CVPR2026

Abstract:
Understanding sources of uncertainty is fundamental to trustworthy three-dimensional scene modeling. While recent advances in neural radiance fields (NeRFs) achieve impressive accuracy in scene reconstruction and novel view synthesis, the lack of uncertainty estimation significantly limits their deployment in safety-critical settings. Existing uncertainty quantification methods for NeRFs fail to separately capture both aleatoric and epistemic uncertainties. Among those that do quantify one or the other, many of them either compromise rendering quality or incur significant computational overhead to obtain uncertainty estimates. To address these issues, we introduce Evidential Neural Radiance Fields, a probabilistic approach that seamlessly integrates with the NeRF rendering process, enabling direct quantification of both aleatoric and epistemic uncertainties from a single forward pass. We compare multiple uncertainty quantification methods on three standardized benchmarks, where our approach demonstrates state-of-the-art scene reconstruction fidelity and uncertainty estimation quality. Code is available at https://github.com/KerryDRX/EvidentialNeRF.

Abstract:
Autonomous Vehicles (AVs) collect and pseudo-label terabytes of multi-modal data localized to HD maps during normal fleet testing. However, identifying interesting and safety-critical scenarios from uncurated driving logs remains a significant challenge. Traditional scenario mining techniques are error-prone and prohibitively time-consuming, often relying on hand-crafted structured queries. In this work, we revisit spatio-temporal scenario mining through the lens of recent vision-language models (VLMs) to detect whether a described scenario occurs in a driving log and, if so, precisely localize it in both time and space. To address this problem, we introduce RefAV, a large-scale dataset of 10,000 diverse natural language queries that describe complex multi-agent interactions relevant to motion planning derived from 1000 driving logs in the Argoverse 2 Sensor dataset. We evaluate several referential multi-object trackers and present an empirical analysis of our baselines. Notably, we find that naively repurposing off-the-shelf VLMs yields poor performance, suggesting that scenario mining presents unique challenges. Lastly, we discuss our recently held competition and share insights from the community. Our code and dataset are available on GitHub and Argoverse.

Abstract:
Primitive-based splatting methods like 3D Gaussian Splatting (3DGS) have revolutionized novel view synthesis with real-time rendering. However, their point-based representations remain incompatible with mesh-based pipelines that power AR/VR and game engines. We present Mesh Splatting, a mesh-based reconstruction approach that jointly optimizes geometry and appearance through differentiable rendering. By enforcing connectivity via restricted Delaunay triangulation and refining surface consistency, Mesh Splatting creates end-to-end smooth, high-fidelity meshes that render efficiently in real-time engines. On Mip-NeRF360 and Tanks&Temples, it boosts PSNR by +0.69dB, while training 2x faster and using 2x less memory, bridging neural rendering and interactive 3D graphics for seamless real-time scene interaction. The project page is available at https://meshsplatting.github.io/.

Abstract:
Unsupervised optical flow methods typically lack reliable uncertainty estimation, limiting their robustness and interpretability. We propose U^ 2 Flow, the first recurrent unsupervised framework that jointly estimates optical flow and per-pixel uncertainty. The core innovation is a decoupled learning strategy that derives uncertainty supervision from augmentation consistency via a Laplace-based maximum likelihood objective, enabling stable training without ground truth. The predicted uncertainty is further integrated into the network to guide adaptive flow refinement and dynamically modulate the regional smoothness loss. Furthermore, we introduce an uncertainty-guided bidirectional flow fusion mechanism that enhances robustness in challenging regions. Extensive experiments on KITTI and Sintel demonstrate that U^ 2 Flow achieves state-of-the-art performance among unsupervised methods while producing highly reliable uncertainty maps, validating the effectiveness of our joint estimation paradigm. The code is available at https://github.com/sunzunyi/U2FLOW.

Abstract:
Superpixels partition an image into perceptually coherent regions, reducing the cost of downstream vision tasks. Modern deep learning methods excel at superpixel generation but often yield irregular boundaries and isolated pixels, necessitating non-differentiable post-processing to enforce connectivity. This undermines the end-to-end learning capabilities. We propose a simple, fully differentiable graph-Laplacian loss that encourages spatial regularity and connectivity during training. The loss is model-agnostic and can be seamlessly integrated into the training of existing architectures to improve the quality of superpixels. In addition, we introduce two novel metrics, the average stray pixel count and excess component count, to measure the quality of superpixels. We demonstrate both qualitative and quantitative improvements over state-of-the-art methods with and without enforced connectivity. Our approach represents a significant step toward eliminating non-differentiable post-processing. Project code: https://github.com/jeremyJJB/Differentiable-Laplacian-Matrix-Guided-Superpixel-Segmentation.

Abstract:
The pose graph is a core component of Structure-from-Motion (SfM), where images act as nodes and edges encode relative poses. Since geometric verification is expensive, SfM pipelines restrict the pose graph to a sparse set of candidate edges, making initialization critical. Existing methods rely on image retrieval to connect each image to its k nearest neighbors, treating pairs independently and ignoring global consistency. We address this limitation through the concept of edge prioritization, ranking candidate edges by their utility for SfM. Our approach has three components: (1) a GNN trained with SfM-derived supervision to predict globally consistent edge reliability; (2) multi-minimal-spanning-tree-based pose graph construction guided by these ranks; and (3) connectivity-aware score modulation that reinforces weak regions and reduces graph diameter. This globally informed initialization yields more reliable and compact pose graphs, improving reconstruction accuracy in sparse and high-speed settings and outperforming SOTA retrieval methods on ambiguous scenes. The code and trained models are available at https://github.com/weitong8591/global_edge_prior.

Abstract:
Multimodal large language models (MLLMs) struggle with hallucinations, particularly with fine-grained queries, a challenge underrepresented by existing benchmarks that focus on coarse image-related questions. We introduce FIne-grained NEgative queRies (FINER), alongside two benchmarks: FINER-CompreCap and FINER-DOCCI. Using FINER, we analyze hallucinations across four settings: multi-object, multi-attribute, multi-relation, and "what" questions. Our benchmarks reveal that MLLMs hallucinate when fine-grained mismatches co-occur with genuinely present elements in the image. To address this, we propose FINER-Tuning, leveraging Direct Preference Optimization (DPO) on FINER-inspired data. Finetuning four frontier MLLMs with FINER-Tuning yields up to 24.2% gains (InternVL3.5-14B) on hallucinations from our benchmarks, while simultaneously improving performance on eight existing hallucination suites, and enhancing general multimodal capabilities across six benchmarks. Code, benchmark, and models are available at https://explainableml.github.io/finer-project/.

Abstract:
Training-free 3D editing aims to modify 3D shapes based on human instructions without model finetuning. It plays a crucial role in 3D content creation. However, existing approaches often struggle to produce strong or geometrically stable edits, largely due to inconsistent latent anchors introduced by timestep-dependent noise during diffusion sampling. To address these limitations, we introduce AnchorFlow, which is built upon the principle of latent anchor consistency. Specifically, AnchorFlow establishes a global latent anchor shared between the source and target trajectories, and enforces coherence using a relaxed anchor-alignment loss together with an anchor-aligned update rule. This design ensures that transformations remain stable and semantically faithful throughout the editing process. By stabilizing the latent reference space, AnchorFlow enables more pronounced semantic modifications. Moreover, AnchorFlow is mask-free. Without mask supervision, it effectively preserves geometric fidelity. Experiments on the Eval3DEdit benchmark show that AnchorFlow consistently delivers semantically aligned and structurally robust edits across diverse editing types. Project page: https://zhenglinzhou.github.io/AnchorFlow/.

Abstract:
Open-vocabulary 3D occupancy is vital for embodied agents, which need to understand complex indoor environments where semantic categories are abundant and evolve beyond fixed taxonomies. While recent work has explored open-vocabulary occupancy in outdoor driving scenarios, such methods transfer poorly indoors, where geometry is denser, layouts are more intricate, and semantics are far more fine-grained. To address these challenges, we adopt a geometry-only supervision paradigm that uses only binary occupancy labels (occupied vs. free). Our framework builds upon 3D Language-Embedded Gaussians, which serve as a unified intermediate representation coupling fine-grained 3D geometry with a language-aligned semantic embedding. On the geometry side, we find that existing Gaussian-to-Occupancy operators fail to converge under such weak supervision, and we introduce an opacity-aware, Poisson-based approach that stabilizes volumetric aggregation. On the semantic side, direct alignment between rendered features and open-vocabulary segmentation features suffers from feature mixing; we therefore propose a Progressive Temperature Decay schedule that gradually sharpens opacities during splatting, strengthening Gaussian-language alignment. On Occ-ScanNet, our framework achieves 59.50 IoU and 21.05 mIoU in the open-vocabulary setting, surpassing all existing occupancy methods in IoU and outperforming prior open-vocabulary approaches by a large margin in mIoU. Code will be released at https://github.com/JuIvyy/LegoOcc.

Abstract:
Despite the recent success of multi-view diffusion models for text/image-based 3D asset generation, instruction-based editing of 3D assets lacks surprisingly far behind the quality of generation models. The main reason is that recent approaches using 2D priors suffer from view-inconsistent editing signals. Going beyond 2D prior distillation methods and multi-view editing strategies, we propose a novel editing method that operates within the latent space of a native 3D diffusion model, allowing us to directly manipulate 3D geometry. We guide the edit synthesis by blending 3D attention maps from the generation with the source object. Coupled with geometry-aware regularization guidance, a spectral modulation strategy in the Fourier domain and a refinement step for 3D enhancement, our method outperforms previous 3D editing methods enabling high-fidelity and precise edits across a wide range of shapes and semantic manipulations. Our project webpage is https://mparelli.github.io/3d-latte.

Abstract:
Image classification is a well-studied task in computer vision, and yet it remains challenging under high-uncertainty conditions, such as when input images are corrupted or training data are limited. Conventional classification approaches typically train models to directly predict class labels from input images, but this might lead to suboptimal performance in such scenarios. To address this issue, we propose Discrete Diffusion Classification Modeling (DiDiCM), a novel framework that leverages a diffusion-based procedure to model the posterior distribution of class labels conditioned on the input image. DiDiCM supports diffusion-based predictions either on class probabilities or on discrete class labels, providing flexibility in computation and memory trade-offs. We conduct a comprehensive empirical study demonstrating the superior performance of DiDiCM over standard classifiers, showing that a few diffusion iterations achieve higher classification accuracy on the ImageNet dataset compared to baselines, with accuracy gains increasing as the task becomes more challenging. We release our code at https://github.com/omerb01/didicm.

Abstract:
Anchor-based multi-view clustering methods have gained significant attention for their effectiveness in handling large-scale datasets in recent years. The performance of these methods is highly dependent on anchor quality. However, current methods neglect the interactive relationships among cross-view anchors, failing to effectively discover and exploit consistent and complementary information, leading to noisy or suboptimal anchor representations. In this paper, we propose a novel scalable multi-view subspace clustering method with tensorized anchor guidance, which directly couples anchors across views to improve clustering performance. Specifically, we construct a third-order anchor tensor from view-specific anchors in a low-dimensional latent space. By imposing a tensor Schatten p-norm constraint on the anchor tensor, we can explicitly capture cross-view low-rank structure and jointly exploit consistent and complementary information among anchors. Moreover, the tensorized anchor regularizer is independent of the number of samples, which reduces both time and space complexity. Experimental results on seven datasets demonstrate that SMVS-TAG achieves superior effectiveness and stability compared to state-of-the-art large-scale MVC methods. Our code is available at https://github.com/Jiamiao2024/SMVS-TAG.

Abstract:
Existing mobile devices are constrained by compact optical designs, such as small apertures, which make it difficult to produce natural, optically realistic bokeh effects. Although recent learning-based methods have shown promising results, they still struggle with photos captured under high digital zoom levels, which often suffer from reduced resolution and loss of fine details. A naive solution is to enhance image quality before applying bokeh rendering, yet this two-stage pipeline reduces efficiency and introduces unnecessary error accumulation. To overcome these limitations, we propose MagicBokeh, a unified diffusion-based framework designed for high-quality and efficient bokeh rendering. Through an alternative training strategy and a focus-aware masked attention mechanism, our method jointly optimizes bokeh rendering and super-resolution, substantially improving both controllability and visual fidelity. Furthermore, we introduce degradation-aware depth module to enable more accurate depth estimation from low-quality inputs. Experimental results demonstrate that MagicBokeh efficiently produces photorealistic bokeh effects, particularly on real-world low-resolution images, paving the way for future advancements in bokeh rendering. Our code and models are available at this \href https://github.com/vivoCameraResearch/MagicBokeh url .

Abstract:
We revisit the problem of estimating the fundamental matrix of a pair of perspective cameras, a cornerstone of geometric computer vision. As is well-known, linear solvers require at least 8 point correspondences, whereas nonlinear minimal solvers require just 7 in the uncalibrated case or 5 in the calibrated case. In this paper, we consider a special case of the 7-point problem where 5 of the points are configured to lie on two lines, which has previously been shown to have a unique solution. As a theoretical contribution, we offer a completely elementary analysis of how this uniqueness manifests in the standard 7-point algorithm. On a practical level, we provide the first linear solver for the minimal problem associated to this special configuration. Additionally, we evaluate a heuristic 5-point fundamental matrix solver based on the construction of virtual midpoints. When combined with early non-minimal fitting, the runtime and accuracy of our solver is competitive with the state-of-the-art (SOTA) on multiple benchmarks. Code is available at https://github.com/CIVA-Lab/v-umlaut.

Abstract:
Novel View Synthesis (NVS) has traditionally relied on models with explicit 3D inductive biases combined with known camera parameters from Structure-from-Motion (SfM) beforehand. Recent vision foundation models like VGGT take an orthogonal approach -- 3D knowledge is gained implicitly through training data and loss objectives, enabling feed-forward prediction of both camera parameters and 3D representations directly from a set of uncalibrated images. While flexible, VGGT features lack explicit multi-view geometric consistency, and we find that improving such 3D feature consistency benefits both NVS and pose estimation tasks. We introduce Selfi, a self-improving 3D reconstruction pipeline via feature alignment, transforming a VGGT backbone into a high-fidelity 3D reconstruction engine by leveraging its own outputs as pseudo-ground-truth. Specifically, we train a lightweight feature adapter using a reprojection-based consistency loss, which distills VGGT outputs into a new geometrically-aligned feature space that captures spatial proximity in 3D. This enables state-of-the-art performance in both NVS and camera pose estimation, demonstrating that feature alignment is a highly beneficial step for downstream 3D reasoning. More details on our project page: https://denghilbert.github.io/selfi

Abstract:
Achieving fully autonomous driving systems requires learning rational decisions in a wide span of scenarios, including safety-critical and out-of-distribution ones. However, such cases are underrepresented in real-world corpus collected by human experts. To complement for the lack of data diversity, we introduce a novel and scalable simulation framework capable of synthesizing massive unseen states upon existing driving logs. Our pipeline utilizes advanced neural rendering with a reactive environment to generate high-fidelity multi-view observations controlled by the perturbed ego trajectory. Furthermore, we develop a pseudo-expert trajectory generation mechanism for these newly simulated states to provide action supervision. Upon the synthesized data, we find that a simple co-training strategy on both real-world and simulated samples can lead to significant improvements in both robustness and generalization for various planning methods on challenging real-world benchmarks, up to +8.6 EPDMS on navhard and +2.9 on navtest. More importantly, such policy improvement scales smoothly by increasing simulation data only, even without extra real-world data streaming in. We further reveal several crucial findings of such a sim-real learning system, which we term SimScale, including the design of pseudo-experts and the scaling properties for different policy architectures. Simulation data and code have been released at https://github.com/OpenDriveLab/SimScale.

Abstract:
Infrared and visible video fusion is pivotal for robust perceptual systems, aiming to synthesize a comprehensive video stream that leverages both thermal resilience and textured details. However, prevailing methods, by treating videos as sequences of independent frames, inherently introduce temporal incoherence, such as flickering and ghosting artifacts. While diffusion models possess strong generative priors to remedy this, their iterative nature is prohibitively slow for video. To resolve this fundamental dilemma, we propose a streaming diffusion model for efficient infrared and visible video fusion, termed SDMFusion. Our key insight is to exploit the generative prior of a pre-trained diffusion model into a one-step sampling framework, while explicitly modeling temporal dynamics. We design a memory-augmented latent pipeline where a temporal aggregation adapter aligns and propagates cross-frame features to ensure coherence, supported by a dedicated temporal consistency loss. This approach effectively decouples the challenge of achieving high fidelity from maintaining temporal stability. Extensive experiments on four benchmarks demonstrate that our method establishes a new state-of-the-art, generating fused videos with exceptional spatio-temporal consistency at a speed suitable for real-time application. The code is available at https://github.com/DandanYoung/SDMFusion.

Abstract:
Keypoint-based matching is a fundamental component of modern 3D vision systems, such as Structure-from-Motion (SfM) and SLAM. Most existing learning-based methods are trained on image pairs, a paradigm that fails to explicitly optimize for the long-term trackability of keypoints across sequences under challenging viewpoint and illumination changes. In this paper, we reframe keypoint detection as a sequential decision-making problem. We introduce TraqPoint, a novel, end-to-end Reinforcement Learning (RL) framework designed to optimize the Track-quality (Traq) of keypoints directly on image sequences. Our core innovation is a track-aware reward mechanism that jointly encourages the consistency and distinctiveness of keypoints across multiple views, guided by a policy gradient method. Extensive evaluations on sparse matching benchmarks, including relative pose estimation and 3D reconstruction, demonstrate that TraqPoint significantly outperforms some state-of-the-art keypoint detection and description methods. The code will be available at https://github.com/xiaomi-research/traqpoint.

Abstract:
The rapid growth of visual data under stringent storage and bandwidth constraints makes extremely low-bitrate image compression increasingly important. While Vector Quantization (VQ) offers strong structural fidelity, existing methods lack a principled mechanism for joint rate-distortion (RD) optimization due to the disconnect between representation learning and entropy modeling. We propose RDVQ, a unified framework that enables end-to-end RD optimization for VQ-based compression via a differentiable relaxation of the codebook distribution, allowing the entropy loss to directly shape the latent prior. We further develop an autoregressive entropy model that supports accurate entropy modeling and test-time rate control. Extensive experiments demonstrate that RDVQ achieves strong performance at extremely low bitrates with a lightweight architecture, attaining competitive or superior perceptual quality with significantly fewer parameters. Compared with RDEIC, RDVQ reduces bitrate by up to 75.71% on DISTS and 37.63% on LPIPS on DIV2K-val. Beyond empirical gains, RDVQ introduces an entropy-constrained formulation of VQ, highlighting the potential for a more unified view of image tokenization and compression. The code is available at https://github.com/CVL-UESTC/RDVQ.

Abstract:
Multimodal Large Language Models (MLLMs) have achieved impressive performance, but their safety alignment remains vulnerable to jailbreak attacks. Existing content-based jailbreaks are often inconsistent and show unsatisfying performance against the rapidly evolving MLLMs, failing to exploit non-content-based vulnerabilities. Unlike previous research, we empirically find that MLLMs exhibit a Stylistic Inconsistency between their comprehension ability and safety ability: MLLMs can robustly understand content regardless of visual style, yet their defense mechanisms can be easily bypassed by specific stylistic triggers. Based on this finding, we propose Adversarial Style Optimization (ASO), a plug-and-play enhancement module to amplify existing visual jailbreaks. ASO fine-tunes an image-editing model to superimpose an optimized stylistic modification onto a given adversarial image, using a Group Relative Policy Optimization (GRPO) agent guided by a Structurally-Tiered Reward Function that combines a logit-based signal for detecting explicit refusals with a high-fidelity semantic evaluation from a powerful judge model. Extensive experiments show that ASO significantly enhances the ASR of SOTA attacks, demonstrating that stylistic biases are a scalable vector for red-teaming MLLMs. Our code is available at https://github.com/bingjunluo/ASO.

Abstract:
Modern vision-language models (VLMs) are expected to have abilities of spatial reasoning with diverse scene complexities, but evaluating such abilities is difficult due to the lack of benchmarks that are not only diverse and scalable but also fully customizable. Existing benchmarks offer limited customizability over the scene complexity and are incapable of isolating and analyzing specific VLM failure modes under distinct spatial conditions. To address this gap, instead of individually presenting benchmarks for different scene complexities, in this paper we present InfiniBench, a fully automated, customizable and user-friendly benchmark generator that can synthesize a theoretically infinite variety of 3D scenes with parameterized control on scene complexity. InfiniBench uniquely translates scene descriptions in natural language into photo-realistic videos with complex and physically plausible 3D layouts. This is achieved through three key innovations: 1) a LLM-based agentic framework that iteratively refines procedural scene constraints from scene descriptions; 2) a flexible cluster-based layout optimizer that generates dense and cluttered scenes previously intractable for procedural methods; and 3) a task-aware camera trajectory optimization method that renders scenes into videos with full object coverage as VLM input. Experiments demonstrate that InfiniBench outperforms state-of-the-art procedural and LLM-based 3D generation methods in prompt fidelity and physical plausibility, especially in high-complexity scenarios. We further showcased the usefulness of InfiniBench, by generating benchmarks for representative spatial reasoning tasks including measurement, perspective-taking and spatiotemporal tracking. The source codes of InfiniBench are available at https://github.com/pittisl/infinibench, and some sample benchmarks being generated by InfiniBench are available at https://huggingface.co/datasets/Haoming645/infinibench.

Abstract:
Most research decoding brain signals into images, often using them as priors for generative models, has focused only on visual content. This overlooks the brain's natural ability to integrate auditory and visual information, for instance, sound strongly influences how we perceive visual scenes. To investigate this, we propose a new task of reconstructing continuous video stimuli from multimodal brain signals recorded during audiovisual stimulation. To enable this, we introduce CineBrain, the first large-scale dataset that synchronizes fMRI and EEG during audiovisual viewing, featuring six hours of The Big Bang Theory episodes for cross-modal alignment. We also conduct the first systematic exploration of combining fMRI and EEG for video reconstruction and present CineSync, a framework for reconstructing dynamic video using a Multi-Modal Fusion Encoder and a Neural Latent Decoder. CineSync achieves state-of-the-art performance in dynamic reconstruction, leveraging the complementary strengths of fMRI and EEG to improve visual fidelity. Our analysis shows that auditory cortical activations enhance decoding accuracy, highlighting the role of auditory input in visual perception. Project Page: https://jianxgao.github.io/CineBrain.

Abstract:
Multimodal Large Language Models (MLLMs) are increasingly vulnerable to multimodal Indirect Prompt Injection (IPI) attacks, which embed malicious instructions in images, videos, or audio to hijack model behavior. Existing defenses, designed primarily for text-only LLMs, are unsuitable for countering these multimodal threats, as they are easily bypassed, modality-dependent, or generalize poorly. Inspired by activation steering research, we hypothesize that a robust, general defense independent of modality can be achieved by steering the model's behavior in the representation space. Through extensive experiments, we discover that the instruction-following behavior of MLLMs is encoded in a subspace. Steering along directions within this subspace can enforce adherence to user instructions, forming the basis of a defense. However, we also found that a naive defense direction could be coupled with a utility-degrading direction, and excessive intervention strength harms model performance. To address this, we propose ARGUS, which searches for an optimal defense direction within the safety subspace that decouples from the utility degradation direction, further combining adaptive strength steering to achieve a better safety-utility trade-off. ARGUS also introduces a lightweight injection detection stage to activate the defense on-demand, and a post-filtering stage to verify defense success. Experimental results show that ARGUS can achieve robust defense against multimodal IPI while maximally preserving the MLLM's utility. Our code will be available at https://github.com/ZeroNLP/ARGUS.

Abstract:
In our study, we conducted a comprehensive analysis of three widely used datasets in the domain of building footprint extraction using deep neural networks: the INRIA Aerial Image Labelling dataset, SpaceNet 2: Building Detection v2, and the AICrowd Mapping Challenge datasets. Our experiments revealed several issues in the AICrowd Mapping Challenge dataset, where nearly 90% (about 250k) of the training split images had identical copies, indicating a high level of duplicate data. Additionally, we found that approximately 56k of the 60k images in the validation split were also present in the training split, amounting to a 93% data leakage.Furthermore, we present a data validation pipeline to address these issues of duplication and data leakage, which hinder the performance of models trained on such datasets. Employing perceptual hashing techniques, this pipeline is designed for efficient de-duplication and leakage identification. It aims to thoroughly evaluate the quality of datasets before their use, thereby ensuring the reliability and robustness of the trained models.

Abstract:
Multi-view 3D reconstruction methods remain highly sensitive to photometric inconsistencies arising from camera optical characteristics and variations in image signal processing (ISP). Existing mitigation strategies such as per-frame latent variables or affine color corrections lack physical grounding and generalize poorly to novel views. We propose the Physically-Plausible ISP (PPISP) correction module, which disentangles camera-intrinsic and capture-dependent effects through physically based and interpretable transformations. A dedicated PPISP controller, trained on the input views, predicts ISP parameters for novel viewpoints, analogous to auto exposure and auto white balance in real cameras. This design enables realistic and fair evaluation on novel views without access to ground-truth images. PPISP achieves SotA performance on standard benchmarks, while providing intuitive control and supporting the integration of metadata when available. The source code is available at: https://github.com/nv-tlabs/ppisp

Abstract:
High-quality 3D scene representation in radiance fields relies on accurate camera poses which are often difficult to acquire in real-world scenarios. An effective solution is to use RGB images for the joint optimization of radiance fields and camera poses, an approach that has been well explored in NeRF series methods. However, unlike NeRF, joint optimization in 3D Gaussian Splatting (3DGS) often requires additional regularization or prior spatial knowledge to reach comparable performance. To eliminate these dependencies, we introduce Energy-GS, a pose-aware Gaussian splatting framework that jointly optimizes scene representation and camera poses using only RGB images. We observe that pose gradients in joint optimization are unstable due to the point-based rendering mechanism. Furthermore, unlike NeRF's spatial sampling framework that enables coarse-to-fine pose alignment, rasterization-based 3DGS lacks controllable sampling and thus cannot support progressive pose refinement. To address these challenges, we redesign the optimization strategy of Gaussian primitives and introduce an image-energy-guided constraint that encourages progressive alignment of camera poses. Experiments on both synthetic and real-world datasets show that Energy-GS can effectively optimize the scene reconstruction and resolve camera pose misalignment at the same time. Benefiting from reliance on only RGB images, we believe this work provides promising insights for visual localization and dense mapping applications such as SLAM. The Code is available at \href https://github.com/SkylerGao/ENGS https://github.com/SkylerGao/ENGS

Abstract:
Infrared-visible image fusion aims to integrate complementary information for robust visual understanding, but existing fusion methods struggle with simultaneously adapting to multiple downstream tasks. To address this issue, we propose a Closed-Loop Dynamic Network (CLDyN) that can adaptively respond to the semantic requirements of diverse downstream tasks for task-customized image fusion. Specifically, CLDyN introduces a closed-loop optimization mechanism that establishes a semantic transmission chain to achieve explicit feedback from downstream tasks to the fusion network through a Requirement-driven Semantic Compensation (RSC) module. The RSC module leverages a Basis Vector Bank (BVB) and an Architecture-Adaptive Semantic Injection (A2SI) block to customize the network architecture according to task requirements, thereby enabling task-specific semantic compensation and allowing the fusion network to actively adapt to diverse tasks without retraining. To promote semantic compensation, a reward-penalty strategy is introduced to reward or penalize the RSC module based on task performance variations. Experiments on the M3FD, FMB, and VT5000 datasets demonstrate that CLDyN not only maintains high fusion quality but also exhibits strong multi-task adaptability. The code is available at https://github.com/YR0211/CLDyN.

Abstract:
Semantic Scene Completion (SSC) is essential for 3D perception in mobile robotics, as it enables holistic scene understanding by jointly estimating dense volumetric occupancy and per-voxel semantics. Although SSC has been widely studied in terrestrial domains such as autonomous driving, aerial settings like autonomous flying remain largely unexplored, thereby limiting progress on downstream applications. Furthermore, LiDAR sensors are the primary modality for SSC data generation, which poses challenges for most uncrewed aerial vehicles (UAVs) due to flight regulations, mass and energy constraints, and the sparsity of LiDAR point clouds from elevated viewpoints. To address these limitations, we propose a LiDAR-free, camera-based data generation framework. By leveraging classical 3D reconstruction, our framework automates semantic label transfer by lifting <10% of annotated images into the reconstructed point cloud, substantially minimizing manual 3D annotation effort. Based on this framework, we introduce OccuFly, the first real-world, camera-based aerial SSC benchmark, captured across multiple altitudes and all seasons. OccuFly provides over 20,000 samples of images, semantic voxel grids, and metric depth maps across 21 semantic classes in urban, industrial, and rural environments, and follows established data organization for seamless integration. We benchmark both SSC and metric monocular depth estimation on OccuFly, revealing fundamental limitations of current vision foundation models in aerial settings and establishing new challenges for robust 3D scene understanding in the aerial domain. Visit https://github.com/markus-42/occufly.

Abstract:
The introduction of negative labels (NLs) has proven effective in enhancing Out-of-Distribution (OOD) detection. However, existing methods often lack an understanding of OOD images, making it difficult to construct an accurate negative space. Furthermore, the absence of negative labels semantically similar to ID labels constrains their capability in near-OOD detection. To address these issues, we propose shaping an Adaptive Negative Textual Space (ANTS) by leveraging the understanding and reasoning capabilities of multimodal large language models (MLLMs). Specifically, we cache images likely to be OOD samples from the historical test images and prompt the MLLM to describe these images, generating expressive negative sentences that precisely characterize the OOD distribution and enhance far-OOD detection. For the near-OOD setting, where OOD samples resemble the in-distribution (ID) subset, we cache the subset of ID classes that are visually similar to historical test images and then leverage MLLM reasoning to generate visually similar negative labels tailored to this subset, effectively reducing false negatives and improving near-OOD detection. To balance these two types of negative textual spaces, we design an adaptive weighted score that enables the method to handle different OOD task settings (near-OOD and far-OOD), making it highly adaptable in open environments. On the ImageNet benchmark, our ANTS significantly reduces the FPR95 by 3.1%, establishing a new state-of-the-art. Furthermore, our method is training-free and zero-shot, enabling high scalability. Codes are available at https://github.com/ZhuWenjie98/ANTS.

Abstract:
Visually cataloging and quantifying the natural world requires pushing the boundaries of both detailed visual classification and counting at scale. Despite significant progress, particularly in crowd and traffic analysis, the fine-grained, taxonomy-aware plant counting remains underexplored in vision. In contrast to crowds, plants exhibit nonrigid morphologies and physical appearance variations across growth stages and environments. To fill this gap, we present TPC-268, the first plant counting benchmark incorporating plant taxonomy. Our dataset couples instance-level point annotations with Linnaean labels (kingdom -> species) and organ categories, enabling hierarchical reasoning and species-aware evaluation. The dataset features 10,000 images with 678,050 point annotations, includes 268 countable plant categories over 242 plant species in Plantae and Fungi, and spans observation scales from canopy-level remote sensing imagery to tissue-level microscopy. We follow the problem setting of class-agnostic counting (CAC), provide taxonomy-consistent, scale-aware data splits, and benchmark state-of-the-art regression- and detection-based CAC approaches. By capturing the biodiversity, hierarchical structure, and multi-scale nature of botanical and mycological taxa, TPC-268 provides a biologically grounded testbed to advance fine-grained class-agnostic counting. Dataset and code are available at https://github.com/tiny-smart/TPC-268.

Abstract:
We introduce VGGT-O, a feed-forward model for 3D reconstruction that improves accuracy, efficiency, and capabilities for both static and dynamic scenes. Prior models such as VGGT have shown that feed-forward 3D reconstruction is, in many cases, competitive with traditional optimization-based methods. Here, we show that the accuracy and robustness of these models scale predictably with model capacity and data size. To enable training 3D reconstruction models at an unprecedented scale, we introduce architectural changes that improve training efficiency and scalability, a high-quality data annotation pipeline that supports dynamic scenes, and a self-supervised learning protocol. We significantly simplify VGGT's architecture by using a single dense prediction head with multi-task supervision, removing expensive high-resolution convolutional layers, and introducing efficient scene tokens for feature aggregation in lieu of global attention. These changes allow us to train VGGT-O with 15 x more supervised data than prior work and to leverage vast amounts of unlabeled videos, while requiring only ~30% of VGGT's training memory. VGGT-O achieves strong results for 3D reconstruction of static and dynamic scenes across multiple benchmarks, e.g., improving over the previous best camera estimation accuracy by 77% on Sintel.

Abstract:
The escalating parameter counts in modern deep learning models pose a fundamental challenge to efficient training and resolution of overfitting. We address this by introducing the Mapping Networks which replace the high dimensional weight space by a compact, trainable latent vector based on the hypothesis that the trained parameters of large networks reside on smooth, low-dimensional manifolds. Henceforth, the Mapping Theorem enforced by a dedicated Mapping Loss, shows the existence of a mapping from this latent space to the target weight space both theoretically and in practice. Mapping Networks significantly reduce overfitting and achieve comparable to better performance than target network across complex vision and sequence tasks, including Image Classification, Deepfake Detection etc., with 99.5%, i.e., around 500x reduction in trainable parameters.

Abstract:
Recent advances in Video-to-Audio (V2A) generation have achieved impressive perceptual quality and temporal synchronization, yet most models remain appearance-driven, capturing visual-acoustic correlations without considering the physical factors that shape real-world sounds. We present Physics-Aware Video-to-Audio Synthesis (PAVAS), a method that incorporates physical reasoning into a latent diffusion-based V2A generation through the Physics-Driven Audio Adapter (Phy-Adapter). The adapter receives object-level physical parameters estimated by the Physical Parameter Estimator (PPE), which uses a Vision-Language Model (VLM) to infer the moving-object mass and a segmentation-based dynamic 3D reconstruction module to recover its motion trajectory for velocity computation. These physical cues enable the model to synthesize sounds that reflect underlying physical factors. To assess physical realism, we curate VGG-Impact, a benchmark focusing on object-object interactions, and introduce Audio-Physics Correlation Coefficient (APCC), an evaluation metric that measures consistency between physical and auditory attributes. Comprehensive experiments show that PAVAS produces physically plausible and perceptually coherent audio, outperforming existing V2A models in both quantitative and qualitative evaluations. Visit https://physics-aware-video-to-audio-synthesis.github.io.

Abstract:
Instance-level object segmentation across disparate egocentric and exocentric views is a fundamental challenge in visual understanding, critical for applications in embodied AI and remote collaboration. This task is exceptionally difficult due to severe changes in scale, perspective, and occlusion, which destabilize direct pixel-level matching. While recent geometry-aware models like VGGT provide a strong foundation for feature alignment, we find they often fail at dense prediction tasks due to significant pixel-level projection drift, even when their internal object-level attention remains consistent. To bridge this gap, we introduce VGGT-Segmentor (VGGT-S), a framework that unifies robust geometric modeling with pixel-accurate semantic segmentation. VGGT-S leverages VGGT's powerful cross-view feature representation and introduces a novel Union Segmentation Head. This head operates in three stages: mask prompt fusion, point-guided prediction, and iterative mask refinement, effectively translating high-level feature alignment into a precise segmentation mask. Furthermore, we propose a single-image self-supervised training strategy that eliminates the need for paired annotations and enables strong generalization. On the Ego-Exo4D benchmark, VGGT-S sets a new state-of-the-art, achieving 67.7% and 68.0% average IoU for Ego-Exo and Exo-Ego tasks, respectively, significantly outperforming prior methods. Notably, our correspondence-free pretrained model surpasses most fully-supervised baselines, demonstrating the effectiveness and scalability of our approach.

Abstract:
We present MAMMA, a markerless motion-capture pipeline that accurately recovers SMPL-X parameters from multi-view video. Traditional motion-capture systems rely on physical markers. Although they offer high accuracy, their requirements of specialized hardware, manual marker placement, and extensive post-processing make them costly and time-consuming. Recent learning-based methods attempt to overcome these limitations, but most are designed for single-person capture, rely on sparse keypoints, or struggle with occlusions and physical interactions. In this work, we introduce a method that predicts dense 2D contact-aware and visibility-aware surface landmarks conditioned on segmentation masks, enabling person-specific correspondence estimation even under heavy occlusion. We employ a novel architecture that exploits learnable queries for each landmark. We demonstrate that our approach can handle complex person--person interaction and offers greater accuracy than existing methods. To train our network, we construct a large, synthetic multi-view dataset combining human motions from diverse sources, including extreme poses, hand motions, and close interactions. Our dataset yields high-variability synthetic sequences with rich body contact and occlusion, and includes SMPL-X ground-truth annotations with dense 2D landmarks. The result is a system capable of accurately capturing human motion without the need for markers. Our approach offers competitive reconstruction quality compared to commercial marker-based motion-capture solutions, without the extensive manual cleanup. Finally, we address the absence of common benchmarks for dense-landmark prediction and markerless motion capture by introducing two evaluation settings built from real multi-view sequences. https://mamma.is.tue.mpg.de/

Abstract:
We present AToken, the first unified visual tokenizer that achieves both high-fidelity reconstruction and semantic understanding across images, videos, and 3D assets. Unlike existing tokenizers that specialize in either reconstruction or understanding for single modalities, AToken encodes these diverse visual inputs into a shared 4D latent space, unifying both tasks and modalities in a single framework. Specifically, we introduce a pure transformer architecture with 4D rotary position embeddings to process visual inputs of arbitrary resolutions and temporal durations. To ensure stable training, we introduce an adversarial-free training objective that combines perceptual and Gram matrix losses, achieving state-of-the-art reconstruction quality. By employing a progressive training curriculum, AToken gradually expands from single images, videos, and 3D, and supports both continuous and discrete latent tokens. AToken achieves 0.21 rFID with 82.2% ImageNet accuracy for images, 3.01 rFVD with 40.2% MSRVTT retrieval for videos, and 28.28 PSNR with 90.9% classification accuracy for 3D.. In downstream applications, AToken enables both visual generation tasks (e.g., image generation with continuous and discrete tokens, text-to-video generation, image-to-3D synthesis) and understanding tasks (e.g., multimodal LLMs), achieving competitive performance across all benchmarks. These results shed light on the next-generation multimodal AI systems built upon unified visual tokenization.

Abstract:
We present SAM 3D, a generative model for visually grounded 3D object reconstruction, predicting geometry, texture, and layout from a single image. SAM 3D excels in natural images, where occlusion and scene clutter are common and visual recognition cues from context play a larger role. We achieve this with a human- and model-in-the-loop pipeline for annotating object shape, texture, and pose, providing visually grounded 3D reconstruction data at unprecedented scale. We learn from this data in a modern, multi-stage training framework that combines synthetic pretraining with real-world alignment, breaking the 3D "data barrier". We obtain significant gains over recent work, with at least a 5:1 win rate in human preference tests on real-world objects and scenes. We will release our code and model weights, an online demo, and a new challenging benchmark for in-the-wild 3D object reconstruction.

Abstract:
Long-tailed image classification remains a long-standing challenge, as real-world data typically follow highly imbalanced distributions where a few head classes dominate and many tail classes contain only limited samples. This imbalance biases feature learning toward head categories and leads to significant degradation on rare classes. Although recent studies have proposed re-sampling, re-weighting, and decoupled learning strategies, the improvement on the most underrepresented classes still remains marginal compared with overall accuracy. In this work, we present a confusion-centric perspective for long-tailed recognition that explicitly focuses on worst-class generalization. We first establish a new theoretical framework of class-specific error analysis, which shows that the worst-class error can be tightly upper-bounded by the spectral norm of the frequency-weighted confusion matrix and a model-dependent complexity term. Guided by this insight, we propose the Confusion-Aware Spectral Regularizer (CAR) that minimizes the spectral norm of the confusion matrix during training to reduce inter-class confusion and enhance tail-class generalization. To enable stable and efficient optimization, CAR integrates a Differentiable Confusion Matrix Surrogate and an EMA-based Confusion Estimator to maintain smooth and low-variance estimates across mini-batches. Extensive experiments across multiple long-tailed benchmarks demonstrates that CAR substantially improves both worst-class accuracy and overall performance. When combined with ConCutMix augmentation, CAR consistently surpasses exisiting state-of-the-art long-tailed learning methods under both the training-from-scratch setting (by 2.37% 4.83%) and the fine-tuning-from-pretrained setting (by 2.42% 4.17%) across ImageNet-LT, CIFAR100-LT, and iNaturalist datasets.

Abstract:
We introduce a novel framework that directly learns a spectral basis for shape and manifold analysis from unstructured data, eliminating the need for traditional operator selection, discretization, and eigensolvers. Grounded in optimal-approximation theory, we train a network to decompose an implicit approximation operator by minimizing the reconstruction error in the learned basis over a chosen distribution of probe functions. For suitable distributions, they can be seen as an approximation of the Laplacian operator and its eigendecomposition, which are fundamental in geometry processing. Furthermore, our method recovers in a unified manner not only the spectral basis, but also the implicit metric's sampling density and the eigenvalues of the underlying operator. Notably, our unsupervised method makes no assumption on the data manifold, such as meshing or manifold dimensionality, allowing it to scale to arbitrary datasets of any dimension. On point clouds lying on surfaces in 3D and high-dimensional image manifolds, our approach yields meaningful spectral bases, that can resemble those of the Laplacian, without explicit construction of an operator. By replacing the traditional operator selection, construction, and eigendecomposition with a learning-based approach, our framework offers a principled, data-driven alternative to conventional pipelines. This opens new possibilities in geometry processing for unstructured data, particularly in high-dimensional spaces.

Abstract:
This work proposes a new formulation to the long-standing problem of convex decomposition through learning feature fields, enabling the first feed-forward model for open-world learning of convex decomposition. Our method produces high-quality decompositions of 3D shapes into a union of convex bodies, which are essential to accelerate collision detection in physical simulation, amongst many other applications.The key insight is to adopt a feature learning approach and learn a continuous feature field that can later be clustered to yield a good convex decomposition via our self-supervised, purely-geometric objective derived from the classical definition of convexity.Our formulation can be used for single shape optimization, but more importantly, feature prediction unlocks scalable, self-supervised learning on large datasets resulting in the first learned open-world for convex decomposition.Experiments show that our decompositions are higher-quality than alternatives and generalize across open-world objects as well as across representations to meshes, CAD models, and even Gaussian splats.

Abstract:
Recent advances in 3D Gaussian Splatting (3DGS) have demonstrated great success in modeling reflective 3D objects and their interaction with the environment via deferred rendering (DR). However, existing methods often struggle with correctly reconstructing physical attributes such as albedo and reflectance, and therefore they do not support high-fidelity relighting. Observing that this limitation stems from the lack of shape and material information in RGB images, we present PhyGaP, a physically-grounded 3DGS method that leverages polarization cues to facilitate precise reflection decomposition and visually consistent relighting of reconstructed objects. Specifically, we design a polarimetric deferred rendering (PolarDR) process to model polarization by reflection, and a self-occlusion-aware environment map building technique (GridMap) to resolve indirect lighting of non-convex objects. We validate on multiple synthetic and real-world scenes, including those featuring only partial polarization cues, that PhyGaP not only excels in reconstructing the appearance and surface normal of reflective 3D objects ( 2 dB in PSNR and 45.7% in Cosine Distance better than existing RGB-based methods on average), but also achieves state-of-the-art inverse rendering and relighting capability.

Abstract:
Vision-Language Models (VLMs) perform well on multimodal benchmarks but lag behind humans and specialized models on visual perception tasks like depth estimation or object counting. Finetuning on one task can unpredictably affect performance on others, making task-specific finetuning challenging. In this paper, we address this challenge through a systematic study of task transferability. We examine how finetuning a VLM on one perception task affects its zero-shot performance on others. We introduce Perfection Gap Factor (PGF), a normalized metric that measures change in performance as a result of task transfer. We utilize PGF to compute Task Transferability, which captures both the breadth and the magnitude of transfer induced by a source task. Using three open-weight VLMs evaluated across 13 perception tasks, we construct a task transfer graph that reveals previously unobserved relationships among perception tasks. Our analysis uncovers patterns of positive and negative transfer, identifies groups of tasks that mutually influence each other, organizes tasks into personas based on their transfer behavior and demonstrates how PGF can guide data selection for more efficient training. These findings highlight both opportunities for positive transfer and risks of negative interference, offering actionable guidance for advancing VLMs.

Abstract:
In-context segmentation (ICS) aims to segment arbitrary concepts, e.g., objects, parts, or personalized instances, given one annotated visual examples. Existing work relies on (i) fine-tuning vision foundation models (VFMs), which improves in-domain results but harms generalization, or (ii) combines multiple frozen VFMs, which preserves generalization but yields architectural complexity and fixed segmentation granularities. We revisit ICS from a minimalist perspective and ask: Can a single self-supervised backbone support both semantic matching and segmentation, without any supervision or auxiliary models? We show that scaled-up dense self-supervised features from DINOv3 exhibit strong spatial structure and semantic correspondence. We introduce INSID3, a training-free approach that segments concepts at varying granularities only from frozen DINOv3 features, given an in-context example. INSID3 achieves state-of-the-art results across one-shot semantic, part, and personalized segmentation, outperforming previous work by +7.5 % mIoU, while using 3x fewer parameters and without any mask or category-level supervision.

Abstract:
Test-Time Training (TTT) has recently emerged as a promising direction for efficient sequence modeling. TTT reformulates attention operation as an online learning problem, constructing a compact inner model from key-value pairs at test time. This reformulation opens a rich and flexible design space while achieving linear computational complexity. However, crafting a powerful visual TTT design remains challenging: fundamental choices for the inner module and inner training lack comprehensive understanding and practical guidelines. To bridge this critical gap, in this paper, we present a systematic empirical study of TTT designs for visual sequence modeling. From a series of experiments and analyses, we distill six practical insights that establish design principles for effective visual TTT and illuminate paths for future improvement. These findings culminate in the Vision Test-Time Training (ViT^3) model, a pure TTT architecture that achieves linear complexity and parallelizable computation. We evaluate ViT^3 across diverse visual tasks, including image classification, image generation, object detection, and semantic segmentation. Results show that ViT^3 consistently matches or outperforms advanced linear-complexity models (e.g., Mamba and linear attention variants) and effectively narrows the gap to highly optimized vision Transformers. We hope this study and the ViT^3 baseline can facilitate future work on visual TTT models. Code: github.com/LeapLabTHU/ViTTT.

Abstract:
This paper introduces a novel architecture for trajectory-conditioned forecasting of future 3D scene occupancy. In contrast to methods that rely on variational autoencoders (VAEs) to generate discrete occupancy tokens, which inherently limit representational capacity, our approach predicts multi-frame future occupancy in an end-to-end manner directly from raw image features. Inspired by the success of attention-based transformer architectures in foundational vision and language models such as GPT and VGGT, we employ a sparse occupancy representation that bypasses the intermediate bird's eye view (BEV) projection and its explicit geometric priors. This design allows the transformer to capture spatiotemporal dependencies more effectively. By avoiding both the finite-capacity constraint of discrete tokenization and the structural limitations of BEV representations, our method achieves state-of-the-art performance on the nuScenes benchmark for 1-3 second occupancy forecasting, outperforming existing approaches by a significant margin. Furthermore, it demonstrates robust scene dynamics understanding, consistently delivering high accuracy under arbitrary future trajectory conditioning.

Abstract:
Recent advances in 3D Gaussian Splatting (3DGS) have enabled significant progress in photorealistic novel view synthesis. However, traditional 3DGS relies on a slow, iterative optimization process, which limits its use in scenarios demanding real-time results. To overcome this bottleneck, recent feed-forward methods aim to predict Gaussian attributes directly from images, but they often struggle with the redundancy of Gaussian primitives and rendering quality. In this paper, we introduce a transformer-based architecture specifically designed for feed-forward Gaussian Splatting. Our key insight is that spatial and semantic relationships among Gaussians can be effectively captured through a sparse attention mechanism, enabled by a Z-order strategy that organizes the unstructured Gaussian set into a spatially coherent sequence. Furthermore, we incorporate this Z-order strategy to adaptively suppress redundancy while preserving critical structural details. This allows the transformer to efficiently model context, compress Gaussian primitives, and predict Gaussian attributes in a single forward pass. Comprehensive experiments demonstrate that our method achieves fast and high-quality novel view synthesis with fewer Gaussian primitives.

Abstract:
Physical AI aims to develop models that can perceive and predict real-world dynamics; yet, the extent to which current multi-modal large language models and video generative models support these abilities is insufficiently understood. We introduce Physical AI Bench (PAI-Bench), a unified and comprehensive benchmark that evaluates perception and prediction capabilities across video generation, conditional video generation, and video understanding, comprising 2,808 real-world cases with task-aligned metrics designed to capture physical plausibility and domain-specific reasoning. Our study provides a systematic assessment of recent models and shows that video generative models, despite strong visual fidelity, often struggle to maintain physically coherent dynamics, while multi-modal large language models exhibit limited performance in forecasting and causal interpretation. These observations suggest that current systems are still at an early stage in handling the perceptual and predictive demands of Physical AI. In summary, PAI-Bench establishes a realistic foundation for evaluating Physical AI and highlights key gaps that future systems must address.

Abstract:
Detecting out-of-distribution (OOD) objects is indispensable for safely deploying object detectors in the wild. Current approaches enable the unknown-aware ability by regularizing the instance-level feature space, such as outlier synthesis. Despite the general efficacy, it is challenging to truly learn the concept of 'unknown' under the absence of real unknown data. In this paper, we propose UNO-Adapter, a simple yet highly effective framework tailored for OOD object detection. Our key insight is that in object detection, where in-distribution (ID) and OOD objects may coexist within the same context, we need global abstraction and reasoning to help the detector learn their differences, i.e., unknown injection. UNO-Adapter consists of two key steps: unsupervised concept discovery and neural concept binder. The former introduces an object-centric learning paradigm to abstract and model the holistic image, including both ID and OOD, obtaining sparse and compressed slot-based representations with relational constraints. The latter dynamically combines slots with object candidates extracted by the detector, binding the concept of unknown to the de facto detector. During inference, we introduce an image-guided OOD object score to reinforce the distinction between ID and OOD. Experiments on standard benchmarks demonstrate the superiority of the proposed method. In particular, UNO-Adapter reduces the FPR95 by up to 11.96% compared to the previous best OOD object detection method.

Abstract:
Temporal retiming, the ability to reconstruct and render dynamic scenes at arbitrary timestamps, is crucial for applications such as slow-motion playback, temporal editing, and post-production. However, most existing 4D Gaussian Splatting (4DGS) methods overfit at discrete frame indices but struggle to represent continuous-time frames, leading to ghosting artifacts when interpolating between timestamps. We identify this limitation as a form of temporal aliasing and propose RetimeGS, a simple yet effective 4DGS representation that explicitly defines the temporal behavior of the 3D Gaussian and mitigates temporal aliasing. To achieve smooth and consistent interpolation, we incorporate optical flow-guided initialization and supervision, triple-rendering supervision, and other targeted strategies. Together, these components enable ghost-free, temporally coherent rendering even under large motions. Experiments on datasets featuring fast motion, non-rigid deformation, and severe occlusions demonstrate that RetimeGS achieves superior quality and coherence over state-of-the-art methods.

Abstract:
Text-to-Video (T2V) models are capable of synthesizing high-quality, temporally coherent dynamic video content, but the diverse generation also inherently introduces critical safety challenges. Existing safety evaluation methods, which focus on static image and text generation, are insufficient to capture the complex temporal dynamics in video generation. To address this, we propose a TEmporal-aware Automated Red-teaming framework, named TEAR, an automated framework designed to uncover safety risks specifically linked to the dynamic temporal sequencing of T2V models. TEAR employs a temporal-aware test generator optimized via a two-stage approach: initial generator training and temporal-aware online preference learning, to craft textually innocuous prompts that exploit temporal dynamics to elicit policy-violating video output. And a refine model is adopted to improve the prompt stealthiness and adversarial effectiveness cyclically. Extensive experimental evaluation demonstrates the effectiveness of TEAR across open-source and commercial T2V systems with an over 80% attack success rate, a significant boost from the prior best result of 57%.

Abstract:
Diffusion models are a strong backbone for visual generation, but their inherently sequential denoising process leads to slow inference. Previous methods accelerate sampling by caching and reusing intermediate outputs based on feature distances between adjacent timesteps. However, existing caching strategies typically rely on raw feature differences that entangle content and noise. This design overlooks spectral evolution, where low-frequency structure appears early and high-frequency detail is refined later. We introduce Spectral-Evolution-Aware Cache (SeaCache), a training-free cache schedule that bases reuse decisions on a spectrally aligned representation. Through theoretical and empirical analysis, we derive a Spectral-Evolution-Aware (SEA) filter that preserves content-relevant components while suppressing noise. Employing SEA-filtered input features to estimate redundancy leads to dynamic schedules that adapt to content while respecting the spectral priors of the underlying diffusion model. Extensive experiments on diverse visual generative models and the baselines show that SeaCache achieves state-of-the-art latency-quality trade-offs.

Abstract:
Lesion detection, symptom tracking, and visual explainability are central to real-world medical image analysis, yet current medical Vision-Language Models (VLMs) still lack mechanisms that translate their broad knowledge into clinically actionable outputs. To bridge this gap, we present Medic-AD, a clinically oriented VLM that strengthens these three capabilities through a stage-wise framework. First, learnable anomaly-aware tokens (Ano) encourage the model to focus on abnormal regions and build more discriminative lesion centered representations. Second, inter-image difference tokens (Diff) explicitly encode temporal changes between studies, allowing the model to distinguish worsening, improvement, and stability in disease burden. Finally, a dedicated explainability stage trains the model to generate heatmaps that highlight lesion-related regions, offering clear visual evidence that is consistent with the model's reasoning. Through our staged design, Medic-AD steadily boosts performance across anomaly detection, symptom tracking, and anomaly segmentation, achieving state-of-the-art results compared with both closed source and medical-specialized baselines. Evaluations on real longitudinal clinical data collected from real hospital workflows further show that Medic-AD delivers stable predictions and clinically faithful explanations in practical patient-monitoring and decision-support workflows.

Abstract:
Parameter-efficient fine-tuning (PEFT) in multimodal tracking reveals a concerning trend where recent performance gains are often achieved at the cost of inflated parameter budgets, which fundamentally erodes PEFT's efficiency promise. In this work, we introduce SEATrack, a Simple, Efficient, and Adaptive two-stream multimodal tracker that tackles this performance-efficiency dilemma from two complementary perspectives. We first prioritize cross-modal alignment of matching responses, an underexplored yet pivotal factor that we argue is essential for breaking the trade-off. Specifically, we observe that modality-specific biases in existing two-stream methods generate conflicting matching attention maps, thereby hindering effective joint representation learning. To mitigate this, we propose AMG-LoRA, which seamlessly integrates Low-Rank Adaptation (LoRA) for domain adaptation with Adaptive Mutual Guidance (AMG) to dynamically refine and align attention maps across modalities. We then depart from conventional local fusion approaches by introducing a Hierarchical Mixture of Experts (HMoE) that enables efficient global relation modeling, effectively balancing expressiveness and computational efficiency in cross-modal fusion. Equipped with these innovations, SEATrack advances notable progress over state-of-the-art methods in balancing performance with efficiency across RGB-T, RGB-D, and RGB-E tracking tasks. Code is available.

Abstract:
Curvilinear structure segmentation is essential in domains such as medical imaging, remote sensing, and materials science. Existing methods often require extensive domain-specific training and lack generalization to novel domains. To overcome these limitations, we propose the Segment Anything Curve Model (SACM) -- a universal, curvilinear segmentation framework built upon the pretrained Segment Anything Model (SAM). SACM introduces a dual-level adapter architecture that enables both fine-grained and domain-adaptive enhancement: block-level internal adapters refine local structural representations, while external adapters facilitate cross-domain feature alignment. Specifically, the internal adapters are embedded within each Transformer block to locally adapt and refine features for thin and intricate curvilinear patterns, while the external adapters operate across blocks to capture global, multi-layer contextual information and facilitate domain adaptation. Furthermore, SACM introduces a feature fusion mechanism that aggregates multi-layer features from all external adapters and fuses them via a Feed-Forward Network (FFN) module, and a dual-stage refinement process in the mask decoder to enhance topology and connectivity. This design enables prompt-free, data-efficient fine-tuning and achieves robust cross-domain generalization when trained with only 18 annotated images. Extensive experiments across twelve diverse curvilinear datasets validate that SACM achieves state-of-the-art performance.

Abstract:
In structure from motion, quadrifocal tensors capture more information than their pairwise counterparts (essential matrices), yet they have often been thought of as impractical and only of theoretical interest. In this work, we challenge such beliefs by providing a new framework to recover n cameras from the corresponding collection of quadrifocal tensors. We form the block quadrifocal tensor and show that it admits a Tucker decomposition whose factor matrices are the stacked camera matrices, and which thus has a multilinear rank of (4,4,4,4) independent of n. We develop the first synchronization algorithm for quadrifocal tensors, using Tucker decomposition, alternating direction method of multipliers, and iteratively reweighted least squares. We further establish relationships between the block quadrifocal, trifocal, and bifocal tensors, and introduce an algorithm that jointly synchronizes these three entities. Numerical experiments demonstrate the effectiveness of our methods on modern datasets, indicating the potential and importance of using higher-order information in synchronization.

Abstract:
Recent unified models have made unprecedented progress in both understanding and generation. However, while most of them accept multi-modal inputs, they typically produce only single-modality outputs. This challenge of producing interleaved content is mainly due to training data scarcity and the difficulty of modeling long-range cross-modal context. To address this issue, we decompose interleaved generation into textual planning and visual consistency modeling, and introduce a framework consisting of a planner and a visualizer. The planner produces dense textual descriptions for visual content, while the visualizer synthesizes images accordingly. Under this guidance, we construct large-scale textual-proxy interleaved data (where visual content is represented in text) to train the planner, and curate reference-guided image data to train the visualizer. These designs give rise to Wan-Weaver, which exhibits emergent interleaved generation ability with long-range textual coherence and visual consistency. Meanwhile, the integration of diverse understanding and generation data into planner training enables Wan-Weaver to achieve robust task reasoning and generation proficiency. To assess the model's capability in interleaved generation, we further construct a benchmark that spans a wide range of use cases across multiple dimensions. Extensive experiments demonstrate that, even without access to any real interleaved data, Wan-Weaver achieves superior performance over existing methods.

Abstract:
The advent of one-step text-to-image (T2I) models offers unprecedented synthesis speed. However, their application to text-guided image editing remains severely hampered, as forcing existing training-free editors into a single inference step fails. This failure manifests as severe object distortion and a critical loss of consistency in non-edited regions, resulting from the high-energy, erratic trajectories produced by naive vector arithmetic on the models' structured fields. To address this problem, we introduce ChordEdit, a model agnostic, training-free, and inversion-free method that facilitates high-fidelity one-step editing. We recast editing as a transport problem between the source and target distributions defined by the source and target text prompts. Leveraging dynamic optimal transport theory, we derive a principled, low-energy control strategy. This strategy yields a smoothed, variance-reduced editing field that is inherently stable, facilitating the field to be traversed in a single, large integration step. A theoretically grounded and experimentally validated approach allows ChordEdit to deliver fast, lightweight and precise edits, finally achieving true real-time editing on these challenging models.

Abstract:
Latent-space modeling has been the standard for Diffusion Transformers (DiTs). However, it relies on a two-stage pipeline where the pretrained autoencoder introduces lossy reconstruction, leading to error accumulation while hindering joint optimization. To address these issues, we propose PixelDiT, a single-stage, end-to-end model that eliminates the need for the autoencoder and learns the diffusion process directly in the pixel space. PixelDiT adopts a fully transformer-based architecture shaped by a dual-level design: a patch-level DiT that captures global semantics and a pixel-level DiT that refines texture details, enabling efficient training of a pixel-space diffusion model while preserving fine details. PixelDiT achieves 1.61 FID on ImageNet 256 and 1.81 FID on ImageNet 512, surpassing existing pixel generative models by a large margin. We further extend PixelDiT to text-to-image generation and pretrain it at the 1024^ 2 resolution in pixel space. It achieves 0.74 on GenEval and 83.5 on DPG-bench, approaching the best latent diffusion models.

Abstract:
Achieving human-level dexterity in robots via imitation learning from heterogeneous datasets is hindered by the challenge of cross-embodiment skill transfer, particularly for high-DoF robotic hands. Existing methods, often relying on 2D observations and temporal-centric action representation, struggle to capture 3D spatial relations and fail to handle embodiment heterogeneity. This paper proposes the Structural Action Transformer (SAT), a new 3D dexterous manipulation policy that challenges this paradigm by introducing a structural-centric perspective. We reframe each action chunk not as a temporal sequence, but as a variable-length, unordered sequence of joint-wise trajectories. This structural formulation allows a Transformer to natively handle heterogeneous embodiments, treating the joint count as a variable sequence length. To encode structural priors and resolve ambiguity, we introduce an Embodied Joint Codebook that embeds each joint's functional role and kinematic properties. Our model learns to generate these trajectories from 3D point clouds via a continuous-time flow matching objective. We validate our approach by pre-training on large-scale heterogeneous datasets and fine-tuning on simulation and real-world dexterous manipulation tasks. Our method consistently outperforms all baselines, demonstrating superior sample efficiency and effective cross-embodiment skill transfer. This structural-centric representation offers a new path toward scaling policies for high-DoF, heterogeneous manipulators.

Abstract:
We introduce SAM 3D Body (3DB), a promptable model for single-image full-body 3D human mesh recovery (HMR) that demonstrates state-of-the-art performance, with strong generalization and consistent accuracy in diverse in-the-wild conditions. 3DB estimates the human pose of the body, feet, and hands. It is the first model to use a new parametric mesh representation, Momentum Human Rig (MHR), which decouples skeletal pose and body shape. 3DB employs an encoder-decoder architecture and supports auxiliary prompts, including 2D keypoints and masks, enabling user-guided inference similar to the SAM family of models. We derive high-quality annotations from a multi-stage annotation pipeline that uses various combinations of manual keypoint annotation, differentiable optimization, multi-view geometry, and dense keypoint detection. Our data engine efficiently selects and processes data to ensure data diversity, collecting unusual poses and rare imaging conditions. We present a new evaluation dataset organized by pose and appearance categories, enabling nuanced analysis of model behavior. Our experiments demonstrate superior generalization and substantial improvements over prior methods in both qualitative user preference studies and traditional quantitative analysis. Both 3DB and MHR are open-source.

Abstract:
Diffusion models achieve state-of-the-art video generation quality, but their inference remains expensive due to the large number of sequential denoising steps. This has motivated a growing line of research on accelerating diffusion inference. Among training-free acceleration methods, caching reduces computation by reusing previously computed model outputs across timesteps. Existing caching methods rely on heuristic criteria to choose cache/reuse timesteps and require extensive tuning. We address this limitation with a principled sensitivity-aware caching framework. Specifically, we formalize the caching error through an analysis of the model output sensitivity to perturbations in the denoising inputs, i.e., the noisy latent and the timestep, and show that this sensitivity is a key predictor of caching error. Based on this analysis, we propose Sensitivity-Aware Caching (SenCache), a dynamic caching policy that adaptively selects caching timesteps on a per-sample basis. Our framework provides a theoretical basis for adaptive caching, explains why prior empirical heuristics can be partially effective, and extends them to a dynamic, sample-specific approach. Experiments on Wan 2.1, CogVideoX, and LTX-Video show that SenCache achieves better visual quality than existing caching methods under similar computational budgets.

Abstract:
Invisible watermarking has become a critical mechanism for authenticating AI-generated image content, with major platforms deploying watermarking schemes at scale. However, evaluating the vulnerability of these schemes against sophisticated removal attacks remains essential to assess their reliability and guide robust design. In this work, we expose a fundamental vulnerability in invisible watermarks by reformulating watermark removal as a view synthesis problem. Our key insight is that generating a perceptually consistent alternative view of the same semantic content, akin to re-observing a scene from a shifted perspective, naturally removes the embedded watermark while preserving visual fidelity. This reveals a critical gap: watermarks robust to pixel-space and frequency-domain attacks remain vulnerable to semantic-preserving viewpoint transformations. We introduce a zero-shot diffusion-based framework that applies controlled geometric transformations in latent space, augmented with view-guided correspondence attention to maintain structural consistency during reconstruction. Operating on frozen pre-trained models without detector access or watermark knowledge, our method achieves state-of-the-art watermark suppression across 15 watermarking methods--outperforming 14 baseline attacks while maintaining superior perceptual quality across multiple datasets.

Abstract:
Deep learning-based methods have revolutionized the field of imaging inverse problems, yielding state-of-the-art performance across various imaging domains. The best performing networks incorporate the imaging operator within the network architecture, typically in the form of deep unrolling. However, in large-scale problems, such as 3D imaging, most existing methods fail to incorporate the operator in the architecture due to the prohibitive amount of memory required by global forward operators, which hinders typical patching strategies. In this work, we present a domain partitioning strategy and normal operator approximations that enable the training of end-to-end reconstruction models incorporating forward operators of arbitrarily large problems into their architecture. The proposed method achieves state-of-the-art performance on 3D X-ray cone-beam tomography and 3D multi-coil accelerated MRI, while requiring only a single GPU for both training and inference.

Abstract:
Instruction-based image editing models have recently achieved impressive performance, enabling complex edits to an input image from a multi-instruction prompt. However, these models apply each instruction in the prompt with a fixed strength, limiting the user's ability to precisely and continuously control the intensity of individual edits. We introduce SliderEdit, a framework for continuous image editing with fine-grained, interpretable instruction control. Given a multi-part edit instruction, SliderEdit disentangles the individual instructions and exposes each as a globally trained slider, allowing smooth adjustment of its strength. Unlike prior works that introduced slider-based attribute controls in text-to-image generation, typically requiring separate training or fine-tuning for each attribute or concept, our method learns a single set of low-rank adaptation matrices that generalize across diverse edits, attributes, and compositional instructions. This enables continuous interpolation along individual edit dimensions while preserving both spatial locality and global semantic consistency. We apply SliderEdit to state-of-the-art editing models, including FLUX-Kontext and Qwen-Image-Edit, and observe substantial improvements in edit controllability, visual consistency, and user steerability. We are the first to explore and propose a framework for continuous, fine-grained instruction control in image editing models. Our results pave the way for interactive, instruction-driven image manipulation with continuous and compositional control.

Abstract:
We introduce a framework for converting 3D shapes into compact and editable assemblies of analytic primitives, directly addressing the persistent trade-off between reconstruction fidelity and parsimony. Our approach combines two key contributions: a novel primitive, termed SuperFrustum, and an iterative inference algorithm, Residual Primitive Fitting (ResFit). SuperFrustum is a analytical primitive that is simultaneously (1) expressive, being able to express various common solids such as cylinders, spheres, cones & their tapered and bent forms, (2) editable, being compactly parameterized with 8 parameters, and (3) optimizable, with a sign distance field differentiable w.r.t. its parameters almost everywhere. ResFit is an unsupervised procedure that interleaves global shape analysis with local optimization, iteratively fitting primitives to the unexplained residual of a shape to discover a parsimonious yet accurate decompositions for each input shape. On diverse 3D benchmarks, our method achieves state-of-the-art results, improving IoU by over 9 points while using nearly half as many primitives as prior work. The resulting assemblies bridge the gap between dense 3D data and human-controllable design, producing high-fidelity and editable shape programs.

Abstract:
Classifier-Free Guidance (CFG) is a widely used inference-time technique to boost the image quality of diffusion models. Yet, its reliance on text conditions prevents its use in unconditional generation. We propose a simple method to enable CFG-like guidance for both conditional and unconditional generation. The key idea is to generate a perturbed prediction via simple token swap operations, and use the direction between it and the clean prediction to steer sampling towards higher-fidelity distributions. In practice, we swap pairs of most semantically dissimilar token latents in either spatial or channel dimensions. Unlike existing methods that apply perturbation in a global or less constrained manner, our approach selectively exchanges and recomposes token latents, allowing finer control over perturbation and its influence on generated samples. Experiments on MS-COCO 2014, MS-COCO 2017, and ImageNet datasets demonstrate that the proposed Self-Swap Guidance (SSG), when applied to popular diffusion models, outperforms previous condition-free methods in image fidelity and prompt alignment under different set-ups. Its fine-grained perturbation granularity also improves robustness, reducing side-effects across a wider range of perturbation strengths. Overall, SSG extends CFG to a broader scope of applications including both conditional and unconditional generation, and can be readily inserted into any diffusion model as a plug-in to gain immediate improvements.

Abstract:
Recent advances in pretraining 3D point cloud encoders (e.g., Point-BERT, Point-MAE) have produced powerful models, whose abilities are typically evaluated on geometric or semantic tasks. At the same time, topological descriptors have been shown to provide informative summaries of a shape's multiscale structure. In this paper we pose the question whether topological information can be derived from features produced by 3D encoders. To address this question, we first introduce DONUT, a synthetic benchmark with controlled topological complexity, and propose FILTR (Filtration Transformer), a learnable framework to predict persistence diagrams directly from frozen encoders. FILTR adapts a transformer decoder to treat diagram generation as a set prediction task. Our analysis on DONUT reveals that existing encoders retain only limited global topological signals, yet FILTR successfully leverages information produced by these encoders to approximate persistence diagrams. Our approach enables, for the first time, data-driven extraction of persistence diagrams from raw point clouds through an efficient learnable feed-forward mechanism.

Abstract:
While 3D Gaussian splatting has emerged as a powerful paradigm, it fundamentally fails to model transparency such as glass panels. The core challenge lies in decoupling the intertwined radiance contributions from transparent interfaces and the transmitted geometry observed through the glass. We present GLINT, a framework that models scene-scale transparency through explicit decomposed Gaussian representation. GLINT reconstructs the primary interface and models reflected and transmitted radiance separately, enabling consistent radiance transport. During optimization, GLINT bootstraps transparency localization from geometry-separation cues induced by the decomposition, together with geometry and material priors from a pre-trained video relighting model. Extensive experiments demonstrate consistent improvements over prior methods for reconstructing complex transparent scenes.

Abstract:
Task arithmetic provides an efficient, training-free way to edit pre-trained models, yet lacks a fundamental theoretical explanation for its success. The existing concept of "weight disentanglement" describes the ideal outcome of non-interfering task composition but does not reveal its underlying cause. Crucially, what intrinsic properties of the pre-trained model (\theta_0) or the task vectors (\tau_t) enable this disentanglement remains underexplored. In this paper, we introduce Task-Feature Specialization (TFS), a model's ability to allocate distinct internal features to different tasks, as the fundamental principle. We first prove that TFS is a sufficient condition for weight disentanglement. More importantly, we find that TFS also gives rise to an observable geometric consequence: weight vector orthogonality. This positions TFS as the common cause for both the desired functional outcome (disentanglement) and a measurable geometric property (orthogonality). This relationship provides the key insight for our method: since the abstract TFS property is intractable to enforce directly, we can instead promote weight disentanglement by shaping its concrete geometric consequence, orthogonality. Therefore, we propose OrthoReg, a simple and effective regularization method that actively enforces an internal orthogonal structure on weight updates (\Delta W) that constitute \tau_t during fine-tuning. And we theoretically prove that OrthoReg promotes disentanglement. Extensive experiments demonstrate that OrthoReg consistently and significantly enhances the performance of various task arithmetic methods.

Abstract:
Articulated 3D objects are critical for embodied AI, robotics, and scene understanding, yet creating simulation-ready assets remains labor-intensive and requires expert modeling of part hierarchies and motion structures. We introduce SPARK, a framework for reconstructing physically consistent, kinematic part-level articulated objects from a single RGB image. Given the image, we first leverage VLMs to extract coarse URDF parameters and generate part-level reference images. We then integrate the part-image guidance and the inferred structure graph into a diffusion transformer to synthesize consistent part and complete shapes of articulated objects. To further refine the URDF parameters, we explore a VLM-based reprediction strategy for discrete attribute refinement and a differentiable forward kinematics module for continuous parameter optimization under VLM-generated open-state supervision. Extensive experiments show that SPARK produces high-quality, simulation-ready articulated assets across diverse categories, enabling downstream applications such as robotic manipulation and interaction modeling.

Abstract:
Part-level point cloud segmentation has recently attracted significant attention in 3D computer vision.Nevertheless, existing research is constrained by two major challenges: native 3D models lack generalization due to data scarcity, while introducing 2D pre-trained knowledge often leads to inconsistent segmentation results across different views.To address these challenges, we propose S^2AM3D, which incorporates 2D segmentation priors with 3D consistent supervision. We design a point-consistent part encoder that aggregates multi-view 2D features through native 3D contrastive learning, producing globally consistent point features. A scale-aware prompt decoder is then proposed to enable real-time adjustment of segmentation granularity via continuous scale signals. Simultaneously, we introduce a large-scale, high-quality part-level point cloud dataset with more than 100k samples, providing ample supervision signals for model training.Extensive experiments demonstrate that S^2AM3D achieves leading performance across multiple evaluation settings, exhibiting exceptional robustness and controllability when handling complex structures and parts with significant size variations.

Abstract:
Recent advancements in 3D generative modeling have significantly improved the generation realism, yet the field is still hampered by existing representations, which struggle to capture assets with complex topologies and detailed appearance. This paper present an approach for learning a structured latent representation from native 3D data to address this challenge. At its core is a new sparse voxel structure called O-Voxel, an omni-voxel representation that encodes both geometry and appearance. O-Voxel can robustly model arbitrary topology, including open, non-manifold, and fully-enclosed surfaces, while capturing comprehensive surface attributes beyond texture color, such as physically-based rendering parameters. Based on O-Voxel, we design a Sparse Compression VAE which provides a high spatial compression rate and a compact latent space. We train large-scale flow-matching models comprising 4B parameters for 3D generation using diverse public 3D asset datasets. Despite their scale, inference remains highly efficient. Meanwhile, the geometry and material quality of our generated assets far exceed those of existing models. We believe our approach offers a significant advancement in 3D generative modeling.

Abstract:
Current event simulation methods focus on employing videos to synthesize new event data, suffering from costly video capture and limited scalability across viewpoints, motions, and lighting. To this end, we propose a Text-to-event simulation framework (Texvent) that can directly generate asynchronous event data from simple text prompts. Texvent first renders prompt-driven videos via multimodal large language models and subsequently applies a new physical simulator to generate event streams. Specifically, an adaptive brightness-aware frame interpolation approach is proposed to enhance the temporal resolution of the rendered videos. A balanced logarithmic intensity comparison strategy and a cache-based voltage refreshment mechanism are introduced into the simulator to generate event data.To narrow the sim-to-real gap, we also introduce background activity noise injection and dense time stamp reconstruction operations. Extensive experiments demonstrate Texvent's superior computational efficiency and its ability to generate more realistic event data than existing simulators.

Abstract:
Despite the perceived success of large-scale dataset distillation (DD) methods, recent evidence [??] finds that simple random image baselines perform on-par with state-of-the-art DD methods like SRe2L [??] due to the use of soft labels during downstream model training. This is in contrast with the findings in coreset literature, where high-quality coresets consistently outperform random subsets in the hard-label (HL) setting. To understand this discrepancy, we perform a detailed scalability analysis to examine the role of data quality under different label regimes, ranging from abundant soft labels (termed as SL+KD regime) to fixed soft labels (SL) and hard labels (HL). Our analysis reveals that high-quality coresets fail to convincingly outperform the random baseline in both SL and SL+KD regimes. In the SL+KD setting, performance further approaches near-optimal levels relative to the full dataset, regardless of subset size or quality, for a given compute budget. This performance saturation calls into question the widespread practice of using soft labels for model evaluation, where unlike the HL setting, subset quality has negligible influence. A subsequent systematic evaluation of five large-scale and four small-scale DD methods in the HL setting reveals that only RDED [??] reliably outperforms random baselines on ImageNet-1K, but can still lag behind strong coreset methods due to its over-reliance on easy sample patches. Based on this, we introduce CAD-Prune, a compute-aware pruning metric that efficiently identifies samples of optimal difficulty for a given compute budget, and use it to develop CA2D, a compute-aligned DD method, outperforming current DD methods on ImageNet-1K at various IPC settings. Together, our findings uncover many insights into current DD research and establish useful tools to advance data-efficient learning for both coresets and DD.

Abstract:
Recent advances in semantic correspondence rely on dual-encoder architectures, combining DINOv2 with diffusion backbones. While accurate, these billion-parameter models generalize poorly beyond training keypoints, revealing a gap between benchmark performance and real-world usability, where queried points rarely match those seen during training. Building upon DINOv2, we introduce MARCO, a unified model for generalizable correspondence driven by a novel training framework that enhances both fine-grained localization and semantic generalization. By coupling a coarse-to-fine objective that refines spatial precision with a self-distillation framework, which expands sparse supervision beyond annotated regions, our approach transforms a handful of keypoints into dense, semantically coherent correspondences. MARCO sets a new state of the art on SPair-71k, AP-10K, and PF-PASCAL, with gains that amplify at fine-grained localization thresholds (+8.9 PCK@0.01), strongest generalization to unseen keypoints (+5.1, SPair-U) and categories (+4.7, MP-100), while remaining 3x smaller and 10x faster than diffusion-based approaches.

Abstract:
Recent advances in video generation have shown remarkable potential for constructing world simulators. However, current models still struggle to produce physically consistent results, particularly when handling large-scale or complex dynamics. This limitation arises primarily because existing approaches respond isotropically to physical prompts and neglect the fine-grained alignment between generated content and localized physical cues. To address these challenges, we propose ProPhy, a Progressive Physical Alignment Framework that enables explicit physics-aware conditioning and anisotropic generation. ProPhy employs a two-stage Mixture-of-Physics-Experts (MoPE) mechanism for discriminative physical prior extraction, where Semantic Experts infer semantic-level physical principles from textual descriptions, and Refinement Experts capture token-level physical dynamics. This mechanism allows the model to learn fine-grained, physics-aware video representations that better reflect underlying physical laws. Furthermore, we introduce a physical alignment strategy that transfers the physical reasoning capabilities of vision-language models (VLMs) into the Refinement Experts, facilitating a more accurate representation of dynamic physical phenomena. Extensive experiments on physics-aware video generation benchmarks demonstrate that ProPhy produces more realistic, dynamic, and physically coherent results than existing state-of-the-art methods.

Abstract:
Soccer understanding has recently garnered growing research interest due to its domain-specific complexity and unique challenges.However, prior works typically rely on task-specific expert models, which are resource-intensive and hinder a holistic view of the game.This paper aims to propose a unified framework that enables a single model to handle diverse soccer visual understanding tasks, spanning both fine-grained perception (e.g., athlete detection) and semantic reasoning (e.g., event classification).Concretely, we make the following contributions in this paper:(i) we present SoccerMaster, the first soccer-specific vision foundation model that unifies comprehensive understanding tasks within a single framework via supervised multi-task pretraining;(ii) we consolidate multiple existing soccer video datasets and develop an automated data curation pipeline, termed as SoccerFactory, to produce scalable multi-task training annotations;and (iii) we conduct extensive experiments demonstrating that SoccerMaster consistently outperforms task-specific expert models across diverse downstream tasks, underscoring its breadth and superiority.The data, code, and model will be publicly available to the research community.

Abstract:
While 3DGS has emerged as a high-fidelity scene representation, encoding rich, general-purpose features directly from its primitives remains under-explored. We address this gap by introducing Chorus, a multi-teacher pretraining framework that learns a holistic feed-forward 3D Gaussian Splatting (3DGS) scene encoder by distilling complementary signals from 2D foundation models. Chorus employs a shared 3D encoder and teacher-specific projectors to learn from language-aligned, generalist, and object-aware teachers, encouraging a shared embedding space that captures signals from high-level semantics to fine-grained structure. We evaluate Chorus on a wide range of tasks: open-vocabulary semantic and instance segmentation, linear and decoder probing, data-efficient supervision, as well as LLM-based Q&A. Besides 3DGS, we also test Chorus on several benchmarks that only support point clouds by pretraining a variant using only Gaussians' centers, colors, estimated normals. Interestingly, this encoder shows strong transfer and outperforms the point clouds baseline while using 39.9xfewer training scenes. Finally, we propose a render-and-distill adaptation that facilitates out-of-domain finetuning. Our code and model is released at this codebase.

Abstract:
Embodied navigation that adheres to social norms remains an open research challenge. Our SocialNav is a foundational model for socially-aware navigation with a hierarchical "brain-action" architecture, capable of understanding high-level social norms and generating low-level, socially compliant trajectories. To enable such dual capabilities, we construct the SocNav Dataset, a large-scale collection of 7 million samples, comprising (1) a Cognitive Activation Dataset providing social reasoning signals such as chain-of-thought explanations and social traversability prediction, and (2) an Expert Trajectories Pyramid aggregating diverse navigation demonstrations from internet videos, simulated environments, and real-world robots. A multi-stage training pipeline is proposed to gradually inject and refine navigation intelligence: we first inject general navigation skills and social norms understanding into the model via imitation learning, and then refine such skills through a deliberately designed Socially-Aware FlowExploration GRPO (SAFE-GRPO), the first flow-based reinforcement learning framework for embodied navigation that explicitly rewards socially compliant behaviors. SocialNav achieves +38% success rate and +46% social compliance rate compared to the state-of-the-art method, demonstrating strong gains in both navigation performance and social compliance. Data and code will be made publicly available.

Abstract:
Accurate and efficient voxelized representations of 3D meshes are the foundation of 3D reconstruction and generation. However, existing representations based on iso-surface heavily rely on water-tightening or rendering optimization, which inevitably compromise geometric fidelity. We propose Faithful Contouring, a sparse voxelized representation that supports 2048+ resolutions for arbitrary meshes, requiring neither converting meshes to field functions nor extracting the isosurface during remeshing. It achieves near-lossless fidelity by preserving sharpness and internal structures, even for challenging cases with complex geometry and topology. The proposed method also shows flexibility for texturing, manipulation, and editing. Beyond representation, we design a dual-mode autoencoder for Faithful Contouring, enabling scalable and detail-preserving shape reconstruction. Extensive experiments show that Faithful Contouring surpasses existing methods in accuracy and efficiency for both representation and reconstruction. For direct representation, it achieves distance errors at the 10^ -5 level; for mesh reconstruction, it yields a 93% reduction in Chamfer Distance and a 35% improvement in F-score over strong baselines, confirming superior fidelity as a representation for 3D learning tasks.

Abstract:
Medical image segmentation is vital for clinical diagnosis and quantitative analysis, yet remains challenging due to the heterogeneity of imaging modalities and the high cost of pixel-level annotations. Although general interactive segmentation models like SAM have achieved remarkable progress, their transfer to medical imaging still faces two key bottlenecks: (i) the lack of adaptive mechanisms for modality- and anatomy-specific tasks, which limits generalization in out-of-distribution medical scenarios; and (ii) current medical adaptation methods fine-tune on large, heterogeneous datasets without selection, leading to noisy supervision, higher cost, and negative transfer. To address these issues, we propose SegMoTE, an efficient and adaptive framework for medical image segmentation. SegMoTE preserves SAM's original prompt interface, efficient inference, and zero-shot generalization while introducing only a small number of learnable parameters to dynamically adapt across modalities and tasks. In addition, we design a progressive prompt tokenization mechanism that enables fully automatic segmentation, significantly reducing annotation dependence. Trained on MedSeg-HQ, a curated dataset less than 1% of existing large-scale datasets, SegMoTE achieves SOTA performance across diverse imaging modalities and anatomical tasks. It represents the first efficient, robust, and scalable adaptation of general segmentation models to the medical domain under extremely low annotation cost, advancing the practical deployment of foundation vision models in clinical applications.

Abstract:
Recent advancements in Vision Language Models (VLMs) have expanded their capabilities to interactive agent tasks, yet existing benchmarks remain limited to single-agent or text-only environments. In contrast, real-world scenarios often involve multiple agents interacting within rich visual and textual contexts, posing challenges with both multimodal observations and strategic interactions. To bridge this gap, we introduce Visual Strategic Bench (VS-Bench), a multimodal benchmark that evaluates VLMs for strategic abilities in multi-agent environments. VS-Bench comprises ten vision-grounded environments that cover cooperative, competitive, and mixed-motive interactions. The performance of VLM agents is evaluated across three dimensions: perception measured by element recognition accuracy; strategic reasoning measured by next-action prediction accuracy; and decision-making measured by normalized episode return. Extensive experiments on fifteen leading VLMs show that, although current models exhibit strong perception abilities, there remains a significant gap to optimal performance in reasoning and decision-making, with the best-performing model attaining 46.6% prediction accuracy and 31.4% normalized return. We further analyze the key factors influencing performance, conduct human studies, and examine failure modes to provide a deeper understanding of VLMs' strategic abilities. By standardizing the evaluation and highlighting the limitations of existing models, we envision VS-Bench as a foundation for future research on strategic multimodal agents.

Abstract:
Text-to-image (T2I) diffusion models effectively produce semantically aligned images, but their reliance on training distributions constrains their capacity for synthesizing truly novel, out-of-distribution concepts. Existing methods attempt to enhance creativity through semantic exploration, such as fusing known concept pairs, but the resulting images remain linguistically describable and confined to familiar semantic spaces. Inspired by the soft probabilistic outputs of classifiers on novel or out-of-distribution inputs, we propose Distribution-Conditional Generation, a paradigm that models novel concepts as image synthesis conditioned on class distributions, enabling controllable yet semantically unconstrained creative generation. Building on this, we propose DisTok, an encoder-decoder framework that unifies conditional and unconditional creative generation by decoding latent representations--either randomly sampled or mapped from conditions (e.g., class distributions)--into tokens representing novel concepts. DisTok is trained by iteratively sampling and fusing concept pairs from a dynamic pool to model progressively complex distributions, while enforcing semantic consistency through a vision-language model that aligns the class distributions of generated images with the input distributions. Extensive experiments demonstrate that DisTok enables efficient and flexible semantic exploration for token-level creative synthesis, achieving state-of-the-art text-image alignment and human preference.

Abstract:
Federated learning (FL) has emerged as a widely adopted training paradigm for privacy-preserving machine learning. Despite the past success of SGD-based methods, they still suffer from severe data heterogeneity and the lack of adaptivity in practical applications. While several adaptive federated optimization methods (such as FedAdam) have been proposed and demonstrated to achieve faster convergence, they fail to show significant improvements in generalization performance under highly heterogeneous data distributions, and their optimization and generalization mechanisms remain insufficiently understood. To fill this gap, we introduce diffusion theory into the adaptive federated optimization framework and analyze the distinct effects of adaptive learning rate and global momentum from the perspectives of saddle-point escaping and flat-minima selection. Theoretical results show that although FedAdam outperforms FedAvg/FedAvgM in escaping saddle points, the latter escapes sharp minima more efficiently. The root cause lies in that adaptive learning rates, while enhancing saddle-point escape, weaken the preference for flat minima. Motivated by these insights, we propose FedAdamom, a new adaptive federated optimization algorithm that adapts the momentum hyperparameter rather than the learning rate. FedAdamom maintains strong saddle-point escaping capability while enhancing flat-minima selection. We further establish its convergence guarantees under non-convex objectives. Extensive experiments demonstrate that FedAdamom significantly outperforms existing adaptive federated optimization methods in terms of convergence speed, generalization performance, and preference for flat minima.

Abstract:
3D Gaussian Splatting (3DGS) has emerged as a prominent 3D representation for high-fidelity and real-time rendering. Prior work has coupled physics simulation with Gaussians, but predominantly targets soft, deformable materials, leaving brittle fracture largely unresolved. This stems from two key obstacles: the lack of volumetric interiors with coherent textures in GS representation, and the absence of fracture-aware simulation methods for Gaussians. To address these challenges, we introduce GaussianFluent, a unified framework for realistic simulation and rendering of dynamic object states. First, it synthesizes photorealistic interiors by densifying internal Gaussians guided by generative models. Second, it integrates an optimized Continuum Damage Material Point Method (CD-MPM) to enable brittle fracture simulation at remarkably high speed. Our approach handles complex scenarios including mixed-material objects and multi-stage fracture propagation, achieving results infeasible with previous methods. Experiments clearly demonstrate GaussianFluent's capability for photo-realistic, real-time rendering with structurally consistent interiors, highlighting its potential for downstream application, such as VR and Robotics.

Abstract:
Maintaining long-term accuracy of stereo camera calibration parameters is important for autonomous systems' perception. This work proposes Online Tracking of Essential Matrix by Stochastic Optimization (TESO). The core mechanisms of TESO are: 1) a robust loss function based on kernel correlation over tentative correspondences, 2) an adaptive online stochastic optimization on the essential manifold. TESO has low CPU and memory requirements, relies on a few hyperparameters, and eliminates the need for data-driven training, enabling the usage in resource-constrained online perception systems. We evaluated the influence of TESO on geometric precision, rectification quality, and stereo depth consistency. On the large-scale MAN TruckScenes dataset, TESO tracks rotational calibration drift with 0.12 deg precision in the Y-axis-critical for stereo accuracy-while the X- and Z-axes are five times more precise. Tracking applied to sequences with simulated drift shows similar precision with respect to the reference as tracking applied to no-drift sequences, indicating the tracker is unbiased. On the KITTI dataset, TESO revealed systematic inconsistencies in extrinsic parameters across stereo pairs, confirming previous published findings. We verified that intrinsic decalibration affected these errors, as evidenced by the conflicting behavior of the rectification and depth metrics. After correcting the reference calibration, TESO improved its rotation precision around the Y-axis 20 times to 0.025 deg and its depth accuracy 50 times. Despite its lightweight design, direct optimization of the proposed TESO loss function alone achieves accuracy comparable to that of neural network-based single-frame methods.

Abstract:
Specular highlights distort appearance, obscure texture, and hinder geometric reasoning in both natural and surgical imagery. We present UnReflectAnything, an RGB-only framework that removes highlights from a single image by predicting a highlight map together with a reflection-free diffuse reconstruction. The model uses a frozen vision transformer encoder to extract multi-scale features, a lightweight head to localize specular regions, and a token-level inpainting module that restores corrupted feature patches before producing the final diffuse image. To overcome the lack of paired supervision, we introduce a Virtual Highlight Synthesis pipeline that renders physically plausible specularities using monocular geometry, Fresnel-aware shading, and randomized lighting which enables training on arbitrary RGB images with correct geometric structure. UnReflectAnything generalizes across natural and surgical domains where non-Lambertian surfaces and non-uniform lighting create severe highlights and it achieves competitive performance with state-of-the-art results on several benchmarks. Project Page: alberto-rota.github.io/UnReflectAnything

Abstract:
Understanding and reconstructing the complex geometry and motion of dynamic 4D scenes from video remains a formidable challenge in computer vision. This paper introduces D4RT, a simple yet powerful feedforward network designed to efficiently solve this task. D4RT utilizes a unified transformer architecture to jointly infer depth, spatio-temporal correspondence, and full camera parameters from a single video. Its core innovation is a novel mechanism that sidesteps the heavy computation of dense, per-frame decoding and the complexity of managing multiple, task-specific decoders. Our unified decoding interface allows the model to independently and efficiently probe the 3D position of any point in space and time. The result is a lightweight and highly scalable method that enables remarkably efficient training and inference. We demonstrate that our approach sets a new state-of-the-art, outperforming previous methods across a wide spectrum of 4D reconstruction tasks.

Abstract:
We introduce NitroGen, a vision-action foundation model for generalist gaming agents that is trained on 40,000 hours of gameplay videos across more than 1,000 games. We scale embodied agents through three key ingredients: 1) an internet-scale video-action dataset constructed by automatically extracting player actions from publicly available gameplay videos, 2) a multi-game benchmark environment that can measure cross-game generalization, and 3) a unified vision-action model trained with large-scale behavior cloning. NitroGen exhibits competence across diverse domains, including combat encounters in 3D action games, high-precision control in 2D platformers, and exploration in procedurally generated worlds. When fine-tuned, pre-training transfers effectively to unseen games, achieving up to 52% relative improvement in task success rates over models trained from scratch. We release the dataset, evaluation suite, and model weights to advance research on generalist embodied agents.

Abstract:
We present 4DPM, a dynamic reconstruction system that receives a casual monocular RGB video as input, and outputs a complete and persistent reconstruction of the scene. In other words, we reconstruct not only the currently visible parts of the scene, but also all previously viewed parts, which enables replaying the complete reconstruction across all timesteps. Our method decomposes the scene into a set of rigid 3D primitives, which are assumed to be moving through the scene. Using estimated dense 2D correspondences, we jointly infer the rigid motion of these primitives through an optimisation pipeline, yielding a 4D reconstruction of the scene, i.e. providing 3D geometry dynamically moving through time. To achieve this, we also introduce a mechanism to extrapolate motion for objects that become invisible, employing motion-grouping techniques to maintain continuity. The resulting system enables 4D spatio-temporal awareness, offering capabilities such as replayable 3D reconstructions of articulated objects through time, multi-object scanning, and object permanence. On object scanning and multi-object datasets, our system significantly outperforms existing methods both quantitatively and qualitatively.

Abstract:
Vision Transformers (ViTs) often need to be compressed for deployment on resource-constrained edge devices like drones and smart vehicles. However, existing model compression methods ignore that many edge devices only require the knowledge of specific classes for their applications. As a result, the derived all-class ViTs retain redundant knowledge and perform suboptimally on these classes. We discovered that simply replacing the calibration dataset with class-specific data does not suffice to address this issue, as these methods face two fundamental limitations. First, they overlook the existence of class-detrimental weights, which interfere with specialization, while removing them can improve class-specific performance. Second, the diversity of target classes and resource constraints on edge devices demand numerous customized models. Existing methods are time-consuming and computationally expensive, thus unscalable. In this work, we present NuWa, a cost-efficient method that addresses these challenges by deriving small ViTs from base ViTs for edge devices with specific class requirements. NuWa performs self-knowledge purification to prune class-detrimental weights and efficiently derives compact ViTs through closed-form optimization. Without post-pruning retraining, the derived edge ViTs surpass the base ViT in class-specific accuracy and accelerate inference. Comprehensive experiments demonstrate that NuWa outperforms state-of-the-art training-free pruning methods on class-specific tasks by up to 29.00% in accuracy. Compared with the best-performing training-dependent pruning method, NuWa achieves a 33.69x pruning speedup and reduces pruning cost by up to 99.83%, with only a 0.61% average accuracy loss.

Abstract:
Music-driven 3D dance generation offers significant creative potential, yet practical applications demand versatile and multimodal control. Given the highly dynamic and complex human motion covering various styles and genres, dance generation requires satisfying diverse conditions beyond just music (e.g., spatial trajectories, keyframe gestures, or style descriptions). However, the absence of a large-scale and richly annotated dataset severely hinders progress. In this paper, we build OpenDanceSet, an extensive human dance dataset comprising over 100 hours across 14 genres and 147 subjects. Each sample has rich annotations to facilitate robust cross-modal learning: 3D motion, paired music, 2D keypoints, trajectories, and expert-annotated text descriptions. Furthermore, we propose OpenDanceNet, a unified masked modeling framework for controllable dance generation, including a disentangled auto-encoder and a multimodal joint-prediction Transformer. OpenDanceNet supports generation conditioned on music and arbitrary combinations of text, keypoints, or trajectories. Comprehensive experiments demonstrate that our work achieves high-fidelity synthesis with strong diversity and realistic physical contacts, while also offering flexible control over spatial and stylistic conditions.

Abstract:
Image aesthetic assessment (IAA) has extensive applications in content creation, album management, and recommendation systems, etc. In such applications, it is commonly needed to pick out the most aesthetically pleasing image from a series of images with subtle aesthetic variations, a topic we refer to as fine-grained IAA. Unfortunately, state-of-the-art IAA models are typically designed for coarse-grained evaluation, where images with notable aesthetic differences are evaluated independently on an absolute scale. These models are inherently limited in discriminating fine-grained aesthetic differences. To address the dilemma, we contribute FGAesthetics, a fine-grained IAA database with 32,217 images organized into 10,028 series, which are sourced from diverse categories including Natural, AIGC, and Cropping. Annotations are collected via pairwise comparisons within each series. We also devise Series Refinement and Rank Calibration to ensure the reliability of data and labels. Based on FGAesthetics, we further propose FGAesQ, a novel IAA framework that learns discriminative aesthetic scores from relative ranks through Difference-preserved Tokenization (DiffToken), Comparative Text-assisted Alignment (CTAlign), and Rank-aware Regression (RankReg). FGAesQ enables accurate aesthetic assessment in fine-grained scenarios while still maintains competitive performance in coarse-grained evaluation. Extensive experiments and comparisons demonstrate the superiority of the proposed method.

Abstract:
Video LLMs suffer from temporal inconsistency: small shifts in frame timing can flip attention and suppress relevant frames. We trace this instability to the common extension of Rotary Position Embeddings to video through multimodal RoPE. The induced inverse Fourier time kernel exhibits frame-scale ripples that multiply adjacent frames by different factors, which perturbs attention that should otherwise be governed by the raw query key inner product. We present Phase Aggregated Smoothing (PAS), a simple, training-free mechanism that applies small opposed phase offsets across heads and then aggregates their outputs. PAS preserves the per-head spectrum magnitude, while the aggregation effectively smooths the temporal kernel and reduces phase sensitivity without changing the positional encoding structure. Our analysis shows that the RoPE rotated logit can be approximated as a content dot product scaled by a time kernel; smoothing this kernel yields Lipschitz stability of attention to small temporal shifts; multi phase averaging attenuates high frequency ripples while preserving per-head spectra under Nyquist-valid sampling. Experiments on multiple video understanding benchmarks under matched token budgets show consistent improvements with negligible computational overhead. PAS provides a plug and play upgrade for robust temporal encoding in Video LLMs.

Abstract:
Recent advances in multimodal large language models (MLLMs) have led to remarkable progress in visual grounding, enabling fine-grained cross-modal alignment between textual queries and image regions. However, transferring such capabilities to remote sensing imagery remains challenging, as targets are often extremely small within kilometer-scale scenes, and queries typically involve intricate geospatial relations such as relative positions, spatial hierarchies, or contextual dependencies across distant objects.To address these challenges, we propose GeoViS, a Geospatially Rewarded Visual Search framework that reformulates remote sensing visual grounding as a progressive search-and-reasoning process. Rather than directly predicting the target location in a single step, GeoViS actively explores the global image through a tree-structured sequence of visual cues, integrating multimodal perception, spatial reasoning, and reward-guided exploration to refine geospatial hypotheses iteratively. This design enables the model to detect subtle small-scale targets while maintaining holistic scene awareness.Extensive experiments on five remote sensing grounding benchmarks demonstrate that GeoViS achieves precise geospatial understanding and consistently surpasses existing methods across key visual grounding metrics, highlighting its strong cross-domain generalization and interpretability.

Abstract:
Conventional multi-projector calibration requires projecting and capturing structured light patterns for each projector sequentially, causing calibration time and effort to increase linearly with the number of projectors. This scalability bottleneck has long limited the deployment of large-scale projection mapping systems. We present a new calibration framework that breaks this limitation by embedding cameras into the surface of the calibration target. The embedded cameras directly capture the incoming projection light, enabling the separation of simultaneously projected structured light patterns from multiple projectors according to their incident directions. Our method establishes correspondences between the optical centers of the embedded cameras and the projector pixels, allowing the intrinsic and extrinsic parameters of all projectors to be simultaneously estimated. We further introduce a correction technique for small misalignments between the calibration board and camera optical centers. As a result, our system achieves calibration accuracy comparable to conventional methods while reducing the required number of projection-capture cycles from linear to nearly constant with respect to the number of projectors, dramatically improving scalability for dense multi-projector systems with overlapping projection regions, such as high-brightness stacking, super-resolution, light-field, and shadow-suppression displays.

Abstract:
Active mapping aims to determine how an agent should move to efficiently reconstruct an unknown environment. Most existing approaches rely on greedy next-best-view prediction, resulting in inefficient exploration and incomplete scene reconstruction. To address this limitation, we introduce MAGICIAN, a novel long-term planning framework that maximizes accumulated surface coverage gain through Imagined Gaussians, a scene representation derived from a pre-trained occupancy network with strong structural priors. This representation enables efficient computation of coverage gain for any novel viewpoint via fast volumetric rendering, allowing its integration into a tree-search algorithm for long-horizon planning. We update Imagined Gaussians and refine the planned trajectory in a closed-loop manner. Our method achieves state-of-the-art performance across indoor and outdoor benchmarks with varying action spaces, demonstrating the critical advantage of long-term planning in active mapping.

Abstract:
Hyperspectral cameras rely on spectral filters, dispersive optics, or coded apertures, which reduce light throughput and increase hardware complexity. These systems face harsh trade-offs between spatial, spectral, and temporal resolution in inherently low-photon conditions. Computational imaging systems break through these trade-offs with compressive sensing, but have typically required complex optics and/or extensive computation. We present Spectrum from Defocus (SfD), a chromatic focal sweep method that achieves state-of-the-art hyperspectral imaging using only two off-the-shelf lenses, a grayscale sensor, and less than one second of reconstruction time. By capturing a chromatically-aberrated focal stack that preserves nearly all incident light, and reconstructing it with a fast physics-based iterative algorithm, SfD delivers sharp, accurate hyperspectral images. The combination of photon efficiency, optical simplicity, and physical interpretability makes SfD a promising solution for fast, compact, and interpretable hyperspectral imaging.

Abstract:
Aerial object-goal navigation (Aerial ObjectNav) requires an Unmanned Aerial Vehicle (UAV) to navigate to target objects in large-scale outdoor environments using only visual observations and high-level object descriptions, without detailed step-by-step instructions. Existing approaches rely on local observations or short-term history, lacking comprehensive scene understanding and efficient spatial exploration strategies, which constrains their navigation capability in complex aerial scenarios. To address these challenges, we propose OctMem-Agent, an octree memory-augmented framework for aerial object-goal navigation. Specifically, we introduce an Adaptive Octree Memory that incrementally aggregates RGB-D observations into a hierarchical 3D representation, capturing both explored regions and unexplored frontiers across large-scale aerial environments. We further propose a Instruction-Guided Memory Query module that extracts task-relevant scene and exploration tokens through instruction-modulated queries. By integrating these tokens with visual observations and language instructions, OctoMem-Agent achieves comprehensive scene understanding and effective spatial exploration for target localization. Extensive experiments on the Aerial ObjectNav benchmark UAV-ON demonstrate that our method achieves a significant 7.5% improvement in success rate over existing methods, validating the effectiveness of our design.

Abstract:
We introduce OLATverse, a large-scale dataset comprising around 9M images of 765 real-world objects, captured from multiple viewpoints under a diverse set of precisely controlled lighting conditions. While recent advances in object-centric inverse rendering, novel view synthesis and relighting have shown promising results, most techniques still heavily rely on the synthetic datasets for training and small-scale real-world datasets for benchmarking, which limits their realism and generalization. To address this gap, OLATverse offers two key advantages over existing datasets: large-scale coverage of real objects and high-fidelity appearance under precisely controlled illuminations. Specifically, OLATverse contains 765 common and uncommon real-world objects, spanning a wide range of material categories. Each object is captured using 35 DSLR cameras and 331 individually controlled light sources, enabling the simulation of diverse illumination conditions. In addition, for each object, we provide well-calibrated camera parameters, accurate object masks, photometric surface normals, and diffuse albedo as auxiliary resources. We also construct an extensive evaluation set, establishing the first comprehensive real-world object-centric benchmark for inverse rendering and normal estimation. OLATverse is publicly available at https://vcai.mpi-inf.mpg.de/projects/OLATverse/.

Abstract:
Registration of multiview point clouds typically depends on extensive pairwise matching to build a pose graph for global synchronization, which is computationally expensive and ill-posed without holistic geometric constraints. In this paper, we propose FUSER, the first feed-forward multi-view registration transformer that processes all scans jointly in a unified, compact latent space to directly predict global poses without any pairwise estimation. To maintain tractability, FUSER employs a sparse 3D CNN to encode each scan into low-resolution superpoint features preserving absolute translation cues, followed by a Geometric Alternating Attention module for efficient intra- and inter-scan reasoning. Particularly, we transfer 2D attention priors from off-the-shelf foundation models (i.e., \pi^3) to enhance 3D feature attention. Building upon FUSER and its estimates, we further introduce FUSER-DF, an SE(3) diffusion refinement framework to correct FUSER's estimates through a denoising process over the joint SE(3)^N space. Here, FUSER serves as a surrogate multiview register to model the denoiser, and a prior-conditioned SE(3)^N variational lower bound is derived for denoising supervision. Extensive experiments on 3DMatch and ScanNet confirm the superior registration accuracy and efficiency of our method.

Abstract:
Transparent objects are common in daily life, and understanding their multi-layer depth information, including both the transparent surface and the objects behind it, is crucial for real-world applications that interact with transparent materials.However, existing depth methods produce only a single depth map, which is inherently ambiguous for transparent surfaces.In this work, We propose a multi-layer depth estimation method, SeeGroup, consisting of novel recurrent decomposition module design and an intensity-based formulation for multi-layer depth. Experiments demonstrate that our method significantly improves the state of the art of multi-layer depth estimation, improving quadruplet relative depth accuracy on LayeredDepth benchmark from 61.34% to 70.09%.

Abstract:
Generative world models are reshaping embodied AI, enabling agents to synthesize realistic 4D driving environments that look convincing but often fail physically or behaviorally. Despite rapid progress, the field still lacks a unified way to assess whether generated worlds preserve geometry, obey physics, or support reliable control. We introduce WorldLens, a full-spectrum benchmark evaluating how well a model builds, understands, and behaves within its generated world. It spans five aspects - Generation, Reconstruction, Action-Following, Downstream Task, and Human Preference - jointly covering visual realism, geometric consistency, physical plausibility, and functional reliability. Across these dimensions, no existing world model excels universally: those with strong textures often violate physics, while geometry-stable ones lack behavioral fidelity. To align objective metrics with human judgment, we further construct WorldLens-26K, a large-scale dataset of human-annotated videos with numerical scores and textual rationales, and develop WorldLens-Agent, an evaluation model distilled from these annotations to enable scalable, explainable scoring. Together, the benchmark, dataset, and agent form a unified ecosystem for measuring world fidelity - standardizing how future models are judged not only by how real they look, but by how real they behave.

Abstract:
Medical vision-language models (VLMs) are strong zero-shot recognizers for medical imaging, but their reliability under domain shift hinges on calibrated uncertainty with guarantees. Split conformal prediction (SCP) offers finite-sample coverage, yet prediction sets often become large (low efficiency) and class-wise coverage unbalanced--high class-conditioned coverage gap (CCV), especially in few-shot, imbalanced regimes; moreover, naively adapting to calibration labels breaks exchangeability and voids guarantees. We propose LATA (Laplacian-Assisted Transductive Adaptation), a training- and label-free refinement that operates on the joint calibration and test pool by smoothing zero-shot probabilities over an image-image kNN graph using a small number of CCCP mean-field updates, preserving SCP validity via a deterministic transform. We further introduce a failure-aware conformal score that plugs into the vision-language uncertainty (ViLU) framework, providing instance-level difficulty and label plausibility to improve prediction set efficiency and class-wise balance at fixed coverage. LATA is black-box (no VLM updates), compute-light (windowed transduction, no backprop), and includes an optional prior knob that can run strictly label-free or, if desired, in a label-informed variant using calibration marginals once. Across three medical VLMs and nine downstream tasks, LATA consistently reduces set size and CCV while matching or tightening target coverage, outperforming prior transductive baselines and narrowing the gap to label-using methods, while using far less compute. Comprehensive ablations and qualitative analyses show that LATA sharpens zero-shot predictions without compromising exchangeability.

Abstract:
Modern optical flow estimation, though empowered by recent deep neural architectures, remains rooted in the discrete correspondence paradigm inherited from classical vision. Most networks infer frame-to-frame displacements, capturing where pixels move but not how motion evolves continuously through time. Yet physical motion in the real world follows smooth dynamics governed by underlying velocity fields, as long established in fluid mechanics and transport theory. To bridge this gap, we introduce Optical Flow Matching (OFM), a continuous formulation that learns a time-dependent velocity field to transport pixel coordinates along motion distribution coherent trajectories. A key component of our OFM is Triangle Velocities Synergy (TVS), a neat yet pragmatic geometric mechanism that provides a stable and physically meaningful velocity construction, ensuring that continuous transport remains well-defined. Combined with an Euler-based ODE solver, OFM yields flow fields that are temporally smooth, geometrically consistent, and process-interpretable. Experiments on Sintel, KITTI, and Spring demonstrate that OFM achieves state-of-the-art accuracy, enhanced temporal stability, and notably stronger cross-dataset generalization, advancing optical flow estimation from correspondence inference to continuous dynamical reasoning.

Abstract:
One of the most exciting applications of vision models involve pixel-level reasoning. Despite the abundance of vision foundation models, we still lack representations that effectively embed spatio-temporal properties of visual scenes at the pixel level. Existing frameworks either train on image-based pretext tasks, which do not account for dynamic elements, or on video sequences for action-level reasoning, which does not scale to dense pixel-level prediction. We present a framework that learns pixel-accurate feature descriptors from videos, LILA. The core element of our training framework is linear in-context learning. LILA leverages spatio-temporal cue maps--depth and motion--estimated with off-the-shelf networks. Despite the noisy nature of those cues, LILA trains effectively on uncurated video datasets, embedding semantic and geometric properties in a temporally consistent manner. We demonstrate compelling empirical benefits of the learned representation across a diverse suite of vision tasks: video object segmentation, surface normal estimation and semantic segmentation.

Abstract:
Local Differential Privacy (LDP) is the gold standard trust model for privacy-preserving machine learning by guaranteeing privacy at the data source. However, its application to image data has long been considered impractical due to the high dimensionality of pixel space. Canonical LDP mechanisms are designed for low-dimensional data, resulting in severe utility degradation when applied to high-dimensional pixel spaces. This paper demonstrates that this utility loss is not inherent to LDP, but from its application to an inappropriate data representation. We introduce LDP-Slicing, a lightweight, training-free framework that resolves this domain mismatch. Our key insight is to decompose pixel values into a sequence of binary bit-planes. This transformation allows us to apply the LDP mechanism directly to the bit-level representation. To further strengthen privacy and preserve utility, we integrate a perceptual obfuscation module that mitigates human-perceivable leakage and an optimization-based privacy budget allocation strategy. This pipeline satisfies rigorous pixel-level \varepsilon-LDP while producing images that retain high utility for downstream tasks. Extensive experiments on face recognition and image classification demonstrate that LDP-Slicing outperforms existing DP/LDP baselines under comparable privacy budgets, with negligible computational overhead.

Abstract:
Medical vision-language models can automate the generation of radiology reports but struggle with accurate visual grounding and factual consistency. Existing models often misalign textual findings with visual evidence, leading to unreliable or weakly grounded predictions. We present "CURE", an error-aware curriculum learning framework that improves grounding and report quality without any additional data. CURE tunes a multimodal instructional model on phrase grounding, grounded report generation, and anatomy-grounded report generation using public datasets. The method dynamically adjusts sampling based on model performance emphasizing on harder samples to improve spatial and textual alignment. CURE improves grounding accuracy by +0.35 IoU, boosts report quality by +0.192 CXRFEScore, and reduces hallucinations by 18.6%. CURE is a data-efficient framework that enhances both grounding accuracy and report reliability.

Abstract:
Generalized Category Discovery (GCD) seeks to uncover novel categories in unlabeled data while preserving recognition of known categories, yet prevailing visual-only pipelines and the loose coupling between supervised learning and discovery often yield brittle boundaries on fine-grained, look-alike categories. We introduce the Analogical Textual Concept Generator (ATCG), a plug-and-play module that analogizes from labeled knowledge to new observations, forming textual concepts for unlabeled samples. Fusing these analogical textual concepts with visual features turns discovery into a visual-textual reasoning process, transferring prior knowledge to novel data and sharpening category separation. ATCG attaches to both parametric and clustering style GCD pipelines and requires no changes to their overall design. Across six benchmarks, ATCG consistently improves overall, known-class, and novel-class performance, with the largest gains on fine-grained data.

Abstract:
Event cameras provide superior temporal resolution, dynamic range, energy efficiency, and pixel bandwidth. Spiking Neural Networks (SNNs) naturally complement event data through discrete spike signals, making them ideal for event-based tracking. However, current approaches combining Artificial Neural Networks (ANNs) and SNNs suffer from suboptimal architectures that compromise energy efficiency and limit tracking performance. To address these limitations, we propose the first Transformer-based Spike-Driven Tracking (SDTrack) pipeline. It incorporates a novel event frame aggregation method called Global Trajectory Prompt (GTP) and a Transformer-based tracker. The GTP method effectively captures global trajectory information and aggregates it with event streams into event frames to enhance spatiotemporal representation. The Transformer-based tracker comprises a fully spike-driven SNN backbone and a simple tracking head. The SDTrack pipeline operates end-to-end without data augmentation or post-processing. Extensive experiments demonstrate that our SDTrack-Tiny pipeline achieves competitive accuracy with only 19.61M parameters and 8.16mJ energy consumption, while our Base version achieves state-of-the-art accuracy across three datasets. Our work establishes a solid foundation for future neuromorphic vision research.

Abstract:
6D object pose estimation, which predicts the transformation of an object relative to the camera, remains challenging for unseen objects. Existing approaches typically rely on explicitly constructing feature correspondences between the query image and either the object model or template images. In this work, we propose PoseGAM, a geometry-aware multi-view framework that directly predicts object pose from a query image and multiple template images, eliminating the need for explicit matching. Built upon recent multi-view-based foundation model architectures, the method integrates object geometry information through two complementary mechanisms: explicit point-based geometry and learned features from geometry representation networks. In addition, we construct a large-scale synthetic dataset containing more than 190k objects under diverse environmental conditions to enhance robustness and generalization. Extensive evaluations across multiple benchmarks demonstrate our state-of-the-art performance, yielding an average AR improvement of 5.1% over prior methods and achieving up to 17.6% gains on individual datasets, indicating strong generalization to unseen objects.

Abstract:
We present MetaSpectra+, a compact multifunctional camera that supports two operating modes: (1) snapshot HDR + hyperspectral or (2) snapshot polarization + hyperspectral imaging. It utilizes a novel metasurface-refractive assembly that splits the incident beam into multiple channels and independently controls each channel's dispersion, exposure, and polarization. Unlike prior multifunctional metasurface imagers restricted to narrow (10--100 nm) bands, MetaSpectra+ operates over nearly the entire visible spectrum (250 nm). Relative to snapshot hyperspectral imagers, it achieves the shortest total track length and the highest reconstruction accuracy on benchmark datasets. The demonstrated prototype reconstructs high-quality hyperspectral datacubes and either an HDR image or two orthogonal polarization channels from a snapshot measurement.

Abstract:
The rapid advancement of diffusion-based image generation models has raised serious concerns regarding potential copyright and privacy infringements involving human-created data. Membership inference attacks (MIAs) have emerged as a promising tool for identifying unauthorized data usage during model training. Existing methods typically assess the ability of model to denoise perturbed suspect images as an indicator of membership status. However, the discriminative power of such features is highly dependent on the degree of model memorization and deteriorates significantly when applied to less exposed data (e.g., pre-training data). Although several methods attempt to enhance detection by leveraging internal model features, these features are generally inaccessible in mainstream closed-source image generation platforms, limiting their practicality. In this paper, we demonstrate that analyzing how a black-box diffusion model denoises a target image and corresponding perturbed textual instructions can reveal more distinctive membership cues. Based on this insight, we propose a black-box membership inference attack framework (named SD-MIA) that leverages a cross-modal data perturbation mechanism to detect pre-training data in diffusion models. We conduct extensive experiments on both a public benchmark dataset and a newly constructed dataset, each comprising pre-training membership and non-membership samples with identical distributions. Experimental results demonstrate that SD-MIA achieves superior performance compared to existing baselines, including those with the unfair advantage of accessing internal model features.

Abstract:
Sign Language Translation (SLT) converts continuous sign videos into spoken language text, yet current models, whether gloss-based or gloss-free, struggle with long or discourse-level inputs. Recent architectures such as TwoStreamNetwork and CV-SLT have nearly saturated short-sentence accuracy, but their performance degrades on long sentences and multi-sentence paragraphs. In real scenarios such as news, interviews or daily conversations, signers naturally produce extended signing sequences with complex contextual dependencies. Moreover, identifying precise gloss boundaries remains a key obstacle, while gloss-based methods, though often superior, incur heavy annotation costs. The community therefore needs a solution that mitigates gloss dependency while preserving translation quality. We present BoostSLT, a context-aware framework for enhancing semantic consistency over long sign sequences without gloss supervision. Instead of requiring explicit gloss segmentation, BoostSLT introduces an Energy-Aware Temporal Segmentation (EAT-Seg) module that dynamically partitions videos into semantically coherent fragments, followed by a Diffusion-based Semantic Reconstruction (DSR) module that stitches and refines fragment-level translations into globally fluent paragraphs. The framework is plug-and-play and model-agnostic, seamlessly integrating with existing gloss-based or gloss-free pipelines across languages. Experiments on PHOENIX-2014T, CSL-Daily, and Auslan-Daily show consistent BLEU and ROUGE-L gains, confirming that diffusion-driven semantic reconstruction effectively bridges local accuracy and global coherence in long-form SLT. Code is available at github.com/K1sna/BoostSLT.

Abstract:
Long video understanding is essential for human-like intelligence, enabling coherent perception and reasoning over extended temporal contexts. While the emerging thinking-with-frames paradigm--which alternates between global temporal reasoning and local frame examination--has advanced the reasoning capabilities of video multi-modal large language models (MLLMs), it suffers from a significant efficiency bottleneck due to the progressively growing and redundant multi-modal context. To address this, we propose SpecTemp, a reinforcement learning-based Speculative Temporal reasoning framework that decouples temporal perception from reasoning via a cooperative dual-model design. In SpecTemp, a lightweight draft MLLM rapidly explores and proposes salient frames from densely sampled temporal regions, while a powerful target MLLM focuses on temporal reasoning and verifies the draft's proposals, iteratively refining its attention until convergence. This design mirrors the collaborative pathways of the human brain, balancing efficiency with accuracy. To support training, we construct the SpecTemp-80K dataset, featuring synchronized dual-level annotations for coarse evidence spans and fine-grained frame-level evidence. Experiments across multiple video understanding benchmarks demonstrate that SpecTemp not only maintains competitive accuracy but also significantly accelerates inference compared with existing thinking-with-frames methods.

Abstract:
Ensuring the authenticity and ownership of digital images is increasingly challenging as modern editing tools enable highly realistic forgeries. Existing image protection systems mainly rely on digital watermarking, which is susceptible to sophisticated digital attacks. To address this limitation, we propose a hybrid optical-digital framework that incorporates physical authentication cues during image formation and preserves them through a learned reconstruction process. At the optical level, a phase mask in the camera aperture produces a Null-space Optical Watermark (NOWA) that lies in the Null Space of the imaging operator and therefore remains invisible in the captured image. Then, a Null-Space Network (NSN) performs measurement-consistent reconstruction that delivers high-quality protected images while preserving the NOWA signature. The proposed design enables tamper localization by projecting the image onto the camera's null space and detecting pixel-level inconsistencies. Our design preserves perceptual quality, resists common degradations such as compression, and establishes a structural security asymmetry: without access to the optical or NSN parameters, adversaries cannot forge the NOWA signature. Experiments with simulations and a prototype camera demonstrate competitive performance in terms of image quality preservation and tamper localization accuracy compared to state-of-the-art digital watermarking and learning-based authentication methods.

Abstract:
Large-scale foundation models pretrained on massive datasets have demonstrated strong generalization capabilities in medical image analysis. However, they are typically trained on static datasets and struggle to cope with the continuously evolving nature of clinical data, where new imaging devices, institutions, and disease subtypes constantly emerge. While domain-incremental learning (DIL) provides a solution for sequential adaptation without revisiting historical data, existing methods typically assume fixed label spaces and limited domain heterogeneity, restricting their applicability to real-world clinical scenarios. To address these challenges, we propose DK-DDIL, a rehearsal-free framework for dynamic DIL that integrates two synergistic modules: a Dynamic Adaptation Module (DAM) employing dynamic rank selection and adaptive regularization to flexibly allocate model capacity under domain shifts, and a Knowledge Inheritance and Refinement (KIR) module that stabilizes cross-domain knowledge transfer through selective adapter fusion and prototype-level contrastive refinement. Experiments on the Skin Pathology Diagnosis dataset, the Cyst-X 3D MRI cohort, and the OfficeHome benchmark demonstrate that DK-DDIL consistently outperforms state-of-the-art DIL approaches, highlighting its effectiveness and versatility across dynamic 2D medical, 3D medical, and natural image domains.

Abstract:
Open-world promptable 3D semantic segmentation remains brittle as semantics are inferred in the input sensor coordinates. Yet, humans, in contrast, interpret parts via functional roles in a canonical space -- wings extend laterally, handles protrude to the side, and legs support from below. Psychophysical evidence shows that we mentally rotate objects into canonical frames to reveal these roles. To fill this gap, we propose CoSMo3D, which attains canonical space perception by inducing a latent canonical reference frame learned directly from data. By construction, we create a unified canonical dataset through LLM-guided intra- and cross-category alignment, exposing canonical spatial regularities across 200 categories. By induction, we realize canonicality inside the model through a dual-branch architecture with canonical map anchoring and canonical box calibration, collapsing pose variation and symmetry into a stable canonical embedding. This shift from input pose space to canonical representation yields far more stable and transferable part semantics. Experimental results show that CoSMo3D establishes new state of the art in open-world promptable 3D segmentation.

Abstract:
Vision-Language-Action (VLA) models have significantly advanced robotic agents capable of executing diverse tasks; however, they remain limited in contact-rich manipulation scenarios that require precise physical interactions. To address this limitation, recent studies have attempted to incorporate tactile signals during downstream tasks, enabling pretrained VLAs to interpret tactile feedback. Nevertheless, introducing new modalities during finetuning, which are rarely present in the pretrain stage, may disrupt the pretrained capabilities of VLAs. In addition, the inherently slow inference speed of VLAs hampers real-time responsiveness and limits the effective utilization of tactile feedback for action adjustment.To overcome these challenges, we propose Adaptive Tactile Vision-Language-Action (AT-VLA), which introduces a novel Adaptive Tactile Injection mechanism. This mechanism dynamically determines the appropriate timing and locations for tactile injection, incorporating only when it significantly contributes to action generation, thereby minimizing interference with pretrained representations.Furthermore, to enable rapid and accurate tactile responses, we propose a Tactile Reaction Dual-Stream mechanism, which decouples sensory processing into a slow visual-language stream for low-frequency perceptual reasoning and a fast tactile control stream for high-frequency physical interaction understanding, achieving real-time close-loop responses within 0.04 s.Real-world experiments thoroughly validate the effectiveness of AT-VLA in contact-rich manipulation tasks.

Abstract:
Visual Foundation Models (VFMs) such as the Segment Anything Model (SAM) have significantly advanced broad use of image segmentation. However, SAM and its variants necessitate substantial manual effort for prompt generation and additional training for specific applications. Recent approaches address these limitations by integrating SAM into in-context (one/few shot) segmentation, enabling auto-prompting through semantic alignment between query and support images. Despite these efforts, they still generate sub-optimal prompts that degrade segmentation quality due to visual inconsistencies between support and query images. To tackle this limitation, we introduce PR-MaGIC (Prompt Refinement via Mask Decoder Gradient Flow for In-Context Segmentation), a training-free test-time framework that refines prompts via gradient flow derived from SAM's mask decoder. PR-MaGIC seamlessly integrates into in-context segmentation frameworks, being theoretically grounded yet practically stabilized through a simple top-1 selection strategy that ensures robust performance across samples. Extensive evaluations demonstrate that PR-MaGIC consistently improves segmentation quality across various benchmarks, effectively mitigating inadequate prompts without requiring additional training or architectural modifications.

Abstract:
Category-level object pose estimation aims to predict the pose and size of arbitrary objects in specific categories. Existing methods struggle with the inherent incompleteness of observed point clouds, which limits their ability to capture complete object shapes for robust pose reasoning. While point cloud completion offers a promising solution, naively treating it as a separate preprocessing step for partial observations introduces compounding errors and additional computational overhead, ultimately hindering both accuracy and efficiency. To address these challenges, we propose ComPose, a novel unified framework that tightly integrates shape completion to provide complete geometric cues for enhanced pose estimation. At the core of ComPose is a keypoint-based progressive completion module, which recovers full shape representations by progressively predicting a sparse set of keypoints and their surrounding dense point sets, empowering the keypoints to capture holistic object geometries. A geometric relation encoding module further enriches keypoint features with both local and global geometric context. In addition, we introduce a novel geometric relation consistency loss to enforce structural alignment between observed keypoints and their predicted NOCS coordinates, ensuring globally coherent coordinate transformations. Extensive experiments on standard benchmarks demonstrate that our method outperforms state-of-the-art approaches without relying on category-level shape priors.

Abstract:
YOLO detectors are known for their fast inference speed, yet training them remains unexpectedly time-consuming due to their exhaustive pipeline that processes every training image in every epoch, even when many images have already been sufficiently learned. This stands in clear contrast to the efficiency suggested by the "You Only Look Once" philosophy. This naturally raises an important question: Does YOLO really need to see every training image in every epoch? To explore this, we propose an Anti-Forgetting Sampling Strategy (AFSS) that dynamically determines which images should be used and which can be skipped during each epoch, allowing the detector to learn more effectively and efficiently. Specifically, AFSS measures the learning sufficiency of each training image as the minimum of its detection recall and precision, and dynamically categorizes training images into easy, medium, or hard levels accordingly. Easy training images are sparsely resampled during training in a continuous review manner, with priority given to those that have not been used for a long time to reduce redundancy and prevent forgetting. Moderate training images are partially selected, prioritizing recently unused ones and randomly choosing the rest from unselected images to ensure coverage and prevent forgetting. Hard training images are fully sampled in every epoch to ensure sufficient learning. The learning sufficiency of each training image is periodically updated, enabling detectors to adaptively shift its focus toward the informative training images over time while progressively discarding redundant ones. On widely used natural image detection benchmarks (MS COCO 2017 and PASCAL VOC 2007) and remote sensing detection datasets (DOTA-v1.0 and DIOR-R), AFSS achieves more than 1.43x training speedup for YOLO-series detectors while also improving accuracy.

Abstract:
Diffeomorphic image registration (DIR) seeks topology-preserving transformations and is fundamental in medical imaging. Existing DIR methods rely on integration schemes (e.g., scaling-and-squaring) and multiple regularizers to enforce invertibility. We introduce SGDIR, a continuous-time registration framework, parameterized by known time-embedded backbones, that models diffeomorphisms using only a single semigroup-based regularization, eliminating explicit integration and auxiliary constraints. We mathematically prove that this formulation directly learns the flow of an underlying ODE, inherently enforcing inverse and cycle consistencies. We evaluate on eight 2D and 3D MR and CT datasets. Under strict semigroup enforcement, our model achieves near-perfect diffeomorphism and significantly outperforms existing diffeomorphic methods, while remaining competitive with leading non-diffeomorphic deformable models. When the regularization is relaxed, the same architecture functions as a deformable method and substantially surpasses state-of-the-art non-diffeomorphic approaches in registration accuracy. These results demonstrate that continuous-time deformation modeling, guided solely by our semigroup-based regularization, yields a unified framework capable of both rigorously diffeomorphic mapping and state-of-the-art deformable registration.

Abstract:
Agentic vision-language models are increasingly trained to "think with images" by calling image operations. However, we show that high final-answer accuracy often hides unfaithful visual reasoning: models may invoke tools on irrelevant regions or ignore tool outputs entirely, yet still guess the correct answer. In this work, we first propose a faithfulness evaluation protocol that measures whether intermediate visual tool outputs (e.g., crops) actually contain the queried evidence. This reveals that recent visual agents achieve high final-answer accuracy but exhibit low rates of faithful tool-use on visual search benchmarks. We then introduce CodeV, a code-based visual agent trained with Tool-Aware Policy Optimization (TAPO). TAPO is a process-level RL framework that augments GRPO with dense rewards defined directly on visual tool inputs and outputs, rather than on chain-of-thought tokens, making supervision easier to verify and less susceptible to reward hacking. CodeV represents visual tools as executable Python code, and TAPO assigns step-wise rewards based solely on the question and tool output, encouraging both necessary and evidence-consistent tool use. In a two-stage SFT+RL pipeline, CodeV achieves competitive or superior accuracy while substantially increasing faithful tool-use rates on related visual search benchmarks. Beyond visual search, CodeV attains strong performance on a range of multimodal reasoning and math benchmarks, suggesting that explicitly supervising intermediate tool behavior is crucial for building trustworthy, agentic visual reasoning systems.

Abstract:
Foundation models for medical image segmentation struggle under out-of-distribution (OOD) shifts, often producing fragmented false positives on OOD tumors. We introduce R^2-Seg, a training-free framework for robust OOD tumor segmentation that operates via a two-stage Reason-and-Reject process. First, the Reason step employs an LLM-guided anatomical reasoning planner to localize organ anchors and generate multi-scale ROIs. Second, the Reject step applies two-sample statistical testing to candidates generated by a frozen foundation model (BiomedParse) within these ROIs. This statistical rejection filter retains only candidates significantly different from normal tissue, effectively suppressing false positives. Our framework requires no parameter updates, making it compatible with zero-update test-time augmentation and avoiding catastrophic forgetting. On multi-center and multi-modal tumor segmentation benchmarks, R^2-Seg substantially improves Dice, specificity, and sensitivity over strong baselines and the original foundation models.

Abstract:
Face relighting aims to synthesize realistic portraits under novel illumination while preserving identity and geometry. However, progress remains constrained by the limited availability of large-scale, physically consistent illumination data. To address this, we introduce POLAR, a large-scale and physically calibrated One-Light-at-a-Time (OLAT) dataset containing over 200 subjects captured under 156 lighting directions, multiple views, and diverse expressions. Building upon POLAR, we develop a flow-based generative model POLARNet that predicts per-light OLAT responses from a single portrait, capturing fine-grained and direction-aware illumination effects while preserving facial identity. Unlike diffusion or background-conditioned methods that rely on statistical or contextual cues, our formulation models illumination as a continuous, physically interpretable transformation between lighting states, enabling scalable and controllable relighting. Together, POLAR and POLARNet form a unified illumination learning framework that links real data, generative synthesis, and physically grounded relighting, establishing a self-sustaining "chicken-and-egg" cycle for scalable and reproducible portrait illumination.

Abstract:
We present _Relightable Holoported Characters_ (RHC), a novel person-specific method for free-view rendering and relighting of full-body and highly dynamic humans solely observed from sparse-view RGB videos at inference. In contrast to classical one-light-at-a-time (OLAT)-based human relighting, our transformer-based RelightNet predicts relit appearance within a single network pass, avoiding costly OLAT-basis capture and generation. For training such a model, we introduce a new capture strategy and dataset recorded in a multi-view lightstage, where we alternate frames lit by random environment maps with uniformly lit tracking frames, simultaneously enabling accurate motion tracking and diverse illumination as well as dynamics coverage. Inspired by the rendering equation, we derive physics-informed features that encode geometry, albedo, shading, and the virtual camera view from a coarse human mesh proxy and the input views. Our RelightNet then takes these features as input and cross-attends them with a novel lighting condition, and regresses the relit appearance in the form of texel-aligned 3D Gaussian splats attached to the coarse mesh proxy. Consequently, our RelightNet implicitly learns to efficiently compute the rendering equation for novel lighting conditions within a single feed-forward pass. Experiments demonstrate our method's superior visual fidelity and lighting reproduction compared to state-of-the-art approaches.

Abstract:
Accurate robot segmentation is a fundamental capability for robotic perception. It enables precise visual servoing for VLA systems, scalable robot-centric data augmentation, accurate real-to-sim transfer, and reliable safety monitoring in dynamic human-robot environments. Despite the strong capabilities of modern segmentation models, surprisingly it remains challenging to segment robots. This is due to robot embodiment diversity, appearance ambiguity, structural complexity, and rapid shape changes. Embracing these challenges, we introduce RobotSeg, a foundation model for robot segmentation in image and video. RobotSeg is built upon the versatile SAM 2 foundation model but addresses its three limitations for robot segmentation, namely the lack of adaptation to articulated robots, reliance on manual prompts, and the need for per-frame training mask annotations, by introducing a structure-enhanced memory associator, a robot prompt generator, and a label-efficient training strategy. These innovations collectively enable a structure-aware, automatic, and label-efficient solution. We further construct the video robot segmentation (VRS) dataset comprising over 2.8k videos (138k frames) with diverse robot embodiments and environments. Extensive experiments demonstrate that RobotSeg achieves state-of-the-art performance on both images and videos, establishing a strong foundation for future advances in robot perception.

Abstract:
Long-wave infrared radiation captured by a thermal camera includes (a) emission from an object governed by its temperature and emissivity, and (b) reflected radiation from the surrounding environment. Separating these components is a long-standing challenge in thermography. Even when using multiple bands, the problem is under-determined without priors on emissivity. This difficulty is amplified in near ambient conditions, where emitted and reflected signals are of comparable magnitude. We present a dual-band video thermography framework that reduces this ambiguity by combining two complementary ideas at a per-pixel level: (i) spectral cues (ratio of emissivity between bands is unknown but fixed), and (ii) temporal cues (object radiation changes smoothly while background radiation changes rapidly). We derive an image formation model and an algorithm to jointly estimate the object's emissivity at each band, and the time-varying object and background temperatures. Experiments with calibrated and uncalibrated emissivities in everyday scenes (e.g., coffee pot heating up, palm print on mirrors, reflections of moving people), demonstrate robust separation and recovery of temperature fields.

Abstract:
Human motion recovered from monocular videos often appears overly smooth or dynamically inconsistent, even when joint positions are numerically accurate. We observe that this limitation stems from the absence of reliable high-order temporal cues--velocity and acceleration--which are essential for reconstructing motion that exhibits realistic momentum, timing, and high-frequency detail. We introduce HTD-Refine, a post-processing framework that augments existing Human Motion Recovery (HMR) pipelines using explicitly estimated high-order temporal dynamics. At the core of our system is PVA-Net, a temporal transformer that infers per-joint 2D positions, 3D velocities, and 3D accelerations directly from a monocular video. These predicted dynamics serve as soft yet informative constraints in a global optimization procedure that refines world-space trajectories, significantly reducing jitter, suppressing oversmoothing, and restoring physically plausible motion. Extensive experiments on challenging in-the-wild benchmarks show that HTD-Refine consistently improves state-of-the-art HMR methods, yielding more accurate global trajectories and substantially more natural motion dynamics. Our results highlight the critical role of high-order temporal modeling in advancing monocular human motion recovery.

Abstract:
Today's strongest video-language models (VLMs) remain proprietary, and the strongest open-weight models often rely on synthetic data from proprietary VLMs and do not disclose their training data or recipe. As a result, the open-source community lacks the foundations needed to improve on the state-of-the-art video (and image) language models. Crucially, many downstream applications require grounding--either by pointing or by tracking in pixels. Even proprietary models lack this capability. We present Molmo2, a new family of VLMs that are state-of-the-art among open-source models and demonstrate exceptional new capabilities in point-driven grounding in single image, multi-image, and video tasks. Our key contribution is a collection of 7 new video datasets and 2 multi-image datasets, including a dataset of highly detailed video captions for pre-training, a free-form video Q&A dataset for fine-tuning, a new object tracking dataset with complex queries, and an innovative new video pointing dataset, all collected without the use of closed VLMs. We also present a training recipe for this data utilizing an efficient packing and message-tree encoding scheme, and show bi-directional attention on vision tokens and a novel token-weight strategy improves performance. Our best-in-class 8B model outperforms others in the class of open weight and data models on short videos, counting, and captioning, and is competitive on long-videos. On video-grounding Molmo2 significantly outperforms existing open-weight models like Qwen3-VL (35.5 vs 29.6 accuracy on video counting) and surpasses proprietary models like Gemini 3 Pro on some tasks (38.4 vs 20.0 F1 on video pointing and 56.2 vs 41.1 J&F on video tracking). Our model weights, new datasets, and source code are available at https://allenai.org/blog/molmo2.

Abstract:
3D Gaussian Splatting (3DGS) has emerged as an efficient approach for photorealistic rendering. Recent MLP-based variants further improve visual fidelity but introduce substantial decoding overhead during rendering. To reduce the computational cost, several pruning strategies and level-of-detail (LOD) techniques have been introduced to reduce the number of Gaussian primitives in large-scale scenes. However, our analysis reveals that significant redundancy still remains due to the lack of occlusion awareness. In this work, we propose Proxy-GS, a novel pipeline that uses a proxy representation to introduce occlusion awareness for Gaussians from arbitrary views. At the core of our approach is a fast proxy system capable of producing precise occlusion depth maps at a resolution of 1000 x 1000 in under 1 ms. This proxy serves two roles. First, it guides the culling of anchors and Gaussians to accelerate rendering. Second, it guides densification toward scene surfaces during training, reducing inconsistencies in occluded regions and thereby improving rendering quality. In heavily occluded scenarios such as the MatrixCity Streets dataset, Proxy-GS achieves more than a 2.5x speedup over Octree-GS while also improving rendering quality.

Abstract:
Automated video analysis is critical for wildlife conservation. A foundational task in this domain is multi-animal tracking (MAT), which underpins applications such as individual re-identification and behavior recognition. However, existing datasets are limited in scale, constrained to a few species, or lack sufficient temporal and geographical diversity - leaving no suitable benchmark for training general-purpose MAT models applicable to wild animals. To address this, we introduce SA-FARI, the largest open-source MAT dataset for wild animals. It comprises 11,609 camera trap videos collected over 10 years (2014-2024) from 741 locations across 4 continents, spanning 99 species categories. Each video is exhaustively annotated culminating in 46 hours of densely annotated footage containing 16,224 masklet identities and 942,702 individual bounding boxes, segmentation masks, and species labels. Alongside the task-specific annotations, we publish anonymized camera trap locations for each video. Finally, we present comprehensive benchmarks on SA-FARI using state-of-the-art vision-language models for detection and tracking, including SAM 3, evaluated with both species-specific and generic animal prompts. We also compare against vision only methods developed specifically for wildlife analysis. SA-FARI is the first large-scale dataset to combine high species diversity, multi-region coverage, and high-quality spatio-temporal annotations, offering a new foundation for advancing multi-animal tracking in the wild. The dataset is available at conservationxlabs.com/SA-FARI.

Abstract:
Accurate 3D reconstruction of objects with reflective, transparent, or low-texture surfaces still remains notoriously challenging. Such materials often violate key assumptions in multi-view reconstruction pipelines, such as photometric consistency and the availability on distinct geometric texture cues. Existing datasets primarily focus on diffuse, textured objects, and therefore provide limited insight into performance under real-world material complexities. We introduce 3DReflecNet, a large-scale hybrid dataset exceeding 22 TB that is specifically designed to benchmark and advance 3D vision methods for these challenging materials. 3DReflecNet combines two types of data: over 120,000 synthetic instances generated via physically-based rendering of more than 12,000 shapes, and over 1,000 real-world objects captured using consumer devices. Together, these data consist of more than 7 million multi-view frames. The dataset spans diverse materials, complex lighting conditions, and a wide range of geometric forms--including shapes generated from both real and LLM-synthesized 2D images using diffusion-based pipelines. To support robust evaluation, we design benchmarks for five core tasks: image matching, structure-from-motion, novel view synthesis, reflection removal, and relighting. Extensive experiments demonstrate that state-of-the-art methods struggle to maintain accuracy across these settings, highlighting the need for more resilient 3D vision models.

Abstract:
Privacy-Preserving Image Queries (PPIQ) are an emerging mechanism for cloud-based visual localization, enabling pose estimation from obfuscated features instead of private images or raw keypoints. However, the main approaches for PPIQ, primarily geometry-based and segmentation-based obfuscation, both suffer from vulnerabilities to recent privacy attacks. In particular, a fundamental limitation of geometry-based obfuscation is that the spatial distribution of obfuscated neighboring lines still effectively surrounds the original keypoint location, providing exploitable cues for recovering the original points. We revisit this geometric paradigm and introduce Dual Convergent Lines (DCL), a novel keypoint obfuscation method demonstrating strong resilience against such attack. DCL places two fixed anchors on a central partition line and lifts each keypoint to a line originating from one of them, with the active anchor determined by the keypoint's location. This arrangement invalidates the geometry-recovery attack by making its optimization ill-posed: Neighboring lines either misleadingly converge to one anchor, yielding a trivial solution, or become near-parallel at the partition boundary, yielding an unstable high-variance solution. Both outcomes thwart point recovery. DCL is also compatible with an existing line-based solver, enabling deployment in traditional localization pipelines. Experiments on both indoor and large-scale outdoor datasets demonstrate DCL's robustness against privacy attacks, efficiency, and scalability, while achieving practical localization performance.

Abstract:
Innovative visual stylization is a cornerstone of artistic creation, yet generating novel and consistent visual styles remains a significant challenge. Existing generative approaches typically rely on lengthy textual prompts, reference images, or parameter-efficient fine-tuning to guide style-aware image generation, but often struggle with style consistency, limited creativity, and complex style representations. In this paper, we consider the code-to-style image generation task, which aims to produce images with novel and consistent visual styles specified by only a numerical code. To date, this field has only been primarily explored by the industry (e.g., Midjourney), with no open-source research from the academic community. To fill this gap, we propose CoTyle, the first open-source method for this task. Specifically, we first train a discrete style codebook from a collection of images to extract style embeddings. These embeddings serve as conditions for a text-to-image diffusion model (T2I-DM) to generate stylistic images. Subsequently, we train an autoregressive style generator on the discrete style embeddings to model their distribution, allowing the synthesis of novel style embeddings. During inference, a numerical style code is mapped to a unique style embedding by the style generator, and this embedding guides the T2I-DM to generate images in the corresponding style. Extensive experiments validate that CoTyle effectively converts a numerical code into a style controller, demonstrating a style is worth one code. Compared to existing methods, the stylized images generated by our method are more diverse and consistent, unlocking a vast space of reproducible styles from minimal input.

Abstract:
Multi-modal learning approaches that integrate pathological images with genomic profiles have significantly enhanced the accuracy of survival prediction tasks. However, previous methods often struggle to effectively process long-range gigapixel whole slide images (WSIs) and sparse genomic profiles due to the limitations of conventional scanning strategies to serialize data and the complex and heterogeneous nature of the modalities. Inspired by recent advancements in Mamba and mixture of experts (MoE), we propose a novel multi-directional composite scanning strategy with mixture of attention and Mamba experts (MDCS-MoAME) for cancer survival prediction. Specifically, we introduce a multi-directional composite scanning (MDCS) strategy to both WSIs and genomic profiles, and use the Mamba encoder to process intra-modal representations at the region, patch, and gene level, ensuring sufficient utilization of the intrinsic information within each modality. To further capture heterogeneous inter-modal representations, we introduce mixture of attention and Mamba experts (MoAME), which dynamically selects tailored experts to model complex inter-modal correlations, flexibly focusing on the interactions between modalities. Finally, we introduce alignment constraints to recalibrate inter-modal interactions and reduce intra- and inter-modal representation redundancy, enhancing its discriminative power for comprehensive survival analysis. Experimental results on five publicly available datasets demonstrate that our method outperforms existing approaches, achieving state-of-the-art performance.

Abstract:
Recently, iris recognition is regaining prominence in immersive applications such as extended reality as a means of seamless user identification. This application scenario introduces unique challenges compared to traditional iris recognition under controlled setups, as the ocular images are primarily captured off-axis and less constrained, causing perspective distortion, intra-subject variation, and quality degradation in iris textures. Datasets capturing these challenges remain limited. This paper fills this gap by presenting a large-scale iris dataset collected via head-mounted displays, termed ImmerIris. It contains 499,791 ocular images from 546 subjects, and is, to our knowledge, the largest public iris dataset to date and among the first dedicated to immersive applications. It is accompanied by a comprehensive set of evaluation protocols that benchmark recognition systems under various challenging conditions. This paper also draws attention to a shared obstacle of current recognition methods, the reliance on a pre-processing, normalization stage, which is fallible in off-axis and unconstrained setups. To this end, this paper further proposes a normalization-free paradigm that directly learns from minimally adjusted ocular images. Despite its simplicity, it outperforms normalization-based prior arts, indicating a promising direction for robust iris recognition.

Abstract:
Optical vibration sensing enables recovering the scene sound directly from the surface vibration of nearby objects, turning everyday objects into "visual microphones". However, most prior methods had focused on capturing the vibrations of specific objects with highly favorable vibration responses. These include objects where the surface vibrations are generated by the object itself (e.g., speaker membrane or guitar body) or objects consisting of a thin membrane which is highly reactive to sound (e.g., a chip bag or the leaf of a plant).In this paper, we tackle sound recovery for a more challenging class of solid objects whose vibration responses are poor or highly resonant. We simultaneously capture vibrations for multiple surface points on the object using a speckle-based vibrometry imaging system. Then, we derive a novel physics-guided vibration formation model that relates the scene sound source to the captured multi-point multi-axis vibrations via the object's vibrational modes. The model is then used to reverse the resonant transfer function of the vibrating object, fusing the plurality of vibration signals to estimate the original sound source of the scene. We evaluate our approach by recovering sound from a variety of everyday objects, demonstrating that it significantly outperforms traditional single-point speckle vibrometry in challenging scenarios where it performs poorly.

Abstract:
Reconstructing dynamic fluids from sparse views is a long-standing and challenging problem, due to the severe lack of 3D information from insufficient view coverage. While several pioneering approaches have attempted to address this issue using differentiable rendering or novel view synthesis, they are often limited by time-consuming optimization under ill-posed conditions. We propose SmokeSVD, an efficient and effective framework to progressively reconstruct dynamic smoke from a single video by integrating the generative capabilities of diffusion models with physically guided consistency optimization. Specifically, we first propose a physically guided side-view synthesizer based on diffusion models, which explicitly incorporates velocity field constraints to generate spatio-temporally consistent side-view images frame by frame, significantly alleviating the ill-posedness of single-view reconstruction. Subsequently, we iteratively refine novel-view images and reconstruct 3D density fields through a progressive multi-stage process that renders and enhances images from increasing viewing angles, generating high-quality multi-view sequences. Finally, we estimate fine-grained density and velocity fields via differentiable advection by leveraging the Navier-Stokes equations. Our approach supports re-simulation and downstream applications while achieving superior reconstruction quality and computational efficiency compared to state-of-the-art methods.

Abstract:
Logit-based Knowledge Distillation (KD) has emerged as a lightweight alternative to feature-based KD. Recent logit-based methods often rely on multi-knowledge alignment and relational modeling. These methods are often inefficient due to redundant objectives, suboptimal transformations, and poorly designed loss functions. Motivated by these issues, we propose Streamlined Knowledge Distillation (SKD), a simple yet effective logit-based method that transfers only two essential forms of knowledge without requiring additional alignment or relational modeling. Specifically, SKD transfers instance-wise knowledge via Kullback-Leibler divergence and direction-wise knowledge by aligning the Gramian matrix of normalized logits. For the latter, we introduce a Mahalanobis distance-based direction-wise loss stabilized through Tikhonov regularization and Cholesky decomposition. This direction-wise loss accounts for variance and correlation in the output space and, as we formally show, is equivalent to the L2-norm in a covariance-whitened space. Extensive experiments demonstrate that SKD consistently outperforms existing logit-based methods and even surpasses feature-based methods, despite its simpler design. Code is available at https://github.com/HyunJunSik/StreamLined.

Abstract:
We present a robust, real-time RGB SLAM system that handles dynamic environments by leveraging differentiable Uncertainty-aware Bundle Adjustment. Traditional SLAM methods typically assume static scenes, leading to tracking failures in the presence of motion. Recent dynamic SLAM approaches attempt to address this challenge using predefined dynamic priors or uncertainty-aware mapping, but they remain limited when confronted with unknown dynamic objects or highly cluttered scenes where geometric mapping becomes unreliable. In contrast, our method estimates per-pixel uncertainty by exploiting multi-view visual feature inconsistency, enabling robust tracking and reconstruction even in real-world environments. The proposed system achieves state-of-the-art camera poses and scene geometry in cluttered dynamic scenarios while running in real time at around 10 FPS. Code and datasets are available at https://github.com/MoyangLi00/DROID-W.git.

Abstract:
Understanding egocentric videos plays a vital role for embodied intelligence. Recent multi-modal large language models (MLLMs) can accept both visual and audio inputs. However, due to the challenge of obtaining text labels with coherent joint-modality information, whether MLLMs can jointly understand both modalities in egocentric videos remains under-explored. To address this problem, we introduce EgoAVU, a scalable data engine to automatically generate egocentric audio-visual narrations, questions and answers. EgoAVU enriches human narrations with multimodal context and generates audio-visual narrations through cross-modal correlation modeling. Token-based video filtering and modular, graph based curation ensure both the data diversity and quality. Leveraging EgoAVU, we construct EgoAVU-Instruct - a large scale training dataset of 3M samples, and EgoAVU-Bench - a manually verified evaluation split of 3K samples. EgoAVU-Bench clearly reveals the limitation of existing MLLMs: they bias heavily towards visual signals, often neglecting audio cues or failing to correspond audio with the visual source. Finetuning MLLMs on EgoAVU-Instruct effectively solves this issue, enabling up to 113% performance improvement on EgoAVU-Bench. Such benefit can also transfer to other benchmarks such as EgoTempo and EgoIllusion, achieving up to 28% relative performance gain. Project page: https://cs20s030.github.io/EgoAVU/

Abstract:
Humans learn by observing, interacting with environments, and internalizing physics and causality. We explore whether agents can similarly acquire human-like reasoning through interaction and experience. To study this, we introduce a Game-to-Unseen (G2U) benchmark of 1,000+ heterogeneous games that exhibit significant visual domain gaps. Existing approaches, including VLMs and world models, struggle with underlying physics and causality due to overfitting on visual details. VLM/VLA agents reason but lack look-ahead in interactive settings, while world models imagine but imitate visual patterns rather than analyze physics and causality. We therefore propose IPR (Interactive Physical Reasoner), using world-model rollouts to score and reinforce a VLM's policy, and introduce PhysCode, a physics-centric action code aligning semantic intent with dynamics to provide a shared action space for prediction and reasoning. Pretrained on 1,000+ games, IPR performs robustly on levels from primitive intuition to goal-driven reasoning, and even surpasses GPT-5 overall. We find that performance improves with more training games and interaction steps, and that the model also zero-shot transfers to unseen games. These results support physics-centric interaction as a path to steadily improving physical reasoning. Further demos and details can be found at https://mybearyzhang.github.io/ipr-1.

Abstract:
Recent advances in 2D-to-3D perception have enabled the recovery of 3D scene semantics from unposed images. However, prevailing methods often suffer from limited generalization, reliance on per-scene optimization, and semantic inconsistencies across viewpoints. To address these limitations, we introduce PE3R, a tuning-free framework for efficient and generalizable 3D semantic reconstruction. By integrating multi-view geometry with 2D semantic priors in a feed-forward pipeline, PE3R achieves zero-shot generalization across diverse scenes and object categories without any scene-specific fine-tuning. Extensive evaluations on open-vocabulary segmentation and multi-view depth estimation show that PE3R not only achieves up to 9xfaster inference but also sets new state-of-the-art accuracy in both semantic and geometric metrics. Our approach paves the way for scalable, language-driven 3D scene understanding. Code is available at github.com/hujiecpp/PE3R.

Abstract:
Handle-based mesh deformation is a classic paradigm in computer graphics which enables intuitive edits from sparse controls. Classical techniques are fast and precise, but require users to know ideal handle placement apriori, which can be unintuitive and inconsistent. Handle sets cannot be adjusted easily, as weights are typically optimized through energies defined by the handles. Modern data-driven methods, on the other hand, provide semantic edits but sacrifice fine-grained control and speed. We propose a technique that achieves the best of both worlds: deep feature proximity yields smooth, visual-aware deformation weights with no additional regularization. Importantly, these weights are computed in real-time for any surface point, unlike prior methods which require expensive optimization. We introduce barycentric feature distillation, an improved feature distillation pipeline which leverages the full visual signal from shape renders to make distillation complexity robust to mesh resolution. This enables high resolution meshes to be processed in minutes versus potentially hours for prior methods. We preserve and extend classical properties through feature space constraints and locality weighting. Our field representation enables automatic visual symmetry detection, which we use to produce symmetry-preserving deformations. We show a proof-of-concept application which can produce deformations for meshes up to 1 million faces in real-time on a consumer-grade machine. Project page at https://threedle.github.io/dfd.

Abstract:
Personalized Federated Learning (PFL) aims to train customized models for clients with highly heterogeneous data distributions while preserving data privacy. Existing approaches often rely on heuristics like clustering or model interpolation, which lack principled mechanisms for balancing heterogeneous client objectives. Serving M clients with distinct data distributions is inherently a multi-objective optimization problem, where achieving optimal personalization ideally requires M distinct models on the Pareto front. However, maintaining M separate models poses significant scalability challenges in federated settings with hundreds or thousands of clients. To address this challenge, we reformulate PFL as a few-for-many optimization problem that maintains only K shared server models (K << M) to collectively serve all M clients. We prove that this framework achieves near-optimal personalization: the approximation error diminishes as K increases and each client's model converges to each client's optimum as data grows. Building on this reformulation, we propose FedFew, a practical algorithm that jointly optimizes the K server models through efficient gradient-based updates. Unlike clustering-based approaches that require manual client partitioning or interpolation-based methods that demand careful hyperparameter tuning, FedFew automatically discovers the optimal model diversity through its optimization process. Experiments across vision, NLP, and real-world medical imaging datasets demonstrate that FedFew, with just 3 models, consistently outperforms other state-of-the-art approaches. Code is available at https://github.com/pgg3/FedFew.

Abstract:
Hyperbolic spaces provide a natural geometry for representing hierarchical and tree-structured data due to their exponential volume growth. To leverage these benefits, neural networks require intrinsic and efficient components that operate directly in hyperbolic space. In this work, we lift two core components of neural networks, Multinomial Logistic Regression (MLR) and Fully Connected (FC) layers, into hyperbolic space via Busemann functions, resulting in Busemann MLR (BMLR) and Busemann FC (BFC) layers with a unified mathematical interpretation. BMLR provides compact parameters, a point-to-horosphere distance interpretation, batch-efficient computation, and a Euclidean limit, while BFC generalizes FC and activation layers with comparable complexity. Experiments on image classification, genome sequence learning, node classification, and link prediction demonstrate improvements in effectiveness and efficiency over prior hyperbolic layers. The code is available at https://github.com/GitZH-Chen/HBNN.

Abstract:
Self-supervised 3D occupancy prediction offers a promising solution for understanding complex driving scenes without requiring costly 3D annotations. However, training dense occupancy decoders to capture fine-grained geometry and semantics can demand hundreds of GPU hours, and once trained, such models struggle to adapt to varying voxel resolutions or novel object categories without extensive retraining. To overcome these limitations, we propose a practical and flexible test-time occupancy prediction framework termed TT-Occ. Our method incrementally constructs, optimizes, and voxelizes time-aware 3D Gaussians from raw sensor streams by integrating vision foundation models (VFMs) at runtime. The flexible representation of 3D Gaussians enables voxelization at arbitrary user-specified resolutions, while the strong generalization capability of VFMs supports accurate perception and open-vocabulary recognition without requiring any network training or fine-tuning. To validate the generality and effectiveness of our framework, we present two variants: a LiDAR-based version and a vision-centric version, and conduct extensive experiments on the Occ3D-nuScenes and nuCraft benchmarks under varying voxel resolutions. Experimental results show that TT-Occ significantly outperforms existing computationally expensive pretrained self-supervised counterparts. Code is available at https://github.com/Xian-Bei/TT-Occ.

Abstract:
Designing 3D metamaterial microstructures that meet the intended functions remains a major challenge, as it typically requires domain expertise, iterative simulations, and extensive manual tuning. Existing work on inverse design that automatically generates microstructures based on desired target properties often suffers from limited design diversity and faces challenges in ensuring the physical feasibility of the generated structures. To address this issue, a property-informed diffusion-based network is proposed that enables the generation of 3D microstructures directly from textual descriptions. Unlike traditional property conditioning methods, our approach leverages rich guidance in terms of semantics and physical properties in the text input to support diverse structure synthesis. To enforce consistency between the generated structures and the target textual prompts, a dual alignment strategy is adopted, including contrastive text-structure alignment and test-time reward-guided alignment. Experimental results show that the model is capable of generating semantically meaningful and physically plausible structures across a wide range of material categories. The proposed framework has good potential for interactive microstructure design and opens up new directions for combining language-based interfaces with inverse material discovery. Code is available at: https://github.com/hongsong-wang/PropDiff-TMG.

Abstract:
In bandwidth-limited online video streaming, videos are usually downsampled and compressed. Although recent online video super-resolution (online VSR) approaches achieve promising results, they are still compute-intensive and fall short of real-time processing at higher resolutions, due to complex motion estimation for alignment and redundant processing of consecutive frames. To address these issues, we propose a compressed-domain-aware network (CDA-VSR) for online VSR, which utilizes compressed-domain information, including motion vectors, residual maps, and frame types to balance quality and efficiency. Specifically, we propose a motion-vector-guided deformable alignment module that uses motion vectors for coarse warping and learns only local residual offsets for fine-tuned adjustments, thereby maintaining accuracy while reducing computation. Then, we utilize a residual map gated fusion module to derive spatial weights from residual maps, suppressing mismatched regions and emphasizing reliable details. Further, we design a frame-type-aware reconstruction module for adaptive compute allocation across frame types, balancing accuracy and efficiency. On the REDS4 dataset, our CDA-VSR surpasses the state-of-the-art method TMP, with a maximum PSNR improvement of 0.13 dB while delivering more than double the inference speed. The code will be released at https://github.com/sspBIT/CDA-VSR.

Abstract:
While neural video codecs (NVCs) have demonstrated superior compression ratio, their prohibitive computational complexity remains a critical barrier to real-world deployment. This paper introduces a chunk-based coding framework designed to significantly improve the rate-distortion-complexity trade-off. Instead of processing frames sequentially, our approach encodes a chunk of multiple frames into a single compact latent representation and decodes them simultaneously. This is enabled by cross-frame interaction modules for joint spatial-temporal modeling and frame-specific decoders for parallel reconstruction. This paradigm not only dramatically enhances coding throughput but also facilitates more effective modeling of long-term temporal correlations. To further boost speed, we propose a streamlined entropy coding mechanism that consolidates bit-stream interactions into a single step, substantially reducing decoding overhead. Building on these innovations, we present DCVC-UF (Ultra-Fast), a new NVC that sets a new SOTA in performance. Our experiments show that DCVC-UF can achieve ultra-fast encoding and decoding speeds, significantly outperforming previous leading codecs. DCVC-UF serves as a notable landmark in the journey of NVC evolution. The code is at https://github.com/microsoft/DCVC.

Abstract:
Residual diffusion models have achieved remarkable progress in image restoration tasks. In near-domain transformations such as image deraining, however, we observe that the residuals exhibit an inherent low-rank structure. Motivated by this property, we propose the Low-Rank Residual Diffusion Model (LRDM), which performs diffusion within a compact low-rank residual subspace for efficient and structure-preserving restoration. We formalize this observation as the Low-Rank Residual Assumption and show that the variational lower bound becomes tighter when residuals lie in a low-rank space. Building on this insight, we introduce an Asymmetric Residual Diffusion Process that constrains the forward process in the low-rank domain while maintaining full-rank flexibility in the reverse process. To accommodate the varying complexity of residuals across diffusion timesteps, we further introduce an Adaptive Rank Selection mechanism that dynamically adjusts the rank during the diffusion process. Experiments on deraining, deblurring, and deshading benchmarks show that LRDM surpasses full-rank diffusion baselines and achieves state-of-the-art performance, validating the advantage of modeling diffusion in a low-rank residual space. Project Page: \urlstyle same \hypersetup allcolors=magenta https://github.com/JF-Tan/LRDM .

Abstract:
We revisit dataset distillation from an outcome-centric perspective. Rather than aligning process surrogates (per-step gradients or training trajectories), Influence Matching (Inf-Match) aligns the final outcome of training: it learns a compact synthetic set whose effect on the converged parameters matches that of the full dataset. Concretely, we introduce a fully differentiable, sample-level influence estimator that quantifies parameter shifts from adding or removing data, without time-consuming inverse-Hessian products or convexity assumptions. The estimator runs in linear time by unrolling the optimization dynamics and applying a first-order Taylor approximation. We then learn the synthetic set by minimizing the mismatch between its influence and that of the real dataset, yielding outcome alignment rather than heuristic process imitation. Inf-Match delivers the best accuracy across standard classification benchmarks. For instance, on Tiny-ImageNet (IPC=10), Inf-Match attains 31.5%, a +4.7% improvement over NCFM. Beyond classification, Inf-Match scales to vision-language distillation on Flickr30K, outperforming strong process-matching baselines. For instance, with 200 to 1000 synthetic samples, our method achieved a leading impressive average on image/text retrieval tasks, higher than NCFM by 2.5%. The code will be released via https://github.com/hrtan/infmatch.

Abstract:
Recently, great progress has been achieved in text-to-video (T2V) generation by scaling transformer-based diffusion models to billions of parameters, which can generate high-quality videos. However, existing models typically produce only short clips offline, restricting their use cases in interactive and real-time applications. This paper addresses these challenges by proposing StreamDiT, a streaming video generation model. StreamDiT training is based on flow matching by adding a moving buffer. We design mixed training with different partitioning schemes of buffered frames to boost both content consistency and visual quality. StreamDiT modeling is based on adaLN DiT with varying time embedding and window attention. To practice the proposed method, we train a StreamDiT model with 4B parameters. In addition, we propose a multistep distillation method tailored for StreamDiT. Sampling distillation is performed in each segment of a chosen partitioning scheme. After distillation, the total number of function evaluations (NFEs) is reduced to the number of chunks in a buffer. Finally, our distilled model reaches real-time performance at 16 FPS on one GPU, which can generate video streams at 512p resolution. We evaluate our method through both quantitative metrics and human evaluation. Our model enables real-time applications, e.g. streaming generation, interactive generation, and video-to-video.

Abstract:
Text-to-image diffusion models excel at generating high-quality, diverse images from natural language prompts. However, they often fail to produce semantically accurate results when the prompt contains concept combinations that contradict their learned priors. We define this failure mode as contextual contradiction, where one concept implicitly negates another due to entangled associations learned during training. To address this, we propose a stage-aware prompt decomposition framework that guides the denoising process using a sequence of proxy prompts. Each proxy prompt is constructed to match the semantic content expected to emerge at a specific stage of denoising, while ensuring contextual coherence. To construct these proxy prompts, we leverage a large language model (LLM) to analyze the target prompt, identify contradictions, and generate alternative expressions that preserve the original intent while resolving contextual conflicts. By aligning prompt information with the denoising progression, our method enables fine-grained semantic control and accurate image generation in the presence of contextual contradictions. Experiments across a variety of challenging prompts show substantial improvements in alignment to the textual prompt.

Abstract:
We train a language model to generate LEGO(r)-brick build sequences. While prior work has been restricted to discrete, voxel-like towers, we consider a much broader set of pieces, encompassing thousands of part types with diverse connection semantics. To enable this, we first collect a large-scale dataset of over 100,000 human-designed LDraw brick objects and scenes. The complexity of our setting makes it challenging to autoregressively assemble structures that satisfy physical constraints. When predicting block pose directly, build sequences quickly become invalid after a small number of steps. Although pieces are placed in 3D space, it is the spatial relationships of the parts which define the whole. With this in mind, we design a graph-based program representation that parametrizes structure through connectivity, improving the physical grounding of generated sequences. To enable future applications, we make our dataset and models available for research purposes. https://kulits.github.io/BrickNet

Abstract:
Multimodal generative models can process instructions in various modalities and demonstrate outstanding performance across a wide range of image generation tasks. However, their robustness in complex real-world scenarios remains limited due to insufficient generalized instruction alignment. We introduce OmniGen2, a unified multimodal generator designed to follow complex, fine-grained instructions. Our core contribution is a two-stage design that first builds a strong, world-knowledge-grounded foundation model and then aligns it using a progressive, multi-task instruction tuning strategy. The foundation model features a streamlined architecture with decoupled decoding for versatile multimodal generation and a novel positional encoding scheme to improve learning efficiency. We ground this model in real-world knowledge using large-scale data construction pipelines. Building on this foundation, we propose a progressive, reinforcement-based alignment process. This phase carefully schedules training tasks and reward signals to foster cross-task knowledge transfer, significantly improving the model's instruction-following capabilities. Our models demonstrate competitive performance on standard benchmarks and our dedicated in-context generation benchmark, OmniContext. We have released our models, code, benchmark, and training datasets at https://github.com/VectorSpaceLab/OmniGen2.

Abstract:
Open-vocabulary human-object interaction (HOI) detection aims to localize and recognize all human-object interactions in an image, including those unseen during training. Existing approaches usually rely on the collaboration between a conventional HOI detector and a Vision-Language Model (VLM) to recognize unseen HOI categories. However, feature fusion in this paradigm is challenging due to significant gaps in cross-model representations. To address this issue, we introduce SL-HOI, a StreamLined open vocabulary HOI detection framework based solely on the powerful DINOv3 model. Our design leverages the complementary strengths of DINOv3's components: its backbone for fine-grained localization and its text-aligned vision head for open-vocabulary interaction classification. Moreover, to facilitate smooth cross-attention between the interaction queries and the vision head's output, we propose first feeding both the interaction queries and the backbone image tokens into the vision head, effectively bridging their representation gaps. All DINOv3 parameters in our approach are frozen, with only a small number of learnable parameters added, allowing a fast adaptation to the HOI detection task. Extensive experiments show that SL-HOI achieves state-of-the-art performance on both the SWiG-HOI and HICO-DET benchmarks, demonstrating the effectiveness of our streamlined model architecture. Code is available at https://github.com/MPI-Lab/SL-HOI.

Abstract:
Leveraging 3D information within Multimodal Large Language Models (MLLMs) has recently shown significant advantages for indoor scene understanding. However, existing methods, including those using explicit ground-truth 3D positional encoding and those grafting external 3D foundation models for implicit geometry, struggle with the trade-off in 2D-3D representation fusion, leading to suboptimal deployment. To this end, we propose 3D-Implicit Depth Emergence, a method that reframes 3D perception as an emergent property derived from geometric self-supervision rather than explicit encoding. Our core insight is the Implicit Geometric Emergence Principle: by strategically leveraging privileged geometric supervision through mechanisms like a fine-grained geometry validator and global representation constraints, we construct an information bottleneck. This bottleneck forces the model to maximize the mutual information between visual features and 3D structures, allowing 3D awareness to emerge naturally within a unified visual representation. Unlike existing approaches, our method enables 3D perception to emerge implicitly, disentangling features in dense regions and, crucially, eliminating depth and pose dependencies during inference with zero latency overhead. This paradigm shift from external grafting to implicit emergence represents a fundamental rethinking of 3D knowledge integration in visual-language models. Extensive experiments demonstrate that our method surpasses SOTA on multiple 3D scene understanding benchmarks. Our approach achieves a 55% reduction in inference latency while maintaining strong performance across diverse downstream tasks, underscoring the effectiveness of meticulously designed auxiliary objectives for dependency-free 3D understanding. Source code can be found at \href https://chushanzhang.github.io/3D-IDE/ \textcolor[HTML] E8477A \small\texttt github.com/ChushanZhang/3D-IDE .

Abstract:
Capturing high-quality images from only a few detected photons is a fundamental challenge in computational imaging. Single-photon avalanche diode (SPAD) sensors promise high-quality imaging in regimes where conventional cameras fail, but raw quanta frames contain only sparse, noisy, binary photon detections. Recovering a coherent image from a burst of such frames requires handling alignment, denoising, and demosaicing (for color) under noise statistics far outside those assumed by standard restoration pipelines or modern generative models. We present an approach that adapts large text-to-image latent diffusion models to the photon-limited domain of quanta burst imaging. Our method leverages the structural and semantic priors of internet-scale diffusion models while introducing mechanisms to handle Bernoulli photon statistics. By integrating latent-space restoration with burst-level spatio-temporal reasoning, our approach produces reconstructions that are both photometrically faithful and perceptually pleasing, even under high-speed motion. We evaluate the method on synthetic benchmarks and new real-world datasets, including the first color SPAD burst dataset and a challenging Deforming (XD) video benchmark. Across all settings, the approach substantially improves perceptual quality over classical and modern learning-based baselines, demonstrating the promise of adapting large generative priors to extreme photon-limited sensing. Code at https://github.com/Aryan-Garg/gQIR.

Abstract:
Advertising images significantly impact commercial conversion rates and brand equity, yet current evaluation methods rely on subjective judgments, lacking scalability, standardized criteria, and interpretability. To address these challenges, we present A3 (Advertising Aesthetic Assessment), a comprehensive framework encompassing four components: a paradigm (A3-Law), a dataset (A3-Dataset), a multimodal large language model (A3-Align), and a benchmark (A3-Bench). Central to A3 is a theory-driven paradigm, A3-Law, comprising three hierarchical stages: (1) Perceptual Attention, evaluating perceptual image signals for their ability to attract attention; (2) Formal Interest, assessing formal composition of image color and spatial layout in evoking interest; and (3) Desire Impact, measuring desire evocation from images and their persuasive impact. Building on A3-Law, we construct A3-Dataset with 120K instruction-response pairs from 30K advertising images, each richly annotated with multi-dimensional labels and Chain-of-Thought (CoT) rationales. We further develop A3-Align, trained under A3-Law with CoT-guided learning on A3-Dataset. Extensive experiments on A3-Bench demonstrate that A3-Align achieves superior alignment with A3-Law compared to existing models, and this alignment generalizes well to quality advertisement selection and prescriptive advertisement critique, indicating its potential for broader deployment. Dataset, code, and models can be found at: https://github.com/euleryuan/A3-Align

Abstract:
While Parameter-Efficient Fine-Tuning using Mixture-of-Experts (MoE) is a promising solution for continual learning (CL), it suffers from two critical failure modes: structural interference, where expert updates interfere, and compositional forgetting, where the model's routing policy drifts. To address these issues, we introduce Spectral MoE, a novel framework built for CL from three core components. First, Spectral Experts are parameterized using unique, disjoint spectral masks to confine their learnable parameters to distinct frequency subspaces, ensuring a priori orthogonal updates that prevent structural interference. Second, a Dual-Router mechanism decouples online routing that learns new tasks from an offline memory that archives historical expert importance. Finally, this offline memory enables a Dynamic Consistency Projection, a geometric constraint that suppresses router drift and adaptively shields experts based on their past contributions, mitigating compositional forgetting. Validated on a strict cross-domain CL benchmark, our framework significantly outperforms existing methods, demonstrating superior knowledge retention and plasticity for new tasks. Code is available at: https://github.com/ouycc/Spectral_MoE.

Abstract:
Multi-image diffusion models can generate images like multi-views or videos to describe static or dynamic scenes, yet texture and structure drift persist, severely undermining the spatiotemporal consistency. Addressing this issue remains challenging, especially without any external geometric or semantic priors during the pure generative inference. In this paper, we introduce CorrAdapter, a plug-and-play adapter that discovers and exploits an innate property of the multi-image diffusion itself, aligning all output images before they are in fact generated. Specifically, CorrAdapter designs a bypass branch for transformer blocks in the multi-image diffusion model, encompassing a native correspondence constructor that builds reliable correspondences from the diffusion model's intermediate features, and an aligned area aggregator that integrates messages from only matching regions to avoid ambiguous information interactions. Given the native correspondences as guidance, CorrAdapter can enhance spatiotemporal consistency without any auxiliary inputs, and remains training-free and baseline-agnostic, which enables it to generalize seamlessly to various generation tasks. Additionally, we provide an optional training scheme to explore further-improved possibilities. Experiments on both static multi-view generation and dynamic video generation show that CorrAdapter consistently improves spatiotemporal consistency and perceptual quality over strong baselines, offering a simple yet versatile drop-in approach to geometrically faithful multi-image diffusion. Code is available at https://github.com/SuhZhang/CorrAdapter.

Abstract:
The performance of promptable video object segmentation (PVOS) models substantially degrades under input corruptions, which prevents PVOS deployment in safety-critical domains. This paper offers the first comprehensive study on robust PVOS (RobustPVOS). We first construct a new, comprehensive benchmark with two real-world evaluation datasets of 351 video clips and more than 2,500 object masks under real-world adverse conditions. At the same time, we generate synthetic training data by applying diverse and temporally varying corruptions to existing VOS datasets. Moreover, we present a new RobustPVOS method, dubbed Memory-object-conditioned Gated-rank Adaptation (MoGA). The key to successfully performing RobustPVOS is two-fold: effectively handling object-specific degradation and ensuring temporal consistency in predictions. MoGA leverages object-specific representations maintained in memory across frames to condition the robustification process, which allows the model to handle each tracked object differently in a temporally consistent way. Extensive experiments on our benchmark validate MoGA's efficacy, showing consistent and significant improvements across diverse corruption types on both synthetic and real-world datasets, establishing a strong baseline for future RobustPVOS research. Our benchmark is publicly available at https://sohyun-l.github.io/RobustPVOS_project_page/.

Abstract:
Current visual grounding research remains limited for practical applications, because existing techniques primarily focus on direct visual queriesCurrent visual grounding research remains limited for practical applications, because existing tasks primarily focus on direct visual queries (e.g., "find the red car") or reading visible text (e.g., "what is the title of this book?"), rather than supporting general questions about objects (e.g., "how comfortable are these earbuds?"). We introduce the novel problem of Visual Grounding for Object Questions (VGOQ). Unlike previous tasks that ground only what is directly visible in images, VGOQ handles open-ended general questions about objects, including concepts such as ease and comfort of use, and aims to identify visual evidence or context that would support an answer. This unexplored problem has immediate practical value, particularly in designing and optimizing product imagery in e-commerce stores. As initial steps toward this task, we develop two automated data generation techniques, which serve to train a lightweight visual grounding model, and to evaluate visual grounding approaches on the resulting synthetic benchmarks, ABO-VGOQ and VizWiz-VGOQ. Our results provide initial evidence that VGOQ represents a meaningful research direction: current SoTA visual grounding performance decreases from 52% gIoU to 37% gIoU when questions are rephrased from visual questions (segmentation of the answer) to general object questions (segmentation of visual evidence). On our new benchmarks, our lightweight model outperforms prior models while being much smaller. Project page: https://martin-ev.github.io/vgoq.

Abstract:
The creation of 3D human avatars from multi-view videos is a significant yet challenging task in computer vision. However, existing techniques rely on high-quality, sharp images as input, which are often impractical to obtain in real-world scenarios due to variations in human motion speed and intensity. This paper introduces a novel method for directly reconstructing sharp 3D human Gaussian avatars from blurry videos. The proposed approach incorporates a 3D-aware, physics-based model of blur formation caused by human motion, together with a 3D human motion model designed to resolve ambiguities in motion-induced blur. This framework enables the joint optimization of the avatar representation and motion parameters from a coarse initialization. Comprehensive benchmarks are established using both a synthetic dataset and a real-world dataset captured with a 360-degree synchronous hybrid-exposure camera system. Extensive evaluations demonstrate the effectiveness of the model across diverse conditions. Codes Available: https://github.com/MyNiuuu/MAD-Avatar

Abstract:
Synthesizing novel spatiotemporal views of dynamic scenes is challenging due to object and camera motion, and sparse observations. While recent Neural Radiance Field (NeRF) and Gaussian Splatting (GS) methods enable 4D dynamic scene reconstruction, they predominantly assume well-lit inputs. Existing low-light reconstruction approaches are limited to static scenes and mainly focus on brightness enhancement while overlooking underlying scene structure. Reconstructing well-lit dynamic scenes from low-light inputs is particularly challenging due to motion-induced shadows, occlusions, and disocclusions, making the problem highly ambiguous. We propose L^ 2 DGS (Low-Light Dynamic Gaussian Splatting), a self-supervised 4D GS framework that directly reconstructs well-lit dynamic scenes from low-light videos. The method decomposes the scene into view- and time-dependent illumination and view-time-invariant reflectance components. We introduce an Occlusion-Disocclusion Network (OCD-Net) to model temporal intensity variations and Brightness Attenuation Features (BAFs) with a BAF Enhancement Network (BAFE-Net) to enable geometry- and photometry-aware transformation between well-lit and low-light observations for self-supervision. L^ 2 DGS operates on standard sRGB inputs without requiring camera metadata. Experiments on simulated and proposed real Low-Light Dynamic Video (L^ 2 DyV) datasets demonstrate superior qualitative and quantitative performance over prior methods. The dataset is available at: \href https://github.com/akumar005/L2DGS https://github.com/akumar005/L2DGS .

Abstract:
Test-Time Adaptation (TTA) enhances model robustness to out-of-distribution (OOD) data by updating the model online during inference, yet existing methods lack theoretical insights into the fundamental causes of performance degradation under domain shifts. Recently, Neural Collapse (NC) has been proposed as an emergent geometric property of deep neural networks (DNNs), providing valuable insights for TTA. In this work, we extend NC to the sample-wise level and discover a novel phenomenon termed Sample-wise Alignment Collapse (NC3+), demonstrating that a sample's feature embedding, obtained by a trained model, aligns closely with the corresponding classifier weight. Building on NC3+, we identify that the performance degradation stems from sample-wise misalignment in adaptation which exacerbates under larger distribution shifts. This indicates the necessity of realigning the feature embeddings with their corresponding classifier weights. However, the misalignment makes pseudo-labels unreliable under domain shifts. To address this challenge, we propose NCTTA, a novel feature-classifier alignment method with hybrid targets to mitigate the impact of unreliable pseudo-labels, which blends geometric proximity with predictive confidence. Extensive experiments demonstrate the effectiveness of NCTTA in enhancing robustness to domain shifts. For example, NCTTA outperforms Tent by 14.52% on ImageNet-C. Project page is publicly available at https://github.com/Cevaaa/NCTTA.

Abstract:
Structure-from-Motion -- the process of simultaneously estimating camera poses and 3D scene structure from a collection of images -- remains a central challenge in computer vision, with many open problems yet to be solved. Recent advances in feedforward 3D reconstruction have made significant strides in overcoming persistent failure cases of classical SfM methods, particularly in scenarios characterized by low texture, limited overlap, and symmetries. However, while feedforward approaches excel in these challenging conditions, they often face limitations regarding scalability, accuracy, or robustness, and typically fall short of classical methods in standard reconstruction settings. In this work, we systematically analyze these limitations and propose a new Structure-from-Motion pipeline by combining the respective strengths of classical and feedforward methods. Extensive experiments across multiple datasets show the benefits of our approach, achieving state-of-the-art results across a wide range of scenarios. We share our system as an open-source implementation at https://github.com/colmap/gluemap.

Abstract:
Gait recognition has emerged as a powerful biometric technique for identifying individuals at a distance without requiring user cooperation. Most existing methods focus primarily on RGB-derived modalities, which fall short in real-world scenarios requiring multi-modal collaboration and cross-modal retrieval. To overcome these challenges, we present MMGait, a comprehensive multi-modal gait benchmark integrating data from five heterogeneous sensors, including an RGB camera, a depth camera, an infrared camera, a LiDAR scanner, and a 4D Radar system. MMGait contains twelve modalities and 334,060 sequences from 725 subjects, enabling systematic exploration across geometric, photometric, and motion domains. Based on MMGait, we conduct extensive evaluations on single-modal, cross-modal, and multi-modal paradigms to analyze modality robustness and complementarity. Furthermore, we introduce a new task, Omni Multi-Modal Gait Recognition, which aims to unify the above three gait recognition paradigms within a single model. We also propose a simple yet powerful baseline, OmniGait, which learns a shared embedding space across diverse modalities and achieves promising recognition performance. The MMGait benchmark, codebase, and pretrained checkpoints are publicly available at https://github.com/BNU-IVC/MMGait.

Abstract:
Diffusion models have recently emerged as the dominant approach in visual generation tasks. However, the lengthy denoising chains and the computationally intensive noise estimation networks hinder their applicability in low-latency and resource-limited environments. Previous research has endeavored to address these limitations in a decoupled manner, utilizing either advanced samplers or efficient model quantization techniques. In this study, we uncover that quantization-induced noise disrupts directional estimation at each sampling step, further distorting the precise directional estimations of higher-order samplers when solving the sampling equations through discretized numerical methods, thereby altering the optimal sampling trajectory. To attain dual acceleration with high fidelity, we propose a sampling-aware quantization strategy, wherein a Mixed-Order Trajectory Alignment technique is devised to impose a more stringent constraint on the error bounds at each sampling step, facilitating a more linear probability flow. Extensive experiments on sparse-step fast sampling across multiple datasets demonstrate that our approach preserves the rapid convergence characteristics of high-speed samplers while maintaining superior generation quality. Code is publicly available at: https://github.com/TaylorJocelyn/Sampling-aware-Quantization.

Abstract:
Achieving unified 3D perception and reasoning across tasks such as segmentation, retrieval, and relation understanding remains challenging, as existing methods are either object-centric or rely on costly training for inter-object reasoning. We present a novel framework that constructs a hierarchical language-distilled Gaussian scene and its 3D semantic scene graph without scene-specific training. A Gaussian pruning mechanism refines scene geometry, while a robust multi-view language alignment strategy aggregates noisy 2D features into accurate 3D object embeddings. On top of this hierarchy, we build an open-vocabulary 3D scene graph with Vision Language-derived annotations and Graph Neural Network-based relational reasoning. Our approach enables efficient and scalable open-vocabulary 3D reasoning by jointly modeling hierarchical semantics and inter/intra-object relationships, validated across tasks including open-vocabulary segmentation, scene graph generation, and relation-guided retrieval. Project page: https://dfki-av.github.io/ReLaGS/

Abstract:
Backdoor attacks pose a significant threat to deep neural networks. Recent studies have shown that model compression, such as quantization and pruning, can be exploited by attackers to implant conditional backdoors. Such backdoors remain dormant in the original model but are activated after the model undergoes specific operations, making them highly stealthy and difficult to detect. Traditional defense methods struggle to counter this type of attack, while defenses specifically designed for conditional backdoors also have difficulty handling traditional backdoor attacks. To address these challenges, we propose a universal defense method, termed Logit Margin Repulsion (LMR). LMR uses a small set of clean samples and combines selective cross-entropy with a logit-margin constraint to enlarge the gap between the backdoor class and benign classes. It then removes channels associated with backdoor behavior through selective pruning, thereby achieving strong backdoor purification. Extensive experiments on a variety of CNNs and Vision Transformers demonstrate that, even with an extremely limited amount of clean data (0.1%), LMR can effectively mitigate both traditional and conditional backdoor attacks. The implementation is publicly available on https://github.com/Trusted-LLM/LMR.

Abstract:
Reconstructing accurate 3D models of large-scale real-world scenes from unstructured, in-the-wild imagery remains a core challenge in computer vision, especially when the input views have little or no overlap. In such cases, existing reconstruction pipelines often produce multiple disconnected partial reconstructions or erroneously merge non-overlapping regions into overlapping geometry.In this work, we propose a framework that grounds each partial reconstruction to a complete reference model of the scene, enabling globally consistent alignment even in the absence of visual overlap. We obtain reference models from dense, geospatially accurate pseudo-synthetic renderings derived from Google Earth Studio. These renderings provide full scene coverage but differ substantially in appearance from real-world photographs. Our key insight is that, despite this significant domain gap, both domains share the same underlying scene semantics. We represent the reference model using 3D Gaussian Splatting, augmenting each Gaussian with semantic features, and formulate alignment as an inverse feature-based optimization scheme that estimates a global 6DoF pose and scale while keeping the reference model fixed. Furthermore, we introduce the WikiEarth dataset, which registers existing partial 3D reconstructions with pseudo-synthetic reference models. We demonstrate that our approach consistently improves global alignment when initialized with various classical and learning-based pipelines, while mitigating failure modes of state-of-the-art end-to-end models. All code, data, and trained models will be released.

Abstract:
Stereo foundation models achieve strong zero-shotgeneralization but remain computationally prohibitive forreal-time applications. Efficient stereo architectures, on the other hand, sacrificerobustness for speed and require costly per-domain fine-tuning.To bridge this gap, we present Fast-FoundationStereo, a family of architectures that achieve, for the first time, strong zero-shot generalization at real-time frame rate. We employ a divide-and-conquer acceleration strategy with three components: (1) knowledge distillation to compress the hybrid backbone into a single efficient student; (2) blockwise neural architecture search for automatically discovering optimal cost filtering designs under latency budgets, reducing search complexity exponentially; and (3) structured pruning for eliminating redundancy in the iterative refinement module. Furthermore, we introduce an automatic pseudo-labeling pipeline used to curate 1.4M in-the-wild stereo pairs to supplement synthetic training data and facilitate knowledge distillation. The resulting model can run over 10x faster than FoundationStereo while closely matching its zero-shot accuracy, thus establishing a new state-of-the-art among real-time methods. Project page: https://nvlabs.github.io/Fast-FoundationStereo

Abstract:
Recent advancements in video generation models have significantly improved their ability to follow text prompts. However, the customization of dynamic visual effects, defined as temporally evolving and appearance-driven visual phenomena like object crushing or explosion, remains underexplored. Prior works on motion customization or control mainly focus on low-level motions of the subject or camera, which can be guided using explicit control signals such as motion trajectories. In contrast, dynamic visual effects involve higher-level semantics that are more naturally suited for control via text prompts. However, it is hard and time-consuming for humans to craft a single prompt that accurately specifies these effects, as they require complex temporal reasoning and iterative refinement over time. To address this challenge, we propose P-Flow, a novel training-free framework for customizing dynamic visual effects in video generation without modifying the underlying model. By leveraging the semantic and temporal reasoning capabilities of vision-language models, P-Flow performs test-time prompt optimization, refining prompts based on the discrepancy between the visual effects of the reference video and the generated output. Through iterative refinement, the prompts evolve to better induce the desired dynamic effect in novel scenes. Experiments demonstrate that P-Flow achieves high-fidelity and diverse visual effect customization and outperforms other models on both text-to-video and image-to-video generation tasks. Code is available at https://github.com/showlab/P-Flow.

Abstract:
Shooting video with handheld shooting devices often results in blurry frames due to shaking hands and other instability factors. Although previous video deblurring methods have achieved impressive progress, they still struggle to perform satisfactorily on real-world handheld video due to the blur domain gap between training and testing data. To address the issue, we propose a self-supervised method for handheld video deblurring, which is driven by sharp clues in the video. First, to train the deblurring model, we extract the sharp clues from the video and take them as misalignment labels of neighboring blurry frames. Second, to improve the deblurring ability of the model, we propose a novel Self-Enhanced Video Deblurring (SEVD) method to create higher-quality paired video data. Third, we propose a Self-Constrained Spatial Consistency Maintenance (SCSCM) method to regularize the model, preventing position shifts between the output and input frames. Moreover, we construct synthetic and real-world handheld video datasets for handheld video deblurring. Extensive experiments on these and other common real-world datasets demonstrate that our method significantly outperforms existing self-supervised ones. The code and datasets are publicly available at https://cshonglei.github.io/SelfHVD.

Abstract:
While real-world applications increasingly demand intricate scene manipulation, existing instruction-guided image editing benchmarks often oversimplify task complexity and lack comprehensive, fine-grained instructions. To bridge this gap, we introduce CompBench, a large-scale benchmark specifically designed for complex instruction-guided image editing. CompBench features challenging editing scenarios that incorporate fine-grained instruction following, spatial and contextual reasoning, thereby enabling comprehensive evaluation of image editing models' precise manipulation capabilities. To construct CompBench, we propose an MLLM-human collaborative framework with tailored task pipelines. Furthermore, we propose an instruction decoupling strategy that disentangles editing intents into four key dimensions: location, appearance, dynamics, and objects, ensuring closer alignment between instructions and complex editing requirements. Extensive evaluations reveal that CompBench exposes fundamental limitations of current image editing models and provides critical insights for the development of next-generation instruction-guided image editing systems. Our project page is available at https://comp-bench.github.io/.

Abstract:
Audio-visual navigation enables embodied agents to navigate toward sound-emitting targets by leveraging both auditory and visual cues. However, most existing approaches rely on precomputed room impulse responses (RIRs) for binaural audio rendering, restricting agents to discrete grid positions and leading to spatially discontinuous observations. To establish a more realistic setting, we introduce Semantic Audio-Visual Navigation in Continuous Environments (SAVN-CE), where agents can move freely in 3D spaces and perceive temporally and spatially coherent audio-visual streams. In this setting, targets may intermittently become silent or stop emitting sound entirely, causing agents to lose goal information. To tackle this challenge, we propose MAGNet, a multimodal transformer-based model that jointly encodes spatial and semantic goal representations and integrates historical context with self-motion cues to enable memory-augmented goal reasoning. Comprehensive experiments demonstrate that MAGNet significantly outperforms state-of-the-art methods, achieving up to a 12.1% absolute improvement in success rate. These results also highlight its robustness to short-duration sounds and long-distance navigation scenarios. The code is available at https://github.com/yichenzeng24/SAVN-CE.

Abstract:
Reported face verification accuracy has reached a plateau on current well-known test sets. As a result, some difficult test sets have been assembled by reducing the image quality or adding artifacts to the image. However, we argue that test sets can be challenging without artificially reducing the image quality because the face recognition (FR) models suffer from correctly recognizing 1) the pairs from the same identity (i.e., genuine pairs) with a large face attribute difference, 2) the pairs from different identities (i.e., impostor pairs) with a small face attribute difference, and 3) the pairs of similar-looking identities (e.g., twins and relatives). We propose three challenging test sets to reveal important but ignored weaknesses of the existing FR algorithms. To challenge models on variation of facial attributes, we propose Hadrian and Eclipse to address facial hair differences and face exposure differences. The images in both test sets are high-quality and collected in a controlled environment. To challenge FR models on similar-looking persons, we propose ND-Twins, which contains images from a dedicated twins dataset. The LFW test protocol is used to structure the proposed test sets. Moreover, we introduce additional rules to assemble "Goldilocks1" level test sets, including 1) restricted number of occurrence of hard samples, 2) equal chance evaluation across demographic groups, and 3) constrained identity overlap across validation folds. Quantitatively, without further processing the images, the proposed test sets have on-par or higher difficulties than the existing test sets that add artifacts to the images. The datasets are available at: https://github.com/HaiyuWu/SOTA-Face-Recognition-Train-and-Test.

Abstract:
We present Rewis3d, a framework that leverages recent advances in feed-forward 3D reconstruction to significantly improve weakly supervised semantic segmentation on 2D images. Obtaining dense, pixel-level annotations remains a costly bottleneck for training segmentation models. Alleviating this issue, sparse annotations offer an efficient weakly-supervised alternative. However, they still incur a performance gap. To address this, we introduce a novel approach that leverages 3D scene reconstruction as an auxiliary supervisory signal. Our key insight is that 3D geometric structure recovered from 2D videos provides strong cues that can propagate sparse annotations across entire scenes. Specifically, a dual student-teacher architecture enforces semantic consistency between 2D images and reconstructed 3D point clouds, using state-of-the-art feed-forward reconstruction to generate reliable geometric supervision. Extensive experiments demonstrate that Rewis3d achieves state-of-the-art performance in sparse supervision, outperforming existing approaches by 2-7% without requiring additional labels or inference overhead. Project page: https://rewis3d-mpi.github.io/Rewis3d/

Abstract:
Camouflaged scene understanding (CSU) has attracted significant attention due to its broad practical implications. However, in this field, robust image-text cross-modal alignment remains under-explored, hindering deeper understanding of camouflaged scenarios and their related applications. To this end, we focus on the typical image-text retrieval task, and formulate a new task dubbed "camouflage-aware image-text retrieval" (CA-ITR). We first construct a dedicated camouflage image-text retrieval dataset (CamoIT), comprising ~10.5K samples with multi-granularity textual annotations. Benchmark results conducted on CamoIT reveal the underlying challenges of CA-ITR for existing cutting-edge retrieval techniques, which are mainly caused by objects' camouflage properties as well as those complex image contents. As a solution, we propose a camouflage-expert collaborative network (CECNet), which features a dual-branch visual encoder: one branch captures holistic image representations, while the other incorporates a dedicated model to inject representations of camouflaged objects. A novel confidence-conditioned graph attention (C\textsuperscript 2 GA) mechanism is incorporated to exploit the complementarity across branches. Comparative experiments show that CECNet achieves ~29% overall CA-ITR accuracy boost, surpassing seven representative retrieval models. The dataset and code will be available at https://github.com/jiangyao-scu/CA-ITR.

Abstract:
Referring video object segmentation (RVOS) aims to segment objects in a video described by a natural language expression. However, most existing approaches focus only on the referred object (typically the actor), even when the expression clearly describes an interaction involving multiple objects with distinct roles. In this paper, we introduce Interaction-Aware Referring Video Object Segmentation (InterRVOS), a novel task that focuses on explicit interaction modeling by requiring separate segmentation of actor and target objects.This formulation enables fine-grained understanding of object relationships, as many video events are defined by such interactions rather than individual objects. We present InterRVOS-127K, a large-scale dataset of over 127K automatically annotated expressions with distinct actor-target mask pairs, and propose ReVIOSa, a MLLM-based architecture that introduces interaction-aware special tokens and attention mask loss (AML) to enhance interaction-aware segmentation. We also propose a new evaluation protocol that separately evaluates actor and target segmentation for more accurate role distinction. Comprehensive experiments demonstrate that ReVIOSa outperforms existing baselines on the proposed InterRVOS-127K benchmark, with further analyses validating the necessity and effectiveness of both ReVIOSa and InterRVOS-127K.

Abstract:
Multi-view diffusion models have recently emerged as a powerful paradigm for novel view synthesis, yet the underlying mechanism that enables their view consistency remains unclear. In this work, we first verify that the attention maps of these models acquire geometric correspondence throughout training, attending to the geometrically corresponding regions across reference and target views for view-consistent generation. However, this correspondence signal remains incomplete, with its accuracy degrading under large viewpoint changes. Building on these findings, we introduce CAMEO, a simple yet effective training technique that directly supervises attention maps using geometric correspondence to enhance both the training efficiency and generation quality of multi-view diffusion models. Notably, supervising a single attention layer is sufficient to guide the model toward learning precise correspondences, thereby preserving the geometry and structure of reference images, accelerating convergence, and improving novel view synthesis performance. CAMEO reduces the number of training iterations required for convergence by half while achieving superior performance at the same iteration counts. We further demonstrate that CAMEO is model-agnostic and can be applied to any multi-view diffusion model. Project page is available at https://cvlab-kaist.github.io/CAMEO/.

Abstract:
Recently, sparse autoencoders (SAEs) have emerged as a promising technique for interpreting activations in foundation models by disentangling features into a sparse set of concepts. However, identifying the optimal level of sparsity for each neuron remains challenging in practice: excessive sparsity might lead to poor reconstruction, whereas insufficient sparsity harms interpretability. While existing activation functions such as ReLU and TopK provide certain sparsity guarantees, they typically require additional sparsity regularization or cherry-picked hyperparameters. We show in this paper that adaptive sparse attention mechanisms using sparsemax can bridge this trade-off, due to their ability to determine the number of concepts in a data-dependent manner.Specifically, we first explore a new class of SAEs based on the cross-attention architecture with the latent features as queries and the learnable dictionary as the key and value matrices. To encourage sparse pattern learning, we employ a sparsemax-based attention strategy that automatically infers a sparse set of concepts according to the complexity of each neuron, resulting in a more flexible and efficient activation function. Through comprehensive evaluation and visualization, we show that our approach successfully achieves lower reconstruction loss while producing high-quality concepts. Moreover, the sparsity level automatically determined by our approach can serve as tuning guidance to improve existing SAEs. The code is available at https://github.com/qyj-bkjx/Sparsemax-SAE.

Abstract:
We present Common Inpainted Objects In-N-Out of Context (COinCO), a novel dataset addressing the scarcity of out-of-context examples in existing vision datasets. By systematically replacing objects in COCO images through diffusion-based inpainting, we create 97,722 unique images featuring both contextually coherent and inconsistent scenes, enabling effective context learning. Each inpainted object is meticulously verified and categorized as in- or out-of-context through Large Vision Language Model assessments. We demonstrate three key tasks enabled by COinCO: (1) a fine-grained context reasoning approach that classifies objects as in- or out-of-context based on three criteria; (2) a novel Objects-from-Context prediction task that determines which new objects naturally belong in given scenes at both instance and clique level semantics, and (3) context-enhanced fake detection on state-of-the-art methods without fine-tuning. COinCO provides a controlled testbed with contextual variations, establishing a foundation for advancing context-aware visual understanding in computer vision, including image forensics. Code and dataset are available at https://co-in-co.github.io/.

Abstract:
Video-to-Speech (VTS) generation aims to synthesize speech from a silent video without auditory signals, and holds substantial promise for applications such as film dubbing and voice restoration for individuals with aphonia. However, existing VTS methods disregard the hierarchical nature of speech, which spans coarse speaker-aware semantics to fine-grained prosodic details. This oversight hinders direct alignment between visual and speech features at specific hierarchical levels during property matching. In this paper, leveraging the hierarchical structure of Residual Vector Quantization (RVQ)-based codec, we propose HiCoDiT, a novel Hierarchical Codec Diffusion Transformer that exploits the inherent hierarchy of discrete speech tokens to achieve strong audio-visual alignment. Specifically, since lower-level tokens encode coarse speaker-aware semantics and higher-level tokens capture fine-grained prosody, HiCoDiT employs low-level and high-level blocks to generate tokens at different levels. The low-level blocks condition on lip-synchronized motion and facial identity to capture speaker-aware content, while the high-level blocks use facial expression to modulate prosodic dynamics. Finally, to enable more effective coarse-to-fine conditioning, we propose a dual-scale adaptive instance layer normalization that jointly captures global vocal style through channel-wise normalisation and local prosody dynamics through temporal-wise normalization. Extensive experiments demonstrate that HiCoDiT outperforms baselines in fidelity and expressiveness, highlighting the potential of discrete modelling for VTS. The code and speech demo are both available at https://github.com/Jiaxin-Ye/HiCoDiT.

Abstract:
Recent advances in image-to-video (I2V) generation have achieved remarkable progress in synthesizing high-quality, temporally coherent videos from static images. Among all the applications of I2V, human-centric video generation includes a large portion. However, existing I2V models encounter difficulties in maintaining identity consistency between the input human image and the generated video, especially when the person in the video exhibits significant expression changes and movements. This issue becomes critical when the human face occupies merely a small fraction of the image. Since humans are highly sensitive to identity variations, this poses a critical yet under-explored challenge in I2V generation. In this paper, we propose Identity-Preserving Reward-guided Optimization (IPRO), a novel video diffusion framework based on reinforcement learning to enhance identity preservation. Instead of introducing auxiliary modules or altering model architectures, our approach introduces a direct and effective tuning algorithm that optimizes diffusion models using a face identity scorer. To improve performance and accelerate convergence, our method backpropagates the reward signal through the last steps of the sampling chain, enabling richer gradient feedback. We also propose a novel facial scoring mechanism that treats faces in ground-truth videos as facial feature pools, providing multi-angle facial information to enhance generalization. A KL-divergence regularization is further incorporated to stabilize training and prevent overfitting to the reward signal. Extensive experiments on Wan 2.2 I2V model and our in-house I2V model demonstrate the effectiveness of our method. Our project is available at https://dof-gaussian.github.io/.

Abstract:
Out-of-distribution (OOD) detection is critical to ensure the safe deployment of deep learning models in critical applications. Deep learning models can often misidentify OOD samples as in-distribution (ID) samples. This vulnerability worsens in the presence of spurious correlation in the training set. Likewise, in fine-grained classification settings, detection of fine-grained OOD samples becomes inherently challenging due to their high similarity to ID samples. However, current research on OOD detection has focused instead largely on relatively easier (conventional) cases. Even the few recent works addressing these challenging cases rely on carefully curated or synthesized outliers, ultimately requiring external data. This motivates our central research question: "Can we innovate OOD detection training framework for fine-grained and spurious settings without requiring any external data at all?" In this work, we present a unified Approach to Spurious, fine-grained, and Conventional OOD Detection (\ASCOOD) that eliminates the reliance on external data. First, we synthesize virtual outliers from ID data by approximating the destruction of invariant features. Specifically, we propose to add gradient attribution values to ID inputs to disrupt invariant features while amplifying true-class logit, thereby synthesizing challenging near-manifold virtual outliers. Then, we simultaneously incentivize ID classification and predictive uncertainty towards virtual outliers. For this, we further propose to leverage standardized features with z-score normalization. ASCOOD effectively mitigates impact of spurious correlations and encourages capturing fine-grained attributes. Extensive experiments across 7 datasets and comparisons with 30+ methods demonstrate merit of ASCOOD in spurious, fine-grained and conventional settings.

Abstract:
Text-to-video (T2V) generation has recently attracted considerable attention, resulting in the development of numerous high-quality datasets that have propelled progress in this area. However, existing public datasets are primarily composed of isolated text-video (T-V) pairs and thus fail to model inter-clip relationships. To address this limitation, we introduce CI-VID, a dataset that moves beyond isolated T2V generation toward text-and-video-to-video (T&V2V) generation. CI-VID contains over 340,000 samples, each comprising a semantically coherent video sequence with interleaved text captions that capture both clip-level content and inter-clip relationships. To validate its effectiveness, we design a comprehensive, multi-dimensional benchmark incorporating human evaluation, VLM-based assessment, and similarity-based metrics. Experimental results demonstrate that models trained on CI-VID significantly improve both accuracy and content consistency in multi-clip video generation. This enables the creation of story-driven content with smooth transitions and strong semantic coherence.

Abstract:
This paper presents VGGT-360, a novel training-free framework for zero-shot, geometry-consistent panoramic depth estimation. Unlike prior view-independent training-free approaches, VGGT-360 reformulates the task as panoramic reprojection over multi-view reconstructed 3D models by leveraging the intrinsic 3D consistency of VGGT-like foundation models, thereby unifying fragmented per-view reasoning into a coherent panoramic understanding. To achieve robust and accurate estimation, VGGT-360 integrates three plug-and-play modules that form a unified panorama-to-3D-to-depth framework: (i) Uncertainty-guided adaptive projection slices panoramas into perspective views to bridge the domain gap between panoramic inputs and VGGT's perspective prior. It estimates gradient-based uncertainty to allocate denser views to geometry-poor regions, yielding geometry-informative inputs for VGGT. (ii) Structure-saliency enhanced attention strengthens VGGT's robustness during 3D reconstruction by injecting structure-aware confidence into its attention layers, guiding focus toward geometrically reliable regions and enhancing cross-view coherence. (iii) Correlation-weighted 3D model correction refines the reconstructed 3D model by reweighting overlapping points using attention-inferred correlation scores, providing a consistent geometric basis for accurate panoramic reprojection. Extensive experiments show that VGGT-360 outperforms both trained and training-free state-of-the-art methods across multiple resolutions and diverse indoor and outdoor datasets. The code is available at https://github.com/Yuanjiayii/VGGT-360.

Abstract:
Temporal information is crucial for visual tracking, but existing multi-frame trackers are vulnerable to model drift caused by naively aggregating noisy historical predictions. In this paper, we introduce DTPTrack, a lightweight and generalizable module designed to be seamlessly integrated into existing trackers to suppress drift. Our framework consists of two core components: (1) a Temporal Reliability Calibrator (TRC) mechanism that learns to assign a per-frame reliability score to historical states, filtering out noise while anchoring on the ground-truth template; and (2) a Temporal Guidance Synthesizer (TGS) module that synthesizes this calibrated history into a compact set of dynamic temporal priors to provide predictive guidance. To demonstrate its versatility, we integrate DTPTrack into three diverse tracking architectures--OSTrack, ODTrack, and LoRAT--and show consistent, significant performance gains across all baselines. Our best-performing model, built upon an extended LoRATv2 backbone, sets a new state-of-the-art on several benchmarks, achieving a 77.5% Success rate on LaSOT and an 80.3% AO on GOT-10k. The source code is available at https://github.com/NorahGreen/DTPTrack.

Abstract:
Recent unified models such as Bagel demonstrate that paired image-edit data can effectively align multiple visual tasks within a single diffusion transformer. However, these models remain limited to single-condition inputs and lack the flexibility needed to synthesize results from multiple heterogeneous sources. We present SIGMA (Selective-Interleaved Generation with Multi-Attribute Tokens), a unified post-training framework that enables interleaved multi-condition generation within diffusion transformers. SIGMA introduces selective multi-attribute tokens, including style, content, subject and identity tokens, which allow the model to interpret and compose multiple visual conditions in an interleaved text-image sequence. Through post-training on the Bagel unified backbone with 700K interleaved examples, SIGMA supports compositional editing, selective attribute transfer and fine-grained multimodal alignment. Extensive experiments show that SIGMA improves controllability, cross-condition consistency and visual quality across diverse editing and generation tasks, with substantial gains over Bagel on compositional tasks. Code is available at https://github.com/auihund/SIGMA.

Abstract:
Recent embodied intelligence suffers from data scarcity, while conventional simulators lack visual realism. Controllable video generation is emerging as a promising data engine, yet current action-conditioned methods still fall short: generated videos are limited in fidelity and temporal consistency, poorly aligned with controls, and often constrained to singleview settings. We attribute these issues to the representational gap between sparse control inputs and dense pixel outputs. Thus, we introduce ORV, a 4D occupancy-centric framework for robot video generation that couples action priors with occupancy-derived visual priors. Concretely, we align chunked 7-DoF actions with video latents via an Action-Expert AdaLN modulation, and inject 2D renderings of 4D semantic occupancy into the generation process as soft guidance. Meanwhile, a central obstacle is the lack of occupancy data for embodied scenarios; we therefore curate ORV-Data, a large-scale, high-quality 4D semantic occupancy dataset of robot manipulation. Across BridgeV2, DROID, and RT-1, ORV improves video generation quality and controllability, achieving 18.8% lower FVD than state of the art, +3.5% success rate on visual planning, and +6.4% success rate on policy learning. Beyond singleview generation, ORV natively supports multiview consistent synthesis and enables simulation-to-real transfer despite significant domain gaps. Code, models, and data will be released upon acceptance.

Abstract:
People can view the same image differently: they focus on different regions, objects, and details in varying orders and describe them in distinct linguistic styles. This leads to substantial variability in image descriptions. However, existing models for personalized image description focus on linguistic style alone, with no prior work leveraging individual viewing patterns. We address this gap by explicitly modeling personalized viewing behavior as a core factor in description generation. Our method, DEPER (DEscription-PERception persona encoder), learns a subject embedding that captures both linguistic style and viewing behavior, guided by an auxiliary attention-prediction task. A lightweight adapter aligns these embeddings with a frozen vision-language model, enabling few-shot personalization without retraining. Across four datasets spanning diverse viewing tasks and both short and detailed descriptions, DEPER achieves a 24% average improvement, showing that modeling personalized attention produces more human-aligned and high-quality descriptions. We posit that understanding how people see helps predict what they say; modeling human diversity in perception can improve both performance and human alignment in multi-modal systems. Code is available at: https://github.com/cvlab-stonybrook/Personalized-Image-Description

Abstract:
Photorealistic style transfer aims to match the color and tone of an input image to that of a style target while preserving the content and details of the original scene. Although existing large image models can facilitate these kinds of appearance edits, their high computational demands, potential for hallucinations, and limited user control make them unsuitable for high-resolution, real-time workflows. We introduce Hist2Style, a bilateral-grid formulation for fast, edge-aware stylization that preserves visual fidelity by constraining operations to locally affine transforms in bilateral space. Our model distills a large image editing model into a lightweight network by training on a large supervised corpus generated with language and vision-language models, targeting spatially varying color edits. The network conditions on a histogram-based embedding of the style target to provide an interpretable interface for adjusting the output style by modifying the target color distribution. Overall, Hist2Style maintains content structure by construction, avoids hallucinations, and supports real-time, high-resolution photorealistic stylization with interactive user-controllable color and tone adjustments. Our project page is available at \href https://dgalor.github.io/hist2style/ dgalor.github.io/hist2style/ .

Abstract:
Event cameras offer a high temporal resolution over traditional frame-based cameras, which makes them suitable for motion and structure estimation. However, it has been unclear how event-based 3D Gaussian Splatting (3DGS) approaches could leverage fine-grained temporal information of sparse events. This work proposes GPERT, a framework to address the trade-off between accuracy and temporal resolution in event-based 3DGS. Our key idea is to decouple the rendering into two branches: event-by-event geometry (depth) rendering and snapshot-based radiance (intensity) rendering, by using ray-tracing and the image of warped events. The extensive evaluation shows that our method achieves state-of-the-art performance on the real world datasets and competitive performance on the synthetic dataset. Also, the proposed method works with out prior information (e.g., pretrained image reconstruction models) or COLMAP-based initialization, is more flexible in the event selection number, and achieves sharp reconstruction on scene edges with fast training time. We hope that this work deepens our understanding of the sparse nature of events for 3D reconstruction. https://github.com/e3ai/gpert

Abstract:
Long-tailed data bias decision boundaries toward head classes and degrade tail class accuracy. Diffusion-based generative augmentation address this problem by generating additional data, while head-to-tail transfer further mitigate the generator bias inherit from long-tailed dataset. However, we show that while head-to-tail transfer helps balance the decision space of the classifier, it also induces latent non-local feature mixing that entangles inter-class features, causing decision boundary overlap and tail class distribution shift. To address this, we first identify the problem of boundary ambiguity and then propose Decision Boundary-aware Generation (DBG) framework, which promotes near-boundary representation learning by generating informative near-boundary samples. Overall, DBG rebalances the long-tailed dataset while yielding more separable decision space for long-tailed learning. Across standard long-tailed benchmarks, DBG consistently improves tail class and overall accuracy with less inter-class overlap. The code of DBG is available at https://github.com/keepdigitalabc-svg/DBG.

Abstract:
Many classic opera videos exhibit poor visual quality due to the limitations of early filming equipment and long-term degradation during storage. Although real-world video super-resolution (RWVSR) has achieved significant advances in recent years, directly applying existing methods to degraded opera videos remains challenging. The difficulties are twofold. First, accurately modeling real-world degradations is complex: simplistic combinations of classical degradation kernels fail to capture the authentic noise distribution, while methods that extract real noise patches from external datasets are prone to style mismatches that introduce visual artifacts. Second, current RWVSR methods, which rely solely on degraded image features, struggle to reconstruct realistic and detailed textures due to a lack of high-level semantic guidance. To address these issues, we propose a Text-guided Dual-Branch Opera Video Super-Resolution (TextOVSR) network, which introduces two types of textual prompts to guide the super-resolution process. Specifically, degradation-descriptive text, derived from the degradation process, is incorporated into the negative branch to constrain the solution space. Simultaneously, content-descriptive text is incorporated into a positive branch and our proposed Text-Enhanced Discriminator (TED) to provide semantic guidance for enhanced texture reconstruction. Furthermore, we design a Degradation-Robust Feature Fusion (DRF) module to facilitate cross-modal feature fusion while suppressing degradation interference. Experiments on our OperaLQ benchmark show that TextOVSR outperforms state-of-the-art methods both qualitatively and quantitatively. The code is available at https://github.com/ChangHua0/TextOVSR.

Abstract:
We introduce Omni-MMSI, a new task that requires comprehensive social interaction understanding from raw audio, vision, and speech input. The task involves perceiving identity-attributed social cues (e.g., who is speaking what) and reasoning about the social interaction (e.g., whom the speaker refers to). This task is essential for developing AI assistants that can perceive and respond to human interactions. Unlike prior studies that operate on oracle-preprocessed social cues, Omni-MMSI reflects realistic scenarios where AI assistants must perceive and reason from raw data. However, existing pipelines and multi-modal LLMs perform poorly on Omni-MMSI because they lack reliable identity attribution capabilities, which leads to inaccurate social interaction understanding. To address this challenge, we propose Omni-MMSI-R, a reference-guided pipeline that produces identity-attributed social cues with tools and conducts chain-of-thought social reasoning. To facilitate this pipeline, we construct participant-level reference pairs and curate reasoning annotations on top of the existing datasets. Experiments demonstrate that Omni-MMSI-R outperforms advanced LLMs and counterparts on Omni-MMSI. Project page: https://sampson-lee.github.io/omni-mmsi-project-page.

Abstract:
Recent Video-Language Models (VLMs) achieve promising results on long-video understanding, but their performance still lags behind that achieved on tasks involving images or short videos. This has led to great interest in improving the long context modeling of VLMs by introducing novel modules and additional complexity. In this paper, we take a different approach: rather than fine-tuning VLMs with the limited data available, we attempt to maximize the performance of existing models.To this end, we propose a novel visual prompting strategy specifically designed for long-video understanding. By combining multiple frames as panels into one image, we effectively trade off spatial details for temporal resolution.Our approach is training-free, parameter-free, and model-agnostic, and can be seamlessly integrated into existing VLMs.Extensive experiments on five established benchmarks across a wide range of model architectures, sizes, and context windows confirm the consistency of our approach. For the TimeScope (Long) dataset, which has the longest videos, the accuracy for video question answering is improved by up to 19.4%. Overall, our method raises the bar for long video understanding models. The code is available at https://fedespu.github.io/Video-Panels.

Abstract:
Large-scale vision-language models such as CLIP have achieved remarkable success in zero-shot image recognition, yet their predictions remain largely opaque to human understanding. In contrast, Concept Bottleneck Models provide interpretable intermediate representations by reasoning through human-defined concepts, but they rely on concept supervision and lack the ability to generalize to unseen classes. We introduce EZPC that bridges these two paradigms by explaining CLIP's zero-shot predictions through human-understandable concepts. Our method projects CLIP's joint image-text embeddings into a concept space learned from language descriptions, enabling faithful and transparent explanations without additional supervision. The model learns this projection via a combination of alignment and reconstruction objectives, ensuring that concept activations preserve CLIP's semantic structure while remaining interpretable. Extensive experiments on five benchmark datasets, CIFAR-100, CUB-200-2011, Places365, ImageNet-100, and ImageNet-1k, demonstrate that our approach maintains CLIP's strong zero-shot classification accuracy while providing meaningful concept-level explanations. By grounding open-vocabulary predictions in explicit semantic concepts, our method offers a principled step toward interpretable and trustworthy vision-language models. Code is available at https://github.com/oonat/ezpc.

Abstract:
Vision-Language Models (VLMs) have advanced rapidly within the unified Transformer architecture, yet their deployment on resource-constrained devices remains challenging due to high computational complexity. While pruning has emerged as an effective technique for compressing VLMs, existing approaches predominantly focus on a single mode by pruning either parameters or tokens, neglecting fully exploring the inherent redundancy in each mode, which leads to substantial performance degradation at high pruning ratios. To address the above limitations, we propose Collaborative Multi-Mode Pruning (CoMP), a novel framework tailored for VLMs by performing joint parameter and token pruning. Specifically, we first design a Collaborative Importance Metric (CIM) that investigates the mutual interference between the coupled parameters and tokens. It incorporates distinct significance of tokens into the computation of parameter importance scores, while simultaneously mitigating the affect of pruned parameters on token importance scores. Moreover, we develop a Multi-Mode Pruning Strategy (MPS) that decomposes the overall pruning process into a sequence of pruning stages, while in each stage we estimate the priory of different pruning modes based on their pruning cost and adaptively shift to the optimal one. Additionally, MPS integrates the historical cost and random exploration, in order to achieve a stable pruning process and avoid local optimum. Extensive experiments across various vision-language tasks and models demonstrate that our method effectively promotes the performance under high pruning ratios by comparing to the state-of-the-art approaches. The source code is available at https://github.com/Wuzimeng/CoMP.git.

Abstract:
In episodic memory with natural language queries (EM-NLQ), a user may ask a question (e.g., "Where did I place the mug?") that requires searching a long egocentric video, captured from the user's perspective, to find the moment that answers it. However, queries can be ambiguous or incomplete, leading to incorrect responses. Current methods ignore this key aspect and address EM-NLQ in a one-shot setup, limiting their applicability in real-world scenarios. In this work, we address this gap and introduce the Episodic Memory with Questions and Feedback task (EM-QnF). Here, the user can provide feedback on the model's initial prediction or add more information (e.g., "Before this. I'm looking for the big blue mug not the white one"), helping the model refine its predictions interactively. To this end, we collect datasets for feedback-based interaction and propose a lightweight training scheme that avoids expensive sequential optimization. We also introduce a plug-and-play Feedback ALignment Module (FALM) that enables existing EM-NLQ models to incorporate user feedback effectively. Our approach significantly improves over the state of the art on three challenging benchmarks and is better than or competitive with commercial large vision-language models while remaining efficient. Evaluation with human-generated feedback shows that it generalizes well to real-world scenarios.

Abstract:
Image-to-Video (I2V) generation represents a frontier in content creation, where models synthesize dynamic visual sequences by jointly reasoning from both image and text prompts. This multimodal grounding enables diverse controllability over video attributes. However, it is precisely this capability that introduces a critical security blind spot: by exploiting the interplay between visual and textual cues, attackers can launch multimodal jailbreak attacks that severely compromise output security. Despite the increasing implementation of security mechanisms in real-world I2V systems, such cross-modal threats remain unexplored. Existing attack methods remain confined to single-modal settings, relying solely on isolated text or image perturbations, which severely limits their effectiveness. To bridge this gap, we propose Runaway Evil, the first multimodal jailbreaking framework for I2V models with dynamic evolutionary capability. Built on a Strategy-Tactic-Action paradigm, our framework exhibits self-amplifying attack through three core components: (1) a strategy-aware command unit that enables the attack to self-evolve its strategies through reinforcement learning-driven strategy customization and large language model (LLM)-based strategy exploration; (2) a multimodal tactical planning unit that generates synergistic text jailbreak instructions and image tampering guidelines based on the selected strategies; and (3) an tactical action Unit executes and evaluates the coordinated attacks. This self-evolving architecture allows the framework to continuously adapt and intensify its attack strategies without human intervention. Extensive experiments demonstrate that Runaway Evil achieves state-of-the-art attack success rates on commercial I2V models, such as Open-Sora 2.0 and CogVideoX. This work provides a critical tool for probing and mitigating multimodal vulnerabilities, laying a foundation for building more robust video generation systems. The code is available at: https://github.com/DeepSota/RunawayEvil.

Abstract:
Image-to-Video generation (I2V) animates a static image into a temporally coherent video sequence following textual instructions, yet preserving fine-grained object identity under changing viewpoints remains a persistent challenge. Unlike text-to-video models, existing I2V pipelines often suffer from appearance drift and geometric distortion, artifacts we attribute to the sparsity of single-view 2D observations and weak cross-modal alignment. Here we address this problem from both data and model perspectives. First, we curate ConsIDVid, a large-scale object-centric dataset built with a scalable pipeline for high-quality, temporally aligned videos, and establish ConsIDVid-Bench, where we present a novel benchmarking and evaluation framework for multi-view consistency using metrics sensitive to subtle geometric and appearance deviations. We further propose ConsID-Gen, a view-assisted I2V generation framework that augments the first frame with unposed auxiliary views and fuses semantic and structural cues via a dual-stream visual-geometric encoder as well as a text-visual connector, yielding unified conditioning for a Diffusion Transformer backbone. Experiments across ConsIDVid-Bench demonstrate that ConsID-Gen consistently outperforms in multiple metrics, with the best overall performance surpassing leading video generation models like Wan2.1 and HunyuanVideo, delivering superior identity fidelity and temporal coherence under challenging real-world scenarios. Our model and dataset are at https://myangwu.github.io/ConsID-Gen.

Abstract:
Finding correspondences is a fundamental and extensively researched problem in computer vision and graphics. In this work, we examine the underexplored task of estimating segmentation-to-segmentation correspondence between images in the wild and untextured 3D shapes. This task is highly challenging due to substantial differences in appearance, geometry, and viewpoint. Our approach bridges the cross-modality gap by linking pixels in the image segment to vertices in the corresponding semantic part of the 3D shape. To achieve this, we first distill deep visual features from a 2D vision model onto the 3D shape surface, allowing for the computation of feature similarity between image pixels and shape vertices. Then, we identify Best Segmentation Buddies, vertices whose most similar image pixel lies within the image segmentation region, enabling the reliable discovery of vertices in semantically corresponding shape parts. Finally, we leverage distilled 3D features from the 2D image segmentation model to segment the shape directly in 3D, bootstrapping the correspondence process. We demonstrate the generality and robustness of our approach across a wide range of image-shape pairs, showcasing accurate and semantically meaningful correspondences. Our project page is at https://threedle.github.io/bsb/.

Abstract:
Vision-Language-Action models have emerged as essential generalist robot policies for diverse manipulation tasks, conventionally relying on directly translating multimodal inputs into actions via Vision-Language Model embeddings. Recent advancements have introduced explicit intermediary reasoning--such as sub-task prediction (language) or goal image synthesis (vision)--to guide action generation. However, these intermediate reasoning are often indirect and inherently limited in their capacity to convey the full, granular information required for precise action execution. Instead, we posit that the most effective form of reasoning is one that deliberates directly in the action space. We introduce Action Chain-of-Thought (ACoT), a paradigm where the reasoning process itself is formulated as a structured sequence of coarse action intents that guide the final policy. In this paper, we propose ACoT-VLA, a novel architecture that materializes the ACoT paradigm. Specifically, we introduce two complementary components: an Explicit Action Reasoner (EAR) and Implicit Action Reasoner (IAR). The former proposes coarse reference trajectories as explicit action-level reasoning steps, while the latter extracts latent action priors from internal representations of multimodal input, co-forming an ACoT that conditions the downstream action head to enable grounded policy learning. Extensive experiments in real-world and simulation environments demonstrate the superiority of our proposed method. Code is available at: https://github.com/AgibotTech/ACoT-VLA.

Abstract:
Vision-language models (VLMs) have achieved impressive performance across a wide range of multimodal reasoning tasks, but they often struggle to disentangle fine-grained visual attributes and reason about underlying causal relationships. In-context learning (ICL) offers a promising avenue for VLMs to adapt to new tasks, but its effectiveness critically depends on the selection of demonstration examples. Existing retrieval-augmented approaches typically rely on passive similarity-based retrieval, which tends to select correlated but non-causal examples, amplifying spurious associations and limiting model robustness. We introduce CIRCLES (Composed Image Retrieval for Causal Learning Example Selection), a novel framework that actively constructs demonstration sets by retrieving counterfactual-style examples through targeted, attribute-guided composed image retrieval. By incorporating counterfactual-style examples, CIRCLES enables VLMs to implicitly reason about the causal relations between attributes and outcomes, moving beyond superficial correlations and fostering more robust and grounded reasoning. Comprehensive experiments on four diverse datasets demonstrate that CIRCLES consistently outperforms existing methods across multiple architectures, especially on small-scale models, with pronounced gains under information scarcity. Furthermore, CIRCLES retrieves more diverse and causally informative examples, providing qualitative insights into how models leverage in-context demonstrations for improved reasoning. Our code is available at https://github.com/gzxiong/CIRCLES.

Abstract:
Audio-driven human animation has attracted wide attention thanks to its practical applications. However, critical challenges remain in generating high-resolution, long-duration videos with consistent appearance and natural hand motions. Existing methods extend videos using overlapping motion frames but suffer from error accumulation, leading to identity drift, color shifts, and scene instability. Additionally, hand movements are poorly modeled, resulting in noticeable distortions and misalignment with the audio. In this work, we propose InfinityHuman, a coarse-to-fine framework that first generates audio-synchronized representations, then progressively refines them into high-resolution, long-duration videos using a pose-guided refiner. Since pose sequences are decoupled from appearance and resist temporal degradation, our pose-guided refiner employs stable poses and the initial frame as a visual anchor to reduce drift and improve lip synchronization. Moreover, to enhance semantic accuracy and gesture realism, we introduce a hand-specific reward mechanism trained with high-quality hand motion data. Experiments on the EMTD and HDTF datasets show that InfinityHuman achieves state-of-the-art performance in video quality, identity preservation, hand accuracy, and lip-sync. Ablation studies further confirm the effectiveness of each module. And our project page is available at https://infinityhuman.github.io/.

Abstract:
Diffusion-based generative models have achieved remarkable success in real-world image super-resolution (SR). With tiled diffusion techniques, these models can produce high-resolution images that exceed their native-supported resolution. However, the quality of such high-resolution (e.g 2048^2) outputs often remains extremely poor, primarily due to two factors we consider: the image upsampling ratio (e.g x8) exceeding the model's native-supported upsampling ratio (e.g x4), and the model's native-supported resolution. In practice, training a native high-resolution model requires larger architectures, which incur significant computational overhead and GPU memory costs, making it hard on limited-resource equipment. Thus, we present TUDSR, a Twice Upsampling-Diffusion framework for higher SR. The TUDSR framework mainly consists of two stages: the first involves training at R-resolution, and the second introduces a looped chunk-based training strategy at NR-resolution. Each stage adapts a one-step GAN architecture comprising a generator and a discriminator. Based on SD2.1-base, we develop TUDSR-S, which achieves state-of-the-art performance across multiple benchmarks. Extensive experiments further demonstrate that TUDSR-S generates high-quality images at the resolutions of 1024^2 and even 2048^2, significantly outperforming existing approaches. Code is available at https://github.com/wuer5/TUDSR.

Abstract:
Relying on in-domain annotations and precise sensor-rig priors, existing 3D occupancy prediction methods are limited in both scalability and out-of-domain generalization. While recent visual geometry foundation models exhibit strong generalization capabilities, they were mainly designed for general purposes and lack one or more key ingredients required for urban occupancy prediction, namely metric prediction, geometry completion in cluttered scenes and adaptation to urban scenarios. We address this gap and present OccAny, the first unconstrained urban 3D occupancy model capable of operating on out-of-domain uncalibrated scenes to predict and complete metric occupancy coupled with segmentation features. is versatile and can predict occupancy from sequential, monocular, or surround-view images. Our contributions are three-fold: (i) we propose the first generalized 3D occupancy framework with (ii)Segmentation Forcing that improves occupancy quality while enabling mask-level prediction, and (iii) a Novel View Rendering pipeline that infers novel-view geometry to enable test-time view augmentation for geometry completion. Extensive experiments demonstrate that OccAny outperforms all visual geometry baselines on 3D occupancy prediction task, while remaining competitive with in-domain self-supervised methods across three input settings on two established urban occupancy prediction datasets. Our code is available at https://github.com/valeoai/OccAny .

Abstract:
Cross-View Object Geo-Localization (CVOGL) aims to locate an object of interest in a query image within a corresponding satellite image. Existing methods typically assume that the query image contains only a single object, which does not align with the complex, multi-object geo-localization requirements in real-world applications, making them unsuitable for practical scenarios. To bridge the gap between the realistic setting and existing task, we propose a new task, called Cross-View Multi-Object Geo-Localization (CVMOGL). To advance the CVMOGL task, we first construct a benchmark, CMLocation, which includes two datasets: CMLocation-V1 and CMLocation-V2. Furthermore, we propose a novel cross-view multi-object geo-localization method, MOGeo, and benchmark it against existing state-of-the-art methods. Extensive experiments are conducted under various application scenarios to validate the effectiveness of our method. The results demonstrate that cross-view object geo-localization in the more realistic setting remains a challenging problem, encouraging further research in this area. Our dataset and code will be released at \texttt https://github.com/LV-BO001/MOGeo .

Abstract:
Digital hematopathology requires cell-level analysis across diverse disease categories, including malignant disorders (e.g., leukemia), infectious conditions (e.g., malaria), and non-malignant red blood cell disorders (e.g., sickle cell disease). Whether single-task, vision-language, WSI- optimized, or single-cell hematology models, these approaches share a key limitation: they cannot provide unified, multi-task, multi-modal reasoning across the complexities of digital hematopathology. To overcome these limitations, we propose Uni-Hema, a multi-task, unified model for digital hematopathology integrating detection, classification, segmentation, morphology prediction, and reasoning across multiple diseases. Uni-Hema leverages 46 publicly available datasets, encompassing over 700K images and 21K question-answer pairs, and is built upon Hema-Former, a multimodal module that bridges visual and linguistic representations at the hierarchy level for the different tasks (detection, classification, segmentation, morphology, mask language modeling, and visual question answering) at different granularities. Extensive experiments demonstrate that Uni-Hema achieves comparable or superior performance compared to training on a single task and single-dataset models, across diverse hematological tasks, while providing interpretable, morphologically relevant insights at the single-cell level. Our framework establishes a new standard for multi-task and multi-modal digital hematopathology. The code is available at https://github.com/intelligentMachines-ITU/Uni-Hema

Abstract:
Vision-language models (VLMs) achieve strong multimodal performance, yet how computation is organized across populations of neurons remains poorly understood. In this work, we study VLMs through the lens of neural topology, representing each layer as a within-layer correlation graph derived from neuron-neuron co-activations. This view allows us to ask whether population-level structure is behaviorally meaningful, how it changes across modalities and depth, and whether it identifies causally influential internal components under intervention. We show that correlation topology carries recoverable behavioral signal; moreover, cross-modal structure progressively consolidates with depth around a compact set of recurrent hub neurons, whose targeted perturbation materially alters model output. Neural topology thus emerges as a meaningful intermediate scale for VLM interpretability: richer than local attribution, more tractable than full circuit recovery, and empirically tied to multimodal behavior. Code is publicly available at https://github.com/he-h/vlm-graph-probing.

Abstract:
Voxel art is a distinctive stylization widely used in games and digital media, yet automated generation from 3D meshes remains challenging due to conflicting requirements of geometric abstraction, semantic preservation, and discrete color coherence. Existing methods either over-simplify geometry or fail to achieve the pixel-precise, palette-constrained aesthetics of voxel art. We introduce Voxify3D, a differentiable two-stage framework bridging 3D mesh optimization with 2D pixel art supervision. Our core innovation lies in the synergistic integration of three components: (1) orthographic pixel art supervision that eliminates perspective distortion for precise voxel-pixel alignment; (2) patch-based CLIP alignment that preserves semantics across discretization levels; (3) palette-constrained Gumbel-Softmax quantization enabling differentiable optimization over discrete color spaces with controllable palette strategies. This integration addresses fundamental challenges: semantic preservation under extreme discretization, pixel-art aesthetics through volumetric rendering, and end-to-end discrete optimization. Experiments show superior performance (37.12 CLIP-IQA, 77.90% user preference) across diverse characters and controllable abstraction (2-8 colors, 20x-50x resolutions). Project page: https://yichuanh.github.io/Voxify-3D/

Abstract:
Recent generative models can create visually plausible 3D representations of objects. However, the generation process often allows for implicit control signals, such as contextual descriptions, and rarely supports bold geometric distortions beyond existing data distributions. We propose a geometric stylization framework that deforms a 3D mesh, allowing it to express the style of an image. While style is inherently ambiguous, we utilize pre-trained diffusion models to extract an abstract representation of the provided image. Our coarse-to-fine stylization pipeline can drastically deform the input 3D model to express a diverse range of geometric variations while retaining the valid topology of the original mesh and part-level semantics. We also propose an approximate VAE encoder that provides efficient and reliable gradients from mesh renderings. Extensive experiments demonstrate that our method can create stylized 3D meshes that reflect unique geometric features of the pictured assets, such as expressive poses and silhouettes, thereby supporting the creation of distinctive artistic 3D creations. Project page: https://changwoonchoi.github.io/GeoStyle

Abstract:
Deep learning models continue to scale, with some requiring more storage than many large-scale datasets. Thus, we introduce a new paradigm: Continual Distillation (CD), where a student learns sequentially from a stream of teacher models without retaining access to earlier teachers. CD faces two challenges: teacher training data is unavailable, and teachers have varying expertise. We show that external unlabeled data enables Unseen Knowledge Transfer (UKT), allowing the student to acquire information from domains not present in the training data, while known to the teacher. We also show that sequential distillation causes Unseen Knowledge Forgetting (UKF) when transferred knowledge is lost after training on later teachers. To better trade off between UKT and UKF, we propose Self External Data Distillation (SE2D), a method that preserves logits on external data to stabilize learning across heterogeneous teachers. Experiments on multiple benchmarks show that SE2D reduces UKF and improves cross-domain generalization. The code and implementation for this work are publicly available at: https://github.com/Nicolas1203/continual_distillation

Abstract:
Lifelong person re-identification (LReID) aims to train a generalizable model with sequentially collected data. However, such models often suffer from semantic drift, limited adaptability, and catastrophic forgetting as new domains emerge. Existing exemplar-free approaches largely rely on visual-only distillation or parameter regularization, while overlooking the potential of auxiliary modalities, such as text, to preserve semantic stability and enable incremental plasticity. We observe that the frozen text encoder in pretrained vision-language models can serve as a stable semantic anchor across domains. To decouple the roles of vision and text, we propose Prompt-Anchored vision-text Distillation (PAD), an asymmetric vision-text framework for semantic alignment and cross-domain generalization. On the textual side, we distill prompts to preserve vision-text alignment under a fixed semantic space, acting as a global semantic reference rather than a dominant learning signal. On the visual side, an EMA-based teacher with an adaptive prompt pool enables domain-wise adaptation by allocating new slots while freezing past ones. Extensive experiments show that PAD substantially outperforms state-of-the-art methods across seen and unseen domains, achieving a strong balance between stability and plasticity. Project page: https://github.com/zu-zi/PAD.

Abstract:
State-of-the-art text-to-image models suffer from a persistent identity crisis when generating scenes with multiple humans: producing duplicate faces, merging identities, and miscounting individuals. We present DisCo (Reinforcement with Diversity Constraints), a reinforcement learning framework that directly optimizes identity diversity both within images and across groups of generated samples. DisCo fine-tunes flow-matching models using Group-Relative Policy Optimization (GRPO), guided by a compositional reward that: (i) penalizes facial similarity within images, (ii) discourages identity repetition across samples, (iii) enforces accurate person counts, and (iv) preserves visual fidelity and prompt alignment via human preference scores. A single-stage curriculum stabilizes training as prompt complexity increases. Importantly, this method does not require any real data. On the DiverseHumans Testset, DisCo achieves 98.6% Unique Face Accuracy and near-perfect Global Identity Spread, outperforming open-source and proprietary models (e.g., Gemini, GPT-Image) while maintaining perceptual quality. Our results establish cross-sample diversity as a critical axis for resolving identity collapse, positioning DisCo as a scalable, annotation-free solution for multi-human image synthesis. Project page: https://qualcomm-ai-research.github.io/disco/.

Abstract:
RGB-Thermal (RGBT) tracking aims to achieve robust object localization across diverse environmental conditions by fusing visible and thermal infrared modalities. However, existing RGBT trackers rely solely on initial-frame visual information for target modeling, failing to adapt to appearance variations due to the absence of language guidance. Furthermore, current methods suffer from redundant search regions and heterogeneous modality gaps, causing background distraction. To address these issues, we first introduce textual descriptions into RGBT tracking benchmarks. This is accomplished through a pipeline that leverages Multi-modal Large Language Models (MLLMs) to automatically produce texual annotations. Afterwards, we propose RAGTrack, a novel Retrieval-Augmented Generation framework for robust RGBT tracking. To this end, we introduce a Multi-modal Transformer Encoder (MTE) for unified visual-language modeling. Then, we design an Adaptive Token Fusion (ATF) to select target-relevant tokens and perform channel exchanges based on cross-modal correlations, mitigating search redundancies and modality gaps. Finally, we propose a Context-aware Reasoning Module (CRM) to maintain a dynamic knowledge base and employ a Retrieval-Augmented Generation (RAG) to enable temporal linguistic reasoning for robust target modeling. Extensive experiments on four RGBT benchmarks demonstrate that our framework achieves state-of-the-art performance across various challenging scenarios. The source code is available at https://github.com/IdolLab/RAGTrack.

Abstract:
Large Vision-Language Models (LVLMs) excel in visual understanding and reasoning, but the excessive visual tokens lead to high inference costs. Although recent token reduction methods mitigate this issue, they mainly target single-turn Visual Question Answering (VQA), leaving the more practical multi-turn VQA (MT-VQA) scenario largely unexplored. MT-VQA introduces additional challenges, as subsequent questions are unknown beforehand and may refer to arbitrary image regions, making existing reduction strategies ineffective. Specifically, current approaches fall into two categories: prompt-dependent methods, which bias toward the initial text prompt and discard information useful for subsequent turns; prompt-agnostic ones, which, though technically applicable to multi-turn settings, rely on heuristic reduction metrics such as attention scores, leading to suboptimal performance. In this paper, we propose a learning-based prompt-agnostic method, termed MetaCompress, overcoming the limitations of heuristic designs. We begin by formulating token reduction as a learnable compression mapping, unifying existing formats such as pruning and merging into a single learning objective. Upon this formulation, we introduce a data-efficient training paradigm capable of learning optimal compression mappings with limited computational costs. Extensive experiments on MT-VQA benchmarks and across multiple LVLM architectures demonstrate that MetaCompress achieves superior efficiency-accuracy trade-offs while maintaining strong generalization across dialogue turns. Our code is available at https://github.com/MArSha1147/MetaCompress.

Abstract:
3D Gaussian Splatting (3DGS) enables real-time novel view synthesis with high visual quality. However, existing methods struggle with semi-transparent specular surfaces that exhibit both complex reflections and clear transmission, often producing blurry reflections or overly occluded transmission. To address this, we present RT-Splatting, a framework that disentangles each Gaussian's geometric occupancy from its optical opacity. This factorization yields a unified surface-volume scene representation with a single set of Gaussian primitives. Our hybrid renderer interprets this representation both as a surface to capture high-frequency reflections and as a volume to preserve clear transmission. To mitigate the ambiguity in jointly optimizing reflection and transmission, we introduce Specular-Aware Gradient Gating, which suppresses misleading gradients from highly specular regions into the transmission branch, effectively reducing distracting floaters. Experiments on challenging semi-transparent scenes show that RT-Splatting achieves state-of-the-art performance, delivering high-fidelity reflections and clear transmission with real-time rendering. Moreover, our factorization naturally enables flexible scene editing. The project page is available at https://sjj118.github.io/RT-Splatting.

Abstract:
In this paper, we propose an extremely efficient, training-free method to extract token-level reward signals directly from an existing deep reward model. Our core idea is to attribute the overall process reward to individual tokens by estimating each token's influence. This influence is defined as the change in the final macroscopic reward (e.g., the process reward) when a token is replaced with a semantically null token. Naively calculating this influence is computationally infeasible, requiring N forward passes through the PRM for an N-token sequence. We overcome this bottleneck by proposing a highly efficient gradient-based estimator. Specifically, we use a first-order Taylor approximation, which simplifies the influence calculation to the inner product of the difference between the token embedding and the null token embedding, and the gradient of the reward with respect to the token embedding. This requires only a single forward and backward pass. The resulting token-level rewards enable standard RL algorithms to perform precise credit assignment without requiring additional reward model training. Experiments on challenging reasoning benchmarks demonstrate that our method substantially improves policy optimization efficiency and enhances the generalization of LLM reasoning capabilities. Our P2T outperforms the outcome reward by +4.9% on MathVista for Qwen2.5-VL-7B-Instruct, and +11.5% on AIME24 for Qwen2.5-Math-7B. The code is available at: https://github.com/JIA-Lab-research/P2T.

Abstract:
The increasing demand for augmented reality and robotics is driving the need for articulated object reconstruction with high scalability. However, existing settings for reconstructing from discrete articulation states or casual monocular videos require non-trivial axis alignment or suffer from insufficient coverage, limiting their applicability. In this paper, we introduce FreeArtGS, a novel method for reconstructing articulated objects under free-moving scenario, a new setting with a simple setup and high scalability. FreeArtGS combines free-moving part segmentation with joint estimation and end-to-end optimization, taking only a monocular RGB-D video as input. By optimizing with the priors from off-the-shelf point-tracking and feature models, the free-moving part segmentation module identifies rigid parts from relative motion under unconstrained capture. The joint estimation module calibrates the unified object-to-camera poses and recovers joint type and axis robustly from part segmentation. Finally, 3DGS-based end-to-end optimization is implemented to jointly reconstruct visual textures, geometry, and joint angles of the articulated object. We conduct experiments on two benchmarks and real-world free-moving articulated objects. Experimental results demonstrate that FreeArtGS consistently excels in reconstructing free-moving articulated objects and remains highly competitive in previous reconstruction settings, proving itself a practical and effective solution for realistic asset generation. The project page is available at: https://freeartgs.github.io/.

Abstract:
We propose Delta Rectified Flow Sampling (DRFS), a novel inversion-free, path-aware editing framework within rectified flow models for text-to-image editing. DRFS is a distillation-based method that explicitly models the discrepancy between the source and target velocity fields in order to mitigate over-smoothing artifacts rampant in prior distillation sampling approaches. We further introduce a time-dependent shift term to push noisy latents closer to the target trajectory, enhancing the alignment with the target distribution. We theoretically demonstrate that disabling this shift recovers Delta Denoising Score (DDS), bridging score-based diffusion optimization and velocity-based rectified-flow optimization. Moreover, under rectified-flow dynamics, a linear shift schedule recovers the inversion-free method FlowEdit as a strict special case, yielding a unifying view of optimization and ODE editing. We conduct an analysis to guide the design of our shift term, and experimental results on the widely used PIE Benchmark indicate that DRFS achieves superior editing quality, fidelity, and controllability while requiring no architectural modifications. Code is available at https://github.com/Harvard-AI-and-Robotics-Lab/DeltaRectifiedFlowSampling.

Abstract:
Creating machines capable of understanding the world in 3D is essential in assisting designers that build and edit 3D environments and robots navigating and interacting within a three-dimensional space. Inspired by advances in language and image modeling, we investigate the potential of autoregressive models for a new modality: structured 3D scenes. To this end, we propose a unified LLM framework that aligns language, images, and 3D scenes and provide a detailed "cookbook" outlining critical design choices for achieving optimal training and performance addressing key questions related to data representation, modality-specific objectives, and more. We show how to tokenize complex 3D objects to incorporate into our structured 3D scene modality. We evaluate performance across four core 3D tasks - rendering, recognition, instruction-following, and question-answering - and four 3D datasets, synthetic and real-world. We show our model's effectiveness on reconstructing complete 3D scenes consisting of complex objects from a single image and on real-world 3D object recognition tasks. Project webpage: https://glab-caltech.github.io/kyvo/

Abstract:
The CLIP model's outstanding generalization has driven recent success in Zero-Shot Anomaly Detection (ZSAD) for detecting anomalies in unseen categories. The core challenge in ZSAD is to specialize the model for anomaly detection tasks while preserving CLIP's powerful generalization capability. Existing approaches attempting to solve this challenge share the fundamental limitation of a patch-agnostic design that processes all patches monolithically without regard for their unique characteristics. To address this limitation, we propose MoECLIP, a Mixture-of-Experts (MoE) architecture for the ZSAD task, which achieves patch-level adaptation by dynamically routing each image patch to a specialized Low-Rank Adaptation (LoRA) expert based on its unique characteristics. Furthermore, to prevent functional redundancy among the LoRA experts, we introduce (1) Frozen Orthogonal Feature Separation (FOFS), which orthogonally separates the input feature space to force experts to focus on distinct information, and (2) a simplex equiangular tight frame (ETF) loss to regulate the expert outputs to form maximally equiangular representations. Comprehensive experimental results across 14 benchmark datasets spanning industrial and medical domains demonstrate that MoECLIP outperforms existing state-of-the-art methods. The code is available at https://github.com/CoCoRessa/MoECLIP.

Abstract:
Text-to-multiview (T2MV) diffusion models have shown great promise in generating multiple views of a scene from a single text prompt. While few-step backbones enable real-time T2MV generation, they often compromise key aspects of generation quality, such as per-view fidelity and cross-view consistency. Reinforcement learning (RL) finetuning offers a potential solution, yet existing approaches designed for single-image diffusion do not readily extend to the few-step T2MV setting, as they neglect cross-view coordination and suffer from weak learning signals in few-step regimes. To address this, we propose MVC-ZigAL, a tailored RL finetuning framework for few-step T2MV diffusion models. Specifically, its core insights are: (1) a new MDP formulation that jointly models all generated views and assesses their collective quality via a joint-view reward; (2) a novel advantage learning strategy that exploits the performance gains of a self-refinement sampling scheme over standard sampling, yielding stronger learning signals for effective RL finetuning; and (3) a unified RL framework that extends advantage learning with a Lagrangian dual formulation for multiview-constrained optimization, balancing single-view and joint-view objectives through adaptive primal-dual updates under a self-paced threshold curriculum that harmonizes exploration and constraint enforcement. Collectively, these designs enable robust and balanced RL finetuning for few-step T2MV diffusion models, yielding substantial gains in both per-view fidelity and cross-view consistency. Code is available at https://github.com/ZiyiZhang27/MVC-ZigAL.

Abstract:
Current 3D mapping pipelines generally assume static environments, which limits their ability to accurately capture and reconstruct moving objects. To address this limitation, we introduce the novel task of active mapping of moving objects, in which a mapping agent must plan its trajectory while compensating for the object's motion. Our approach, Paparazzo, provides a learning-free solution that robustly predicts the target's trajectory and identifies the most informative viewpoints from which to observe it, to plan its own path. We also contribute a comprehensive benchmark designed for this new task. Through extensive experiments, we show that Paparazzo significantly improves 3D reconstruction completeness and accuracy compared to several strong baselines, marking an important step toward dynamic scene understanding. Project page: https://davidea97.github.io/paparazzo-page/

Abstract:
Large vision-language models (VLMs) are highly vulnerable to multimodal jailbreak attacks that exploit visual-textual interactions to bypass safety guardrails. In this paper, we present DTR, a novel inference-time defense that mitigates multimodal jailbreak attacks through optimizing the model's key-value (KV) caches. Rather than relying on curated safety-specific data or costly image-to-text conversion, we introduce a new formulation of the safety-relevant distributional shift induced by the visual modality. This formulation enables DTR to dynamically adjust visual token weights, minimizing the impact of adversarial visual inputs while preserving the model's general capabilities and inference efficiency. Extensive evaluation across diverse VLMs and attack benchmarks demonstrates that DTR outperforms existing defenses in both attack robustness and benign-task performance, marking the first successful application of KV cache optimization for safety enhancement in multimodal foundation models. The code for replicating DTR is available at: https://github.com/TanqiuJiang/DTR.

Abstract:
Diffusion bridge models establish probabilistic paths between arbitrary paired distributions and exhibit great potential for universal image restoration. Most existing methods merely treat them as simple variants of stochastic interpolants, lacking a unified analytical perspective. Besides, they indiscriminately reconstruct images through global noise injection and removal, inevitably distorting undegraded regions due to imperfect reconstruction. To address these challenges, we propose the R esidual D iffusion B ridge M odel (RDBM). Specifically, we theoretically reformulate the stochastic differential equations of generalized diffusion bridge and derive the analytical formulas of its forward and reverse processes. Crucially, we leverage the residuals from given distributions to modulate the noise injection and removal, enabling adaptive restoration of degraded regions while preserving intact others. Additionally, we unravel the fundamental mathematical essence of existing bridge models, all of which are special cases of RDBM and empirically demonstrate the optimality of our proposed models. Extensive experiments are conducted to demonstrate the state-of-the-art performance of our method both qualitatively and quantitatively across diverse image restoration tasks. Code is publicly available at https://github.com/MiliLab/RDBM.

Abstract:
Existing video editing methods face a critical trade-off: expert models offer precision but rely on task-specific priors like masks, hindering unification; conversely, unified temporal in-context learning models are mask-free but lack explicit spatial cues, leading to weak instruction-to-region mapping and imprecise localization. To resolve this conflict, we propose VideoCoF, a novel Chain-of-Frames approach inspired by Chain-of-Thought reasoning. VideoCoF enforces a "seeing, reasoning, then editing" procedure by compelling the video diffusion model to first predict reasoning tokens (edit-region latents) before generating the target video tokens. This explicit reasoning step removes the need for user-provided masks while achieving precise instruction-to-region alignment and fine-grained video editing. Furthermore, we introduce a RoPE alignment strategy that leverages these reasoning tokens to ensure motion alignment and enable length extrapolation beyond the training duration. We demonstrate that with a minimal data cost of only 50k video pairs, VideoCoF achieves state-of-the-art performance on VideoCoF-Bench, validating the efficiency and effectiveness of our approach. Our code, weight, data are available at https://github.com/knightyxp/VideoCoF.

Abstract:
Conventional vision-language models (VLMs) struggle to interpret scenes captured under adverse conditions (e.g., low light, high dynamic range, or fast motion) because standard RGB images degrade in such environments. Event cameras provide a complementary modality: they asynchronously record per-pixel brightness changes with high temporal resolution and wide dynamic range, preserving motion cues where frames fail. We propose RE-VLM, the first dual-stream vision-language model that jointly leverages RGB images and event streams for robust scene understanding across both normal and challenging conditions. RE-VLM employs parallel RGB and event encoders together with a progressive training strategy that aligns heterogeneous visual features with language. To address the scarcity of RGB-Event-Text supervision, we further propose a graph-driven pipeline that converts synchronized RGB-Event streams into verifiable scene graphs, from which we synthesize captions and question-answer (QA) pairs. To develop and evaluate RE-VLM, we construct two datasets: PEOD-Chat, targeting illumination-challenged scenes, and RGBE-Chat, covering diverse scenarios. On captioning and VQA benchmarks, RE-VLM consistently outperforms state-of-the-art RGB-only and event-only models with comparable parameter counts, with particularly large gains under challenging conditions. These results demonstrate the effectiveness of event-augmented VLMs in achieving robust vision-language understanding across a wide range of real-world environments. Code and datasets are available at https://github.com/bupt-ai-cz/RE-VLM.

Abstract:
Generative models, with their success in image and video generation, have recently been explored for synthesizing effective neural network weights. These approaches take trained neural network checkpoints as training data, and aim to generate high-performing neural network weights during inference. In this work, we examine four representative, well-known methods in this emerging area on their ability to generate novel model weights, i.e., weights that are different from the checkpoints seen during training. Contrary to claims in prior work, we find that these methods synthesize weights largely by memorization: they produce either replicas, or at best simple interpolations, of the training checkpoints. Current methods fail to outperform simple baselines, such as adding noise to the weights or taking a simple weight ensemble, in obtaining different and simultaneously high-performing models. Our further results suggest that the memorization potentially resulted from limited data, overparameterized models, and the underuse of structural priors specific to weight data. Our findings highlight the need for more careful design and evaluation of generative models in new domains.

Abstract:
Human mesh recovery (HMR) models 3D human body from monocular videos, with recent works extending it to world-coordinate human trajectory and motion reconstruction. However, most existing methods remain offline, relying on future frames or global optimization, which limits their applicability in interactive feedback and perception-action loop scenarios such as AR/VR and telepresence. To address this, we propose OnlineHMR, a fully online framework that jointly satisfies four essential criteria of online processing, including system-level causality, faithfulness, temporal consistency, and efficiency. Built upon a two-branch architecture, OnlineHMR enables streaming inference via a causal key-value cache design and a curated sliding-window learning strategy. Meanwhile, a human-centric incremental SLAM provides online world-grounded alignment under physically plausible trajectory correction. Experimental results show that our method achieves performance comparable to existing chunk-based approaches on the standard EMDB benchmark and highly dynamic custom videos, while uniquely supporting online processing. Page and code are available at https://tsukasane.github.io/Video-OnlineHMR/.

Abstract:
Open-set semi-supervised learning (OSSL) has achieved notable progress in exploiting unlabeled data, yet most existing methods overlook the distinct sensitivities of in-distribution (ID) and out-of-distribution (OOD) samples to semantic-preserving perturbations, resulting in suboptimal model performance. To address this limitation, we propose Perturbation-Aware Filtering (PAF), which leverages the behavioral difference between ID and OOD samples under perturbations and extends it into a representation-level signal for reliable OOD filtering. Specifically, PAF identifies OOD samples by measuring the representation instability under semantic-preserving perturbations. We then integrate PAF into a carefully designed two-stage training framework, allowing the model to exploit abundant unlabeled data in the first stage and gradually adapt to the open-set setting with limited labels in the second stage. Extensive experimental results on widely-used OSSL benchmarks demonstrate that our proposed PAF approach achieves superior performance compared to state-of-the-art (SOTA) OSSL methods. Our code is available at https://github.com/njustkmg/CVPR26-PAF.

Abstract:
Event-based video reconstruction seeks to recover high-speed, high-dynamic-range videos from event streams. While existing approaches rely exclusively on motion-triggered events, these events are inherently sparse and primarily capture dynamic regions. Therefore, they often suffer from error accumulation and degraded quality in regions with few events. In this work, we introduce aperture-modulation-triggered events as a complementary mechanism to enrich the captured scene information. Specifically, we periodically modulate the aperture to actively generate dense event signals, thereby encoding intensity cues even in static or low-motion regions. Building upon this idea, we design an AE2VID framework that jointly leverages aperture-modulation-triggered and motion-triggered events to enhance the fidelity of predictions. The proposed framework consists of two subnetworks for the dedicated processing of both event types. We further collect a real dataset and validate the effectiveness of our method. Extensive experiments show our superiority over state-of-the-art methods. Code and data will be available at https://github.com/a1henu/AE2VID/.

Abstract:
Safety-critical deep learning systems must be robust against real-world corruptions combining spatially correlated distortions and independent noise.Current deep neural network verification methods handle these perturbations separately, either checking independent pixel-wise perturbations or restricted convolutional transformations using predefined patterns.This gap prevents assessing robustness under realistic conditions where both perturbation types occur simultaneously.To address these limitations, we propose VeriDou, a framework that introduces:(i) universal convolutional perturbations that enable verification across continuous spatial distortion spaces, and(ii) dual perturbations that capture both convolutional distortions and independent pixel-level variations.Our evaluation on a set of diverse benchmarks with 14340 instances shows VeriDou's dual perturbations approach found substantially more adversarial examples on networks that existing methods claimed to be highly robust.This shows that VeriDou is able to explore a broader range of unsafe regions and thus enhances formal assessment of robustness. VeriDou is available at https://github.com/dynaroars/VeriDou.

Abstract:
Recent advances in diffusion models have brought remarkable progress in image and video editing, yet some tasks remain underexplored. In this paper, we extend Object Retexture into video domain, which transfers local textures from a reference object to a target object in images or videos. To perform this task, a straightforward solution is to use ControlNet conditioned on the source structure and the reference texture. However, this approach suffers from limited controllability due to two reasons: conditioning on the raw reference image introduces unwanted structural information, and this method fails to disentangle visual texture and structure information of the source. To address this problem, we proposed a method, namely Refacade, that consists of two key designs to achieve precise and controllable texture transfer in both images and videos. First, we employ a texture remover trained on paired textured/untextured 3D mesh renderings to remove appearance information while preserving geometry and motion of source videos. Second, we disrupt the reference's global layout using a jigsaw permutation, encouraging the model to focus on local texture statistics rather than global layout of object. Extensive experiments demonstrate superior visual quality, precise editing, and controllability, outperforming strong baselines in both quantitative and human evaluations.

Abstract:
Predicting driver attention is a critical problem for developing explainable autonomous driving systems and understanding driver behavior in mixed human-autonomous vehicle traffic scenarios. Although significant progress has been made through large-scale driver attention datasets and deep learning architectures, existing works are constrained by narrow frontal field-of-view and limited driving diversity. Consequently, they fail to capture the full spatial context of driving environments, especially during lane changes, turns, and interactions involving peripheral objects such as pedestrians or cyclists. In this paper, we introduce DriverGaze360, a large-scale 360 field of view driver attention dataset, containing 1 million gaze-labeled frames collected from 19 human drivers, enabling comprehensive omnidirectional modeling of driver gaze behavior. Moreover, our panoramic attention prediction approach, DriverGaze360-Net, jointly learns attention maps and attended objects by employing an auxiliary semantic segmentation head. This improves spatial awareness and attention prediction across wide panoramic inputs. Extensive experiments demonstrate that DriverGaze360-Net achieves state-of-the-art attention prediction performance on multiple metrics on panoramic driving images. Dataset and method available at https://dfki-av.github.io/drivergaze360/.

Abstract:
Recent advances in 3D Gaussian Splatting (3DGS) have focused on accelerating optimization while preserving reconstruction quality. However, many proposed methods entangle implementation-level improvements with fundamental algorithmic modifications or trade performance for fidelity, leading to a fragmented research landscape that complicates fair comparison. In this work, we consolidate and evaluate the most effective and broadly applicable strategies from prior 3DGS research and augment them with several novel optimizations. We further investigate underexplored aspects of the framework, including numerical stability, Gaussian truncation, and gradient approximation. The resulting system, Faster-GS, provides a rigorously optimized algorithm that we evaluate across a comprehensive suite of benchmarks. Our experiments demonstrate that Faster-GS achieves up to 5x faster training while maintaining visual quality, establishing a new cost-effective and resource efficient baseline for 3DGS optimization. Furthermore, we demonstrate that optimizations can be applied to 4D Gaussian reconstruction, leading to efficient non-rigid scene optimization. Source code available at: https://github.com/nerficg-project/faster-gaussian-splatting

Abstract:
Generating audio that is acoustically consistent with a scene is essential for immersive virtual environments. Recent neural acoustic field methods enable spatially continuous sound rendering but remain scene-specific, requiring dense audio measurements and costly training for each environment. Few-shot approaches improve scalability across rooms but still rely on multiple recordings and, being deterministic, fail to capture the inherent uncertainty of scene acoustics under sparse context. We introduce FLow-matching ACoustic generation (FLAC), a probabilistic method for few-shot acoustic synthesis that models the distribution of plausible room impulse responses (RIRs) given minimal scene context. FLAC leverages a diffusion transformer trained with a flow-matching objective to generate RIRs at arbitrary positions in novel scenes, conditioned on spatial, geometric, and acoustic cues. FLAC outperforms state-of-the-art eight-shot baselines with one-shot on both the AcousticRooms and Hearing Anything Anywhere datasets. To complement standard perceptual metrics, we further introduce AGREE, a joint Acoustic-GeometRy EmbEdding, enabling geometry-consistent evaluation of generated RIRs through retrieval and distributional metrics. This work is the first to apply generative flow matching to explicit RIR synthesis, establishing a new direction for robust and data-efficient acoustic synthesis. Project page: https://amandinebtto.github.io/FLAC/

Abstract:
Unaligned RGBT tracking aims to achieve robust target localization across spatially misaligned RGB and thermal infrared (TIR) videos, a crucial challenge for deploying RGBT tracking in real-world scenarios. Existing methods often calculate all cross-modal alignment parameters (i.e., spatial shift and scale change) simultaneously, but suffer from two major limitations. 1) They are difficult to adapt to different degrees of unaligned difficulty during tracking. 2) They usually require complex models to handle challenging scenarios, resulting in a large computational burden. To overcome these limitations, we propose a novel Progressive Multi-cue Alignment framework called PMATrack, which disentangles the calculation of cross-modal alignment parameters in a progressive manner and dynamically selects appropriate cues to handle different challenges, thereby enabling robust and efficient unaligned RGBT tracking. In particular, PMATrack divides the cross-modal alignment parameter estimation into three stages to progressively perform center offset computation, scale transformation estimation, and global refinement. At each stage, we design a difficulty-aware router to adaptively select the appropriate alignment expert based on the cross-modal alignment complexity, thereby reducing computational redundancy. In addition, we build a high-quality video benchmark called MUART244 to facilitate the comprehensive evaluation of different unaligned RGBT tracking algorithms. Extensive experiments demonstrate the outstanding performance of PMATrack against existing state-of-the-art methods. The code and dataset will be available at https://github.com/NOP1224/Unaligned_RGBT_Tracking.

Abstract:
Electromagnetic Inverse Scattering Problems (EISP) seek to reconstruct relative permittivity from scattered fields and are fundamental to applications like medical imaging. This inverse process is inherently ill-posed and highly nonlinear, making it particularly challenging, especially under sparse transmitter setups, e.g., with only one transmitter. While recent machine learning-based approaches have shown promising results, they often rely on time-consuming, case-specific optimization and perform poorly under sparse transmitter setups. To address these limitations, we revisit EISP from a data-driven perspective. The scarcity of transmitters leads to an insufficient amount of measured data, which fails to capture adequate physical information for stable inversion. Accordingly, we propose a fully end-to-end and data-driven framework that predicts the relative permittivity of scatterers from measured fields, leveraging data distribution priors to compensate for the incomplete information from sparse measurements. This design enables data-driven training and feed-forward prediction of relative permittivity while maintaining strong robustness to transmitter sparsity. Extensive experiments show that our method outperforms state-of-the-art approaches in reconstruction accuracy and robustness. Notably, we demonstrate, for the first time, high-quality reconstruction from a single transmitter. This work advances practical electromagnetic imaging by providing a new, cost-effective paradigm to inverse scattering. Code and models are released at https://gomenei.github.io/SingleTX-EISP/.

Abstract:
Vision Transformers (ViTs) achieve state-of-the-art performance on a wide range of vision tasks, yet they remain highly vulnerable to adversarial perturbations due to the lack of explicit region-level semantic modeling. Adversarial perturbations are typically local and spatially structured, whereas the globally coupled self-attention and spatially uniform feed-forward networks in ViTs propagate local corruptions across the whole image without enforcing consistency within semantically coherent regions. To mitigate this mismatch, we propose Region-aware Mixture-of-Experts, namely "ReMoE", a plug-and-play module that replaces the standard feed-forward network (FFN) with a region-aware expert layer. Specifically, our ReMoE strategically introduces multi-granularity experts (i.e., global, center, and regional) and couples them with an attention-guided routing mechanism that operates on patch-to-region (P2R) and region-to-patch (R2P) transformations. This mechanism adaptively activates the most relevant experts for each spatial location according to its attention profile, enabling the model to capture region-level semantics and local context while preserving global consistency, thereby providing a stronger inductive bias for adversarially robust ViT representations. Extensive experiments demonstrate that our ReMoE substantially improves the adversarial robustness of ViTs with only marginal additional computational cost. Our code is available at https://github.com/zhongskr0114/ReMoE.

Abstract:
Cross-Domain Few-Shot Segmentation aims to segment categories in data-scarce domains conditioned on a few exemplars. Typical methods first establish few-shot capability in a large-scale source domain and then adapt it to target domains. However, due to the limited quantity and diversity of target samples, existing methods still exhibit constrained performance. Moreover, the source-trained model's initially weak few-shot capability in target domains, coupled with substantial domain gaps, severely hinders the effective utilization of target samples and further impedes adaptation. To this end, we propose Multi-view Progressive Adaptation, which progressively adapts few-shot capability to target domains from both data and strategy perspectives. (i) From the data perspective, we introduce Hybrid Progressive Augmentation, which progressively generates more diverse and complex views through cumulative strong augmentations, thereby creating increasingly challenging learning scenarios. (ii) From the strategy perspective, we design Dual-chain Multi-view Prediction, which fully leverages these progressively complex views through sequential and parallel learning paths under extensive supervision. By jointly enforcing prediction consistency across diverse and complex views, MPA achieves both robust and accurate adaptation to target domains. Extensive experiments demonstrate that MPA effectively adapts few-shot capability to target domains, outperforming state-of-the-art methods by a large margin (+7.0%). Our code is available at https://github.com/niejiahao1998/MPA.

Abstract:
3D Gaussian splatting (3DGS) has emerged as a revolutionary scene representation in simultaneous localization and mapping (SLAM) research. However, existing research on 3DGS-based SLAM fails to accurately address the appearance variations induced by camera auto-exposure in prevalent real-world scenarios, resulting in reduced localization and photorealistic mapping accuracy. To address this issue, we propose a stereo auto-exposure-robust Gaussian splatting SLAM (AERGS-SLAM), a framework robust to such variations and enables both reliable localization and exposure-controlled photorealistic mapping. Our key contributions are two fold. Firstly, we propose a camera exposure network to model the camera exposure process, which we integrate with Gaussian splatting to achieve exposure-controlled novel view synthesis. Secondly, we exploit an illumination-robust geometric feature for localization and Gaussian map initialization, enhancing localization accuracy under exposure-varying scenarios. Extensive experiments on public datasets and our self-collected real-world dataset demonstrate that AERGS-SLAM outperforms baselines in both localization performance and photorealistic mapping quality. Our code is available at: https://github.com/zzy-2021/AERGS-SLAM.

Abstract:
In 3D reconstruction, the problem of inverse rendering, namely recovering the illumination of the scene and the material properties, is fundamental. Existing Gaussian Splatting-based methods primarily target static scenes and often assume simplified or moderate lighting to avoid entan- gling shadows with surface appearance. This limits their ability to accurately separate lighting effects from mate- rial properties, particularly in real-world conditions. We address this limitation by leveraging dynamic elements-- regions of the scene that undergo motion--as a supervisory signal for inverse rendering. Motion reveals the same sur- faces under varying lighting conditions, providing stronger cues for disentangling material and illumination. This the- sis is supported by our experimental results which show we improve LPIPS by 23% for albedo estimation and by 15% for scene relighting relative to next-best baseline. To this end, we introduce LumiMotion, the first Gaussian-based approach that leverages dynamics for inverse rendering and operates in arbitrary dynamic scenes. Our method learns a dynamic 2D Gaussian Splatting representation that em- ploys a set of novel constraints which encourage the dy- namic regions of the scene to deform, while keeping static regions stable. As we demonstrate, this separation is crucial for correct optimization of the albedo. Finally, we release a new synthetic benchmark comprising five scenes under four lighting conditions, each in both static and dynamic variants, for the first time enabling systematic evaluation of inverse rendering methods in dynamic environments and challenging lighting.

Abstract:
The adoption of vision neural networks in regulated industries requires formal robustness guarantees, especially in safety-critical domains such as healthcare, autonomous vehicles, and aerospace. However, current approaches are confined to incomplete statistical verification or robustness to p-norm and affine transforms, which cover only a narrow subset of perturbations to the image formation process. In particular, robustness to camera motion remains an open problem despite being key to deploy many vision applications. We present a formal verification approach that targets robustness against 3D motion perturbations of the capturing camera. We first establish a closed-form mapping from camera pose to pixel values. By analyzing the continuity properties of the resulting homographies, we show that recent work on Lipschitz optimization and piecewise continuity can be extended to derive tight linear bounds on perturbed pixel values. Our approach applies to scenes with predominantly planar structure, such as ground planes in augmented reality, road markings and traffic signs in autonomous driving, or planar workspaces in robotic manipulation. This enables the first formal verification of projective geometry transforms, without complex simulation, surrogate networks, or explicit image-formation models. Our implementation achieves up to 89% speedup and 7% tighter bounds over prior work. We evaluate our method on the VNN-COMP benchmark and reveal systematic weaknesses to projective perturbations. Finally, we demonstrate a real-world case study on a safety-critical runway classifier, addressing a key challenge in the certification of learned models. Data and code are available at https://github.com/jeangud/homography-verification.

Abstract:
Diffusion models are able to produce AI-generated images that are almost indistinguishable from real ones, raising concerns about their potential misuse and posing substantial challenges for detecting them. Many existing detectors rely on reconstruction error -- the difference between the input image and its reconstructed version -- as the basis for distinguishing real from fake images. However, these detectors become less effective as modern AI-generated images become increasingly similar to real ones. To address this challenge, we propose a novel difference-in-difference method. Instead of directly using the reconstruction error (a first-order difference), we compute the difference in reconstruction error -- a second-order difference -- for variance reduction and improving detection accuracy. Extensive experiments demonstrate that our method achieves strong generalization performance, enabling reliable detection of AI-generated images in the era of generative AI. Code is available at https://github.com/Qixinyi1122-lucky/DID.

Abstract:
Recent advances in Gaussian Splatting (GS) have significantly improved 3D scene reconstruction and novel view synthesis. However, most existing methods typically assume that training inputs are captured under stable and achromatic lighting conditions . In contrast, scenes recorded under temporally varying color light, as in "disco lights" commonly seen in events, performances, and decorative settings, introduce severe ambiguities in both scene photometry and geometry. We propose Disco-GS, a framework that leverages GS for reconstructing the 3D scene while simultaneously recovering the underlying canonical appearance from videos captured under dynamic lighting conditions. Disco-GS estimates the effective per-pixel transient light, which, when applied to the canonical image, results in the observed color image of the scene, thereby enabling self-supervised learning. Disco-GS is an end-to-end framework that does not rely on any prior knowledge, such as color values, ambient lighting conditions, or scene properties. It effectively handles both global and spatially localized transient color variations. It also enables controllable brightness manipulation of the canonical scene, facilitating applications such as simulating low-light and well-lit scene conditions. To the best of our knowledge, Disco-GS is the first method to simultaneously perform 3D scene reconstruction and canonical appearance recovery from inputs captured under artificially varying, disco-style colored light. To enable quantitative and qualitative evaluations, we also introduce the Disco dataset, a collection of 25 videos of real-world scenes exhibiting diverse and random color variations. Extensive experiments demonstrate the robustness and fidelity of Disco-GS. The dataset is available at: https://github.com/akumar005/Disco-GS.

Abstract:
A fundamental dilemma in generative modeling persists: iterative diffusion models achieve outstanding fidelity, but at a significant computational cost, while efficient few-step alternatives are constrained by a hard quality ceiling. This conflict between generation steps and output quality arises from restrictive training objectives that focus exclusively on either infinitesimal dynamics (PF-ODEs) or direct endpoint prediction. We address this challenge by introducing an exact, continuous-time dynamics equation that analytically defines state transitions across any finite time interval (\Delta t). This leads to a novel generative paradigm, Transition Models (TiM), which adapt to arbitrary-step transitions, seamlessly traversing the generative trajectory from single leaps to fine-grained refinement with more steps.Despite having only 865M parameters, TiM achieves state-of-the-art performance, surpassing leading models such as SD3.5 (8B parameters) and FLUX.1 (12B parameters) across all evaluated step counts. Importantly, unlike previous few-step generators, TiM demonstrates monotonic quality improvement as the sampling budget increases. Additionally, when employing our native-resolution strategy, TiM delivers exceptional fidelity at resolutions up to (4096x4096).

Abstract:
Classifying fine-grained visual concepts under open-world settings, i.e., without a predefined label set, demands models to be both accurate and specific. Recent reasoning Large Multimodal Models (LMMs) exhibit strong visual understanding capability but tend to produce overly generic predictions when performing fine-grained image classification. Our preliminary analysis reveals that models do possess the intrinsic fine-grained domain knowledge. However, promoting more specific predictions (specificity) without compromising correct ones (correctness) remains a non-trivial and understudied challenge. In this work, we investigate how to steer reasoning LMMs toward predictions that are both correct and specific. We propose a novel specificity-aware reinforcement learning framework, SpeciaRL, to fine-tune reasoning LMMs on fine-grained image classification under the open-world setting. SpeciaRL introduces a dynamic, verifier-based reward signal anchored to the best predictions within online rollouts, promoting specificity while respecting the model's capabilities to prevent incorrect predictions. Our out-of-domain experiments show that SpeciaRL delivers the best trade-off between correctness and specificity across extensive fine-grained benchmarks, surpassing existing methods and advancing open-world fine-grained image classification. Code and model are publicly available at https://github.com/s-angheben/SpeciaRL.

Abstract:
Recent approaches have demonstrated the promise of using diffusion models to generate interactive and explorable worlds. However, most of these methods face critical challenges such as excessively large parameter sizes, reliance on lengthy inference steps, and rapidly growing historical context, which severely limit real-time performance and lack text-controlled generation capabilities.To address these challenges, we propose Yume1.5, a novel framework designed to generate realistic, interactive, and continuous worlds from a single image or text prompt. Yume1.5 achieves this through a carefully designed framework that supports keyboard-based exploration of the generated worlds. The framework comprises three core components: (1) a long-video generation method combining unified context compression and linear attention; (2) a context compression-based bidirectional attention distillation approach with an enhanced text embedding scheme for real-time streaming video generation. Yume1.5 achieves an average generation speed of 12 fps at 540p resolution using only a single A100 GPU; (3) a text-controlled method for generating world events. Github: https://github.com/stdstu12/YUME.

Abstract:
Vision-and-Language Navigation (VLN) requires an agent to ground language instructions to its own movement within a visual environment. While state-of-the-art methods leverage the reasoning capabilities of Vision-Language Models (VLMs) for end-to-end action prediction, they often lack an explicit and explainable understanding of the relationships between the agent, the instruction, and the scene. Conversely, explicitly building a scene map for heuristic planning is intuitively appealing but relies on additional 3D sensors and hinders large-scale vision-language pre-training. To bridge this gap, we propose AwareVLN, a novel framework that equips the navigation model with a self-aware reasoning mechanism, enabling it to understand the agent's state and task progress in a fully end-to-end and data-driven manner. Our approach features two key innovations: (1) a structural reasoning module that fosters spatial and task-oriented self-awareness, and (2) an automatic data engine with progress division for effective training. Extensive experiments on various datasets in Habitat simulator show our AwareVLN significantly outperforms previous state-of-the-art vision-language navigation methods. Project page: https://gwxuan.github.io/AwareVLN/.

Abstract:
Recent advances in visual generation have increasingly explored the integration of reasoning capabilities. They incorporate textual reasoning, i.e., think, either before (as pre-planning) or after (as post-refinement) the generation process, yet they lack on-the-fly multimodal interaction during the generation itself. In this preliminary study, we introduce Thinking-while-Generating (TwiG), the first interleaved framework that enables co-evolving textual reasoning throughout the visual generation process. As visual content is progressively generating, textual reasoning is interleaved to both guide upcoming local regions and reflect on previously synthesized ones. This dynamic interplay produces more context-aware and semantically rich visual outputs. To unveil the potential of this framework, we investigate three candidate strategies, zero-shot prompting, supervised fine-tuning (SFT) on our curated TwiG-50K dataset, and reinforcement learning (RL) via a customized TwiG-GRPO strategy, each offering unique insights into the dynamics of interleaved reasoning. We hope this work inspires further research into interleaving textual reasoning for enhanced visual generation. Code is released at: https://github.com/ZiyuGuo99/Thinking-while-Generating.

Abstract:
Dense scenes containing numerous tiny objects pose a fundamental challenge for segmentation models, where small localization errors can significantly degrade downstream measurements. We present Structure-Aware Representation Distillation (SARD), a teacher-compatible framework that transfers structural knowledge from a large teacher to a compact student via feature-space alignment rather than mask imitation. SARD builds a structure-importance map by integrating boundary salience, geometric complexity, and local feature variation, and uses it to guide a unified representation loss that combines feature consistency with distribution alignment. This encourages the student to allocate capacity to geometrically informative regions while preserving global context. Experiments on Cityscapes, ADE20K, and a challenging rock fragmentation benchmark (RockFrag) show that SARD consistently improves both mIoU and boundary IoU over strong distillation baselines; on RockFrag, SARD improves a Swin-T student over CWD by +4.3 mIoU and +6.7 bIoU. A ResNet-50 student distilled from a Swin-L teacher achieves up to 7.7 times parameter reduction and 9 times higher throughput than the teacher, with no additional inference overhead beyond the student network, demonstrating that structure-aware representation distillation is effective and efficient for tiny-dense segmentation. Code is available at: https://github.com/liuuuuuuxuesong/SARD.

Abstract:
Vision-language models (VLMs) have achieved strong performance in visual question answering (VQA), yet they remain constrained by static training data. Retrieval-Augmented Generation (RAG) mitigates this limitation by enabling access to up-to-date, culturally grounded, and multilingual information; however, multilingual multimodal RAG remains largely underexplored. We introduce M4-RAG, a massive-scale benchmark spanning 42 languages, 56 regional dialects and registers, and 189 countries, comprising over 80,000 culturally diverse image-question pairs for evaluating retrieval-augmented VQA across languages and modalities. To balance realism with reproducibility, we build a controlled retrieval environment containing millions of carefully curated multilingual documents relevant to the query domains, approximating real-world retrieval conditions while ensuring consistent experimentation. Our systematic evaluation reveals that although RAG consistently benefits smaller VLMs, it fails to scale to larger models and often even degrades their performance, exposing a critical mismatch between model size and current retrieval effectiveness. Our cross-lingual evaluations also reveal significant performance degradation when prompts or retrieved context are provided in non-English languages. The code, datasets, and evaluation protocols for M4-RAG are available as open-source at https://github.com/davidanugraha/M4-RAG.

Abstract:
Balancing performance trade-off on long-tail data distributions remains a long-standing challenge. In this paper, we posit that this dilemma stems from a phenomenon called "tail performance degradation" in continual learning (the model tends to severely overfit on head classes while quickly forgetting tail classes) and pose a solution from a loss landscape perspective. We observe that different classes possess divergent convergence points in the loss landscape. Besides, this divergence is aggravated when the model settles into sharp and non-robust minima, rather than a shared and flat solution that is beneficial for all classes. In light of this, we propose a continual learning inspired framework to prevent "tail performance degradation". To avoid inefficient per-class parameter preservation, a Grouped Knowledge Preservation module is proposed to memorize group-specific convergence parameters, promoting convergence towards a shared solution. Concurrently, our framework integrates a Grouped Sharpness Aware module to seek flatter minima by explicitly addressing the geometry of the loss landscape. Notably, our framework requires neither external training samples nor pre-trained models, facilitating the broad applicability. Extensive experiments on four benchmarks demonstrate significant performance gains over state-of-the-art methods. The code is available at: https://gkp-gsa.github.io/.

Abstract:
We introduce RefTon, a flux-based person-to-person virtual try-on framework that enhances garment realism through unpaired visual references. Unlike conventional approaches that rely on complex auxiliary inputs such as body parsing and warped mask or require finely designed extract branches to process various input conditions, RefTon streamlines the process by directly generating try-on results from a source image and a target garment, without the need for structural guidance or auxiliary components to handle diverse inputs. Moreover, inspired by human clothing selection behavior, RefTon leverages additional reference images (the target garment worn on different individuals) to provide powerful guidance for refining texture alignment and maintaining the garment details. To enable this capability, we built a dataset containing unpaired reference images for training. Extensive experiments on public benchmarks demonstrate that RefTon achieves competitive or superior performance compared to state-of-the-art methods, while maintaining a simple and efficient person-to-person design. \thanks Code and dataset is available at: https://github.com/360CVGroup/RefTon.

Abstract:
Diffusion Transformers (DiTs) achieve strong video generation performance but suffer from prohibitive computation cost due to dense spatiotemporal tokenization. Most existing works rely on uniform patchification, tokenizing non-overlapping spatiotemporal with a fixed patch size regardless of the underlying content. This content-agnostic tokenization results in substantial redundant computation, especially in visually simple or static areas. To address this inefficiency while preserving the video generation quality, we propose DynaPatch, a fine-grained dynamic patchification framework that adaptively selects patch sizes for each spatiotemporal region based on content complexity. A lightweight router predicts patch sizes directly from the latents encoded by 3D Variational Autoencoder (VAE), and is jointly optimized with the diffusion model through diffusion loss, an attention-guided saliency alignment loss, and a token-budget regularizer. Learnable patchify/unpatchify layers integrate seamlessly with standard DiT backbones, allowing flexible tokenization without architectural changes. Experiments demonstrate that DynaPatch can effectively reduce redundant computations while preserving fine details, achieving 1.3-1.8x acceleration with minimal quality degradation. On VBench, DynaPatch attains a Total Score of 83.42 at 30% token reduction, significantly outperforming prior patchification and token pruning approaches. These results indicate that content-aware patchification offers an effective direction for efficient and scalable video diffusion. Project page: https://shengli99.github.io/DynaPatch/.

Abstract:
We introduce MusicInfuser, an approach that aligns pre-trained text-to-video diffusion models to generate high-quality dance videos synchronized with specified music tracks. Rather than training a multimodal audio-video or audio-motion model from scratch, our method demonstrates how existing video diffusion models can be efficiently adapted to align with musical inputs. We propose a novel layer-wise adaptability criterion based on a guidance-inspired constructive influence function to select adaptable layers, significantly reducing training costs while preserving rich prior knowledge, even with limited, specialized datasets. Experiments show that MusicInfuser effectively bridges the gap between music and video, generating novel and diverse dance movements that respond dynamically to music. Furthermore, our framework generalizes well to unseen music tracks, longer video sequences, and unconventional subjects, outperforming baseline models in consistency and synchronization. All of this is achieved without requiring motion data, with training completed on a single GPU within a day.

Abstract:
With the rapid advancement of AIGC technologies, image forensics will encounter unprecedented challenges. Traditional methods are incapable of dealing with increasingly realistic images generated by rapidly evolving image generation techniques. To facilitate the identification of AI-generated images and the attribution of their source models, generative image watermarking and AI-generated image attribution have emerged as key research focuses in recent years. However, existing methods are model-dependent, requiring access to the generative models and lacking generality and scalability to new and unseen generators. To address these limitations, this work presents a new paradigm for AI-generated image attribution by formulating it as an instance retrieval problem instead of a conventional image classification problem. We propose an efficient model-agnostic framework, called Low-bIt-plane-based Deepfake Attribution (LIDA). The input to LIDA is produced by Low-Bit Fingerprint Generation module, while the training involves Unsupervised Pre-Training followed by subsequent Few-Shot Attribution Adaptation.Comprehensive experiments demonstrate that LIDA achieves state-of-the-art performance for both Deepfake detection and image attribution under zero- and few-shot settings. The code is at https://github.com/hongsong-wang/LIDA

Abstract:
Test-time adaptation (TTA) aims to enhance the cross-domain performance of pre-trained models by adapting to unlabeled test data.While most existing TTA methods rely on backpropagation (BP) for finetuning, BP-free methods such as zeroth-order (ZO) methods are more desired in practical on-device scenarios. ZO methods rely only on forward computation, which can largely reduce the complexity and memory overhead of on-device deployment.However, ZO methods suffer from much higher variance compared with first-order methods in estimating the gradient.To address this, we propose an improved ZO method to substantially boost the performance of ZO optimization based TTA.First, we provide an observation to reveal the persistent low-rank Hessian structure of the loss during the adaptation process. Based on this insight, we then propose a loss-landscape curvature-aware zeroth-order (CAZO) method, which leverages a sliding-average estimation of the diagonal Hessian to construct a covariance matrix for anisotropic perturbation sampling. CAZO operates by freezing pretrained weights and optimizing minimal adapter parameters via forward-only passes based gradient estimation, which can substantially reduce the memory overhead compared to BP-based methods. Extensive experiments demonstrate that CAZO significantly outperforms existing TTA methods, achieving state-of-the-art performance while maintaining an excellent balance between accuracy and memory efficiency. Code is available at https://github.com/Hollyming/CAZO.

Abstract:
Compositional video generation aims to synthesize multiple instances with diverse appearance and motion. However, current approaches mainly focus on binding semantics, neglecting to understand diverse motion categories specified in prompts. In this paper, we propose a motion factorization framework that decomposes complex motion into three primary categories: motionlessness, rigid motion, and non-rigid motion. Specifically, our framework follows a planning before generation paradigm. (1) During planning, we reason about motion laws on the motion graph to obtain frame-wise changes in the shape and position of each instance. This alleviates semantic ambiguities in the user prompt by organizing it into a structured representation of instances and their interactions. (2) During generation, we modulate the synthesis of distinct motion categories in a disentangled manner. Conditioned on the motion cues, guidance branches stabilize appearance in motionless regions, preserve rigid-body geometry, and regularize local non-rigid deformations. Crucially, our two modules are model-agnostic, which can be seamlessly incorporated into various diffusion model architectures. Extensive experiments demonstrate that our framework achieves impressive performance in motion synthesis on real-world benchmarks. Code is available at https://github.com/ZixuanWang0525/MF-CVG.

Abstract:
Event cameras provide microsecond latency, making them suitable for 6D object pose tracking in fast, dynamic scenes where conventional RGB and depth pipelines suffer from motion blur and large pixel displacements. We introduce EventTrack6D, an event-depth tracking framework that generalizes to novel objects without object-specific training by reconstructing both intensity and depth at arbitrary timestamps between depth frames. Conditioned on the most recent depth measurement, our dual reconstruction recovers dense photometric and geometric cues from sparse event streams. Our EventTrack6D operates at over 120 FPS and maintains temporal consistency under rapid motion. To support training and evaluation, we introduce a comprehensive benchmark suite: a large-scale synthetic dataset for training and two complementary evaluation sets, including real and simulated event datasets. Trained exclusively on synthetic data, EventTrack6D generalizes effectively to real-world scenarios without fine-tuning, maintaining accurate tracking across diverse objects and motion patterns. Our method and datasets validate the effectiveness of event cameras for event-based 6D pose tracking of novel objects. Code and datasets are publicly available at https://chohoonhee.github.io/Event6D.

Abstract:
Single-image HDR reconstruction aims to recover high dynamic range radiance from a single low dynamic range (LDR) input, but remains highly ill-posed due to detail saturation in over-exposed regions and noise amplification in under-exposed areas. While recent diffusion-based approaches offer powerful generative priors, they often overlook the exposure-dependent nature of the degradation and incur substantial computational costs from iterative sampling. To address these challenges, we propose ExpoCM, a novel one-step generative HDR reconstruction framework that reformulates HDR reconstruction as a Probability Flow ODE (PF-ODE) and constructs exposure-aware consistency trajectories via exposure-dependent perturbations. Specifically, a soft exposure mask is first constructed to separate the LDR image into over-, under-, and well-exposed regions. Based on this partition, region-conditioned consistency trajectories are designed to hallucinate saturated details, suppress noise in dark regions, and preserve reliable structures within a single, distillation-free inference step. To further enhance perceptual quality, we introduce an Exposure-guided Luminance-Chromaticity Loss in the CIE \text L ^\text a ^\text b ^ space, which assigns exposure-aware weights to luminance and chromaticity components, effectively mitigating brightness bias and color drift. Extensive experiments on the HDR-REAL, HDR-EYE, and AIM2025 benchmarks demonstrate that ExpoCM achieves state-of-the-art fidelity and perceptual accuracy, while enabling over 400xand 20xfaster inference compared to DDPM (1000 steps) and DDIM (50 steps), respectively. Code is available at https://github.com/AoyuLiu01/ExpoCM.

Abstract:
Diffusion transformers have achieved remarkable progress in high-quality image and video generation, but their computational overhead remains a significant challenge. Existing token reduction-based acceleration techniques, such as caching and merging, attempt to reduce this cost from both temporal and spatial perspectives, but often compromise generation quality by introducing non-updated or non-self denosing directions. In this paper, we propose Residual Caching (ResCa), a novel, training-free framework that introduces a proxy denoising perspective to overcome these limitations. ResCa achieves acceleration while maintaining a denoising trajectory that is both self and updated. The core idea is to perform true denoising on only one proxy token within each trajectory-based cluster, and use its computed multi-order residuals to guide the simulated denoising of all other tokens. ResCa can be seamlessly integrated into various diffusion models, including DiT, FLUX, and HunyuanVideo. Extensive quantitative and qualitative experiments demonstrate the effectiveness of our method, achieving up to a 5.5 times acceleration in GFLOPs while maintaining near-lossless generation quality on FLUX. Project page: https://fanghaipeng.github.io/ResCa.

Abstract:
Out-of-distribution (OOD) detection is essential for the safe deployment of machine learning models. While extensive work has focused on designing effective scoring functions for OOD detection, relatively few studies explore training neural networks with calibration-oriented objectives, which often compromise predictive accuracy and restrict the choice of scoring functions. In this work, we first identify feature collapse in Logit Normalization (LogitNorm), and then propose a novel hyperparameter-free training formulation that significantly improves a wide range of post-hoc detection methods. Specifically, we introduce a feature distance-aware normalization objective, termed ELogitNorm, which enhances both OOD detection performance and in-distribution (ID) confidence calibration. Extensive experiments on standard benchmarks demonstrate that our approach outperforms state-of-the-art training-time methods in OOD detection while preserving strong ID classification performance. Our code is available at: https://github.com/limchaos/ElogitNorm.

Abstract:
Stereo vision is fundamental for enabling machines to perceive and interact with the world. While monocular stereo methods offer hardware compactness, they struggle with generalization due to reliance on data-driven priors. Binocular and multi-view systems improve accuracy but incur higher hardware complexity and data inefficiency. In this paper, we introduce a monocular solution for high-framerate stereo vision via temporal optical modulation. The modulation directs light from two views onto a single sensor in a mixed manner, while periodically attenuating one view at 60 Hz. To capture the temporal variations introduced by this modulation, we employ a high-speed spike camera that records the mixed scene as temporally dense spikes. The high temporal resolution of these spikes enables the construction of a linear system for efficient binocular video decoupling. Consequently, we introduce a two-stage decoding methodology for achieving high-quality stereo vision: An efficient least-squares-based baseline reconstruction followed by a deep learning refinement module. Experimental results demonstrate that our approach achieves 240FPS binocular video reconstruction with superior accuracy compared to monocular systems, while maintaining the hardware compactness and data efficiency. Code is available at https://github.com/yongqiye00/MonoSpikeStereo.

Abstract:
The deployment of multimodal models in high-stakes domains, such as self-driving vehicles and medical diagnostics, demands not only strong predictive performance but also reliable mechanisms for detecting failures. In this work, we address the largely unexplored problem of failure detection in multimodal contexts. We propose Adaptive Confidence Regularization (ACR), a novel framework specifically designed to detect multimodal failures. Our approach is driven by a key observation: in most failure cases, the confidence of the multimodal prediction is significantly lower than that of at least one unimodal branch, a phenomenon we term confidence degradation. To mitigate this, we introduce an Adaptive Confidence Loss that penalizes such degradations during training. In addition, we propose Multimodal Feature Swapping, a novel outlier synthesis technique that generates challenging, failure-aware training examples. By training with these synthetic failures, ACR learns to more effectively recognize and reject uncertain predictions, thereby improving overall reliability. Extensive experiments across four datasets, three modalities, and multiple evaluation settings demonstrate that ACR achieves consistent and robust gains. The source code will be available at https://github.com/mona4399/ACR.

Abstract:
Feed-forward 3D modeling has emerged as a promising approach for rapid and high-quality 3D reconstruction. In particular, directly generating explicit 3D representations, such as 3D Gaussian splatting, has attracted significant attention due to its fast and high-quality rendering. However, many state-of-the-art methods, primarily based on transformer architectures, suffer from severe scalability issues because they rely on full attention across image tokens from multiple input views, resulting in prohibitive computational costs as the number of views or image resolution increases. Toward a scalable and efficient feed-forward 3D reconstruction, we introduce an iterative Large 3D Reconstruction Model (iLRM) that generates 3D Gaussian representations through an iterative refinement mechanism, guided by three core principles: (1) decoupling the scene representation from input images to enable compact 3D representations; (2) decomposing global multi-view interactions into a two-stage attention scheme to reduce computational costs; and (3) injecting high-resolution information at every layer to achieve high-fidelity reconstruction. Experimental results on widely used datasets, such as RE10K and DL3DV, demonstrate that iLRM outperforms existing methods in both reconstruction quality and speed.

Abstract:
Attribution is essential for interpreting object-level foundation models. Recent methods based on submodular subset selection have achieved high faithfulness, but their efficiency limitations hinder practical deployment in real-world scenarios. To address this, we propose PhaseWin, a novel phase-window search algorithm that enables faithful region attribution with near-linear complexity. PhaseWin replaces traditional quadratic-cost greedy selection with a phased coarse-to-fine search, combining adaptive pruning, windowed fine-grained selection, and dynamic supervision mechanisms to closely approximate greedy behavior while dramatically reducing model evaluations. Theoretically, PhaseWin retains near-greedy approximation guarantees under mild monotone submodular assumptions. Empirically, PhaseWin achieves over 95% of greedy attribution faithfulness using only 20% of the computational budget, and consistently outperforms other attribution baselines across object detection and visual grounding tasks with Grounding DINO and Florence-2. PhaseWin establishes a new state of the art in scalable, high-faithfulness attribution for object-level multimodal models. Code is available at https://github.com/Qihuai27/phasewin-search.

Abstract:
Text-guided diffusion models have advanced image editing by enabling intuitive control through language. However, despite their strong capabilities, we surprisingly find that SOTA methods struggle with simple, everyday transformations such as rain or blur. We attribute this limitation to weak and inconsistent textual supervision during training, which leads to poor alignment between language and vision. Existing solutions often rely on extra finetuning or stronger text conditioning, but suffer from high data and computational requirements. We argue that diffusion-based editing capabilities aren't lost but merely hidden from text. The door to cost-efficient visual editing remains open, and the key lies in a vision-centric paradigm that perceives and reasons about visual change as humans do, beyond words. Inspired by this, we introduce Visual Diffusion Conditioning (VDC), a training-free framework that learns conditioning signals directly from visual examples for precise, language-free image editing. Given a paired example--one image with and one without the target effect--VDC derives a visual condition that captures the transformation and steers generation through a novel condition-steering mechanism. An accompanying inversion-correction step mitigates reconstruction errors during DDIM inversion, preserving fine detail and realism. Across diverse tasks, VDC outperforms both training-free and fully fine-tuned text-based editing methods. The code and models are open-sourced at https://omaralezaby.github.io/vdc/

Abstract:
Conformal prediction (CP) is a powerful framework for uncertainty quantification, generating prediction sets with coverage guarantees. Split conformal prediction relies on labeled data in the calibration procedure. However, the labeled data is often limited in real-world scenarios, leading to unstable coverage performance in different runs. To address this issue, we extend CP to the semi-supervised setting and propose SemiCP, a new paradigm that leverages both labeled and unlabeled data for calibration. To achieve this, we introduce an unlabeled nonconformity score, Nearest Neighbor Matching (NNM) score. Specifically, NNM estimates the nonconformity scores of unlabeled samples using their most similar pseudo-labeled counterparts during calibration, while maintaining the original scores for labeled data. Theoretically, we demonstrate that the average coverage gap (i.e., the absolute difference between the empirical marginal coverage and the target coverage) of SemiCP can decrease significantly at a rate O(N-1t2) and converge to an error term, where N is the number of unlabeled data. Extensive experiments validate the effectiveness of SemiCP under limited labeled data, reducing the average coverage gap by up to 77% on common benchmarks with 4000 unlabeled examples, when there are only 20 labeled examples. Code is available at https://github.com/Shinning-Zhou/SemiCP.

Abstract:
Generative models have the potential to transform the way we emulate Earth's changing climate. Previous generative approaches rely on weather-scale autoregression for climate emulation, but this is inherently slow for long climate horizons and has yet to demonstrate stable rollouts under nonstationary forcings. Here, we introduce Spatiotemporal Pyramid Flows (SPF), a new class of flow matching approaches that model data hierarchically across spatial and temporal scales. Inspired by cascaded video models, SPF partitions the generative trajectory into a spatiotemporal pyramid, progressively increasing spatial resolution to reduce computation and coupling each stage with an associated timescale to enable direct sampling at any temporal level in the pyramid. This design, together with conditioning each stage on prescribed physical forcings (e.g., greenhouse gases or aerosols), enables efficient, parallel climate emulation at multiple timescales. On ClimateBench, SPF outperforms strong flow matching baselines and pre-trained models at yearly and monthly timescales while offering fast sampling, especially at coarser temporal levels. To scale SPF, we curate ClimateSuite, the largest collection of Earth system simulations to date, comprising over 33,000 simulation-years across ten climate models and the first dataset to include simulations of climate interventions. We find that the scaled SPF model demonstrates good generalization to held-out scenarios across climate models. Together, SPF and ClimateSuite provide a foundation for accurate, efficient, probabilistic climate emulation across temporal scales and realistic future scenarios. Data and code is publicly available at https://github.com/stanfordmlgroup/spf .

Abstract:
Diffusion models generate synthetic images through an iterative refinement process. However, the misalignment between the simulation-free objective and the iterative process often causes accumulated gradient error along the sampling trajectory, which leads to unsatisfactory results and a failure to generalize. Guidance techniques like Classifier Free Guidance (CFG) and AutoGuidance (AG) alleviate this by extrapolating between the main and inferior signal for stronger generalization. Despite empirical success, the effective operational regimes of prevalent guidance methods are still under-explored, leading to ambiguity when selecting the appropriate guidance method given a precondition. In this work, we first conduct synthetic comparisons to isolate and demonstrate the effective regime of guidance methods represented by CFG and AG from the perspective of weak-to-strong principle. Based on this, we propose a hybrid instantiation called SEG under the principle, taking the benefits of both. Furthermore, we demonstrate that the W2S principle along with SEG can be migrated into the training objective, improving the generalization ability of unguided diffusion models. We validate our approach with comprehensive experiments. At inference time, evaluations on SD3 and SD3.5 confirm that SEG outperforms existing training-free guidance variants. Training-time experiments on transformer architectures demonstrate the effective migration and performance gains in both conditional and unconditional settings. Code is available at https://github.com/Westlake-AGI-Lab/SGG

Abstract:
Multimodal Large Language Models (MLLMs) have recently achieved remarkable progress in vision-language understanding. Yet, human perception is inherently multisensory, integrating sight, sound, and motion to reason about the world. Among these modalities, sound provides indispensable cues about spatial layout, off-screen events, and causal interactions, particularly in egocentric settings where auditory and visual signals are tightly coupled. To this end, we introduce EgoSound, the first benchmark designed to systematically evaluate egocentric sound understanding in MLLMs. EgoSound unifies data from Ego4D and EgoBlind, encompassing both sighted and sound-dependent experiences. It defines a seven-task taxonomy spanning intrinsic sound perception, spatial localization, causal inference, and cross-modal reasoning. Constructed through a multi-stage auto-generative pipeline, EgoSound contains 7315 validated QA pairs across 900 videos. Comprehensive experiments on nine state-of-the-art MLLMs reveal that current models exhibit emerging auditory reasoning abilities but remain limited in fine-grained spatial and causal understanding. EgoSound establishes a challenging foundation for advancing multisensory egocentric intelligence, bridging the gap between seeing and truly hearing the world. Project page: https://groolegend.github.io/EgoSound/.

Abstract:
We present DeSOPE, a large-scale dataset for 6DoF deformed objects. Most 6D object pose methods assume rigid or articulated objects, an assumption that fails in practice as objects deviate from their canonical shapes due to wear, impact, or deformation. To model this, we introduce the DeSOPE dataset, which features high-fidelity 3D scans of 26 common object categories, each captured in one canonical state and three deformed configurations, with accurate 3D registration to the canonical mesh. Additionally, it features an RGB-D dataset with 133K frames across diverse scenarios and 665K pose annotations produced via a semi-automatic pipeline. We begin by annotating 2D masks for each instance, then compute initial poses using an object pose method, refine them through an object-level SLAM system, and finally perform manual verification to produce the final annotations. We evaluate several object pose methods and find that performance drops sharply with increasing deformation, suggesting that robust handling of such deformations is critical for practical applications. The project page and dataset are available at https://desope-6d.github.io/.

Abstract:
Capturing both geometry and rigid motion for structured dynamic objects, like multi-part assemblies or jointed mechanisms, remains a key challenge. Existing dynamic methods, such as deformable meshes or 3DGS, rely on unstructured representations and fail to jointly model suitable geometry and articulated motion. Primitive-based methods excel at structured static scenes, but their dynamic potential is still unexplored. We propose D-Prism, the first framework to achieve high-fidelity structured dynamic modeling by extending differentiable primitives to the dynamic domain.Specifically, we bind 3DGS to primitive surfaces, leveraging their respective strengths in appearance and geometry. We introduce a deformation network to control primitive motion, ensuring it accurately matches the object's movement. Furthermore, we design a novel adaptive control strategy to dynamically adjust primitive counts, better matching objects' true spatial footprint.Experiments confirm that our method excels at structured dynamic modeling, providing both structured geometry and precise motion tracking. Project page: https://zju3dv.github.io/d-prism/.

Abstract:
While traditional and neural video codecs (NVCs) have achieved remarkable rate-distortion performance, improving perceptual quality at low bitrates remains challenging. Some NVCs incorporate perceptual or adversarial objectives but still suffer from artifacts due to limited generation capacity, whereas others leverage pretrained diffusion models to improve quality at the cost of high sampling complexity. To overcome these challenges, we propose S2VC, a Single Step diffusion-based Video Codec that integrates a conditional coding framework with an efficient single-step diffusion generator, enabling realistic reconstruction at low bitrates with reduced sampling cost. Recognizing the importance of semantic conditioning in single-step diffusion, we introduce Contextual Semantic Guidance to extract frame-adaptive semantics from buffered features. This guidance replaces text captions with efficient, fine-grained conditioning, thereby improving generation realism. In addition, Temporal Consistency Guidance is incorporated into the diffusion U-Net to enforce temporal coherence across frames and ensure stable generation. Extensive experiments show that S2VC delivers state-of-the-art perceptual quality with an average bitrate saving of 51.62% over prior perceptual method, underscoring the promise of single-step diffusion for efficient, high-quality video compression. Project: https://onedc-codec.github.io/s2vc/

Abstract:
Existing literature typically treats various customized generation tasks (e.g., subject-customized generation, style-customized generation) as distinct and disjoint problems, with each task focusing solely on customizing a specific aspect of the reference image. However, we argue that the objectives of these different customization tasks are inherently complementary and can be mutually enhanced within a unified framework, as they fundamentally involve the disentanglement of multiple feature aspects from the reference image. To this end, we introduce USO, a Unified Simultaneous Optimization framework to simultaneously unify different customized tasks (i.e., subject and style). Specifically, USO introduces a cyclical data-model framework that connects these two tasks by a subject-for-style data curation pipeline and a style-for-subject model training pipeline. The subject-for-style data curation pipeline leverages a state-of-the-art subject-customized model to generate high-quality triplet data comprising content images, style images, and their corresponding stylized content images. Building on this foundation, the style-for-subject model training pipeline introduces an auxiliary style reward to simultaneously align style and content features, thereby reinforcing the model's ability to extract the desired style or content features from the reference image. Extensive experiments demonstrate that USO achieves state-of-the-art performance among open-source models, excelling in both subject consistency and style similarity. Code and model: https://github.com/bytedance/USO.

Abstract:
Generative models excel at motion synthesis for a fixed number of agents but struggle to generalize with variable agents. Based on limited, domain-specific data, existing methods employ autoregressive models to generate motion recursively, which suffer from inefficiency and error accumulation. We propose Unified Motion Flow (UMF), which consists of Pyramid Motion Flow (P-Flow) and Semi-Noise Motion Flow (S-Flow). UMF decomposes the number-free motion generation into a single-pass motion prior generation stage and multi-pass reaction generation stages. Specifically, UMF utilizes a unified latent space to bridge the distribution gap between heterogeneous motion datasets, enabling effective unified training. For motion prior generation, P-Flow operates on hierarchical resolutions conditioned on different noise levels, thereby mitigating computational overheads. For reaction generation, S-Flow learns a joint probabilistic path that adaptively performs reaction transformation and context reconstruction, alleviating error accumulation. Extensive results and user studies demonstrate UMF's effectiveness as a generalist model for multi-person motion generation from text. Project page: https://githubhgh.github.io/umf/.

Abstract:
This paper studies Visual Question-Visual Answering (VQ-VA): generating an image, rather than text, in response to a visual question---an ability that has recently emerged in proprietary systems such as NanoBanana and GPT-Image. To also bring this capability to open-source models, we introduce VQ-VA World, a data-centric framework built around an agentic pipeline for large-scale, targeted data construction. Leveraging web-scale deployment, this pipeline crawls a massive amount of ~1.8M high-quality, interleaved image-text samples for model training. For evaluation, we further release IntelligentBench, a human-curated benchmark that systematically assesses VQ-VA along the aspects of world knowledge, design knowledge, and reasoning. Training with VQ-VA World data yields strong empirical gains: it helps LightFusion attain 53.06 on IntelligentBench, substantially surpassing the best prior open-source baselines (i.e., 7.78 from vanilla LightFusion; 1.94 from UniWorld-V1), and significantly narrowing the gap toward leading proprietary systems (e.g., 81.67 from NanoBanana; 82.64 from GPT-Image). By releasing the full suite of model weights, datasets, and pipelines, our work will greatly stimulate future research on VQ-VA. Our project is publicly available at https://github.com/chenhuigou/VQ-VA-World.

Abstract:
Diffusion Transformers have become a dominant paradigm in visual generation, yet their low inference efficiency remains a key bottleneck hindering further advancement. Among common training-free techniques, caching offers high acceleration efficiency but often compromises fidelity, whereas pruning shows the opposite trade-off. Integrating caching with pruning achieves a balance between acceleration and generation quality. However, existing methods typically employ fixed and heuristic schemes to configure caching and pruning strategies. While they roughly follow the overall sensitivity trend of generation models to acceleration, they fail to capture fine-grained and complex variations, inevitably skipping highly sensitive computations and leading to quality degradation. Furthermore, such manually designed strategies exhibit poor generalization. To address these issues, we propose SODA, a Sensitivity-Oriented Dynamic Acceleration method that adaptively performs caching and pruning based on fine-grained sensitivity. SODA builds an offline sensitivity error modeling framework across timesteps, layers, and modules to capture the sensitivity to different acceleration operations. The cache intervals are optimized via dynamic programming with sensitivity error as the cost function, minimizing the impact of caching on model sensitivity. During pruning and cache reuse, SODA adaptively determines the pruning timing and rate to preserve computations of highly sensitive tokens, significantly enhancing generation fidelity. Extensive experiments on DiT-XL/2, PixArt-\alpha, and OpenSora demonstrate that SODA achieves state-of-the-art generation fidelity under controllable acceleration ratios. Our code is released publicly at: https://github.com/leaves162/SODA.

Abstract:
Existing evaluations of multimodal large language models (MLLMs) on spatial intelligence are typically fragmented and limited in scope. In this work, we conduct a holistic assessment of the spatial understanding abilities of modern MLLMs and propose complementary data-driven and agent-based solutions. Concretely, we make the following contributions: (i) we propose SpatialScore, the most comprehensive and diverse multimodal spatial intelligence benchmark to date, encompassing various visual data types, input modalities, and question-answering formats with approximately 5K manually verified samples across 30 distinct tasks; (ii) we construct SpatialCorpus, a large-scale training resource with 331K multimodal QA samples for supervised fine-tuning Qwen3-VL on spatial understanding; (iii) we develop SpaitalAgent, a multi-agent system incorporating 12 specialized spatial perception tools, supporting both Plan-Execute and ReAct reasoning paradigms, enabling to improve spatial reasoning in a training-free manner; and (iv) we conduct extensive evaluations on 40 representative MLLMs, revealing persistent challenges in spatial intelligence while demonstrating the effectiveness of our data-driven and agent-based solutions. All data, code, and models will be publicly available.

Abstract:
Diffusion-based approaches have recently driven remarkable progress in real-world image super-resolution (SR). However, existing methods still struggle to simultaneously preserve fine details and ensure high-fidelity reconstruction, often resulting in suboptimal visual quality. In this paper, we propose FiDeSR, a high-fidelity and detail-preserving one-step diffusion super-resolution framework. During training, we introduce a detail-aware weighting strategy that adaptively emphasizes regions where the model exhibits higher prediction errors. During inference, low- and high-frequency adaptive enhancers further refine the reconstruction without requiring model retraining, enabling flexible enhancement control. To further improve the reconstruction accuracy, FiDeSR incorporates a latent residual refinement, which corrects prediction errors in the diffusion noise and enhances fine detail recovery. FiDeSR achieves superior real-world SR performance compared to existing diffusion-based methods, producing outputs with both high perceptual quality and faithful content restoration. The source code will be released at: https://github.com/Ar0Kim/FiDeSR.

Abstract:
Recent advances in feed-forward 3D Gaussian splatting have demonstrated remarkable efficiency by reconstructing scenes in a single pass. However, the reconstruction fidelity of these methods lags behind that of traditional optimization-based approaches, which gradually correct reconstruction flaws through a lengthy iterative process. In this paper, we leverage the strengths of both paradigms and introduce iSplat, a novel framework that reformulates reconstruction as an iterative feed-forward process involving multiple (typically three) passes. Central to iSplat is a recurrent GRU-based optimizer that refines both geometry and appearance in a synergistic loop. To address geometric inaccuracies, we propose an uncertainty-driven depth refinement strategy that progressively narrows the search space for each Gaussian based on its estimated uncertainty from the previous step. To further improve appearance details, we design a region-aware enhancement mechanism that applies targeted multi-view and monocular feature aggregation to resolve ambiguities in both overlapping and non-overlapping areas. We validate iSplat's robustness and generalization on in-domain (RealEstate10K, ACID) and cross-dataset (DTU, ACID) benchmarks. With only 42.6M parameters, iSplat surpasses DepthSplat (354M) on RealEstate10K (PSNR: 27.67 vs. 27.47 dB). Crucially, on the cross-dataset DTU benchmark, it further boosts the PSNR by 2.88 dB (18.26 vs. 15.38 dB), showcasing exceptional generalization. These results highlight the significant potential of iterative refinement to overcome the inherent limitations of one-shot approaches. Our code is available at https://github.com/haifengwu205/iSplat.

Abstract:
Few-shot Font Generation (FFG) aims to create a complete font from a limited number of references, offering significant practical value. However, existing methods neglect glyph spatial information, which leads to two critical limitations. At the pipeline level, distorted rendering introduces spatial bias, impairing vectorization and dataset quality, and this problem is compounded by the lack of unified standards, which undermines a unified benchmark. At the model level, the implicit coupling of shape and position hinders fine-grained optimization and generalization. We address these challenges in the context of Chinese font generation, where glyph complexity demands superior model capability. Consequently, we first propose a Spatial-Preserving Rendering (SPR) scheme, which eliminates spatial bias and enables accurate vectorization. Alongside, we release an OFL-licensed Chinese font dataset to establish a unified benchmark. Then, technically, we propose GlyphSpatialNet, a two-stage framework to explicitly model glyph spatial information in pixel space. In first stage, we design a Shape-Position Decoupling (SPD) architecture and a Gradient Broadcasting Module (GBM) to achieve font style transfer in low resolution. In second stage, we design Style Detail Enhancement (SDE), which refines the style details for high resolution outputs. Extensive experiments demonstrate the effectiveness of our approach. Code and dataset are available at https://github.com/sp777g/GlyphSpatialNet.

Abstract:
Reinforcement fine-tuning (RFT) is increasingly used to strengthen the reasoning abilities of large models, yet its effectiveness is bound by how training data are selected and used. Most data-centric RFT methods rely on static or heuristic sample selection, implicitly assuming a sample's value is fixed over training. This overlooks the non-stationary dynamics of policy learning and can lead to suboptimal updates. We propose Dynamic Important Example Mining (DIEM), a principled and fully automated framework that makes data utilization adaptive throughout RFT. DIEM integrates two components into each optimization step: (i) a gradient-alignment importance estimator that efficiently approximates each sample's marginal contribution to policy improvement; and (ii) a constrained batch reweighting scheme that maximizes aggregate utility while preserving the update's gradient magnitude to stabilize optimization. Across several reasoning benchmarks, DIEM consistently outperforms strong static and dynamic baselines. The code will be released via https://github.com/hrtan/DIEM.

Abstract:
Action diffusion excels at high-fidelity action generation but incurs heavy computational costs owing to its iterative denoising nature. Despite current technologies showing promise in accelerating diffusion transformers by reusing the cached features, they struggle to adapt to policy dynamics arising from diverse perceptions and multi-round rollout iterations in open environments. We propose test-time sparsity to tackle this challenge, which aims to accelerate action diffusion by dynamically predicting prunable residual computations for each model forward at test time. However, two bottlenecks remain in this paradigm: 1) repetitive conditional encoding and pruning offset most potential speed gains, and 2) the features cached from previous denoising timesteps cannot constrain large pruning errors under aggressive sparsity. To address the first bottleneck, we design a highly parallelized inference pipeline that minimizes the non-decoder delay to milliseconds. Specifically, we first design a lightweight pruner that shares the encoder with the diffusion transformer. Then, we decouple the encoding and pruning from the autoregressive denoising loop by processing all denoising timesteps in parallel, and overlap the pruner with the decoder forward inference through asynchronism. To overcome the second bottleneck, we introduce an omnidirectional reusing strategy, which achieves 95% sparsity by selectively reusing the features cached from the current forward, previous denoising timesteps, and earlier rollout iterations. To learn the rollout-level reusing strategies, we sample a few action trajectories to supervise the sparsified diffusion step by step. Extensive experiments demonstrate that our method reduces FLOPs by 92% and accelerates action generation by 5x, achieving lossless performance with an inference frequency of 47.5 Hz. Our code is available at https://github.com/ky-ji/Test-time-Sparsity.

Abstract:
Foundational Vision Transformers (ViTs) have limited effectiveness in tasks requiring fine-grained spatial understanding, due to their fixed pre-training resolution and inherently coarse patch-level representations. These challenges are especially pronounced in dense prediction scenarios, such as open-vocabulary segmentation with ViT-based vision-language models, where high-resolution inputs are essential for accurate pixel-level reasoning. Existing approaches typically process large-resolution images using a sliding-window strategy at the pre-training resolution. While this improves accuracy through finer strides, it comes at a significant computational cost. We introduce SPAR: Single-Pass Any-Resolution ViT, a resolution-agnostic dense feature extractor designed for efficient high-resolution inference. We distill the spatial reasoning capabilities of a finely-strided, sliding-window teacher into a single-pass student using a feature regression loss, without requiring architectural changes or pixel-level supervision. Applied to open-vocabulary segmentation, SPAR improves single-pass baselines by up to 10.5 mIoU and even surpasses the teacher, demonstrating effectiveness in efficient, high-resolution reasoning. Code: https://github.com/naomikombol/SPAR

Abstract:
Multimodal models ideally should generalize to unseen domains while remaining data-efficient to reduce annotation costs. To this end, we introduce and study a new problem, Semi-Supervised Multimodal Domain Generalization (SSMDG), which aims to learn robust multimodal models from multi-source data with few labeled samples. We observe that existing approaches fail to address this setting effectively: multimodal domain generalization methods cannot exploit unlabeled data, semi-supervised multimodal learning methods ignore domain shifts, and semi-supervised domain generalization methods are confined to single-modality inputs. To overcome these limitations, we propose a unified framework featuring three key components: Consensus-Driven Consistency Regularization, which obtains reliable pseudo-labels through confident fused-unimodal consensus; Disagreement-Aware Regularization, which effectively utilizes ambiguous non-consensus samples; and Cross-Modal Prototype Alignment, which enforces domain- and modality-invariant representations while promoting robustness under missing modalities via cross-modal translation. We further establish the first SSMDG benchmarks, on which our method consistently outperforms strong baselines in both standard and missing-modality scenarios. Our benchmarks and code are available at https://github.com/lihongzhao99/SSMDG.

Abstract:
Flow matching models have emerged as a powerful framework for realistic image generation by learning to reverse a corruption process that progressively adds Gaussian noise. However, because noise is injected in the latent domain, its impact on different frequency components is non-uniform. As a result, during inference, flow matching models tend to generate low-frequency components (global structure) in the early stages, while high-frequency components (fine details) emerge only later in the reverse process. Building on this insight, we propose Frequency-Aware Flow Matching (FreqFlow), a novel approach that explicitly incorporates frequency-aware conditioning into the flow matching framework via time-dependent adaptive weighting. We introduce a two-branch architecture: (1) a frequency branch that separately processes low- and high-frequency components to capture global structure and refine textures and edges, and (2) a spatial branch that synthesizes images in the latent domain, guided by the frequency branch's output. By explicitly integrating frequency information into the generation process, FreqFlow ensures that both large-scale coherence and fine-grained details are effectively modeled--low-frequency conditioning reinforces global structure, while high-frequency conditioning enhances texture fidelity and detail sharpness. On the class-conditional ImageNet-256 generation benchmark, our method achieves state-of-the-art performance with an FID of 1.38, surpassing the prior diffusion model DiT and flow matching model SiT by 0.79 and 0.58 FID, respectively. Code is available at https://github.com/OliverRensu/FreqFlow.

Abstract:
Point tracking models often struggle to generalize to real-world videos because large-scale training data is predominantly synthetic--the only source currently feasible to produce at scale. Collecting real-world annotations, however, is prohibitively expensive, as it requires tracking hundreds of points across frames. We introduce AnthroTAP, an automated pipeline that generates large-scale pseudo-labeled point tracking data from real human motion videos. Leveraging the structured complexity of human movement--non-rigid deformations, articulated motion, and frequent occlusions--AnthroTAP fits Skinned Multi-Person Linear (SMPL) models to detected humans, projects mesh vertices onto image planes, resolves occlusions via ray-casting, and filters unreliable tracks using optical flow consistency. A model trained on the AnthroTAP dataset achieves state-of-the-art performance on TAP-Vid, a challenging general-domain benchmark for tracking any point on diverse rigid and non-rigid objects (e.g., humans, animals, robots, and vehicles). Our approach outperforms recent self-training methods trained on vastly larger real datasets, while requiring only one day of training on 4 GPUs. AnthroTAP shows that structured human motion offers a scalable and effective source of real-world supervision for point tracking.

Abstract:
Story visualization aims to generate coherent image sequences that faithfully represent a narrative and match given character references. Despite progress in generative models, existing benchmarks remain narrow in scope, often limited to short prompts, lacking character references, or single-image cases, failing to reflect real-world narrative complexity and obscuring true model performance. We introduce ViStoryBench, a comprehensive benchmark designed to evaluate story visualization models across varied narrative structures, visual styles, and character settings. It features richly annotated multi-shot scripts derived from curated stories spanning literature, film, and folklore. Large language models assist in story summarization and script generation, with all outputs verified by humans for coherence and fidelity. Character references are carefully curated to maintain consistency across different artistic styles. ViStoryBench proposes a suite of multi-dimensional automated metrics to evaluate character consistency, style similarity, prompt alignment, aesthetic quality, and artifacts like copy-paste behavior. These metrics are validated through human studies and used to assess a broad range of open-source and commercial models, enabling systematic analysis and encouraging advances in visual storytelling.

Abstract:
When MLLMs fail at Science, Technology, Engineering, and Mathematics (STEM) visual reasoning, a fundamental question arises: is it due to perceptual deficiencies or reasoning limitations? Through systematic scaling analysis that independently scales perception and reasoning components, we uncover a critical insight: scaling perception consistently outperforms scaling reasoning. This reveals perception as the true lever limiting current STEM visual reasoning. Motivated by this insight, our work focuses on systematically enhancing the perception capabilities of MLLMs by establishing code as a powerful perceptual medium--executable code provides precise semantics that naturally align with the structured nature of STEM visuals. Specifically, we construct ICC-1M, a large-scale dataset comprising 1M Image-Caption-Code triplets that materializes this code-as-perception paradigm through two complementary approaches: (1) Code-Grounded Caption Generation treats executable code as ground truth for image captions, eliminating the hallucinations inherent in existing knowledge distillation methods; (2) STEM Image-to-Code Translation prompts models to generate reconstruction code, mitigating the ambiguity of natural language for perception enhancement. To validate this paradigm, we further introduce STEM2Code-Eval, a novel benchmark that directly evaluates visual perception in STEM domains. Unlike existing work relying on problem-solving accuracy as a proxy that only measures problem-relevant understanding, our benchmark requires comprehensive visual comprehension through executable code generation for image reconstruction, providing deterministic and verifiable assessment. Code is available at https://github.com/TongkunGuan/Qwen-CodePercept.

Abstract:
Model merging combines independently fine-tuned checkpoints without joint multi-task training. In the era of foundation-model, fine-tuning with Low-Rank Adaptation (LoRA) is prevalent, making LoRA merging a promising target. Existing approaches can work in homogeneous settings where all target tasks are classification but often fail when tasks span classification and regression. Approaches using entropy-based surrogates do not apply to regression and are costly for large language models due to long token sequences. We introduce Null-Space Compression (NSC) Merging, a label-free, output-agnostic method that sets merge weights from adapter geometry. Our key observation is that during LoRA finetuning the down-projection factor (A) in (\Delta W = BA) compresses its null space, and the compression correlates with performance. NSC uses this as an optimization signal for merging that can generalize across classification, regression, and sequence generation. NSC achieves state-of-the-art performance across twenty heterogeneous vision tasks with balanced gains where prior methods overfit subsets of tasks. It also outperforms baselines on six NLI benchmarks and on vision-language evaluations for VQA and image captioning, demonstrating scalability and effectiveness. Our code is available at https://github.com/wonyoung01/nsc_merging.

Abstract:
We present ON3R, an online-trained neural regressor addressing sparse-view structureless localization, where database images have limited visual overlap and no prebuilt 3D map. Given any sparse matches between a query and a K-tuple of posed database views, ON3R predicts 3D coordinates for matched query keypoints, supervised by database reprojection residuals and a monocular depth prior. Afterwards, the absolute pose of the query is estimated via P3P-RANSAC and refined with lightweight bundle adjustment. Across MegaDepth, Cambridge Landmarks, and a sparsified version of Aachen Day-Night, ON3R outperforms existing methods. ON3R is particularly effective when the data is extremely sparse -- we focus on K\!\leq\!10 database images. The code is available at https://github.com/ludvigdillen/ON3R.

Abstract:
Cross-domain panoramic semantic segmentation has attracted growing interest as it enables comprehensive 360^ \circ scene understanding for real-world applications. However, it remains particularly challenging due to severe geometric Field of View (FoV) distortions and inconsistent open-set semantics across domains. In this work, we formulate an open-set domain adaptation setting, and propose Extrapolative Domain Adaptive Panoramic Segmentation (EDA-PSeg) framework that trains on local perspective views and tests on full 360^ \circ panoramic images, explicitly tackling both geometric FoV shifts across domains and semantic uncertainty arising from previously unseen classes. To this end, we propose the Euler-Margin Attention (EMA), which introduces an angular margin to enhance viewpoint-invariant semantic representation, while performing amplitude and phase modulation to improve generalization toward unseen classes. Additionally, we design the Graph Matching Adapter (GMA), which builds high-order graph relations to align shared semantics across FoV shifts while effectively separating novel categories through structural adaptation. Extensive experiments on four benchmark datasets under camera-shift, weather-condition, and open-set scenarios demonstrate that EDA-PSeg achieves state-of-the-art performance, robust generalization to diverse viewing geometries, and resilience under varying environmental conditions. The code is available at https://github.com/zyfone/EDA-PSeg.

Abstract:
Modern multimodal large language models (MLLMs) typically keep the language model fixed and train a visual projector that maps the pixels into a sequence of tokens in its embedding space, so that images can be presented in essentially the same form as text. However, the language model has been optimized to operate on discrete, semantically meaningful tokens, while prevailing visual projectors transform an image into a long stream of continuous and highly correlated embeddings. This causes the visual tokens to behave differently from the word-like units that LLMs are originally trained to understand. We propose a novel Disentangled Visual Tokenization (DiVT) that clusters patch embeddings into coherent semantic units, so each token corresponds to a distinct visual concept instead of a rigid grid cell. DiVT further adapts its token budget to image complexity, providing an explicit accuracy compute trade-off modifying neither the vision encoder nor the language model. Across diverse multimodal benchmarks, DiVT matches or surpasses baselines with significantly fewer visual tokens, demonstrating robustness under limited token budgets, significantly reducing memory cost and latency while making visual inputs more compatible with LLMs. Our code is available at https://github.com/snuviplab/DiVT.

Abstract:
Synthetic Aperture Radar (SAR) imagery plays a critical role in all-weather, day-and-night remote sensing applications. However, existing SAR-oriented deep learning is constrained by data scarcity, while the physically grounded speckle noise in SAR imagery further hampers fine-grained semantic representation learning. To address these challenges, we propose SARMAE, a Noise-Aware Masked Autoencoder for self-supervised SAR representation learning. Specifically, we construct SAR-1M, the first million-scale SAR dataset, with additional paired optical images, to enable large-scale pre-training. Building upon this, we design Speckle-Aware Representation Enhancement (SARE), which injects SAR-specific speckle noise into Masked Autoencoder to facilitate noise-aware and robust representation learning. Furthermore, we introduce Semantic Anchor Representation Constraint (SARC), which leverages paired optical priors to align SAR features and ensure semantic consistency. Extensive experiments across multiple SAR datasets demonstrate that SARMAE achieves state-of-the-art performance on classification, detection, and segmentation tasks. Code, data, and models are available at https://github.com/MiliLab/SARMAE.

Abstract:
Universal Multimodal Embeddings (UMEs) aim to unify various modalities and tasks into a shared representation space. In recent years, this field has witnessed substantial progress driven by the development of Multimodal Large Language Models (MLLMs). However, a crucial capability, visual identity discrimination, remains underexplored in existing UME methods, despite its critical role in a wide range of tasks, including instance retrieval, re-identification, and identity preservation in AI-generated content. To bridge this gap, we propose a unified formulation for visual identity discrimination (VisID) and introduce MVEB (Multimodal Visual Identity Embedding Benchmark), a large-scale benchmark curated from both real-world and synthetic datasets to support evaluation and training. Furthermore, we present a simple yet effective learning framework that jointly optimizes general multimodal and visual identity representations through a carefully designed identity-aware sampling mechanism. Extensive experiments demonstrate that our approach successfully endows UMEs with strong identity discrimination capability and maintains competitive general multimodal performance. We believe this work not only illuminates a critical yet neglected capability, but also takes a step toward more holistic universal multimodal embeddings. Code and data are available at \href https://chrisclear3.github.io/MVEB MVEB .

Abstract:
Dynamic data pruning accelerates deep learning by selectively omitting less informative samples during training. While per-sample loss is a common importance metric, obtaining it can be challenging or infeasible for complex models or loss functions, often requiring significant implementation effort. This work proposes the Batch Loss Score (BLS), a computationally efficient alternative using an Exponential Moving Average (EMA) of readily available batch losses to assign scores to individual samples. We frame the batch loss, from the perspective of a single sample, as a noisy measurement of its scaled individual loss, with noise originating from stochastic batch composition. It is formally shown that the EMA mechanism functions as a first-order low-pass filter, attenuating high-frequency batch composition noise. This yields a score approximating the smoothed and persistent contribution of the individual sample to the loss, providing a theoretical grounding for BLS as a proxy for sample importance. BLS demonstrates remarkable code integration simplicity (three-line injection) and readily adapts existing per-sample loss-based methods (one-line proxy). Its effectiveness is demonstrated by enhancing two such methods to losslessly prune 20%-50% of samples across 14 datasets, 11 tasks and 18 models, highlighting its utility and broad applicability, especially for complex scenarios where per-sample loss is difficult to access. Code is available at https://github.com/mrazhou/BLS.

Abstract:
Simulators can generate virtually unlimited driving data, yet imitation learning policies in simulation still struggle to achieve robust closed-loop performance. Motivated by this gap, we empirically study how misalignment between privileged expert demonstrations and sensor-based student observations can limit the effectiveness of imitation learning. More precisely, experts have significantly higher visibility (e.g., ignoring occlusions) and far lower uncertainty (e.g., knowing other vehicles' actions), making them difficult to imitate reliably. Furthermore, navigational intent (i.e., the route to follow) is under-specified in student models at test time via only a single target point. We demonstrate that these asymmetries can measurably limit driving performance in CARLA and offer practical interventions to address them. After careful modifications to narrow the gaps between expert and student, our TransFuser v6 (TFv6) student policy achieves a new state of the art on all major publicly available CARLA closed-loop benchmarks, reaching 95 DS on Bench2Drive and more than doubling prior performances on Longest6 v2 and Town13. Additionally, by integrating perception supervision from our dataset into a shared sim-to-real pipeline, we show consistent gains on the NAVSIM and Waymo Vision-Based End-to-End driving benchmarks. Our code, data, and models are publicly available at https://github.com/kesai-labs/lead.

Abstract:
Multimodal Large Language Models (MLLMs) have recently demonstrated remarkable capabilities in cross-modal understanding and generation. However, the rapid growth of visual token sequences--especially in long-video and streaming scenarios--poses a major challenge to their scalability and real-world deployment. Thus, we introduce POINTS-Long, a native dual-mode MLLM featuring dynamic visual token scaling inspired by the human visual system. The model supports two complementary perception modes: focus mode and standby mode, enabling users to dynamically trade off efficiency and accuracy during inference. On fine-grained visual tasks, the focus mode retains the optimal performance, while on long-form general visual understanding, the standby mode retains 97.7-99.7% of the original accuracy using only 1/40-1/10th of the visual tokens. Moreover, POINTS-Long natively supports streaming visual understanding via a dynamically detachable KV-cache design, allowing efficient maintenance of ultra-long visual memory. Our work provides new insights into the design of future MLLMs and lays the foundation for adaptive and efficient long-form visual understanding. Model and code are available at https://anakin-skywalker-joseph.github.io/POINTS-Long-Webpage.

Abstract:
Metric Cross-View Geo-Localization (MCVGL) aims to estimate the 3-DoF camera pose (position and heading) by matching ground and satellite images. In this work, instead of pinhole and satellite images, we study robust MCVGL using holistic panoramas and OpenStreetMap (OSM). To this end, we establish a large-scale MCVGL benchmark dataset, CV-RHO, with over 2.7M images under different weather and lighting conditions, as well as sensor noise. Furthermore, we propose a model termed RHO with a two-branch Pin-Pan architecture for accurate visual localization. A Split-Undistort-Merge (SUM) module is introduced to address the panoramic distortion, and a Position-Orientation Fusion (POF) mechanism is designed to enhance the localization accuracy. Extensive experiments prove the value of our CV-RHO dataset and the effectiveness of the RHO model, with a significant performance gain up to 20% compared with the state-of-the-art baselines. Project page: https://github.com/InSAI-Lab/RHO.

Abstract:
Classifier-Free Guidance (CFG) is a cornerstone of modern text-to-image models, yet its reliance on a semantically vacuous null prompt generates a guidance signal prone to geometric entanglement. This is a key factor limiting its precision, leading to well-documented failures in complex compositional tasks. We propose Condition-Degradation Guidance (CDG), a novel paradigm that replaces the null prompt with a strategically degraded condition, c_deg. This reframes guidance from a coarse "good vs. null" contrast to a more refined "good vs. almost good" discrimination, thereby compelling the model to capture fine-grained semantic distinctions. We find that tokens in transformer text encoders split into two functional roles: content tokens encoding object semantics, and context-aggregating tokens capturing global context. By selectively degrading only the former, CDG constructs c_deg without external models or training. Validated across diverse architectures including Stable Diffusion 3, FLUX, and Qwen-Image, CDG markedly improves compositional accuracy and text-image alignment. As a lightweight, plug-and-play module, it achieves this with negligible computational overhead. Our work challenges the reliance on static, information-sparse negative samples and establishes a new principle for diffusion guidance: the construction of adaptive, semantically aware negative samples is critical to achieving precise semantic control. Code is available at https://github.com/Ming-321/Classifier-Degradation-Guidance.

Abstract:
Depth in the real world is rarely singular. Transmissive materials create layered ambiguities that confound conventional perception systems. Existing models remain passive; conventional approaches typically estimate static depth maps anchored to the nearest surface, and even recent multi-head extensions suffer from a representational bottleneck due to fixed feature representations. This stands in contrast to human vision, which actively shifts focus to perceive a desired depth. We introduce DepthFocus, a steerable Vision Transformer that redefines stereo depth estimation as condition-aware control. Instead of extracting fixed features, our model dynamically modulates its computation based on a physical reference depth, integrating dual conditional mechanisms to selectively perceive geometry aligned with the desired focus. Leveraging a newly curated large-scale synthetic dataset, DepthFocus achieves state-of-the-art results across all evaluated benchmarks, including both standard single-layer and complex multi-layered scenarios. While maintaining high precision in opaque regions, our approach effectively resolves depth ambiguities in transparent and reflective scenes by selectively reconstructing geometry at a target distance. This capability enables robust, intent-driven perception that significantly outperforms existing multi-layer methods, marking a substantial step toward active 3D perception. \noindent Project page: \href https://junhong-3dv.github.io/depthfocus-project/ this https URL .

Abstract:
Although convolutional neural networks (CNNs) have continued to evolve in recent years, Transformers have become increasingly popular in the field of computer vision. In this work, we introduce a mechanism for CNNs to aggregate information based on content similarity--an ability analogous to the self-attention mechanism. We achieve this by proposing a sorting operation that transforms the feature similarity between tokens into relative positional information: specifically, rearranging tokens such that tokens with higher feature similarity are placed closer together. This approach allows convolution operations to be indirectly transformed into an aggregation mode driven by content similarity. Experiments show that our proposed model, named Ego, achieves promising results across various tasks, underscoring the untapped potential of CNNs. Code and models are publicly available at https://github.com/essenceoftheworld/ego.

Abstract:
Recent Deepfake Video Detection (DFD) studies have demonstrated that pre-trained Vision-Language Models (VLMs) such as CLIP exhibit strong generalization capabilities in detecting artifacts across different identities. However, existing approaches focus on leveraging visual features only, overlooking their most distinctive strength -- the rich vision-language semantics embedded in the latent space. We propose VLAForge, a novel DFD framework that unleashes the potential of such cross-modal semantics to enhance model's discriminability in deepfake detection. This work i) enhances the visual perception of VLM through a ForgePerceiver, which acts as an independent learner to capture diverse, subtle forgery cues both granularly and holistically, while preserving the pretrained Vision-Language Alignment (VLA) knowledge, and ii) provides a complementary discriminative cue -- Identity-Aware VLA score, derived by coupling cross-modal semantics with the forgery cues learned by ForgePerceiver. Notably, the VLA score is augmented by an identity prior-informed text prompting to capture authenticity cues tailored to each identity, thereby enabling more discriminative cross-modal semantics. Comprehensive experiments on video DFD benchmarks, including classical face-swapping forgeries and recent full-face generation forgeries, demonstrate that our VLAForge substantially outperforms state-of-the-art methods at both frame and video levels. Code is available at https://github.com/mala-lab/VLAForge.

Abstract:
In endoscopic surgery, a clear and high-quality visual field is critical for surgeons to make accurate intraoperative decisions. However, persistent visual degradation, including smoke generated by energy devices, lens fogging from thermal gradients, and lens contamination due to blood or tissue fluid splashes during surgical procedures, severely impairs visual clarity. These degenerations can seriously hinder surgical workflow and pose risks to patient safety. To systematically investigate and address various forms of surgical scene degradation, we introduce a real-world open-source surgical image restoration dataset covering laparoscopic environments, called SurgClean, which involves multi-type image restoration tasks from two medical sites, i.e., desmoking, defogging, and desplashing. SurgClean comprises 3,113 images with diverse degradation types and corresponding paired reference labels. Based on SurgClean, we establish a standardized evaluation benchmark and provide performance for 22 representative generic task-specific image restoration approaches, including 12 generic and 10 task-specific image restoration approaches. Experimental results reveal substantial performance gaps relative to clinical requirements, highlighting a critical opportunity for algorithm advancements in intelligent surgical restoration. Furthermore, we explore the degradation discrepancies between surgical and natural scenes from structural perception and semantic understanding perspectives, providing fundamental insights for domain-specific image restoration research. Our work aims to empower restoration algorithms and improve the efficiency of clinical procedures. Sources can be available at: https://github.com/PJLallen/Surgical-Image-Restoration.

Abstract:
Deep Neural Networks achieve high performance in vision tasks by learning features from regions of interest (ROI) within images, but their performance degrades when deployed on out-of-distribution (OOD) data that differs from training data. This challenge has led to OOD detection methods that aim to identify and reject unreliable predictions. Although prior work shows that OOD detection performance varies by artefact type, the underlying causes remain underexplored. To this end, we identify a previously unreported bias in OOD detection: for hard-to-detect artefacts (near-OOD), detection performance typically improves when the artefact shares visual similarity (e.g. colour) with the model's ROI and drops when it does not - a phenomenon we term the Invisible Gorilla Effect. For example, in a skin lesion classifier with red lesion ROI, we show the method Mahalanobis Score achieves a 31.5% higher AUROC when detecting OOD red ink (similar to ROI) compared to black ink (dissimilar) annotations. We annotated artefacts by colour in 11,355 images from three public datasets (e.g. ISIC) and generated colour-swapped counterfactuals to rule out dataset bias. We then evaluated 40 OOD methods across 7 benchmarks and found significant performance drops for most methods when artefacts differed from the ROI. Our findings highlight an overlooked failure mode in OOD detection and provide guidance for more robust detectors. Code and annotations available at https://github.com/HarryAnthony/Invisible_Gorilla_Effect.

Abstract:
Visual Active Tracking (VAT) aims to control cameras to follow a target in 3D space, which is critical for applications like drone navigation and security surveillance. However, it faces two key bottlenecks in real-world deployment: confusion from visually similar distractors caused by insufficient instance-level discrimination and severe failure under occlusions due to the absence of active planning. To address these, we propose OA-VAT, a unified pipeline with three complementary modules. First, a training-free Instance-Aware Offline Prototype Initialization aggregates multi-view augmented features via DINOv3 to construct discriminative instance prototypes, mitigating distractor confusion. Second, an Online Prototype Enhancement Tracker enhances prototypes online and integrates a confidence-aware Kalman filter for stable tracking under appearance and motion changes. Third, an Occlusion-Aware Trajectory Planner, trained on our new Planning-20k dataset, uses conditional diffusion to generate obstacle-avoiding paths for occlusion recovery. Experiments demonstrate OA-VAT achieves 0.93 average SR on UnrealCV (+2.2% vs. SOTA TrackVLA), 90.8% average CAR on real-world datasets (+12.1% vs. SOTA GC-VAT), and 81.6% TSR on a DJI Tello drone. Running at 35 FPS on an RTX 3090, it delivers robust, real-time performance for practical deployment. The code is available at https://github.com/SHWplus/OA-VAT.

Abstract:
Video Super-Resolution (VSR) fundamentally struggles with a critical trade-off: single-step models offer unmatched efficiency but often lack the high-frequency detail, creativity, and visual quality of their multi-step diffusion counterparts, which are computationally prohibitive for practical use. In this paper, we propose PS-SR, a novel "pseudo" single-step VSR framework that transcends this trade-off through a computationally asymmetric sampling pipeline. The key to PS-SR lies in its speculative diffusion mechanism: a powerful base model performs only a single, comprehensive sampling step, establishing the global structure and content fidelity, after which a lightweight draft model, directly augmented by the base model's features, speculatively performs subsequent refinements. Crucially, we further enforce a frequency-domain update rule that constrains these refinements to exclusively inject high-frequency details, preserving the foundational low-frequency content and preventing semantic drift across sampling steps. By doing so, PS-SR creates the "illusion" of a single-step model--delivering the similar inference speeds and input-output content consistency--while achieving the visual richness and creativity typically reserved for costly multi-step generative models. We demonstrate that our "pseudo-single-step" paradigm achieves state-of-the-art quality with a comparable speed to single-step models, paving the way for real-time, high-fidelity video enhancement. Please refer to our project page for more results: https://waq2001.github.io/PS-SR-page/.

Abstract:
General 3D foundation models have started to lead the trend of unifying diverse vision tasks, yet most assume RGB-only inputs and ignore readily available geometric cues (e.g., camera intrinsics, poses, and depth maps). To address this issue, we introduce OmniVGGT, a novel framework that can effectively benefit from an arbitrary number of auxiliary geometric modalities during both training and inference. In our framework, a GeoAdapter is proposed to encode depth and camera intrinsics/extrinsics into a spatial foundation model. It employs zero-initialized convolutions to progressively inject geometric information without disrupting the foundation model's representation space. This design ensures stable optimization with negligible overhead, maintaining inference speed comparable to VGGT even with multiple additional inputs. Additionally, a stochastic multimodal fusion regimen is proposed, which randomly samples modality subsets per instance during training. This enables an arbitrary number of modality inputs during testing and promotes learning robust spatial representations instead of overfitting to auxiliary cues. Extensive experiments on monocular/multi-view depth estimation, multi-view stereo, and camera pose estimation demonstrate that OmniVGGT outperforms prior methods with auxiliary inputs and achieves state-of-the-art results even with RGB-only input. To further highlight its practical utility, we integrated OmniVGGT into vision-language-action (VLA) models. The enhanced VLA model by OmniVGGT not only outperforms the vanilla point-cloud-based baseline on mainstream benchmarks, but also effectively leverages accessible auxiliary inputs to achieve consistent gains on robotic tasks. Project Page: https://livioni.github.io/OmniVGGT-official/

Abstract:
Model merging aims to integrate multiple task-adapted models into a unified model that preserves the knowledge of each task. In this paper, we identify that the key to this knowledge retention lies in maintaining the directional consistency of singular spaces between merged multi-task vector and individual task vectors. However, this consistency is frequently compromised by two issues: i) an imbalanced energy distribution within task vectors, where a small fraction of singular values dominate the total energy, leading to the neglect of semantically important but weaker components upon merging, and ii) the geometric inconsistency of task vectors in parameter space, which causes direct merging to distort their underlying directional geometry. To address these challenges, we propose DC-Merge, a method for directional-consistent model merging. It first balances the energy distribution of each task vector by smoothing its singular values, ensuring all knowledge components are adequately represented. These energy-balanced vectors are then projected onto a shared orthogonal subspace to align their directional geometries with minimal reconstruction error. Finally, the aligned vectors are aggregated in the shared orthogonal subspace and projected back to the original parameter space. Extensive experiments on vision and vision-language benchmarks show that DC-Merge consistently achieves state-of-the-art performance in both full fine-tuning and LoRA settings. The implementation code is available at https://github.com/Tobeginwith/DC-Merge.

Abstract:
Text-driven instruction-based video editing in complex scenes remains challenging: purely textual prompts often fail to capture precise spatial relationships and physical constraints, resulting in target ambiguity and physically implausible outcomes. To address this, we propose a plan-guide-edit framework that explicitly bridges semantic intent and spatial execution. In our framework, a Chain-of-Thought (CoT)-enhanced multimodal large language model (MLLM) serves as a planner, performing structured reasoning over the video and instructions to derive a precise sequence of bounding boxes and attribute-enriched editing directives. These spatial priors then guide a box-conditioned mask generator, transforming ambiguous global retrieval into localized, context-aware refinement and producing masks that more accurately capture object scale, contact relationships, and placement. Building on these spatial and semantic signals, a diffusion-based editor integrates the masks, enriched instructions, and frame features to render high-fidelity edits that remain temporally coherent and spatially well aligned. Trained first in a modular manner and then jointly, our framework achieves superior performance with reduced data requirements, delivering precise localization in scenes with multiple similar objects and physically consistent object additions, and extensive experiments demonstrate state-of-the-art performance over multiple strong baseline methods. More details are available at: https://github.com/flying-sky999/CoT-Edit

Abstract:
On-the-fly category discovery (OCD) aims to recognize known categories while simultaneously discovering novel ones from an unlabeled online stream, using a model trained only on labeled data. Existing approaches freeze the feature extractor trained offline and employ a hash-based framework that quantizes features into binary codes as class prototypes. However, discovering novel categories with a fixed knowledge base is counterintuitive, as the learning potential of incoming data is entirely neglected. In addition, feature quantization introduces information loss, diminishes representational expressiveness, and amplifies intra-class variance. It often results in category explosion, where a single class is fragmented into multiple pseudo-classes. To overcome these limitations, we propose a test-time adaptation framework that enables learning through discovery. It incorporates two complementary strategies: a semantic-aware prototype update and a stable test-time encoder update. The former dynamically refines class prototypes to enhance classification, whereas the latter integrates new information directly into the parameter space. Together, these components allow the model to continuously expand its knowledge base with newly encountered samples. Furthermore, we introduce a margin-aware logit calibration in the offline stage to enlarge inter-class margins and improve intra-class compactness, thereby reserving embedding space for future class discovery. Experiments on standard OCD benchmarks demonstrate that our method substantially outperforms existing hash-based state-of-the-art approaches, yielding notable improvements in novel-class accuracy and effectively mitigating category explosion. The code is publicly available at https://github.com/ynanwu/TALON.

Abstract:
World foundation models aim to simulate the evolution of the real world with physically plausible behavior. Unlike prior methods that handle spatial and temporal correlations separately, we propose RAYNOVA, a geometry-agonistic multiview world model for driving scenarios that employs a dual-causal autoregressive framework. It follows both scale-wise and temporal topological orders in the autoregressive process, and leverages global attention for unified 4D spatio-temporal reasoning. Different from existing works that impose strong 3D geometric priors, RAYNOVA constructs an isotropic spatio-temporal representation across views, frames, and scales based on relative Plucker-ray positional encoding, enabling robust generalization to diverse camera setups and ego motions. We further introduce a recurrent training paradigm to alleviate distribution drift in long-horizon video generation. RAYNOVA achieves state-of-the-art multi-view video generation results on nuScenes, while offering higher throughput and strong controllability under diverse input conditions, generalizing to novel views and camera configurations without explicit 3D scene representation. Our code will be released at https://raynova-ai.github.io/.

Abstract:
Medical image anomaly detection faces unique challenges due to subtle, heterogeneous anomalies embedded in complex anatomical structures. Through systematic Grad-CAM analysis, we reveal that discriminative activation maps fail on medical data, unlike their success on industrial datasets, motivating the need for manifold-level modeling. We propose PDD (Manifold-Prior Diverse Distillation), a novel framework that unifies dual-teacher priors into a shared high-dimensional manifold and distills this knowledge into dual students with complementary behaviors. Specifically, frozen VMamba-Tiny and wide-ResNet50 encoders provide global contextual and local structural priors, respectively. Their features are unified through a Manifold Matching and Unification (MMU) module, while an Inter-Level Feature Adaption (InA) module enriches intermediate representations. The unified manifold is distilled into two students: one performs layer-wise distillation via InA for local consistency, while the other receives skip-projected representations through a Manifold Prior Affine (MPA) module to capture cross-layer dependencies. A diversity loss prevents representation collapse while maintaining detection sensitivity. Extensive experiments on multiple medical datasets demonstrate that PDD significantly outperforms existing state-of-the-art methods, achieving improvements of up to 11.8%, 2.9%, and 8.5% in terms of AUROC on HeadCT, BrainMRI, and ZhangLab datasets, respectively, and 3.4% in terms of F1 max on the Uni-Medical dataset, establishing new state-of-the-art performance in medical image anomaly detection. The implementation will be released at https://github.com/OxygenLu/PDD.

Abstract:
Visual query localization (VQL) aims to predict a spatial-temporal response of the most recent occurrence from a sequence given a query. Currently, most research focuses on visual query localization from 2D videos, while its counterpart in 3D space has received little attention. In this paper, we make the first attempt to visual query localization in the 3D world by introducing a novel benchmark, dubbed 3DVQL. Specifically, 3DVQL contains 2,002 sequences with around 170,000 frames and 6.4K response track segments from 38 object categories. Each sequence in 3DVQL is provided with multiple modalities including point clouds (PC), RGB and depth images to support flexible research. To ensure high-quality annotation, each sequence is manually annotated with multiple rounds of verification and refinement. To our best knowledge, 3DVQL is the first benchmark towards 3D multimodal visual query localization. To facilitate comparison for subsequent research, we implement a series of representative 3D multimodal VQL baselines using PC and RGB. The experimental results show that existing methods exhibit significant performance variations across different fusion modules. To encourage future research, we propose a lift and attention fusion algorithm named LaF, which significantly outperforms existing baseline models. Our benchmark and model will be publicly released at our webpage https://github.com/wuhengliangliang/3DVQL.

Abstract:
In this work, we propose DiT360, a DiT-based framework that performs hybrid training on perspective and panoramic data for panoramic image generation. We attribute the main challenges in preserving geometric fidelity and photorealism to the scarcity of large-scale, high-quality real-world panoramic data, in contrast to prior methods that emphasize model design. Basically, DiT360 has several key modules for inter-domain transformation and intra-domain augmentation, applied at both the pre-VAE image level and the post-VAE token level. At the image level, we incorporate cross-domain knowledge through perspective image guidance and panoramic refinement, which enhance perceptual quality while regularizing diversity and photorealism. At the token level, hybrid supervision is applied across multiple modules, which include circular padding for boundary continuity, yaw loss for rotational robustness, and cube loss for distortion awareness. Extensive experiments on text-to-panorama, inpainting, and outpainting tasks demonstrate that our method achieves better boundary consistency and image fidelity across eleven quantitative metrics. Our code is available at https://github.com/Insta360-Research-Team/DiT360.

Abstract:
Embodied agents face a fundamental limitation: once deployed in real-world environments, they cannot easily acquire new knowledge to improve task performance. In this paper, we propose Dejavu, a general post-deployment learning framework that augments a frozen Vision-Language-Action (VLA) policy with retrieved execution memories through an Experience Feedback Network (EFN). EFN identifies contextually relevant prior action experiences and conditions action prediction on the retrieved guidance. We train EFN with reinforcement learning and semantic similarity rewards, encouraging the predicted actions to align with past behaviors under the current observation. During deployment, EFN continually expands its memory with new trajectories, enabling the agent to exhibit "learning from experience." Experiments across diverse embodied tasks show that EFN improves adaptability, robustness, and success rates over frozen baselines. Our Project Page is https://dejavu2025.github.io/.

Abstract:
Vision-language models like CLIP have achieved remarkable progress in cross-modal representation learning, yet suffer from systematic misclassifications among visually and semantically similar categories. We observe that such confusion patterns are not random but persistently occur between specific category pairs, revealing the model's intrinsic bias and limited fine-grained discriminative ability. To address this, we propose CAPT, a Confusion-Aware Prompt Tuning framework that enables models to learn from their own misalignment. Specifically, we construct a Confusion Bank to explicitly model stable confusion relationships across categories and misclassified samples. On this basis, we introduce a Semantic Confusion Miner (SEM) to capture global inter-class confusion through semantic difference and commonality prompts, and a Sample Confusion Miner (SAM) to retrieve representative misclassified instances from the bank and capture sample-level cues through a Diff-Manner Adapter that integrates global and local contexts. To further unify confusion information across different granularities, a Multi-Granularity Difference Expert (MGDE) module is designed to jointly leverage semantic- and sample-level experts for more robust confusion-aware reasoning. Extensive experiments on 11 benchmark datasets demonstrate that our method significantly reduces confusion-induced errors while enhancing the discriminability and generalization of both base and novel classes, successfully resolving 50.72 percent of confusable sample pairs. Code will be released at https://github.com/greatest-gourmet/CAPT.

Abstract:
Accurate metric depth is critical for autonomous driving perception and simulation, yet current approaches struggle to achieve high metric accuracy, multi-view and temporal consistency, and cross-domain generalization. To address these challenges, we present DriveMVS, a novel multi-view stereo framework that reconciles these competing objectives through two key insights: (1) Sparse but metrically accurate LiDAR observations can serve as geometric prompts to anchor depth estimation in absolute scale, and (2) deep fusion of diverse cues is essential for resolving ambiguities and enhancing robustness, while a spatio-temporal decoder ensures consistency across frames. Built upon these principles, DriveMVS embeds the LiDAR prompt in two ways: as a hard geometric prior that anchors the cost volume, and as soft feature-wise guidance fused by a triple-cue combiner. Regarding temporal consistency, DriveMVS employs a spatio-temporal decoder that jointly leverages geometric cues from the MVS cost volume and temporal context from neighboring frames. Experiments show that DriveMVS achieves state-of-the-art performance on multiple benchmarks, excelling in metric accuracy, temporal stability, and zero-shot cross-domain transfer, demonstrating its practical value for scalable, reliable autonomous driving systems. Code: https://github.com/Akina2001/DriveMVS.git.

Abstract:
We introduce ProM3E, a probabilistic masked multimodal embedding model for any-to-any generation of multimodal representations for ecology. ProM3E is based on masked modality reconstruction in the embedding space, learning to infer missing modalities given a few context modalities. By design, our model supports modality inversion in the embedding space. The probabilistic nature of our model allows us to analyze the feasibility of fusing various modalities for given downstream tasks, essentially learning what to fuse. Using these features of our model, we propose a novel cross-modal retrieval approach that mixes inter-modal and intra-modal similarities to achieve superior performance across all retrieval tasks. We further leverage the hidden representation from our model to perform linear probing tasks and demonstrate the superior representation learning capability of our model. All our code, datasets and model will be released at https://vishu26.github.io/prom3e.

Abstract:
Despite the remarkable success, recent reconstruction-based anomaly detection (AD) methods via diffusion modeling still involve fine-grained noise-strength tuning and computationally expensive multi-step denoising, leading to a fundamental tension between fidelity and efficiency. In this paper, we propose InvAD, a novel inversion-based anomaly detection approach for detection via noising in latent space, which circumvents explicit reconstruction. Importantly, we contend that the limitations of prior reconstruction-based methods originate from the prevailing "detection via denoising in RGB space" paradigm. To address this, we model AD under a reconstruction-free formulation that directly infers the final latent variable corresponding to the input image via DDIM inversion and then measures the deviation based on the known prior distribution for anomaly scoring. Specifically, when approximating the original probability flow ODE using the Euler method, we use only a few inversion steps to noise the clean image and improve inference efficiency. As the added noise is adaptively derived from the learned diffusion model, the original features of the clean test image can still be leveraged to yield high detection accuracy. We perform extensive experiments and detailed analysis across four widely used AD benchmarks under the unsupervised unified setting to demonstrate the effectiveness of our model, achieving state-of-the-art AD performance and about a twofold inference-time speedup without diffusion distillation.

Abstract:
We present Motion 3-to-4, a feed-forward framework for synthesising high-quality 4D dynamic objects from a single monocular video and an optional 3D reference mesh. While recent advances have significantly improved 2D, video, and 3D content generation, 4D synthesis remains difficult due to limited training data and the inherent ambiguity of recovering geometry and motion from a monocular viewpoint. Motion 3-to-4 addresses these challenges by decomposing 4D synthesis into static 3D shape generation and motion reconstruction. Using a canonical reference mesh, our model learns a compact motion latent representation and predicts per-frame vertex trajectories to recover complete, temporally coherent geometry. A scalable frame-wise transformer further enables robustness to varying sequence lengths. Evaluations on both standard benchmarks and a new dataset with accurate ground-truth geometry show that Motion 3-to-4 delivers superior fidelity and spatial consistency compared to prior work. Project page is available at https://motion3-to-4.github.io/.

Abstract:
Pansharpening aims to generate high-resolution multi-spectral images by fusing the spatial detail of panchromatic images with the spectral richness of low-resolution MS data. However, most existing methods are evaluated under limited, low-resolution settings, limiting their generalization to real-world, high-resolution scenarios. To bridge this gap, we systematically investigate the data, algorithmic, and computational challenges of cross-scale pansharpening. We first introduce PanScale, the first large-scale, cross-scale pansharpening dataset, accompanied by PanScale-Bench, a comprehensive benchmark for evaluating generalization across varying resolutions and scales. To realize scale generalization, we propose ScaleFormer, a novel architecture designed for multi-scale pansharpening. ScaleFormer reframes generalization across image resolutions as generalization across sequence lengths: it tokenizes images into patch sequences of the same resolution but variable length proportional to image scale. A Scale-Aware Patchify module enables training for such variations from fixed-size crops. ScaleFormer then decouples intra-patch spatial feature learning from inter-patch sequential dependency modeling, incorporating Rotary Positional Encoding to enhance extrapolation to unseen scales. Extensive experiments show that our approach outperforms SOTA methods in fusion quality and cross-scale generalization. The datasets and source code are available at https://github.com/caoke-963/ScaleFormer.

Abstract:
We introduce VoxTell, a vision-language model for text-prompted volumetric medical image segmentation. It maps free-form descriptions, from single words to full clinical sentences, to 3D masks. Trained on 62K+ CT, MRI, and PET volumes spanning 1K+ anatomical and pathological classes, VoxTell uses multi-stage vision-language fusion across decoder layers to align textual and visual features at multiple scales. It achieves state-of-the-art zero-shot performance across modalities on unseen datasets, excelling on familiar concepts while generalizing to related unseen classes. Extensive experiments further demonstrate strong cross-modality transfer, robustness to linguistic variations and clinical language, as well as accurate instance-specific segmentation from real-world text. Code is available at: https://github.com/MIC-DKFZ/VoxTell

Abstract:
Semantic route planning involves generating itineraries that align with user intent while respecting real-world spatial constraints. However, text-only large language models (LLMs) often hallucinate geographically implausible routes due to poor spatial grounding. Inspired by how humans use maps for route planning, we propose the SMAP, which is the first multimodal framework combining user queries, POI metadata, and map tiles to produce spatially coherent, preference-aware routes. To enhance the spatial consistency, the SMAP features a two-stage anti-hallucination mechanism: (1) a map-grounded self-editing pipeline where a multimodal LLM (MLLM) drafts routes and a second MLLM verifies and refines them using geographic evidence; and (2) hallucination-penalized Direct Preference Optimization (HDPO) that steers the route generator toward spatially plausible routes by using verified routes as accepted responses and hallucinated drafts as rejected ones. Additionally, we introduce MM-Route, the first multimodal dataset for semantic route planning, with 3,000 diverse queries annotated with POI metadata and map tiles, covering a broad spectrum of geographic granularities and user intents. Experimental results demonstrate that SMAP significantly reduces geographical hallucinations and outperforms strong baselines in spatial plausibility and user alignment. Our project is at https://amap-mobility-intelligence.github.io/SMAP/.

Abstract:
Despite rapid developments and widespread applications of MLLM agents, they still struggle with long-form video understanding (LVU) tasks, which are characterized by high information density and extended temporal spans. Recent research on LVU agents demonstrates that simple task decomposition and collaboration mechanisms are insufficient for long-chain reasoning tasks. Moreover, directly reducing the time context through embedding-based retrieval may lose key information of complex problems. In this paper, we propose Symphony, a multi-agent system, to alleviate these limitations. By emulating human cognition patterns, Symphony decomposes LVU into fine-grained subtasks and incorporates a deep reasoning collaboration mechanism enhanced by reflection, effectively improving the reasoning capability. Additionally, Symphony provides a VLM-based grounding approach to analyze LVU tasks and assess the relevance of video segments, which significantly enhances the ability to locate complex problems with implicit intentions and large temporal spans. Experimental results show that Symphony achieves state-of-the-art performance on LVBench, LongVideoBench, VideoMME, and MLVU, with a 5.0% improvement over the prior state-of-the-art method on LVBench. Code is available at https://github.com/Haiyang0226/Symphony.

Abstract:
Current few-shot action recognition methods achieve impressive performance by learning representative prototypes and designing diverse video matching strategies. However, these approaches typically face two critical limitations: i) prototypes learned through implicit sample interactions lack clear semantic correspondence between query-support pairs, limiting their class representativeness; ii) the independent design of prototype learning and matching mechanisms creates a potential incompatibility between prototype representations and matching strategies. To address these limitations, we propose a Match-guided Prototype Learning (MPL) method comprising two key components: enhanced match (E-Match) and key-frame extraction match (K-Match). E-Match explicitly enhances prototype learning in class-specific embeddings by incorporating the matched semantics of query samples, while K-Match further refines the prototype representation through key-frame matching at the fine-grained frame level. Additionally, we propose a Cross-Shot Attention Aggregator (CSA-Aggregator) that dynamically aggregates adjacent frames across support samples, thereby obtaining a prototype representation that captures intra-class shared action patterns. In this way, the proposed MPL effectively mines coarse-to-fine, match-guided semantic information from query-support pairs to generate discriminative class prototypes, and improve the compatibility of prototype representation with the match mechanism. Extensive evaluations on four public datasets confirm that MPL achieves superior performance over leading few-shot action recognition techniques. The source code is available at https://github.com/jayzh-research/MPL-FSAR.

Abstract:
Real objects that inhabit the physical world follow physical laws and thus behave plausibly during interaction with other physical objects. However, current methods that perform 3D reconstructions of real-world scenes from multi-view 2D images optimize primarily for visual fidelity, i.e., they train with photometric losses and reason about uncertainty in the image or representation space. This appearance-centric view overlooks body contacts and couplings, conflates function-critical regions (e.g., aerodynamic or hydrodynamic surfaces) with ornamentation, and reconstructs structures suboptimally, even when physical regularizers are added. All these can lead to unphysical and implausible interactions. To address this, we consider the question: How can 3D reconstruction become aware of real-world interactions and underlying object functionality, beyond visual cues? To answer this question, we propose FluidGaussian, a plug-and-play method that tightly couples geometry reconstruction with ubiquitous fluid-structure interactions to assess surface quality at high granularity. We define a simulation-based uncertainty metric induced by fluid simulations and integrate it with active learning to prioritize views that improve both visual and physical fidelity. In an empirical evaluation on NeRF Synthetic (Blender), Mip-NeRF 360, and DrivAerNet++, our FluidGaussian method yields up to +8.6% visual PSNR (Peak Signal-to-Noise Ratio) and -62.3% velocity divergence during fluid simulations. Our code is available in https://github.com/delta-lab-ai/FluidGaussian.

Abstract:
Reference-based color grading aims to reproduce the tonal mood and lighting of a reference while preserving color harmony and scene structure. Existing photorealistic and filter-based methods often produce unstable tone mappings --- over-shifting or inconsistently retaining colors --- leading to unnatural results. We propose CanonCGT, a two-stage framework built on a canonical pivot --- a style-neutral intermediate representation for stable color mapping. The first stage canonicalizes the input by removing intrinsic tonal bias, and the second color-grades it to match the reference style. A dual-phase training scheme, DP-CGT, combines supervised preset learning with self-supervised refinement on unpaired photographs. CanonCGT delivers photorealistic and tonally consistent results across diverse datasets, surpassing state-of-the-art methods in stability and visual fidelity. Our codes are available at \href https://github.com/Jinwon-Ko/CanonCGT https://github.com/Jinwon-Ko/CanonCGT .

Abstract:
Generating long videos using pre-trained video diffusion models, which are typically trained on short clips, presents a significant challenge. Directly applying these models for long-video inference often leads to a notable degradation in visual quality. This paper identifies that this issue primarily stems from two out-of-distribution (O.O.D) problems: frame-level relative position O.O.D and contextlength O.O.D. To address these challenges, we propose FreeLOC, a novel training-free, layer-adaptive framework that introduces two core techniques: Video-based Relative Position Re-encoding (VRPR) for frame-level relative position O.O.D, a multi-granularity strategy that hierarchically re-encodes temporal relative positions to align with the model's pre-trained distribution, and Tiered Sparse Attention (TSA) for context-length O.O.D, which preserves both local detail and long-range dependencies by structuring attention density across different temporal scales. Crucially, we introduce a layer-adaptive probing mechanism that identifies the sensitivity of each transformer layer to these O.O.D issues, allowing for the selective and efficient application of our methods. Extensive experiments demonstrate that our approach significantly outperforms existing training-free methods, achieving state-of-the-art results in both temporal consistency and visual quality. Code is available at https://github.com/Westlake-AGI-Lab/FreeLOC.

Abstract:
Recovering global human and camera motion from monocular video is essential for world-coordinate human reconstruction but remains challenging due to entangled motions in image space. Traditional SLAM methods estimate monocular camera motion but fail in scenes dominated by foreground objects such as humans. A common workaround is to mask out dynamic objects, yet this approach becomes brittle when humans occupy most of the view or the background is too noisy, leading to unstable tracking and loss of constraints. This paper takes the opposite stance and reintegrates human motion as informative landmarks. We introduce HumanBA, a human-aware bundle adjustment framework that transforms dynamic humans into usable constraints via motion decoupling. HumanBA subtracts the human-induced component from observed joint trajectories, isolating a camera-induced (pseudo-static) component that can be safely incorporated into bundle adjustment alongside background features. To mitigate noise in global human estimates, HumanBA applies motion refinements and motion-aware reliability weighting. Across EMDB and SLOPER4D benchmarks, we show consistent improvements on camera pose estimation and reduce global human reconstruction error, demonstrating the benefits of treating humans as dynamic yet informative landmarks. Our code is available at https://github.com/MartaYang/HumanBA.

Abstract:
To achieve compactness or cost reduction for optical lens systems, designers typically rely on commercial software to design lens systems independently of post-processing algorithms, leading to excessive dependence on designers' expertise and often requiring significant time. Recently, joint optimization approaches utilizing differentiable ray tracing have emerged, demonstrating significant potential in lens design tasks. However, existing pipelines lack accurate and efficient diffraction modeling capabilities for miniature optical systems. In this work, we propose a novel lens component deletion pipeline for miniature optical systems, which automatically deletes the suitable lens component, and then optimizes both the lens system and the post-processing network to achieve joint aberration correction. Additionally, we introduce a novel metric for evaluating the contribution of each lens component within an optical system, aimed at identifying the lens component that has the least impact on the system. We also develop an efficient differentiable point spread function estimation method based on the Rayleigh-Sommerfeld diffraction model, significantly reducing GPU memory consumption. Our proposed pipeline achieves lens component deletion while maintaining imaging quality comparable to the original lens system, thereby enabling the compactness or cost-effective optimization of miniature optical systems. Project page: https://github.com/WenguanZhang/HappyLens

Abstract:
In the deep-learning-based computer vision community, Neural Architecture Search (NAS) has become the de-facto tool for acquiring task-optimal network structures. Nevertheless, NAS methods are trapped in a fundamental accuracy-efficiency dilemma: training-based approaches deliver reliable performance but incur prohibitive search costs, whereas training-free strategies are ultra-fast but often yield relatively unreliable rankings. To reconcile this conflict, we propose a vision-oriented lightweight training-based NAS framework. We first design six micro vision tasks whose training time is negligible, yet together they probe a broad spectrum of representational capacities. Built upon these tasks, we introduce a budget-adaptive performance evaluator to produce the most accurate ranking attainable within the limit. Experiments on popular NAS benchmarks show that our method achieves a ranking correlation higher than existing methods. Furthermore, we construct a search space from prevalent neural blocks and run our method at a cost close to training-free methods; the discovered architecture surpasses the current state-of-the-art under identical training recipes. Our codes are available at https://github.com/fanyi-plus/tf-nas.

Abstract:
Visual reinforcement learning has achieved remarkable progress in visual control and robotics, but its vulnerability to adversarial perturbations remains underexplored. Most existing black-box attacks focus on vector-based or discrete-action RL, and their effectiveness on image-based continuous control is limited by the large action space and excessive environment queries. We propose SEBA, a sample-efficient framework for black-box adversarial attacks on visual RL agents. SEBA integrates a shadow Q model that estimates cumulative rewards under adversarial conditions, a generative adversarial network that produces visually imperceptible perturbations, and a world model that simulates environment dynamics to reduce real-world queries. Through a two-stage iterative training procedure that alternates between learning the shadow model and refining the generator, SEBA achieves strong attack performance while maintaining efficiency. Experiments on MuJoCo and Atari benchmarks show that SEBA significantly reduces cumulative rewards, preserves visual fidelity, and greatly decreases environment interactions compared to prior black-box and white-box methods. The code is available at https://github.com/tairanhuang/seba online.

Abstract:
Partial label learning is a prominent weakly supervised classification task, where each training instance is ambiguously labeled with a set of candidate labels. In real-world scenarios, candidate labels are often influenced by instance features, leading to the emergence of instance-dependent PLL (ID-PLL), a setting that more accurately reflects this relationship. A significant challenge in ID-PLL is instance entanglement, where instances from similar classes share overlapping features and candidate labels, resulting in increased class confusion. To address this issue, we propose a novel Class-specific Augmentation based Disentanglement (CAD) framework, which tackles instance entanglement by both intra- and inter-class regulations. For intra-class regulation, CAD amplifies class-specific features to generate class-wise augmentations and aligns same-class augmentations across instances. For inter-class regulation, CAD introduces a weighted penalty loss function that applies stronger penalties to more ambiguous labels, encouraging larger inter-class distances. By jointly applying intra- and inter-class regulations, CAD improves the clarity of class boundaries and reduces class confusion caused by entanglement. Extensive experimental results demonstrate the effectiveness of CAD in mitigating the entanglement problem and enhancing ID-PLL performance. The code is available at https://github.com/RyanZhaoIc/CAD.git.

Abstract:
Ultra-low bitrate image compression (below 0.05 bits per pixel) is increasingly critical for bandwidth-constrained and computation-limited encoding scenarios such as edge devices. Existing frameworks typically rely on large pretrained encoders (e.g., VAEs or tokenizer-based models) and perform transform coding within their generative latent space. While these approaches achieve impressive perceptual fidelity, their reliance on heavy encoder networks makes them unsuitable for deployment on weak sender devices. In this work, we explore the feasibility of applying shallow encoders for ultra-low bitrate compression and propose a novel Asymmetric Extreme Image Compression (AEIC) framework that pursues simultaneously encoding simplicity and decoding quality. Specifically, AEIC employs moderate or even shallow encoder networks, while leveraging an one-step diffusion decoder to maintain high-fidelity and high-realism reconstructions under extreme bitrates. To further enhance the efficiency of shallow encoders, we design a dual-side feature distillation scheme that transfers knowledge from AEIC with moderate encoders to its shallow encoder variants. Experiments show that AEIC not only outperforms existing methods on rate-distortion-perception performance at ultra-low bitrates, but also delivers exceptional encoding efficiency for 35.8 FPS on 1080P images, while maintaining competitive decoding speed compared to existing methods. Code is available at https://github.com/LuizScarlet/AEIC.

Abstract:
We study the task of establishing object-level visual correspondence across different viewpoints in videos, focusing on the challenging egocentric-to-exocentric and exocentric-to-egocentric scenarios. We propose a simple yet effective framework based on conditional binary segmentation, where an object query mask is encoded into a latent representation to guide the localization of the corresponding object in a target video. To encourage robust, view-invariant representations, we introduce a cycle-consistency training objective: the predicted mask in the target view is projected back to the source view to reconstruct the original query mask. This bidirectional constraint provides a strong self-supervisory signal without requiring ground-truth annotations and enables test-time training (TTT) at inference. Experiments on the Ego-Exo4D and HANDAL-X benchmarks demonstrate the effectiveness of our optimization objective and TTT strategy, achieving state-of-the-art performance. The code is available at https://github.com/shannany0606/CCMP.

Abstract:
Vision Foundation Models (VFMs) provide rich and transferable representations through large-scale pretraining, yet their high-capacity representations remain underutilized when adapted to downstream tasks. In Domain Generalization Semantic Segmentation (DGSS), parameter-efficient fine-tuning (PEFT) often overfits adapters to source domain statistics and seen class boundaries, leading to representation degradation manifested as domain bias and semantic rigidity. Existing regularization strategies alleviate this through random perturbations, but such operations disrupt the pretrained geometric structure, causing semantic drift and unstable generalization. We propose Geometry-Consistent Regularization (GeCo), which extrapolates the pretrained representation space toward the target task under structure-respected constraints, thereby preserving the inherent generalization of VFMs while enhancing their task-specific adaptation. GeCo introduces curvature-guided perturbation to modulate feature variation according to local manifold complexity of the pretrained embedding space, enabling structure-aligned representation expansion. Complementarily, a geodesic-based regularization constrains prediction shifts along smooth, manifold-aligned trajectories, ensuring semantic continuity and stable decision behavior. Extensive experiments demonstrate that GeCo achieves superior generalization across both closed-set and open-set DGSS benchmarks. The code is available at https://github.com/DZhaoXd/GeCo.

Abstract:
Recent advances in large language models (LLMs) have opened new avenues for multimodal reasoning. Yet, most existing methods still rely on pretrained vision-language models (VLMs) to encode image-text pairs in isolation, ignoring the relational structure that real-world multimodal data naturally form. This motivates reasoning on multimodal graphs (MMGs), where each node has textual and visual attributes and edges provide structural cues. Enabling LLM-based reasoning on such heterogeneous multimodal signals while preserving graph topology introduces two key challenges: resolving weak cross-modal consistency and handling heterogeneous modality preference. To address this, we propose Mario, a unified framework that simultaneously resolves the two above challenges and enables effective LLM-based reasoning over MMGs. Mario consists of two innovative stages. Firstly, a graph-conditioned VLM design that jointly refines textual and visual features through fine-grained cross-modal contrastive learning guided by graph topology. Secondly, a modality-adaptive graph instruction tuning mechanism that organizes aligned multimodal features into graph-aware instruction views and employs a learnable router to surface, for each node and its neighborhood, the most informative modality configuration to the LLM. Extensive experiments across diverse MMG benchmarks demonstrate that Mario consistently outperforms state-of-the-art graph models in both supervised and zero-shot scenarios for node classification and link prediction. The code will be made available at https://github.com/sunyuanfu/Mario.

Abstract:
Vision-Language Models (VLM) exhibit strong reasoning capabilities, showing promise for end-to-end autonomous driving systems. Chain-of-Thought (CoT), as VLM's widely used reasoning strategy, is facing critical challenges. Existing textual CoT has a large gap between text semantic space and trajectory physical space. Although the recent approach utilizes future image to replace text as CoT process, it lacks clear planning-oriented objective guidance to generate images with accurate scene evolution. To address these, we innovatively propose MindDriver, a progressive multimodal reasoning framework that enables VLM to imitate human-like progressive thinking for autonomous driving.MindDriver presents semantic understanding, semantic-to-physical space imagination, and physical-space trajectory planning.To achieve aligned reasoning processes in MindDriver, we develop a feedback-guided automatic data annotation pipeline to generate aligned multimodal reasoning training data. Furthermore, we develop a progressive reinforcement fine-tuningmethod to optimize the alignment through progressive high-level reward-based learning.MindDriver demonstrates superior performance in both nuScences open-loop and Bench2Drive closed-loop evaluation. Codes: https://github.com/hotdogcheesewhite/MindDriver.

Abstract:
Parameter-efficient fine-tuning (PEFT) has become a compelling approach for adapting large language models (LLMs) into multimodal large language models (MLLMs), enabling them to handle diverse modalities with substantially lower memory and computational costs. However, most existing PEFT methods neglect the issue of modality-imbalanced learning, which is characterized by the excessive dominance of text modality in updating parameters, thus incurring insufficient learning of non-text modalities and leading to performance degradation. To address this issue, we propose a novel parameter-efficient adaptation method for MLLMs, namely Implicit Modality Decomposition (IMoD), based on LoRA. It firstly decomposes the learnable parameters into the non-overlapped text-specific, non-text-specific and modality-sharing components, thereby alleviating modality imbalance. To further guide the optimization of these components toward specific modalities, we propose Modality-Specific Decoupling Constraint that suppresses cross-modal interference among modality-specific parameters, and Modality-Agnostic Alignment Constraint that encourages modality-sharing component to capture well-aligned, modality-invariant semantics. Extensive experiments across diverse multimodal settings and LLM architectures demonstrate that our method consistently delivers significant performance gains, particularly achieving an averaged 3.3% improvements on the audio-visual-text tasks without sacrificing the parameter and inference efficiency. Code is available at https://github.com/mmffzzz/IMoD.git.

Abstract:
Recently, autoregressive (AR) models have shown strong potential in image generation, offering better scalability and easier integration with unified multi-modal models compared to diffusion methods.However, extending AR models to controllable image editing remains challenging due to weak and inefficient conditioning strategies, which often lead to suboptimal semantic alignment and visual quality.To address this limitation, we present SCAR, a Semantic-Context-driven method for AutoregRessive models.SCAR introduces Compressed Semantic Prefilling and Semantic Alignment Guidance that jointly enhance contextual understanding and generation coherence. Unlike prior methods that rely on sparse visual tokens or decoding stage injection, SCAR enables strong semantic guidance from the input stage, while remaining model-agnostic and applicable to both next-token and next-scale AR paradigms.Extensive experiments on instruction editing and controllable generation demonstrate that our method significantly improves visual fidelity and semantic alignment, outperforming existing AR-based methods while maintaining controllability. Code will be released at https://github.com/AMAP-ML/SCAR.

Abstract:
Underwater instance segmentation is essential for fine-grained scene understanding. However, underwater imagery exhibits a strong domain gap from in-air vision due to severe degradation (e.g., turbidity). Consequently, despite its general segmentation ability, SAM degrades sharply underwater. In this work, we propose BiPA, which effectively adapts SAM to the underwater domain. To be concrete, we construct an underwater SAM with dual prompts and introduce a foreground-attentive injection block to enhance local foreground representation. We formulate dense prompt learning as a bilevel optimization, explicitly capturing the mutual dependency between prompt and model. To make this tractable, we design a two-stage learning strategy. The first stage adapts the dense prompt itself, updating it with Bayesian optimization to learn efficiently. The second stage fine-tunes the model parameters under the frozen optimized prompt, which finally enables effective cross-domain adaptation. Extensive experiments and analyses verify the superiority and efficiency of BiPA. The code is publicly available at https://github.com/ZeAstra/BiPA.

Abstract:
Despite recent advances in multimodal reasoning, Multimodal Large Language Models (MLLMs) still struggle on complex tasks where initial visual perceptions can be misleading. This performance gap stems from a critical reasoning flaw we term Visual Inertia: while MLLMs excel at iterative reflection in textual contexts, they tend to uncritically commit to their initial visual interpretations and rarely revise them. To overcome this limitation, we introduce GThinker, an MLLM equipped with a novel adaptive visual rethinking capability. GThinker leverages Cue-Rethinking, a flexible reasoning pattern that not only grounds reasoning in visual cues but also strategically triggers a re-examination of these cues to resolve inconsistencies. To instill this capability, we introduce a novel two-stage training framework. It begins with a pattern-guided cold start, enhanced by a judge-guided selective mechanism to learn from failure cases, followed by incentive reinforcement learning. We further curate the GThinker-11K dataset to power the training with an iterative multimodal annotation pipeline. Extensive experiments demonstrate that GThinker significantly mitigates visual inertia during reasoning, achieving a leading 81.5% on the M3CoT benchmark, which is rich in such challenges, surpassing the powerful o4-mini model. Furthermore, GThinker shows consistent improvements across a range of multimodal reasoning benchmarks with an average gain of 2.1%, showcasing the broad benefits of equipping MLLMs with the ability to rethink both what they see and how they think. Data, model, and code are available at https://github.com/jefferyZhan/GThinker.

Abstract:
Multimodal large language models (MLLMs) have shown considerable potential in chart understanding and reasoning tasks. However, they still struggle with high information density (HID) charts characterized by multiple subplots, legends, and dense annotations due to three major challenges: (1) limited fine-grained perception results in the omission of critical visual cues; (2) redundant or noisy visual information undermines the performance of multimodal reasoning; (3) lack of adaptive deep reasoning relative to the amount of visual information. To tackle these challenges, we present a novel focus-driven fine-grained chart reasoning model, Chart-FR1, to improve perception, focusing efficiency, and adaptive deep reasoning on HID charts. Specifically, we propose Focus-CoT, a visual focusing chain-of-thought that enhances fine-grained perception by explicitly linking reasoning steps to key visual cues, such as local image regions and OCR signals. Building on this, we introduce Focus-GRPO, a focus-driven reinforcement learning algorithm with an information-efficiency reward that compresses redundant visual information for efficient focusing, and an adaptive KL penalty mechanism that enables flexible control over reasoning depth as more visual cues are discovered. Furthermore, to fill the gap in benchmarks for HID charts, we build HID-Chart, a challenging benchmark with an information-density metric designed to evaluate fine-grained chart reasoning capabilities. Extensive experiments on multiple chart benchmarks demonstrate that Chart-FR1 outperforms state-of-the-art MLLMs in chart understanding and reasoning. Code is available at https://github.com/phkhub/Chart-FR1.

Abstract:
Transformers have achieved widespread and remarkable success, while the computational complexity of their attention modules remains a major bottleneck for vision tasks. Existing methods mainly employ 8-bit or 4-bit quantization to balance efficiency and accuracy. In this paper, with theoretical justification, we indicate that binarization of attention preserves the essential similarity relationships, and propose BinaryAttention, an effective method for fast and accurate 1-bit qk-attention. Specifically, we retain only the sign of queries and keys in computing the attention, and replace the floating dot products with bit-wise operations, significantly reducing the computational cost. We mitigate the inherent information loss under 1-bit quantization by incorporating a learnable bias, and enable end-to-end acceleration. To maintain the accuracy of attention, we adopt quantization-aware training and self-distillation techniques, mitigating quantization errors while ensuring sign-aligned similarity. BinaryAttention is more than 2xfaster than FlashAttention2 on A100 GPUs. Extensive experiments on vision transformer and diffusion transformer benchmarks demonstrate that BinaryAttention matches or even exceeds full-precision attention, validating its effectiveness. Our work provides a highly efficient and effective alternative to full-precision attention, pushing the frontier of low-bit transformers for vision tasks. The codes and models can be found at https://github.com/EdwardChasel/BinaryAttention.

Abstract:
Scaling artist-designed meshes to high triangle numbers remains challenging for autoregressive generative models. Existing transformer-based methods suffer from long-sequence bottlenecks and limited quantization resolution, primarily due to the large number of tokens required and constrained quantization granularity. These issues prevent faithful reproduction of fine geometric details and structured density patterns. We introduce MeshMosaic, a novel local-to-global framework for artist mesh generation that scales to over 100K triangles--substantially surpassing prior methods, which typically handle only around 8K faces. MeshMosaic first segments shapes into patches, generating each patch autoregressively and leveraging shared boundary conditions to promote coherence, symmetry, and seamless connectivity between neighboring regions. This strategy enhances scalability to high-resolution meshes by quantizing patches individually, resulting in more symmetrical and organized mesh density and structure. Extensive experiments across multiple public datasets demonstrate that MeshMosaic significantly outperforms state-of-the-art methods in both geometric fidelity and user preference, supporting superior detail representation and practical mesh generation for real-world applications.

Abstract:
Human-Object Interaction (HOI) detection aims to localize human-object pairs and classify their interactions from a single image, a task that demands strong visual understanding and nuanced contextual reasoning. Recent approaches have leveraged Vision-Language Models (VLMs) to introduce semantic priors, significantly improving HOI detection performance. However, existing methods often fail to fully capitalize on the diverse contextual cues distributed across the entire scene. To overcome these limitations, we propose the Instance-centric Context Mining Network (InCoM-Net)--a novel framework that effectively integrates rich semantic knowledge extracted from VLMs with instance-specific features produced by an object detector. This design enables deeper interaction reasoning by modeling relationships not only within each detected instance but also across instances and their surrounding scene context. InCoM-Net comprises two core components: Instance-centric Context Refinement (ICR), which separately extracts intra-instance, inter-instance, and global contextual cues from VLM-derived features, and Progressive Context Aggregation (ProCA), which iteratively fuses these multi-context features with instance-level detector features to support high-level HOI reasoning. Extensive experiments on the HICO-DET and V-COCO benchmarks show that InCoM-Net achieves state-of-the-art performance, surpassing previous HOI detection methods. Code is available at https://github.com/nowuss/InCoM-Net.

Abstract:
Significant progress has been made in spatial intelligence, spanning both spatial reconstruction and world exploration. However, the scalability and real-world fidelity of current models remain severely constrained by the scarcity of large-scale, high-quality training data. While several datasets provide camera pose information, they are typically limited in scale, diversity, and annotation richness, particularly for real-world dynamic scenes with ground-truth camera motion.To this end, we collect SpatialVID, a dataset consisting of a large corpus of in-the-wild videos with diverse scenes, camera movements and dense 3D annotations such as per-frame camera poses, depth, and motion instructions.Specifically, we collect more than 21,000 hours of raw video, and process them into 2.7 million clips through a hierarchical filtering pipeline, totaling 7,089 hours of dynamic content. A subsequent annotation pipeline enriches these clips with detailed spatial and semantic information, including camera poses, depth maps, dynamic masks, structured captions, and serialized motion instructions.Analysis of SpatialVID's data statistics reveals a richness and diversity that directly foster improved model generalization and performance, establishing it as a key asset for the video and 3D vision research community.Through extensive validation experiments, we demonstrate SpatialVID's effectiveness across tasks such as controllable video generation, world simulation and geometric reconstruction, providing a strong foundation for spatial intelligence research.

Abstract:
This paper addresses the task of large-scale 3D scene reconstruction from long video sequences. Recent feed-forward reconstruction models have shown promising results by directly regressing 3D geometry from RGB images without explicit 3D priors or geometric constraints. However, these methods often struggle to maintain reconstruction accuracy and consistency over long sequences due to limited memory capacity and the inability to effectively capture global contextual cues. In contrast, humans can naturally exploit the global understanding of the scene to inform local perception. Motivated by this, we propose a novel neural global context representation that efficiently compresses and retains long-range scene information, enabling the model to leverage extensive contextual cues for enhanced reconstruction accuracy and consistency. The context representation is realized through a set of lightweight neural sub-networks that are rapidly adapted during test time via self-supervised objectives, which substantially increases memory capacity without incurring significant computational overhead. The experiments on multiple large-scale benchmarks, including the KITTI Odometry [??] and Oxford Spires [??] datasets, demonstrate the effectiveness of our approach in handling ultra-large scenes, achieving leading pose accuracy and state-of-the-art 3D reconstruction accuracy while maintaining efficiency. Code will be available at https://zju3dv.github.io/scal3r.

Abstract:
Multimodal large language models (MLLMs) are increasingly deployed in real-world, agentic settings where outputs must not only be correct, but also conform to pre-defined data schemas. Despite recent progress in structured generation in textual domain, there is still no benchmark that systematically evaluates schema-grounded information extraction and reasoning over visual inputs. In this work, we conduct a comprehensive study of visual structural output capabilities for MLLMs with our carefully designed SO-BENCH benchmark. Covering four visual domains, including UI screens, natural images, documents, and charts, SO-BENCH is built from over 6.5K diverse JSON schemas and 1.8K curated image-schema pairs with human-verified quality. Benchmarking experiments on open-sourced andfrontier proprietary models reveal persistent gaps in predicting accurate, schema compliant outputs, highlighting the need for better multimodal structured reasoning. Beyond benchmarking, we further conduct training experiments to largely improve the model's structured output capability. We make the benchmark publicly available at https://github.com/apple/ml-sobench.

Abstract:
We present UIKA, a feed-forward animatable Gaussian head model from an arbitrary number of pose-free inputs, including a single image, multi-view captures, and smartphone-captured videos. Unlike the traditional avatar method, which requires a studio-level multi-view capture system and reconstructs a human-specific model through a long-time optimization process, we rethink the task through the lenses of model representation, network design, and data preparation. First, we introduce a UV-guided avatar modeling strategy, in which each input image is associated with a pixel-wise facial correspondence estimation. Such correspondence estimation allows us to reproject each valid pixel color from screen space to UV space, which is independent of camera pose and character expression. Furthermore, we design learnable UV tokens on which the attention mechanism can be applied at both the screen and UV levels. The learned UV tokens can be decoded into canonical Gaussian attributes using aggregated UV information from all input views. To train our large avatar model, we additionally prepare a large-scale, identity-rich synthetic training dataset. Our method significantly outperforms existing approaches in both monocular and multi-view settings. See more details in our project page: https://zijian-wu.github.io/uika-page/.

Abstract:
Zero-shot anomaly detection (ZSAD) requires detecting and localizing anomalies without access to target-class anomaly samples. Mainstream methods rely on vision-language models (VLMs) such as CLIP: they build hand-crafted or learned prompt sets for normal and abnormal semantics, then compute image-text similarities for open-set discrimination. While effective, this paradigm depends on a text encoder and cross-modal alignment, which can lead to training instability and parameter redundancy. This work revisits the necessity of the text branch in ZSAD and presents VisualAD, a purely visual framework built on Vision Transformers. We introduce two learnable tokens within a frozen backbone to directly encode normality and abnormality. Through multi-layer self-attention, these tokens interact with patch tokens, gradually acquiring high-level notions of normality and anomaly while guiding patches to highlight anomaly-related cues. Additionally, we incorporate a Spatial-Aware Cross-Attention (SCA) module and a lightweight Self-Alignment Function (SAF): SCA injects fine-grained spatial information into the tokens, and SAF recalibrates patch features before anomaly scoring. VisualAD achieves state-of-the-art performance on 13 zero-shot anomaly detection benchmarks spanning industrial and medical domains, and adapts seamlessly to pretrained vision backbones such as the CLIP image encoder and DINOv2. Our code is publicly available at https://github.com/7HHHHH/VisualAD.

Abstract:
In Vision-Language-Action (VLA) models, action chunking (i.e., executing a sequence of actions without intermediate replanning) is a key technique to improve robotic manipulation abilities. However, a large chunk size reduces the model's responsiveness to new information, while a small one increases the likelihood of mode-jumping, jerky behavior resulting from discontinuities between chunks. Therefore, selecting the optimal chunk size is an urgent demand to balance the model's reactivity and consistency. Unfortunately, a dominant trend in current VLA models is an empirical fixed chunk length at inference-time, hindering their superiority and scalability across diverse manipulation tasks. To address this issue, we propose a novel Adaptive Action Chunking (AAC) strategy, which exploits action entropy as the cue to adaptively determine the chunk size based on current predictions. Extensive experiments on a wide range of simulated and real-world robotic manipulation tasks have demonstrated that our approach substantially improves performance over the state-of-the-art alternatives. The videos and source code are publicly available at https://lance-lot.github.io/adaptive-chunking.github.io.

Abstract:
We introduce a test-time framework for multiview Transformers (MVTs) that incorporates priors (e.g., camera poses, intrinsics, and depth) to improve 3D tasks without retraining or modifying pre-trained image-only networks. Rather than feeding priors into the architecture, we cast them as constraints on the predictions and optimize the network at inference time. The optimization loss consists of a self-supervised objective and prior penalty terms. The self-supervised objective captures the compatibility among multi-view predictions and is implemented using photometric or geometric loss between renderings from other views and each view itself. Any available priors are converted into penalty terms on the corresponding output modalities. Across a series of 3D vision benchmarks, including point map estimation and camera pose estimation, our method consistently improves performance over base MVTs by a large margin. On the ETH3D, 7-Scenes, and NRGBD datasets, our method reduces the point-map distance error by more than half compared with the base image-only models. Our method also outperforms retrained prior-aware feed-forward methods, demonstrating the effectiveness of our test-time constrained optimization (TCO) framework for incorporating priors into 3D vision tasks. The code is available at https://github.com/cvlab-stonybrook/TCO.

Abstract:
For an unlabeled dataset containing both known and unknown categories, Generalized Category Discovery (GCD) aims to classify known categories while discovering unknown ones. Although existing methods achieve promising results on coarse-grained datasets, they struggle in fine-grained scenarios. We observe that attention artifacts where attention maps exhibit abnormally high responses on a few tokens--significantly interfere with fine-grained GCD by causing the model to overemphasize global semantics and overlook discriminative local cues. To address this issue, we propose the Token-Aware Refinement (TAR) framework, a plug-and-play module that mitigates attention artifacts and enhances local information. Unlike conventional approaches that rely solely on the first token for classification, TAR leverages the entire token sequence to better capture fine-grained details. Extensive experiments demonstrate that TAR consistently improves performance across multiple benchmarks. Our code is available at https://github.com/VectorYangYiStar/TAR.

Abstract:
Universal Multimodal embedding models built on Multimodal Large Language Models (MLLMs) have traditionally employed contrastive learning, which aligns representations of query-target pairs across different modalities. Yet, despite its empirical success, they are primarily built on a "single-turn" formulation where each query-target pair is treated as an independent data point. This paradigm leads to computational inefficiency when scaling, as it requires a separate forward pass for each pair and overlooks potential contextual relationships between multiple queries that can relate to the same context. In this work, we introduce Multi-Turn Contrastive Learning (MuCo), a dialogue-inspired framework that revisits this process. MuCo leverages the conversational nature of MLLMs to process multiple, related query-target pairs associated with a single image within a single forward pass. This allows us to extract a set of multiple query and target embeddings simultaneously, conditioned on a shared context representation, amplifying the effective batch size and overall training efficiency. Experiments exhibit MuCo with a newly curated 5M multimodal multi-turn dataset (M3T), which yields state-of-the-art retrieval performance on MMEB and M-BEIR benchmarks, while markedly enhancing both training efficiency and representation coherence across modalities. Code and M3T are available at https://github.com/naver-ai/muco

Abstract:
Knowledge distillation establishes a learning paradigm that leverages both data supervision and teacher guidance. However, determining the optimal balance between learning from data and learning from the teacher is challenging, as some samples may be noisy while others are subject to teacher uncertainty. This motivates the need for adaptively balancing data and teacher supervision. We propose Beta-weighted Knowledge Distillation (Beta-KD), an uncertainty-aware distillation framework that adaptively modulates how much the student relies on teacher guidance. Specifically, we formulate teacher-student learning from a unified Bayesian perspective and interpret teacher supervision as a Gibbs prior over student activations. This yields a closed-form, uncertainty-aware weighting mechanism and supports arbitrary distillation objectives and their combinations. Extensive experiments are conducted on multimodal VQA benchmarks by distilling a student Vision-Language Model from a large teacher VLM. The results demonstrate that Beta-KD consistently outperforms existing knowledge distillation methods. Code is available at https://github.com/Jingchensun/beta-kd.

Abstract:
Next-scale prediction paradigm visual autoregressive (VAR) models have demonstrated significant potential for image super-resolution. However, their practical application is constrained by a rigid, size-specific design. This limitation stems from their reliance on memorizing fixed, absolute scaling schedules, which necessitates a distinct model for each target resolution. We introduce DVAR, a Dynamic Visual AutoRegressive framework that overcomes this fundamental bottleneck. Instead of memorizing these rigid schedules, DVAR learns a canonical scaling dynamic. This dynamic effectively decouples the logic of relative scaling from the absolute target size, thereby preserving a single set of proportions between generative steps that can be applied uniformly to any size. Furthermore, we introduce a dynamic sampling scheduler to mitigate the teacher-forcing problem with negligible computational overhead. By leveraging the geometric proximity of visual tokens in the codebook, it efficiently simulates the model's predictive error distribution to bridge the training-inference gap. To our knowledge, DVAR is the first framework to grant VAR models size-flexibility, breaking their one-to-one dependency on a fixed resolution. Extensive evaluations demonstrate that DVAR achieves superior visual quality over existing Real-ISR methods, proving that a flexible, purely autoregressive approach is a viable path to state-of-the-art image super-resolution. Code is available at https://github.com/YuZheng9/DVAR.

Abstract:
Gaze estimation is instrumental in modern virtual reality (VR) systems. Despite significant progress in remote-camera gaze estimation, VR gaze research remains constrained by data scarcity, particularly the lack of large-scale, accurately labeled datasets captured with the off-axis camera configurations typical of modern headsets. Gaze annotation is difficult since fixation on intended targets cannot be guaranteed. To address these challenges, we introduce VRGaze, the first large-scale off-axis gaze estimation dataset for VR, comprising 2.1 million near-eye infrared images collected from 68 participants. We further propose GazeShift, an attention-guided unsupervised framework for learning gaze representations without labeled data. Unlike prior redirection-based methods that rely on multi-view or 3D geometry, GazeShift is tailored to near-eye imagery, achieving effective gaze-appearance disentanglement in a compact, real-time model. GazeShift embeddings can be optionally adapted to individual users via lightweight few-shot calibration, achieving a 1.84deg mean error on VRGaze. On the remote-camera MPIIGaze dataset, the model achieves a 7.15deg person-agnostic error, doing so with 10x fewer parameters and 35x fewer FLOPs than baseline methods. Deployed natively on a VR headset GPU, inference takes only 5 ms. Combined with demonstrated robustness to illumination changes, these results highlight GazeShift as a label-efficient, real-time solution for VR gaze tracking. Project code and the VRGaze dataset are released at https://github.com/gazeshift3/gazeshift

Abstract:
We present FlexTraj, a framework for image-to-video generation with flexible point trajectory control. FlexTraj introduces a unified point-based representation that encodes point with a temporally consistent trajectory ID, a segmentation ID, and an optional color channel for appearance cues, enabling both dense and sparse trajectory control. Instead of injecting trajectory conditions into the video generator through token concatenation or ControlNet, FlexTraj employs an efficient sequence-concatenation scheme that achieves faster convergence, stronger controllability, and efficient inference, while maintaining robustness under unaligned conditions. To train such a unified point trajectory-controlled video generator, FlexTraj adopts an annealing training strategy that gradually reduces reliance on complete supervision and aligned condition. Experimental results demonstrate that FlexTraj enables multi-granularity, alignment-agnostic trajectory control for video generation, supporting various applications such as motion cloning, drag-based image-to-video, motion interpolation, camera redirection, flexible action control and mesh animations.

Abstract:
Low-light image enhancement (LLIE) aims to restore natural visibility, color fidelity, and structural detail under severe illumination degradation. State-of-the-art (SOTA) LLIE techniques often rely on large models and multi-stage training, limiting practicality for edge deployment. Moreover, their dependence on a single color space introduces instability and visible exposure or color artifacts.To address these, we propose Multinex, an ultra-lightweight structured framework that integrates multiple fine-grained representations within a principled Retinex residual formulation. It decomposes an image into illumination and color prior stacks derived from distinct analytic representations, and learns to fuse these representations into luminance and reflectance adjustments required to correct exposure. By prioritizing enhancement over reconstruction and exploiting lightweight neural operations, Multinex significantly reduces computational cost, exemplified by its lightweight (45K parameters) and nano (0.7K parameters) versions. Extensive benchmarks show that all lightweight variants significantly outperform their corresponding lightweight SOTA models, and reach comparable performance to heavy models. Paper page available at https://albrateanu.github.io/multinex.

Abstract:
Vision Language Models (VLMs) demonstrate strong qualitative visual understanding, but struggle with metrically precise spatial reasoning required for embodied applications. The agentic paradigm promises that VLMs can use a wide variety of tools that could augment these capabilities, such as depth estimators, segmentation models, and pose estimators. Yet it remains an open challenge how to realize this vision without solely relying on handcrafted prompting strategies or enforcing fixed, predefined tool pipelines that limit VLMs' ability to discover optimal tool-use patterns. Reinforcement Learning could overcome this gap, but has so far been limited to reasoning with a single visual tool due to the large search space in multi-tool reasoning. We introduce Double Interactive Reinforcement Learning (DIRL), a two-phase training framework where VLMs learn to coordinate multiple tools through interactive exploration and feedback. In the teaching phase, we combine demonstrations from a single tool specialist trained via interactive RL with traces from a frontier model using all tools. In the exploration phase, the model further refines multi-tool coordination through continued RL. Our model, SpaceTools, with tool-augmented spatial reasoning ability, achieves state-of-the-art performance on spatial understanding benchmarks (RoboSpatial-Home, BLINK, Internal Benchmark) and demonstrates reliable real-world manipulation using a 7-DOF robot as a tool. DIRL provides substantial improvements over the vanilla SFT (+12% on RoboSpatial) and RL (+16% on RoboSpatial) baselines. Project page: https://spacetools.github.io/.

Abstract:
Text-to-image (T2I) diffusion models have made significant strides in generating high-quality images. However, progressively manipulating certain attributes of generated images to meet the desired user expectations remains challenging, particularly for content with rich details, such as human faces. Some studies have attempted to address this by training slider modules. However, they follow a One-for-One manner, where an independent slider is trained for each attribute, requiring additional training whenever a new attribute is introduced. This not only results in parameter redundancy accumulated by sliders but also restricts the flexibility of practical applications and the scalability of attribute manipulation. To address this issue, we introduce the All-in-One Slider, a lightweight module that decomposes the text embedding space into sparse, semantically meaningful attribute directions. Once trained, it functions as a general-purpose slider, enabling interpretable and fine-grained continuous control over various attributes. Moreover, by recombining the learned directions, the All-in-One Slider supports zero-shot manipulation of unseen attributes (e.g., races and celebrities) and the composition of multiple attributes. Extensive experiments demonstrate that our method enables accurate and scalable attribute manipulation, achieving notable improvements compared to previous methods. Furthermore, our method can be extended to integrate with the inversion framework to perform attribute manipulation on real images, broadening its applicability to various real-world scenarios. The code and trained model will be released.

Abstract:
Multimodal Chain-of-Thought (MCoT) models have demonstrated impressive capability in complex visual reasoning tasks. Unfortunately, recent studies reveal that they suffer from severe hallucination problems due to diminished visual attention during the generation process.However, visual attention decay is a well-studied problem in Large Vision-Language Models (LVLMs). Considering the fundamental differences in reasoning processes between MCoT models and traditional LVLMs, we raise a basic question: Whether MCoT models have unique causes of hallucinations? To answer this question, we systematically investigate the hallucination patterns of MCoT models and find that fabricated texts are primarily generated in associative reasoning steps, which we term divergent thinking. Leveraging these insights, we introduce a simple yet effective strategy that can effectively localize divergent thinking steps and intervene in the decoding process to mitigate hallucinations. Extensive experiments show that our method outperforms existing methods by a large margin. More importantly, our proposed method can be conveniently integrated with other hallucination mitigation methods and further boost their performance. The code is publicly available at https://github.com/ASGO-MM/MCoT-hallucination.

Abstract:
The development of generalizable Novel View Synthesis (NVS) models is critically limited by the scarcity of large-scale training data featuring diverse and precise camera trajectories. While real-world captures are photorealistic, they are typically sparse and discrete. Conversely, synthetic data scales but suffers from a domain gap and often lacks realistic semantics. We introduce \change FreeScale , a novel framework that leverages the power of scene reconstruction to transform limited real-world image sequences into a scalable source of high-quality training data. Our key insight is that an imperfect reconstructed scene serves as a rich geometric proxy, but naively sampling from it amplifies artifacts. To this end, we propose a certainty-aware free-view sampling strategy \change identifying novel viewpoints that are both semantically meaningful and minimally affected by reconstruction errors. We demonstrate \change FreeScale 's effectiveness by scaling up the training of feedforward NVS models, achieving a \change notable gain of 2.7 dB in PSNR on challenging out-of-distribution benchmarks. Furthermore, we show that the generated data can actively enhance per-scene 3D Gaussian Splatting optimization, leading to consistent improvements across multiple datasets. Our work provides a practical and powerful data generation engine to overcome a fundamental bottleneck in 3D vision. \change Project page: https://mvp-ai-lab.github.io/FreeScale .

Abstract:
Consistent 3D geometry estimation from streaming RGB input is crucial for real-world applications such as autonomous driving, embodied AI, and large-scale reconstruction. While modern monocular geometry foundation models achieve strong single-image accuracy, they exhibit severe temporal inconsistency on continuous input, notably dominated by scale-shift drifting. Through targeted empirical analysis, we trace this instability to its root cause: fluctuations in latent feature statistics, whose mean and variance directly determine the predicted depth's scale and shift. Building on this insight, we introduce Dynamic Feature Normalization (DyFN), a lightweight, causal recurrent module that dynamically and robustly modulates feature statistics to maintain stable geometry over time. We adapt powerful pretrained monocular geometry models for streaming by finetuning only DyFN, a mere 2% additional parameters, while keeping the backbone frozen, thereby achieving temporal consistency without compromising single-image accuracy. Extensive experiments across four benchmarks show that DyFN effectively eliminates temporal artifacts such as disjointed layering and positional jitter, and achieves state-of-the-art temporal stability, improving over prior streaming methods by up to 14% and even outperforming heavier non-causal video baselines. Project page: https://shawlyu.github.io/DyFN

Abstract:
Solving partial differential equations (PDEs) on shapes underpins many shape analysis and engineering tasks; yet, prevailing PDE solvers operate on polygonal/triangle meshes while modern 3D assets increasingly live as neural representations. This mismatch leaves no suitable method to solve surface PDEs directly within the neural domain, forcing explicit mesh extraction or per-instance residual training, preventing end-to-end workflows. We present a novel, mesh-free formulation that learns a local update operator conditioned on neural (local) shape attributes, enabling surface PDEs to be solved directly where the (neural) data lives. The operator integrates naturally with prevalent neural surface representations, is trained once on a single representative shape, and generalizes across shape and topology variations, enabling accurate, fast inference without explicit meshing or per-instance optimization while preserving differentiability. Across analytic benchmarks (heat diffusion and Poisson equations on the sphere) and on diverse shapes and neural surface representations, our method achieves accuracy comparable to classical solvers while enabling a unified, end-to-end pipeline across neural and traditional surface representations. Our source code and project page: https://welschinger.github.io/Learning-to-Solve-PDEs-on-Neural-Shape-Representations/.

Abstract:
Achieving human-level performance in Vision-and-Language Navigation (VLN) requires an embodied agent to understand textual instructions, perceive visual observations, and reason over long action sequences. Recent works, such as NavCoT and NavGPT-2, demonstrate the potential of Chain-of-Thought (CoT) reasoning for improving interpretability and long-horizon planning. Moreover, multimodal extensions like OctoNav-R1 and CoT-VLA further validate CoT as a promising pathway toward human-like navigation reasoning. However, existing approaches face critical drawbacks: purely textual CoTs lack visual perception and easily overfit to sparse annotated reasoning steps, while multimodal CoTs incur severe token inflation by generating imagined visual observations, making real-time navigation impractical. In this work, we propose FantasyVLN, a unified implicit reasoning framework that preserves the benefits of CoT reasoning without explicit token overhead. Specifically, imagined visual tokens are encoded into a compact latent space using a pretrained Visual AutoRegressor (VAR) during CoT reasoning training, and the model jointly learns from textual, visual, and multimodal CoT modes under a unified multi-CoT strategy. At inference, our model performs direct instruction-to-action mapping while still enjoying reasoning-aware representations. Extensive experiments on LH-VLN show that our approach achieves reasoning-aware yet real-time navigation, improving success rates and efficiency while reducing inference latency by an order of magnitude compared to explicit CoT methods. Code is available at https://github.com/Fantasy-AMAP/fantasy-vln.

Abstract:
We revisit video hallucination in multimodal large language models (Video-MLLMs) from a semantic aggregation perspective. While prior work attributes hallucinations to language priors, missing frames, or visual encoder biases, these explanations overlook errors arising during the aggregation of correct frame-level semantics into event-level interpretations. We term this phenomenon Semantic Aggregation Hallucination (SAH), which becomes increasingly prevalent in complex, multi-event video understanding tasks with rich temporal dependencies. To systematically study SAH, we introduce ELV-Halluc, the first benchmark designed for fine-grained evaluation of semantic aggregation errors. Our experiments reveal that SAH correlates with both semantic complexity and rapid semantic transitions. We further propose mitigation strategies: improved positional encoding preserves temporal structure, and reinforcement learning such as DPO enhances the model's ability to distinguish semantics within and across events. Using a curated 8K adversarial video-text pair dataset, our approach achieves consistent gains across benchmarks, including a 27.7% reduction in SAH rate on ELV-Halluc and Video-MME. Data and code are available at https://github.com/hlsv02/ELV-Halluc.

Abstract:
Recent years have seen impressive advances in text-to-image generation, with image generative or unified models, generating high-quality images from text. Yet these models still struggle with fine-grained color control, often failing to accurately match colors specified in text prompts. While existing benchmarks evaluate compositional reasoning and prompt adherence, none systematically assess the color precision. Color is fundamental to human visual perception and communication, and critical for applications from art to design workflows requiring brand consistency. However, current benchmarks either neglect color or rely on coarse assessments, missing key capabilities like interpreting RGB values or aligning with human expectations. To this end, we propose GenColorBench, the first comprehensive benchmark for T2I color generation, grounded in color systems like ISCC-NBS and CSS3/X11, including numerical colors which are absent elsewhere. With 44K color-focused prompts covering 400+ colors, it reveals models' true capabilities via perceptual and automated assessments. Evaluations of popular T2I models on GenColorBench reveal significant performance variation, indicating which color conventions models understand the best and exposing their failure modes. Furthermore, GenColorBench provides insights to guide future improvements in precise color generation. The benchmark is available at https://moatifbutt.github.io/gencolorbench/.

Abstract:
Multimodal unsupervised anomaly detection has garnered increasing attention for robust defect localization.Recent approaches rely on establishing cross-modal matching relationships under normal conditions without explicit guidance.However, in practice, a single modality may have multiple distinct representations corresponding to another modality, and such unconditional mappings struggle to adaptively capture these variations, resulting in mapping ambiguity and the misclassification of diverse yet normal variations as anomalies.Moreover, existing methods suffer from slow inference speed and high memory overhead, hindering their deployment in real-world production lines.To address these issues, we propose an efficient and effective Complementary Prototype Mapping (CPMAD) framework, which dynamically extracts consensus and supplementary prototypes to serve as complementary priors, thereby guiding and disambiguating cross-modal mappings.The framework comprises three key components:(1) Consensus Extraction Module (CEM) learns a dynamic anchor, transforming multimodal features into anomaly-free consensus prototypes to improve cross-modal consistency and suppress latent anomalies;(2) Supplementary Query Module (SQM) employs a Complementary Residual Attention mechanism to capture the discrepancy between the consensus and modality-specific spaces, thereby exploring the most representative and discriminative cues as supplementary prototypes; and(3) Complementary Mapping Module adaptively integrates both prototypes to perform feature mapping.Extensive experiments demonstrate that CPMAD not only achieves superior performance in both full-data and few-shot settings across diverse industrial and medical scenarios but also maintains faster inference speeds and lower memory consumption compared to existing methods. Code: https://github.com/yuanzhao-CVLAB/CPMAD.

Abstract:
The pursuit of a universal AI-generated image (AIGI) detector often relies on aggregating data from numerous generators to improve generalization. However, this paper identifies a paradoxical phenomenon we term the "Benefit then Conflict" dilemma, where detector performance stagnates and eventually degrades as source diversity expands. Our systematic analysis, diagnoses this failure by identifying two core issues: severe data-level heterogeneity which causes the feature distributions of real and synthetic images to increasingly overlap, and a critical model-level bottleneck from fixed, pretrained encoders that cannot adapt to the rising complexity. To address these challenges, we propose Generator-Aware Prototype Learning (GAPL) , a framework that constrain representation with a structured learning paradigm. GAPL learns a compact set of canonical forgery prototypes to create a unified, low-variance feature space, effectively countering data heterogeneity. To resolve the model bottleneck, it employs a two-stage training scheme with Low-Rank Adaptation, enhancing its discriminative power while preserving valuable pretrained knowledge. This approach establishes a more robust and generalizable decision boundary. Through extensive experiments, we demonstrate that GAPL achieves state-of-the-art performance, showing superior detection accuracy across a wide variety of GAN and diffusion-based generators. Code is available at https://github.com/UltraCapture/GAPL

Abstract:
Incremental Object Detection (IOD) aims to equip detectors with the ability to handle dynamic environments and emerging object categories, and the rise of vision-language models has substantially advanced this goal. However, existing studies often oversimplify real-world scenarios by assuming the incremental tasks come from a single general domain. To better investigate vision-language models under IOD, it is necessary to explore more generalized scenarios that encompass both novel categories and domains. To this end, we propose Cross-Domain Incremental Object Detection (CDIOD), a new benchmark that assesses the ability to continuously adapt to diverse object detection tasks across domains. CDIOD reveals that existing methods struggle to balance between adaptivity and stability under substantial domain shifts. To tackle this challenge, we propose Dynamic Group Subspace (DGS), a novel framework that dynamically groups tasks by distribution to promote knowledge sharing and prevent task collisions; progressively consolidates adapters to build shared subspaces and control parameter growth; and implements a dynamic training pipeline to maintain a proper stability-adaptivity balance. DGS enables vision-language models to effectively handle task streams of various distribution shifts. Extensive experiments across three benchmarks demonstrate that DGS achieves SOTA performance, highlighting its robustness in diverse incremental learning scenarios. Code is available at https://github.com/Never-wx/dgs.

Abstract:
Edge-based representations are fundamental cues for visual understanding, a principle rooted in early vision research and still central today. We extend this principle to vision-language alignment, showing that isolating and aligning structural cues across modalities can greatly benefit fine-tuning on long, detail-rich captions, with a specific focus on improving cross modal retrieval. We introduce StructXLIP, a fine-tuning alignment paradigm that extracts edge maps (e.g., Canny), treating them as proxies for the visual structure of an image, and filters the corresponding captions to emphasize structural cues, making them "structure-centric". Fine-tuning augments the standard alignment loss with three structure-centric losses: (i) aligning edge maps with structural text, (ii) matching local edge regions to textual chunks, and (iii) connecting edge maps to color images to prevent representation drift. From a theoretical standpoint, while standard CLIP maximizes the mutual information between visual and textual embeddings, StructXLIP additionally maximizes the mutual information between multimodal structural representations. This auxiliary optimization is intrinsically harder, guiding the model toward more robust and semantically stable minima, enhancing vision-language alignment. Beyond outperforming current competitors on cross-modal retrieval on both general and specialized domains, our method serves as a general boosting recipe that can be integrated into future approaches in a plug-and-play manner. Code and pretrained models are available at: https://github.com/intelligolabs/StructXLIP.

Abstract:
Recent progress in vision-language pretraining has enabled significant improvements to many downstream computer vision applications, such as classification, retrieval, segmentation and depth prediction. However, a fundamental capability that these models still struggle with is aligning dense patch representations with text embeddings of corresponding concepts. In this work, we investigate this critical issue and propose novel techniques to enhance this capability in foundational vision-language models. First, we reveal that a patch-level distillation procedure significantly boosts dense patch-text alignment -- surprisingly, the patch-text alignment of the distilled student model strongly surpasses that of the teacher model. This observation inspires us to consider modifications to pretraining recipes, leading us to propose iBOT++, an upgrade to the commonly-used iBOT masked image objective, where unmasked tokens also contribute directly to the loss. This dramatically enhances patch-text alignment of pretrained models. Additionally, to improve vision-language pretraining efficiency and effectiveness, we modify the exponential moving average setup in the learning recipe, and introduce a caption sampling strategy to benefit from synthetic captions at different granularities. Combining these components, we develop TIPSv2, a new family of image-text encoder models suitable for a wide range of downstream applications. Through comprehensive experiments on 9 tasks and 20 datasets, we demonstrate strong performance, generally on par with or better than recent vision encoder models. Code and models are released via our project page at https://gdm-tipsv2.github.io/.

Abstract:
Accurate focus quality assessment (FQA) in fluorescence microscopy is challenging due to stain-dependent optical variations that induce heterogeneous focus behavior across images. Existing methods, however, treat focus quality as a stain-agnostic problem, assuming a shared global ordering. We formulate stain-aware FQA for fluorescence microscopy, showing that focus-rank relationships vary substantially across stains due to stain-dependent imaging characteristics and invalidate this assumption. To support this formulation, we introduce FluoMix, the first dataset for stain-aware FQA spanning multiple tissues, fluorescent stains, and focus levels. We further propose FluoCLIP, a two-stage vision-language framework that grounds stain semantics and enables stain-conditioned ordinal reasoning for focus prediction, effectively decoupling stain representation from ordinal structure. By explicitly modeling stain-dependent focus behavior, FluoCLIP consistently outperforms both conventional FQA methods and recent vision-language baselines, demonstrating strong generalization across diverse fluorescence microscopy conditions. Code and dataset are publicly available at https://fluoclip.github.io/.

Abstract:
Density ratio estimation is a core concept in statistical machine learning because it provides a unified mechanism for tasks such as importance weighting, divergence estimation, and likelihood-free inference, but its potential in vision and language models has not been fully explored. Modern vision-language encoders such as CLIP and SigLIP are trained with contrastive objectives that implicitly optimize log density ratios between joint and marginal image-text distributions, which implicitly learn similarity scores proportional to log density ratios. However, prior work has largely focused on their embedding utility, and the density-ratio structure induced by contrastive learning has not been systematically examined or exploited in multimodal applications. To address this gap, we reinterpret CLIP-style models as pretrained and general-purpose density ratio estimators and show that this perspective enables new algorithmic capabilities. We present a unified explanation of how contrastive objectives estimate density ratios and propose two practical applications: Importance Weight Learning and KL divergence estimation. Our Importance Weight Learning method requires only a single additional prompt and improves F1 scores by up to 7 points. We further show that CLIP-based density ratios support estimation of KL divergences that quantify how conditioning on an image or text alters the distribution of the other modality. Through qualitative examples and an N-gram analysis of captions, we find that these divergences capture semantic diversity and mode structure in multimodal data. Leveraging this property, we introduce a simple KL-guided data curation method that achieves performance competitive with LAION2B filtering. Our code is available at https://github.com/fumiyauchiyama/CLIP_Density_Ratio .

Abstract:
Open-vocabulary Object Goal Navigation requires an embodied agent to reach objects described by free-form language, including categories never seen during training. Existing end-to-end policies tend to overfit small simulator datasets, achieving high success on training scenes but failing to generalize and often exhibiting unsafe behavior (frequent collisions). In our work, we are the first to show that a high degree of generalization to unseen categories in the open-vocabulary object goal navigation task can be achieved with a lightweight transformer model (130M parameters) using only RGB input. We introduce the OVSegDT approach, which has three key features. First, we add a goal binary mask encoder that grounds the textual goal and provides precise spatial cues. The second component is a proposed Entropy-Adaptive Loss Modulation (EALM) -- a per-sample scheduler that continuously balances imitation and reinforcement signals according to policy entropy, eliminating brittle manual phase switches. EALM reduces the sample complexity of training by 33% and cuts the collision count by 10% compared to the baseline. The final component improves the agent's navigation quality even under noisy predicted segmentation by combining an auxiliary segmentation loss with a reward function based on the area of the true goal mask during fine-tuning on predicted segmentation. On HM3D-OVON, our model achieves performance on unseen categories comparable to that on seen ones and establishes state-of-the-art results (44.7% SR, 20.6% SPL on val unseen) without using depth, odometry, or large vision-language models. Code is available at https://github.com/CognitiveAISystems/OVSegDT.

Abstract:
Spatial Transcriptomics (ST) provides spatially-resolved gene expression, offering crucial insights into tissue architecture and complex diseases. However, its prohibitive cost limits widespread adoption, leading to significant attention on inferring spatial gene expression from readily available whole slide images. While graph neural networks have been proposed to model interactions between tissue regions, their reliance on pre-defined sparse graphs prevents them from considering potentially interacting spot pairs, resulting in a structural limitation in capturing complex biological relationships. To address this, we propose FEAST (Fully connected Expressive Attention for Spatial Transcriptomics), an attention-based framework that models the tissue as a fully connected graph, enabling the consideration of all pairwise interactions. To better reflect biological interactions, we introduce negative-aware attention, which models both excitatory and inhibitory interactions, capturing essential negative relationships that standard attention often overlooks. Furthermore, to mitigate the information loss from truncated or ignored context in standard spot image extraction, we introduce an off-grid sampling strategy that gathers additional images from intermediate regions, allowing the model to capture a richer morphological context. Experiments on public ST datasets show that FEAST surpasses state-of-the-art methods in gene expression prediction while providing biologically plausible attention maps that clarify positive and negative interactions. Our code is available at https://github.com/starforTJ/ FEAST.

Abstract:
Infrared imaging is essential for perception in harsh environments. However, dynamically coupled degradation factors severely impair visual quality and downstream semantic accuracy. Although generative diffusion models provide strong image restoration priors, high computational cost and physical inconsistency limit their application in infrared sensing. To bridge these gaps, we reformulate infrared imaging as a single-step diffusion process, aligning degraded observations with trajectory latent states via dynamic timestep estimation to leverage timestep-specific diffusion priors for high-fidelity reconstruction. Meanwhile, we introduce a spectral regularization term to enforce thermal radiation constraints and ensure physical consistency. Subsequently, a task-aware low-rank adaptation mechanism is devised through dynamic prompting to enable efficient transfer across downstream infrared tasks. Experiments demonstrate our method surpasses existing approaches in restoration quality, semantic structure preservation, and task generalization. The code is available at https://github.com/csmty/InfraredIR.

Abstract:
CLIP achieves strong zero-shot image-text retrieval by aligning global vision and text representations, yet it falls behind on fine-grained tasks even when fine-tuned on long, detailed captions. In this work, we propose b-CLIP, a multi-granular text-conditioned contrastive learning framework designed to achieve hierarchical alignment between multiple textual granularities--from full captions to sentences and phrases--and their corresponding visual regions. For each level of granularity, b-CLIP uses cross-attention to dynamically pool image patches into contextualized visual embeddings. To address the semantic overlap inherent in this hierarchy, we introduce the b-Contextualized Contrastive Alignment Loss (b-CAL), which parameterizes the trade-off between query-specific matching and intra-image contextualization under both Cross-Entropy and Binary Cross-Entropy formulations. Through extensive experiments, we demonstrate that b-CLIP significantly improves dense alignment, achieving 91.8% T2I 92.3% I2T at R@1 on Urban1K and 30.9% on FG-OVD (Hard), setting state-of-the-art among methods trained without hard negatives. b-CLIP establishes a robust, adaptive baseline for dense vision-language correspondence. The code and models are released at \href https://github.com/fzohra/B-CLIP https://github.com/fzohra/B-CLIP .

Abstract:
Existing Multi-Camera Multi-Target (MCMT) tracking models typically adopt a two-stage framework, involving single-camera tracking followed by inter-camera tracking. However, in this paradigm, the use of multiple views is confined to recovering missed matches in the first stage, providing a limited contribution to overall tracking. To address this issue, we propose a novel global MCMT tracking framework termed GMT, which effectively leverages the advantage of multi-view by performing global-level trajectory-target matching. Specifically, instead of assigning trajectories independently for each view, we propose a Cross-View Feature Consistency Enhancement(CFCE) module to reduce the feature discrepancies across different views, and encode the same historical targets across different views as global trajectories. The Global Trajectory Associate (GTA) module is then introduced to associate new targets to global trajectories, allowing the model to jointly exploit both intra-view and inter-view cues during tracking. Compared with the two-stage framework, the GMT achieves significant improvements on existing datasets, with gains of up to 13.1% in CVMA in and 19.2% in CVIDF1. Moreover, we present VisionTrack, a high-quality, large-scale MCMT dataset encompassing diverse scenes with varying illumination and target distributions, providing significantly greater diversity than existing datasets. Our code and dataset will be released at https://github.com/FoxCanned/GMT.

Abstract:
Image-to-point cloud matching (2D-3D matching) establishes accurate correspondences between image keypoints and 3D points for 6-DoF camera pose estimation. Existing methods either suffer from poor generalization due to scene-specific coordinate regression requiring per-scene retraining, or incur high storage and maintenance costs from descriptor-based matching that relies on large descriptor sets. Consequently, descriptor-free approaches have gained attention by avoiding heavy storage while improving generalizability; however, most rely only on low-level geometric cues, which limits performance. Leveraging the benefits of semantics in providing context, resolving ambiguities, and enhancing robustness in challenging scenes, we propose the Semantic-Aware Guided Graph Neural Network (SAG-GNN), integrating high-level semantics into descriptor-free 2D-3D matching. Specifically, we design a compact semantic extraction scheme encoding each 3D point as a low-dimensional semantic probability distribution, offering effective guidance with minimal storage. A bidirectionally-aligned fusion block merges geometric features with semantic context for more unified and consistent representations. Additionally, semantic priors guide the 2D-3D information exchange within the interaction framework from a high-level semantic perspective. Extensive indoor and outdoor experiments validate that SAG-GNN achieves state-of-the-art results in descriptor-free 2D-3D matching and visual localization, with low storage and strong generalization. Code is available at https://github.com/tinxu0203/SAG-GNN.

Abstract:
Multimodal large language models (MLLMs) have recently demonstrated remarkable reasoning and perceptual abilities for anomaly detection. However, most approaches remain confined to image-level anomaly detection and textual reasoning, while pixel-level localization still relies on external vision modules and dense annotations. In this work, we activate the intrinsic reasoning potential of MLLMs to perform anomaly detection, pixel-level localization, and interpretable reasoning solely from image-level supervision, without any auxiliary components or pixel-wise labels. Specifically, we propose Reasoning-Driven Anomaly Localization (ReAL), which extracts anomaly-related tokens from the autoregressive reasoning process and aggregates their attention responses to produce pixel-level anomaly maps. We further introduce a Consistency-Guided Reasoning Optimization (CGRO) module that leverages reinforcement learning to align reasoning tokens with visual attentions, resulting in more coherent reasoning and accurate anomaly localization. Extensive experiments on four public benchmarks demonstrate that our method significantly improves anomaly detection, localization, and interpretability. Remarkably, despite relying solely on image-level supervision, our approach achieves performance competitive with MLLM-based methods trained under dense pixel-level supervision. Code is available at https://github.com/YizhouJin313/ReADL.

Abstract:
Model merging aims to combine multiple task-specific experts into a single model, but inter-task interference often causes severe degradation, especially when the experts are trained on heterogeneous objectives. Existing data-free methods are practical, yet largely rely on parameter-space heuristics without explicitly modeling task statistics. In this paper, we show that under a local linear approximation, the input covariance of each task can be estimated directly from its fine-tuning update, providing a principled bridge between parameter changes and data geometry in the data-free setting. Based on this insight, we propose ACE-Merging, a closed-form model merging framework with adaptive covariance normalization, a collective structural prior, and spectral refinement. Across both vision and language benchmarks, ACE-Merging achieves strong overall performance among data-free baselines. In particular, it improves the average score on GPT-2 by more than 4 points over prior methods, while also delivering strong scalability and competitive efficiency. Our code is available at https://github.com/unravel-xu/ACE-Merging/tree/main.

Abstract:
We present Gaussian Splatting Alignment (GSA), a novel method for aligning two independent 3D Gaussian Splatting (3DGS) models via a similarity transformation (rotation, translation, and scale), even when they are of different objects in the same category (e.g., different cars). In contrast, existing methods can only align 3DGS models of the same object (e.g., the same car) and often must be given true scale as input, while we estimate it successfully. GSA leverages viewpoint-guided spherical map features to obtain robust correspondences and introduces a two-step optimization framework that aligns 3DGS models while keeping them fixed. First, we apply an iterative feature-guided absolute orientation solver as our coarse registration, which is robust to poor initialization (e.g., 180deg misalignment or a 10x scale gap). Next, we use a fine registration step that enforces multi-view feature consistency, inspired by inverse radiance-field formulations. The first step already achieves state-of-the-art performance, and the second further improves results. In the same-object case, GSA outperforms prior works, often by a large margin, even when the other methods are given the true scale. In the harder case of different objects in the same category, GSA vastly surpasses them, providing the first effective solution for category-level 3DGS registration and unlocking new applications. Project webpage: https://bgu-cs-vil.github.io/GSA-project

Abstract:
Recent advancements in text-to-image synthesis have been largely propelled by diffusion-based models, yet achieving precise alignment between text prompts and generated images remains a persistent challenge. We find that this difficulty arises primarily from the limitations of conventional diffusion loss, which provides only implicit supervision for modeling fine-grained text-image correspondence. In this paper, we introduce Cross-Timestep Self-Calibration (CTCal), founded on the supporting observation that establishing accurate text-image alignment within diffusion models becomes progressively more difficult as the timestep increases. CTCal leverages the reliable text-image alignment (i.e., cross-attention maps) formed at smaller timesteps with less noise to calibrate the representation learning at larger timesteps with more noise, thereby providing explicit supervision during training. We further propose a timestep-aware adaptive weighting to achieve a harmonious integration of CTCal and diffusion loss. CTCal is model-agnostic and can be seamlessly integrated into existing text-to-image diffusion models, encompassing both diffusion-based (e.g., SD 2.1) and flow-based approaches (e.g., SD 3). Extensive experiments on T2I-Compbench++ and GenEval benchmarks demonstrate the effectiveness and generalizability of the proposed CTCal. Our code is available at https://github.com/xiefan-guo/ctcal.

Abstract:
Dataset distillation (DD) aims to synthesize compact training sets that enable models to achieve high accuracy with significantly fewer samples. Recent diffusion-based DD methods commonly introduce semantic guidance through late-stage cross-attention, where textual prompts tend to dominate the generative process. Although this strategy enforces label relevance, it diminishes the contribution of visual latents, resulting in over-corrected samples that mirror prompt patterns rather than reflecting intrinsic visual features. To solve this problem, we introduce an Early Vision-Language Fusion (EVLF) method that aligns textual and visual embeddings at the transition between the encoder and the generative backbone. By incorporating a lightweight cross-attention module at this transition, the early representations simultaneously encode local textures and global semantic directions across the denoising process. Importantly, EVLF is plug-and-play and can be easily integrated into any diffusion-based dataset distillation pipeline with an encoder. It works across different denoiser architectures and sampling schedules without any task-specific modifications. Extensive experiments demonstrate that EVLF generates semantically faithful and visually coherent synthetic data, yielding consistent improvements in downstream classification accuracy across varied settings. Source code is available at https://github.com/wenqi-cai297/earlyfusion-for-dd .

Abstract:
Editable high-fidelity 4D scenes are crucial for autonomous driving, as they can be applied to end-to-end training and closed-loop simulation. However, existing reconstruction methods are primarily limited to replicating observed scenes and lack the capability for diverse weather simulation. While image-level weather editing methods tend to introduce scene artifacts and offer poor controllability over the weather effects. To address these limitations, we propose WeatherCity, a novel framework for 4D urban scene reconstruction and weather editing. Specifically, we leverage a text-guided image editing model to achieve flexible editing of image weather backgrounds. To tackle the challenge of multi-weather modeling, we introduce a novel weather Gaussian representation based on shared scene features and dedicated weather-specific decoders. This representation is further enhanced with a content consistency optimization, ensuring coherent modeling across different weather conditions. Additionally, we design a physics-driven model that simulates dynamic weather effects through particles and motion patterns. Extensive experiments on multiple datasets and various scenes demonstrate that WeatherCity achieves flexible controllability, high fidelity, and temporal consistency in 4D reconstruction and weather editing. Our framework not only enables fine-grained control over weather conditions (e.g., light rain and heavy snow) but also supports object-level manipulation within the scene. Codes are released at https://github.com/IRMVLab/WeatherCity.

Abstract:
Most referring object detection (ROD) models, especially the modern grounding detectors, are designed for data-rich conditions, yet many practical deployments, such as robotics, augmented reality, and other specialized domains, would face severe label scarcity. In such regimes, end-to-end grounding detectors need to learn spatial and semantic structure from scratch, wasting precious samples. We ask a simple question: Can explicit reasoning priors help models learn more efficiently when data is scarce? To explore this, we first introduce a Data-efficient Referring Object Detection (De-ROD) task, which is a benchmark protocol for measuring ROD performance in low-data and few-shot settings. We then propose the HeROD (Heuristic-inspired ROD), a lightweight, model-agnostic framework that injects explicit, heuristic-inspired spatial and semantic reasoning priors, which are interpretable signals derived based on the referring phrase, into 3 stages of a modern DETR-style pipeline: proposal ranking, prediction fusion, and Hungarian matching. By biasing both training and inference toward plausible candidates, these priors promise to improve label efficiency and convergence performance. On RefCOCO, RefCOCO+, and RefCOCOg, HeROD consistently outperforms strong grounding baselines in scarce-label regimes. More broadly, our results suggest that integrating simple, interpretable reasoning priors provides a practical and extensible path toward better data-efficient vision-language understanding. Code at: https://github.com/xuzhang1199/HeROD.

Abstract:
Noisy, partially overlapping data and the need for real-time processing pose major challenges for rigid registration. Considering that feature-based matching can handle large transformation differences but suffers from limited accuracy, while local geometry-based matching can achieve fine-grained local alignment but relies heavily on a good initial transformation, we propose a novel dual-space paradigm to fully leverage the strengths of both approaches. First, we introduce an efficient filtering mechanism consisting of a computationally lightweight one-point RANSAC algorithm and a subsequent refinement module to eliminate unreliable feature-based correspondences. Subsequently, we treat the filtered correspondences as anchor points, extract geometric proxies, and formulate an effective objective function with a tailored solver to estimate the transformation. Experiments verify our method's effectiveness, as demonstrated by a 32x CPU-time speedup over MAC on KITTI with comparable accuracy. Project page: https://ustc3dv.github.io/DualReg/.

Abstract:
LiDAR relocalization has attracted increasing attention as it can deliver accurate 6-DoF pose estimation in complex 3D environments. Recent learning-based regression methods offer efficient solutions by directly predicting global poses without the need for explicit map storage. However, these methods often struggle in challenging scenes due to their equal treatment of all predicted points, which is vulnerable to noise and outliers. In this paper, we propose LEADER, a robust LiDAR-based relocalization framework enhanced by a simple, yet effective geometric encoder. Specifically, a Robust Projection-based Geometric Encoder architecture which captures multi-scale geometric features is first presented to enhance descriptiveness in geometric representation. A Truncated Relative Reliability loss is then formulated to model point-wise ambiguity and mitigate the influence of unreliable predictions. Extensive experiments on the Oxford RobotCar and NCLT datasets demonstrate that LEADER outperforms state-of-the-art methods, achieving 24.1% and 73.9% relative reductions in position error over existing techniques, respectively. The source code is released on https://github.com/JiansW/LEADER.

Abstract:
A generalist robotic policy needs both semantic understanding for task planning and the ability to interact with the environment through predictive capabilities. To tackle this, we present MM-ACT, a unified Vision-Language-Action (VLA) model that integrates text, image, and action in shared token space and performs generation across all three modalities. MM-ACT adopts a re-mask parallel decoding strategy for text and image generation, and employs a one-step parallel decoding strategy for action generation to improve efficiency. We introduce Context-Shared Multimodal Learning, a unified training paradigm that supervises generation in all three modalities from a shared context, enhancing action generation through cross-modal learning. Experiments were conducted on the LIBERO simulation and Franka real-robot setups as well as RoboTwin2.0 to assess in-domain and out-of-domain performances respectively. Our approach achieves a success rate of 96.3% on LIBERO, 72.0% across three tasks of real Franka, and 52.38% across eight bimanual tasks of RoboTwin2.0 with an additional gain of 9.25% from cross-modal learning. We release our codes, models and data at https://github.com/HHYHRHY/MM-ACT.

Abstract:
Following major advances in text and image generation, the video domain has surged, producing highly realistic and controllable sequences. Along with this progress, these models also raise serious concerns about misinformation, making reliable detection of synthetic videos increasingly crucial. Image-based detectors are fundamentally limited because they operate per frame and ignore temporal dynamics, while supervised video detectors generalize poorly to unseen generators, a critical drawback given the rapid emergence of new models. These challenges motivate zero-shot approaches, which avoid synthetic data and instead score content against real-data statistics, enabling training-free, model-agnostic detection. We introduce STALL, a simple, training-free, theoretically justified detector that provides likelihood-based scoring for videos, jointly modeling spatial and temporal evidence within a probabilistic framework. We evaluate STALL on two public benchmarks and introduce ComGenVid, a new benchmark with state-of-the-art generative models. STALL consistently outperforms prior image- and video-based baselines. Code and data are available at: https://omerbenhayun.github.io/stall-video.

Abstract:
The effectiveness of Contrastive Language-Image Pre-training (CLIP) models critically depends on the semantic diversity and quality of their training data. However, while existing synthetic data generation methods primarily focus on increasing data volume, such emphasis often leads to limited semantic diversity and redundant or shallow captions. To address this limitation, we propose Role-SynthCLIP, a novel data synthesis framework that leverages multi-perspective role-playing prompts (e.g., a compositional analyst, an interpreter of image context) to guide Multimodal Large Language Models (MLLMs) in generating semantically diverse captions from distinct viewpoints. This mechanism enhances the semantic diversity and fine-grained image-text alignment of synthetic pairs by enriching each image with multiple complementary captions, while keeping the number of training images fixed. Experimental results demonstrate the effectiveness and efficiency of our method. A CLIP-B/16 model trained on only 1 million images achieves a Recall@1 of 64.1% on the MS COCO validation set, surpassing the best existing synthetic data baseline (trained on 5M pairs) by 2.8 percentage points. The code and trained models can be found at https://github.com/huangfu170/Role-SynthCLIP.

Abstract:
Recent breakthroughs in Diffusion Transformers (DiTs) have revolutionized the field of visual synthesis due to their superior scalability. To facilitate DiTs' capability of capturing meaningful internal representations, recent works such as REPA incorporate external pretrained encoders for representation alignment. However, the underlying mechanisms governing representation learning within DiTs are not well understood. To this end, we first systematically investigate the representation dynamics of DiTs. Through analyzing the evolution and influence of internal representations under various settings, we reveal that representation diversity across blocks is a crucial factor for effective learning. Based on this key insight, we propose DiverseDiT, a novel framework that explicitly promotes representation diversity. DiverseDiT incorporates long residual connections to diversify input representations across blocks and a representation diversity loss to encourage blocks to learn distinct features. Extensive experiments on ImageNet 256x256 and 512x512 demonstrate that our DiverseDiT yields consistent performance gains and convergence acceleration when applied to different backbones with various sizes, even when tested on the challenging one-step generation setting. Furthermore, we show that DiverseDiT is complementary to existing representation learning techniques, leading to further performance gains. Our work provides valuable insights into the representation learning dynamics of DiTs and offers a practical approach for enhancing their performance. Our code is available at https://github.com/kobeshegu/DiverseDiT.

Abstract:
Nighttime environments pose significant challenges for camera-based perception, as existing methods passively rely on the scene lighting. We introduce Lighting-driven Dynamic Active Sensing (LiDAS), a closed-loop active illumination system that combines off-the-shelf visual perception models with high-definition headlights. Rather than uniformly brightening the scene, LiDAS dynamically predicts an optimal illumination field that maximizes downstream perception performance, i.e., decreasing light on empty areas to reallocate it on object regions. LiDAS enables zero-shot nighttime generalization of daytime-trained models through adaptive illumination control. Trained on synthetic data and deployed zero-shot in real-world closed-loop driving scenarios, LiDAS enables +18.7% mAP50 and +5.0% mIoU over standard low-beam at equal power. It maintains performances while reducing energy use by 40%. LiDAS complements domain-generalization methods, further strengthening robustness without retraining. By turning readily available headlights into active vision actuators, LiDAS offers a cost-effective solution to robust nighttime perception. Project page: https://simondemoreau.github.io/LiDAS/

Abstract:
Vision-Language Models (VLMs) typically rely on static initial frames for video reasoning, restricting their ability to incorporate essential dynamic information as the reasoning process evolves. Existing methods that augment Chain-of-Thought (CoT) with additional frame information often exhibit suboptimal CoT quality and lack the crucial ability to synthesize visual information for hypothetical or counterfactual scenarios. We introduce Act-to-See (Act2See), a novel framework that enables active visual perception by empowering VLMs to actively interleave video frames within text CoTs. Act2See is developed via Supervised Fine-Tuning (SFT) on a high-quality dataset of reasoning traces generated by a frontier VLM. These traces integrate active calls to either retrieve existing frames or generate new ones, and are rigorously verified against human-annotated CoTs to ensure quality. This approach cultivates an emergent capability: at inference time, the model actively determines when to search for or synthesize the necessary visual evidence. Act2See establishes new state-of-the-art results on challenging benchmarks, including VideoEspresso and ViTIB, and outperforms comparable or larger models on Video-MME, EgoNormia, and VCR-Bench, demonstrating an advancement in enabling VLMs with active visual perception for video reasoning. Code: https://github.com/martinmamql/act2see.

Abstract:
In real-world robotic manipulation, states typically admit a neighborhood of near-equivalent actions. That is, for each state, there exist a feasible action neighborhood (FAN) rather than a single correct action, within which motions yield indistinguishable progress. However, prevalent VLA training methodologies are directly inherited from linguistic settings and do not exploit the FAN property, thus leading to poor generalization and low sample efficiency. To address this limitation, we introduce a FAN-guided regularizer that shapes the model's output distribution to align with the geometry of FAN. Concretely, we introduce a Gaussian prior that promotes locally smooth and unimodal predictions around the preferred direction and magnitude. In extensive experiments across both reinforced finetuning (RFT) and supervised finetuning (SFT), our method achieves significant improvement in sample efficiency, and success rate in both in-distribution and out-of-distribution (OOD) scenarios. By aligning with the intrinsic action tolerance of physical manipulation, FAN-guided regularization provides a principled and practical method for sample-efficient, and generalizable VLA adaptation. The source code is publicly available at: https://github.com/Haochen-Niu/FAN-VLA.

Abstract:
Boundary representation (B-Rep) generation is a fundamental task in computer-aided design, yet the direct synthesis of high-fidelity and structurally valid B-Reps remains a major challenge. Existing deep generative methods suffer from two forms of brittleness: representation brittleness, caused by padding noise and feature contamination in the latent space, and generation brittleness, stemming from sequential error propagation and a train-inference mismatch due to non-differentiable validity enforcement. We propose HiFi-BRep, a novel framework that addresses these limitations through two synergistic contributions. First, a topology-aware encoder constructs a high-fidelity latent representation by eliminating padding via learnable queries and preventing feature contamination with topology-guided attention. Second, a single-stage decoder jointly predicts geometry and topology in parallel, embedding core manifold constraints as a differentiable learning objective. This design ensures mutual guidance between geometry and topology while avoiding cascaded errors. Extensive experiments show that HiFi-BRep significantly outperforms state-of-the-art methods in both structural validity and geometric fidelity, providing a robust solution for high-quality B-Rep synthesis. Code and models are publicly available at https://github.com/1nnoh/HiFi-BRep.

Abstract:
Existing methods for learning with noisy labels (LNL) predominantly rely on the small-loss trick, assuming that low-loss samples are more likely to be correctly labeled. While effective, this strategy suffers from two overlooked confirmation biases: (1) Class-level confirmation bias: samples from easy-to-learn classes tend to have lower losses, leading to over-selection of easy samples while ignore hard ones; (2) Instance-level confirmation bias: mislabeled samples with spuriously low loss are mistakenly treated as clean, forcing the model to memorize wrong labels. Both biases accumulate over training and degrade performance. To mitigate these issues, we propose Marginal Distribution Adjustment (MDA) and Candidate Class Selection (CCS). MDA dynamically reshapes the model's predicted class distribution toward uniformity, ensuring more fair sample selection across classes. CCS leverages training dynamics to identify likely true labels and removes them from the classification task, preventing memorization of incorrect annotations while converting weakly related labels into useful supervision. Both MDA and CCS are plug-and-play modules. Extensive experiments show that integrating MDA and CCS into either existing sample selectors or advanced LNL pipeline consistently enhances performance on both CIFAR-10/100 with synthetic noise and real-world datasets (CIFAR-N, Clothing1M, WebVision), demonstrating their broad applicability in LNL methods. The code is available at https://github.com/Aliinton/DSS.

Abstract:
Efficient image compression relies on modeling both local and global redundancy. Most state-of-the-art (SOTA) learned image compression (LIC) methods are based on CNNs or Transformers, which are inherently rigid. Standard CNN kernels and window-based attention mechanisms impose fixed receptive fields and static connectivity patterns, which potentially couple non-redundant pixels simply due to their proximity in Euclidean space. This rigidity limits the model's ability to adaptively capture spatially varying redundancy across the image, particularly at the global level. To overcome these limitations, we propose a content-adaptive image compression framework based on Graph Neural Networks (GNNs). Specifically, our approach constructs dual-scale graphs that enable flexible, data-driven receptive fields. Furthermore, we introduce adaptive connectivity by dynamically adjusting the number of neighbors for each node based on local content complexity. These innovations empower our Graph-based Learned Image Compression (GLIC) model to effectively model diverse redundancy patterns across images, leading to more efficient and adaptive compression. Experiments demonstrate that GLIC achieves state-of-the-art performance, achieving BD-rate reductions of 19.29%, 21.69%, and 18.71% relative to VTM-9.1 on Kodak, Tecnick, and CLIC, respectively. Code will be released at https://github.com/UnoC-727/GLIC.

Abstract:
Diffusion-based video motion customization facilitates the acquisition of human motion representations from a few video samples, while achieving arbitrary subjects transfer through precise textual conditioning. Existing approaches often rely on semantic-level alignment, expecting the model to learn new motion concepts and combine them with other entities (e.g., cats or dogs) to produce visually appealing results. However, video data involve complex spatio-temporal patterns, and focusing solely on semantics cause the model to overlook the visual complexity of motion. Conversely, tuning only the visual representation leads to semantic confusion in representing the intended action. To address these limitations, we propose SynMotion, a new motion-customized video generation model that jointly leverages semantic guidance and visual adaptation. At the semantic level, we introduce the dual-embedding semantic comprehension mechanism which disentangles subject and motion representations, allowing the model to learn customized motion features while preserving its generative capabilities for diverse subjects. At the visual level, we integrate parameter-efficient motion adapters into a pre-trained video generation model to enhance motion fidelity and temporal coherence. Furthermore, we introduce a new embedding-specific training strategy which alternately optimizes subject and motion embeddings, supported by the manually constructed Subject Prior Video (SPV) training dataset. This strategy promotes motion specificity while preserving generalization across diverse subjects. Lastly, we introduce MotionBench, a newly curated benchmark with diverse motion patterns. Experimental results across both T2V and I2V settings demonstrate that SynMotion outperforms existing baselines.

Abstract:
ViTs deliver SOTA performance, yet their fixed computational budget prevents scalable deployment across heterogeneous hardware. Recent Matryoshka-style Transformer architectures mitigate this by embedding nested subnetworks within a single model to enable scalable inference. However, these models allocate the same amount of compute to all inputs, regardless of their complexity, which leads to inefficiencies. To address this, we introduce ThinkingViT, a nested ViT architecture that employs progressive thinking stages to dynamically adjust inference computation based on input difficulty. ThinkingViT first activates a small subset of the most important attention heads to produce an initial prediction. If the prediction confidence exceeds a predefined threshold, inference terminates early. Otherwise, within the same backbone, it activates a larger subset of attention heads and conducts a new forward pass. This process continues iteratively until the model reaches the predefined confidence level or exhausts its maximum capacity. To boost the performance of subsequent rounds, we introduce a Token Recycling approach that fuses the input embeddings with the embeddings from the previous stage. Experiments show that ThinkingViT surpasses nested baselines by up to 2.0 percentage points (p.p.) in accuracy at the same throughput and by up to 2.9 p.p. at equal GMACs on ImageNet-1K. We show that the backbone-preserving design of ThinkingViT allows it to serve as a plug-in upgrade for ViTs in downstream tasks such as semantic segmentation. We also demonstrate that ThinkingViT transfers effectively to other architectures such as Swin Transformers.

Abstract:
Test-Time Adaptation (TTA) is essential for enabling deep learning models to handle real-world data distribution shifts. However, current approaches face significant limitations: backpropagation-based methods are not suitable for low-end deployment devices, due to their high computation and memory requirements, as well as their tendency to modify model weights during adaptation; while traditional backpropagation-free techniques exhibit constrained adaptation capabilities. In this work, we propose Forward-Only Zeroth-Order Optimization (FOZO), a novel and practical backpropagation-free paradigm for TTA. FOZO leverages a memory-efficient zeroth-order prompt optimization, which is led by objectives optimizing both intermediate feature statistics and prediction entropy. To ensure efficient and stable adaptation over the out-of-distribution data stream, we introduce a dynamically decaying perturbation scale during zeroth-order gradient estimation and theoretically prove its convergence under the TTA data stream assumption. Extensive continual adaptation experiments on ImageNet-C, ImageNet-R, and ImageNet-Sketch demonstrate FOZO's superior performance, achieving 59.52% Top-1 accuracy on ImageNet-C (5K, level 5) and outperforming main gradient-based methods and SOTA forward-only FOA (58.13%). Furthermore, FOZO exhibits strong generalization on quantized (INT8) models. These findings demonstrate that FOZO is a highly competitive solution for TTA deployment in resource-limited scenarios. Code: https://github.com/eVI-group-SCU/FOZO

Abstract:
Reinforcement learning (RL) has emerged as an effective approach for improving video reasoning in multimodal large language models (MLLMs). However, existing methods remain inefficient for two reasons. First, training data are typically organized by task formats rather than underlying reasoning abilities, creating a mismatch that encourages models to learn task-specific patterns instead of transferable capabilities. As a result, improving ability generalization requires broad coverage over many ability-task combinations, making RL training costly. Second, prior work often compensates for this inefficiency with increasingly complex algorithmic designs, such as specialized temporal architectures or multi-objective reward frameworks, which further complicate training. To address these issues, we present VAST, a cognitive taxonomy and data organization framework that structures video understanding into three layers: Perception, Reasoning, and Cognition. Based on this taxonomy, we construct VAST-15K for training and VAST-Bench for evaluation. We further introduce VideoVAST, a reinforcement learning framework that uses consistency rewards to encourage alignment between reasoning traces and final answers, without architectural modifications. Experiments show that VideoVAST achieves 66.3% accuracy on MVBench and 57.4% on VAST-Bench, compared with 62.7% and 54.3% for Video-R1, while using approximately 72% fewer GPU hours and 96% fewer training samples under the same training settings. Code and resources are available at https://zhongan-wang.github.io/VAST.

Abstract:
Recent text-to-image (T2I) diffusion and flow-matching models can produce highly realistic images from natural language prompts. In practical scenarios, T2I systems are often run in a "generate--then--select" mode: many seeds are sampled and only a few images are kept for use. However, this pipeline is highly resource-intensive since each candidate requires tens to hundreds of denoising steps, and evaluation metrics such as CLIPScore and ImageReward are post-hoc. In this work, we address this inefficiency by introducing Probe-Select, a plug-in module that enables efficient evaluation of image quality within the generation process. We observe that certain intermediate denoiser activations, even at early timesteps, encode a stable coarse structure, object layout and spatial arrangement--that strongly correlates with final image fidelity. Probe-Select exploits this property by predicting final quality scores directly from early activations, allowing unpromising seeds to be terminated early. Across diffusion and flow-matching backbones, our experiments show that early evaluation at only 20% of the trajectory accurately ranks candidate seeds and enables selective continuation. This strategy reduces sampling cost by over 60% while improving the quality of the retained images, demonstrating that early structural signals can effectively guide selective generation without altering the underlying generative model. Code is available at https://github.com/Guhuary/ProbeSelect.

Abstract:
Gait recognition enables non-intrusive, privacy-preserving identification but suffers in uncontrolled environments due to illumination and motion sensitivity in conventional cameras. In this work, we explore gait recognition using event cameras, which offer microsecond temporal resolution and high dynamic range, naturally capturing robust dynamic cues and suppressing static noise. Existing event-based approaches typically aggregate event streams into event images over long time windows, thereby discarding fine-grained motion dynamics critical for gait recognition. Therefore, we propose EventGait, an end-to-end dual-stream framework that separately models motion and shape while preserving the advantages of events. Our dynamic stream leverages a Mixture of Spiking Experts (MoSE) with diverse neuron constants for robust dynamic perception across complex motion and illumination scenes, while the static stream learns dense shape representations via Cross-modal Structural Alignment (CroSA) with large vision foundation models. To address the absence of large-scale event-based gait datasets, we introduce a synthesis pipeline and release two new benchmarks: SUSTech1K-E and CCGR-Mini-E. Extensive experiments have shown that event-based gait recognition not only achieves results comparable to camera-based gait recognition under normal conditions but also significantly outperforms it in low-light scenarios. Our approach sets a new state of the art on both synthesized and real-world event-based gait benchmarks, highlighting the robustness and potential of event-driven gait analysis. The code and datasets are released at https://github.com/QUEAHREN/EventGait.

Abstract:
Supervised Fine-Tuning (SFT) of the language backbone plays a pivotal role in adapting Vision-Language Models (VLMs) to specialized domains such as medical reasoning. However, existing SFT practices often rely on unfiltered textual datasets that contain redundant and low-quality samples, leading to substantial computational costs and suboptimal performance in complex clinical scenarios. Although existing methods attempt to alleviate this problem by selecting data based on sample difficulty, defined by knowledge and reasoning complexity, they overlook each sample's optimization utility reflected in its gradient. Interestingly, we find that gradient-based influence alone favors easy-to-optimize samples that cause large parameter shifts but lack deep reasoning chains, while difficulty alone selects noisy or overly complex textual cases that fail to guide stable optimization. Based on this observation, we propose a data selection strategy, Difficulty-Influence Quadrant (DIQ), which prioritizes samples in the "high-difficulty-high-influence" quadrant to balance complex clinical reasoning with substantial gradient influence. This enables efficient medical reasoning for VLMs with minimal fine-tuning data. Furthermore, Human and LLM-as-a-judge evaluations show that DIQ-selected subsets demonstrate higher data quality and generate clinical reasoning that is more aligned with expert practices in differential diagnosis, safety check, and evidence citation, as DIQ emphasizes samples that foster expert-like reasoning patterns. Extensive experiments on medical reasoning benchmarks demonstrate that DIQ enables VLM backbones fine-tuned on only 1% of selected data to match full-dataset performance, while using 10% consistently outperforms baseline methods, highlighting the superiority of principled data selection over brute-force scaling. The code is available at https://github.com/mihara-bot/DIQ.

Abstract:
World models, which simulate environmental dynamics and generate sensor observations, are gaining increasing attention in autonomous driving. However, progress in LiDAR-based world models has lagged behind those built on camera videos or occupancy data, primarily due to two core challenges: the inherent disorder of LiDAR point clouds and the difficulty of distinguishing dynamic objects from static structures. To address these issues, we propose GEM: a Generative LiDAR world model that leverages dEformable Mamba architecture, significantly improving fidelity and imaginative capability. Specifically, leveraging the structural similarity between sequential laser scanning and Mamba's processing mechanism, we first tokenize LiDAR sweeps into compact representations via a custom LiDAR scene tokenizer. After unsupervised disentanglement of tokenized features via a dynamic-static separator, a tri-path deformable Mamba is introduced to perform selective scanning and adaptive gating fusion over the disentangled features, leading to enhanced spatial-temporal understanding of the world evolution. Optionally, a planner and a BEV layout controller can be integrated to explore the model's capability for autonomous rollout and its potential to generate "what-if" scenarios. Extensive experiments show that GEM achieves state-of-the-art performances across diverse benchmarks and evaluation settings, demonstrating its superiority and effectiveness. Project page: https://github.com/wuyang98/GEM .

Abstract:
Fine-grained anomaly detection is crucial in industrial and medical applications, but labeled anomalies are often scarce, making zero-shot detection challenging. While vision-language models like CLIP offer promising solutions, they struggle with foreground-background feature entanglement and coarse textual semantics. We propose FB-CLIP, a framework that enhances anomaly localization via multi-strategy textual representations and foreground-background separation. In the textual modality, it combines End-of-Text features, global-pooled representations, and attention-weighted token features for richer semantic cues. In the visual modality, multi-view soft separation along identity, semantic, and spatial dimensions, together with background suppression, reduces interference and improves discriminability. Semantic Consistency Regularization (SCR) aligns image features with normal and abnormal textual prototypes, suppressing uncertain matches and enlarging semantic gaps. Experiments show that FB-CLIP effectively distinguishes anomalies from complex backgrounds, achieving accurate fine-grained anomaly detection and localization under zero-shot settings. Code : \href https://github.com/Xi-Mu-Yu/FB-CLIP https://github.com/Xi-Mu-Yu/FB-CLIP .

Abstract:
Video generation has achieved remarkable progress in visual fidelity and controllability, enabling conditioning on text, layout, or motion. Among these, motion control -- specifying object dynamics and camera trajectories -- is essential for composing complex, cinematic scenes, yet existing interfaces remain limited. We introduce LAMP, that leverages large language models (LLMs) as motion planners to translate natural language descriptions into explicit 3D trajectories for dynamic objects and (relatively defined) cameras. LAMP defines a motion domain-specific language (DSL), inspired by cinematography conventions. By harnessing program synthesis capabilities of LLMs, LAMP generates structured motion programs from natural language, which are deterministically mapped to 3D trajectories. We construct a large-scale procedural dataset pairing natural text descriptions with corresponding motion programs and 3D trajectories. Experiments demonstrate LAMP's improved performance in motion controllability and alignment with user intent compared to state-of-the-art alternatives, establishing the first framework for generating both object and camera motions directly from natural language specifications. Video and code are available at the project page at https://cyberiada.github.io/LAMP/ .

Abstract:
Video comprises the vast majority of bits that are generated daily, and is the primary signal driving current innovations in robotics, remote sensing, and wearable technology. Yet, the most powerful video understanding models are too expensive for the resource-constrained platforms used in these applications. One approach is to offload inference to the cloud; this gives access to GPUs capable of processing high-resolution videos in real time. But even with reliable, high-bandwidth communication channels, the combined latency of video encoding, model inference, and round-trip communication prohibits use for certain real-time applications. The alternative is to use fully local inference; but this places extreme constraints on computational and power costs, requiring smaller models and lower resolution, leading to degraded accuracy. To address these challenges, we propose DeDelayed, a real-time inference system that divides computation between a remote model operating on delayed video frames and a local model with access to the current frame. The remote model is trained to make predictions on anticipated future frames, which the local model incorporates into its prediction for the current frame. The local and remote models are jointly optimized with an autoencoder that limits the transmission bitrate required by the available downlink communication channel. We evaluate DeDelayed on the task of real-time streaming video segmentation using the BDD100k driving dataset. For a round trip delay of 100 ms, DeDelayed improves performance by 6.4 mIoU compared to fully local inference and 9.8 mIoU compared to remote inference---an equivalent improvement to using a model ten times larger. We release our training code, pretrained models, and python library at https://github.com/InterDigitalInc/dedelayed .

Abstract:
Deploying embodied agents that can answer questions about their surroundings in realistic real-world settings remains difficult, partly due to the scarcity of benchmarks for episodic memory Embodied Question Answering (EQA). Inspired by the challenges of infrastructure inspections, we propose Inspection EQA as a compelling problem class for advancing episodic memory EQA, as it demands multi-scale reasoning and long-range spatial understanding, while offering standardized evaluation, professional inspection reports as grounding, and egocentric imagery. We introduce BridgeEQA, a benchmark of 2,200 open-vocabulary question-answer pairs (in the style of OpenEQA) grounded in professional inspection reports across 200 real-world bridge scenes with 47.93 images on average per scene. We further propose a new EQA metric Image Citation Relevance to evaluate the ability of a model to cite relevant images. Evaluations of state-of-the-art vision-language models reveal substantial performance gaps. To address this, we propose Embodied Memory Visual Reasoning (EMVR), which formulates the inspection EQA task as a Markov decision process. EMVR shows strong performance over the baselines. Code and dataset available at: https://drags99.github.io/bridge-eqa/

Abstract:
Scenes in the real world are often composed of several static and dynamic objects. Capturing their 4-dimensional structures, composition and spatio-temporal configuration in-the-wild, though extremely interesting, is equally hard.Therefore, existing works often focus on one object at a time, while relying on some category-specific parametric shape model for dynamic objects. This can lead to inconsistent scene configurations, in addition to being limited to the modeled object categories. We propose COM4D (Compositional 4D), a method that consistently and jointly predicts the structure and spatio-temporal configuration of 4D/3D objects using only static multi-object or dynamic single object supervision. We achieve this by a carefully designed training of spatial and temporal attentions on 2D video input. The training is disentangled into learning from object compositions on the one hand, and single object dynamics throughout the video on the other, thus completely avoiding reliance on 4D compositional training data.At inference time, our proposed attention mixing mechanism combines these independently learned attentions, without requiring any 4D composition examples.By alternating between spatial and temporal reasoning, COM4D reconstructs complete and persistent 4D scenes with multiple interacting objects directly from monocular videos.COM4D provides state-of-the-art results in existing separate problems of 4D object and composed 3D reconstruction despite being purely data-driven.

Abstract:
We introduce Mistake Attribution (MATT), a new task for fine-grained understanding of human mistakes in egocentric videos. While prior work detects whether a mistake occurs, MATT attributes the mistake to what part of the instruction is violated (semantic role), when in the video the deviation becomes irreversible (the Point-of-No-Return, PNR), and where the mistake appears in the PNR frame. We develop MisEngine, a data engine that automatically constructs mistake samples from existing datasets with attribution-rich annotations. Applied to large egocentric corpora, MisEngine yields EPIC-KITCHENS-M and Ego4D-M---two datasets up to two orders of magnitude larger than prior mistake datasets.We then present MisFormer, a unified attention-based model for mistake attribution across semantic, temporal, and spatial dimensions, trained with MisEngine supervision. A human study demonstrates the ecological validity of our MisEngine-constructed mistake samples, confirming that EPIC-KITCHENS-M and Ego4D-M can serve as reliable benchmarks for mistake understanding. Experiments on both our datasets and prior benchmarks show that MisFormer, as a single unified model, outperforms task-specific SOTA methods by at least 6.66%, 21.81%, 18.7%, and 3.00% in video-language understanding, temporal localization, hand-object interaction, and mistake detection, respectively. Project page: https://yayuanli.github.io/MATT/

Abstract:
In hematoxylin-eosin (H&E) to virtual immunohistochemistry (IHC) staining, paired images enable supervised learning but suffer from inherent spatial dislocation, limiting pixel-level constraints. Thus, auxiliary tasks have been increasingly employed with paired data to provide complementary supervision. However, existing methods largely overlook the rich semantic information embedded in auxiliary task models. This paper proposes a novel framework for virtual IHC staining guided by dual-aligned multi-task features, which fully explores semantic cues from auxiliary tasks. To realize effective guidance, we address two obstacles: (1) the spatial mismatch between paired H&E and IHC feature representations; (2) the task gap between auxiliary task features and virtual staining features. To resolve the spatial mismatch, we generate an alignment matrix that aligns H&E and IHC features. Specifically, we first introduce structure-enhanced learning to restore semantic consistency in regions affected by inaccurate staining in virtual IHC images. Then, we separately cluster features from virtual IHC and real IHC images, and establish semantic correspondences using an active-passive matching mechanism. This ensures that only semantically aligned regions are matched, reducing the impact of staining variability on the alignment matrix. To bridge the task gap, we introduce a task-gap alignment module trained under the principle that auxiliary features are considered aligned if they improve the performance of the virtual IHC staining model. Extensive experiments on two public datasets with four biomarkers demonstrate the effectiveness of our framework. Code is available at https://github.com/U-RBook/VSMT.

Abstract:
Visual explanations for object detectors are crucial for enhancing their reliability. Object detectors identify and localize instances by assessing multiple visual features collectively. When generating explanations, overlooking these collective influences in detections may lead to missing compositional cues or capturing spurious correlations. However, existing methods typically focus solely on individual pixel contributions, neglecting the collective contribution of multiple pixels. To address this limitation, we propose a game-theoretic method based on Shapley values and interactions to explicitly capture both individual and collective pixel contributions. Our method provides explanations for both bounding box localization and class determination, highlighting regions crucial for detection. Extensive experiments demonstrate that the proposed method identifies important regions more accurately than state-of-the-art methods. The code is available at https://github.com/tttt-0814/VX-CODE.

Abstract:
Recent advances in remote sensing have led to an increase in the number of available foundation models; each trained on different modalities, datasets, and objectives, yet capturing only part of the vast geospatial knowledge landscape. While these models show strong results within their respective domains, their capabilities remain complementary rather than unified. Therefore, instead of choosing one model over another, we aim to combine their strengths into a single shared representation.We introduce GeoSANE, a geospatial model foundry that learns a unified neural representation from the weights of existing foundation models and task-specific models, able to generate novel neural networks weights on-demand. Given a target architecture, GeoSANE generates weights ready for finetuning for classification, segmentation, and detection tasks across multiple modalities.Models generated by GeoSANE consistently outperform their counterparts trained from scratch, match or surpass state-of-the-art remote sensing foundation models, and outperform models obtained through pruning or knowledge distillation when generating lightweight networks. Evaluations across ten diverse datasets and on GEO-Bench confirm its strong generalization capabilities.By shifting from pre-training to weight generation, GeoSANE introduces a new framework for unifying and transferring geospatial knowledge across models and tasks. Code is available at \href https://hsg-aiml.github.io/GeoSANE/ hsg-aiml.github.io/GeoSANE/ .

Abstract:
Visual correspondence across image-to-image (2D-2D), image-to-point cloud (2D-3D), and point cloud-to-point cloud (3D-3D) geometric matching forms the foundation for numerous 3D vision tasks. Despite sharing a similar problem structure, current methods use task-specific designs with separate models for each modality combination. We present UniCorn, the first correspondence model with shared weights that unifies geometric matching across all three tasks. Our key insight is that Transformer attention naturally captures cross-modal feature similarity. We propose a dual-stream decoder that maintains separate appearance and positional feature streams. This design enables end-to-end learning through stack-able layers while supporting flexible query-based correspondence estimation across heterogeneous modalities. Our architecture employs modality-specific backbones followed by shared encoder and decoder components, trained jointly on diverse data combining pseudo point clouds from depth maps with real 3D correspondence annotations. UniCorn achieves competitive performance on 2D-2D matching and surpasses prior state-of-the-art by 8% on 7Scenes (2D-3D) and 10% on 3DLoMatch (3D-3D) in registration recall. Code is provided at \href https://github.com/neu-vi/UniCorrn github.com/neu-vi/UniCorrn .

Abstract:
Vision Foundation Models (VFMs) extract spatially downsampled representations, posing challenges for pixel-level tasks. Existing upsampling approaches face a fundamental trade-off: classical filters are fast and broadly applicable but rely on fixed forms, while modern upsamplers achieve superior accuracy through learnable, VFM-specific forms at the cost of retraining for each VFM. We introduce Neighborhood Attention Filtering (NAF), which bridges this gap by learning adaptive spatial-and-content weights through Cross-Scale Neighborhood Attention and Rotary Position Embeddings (RoPE), guided solely by the high-resolution input image. NAF operates zero-shot: it upsamples features from any VFM without retraining, making it the first VFM-agnostic architecture to outperform VFM-specific upsamplers and achieve state-of-the-art performance across multiple downstream tasks. It maintains high efficiency, scaling to 2K feature maps and reconstructing intermediate-resolution maps at 18 FPS. Beyond feature upsampling, NAF demonstrates strong performance on image restoration, highlighting its versatility. Code and checkpoints are available at https://github.com/valeoai/NAF.

Abstract:
Monocular 3D pose estimation is fundamentally ill-posed due to depth ambiguity and occlusions, thereby motivating probabilistic methods that generate multiple plausible 3D pose hypotheses. In particular, diffusion-based models have recently demonstrated strong performance, but their iterative denoising process typically requires many timesteps for each prediction, making inference computationally expensive. In contrast, we leverage Flow Matching (FM) to learn a velocity field defined by an Ordinary Differential Equation (ODE), enabling efficient generation of 3D pose samples with only a few integration steps. We propose a novel generative pose estimation framework, FMPose3D, that formulates 3D pose estimation as a conditional distribution transport problem. It continuously transports samples from a standard Gaussian prior to the distribution of plausible 3D poses conditioned only on 2D inputs. Although ODE trajectories are deterministic, FMPose3D naturally generates various pose hypotheses by sampling different noise seeds. To obtain a single accurate prediction from those hypotheses, we further introduce a Reprojection-based Posterior Expectation Aggregation (RPEA) module, which approximates the Bayesian posterior expectation over 3D hypotheses. FMPose3D surpasses existing methods on the widely used human pose estimation benchmarks Human3.6M and MPI-INF-3DHP, and further achieves state-of-the-art performance on the 3D animal pose datasets Animal3D and CtrlAni3D, demonstrating strong performance across both 3D pose domains. The code is available at https://github.com/AdaptiveMotorControlLab/FMPose3D.

Abstract:
Understanding high-resolution (HR) images remains a critical challenge for multimodal large language models (MLLMs). Recent approaches leverage vision-based retrieval-augmented generation (RAG) to retrieve query-relevant crops from HR images, improving understanding capacity of MLLMs. However, this paradigm often leads to object fragmentation, resulting in semantic bias and incomplete retrieval, while also introducing false positives from irrelevant background patches. To address these issues, we propose Multi-resolution Retrieval-Detection (MRD), a training-free framework that enhances HR image understanding from both local and global perspectives. Locally, MRD enforces cross-scale semantic consistency via multi-resolution semantic fusion to mitigate single-resolution bias and alleviate object fragmentation. Globally, it integrates open-vocabulary object detection (OVD) as localization priors within a unified framework. Extensive experiments across multiple MLLMs on HR image benchmarks demonstrate that MRD achieves state-of-the-art (SOTA) performance on both single-object and multi-object understanding tasks. Code will be available at: https://github.com/yf0412/MRD.

Abstract:
We introduce SketchDeco, a training-free approach to sketch colourisation that bridges the gap between professional design needs and intuitive, region-based control. Our method empowers artists to use simple masks and colour palettes for precise spatial and chromatic specification, avoiding both the tediousness of manual assignment and the ambiguity of text-based prompts. We reformulate this task as a novel, training-free composition problem. Our core technical contribution is a guided latent-space blending process: we first leverage diffusion inversion to precisely "paint" user-defined colours into specified regions, and then use a custom self-attention mechanism to harmoniously blend these local edits with a globally consistent base image. This ensures both local colour fidelity and global harmony without requiring any model fine-tuning. Our system produces high-quality results in 15--20 inference steps on consumer GPUs, making professional-quality, controllable colourisation accessible.

Abstract:
While powerful in image-conditioned generation, multimodal large language models (MLLMs) can display uneven performance across demographic groups, highlighting fairness risks. In safety-critical clinical settings, such disparities risk producing unequal diagnostic narratives and eroding trust in AI-assisted decision-making. While fairness has been studied extensively in vision-only and language-only models, its impact on MLLMs remains largely underexplored. To address these biases, we introduce FairLLaVA, a parameter-efficient fine-tuning method that mitigates group disparities in visual instruction tuning without compromising overall performance. By minimizing the mutual information between target attributes, FairLLaVA regularizes the model's representations to be demographic-invariant. The method can be incorporated as a lightweight plug-in, maintaining efficiency with low-rank adapter fine-tuning, and provides an architecture-agnostic approach to fair visual instruction following. Extensive experiments on large-scale chest radiology report generation and dermoscopy visual question answering benchmarks show that FairLLaVA consistently reduces inter-group disparities while improving both equity-scaled clinical performance and natural language generation quality across diverse medical imaging modalities. Code can be accessed at https://github.com/bhosalems/FairLLaVA.

Abstract:
Vision-Language Models like CLIP are extensively used for inter-modal tasks which involve both visual and text modalities. However, when the individual modality encoders are applied to inherently intra-modal tasks like image-to-image retrieval, their performance suffers from the intra-modal misalignment. In this paper we study intra-modal misalignment in CLIP with a focus on the role of the projectors that map pre-projection image and text embeddings into the shared embedding space. By analyzing the form of the cosine similarity applied to projected features, and its interaction with the contrastive CLIP loss, we show that there is an inter-modal operator responsible for aligning the two modalities during training, and a second, intra-modal operator that only enforces intra-modal normalization but does nothing to promote intra-modal alignment. Via spectral analysis of the inter-modal operator, we identify an approximately isotropic subspace in which the two modalities are well-aligned, as well as anisotropic directions specific to each modality. We demonstrate that this aligned subspace can be directly obtained from the projector weights and that removing the anisotropic directions improves intra-modal alignment. Our experiments on intra-modal retrieval and classification benchmarks show that our training-free method reduces intra-modal misalignment, greatly lowers latency, and outperforms existing approaches across multiple pre-trained CLIP-like models. The code is publicly available at: https://github.com/simomagi/IsoCLIP.

Abstract:
Accurate image segmentation is essential for modern computer vision applications such as image editing, autonomous driving, and medical image analysis. In recent years, Dichotomous Image Segmentation (DIS) has become a standard task for training and evaluating highly accurate segmentation models. Existing DIS approaches often fail to preserve fine-grained details or fully capture the semantic structure of the foreground. To address these challenges, we present FlowDIS, a novel dichotomous image segmentation method built on the flow matching framework, which learns a time-dependent vector field to transport the image distribution to the corresponding mask distribution, optionally conditioned on a text prompt. Moreover, with our Position-Aware Instance Pairing (PAIP) training strategy, FlowDIS offers strong controllability through text prompts, enabling precise, pixel-level object segmentation. Extensive experiments demonstrate that our method significantly outperforms state-of-the-art approaches both with and without language guidance. Compared with the best prior DIS method, FlowDIS achieves a 5.5% higher F_beta^omega measure and 43% lower MAE (M) on the DIS-TE test set. The code is available at: https://github.com/Picsart-AI-Research/FlowDIS.

Abstract:
Artistic design, particularly poster design, often demands rapid yet precise modification of textual content while preserving visual harmony and typographic intent, especially across diverse font styles. Although modern image editing models have grown increasingly powerful, they still fall short in fine-grained, font-aware text manipulation, limiting their utility in professional workflows. To address this issue, we present SkyReels-Text, a novel font-controllable framework for precise poster text editing. Our method enables simultaneous editing of multiple text regions, each rendered in distinct typographic styles, while preserving the visual appearance of non-edited regions. Notably, our model requires neither font labels nor test-time fine-tuning: users can simply provide cropped glyph patches corresponding to their desired typography--even if the font is not included in any standard library. Extensive experiments on multiple benchmarks demonstrate that SkyReels-Text achieves state-of-the-art performance in both text fidelity and visual realism, offering unprecedented control over font families and stylistic nuances. This work bridges the gap between general-purpose image editing and professional-grade typographic design. Code and models are publicly available at https://github.com/SkyworkAI/SkyReels-Text.

Abstract:
Accurate estimation of food nutrition plays a vital role in promoting healthy dietary habits and personalized diet management. Most existing food datasets primarily focus on Western cuisines and lack sufficient coverage of Chinese dishes, which restricts accurate nutritional estimation for Chinese meals. Moreover, many state-of-the-art nutrition prediction methods rely on depth sensors, restricting their applicability in daily scenarios. To address these limitations, we introduce OmniFood8K, a comprehensive multimodal dataset comprising 8,036 food samples, each with detailed nutritional annotations and multi-view images. In addition, to enhance models' capability in nutritional prediction, we construct NutritionSynth-115K, a large-scale synthetic dataset that introduces compositional variations while preserving precise nutritional labels. Moreover, we propose an end-to-end framework for nutritional prediction from a single RGB image. First, we predict a depth map from a single RGB image and design the Scale-Shift Residual Adapter (SSRA) to refine it for global scale consistency and local structural preservation. Second, we propose the Frequency-Aligned Fusion Module (FAFM) to hierarchically align and fuse RGB and depth features in the frequency domain. Finally, we design a Mask-based Prediction Head (MPH) to emphasize key ingredient regions via dynamic channel selection for more accurate prediction. Extensive experiments on multiple datasets demonstrate the superiority of our method over existing approaches. Project homepage: https://yudongjian.github.io/OmniFood8K-food/

Abstract:
Human experience in digital environments offers a vast, underexplored resource of authentic, untrimmed interactions that contain rich procedural knowledge. We introduce Demo2Tutorial, a framework that transforms this experience captured via screen recordings and interaction logs into structured, multimodal software tutorials for teaching both humans and agents. Demo2Tutorial first collects human experience via a dedicated recorder, then parses raw experience using a multimodal Action Parser to reconstruct perception, action, and intent. A Step Planner then abstracts these steps into hierarchical task graphs representing goals and steps. Finally, a Tutorial Composer transforms the parsed experience into structured, reusable image-text instructions. We evaluate the tutorial generation quality on a new benchmark derived from official software documentation. We further demonstrate that this distilled representation benefits (i) human learning, by automatically generating multimodal tutorials, and (ii) agent learning, by improving downstream GUI-agent planning and generalization. Experiments show Demo2Tutorial produces high-quality tutorials that surpass human-authored ones and significantly outperform baseline methods, while enabling both faster human task completion and improved GUI agent planning, demonstrating that structured tutorials distilled from human experience can serve as effective knowledge representations for advancing both human learning and agent capabilities. Code and data will be available at: https://github.com/showlab/Demo2Tutorial

Abstract:
Existing visual trackers mainly operate in a non-interactive, fire-and-forget manner, making them impractical for real-world scenarios that require human-in-the-loop adaptation. To overcome this limitation, we introduce Interactive Tracking, a new paradigm that allows users to guide the tracker at any time using natural language commands. To support research in this direction, we make three main contributions. First, we present InteractTrack, the first large-scale benchmark for interactive tracking, containing 150 videos with dense bounding box annotations and timestamped language instructions. Second, we propose a comprehensive evaluation protocol and evaluate 25 representative trackers, showing that state-of-the-art methods fail in interactive scenarios--strong performance on conventional benchmarks does not transfer. Third, we introduce Interactive Memory-Augmented Tracking (IMAT), a new baseline that employs a dynamic memory mechanism to learn from user feedback and update tracking behavior accordingly. Our benchmark, protocol, and baseline establish a foundation for developing more intelligent, adaptive, and collaborative tracking systems, bridging the gap between automated perception and human guidance. The full benchmark, tracking results, and analysis are available at \href https://github.com/NorahGreen/InteractTrack.git this URL .

Abstract:
Recent text-to-image (T2I) diffusion models achieve remarkable realism, yet faithful prompt-image alignment remains challenging, particularly for complex prompts with multiple objects, relations, and fine-grained attributes. Existing training-free inference-time scaling methods rely on fixed iteration budgets that cannot adapt to prompt difficulty, while reflection-tuned models require carefully curated reflection datasets and extensive joint fine-tuning of diffusion and vision-language models, often overfitting to reflection paths data and lacking transferability across models. We introduce RAISE (Requirement-Adaptive Self-Improving Evolution), a training-free, requirement-driven evolutionary framework for adaptive T2I generation. RAISE formulates image generation as a requirement-driven adaptive scaling process, evolving a population of candidates at inference time through a diverse set of refinement actions--including prompt rewriting, noise resampling, and instructional editing. Each generation is verified against a structured checklist of requirements, enabling the system to dynamically identify unsatisfied items and allocate further computation only where needed. This achieves adaptive test-time scaling that aligns computational effort with semantic query complexity. On GenEval and DrawBench, RAISE attains state-of-the-art alignment (0.94 overall GenEval) while incurring fewer generated samples (reduced by 30-40%) and VLM calls (reduced by 80%) than prior scaling and reflection-tuned baselines, demonstrating efficient, generalizable, and model-agnostic multi-round self-improvement. Code is available at https://github.com/LiyaoJiang1998/RAISE.

Abstract:
Vision-language models (VLMs) have achieved remarkable success across numerous domains, yet they lag significantly in animal behavior understanding due to severe data scarcity. Annotated animal behavior videos are prohibitively expensive and time-consuming to collect, requiring domain expertise and controlled observation conditions. To address this challenge, we leverage structured domain knowledge as an inductive bias from the Neuro Behavior Ontology (NBO), which provides professional annotations, hierarchical behavior structures, and comprehensive semantic coverage. We construct AnimalBand, an NBO-consistent dataset integrating 74,671 videos across multiple species and behaviors with semantic standardization and extended knowledge. Based on this resource, we present EthoCLIP, an ontology-enhanced vision-language contrastive learning framework that embeds ontology semantics through an ontology-aware graph module to capture hierarchical relationships among behaviors and learn structured semantic dependencies. Incorporating ontological information reduces reliance on purely data-driven learning, thereby alleviating needs for large-scale datasets. Extensive experiments validate both our dataset and method. Results demonstrate that EthoCLIP pretrained on AnimalBand substantially improves behavior recognition accuracy and transfer learning performance across diverse benchmarks, confirming that ontology-driven semantic enrichment effectively mitigates data scarcity in animal behavior understanding. Our data and code will be released at https://github.com/PRIS-CV/AnimalBand.

Abstract:
Motion segmentation in dynamic scenes is highly challenging, as conventional methods heavily rely on estimating camera poses and point correspondences from inherently noisy motion cues. Existing statistical inference or iterative optimization techniques that struggle to mitigate the cumulative errors in multi-stage pipelines often lead to limited performance or high computational cost. In contrast, we propose a fully learning-based approach that directly infers moving objects from latent feature representations via attention mechanisms, thus enabling end-to-end feed-forward motion segmentation. Our key insight is to bypass explicit correspondence estimation and instead let the model learn to implicitly disentangle object and camera motion. Supported by recent advances in 4D scene geometry reconstruction (e.g., \pi^3), the proposed method leverages reliable camera poses and rich spatial-temporal priors, which ensure stable training and robust inference for the model. Extensive experiments demonstrate that by eliminating complex pre-processing and iterative refinement, our approach achieves state-of-the-art motion segmentation performance with high efficiency. The code is available at:https://github.com/zjutcvg/GeoMotion.

Abstract:
Ultra-high-resolution (UHR) remote sensing (RS) images offer rich fine-grained information but also present challenges in effective processing. Existing dynamic resolution and token pruning methods are constrained by a passive perception paradigm, suffering from increased redundancy when obtaining finer visual inputs. In this work, we explore a new active perception paradigm that enables models to revisit information-rich regions. First, we present LRS-GRO, a large-scale benchmark dataset tailored for active perception in UHR RS processing, encompassing 17 question types across global, region, and object levels, annotated via a semi-automatic pipeline. Building on LRS-GRO, we propose ZoomEarth, an adaptive cropping-zooming framework with a novel Region-Guided reward that provides fine-grained guidance. Trained via supervised fine-tuning (SFT) and Group Relative Policy Optimization (GRPO), ZoomEarth achieves state-of-the-art performance on LRS-GRO and, in the zero-shot setting, on three public UHR remote sensing benchmarks. Furthermore, ZoomEarth can be seamlessly integrated with downstream models for tasks such as cloud removal, denoising, segmentation, and image editing through simple tool interfaces, demonstrating strong versatility and extensibility. All data and code will be released at https://earth-insights.github.io/ZoomEarth.

Abstract:
Diffusion models have recently advanced video restoration, but applying them to real-world and AIGC-generated video super-resolution (VSR) remains challenging due to high latency, prohibitive computation, and poor generalization to ultra-high resolutions. Our goal in this work is to make diffusion-based VSR practical by achieving efficiency, scalability, and near real-time performance. To this end, we propose FlashVSR, the first diffusion-based one-step streaming framework for efficient video super-resolution. FlashVSR runs at 17 FPS for 768x1408 videos on a single A100 GPU by combining three complementary innovations: (i) a train-friendly three-stage distillation pipeline that enables streaming super-resolution, (ii) locality-constrained sparse attention that reduces redundant computation while bridging the train--test resolution gap, and (iii) a tiny conditional decoder that accelerates reconstruction without sacrificing quality. To support large-scale training, we also construct VSR-120K, a new dataset containing 120 K videos and 180 K images. Extensive experiments demonstrate that FlashVSR scales reliably to ultra-high resolutions and achieves state-of-the-art performance with up to 12x speed-up over prior one-step diffusion-based VSR models. We will release the code, pretrained models, and dataset to foster future research in efficient diffusion-based video super-resolution.

Abstract:
Most existing evaluations of generated videos adopt a no-reference paradigm. Although recent benchmarks cover multiple dimensions and show moderate correlation with human preferences, relying solely on textual prompts weakens real-world constraints and makes it difficult to produce accountable and interpretable judgments on instance-level issues such as target behavior deviation, temporal inconsistency, and commonsense violations. In scenarios with explicit expectations, such as controlled generation, reference videos naturally provide rich, unambiguous spatio-temporal evidence, enabling stricter and more trustworthy assessment. Motivated by this, we propose Ref4D-VideoBench, a reference-based, fine-grained, multi-dimensional benchmark for generated video evaluation. Ref4D-VideoBench contains 600 high-quality reference videos with tightly evidence-bounded prompts, and introduces a 12-metric structured evaluation suite along four key dimensions: basic semantic alignment, motion consistency, event temporal consistency, and world knowledge consistency. Experiments on eight text-to-video models show that our method achieves stronger agreement with human judgments than representative no-reference frameworks. Our code is available at https://github.com/TAILab-W/Ref4D-VideoBench.

Abstract:
Theory of Mind (ToM) refers to the ability to infer others' mental states, such as beliefs, desires, and intentions. Current vision-language embodied agents lack ToM-based decision-making, and existing benchmarks focus solely on human mental states while ignoring the agent's own perspective, hindering coherent decision and action generation. To address this, we propose MindPower, a Robot-Centric framework integrating Perception, Mental Reasoning, Decision Making and Action. Given multimodal inputs, MindPower first perceives the environment and human states, then performs ToM Reasoning to model both self and others, and finally generates decisions and actions guided by inferred mental states. Furthermore, we introduce Mind-Reward, a novel optimization objective that encourages VLMs to produce consistent ToM Reasoning and behavior. Our model outperforms GPT-4o by 12.77% in decision making and 12.49% in action generation. Project page: https://zhangdaxia22.github.io/MindPower/.

Abstract:
We present Cross-View Splatter, a feed-forward method that predicts pixel-aligned Gaussian splats for outdoor scenes captured at ground level and by satellite. Faithful reconstructions require good camera coverage, but ground imagery is time-consuming and hard to capture at scale for large outdoor scenes. Fortunately, satellite imagery can provide a global geometric prior that is easy to access via public APIs. Cross-View Splatter fuses orthorectified satellite views with GPS-tagged ground photos to predict Gaussian splats in a unified 3D coordinate frame. By aligning ground and bird's-eye feature representations, our model improves scene coverage and novel-view synthesis, compared to ground imagery alone. We train on curated georeferenced datasets and paired satellite-terrain data, mined from open mapping services. We evaluate our method on a new benchmark for novel-view synthesis with georeferenced imagery, allowing comparison to prior state-of-the-art methods. Our code and data preparation will be available at https://nianticspatial.github.io/cross-view-splatter/.

Abstract:
Medical vision-language pretraining (VLP) models have recently been investigated for their generalization to diverse downstream tasks. However, current medical VLP methods typically force the model to learn simple and complex concepts simultaneously. This anti-cognitive process leads to suboptimal feature representations, especially under distribution shift. To address this limitation, we propose a Knowledge-driven Cognitive Orchestration for Medical VLP (MedKCO) that involves both the ordering of the pretraining data and the learning objective of vision-language contrast. Specifically, we design a two level curriculum by incorporating diagnostic sensitivity and intra-class sample representativeness for the ordering of the pretraining data. Moreover, considering the inter-class similarity of medical images, we introduce a self-paced asymmetric contrastive loss to dynamically adjust the participation of the pretraining objective. We evaluate the proposed pretraining method on three medical imaging scenarios in multiple vision-language downstream tasks, and compare it with several curriculum learning methods. Extensive experiments show that our method significantly surpasses all baselines. https://github.com/Mr-Talon/MedKCO.

Abstract:
Precise and controllable image editing remains a significant challenge. Current methods often rely on text prompts, but achieving accurate spatial localization solely through descriptions is inherently difficult. Mask-based approaches, though offering better control, typically require overly precise user annotations, thus increasing user burden and leading to unnatural results. To bridge this gap, we introduce the Interactive Instruction-based Image Editing (I^3E) task, which generates high-quality edits from a more intuitive combination: concise text instructions and imprecise spatial guidance. To address the critical lack of suitable data, we propose an efficient pipeline to generate Inter-Edit, a new million-scale training dataset that simulates realistic user masks---not strictly segment-aligned. We also present a comprehensive benchmark, featuring a meticulously human-annotated test set that captures diverse, localization-dependent editing scenarios and realistic user interaction patterns. To evaluate this task, we introduce a new suite of position-aware metrics that strongly correlate with human perceptual judgments. Finally, we develop three baseline models trained on Inter-Edit. Extensive experiments demonstrate that our methods significantly enhance I_3E performance, achieving substantial improvements in localization and edit quality, and outperforming existing state-of-the-art models. The Inter-Edit dataset and all related code are publicly available at https://github.com/Delong-liu-bupt/Inter-Edit.

Abstract:
Although EDM aims to unify the design space of diffusion models, its reliance on fixed Gaussian noise prevents it from explaining emerging flow-based methods that diffuse arbitrary noise. Moreover, our study reveals that EDM's forcible injection of Gaussian noise has adverse effects on image restoration task, as it corrupts the degraded images, overextends the restoration distance, and increases the task's complexity. To interpret diverse methods for handling distinct noise patterns within a unified theoretical framework and to minimize the restoration distance, we propose EDA, which Elucidates the Design space of Arbitrary-noise diffusion models. Theoretically, EDA expands noise pattern flexibility while preserving EDM's modularity, with rigorous proof that increased noise complexity introduces no additional computational overhead during restoration. EDA is validated on three representative medical image denoising and natural image restoration tasks: MRI bias field correction (global smooth noise), CT metal artifact removal (global sharp noise) and natural image shadow removal (local boundary-aware noise). With only 5 sampling steps, competitive results against specialized methods across medical and natural tasks demonstrate EDA's strong generalization capability for image restoration. Code is available at: https://github.com/PerceptionComputingLab/EDA.

Abstract:
As a challenge vision-language task, Composed Image Retrieval (CIR) aims to integrate information from a bi-modal query (image + text) to retrieve target images. While supervised CIR has achieved notable success in domain-specific scenarios, its reliance on manually annotated triplets restricts its scalability and application. Zero-shot CIR alleviates this by leveraging unlabeled data or automatically collected triplets, yet it often suffers from an intractable domain gap. To this end, we shift the focus to developing robust CIR models under limited labeled data and propose Domain-Adaptive In-context Generation (DAIG), which adapts the in-context capability of a pretrained Text-to-Image (T2I) model to the target domain and the CIR task using few-shot samples and then transforms the LLM-generated textual triplets into unbiased CIR triplets as additional training data. After that, we present a two-stage framework applicable to any supervised CIR approach. The first stage, Distributionally Robust Synthetic Pretraining (DRSP), perturbs visual features to expand the distribution of synthetic data and improve training robustness on it. The second stage, Fine-grained Real-world Adaptation (FRA), fine-tunes on manually annotated triplets by imposing an angular margin on matching pairs to facilitate fine-grained learning. Experiments on two benchmarks validate the effectiveness of our method, i.e., under both few-shot and fully supervised CIR settings, DAIG yields substantial performance gains over CLIP4CIR, BLIP4CIR, and SPRC. The code is available at \href https://github.com/JThuge/DAIG \color blue!37https://github.com/JThuge/DAIG .

Abstract:
Existing diffusion codecs typically build on text-to-image diffusion foundation models like Stable Diffusion.However, text conditioning is suboptimal from a compression perspective, hindering the potential of downstream diffusion codecs, particularly at ultra-low bitrates.To address it, we introduce CoD, the first Compression-oriented Diffusion foundation model, trained from scratch to enable end-to-end optimization of both compression and generation. CoD is not a fixed codec but a general foundation model designed for various diffusion-based codecs.It offers several advantages: High compression efficiency, replacing Stable Diffusion with CoD in downstream codecs like DiffC achieves SOTA results, especially at ultra-low bitrates (e.g., 0.0039 bpp); Low-cost and reproducible training, 300x faster training than Stable Diffusion ( 20 vs. 6,250 A100 GPU days) on entirely open image-only datasets; Providing new insights, e.g., We find pixel-space diffusion can achieve VTM-level PSNR with high perceptual quality and can outperform GAN-based codecs using fewer parameters.We hope CoD lays the foundation for future diffusion codec research. Codes are released at https://github.com/microsoft/GenCodec/tree/main/CoD.

Abstract:
Reasoning segmentation increasingly employs reinforcement learning to generate explanatory reasoning chains that guide Multimodal Large Language Models. While these geometric rewards are primarily confined to guiding the final localization, they are incapable of discriminating whether the reasoning process remains anchored on the referred region or strays into irrelevant context. Lacking this discriminative guidance, the model's reasoning often devolves into unfocused and verbose chains that ultimately fail to disambiguate and perceive the target in complex scenes. This suggests a need to complement the RL objective with Discriminative Perception, an ability to actively distinguish a target from its context. To realize this, we propose DPAD to compel the model to generate a descriptive caption of the referred object, which is then used to explicitly discriminate by contrasting the caption's semantic relevance to the referred object against the wider context. By optimizing for this discriminative capability, the model is forced to focus on the unique attributes of the target, leading to a more converged and efficient reasoning chain. The descriptive caption also serves as an interpretability rationale that aligns with the segmentation. Experiments on the benchmarks confirm the validity of our approach, delivering substantial performance gains, with the cIoU on ReasonSeg increasing by 3.09% and the reasoning chain length decreasing by approximately 42%. Code is available at https://github.com/mrazhou/DPAD.

Abstract:
End-to-end autonomous driving methods built on vision language models (VLMs) have undergone rapid development driven by their universal visual understanding and strong reasoning capabilities obtained from the large-scale pretraining. However, we find that current VLMs struggle to understand fine-grained 3D spatial relationships which is a fundamental requirement for systems interacting with the physical world. To address this issue, we propose SpaceDrive, a spatial-aware VLM-based driving framework that treats spatial information as explicit positional encodings (PEs) instead of textual digit tokens, enabling joint reasoning over semantic and spatial representations. SpaceDrive employs a universal positional encoder to all 3D coordinates derived from multi-view depth estimation, historical ego-states, and text prompts. These 3D PEs are first superimposed to augment the corresponding 2D visual tokens. Meanwhile, they serve as a task-agnostic coordinate representation, replacing the digit-wise numerical tokens as both inputs and outputs for the VLM. This mechanism enables the model to better index specific visual semantics in spatial reasoning and directly regress trajectory coordinates rather than generating digit-by-digit, thereby enhancing planning accuracy. Extensive experiments validate that SpaceDrive achieves state-of-the-art open-loop performance on the nuScenes dataset and the second-best Driving Score of 78.02 on the Bench2Drive closed-loop benchmark over existing VLM-based methods. Code is available at: https://github.com/zhenghao2519/SpaceDrive.

Abstract:
Ancient inscriptions, as repositories of cultural memory, have suffered from centuries of environmental and human-induced degradation. Restoring their intertwined visual and textual integrity poses one of the most demanding challenges in digital heritage preservation. However, existing AI-based approaches often rely on rigid pipelines, struggling to generalize across such complex and heterogeneous real-world degradations. Inspired by the skill-coordinated workflow of human epigraphers, we propose EpiAgent, an agent-centric system that formulates inscription restoration as a hierarchical planning problem. Following an Observe-Conceive-Execute-Reevaluate paradigm, an LLM-based central planner orchestrates collaboration among multimodal analysis, historical experience, specialized restoration tools, and iterative self-refinement. This agent-centric coordination enables a flexible and adaptive restoration process beyond conventional single-pass methods. Across real-world degraded inscriptions, EpiAgent achieves superior restoration quality and stronger generalization compared to existing methods. Our work marks an important step toward expert-level agent-driven restoration of cultural heritage. The code is available at https://github.com/blackprotoss/EpiAgent.

Abstract:
Zero-Shot Composed Image Retrieval (ZS-CIR) aims to retrieve target images using a composed query of a reference image and a textual modification, without relying on triplet-based supervision. As the two inputs describe related but semantically unaligned information, the key challenge lies in interpreting their cross-modal discrepancy to infer the user's intended semantic modification. Existing ZS-CIR methods mainly adopt a consistency-driven paradigm, training on semantically aligned image-text pairs with alignment or reconstruction objectives. This paradigm enforces cross-modal agreement but overlooks the semantic discrepancies between modalities that naturally arise during inference. To address this issue, we propose DiffComp (Differentiate-then-Compose), a difference-driven self-supervised framework that actively induces and exploits cross-modal discrepancies during training. It stimulates the model to perceive and reconcile semantic differences across visual and textual modalities, thereby improving consistency between training and inference. The framework consists of three components: Contextual Semantic Super-patches that provide localized and coherent visual representations for downstream perception and composition; Phrase-guided Masking that selectively removes text-aligned visual cues to induce controlled cross-modal discrepancies; and Difference-aware Composition that adaptively integrates visual and textual features according to their degree of semantic difference. Extensive experiments on four ZS-CIR benchmarks show that DiffComp achieves state-of-the-art performance and strong generalization. Our code will be released at https://github.com/Orange1999/DiffComp.

Abstract:
Instructional video editing applies edits to an input video using only text prompts, enabling intuitive natural-language control. Despite the rapid progress, most methods still require fixed-length inputs and substantial compute. Meanwhile, autoregressive video generation enables efficient variable-length synthesis, yet remains under-explored for video editing. We introduce a causal, efficient video editing model that edits variable-length videos frame by frame. For efficiency, we start from a 2D image-to-image (I2I) diffusion model and adapt it to video-to-video (V2V) editing by conditioning the edit at time step t on the model's prediction at t-1. To leverage videos' temporal redundancy, we propose a new I2I diffusion forward process formulation that encourages the model to predict the residual between the target output and the previous prediction. We call this \underline R esidual \underline F low \underline D iffusion \underline M odel (\methodname), which focuses the denoising process on changes between consecutive frames. Moreover, we propose a new benchmark that better ranks state-of-the-art methods by faithfulness for video editing tasks. Trained on paired video data for global/local style transfer and object removal, \methodname surpasses I2I-based methods and competes with fully spatiotemporal (3D) V2V models, while matching the compute of image models and scaling independently of input video length. More content can be found in \href https://smsd75.github.io/RFDM_page/ RFDM page .

Abstract:
Open-vocabulary dense perception (OVDP) aims to localize objects unseen during training by leveraging textual knowledge. Despite the remarkable progress of recent CLIP-based approaches, we identify a critical limitation: synonym-induced grounding inconsistency, where semantically equivalent expressions yield disparate spatial attention patterns. This inconsistency undermines the robustness and performance of existing methods in real-world OVDP applications. To address this issue, we propose SynCLIP, a Synonym-Coherent Language-Image Pretraining framework that enhances synonym-robust grounding for OVDP. SynCLIP introduces a Semantic-consistent Spatial Attention alignment (SSA) module to enhance spatial attention consistency by minimizing discrepancies between attention maps of original and synonymous expressions. Furthermore, a Spatial Attention Refinement (SAR) module selectively strengthens the most semantically relevant spatial regions within aligned maps for more precise and stable grounding. To support synonym-coherent pretraining, we also construct a Synonym-Enriched Visual Corpus (SEViC), which augments each category with multiple synonyms and textual definitions. Extensive experiments on multiple benchmarks demonstrate that SynCLIP substantially improves grounding consistency under diverse linguistic variants and achieves state-of-the-art performance among CLIP-based OVDP methods. Code is available at https://github.com/Justlovesmile/SynCLIP.

Abstract:
Existing vision-language models (VLMs) are tailored for pinhole imagery, stitching multiple narrow field-of-view inputs to piece together a complete omni-scene understanding. Yet, such multi-view perception overlooks the holistic spatial and contextual relationships that a single panorama inherently preserves. In this work, we introduce the Panorama-Language Modeling (PLM) paradigm, a unified 360deg vision-language reasoning that is more than the sum of its pinhole counterparts. Besides, we present PanoVQA, a large-scale panoramic VQA dataset that involves adverse omni-scenes, enabling comprehensive reasoning under object occlusions and driving accidents. To establish a foundation for PLM, we develop a plug-and-play panoramic sparse attention module that allows existing pinhole-based VLMs to process equirectangular panoramas without retraining. Extensive experiments demonstrate that our PLM achieves superior robustness and holistic reasoning under challenging omni-scenes, yielding understanding greater than the sum of its narrow parts. Project page: https://github.com/InSAI-Lab/PanoVQA.

Abstract:
Refining visual representations by eliminating their internal feature-level redundancy is crucial for simultaneously optimizing the performance and computational cost of models in visual tracking. To enhance their performance, many contemporary Transformer-based trackers leverage a larger number of historical template frames to capture richer spatio-temporal cues. However, this strategy leads to a massive number of input visual tokens. This creates two critical issues: it imposes a quadratic computational burden and can also degrade the tracker's overall performance. To bridge this gap, we propose a compress-then-interact tracking framework, ETCTrack, that learns to efficiently compress template tokens from historical template frames into a robust target representation, moving beyond handcrafted rules. Our method first employs the Adaptive Token Compressor to dynamically construct compact yet highly discriminative template tokens by filtering out redundant visual tokens. These refined template tokens are then processed by our Hierarchical Interaction Encoder to achieve a deep, adaptive interaction with the search features. Refined search features ensure subsequent precise target localization. Experiments on seven benchmarks demonstrate that our method outperforms current state-of-the-art trackers. ETCTrack-B224 reduces the number of template tokens by 60%, leading to a 21.4% reduction in MACs with only a 0.4% drop in accuracy. The source code are available at https://github.com/PJD-WJ/ETCTrack.

Abstract:
Instruction-based image editing offers a powerful and intuitive way to manipulate images through natural language. Yet, relying solely on text instructions limits fine-grained control over the extent of edits. We introduce Kontinuous Kontext, an instruction-driven editing model that provides a new dimension of control over edit strength, enabling users to adjust edits gradually from no change to a fully realized result in a smooth and continuous manner. Kontinuous Kontext extends a state-of-the-art image editing model to accept an additional input, a scalar edit strength which is then paired with the edit instruction, enabling explicit control over the extent of the edit. To inject this scalar information, we train a lightweight projector network that maps the input scalar and the edit instruction to coefficients in the model's modulation space. For training our model, we synthesize a diverse dataset of image-edit-instruction-strength quadruplets using existing generative models, followed by a filtering stage to ensure quality and consistency. Kontinuous Kontext provides a unified approach for fine-grained control over edit strength for instruction driven editing from subtle to strong across diverse operations such as stylization, attribute, material, background, and shape changes, without requiring attribute-specific training.

Abstract:
Developing a robust visual quality assessment (VQualA) large multi-modal model (LMM) requires achieving versatility, powerfulness, and transferability. However, existing VQualA LMMs typically focus on a single task and rely on full-parameter fine-tuning, which makes them prone to overfitting on specific modalities or task types, thereby limiting their generalization capacity and transferability. To address this, we propose a vision-encoder-centered generative pre-training pipeline and develop the VITAL-Series LMMs. (1) We adopt a machine-executed annotation-scrutiny paradigm, constructing over 4.5M vision-language (VL) pairs-the largest VQualA training dataset to date. (2) We employ a multi-task training workflow that simultaneously enhances the model's quantitative scoring precision and strengthens its capability for quality interpretation across both image and video modalities. (3) Building upon the vision encoder, we realize an efficient model zoo extension: the model zoo exhibits strong zero-shot performance, and each paired decoder requires only a swift warm-up using less than 1/1000 of the pre-training data to achieve performance comparable to the fully trained counterpart. Overall, our work lays a cornerstone for advancing toward the foundation LMM for VQualA. The project is at https://github.com/jzhws/VITAL-Series.

Abstract:
Multimodal large reasoning models (MLRMs) are increasingly deployed for vision-language tasks that produce explicit intermediate rationales. However, reasoning traces can contain unsafe content even when the final answer is non-harmful, creating deployment risks. Existing multimodal safety guards primarily evaluate only the input question and the final answer, neglecting the intermediate reasoning process. This oversight allows undetected harm, such as biased inferences or policy-violating use of visual context, to emerge during reasoning. We introduce GuardTrace-VL, a vision-aware safety auditor that monitors the full Question-Thinking-Answer (QTA) pipeline via joint image-text analysis, enabling detection of unsafe content as it emerges in the reasoning stage. To support training and evaluation, we construct the GuardTrace dataset, which is generated through diverse prompting strategies and refined via a MLRM- and human-based voting and verification pipeline. Furthermore, we propose a three-stage progressive training scheme combined with the data refinement process, enabling the model to learn nuanced and context-dependent safety preferences according to different risk levels. On our proposed test set covering both in-domain and out-of-domain scenarios, GuardTrace-VL model achieves an F1 score of 93.1% on unsafe reasoning detection tasks, representing a 13.5% improvement in F1 score compared to the previous strongest multimodal safety defense methods.The codes is available at https://github.com/xiangyx2020/GuardTrace-VL.

Abstract:
Reliable generalization metrics are fundamental to the evaluation of machine learning models. Especially in high-stakes applications where labeled target data are scarce, evaluation of models' generalization performance under distribution shift is a pressing need. We focus on two practical scenarios: (1) Before deployment, how to select the best model for unlabeled target data? (2) After deployment, how to monitor model performance under distribution shift? The central need in both cases is a reliable and label-free proxy metric. Yet existing proxy metrics, such as model confidence or accuracy-on-the-line, are often unreliable as they only assess model output while ignoring the internal mechanisms that produce them. We address this limitation by introducing a new perspective: using the inner workings of a model, i.e., circuits, as a predictive metric of generalization performance. Leveraging circuit discovery, we extract the causal interactions between internal representations as a circuit, from which we derive two metrics tailored to the two practical scenarios. (1) Before deployment, we introduce Dependency Depth Bias, which measures different models' generalization capability on target data. (2) After deployment, we propose Circuit Shift Score, which predicts a model's generalization under different distribution shifts. Across various tasks, both metrics demonstrate significantly improved correlation with generalization performance, outperforming existing proxies by an average of 13.4% and 34.1%, respectively. Our code is available at https://github.com/deep-real/GenCircuit.

Abstract:
Despite progress in multimodal sarcasm detection, existing datasets and methods predominantly focus on single-image scenarios, overlooking potential semantic and affective relations across multiple images. This leaves a gap in modeling cases where sarcasm is triggered by multi-image cues in real-world settings. To bridge this gap, we introduce MMSD3.0, a new benchmark composed entirely of multi-image samples curated from tweets and Amazon reviews. We further propose a Cross-Image Reasoning Model (CIRM), integrating a Dual-Stage Bridge Module and Relevance-Guided Fusion Module to model inter-image dependencies and cross-modal correspondences. Complementarily, we establish a comprehensive suite of strong and representative baselines and conduct extensive experiments, showing that MMSD3.0 is an effective and reliable benchmark that better reflects real-world conditions. Moreover, CIRM demonstrates state-of-the-art performance across MMSD, MMSD2.0, and MMSD3.0, validating its effectiveness in both single-image and multi-image scenarios. Dataset and code are publicly available at https://github.com/ZHCMOONWIND/MMSD3.0.

Abstract:
Computational pathology has advanced rapidly in recent years, driven by domain-specific image encoders and growing interest in using vision-language models to answer natural-language questions about diseases. Yet, the core problem behind pathology question-answering remains unsolved, considering that a gigapixel slide contains far more information than necessary for a given question. Pathologists naturally navigate tissue and morphology complexity by scanning broadly, and zooming in selectively according to the clinical questions. Current models, in contrast, rely on uniform patch sampling or broad attention maps, often attending equally to irrelevant regions while overlooking key visual evidence. In this work, we try to bring models closer to how humans actually examine slides. We propose a question-guided, tissue-aware, and coarse-to-fine retrieval framework, HistoSelect, that consists of two key components: a group sampler that identifies question-relevant tissue regions, followed by a patch selector that retrieves the most informative patches within those regions. By selecting only the most informative patches, our method becomes significantly more efficient: reducing visual token usage by 70% on average, while improving accuracy across three pathology QA tasks. Evaluated on 356,000 question-answer pairs, our approach outperforms existing methods and produces answers grounded in interpretable, pathologist-consistent regions. Our results suggest that bringing human-like search and attention patterns into WSI reasoning is a promising direction for building practical and reliable pathology VLMs. Code is available at https://github.com/winston52/HistoSelect.

Abstract:
Existing single object trackers typically treat temporal modeling superficially by passing limited inter-frame information, such as propagated tokens or template updates, without intrinsic temporal supervision learning. To address this limitation, we propose TGTrack, a new unified tracking framework that incorporates a temporally generative supervision task to guide the model in learning temporal dynamics. The core of TGTrack is a temporally generative learning paradigm equipped with a transformer-based generative decoder, which consists of a gated fusion module and an autoregressive prediction mechanism. This joint design enables the model to infer future scenarios from preceding information, thereby improving its ability to model both visual appearance and temporal dynamics. Furthermore, we introduce a time token embedding to explicitly encode the temporal position of each frame. Experiments on 11 benchmarks spanning five modalities show that TGTrack achieves state-of-the-art performance in robust unified tracking. For instance, TGTrack-B384 achieves an AUC of 75.3% on LaSOT. Code is available at https://github.com/wtg1/TGTrack.

Abstract:
Foundation models have transformed vision and language by learning general-purpose representations from large-scale unlabeled data, yet 3D medical imaging lacks analogous approaches. Existing self-supervised methods rely on low-level reconstruction or contrastive objectives that fail to capture the anatomical semantics critical for medical image analysis, limiting transfer to downstream tasks. We present MASS (MAsk-guided Self-Supervised learning), which treats in-context segmentation as the pretext task for learning general-purpose medical imaging representations. MASS's key insight is that automatically generated class-agnostic masks provide sufficient structural supervision for learning semantically rich representations. By training on thousands of diverse mask proposals spanning anatomical structures and pathological findings, MASS learns what semantically defines medical structures: the holistic combination of appearance, shape, spatial context, and anatomical relationships. We demonstrate effectiveness across data regimes: from small-scale pretraining on individual datasets (20-200 scans) to large-scale multi-modal pretraining on 5K CT, MRI, and PET volumes, all without annotations. MASS demonstrates: (i) few-shot segmentation on novel structures, (ii) matching full supervision with only 20-40% labeled data while outperforming self-supervised baselines by over 20 in Dice score in low-data regimes, and (iii) frozen-encoder classification on unseen pathologies that matches full supervised training with thousands of samples. Mask-guided self-supervised pretraining captures broadly generalizable knowledge, opening a path toward 3D medical imaging foundation models without expert annotations. Code is available \href https://github.com/Stanford-AIMI/MASS here .

Abstract:
Text-guided dynamic 3D character generation has advanced rapidly, yet producing high-quality motion that faithfully reflects rich textual descriptions remains challenging. Existing methods tend to generate limited sub-actions or incoherent motion due to fixed-length temporal inputs and discrete frame-wise representations that fail to capture rich motion semantics. We address these limitations by representing motion with continuous differentiable B-spline curves, enabling more effective motion generation without modifying the capabilities of the underlying generative model. Specifically, our closed-form, Laplacian-regularized B-spline solver efficiently compresses variable-length motion sequences into compact representations with a fixed number of control points. Further, we introduce a normal-fusion strategy for input shape adherence along with correspondence-aware and local-rigidity losses for motion-restoration quality. To train our model, we collate BIMO, a new dataset containing diverse variable-length 3D motion sequences with rich, high-quality text annotations. Extensive evaluations show that our feed-forward framework BiMotion generates more expressive, higher-quality, and better prompt-aligned motions than existing state-of-the-art methods, while also achieving faster generation. Our project page is at: https://wangmiaowei.github.io/BiMotion.github.io/.

Abstract:
In this work, we introduce LLaDA-V, a purely diffusion-based Multimodal Large Language Model (MLLM) that integrates visual instruction tuning with masked diffusion models, representing a departure from the autoregressive paradigms dominant in current multimodal approaches. Built upon LLaDA, a representative large language diffusion model, LLaDA-V incorporates a vision encoder and MLP connector that projects visual features into the language embedding space, leveraging diffusion language models' bidirectional attention to capture spatial relationships in visual data more effectively than causal, sequential processing. Our empirical investigation reveals several intriguing results: First, LLaDA-V demonstrates promising multimodal performance despite its language model being weaker on purely textual tasks than counterparts like LLaMA3-8B and Qwen2-7B. When trained on the same instruction data, LLaDA-V is highly competitive with LLaMA3-V across multimodal tasks and shows promising data scaling behavior on several benchmarks. It also narrows the performance gap to Qwen2-VL, suggesting the effectiveness of its architecture for multimodal tasks. Second, LLaDA-V achieves state-of-the-art performance in multimodal understanding compared to existing purely diffusion-based MLLMs. Our findings suggest that large language diffusion models show promise in multimodal contexts and warrant further investigation. To facilitate future research, we open-source LLaDA-V together with its training and evaluation code at https://github.com/ML-GSAI/LLaDA-V.

Abstract:
Crack segmentation (CS) is crucial for structural inspection and maintenance in production scenarios. To achieve both high accuracy and efficiency, recent methods have adopted Mamba-based architectures built upon state space models (SSMs), which enable linear-complexity modeling of long-range dependencies. However, existing approaches typically rely on static multi-directional scanning to flatten visual features into sequences. This fixed flattening order disrupts spatial continuity and weakens the SSM's ability to model irregular crack patterns effectively. To address this limitation, we propose CrackSSM, a novel crack-aware segmentation framework featuring a dynamic scanning strategy that adapts the token sequence to the underlying structure of each image. Specifically, we compute directional response strength along four orientations from high-level semantic features, and use these values to reorder tokens so that crack-relevant regions remain adjacent in sequence. This alignment improves the causal modeling ability of SSMs while preserving their efficiency and better suits the irregular, fine-grained nature of cracks. Additionally, we design a wavelet-guided decoding mechanism to recover detailed features. It incorporates high-frequency components extracted from the input image and applies them to guide feature refinement and edge-aware fusion, further enhancing segmentation precision. Experiments on three benchmark datasets demonstrate that our method achieves superior segmentation accuracy with fewer parameters and faster inference compared to existing state-of-the-art models. The source code for this work is available at https://github.com/hby123123/CrackSSM.

Abstract:
The utilization of temporal information has always been an open topic in the tracking community. However, existing trackers tend to employ more and more inputs or parameters for temporal learning, hindering their deployment in resource-constrained unmanned aerial vehicles (UAVs). More importantly, this raises ambiguity whether the performance gains come from the temporal learning itself, or come from the increased inputs and parameters. In this study, we advocate designing temporal learning components from a more balanced perspective that jointly considers performance gains and computational costs. To achieve this goal, we introduce a new evaluation metric, i.e., precision per FLOP (PPF). The PPF is introduced to quantify the tracking precision gains achieved by temporal learning components per unit of FLOP, thus enabling fair and efficiency-aware comparisons among these components and driving them toward more efficient designs. Based on this metric, we propose a low-cost yet effective temporal learning (LETL) approach to model contextual relationships. This approach continuously propagates and merges representative appearance tokens in video streams, allowing the tracker to efficiently capture the changing patterns of targets with relatively low costs. We integrate the LETL approach into existing one-stream frameworks, thereby building a simple yet effective tracker, namely LETrack, for robust UAV tracking. Extensive experimental demonstrate the superiority of our LETrack, and show that the proposed LETL approach achieves higher PPF scores. The Code is available at https://github.com/GXNU-ZhongLab/LETrack.

Abstract:
Recent text-to-video (T2V) models have demonstrated strong capabilities in producing high-quality, dynamic videos. To improve the visual controllability, recent works have considered fine-tuning pre-trained T2V models to support image-to-video (I2V) generation. However, such adaptation frequently suppresses motion dynamics of generated outputs, resulting in more static videos compared to their T2V counterparts. In this work, we analyze this phenomenon and identify that it stems from the premature exposure to high-frequency details in the input image, which biases the sampling process toward a shortcut trajectory that overfits to the static appearance of the reference image. To address this, we propose adaptive low-pass guidance (ALG), a simple training-free fix to the I2V model sampling procedure to generate more dynamic videos without compromising per-frame image quality. Specifically, ALG adaptively modulates the frequency content of the conditioning image by applying a low-pass filter at the early stage of denoising. Extensive experiments show ALG significantly improves the temporal dynamics of generated videos, while preserving or even improving image fidelity and text alignment. For instance, on the VBench test suite, ALG achieves a 33% average improvement across models in dynamic degree while maintaining the original video quality. For additional visualizations and source code, see the project page.

Abstract:
The incorporation of online reinforcement learning (RL) into diffusion and flow-based generative models has recently gained attention as a powerful paradigm for aligning model behavior with human preferences. By leveraging stochastic sampling via Stochastic Differential Equations (SDEs) during the denoising phase, these models can explore a variety of denoising trajectories, enhancing the exploratory capacity of RL. However, despite their ability to discover potentially high-reward samples, current approaches often struggle to effectively align with preferences due to the sparsity and narrowness of reward feedback. To overcome this limitation, we introduce a novel framework called Granular-GRPO (G^2RPO), which enables fine-grained and comprehensive evaluation of sampling directions in the RL training of flow models. Specifically, we propose a Singular Stochastic Sampling mechanism that supports step-wise stochastic exploration while ensuring strong correlation between injected noise and reward signals, enabling more accurate credit assignment to each SDE perturbation. Additionally, to mitigate the bias introduced by fixed-granularity denoising, we design a Multi-Granularity Advantage Integration module that aggregates advantages computed across multiple diffusion scales, resulting in a more robust and holistic assessment of sampling trajectories. Extensive experiments on various reward models, including both in-domain and out-of-domain settings, demonstrate that our G^2RPO outperforms existing flow-based GRPO baselines, highlighting its effectiveness and generalization capability.

Abstract:
Vision-language models (VLMs) like CLIP have shown impressive generalization capabilities, yet their potential for Cross-Domain Few-Shot Learning (CDFSL) remains underexplored, where the model needs to transfer source-domain information to target domains with scarce training data. While the attention sink phenomenon has been observed in VLMs for certain tasks, its role in CDFSL scenarios has not been studied. In this paper, we uncover a critical issue overlooked by prior works: standard target-domain few-shot fine-tuning in CDFSL significantly exacerbates the attention sink problem, leading to poor discriminability across classes. To understand this phenomenon, through extensive experiments, we interpret it as the model's shortcut learning for domain adaptation: to overcome the huge domain gap between the source and target domains, the model shows a high tendency to push tokens that are initially closer to target-domain classes (i.e., simple tokens) to be even closer to these classes, exacerbating the attention sink and wasting the capability of learning other discriminative but initially further tokens (i.e., hard tokens). To address this, we propose a novel approach to dynamically re-weight tokens according to their relevance with target-domain classes during the target-domain finetuning, which explicitly suppresses the model's reliance on these simple tokens and enhances the learning of hard tokens, reducing sink tokens and enhancing discriminability. Extensive experiments on four benchmark datasets validate the rationale of our method, demonstrating new state-of-the-art performance. Our codes are available at https://github.com/shuaiyi308/TIR.

Abstract:
Current text-to-image models struggle to provide precise camera control using natural language alone. In this work, we present a framework for precise camera control with global scene understanding in text-to-image generation by learning parametric camera tokens. We fine-tune image generation models for viewpoint-conditioned text-to-image generation on a curated dataset that combines 3D-rendered images for geometric supervision and photorealistic augmentations for appearance and background diversity. Qualitative and quantitative experiments demonstrate that our method achieves state-of-the-art accuracy while preserving image quality and prompt fidelity. Unlike prior methods that overfit to object-specific appearance correlations, our viewpoint tokens learn factorized geometric representations that transfer to unseen object categories. Our work shows that text-vision latent spaces can be endowed with explicit 3D camera structure, offering a pathway toward geometrically-aware prompts for text-to-image generation. Project page: https://randdl.github.io/viewtoken_control/

Abstract:
Recent advances in snapshot multispectral (MS) imaging have enabled compact, low-cost spectral sensors for consumer and mobile devices. By capturing richer spectral information than conventional RGB sensors, these systems can enhance key imaging tasks, including color correction. However, most existing methods treat the color correction pipeline in separate stages, often discarding MS data early in the process. We propose a unified, learning-based framework that performs end-to-end color correction and jointly leverages data from a high-resolution RGB sensor and an auxiliary low-resolution MS sensor. Our approach integrates the full pipeline within a single model, producing coherent and color-accurate outputs. We demonstrate the flexibility and generality of our framework by refactoring two different state-of-the-art image-to-image architectures. To support training and evaluation, we construct a dedicated dataset by aggregating and repurposing publicly available spectral datasets, rendering under multiple RGB camera sensitivities. Extensive experiments show that our approach improves color accuracy and stability, reducing error by up to 50% compared to RGB-only and MS-driven baselines. Code, models and dataset available at: https://lucacogo.github.io/Mobile-Spectral-CC/.

Abstract:
Pose estimation remains challenging under sparse views, especially when visual overlap across images is extremely limited. Recent advances in video generation models offer a promising solution by enabling keyframe interpolation, which can enrich contextual cues and improve pose estimation performance. However, existing video generation models often lack 3D consistency, producing temporally plausible but spatially inconsistent frames that degrade downstream pose estimation. In this paper, we propose a framework ExPose that directly addresses 3D inconsistency when applying video generation to pose estimation in extreme-view settings. Specifically, we fine-tune a video generation model using Group Relative Preference Optimization (GRPO), aligning its outputs with 3D-consistent supervisory signals derived from pose estimation objectives. Our approach not only enhances the quality of temporal interpolation, but also ensures spatial coherence across views, significantly improving pose estimation accuracy. Extensive experiments demonstrate that our method outperforms state-of-the-art baselines, highlighting the potential of preference-optimized video generation as a powerful tool for pose estimation in extreme-view scenarios. Code is available at https://github.com/yh-yoon/ExPose.

Abstract:
Multimodal large language models (MLLMs) achieve remarkable progress in cross-modal perception and reasoning, yet a fundamental question remains unresolved: should the vision encoder be fine-tuned or frozen? Despite the success of models such as LLaVA and Qwen-VL, inconsistent design choices and heterogeneous training setups hinder a unified understanding of visual fine-tuning (VFT) in MLLMs. Through a configuration-aligned benchmark, we find that existing VFT methods fail to consistently outperform the frozen baseline across multimodal tasks. Our analysis suggests that this instability arises from visual preference conflicts, where the context-agnostic nature of vision encoders induces divergent parameter updates under diverse multimodal context. To address this issue, we propose the Context-aware Visual Fine-tuning (CoVFT) framework, which explicitly incorporates multimodal context into visual adaptation. By integrating a Context Vector Extraction (CVE) and a Contextual Mixture-of-Experts (CoMoE) module, CoVFT decomposes conflicting optimization signals and enables stable, context-sensitive visual updates. Extensive experiments on 12 multimodal benchmarks demonstrate that CoVFT achieves state-of-the-art performance with superior stability. Notably, fine-tuning a 7B MLLM with CoVFT surpasses the average performance of its 13B counterpart, revealing substantial untapped potential in visual encoder optimization within MLLMs. Code is available at https://github.com/weeknan/CoVFT.

Abstract:
Compared to untargeted attacks, targeted transfer-based attack still suffers from much lower Attack Success Rates (ASRs), although significant improvements have been achieved by kinds of methods, such as diversifying input, stabilizing the gradient, and re-training surrogate models. In this paper, we find that adversarial examples generated by existing methods rely heavily on a small subset of surrogate model parameters, which limits their transferability to unseen target models. Inspired by this finding, we propose Random Parameter Pruning Attack (RaPA), which introduces parameter-level randomization during the attack process. At each optimization step, RaPA randomly prunes model parameters to generate diverse yet semantically consistent surrogate variants. We show that this parameter-level randomization is equivalent to adding an importance-equalization regularizer, thereby alleviating the over-reliance issue. Extensive experiments across both CNN and Transformer architectures demonstrate that RaPA substantially enhances transferability. In the challenging case of transferring from CNN-based to Transformer-based models, RaPA achieves up to 11.7% higher average ASRs than state-of-the-art baselines (with 33.3% ASRs), while being training-free, cross-architecture efficient, and easily integrated into existing attack frameworks. Code is available on https://github.com/molarsu/RaPA.

Abstract:
Bimanual manipulation requires policies that can reason about 3D geometry, anticipate how it evolves under action, and generate smooth, coordinated motions. However, existing methods typically rely on 2D features with limited spatial awareness, or require explicit point clouds that are difficult to obtain reliably in real-world settings. Meanwhile, recent 3D geometric foundation models show that accurate and diverse 3D structure can be reconstructed directly from RGB images in a fast and robust manner. Building on these advances, we propose a framework that builds bimanual manipulation directly on a pre-trained 3D geometric foundation model. Our policy fuses geometry-aware latents, 2D semantic features, and proprioception into a unified state representation, and uses a diffusion model to jointly predict an action chunk and a future 3D latent, which decodes into a dense point map. By explicitly predicting how the 3D scene will evolve together with the action sequence, the policy gains strong spatial understanding and predictive capability using only RGB observations. We evaluate our method both in simulation on the RoboTwin benchmark and in real-world robot executions. Our approach consistently outperforms 2D-based and point-cloud-based baselines, achieving state-of-the-art performance in manipulation success, inter-arm coordination, and 3D spatial prediction accuracy. Code is available at https://github.com/Chongyang-99/GAP.git.

Abstract:
Infrared and visible image fusion (IVIF) integrates complementary modalities to enhance scene perception. Current methods predominantly focus on optimizing handcrafted losses and objective metrics, often resulting in fusion outcomes that do not align with human visual preferences. This challenge is further exacerbated by the ill-posed nature of IVIF, which severely limits its effectiveness in human perceptual environments such as security surveillance and driver assistance systems. To address these limitations, we propose a feedback reinforcement framework that bridges human evaluation to infrared and visible image fusion. To address the lack of human-centric evaluation metrics and data, we introduce the first large-scale human feedback dataset for IVIF, containing multidimensional subjective scores and artifact annotations, and enriched by a fine-tuned large language model with expert review. Based on this dataset, we design a domain-specific reward function and train a reward model to quantify perceptual quality. Guided by this reward, we fine-tune the fusion network through Group Relative Policy Optimization, achieving state-of-the-art performance that better aligns fused images with human aesthetics. Code is available at https://github.com/ALKA-Wind/EVAFusion.

Abstract:
Restoring degraded document image is essential for both improving visual quality and optimizing performance in downstream document analysis tasks. Although existing methods have demonstrated substantial improvements in restoration outcomes, they primarily address single-type degradation scenarios. Current approaches typically necessitate training multiple specialized models for specific degradation types or rely on explicit prior knowledge of degradation patterns to guide the training process. To overcome these limitations, we propose MMDIR, a multimodal instruction-driven framework designed for document image restoration under mixed and uncertain degradation conditions. By leveraging semantically structured instructions, MMDIR dynamically identifies present degradation types (blur, shadow, text watermark, and seal), while enhancing degradation-aware representation learning. Furthermore, we introduce a novel benchmark named MixedDoc comprising complex mixed degradations, where each image contains randomized combinations of the aforementioned types. This benchmark addresses a critical gap in existing datasets, which lack realistic multi-degradation samples and often overlook common obstructions such as seals and text watermarks. The effectiveness of our approach is thoroughly validated across both released public benchmarks and our newly proposed dataset. The dataset is available at https://github.com/xiaomore/MMDIR.

Abstract:
Understanding mid-level road semantics, which capture the structural and contextual cues that link low-level perception to high-level planning, is essential for reliable autonomous driving and digital map construction. However, existing benchmarks primarily target perception tasks such as detection or segmentation, overlooking the reasoning capabilities required to infer road topology and dynamic scene structure. To address this gap, we present RoadSceneBench, a lightweight yet information-rich benchmark designed to evaluate and advance visual reasoning in complex road environments. Unlike large-scale perception datasets, RoadSceneBench emphasizes relational understanding and structural consistency, encouraging models to capture the underlying logic of real-world road scenes. Furthermore, to enhance reasoning reliability, we propose Hierarchical Relational Reward Propagation with Temporal Consistency (HRRP-T), a training framework for Vision-Language Models (VLMs) in which reward signals adaptively promote spatial coherence and semantic alignment throughout the reasoning process. This paradigm enables models to move beyond static recognition toward geometry-aware and temporally consistent reasoning. Extensive experiments demonstrate that our method achieves state-of-the-art performance across diverse road configurations. RoadSceneBench thus provides a compact yet powerful foundation for studying mid-level road semantics and fostering structure-aware autonomous perception. Our dataset is available at https://github.com/XiyanLiu/RoadSceneBench.

Abstract:
Open-vocabulary object detection (OVOD) aims to detect known and unknown objects in the open world by leveraging text prompts. Benefiting from the emergence of large-scale vision--language pre-trained models, OVOD has demonstrated strong zero-shot generalization capabilities. However, when dealing with camouflaged objects, the detector often fails to distinguish and localize objects because the visual features of the objects and the background are highly similar. To bridge this gap, we construct a benchmark named OVCOD-D by augmenting carefully selected camouflaged object images with fine-grained textual descriptions. Due to the limited scale of available camouflaged object datasets, we adopt detectors pre-trained on large-scale object detection datasets as our baseline methods, as they possess stronger zero-shot generalization ability. In the specificity-aware sub-descriptions generated by multimodal large models, there still exist confusing and overly decorative modifiers. To mitigate such interference, we design a sub-description principal component contrastive fusion strategy that reduces noisy textual components. Furthermore, to address the challenge that the visual features of camouflaged objects are highly similar to those of their surrounding environment, we propose a specificity-guided regional weak alignment and dynamic focusing method, which aims to strengthen the detector's ability to discriminate camouflaged objects from background. Under the open-set evaluation setting, the proposed method achieves an AP of 56.4 on the OVCOD-D benchmark. Project code and benchmark will be released at https://github.com/Zh1fen/SDDF.

Abstract:
This paper presents a multi-agent perception-action exploration alliance, dubbed A4VL, for efficient long-video reasoning. A4VL operates in a multi-round perception-action exploration loop with a selection of VLM agents. In each round, the team of agents performs video question-answer (VideoQA) via perception exploration followed by action exploration. During perception exploration, each agent learns to extract query-specific perception clue(s) from a few sampled frames and performs clue-based alignment to find the video block(s) that are most relevant to the query-specific event. During action exploration, A4VL performs video reasoning in three steps: (1) each agent produces its initial answer with rational, (2) all agents collaboratively scores one another through cross-reviews and relevance ranking, and (3) based on whether a satisfactory consensus is reached, the decision is made either to start a new round of perception-action deliberation by pruning (e.g., filtering out the lowest performing agent) and re-staging (e.g., new-clue and matching block based perception-action exploration), or to conclude by producing its final answer. The integration of the multi-agent alliance through multi-round perception-action exploration, coupled with event-driven partitioning and cue-guided block alignment, enables A4VL to effectively scale to real world long videos while preserving high quality video reasoning. Evaluation Results on five popular VideoQA benchmarks show that A4VL outperforms 18 existing representative VLMs and 11 recent methods optimized for long-video reasoning, while achieving significantly lower inference latency. Our code is released at https://github.com/git-disl/A4VL.

Abstract:
Multimodal Summarization (MMS) aims to generate concise textual summaries by understanding and integrating information across videos, transcripts, and images. However, existing approaches still suffer from three main challenges: (1) reliance on domain-specific supervision, (2) implicit fusion with weak cross-modal grounding, and (3) flat temporal modeling without event transitions. To address these issues, we introduce CoE, a training-free MMS framework that performs structured reasoning through a Chain-of-Events guided by a Hierarchical Event Graph (HEG). The HEG encodes textual semantics into an explicit event hierarchy that scaffolds cross-modal grounding and temporal reasoning. Guided by this structure, CoE localizes key visual cues, models event evolution and causal transitions, and refines outputs via lightweight style adaptation for domain alignment. Extensive experiments on eight diverse datasets demonstrate that CoE consistently outperforms state-of-the-art video CoT baselines, achieving average gains of +3.04 ROUGE, +9.51 CIDEr, and +1.88 BERTScore, highlighting its robustness, interpretability, and cross-domain generalization. Our code is available at https://github.com/youxiaoxing/CoE.

Abstract:
Generalization remains the central challenge for interactive 3D scene generation. Existing learning-based approaches ground spatial understanding in limited scene dataset, restricting generalization to new layouts. We instead reprogram a pre-trained 3D instance generator to act as a scene-level learner via, replacing dataset-bounded supervision with model-centric spatial supervision. This reprogramming unlocks the generator's transferable spatial knowledge, enabling generalization to unseen layouts and novel object compositions. Remarkably, spatial reasoning still emerges even when the training scenes are randomly composed objects. This demonstrates that the generator's transferable scene prior provides a rich learning signal for inferring proximity, support, and symmetry from purely geometric cues. Replacing widely used canonical space, we instantiate this insight with a view-centric formulation of the scene space, yielding a fully feed-forward, generalizable scene generator that learns spatial relations directly from the instance model. Quantitative and qualitative results show that a 3D instance generator is an implicit spatial learner and reasoner, pointing toward foundation models for interactive 3D scene understanding and generation. The code is accessible at \href https://luling06.github.io/I-Scene-project/ the project page .

Abstract:
Although recent 3D-native generators have made great progress in synthesizing reliable geometry, they still fall short in achieving realistic appearances. A key obstacle lies in the lack of diverse and high-quality real-world 3D assets with rich surface details, since capturing such data is intrinsically difficult due to the diverse scales of scenes, non-rigid motions of objects, and the limited precision of scanners.We introduce Photo3D, a framework for advancing photorealistic 3D generation, which is driven by the image data generated by the GPT-4o-Image model.Considering that the generated images can distort 3D structures due to their lack of multi-view consistency, we design a structure-aligned multi-view synthesis pipeline and construct a detail-enhanced multi-view dataset paired with 3D geometry. Building on it, we present a realistic detail enhancement scheme that leverages perceptual feature adaptation and semantic structure matching to enforce appearance consistency with the realistic detail priors while preserving the structural consistency with the 3D-native geometry. While our scheme is general to different 3D-native generators, we present dedicated training strategies to facilitate the optimization of geometry-texture coupled and decoupled 3D-native generation paradigms. Experiments demonstrate that Photo3D generalizes well across diverse 3D-native generation paradigms and achieves state-of-the-art photorealistic 3D generation performance. Project Page: https://liangsanzhu.github.io/photo3d-page/

Abstract:
Cross-subject EEG-to-image retrieval for visual decoding is challenged by subject shift and hubness in the embedding space, which distort similarity geometry and destabilize top-k rankings, making small-k shortlists unreliable. We introduce SATTC (Structure-Aware Test-Time Calibration), a label-free calibration head that operates directly on the similarity matrix of frozen EEG and image encoders. SATTC combines a geometric expert-subject-adaptive whitening of EEG embeddings with an adaptive variant of Cross-domain Similarity Local Scaling (CSLS)--and a structural expert built from mutual nearest neighbors, bidirectional top-k ranks, and class popularity, fused via a simple Product-of-Experts rule. On THINGS-EEG under a strict leave-one-subject-out protocol, standardized inference with cosine similarities, l2-normalized embeddings, and candidate whitening already yields a strong cross-subject baseline over the original ATM retrieval setup. Building on this baseline, SATTC further improves Top-1 and Top-5 accuracy, reduces hubness and per-class imbalance, and produces more reliable small-k shortlists. These gains transfer across multiple EEG encoders, supporting SATTC as an encoder-agnostic, label-free test-time calibration layer for cross-subject neural decoding. Code is available at https://github.com/QunjieHuang/SATTC-CVPR2026

Abstract:
Diffusion bridge models offer a powerful framework for connecting two data distributions, such as in image restoration and translation. Many existing methods learn this bridge by mimicking the score-matching formulation of standard diffusion models. In this work, we find that this way leads to an anomalous underfitting phenomenon near the target endpoint, as the process approaches the target distribution (t \to 0). This underfitting, characterized by significant drift in the predicted variance and direction, results from an excessively large discrepancy in noise levels between the network's input and its regression target. To resolve this issue, we propose the Noise-Aligned Diffusion Bridge (NADB). Our approach reformulates the diffusion bridge by first employing a mean network to provide a cleaner conditional target, and then introducing a novel, noise-aligned mapping relationship. This new formulation resolves the noise mismatch and corrects the underfitting near the target endpoint.Experimental validation across multiple image restoration and image translation tasks demonstrates the effectiveness of our approach. Code is available at https://github.com/gyr02/NADB.

Abstract:
Reinforcement Learning with Verifiable Rewards (RLVR) has recently demonstrated remarkable potential in enhancing the reasoning capability of Large Reasoning Models (LRMs). However, RLVR often drives the policy toward over-determinism, resulting in ineffective exploration and premature policy convergence. While promoting token-level diversity has shown promise in mitigating entropy collapse, we argue that the latent dynamics underlying token generation encode a far richer computational structure for steering policy optimization toward a more effective exploration-exploitation tradeoff. To enable tractable analysis and intervention of the latent dynamics of LRMs, we leverage Koopman operator theory to obtain a linearized representation of their hidden state dynamics. This enables us to introduce Dynamic Spectral Dispersion (DSD), a new metric to quantify the heterogeneity of the model's latent dynamics, serving as a direct indicator of policy exploration. Building upon these foundations, we propose Reasoning with Latent eXploration (ReLaX), a framework that explicitly incorporates latent dynamics to regulate exploration and exploitation during policy optimization. Comprehensive experiments across a wide range of multimodal and text-only reasoning benchmarks show that ReLaX consistently incentivizes reasoning capability and outperforms existing token-level methods. Our project is available at https://github.com/ZhangShimin1/ReLaX.

Abstract:
Instruction-based image editing with diffusion models has achieved impressive results, yet existing methods struggle with fine-grained instructions specifying precise attributes such as colors, positions, and quantities. While recent approaches employ Group Relative Policy Optimization (GRPO) for alignment, they optimize only at individual sampling steps, providing sparse feedback that limits trajectory-level control. We propose a unified framework CogniEdit, combining multi-modal reasoning with dense reward optimization that propagates gradients across consecutive denoising steps, enabling trajectory-level gradient flow through the sampling process. Our method comprises three components: (1) Multi-modal Large Language Models for decomposing complex instructions into actionable directives, (2) Dynamic Token Focus Relocation that adaptively emphasizes fine-grained attributes, and (3) Dense GRPO-based Optimization that propagates gradients across consecutive steps for trajectory-level supervision. Extensive experiments on benchmark datasets demonstrate that our CogniEdit achieves state-of-the-art performance in balancing fine-grained instruction following with visual quality and editability preservation. Our code is available at https://github.com/yl4467/CogniEdit.

Abstract:
Achieving spatial alignment between whole-slide images (WSIs) across stains is essential for integrating morphology and molecular information, yet it remains challenging due to extreme resolution, tissue fragmentation, and large nonlinear deformations. Conventional registration pipelines depend on global pre-alignment and spatial consistency, which often collapse under such distortions. We present SAR2Net, a framework that learns spatially anchored representations and reformulates cross-stain alignment as a region-level feature retrieval problem. Instead of estimating explicit transformations, SAR2Net learns pointwise representations encoding the relative spatial relationships to tissue landmarks. Given landmarks and arbitrary coordinates, it predicts spatially anchored features that serve as deformation-invariant descriptors of tissue topology. A two-stage retrieval framework then establishes correspondences between slides, even when global alignment is infeasible. Experiments on biopsy-oriented HE-IHC datasets show that SAR2Net achieves robust region-level alignment under severe tissue distortions, outperforming previous methods. Code is available at https://github.com/Shentl/SAR2Net

Abstract:
Class-Incremental Learning (CIL) aims to develop models to continuously learn new classes without forgetting learned old ones. Recent advances combine pre-trained models with parameter-efficient fine-tuning, achieving promising results. However, these approaches typically allocate new trainable parameters for each task, causing the model size to grow linearly with task number. Moreover, they lack explicit mechanisms to structure a coherent and discriminative representation space across tasks. To address these limitations, we propose Representation-Steered Incremental Adapter Tuning (RSIAT). RSIAT maintains a single shared adapter for all tasks, eliminating parameter growth during incremental learning. In the base task, we introduce a representation-steering loss that enhances discriminative feature learning while facilitating future task adaptation. During incremental tasks, a residual autoencoder-based projector aligns feature distributions between old and new tasks, preserving representation consistency without over-constraining the shared adapter. Extensive experiments on six CIL benchmarks demonstrate that RSIAT significantly outperforms state-of-the-art methods in both performance and parameter efficiency, achieving superior stability-plasticity trade-offs with minimal trainable parameters. Code is available at https://github.com/zjrzjrz/RSIAT.

Abstract:
Semi-supervised learning addresses label scarcity and high annotation costs in medical image segmentation by exploiting the latent information in unlabeled data to enhance model performance. Traditional discriminative segmentation relies on segmentation masks, neglecting feature-level distribution constraints. This limits robust semantic representation learning and adaptive modeling of unlabeled data in scenarios with few labels. To address these limitations, we propose SemiGDA, a novel Generative Dual-distribution Alignment framework for semi-supervised medical image segmentation. Our SemiGDA overcomes the reliance of discriminative methods on large labeled datasets by aligning feature and semantic distributions to boost semantic learning and scene adaptability. Specifically, we propose a Dual-distribution Alignment Module (DAM), which employs two structurally distinct encoders to model image and mask feature distributions. It enforces their alignment in the latent space via distributional constraints, establishing structured feature consistency. Moreover, we design a Consistency-Driven Skip Adapter (CDSA) strategy, which introduces dual skip adapters (Image and Mask) to fuse multi-scale features via skip connections. Using a consistency loss, CDSA enhances cross-branch semantic alignment and reinforces fine-grained semantic consistency. Experimental results on diverse medical datasets show that our method outperforms other state-of-the-art semi-supervised segmentation methods. Code is released at: https://github.com/taozh2017/SemiGDA.

Abstract:
Dexterous grasp generation is a predominant task that enables robots to perform human-level manipulation. However, a dexterous hand always maintains high-dimensional DoF and actuation space, making existing approaches that rely on holistic latent representations difficult to produce high-quality and semantically aligned grasps. In this paper, we propose MaskDexGrasp to address these challenges. We first present a part-aware grasp tokenizer that decomposes dexterous grasps into discrete tokens, facilitating compositional modeling of anatomical dependencies. Building upon this representation, a bidirectional masked grasp transformer is then developed to predict grasp tokens conditioned on object geometry and task description, ensuring coherent grasp generation while allowing fine-grained part-level editing. To facilitate evaluation, we construct a dexterous grasp dataset that comprises 65K grasping instances and 260K richly annotated descriptions covering 11 tasks. Comprehensive experiments demonstrate that our method achieves the state-of-the-art performance. Project page is available at https://binghui-z.github.io/MaskDexGrasp/.

Abstract:
Vision-Language Models (VLMs) have shown remarkable progress in Vision-Language Navigation (VLN), offering new possibilities for navigation decision-making that could benefit both robotic platforms and human users. However, real-world navigation is inherently conditioned by the agent's mobility constraints. For example, a sweeping robot cannot traverse stairs, while a quadruped can. We introduce Capability-Conditioned Navigation (CapNav), a benchmark designed to evaluate how well VLMs can navigate complex indoor spaces given an agent's specific physical and operational capabilities. CapNav defines five representative human and robot agents, each described with physical dimensions, mobility capabilities, and environmental interaction abilities. CapNav provides 45 real-world indoor scenes, 473 navigation tasks, and 2365 QA pairs to test if VLMs can traverse indoor environments based on agent capabilities. We evaluate 13 modern VLMs and find that current VLM's navigation performance drops sharply as mobility constraints tighten, and that even state-of-the-art models struggle with obstacle types that require reasoning on spatial dimensions. We conclude by discussing the implications for capability-aware navigation and the opportunities for advancing embodied spatial reasoning in future VLMs. The benchmark is available at https://makeabilitylab.github.io/CapNav/

Abstract:
Self-supervised real-world image denoising remains a fundamental challenge, arising from the antagonistic trade-off between decorrelating spatially structured noise and preserving high-frequency details. Existing blind-spot network (BSN) methods rely on pixel-shuffle downsampling (PD) to decorrelate noise, but aggressive downsampling fragments fine structures, while milder downsampling fails to remove correlated noise. To address this, we introduce Next-Scale Prediction (NSP), a novel self-supervised paradigm that decouples noise decorrelation from detail preservation. NSP constructs cross-scale training pairs, where BSN takes low-resolution, fully decorrelated sub-images as input to predict high-resolution targets that retain fine details. As a by-product, NSP naturally supports super-resolution of noisy images without retraining or modification. Extensive experiments demonstrate that NSP achieves state-of-the-art self-supervised denoising performance on real-world benchmarks, significantly alleviating the long-standing conflict between noise decorrelation and detail preservation. The code is available at https://github.com/XLearning-SCU/2026-CVPR-NSP.

Abstract:
Text-to-point-cloud (T2P) localization aims to infer precise spatial positions within 3D point cloud maps from natural language descriptions, reflecting how humans perceive and communicate spatial layouts through language. However, existing methods largely rely on shallow text-point cloud correspondence without effective spatial reasoning, limiting their accuracy in complex environments. To address this limitation, we propose VLM-Loc, a framework that leverages the spatial reasoning capability of large vision-language models (VLMs) for T2P localization. Specifically, we transform point clouds into bird's-eye-view (BEV) images and scene graphs that jointly encode geometric and semantic context, providing structured inputs for the VLM to learn cross-modal representations bridging linguistic and spatial semantics. On top of these representations, we introduce a partial node assignment mechanism that explicitly associates textual cues with scene graph nodes, enabling interpretable spatial reasoning for accurate localization. To facilitate systematic evaluation across diverse scenes, we present CityLoc, a benchmark built from multi-source point clouds for fine-grained T2P localization. Experiments on CityLoc demonstrate VLM-Loc achieves superior accuracy and robustness compared to state-of-the-art methods. Our code, model, and dataset are available at \href https://github.com/MCG-NKU/nku-3d-vision repository .

Abstract:
We present Franca (pronounced Fran-ka): free one; the first fully open-source (data, code, weights) vision foundation model that matches and in many cases surpasses the performance of state-of-the-art proprietary models, e.g., DINOv2, CLIP, SigLIPv2, etc. Our approach is grounded in a transparent training pipeline inspired by Web-SSL and uses publicly available data: ImageNet-21K and a subset of ReLAION-2B. Beyond model release, we tackle critical limitations in self-supervised learning clustering methods. Existing approaches assign image features to large codebooks via clustering algorithms such as Sinkhorn-Knopp, but they often overlook the inherent ambiguity in cluster semantics. To address this, we introduce a multi-head clustering projector based on nested Matryoshka representations. This design progressively refines features into increasingly fine-grained clusters without increasing the model size, producing higher-quality dense representations. Additionally, we propose a novel positional disentanglement strategy that explicitly removes positional biases from dense representations.This leads to consistent gains on several downstream benchmarks, demonstrating the utility of cleaner feature spaces. Our contributions establish a new standard for transparent, high-performance vision models and open a path toward more reproducible and generalizable foundation models for the broader AI community.

Abstract:
Recent progress in large language models (LLMs) has shown that reasoning improves when intermediate thoughts are externalized into explicit workspaces, such as chain-of-thought traces or tool-augmented reasoning. Yet, visual language models (VLMs) lack an analogous mechanism for spatial reasoning, limiting their ability to generate images that accurately reflect geometric relations, object identities, and compositional intent. We introduce the concept of a spatial scratchpad -- a 3D reasoning substrate that bridges linguistic intent and image synthesis. Given a text prompt, our framework parses subjects and background elements, instantiates them as editable 3D meshes, and employs agentic scene planning for placement, orientation, and viewpoint selection. The resulting 3D arrangement is rendered back into the image domain with identity-preserving cues, enabling the VLM to generate spatially consistent and visually coherent outputs. Unlike prior 2D layout-based methods, our approach supports intuitive 3D edits that propagate reliably into final images. Empirically, it achieves a 32% improvement in text alignment on GenAI-Bench, demonstrating the benefit of explicit 3D reasoning for precise, controllable image generation. Our results highlight a new paradigm for vision-language models that deliberate not only in language, but also in space. Code and visualizations at https://oindrilasaha.github.io/3DScratchpad/

Abstract:
Recent advances in diffusion models have greatly improved pose-driven character animation. However, existing methods are limited to spatially aligned reference-pose pairs with matched skeletal structures. Handling reference-pose misalignment remains unsolved. To address this, we present One-to-All Animation, a unified framework for high-fidelity character animation and image pose transfer for references with arbitrary layouts. First, to handle spatially misaligned reference, we reformulate training as a self-supervised outpainting task that transforms diverse-layout reference into a unified occluded-input format. Second, to process partially visible reference, we design a reference extractor for comprehensive identity feature extraction. Further, we integrate hybrid reference fusion attention to handle varying resolutions and dynamic sequence lengths. Finally, from the perspective of generation quality, we introduce identity-robust pose control that decouples appearance from skeletal structure to mitigate pose overfitting, and a token replace strategy for coherent long-video generation. Extensive experiments show that our method outperforms existing approaches. The code and model are available at https://github.com/ssj9596/One-to-All-Animation.

Abstract:
Compared to images, videos better reflect real-world acquisition and possess valuable temporal cues. However, existing multi-sensor fusion research predominantly integrates complementary context from multiple images rather than videos due to the scarcity of large-scale multi-sensor video datasets, limiting research in video fusion and the inherent difficulty of jointly modeling spatial and temporal dependencies in a unified framework. To this end, we construct M3SVD, a benchmark dataset with 220 temporally synchronized and spatially registered infrared-visible videos comprising 153,797 frames, bridging the data gap. Secondly, we propose VideoFusion, a multi-modal video fusion model that exploits cross-modal complementarity and temporal dynamics to generate spatio-temporally coherent videos from multi-modal inputs. Specifically, 1) a differential reinforcement module is developed for cross-modal information interaction and enhancement, 2) a complete modality-guided fusion strategy is employed to adaptively integrate multi-modal features, and 3) a bi-temporal co-attention mechanism is devised to dynamically aggregate forward-backward temporal contexts to reinforce cross-frame feature representations. Experiments reveal that VideoFusion outperforms existing image-oriented fusion paradigms in sequences, effectively mitigating temporal inconsistency and interference. Project and M3SVD: https://github.com/Linfeng-Tang/VideoFusion.

Abstract:
Maritime multimodal vision faces significant challenges due to the complexity and variability of oceanic weather and environmental conditions. While modern vessels are commonly equipped with visible and infrared imaging systems, the complementary nature of these modalities fundamentally depends on accurate cross-modal registration. However, the absence of paired visible-infrared datasets that realistically capture diverse maritime scenarios has severely hindered progress in this field. To overcome this limitation, we present MMVIP, the first large-scale visible-infrared maritime vision dataset covering a wide spectrum of weather conditions and sea states. The dataset contains 128,100 images and 50 video sequences with precise spatial-temporal alignment. Comprehensive evaluations across image registration, fusion, maritime object detection, and cross-modal image translation tasks demonstrate the dataset's effectiveness and challenge. Furthermore, MMVIP establishes a new benchmark for advancing multimodal maritime perception. The dataset and corresponding benchmarks are publicly available at: https://github.com/yyppptjr/MMVIP.

Abstract:
As powerful generative models, text-to-image diffusion models have recently been explored for discriminative tasks. A line of research focuses on adapting a pre-trained diffusion model to semantic segmentation without any further training, leading to training-free diffusion segmentors. These methods typically rely on cross-attention maps from the model's attention layers, which are assumed to capture semantic relationships between image pixels and text tokens. Ideally, such approaches should benefit from more powerful diffusion models, i.e., stronger generative capability should lead to better segmentation. However, we observe that existing methods often fail to scale accordingly. To understand this issue, we identify two underlying gaps: (i) Cross-attention is computed across multiple heads and layers, but there exists a discrepancy between these individual attention maps and a unified global representation. (ii) Even when a global map is available, it does not directly translate to accurate semantic correlation for segmentation, due to score imbalances among different text tokens. To bridge these gaps, we propose two techniques: auto aggregation and per-pixel rescaling, which together enable training-free segmentation to better leverage generative capability. We evaluate our approach on standard semantic segmentation benchmarks and further integrate it into a generative technique, demonstrating both improved performance and broad applicability. Codes are at https://github.com/Darkbblue/goca.

Abstract:
Despite strong results on recognition and segmentation, current 3D visual pre-training methods often underperform on robotic manipulation. We attribute this gap to two factors: the lack of state-action-state dynamics modeling and the unnecessary redundancy of explicit geometric reconstruction. We introduce AFRO, a self-supervised framework that learns dynamics-aware 3D representations without action or reconstruction supervision. AFRO casts state prediction as a generative diffusion process and jointly models forward and inverse dynamics in a shared latent space to capture causal transition structure. To prevent feature leakage in action learning, we employ feature differencing and inverse-consistency supervision, improving the quality and stability of visual features. When combined with Diffusion Policy, AFRO substantially increases manipulation success rates across 16 simulated and 4 real-world tasks, outperforming existing pre-training approaches. The framework also scales favorably with data volume and task complexity. Qualitative visualizations indicate that AFRO learns semantically rich, discriminative features, offering an effective pre-training solution for 3D representation learning in robotics. Project page: https://kolakivy.github.io/AFRO/

Abstract:
We present Proof-of-Perception (PoP), a tool-using framework that casts multimodal reasoning as an executable graph with explicit reliability guarantees. Each perception or logic node outputs a conformal set \Gamma^ (t) _\delta(x), yielding calibrated, stepwise uncertainty; a lightweight controller uses these certificates to allocate compute under a budget--expanding with extra tool calls only when needed and stopping early otherwise. This grounds answers in verifiable evidence, reduces error compounding and hallucinations, and enables principled accuracy-compute trade-offs. Across document, chart, and multi-image QA benchmarks, PoP improves performance and reliability over strong chain-of-thought, ReAct-style, and program-of-thought baselines while using computation more efficiently. Code is available at \href https://github.com/AryaFayyazi/PoP https://github.com/AryaFayyazi/PoP .

Abstract:
Recent vision-language models (VLMs) such as CLIP demonstrate impressive cross-modal reasoning, extending beyond images to 3D perception. Yet, these models remain fragile under domain shifts, especially when adapting from synthetic to real-world point clouds. Conventional 3D domain adaptation approaches rely on heavy trainable encoders, yielding strong accuracy but at the cost of efficiency. We introduce CLIPoint3D, the first framework for few-shot unsupervised 3D point cloud domain adaptation built upon CLIP. Our approach projects 3D samples into multiple depth maps and exploits the frozen CLIP backbone, refined through a knowledge-driven prompt tuning scheme that integrates high-level language priors with geometric cues from a lightweight 3D encoder. To adapt task-specific features effectively, we apply parameter-efficient fine-tuning to CLIP's encoders and design an entropy-guided view sampling strategy for selecting confident projections. Furthermore, an optimal transport-based alignment loss and an uncertainty-aware prototype alignment loss collaboratively bridge source-target distribution gaps while maintaining class separability. Extensive experiments on PointDA-10 and GraspNetPC-10 benchmarks show that CLIPoint3D achieves consistent 3-16% accuracy gains over both CLIP-based and conventional encoder-based baselines. Project page: https://sarthakm320.github.io/CLIPoint3D.

Abstract:
Recent subject-driven image customization excels in fidelity, yet fine-grained instance-level spatial control remains an elusive challenge, hindering real-world applications. This limitation stems from two factors: a scarcity of scalable, position-annotated datasets, and the entanglement of identity and layout by global attention mechanisms. To this end, we introduce PositionIC, a unified framework for high-fidelity, spatially controllable multi-subject customization. First, we present BMPDS, the first automatic data-synthesis pipeline for position-annotated multi-subject datasets, effectively providing crucial spatial supervision. Second, we design a lightweight, layout-aware diffusion framework that integrates a novel visibility-aware attention mechanism. This mechanism explicitly models spatial relationships via an NeRF-inspired volumetric weight regulation to effectively decouple instance-level spatial embeddings from semantic identity features, enabling precise, occlusion-aware placement of multiple subjects. Extensive experiments demonstrate PositionIC achieves state-of-the-art performance on public benchmarks, setting new records for spatial precision and identity consistency. Our work represents a significant step towards truly controllable, high-fidelity image customization in multi-entity scenarios.Code and data: https://github.com/MeiGen-AI/PositionIC.

Abstract:
As a challenging video editing task, movie trailer generation involves selecting and reorganizing movie shots to create engaging trailers. Currently, most existing automatic trailer generation methods employ a "selection-then-ranking" paradigm (i.e., first selecting key shots and then ranking them), which suffers from inevitable error propagation and limits the quality of the generated trailers. Beyond this paradigm, we propose a new self-paced and self-corrective masked prediction method called SSMP, which achieves state-of-the-art results in automatic trailer generation via bi-directional contextual modeling and progressive self-correction. In particular, SSMP trains a Transformer encoder that takes the movie shot sequences as prompts and generates corresponding trailer shot sequences accordingly. The model is trained via masked prediction, reconstructing each trailer shot sequence from its randomly masked counterpart. The mask ratio is self-paced, allowing the task difficulty to adapt to the model and thereby improving model performance.When generating a movie trailer, the model fills the shot positions with high confidence at each step and re-masks the remaining positions for the next prediction, forming a progressive self-correction mechanism that is analogous to how human editors work. Both quantitative results and user studies demonstrate the superiority of SSMP in comparison to existing automatic movie trailer generation methods. Demo and code are available at: https://github.com/Dixin-Lab/SSMP.

Abstract:
We present UniTEX, a novel two-stage 3D texture generation framework to create high-quality, consistent textures for 3D assets. Existing approaches predominantly rely on UV-based models in the second stage to refine textures after reprojecting the generated multi-view images onto the 3D shapes, which introduces challenges related to topological ambiguity. To address this, we bypass the limitations of UV mapping by introducing a Large Texturing Model (LTM) that directly regresses textures in a unified 3D functional space. Moreover, to enable more effective and complete supervision of LTM, we propose to extend surface-defined textures into a continuous volumetric field to serve as an advanced training objective, which we refer to as Texture Functions (TF). Finally, we develop an advanced LoRA-based strategy for efficiently adapting large-scale 2D Diffusion Transformers (DiTs) for high-quality multi-view texture synthesis as our first stage. Extensive experiments demonstrate that UniTEX achieves superior visual quality and texture integrity compared to existing approaches, offering a generalizable and scalable solution for automated 3D texture generation. Code is available in: https://github.com/HKUST-SAIL/UniTEX

Abstract:
Diffusion-based real-world image super-resolution (Real-ISR) methods have demonstrated impressive performance. To achieve efficient Real-ISR, many works employ Variational Score Distillation (VSD) to distill a pre-trained stable-diffusion (SD) model for one-step SR with a fixed timestep. However, since SD performs different generative priors at different timesteps, a fixed timestep is difficult for these methods to fully leverage the generative priors in SD, leading to suboptimal performance. To address this issue, we propose a Time-Aware one-step Diffusion Network for Real-ISR (TADSR). We first introduce a Time-Aware VAE Encoder, which projects the same image into different latent features based on timesteps. Through joint dynamic variation of timesteps and latent features, the student model can better align with the input pattern distribution of the pre-trained SD, thereby enabling more effective utilization of SD's generative capabilities. To better activate the generative prior of SD at different timesteps, we propose a Time-Aware VSD loss that bridges the timesteps of the student model and those of the teacher model, thereby producing more consistent generative prior guidance conditioned on timesteps. Additionally, though utilizing the generative prior in SD at different timesteps, our method can naturally achieve controllable trade-offs between fidelity and realism by changing the timestep. Experimental results demonstrate that our method achieves both state-of-the-art performance and controllable SR results with only a single step. The source code is released at https://github.com/zty557/TADSR

Abstract:
Understanding animal species from multimodal data poses an emerging challenge at the intersection of computer vision and ecology.While recent biological models, such as BioCLIP, have demonstrated strong alignment between images and textual taxonomic information for species identification, the integration of the audio modality remains an open problem. We propose BioVITA, a novel visual-textual-acoustic alignment framework for biological applications. BioVITA involves (i) a training dataset, (ii) a representation model, and (iii) a retrieval benchmark. First, we construct a large-scale training dataset comprising 1.3 million audio clips and 2.3 million images, covering 14,133 species annotated with 34 ecological trait labels. Second, building upon BioCLIP2, we introduce a two-stage training framework to effectively align audio representations with visual and textual representations. Third, we develop a cross-modal retrieval benchmark that covers all possible directional retrieval across the three modalities (i.e., image-to-audio, audio-to-text, text-to-image, and their reverse directions), with three taxonomic levels: Family, Genus, and Species. Extensive experiments demonstrate that our model learns a unified representation space that captures species-level semantics beyond taxonomy, advancing multimodal biodiversity understanding. The project page is available at: https://dahlian00.github.io/BioVITA_Page/

Abstract:
High dynamic range novel view synthesis (HDR-NVS) reconstructs scenes with dynamic details by fusing multi-exposure low dynamic range (LDR) views, yet it struggles to capture ambient illumination-dependent appearance. Implicitly supervising HDR content by constraining tone-mapped results fails in correcting abnormal HDR values, and results in limited gradients for Gaussians in under/over-exposed regions. To this end, we introduce PhysHDR-GS, a physically inspired HDR-NVS framework that models scene appearance via intrinsic reflectance and adjustable ambient illumination. PhysHDR-GS employs a complementary image-exposure (IE) branch and Gaussian-illumination (GI) branch to faithfully reproduce standard camera observations and capture illumination-dependent appearance changes, respectively. During training, the proposed cross-branch HDR consistency loss provides explicit supervision for HDR content, while an illumination-guided gradient scaling strategy mitigates exposure-biased gradient starvation and reduces under-densified representations. Experimental results across realistic and synthetic datasets demonstrate our superiority in reconstructing HDR details (e.g., a PSNR gain of 2.04 dB over HDR-GS), while maintaining real-time rendering speed (up to 76 FPS). Code and models are available at https://huimin-zeng.github.io/PhysHDR-GS/.

Abstract:
Masked video modeling (MVM) has emerged as a simple and scalable self-supervised pretraining paradigm, but only encodes motion information implicitly, limiting the encoding of temporal dynamics in the learned representations. As a result, such models struggle on motion-centric tasks that require fine-grained motion awareness. To address this, we propose TrackMAE, a simple masked video modeling paradigm that explicitly uses motion information as a reconstruction signal. In TrackMAE, we use an off-the-shelf point tracker to sparsely track points in the input videos generating motion trajectories. Furthermore, we exploit the extracted trajectories to improve the random tube masking with a motion-aware masking strategy. We enhance video representations learned in both pixel and feature semantic reconstruction space by providing a complementary supervision signal in the form of motion targets. We evaluate on six datasets across diverse downstream settings and find that TrackMAE consistently outperforms the state-of-the-art video SSL baselines, therefore learning more discriminative and generalizable representations Code available at https://github.com/rvandeghen/TrackMAE

Abstract:
The scarcity of labeled fine-grained data presents a significant challenge for deep clustering. Vision-Language Models (VLMs) on existing coarse-grained datasets (characterized by high inter-class and low intra-class variance) struggle to capture the subtle distinctions essential for fine-grained categorization, leading to suboptimal clustering performance. To address this, we propose a novel framework that adapts VLMs for fine-grained clustering without requiring fine-grained labels. Our method steers the model to focus on discriminative fine-grained features by integrating a Bayesian nonparametric process with a tailored representation learning objective, which includes low-rank guidance and orthogonal guidance. This allows our model to dynamically discover clusters that reflect fine-grained categories. Extensive experiments demonstrate that our approach achieves state-of-the-art performance on multiple fine-grained benchmarks. Code is available at https://github.com/HenryWells02/VLM-Fine-Clustering.

Abstract:
Object removal requires eliminating not only the target object but also its associated visual effects such as shadows and reflections. However, diffusion-based inpainting and removal methods often introduce artifacts, hallucinate contents, alter background, and struggle to remove object effects accurately. To address these challenges, we propose ObjectClear, a novel framework that decouples foreground removal from background reconstruction via an adaptive target-aware attention mechanism. This design empowers the model to precisely localize and remove both objects and their effects while maintaining high background fidelity. Moreover, the learned attention maps are leveraged for an attention-guided fusion strategy during inference, further enhancing visual consistency. To facilitate the training and evaluation, we construct OBER, a large-scale dataset for OBject-Effect Removal, which provides paired images with and without object-effects, along with precise masks for both objects and their effects. The dataset comprises high-quality captured and simulated data, covering diverse objects, effects, and complex multi-object scenes. Extensive experiments demonstrate that ObjectClear outperforms prior methods, achieving superior object-effect removal quality and background fidelity, especially in challenging scenarios.

Abstract:
The F-measure is a widely used metric in multi-label classification, where multiple labels are predicted simultaneously for a single instance. The optimal prediction rule for F-measure requires estimating q^2+1 probabilities, where q is the number of labels. Existing approaches train q multinomial estimators (multi-class classifiers) to directly estimate these probabilities, followed by a matrix multiplication for making predictions. However, this method has two major drawbacks. First, the matrix multiplication incurs a time complexity of \mathcal O (q^3), which becomes computationally expensive for large q. Second, training multinomial estimators is challenging due to the sparsity of the underlying distributions, which results from the inherent imbalance in multi-label datasets and is further exacerbated by the label transformation required by the method itself. In this paper, we first demonstrate that matrix multiplication can be reformulated as a series of convolutions by exploiting a special structure in the matrix. These convolutions can then be efficiently computed using the Fast Fourier Transform (FFT), reducing the time complexity to \mathcal O (q^2\log q). To avoid multinomial label transformation, we propose an indirect sampling-then-estimation approach to estimate the required probabilities. This method trains only q binary estimators instead of multinomial ones, thereby alleviating the sparsity issue, simplifying the training process, and improving performance. We provide theoretical guarantees for the consistency of the proposed sampling-based method and demonstrate its effectiveness through extensive experiments on diverse datasets. The code is available in: https://github.com/ZixunWang/MLC-F1-Sampling.

Abstract:
Bamboo slips are essential media for recording ancient East Asian civilizations, but excavated slips often suffer severe deformation due to dehydration and stress effects, creating substantial challenges for restoration. Traditional manual restoration is time-consuming and risks damage, while existing generative models struggle with the complex non-linear deformations in bamboo materials.We propose a novel framework for inverse restoration of deformed bamboo slips that provides a progressive physical deformation modeling with stepwise inverse displacement prediction. Our approach establishes a computable mathematical model of deformation based on wood fiber microstructure and stress-diffusion coupling effects, enabling the forward process to simulate physically plausible deformation trajectories as a deterministic, physics-driven progressive evolution. The inverse process transforms from predicting abstract noise to learning physically meaningful inverse displacement fields that progressively restore deformations.Experimental results show substantial gains in restoration fidelity while preserving delicate textual features, enabling the reliable correction of complex non-linear deformations that defeat traditional techniques. By integrating physical insights into bamboo material behavior with progressive restoration modeling, this work establishes a new paradigm for digital archaeological restoration--one that holds significant potential to transform how deformed cultural relics are reconstructed and studied. Code is available at https://github.com/VillanelleQQ/PGDR-BambooSlips

Abstract:
Federated learning (FL) enables collaborative training across clients while preserving privacy. While most existing FL methods assume homogeneous model architectures, client heterogeneity in both data and resources makes this assumption impractical, thus motivating model-heterogeneous FL. To address this problem, we propose Federated Representation Entanglement (FedRE), a framework built upon a novel form of client knowledge termed entangled representation. Specifically, each client aggregates its local representations into a single entangled representation using normalized random weights, and then applies the same weights to integrate the corresponding one-hot label encodings into an entangled-label encoding. Both are subsequently uploaded to the server to train a global classifier. During training, each entangled representation is supervised across categories via its entangled-label encoding, while random weights are re-sampled at each round to introduce diversity, alleviating overconfidence in the global classifier and yielding smoother decision boundaries. Moreover, each client uploads a single entangled representation along with its entangled-label encoding, mitigating the risk of representation inversion attacks and reducing communication overhead. Extensive experiments demonstrate that FedRE achieves an effective trade-off among model performance, privacy protection, and communication overhead. The codes are available at https://github.com/AIResearch-Group/FedRE.

Abstract:
Vision-language models (VLMs) have demonstrated strong potential for adapting to downstream biomedical tasks with limited training samples. However, their generalization to unseen classes within the same dataset remains limited, as the image-text alignment semantics often rely on spurious cues present in seen classes that do not transfer. To tackle this, we propose BiomedCCPL (Causal Conditional Prompt Learning), a framework that uses VGAP (Visual Grounder with Adaptive Prototype) to generate image-conditional prompts from multi-scale adaptive prototypes and employs SCD (Synergistic Causal Disentanglement) to regularize the generation of image-conditional prompts. Guided by insights from a causal analysis of generalization to unseen classes, SCD leverages multiple synergistic learning objectives to perform front-door adjustment, ensuring that the dynamically generated image-conditional prompts focus on underlying diagnostic image features shared across seen and unseen classes. Experiments on 11 datasets across 9 modalities demonstrate that BiomedCCPL effectively enhances the model's data efficiency and generalization ability. In particular, on the Base-to-Novel task, BiomedCCPL achieves an average HM of 79.98%, surpassing the previous state-of-the-art by 6.45%. Code is available at https://github.com/burgers0708/BiomedCCPL.

Abstract:
In this work, we exploit class dissimilarities to provide complementary learning information beyond correct classification, that is not fully utilized in existing learning paradigms. To model these dissimilarities, we introduce the concept of an opposite-class, which consists of everything that is not part of a corresponding class, i.e., all samples from non-target classes or samples from unknown classes. By setting appropriately encoded target distributions over the non-target classes, we explicitly optimize the model's activation distributions across all non-target classes, which enhances class dissimilarity information and enables better control over the geometry of the representations. We analyze the convergence dynamics of our proposed approach, both theoretically and empirically, showing that it naturally pushes the representations towards neural collapse, leading to more discriminative and robust features. Our extensive evaluation across multiple classification settings demonstrates consistent improvements of our method on closed-set, open-set, few-shot classification, and domain generalization. Code: https://github.com/katdimitris/Complementary-Dissimilarity-Loss

Abstract:
Monocular human motion capture in occlusion scenarios presents significant challenges. Although a few works have explicitly considered the occlusion problem, image-based methods are unreliable due to the lack of temporal constraints while video-based approaches cannot gain sufficient knowledge from time domain motion priors to address long-term occlusions. However, occluded human motion typically exhibits periodic patterns and consistent momentum. Inspired by this observation, we exploit reliable image observations in frequency domain and formulate the motion capture task as a wavelet coefficients selection process. Specifically, we first construct probabilistic distributions for the occluded 2D keypoints, and then introduce a frequency domain diffusion model to refine the distributions by learning long-term periodic information and physical momentum with Discrete Wavelet Transform (DWT). Consequently, the learned denoising prior can select valid wavelet components to facilitate the 3D motion capture with a 3D decoder. By employing a joint reprojection strategy, we can also use the same diffusion process to train the 3D decoder. To further promote human occlusion-related tasks, we also present the first 3D occluded motion dataset, OcMotion, which serves as a new benchmark for both training and evaluation. Experimental results demonstrate that our method can produce accurate and coherent human motions from occluded videos. More information is available at https://github.com/boycehbz/FreqMotion.

Abstract:
Recent advances in multimodal large language models (MLLMs) have substantially expanded the capabilities of multimodal retrieval, enabling systems to align and retrieve information across visual and textual modalities. Yet, existing benchmarks largely focus on coarse-grained or single-condition alignment, overlooking real-world scenarios where user queries specify multiple interdependent constraints across modalities. To bridge this gap, we introduce MCMR (Multi-Conditional Multimodal Retrieval): a large-scale benchmark designed to evaluate fine-grained, multi-condition cross-modal retrieval under natural-language queries. MCMR spans five product domains: upper and bottom clothing, jewelry, shoes, and furniture. It also preserves rich long-form metadata essential for compositional matching. Each query integrates complementary visual and textual attributes, requiring models to jointly satisfy all specified conditions for relevance. We benchmark a diverse suite of MLLM-based multimodal retrievers and vision-language rerankers to assess their condition-aware reasoning abilities. Experimental results reveal: (i) distinct modality asymmetries across models; (ii) visual cues dominate early-rank precision, while textual metadata stabilizes long-tail ordering; and (iii) MLLM-based pointwise rerankers markedly improve fine-grained matching by explicitly verifying query-candidate consistency. Overall, MCMR establishes a challenging and diagnostic benchmark for advancing multimodal retrieval toward compositional, constraint-aware, and interpretable understanding. Our code and dataset is available at https://github.com/EIT-NLP/MCMR

Abstract:
Visual Geometry Grounded Transformer (VGGT) has advanced 3D vision, yet its global attention layers suffer from quadratic computational costs that hinder scalability. Several sparsification-based acceleration techniques have been proposed to alleviate this issue, but they often suffer from substantial accuracy degradation. We hypothesize that the accuracy degradation stems from the heterogeneity in head-wise sparsification sensitivity, as the existing methods apply a uniform sparsity pattern across all heads. Motivated by this hypothesis, we present a two-stage sparsification pipeline that effectively quantifies and exploits headwise sparsification sensitivity. In the first stage, we measure head-wise sparsification sensitivity using a novel metric, the Head Sensitivity Score (HeSS), which approximates the Hessian with respect to two distinct error terms on a small calibration set. In the inference stage, we perform HeSS-Guided Sparsification, leveraging the pre-computed HeSS to reallocate the total attention budget-assigning denser attention to sensitive heads and sparser attention to more robust ones. We demonstrate that HeSS effectively captures head-wise sparsification sensitivity and empirically confirm that attention heads in the global attention layers exhibit heterogeneous sensitivity characteristics. Extensive experiments further show that our method effectively mitigates performance degradation under high sparsity, demonstrating strong robustness across varying sparsification levels. Code is available at https://github.com/libary753/HeSS.

Abstract:
Tremendous progress in visual scene generation now turns a single image into an explorable 3D world, yet immersion remains incomplete without sound. We introduce Image2AVScene, the task of generating a 3D audio-visual scene from a single image, and present SonoWorld, the first framework to tackle this challenge. From one image, our pipeline outpaints a 360deg panorama, lifts it into a navigable 3D scene, places language-guided sound anchors, and renders ambisonics for point, areal, and ambient sources, yielding spatial audio aligned with scene geometry and semantics. Quantitative evaluations on a newly curated real-world dataset and a controlled user study confirm the effectiveness of our approach. Beyond free-viewpoint audio-visual rendering, we also demonstrate applications to one-shot acoustic learning and audio-visual spatial source separation. Project website: https://humathe.github.io/sonoworld/

Abstract:
With the rapid advancement of generative models, generated image detection has become an important task in visual forensics. Although existing methods have achieved remarkable progress, they often rely, after training, on only a small subset of highly salient forgery cues, which limits their ability to generalize to unseen generative mechanisms. We argue that reliably generated image detection should not depend on a single decision path but should preserve multiple judgment perspectives, enabling the model to understand the differences between real and generated images from diverse viewpoints. Based on this idea, we propose an anti-feature-collapse learning framework that filters task-irrelevant components and suppresses excessive overlap among different forgery cues in the representation space, preventing discriminative information from collapsing into a few dominant feature directions. This design maintains diverse and complementary evidence within the model, reduces reliance on a small set of salient cues, and enhances robustness under unseen generative settings. Extensive experiments on multiple public benchmarks demonstrate that the proposed method significantly outperforms the state-of-the-art approaches in cross-model scenarios, achieving an accuracy improvement of 5.02% and exhibiting superior generalization and detection reliability. The source code is available at https://github.com/Yanmou-Hui/DoU.

Abstract:
Recent advances in Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS) have made it standard practice to reconstruct 3D scenes from multi-view images. Removing objects from such 3D representations is a fundamental editing task that requires complete and seamless inpainting of occluded regions, ensuring consistency in geometry and appearance. Although existing methods have made notable progress in improving inpainting consistency, they often neglect global lighting effects, leading to physically implausible results. Moreover, these methods struggle with view-dependent non-Lambertian surfaces, where appearance varies across viewpoints, leading to unreliable inpainting. In this paper, we present 3D Gaussian Object Removal in the Intrinsic Space (GOR-IS), a novel framework for physically consistent and visually coherent 3D object removal. Our approach decomposes the scene into intrinsic components and explicitly models light transport to maintain global lighting effects consistency. Furthermore, we introduce an intrinsic-space inpainting module that operates directly in the material and lighting domains, effectively addressing the challenges posed by non-Lambertian surfaces. Extensive experiments on both synthetic and real-world datasets demonstrate that our framework substantially improves the physical consistency and visual coherence of object removal, outperforming existing methods by 13% in perceptual similarity (LPIPS) and 2dB in peak signal-to-noise ratio (PSNR). Code is publicly available at https://applezyh.github.io/GOR-IS-project-page/

Abstract:
Retrieving rare and safety-critical driving scenarios from large-scale datasets is essential for building robust autonomous driving (AD) systems. As dataset sizes continue to grow, the key challenge shifts from collecting more data to efficiently identifying the most relevant samples. We introduce SearchAD, a large-scale rare image retrieval dataset for AD containing over 423k frames drawn from 11 established datasets. SearchAD provides high-quality manual annotations of more than 513k bounding boxes covering 90 rare categories. It specifically targets the "needle-in-a-haystack" problem of locating extremely rare classes, with some appearing fewer than 50 times across the entire dataset. Unlike existing benchmarks, which focused on instance-level retrieval, SearchAD emphasizes semantic image retrieval with a well-defined data split, enabling text-to-image and image-to-image retrieval, few-shot learning, and fine-tuning of multi-modal retrieval models. Comprehensive evaluations show that text-based methods outperform image-based ones due to stronger inherent semantic grounding. While models directly aligning spatial visual features with language achieve the best zero-shot results, and our fine-tuning baseline significantly improves performance, absolute retrieval capabilities remain unsatisfactory. With a held-out test set on a public benchmark server, SearchAD establishes the first large-scale dataset for retrieval-driven data curation and long-tail perception research in AD: https://iis-esslingen.github.io/searchad/

Abstract:
We introduce a novel method for video Scene Text Segmentation (STS), a task critical for understanding dynamic visual content. Despite the success of foundation models like Segment Anything Model 2 (SAM2) in generic segmentation, their application to video STS is hindered by the reliance on external prompts, limited output resolution, and instability in video sequences. To address these, we present a comprehensive framework based on SAM2. First, we fine-tune the image encoder using LoRA and integrate a self-prompting module, enabling the model to autonomously generate text-specific prompts. Second, we augment the decoder with additional upsampling branches at 512x512 and 1024x1024 resolutions, complementing the original 256x256 output to produce high-fidelity, multi-resolution masks. Third, we enhance the memory mechanism by combining short-term memory with a top-k selection strategy, ensuring temporally consistent and stable segmentation across video frames. A significant obstacle in video STS is data scarcity. To this end, we contribute two datasets: STS-SynthV, containing 1,410 synthetic video clips generated via FlowText, and STS-RealV, comprising 660 meticulously annotated real-world video sequences. Extensive experiments demonstrate that our method achieves state-of-the-art performance on multiple video and image scene text benchmarks. The data and code:https://github.com/insuper-zhang/SAM2Text/.

Abstract:
High-resolution remote sensing imagery exhibits complex spatial regularities where topology, continuity, and region adjacency govern semantic organization. However, existing remote sensing image semantic segmentation (RSISS) networks, being predominantly discriminative, estimate strong posteriors from data while lacking generative priors that encode such structural dependencies. This imbalance leads to fragmented boundaries, texture overfitting, and poor cross-dataset generalization. We address this challenge by reformulating RSISS as posterior inference grounded in generative structural priors, introducing HySeg , a hybrid generative-discriminative segmentation paradigm that learns structure-consistent priors through generative modeling and guides posterior inference for remote sensing segmentation. At its core, the MeanStruct module, a MeanFlow-based generative prior learner, models semantic topology as a continuous stochastic field, while the Prior-to-Affinity Projection (P2A) dynamically transforms this field into topology-aware, class-conditional affinities that guide posterior inference in the Dynamic Affinity-driven Segmentation (DAS) head. Our approach is model-agnostic and seamlessly integrates with diverse backbones, consistently improving structural coherence and generalization. Across four challenging RSISS benchmarks, HySeg achieves state-of-the-art performance and advances remote sensing segmentation from appearance-based perception to structural reasoning. Our code is available at https://github.com/HeryJie/HySeg.

Abstract:
Dataset Distillation (DD) aims to compress large-scale datasets into a small number of condensed Images Per Class (IPC), enabling efficient network training. Previous coreset selection and synthetic-based DD methods achieve reasonable performance. However, our in-depth investigation reveals that existing methods share a common issue: pattern imbalance. Specifically, they either overemphasize class-general patterns representing the majority of each class or focus on fewer marginal patterns critical for model generalization. To address this issue, we propose a novel framework, Balanced Patterns Selection (BPS). Unlike prior methods that assume each class forms a single cluster, BPS models the multiple visual pattern distribution within each class via a hierarchical semantic structure inherent to the dataset. It then selects two complementary subsets in a balanced manner from the center (class-general patterns) and the margins (marginal patterns) of each pattern, producing a pattern-balanced coreset. Theoretically, we prove that the BPS-selected coreset aligns with the original dataset in both information coverage and performance. Moreover, its model-agnostic selection nature ensures cross-architecture generalization, while the Optimize-Once-for-All-IPCs property guarantees efficiency. Extensive experiments on four benchmarks demonstrate that BPS significantly outperforms existing state-of-the-art methods. The source code is available at: https://github.com/BeCarefulOfYournaoke/BPS.

Abstract:
We introduce HyperGaussians, a novel extension of 3D Gaussian Splatting for high-quality animatable face avatars. While tremendous successes have been achieved for static faces, animatable avatars from dynamic videos still fall in the uncanny valley. The de facto standard, 3D Gaussian Splatting (3DGS), represents a face through a collection of 3D Gaussian primitives. 3DGS excels at rendering static faces, but the state-of-the-art still struggles with nonlinear deformations, complex lighting effects, and fine details. While most related works focus on predicting better Gaussian parameters from expression codes, we rethink the 3D Gaussian representation itself and how to make it more expressive. Our insights lead to a novel extension of 3D Gaussians to high-dimensional multivariate Gaussians, dubbed 'HyperGaussians'. The higher dimensionality increases expressivity through conditioning on a learnable local embedding. However, splatting HyperGaussians is computationally expensive because it requires inverting a high-dimensional covariance matrix. We solve this by reparameterizing the covariance matrix, dubbed the 'inverse covariance trick'. This trick boosts the efficiency so that HyperGaussians can be seamlessly integrated into existing models. To demonstrate this, we plug in HyperGaussians into two state-of-the-art methods for face avatars: FlashAvatar and GaussianHeadAvatar. Our evaluation on 29 subjects from 6 face datasets shows that HyperGaussians outperform 3DGS numerically and visually, particularly for high-frequency details like eyes, teeth, wrinkles, and specular reflections.

Abstract:
Despite the remarkable success of Vision-Language Models (VLMs), their performance on a range of complex visual tasks is often hindered by a "visual processing bottleneck": a propensity to lose grounding in visual evidence and exhibit a deficit in contextualized visual experience during prolonged generation. Drawing inspiration from human cognitive memory theory, which distinguishes short-term visually-dominant memory and long-term semantically-dominant memory, we propose VisMem, a cognitively-aligned framework that equips VLMs with dynamic latent vision memories, a short-term module for fine-grained perceptual retention and a long-term module for abstract semantic consolidation. These memories are seamlessly invoked during inference, allowing VLMs to maintain both perceptual fidelity and semantic consistency across thinking and generation. Extensive experiments across diverse visual benchmarks for understanding, reasoning, and generation reveal that VisMem delivers a significant average performance boost of 11.0% relative to the vanilla model and outperforms all counterparts, establishing a new paradigm for latent-space memory enhancement. The code will be available: https://github.com/YU-deep/VisMem.git.

Abstract:
Multi-focus image fusion (MFIF) is a crucial technique in image processing, with a key challenge being the generation of decision maps with precise boundaries. However, traditional methods based on heuristic rules and deep learning methods with black-box networks are difficult to generate high-quality decision maps. To overcome this challenge, we introduce neurodynamics-driven coupled neural P (CNP) systems, which are biological neural computation models inspired by spiking mechanisms, to enhance the accuracy of decision maps. Specifically, we first conduct an in-depth analysis of the model's neurodynamics to identify the constraints between the network parameters and the input signals. This solid analysis avoids abnormal continuous firing of neurons and ensures the model accurately distinguishes between focused and unfocused regions, generating high-quality decision maps for MFIF. Based on this analysis, we propose a Neurodynamics-Driven CNP Fusion model (ND-CNPFuse) tailored for the challenging MFIF task. Unlike current ideas of decision map generation, ND-CNPFuse distinguishes between focused and unfocused regions by mapping the source image into interpretable spike matrices. By comparing the number of spikes, an accurate decision map can be generated directly without any post-processing. Extensive experimental results show that ND-CNPFuse achieves new state-of-the-art performance on four classical MFIF datasets, including Lytro, MFFW, MFI-WHU, and Real-MFF. The code is available at https://github.com/MorvanLi/ND-CNPFuse.

Abstract:
We introduce FlexAvatar, a method for creating high-quality and complete 3D head avatars from a single image. A core challenge lies in the limited availability of multi-view data and the tendency of monocular training to yield incomplete 3D head reconstructions. We identify the root cause of this issue as the entanglement between driving signal and target viewpoint when learning from monocular videos. To address this, we propose a transformer-based 3D portrait animation model with learnable data source tokens, so-called bias sinks, which enables unified training across monocular and multi-view datasets. This design leverages the strengths of both data sources during inference: strong generalization from monocular data and full 3D completeness from multi-view supervision. Furthermore, our training procedure yields a smooth latent avatar space that facilitates identity interpolation and flexible fitting to an arbitrary number of input observations. In extensive evaluations on single-view, few-shot, and monocular avatar creation tasks, we verify the efficacy of FlexAvatar. Many existing methods struggle with view extrapolation while FlexAvatar generates complete 3D head avatars with realistic facial animations.

Abstract:
End-to-end autonomous driving (E2E-AD) demands effective processing of multi-view sensor data and robust handling of diverse and complex driving scenarios, particularly rare maneuvers such as aggressive turns. The recent success of the Mixture-of-Experts (MoE) architecture in Large Language Models (LLMs) demonstrates that expert specialization enables strong scalability. In this work, we propose DriveMoE, a novel MoE-based E2E-AD framework, with a Scene-Specialized Vision MoE and a Skill-Specialized Action MoE. First, we introduce Drive-p0, a Vision-Language-Action (VLA) baseline adapted from Embodied AI for autonomous driving, which serves as the foundation model for DriveMoE. Building on this, we strengthen perception through a carefully designed Vision MoE, where a router adaptively selects context-relevant camera views. This mechanism is inspired by human driving cognition, in which attention is directed to key visual cues rather than to all sensory inputs simultaneously. Beyond perception, we introduce an Action MoE that augments the framework by training a router to activate specialized expert modules tailored to distinct driving behaviors. Within the Action MoE, we implement two distinct styles(Token-level Router and Trajectory-level Router) and extensively explore their applicability in autonomous driving. In Bench2Drive closed-loop evaluations, DriveMoE demonstrates robust performance across diverse driving scenarios, alleviates the mode-averaging effect that limits existing models, and achieves state-of-the-art results with significant improvements over Drive-p0. We will release our code and models of DriveMoE and Drive-p0.

Abstract:
Current multimodal image retrieval benchmarks focus on relatively simple queries where target images are either described directly or by simple composition with an input image. When retrieval requires complex reasoning to determine the target image, the task becomes significantly more challenging, yet standardized benchmarks for this setting do not exist. To fill this gap, we introduce RMIR, a benchmark dataset of 1634 queries requiring reasoning across three categories: functional (object affordances), temporal (time-based relationships), and causal (cause-effect reasoning). Each query combines visual and textual inputs that demand robust visual understanding together with logical inference, beyond surface-level matching, to identify correct target images. In addition to the dataset itself, we present a pipeline to generate the dataset, which can be used to generate additional reasoning-intensive retrieval data at scale. Evaluation of state-of-the-art models on RMIR reveals significant performance gaps, with the best model achieving only 46.53% recall@20 averaged across reasoning categories. Our evaluation also shows that generative embedding models with explicit reasoning substantially outperform discriminative approaches, with reasoning-aware training proving more impactful than model scale. Our systematic analysis exposes fundamental limitations in current multimodal retrieval systems and establishes RMIR as a challenging testbed for developing multimodal, reasoning-capable retrieval models. Our dataset and code are available at https://github.com/amazon-science/rmir

Abstract:
Blind-spot networks (BSNs) enable self-supervised image denoising by preventing access to the target pixel, allowing clean signal estimation without ground-truth supervision. However, this approach assumes pixel-wise noise independence, which is violated in real-world sRGB images due to spatially correlated noise from the camera's image signal processing (ISP) pipeline. While several methods employ downsampling to decorrelate noise, they alter noise statistics and limit the network's ability to utilize full contextual information. In this paper, we propose the Triangular-Masked Blind-Spot Network (TM-BSN), a novel blind-spot architecture that accurately models the spatial correlation of real sRGB noise. This correlation originates from demosaicing, where each pixel is reconstructed from neighboring samples with spatially decaying weights, resulting in a diamond-shaped pattern. To align the receptive field with this geometry, we introduce a triangular-masked convolution that restricts the kernel to its upper-triangular region, creating a diamond-shaped blind spot at the original resolution. This design excludes correlated pixels while fully leveraging uncorrelated context, eliminating the need for downsampling or post-processing. Furthermore, we use knowledge distillation to transfer complementary knowledge from multiple blind-spot predictions into a lightweight U-Net, improving both accuracy and efficiency. Extensive experiments on real-world benchmarks demonstrate that our method achieves state-of-the-art performance, significantly outperforming existing self-supervised approaches. Our code is available at https://github.com/parkjun210/TM-BSN.

Abstract:
Tensor Ring (TR) decomposition is a powerful tool for high-order data modeling, but is inherently restricted to discrete forms defined on fixed meshgrids. In this work, we propose a TR functional decomposition for both meshgrid and non-meshgrid data, where factors are parameterized by Implicit Neural Representations (INRs). However, optimizing this continuous framework to capture fine-scale details is intrinsically difficult. Through a frequency-domain analysis, we demonstrate that the spectral structure of TR factors determines the frequency composition of the reconstructed tensor and limits the high-frequency modeling capacity. To mitigate this, we propose a reparameterized TR functional decomposition, in which each TR factor is a structured combination of a learnable latent tensor and a fixed basis. This reparameterization is theoretically shown to improve the training dynamics of TR factor learning. We further derive a principled initialization scheme for the fixed basis and prove the Lipschitz continuity of our proposed model. Extensive experiments on image inpainting, denoising, super-resolution, and point cloud recovery demonstrate that our method achieves consistently superior performance over existing approaches. Code is available at https://github.com/YangyangXu2002/RepTRFD.

Abstract:
Robust cross-view geo-localization (CVGL) remains challenging despite the surge in recent progress. Existing methods still rely on field-of-view (FoV)-specific training paradigms, where models are optimized under a fixed FoV but collapse when tested on unseen FoVs and unknown orientations. This limitation necessitates deploying multiple models to cover diverse variations. Although studies have explored dynamic FoV training by simply randomizing FoVs, they failed to achieve robustness across diverse conditions---implicitly assuming all FoVs are equally difficult. To address this gap, we present SinGeo, a simple yet powerful framework that enables a single model to realize robust cross-view geo-localization without additional modules or explicit transformations. SinGeo employs a dual discriminative learning architecture that enhances intra-view discriminability within both ground and satellite branches, and is the first to introduce a curriculum learning strategy to achieve robust CVGL. Extensive evaluations on four benchmark datasets reveal that SinGeo sets state-of-the-art (SOTA) results under diverse conditions, and notably outperforms methods specifically trained for extreme FoVs. Beyond superior performance, SinGeo also exhibits cross-architecture transferability. Furthermore, we propose a consistency evaluation method to objectively assess model stability under varying views, providing an objective perspective for understanding and advancing robustness in future CVGL research. Codes are available at: https://github.com/Yangchen-nudt/SinGeo.

Abstract:
Spatial registration across different visual modalities is a critical but formidable step in multi-modality image fusion for real-world perception. Although several methods are proposed to address this issue, the existing registration-based fusion methods typically require extensive pre-registration operations, limiting their efficiency. To overcome these limitations, a general cross-modality registration method guided by visual priors is proposed for infrared and visible image fusion task, termed FusionRegister. Firstly, FusionRegister achieves robustness by learning cross-modality misregistration representations rather than forcing alignment of all differences, ensuring stable outputs even under challenging input conditions. Moreover, FusionRegister demonstrates strong generality by operating directly on fused results, where misregistration is explicitly represented and effectively handled, enabling seamless integration with diverse fusion methods while preserving their intrinsic properties. In addition, its efficiency is further enhanced by serving the backbone fusion method as a natural visual prior provider, which guides the registration process to focus only on mismatch regions, thereby avoiding redundant operations. Extensive experiments on three datasets demonstrate that FusionRegister not only inherits the fusion quality of state-of-the-art methods, but also delivers superior detail alignment and robustness, making it highly suitable for infrared and visible image fusion method. The code will be available at https://github.com/bociic/FusionRegister.

Abstract:
Text-to-image (T2I) models such as Stable Diffusion have advanced rapidly and are widely used in content creation. However, these models can be misused to generate harmful content, including nudity or violence, posing significant safety risks. While most platforms employ content moderation systems, underlying vulnerabilities can still be exploited by adversaries. Recent research on red-teaming and adversarial attacks against T2I models faces a critical limitation: existing methods struggle to balance prompt stealthiness with image toxicity. Some studies successfully generate highly toxic images but use adversarial prompts that are easily detected by safety filters, while others focus on bypassing safety mechanisms but fail to produce genuinely harmful outputs, neglecting the discovery of truly high-risk prompts. Consequently, there remains a lack of reliable tools for evaluating the safety of defended T2I models. To address this gap, we propose GenBreak, a framework that fine-tunes a red-team large language model (LLM) to systematically explore underlying vulnerabilities in T2I generators. Our approach combines supervised fine-tuning on curated datasets with reinforcement learning via interaction with a surrogate T2I model. By integrating multiple reward signals, we guide the LLM to craft adversarial prompts that enhance both evasion capability and image toxicity, while maintaining semantic coherence and diversity. These prompts demonstrate strong effectiveness in black-box attacks against commercial T2I generators, revealing practical safety weaknesses. Code is available at https://github.com/wangdandan567/RT-diffuser.

Abstract:
Autoregressive (AR) language models rely on causal tokenization, but extending this paradigm to vision remains non-trivial. Current visual tokenizers either flatten 2D patches into non-causal sequences or enforce heuristic orderings that misalign with the "next-token prediction" pattern. Recent diffusion autoencoders similarly fall short: conditioning the decoder on all tokens lacks causality, while applying nested dropout mechanism introduces imbalance. To address these challenges, we present CaTok, a 1D causal image tokenizer with a MeanFlow decoder. By selecting tokens over time intervals and binding them to the MeanFlow objective, CaTok learns causal 1D representations that support both fast one-step generation and high-fidelity multi-step sampling, while naturally capturing diverse visual concepts across token intervals. To further stabilize and accelerate training, we propose a straightforward regularization REPA-A, which aligns encoder features with Vision Foundation Models (VFMs). Experiments demonstrate that CaTok achieves state-of-the-art results on ImageNet reconstruction, reaching 0.75 FID, 22.53 PSNR and 0.674 SSIM with fewer training epochs, and the AR model attains performance comparable to leading approaches. Project website is available in https://sharelab-sii.github.io/catok-web.

Abstract:
Recent text-to-video generation models have made remarkable progress in visual realism, motion fidelity, and text-video alignment, yet they still struggle to produce socially coherent behavior. Unlike humans, who readily infer intentions, beliefs, emotions, and social norms from brief visual cues, current models often generate literal scenes without capturing the underlying causal and psychological dynamics. To systematically assess this limitation, we introduce the first benchmark for social reasoning in video generation. Grounded in developmental and social psychology, the benchmark covers thirty classic social cognition paradigms spanning seven core dimensions: mental-state inference, goal-directed action, joint attention, social coordination, prosocial behavior, social norms, and multi-agent strategy. To operationalize these paradigms, we build a fully training-free agent-based pipeline that distills the reasoning structure of each paradigm, synthesizes diverse video-ready scenarios, enforces conceptual neutrality and difficulty control through cue-based critique, and evaluates generated videos with a high-capacity VLM judge along five interpretable dimensions of social reasoning. Using this framework, we conduct the first large-scale evaluation of seven state-of-the-art video generation systems. Results show a clear gap between surface-level plausibility and deeper social reasoning, suggesting that current models remain limited in their ability to generate socially grounded behavior. https://github.com/Gloria2tt/SVBench-Evaluation

Abstract:
Neural radiance fields (NeRF)-based methods with multi-resolution hash encoding enable efficient sparse-view CBCT reconstruction, but real-world projections violate ideal assumptions due to scatter/noise and related inconsistencies. Uniformly fusing hash-grid levels entangles heterogeneous frequency components, yielding spurious high-frequency textures in homogeneous tissues, blurred boundaries, and propagation of projection-induced bias. We propose GH-NAF (Grid-Adaptive Hash-Level-Attended Neural Attenuation Field), which decouples hash levels and adaptively weights them via uncertainty-guided, grid-adaptive hash-level attention. This stabilizes low-frequency modeling in homogeneous regions while selectively preserving high-frequency details near structural boundaries. Experiments on synthetic and real CBCT data show that GH-NAF improves intra-material contrast and reconstruction quality over state-of-the-art methods. The code is available at https://github.com/seongje-oh/GH-NAF

Abstract:
In AI-generated image detection, current cutting-edge methods typically adapt pre-trained foundation models through partial-parameter fine-tuning. However, these approaches often struggle to generalize to forgeries from unseen generators, as the fine-tuned models capture only limited patterns from training data and fail to reflect the evolving traits of new ones. To overcome this limitation, we propose Image-Adaptive Prompt Learning (IAPL), a novel paradigm that dynamically adjusts the prompts fed into the encoder according to each testing image, rather than fixing them after training. This design significantly enhances robustness and adaptability to diverse forged images. The dynamic prompts integrate conditional information with test-time adaptive tokens through a lightweight learnable scaling factor. The conditional information is produced by a Conditional Information Learner, which leverages CNN-based feature extractors to model both forgery-specific and general conditions. The test-time adaptive tokens are optimized during inference on a single sample by enforcing prediction consistency across multiple views, ensuring that the parameters align with the current image. For the final decision, the optimal input with the highest prediction confidence is selected. Extensive experiments show that IAPL achieves state-of-the-art performance, with mean accuracies of 95.61% and 96.7% on the widely used UniversalFakeDetect and GenImage datasets, respectively.

Abstract:
Few-shot fine-grained image classification (FSFG) aims to recognize novel fine-grained categories from only a few labeled samples. Existing FSFG methods primarily focus on fine-grained feature extraction and modeling query-support interactions within training episodes containing a small number of classes. Relying on the episodic training strategy, these methods typically assume that the capabilities learned on training samples can directly transfer to evaluation episodes with a few novel classes (few-way). However, in more practical and challenging scenarios involving many novel classes (many-way), existing approaches lack a reliable and global characterization of the feature space, making it difficult for episodic adaptation alone to generalize effectively. In this paper, we pioneer a theoretical analysis of novel class behavior in FSFG and derive a class discriminative index bound. Guided by this analysis, we propose a novel SCEG method that incorporates Self and Collaborative feature extraction as well as Episodic and Global feature space optimization. Extensive experiments demonstrate that our method consistently and significantly outperforms existing methods under both conventional few-way and the new many-way settings. Codes are available at: https://github.com/Legenddddd/SCEG.

Abstract:
This paper introduces UniMERNet, a high-accuracy, computation-efficient algorithm for Mathematical Expression Recognition (MER) across diverse real-world scenarios. To facilitate UniMERNet's training, we constructed UniMER-1M, a million-scale dataset whose unprecedented diversity endows the model with robust generalization ability. Through in-depth analysis, we discover a distinctive raster-scan pattern (left-to-right, top-to-bottom) in the attention distribution of Transformer models for MER tasks, which closely aligns with human reading habits. Based on this key finding, we design an innovative Raster-Scan Attention mechanism that employs a "horizontal-first, vertical-second" sequential attention computation strategy. This approach not only successfully reduces computational complexity from \mathcal O (NH^2W^2D) to \mathcal O (NHWD(H + W)) , but also enables the model to capture long-range dependencies more efficiently, achieving recognition performance comparable to global attention. Leveraging both UniMER-1M and our innovative attention mechanism, UniMERNet achieves state-of-the-art performance across four real-world scenarios while significantly reducing computational resources compared to global attention: over 1.2x memory savings during training, approximately 10x memory reduction during inference, and 5x speed with slightly improved accuracy. All resources will be publicly released to advance MER research further.

Abstract:
We introduce MoS (Mixture of States), a novel fusion paradigm for multimodal diffusion models that merges modalities using flexible, state-based interactions. The core of MoS is a learnable, token-wise router that creates denoising timestep- and input-dependent interactions between modalities' hidden states, precisely aligning token-level features with the diffusion trajectory. This router sparsely selects the top-k hidden states and is trained with an epsilon-greedy strategy, efficiently selecting contextual features with minimal learnable parameters and negligible computational overhead. We validate our design with text-to-image generation (MoS-Image) and editing (MoS-Editing), which achieve state-of-the-art results. With only 3B to 5B parameters, our models match or surpass counterparts up to 4xlarger. These findings establish MoS as a flexible and compute-efficient paradigm for scaling multimodal diffusion models. The project homepage is available at https://haozheliu-st.github.io/mos-homepage/

Abstract:
Composed Image Retrieval (CIR) aims to retrieve target images based on a hybrid query comprising a reference image and a modification text. Early dual-tower Vision-Language Models (VLMs) struggle with cross-modality compositional reasoning required for this task. While adapting generative Multimodal Large Language Models (MLLMs) for retrieval offers a promising direction, we identify that this strategy overlooks a fundamental issue: compressing a generative MLLM into a single-embedding discriminative retriever triggers a paradigm conflict, which leads to Capability Degradation--the deterioration of native fine-grained reasoning after retrieval adaptation. To address this challenge, we propose ReCALL, a model-agnostic framework that follows a diagnose-generate-refine pipeline: First, we diagnose cognitive blind spots of the retriever via self-guided informative instance mining. Next, we generate corrective instructions and triplets by prompting the foundation MLLM and conduct quality control with VQA-based consistency filtering. Finally, we refine the retriever through continual training on these triplets with a grouped contrastive scheme, thereby internalizing fine-grained visual-semantic distinctions and realigning the discriminative embedding space of retriever with intrinsic compositional reasoning within the MLLM. Extensive experiments on CIRR and FashionIQ show that ReCALL consistently recalibrates degraded capabilities and achieves state-of-the-art performance. Code is available at https://github.com/RemRico/Recall.

Abstract:
This paper focuses on a highly practical scenario: how to continue benefiting from the advantages of multi-modal image fusion under harsh conditions when only visible imaging sensors are available. To achieve this goal, we propose a novel concept of single image fusion, which extends conventional data-level fusion to the knowledge level. Specifically, we develop MagicFuse, a novel single image fusion framework capable of deriving a comprehensive cross-spectral scene representation from a single low-quality visible image. MagicFuse first introduces an intra-spectral knowledge reinforcement branch and a cross-spectral knowledge generation branch based on the diffusion models. They mine scene information obscured in the visible spectrum and learn thermal radiation distribution patterns transferred to the infrared spectrum, respectively. Building on them, we design a multi-domain knowledge fusion branch that integrates the probabilistic noise from the diffusion streams of these two branches, from which a cross-spectral scene representation can be obtained through successive sampling. Then, we impose both visual and semantic constraints to ensure that this scene representation can satisfy human observation while supporting downstream semantic decision-making. Extensive experiments show that our MagicFuse achieves visual and semantic representation performance comparable to or even better than state-of-the-art fusion methods with multi-modal inputs, despite relying solely on a single degraded visible image. The code is publicly available at https://github.com/zhayanping/MagicFuse.

Abstract:
Learning-based edge detection models trained with cross-entropy loss often suffer from thick edge predictions, which deviate from the crisp, single-pixel annotations typically provided by humans. While previous approaches to achieving crisp edges have focused on designing specialized loss functions or modifying network architectures, we show that a carefully designed training and inference strategy alone is sufficient to achieve human-like edge quality. In this work, we introduce the Masked Edge Prediction MOdel (MEMO), which produces both accurate and crisp edges using only cross-entropy loss. We first construct a large-scale synthetic edge dataset to pre-train MEMO, enhancing its generalization ability. Subsequent fine-tuning on downstream datasets requires only a lightweight module comprising 1.2% additional parameters. During training, MEMO learns to predict edges under varying ratios of input masking. A key insight guiding our inference is that thick edge predictions typically exhibit a confidence gradient: high in the center and lower toward the boundaries. Leveraging this, we propose a novel progressive prediction strategy that sequentially finalizes edge predictions in order of prediction confidence, resulting in thinner and more precise contours. Our method achieves visually appealing, post-processing-free, human-like edge maps and outperforms prior methods on crispness-aware evaluations. \href https://github.com/cplusx/MEMO_Edge_Detection https://github.com/cplusx/MEMO_Edge_Detection

Abstract:
We present MoVieS, a Motion-aware View Synthesis model that reconstructs 4D dynamic scenes from monocular videos in one second. It represents dynamic 3D scenes with pixel-aligned Gaussian primitives and explicitly supervises their time-varying motions. This allows, for the first time, the unified modeling of appearance, geometry and motion from monocular videos, and enables reconstruction, view synthesis and 3D point tracking within a single learning-based framework. By bridging view synthesis with geometry reconstruction, MoVieS enables large-scale training on diverse datasets with minimal dependence on task-specific supervision. As a result, it also naturally supports a wide range of zero-shot applications, such as scene flow estimation and moving object segmentation. Extensive experiments validate the effectiveness and efficiency of MoVieS across multiple tasks, achieving competitive performance while offering several orders of magnitude speedups.

Abstract:
We study monocular metric depth estimation (MMDE) without camera intrinsics at training or inference. When focal length and scene depth vary together, depth changes are difficult to perceive in the image, yet the edge-frequency statistics exhibit systematic, scale-correlated shifts. Building on this observation, we introduce a spectral quantile estimator (SQE) that analyzes the Fourier spectrum of a predicted edge map and outputs a single score used as a proxy for metric scale. Consequently, we propose MD2E, a method that models depth-to-edge cues by deriving edge targets from depth annotations, calibrating metric scale using the spectral score, and using edge predictions to regularize depth boundaries while producing metric depth. Across diverse cameras and datasets, MD2E achieves state-of-the-art performance in MMDE in both zero-shot and fine-tuning settings. The project page is available at https://2j472no.github.io/MD2E/.

Abstract:
VR sketching lets users explore and iterate on ideas directly in 3D, offering a faster and more intuitive alternative to conventional CAD software. However, existing sketch-to-shape models ignore the temporal ordering of strokes, discarding crucial cues about structure and design intent. We introduce VRSketch2Shape, the first framework and multi-category dataset for 3D shape generation from sequential VR sketches. Our contributions are threefold: (i) an automated pipeline that generates ordered VR sketches from arbitrary shapes, (ii) a dataset comprising over 20k synthetic and 900 hand-drawn sketch-shape pairs across four categories, and (iii) an order-aware sketch encoder coupled with a diffusion-based 3D generator. Our approach yields higher geometric fidelity than prior work and generalizes effectively from synthetic to real sketches with minimal supervision. All data and models are released open-source on on https://chenyizi086.github.io/VRSketch2Shape_website/.

Abstract:
High-fidelity 3D Gaussian head avatar generation is critical for applications such as AR/VR, telepresence, and digital humans. Existing methods depend on multi-view datasets, 3D captures, or intermediate 2D view synthesis. In contrast, we learn both conditional and unconditional 3D head models from randomly sampled 2D images alone, without using multi-view data, 3D supervision, or intermediate view generation. We introduce MVCHead, a single-shot state space model that enforces multi-view consistency (MVC) directly in the 3D representation while regressing 3D Gaussians under these constraints. At its core, we propose a Hierarchical State Space (HiSS) block that progressively refines Gaussians from coarse to fine, while capturing long-range dependencies. Within each HiSS block, we modify Mamba's standard unidirectional scan with the proposed Hierarchical Bi-directional State Scan (HiBiSS) that aligns recurrence with the axes along which multi-view inconsistencies are strongest. Finally, we design an SE(3) Multi-view Critic that judges whether a set of self-renders arises from a single underlying 3D configuration, rewarding cross-view pixel alignment without observing real multi-view pairs. MVCHead achieves state-of-the-art perceptual quality, surpasses prior methods in both texture and geometric consistency, and maintains comparable shape consistency. To demonstrate scalability, we release FaceGS-10K, the first large-scale dataset of ready-to-use 3D Gaussian head assets for training and evaluation of 3D head models. Project Page and code: https://humansensinglab.github.io/MVCHead/

Abstract:
Recent advances in Vision-Language-Action (VLA) models demonstrate that visual signals can effectively complement sparse action supervisions. However, letting VLA directly predict high-dimensional visual states can distribute model capacity and incur prohibitive training cost, while compressing visual states into more compact supervisory signals inevitably incurs information bottlenecks. Moreover, existing methods often suffer from poor comprehension and reasoning capabilities due to the neglect of language supervision. This paper introduces Mantis, a novel framework featuring a Disentangled Visual Foresight (DVF) to tackle these issues. Specifically, Mantis decouples visual foresight prediction from the backbone with the combination of meta queries and a diffusion Transformer (DiT) head. With the current visual state provided to the DiT via a residual connection, a simple next-state prediction objective enables the meta queries to automatically capture the latent actions that delineate the visual trajectory, and hence boost the learning of explicit actions. The disentanglement reduces the burden of the VLA backbone, enabling it to maintain comprehension and reasoning capabilities through language supervision. Empirically, pretrained on human manipulation videos, robot demonstrations, and image-text pairs, Mantis achieves a 96.5% success rate on LIBERO benchmark after fine-tuning, surpassing powerful baselines while exhibiting high convergence speed. Real-world evaluations show that Mantis outperforms \pi_ 0.5 , a leading open-source VLA model, particularly in instruction-following capability, generalization to unseen instructions, and reasoning ability. We also introduce the adaptive temporal ensemble (ATE) strategy to balance computational efficiency and motion stability during inference, yielding the Mantis-ATE variant, which reduces inference counts by 45% while maintaining performance. We have released the code and model weights to the open-source community, which are available at https://github.com/SJTU-DENG-Lab/Mantis.

Abstract:
Medical image segmentation demands both high accuracy and computational efficiency, yet existing methods face a critical trade-off: CNNs lack global context while transformers incur prohibitive costs for deployment on resource-constrained devices. To address this challenge, we propose a Physics-informed Multi-scale Refinement Network (PMRNet), integrating symplectic geometry, renormalization group theory, and entropy diffusion to guide feature learning. PMRNet features three innovations: (1) a physics-informed encoder with Enhanced Symplectic Convolution for boundary detection and Renormalization Group-inspired Downsampling for information preservation; (2) a Pseudo-Global Receptive Field module achieving near-global context with linear complexity through entropy-driven diffusion; and (3) a boundary-aware decoder for precise delineation. With only 0.87M parameters and 3.43 GFLOPs, PMRNet achieves 87.25% IoU and 92.56% Dice on the challenging Clinic dataset, outperforming state-of-the-art (SOTA) models with even 100x more parameters across 12 medical imaging datasets while maintaining computational efficiency. Code is available at https://github.com/KangBoce/PMRNet.

Abstract:
Cross-modal ship re-identification (ReID) between optical and synthetic aperture radar (SAR) imagery has recently emerged as a critical yet underexplored task in maritime intelligence and surveillance. However, the substantial modality gap between optical and SAR images poses a major challenge for robust identification. To address this issue, we propose MOS, a novel framework designed to Mitigate the Optical-SAR modality gap and achieve modality-consistent feature learning for optical-SAR cross-modal ship ReID. MOS consists of two core components: (1) Modality-Consistent Representation Learning (MCRL) applies denoise SAR image procession and a class-wise modality alignment loss to align intra-identity feature distributions across modalities. (2) Cross-modal Data Generation and Feature fusion (CDGF) leverages a brownian bridge diffusion model to synthesize cross-modal samples, which are subsequently fused with original features during inference to enhance alignment and discriminability. Extensive experiments on the HOSS ReID dataset demonstrate that MOS significantly surpasses state-of-the-art methods across all evaluation protocols, achieving notable improvements of +3.0%, +6.2%, and +16.4% in R1 accuracy under the ALL to ALL,Optical to SAR, and SAR to Optical settings, respectively. The source codes are available at https://github.com/yjzhao1019/MOS.

Abstract:
Class-incremental learning (CIL) aims to continuously accumulate knowledge from a stream of tasks and construct a unified classifier over all seen classes. Although pretrained models (PTMs) have shown promising performance in CIL, they still struggle with the entanglement of multi-task subspaces, leading to catastrophic forgetting when task routing parameters are poorly calibrated or task-level representations are rigidly fixed. To address this issue, we propose a novel Quantum-Gated Task-interaction Knowledge Distillation (QKD) framework that leverages quantum gating to guide inter-task knowledge transfer. Specifically, we introduce a quantum-gated task modulation gating mechanism to model the relational dependencies among task embedding, dynamically capturing the sample-to-task relevance for both joint training and inference across streaming tasks. Guided by the quantum gating outputs, we perform task-interaction knowledge distillation guided by these task-embedding-level correlation weights from old to new adapters, enabling the model to bridge the representation gaps between independent task subspaces. Extensive experiments demonstrate that QKD effectively mitigates forgetting and achieves state-of-the-art performance. Code is available at: https://github.com/Frank-lilinjie/CVPR26-QKD

Abstract:
We present MATCH (Multi-view Avatars from Topologically Corresponding Heads), a multi-view Gaussian registration method for high-quality head avatar creation and editing. State-of-the-art multi-view head avatars require time-consuming head tracking, which is followed by an expensive avatar optimization, often resulting in a total creation time that exceeds one day. MATCH instead directly predicts Gaussian splat textures in correspondence from calibrated multi-view images in 0.5 seconds per frame. While the learned intra-subject correspondence across frames allows us to quickly build personalized head avatars, correspondence across subjects enables various applications such as expression transfer, optimization-free tracking, semantic editing, and identity interpolation. We learn to establish such correspondences end-to-end, with a transformer-based model that predicts textures of Gaussian splats in the fixed UV layout of a template mesh. To this end, we introduce a novel registration-guided attention block, in which each UV map token attends exclusively to image tokens depicting its corresponding mesh region. MATCH outperforms existing methods for novel-view synthesis, geometry registration, and head avatar generation, the latter being 10xfaster than the qualitatively closest baseline. Code and model weights are available under https://malteprinzler.github.io/projects/match

Abstract:
Virtual try-on (VTON) aims to render a target garment onto a person while preserving pose, identity, and fine-grained appearance. Most existing methods rely on supervised paired data, limiting cross-domain generalization, while recent training-free approaches, though more robust, require multiple diffusion calls and complex compositing, making deployment impractical. We propose PG-VTON, a single-pass, training-free framework based on Patch-Guided Reference Alignment. Our key insight is that modern inpainting diffusion models already possess strong in-context completion: given a masked person and a small garment patch, they can synthesize plausible, pose-consistent clothing without task-specific training. PG-VTON exploits this capability with two lightweight components: Patch-Anchored Identity Priming (PIP) injects a localized garment patch only in early denoising steps to anchor garment identity, and Reference-Aware Attention (RAA) strengthens attention from masked-region tokens to garment tokens to enhance detail transfer, all without modifying model weights. With a single diffusion pass, PG-VTON achieves state-of-the-art performance among training-free methods on DressCode and VITON-HD and generalizes effectively to subject insertion. Code is available at \href https://github.com/PKU-ICST-MIPL/PG-VTON_CVPR2026 https://github.com/PKU-ICST-MIPL/PG-VTON_CVPR2026 .

Abstract:
Diffusion models achieve remarkable performance across diverse generative tasks in computer vision, but their high computational cost remains a major barrier to deployment. Model pruning offers a promising way to reduce inference cost and enable lightweight models. However, pruning leads to quality drop due to reduced capacity. A key limitation of existing pruning approaches is that pruned models are finetuned using the same objective as the dense model (denoising score matching). Since the dense model is accessible during finetuning, it warrants a more effective approach for knowledge transfer from the dense to the pruned model. Motivated by this, we propose 2ndMatch(2ndM), a general-purpose finetuning framework that introduces a 2nd-order Jacobian (J^ \top J) Matching loss inspired by Finite-Time Lyapunov Exponents. 2ndM teaches the pruned model to mimic the sensitivity of the dense teacher, i.e., how to respond to small perturbations over time, through scalable random projections. The framework is architecture-agnostic and applies to both U-Net- and Transformer-based diffusion models. Experiments on CIFAR-10, CelebA, LSUN, ImageNet, and MSCOCO demonstrate that 2ndM reduces the performance gap between pruned and dense models, substantially improving output quality.

Abstract:
Multimodal learning often grapples with the challenge of low-quality data, which predominantly manifests as two facets: modality imbalance and noisy corruption. While these issues are often studied in isolation, we argue that they share a common root in the predictive uncertainty towards the reliability of individual modalities and instances during learning. In this paper, we propose a unified framework, termed Conformal Predictive Self-Calibration (CPSC), which leverages conformal prediction to equip the model with the ability to perform self-guided calibration on-the-fly. The core of our proposed CPSC lies in a novel self-calibrating training loop that seamlessly integrates two key modules: (1) Representation Self-Calibration, which decomposes unimodal features into components, selectively fuses the most robust ones identified by a conformal predictor to enhance feature resilience. (2) Gradient Self-Calibration, which recalibrates the gradient flow during backpropagation based on instance-wise reliability scores, steering the optimization towards more trustworthy directions. Furthermore, we also devise a self-update strategy for the conformal predictor to ensure the entire system co-evolves consistently throughout the training process. Extensive experiments on six benchmark datasets under both imbalanced and noisy settings demonstrate that our CPSC framework consistently outperforms existing state-of-the-art methods. Our code is available at https://github.com/XunCHN/CPSC.

Abstract:
Vision-Language-Action (VLA) models are promising for embodied intelligence, yet they often overlook the predictive and temporal-causal structure underlying visual dynamics. World-model VLAs address this by predicting future frames, but waste capacity reconstructing redundant backgrounds. To overcome these limitations, we introduce CoWVLA, a new "Chain of World" paradigm that unifies world-model temporal reasoning with a disentangled latent motion representation. First, a pretrained video VAE serves as a latent motion extractor, explicitly factorizing video segments into structure and motion latents. Then, during pre-training, the VLA learns from an instruction and an initial frame to infer a continuous latent motion chain and predict the segment's terminal frame. Finally, during co-fine-tuning, this latent dynamic is aligned with discrete action prediction by jointly modeling sparse keyframes and action sequences in a unified autoregressive decoder. This design preserves the world-model benefits of temporal reasoning and world knowledge while retaining the compactness and interpretability of latent actions, enabling efficient visuomotor learning. Extensive experiments on robotic simulation benchmarks show that CoWVLA outperforms existing world-model and latent-action approaches and achieves moderate computational efficiency, highlighting its potential as a more effective VLA pretraining paradigm. Project website: https://fx-hit.github.io/cowvla-io.

Abstract:
Recent photo-realistic 3D talking head via 3D Gaussian Splatting still has significant shortcoming in emotional expression manipulation, especially for fine-grained and expansive dynamics emotional editing using multi-modal control. This paper introduces a new editable 3D Gaussian talking head, i.e. EmoDiffTalk. Our key idea is a novel Emotion-aware Gaussian Diffusion, which includes an action unit (AU) prompt Gaussian diffusion process for fine-grained facial animator, and moreover an accurate text-to-AU emotion controller to provide accurate and expansive dynamic emotional editing using text input. Experiments on public EmoTalk3D and RenderMe-360 datasets demonstrate superior emotional subtlety, lip-sync fidelity, and controllability of our EmoDiffTalk over previous works, establishing a principled pathway toward high-quality, diffusion-driven, multimodal editable 3D talking-head synthesis. To our best knowledge, our EmoDiffTalk is one of the first few 3D Gaussian Splatting talking-head generation framework, especially supporting continuous, multimodal emotional editing within the AU-based expression space.Please visit our website for more details and information:https://liuchang883.github.io/EmoDiffTalk/

Abstract:
Brain visual decoding aims to recognize and reconstruct perceptual visual content from brain activity, providing a promising potential for the development of brain-computer interfaces and brain-inspired intelligence. However, this task faces a fundamental challenge of information asymmetry: while natural images contain complex visual scenes with objects and backgrounds, the corresponding brain signals reflect focused attention on central objects while being contaminated by various neural noise. Previous methods that directly align visual and brain representations often overlook this inherent asymmetry, resulting in suboptimal decoding performance. To address this, we propose linguistic-prior-guided visual decoupling method, which introduces object-oriented textual descriptions as semantic guidance to explicitly decouple foreground objects from complex backgrounds in natural images. This design enables the model to automatically focus on task-relevant visual concepts while effectively filtering out irrelevant neural noise in brain signals, achieving a transition from asymmetric vision-brain alignment to semantic symmetric alignment. Extensive experiments on the THINGS-EEG and THINGS-MEG datasets demonstrate that our method achieves new state-of-the-art performance in the zero-shot brain-to-image retrieval task. The source code is available at https://github.com/TKQXX/BVSA.

Abstract:
Recent video diffusion models generate photorealistic, temporally coherent videos, yet they fall short as reliable world models for autonomous driving, where structured motion and physically consistent interactions are essential. Adapting these generalist video models to driving domains has shown promise but typically requires massive domain-specific data and costly fine-tuning. We propose an efficient adaptation framework that converts generalist video diffusion models into controllable driving world models with minimal supervision. The key idea is to decouple motion learning from appearance synthesis. First, the model is adapted to predict structured motion in a simplified form: videos of skeletonized agents and scene elements, focusing learning on physical and social plausibility. Then, the same backbone is reused to synthesize realistic RGB videos conditioned on these motion sequences, effectively "dressing" the motion with texture and lighting. This two-stage process mirrors a reasoning-rendering paradigm: first infer dynamics, then render appearance. Our experiments show this decoupled approach is exceptionally efficient: adapting SVD, we match prior SOTA models with less than 6% of their compute. Scaling to LTX, our MAD-LTX model outperforms all open-source competitors, and supports a comprehensive suite of text, ego, and object controls. Project page: \href https://vita-epfl.github.io/MAD-World-Model/ https://vita-epfl.github.io/MAD-World-Model/

Abstract:
Recent advances in text-to-audio (TTA) generation excel at synthesizing short audio clips but struggle with long-form narrative audio, which requires temporal coherence and compositional reasoning. To fill this gap, we propose AudioStory, a unified framework that integrates large language models (LLMs) with TTA systems to generate structured, long-form audio narratives. AudioStory possesses strong instruction-following reasoning generation capabilities. It employs LLMs to decompose complex narrative queries into temporally ordered sub-tasks with contextual cues, enabling coherent scene transitions and emotional tone consistency. AudioStory has two appealing features: (1) Decoupled bridging mechanism: AudioStory disentangles LLM-diffuser collaboration into two specialized components, i.e., a bridging query for intra-event semantic alignment and a residual query for inter-event coherence preservation. (2) End-to-end training: By unifying instruction comprehension and audio generation within a single end-to-end framework, AudioStory eliminates the need for modular training pipelines while enhancing synergy between components. Furthermore, we establish a benchmark AudioStory-10K, encompassing diverse domains such as animated soundscapes and natural sound narratives. Extensive experiments show the superiority of AudioStory on both single-audio generation and narrative audio generation, surpassing prior TTA baselines in both instruction-following ability and audio fidelity. Our code and dataset will be publicly available.

Abstract:
We present SGS-Intrinsic, an indoor inverse rendering framework that works well for sparse-view images. Unlike existing 3D Gaussian Splatting (3DGS) based methods that focus on object-centric reconstruction and fail to work under sparse view settings, our method allows to achieve high-quality geometry reconstruction and accurate disentanglement of material and illumination. The core idea is to construct a dense and geometry-consistent Gaussian semantic field guided by semantic and geometric priors, providing a reliable foundation for subsequent inverse rendering. Building upon this, we perform material-illumination disentanglement by combining a hybrid illumination model and material prior to effectively capture illumination-material interactions. To mitigate the impact of cast shadows and enhance the robustness of material recovery, we introduce illumination-invariant material constraint together with a deshadowing model. Extensive experiments on benchmark datasets show that our method consistently improves both reconstruction fidelity and inverse rendering quality over existing 3DGS-based inverse rendering approaches. Our code is available at https://github.com/GrumpySloths/SGS_Intrinsic.github.io.

Abstract:
Existing generative models for 3D shapes can synthesize high-fidelity and visually plausible shapes. For certain classes of shapes that have undergone an engineering design process, the realism of the shape is tightly coupled with the underlying physical properties, e.g., aerodynamic efficiency for automobiles. Since existing methods lack knowledge of such physics, they are unable to use this knowledge to enhance the realism of shape generation. Motivated by this, we propose a unified physics-based 3D shape generation pipeline, with a focus on industrial design applications. Specifically, we introduce a new flow matching model with explicit physical guidance, consisting of an alternating update process. We iteratively perform a velocity-based update and a physics-based refinement, progressively adjusting the latent code to align with the desired 3D shapes and physical properties. We further strengthen physical validity by incorporating a physics-aware regularization term into the velocity-based update step. To support such physics-guided updates, we build a shape-and-physics variational autoencoder (SP-VAE) that jointly encodes shape and physics information into a unified latent space. The experiments on three benchmarks show that this synergistic formulation improves shape realism beyond mere visual plausibility. Our code and model weights are available at https://github.com/kasvii/PhysGen.

Abstract:
Large vision-language models (LVLMs) have demonstrated remarkable capabilities in multimodal understanding tasks. However, the increasing demand for high-resolution image and long-video understanding results in substantial token counts, consequently leading to reduced inference efficiency. Token compression offers a direct solution by reducing the number of tokens to be processed, thereby improving computational efficiency without architectural changes. Through extensive analysis, we identify two critical limitations in existing inner-LLM token compression methods: positional bias and incompatibility with efficient operators, which critically hinder their practical deployment for LVLM acceleration. This paper presents the first approach from a dynamic token variation perspective, revealing that visual token variations within LLMs exhibit task-agnostic properties. We propose Variation-aware Vision Token Dropping (i.e., V2Drop), which progressively removes visual tokens with minimal variation during LVLM inference, thereby enhancing computational efficiency. Extensive experiments across multiple models and benchmarks consistently demonstrate that V2Drop maintains 94.0% and 98.6% of the original performance for image and video understanding tasks respectively, while reducing LLM generation latency by 31.5% and 74.2%.

Abstract:
The rapid progress of Multi-Modal Large Language Models (MLLMs) has significantly advanced downstream applications. However, this progress also exposes serious transferable adversarial vulnerabilities. In general, existing adversarial attacks against MLLMs typically rely on surrogate models trained within a single learning paradigm and perform independent optimisation in their respective feature spaces. This straightforward setting naturally restricts the richness of feature representations, delivering limits on the search space and thus impeding the diversity of adversarial perturbations. To address this, we propose a novel Multi-Paradigm Collaborative Attack (MPCAttack) framework to boost the transferability of adversarial examples against MLLMs. In principle, MPCAttack aggregates semantic representations, from both visual images and language texts, to facilitate joint adversarial optimisation on the aggregated features through a Multi-Paradigm Collaborative Optimisation (MPCO) strategy. By performing contrastive matching on multi-paradigm features, MPCO adaptively balances the importance of different paradigm representations and guides the global perturbation optimisation, effectively alleviating the representation bias. Extensive experimental results on multiple benchmarks demonstrate the superiority of MPCAttack, indicating that our solution consistently outperforms state-of-the-art methods in both targeted and untargeted attacks on open-source and closed-source MLLMs. The code is released at https://github.com/LiYuanBoJNU/MPCAttack.

Abstract:
Large Vision-Language Models (LVLMs) have advanced multimodal learning but face high computational cost issues due to the input of large number of visual tokens, motivating token pruning to improve inference efficiency.The key challenge lies in identifying which tokens are truly important.Most existing approaches rely on attention- or similarity-based criteria to estimate token importance.However, they inherently suffer from certain limitations, such as being task-agnostic and exhibiting positional bias.In this work, we explore a new perspective on token importance assignment based on token transitions in LVLMs, where token transitions are defined as the changes in token representations occurring as they propagate through the model's modules.We observe that the transition of token representations provides a meaningful signal of semantic information.Based on this insight, we propose TransPrune, a training-free and efficient token pruning method.Specifically, TransPrune progressively prunes tokens by assessing their importance through a combination of Token Transition Variation (TTV), which measures changes in both the magnitude and direction of token representations; as well as Instruction-Guided Attention (IGA), which measures how strongly the instruction attends to visual tokens via attention.Extensive experiments on various LVLM architectures, such as LLaVA-v1.5, LLaVA-Next and Qwen2.5-VL, demonstrate that TransPrune maintains comparable multimodal performance while reducing inference TFLOPs by more than half. The code is available at https://github.com/liaolea/TransPrune.

Abstract:
3D human pose estimation (3D HPE) has emerged as a prominent research topic, particularly in the realm of RGB-based methods. However, the use of RGB images is often limited by issues such as occlusion and privacy constraints. Consequently, multi-modal sensing, which leverages non-intrusive sensors, is gaining increasing attention. Nevertheless, multi-modal 3D HPE still faces challenges, including modality imbalance. In this work, we introduce a novel balanced multi-modal learning method for 3D HPE, which harnesses the power of RGB, LiDAR, mmWave, and WiFi. Specifically, we propose a Shapley value-based contribution algorithm to assess the contribution of each modality and detect modality imbalance. To address this imbalance, we design a modality learning regulation strategy that decelerates the learning process during the early stages of training. We conduct extensive experiments on the widely adopted multi-modal dataset, MM-Fi, demonstrating the superiority of our approach in enhancing 3D pose estimation under complex conditions. Our source code is available at https://github.com/MICLAB-BUPT/AWC.

Abstract:
Despite the success of audio-visual large-language models (LLMs), they can produce plausible but ungrounded outputs, termed hallucination. Existing benchmarks focus on environmental sounds (e.g., dog barking) to indicate event occurrence. In contrast, human speech carries fundamentally different, rich semantics and temporal structures, yet it remains unexplored whether current models can accurately align speech content with corresponding visual signals. In this work, we show that speech content can induce hallucinations in audio-visual LLMs. To systematically study this, we introduce SVHalluc, the first comprehensive benchmark for evaluating speech-vision hallucination in audio-visual LLMs. Our benchmark diagnoses speech-vision hallucinations from two critical and complementary aspects: semantic and temporal. Experimental results demonstrate that state-of-the-art open-source audio-visual LLMs struggle with aligning speech content with corresponding visual signals, with a near-random accuracy on multiple tasks. In contrast, Gemini-2.5 Pro significantly outperforms the open-source models. Our analysis suggests that their failures stem from limited ability in cross-modality understanding, despite strong performance in single-modality perception. Our work uncovers a new and fundamental limitation of current audio-visual LLMs and highlights the need for speech-grounded video comprehension. Project page: https://chenshuang-zhang.github.io/projects/svhalluc/.

Abstract:
Despite the rapid progress of text-to-image (T2I) models, generating images that accurately reflect complex compositional prompts (covering attribute bindings, object relationships, counting) still remains challenging. To address this, we propose BIDPO, a framework to enhance T2I model's capability of compositional text-to-image generation. We begin by introducing an carefully designed pipeline to construct a large-scale preference dataset, BICOMP, with strictly quality control. Then, we extend Diffusion DPO to jointly optimize image and text preferences, which is shown to greatly effective in improving the models to follow complex text prompt in generation. To further enhance the models for fine-grained alignment, we employ a region-level guidance method to focus on regions relevant to compositional concepts. Experimental results demonstrate that our BIDPO substantially improves compositional fidelity, consistently outperforming prior methods across multiple benchmarks. Our approach highlights the potential of preference-based fine-tuning for complex text-to-image tasks, offering a flexible and scalable alternative to existing techniques. Code is available at https://github.com/anzeameol/BiDPO.

Abstract:
Multimodal large language models (MLLMs) are increasingly being applied to spatial cognition tasks, where they are expected to understand and interact with complex environments. Most existing works improve spatial reasoning by introducing 3D priors or geometric supervision, which enhances performance but incurs substantial data preparation and alignment costs. In contrast, purely 2D approaches often struggle with multi-frame spatial reasoning due to their limited ability to capture cross-frame spatial relationships. To address these limitations, we propose EgoMind, a Chain-of-Thought framework that enables geometry-free spatial reasoning through Role-Play Caption, which jointly constructs a coherent linguistic scene graph across frames, and Progressive Spatial Analysis, which progressively reasons toward task-specific questions. With only 5K auto-generated SFT samples and 20K RL samples, EgoMind achieves competitive results on VSI-Bench, SPAR-Bench, SITE-Bench, and SPBench, demonstrating its effectiveness in strengthening the spatial reasoning capabilities of MLLMs and highlighting the potential of linguistic reasoning for spatial cognition. Code and data are released at https://github.com/Hyggge/EgoMind.

Abstract:
Although foundation models (e.g. SAM) perform remarkably in medical image segmentation, its high computational complexity limits deployment. Knowledge distillation (KD) allows lightweight models to inherit the representational capabilities of large models, thereby mitigating this issue. Existing KD methods enhance student performance, but due to teacher-student different feature advantages, they neglect to internalize and integrate student's semantic information adaptively after knowledge transfer, causing poor knowledge assimilation and limiting gains and generalization. To address this limitation, we propose a novel medical image segmentation framework, which is infusion to assimilation distillation (IAD). In Knowledge Infusion Stage (KIS), to semantically align teacher-student prediction distributions, soft-label distillation is combined with class-weighted prototype alignment strategy. In Knowledge Assimilation Stage (KAS), to promote adaptively semantic assimilation, a contrastive semantic self-optimization strategy refines student predictions through positive and negative sample pairs and imposes reverse constraints on encoder features to enhance semantic consistency. IAD achieves DICE gains of 4.32% on Synapse, 1.85% on ACDC, and 2.42% on Polyp datasets, and delivers an average 4.16% generalization gain on ISIC2018, PH2, BUSI, and STU datasets, outperforming mainstream KD methods. Code is available at https://github.com/hjklearn/IAD.

Abstract:
Conversational image segmentation grounds abstract, intent-driven concepts into pixel-accurate masks. Prior work on referring image grounding focuses on categorical and spatial queries (e.g., "left-most apple") and overlooks functional and physical reasoning (e.g., "where can I safely store the knife?"). We address this gap and introduce Conversational Image Segmentation (CIS) and ConvSeg, a benchmark spanning entities, spatial relations, intent, affordances, functions, safety, and physical reasoning. We also present ConvSeg-Net, which fuses strong segmentation priors with language understanding, and an AI-powered data engine that generates prompt-mask pairs without human supervision. We show that current language-guided segmentation models are inadequate for CIS, while ConvSeg-Net trained on our data engine achieves significant gains on ConvSeg and maintains strong performance on existing language-guided segmentation benchmarks. Project webpage: https://glab-caltech.github.io/converseg

Abstract:
This paper presents a method for image relighting that enables precise and continuous control over multiple illumination attributes in a photograph. We formulate relighting as a conditional image generation task and introduce attribute tokens to encode distinct lighting factors such as intensity, color, ambient illumination, diffuse level, and 3D light positions. The model is trained on a large-scale synthetic dataset with ground-truth lighting annotations, supplemented by a small set of real captures to enhance realism and generalization. We validate our approach across a variety of relighting tasks, including controlling in-scene lighting fixtures and editing environment illumination using virtual light sources, on synthetic and real images. Our method achieves state-of-the-art quantitative and qualitative performance compared to prior work. Remarkably, without explicit inverse rendering supervision, the model exhibits an inherent understanding of how light interacts with scene geometry, occlusion, and materials, yielding convincing lighting effects even in traditionally challenging scenarios such as placing lights within objects or relighting transparent materials plausibly.

Abstract:
Diffusion models have shown impressive performance in generative tasks, demonstrating their ability to capture detailed structural and semantic information. Recently, these capabilities have been extended to visual understanding, with studies employing diffusion models as the backbone for various perception tasks. However, existing diffusion-based perception models are generally restricted to a single task or a fixed set of predefined tasks, lacking an efficient mechanism to generalize to novel tasks. To overcome this limitation, we propose a unified DiT-based perception framework called UniPercept, which introduces a novel foundation-adapter paradigm for general visual perception. In this framework, a shared diffusion-based foundation model is trained to capture common and generalizable visual knowledge across diverse perception tasks, with task-specific adapters integrated for each individual task. Leveraging its superior generalization ability, the foundation model can be efficiently adapted to novel domains through lightweight adapters, requiring as few as 1,000 training samples and less than 1% of trainable parameters. Furthermore, UniPercept demonstrates strong performance across various perception tasks, outperforming state-of-the-art generalist models in most cases and achieving accuracy comparable to specialist models. Project page: https://VIPL-GENUN.github.io/Project-UniPercept.

Abstract:
Detectors often suffer from degraded performance, primarily due to the distributional gap between the source and target domains. This issue is especially evident in single-source domains with limited data, as models tend to rely on confounders (e.g., illumination, co-occurrence, and style) from the source domain, leading to spurious correlations that hinder generalization. To this end, this paper proposes a novel Basis-driven framework for domain generalization, namely Bridge, that incorporates causal inference into object detection. By learning the low-rank bases for front-door adjustment, Bridge blocks confounders' effects to mitigate spurious correlations, while simultaneously refining representations by filtering redundant and task-irrelevant components.Bridge can be seamlessly integrated with both discriminative (e.g., DINOv2/3, SAM) and generative (e.g., Stable Diffusion) Vision Foundation Models (VFMs). Extensive experiments across multiple domain generalization object detection datasets, i.e., Cross-Camera, Adverse Weather, Real-to-Artistic, Diverse Weather Datasets, and Diverse Weather DroneVehicle (our newly augmented real-world UAV-based benchmark), underscore the superiority of our proposed method over previous state-of-the-art approaches. The project page is available at: https://mingbohong.github.io/Bridge/.

Abstract:
Diffusion models have recently motivated great success in many generation tasks like object removal. Nevertheless, existing image decomposition methods struggle to disentangle semi-transparent or transparent layer occlusions due to mask prior dependencies, static object assumptions, and the lack of datasets. In this paper, we delve into a novel task: Layer-Wise Decomposition of Alpha-Composited Images, aiming to recover constituent layers from single overlapped images under the condition of semi-transparent/transparent alpha layer non-linear occlusion. To address challenges in layer ambiguity, generalization, and data scarcity, we first introduce AlphaBlend, the first large-scale and high-quality dataset for transparent and semi-transparent layer decomposition, containing six subtasks with different characteristics (e.g., translucent flare removal, semi-transparent cell decomposition, glassware decomposition). Building on this dataset, we present DiffDecompose, a diffusion Transformer-based framework that learns the posterior over possible layer decompositions conditioned on the input image, semantic prompts, and blending type. Rather than regressing alpha mattes directly, DiffDecompose performs In-Context Decomposition, enabling the model to predict one or multiple layers without per-layer supervision, and introduces Layer Position Encoding Cloning to maintain pixel-level correspondence across layers. Extensive experiments on the proposed AlphaBlend dataset and public LOGO dataset verify the effectiveness of DiffDecompose. Code will be publicly available at https://github.com/Wangzt1121/DiffDecompose.

Abstract:
Vision-Language-Action (VLA) models trained via imitation learning suffer from significant performance degradation in data-scarce scenarios due to their reliance on large-scale demonstration datasets. Although reinforcement learning (RL)-based post-training has proven effective in addressing data scarcity, its application to VLA models is hindered by the non-resettable nature of real-world environments. This limitation is particularly critical in high-risk domains such as industrial automation, where interactions often induce state changes that are costly or infeasible to revert. Furthermore, existing VLA approaches lack a reliable mechanism for detecting task completion, leading to redundant actions that reduce overall task success rates. To address these challenges, we propose RehearseVLA, an RL-based post-training framework that replaces physical interaction with a low-cost world model-based virtual simulator. RehearseVLA consists of two key components: (1) a physically-consistent world simulator that generates temporally consistent future visual observations, and (2) a vision-language model (VLM)-guided instant reflector that provides continuous reward signals and predicts action termination. This simulated environment enables VLA models to safely explore and generalize beyond their initial imitation learning distribution. Our method achieves notable performance gains with as few as five expert demonstrations per task. Experiments on complex robotic manipulation tasks demonstrate that RehearseVLA effectively overcomes the data inefficiency, safety constraints, and inefficient execution of conventional VLA models that rely on real-world interaction, offering a practical and scalable solution for post-training in resource-constrained settings. Our code is available at https://github.com/iSEE-Laboratory/RehearseVLA.

Abstract:
While open-sourcing instruction-guided image editing models accelerates research, it surrenders control over their capabilities to anyone who downloads the weights. Existing protection methods are reactive: they verify ownership after generation, but the underlying model remains fully functional for unauthorized users. We introduce Visilock, where access control is baked into model weights, rendering the model unusable without a visual trigger in the input. The challenge is training a model that retains editing capability for authorized input and remains unusable for unauthorized input, without destabilizing training. Naive multi-task objectives create gradient conflicts that collapse training, while contrastive approaches like FMLock destroy the denoising manifold. We develop Dual Score Distillation, a dual-teacher framework where a degraded teacher defines locked behavior and an original teacher guides editing quality, eliminating gradient interference through separate frozen targets. A key risk is that released models could be unlocked through post-hoc fine-tuning. To prevent this, we initialize the student model from the degraded teacher so that it begins in a locked state, and only regains editing ability for authorized inputs via distillation. This impedes adversarial fine-tuning from recovering full editing capability. Evaluation on InstructPix2Pix shows authorized edits maintain baseline quality (CLIP-I: 0.821, DINO: 0.726) while unauthorized attempts degrade substantially (CLIP-I: 0.481, DINO: 0.072) with 41% and 90% drops in image and semantic similarity. The lock remains robust to key corruptions, spatial perturbations, and adversarial unlock fine-tuning. Code will be available at https://github.com/Luvata/VisiLock.

Abstract:
Existing Real-Time Object Detection (RTOD) methods commonly adopt YOLO-like architectures for their favorable trade-off between accuracy and speed. However, these models rely on static dense computation that applies uniform processing to all inputs, misallocating representational capacity and computational resources such as over-allocating on trivial scenes while under-serving complex ones. This mismatch results in both computational redundancy and suboptimal detection performance.To overcome this limitation, we propose YOLO-Master, a novel YOLO-like framework that introduces image-level adaptive computation for RTOD. This is achieved through an Efficient Sparse Mixture-of-Experts (ES-MoE) block that dynamically allocates computational resources to each input according to its scene complexity. At its core, a lightweight dynamic routing network guides expert specialization during training through a diversity enhancing objective, encouraging complementary expertise among experts. Additionally, the routing network adaptively learns to activate only the most relevant experts, thereby improving detection performance while minimizing computational overhead during inference.Comprehensive experiments on five large-scale benchmarks demonstrate the superiority of YOLO-Master. On MS COCO, our model achieves 42.4% AP with 1.62ms latency, outperforming YOLOv13-N by +0.8% mAP and 17.8% faster inference. Notably, the gains are most pronounced on challenging dense scenes, while the model preserves efficiency on typical inputs and maintains real-time inference speed. Github: https://github.com/Tencent/YOLO-Master/

Abstract:
Training Large Multimodality Models (LMMs) relies on descriptive image caption that connects image and language. Existing methods for generating such captions often rely on distilling the captions from pretrained LMMs, constructing them from publicly available internet images, or even generating them through human annotation. However, these strategies can fall short in terms of precision and granularity, particularly when dealing with complex visual reasoning tasks. In this paper, we propose to leverage off-the-shelf visual specialists, which were trained from annotated images initially not for image captioning, for enhancing the image caption. Our approach, named Cap-Workflow, explores object low-level and fine-grained attributes (e.g., depth, emotion and fine-grained categories) and object relations (e.g., relative location and human-object-interaction (HOI)), and combine the attributes into the descriptive caption. By systematically integrating these rich attributes into the generated captions, Cap-Workflow significantly improves the descriptive quality of the captions, providing a deeper and more nuanced understanding of the visual content. Experiments demonstrate that such visual specialists are able to improve the performance for visual understanding tasks as well as reasoning that benefits from more accurate visual understanding. The source code and datasets are available at https://github.com/syp2ysy/Cap-Workflow.

Abstract:
Diffusion probabilistic models have demonstrated significant potential in generating high-quality, realistic medical images, providing a promising solution to the persistent challenge of data scarcity in the medical field. Nevertheless, producing 3D medical volumes with anatomically consistent structures under multimodal conditions remains a complex and unresolved problem. We introduce Sketch2CT, a multimodal diffusion framework for structure-aware 3D medical volume generation, jointly guided by a user-provided 2D sketch and a textual description that captures 3D geometric semantics. The framework initially generates 3D segmentation masks of the target organ from random noise, conditioned on both modalities. To effectively align and fuse these inputs, we propose two key modules that refine sketch features with localized textual cues and integrate global sketch-text representations. Built upon a capsule-attention backbone, these modules leverage the complementary strengths of sketches and text to produce anatomically accurate organ shapes. The synthesized segmentation masks subsequently guide a latent diffusion model for 3D CT volume synthesis, enabling realistic reconstruction of organ appearances that are consistent with user-defined sketches and descriptions. Extensive experiments on public CT datasets demonstrate that Sketch2CT achieves superior performance in generating multimodal medical volumes. Its controllable, low-cost generation pipeline enables principled, efficient augmentation of medical datasets. Code is available at https://github.com/adlsn/Sketch2CT.

Abstract:
Change detection (CD) is a fundamental task for monitoring and analysing land cover dynamics. While recent high performance models and high quality datasets have significantly advanced the field, a critical limitation persists. Current models typically acquire limited knowledge from single-type annotated data and cannot concurrently leverage diverse binary change detection (BCD) and semantic change detection (SCD) datasets. This constraint leads to poor generalisation and limited versatility. The recent advancements in Multimodal Large Language Models (MLLMs) introduce new possibilities for a unified CD framework. We leverage the language priors and unification capabilities of MLLMs to develop UniChange, the first MLLM-based unified change detection model. UniChange integrates generative language abilities with specialised CD functionalities. We introduce three special tokens: [T1], [T2], and [CHANGE], utilising their embeddings as the key to query variations. This approach successfully accommodates both BCD and SCD tasks. Furthermore, UniChange utilises text prompts to guide the identification of change categories, eliminating the reliance on predefined classification heads. This design allows UniChange to effectively acquire knowledge from multi-source datasets, even when their class definitions conflict. Experiments on four public benchmarks (WHU-CD, S2Looking, LEVIR-CD+, and SECOND) demonstrate SOTA performance, achieving IoU scores of 90.41, 53.04, 78.87, and 57.62, respectively, surpassing all previous methods. The code is available at https://github.com/NKU-HLT/UniChange.

Abstract:
We present Consistent-Recurrent Feature Flow Transformer (CRFT), a unified coarse-to-fine framework based on feature flow learning for robust cross-modal image registration. CRFT learns a modality-independent feature flow representation within a transformer-based architecture that jointly performs feature alignment and flow estimation. The coarse stage establishes global correspondences through multi-scale feature correlation, while the fine stage refines local details via hierarchical feature fusion and adaptive spatial reasoning. To enhance geometric adaptability, an iterative discrepancy-guided attention mechanism with a Spatial Geometric Transform (SGT) recurrently refines the flow field, progressively capturing subtle spatial inconsistencies and enforcing feature-level consistency. This design enables accurate alignment under large affine and scale variations while maintaining structural coherence across modalities. Extensive experiments on diverse cross-modal datasets demonstrate that CRFT consistently outperforms state-of-the-art registration methods in both accuracy and robustness. Beyond registration, CRFT provides a generalizable paradigm for multimodal spatial correspondence, offering broad applicability to remote sensing, autonomous navigation, and medical imaging. Code and datasets are publicly available at https://github.com/NEU-Liuxuecong/CRFT.

Abstract:
As a mainstream technique for 3D reconstruction, 3D Gaussian splatting (3DGS) has been applied in a wide range of applications and services. Recent studies have revealed critical vulnerabilities in this pipeline and introduced computation cost attacks that lead to malicious resource occupancies and even denial-of-service (DoS) conditions, thereby hindering the reliable deployment of 3DGS. In this paper, we propose the first effective and comprehensive black-box defense framework, named RemedyGS, against such computation cost attacks, safeguarding 3DGS reconstruction systems and services. Our pipeline comprises two key components: a detector to identify the attacked input images with poisoned textures and a purifier to recover the benign images from their attacked counterparts, mitigating the adverse effects of these attacks. Moreover, we incorporate adversarial training into the purifier to enforce distributional alignment between the recovered and original natural images, thereby enhancing the defense efficacy. Experimental results demonstrate that our framework effectively defends against white-box, black-box, and adaptive attacks in 3DGS systems, achieving state-of-the-art performance in both safety and utility. Our code is available at https://github.com/Polly-LYP/RemedyGS.

Abstract:
Despite the significant advancements in Large Vision-Language Models (LVLMs), their tendency to generate hallucinations undermines reliability and restricts broader practical deployment. Among the hallucination mitigation methods, feature steering emerges as a promising approach that reduces erroneous outputs in LVLMs without increasing inference costs. However, current methods apply uniform feature steering across all layers. This heuristic strategy ignores inter-layer differences, potentially disrupting layers unrelated to hallucinations and ultimately leading to performance degradation on general tasks. In this paper, we propose Locate-Then-Sparsify for Feature Steering (LTS-FS), a plug-and-play framework which controls the steering intensity according to the hallucination relevance of each layer. We first construct a dataset comprising token-level and sentence-level hallucination cases. Based on this dataset, we introduce an attribution method based on causal interventions to quantify the hallucination relevance of each layer. With the attribution scores across layers, we propose a layerwise strategy that converts these scores into feature steering intensities for individual layers, enabling more precise adjustments specifically on hallucination-relevant layers. Extensive experiments across multiple LVLMs and benchmarks demonstrate that LTS-FS effectively mitigates hallucination while preserving strong performance. Codes are available at https://github.com/huttersadan/LTS-FS.

Abstract:
Accurate 3D scene understanding is essential for embodied intelligence, with occupancy prediction emerging as a key task for reasoning about both objects and free space. Existing approaches largely rely on depth priors (e.g., DepthAnything) but make only limited use of 3D cues, restricting performance and generalization. Recently, visual geometry models such as VGGT have shown strong capability in providing rich 3D priors, but similar to monocular depth foundation models, they still operate at the level of visible surfaces rather than volumetric interiors, motivating us to explore how to more effectively leverage these increasingly powerful geometry priors for 3D occupancy prediction. We present GPOcc, a framework that leverages generalizable visual geometry priors (GPs) for monocular occupancy prediction. Our method extends surface points inward along camera rays to generate volumetric samples, which are represented as Gaussian primitives for probabilistic occupancy inference. To handle streaming input, we further design a training-free incremental update strategy that fuses per-frame Gaussians into a unified global representation. Experiments on Occ-ScanNet and EmbodiedOcc-ScanNet demonstrate significant gains: GPOcc improves mIoU by +9.99 in the monocular setting and +11.79 in the streaming setting over prior state of the art. Under the same depth prior, it achieves +6.73 mIoU while running 2.65 times faster. These results highlight that GPOcc leverages geometry priors more effectively and efficiently. Code will be released at https://github.com/JuIvyy/GPOcc.

Abstract:
Accurate and interpretable multi-disease diagnosis remains a critical challenge in medical research, particularly when leveraging heterogeneous multimodal medical data. Current approaches often rely on single-modal data, limiting their ability to comprehensively understand complex diseases. To address this, we propose MedTVT-R1, a novel Multimodal Large Language Model (MLLM) framework designed to integrate clinical multimodal data for reasoning and diagnosing multiple diseases. We construct MedTVT-QA, a curated instruction dataset that provides question-answer pairs for physiological-level interpretations and disease-level diagnoses with a Chain of Evidence approach. MedTVT-R1 incorporates a modality perception layer to capture inter-modal dependencies and adaptively weight modality contributions. Additionally, we employ Group Relative Policy Optimization (GRPO)-based Reinforcement Fine-Tuning with a Jaccard Reward function to enhance diagnostic reasoning. Experimental results demonstrate MedTVT-R1's superiority in multimodal feature utilization and multi-disease diagnosis, offering significant potential for clinical applications such as diagnostic report generation and comorbidity reasoning. The dataset and code will be available on GitHub.

Abstract:
Despite advances in physics-based 3D motion synthesis, current methods face key limitations: reliance on pre-reconstructed 3D Gaussian Splatting (3DGS) built from dense multi-view images with time-consuming per-scene optimization; physics integration via either inflexible, hand-specified attributes or unstable, optimization-heavy guidance from video models using Score Distillation Sampling (SDS); and naive concatenation of prebuilt 3DGS with physics modules, which ignores physical information embedded in appearance and yields suboptimal performance. To address these issues, we propose PhysGM, a feed-forward framework that jointly predicts 3D Gaussian representation and physical properties from a single image, enabling immediate simulation and high-fidelity 4D rendering. Unlike slow appearance-agnostic optimization methods, we first pre-train a physics-aware reconstruction model that directly infers both Gaussian and physical parameters. We further refine the model with Direct Preference Optimization (DPO), aligning simulations with the physically plausible reference videos and avoiding the high-cost SDS optimization. To address the absence of a supporting dataset for this task, we propose PhysAssets, a dataset of 50K+ 3D assets annotated with physical properties and corresponding reference videos. Experiments show that PhysGM produces high-fidelity 4D simulations from a single image in one minute, achieving a significant speedup over prior work while delivering realistic renderings.

Abstract:
Lifelong person re-identification (LReID) aims to learn from varying domains to obtain a unified person retrieval model. Existing LReID approaches typically focus on learning from scratch or a visual classification-pretrained model, while the Vision-Language Model (VLM) has shown generalizable knowledge in a variety of tasks. Although existing methods can be directly adapted to the VLM, since they only consider global-aware learning, the fine-grained attribute knowledge is underleveraged, leading to limited acquisition and anti-forgetting capacity. To address this problem, we introduce a novel VLM-driven LReID approach named Vision-Language Attribute Disentanglement and Reinforcement (VLADR). Our key idea is to explicitly model the universally shared human attributes to improve inter-domain knowledge transfer, thereby effectively utilizing historical knowledge to reinforce new knowledge learning and alleviate forgetting. Specifically, VLADR includes a Multi-grain Text Attribute Disentanglement mechanism that mines the global and diverse local text attributes of an image. Then, an Inter-domain Cross-modal Attribute Reinforcement scheme is developed, which introduces cross-modal attribute alignment to guide visual attribute extraction and adopts inter-domain attribute alignment to achieve fine-grained knowledge transfer. Experimental results demonstrate that our VLADR outperforms the state-of-the-art methods by 1.9%-2.2% and 2.1%-2.5% on anti-forgetting and generalization capacity. Our source code is available at https://github.com/zhoujiahuan1991/CVPR2026-VLADR

Abstract:
Offline vectorized maps constitute critical infrastructure for high-precision autonomous driving and mapping services. Existing approaches rely predominantly on single ego-vehicle trajectories, which fundamentally suffer from viewpoint insufficiency: while memory-based methods extend observation time by aggregating ego-trajectory frames, they lack the spatial diversity needed to reveal occluded regions. Incorporating views from surrounding vehicles offers complementary perspectives, yet naive fusion introduces three key challenges: computational cost from large candidate pools, redundancy from near-collinear viewpoints, and noise from pose errors and occlusion artifacts. We present OptiMVMap, which reformulates multi-vehicle mapping as a select-then-fuse problem to address these challenges systematically. An Optimal Vehicle Selection (OVS) module strategically identifies a compact subset of helpers that maximally reduce ego-centric uncertainty in occluded regions, addressing computation and redundancy challenges. Cross-Vehicle Attention (CVA) and Semantic-aware Noise Filter (SNF) then perform pose-tolerant alignment and artifact suppression before BEV-level fusion, addressing the noise challenge. This targeted pipeline yields more complete and topologically faithful maps with substantially fewer views than indiscriminate aggregation. On nuScenes and Argoverse2, OptiMVMap improves MapTRv2 by +10.5 mAP and +9.3 mAP, respectively, and surpasses memory-augmented baselines MVMap and HRMapNet by +6.2 mAP and +3.8 mAP on nuScenes. These results demonstrate that uncertainty-guided selection of helper vehicles is essential for efficient and accurate multi-vehicle vectorized mapping. The code is released at https://github.com/DanZeDong/OptiMVMap.

Abstract:
Since Transformers are introduced into vision architectures, their quadratic complexity has always been a significant issue that many research efforts aim to address. A representative approach involves grouping tokens, performing self-attention calculations within each group. To this end, various carefully designed grouping strategies have been proposed to enhance the performance of Vision Transformers. Here, we pose the following questions: Are these carefully designed grouping methods truly necessary? Is there a simpler and more unified token grouping method that can replace these diverse methods? Therefore, we propose the random grouping strategy, which involves a simple and fast random grouping strategy for vision tokens. We validate this approach on multiple baselines, and experiments show that random grouping almost outperforms all other grouping methods. When transferred to downstream tasks, such as object detection, random grouping demonstrates even more pronounced advantages. In response to this phenomenon, we conduct a detailed analysis of the advantages of random grouping from multiple perspectives and identify several crucial elements for the design of grouping strategies: positional information, head feature diversity, global receptive field, and fixed grouping pattern. We demonstrate that as long as these four conditions are met, vision tokens require only an extremely simple grouping strategy to efficiently and effectively handle various visual tasks. We also validate the effectiveness of our proposed random method across multiple modalities, including visual tasks, point cloud processing, and vision-language models. Code will be available at https://github.com/qhfan/random.

Abstract:
Scene-level point cloud understanding remains challenging due to diverse geometries, imbalanced category distributions, and highly varied spatial layouts. Existing methods improve object-level performance but rely on static network parameters during inference, limiting their adaptability to dynamic scene data. We propose PointTPA, a Test-time Parameter Adaptation framework that generates input-aware network parameters for scene-level point clouds. PointTPA adopts a Serialization-based Neighborhood Grouping (SNG) to form locally coherent patches and a Dynamic Parameter Projector (DPP) to produce patch-wise adaptive weights, enabling the backbone to adjust its behavior according to scene-specific variations while maintaining a low parameter overhead. Integrated into the PTv3 structure, PointTPA demonstrates strong parameter efficiency by introducing two lightweight modules of less than 2% of the backbone's parameters. Despite this minimal parameter overhead, PointTPA achieves 78.4% mIoU on ScanNet validation, surpassing existing parameter-efficient fine-tuning (PEFT) methods across multiple benchmarks, highlighting the efficacy of our test-time dynamic network parameter adaptation mechanism in enhancing 3D scene understanding. The code is available at https://github.com/H-EmbodVis/PointTPA.

Abstract:
Fully decentralized deep learning removes global servers and ensures local data privacy. However, Euclidean consensus, averaging weights, gradients or momentum, may degrade under the situations with non-i.i.d. (non-independent and identically distributed) data and client size imbalance. We propose a geometry-aware approach based on natural gradient variational inference. Clients communicate in the expectation parameter space of an exponential family, where simple linear mixing yields a forward Kullback--Leibler (KL) barycenter consensus. The aggregate is the model closest to all client distributions, aligning updates across heterogeneous sites and mitigating distribution shift. We further provide a lightweight decentralized Adam implementation where each client maintains a diagonal Gaussian posterior and operates entirely in the expectation space for both optimization and consensus. We prove convergence for convex losses on connected graphs. On CIFAR-100 and a medical image segmentation benchmark, our method substantially outperforms Euclidean-space consensus baselines under severe non-i.i.d. and client-imbalance cases, achieving around 20% accuracy gain on CIFAR-100, while matching the communication budget and improving training stability. The code is available at https://github.com/x-lu/Beyond-euclidean-gossip.

Abstract:
Prototype or region-attention modules have recently improved medical image segmentation but still suffer from two fundamental limitations: 1) they represent each semantic concept as a point or isotropic region, failing to capture the inherently anisotropic geometry of real feature distributions; and 2) many rely on non-differentiable clustering or one-way kernel weighting, which restricts their ability to form coherent region-level representations. We address these issues with the Anisotropic Differentiable Granular-Ball (AD-GBC) module, which generalizes prototypes into learnable geometric regions parameterized by a center and an anisotropic vector scale. AD-GBC aggregates local features into region-level semantics and redistributes the refined representation back to pixels in a fully differentiable manner, enabling geometry-aware refinement within modern UNet-style architectures. Two geometric regularizers, a Wasserstein-based diversity loss and a scale consistency loss, mitigate center collapse and encourage stable, well-formed region geometry.AD-GBC yields consistent improvements across four widely used medical segmentation benchmarks (BUSI, GlaS, CVC-ClinicDB, ISIC17) when integrated into two strong backbones (Rolling-UNet and U-KAN), demonstrating that the proposed geometric region formulation generalizes well across different imaging conditions. The code is available at https://github.com/SiaShen-dot/AD-GBC.

Abstract:
Precise Event Spotting aims to localize fast-paced actions or events in videos with high temporal precision, a key task for applications in sports analytics, robotics, and autonomous systems. Existing methods typically process all frames uniformly, overlooking the inherent spatio-temporal redundancy in video data. This leads to redundant computation on non-informative regions while limiting overall efficiency. To remain tractable, they often spatially downsample inputs, losing fine-grained details crucial for precise localization. To address these limitations, we propose AdaSpot, a simple yet effective framework that processes low-resolution videos to extract global task-relevant features while adaptively selecting the most informative region-of-interest in each frame for high-resolution processing. The selection is performed via an unsupervised, task-aware strategy that maintains spatio-temporal consistency across frames and avoids the training instability of learnable alternatives. This design preserves essential fine-grained visual cues with a marginal computational overhead compared to low-resolution-only baselines, while remaining far more efficient than uniform high-resolution processing. Experiments on standard PES benchmarks demonstrate that AdaSpot achieves state-of-the-art performance under strict evaluation metrics (e.g., +3.96 and +2.26 mAP@0 frames on Tennis and FineDiving), while also maintaining strong results under looser metrics. Code is available at: https://github.com/arturxe2/AdaSpot

Abstract:
Understanding 3D human-object interaction (HOI) involves two highly-related abilities: reconstruction, which perceives observed geometry, and generation, which imagines plausible future interactions. However, most existing methods treat these abilities as separate tasks, limiting their capacity to capture the unified nature of human spatial reasoning. To address this, we propose a unified framework that bridges reconstruction and generation through a shared semantic-geometric reasoning space. Specifically, a 3D Contact Reasoning mechanism enables direct reasoning in 3D space, jointly modeling geometric structure and semantic relationships, while a Reasoning Trace Refinement module iteratively refines contact predictions by integrating geometric and semantic cues. The framework builds a unified latent representation via explicit reasoning on human-object contact regions. To further enhance realism and physical plausibility when generating the outputs of reconstruction and generation, we modify and adapt the Gravity-Field Based Diffusion Bridge to refine fine-grained contact geometry and ensure smooth, physically consistent human-object engagement. Extensive experiments demonstrate that our unified framework significantly improves both reconstruction accuracy and generative interaction quality, establishing a cohesive and interpretable paradigm for 3D HOI understanding. The code will be released at https://github.com/xumiao66/ReGenHOI.

Abstract:
The integration of vision and language in Vision-Language Models (VLMs), while enabling multimodal capabilities, inherently expands their attack surface. Among existing white-box jailbreak methods, suffix-optimization-based approaches often rely on gradient approximations over discrete token spaces, yielding insufficient guidance and causing optimization to stagnate in local optima, while image-perturbation-based ones frequently exhibit poor cross-model transferability. In this work, we introduce DGSIP, a Dissonance-Guided Suffix Optimization and Image-Phrase Injection framework. DGSIP leverages predictive dissonance between the target model and an unaligned model to identify tokens suppressed by safety alignment, using them as a more effective signal than gradient-based cues for suffix optimization. It further reinforces the attack by jointly optimizing the content and presentation of phrase embedded in images to leverage VLMs' cross-modal sensitivity. Our extensive experiments demonstrate that DGSIP outperforms prior baselines across multiple safety benchmarks and a range of open-source VLMs (e.g., MiniGPT-4, InstructBlip and LLaVA). Notably, compared to baselines, our method exhibits much stronger transferability to commercial black-box VLMs, such as GPT-4o-Mini, Gemini 2.0 Flash and Qwen 2.5-VL. Based upon DGSIP, we empirically reveal critical vulnerabilities in the safeguard mechanisms of current VLMs, highlighting the need for more robust defense strategies. The implementation is available on https://github.com/Trusted-LLM/DGSIP.

Abstract:
Federated active learning (FAL) seeks to reduce annotation cost under privacy constraints, yet its effectiveness degrades in realistic settings with severe global class imbalance and highly heterogeneous clients. We conduct a systematic study of query-model selection in FAL and uncover a central insight: the model that achieves more class-balanced sampling, especially for minority classes, consistently leads to better final performance. Moreover, global-model querying is beneficial only when the global distribution is highly imbalanced and client data are relatively homogeneous; otherwise, the local model is preferable. Based on these findings, we propose FairFAL, an adaptive class-fair FAL framework. FairFAL (1) infers global imbalance and local-global divergence via lightweight prediction discrepancy, enabling adaptive selection between global and local query models; (2) performs prototype-guided pseudo-labeling using global features to promote class-aware querying; and (3) applies a two-stage uncertainty-diversity balanced sampling strategy with k-center refinement. Experiments on five benchmarks show that FairFAL consistently outperforms state-of-the-art approaches under challenging long-tailed and non-IID settings. The code is available at https://github.com/chenchenzong/FairFAL.

Abstract:
Image segmentation remains a challenging task in computer vision, demanding robust mask generation and precise classification. Recent mask-based approaches yield high-quality masks by capturing global context. However, accurately classifying these masks, especially in the presence of ambiguous boundaries and imbalanced class distributions, remains an open challenge. In this work, we introduce ViT-P, a novel two-stage segmentation framework that decouples mask generation from classification. The first stage employs a proposal generator to produce class-agnostic mask proposals, while the second stage utilizes a point-based classification model built on the Vision Transformer (ViT) to refine predictions. ViT-P serves as a pre-training-free adapter, allowing the integration of various pre-trained vision transformers without modifying their architecture, ensuring adaptability to dense prediction tasks. Furthermore, we demonstrate that coarse and bounding box annotations can effectively enhance classification without requiring additional training on fine annotation datasets, reducing annotation costs while maintaining strong performance. Extensive experiments across COCO, ADE20K, and Cityscapes datasets validate the effectiveness of ViT-P, achieving state-of-the-art results with 54.0 PQ on ADE20K panoptic segmentation, 87.4 mIoU on Cityscapes semantic segmentation, and 63.6 mIoU on ADE20K semantic segmentation.

Abstract:
Foot contact plays a critical role in human interaction with the world, and thus exploring foot contact can advance our understanding of human movement and physical interaction. Despite its importance, existing methods often approximate foot contact using a zero-velocity constraint and focus on joint-level contact, failing to capture the detailed interaction between the foot and the world. Dense estimation of foot contact is crucial for accurately modeling this interaction, yet predicting dense foot contact from a single RGB image remains largely underexplored. There are two main challenges for learning dense foot contact estimation. First, shoes exhibit highly diverse appearances, making it difficult for models to generalize across different styles. Second, ground often has a monotonous appearance, making it difficult to extract informative features. To tackle these issues, we present a FEet COntact estimation (FECO) framework that learns dense foot contact with shoe style-invariant and ground-aware learning. To overcome the challenge of shoe appearance diversity, our approach incorporates shoe style adversarial training that enforces shoe style-invariant features for contact estimation. To effectively utilize ground information, we introduce a ground feature extractor that captures ground properties based on spatial context. As a result, our proposed method achieves robust foot contact estimation regardless of shoe appearance and effectively leverages ground information. The codes are available at https://github.com/dqj5182/FECO_RELEASE.

Abstract:
Existing methods following the integrated hard-regression or decoupling optimization paradigms exhibit limited fusion performance under complex degradations. To address these paradigm-level shortcomings, we propose ReCoFuse, an ultra-robust image fusion framework based on restorative multi-modal diffusion reciprocal coupling. ReCoFuse redefines the relationship between information restoration and integration, deriving a novel reciprocal coupling optimization paradigm through their mutual reinforcement. It first constructs two restoration branches using diffusion modules (DiM) to capture modality-specific restoration priors. Then, time-aware cross-modal integration modules (TIM) are introduced as a bridge to couple restoration and integration, embedded at each DiM sampling timestep to aggregate multi-modal information. The aggregated variable not only feeds back to each restoration branch to enhance degradation removal via cross-modal complementarity, but also generates high-quality fused images that comprehensively represent the scene. Moreover, an alternating regularization mechanism is designed to iteratively optimize DiM and TIM along the gradient path, ensuring effective collaboration between restoration and integration. Extensive experiments show that ReCoFuse achieves state-of-the-art performance under challenging degradations such as low light, haze, noise, low contrast, and stripes. The code is publicly available at https://github.com/HaoZhang1018/ReCoFuse.

Abstract:
The ubiquity of monocular videos capturing daily hand-object interactions presents a valuable resource for embodied intelligence. While 3D hand reconstruction from in-the-wild videos has seen significant progress, reconstructing the involved objects remains challenging due to severe occlusions and the complex, coupled motion of the camera, hands, and object. In this paper, we introduce ForeHOI, a novel feed-forward model that directly reconstructs 3D object geometry from monocular hand-object interaction videos within one minute of inference time, eliminating the need for any pre-processing steps. Our key insight is that, the joint prediction of 2D mask inpainting and 3D shape completion in a feed-forward framework can effectively address the problem of severe occlusion in monocular hand-held object videos, thereby achieving results that outperform the performance of optimization-based methods. The information exchanges between the 2D and 3D shape completion boosts the overall reconstruction quality, enabling the framework to effectively handle severe hand-object occlusion. Furthermore, to support the training of our model, we contribute the first large-scale, high-fidelity synthetic dataset of hand-object interactions with comprehensive annotations. Extensive experiments demonstrate that ForeHOI achieves state-of-the-art performance in object reconstruction, significantly outperforming previous methods with around a 100x speedup. Code and data are available at: https://github.com/Tao-11-chen/ForeHOI.

Abstract:
Intraoperative bleeding in laparoscopic surgery causes rapid obscuration of the operative field to hinder the surgical process and increases the risk of postoperative complications. Intelligent detection of bleeding areas can quantify the blood loss to assist decision-making, while locating bleeding points helps surgeons quickly identify the source of bleeding and achieve hemostasis in time to improve surgical success rates. To fill the benchmark gap, we first construct a real-world laparoscopic surgical bleeding detection dataset, named SurgBlood, comprising 5,330 frames from 95 surgical video clips with bleeding region and point annotations. Accordingly, we develop a dual-task synergistic online detector called BlooDet, enabling simultaneous detection of bleeding regions and points in laparoscopic surgery. The baseline embraces a dual-branch bidirectional guidance design based on Segment Anything Model 2. The mask branch detects bleeding regions through adaptive edge and point prompt embeddings, while the point branch leverages mask memory to induce bleeding point memory modeling and captures point motion direction via inter-frame optical flow. By coupled bidirectional guidance, our framework explores spatial-temporal correlations while exploiting memory modeling to infer current bleeding status. Extensive experiments indicate that our method outperforms 13 counterparts in bleeding detection. Code and data are available at https://github.com/PJLallen/SurgBlood.

Abstract:
Zero-shot anomaly detection aims to detect and localise abnormal regions in the image without access to any in-domain training images. While recent approaches leverage vision-language models (VLMs), such as CLIP, to transfer high-level concept knowledge, methods based on purely vision foundation models (VFMs), like DINOv2, have lagged behind in performance. We argue that this gap stems from two practical issues: (i) limited diversity in existing auxiliary anomaly detection datasets and (ii) overly shallow VFM adaptation strategies. To address both challenges, we propose AnomalyVFM, a general and effective framework that turns any pretrained VFM into a strong zero-shot anomaly detector. Our approach combines a robust three-stage synthetic dataset generation scheme with a parameter-efficient adaptation mechanism, utilising low-rank feature adapters and a confidence-weighted pixel loss. Together, these components enable modern VFMs to substantially outperform current state-of-the-art methods. More specifically, with RADIO as a backbone, AnomalyVFM achieves an average image-level AUROC of 94.1% across 9 diverse datasets, surpassing previous methods by significant 3.3 percentage points. Code: \href https://maticfuc.github.io/anomaly_vfm/ Project Page

Abstract:
The rapid advancement of Vision-Language Models (VLMs) has brought their safety vulnerabilities into sharp focus. However, existing red teaming methods are fundamentally constrained by an inherent linear exploration paradigm, confining them to optimizing within a predefined strategy set and preventing the discovery of novel, diverse exploits. To transcend this limitation, we introduce TreeTeaming, an automated red teaming framework that reframes strategy exploration from static testing to a dynamic, evolutionary discovery process. At its core lies a strategic Orchestrator, powered by a Large Language Model (LLM), which autonomously decides whether to evolve promising attack paths or explore diverse strategic branches, thereby dynamically constructing and expanding a strategy tree. A multimodal actuator is then tasked with executing these complex strategies. In the experiments across 12 prominent VLMs, TreeTeaming achieves state-of-the-art attack success rates on 11 models, outperforming existing methods and reaching up to 87.60% on GPT-4o. The framework also demonstrates superior strategic diversity over the union of previously public jailbreak strategies. Furthermore, the generated attacks exhibit an average toxicity reduction of 23.09%, showcasing their stealth and subtlety. Our work introduces a new paradigm for automated vulnerability discovery, underscoring the necessity of proactive exploration beyond static heuristics to secure frontier AI models. The code and data are available at: https://github.com/ChunXiaostudy/TreeTeaming. Warning: This paper contains examples of harmful texts and images, and reader discretion is recommended.

Abstract:
Reconstructing people, objects, and their interactions in 3D is a long-standing and fundamental goal for intelligent systems. Often the input is RGB video from a moving camera, making the task ill-posed; depth is ambiguous, humans and objects occlude each other, and camera and object motion entangle to create apparent motion. Most prior work addresses humans or objects in isolation, ignoring their interplay, or assumes known 3D shapes or cameras, which is impractical for real-world applications. We develop RHINO (Reconstructing Human Interactions with Novel Objects), a novel three-step framework that recovers in 3D a human, novel (unseen) manipulated object, and static scene in a common world frame from a monocular RGB video. First, we leverage 3D-aware foundation models to obtain cues that stabilize Structure-from-Motion (SfM) even for low-texture regions; this yields a coarse shape and apparent motion of a manipulated object from foreground pixels, and a coarse scene shape and camera motion from background pixels. Second, we estimate a human in the camera frame via an off-the-shelf method, and subtract the camera motion from apparent motion to extract the object motion; this registers the human, object, and coarse scene shapes into a common world frame. Third, we refine shapes using a compositional neural field with per-component signed-distance fields. The latter further enables differentiable contact priors that attract surfaces while penalizing inter-penetration, improving the physical plausibility of the final reconstruction. For evaluation, we capture a new dataset of handheld monocular videos synchronized with a volumetric 4D capture stage, providing ground-truth shape and camera motion. RHINO outperforms state-of-the-art baselines on novel-view synthesis and 4D reconstruction, and ablations show that each stage contributes substantially. Code and data are available at https://lxxue.github.io/RHINO.

Abstract:
Talking face generation has gained significant attention as a core application of generative models.To enhance the expressiveness and realism of synthesized videos, emotion editing in talking face video plays a crucial role.However, existing approaches often limit expressive flexibility and struggle to generate extended emotions.Label-based methods represent emotions with discrete categories, which fail to capture a wide range of emotions.Audio-based methods can leverage emotionally rich speech signals--and even benefit from expressive text-to-speech (TTS) synthesis--but they fail to express the target emotions because emotions and linguistic contents are entangled in emotional speeches.Images-based methods, on the other hand, rely on target reference images to guide emotion transfer, yet they require high-quality frontal views and face challenges in acquiring reference data for extended emotions (e.g., sarcasm).To address these limitations, we propose Cross-Modal Emotion Transfer (C-MET), a novel approach that generates facial expressions based on speeches by modeling emotion semantic vectors between speech and visual feature spaces.C-MET leverages a large-scale pretrained audio encoder and a disentangled facial expression encoder to learn emotion semantic vectors that represent the difference between two different emotional embeddings across modalities.Extensive experiments on the MEAD and CREMA-D datasets demonstrate that our method improves emotion accuracy by 14% over state-of-the-art methods, while generating expressive talking face videos--even for unseen extended emotions. Code, checkpoint, and demo are available at https://chanhyeok-choi.github.io/C-MET/.

Abstract:
We study how different Chain-of-Thought (CoT) designs affect the acquisition of the generalizable visual reasoning ability in vision-language models (VLMs). While CoT data, especially long or visual CoT such as "think with image", has been widely used to supervise intermediate reasoning, it remains unclear why specific CoT designs help and which ones truly support generalizable reasoning. To systematically evaluate this, we focus on a controlled maze-solving task where reasoning rules are fully visual, difficulty can be tuned by grid size, and all the intermediate steps can be automatically generated. Using Qwen2.5-VL-7B under a standard SFT-then-RL pipeline, we compare three representative CoT formats: Language CoT, Grounding CoT (with spatial coordinate trajectories), and Visual CoT (with image manipulations). Our experiments reveal that visual and longer CoT mainly accelerate convergence but do not lift the final performance ceiling; concise CoT containing only essential grounding steps outperforms longer traces; and, strikingly, CoT retaining only the minimal grounding results generalizes best across different maze sizes. We further validate these insights on other vision-centric tasks. These findings highlight a "short is long" effect and provide practical guidance for constructing more generalizable SFT datasets for visual reasoning. Our code and data will be publicly released at https://github.com/RUCAIBox/Revisiting-Visual-CoT.

Abstract:
Event-based low-light image enhancement (LIE) methods mainly focus on incorporating high dynamic range (HDR) information from events while overlooking the essential global illumination in images and the inherent noise sensitivity of event signals in real-world scenarios. To address these issues, we propose EIC-LIE, an event-illumination collaborative LIE framework. Concretely, we first design an Event-Illumination Collaborative Interaction (EICI) module, which contains two key processes: forward gathering, which gathers HDR features across varying lighting conditions, and backward injection, which provides complementary content for illumination and event representations.Next, we introduce an Illumination-aware Event Filter (IAEF) that dynamically reduces event noise based on brightness statistics derived from images. Additionally, we build a beam-splitter-based hybrid imaging system to collect high-quality event-image pairs with temporal synchronization from dynamic scenes, providing the first high-resolution, real-world event-based LIE dataset.Extensive experiments show that our EIC-LIE outperforms state-of-the-art methods on five real-world and synthetic datasets, significantly surpassing previous methods with improvements of up to 1.24dB in PSNR and 0.069 in SSIM. The code and dataset are released at https://github.com/QUEAHREN/EIC-LIE.

Abstract:
Open-domain visual entity recognition (VER) seeks to associate images with entities in encyclopedic knowledge bases such as Wikipedia. Recent generative methods tailored for VER demonstrate strong performance but incur high computational costs, limiting their scalability and practical deployment. In this work, we revisit the contrastive paradigm for VER and introduce WikiCLIP, a simple yet effective framework that establishes a strong and efficient baseline for open-domain VER. WikiCLIP leverages large language model embeddings as knowledge-rich entity representations and enhances them with a Vision-Guided Knowledge Adaptor (VGKA) that aligns textual semantics with visual cues at the patch level. To further encourage fine-grained discrimination, a Hard Negative Synthesis Mechanism generates visually similar but semantically distinct negatives during training. Experimental results on popular open-domain VER benchmarks, such as OVEN, demonstrate that WikiCLIP significantly outperforms strong baselines. Specifically, WikiCLIP achieves a 16% improvement on the challenging OVEN unseen set, while reducing inference latency by nearly 100 times compared with the leading generative model, AutoVER. The project page is available at https://artanic30.github.io/project_pages/WikiCLIP/.

Abstract:
Single-image reflection separation is highly challenging under extreme conditions like glare or weak reflections. Existing methods often struggle to recover both layers in glare or weak-reflection scenarios because of insufficient information. This paper presents a diffusion model explicitly fine-tuned for this task, leveraging generative diffusion priors for robust separation. Our method simultaneously generates transmission and reflection layers through a unified diffusion model, incorporating a novel cross-layer self-attention mechanism for better feature disentanglement. We further introduce a disjoint sampling strategy to iteratively reduce interference between the layers during diffusion and a latent optimization step with a learned composition function for improved results in complex real-world scenarios. Extensive experiments demonstrate that our approach surpasses state-of-the-art methods on multiple real-world benchmarks. Project page: https://brian90709.github.io/diff-reflection-separation/

Abstract:
Interactive Text-to-Image Retrieval (I-TIR) aims to refine image retrieval results through natural language dialogues, which allows users to progressively supplement or correct their search intention across multiple rounds, enabling a more precise and user-aligned visual search experience.However, existing methods perform cross-modal retrieval within a fixed multimodal feature space, mapping all dialogue text and images onto the same static embedding manifold.Such static formulation easily causes semantic vagueness, making it difficult to capture subtle embedding shifts in the user's updated intention for fine-grained retrieval.To address this limitation, we propose Context-Aware Latent Space Transformation (CAST), a lightweight framework that dynamically transforms the common latent space of both textual and visual representations according to the specific evolving user's search intention, enabling fine-grained and adaptive semantic alignment.The core of CAST is the Context-Aware Space Regulator (CASR), a crucial space transformation module composed of two key components: (1) the Context-Aware Low-Rank Projector (CLP), which learns to predict the projection direction of embedding space based on the intent's context;and (2) the Context-Guided Modulator (CGM), which adaptively determines appropriate projection strength. CASR is highly lightweight, adding negligible parameters and computational overhead, and can be seamlessly integrated into diverse I-TIR frameworks.Extensive experiments demonstrate the effectiveness of our proposed framework, indicating that it can serve as a general, plug-and-play solution for efficient and scalable interactive text-to-image retrieval. Our source code is available at https://github.com/HuiGuanLab/CAST.

Abstract:
3D Gaussian Splatting can exploit frustum culling and level-of-detail strategies to accelerate rendering of scenes containing a large number of primitives. However, the semi-transparent nature of Gaussians prevents the application of another highly effective technique: occlusion culling. We address this limitation by proposing a novel method to learn the viewpoint-dependent visibility function of all Gaussians in a trained model using a small, shared MLP across instances of an asset in a scene. By querying it for Gaussians within the viewing frustum prior to rasterization, our method can discard occluded primitives during rendering. Leveraging Tensor Cores for efficient computation, we integrate these neural queries directly into a novel instanced software rasterizer. Our approach outperforms the current state of the art for composed scenes in terms of VRAM usage and image quality, utilizing a combination of our instanced rasterizer and occlusion culling MLP, and exhibits complementary properties to existing LoD techniques. The source code is available at https://brent-zoomers.github.io/nvgs/

Abstract:
Image-to-point-cloud (I2P) registration aims to align 2D images with 3D point clouds by establishing reliable 2D-3D correspondences. The drastic modality gap between images and point clouds makes it challenging to learn features that are both discriminative and generalizable, leading to severe performance drops in unseen scenarios. We address this challenge by introducing a heterogeneous graph that enables refining both cross-modal features and correspondences within a unified architecture. The proposed graph represents a mapping between segmented 2D and 3D regions, which enhances cross-modal feature interaction and thus improves feature discriminability. In addition, modeling the consistency among vertices and edges within the graph enables pruning of unreliable correspondences. Building on these insights, we propose a heterogeneous graph embedded I2P registration method, termed Hg-I2P. It learns a heterogeneous graph by mining multi-path feature relationships, adapts features under the guidance of heterogeneous edges, and prunes correspondences using graph-based projection consistency. Experiments on six indoor and outdoor benchmarks under cross-domain setups demonstrate that Hg-I2P significantly outperforms existing methods in both generalization and accuracy. Code is released on https://github.com/anpei96/hg-i2p-demo.

Abstract:
Vision-Language-Action (VLA) models have recently achieved remarkable progress in robotic manipulation, yet they remain limited in failure diagnosis and learning from failures. Additionally, existing failure datasets are mostly generated programmatically in simulation, which limits their generalization to the real world. In light of these, we introduce ViFailback, a framework designed to diagnose robotic manipulation failures and provide both textual and visual correction guidance. Our framework utilizes explicit visual symbols to enhance annotation efficiency. We further release the ViFailback dataset, a large-scale collection of 58,128 Visual Question Answering (VQA) pairs along with their corresponding 5,202 real-world manipulation trajectories. Based on the dataset, we establish ViFailback-Bench, a benchmark of 11 fine-grained VQA tasks designed to assess the failure diagnosis and correction abilities of Vision-Language Models (VLMs), featuring ViFailback-Bench Lite for closed-ended and ViFailback-Bench Hard for open-ended evaluation. To demonstrate the effectiveness of our framework, we built the ViFailback-8B VLM, which not only achieves significant overall performance improvement on ViFailback-Bench but also generates visual symbols for corrective action guidance. Finally, by integrating ViFailback-8B with a VLA model, we conduct real-world robotic experiments demonstrating its ability to assist the VLA model in recovering from failures. Project page: https://x1nyuzhou.github.io/vifailback.github.io/

Abstract:
We present PAMotion, a physics-aware diffusion framework for generating realistic full-body human interactions with multiple objects.Existing diffusion-based methods that jointly synthesize human and object motions often struggle to capture the intricate physical interactions--especially those involving complex hand-object contacts. To address this issue, in this paper, we begin with our key observation: in everyday, slow-motion scenarios, object accelerations inherently reveal the underlying physical interactions.If an object's acceleration aligns with gravity, it is likely in free motion with no physical contact from human or other objects; otherwise, it must be in contact--directly or indirectly--with the human body. Building on this intuition, PAMotion jointly models full-body human motion, object motion, and their corresponding accelerations, enforcing physical plausibility through a physics-aware interaction loss.In this loss, we softly penalizes violations of consistency between object acceleration and human-object contact states. PAMotion follows a coarse-to-fine pipline: we first synthesize global torso and object translations, then conditionally refine hand motions and object rotations, achieving both high-level motion-text consistency and low-level physical fidelity. Experiments on two challenging datasets HIMO and ParaHome demonstrate that PAMotion achieves state-of-the-art performance in generating realistic, physically consistent full-body manipulation sequences involving multiple objects. Our code is released at https://github.com/liyuheng520/PAMotion.

Abstract:
Brain tumor MRI segmentation is essential for clinical diagnosis and treatment planning, enabling accurate lesion detection and radiotherapy target delineation. However, tumor lesions occupy only a small fraction of the volumetric space, resulting in severe spatial sparsity, while existing segmentation networks often overlook clinically observed spatial priors of tumor occurrence, leading to redundant feature computation over extensive background regions. To address this issue, we propose PGR-Net (Prior-Guided ROI Reasoning Network) - an explicit ROI-aware framework that incorporates a data-driven spatial prior set to capture the distribution and scale characteristics of tumor lesions, providing global guidance for more stable segmentation. Leveraging these priors, PGR-Net introduces a hierarchical Top-K ROI decision mechanism that progressively selects the most confident lesion candidate regions across encoder layers to improve localization precision. We further develop the WinGS-ROI (Windowed Gaussian-Spatial Decay ROI) module, which uses multi-window Gaussian templates with a spatial decay function to produce center-enhanced guidance maps, thus directing feature learning throughout the network. With these ROI features, a windowed RetNet backbone is adopted to enhance localization reliability. Experiments on BraTS 2019/2023 and MSD Task01 show that PGR-Net consistently outperforms existing approaches while using only 8.64M Params, achieving Dice scores of 89.02%, 91.82%, and 89.67% on the Whole Tumor region. Code is available at https://github.com/CNU-MedAI-Lab/PGR-Net.

Abstract:
We present PiLoT, a unified framework that tackles UAV-based ego and target geo-localization. Conventional approaches rely on decoupled pipelines that fuse GNSS and Visual-Inertial Odometry (VIO) for ego-pose estimation, and active sensors like laser rangefinders for target localization. However, these methods are susceptible to failure in GNSS-denied environments and incur substantial hardware costs and complexity. PiLoT breaks this paradigm by directly registering live video stream against a geo-referenced 3D map. To achieve robust, accurate, and real-time performance, we introduce three key contributions: 1) a Dual-Thread Engine that decouples map rendering from core localization thread, ensuring both low latency while maintaining drift-free accuracy; 2) a large-scale synthetic dataset with precise geometric annotations (camera pose, depth maps). This dataset enables the training of a lightweight network that generalizes in a zero-shot manner from simulation to real data; and 3) a Joint Neural-Guided Stochastic-Gradient Optimizer (JNGO) that achieves robust convergence even under aggressive motion.Evaluations on a comprehensive set of public and newly collected benchmarks show that PiLoT outperforms state-of-the-art methods while running over 25 FPS on NVIDIA Jetson Orin platform. Our code and dataset are available at: https://github.com/Choyaa/PiLoT.

Abstract:
A fundamental role in decoding human motor intent and enabling intuitive human-computer interaction is played by electromyography (EMG). However, its generalization capability across subjects, devices, and tasks remains substantially limited by data heterogeneity, label scarcity, and the lack of a unified representational framework. To bridge this gap, we propose Any Electromyography (AEMG), the first large-scale, self-supervised representation learning framework for EMG. AEMG reconceptualizes neuromuscular dynamics linguistically, utilizing a novel Neuromuscular Contraction Tokenizer (NCT) to translate discrete muscle contractions into structural words and temporal activation patterns into coherent sentences. Furthermore, we compile the largest cross-device EMG signal vocabulary to date, enabling seamless transfer across arbitrary channel topologies and sampling rates. Experiments demonstrate that AEMG improves the zero-shot leave-one-subject-out (LOSO) accuracy by 5.79--9.25% compared to six state-of-the-art baselines, and achieves more than 90% few-shot adaptation performance with only 5% of target user data. Our work has proposed the concept of EMG signals as a cross-device physiological language, learned their grammar from massive amounts of data, and laid the groundwork for a single-training, universally applicable EMG foundation model. Our code, demos, and more resources are available at https://github.com/AEMG-series/AEMG.

Abstract:
We introduce a lifelong imitation learning framework that enables continual policy refinement across sequential tasks under realistic memory and data constraints. Our approach departs from conventional experience replay by operating entirely in a multimodal latent space, where compact representations of visual, linguistic, and robot's state information are stored and reused to support future learning. To further stabilize adaptation, we introduce an incremental feature adjustment mechanism that regularizes the evolution of task embeddings through an angular margin constraint, preserving inter-task distinctiveness. Our method establishes a new state of the art in the LIBERO benchmarks, achieving 10-17 point gains in AUC and up to 65% less forgetting compared to previous leading methods. Ablation studies confirm the effectiveness of each component, showing consistent gains over alternative strategies. The code is available at: https://github.com/yfqi/lifelong_mlr_ifa.

Abstract:
Multimodal image synthesis has made significant progress, yet most editing methods still rely on textual instructions, which are less direct than visual guidance. Recently, a new paradigm edits one image using another as reference, enabling more intuitive manipulation through visual exemplars. We formalize this setting as cross-image editing, where a source image is modified under one or more visual references. We propose OrionEdit, a unified framework that regulates editing via symmetric orthogonal subspace disentanglement and reverse-causal attention with information-flow masks enforcing unidirectional latent dependencies. Built on standard diffusion backbones, OrionEdit enables zero-shot multi-reference editing and outperforms open-source baselines, approaching proprietary models in fidelity and disentanglement. The model is available at https://github.com/cityuhkai/OrionEdit.

Abstract:
Understanding the surrounding environment is fundamental in autonomous driving and robotic perception. Distinguishing between known classes and previously unseen objects is crucial in real-world environments, as done in Anomaly Segmentation. However, research in the 3D field remains limited, with most existing approaches applying post-processing techniques from 2D vision. To cover this lack, we propose a new efficient approach that directly operates in the feature space, modeling the feature distribution of inlier classes to constrain anomalous samples. Moreover, the only publicly available 3D LiDAR anomaly segmentation dataset contains simple scenarios, with few anomaly instances, and exhibits a severe domain gap due to its sensor resolution. To bridge this gap, we introduce a set of mixed real-synthetic datasets for 3D LiDAR anomaly segmentation, built upon established semantic segmentation benchmarks, with multiple out-of-distribution objects and diverse, complex environments. Extensive experiments demonstrate that our approach achieves state-of-the-art and competitive results on the existing real-world dataset and the newly introduced mixed datasets, respectively, validating the effectiveness of our method and the utility of the proposed datasets. Code and datasets are available at https://simom0.github.io/lido-page/.

Abstract:
Real-world image restoration aims to restore high-quality (HQ) images from degraded low-quality (LQ) inputs captured under uncontrolled conditions. Existing methods typically depend on ground-truth (GT) supervision, assuming that GT provides perfect reference quality. However, GT can still contain images with inconsistent perceptual fidelity, causing models to converge to the average quality level of the training data rather than achieving the highest perceptual quality attainable. To address these problems, we propose a novel framework, termed IQPIR, that introduces an Image Quality Prior (IQP)--extracted from pre-trained No-Reference Image Quality Assessment (NR-IQA) models--to guide the restoration process toward perceptually optimal outputs explicitly. Our approach synergistically integrates IQP with a learned codebook prior through three key mechanisms:(1) a quality-conditioned Transformer, where NR-IQA-derived scores serve as conditioning signals to steer the predicted representation toward maximal perceptual quality. This design provides a plug-and-play enhancement compatible with existing restoration architectures without structural modification; and(2) a dual-branch codebook structure, which disentangles common and HQ-specific features, ensuring a comprehensive representation of both generic structural information and quality-sensitive attributes; and (3) a discrete representation-based quality optimization strategy, which mitigates over-optimization effects commonly observed in continuous latent spaces. Extensive experiments on real-world image restoration demonstrate that our method not only surpasses cutting-edge methods but also serves as a generalizable quality-guided enhancement strategy for existing methods. The code is available at https://github.com/fengyang1399-pixel/IQPIR.

Abstract:
To gain finer regional forecasts, many works have explored the regional integration from the global atmosphere, e.g., by solving boundary equations in physics-based methods or cropping regions from global forecasts in data-driven methods. However, the effectiveness of these methods is often constrained by static and imprecise regional boundaries, resulting in poor generalization ability. To address this issue, we propose Spatial-Temporal Weather Forecasting (STCast), a novel AI-driven framework for adaptive regional boundary optimization and dynamic monthly forecast allocation. Specifically, our approach employs a Spatial-Aligned Attention (SAA) mechanism, which aligns global and regional spatial distributions to initialize boundaries and adaptively refines them based on attention-derived alignment patterns. Furthermore, we design a Temporal Mixture-of-Experts (TMoE) module, where atmospheric variables from distinct months are dynamically routed to specialized experts using a discrete Gaussian distribution, enhancing the model's ability to capture temporal patterns. Beyond global and regional forecasting, STCast is evaluated on extreme event prediction and ensemble forecasting. Experimental results demonstrate consistent superiority over other methods across all four tasks. Code: https://github.com/chenhao-zju/STCast

Abstract:
Driving planning is a critical component of end-to-end (E2E) autonomous driving. However, prevailing Imitative E2E Planners often suffer from multimodal trajectory mode collapse, failing to produce diverse trajectory proposals. Meanwhile, Generative E2E Planners struggle to incorporate crucial safety and physical constraints directly into the generative process, necessitating an additional optimization stage to refine their outputs. In this paper, we propose GuideFlow, a novel planning framework that leverages Constrained Flow Matching. Concretely, GuideFlow explicitly models the flow matching process, which inherently mitigates mode collapse and allows for flexible guidance from various conditioning signals. Our core contribution lies in directly enforcing explicit constraints within the flow matching generation process, rather than relying on implicit constraint encoding. Crucially, GuideFlow unifies the training of the flow matching with the Energy-Based Model (EBM) to enhance the model's autonomous optimization capability to robustly satisfy physical constraints. Secondly, GuideFlow parameterizes driving aggressiveness as a control signal during generation, enabling precise manipulation of trajectory style. Extensive evaluations on major driving benchmarks (Bench2Drive, NuScenes, NavSim and ADV-NuScenes) validate the effectiveness of GuideFlow. Notably, on the NavSim test hard split (Navhard), GuideFlow achieved SOTA with an EPDMS score of 43.0. The code will be released in https://github.com/adept-thu/GuideFlow.

Abstract:
Document image dewarping remains a challenging task in the deep learning era. While existing methods have improved by leveraging text line awareness, they typically focus only on a single horizontal dimension. In this paper, we propose a fine-grained deformation perception model that focuses on Dual Dimensions of document horizontal-vertical-lines to improve document Dewarping called D2Dewarp. It can perceive distortion trends in different directions across document details. To combine the horizontal and vertical granularity features, an effective fusion module based on X and Y coordinate is designed to facilitate interaction and constraint between the two dimensions for feature complementarity. Due to the lack of annotated line features in current public dewarping datasets, we also propose an automatic fine-grained annotation method using public document texture images and automatic rendering engine to build a new large-scale distortion training dataset named DocDewarpHV. The code and dataset will be publicly released. On three public Chinese and English benchmarks, both quantitative and qualitative results show that our method achieves better rectification results compared with the state-of-the-art methods. The code and dataset are available at https://github.com/xiaomore/D2Dewarp.

Abstract:
Federated Graph Learning (FGL) has emerged as a principled framework for decentralized training of Graph Neural Networks (GNNs) while preserving data privacy. In subgraph-FL scenarios, however, structural noise arising from data collection and storage can damage the GNN message-passing scheme of clients, leading to conflicts in collaboration. Existing approaches exhibit two critical limitations: 1) Globally, they fail to identify corrupted clients, causing destructive knowledge inconsistencies. 2) Locally, the global GNN performs poorly on these clients due to structural noise, limiting their ability to benefit from federated collaboration. To address these challenges, we propose FedSDR, a spectra-based FGL framework against high-structural-noise scenarios. Specifically, Structural Noise-Aware Aggregation (SNAA) introduces a structural fidelity evaluation metric to detect corrupted clients and reduce their contributions, thereby mitigating the impact of noise on the global GNN. Furthermore, Robust Local Structure Reconstruction (RLSR) leverages the knowledge from the healthy global model to repair locally corrupted graph structures. Extensive experiments demonstrate that FedSDR outperforms state-of-the-art methods across various scenarios under structural noise. The code is available at https://github.com/Subtleazure/FedSDR.

Abstract:
We propose Bijective Universal Scene-Specific Anomalous Relationship Detection (BUSSARD), a normalizing flow-based model for detecting anomalous relations in scene graphs, generated from images. Our work follows a multimodal approach, embedding object and relationship tokens from scene graphs with a language model to leverage semantic knowledge from the real world. A normalizing flow model is used to learn bijective transformations that map object-relation-object triplets from scene graphs to a simple base distribution (typically Gaussian), allowing anomaly detection through likelihood estimation. We evaluate our approach on the SARD dataset containing office and dining room scenes. Our method achieves around 10% better AUROC results compared to the current state-of-the-art model, while simultaneously being five times faster. Through ablation studies, we demonstrate superior robustness and universality, particularly regarding the use of synonyms, with our model maintaining stable performance while the baseline shows 17.5% deviation. This work demonstrates the strong potential of learning-based methods for relationship anomaly detection in scene graphs. Our code is available at https://github.com/mschween/BUSSARD.

Abstract:
Unregistered hyperspectral image (HSI) super-resolution (SR) typically aims to enhance a low-resolution HSI using an unregistered high-resolution reference image. In this paper, we propose an unmixing-based fusion framework that decouples spatial-spectral information to simultaneously mitigate the impact of unregistered fusion and enhance the learnability of SR models. Specifically, we first utilize singular value decomposition for initial spectral unmixing, preserving the original endmembers while dedicating the subsequent network to enhancing the initial abundance map. To leverage the spatial texture of the unregistered reference, we introduce a coarse-to-fine deformable aggregation module, which first estimates a pixel-level flow and a similarity map using a coarse pyramid predictor. It further performs fine sub-pixel refinement to achieve deformable aggregation of the reference features. The aggregative features are then refined via a series of spatial-channel abundance cross-attention blocks. Furthermore, a spatial-channel modulated fusion module is presented to merge encoder-decoder features using dynamic gating weights, yielding a high-quality, high-resolution HSI. Experimental results on simulated and real datasets confirm that our proposed method achieves state-of-the-art super-resolution performance. The code will be available at https://github.com/yingkai-zhang/UAFL.

Abstract:
We present GazeOnce360, a novel end-to-end model for multi-person gaze estimation from a single tabletop-mounted upward-facing fisheye camera. Unlike conventional approaches that rely on forward-facing cameras in constrained viewpoints, we address the underexplored setting of estimating the 3D gaze direction of multiple people distributed across a 360deg scene from an upward fisheye perspective. To support research in this setting, we introduce MPSGaze360, a large-scale synthetic dataset rendered using Unreal Engine, featuring diverse multi-person configurations with accurate 3D gaze and eye landmark annotations. Our model tackles the severe distortion and perspective variation inherent in fisheye imagery by incorporating rotational convolutions and eye landmark supervision. To better capture fine-grained eye features crucial for gaze estimation, we propose a dual-resolution architecture that fuses global low-resolution context with high-resolution local eye regions. Experimental results demonstrate the effectiveness of each component in our model. This work highlights the feasibility and potential of fisheye-based 360deg gaze estimation in practical multi-person scenarios. Project page: https://caizhuojiang.github.io/GazeOnce360/.

Abstract:
Synthetic datasets are a crucial ingredient for training stereo matching networks, but the question of what makes a stereo dataset effective remains underexplored. We investigate the design space of synthetic datasets by varying the parameters of a procedural dataset generator, and report the effects on zero-shot stereo matching performance using standard benchmarks. We validate our findings by collecting the best settings and creating a large-scale dataset. Training only on this dataset achieves better performance than training on a mixture of widely used datasets, and is competitive with training on the FoundationStereo dataset, with the additional benefit of open-source generation code and an accompanying parameter analysis to enable further research. We open-source our system to enable further research on procedural stereo datasets.

Abstract:
Visual generation with discrete tokens has gained significant attention as it enables a unified token prediction paradigm shared with language models, promising seamless multimodal architectures. However, current discrete generation methods remain limited to low-dimensional latent tokens (typically 8-32 dims), sacrificing the semantic richness essential for understanding. While high-dimensional pretrained representations (768-1024 dims) could bridge this gap, their discrete generation poses fundamental challenges. In this paper, we present Cubic Discrete Diffusion (CubiD), the first discrete generation model for high-dimensional representations. CubiD performs fine-grained masking throughout the high-dimensional discrete representation--any dimension at any position can be masked and predicted from partial observations. This enables the model to learn rich correlations both within and across spatial positions, with the number of generation steps fixed at T regardless of feature dimensionality, where T << hwd. On ImageNet-256, CubiD achieves state-of-the-art discrete generation with strong scaling behavior from 900M to 3.7B parameters. Crucially, we validate that these discretized tokens preserve original representation capabilities, demonstrating that the same discrete tokens can effectively serve both understanding and generation tasks. We hope this work will inspire future research toward unified multimodal architectures. Code is available at: https://github.com/YuqingWang1029/CubiD.

Abstract:
The proliferation of neural radiance field (NeRF) research requires significant efforts to reimplement papers before building upon them. We introduce NERFIFY, a multi-agent framework that reliably converts NeRF research papers into trainable Nerfstudio plugins, in contrast to generic paper-to-code methods and frontier models like GPT-5 that usually fail to produce runnable code. NERFIFY achieves domain-specific executability through six key innovations: (1) Context-free grammar (CFG): LLM synthesis is constrained by Nerfstudio formalized as a CFG, ensuring generated code satisfies architectural invariants. (2) Graph-of-Thought code synthesis: Specialized multi-file-agents generate repositories in topological dependency order, validating contracts and errors at each node. (3) Compositional citation recovery: Agents automatically retrieve and integrate components (samplers, encoders, proposal networks) from citation graphs of references. (4) Visual feedback: Artifacts are diagnosed through PSNR-minima ROI analysis, cross-view geometric validation, and VLM-guided patching to iteratively improve quality. (5) Knowledge enhancement: Beyond reproduction, methods can be improved with novel regularizers or architectural optimizations. (6) Benchmarking: An evaluation framework is designed for NeRF paper-to-code synthesis across 30 diverse papers. On papers without public implementations, NERFIFY achieves visual quality matching expert human code (+-0.5 dB PSNR, +-0.2 SSIM) while reducing implementation time from weeks to minutes. NERFIFY demonstrates that a domain-aware design enables code translation for complex vision papers, potentiating accelerated and democratized reproducible research. Code, data and implementations are publicly available at https://seemandhar.github.io/NERFIFY/.

Abstract:
Neural video compression (NVC) has achieved significant progress in recent years. The state-of-the-art (SOTA) NVC schemes, exemplified by the Deep Conditional Video Coding series, have focused on pursuing higher fidelity (e.g., PSNR), but lack sufficient exploitation of deep networks' advantages for better perceptual quality. We fill in this gap with two new techniques. First, we propose a color-separation-based framework, termed PNVC-C, which decouples luminance and chrominance processing to better align with human visual perception. This framework enables explicit and adaptive allocation of computation and bitrate budgets between luminance and chrominance components. Second, within this framework, we introduce the perceptual optimization scheme Rc-GAN, which leverages a bitrate-based rank chain loss to link variable-rate coding with perceptual quality ranking, enforcing consistent quality ordering and improving perceptual fidelity. Built upon these designs, we establish the PNVC-C framework with two variants: PNVC-C-Base, optimized for objective fidelity, and PNVC-CR, a perceptual variant that applies the Rc-GAN. Experimental results demonstrate that PNVC-C-Base achieves SOTA objective performance in YUV PSNR, while PNVC-CR attains SOTA perceptual quality on LPIPS, DISTS, KID, and FID metrics. The source code is released at https://github.com/lxz-nan/PNVC-CR.

Abstract:
Zero-Shot 3D Visual Grounding (Zero-Shot 3DVG) aims to localize target objects in 3D scenes from natural language descriptions without relying on instance-wise description annotations. Existing methods rely on extra 2D images during inference and/or require multi-turn interactions with large language models (LLMs) or vision-language models (VLMs), which increase latency, computational cost, and deployment complexity. To overcome these limitations, we propose Unaided Zero-shot 3D Visual Grounding with generated language conditions (UZ3DVG), which is fed with 3D point clouds and textual descriptions only during inference and does not depend on external models. This is a new training paradigm: a VLM is employed solely to produce object-wise descriptions (pseudo labels) and reasoning chains for training a lightweight 3DVG model with robust spatial reasoning. Specifically, the introduced Open-Vocabulary Multi-Source Spatial Annotation and Reasoning Chain Generator processes RGB-D images or 3D-projected 2D images from open-world scenes to generate spatial pseudo-labels and reasoning chains for training. Then, we propose Reasoning Chain Distillation, which transfers reasoning knowledge extracted by a large teacher network to a lightweight student network. To represent both global and local geometric relationships, the Geometry-Aware Spatial Modeling (GeoSM) module is introduced to align textual reasoning with 3D spatial structures. Experiments show that UZ3DVG achieves SOTA zero-shot performance on ScanRefer and NR3D, with inference speeds up to 7.69 \mathrm FPS , approximately 38 times faster than SOTA methods. The code is available at https://github.com/tanwb/UZ3DVG.

Abstract:
Diffusion transformers (DiTs) achieve high generative quality but lock FLOPs to image resolution, limiting principled latency-quality trade-offs, and allocate computation uniformly across input spatial tokens, wasting resource allocation to unimportant regions. We introduce Elastic Latent Interface Transformer (ELIT), a drop-in, DiT-compatible mechanism that decouples input image size from compute. Our approach inserts a latent interface, a learnable variable-length token sequence on which standard transformer blocks can operate. Lightweight Read and Write cross-attention layers move information between spatial tokens and latents and prioritize important input regions. By training with random dropping of tail latents, ELIT learns to produce importance-ordered representations with earlier latents capturing global structure while later ones contain information to refine details. At inference, the number of latents can be dynamically adjusted to match compute constraints. ELIT is deliberately minimal, adding two cross-attention layers while leaving the rectified flow objective and the DiT stack unchanged. Across datasets and architectures (DiT, U-ViT, HDiT, MM-DiT), ELIT delivers consistent gains. On ImageNet-1K 512px, ELIT delivers an average gain of 35.3% and 39.6% in FID and FDD scores.

Abstract:
Visual Place Recognition (VPR) aims to match query images against a database using visual cues. State-of-the-art methods aggregate features from deep backbones to form global descriptors. Optimal transport-based aggregation methods reformulate feature-to-cluster assignment as a transport problem, but the standard Sinkhorn algorithm symmetrically treats source and target marginals, limiting effectiveness when image features and cluster centers exhibit substantially different distributions. We propose an asymmetric aggregation VPR method with geometric constraints for locally aggregated descriptors, called A2GC-VPR. Our method employs row-column normalization averaging with separate marginal calibration, enabling asymmetric matching that adapts to distributional discrepancies in visual place recognition. Geometric constraints are incorporated through learnable coordinate embeddings, computing compatibility scores fused with feature similarities, thereby promoting spatially proximal features to the same cluster and enhancing spatial awareness. Experimental results on MSLS, NordLand, and Pittsburgh datasets demonstrate superior performance, validating the effectiveness of our approach in improving matching accuracy and robustness. Our code is available at https://github.com/CV4RA/A2GC.

Abstract:
Unmanned aerial vehicle (UAV) based object detection is a critical but challenging task, when applied in dynamically changing scenarios with limited annotated training data. Layout-to-image generation approaches have proved effective in promoting detection accuracy by synthesizing labeled images based on diffusion models. However, they suffer from frequently producing artifacts, especially near layout boundaries of tiny objects, thus substantially limiting their performance. To address these issues, we propose UAVGen, a novel layout-to-image generation framework tailored for UAV-based object detection. Specifically, UAVGen designs a Visual Prototype Conditioned Diffusion Model (VPC-DM) that constructs representative instances for each class and integrates them into latent embeddings for high-fidelity object generation. Moreover, a Focal Region Enhanced Data Pipeline (FRE-DP) is introduced to emphasize object-concentrated foreground regions in synthesis, combined with a label refinement to correct missing, extra and misaligned generations. Extensive experimental results demonstrate that our method significantly outperforms state-of-the-art approaches, and consistently promotes accuracy when integrated with distinct detectors. The source code is available at https://github.com/Sirius-Li/UAVGen.

Abstract:
In the context of long-term video understanding with large multimodal models, many frameworks have been proposed. Although transformer-based visual compressors and memory-augmented approaches are often used to process long videos, they usually compress each frame independently and therefore fail to achieve strong performance on tasks that require understanding complete events, such as temporal ordering tasks in MLVU and VNBench. This motivates us to rethink the conventional one-way scheme from perception to memory, and instead establish a feedbackdriven process in which past visual contexts stored in the context memory can benefit ongoing perception. To this end, we propose Question-guided Visual Compression with Memory Feedback (QViC-MF), a framework for long-term video understanding. At its core is a Question-guided Multimodal Selective Attention (QMSA), which learns to preserve visual information related to the given question from both the current clip and the past related frames from the memory. The compressor and memory feedback work iteratively for each clip of the entire video. This simple yet effective design yields large performance gains on longterm video understanding tasks. Extensive experiments show that our method achieves significant improvement over current state-of-the-art methods by 6.1% on MLVU test, 8.3% on LVBench, 18.3% on VNBench Long, and 3.7% on VideoMME Long. The code will be available at https://github.com/FujitsuResearch/QViC-MF/.

Abstract:
Coded Aperture Snapshot Spectral Imaging (CASSI) is an emerging hyperspectral image (HSI) acquisition technique for downstream semantic segmentation. Due to the ill-posedness nature of CASSI systems, typical solutions are compelled to conduct a two-stage reconstruction-then-segmentation pipeline, namely viewing them as two separate tasks. However, we observe that such two tasks are interrelated and mutually reinforcing for representation learning, and thus separating them limits the overall accuracy and efficiency. To this end, we propose the first Cooperative Reconstruction-Segmentation Deep Unfolding Network (CRSDUN) to solve the reconstruction and segmentation tasks in parallel. To make the two mutually reinforcing, we introduce the Cross-Aggregated Super-Token Attention (CASTA) mechanism to enhance the representation interactions between HSI reconstruction and semantic segmentation. Extensive experiments on both synthetic and real-world HSI reconstruction-segmentation datasets demonstrate that our method achieves state-of-the-art in both spectral reconstruction and semantic segmentation. The code is available at \href https://github.com/zjhe02/CRSDUN https://github.com/zjhe02/CRSDUN .

Abstract:
Spiking Neural Networks (SNNs) utilize spike-based activations to mimic the brain's energy-efficient information processing. However, the binary and discontinuous nature of spike activations causes vanishing gradients, making adversarial robustness evaluation via gradient descent unreliable. While improved surrogate gradient methods have been proposed, their effectiveness under strong adversarial attacks remains unclear. We propose a more reliable framework for evaluating SNN adversarial robustness. We theoretically analyze the degree of gradient vanishing in surrogate gradients and introduce the Adaptive Sharpness Surrogate Gradient (ASSG), which adaptively evolves the shape of the surrogate function according to the input distribution during attack iterations, thereby enhancing gradient accuracy while mitigating gradient vanishing. In addition, we design an adversarial attack with adaptive step size under the L_infinity constraint--Stable Adaptive Projected Gradient Descent (SA-PGD), achieving faster and more stable convergence under imprecise gradients. Extensive experiments show that our approach substantially increases attack success rates across diverse adversarial training schemes, SNN architectures and neuron models, providing a more generalized and reliable evaluation of SNN adversarial robustness. The experimental results further reveal that the robustness of current SNNs has been significantly overestimated and highlighting the need for more dependable adversarial training methods. The code is released at https://github.com/craree/ASSG-SNNs-Robustness-Evaluation

Abstract:
Human motion recovery for real-world interaction demands both precise action details and metric-scale trajectories. Recovering absolute human pose from monocular input presents a viable solution, but faces two main challenges: (1) models' reliance on 3D training data from constrained environments limits their out-of-distribution generalization; and (2) the inherent difficulty of estimating metric-scale poses from monocular observations. This paper introduces Mocap-2-to-3, a novel framework that differs from prior HMR methods by recovering absolute poses from monocular input and leveraging abundant 2D data to enhance 3D motion recovery. To effectively utilize the action priors and diversity in large-scale 2D datasets, we reformulate 3D motion as a multi-view synthesis process and divide the training into two stages: a single-view diffusion model is first pre-trained on extensive 2D data, followed by multi-view fine-tuning on 3D data, thus achieving a combination of strong priors and geometric constraints. Furthermore, to recover absolute poses, we introduce a novel human motion representation that decouples the learning of local pose and global movements, while encoding ground geometric priors to accelerate convergence, thereby yielding more precise positioning in the physical world. Experiments on in-the-wild benchmarks show that our method outperforms state-of-the-art approaches in both camera-space motion realism and world-grounded human positioning, while exhibiting strong generalization capability.

Abstract:
Despite recent progress in 3D self-supervised learning, collecting large-scale 3D scene scans remains expensive and labor-intensive. In this work, we investigate whether 3D representations can be learned from unlabeled videos recorded without any real 3D sensors. We present Laplacian-Aware Multi-level 3D Clustering with Sinkhorn-Knopp (LAM3C), a self-supervised framework that learns from video-generated point clouds reconstructed from unlabeled videos. We first introduce RoomTours, a video-generated point cloud dataset constructed by collecting room-walkthrough videos from the web (e.g., real-estate tours) and generating 49,219 scenes using an off-the-shelf feed-forward reconstruction model. We also propose a noise-regularized loss that stabilizes representation learning by enforcing local geometric smoothness and ensuring feature stability under noisy point clouds. Remarkably, without using any real 3D scans, LAM3C achieves better performance than previous self-supervised methods on indoor semantic and instance segmentation. These results suggest that unlabeled videos represent an abundant source of data for 3D self-supervised learning. Our source code is available at https://github.com/ryosuke-yamada/lam3c.

Abstract:
Neural video compression (NVC) technologies have advanced rapidly in recent years, yielding state-of-the-art schemes such as DCVC-RT that offer superior compression efficiency to H.266/VVC and real-time encoding/decoding capabilities. Nonetheless, existing NVC schemes have several limitations, including inefficiency in dealing with disocclusion and new content, interframe error propagation and accumulation, among others. To eliminate these limitations, we borrow the idea from classic video coding schemes, which allow intra coding within inter-coded frames. With the intra coding tool enabled, disocclusion and new content are properly handled, and interframe error propagation is naturally intercepted without the need for manual refresh mechanisms. We present an NVC framework with unified intra and inter coding, where every frame is processed by a single model that is trained to perform intra/inter coding adaptively. Moreover, we propose a simultaneous two-frame compression design to exploit interframe redundancy not only forwardly but also backwardly. Experimental results show that our scheme outperforms DCVC-RT by an average of 12.1% BD-rate reduction, delivers more stable bitrate and quality per frame, and retains real-time encoding/decoding performances. The codes are available at https://github.com/ihuixiang/UIIC.

Abstract:
Recent studies have made notable progress in video representation learning by transferring image-pretrained models to video tasks, typically with complex temporal modules and video fine-tuning. However, fine-tuning heavy modules may compromise inter-video semantic separability, i.e., the essential ability to distinguish objects across videos. While reducing the tunable parameters hinders their intra-video temporal consistency, which is required for stable representations of the same object within a video. This dilemma indicates a potential trade-off between the intra-video temporal consistency and inter-video semantic separability during image-to-video transfer. To this end, we propose the Consistency-Separability Trade-off Transfer Learning (Co-Settle) framework, which applies a lightweight projection layer on top of the frozen image-pretrained encoder to adjust representation space with a temporal cycle consistency objective and a semantic separability constraint. We further provide a theoretical support showing that the optimized projection yields a better trade-off between the two properties under appropriate conditions. Experiments on eight image-pretrained models demonstrate consistent improvements across multiple levels of video tasks with only five epochs of self-supervised training. The code is available at https://github.com/yafeng19/Co-Settle.

Abstract:
Open-vocabulary segmentation enables pixel-level recognition from an open set of textual categories, allowing generalization beyond fixed classes. Despite great potential in remote sensing, progress in this area remains largely limited to clear-sky optical data and struggles under cloudy or haze-contaminated conditions. We present MM-OVSeg, a multimodal Optical-SAR fusion framework for resilient open-vocabulary segmentation under adverse weather conditions. MM-OVSeg leverages the complementary strengths of the two modalities--optical imagery provides rich spectral semantics, while synthetic aperture radar (SAR) offers cloud-penetrating structural cues. To address the cross-modal domain gap and the limited dense prediction capability of current vision-language models, we propose two key designs: a cross-modal unification process for multi-sensor representation alignment, and a dual-encoder fusion module that integrates hierarchical features from multiple vision foundation models for text-aligned multimodal segmentation. Extensive experiments demonstrate that MM-OVSeg achieves superior robustness and generalization across diverse cloud conditions. The source dataset and code are available at https://github.com/Jimmyxichen/MM-OVSeg.

Abstract:
Pre-trained vision-language models (VLMs) exhibit strong zero-shot generalization but remain vulnerable to adversarial perturbations. Existing classification-guided adversarial fine-tuning methods often disrupt pre-trained cross-modal alignment, weakening visual-textual correspondence and degrading zero-shot performance. In this paper, we propose an Alignment-Guided Fine-Tuning (AGFT) framework that enhances zero-shot adversarial robustness while preserving the cross-modal semantic structure. Unlike label-based methods that rely on hard labels and fail to maintain the relative relationships between image and text, AGFT leverages the probabilistic predictions of the original model for text-guided adversarial training, which aligns adversarial visual features with textual embeddings via soft alignment distributions, improving zero-shot adversarial robustness. To address structural discrepancies introduced by fine-tuning, we introduce a distribution consistency calibration mechanism that adjusts the robust model's output to match a temperature-scaled version of the pre-trained model's predictions. Extensive experiments across multiple zero-shot benchmarks demonstrate that AGFT outperforms state-of-the-art methods while significantly improving zero-shot adversarial robustness. Code is available at https://github.com/YuboCui/AGFT.

Abstract:
Monocular 3D lane detection remains challenging due to depth ambiguity and weak geometric constraints. Mainstream methods rely on depth guidance, BEV projection, and anchor- or curve-based heads with simplified physical assumptions, remapping high-dimensional image features while only weakly encoding road geometry. Lacking an invariant geometric-topological coupling between lanes and the underlying road surface, 2D-to-3D lifting is ill-posed and brittle, often degenerating into concavities, bulges, and twists. To address this, we propose the Road-Manifold Assumption: the road is a smooth 2D manifold in R3, lanes are embedded 1D submanifolds, and sampled lane points are dense observations, thereby coupling metric and topology across surfaces, curves, and point sets. Building on this, we propose ReManNet, which first produces initial lane predictions with an image backbone and detection heads, then encodes geometry as Riemannian Gaussian descriptors on the symmetric positive-definite (SPD) manifold, and fuses these descriptors with visual features through a lightweight gate to maintain coherent 3D reasoning. We also propose the 3D Tunnel Lane IoU (3D-TLIoU) loss, a joint point-curve objective that computes slice-wise overlap of tubular neighborhoods along each lane to improve shape-level alignment. Extensive experiments on standard benchmarks demonstrate that ReManNet achieves state-of-the-art (SOTA) or competitive results. On OpenLane, it improves F1 by +8.2% over the baseline and by +1.8% over the previous best, with scenario-level gains of up to +6.6%. The code will be publicly available at https://github.com/changehome717/ReManNet.

Abstract:
Deep neural networks are highly vulnerable to adversarial examples, i.e.,small perturbations that can significantly degrade model performance. While adversarial training has become the primary defense strategy, most studies focus on balanced datasets, overlooking the challenges posed by real-world long-tail data. Motivated by the fact that perturbations in adversarial examples inherently alter the training distribution, we theoretically investigate their impact. We first revisit adversarial training for long-tail data and identify two key limitations: (i) a skewed training objective caused by class imbalance, and (ii) unstable evolution of adversarial distributions. Furthermore, we show that perturbations can simultaneously address both adversarial vulnerability and class imbalance. Based on these insights, we propose \METHODNAME , a plug-and-play framework that adaptively adjusts perturbations during adversarial training. Extensive experiments demonstrate that \METHODNAME consistently enhances adversarial robustness and class-balance on long-tailed datasets. The code is available at \href https://github.com/zhang-lilin/RobustLT https://github.com/zhang-lilin/RobustLT .

Abstract:
We present UniSH, a unified, feed-forward framework for joint metric-scale 3D scene and human reconstruction. A key challenge in this domain is the scarcity of large-scale, annotated real-world data, forcing a reliance on synthetic datasets. This reliance introduces a significant sim-to-real domain gap, leading to poor generalization, low-fidelity human geometry, and poor alignment on in-the-wild videos. To address this, we propose an innovative training paradigm that effectively leverages unlabeled in-the-wild data. Our framework bridges strong, disparate priors from scene reconstruction and HMR, and is trained with two core components: (1) a robust distillation strategy to refine human surface details by distilling high-frequency details from an expert depth model, and (2) a two-stage supervision scheme, which first learns coarse localization on synthetic data, then fine-tunes on real data by directly optimizing the geometric correspondence between the SMPL mesh and the human point cloud. This approach enables our feed-forward model to jointly recover high-fidelity scene geometry, human point clouds, camera parameters, and coherent, metric-scale SMPL bodies, all in a single forward pass. Extensive experiments demonstrate that our model achieves state-of-the-art performance on human-centric scene reconstruction and delivers highly competitive results on global human motion estimation, comparing favorably against both optimization-based frameworks and HMR-only methods. Project page: https://murphylmf.github.io/UniSH/.

Abstract:
As generative models rapidly evolve, the realism of AI-generated videos has reached new levels, posing significant challenges for detecting the authenticity of videos. Existing deepfake detection techniques generally rely on training datasets with limited generation methods and content diversity, which limits their generalization ability on more realistic content, particularly that produced by the latest generative models. Recently, large multimodal models (LMMs) have demonstrated remarkable zero-shot performance across a variety of vision tasks. Yet, their ability to discern deepfake videos remains largely untested. To this end, we propose FVBench, a comprehensive deep\underline f ake \underline v ideo \underline bench mark designed to advance video deepfake detection. It includes: (i) extensive content diversity, with over 120K videos covering real, AI-edited, and fully AI-generated categories, (ii) comprehensive model coverage, with fake videos generated and edited by 42 of the state-of-the-art video synthesis and editing models, and (iii) deepfake video detection benchmark for LMMs, which is a comprehensive benchmark for exploring the deepfake video detection capabilities of LMMs. The dataset and code are released at https://github.com/IntMeGroup/FVBench.

Abstract:
We introduce GeoSAM2, a prompt-controllable framework for 3D part segmentation that casts the task as multi-view 2D mask prediction. Given a textureless object, we render normal and point maps from predefined viewpoints and accept simple 2D prompts--clicks or boxes--to guide part selection. These prompts are processed by a shared SAM2 backbone augmented with LoRA and residual geometry fusion, enabling view-specific reasoning while preserving pretrained priors. The predicted masks are back-projected to the object, aggregated across views.Our method enables fine-grained, part-specific control without requiring text prompts, per-shape optimization, or full 3D labels. In contrast to global clustering or scale-based methods, prompts are explicit, spatially grounded, and interpretable. We achieve state-of-the-art class-agnostic performance on PartObjaverse-Tiny and PartNetE, outperforming both slow optimization-based pipelines and fast but coarse feedforward approaches. Our results highlight a new paradigm: aligning the paradigm of 3D segmentation with SAM2, leveraging interactive 2D inputs to unlock controllability and precision in object-level part understanding.

Abstract:
Multimodal large language models (MLLMs) have achieved remarkable success across diverse applications, from autonomous driving to document understanding. As these models are deployed in safety-critical contexts, understanding their adversarial robustness becomes crucial. However, current evaluations focus primarily on simple tasks like coarse-grained classification, and employ inconsistent evaluation protocols, hindering rigorous comparison of attack methods. We introduce AdvRobustBench, a comprehensive adversarial robustness benchmark for MLLMs comprising 1,000 examples across visual question answering (VQA) and optical character recognition (OCR) tasks, drawn from widely-used MLLM benchmarks (MMBench, MMStar, OCRBench-v2). We further propose Omni-Attack, a novel transfer-based black-box attack method that addresses key challenges in attacking open-ended question-answering systems. Our approach introduces (i) a target-construction pipeline that generates question-conditioned textual and visual targets to provide stronger optimization signals, and (ii) a location-aware attack strategy for OCR that enables spatially-precise perturbations. Extensive experiments demonstrate that Omni-Attack achieves strong targeted attack success rates (up to 71.8% on GPT-4.1 at \varepsilon=8/255) across both proprietary models (GPT-4.1, Claude 3.7, Gemini 2.0) and open-source MLLMs, revealing significant vulnerabilities in current multimodal systems. Our benchmark and findings establish a foundation for developing more robust MLLMs. Codes are available at https://github.com/hukkai/transferable_mllm_attack

Abstract:
Unified Multimodal Models (UMMs) are often constrained by the pre-training of their visual generation components, which typically relies on inefficient paradigms and scarce, high-quality text-image paired data. In this paper, we systematically analyze pre-training recipes for UMM visual generation and identify these two issues as the major bottlenecks. To address them, we propose Image-Only Training for UMMs (IOMM), a data-efficient two-stage training framework. The first stage pre-trains the visual generative component exclusively using abundant unlabeled image-only data, thereby removing the dependency on paired data for this costly phase. The second stage fine-tunes the model using a mixture of unlabeled images and a small curated set of text-image pairs, leading to improved instruction alignment and generative quality. Extensive experiments show that IOMM not only improves training efficiency but also achieves state-of-the-art (SOTA) performance. For example, our IOMM-B (3.6B) model was trained from scratch using only about 1050 H800 GPU hours (with the vast majority, 1000 hours, dedicated to the efficient image-only pre-training stage). It achieves 0.89 on GenEval and 0.55 on WISE--surpassing strong baselines such as BAGEL-7B (0.82 and 0.55) and BLIP3-o-4B (0.84 and 0.50). Code is available at: https://github.com/LINs-lab/IOMM

Abstract:
In this work, we present a panoramic metric depth foundation model that generalizes across diverse scene distances. We explore a data-in-the-loop paradigm from the view of both data construction and framework design. We collect a large-scale dataset by combining public datasets, high-quality synthetic data from our UE5 simulator and text-to-image models, and real panoramic images from the web. To reduce domain gaps between indoor/outdoor and synthetic/real data, we introduce a three-stage pseudo-label curation pipeline to generate reliable ground truth for unlabeled images. For the model, we adopt DINOv3-Large as the backbone for its strong pre-trained generalization, and introduce a plug-and-play range mask head, sharpness-centric optimization, and geometry-centric optimization to improve robustness to varying distances and enforce geometric consistency across views. Experiments on multiple benchmarks (e.g., Stanford2D3D, Matterport3D, and Deep360) demonstrate strong performance and zero-shot generalization, with particularly robust and stable metric predictions in diverse real-world scenes. The project page can be found at: https://insta360-research-team.github.io/DAP_website/

Abstract:
Recently, reinforcement learning (RL) has been employed for improving generative image super-resolution (ISR) performance. However, the current efforts are focused on multi-step generative ISR, while one-step generative ISR remains underexplored due to its limited stochasticity. In addition, RL methods such as Direct Preference Optimization (DPO) require the generation of positive and negative sample pairs offline, leading to a limited number of samples, while Group Relative Policy Optimization (GRPO) only calculates the likelihood of the entire image, ignoring local details that are crucial for ISR. In this paper, we propose Group Direct Preference Optimization (GDPO), a novel approach to integrate RL into one-step generative ISR model training. First, we introduce a noise-aware one-step diffusion model that can generate diverse ISR outputs. To prevent performance degradation caused by noise injection, we introduce an unequal-timestep strategy to decouple the timestep of noise addition from that of diffusion. We then present the GDPO strategy, which integrates the principle of GRPO into DPO, to calculate the group-relative advantage of each online generated sample for model optimization. Meanwhile, an attribute-aware reward function is designed to dynamically evaluate the score of each sample based on its statistics of smooth and texture areas. Experiments demonstrate the effectiveness of GDPO in enhancing the performance of one-step generative ISR models. Code: https://github.com/Joyies/GDPO.

Abstract:
With the rapid advancement of artificial intelligence generated content (AIGC) technologies, video forgery has become increasingly prevalent, posing new challenges to public discourse and societal security. Despite remarkable progress in existing deepfake detection methods, AIGC forgery detection remains challenging, as existing datasets mainly rely on open-source video generation models with quality far below commercial AIGC systems. Even datasets containing a few commercial samples often retain visible watermarks, compromising authenticity and hindering model generalization to high-fidelity AIGC videos. To address these issues, we introduce CoCoVideo-26K, a contrastive, commercial-model-based AIGC video dataset covering 13 mainstream commercial generators and providing semantically aligned real-fake video pairs. This dataset enables deeper exploration of the differences between authentic and high-quality synthetic videos, establishes a new benchmark for highly realistic video forgery detection. Building on this dataset, we propose CoCoDetect, a detection framework integrating contrastive learning with confidence-gated multimodal large language model (MLLM) inference. An R3D-18 backbone extracts spatio-temporal representations, while a confidence gate routes uncertain cases to an MLLM for reasoning about physical plausibility and scene consistency. Extensive experiments on CoCoVideo-26K and public benchmarks demonstrate state-of-the-art performance, validating the framework's robustness and generality. Our code and dataset are available at https://github.com/DonoToT/CoCoVideo.

Abstract:
State-of-the-art crowd counting and localization are primarily modeled using two paradigms: density maps and point regression. Given the field's security ramifications, there is active interest in model robustness against adversarial attacks. Recent studies have demonstrated transferability across density-map-based approaches via adversarial patches, but cross-paradigm attacks (i.e., across both density map-based models and point regression-based models) remain unexplored. We introduce a novel adversarial framework that compromises both density map and point regression architectural paradigms through a comprehensive multi-task loss optimization. For point-regression models, we employ scene-density-specific high-confidence logit suppression; for density-map approaches, we use peak-targeted density map suppression. Both are combined with model-agnostic perceptual constraints to ensure that perturbations are effective and imperceptible to the human eye. Extensive experiments demonstrate the effectiveness of our attack, achieving on average a 7X increase in Mean Absolute Error compared to clean images while maintaining competitive visual quality, and successfully transferring across seven state-of-the-art crowd models with transfer ratios ranging from 0.55 to 1.69. Our approach strikes a balance between attack effectiveness and imperceptibility compared to state-of-the-art transferable attack strategies. The source code is available at https://github.com/simurgh7/CrowdGen

Abstract:
Underwater images commonly suffer from foreground-background ambiguity, loss of structural details, and severely reduced contrast, which collectively make underwater object detection (UOD) an inherently challenging task. To handle this issue, we present a residual-guided hierarchical calibration network (RHCNet) designed to achieve more efficient and robust UOD, which comprises a residual-guided feature enhancement module (RGFE) and a hierarchical feature calibration pyramid module (HFCP). Concretely, RHCNet extends the standard ResNet-50 backbone by embedding the RGFE, which effectively strengthens the representation of edge and texture features in blurry regions by jointly leveraging convolutional operations and attention mechanisms to achieve more discriminative feature extraction for UOD. Subsequently, the HFCP integrates a bottom-up semantic enhancement path and a top-down fine-grained feature compensation path, while a K-means clustering-guided feature calibration module is jointly employed to ensure multi-level cross-scale semantic consistency and accurate alignment of salient region features. Extensive experiments on the DUO and UTDAC benchmark datasets demonstrated that our RHCNet attains the highest AP scores of 70.53% and 53.35%, respectively. Besides, our RHCNet also maintains excellent detection accuracy and strong generalization capability on the COCO dataset for terrestrial scenarios. The code is available at https://github.com/YitengGuo/RHCNet.

Abstract:
Recent advancements in video models have shown tremendous progress, particularly in long video understanding. However, current benchmarks predominantly feature western-centric data and English as the dominant language, introducing significant biases in evaluation. To address this, we introduce CURVE, a challenging benchmark for multicultural and multilingual video reasoning. CURVE comprises high-quality, entirely human-generated annotations from diverse, region-specific cultural videos across 18 global locales. Unlike prior work that relies on automatic translations, CURVE provides complex questions, answers, and multi-step reasoning steps, all crafted in native languages. Making progress on CURVE requires a deeply situated understanding of visual cultural context. Furthermore, we leverage CURVE's reasoning traces to construct evidence-based graphs and propose a novel iterative strategy using these graphs to identify fine-grained errors in reasoning. Our evaluations reveal that SoTA Video-LLMs struggle significantly, performing substantially below human-level accuracy, with errors primarily stemming from the visual perception of cultural elements. We will release CURVE to foster the development of more equitable and capable multimodal foundation models.

Abstract:
Infrared-visible (IR-VIS) image fusion is vital for perception and security, yet most methods rely on the availability of both modalities during training and inference. When the infrared modality is absent, pixel-space generative substitutes become hard to control and inherently lack interpretability. We address missing-IR fusion by proposing a dictionary-guided, coefficient-domain framework built upon a shared convolutional dictionary. The pipeline comprises three key components: (1) Joint Shared-dictionary Representation Learning (JSRL) learns a unified and interpretable atom space shared by both IR and VIS modalities; (2) VIS-Guided IR Inference (VGII) transfers VIS coefficients to pseudo-IR coefficients in the coefficient domain and performs a one-step closed-loop refinement guided by a frozen large language model as a weak semantic prior; and (3) Adaptive Fusion via Representation Inference (AFRI) merges VIS structures and inferred IR cues at the atom level through window attention and convolutional mixing, followed by reconstruction with the shared dictionary. This encode-transfer-fuse-reconstruct pipeline avoids uncontrolled pixel-space generation while ensuring prior preservation within interpretable dictionary-coefficient representation. Experiments under missing-IR settings demonstrate consistent improvements in perceptual quality and downstream detection performance. To our knowledge, this represents the first framework that jointly learns a shared dictionary and performs coefficient-domain inference-fusion to tackle missing-IR fusion. The source code is publicly available at https://github.com/harukiv/DCMIF.

Abstract:
With the great advancement of video generation models, a growing number of content creators and researchers are leveraging these technologies to produce large volumes of human-centric videos for content creation and customized data generation for specific tasks. Although existing video generation models are capable of producing videos with high visual quality, their inadequate understanding of video realism results in generating unrealistic videos. While various evaluators have emerged to assess the quality of generated videos, they are trained from low-quality generated videos and data annotations, leading to misaligned ratings with human preferences. They also lack interpretability due to the absence of chain-of-thought reasoning. To address these issues, we propose VideoRealBench, a comprehensive benchmark for evaluating the realism of generated human-centric videos. We leverage a rating scale designed from human preferences to score videos and provide three-step rationales, thereby creating a finely-annotated dataset VideoRealDataset and proposing an evaluator VideoRealEval capable of providing reliable scores along with detailed rationales. VideoRealEval achieves a Pearson's linear correlation coefficient (PLCC) of 57.07% and a Spearman's rank correlation coefficient (SROCC) of 56.78% on VideoRealDataset, demonstrating closer alignment with human preferences than existing evaluators. Our code and dataset are available at https://github.com/MCG-NJU/VideoRealBench.

Abstract:
Recent advances in Multimodal Large Language Models (MLLMs) have enabled automated generation of structured layouts from natural language descriptions. Existing methods typically follow a code-only paradigm that generates code to represent layouts, which are then rendered by graphic engines to produce final images. However, they are blind to the rendered visual outcome, making it difficult to guarantee readability and aesthetics. In this paper, we identify visual feedback as a critical factor in layout generation and propose Visual Feedback Layout Model (VFLM), a self-improving framework that leverages visual feedback iterative refinement. VFLM is capable of performing adaptive reflective generation, which leverages visual information to reflect on previous issues and iteratively generates outputs until satisfactory quality is achieved. It is achieved through reinforcement learning with a visually grounded reward model that incorporates OCR accuracy. By rewarding only the final generated outcome, we can effectively stimulate the model's iterative and reflective generative capabilities. Experiments across multiple benchmarks show that VFLM consistently outperforms advanced MLLMs, existing layout models, and code-only baselines, establishing visual feedback as critical for design-oriented MLLMs. Our code and data are available at https://github.com/FolSpark/VFLM.

Abstract:
Diffusion-based image-to-image (I2I) translation excels in high-fidelity generation but suffers from slow sampling in state-of-the-art Diffusion Bridge Models (DBMs), often requiring dozens of function evaluations (NFEs). We introduce DBMSolver, a training-free sampler that exploits the semi-linear structure of DBM's underlying SDE and ODE via exponential integrators, yielding highly-efficient 1st- and 2nd-order solutions. This reduces NFEs by up to 5X while boosting quality (e.g., FID drops 53% on DIODE at 20 NFEs vs. 2nd-order baseline). Experiments on inpainting, stylization, and semantics-to-image tasks across resolutions up to 256x256 show DBMSolver sets new SOTA efficiency-quality tradeoffs, enabling real-world applicability. Our code is publicly available at https://github.com/snumprlab/dbmsolver.

Abstract:
With growing real-world demands, efficient tracking has received increasing attention. However, most existing methods are limited to RGB inputs and struggle in multi-modal scenarios. Moreover, current multi-modal tracking approaches typically use complex designs, making them too heavy and slow for resource-constrained deployment. To tackle these limitations, we propose UETrack, a unified and efficient framework for single object tracking. UETrack demonstrates high practicality and versatility, efficiently handling multiple modalities including RGB, Depth, Thermal, Event, and Language, and addresses the gap in efficient multi-modal tracking. It introduces two key components: a Token-Pooling-based Mixture-of-Experts mechanism that enhances modeling capacity through feature aggregation and expert specialization, and a Target-aware Adaptive Distillation strategy that selectively performs distillation based on sample characteristics, reducing redundant supervision and improving performance. Extensive experiments on 12 benchmarks across 3 hardware platforms show that UETrack achieves a superior speed-accuracy trade-off compared to pervious methods. For instance, UETrack-B achieves 69.2% AUC on LaSOT and runs at 163/56/60 FPS on GPU/CPU/AGX, demonstrating strong practicality and versatility. Code is available at https://github.com/kangben258/UETrack.

Abstract:
Recent cross-modal hashing methods have introduced sample generation strategies to enrich training signals. Despite these advances, sample generation-driven hashing still faces two major challenges: (1) Interpolation-based methods adopt deterministic and class-independent generation that restricts synthetic samples to a small region around the original data. Consequently, intra-class diversity is limited, which weakens the model's ability to learn discriminative binary codes. (2) Generation network-based methods, which leverage a complex generative model to produce synthetic samples, leading to extra model complexity. To address these issues, we propose a novel Intra-class Distribution-guided Generative Hashing (IDGH) that adaptively generates synthetic samples directly from estimated intra-class distributions. Specifically, we suggest an Intra-class Distribution Estimation (IDE) scheme to model the characteristic distribution of each class, providing essential support for adaptive sample generation. Meanwhile, by utilizing the distribution information from neighboring classes, we design a Neighbor-guided Distribution Refinement (NDR) mechanism to correct flawed estimations for classes. With refined intra-class distributions, we propose a Distribution-aware Adaptive Generation (DAG) strategy that synthesizes informative training samples by shifting features along diverse directions guided by intra-class distribution patterns. The proposed approach is plug-and-play and can be seamlessly integrated into various objective functions, providing semantically diverse training samples, thus enhancing similarity learning. Extensive experiments on benchmark datasets demonstrate that IDGH outperforms existing methods. Code is available at https://github.com/QinLab-WFU/IDGH.

Abstract:
Action Quality Assessment (AQA) aims to score how well an action is performed and is widely used in sports analysis, rehabilitation assessment, and human skill evaluation. Multi-modal AQA has recently achieved strong progress by leveraging complementary visual and kinematic cues, yet real-world deployments often suffer from non-stationary modality imbalance, where certain modalities become missing or intermittently available due to sensor failures or annotation gaps. Existing continual AQA methods overlook this issue and assume that all modalities remain complete and stable throughout training, which restricts their practicality. To address this challenge, we introduce Bridged Modality Adaptation (BriMA), an innovative approach to multi-modal continual AQA under modality-missing conditions. BriMA consists of a memory-guided bridging imputation module that reconstructs missing modalities using both task-agnostic and task-specific representations, and a modality-aware replay mechanism that prioritizes informative samples based on modality distortion and distribution drift. Experiments on three representative multi-modal AQA datasets (RG, Fis-V, and FS1000) show that BriMA consistently improves performance under different modality-missing conditions, achieving 6--8% higher correlation and 12--15% lower error on average. These results demonstrate a step toward robust multi-modal AQA systems under real-world deployment constraints. Our code is available at https://github.com/ZhouKanglei/BriMA.

Abstract:
Fairness is a core element in the trustworthy deployment of deepfake detection models, especially in the field of digital identity security. Biases in detection models toward different demographic groups, such as gender and race, may lead to systemic misjudgments, exacerbating the digital divide and social inequities. However, current fairness-enhanced detectors often improve fairness at the cost of detection accuracy. To address this challenge, we propose a dual-mechanism collaborative optimization framework. Our proposed method innovatively integrates structural fairness decoupling and global distribution alignment: decoupling channels sensitive to demographic groups at the model architectural level, and subsequently reducing the distance between the overall sample distribution and the distributions corresponding to each demographic group at the feature level. Experimental results demonstrate that, compared with other methods, our framework improves both inter-group and intra-group fairness while maintaining overall detection accuracy across domains. The code is available at https://github.com/ywh1093/Fairness-Optimization.

Abstract:
Multimodal intent recognition aims to infer human intents by jointly modeling various modalities, playing a pivotal role in real-world dialogue systems. However, current methods struggle to model hierarchical semantics underlying complex intents and lack the capacity for self-evolving reasoning over multimodal representations. To address these issues, we propose HIER, a novel method that integrates hierarchical semantic representation with evolutionary reasoning based on Multimodal Large Language Model (MLLM). Inspired by human cognition, HIER introduces a structured reasoning paradigm that organizes multimodal semantics into three progressively abstracted levels. It starts with modality-specific tokens capturing localized semantic cues, which are then clustered via a label-guided strategy to form mid-level semantic concepts. To capture higher-order structure, inter-concept relations are selected using JS divergence scores to highlight salient dependencies across concepts. These hierarchical representations are then injected into MLLM via CoT-driven prompting, enabling step-wise reasoning. Besides, HIER utilizes a self-evolution mechanism that refines semantic representations through MLLM feedback, allowing dynamic adaptation during inference. Experiments on three challenging benchmarks show that HIER consistently outperforms state-of-the-art methods and MLLMs with 1-3% gains across all metrics. Code and more results are available at https://github.com/thuiar/HIER.

Abstract:
Unsupervised remote photoplethysmography (rPPG) promises to leverage unlabeled video data, but its potential is hindered by a critical challenge: training on low-quality "in-the-wild" videos severely degrades model performance. An essential step missing here is to assess the suitability of the videos for rPPG model learning before using them for the task. Existing video quality assessment (VQA) methods are mainly designed for human perception and not directly applicable to the above purpose. In this work, we propose rPPG-VQA, a novel framework for assessing video suitability for rPPG. We integrate signal-level and scene-level analyses and design a dual-branch assessment architecture. The signal-level branch evaluates the physiological signal quality of the videos via robust signal-to-noise ratio (SNR) estimation with a multi-method consensus mechanism, and the scene-level branch uses a multimodal large language model (MLLM) to identify interferences like motion and unstable lighting. Furthermore, we propose a two-stage adaptive sampling (TAS) strategy that utilizes the quality score to curate optimal training datasets. Experiments show that by training on large-scale, "in-the-wild" videos filtered by our framework, we can develop unsupervised rPPG models that achieve a substantial improvement in accuracy on standard benchmarks. Our code is available at https://github.com/Tianyang-Dai/rPPG-VQA.

Abstract:
This paper investigates the challenging task of detecting backdoored text-to-image models under black-box settings and introduces a novel detection framework BlackMirror. Existing approaches typically rely on analyzing image-level similarity, under the assumption that backdoor-triggered generations exhibit strong consistency across samples. However, they struggle to generalize to recently emerging backdoor attacks, where backdoored generations can appear visually diverse. BlackMirror is motivated by an observation: across backdoor attacks, only partial semantic patterns within the generated image are steadily manipulated, while the rest of the content remains diverse or benign. Accordingly, BlackMirror consists of two components: MirrorMatch, which aligns visual patterns with the corresponding instructions to detect semantic deviations; and MirrorVerify, which evaluates the stability of these deviations across varied prompts to distinguish true backdoor behavior from benign responses. BlackMirror is a general, training-free framework that can be deployed as a plug-and-play module in Model-as-a-Service (MaaS) applications. Comprehensive experiments demonstrate that BlackMirror achieves accurate detection across a wide range of attacks. Code is available at this https://github.com/Ferry-Li/BlackMirror.

Abstract:
Document AI has advanced rapidly and is attracting increasing attention. Yet, while most efforts have focused on document layout analysis (DLA), its generative counterpart, layout generation, remains underexplored. Distinct from traditional graphic layout design and room layout planning, document layout generation typically involves a larger number of elements per page and exhibits greater structural diversity and complexity. Currently, a major obstacle lies in the scarcity of diverse document layouts: academic papers with Manhattan-style structures dominate existing studies, while open-world genres such as newspapers and magazines remain severely underrepresented. To address this gap, we curate OmniDocLayout-1M, the first million-scale dataset of diverse document layouts, covering six common document types and comprising contemporary layouts collected from multiple sources. Moreover, since existing methods struggle in complex domains and often fail to arrange long sequences coherently, we introduce OmniDocLayout-LLM, a 0.5B model with designed two-stage Coarse-to-Fine learning paradigm: 1) learning universal layout principles from our dataset with coarse category definitions, and 2) transferring the knowledge to a specific domain with few fine-grained annotated samples. Extensive experiments demonstrate that our approach achieves strong performance on multiple domains in M^6Doc dataset, substantially surpassing both existing layout generation experts and several latest general-purpose LLMs. More information can be found at https://github.com/opendatalab/OmniDocLayout.

Abstract:
Generating RGB-A videos, which include alpha channels for transparency, has wide applications. However, current methods often suffer from low quality due to confusion between RGB and alpha. In this paper, we address this problem by learning shiftable RGB-A distributions. We adjust both the latent space and noise space, shifting the alpha distribution outward while preserving the RGB distribution, thereby enabling stable transparency generation without compromising RGB quality. Specifically, for the latent space, we propose a transparency-aware bidirectional diffusion loss during VAE training, which shifts the RGB-A distribution according to likelihood. For the noise space, we propose shifting the mean of diffusion noise sampling and applying a Gaussian ellipse mask to provide transparency guidance and controllability. Additionally, we construct a high-quality RGB-A video dataset. Compared to state-of-the-art methods, our model excels in visual quality, naturalness, transparency rendering, inference convenience, and controllability. The released model is available on our website: https://donghaotian123.github.io/Wan-Alpha/.

Abstract:
Trajectory prediction is critical for autonomous driving, enabling safe and efficient planning in dense, dynamic traffic. Most existing methods optimize prediction accuracy under fixed-length observations. However, real-world driving often yields variable-length, incomplete observations, posing a challenge to these methods. A common strategy is to directly map features from incomplete observations to those from complete ones. This one-shot mapping, however, struggles to learn accurate representations for short trajectories due to significant information gaps. To address this issue, we propose a Progressive Retrospective Framework (PRF), which gradually aligns features from incomplete observations with those from complete ones via a cascade of retrospective units. Each unit consists of a Retrospective Distillation Module (RDM) and a Retrospective Prediction Module (RPM), where RDM distills features and RPM recovers previous timesteps using the distilled features. Moreover, we propose a Rolling-Start Training Strategy (RSTS) that enhances data efficiency during PRF training. PRF is plug-and-play with existing methods. Extensive experiments on datasets Argoverse 2 and Argoverse 1 demonstrate the effectiveness of PRF. Code is available at https://github.com/zhouhao94/PRF.

Abstract:
Recent segmentation models have demonstrated promising efficiency by aggressively reducing parameter counts and computational complexity. However, these models often struggle to accurately delineate fine lesion boundaries and texture patterns essential for early skin cancer diagnosis and treatment planning. In this paper, we propose MambaLiteUNet, a compact yet robust segmentation framework that integrates Mamba state space modeling into a U-Net architecture, along with three key modules: Adaptive Multi-Branch Mamba Feature Fusion (AMF), Local Global Feature Mixing (LGFM), and Cross-Gated Attention (CGA). These modules are designed to enhance local-global feature interaction, preserve spatial details, and improve the quality of skip connections. MambaLiteUNet achieves an average IoU of 87.12% and average Dice score of 93.09% across ISIC2017, ISIC2018, HAM10000, and PH2 benchmarks, outperforming state-of-the-art models. Compared to U-Net, our model improves average IoU and Dice by 7.72 and 4.61 points, respectively, while reducing parameters by 93.6% and GFLOPs by 97.6%. Additionally, in domain generalization with six unseen lesion categories, MambaLiteUNet achieves 77.61% IoU and 87.23% Dice, performing best among all evaluated models. Our extensive experiments demonstrate that MambaLiteUNet achieves a strong balance between accuracy and efficiency, making it a competitive and practical solution for dermatological image segmentation. Our code is publicly available at: https://github.com/maklachur/MambaLiteUNet.

Abstract:
Collaborative fairness in federated learning ensures that clients are rewarded according to their contributions, thereby fostering long-term participation among clients. However, existing methods often under-reward low-contributing clients in the early training stage and neglect critical issues (consistency across local models or unequal neuron training frequencies in the global model), leading to degraded performance. To address these issues, we propose FedRAC, a novel Federated learning framework employing Rolling submodel Allocation for Collaborative fairness, without compromising the global model performance. First, we design a dynamic reputation calculation module with a theoretical fairness guarantee to generate reputations matching clients' contributions. It adjusts their reputations dynamically during training, ensuring low-contribution clients access better models in the early stages for adequate training. Second, we propose a rolling submodel allocation module that assigns high-performance submodels to clients with high reputations. This module prioritizes low-frequency neurons during allocation and is supported by theoretical convergence guarantees, ensuring that all neurons in the global model are fully trained. Extensive experiments are conducted on four public datasets to confirm the advantages of our method in terms of fairness and model accuracy. The source code is available at https://github.com/ZiHuiWangpcl1/FedRAC.

Abstract:
Incremental Object Detection (IOD) aims to continuously learn new object categories without forgetting previously learned ones. Recently, prompt-based methods have gained popularity for their replay-free design and parameter efficiency. However, due to prompt coupling and prompt drift, these methods often suffer from prompt degradation during continual adaptation. To address these issues, we propose a novel prompt-decoupled framework called PDP. PDP innovatively designs a dual-pool prompt decoupling paradigm, which consists of a shared pool used to capture task-general knowledge for forward transfer, and a private pool used to learn task-specific discriminative features. This paradigm explicitly separates task-general and task-specific prompts, preventing interference between prompts and mitigating prompt coupling. In addition, to counteract prompt drift resulting from inconsistent supervision where old foreground objects are treated as background in subsequent tasks, PDP introduces a Prototypical Pseudo-Label Generation (PPG) module. PPG can dynamically update the class prototype space during training and use the class prototypes to further filter valuable pseudo-labels, maintaining supervisory signal consistency throughout the incremental process. PDP achieves state-of-the-art performance on MS-COCO (with a 9.2% AP improvement) and PASCAL VOC (with a 3.3% AP improvement) benchmarks, highlighting its potential in balancing stability and plasticity. The code and dataset are released at: https://github.com/zyt95579/PDP_IOD.

Abstract:
Personalized generation models for a single subject have demonstrated remarkable effectiveness, highlighting their significant potential. However, when extended to multiple subjects, existing models often exhibit degraded performance, particularly in maintaining subject consistency and adhering to textual prompts. We attribute these limitations to the absence of high-quality multi-subject datasets and refined post-training strategies. To address these challenges, we propose a scalable multi-subject data generation pipeline that leverages powerful single-subject generation models to construct diverse and high-quality multi-subject training data. Through this dataset, we first enable single-subject personalization models to acquire knowledge of synthesizing multi-image and multi-subject scenarios. Furthermore, to enhance both subject consistency and text controllability, we design a set of Pairwise Subject-Consistency Rewards and general-purpose rewards, which are incorporated into a refined reinforcement learning stage. To comprehensively evaluate multi-subject personalization, we introduce a new benchmark that assesses model performance using seven subsets across three dimensions. Extensive experiments demonstrate the effectiveness of our approach in advancing multi-subject personalized image generation. Github Link: https://github.com/wang-shulei.

Abstract:
Recent progress in 4D representations, such as Dynamic NeRF and 4D Gaussian Splatting (4DGS), has enabled dynamic 4D scene reconstruction. However, text-driven 4D scene editing remains under-explored due to the challenge of ensuring both multi-view and temporal consistency across space and time during editing.Existing studies rely on 2D diffusion models that edit frames independently, often causing motion distortion, geometric drift, and incomplete editing. We introduce Dynamic-eDiTor, a training-free text-driven 4D editing framework leveraging Multimodal Diffusion Transformer (MM-DiT) and 4DGS. This mechanism consists of Spatio-Temporal Sub-Grid Attention (STGA) for locally consistent cross-view and temporal fusion, and Context Token Propagation (CTP) for global propagation via token inheritance and optical-flow-guided token replacement. Together, these components allow Dynamic-eDiTor to perform seamless, globally consistent multi-view video without additional training and directly optimize pre-trained source 4DGS.Extensive experiments on multi-view video dataset DyNeRF demonstrate that our method achieves superior editing fidelity and both multi-view and temporal consistency prior approaches. Project page for results and code: https://di-lee.github.io/dynamic-eDiTor

Abstract:
The limited understanding capacity of the visual encoder in Contrastive Language-Image Pre-training (CLIP) has become a key bottleneck for downstream performance. This capacity includes both Discriminative Ability (D-Ability), which reflects class separability, and Detail Perceptual Ability (P-Ability), which focuses on fine-grained visual cues. Recent solutions use diffusion models to enhance representations by conditioning image reconstruction on CLIP visual tokens. We argue that such paradigms may compromise D-Ability and therefore fail to effectively address CLIP's representation limitations. To address this, we integrate contrastive signals into diffusion-based reconstruction to pursue more comprehensive visual representations. We begin with a straightforward design that augments the diffusion process with contrastive learning on input images. However, empirical results show that the naive combination suffers from gradient conflict and yields suboptimal performance. To balance the optimization, we introduce the Diffusion Contrastive Reconstruction (DCR), which unifies the learning objective. The key idea is to inject contrastive signals derived from each reconstructed image, rather than from the original input, into the diffusion process. Our theoretical analysis shows that the DCR loss can jointly optimize D-Ability and P-Ability. Extensive experiments across various benchmarks and multi-modal large language models validate the effectiveness of our method. The code is available at https://github.com/boyuh/DCR.

Abstract:
Vision-language foundation models (VLFMs) promise zero-shot and retrieval understanding for Earth observation. While operational satellite systems often lack full multi-spectral coverage, making RGB-only inference highly desirable for scalable deployment, the adoption of VLFMs for satellite imagery remains hindered by two factors: (1) multi-spectral inputs are informative but difficult to exploit consistently due to band redundancy and misalignment; and (2) CLIP-style text encoders limit semantic expressiveness and weaken fine-grained alignment. We present SATtxt, a VLFM that operates with RGB inputs at inference while retaining spectral cues learned during training and producing rich cross-modal representations. Our framework comprises two stages. First, Spectral Representation Distillation transfers spectral priors from a frozen multi-spectral teacher to an RGB student via a lightweight projector. Second, Spectrally Grounded Alignment with Instruction-Augmented LLMs bridges the distilled visual space and an expressive LLM embedding space. Across satellite benchmarks, SATtxt improves zero-shot classification on average by 4.2%, retrieval by 5.9%, linear probing by 2.7%, and open-vocabulary segmentation by 2.8% over baselines, showing an efficient path toward spectrum-aware vision-language learning for Earth observation. Project page: https://ikhado.github.io/sattxt/

Abstract:
Bird's Eye View (BEV) semantic segmentation is essential for autonomous driving and mobile robotics, yet it still faces significant challenges on accurate segmentation of foreground object and efficient estimating of layout categories obscured by objects. To address these issues, we propose BEV-CAR, a Context-Aware Rasterization method that rasterizes the BEV representation without any coordinate transformations. By optimising each ray and incorporating depth features, BEV-CAR effectively addresses the challenges posed by object occlusions and varying environmental conditions. It ensures robust performance across diverse scenarios, particularly improving the accuracy of foreground object segmentation and layout estimation in occluded areas. And extensive experiments on the nuScenes and Argoverse datasets demonstrate that BEV-CAR achieves state-of-the-art (SOTA) performance. More importantly, the rasterization technique in this paper does not introduce additional computational overhead during the inference process, making it suitable for practical deployment in real-world scenarios. Code is available in https://github.com/BEV-DAR/BEV-DAR.

Abstract:
Recent advancements in image customization exhibit a wide range of application prospects due to stronger customization capabilities. However, since we humans are more sensitive to faces, a significant challenge remains in preserving consistent identity while avoiding identity confusion with multi-reference images, limiting the identity scalability of customization models. To address this, we present UMO, a Unified Multi-identity Optimization framework, designed to maintain high-fidelity identity preservation and alleviate identity confusion with scalability. With multi-to-multi matching paradigm, UMO reformulates multi-identity generation as a global assignment optimization problem and unleashes multi-identity consistency for existing image customization methods generally. To facilitate the training of UMO, we develop a customization dataset with multi-reference images, consisting of both synthesised and real parts. Additionally, we propose a new metric to measure identity confusion. Extensive experiments demonstrate that UMO not only improves identity consistency significantly, but also reduces identity confusion on several image customization methods, setting a new state-of-the-art among open-source methods along the dimension of identity preserving. Code and model: https://github.com/bytedance/UMO.

Abstract:
Video instance segmentation (VIS) for low-light content remains highly challenging for both humans and machines alike, due to noise, blur and other adverse conditions. The lack of large-scale annotated datasets and the limitations of current synthetic pipelines, particularly in modeling temporal degradations, further hinder progress. Moreover, existing VIS methods are not robust to the degradations found in low-light videos and, consequently, perform poorly even after finetuning. In this paper, we introduce ELVIS (Enhance Low-Light for Video Instance Segmentation), a framework that enables domain adaptation of state-of-the-art VIS models to low-light scenarios. ELVIS is comprised of an unsupervised synthetic low-light video pipeline that models both spatial and temporal degradations, a calibration-free degradation profile estimation network (VDP-Net) and an enhancement decoder head that disentangles degradations from content features. ELVIS improves performances by up to +3.7AP on the synthetic low-light YouTube-VIS 2019 dataset and beats two-stage baselines by at least +2.8AP on real low-light videos. Code and dataset available at: https://joannelin168.github.io/research/ELVIS

Abstract:
The remarkable success of modern Deep Neural Networks (DNNs) can be primarily attributed to having access to compute resources and high-quality labeled data, which is often costly and challenging to acquire. Recently, text-to-image Diffusion Models (DMs) have emerged as powerful data generators to augment training datasets. Machine learning practitioners often utilize off-the-shelf third-party DMs for generating synthetic data without domain-specific expertise or adaptation. Such a practice leads to a novel and insidious threat: a diffusion model infected with a backdoor can effectively spread into a large number of downstream models, causing a backdoor pandemic. To achieve this for the first time, we propose Eidolon, designed and optimized to stealthily transfer the backdoor injected into a single diffusion model into virtually an unlimited number of downstream models without any active attacker role in the downstream training tasks. Proposed Eidolon not only makes the attack stealthier and effective, but it also enforces a strict threat model for injecting a backdoor into the downstream model compared to conventional backdoor attacks. We propose four necessary tests that a successful backdoor attack on the diffusion model should pass to cause a backdoor pandemic. Our evaluation across a wide range of benchmark datasets and model architectures exhibits that only our attack successfully passes these tests, causing widespread pandemic across many downstream models. Code is available at https://github.com/ML-Security-Research-LAB/Eidolon

Abstract:
Semi-Supervised Regression (SSR) is essential in domains like sentiment analysis and healthcare where labeled data is limited but unlabeled data is plentiful. Despite its practical importance, SSR remains underexplored due to the lack of effective pseudo-labeling strategies for continuous outputs. Unlike classification, regression lacks inherent confidence measures, making it harder to filter and trust pseudo-labels. This limitation permits low-quality pseudo-labels to propagate during training without proper validation, significantly amplifying prediction errors in SSR frameworks. In this work, we propose GaussianMatch, a novel SSR framework enabling high-quality pseudo-label filtering, which selects reliable pseudo-labels through multi-view prediction consistency under feature-space smoothness assumptions. It has introduces two key innovations: 1) Gaussian Consistency Filter (GCF) that quantifies prediction consistency across weakly augmented views through Gaussian similarity scoring, retaining pseudo-labels only when all predictions fall within a confidence interval; 2) Adaptive Gaussian Standard Deviation Smoothing (AGDS) that enhances GCF's robustness through a Bayesian-regularized curriculum that phases confidence intervals from warm-up conservative bounds to progressively tightened thresholds. The use of AGDS ensures stable and reliable pseudo-label filtering throughout training. Extensive experiments demonstrate that GaussianMatch performs strongly across varying data conditions, showing notable robustness under extreme label scarcity. For instance, it outperforms the state of the art on UTKFace (30 labels), reducing error by 15.36% and improving the Coefficient of Determination by 50.21%. Our code is available at https://github.com/pywin/GaussianMatch.

Abstract:
Referring video object segmentation (RVOS) aims to identify, track and segment the objects in a video based on language descriptions, which has received great attention in recent years. However, existing datasets remain focus on short video clips within several seconds, with salient objects visible in most frames. To advance the task towards more practical scenarios, we introduce Long-RVOS, a large-scale benchmark for long-term referring video object segmentation. Long-RVOS contains 2,000+ videos of an average duration exceeding 60 seconds, covering a variety of objects that undergo occlusion, disappearance-reappearance and shot changing. The objects are manually annotated with three different types of descriptions to individually evaluate the understanding of static attributes, motion patterns and spatiotemporal relationships. Moreover, we introduce two new metrics to assess the temporal and spatiotemporal consistency. We benchmark 7 state-of-the-art methods on Long-RVOS to show that current approaches struggle severely with the long-video challenges. We further propose ReferMo, a promising baseline method that integrates motion information to expand the temporal receptive field, and employs a local-to-global architecture to capture both short-term dynamics and long-term dependencies. We hope that Long-RVOS and our baseline can drive future RVOS research towards more realistic and long-form videos. Our dataset and code is available at https://isee-laboratory.github.io/Long-RVOS.

Abstract:
3D morphing remains challenging due to the difficulty of generating semantically consistent and temporally smooth deformations, especially across categories. We present MorphAny3D, a training-free framework that leverages Structured Latent (SLAT) representations for high-quality 3D morphing. Our key insight is that intelligently blending source and target SLAT features within the attention mechanisms of 3D generators naturally produces plausible morphing sequences. To this end, we introduce Morphing Cross-Attention (MCA), which fuses source and target information for structural coherence, and Temporal-Fused Self-Attention (TFSA), which enhances temporal consistency by incorporating features from preceding frames. An orientation correction strategy further mitigates the pose ambiguity within the morphing steps. Extensive experiments show that our method generates state-of-the-art morphing sequences, even for challenging cross-category cases. MorphAny3D further supports advanced applications such as decoupled morphing and 3D style transfer, and can be generalized to other SLAT-based generative models. Project page: https://xiaokunsun.github.io/MorphAny3D.github.io/.

Abstract:
Learning joint representations across multiple modalities remains a central challenge in multimodal machine learning. Prevailing approaches predominantly operate in pairwise settings, aligning two modalities at a time. While some recent methods aim to capture higher-order interactions among multiple modalities, they often overlook or insufficiently preserve pairwise relationships, limiting their effectiveness on single-modality tasks. In this work, we introduce Contrastive Fusion (ConFu), a framework that jointly embeds both individual modalities and their fused combinations into a unified representation space, where modalities and their fused counterparts are aligned. ConFu extends traditional pairwise contrastive objectives with an additional fused-modality contrastive term, encouraging the joint embedding of modality pairs with a third modality. This formulation enables ConFu to capture higher-order dependencies, such as XOR-like relationships, that cannot be recovered through pairwise alignment alone, while still maintaining strong pairwise correspondence. We evaluate ConFu on synthetic and real-world multimodal benchmarks, assessing its ability to exploit cross-modal complementarity, capture higher-order dependencies, and scale with increasing multimodal complexity. Across these settings, ConFu demonstrates competitive performance on retrieval and classification tasks, while supporting unified one-to-one and two-to-one retrieval within a single contrastive framework. We release our code and dataset at https://github.com/estafons/confu.

Abstract:
Humanoid agents are expected to emulate the complex coordination inherent in human social behaviors. However, existing methods are largely confined to single-agent scenarios, overlooking the physically plausible interplay essential for multi-agent interactions. To bridge this gap, we propose InterAgent, the first end-to-end framework for text-driven physics-based multi-agent humanoid control. At its core, we introduce an autoregressive diffusion transformer equipped with multi-stream blocks, which decouples proprioception, exteroception, and action to mitigate cross-modal interference while enabling synergistic coordination. We further propose a novel interaction graph exteroception representation that explicitly captures fine-grained joint-to-joint spatial dependencies to facilitate network learning. Additionally, within it we devise a sparse edge-based attention mechanism that dynamically prunes redundant connections and emphasizes critical inter-agent spatial relations, thereby enhancing the robustness of interaction modeling. Extensive experiments demonstrate that InterAgent consistently outperforms multiple strong baselines, achieving state-of-the-art performance. It enables producing coherent, physically plausible, and semantically faithful multi-agent behaviors from only text prompts. Project page: \tt \small \href https://binlee26.github.io/InterAgent-Page https://binlee26.github.io/InterAgent-Page .

Abstract:
Diffusion Transformers (DiTs) have achieved state-of-the-art image and video generation performance, but sampling remains expensive due to repeated transformer forward passes over many timesteps. Feature caching offers a training-free way to accelerate inference by reusing or forecasting hidden representations, yet recent forecasting-based methods derive their coefficients from hand-crafted formulas (e.g., Taylor expansion), which ultimately reduce to fixed linear combinations of a few historical features. Such fixed coefficients are suboptimal and fragile under aggressive skipping. In this paper, we first show that existing forecasting-based caching methods can be unified in a common linear form, and then analyze DiT feature trajectories, finding that for most denoising steps the current feature can be reconstructed from past features with projection fidelity above 0.95, indicating that accurate linear prediction is feasible. Motivated by this, we propose L^2P (Learnable Linear Predictor), a simple data-driven caching framework that replaces hand-designed coefficients with learnable per-timestep weights trained on a small set of cached trajectories using a mean-squared error loss, converging in about 20 seconds on a single GPU. Extensive experiments on state-of-the-art DiTs demonstrate that L^2P consistently outperforms existing caching baselines: on FLUX.1-dev, L^2P achieves a 4.55xFLOPs reduction and 4.15xlatency speedup with a PSNR of 31.459, and on Qwen-Image and Qwen-Image-Lightning, it maintains high visual fidelity even under up to 7.18xacceleration, where prior methods suffer from noticeable quality degradation. These results show that learning linear predictors is a practical and effective alternative to designing increasingly complex forecasting formulas for efficient diffusion model inference. The implementation code is available at https://github.com/Aredstone/L2P-Cache.

Abstract:
Referring expression comprehension and segmentation (RECS) task plays a vital role in remote sensing due to its high efficiency in multi-tasking. However, RECS has reached a performance bottleneck rooted in representational insufficiency, primarily due to cross-task representational fragmentation in multi-task interpretation. In this paper, we propose RECS4R, a unified multi-task framework to upgrade RECS performance. At the representation level, we introduce a language-guided unified contour decoding paradigm (LUCDP), using language-conditioned contours as intermediate carriers to synchronously decode VG and RIS, structurally preserving geometric and semantic consistency while enabling lightweight, efficient decoding. At refinement level, we introduce residual coarse-to-fine encoding (RCE), shifting fine stage from learning-from-scratch to error correction. At reaggregation level, we design channel isolated multi-scale fusion (CIMF) to achieve lossless feature fusion between channels. At regularization level, we employ gradient consistency loss (GCL) to enhance LUCDP and improve boundary adherence. Moreover, we validate RECS4R on remote-sensing and natural datasets, including RefDIOR, RRSIS-D, OPT-RSVG, RefCOCO, RefCOCO+, and RefCOCOg, and verify the image encoder under CNN, Transformer, and Mamba backbones, achieving advanced performance. The code will be publicly accessible at https://github.com/IPIU-XDU/RSFM.

Abstract:
With the rapid advancement and widespread application of vision-language pre-training (VLP) models, their vulnerability to adversarial attacks has become a critical concern. In general, the adversarial examples can typically be designed to exhibit transferable power, attacking not only different models but also across diverse tasks. However, existing attacks on language-vision models mainly rely on static cross-modal interactions and focus solely on disrupting positive image-text pairs, resulting in limited cross-modal disruption and poor transferability. To address this issue, we propose a Semantic-Augmented Dynamic Contrastive Attack (SADCA) that enhances adversarial transferability through progressive and semantically guided perturbation. SADCA progressively disrupts cross-modal alignment through dynamic interactions between adversarial images and texts. This is accomplished by SADCA establishing a contrastive learning mechanism involving adversarial, positive and negative samples, to reinforce the semantic inconsistency of the obtained perturbations. Moreover, we empirically find that input transformations commonly used in traditional transfer-based attacks also benefit VLPs, which motivates a semantic augmentation module that increases the diversity and generalization of adversarial examples. Extensive experiments on multiple datasets and models demonstrate that SADCA significantly improves adversarial transferability and consistently surpasses state-of-the-art methods. The code is released at https://github.com/LiYuanBoJNU/SADCA.

Abstract:
Recent advancements in audio-driven talking face generation have made great progress in lip synchronization. However, current methods often lack sufficient control over talking face, such as speaking style and emotional expression, resulting in uniform facial motion. In this paper, we focus on improving two key factors: lip-audio alignment control(LAC) and emotion control(EMC), to enhance the diversity and user-friendliness of talking videos. Lip-audio alignment control ensures accurate lip-sync across Recent advancements in audio-driven talking face generation have made great progress in lip synchronization. However, current methods often lack sufficient control over talking face, such as speaking style and emotional expression, resulting in uniform facial motion. In this paper, we focus on improving two key factors: lip-audio alignment control(LAC) and emotion control(EMC), to enhance the diversity and user-friendliness of talking videos. Lip-audio alignment control ensures accurate lip-sync across varied speaking styles to simulate different talking habits, whereas emotion control aims to generate realistic emotional expressions with varying intensities and mixed emotional states. To achieve precise facial animation control, we propose a novel and efficient framework, PC-Talk, which enables lip-audio alignment and emotion control through implicit keypoint deformations. First, our LAC module generates lip-synced talking faces with a specific speaking style, derived from either a video reference or preset options. It also supports lip movement scale adjustment and fine-grained editing of speaking styles for specific articulations. Second, our EMC module produces vivid emotional facial expressions through pure emotional deformation. It further enables precise control over emotion intensity and the compound emotions across different facial regions. Our method demonstrates outstanding control capabilities and achieves SOTA performance on HDTF and MEAD datasets in experiments. Project page: https://bq-wang0511.github.io/PC-Talk/

Abstract:
Autoregressive models (ARMs) have long dominated the landscape of biomedical vision-language models (VLMs). Recently, masked diffusion models such as LLaDA have emerged as promising alternatives, yet their application in the biomedical domain remains largely underexplored. To bridge this gap, we introduce LLaDA-MedV, the first large language diffusion model tailored for biomedical image understanding through vision instruction tuning. LLaDA-MedV achieves relative performance gains of 7.855% over LLaVA-Med and 1.867% over LLaDA-V in the open-ended biomedical visual conversation task, and sets new state-of-the-art accuracy on the closed-form subset of three VQA benchmarks: 84.93% on VQA-RAD, 92.31% on SLAKE, and 95.15% on PathVQA. Furthermore, a detailed comparison with LLaVA-Med suggests that LLaDA-MedV is capable of generating reasonably longer responses by explicitly controlling response length, which can lead to more informative outputs. We also conduct an in-depth analysis of both the training and inference stages, highlighting the critical roles of initialization weight selection, fine-tuning strategies, and the interplay between sampling steps and response repetition. The code and model weight is released at https://github.com/LLM-VLM-GSL/LLaDA-MedV.

Abstract:
Classical video quality assessment (VQA) methods generate a numerical score to judge a video's perceived visual fidelity and clarity. Yet, a score fails to describe the video's complex quality dimensions (e.g., noise), restricting its applicability. Benefiting from the human-friendly linguistic output, adapting video large multimodal models (LMMs) to VQA via instruction tuning has the potential to address this issue. The core of the approach lies in the video quality-centric instruction data. Previous explorations mainly focus on the image domain, and their data generation processes heavily rely on human quality annotations and proprietary systems (e.g., GPT-4), limiting data scalability and effectiveness. To address these challenges, we propose the Score-based Instruction Generation (SIG) pipeline. Specifically, SIG first scores multiple quality dimensions of an unlabeled video and maps scores to text-defined levels. It then explicitly incorporates a hierarchical Chain-of-Thought (CoT) to model the correlation between specific dimensions and overall quality, mimicking the human visual system's (HVS) reasoning process. The automated pipeline eliminates the reliance on expert-written quality descriptions and proprietary systems, ensuring data scalability and generation efficiency. To this end, the resulting Score2Instruct (S2I) dataset contains over 320K diverse instruction-response pairs, laying the basis for instruction tuning. Moreover, to advance video LMMs' quality scoring and justification abilities simultaneously, we devise a progressive tuning strategy to fully unleash the power of S2I. Built upon SIG, we further curate a benchmark termed S2I-Bench with 400 open-ended questions to better evaluate the quality justification capacity of video LMMs. Experimental results on the S2I-Bench and existing benchmarks indicate that our method consistently improves quality scoring and justification capabilities across multiple video LMMs. The code and dataset will be available at https://github.com/KeiChiTse/S2I.

Abstract:
Visual encoding and decoding models act as gateways to understanding the neural mechanisms underlying human visual perception. Typically, visual encoding models that predict brain activity from stimuli and decoding models that reproduce stimuli from brain activity are treated as distinct tasks, requiring separate models and training procedures. This separation is inefficient and fails to model the consistency between encoding and decoding processes. To address this limitation, we propose NeuroFlow, the first unified framework that jointly models visual encoding and decoding from neural activity within a single flow model. NeuroFlow introduces two key components: (i) NeuroVAE is designed as a variational backbone to model neural variability and establish a compact, semantically structured latent space for bidirectional modeling across visual and neural modalities. (ii) Cross-modal Flow Matching (XFM) bypasses the typical paradigm of noise-to-data diffusion guided by a specific modality condition, instead learning a reversibly consistent flow model between visual and neural latent distributions. For the first time, visual encoding and decoding are reformulated as a time-dependent, reversible process within a shared latent space for unified modeling. Empirical results demonstrate that NeuroFlow achieves superior overall performance in visual encoding and decoding tasks with higher computational efficiency compared to any isolated methods. We further analyze principal factors that steer the model toward encoding-decoding consistency and, through brain functional analyses, demonstrate that NeuroFlow captures consistent activation patterns underlying neural variability. NeuroFlow marks a major step toward unified visual encoding and decoding from neural activity, providing mechanistic insights that inform future bidirectional visual brain-computer interfaces. Code and project page are publicly available at https://michaelmaiii.github.io/NeuroFlow-S.

Abstract:
Efficient and robust feature matching is crucial for latency-sensitive and resource-constrained applications. However, many current semi-dense feature matching methods scale quadratically with spatial resolution, either due to transformer-based long-range context modeling or from redundant full correlation computations. To overcome these limitations, we present SLiM (Salient Lightweight Matching), a novel scalable feature matching method that delivers reliable matches with low memory footprint and latency, especially at high resolutions. Our approach introduces three key innovations: (1) a hybrid Conv-Mamba backbone for efficient cross-scale and cross-view feature extraction with linear complexity, (2) a training-free norm-based feature filtering mechanism, enabling sparse correlation that significantly reduces computation overhead during inference, and (3) a lightweight recurrent coordinate refinement that surpasses expectation-based regression in subpixel accuracy. Experimental results show that SLiM consistently achieves a strong accuracy-efficiency trade-off across indoor and outdoor benchmarks, with clear advantages in memory usage, inference speed, and high-resolution scalability. These results demonstrate the practical efficiency and strong scalability of SLiM. Project page: https://github.com/Band-127/SLiM

Abstract:
World Generation Models are emerging as a cornerstone of next-generation multimodal intelligence systems. Unlike traditional 2D visual generation, World Models aim to construct realistic, dynamic, and physically consistent 3D/4D worlds from images, videos, or text. These models not only need to produce high-fidelity visual content but also maintain coherence across space, time, physics, and instruction control, enabling applications in virtual reality, autonomous driving, Embodied Intelligence, and content creation.However, prior benchmarks, however, each emphasize different evaluation dimensions and lack a unified assessment of world-realism capability.To systematically evaluate World Models, we introduce the 4DWorldBench, which measures models across four key dimensions: Perceptual Quality, Condition-4D Alignment, Physical Realism, and 4D Consistency. The benchmark covers tasks such as Image-to-3D/4D, Video-to-4D, Text-to-3D/4D. Beyond these, we innovatively introduce adaptive conditioning across multiple modalities, which not only integrates but also extends traditional evaluation paradigms. To accommodate different modality-conditioned inputs, we map all modality conditions into a unified textual space during evaluation, and further integrate LLM-as-judge, MLLM-as-judge, and traditional network-based methods. This unified and adaptive design enables more comprehensive and consistent evaluation of alignment, physical Realism, and cross-modal coherence.Preliminary human studies further demonstrate that our adaptive tool selection achieves closer agreement with subjective human judgments.We hope this benchmark will serve as a foundation for objective comparisons and improvements, accelerating the transition from "visual generation" to "world generation." Our project can be found at https://yeppp27.github.io/4DWorldBench.github.io/.

Abstract:
Generating human motion that satisfies customized zero-shot goal functions, enabling applications such as controllable character animation and behavior synthesis for virtual agents, is a critical capability. While current approaches handle many unseen constraints, they fail on tasks with very challenging spatiotemporal restrictions, such as severe spatial obstacles or specified numbers of walking steps. To equip motion generators for these highly constrained tasks, we present a retrieval-guided method built on the training-free diffusion noise optimization framework. The key idea is to search within large motion datasets for guidance that can potentially satisfy difficult constraints. We introduce relational task parsing to group target constraints and identify the difficult ones to be handled by retrieved reference. A better initialization for diffusion noise is then obtained via a reward-guided mask that combines random noise with retrieved noise. By optimizing diffusion noise from this improved initialization, we successfully solve highly constrained generation tasks. By leveraging LLM for relational task parsing, the whole framework is further enabled to automatically reason for what to retrieve, improving the intelligence of moving agents under a training-free optimization scheme. Project page: https://hanchaoliu.github.io/RetrievalGuidedDNO/

Abstract:
Recent video generators achieve striking photorealism, yet remain fundamentally inconsistent in 3D. We present WorldReel, a 4D video generator that is natively spatio-temporally consistent. WorldReel jointly produces RGB frames together with 4D scene representations, including pointmaps, camera trajectory, and dense flow mapping, enabling coherent geometry and appearance modeling over time. Our explicit 4D representation enforces a single underlying scene that persists across viewpoints and dynamic content, yielding videos that remain consistent even under large non-rigid motion and significant camera movement.We train WorldReel by carefully combining synthetic and real data: synthetic data provides precise 4D supervision (geometry, motion, and camera), while real videos contribute visual diversity and realism. This blend allows WorldReel to generalize to in-the-wild footage while preserving strong geometric fidelity. Extensive experiments demonstrate that WorldReel sets a new state-of-the-art for consistent video generation with dynamic scenes and moving cameras, improving metrics of geometric consistency, motion coherence, and reducing view-time artifacts over competing methods. We believe that WorldReel brings video generation closer to 4D-consistent world modeling, where agents can render, interact, and reason about scenes through a single and stable spatiotemporal representation. Project page: https://bshfang.github.io/worldreel/

Abstract:
Synthetic data has emerged as a practical alternative to authentic face datasets for training face recognition (FR) systems, especially as privacy and legal concerns increasingly restrict the use of real biometric data. Recent advances in identity-conditional diffusion models have enabled the generation of photorealistic and identity-consistent face images. However, many of these models suffer from limited intra-class variation, an essential property for training robust and generalizable FR models. In this work, we propose IDPERTURB, a simple yet effective geometric-driven sampling strategy to enhance diversity in synthetic face generation. IDPERTURB perturbs identity embeddings within a constrained angular region of the unit hyper-sphere, producing a diverse set of embeddings without modifying the underlying generative model. Each perturbed embedding serves as a conditioning vector for a pre-trained diffusion model, enabling the synthesis of visually varied yet identity-coherent face images suitable for training generalizable FR systems. Empirical results demonstrate that training FR on datasets generated using IDPERTURB yields improved performance across multiple FR benchmarks, compared to existing synthetic data generation approaches. Code and generated datasets are publicly available https://github.com/fdbtrs/IDperturb.

Abstract:
Text-to-Image (T2I) diffusion/flow models have recently achieved remarkable progress in visual fidelity and text alignment. However, they remain limited when users need to precisely control image layouts, something that natural language alone cannot reliably express. Controllable generation methods augment the initial T2I model with additional conditions that more easily describe the scene. Prior works straightforwardly train the augmented network with the same loss as the initial network. Although natural at first glance, this can lead to very long training times in some cases before convergence. In this work, we revisit the training objective of controllable diffusion models through a detailed analysis of their denoising dynamics. We show that direct supervision on the clean target image, dubbed x_0-supervision, or an equivalent re-weighting of the diffusion loss, yields faster convergence. Experiments on multiple control settings demonstrate that our formulation accelerates convergence by up to 2xaccording to our novel metric (mean Area Under the Convergence Curve - mAUCC), while also improving both visual quality and conditioning accuracy. Our code is available at https://github.com/CEA-LIST/x0-supervision.

Abstract:
All-in-One Image Restoration (AIO-IR) aims to develop a unified model that can handle multiple degradations under complex conditions. However, existing methods often rely on task-specific designs or latent routing strategies, making it hard to adapt to real-world scenarios with various degradations. We propose FAPE-IR, a Frequency-Aware Planning and Execution framework for image restoration. It uses a frozen Multimodal Large Language Model (MLLM) as a planner to analyze degraded images and generate concise, frequency-aware restoration plans. These plans guide a LoRA-based Mixture-of-Experts (LoRA-MoE) module within a diffusion-based executor, which dynamically selects high- or low-frequency experts, complemented by frequency features of the input image. To further improve restoration quality and reduce artifacts, we introduce adversarial training and a frequency regularization loss. By coupling semantic planning with frequency-based restoration, FAPE-IR offers a unified and interpretable solution for all-in-one image restoration. Extensive experiments show that FAPE-IR achieves state-of-the-art performance across seven restoration tasks and exhibits strong zero-shot generalization under mixed degradations. Code is available at https://github.com/Programmergg/FAPE-IR.

Abstract:
Recent multimodal reasoning models, inspired by DeepSeek-R1, have significantly advanced vision-language systems. However, in remote sensing (RS) tasks, we observe widespread pseudo reasoning: models narrate the process of reasoning rather than genuinely reason toward the correct answer based on visual evidence. We attribute this to the Glance Effect, where a single, coarse perception of large-scale RS imagery results in incomplete understanding and reasoning based on linguistic self-consistency instead of visual evidence. To address this, we propose RS-EoT (Remote Sensing Evidence-of-Thought), a language-driven, iterative visual evidence-seeking paradigm. To instill this paradigm, we propose SocraticAgent, a self-play multi-agent system that synthesizes reasoning traces via alternating cycles of reasoning and visual inspection. To enhance and generalize these patterns, we propose a two-stage progressive RL strategy: first, RL on fine-grained Grounding tasks to enhance RS-EoT capabilities, followed by RL on RS VQA to generalize to broader understanding scenarios. Experiments show RS-EoT achieves state-of-the-art performance on multiple RS VQA and grounding benchmarks. Analyses reveal clear iterative cycles of reasoning and evidence seeking, confirming RS-EoT mitigates the Glance Effect and enables genuine evidence-grounded reasoning. Our code, data, and models are available at https://geox-lab.github.io/Asking_like_Socrates/.

Abstract:
Graphic icons are a cornerstone of modern design workflows, yet they are often distributed as flattened single-path or compound-path graphics, where the original semantic layering is lost. This absence of semantic decomposition hinders downstream tasks such as editing, restyling, and animation. We formalize this problem as semantic layer construction for flattened vector art and introduce SemLayer, a visual generation empowered pipeline that restores editable layered structures. Given an abstract icon, SemLayer first generates a chromatically differentiated representation in which distinct semantic components become visually separable. To recover the complete geometry of each part, including occluded regions, we then perform a semantic completion step that reconstructs coherent object-level shapes. Finally, the recovered parts are assembled into a layered vector representation with inferred occlusion relationships. Extensive qualitative comparisons and quantitative evaluations demonstrate the effectiveness of SemLayer, enabling editing workflows previously inapplicable to flattened vector graphics and establishing semantic layer reconstruction as a practical and valuable task. Project page: https://xxuhaiyang.github.io/SemLayer/

Abstract:
RAW images preserve superior fidelity and rich scene information compared to RGB, making them essential for tasks in challenging imaging conditions. To alleviate the high cost of data collection, recent RGB-to-RAW conversion methods aim to synthesize RAW images from RGB. However, they overlook two key challenges: (i) the reconstruction difficulty varies with pixel intensity, and (ii) multi-camera conversion requires camera-specific adaptation. To address these issues, we propose SpiralDiff, a diffusion based framework tailored for RGB-to-RAW conversion with a signal-dependent noise weighting strategy that adapts reconstruction fidelity across intensity levels. In addition, we introduce CamLoRA, a camera-aware lightweight adaptation module that enables a unified model to adapt to different camera-specific ISP characteristics. Extensive experiments on four benchmark datasets demonstrate the superiority of SpiralDiff in RGB-to-RAW conversion quality and its downstream benefits in RAW-based object detection. Our code is available at https://github.com/Chuancy-TJU/SpiralDiff.

Abstract:
Zero-shot image restoration provides a flexible way to handle diverse degradations without task-specific training. However, existing methods typically rely on stacked layers or pre-trained features to enhance degradation expression, while overlooking physically consistent priors. The insufficient degradation prompts impose the heavy training burden and high sampling costs during zero-shot diffusion. Moreover, the fixed inference trajectory often collapses to suboptimal solutions under complex corruptions. We observe that heterogeneous degradations can be reparameterized into a minimal set of physically coherent parameters for compact representation. Based on this insight, we first propose a unified physical zero-shot image restoration (UP-ZeroIR) framework that explicitly models heterogeneous degradations into a homogeneous all-in-one distribution. The distribution can be optimized directly in the latent space, enabling principled solution exploration and effective prompt adaptation. Besides, we introduce a dynamic quality-refinement strategy that adaptively adjusts the diffusion trajectory for robust globally optimal convergence. Extensive experiments demonstrate that our method achieves state-of-the-art performance across both single and mixed degradations. Our code is available at https://github.com/yangjinglyy/UP-ZeroIR.

Abstract:
Standard supervised training for deepfake detection treats all samples with uniform importance, which can be suboptimal for learning robust and generalizable features. In this work, we propose a novel Tutor-Student Reinforcement Learning (TSRL) framework to dynamically optimize the training curriculum. Our method models the training process as a Markov Decision Process where a "Tutor" agent learns to guide a "Student" (the deepfake detector). The Tutor, implemented as a Proximal Policy Optimization (PPO) agent, observes a rich state representation for each training sample, encapsulating not only its visual features but also its historical learning dynamics, such as EMA loss and forgetting counts. Based on this state, the Tutor takes an action by assigning a continuous weight (0-1) to the sample's loss, thereby dynamically re-weighting the training batch. The Tutor is rewarded based on the Student's immediate performance change, specifically rewarding transitions from incorrect to correct predictions. This strategy encourages the Tutor to learn a curriculum that prioritizes high-value samples, such as hard-but-learnable examples, leading to a more efficient and effective training process. We demonstrate that this adaptive curriculum improves the Student's generalization capabilities against unseen manipulation techniques compared to traditional training methods. Code is available at https://github.com/wannac1/TSRL

Abstract:
Knowledge distillation (KD) has been widely applied in semantic segmentation to compress large models, but conventional approaches primarily preserve in-domain accuracy while neglecting out-of-domain generalization, which is essential under distribution shifts. This limitation becomes more severe with the emergence of vision foundation models (VFMs): although VFMs exhibit strong robustness on unseen data, distilling them with conventional KD often compromises this ability. We propose Generalizable Knowledge Distillation (GKD), a multi-stage framework that explicitly enhances generalization. GKD decouples representation learning from task learning. In the first stage, the student acquires domain-agnostic representations through selective feature distillation, and in the second stage, these representations are frozen for task adaptation, thereby mitigating overfitting to visible domains. To further support transfer, we introduce a query-based soft distillation mechanism, where student features act as queries to teacher representations to selectively retrieve transferable spatial knowledge from VFMs. Extensive experiments on five domain generalization benchmarks demonstrate that GKD consistently outperforms existing KD methods, achieving average gains of +1.9% in foundation-to-foundation (F2F) and +10.6% in foundation-to-local (F2L) distillation. The code will be available at https://github.com/Younger-hua/GKD.

Abstract:
The transition from image to video understanding requires vision-language models (VLMs) to shift from recognizing static patterns to reasoning over temporal dynamics such as motion trajectories, speed changes, and state transitions. Yet current post-training methods fall short due to two critical limitations: (1) existing datasets often lack temporal-centricity, where answers can be inferred from isolated keyframes rather than requiring holistic temporal integration; (2) training data generated by proprietary models contains systematic errors in fundamental temporal perception, such as confusing motion directions or misjudging speeds. We introduce SynRL, a post-training framework that teaches models temporal primitives, the fundamental building blocks of temporal understanding including direction, speed, and state tracking. Our key insight is that these abstract primitives, learned from programmatically generated synthetic videos, transfer effectively to real-world scenarios. We decompose temporal understanding into short-term perceptual primitives (speed, direction) and long-term cognitive primitives (state tracking, retrodictive inference), constructing 7.7K CoT and 7K RL samples with ground-truth frame-level annotations through code-based video generation. Despite training on simple geometric shapes, SynRL achieves substantial improvements across 15 benchmarks spanning temporal grounding, complex reasoning, and general video understanding. Remarkably, our 7.7K synthetic CoT samples outperform Video-R1's 165K real-world samples. We attribute this to fundamental temporal skills--such as tracking frame-by-frame changes and comparing velocity--that transfer effectively from abstract synthetic patterns to complex real-world scenarios. This establishes a new paradigm for video post-training: video temporal learning through carefully designed synthetic data provides a more cost-efficient scaling path. Code is released on https://github.com/jiangsongtao/Synthetic-Video.

Abstract:
Is basic visual understanding really solved in state-of-the-art VLMs? We present VisualOverload, a slightly different visual question answering (VQA) benchmark comprising 2,720 question-answer pairs, with privately held ground-truth responses. Unlike prior VQA datasets that typically focus on near global image understanding, VisualOverload challenges models to perform simple, knowledge-free vision tasks in densely populated (or, overloaded) scenes. Our dataset consists of high-resolution scans of public-domain paintings that are populated with multiple figures, actions, and unfolding subplots set against elaborately detailed backdrops. We manually annotated these images with questions across six task categories to probe for a thorough understanding of the scene. We hypothesize that current benchmarks overestimate the performance of VLMs, and encoding and reasoning over details is still a challenging task for them, especially if they are confronted with densely populated scenes. Indeed, we observe that even the best model (o3) out of 37 tested models only achieves 19.6% accuracy on our hardest test split and overall 69.5% accuracy on all questions. Beyond a thorough evaluation, we complement our benchmark with an error analysis that reveals multiple failure modes, including a lack of counting skills, failure in OCR, and striking logical inconsistencies under complex tasks. Altogether, VisualOverload exposes a critical gap in current vision models and offers a crucial resource for the community to develop better models.

Abstract:
Visual navigation requires the robot to reach a specified goal such as an image, based on a sequence of first-person visual observations. While recent learning-based approaches have made significant progress, they often focus on improving policy heads or decision strategies while relying on simplistic feature encoders and temporal pooling to represent visual input. This leads to the loss of fine-grained spatial and temporal structure, ultimately limiting accurate action prediction and progress estimation. In this paper, we propose a unified spatio-temporal representation framework that enhances visual encoding for robotic navigation. Our approach extracts features from both image sequences and goal observations, and fuses them using the designed spatio-temporal fusion module. This module performs spatial graph reasoning within each frame and models temporal dynamics using a hybrid temporal shift module combined with multi-resolution difference-aware convolution. Experimental results demonstrate that our approach consistently improves navigation performance and offers a generalizable visual backbone for goal-conditioned control. Code is available at https://github.com/hren20/STRNet.

Abstract:
Weakly-supervised Human-Object Interaction (HOI) detection is essential for scalable scene understanding, as it learns interactions from only image-level annotations. Due to the lack of localization signals, prior works typically rely on an external object detector to generate candidate pairs and then infer their interactions through pairwise reasoning. However, this framework often struggles to scale due to the substantial computational cost incurred by enumerating numerous instance pairs. In addition, it suffers from false positives arising from non-interactive combinations, which hinder accurate instance-level HOI reasoning. To address these issues, we introduce Relational Grounding Transformer (RegFormer), a versatile interaction recognition module for efficient and accurate HOI reasoning. Under image-level supervision, RegFormer leverages spatially grounded signals as guidance for the reasoning process and promotes locality-aware interaction learning. By learning localized interaction cues, our module distinguishes humans, objects, and their interactions, enabling direct transfer from image-level interaction reasoning to precise and efficient instance-level reasoning without additional training. Our extensive experiments and analyses demonstrate that RegFormer effectively learns spatial cues for instance-level interaction reasoning, operates with high efficiency, and even achieves performance comparable to fully supervised models. Our code is available at https://github.com/mlvlab/RegFormer.

Abstract:
Implicit Neural Representations (INRs) have emerged as a powerful paradigm for various signal processing tasks, but their inherent spectral bias limits the ability to capture high-frequency details. Existing methods partially mitigate this issue by using Fourier-based features, which usually rely on fixed frequency bases. This forces multi-layer perceptrons (MLPs) to inefficiently compose the required frequencies, thereby constraining their representational capacity. To address this limitation, we propose Content-Aware Frequency Encoding (CAFE), which builds upon Fourier features through multiple parallel linear layers combined via a Hadamard product. CAFE can explicitly and efficiently synthesize a broader range of frequency bases, while the learned weights enable the selection of task-relevant frequencies. Furthermore, we extend this framework to CAFE+, which incorporates Chebyshev features as a complementary component to Fourier bases. This combination provides a stronger and more stable frequency representation. Extensive experiments across multiple benchmarks validate the effectiveness and efficiency of our approach, consistently achieving superior performance over existing methods. Our code is available at https://github.com/JunboKe0619/CAFE.

Abstract:
Tensor-based methods have attracted growing interest due to their ability to reduce trainable parameters and offer advantages over matrix-based approaches in parameter-efficient fine-tuning (e.g., LoRA and PiSSA), particularly in capturing inter-layer correlations. However, directly applying tensor methods requires repeated reconstruction of model weights during training, introducing substantial overhead and becoming a key bottleneck for deployment. To address this limitation, we propose Reconstruction-Free Tensor-Based Adaptation (ReFTA), which offers four advantages: (1) eliminating weight tensor reconstruction via tensor algebraic properties; (2) reducing initialization quantization error by fine-tuning only the principal tensor components; (3) providing a rigorous generalization guarantee rooted in tensor product algebra; and (4) enabling a unified design controlled by a single tensor rank configuration. Extensive experiments on both image classification (IC) and natural language understanding (NLU) tasks demonstrate that ReFTA achieves the best accuracy-efficiency trade-off among all evaluated methods. Across most cases, ReFTA attains the highest average accuracy with the fewest trainable parameters. On IC tasks with the ViT-Large model, ReFTA surpasses LoRA by 5.6% in average accuracy, while reducing the number of trainable parameters by up to 96% compared to LoRA. On NLU tasks with RoBERTa-Large, ReFTA improves the average accuracy by approximately 5% over most existing methods while using only 86.4% fewer parameters than LoRA (r=1) and 97.5% fewer than PiSSA. The code is available at https://github.com/jzheng20/ReFTA.

Abstract:
Multimodal Large Language Models (MLLMs) like LLaVA have demonstrated remarkable capabilities in multi-modal understanding and generation. This success motivates us to investigate whether the inherent prior knowledge embedded within such MLLMs contains sufficient spatial awareness for dense prediction tasks, without requiring any task-specific fine-tuning. Thus, in this paper, we explore the utilization of LLaVA for training-free open-vocabulary semantic segmentation. We discover that certain layers within the LLM part of LLaVA can generate localized features corresponding to given object classes. Building on this intrinsic capability, we design three modules: A question-answer pipeline to identify target classes in the image, a text-visual response module to extract initial reliable pixel-level activations for the target class, and a visual generation module to produce reliable refined prompts, which further serve as guidance for SAM to generate the predictions. Our LLaVA-based approach achieves new state-of-the-art performance on "Thing" category datasets, e.g., PASCAL VOC 2012 and COCO-object. Moreover, our method does not require explicit background class names, demonstrating its exceptional potential for handling open-world scenarios. Code is available at https://github.com/zbf1991/FSeg-LLaVA.

Abstract:
Recent advancements in video generation technologies have been significant, resulting in their widespread application across multiple domains. However, concerns have been mounting over the potential misuse of generated content. Tracing the origin of generated videos has become crucial to mitigate potential misuse and identify responsible parties. Existing video attribution methods require additional operations or the training of source attribution models, which may degrade video quality or necessitate large amounts of training samples. To address these challenges, we define for the first time the "few-shot training-free generated video attribution" task and propose SWIFT, which is tightly integrated with the temporal characteristics of the video. By leveraging the "Pixel Frames(many) to Latent Frame(one)" temporal mapping within each video chunk, SWIFT applies a fixed-length sliding window to perform two distinct reconstructions: normal and corrupted. The variation in the losses between two reconstructions is then used as an attribution signal. We conducted an extensive evaluation of five state-of-the-art (SOTA) video generation models. Experimental results show that SWIFT achieves over 90% average attribution accuracy with merely 20 video samples across all models and even enables zero-shot attribution for HunyuanVideo, EasyAnimate, and Wan2.2. Our source code is available at https://github.com/wangchao0708/SWIFT.

Abstract:
As multimodal LLM-driven agents advance in autonomy and generalization, traditional static datasets face inherent scalability limitations and are insufficient for fully assessing their capabilities in increasingly complex and diverse tasks. Existing studies have attempted to generate agent tasks using LLMs, but due to the inherent hallucinations of LLMs and the lack of internal data relationship modeling, these tasks often exhibit semantic inconsistencies and solvability issues. To address these challenges, we introduce Graph2Eval, a knowledge-graph-driven framework for automated, scalable, and semantically grounded agent task generation. At its core, Graph2Eval leverages a knowledge graph built from heterogeneous external data sources as a structured task space, generating multimodal agent tasks through subgraph sampling and task construction guided by task templates and meta-path strategies. To further ensure task reliability, a multi-stage filtering pipeline based on node reachability analysis, LLM scoring, and similarity analysis ensures the diversity and solvability of the generated tasks. By unifying both RAG Agent and Web Agent scenarios, Graph2Eval enables efficient generation of multimodal document understanding tasks and multi-step web interaction tasks. We instantiate the framework with Graph2Eval-Bench, a curated dataset of 1,319 tasks spanning document understanding and web interaction scenarios. Extensive experiments show that, on average, Graph2Eval improves task semantic consistency by 20% and solvability by 17% over baselines, while Graph2Eval-Bench effectively distinguishes agent performance, offering a new perspective on automated agent evaluation.

Abstract:
Rotational symmetry is an important prior in 6D pose estimation, improving pose accuracy and supporting symmetry-aware evaluation. However, current symmetry annotations for 3D objects remain largely manual or semi-automatic, often requiring predefined types or orders, which limits scalability. This work introduces a fully automatic, reference-free framework for symmetry-type classification, rotational-order identification, and full-axis localization across all eight canonical 3D rotational symmetry types. The method localizes a dominant high-order axis, infers its rotational order through self-consistency analysis, and reconstructs the complete symmetry structure under a hierarchy-guided formulation. A texture-aware extension further models appearance-induced reductions in rotational order while preserving axis orientations. Experiments on idealized and real-world datasets demonstrate strong accuracy and generalization, achieving 94.75% accuracy on 438 symmetric objects in GSO. Training FoundationPose with these priors improves accuracy by up to 0.9% across five BOP datasets, showing that automatically estimated rotational priors improve downstream 6D pose estimation. Code is available at https://github.com/WangYuLin-SEU/KASAL.

Abstract:
Pre-trained weights have become a cornerstone of modern deep learning, enabling efficient knowledge transfer and improving downstream task performance, especially in data-scarce scenarios. However, a fundamental question remains: how can we obtain better pre-trained weights that encapsulate more knowledge beyond the given dataset? In this work, we introduce KNowledge-Overflowed Weights (KNOW) prediction, a novel strategy that leverages structured forgetting and its inversion to synthesize knowledge-enriched weights. Our key insight is that sequential fine-tuning on progressively downsized datasets induces a structured forgetting process, which can be modeled and reversed to recover knowledge as if trained on a larger dataset. We construct a dataset of weight transitions governed by this controlled forgetting and employ meta-learning to model weight prediction effectively. Specifically, our KNowledge-Overflowed Weights Nowcaster (KNOWN) acts as a hyper-model that learns the general evolution of weights and predicts enhanced weights with improved generalization. Extensive experiments across diverse datasets and architectures demonstrate that KNOW prediction consistently outperforms Naive fine-tuning and simple weight prediction, leading to superior downstream performance. Our work provides a new perspective on reinterpreting forgetting dynamics to push the limits of knowledge transfer. The code and pre-trained model are available at https://github.com/jjh6297/KNOW

Abstract:
Multimodal Large Language Models (MLLMs) have made great progress in video understanding tasks. However, when it comes to understanding complex or lengthy videos, MLLMs tend to overlook details or produce hallucinations. To alleviate these issues, recent work has attempted to leverage reinforcement learning (RL) to boost models' deep linguistic reasoning of complex videos. But these methods have two main problems: First, the RL framework they used has unstable training, high training costs, and is difficult to train satisfactory video reasoning models; Second, the linguistic reasoning process is difficult to guarantee the reliability of visual information. To alleviate these problems, we propose to use multimodal elements for reasoning, and we design a novel framework to build and enhance versatile video reasoning capabilities on MLLMs. We carefully design a multi-task cold start and multi-task reinforcement learning to improve the model's visual perception and proficiency in multiple capabilities. In the inference phase, we leverage multimodal reasoning and dynamic sampling to further improve the performance. We verified the efficiency of the framework on a base MLLM (Qwen2-VL-7B-Base). Through cold-start with 3k data and reinforcement learning training with 5k data, combined with inference design, our final model significantly outperforms the base model on seven public video benchmarks, even surpassing and approaching the state-of-the-art Instruct Models such as Qwen2.5-VL-7B-Instruct trained with large-scale data. Our code will be available at https://github.com/Wang-Xiaodong1899/VideoReasoner.

Abstract:
Identity-preserving video generation offers powerful tools for creative expression, allowing users to customize videos featuring their beloved characters. However, prevailing methods are typically designed and optimized for a single identity reference. This underlying assumption restricts creative flexibility by inadequately accommodating diverse real-world input formats. Relying on a single source also constitutes an ill-posed scenario, causing an inherently ambiguous setting that makes it difficult for the model to faithfully reproduce an identity across novel contexts. To address these issues, we present AnyID, an ultra-fidelity identity-preservation video generation framework that features two core contributions. First, we introduce a scalable omni-referenced architecture that effectively unifies heterogeneous identity inputs (e.g., faces, portraits, and videos) into a cohesive representation. Second, we propose a primary-referenced generation paradigm, which designates one reference as a canonical anchor and uses a novel differential prompt to enable precise, attribute-level controllability. We conduct training on a large-scale, meticulously curated dataset to ensure robustness and high fidelity, and then perform a final fine-tuning stage using reinforcement learning. This process leverages a preference dataset constructed from human evaluations, where annotators performed pairwise comparisons of videos based on two key criteria: identity fidelity and prompt controllability. Extensive evaluations validate that AnyID achieves ultra-high identity fidelity as well as superior attribute-level controllability across different task settings. Project page: https://johnneywang.github.io/AnyID-webpage.

Abstract:
Balancing convergence efficiency and robustness under Differential Privacy (DP) is a central challenge in Federated Learning (FL). Although AdamW accelerates training and fine-tuning in large-scale models, we find that directly applying it to Differentially Private FL (DPFL) suffers from three major issues: (i) data heterogeneity and privacy noise jointly amplify the variance of the second-moment estimator, (ii) DP perturbations bias the second-moment estimator, and (iii) DP amplifies AdamW's sensitivity to local overfitting, worsening client drift. We propose DP-FedAdamW, the first AdamW-based optimizer for DPFL. It restores the effectiveness of AdamW under DP by stabilizing second-moment variance, removing DP-induced bias, and aligning local updates with the global descent direction to curb client drift. Theoretically, we establish an unbiased second-moment estimator and prove a linearly accelerated convergence rate without any heterogeneity assumption, while providing tighter (\varepsilon,\delta)-DP guarantees. Our empirical results demonstrate the strong performance of DP-FedAdamW across language and vision Transformers, as well as ResNet-18. On Tiny-ImageNet (Swin-Base, \varepsilon=1), DP-FedAdamW outperforms the state-of-the-art (SOTA) by 5.83%. The code is available at https://github.com/junkangLiu0/DP-FedAdamW.

Abstract:
High-Resolution Transmission Electron Microscopy (HRTEM) enables atomic-scale observation of nucleation dynamics, which boosts the studies of advanced solid materials. Nonetheless, due to the millisecond-scale rapid change of nucleation, it requires short-exposure rapid imaging, leading to severe noise that obscures atomic positions. In this work, we propose a statistical characteristic-guided denoising network, which utilizes statistical characteristics to guide the denoising process in both spatial and frequency domains. In the spatial domain, we present spatial deviation-guided weighting to select appropriate convolution operations for each spatial position based on deviation characteristic. In the frequency domain, we present frequency band-guided weighting to enhance signals and suppress noise based on band characteristics. We also develop an HRTEM-specific noise calibration method and generate a dataset with disordered structures and realistic HRTEM image noises. It can ensure the denoising performance of models on real images for nucleation observation. Experiments on synthetic and real data show our method outperforms the state-of-the-art methods in HRTEM image denoising, with effectiveness in the localization downstream task. Code will be available at https://github.com/HeasonLee/SCGN.

Abstract:
Millimeter-wave (mmWave) point clouds have attracted increasing interest in human sensing due to their robustness, privacy preservation, and low cost. However, their practical adoption is hindered by the inherent sparsity of data and the lack of large-scale annotated dataset. We revisit generative modeling and propose a unified flow-matching framework mmWaveFlow that unifies enhancement and generation of mmWave point clouds by learning an invertible transport between dense and sparse point clouds. We leverage paired data and a Cross-modal Latent Alignment module to enforce semantic alignment and bridge the modality gap. We find that condition-free flow matching is more vulnerable to latent path crossings, which impair transport. Therefore, we propose Origin-Aware Flow Matching (OA-Flow) by conditioning transport on the path origin to mitigate ambiguity in bidirectional transport. Results of experiments across multiple datasets demonstrate the effectiveness of mmWaveFlow for mmWave human point clouds generation and enhancement. We also observe consistent gains in downstream tasks, highlighting the promise of our framework for human sensing. Codes are available at https://github.com/suchang-99/mmWaveFlow.

Abstract:
Recent work has shown that eliciting Large Language Models (LLMs) to generate reasoning traces in natural language before answering the user's request can significantly improve their performance across tasks. This approach has been extended to multimodal LLMs, where the models can produce chains-of-thoughts (CoT) about the content of input images and videos. For video inputs, prior works use complex multi-step pipelines that extract and include relevant frames from videos in the CoT, or produce simpler single-stage reasoning traces at the expense of poor temporal grounding. Here, we propose the first video LLMs with single-stage reasoning that includes explicit references to relevant frames, thereby reducing temporal inconsistencies in the reasoning process. Our approach is simple, unified, and self-contained, employing a single-stage inference to handle complex video understanding tasks without relying on auxiliary modules for frame selection or caption generation. For this, we first create CoF-Data, a large dataset of diverse questions, answers, and corresponding frame-grounded reasoning traces from both natural and synthetic videos, spanning various topics and tasks. Our models, obtained by fine-tuning video LLMs on this chain-of-frames (CoF) data, generate reasoning traces that accurately identify key frames to answer given questions. In turn, this consistently improves performance across multiple video understanding benchmarks. Surprisingly, we find that synthetic data alone, despite being out-of-distribution with respect to these real-world benchmarks, provides a significant boost in model accuracy.

Abstract:
Recent work in 3D scene understanding is moving beyond purely spatial analysis toward functional scene understanding. However, existing methods often consider functional relationships between object pairs in isolation, failing to capture the scene-wide interdependence that humans use to resolve ambiguity. We introduce FunFact, a framework for constructing probabilistic open-vocabulary functional 3D scene graphs from posed RGB-D images. FunFact first builds an object- and part-centric 3D map and uses foundation models to propose semantically plausible functional relations. These candidates are converted into factor graph variables and constrained by both LLM-derived common-sense priors and geometric priors. This formulation enables joint probabilistic inference over all functional edges and their marginals, yielding substantially better calibrated confidence scores. To benchmark this setting, we introduce FunThor, a synthetic dataset based on AI2-THOR with part-level geometry and rule-based functional annotations. Experiments on SceneFun3D, FunGraph3D, and FunThor show that FunFact improves node and relation discovery recall and significantly reduces calibration error for ambiguous relations, highlighting the benefits of holistic probabilistic modeling for functional scene understanding. See our project page at https://funfact-scenegraph.github.io/.

Abstract:
Open-Set Object Detection (OSOD) enables recognition of novel categories beyond fixed classes but faces challenges in aligning text representations with complex visual concepts and the scarcity of image-text pairs for rare categories. This results in suboptimal performance in specialized domains or with complex objects. Recent visual-prompted methods partially address these issues but often involve complex multi-modal designs and multi-stage optimizations, prolonging the development cycle. Additionally, effective training strategies for data-driven OSOD models remain largely unexplored. To address these challenges, we propose PET-DINO, a universal detector supporting both text and visual prompts. Our Alignment-Friendly Visual Prompt Generation (AFVPG) module builds upon an advanced text-prompted detector, addressing the limitations of text representation guidance and reducing the development cycle. We introduce two prompt-enriched training strategies: Intra-Batch Parallel Prompting (IBP) at the iteration level and Dynamic Memory-Driven Prompting (DMD) at the overall training level. These strategies enable simultaneous modeling of multiple prompt routes, facilitating parallel alignment with diverse real-world usage scenarios. Comprehensive experiments demonstrate that PET-DINO exhibits competitive zero-shot object detection capabilities across various prompt-based detection protocols. These strengths can be attributed to inheritance-based philosophy and prompt-enriched training strategies, which play a critical role in building an effective generic object detector. Project page: https://fuweifuvtoo.github.io/pet-dino.

Abstract:
Image denoising is a fundamental task in computer vision aimed at recovering clean images from noise-corrupted observations. While supervised deep learning methods achieve remarkable performance when trained on paired data with known noise levels, their real-world applicability is limited as noise characteristics are often unknown. Existing unsupervised techniques, such as blind-spot networks or methods based on statistical estimation, either compromise performance due to information loss or suffer from inaccuracies in noise level estimation. To address these challenges, we propose a novel two-stage self-supervised denoising framework that first accurately estimates the noise level directly from noisy images, without requiring clean references or prior noise knowledge. Building upon theoretical insights from Noisier2Noise, we rigorously derive a relationship between the noise level and the variance of the denoised image, enabling robust estimation via a deep learning model and a ternary search strategy. The estimated noise level is then used to synthesize training pairs for supervised denoising. Experiments demonstrate that our method outperforms existing unsupervised approaches and traditional noise estimation techniques, achieving performance competitive with--and in some cases surpassing--supervised methods trained with known noise levels. The proposed framework effectively overcomes the training data pair limitations of supervised approaches for unknown additive white Gaussian noise. Our code is available at https://github.com/zhanzhanblue/CANC.

Abstract:
Crop yield prediction requires substantial data to train scalable models. However, creating yield prediction datasets is constrained by high acquisition costs, heterogeneous data quality, and data privacy regulations. Consequently, existing datasets are scarce, low in quality, or limited to regional levels or single crop types, hindering the development of scalable data-driven solutions. In this work, we release YieldSAT, a large, high-quality, and multimodal dataset for high-resolution crop yield prediction. YieldSAT spans various climate zones across multiple countries, including Argentina, Brazil, Uruguay, and Germany, and includes major crop types, including corn, rapeseed, soybeans, and wheat, across 2,173 expert-curated fields. In total, over 12.2 million yield samples are available, each with a spatial resolution of 10 m. Each field is paired with multispectral satellite imagery, resulting in 113,555 labeled satellite images, complemented by auxiliary environmental data. We demonstrate the potential of large-scale and high-resolution crop yield prediction as a pixel regression task by comparing various deep learning models and data fusion architectures. Furthermore, we highlight open challenges arising from severe distribution shifts in the ground truth data under real-world conditions. To mitigate this, we explore a domain-informed Deep Ensemble approach that exhibits significant performance gains. The dataset is available at https://yieldsat.github.io/.

Abstract:
Self-supervised feed-forward methods for scene flow estimation offer real-time efficiency, but their supervision from two-frame point correspondences is unreliable and often breaks down under occlusions. Multi-frame supervision has the potential to provide more stable guidance by incorporating motion cues from past frames, yet naive extensions of two-frame objectives are ineffective because point correspondences vary abruptly across frames, producing inconsistent signals.In the paper, we present TeFlow, enabling multi-frame supervision for feed-forward models by mining temporally consistent supervision. TeFlow introduces a temporal ensembling strategy that forms reliable supervisory signals by aggregating the most temporally consistent motion cues from a candidate pool built across multiple frames.Extensive evaluations demonstrate that TeFlow establishes a new state-of-the-art for self-supervised feed-forward methods, achieving performance gains of up to 33% on the challenging Argoverse 2 and nuScenes datasets. Our method performs on par with leading optimization-based methods, yet speeds up 150 times. The code is open-sourced at https://github.com/Kin-Zhang/TeFlow along with trained model weights.

Abstract:
Deep learning-based watermarking has made remarkable progress in recent years. To achieve robustness against various distortions, current methods commonly adopt a training strategy where a \underline s ingle \underline r andom \underline d istortion (SRD) is chosen as the noise layer in each training batch. However, the SRD strategy treats distortions independently within each batch, neglecting the inherent relationships among different types of distortions and causing optimization conflicts across batches. To address this issue, we propose a novel training strategy that enhances robustness and generalization via \underline meta -learning with \underline f eature \underline c onsistency (Meta-FC). Specifically, we randomly sample multiple distortions from the noise pool to construct a meta-training task, while holding out one distortion as a simulated "unknown" distortion for the meta-testing phase.Through meta-learning, the model is encouraged to identify and utilize neurons that exhibit stable activations across different types of distortions, mitigating the optimization conflicts caused by the random sampling of diverse distortions in each batch.To further promote the transformation of stable activations into distortion-invariant representations, we introduce a feature consistency loss that constrains the decoded features of the same image subjected to different distortions to remain consistent.Extensive experiments demonstrate that, compared to the SRD training strategy, Meta-FC improves the robustness and generalization of various watermarking models by an average of 1.59%, 4.71%, and 2.38% under high-intensity, combined, and unknown distortions. Code has been made \href https://github.com/007-li/Meta-FC open-source .

Abstract:
Collaborative perception (CP) enables data sharing among connected and autonomous vehicles (CAVs) to enhance driving safety. However, CP systems are vulnerable to adversarial attacks where malicious agents forge false objects via feature-level perturbations. Current defensive systems use threshold-based consensus verification by comparing collaborative and ego detection results. Yet, these defenses remain vulnerable to more sophisticated attack strategies that could exploit two critical weaknesses: (i) lack of robustness against attacks with systematic timing and target region optimization, and (ii) inadvertent disclosure of vulnerability knowledge through implicit confidence information in shared collaboration data. In this paper, we propose MVIG attack, a novel adaptive adversarial CP framework learning to capture vulnerability knowledge disclosed by different defensive CP systems from a unified mutual view information graph (MVIG) representation. Our approach combines MVIG representation with temporal graph learning to generate evolving fabrication risk maps and employs entropy-aware vulnerability search to optimize attack location, timing and persistence, enabling adaptive attacks with generalizability across various defensive configurations. Extensive evaluations on OPV2V and Adv-OPV2V datasets demonstrate that MVIG attack reduces defense success rates by up to 62% against state-of-the-art defenses while achieving 47% lower detection for persistent attacks at 29.9 FPS, exposing critical security gaps in CP systems. Code will be released at https://github.com/yihangtao/MVIG.git.

Abstract:
Image alignment is a fundamental task in computer vision with broad applications. Existing methods predominantly employ optical flow-based image warping. However, this technique is susceptible to common challenges such as occlusions and illumination variations, leading to degraded alignment visual quality and compromised accuracy in downstream tasks. In this paper, we present DMAligner, a diffusion-based framework for image alignment through alignment-oriented view synthesis. DMAligner is crafted to tackle the challenges in image alignment from a new perspective, employing a generation-based solution that showcases strong capabilities and avoids the problems associated with flow-based image warping. Specifically, we propose a Dynamics-aware Diffusion Training approach for learning conditional image generation, synthesizing a novel view for image alignment. This incorporates a Dynamics-aware Mask Producing (DMP) module to adaptively distinguish dynamic foreground regions from static backgrounds, enabling the diffusion model to more effectively handle challenges that classical methods struggle to solve. Furthermore, we develop the Dynamic Scene Image Alignment (DSIA) dataset using Blender, which includes 1,033 indoor and outdoor scenes with over 30K image pairs tailored for image alignment. Extensive experimental results demonstrate the superiority of the proposed approach on DSIA benchmarks, as well as on a series of widely-used video datasets for qualitative comparisons. Our code is available at https://github.com/boomluo02/DMAligner.

Abstract:
Recent advances in volumetric super-resolution (SR) have demonstrated strong performance in medical and scientific imaging, with transformer- and CNN-based approaches achieving impressive results even at extreme scaling factors. In this work, we show that much of this performance stems from training on downsampled data rather than real low-resolution scans. This reliance on downsampling is partly driven by the scarcity of paired high- and low-resolution 3D datasets. To address this, we introduce VoDaSuRe, a large-scale volumetric dataset containing paired high- and low-resolution scans. When training models on VoDaSuRe, we reveal a significant discrepancy: SR models trained on downsampled data produce substantially sharper predictions than those trained on real low-resolution scans, which smooth fine structures. Conversely, applying models trained on downsampled data to real scans preserves more structure but is inaccurate. Our findings suggest that current SR methods are overstated - when applied to real data, they do not recover structures lost in low-resolution scans and instead predict a smoothed average. We argue that progress in deep learning-based volumetric SR requires datasets with paired real scans of high complexity, such as VoDaSuRe. Our dataset and code are publicly available through: https://augusthoeg.github.io/VoDaSuRe/

Abstract:
We present DAWN (Diffusion is All We Need for robot control), a unified diffusion-based framework for language-conditioned robotic manipulation that bridges high-level motion intent and low-level robot action via structured pixel motion representation. In DAWN, both the high-level and low-level controllers are modeled as diffusion processes, yielding a fully trainable, end-to-end system with interpretable intermediate motion abstractions. DAWN achieves state-of-the-art results on the challenging CALVIN benchmark, demonstrating strong multi-task performance, and further validates its effectiveness on MetaWorld. Despite the substantial domain gap between simulation and reality and limited real-world data, we demonstrate reliable real-world transfer with only minimal finetuning, illustrating the practical viability of diffusion-based motion abstractions for robotic control. Our results show the effectiveness of combining diffusion modeling with motion-centric representations as a strong baseline for scalable and robust robot learning.

Abstract:
Accurate long horizon forecasting of particulate matter (PM) concentration fields is essential for operational public health decisions. However, achieving reliable forecasts remains challenging in regions with complex terrain and strong atmospheric dynamics such as East Asia. While foundation models such as Aurora offer global generality, they often miss region-specific dynamics and rely on non-real-time inputs, limiting their practical utility for localized warning systems. To address this gap, we construct and release the real-world observations and high-resolution CMAQ--OBS dataset for East Asia, reducing regional error by 59.5% and enabling real-time 48-120 hour forecasts critical for public health alerts. However, standard point-wise objectives cannot reflect asymmetric operational costs, where false alarms deteriorate public trust while missed severe events endanger populations. This cost mismatch causes SFT models to over-predict and yield high False Alarm Rates. We introduce Group-Relative Policy Optimization (GRPO) with class-wise rewards and curriculum rollout to align predictions with operational priorities. Experimental results demonstrate that our framework significantly improves the reliability of the forecast. Compared to the SFT-only baseline, our model reduces the False Alarm Rate by 47.3% while achieving a competitive F1-score, proving its effectiveness for practical, real-world air quality forecasting systems on long lead time scenarios. Code and dataset are publicly available at https://github.com/kaist-cvml/FAKER-Air.

Abstract:
Manual font design is an intricate process that transforms a stylistic visual concept into a coherent glyph set. This challenge persists in automated Few-shot Font Generation (FFG), where models struggle to preserve both structural integrity and stylistic fidelity from limited references. While autoregressive (AR) models have demonstrated impressive generative capabilities, their application to FFG is constrained by conventional patch-level tokenization, which neglects global dependencies crucial for coherent font synthesis. Moreover, existing FFG methods remain within the image-to-image paradigm, relying solely on visual references and overlooking the role of language in conveying stylistic intent during font design. To address these limitations, we propose GAR-Font, a novel AR framework for multimodal few-shot font generation. GAR-Font introduces a global-aware tokenizer that effectively captures both local structures and global stylistic patterns, a multimodal style encoder offering flexible style control through a lightweight language-style adapter without requiring intensive multimodal pretraining, and a post-refinement pipeline that further enhances structural fidelity and style coherence. Extensive experiments show that GAR-Font outperforms existing FFG methods, excelling in maintaining global style faithfulness and achieving higher-quality results with textual stylistic guidance. Project Page: https://xtryer-s.github.io/projects_pages/GAR_Font.

Abstract:
Masked auto-regressive diffusion models (MAR) benefit from the expressive modeling ability of diffusion models and the flexibility of masked auto-regressive ordering. However, vanilla MAR suffers from slow inference due to its hierarchical inference mechanism: an outer AR unmasking loop and an inner diffusion denoising chain. Such decoupled structure not only harm the generation efficiency but also hinder the practical use of MAR for reinforcement learning (RL), an increasingly critical paradigm for generative model post-training.To address this fundamental issue, we introduce MARVAL (Masked Auto-regressive Variational Acceleration), a distillation-based framework that compresses the diffusion chain into a single AR generation step while preserving the flexible auto-regressive unmasking order. Such a distillation with MARVAL not only yields substantial inference acceleration but, crucially, makes RL post-training with verifiable rewards practical, resulting in scalable yet human-preferred fast generative models. Our contributions are twofold: (1) a novel score-based variational objective for distilling masked auto-regressive diffusion models into a single generation step without sacrificing sample quality; and (2) an efficient RL framework for masked auto-regressive models via MARVAL-RL. On ImageNet 256x256, MARVAL-Huge achieves an FID of 2.00 with more than 30 times speedup compared with MAR-diffusion, and MARVAL-RL yields consistent improvements in CLIP and image-reward scores on ImageNet datasets with entity names. In conclusion, MARVAL demonstrates the first practical path to distillation and RL of masked auto-regressive diffusion models, enabling fast sampling and better preference alignments. Project page: https://ai4imaginglab.github.io/MARVAL-RL-Page/

Abstract:
Under-display cameras (UDCs) allow for full-screen designs by positioning the imaging sensor underneath the display. Nonetheless, light diffraction and scattering through the various display layers result in spatially varying and complex degradations, which significantly reduce high-frequency details. Current PSF-based physical modeling techniques and frequency-separation networks are effective at reconstructing low-frequency structures and maintaining overall color consistency. However, they still face challenges in recovering fine details when dealing with complex, spatially varying degradation. To solve this problem, we propose a lightweight Uncertainty-aware Context-Memory Network (UCMNet), for UDC image restoration. Unlike previous methods that apply uniform restoration, UCMNet performs uncertainty-aware adaptive processing to restore high-frequency details in regions with varying degradations. The estimated uncertainty maps, learned through an uncertainty-driven loss, quantify spatial uncertainty induced by diffraction and scattering, and guide the Memory Bank to retrieve region-adaptive context from the Context Bank. This process enables effective modeling of the non-uniform degradation characteristics inherent to UDC imaging. Leveraging this uncertainty as a prior, UCMNet achieves state-of-the-art performance on multiple benchmarks with 30% fewer parameters than previous models. Project page: https://kdhrick2222.github.io/projects/UCMNet.

Abstract:
Reasoning segmentation has recently expanded from ground-level scenes to remote-sensing imagery, yet UAV data poses distinct challenges, including oblique viewpoints, ultra-high resolutions, and extreme scale variations. To address these issues, we formally define the UAV Reasoning Segmentation task and organize its semantic requirements into three dimensions: Spatial, Attribute, and Scene-level reasoning. Based on this formulation, we construct DRSeg, a large-scale benchmark for UAV reasoning segmentation, containing 10k high-resolution aerial images paired with Chain-of-Thought QA supervision across all three reasoning types. As a benchmark companion, we introduce PixDLM, a simple yet effective pixel-level multimodal language model that serves as a unified baseline for this task. Experiments on DRSeg establish strong baseline results and highlight the unique challenges of UAV reasoning segmentation, providing a solid foundation for future research. All datasets, models, and code are available at https://github.com/XIEFOX/PixDLM.

Abstract:
Large multimodal models (LMMs) have shown great potential for video reasoning with textual Chain-of-Thought. However, they remain vulnerable to hallucinations, especially when processing long-form videos where evidence is sparse and temporally dispersed. Inspired by how humans comprehend long videos--by first skimming globally and then examining relevant clips for details--we introduce LongVT, an end-to-end agentic framework that enables "Thinking with Long Videos" via interleaved Multimodal Chain-of-Tool-Thought. Specifically, we exploit LMMs' inherent temporal grounding ability as a native video cropping tool to zoom in on a specific video clip and resample finer-grained video frames. This global-to-local reasoning loop continues until answers are grounded in retrieved visual evidence. Given the scarcity of fine-grained question-answering (QA) data for the long video reasoning task, we curate and release a data suite named VideoSIAH to facilitate both training and evaluation. Specifically, our training dataset consists of 247.9K samples for tool-integrated cold-start supervised fine-tuning, 1.6K samples for agentic reinforcement learning, and 15.4K samples for agentic reinforcement fine-tuning. Our evaluation benchmark consists of 652 QA pairs that are carefully curated through a semi-automatic data pipeline with human-in-the-loop validation. With a meticulously designed three-stage training strategy and extensive empirical validation, LongVT consistently outperforms existing strong baselines across four challenging long-video understanding and reasoning benchmarks. Our code, data, and model weights are publicly available at https://github.com/EvolvingLMMs-Lab/LongVT.

Abstract:
A pictorial chart is an effective medium for visual storytelling, seamlessly integrating visual elements with data charts. However, creating such images is challenging because the flexibility of visual elements often conflicts with the rigidity of chart structures. This process thus requires a creative deformation that maintains both data faithfulness and visual aesthetics. Current methods that extract dense structural cues from natural images (e.g., edge or depth maps) are ill-suited as conditioning signals for pictorial chart generation. We present ChArtist, a domain-specific method for generating pictorial charts automatically, offering two distinct types of control: 1) spatial control that aligns well with the chart structure, and 2) subject-driven control that respects the visual characteristics of a reference image. To achieve this, we introduce a skeleton-based spatial control representation. This representation encodes only the data-encoding information of the chart, allowing for the easy incorporation of reference visuals without a rigid outline constraint. We implement our method based on the Diffusion Transformer (DiT) and leverage an adaptive position encoding mechanism to manage these two controls. We further introduce Spatially Gated Attention to modulate the interaction between spatial control and subject control. To support the fine-tuning of pre-trained models for this task, we created a large-scale dataset of 30,000 triplets (skeleton, reference image, pictorial chart). We also propose a unified data accuracy metric to evaluate the data faithfulness of the generated charts. We believe this work demonstrates that current generative models can achieve data-driven visual storytelling by moving beyond general-purpose conditions to task-specific representations. The code and dataset will be released. Project page: https://chartist-ai.github.io/.

Abstract:
Document parsing is a fine-grained task where image resolution significantly impacts performance. While advanced research leveraging vision-language models benefits from high-resolution input to boost model performance, this often leads to a quadratic increase in the number of vision tokens and significantly raises computational costs. We attribute this inefficiency to substantial visual regions redundancy in document images, like background. To tackle this, we propose PaddleOCR-VL, a novel coarse-to-fine architecture that focuses on semantically relevant regions while suppressing redundant ones, thereby improving both efficiency and performance. Specifically, we introduce a lightweight Valid Region Focus Module (VRFM) which leverages localization and contextual relationship prediction capabilities to identify valid vision tokens. Subsequently, we design and train a compact yet powerful 0.9B vision-language model (PaddleOCR-VL-0.9B) to perform detailed recognition, guided by VRFM outputs to avoid direct processing of the entire large image. Extensive experiments demonstrate that PaddleOCR-VL achieves state-of-the-art performance in both page-level parsing and element-level recognition. It significantly outperforms existing solutions, exhibits strong competitiveness against top-tier VLMs, and delivers fast inference while utilizing substantially fewer vision tokens and parameters, highlighting the effectiveness of targeted coarse-to-fine parsing for accurate and efficient document understanding. The source code and models are publicly available at https://github.com/PaddlePaddle/PaddleOCR.

Abstract:
In open-world semi-supervised learning (OWSSL), a model learns from labeled data and unlabeled data containing both known and novel classes. In practical OWSSL applications, models are expected to perform rigorous classification by directly selecting the most semantically relevant label from a candidate set for each sample. Existing OWSSL methods fail to achieve this because novel samples are trained without explicit supervision, and these methods lack mechanisms to extract latent semantic information, resulting in predicted labels that have no semantic correspondence to candidate textual labels. To address this, we introduce SEmantic Capture for Open-world Semi-supervised learning (SECOS), which directly predicts textual labels from the candidate set without post-processing, meeting the requirements of practical OWSSL applications. SECOS leverages external knowledge to extract and align semantic representations across modalities for both known and novel classes, providing explicit supervisory signals for training novel classes. Extensive experiments demonstrate that even when existing OWSSL methods are evaluated under the more lenient post-hoc matching setting, SECOS still surpasses them by up to 5.4% without such assistance, highlighting its superior effectiveness. Code is available at https://github.com/ganchi-huanggua/OSSL-Classification.

Abstract:
Zero-shot skeleton-based action recognition aims to recognize unseen actions by transferring knowledge from seen categories through semantic descriptions. Most existing methods typically align skeleton features with textual embeddings within a shared latent space. However, the absence of contextual cues, such as objects involved in the action, introduces an inherent gap between skeleton and semantic representations, making it difficult to distinguish visually similar actions. To address this, we propose SkeletonContext, a prompt-based framework that enriches skeletal motion representations with language-driven contextual semantics. Specifically, we introduce a Cross-Modal Context Prompt Module, which leverages a pretrained language model to reconstruct masked contextual prompts under guidance derived from LLMs. This design effectively transfers linguistic context to the skeleton encoder for instance-level semantic grounding and improved cross-modal alignment. In addition, a Key-Part Decoupling Module is incorporated to decouple motion-relevant joint features, ensuring robust action understanding even in the absence of explicit object interactions. Extensive experiments on multiple benchmarks demonstrate that SkeletonContext achieves state-of-the-art performance under both conventional and generalized zero-shot settings, validating its effectiveness in reasoning about context and distinguishing fine-grained, visually similar actions. Our project is available at:https://github.com/NingWang2049/skeletoncontext.

Abstract:
Diffusion models have achieved remarkable performance on a wide range of generative tasks, yet training them from scratch is notoriously resource-intensive, typically requiring millions of training images and many GPU days. Motivated by a data-centric view of this bottleneck, we adopt a condensation-based perspective: given a large training set, the goal is to construct a much smaller condensed dataset that still supports training strong diffusion models under minimal data and compute budgets. To operationalize this perspective, we introduce Diffusion Dataset Condensation (D2C), a two-phase framework comprising Select and Attach. In the Select phase, a diffusion difficulty score combined with interval sampling is used to identify a compact, informative training subset from the original data. Building on this subset, the Attach phase further strengthens the conditional signals by augmenting each selected image with rich semantic and visual representations. To our knowledge, D2C is the first framework that systematically investigates dataset condensation for diffusion models, whereas prior condensation methods have mainly targeted discriminative architectures. Extensive experiments across data budgets from 0.8% to 8% of ImageNet, model architectures, and image resolutions demonstrate that D2C dramatically accelerates diffusion model training while preserving high generative quality. On ImageNet 256x256 with SiT-XL/2, D2C attains a FID of 4.3 in just 40k steps using only 0.8% of the training images, corresponding to about 233x and 100x faster training than vanilla SiT-XL/2 and SiT-XL/2 + REPA, respectively. Code: https://github.com/xie-lab-ml/Diffusion-Dataset-Condensation.

Abstract:
Partially Relevant Video Retrieval (PRVR) aims to retrieve untrimmed videos based on text queries that describe only partial events. Existing methods suffer from incomplete global contextual perception, struggling with query ambiguity and local noise induced by spurious responses. To address these issues, we propose DreamPRVR, which adopts a coarse-to-fine representation learning paradigm. The model first generates global contextual semantic registers as coarse-grained highlights spanning the entire video and then concentrates on fine-grained similarity optimization for precise cross-modal matching. Concretely, these registers are generated by initializing from the video-centric distribution produced by a probabilistic variational sampler and then iteratively refined via a text-supervised truncated diffusion model. During this process, textual semantic structure learning constructs a well-formed textual latent space, enhancing the reliability of global perception. The registers are then adaptively fused with video tokens through register-augmented Gaussian attention blocks, enabling context-aware feature learning. Extensive experiments show that DreamPRVR outperforms state-of-the-art methods. Code is released at https://github.com/lijun2005/CVPR26-DreamPRVR.

Abstract:
Understanding and predicting object motion from egocentric video is fundamental to embodied perception and interaction. However, generating physically consistent 6DoF trajectories remains challenging due to occlusions, fast motion, and the lack of explicit physical reasoning in existing generative models. We present EgoFlow, a flow-matching framework that synthesizes realistic and physically plausible trajectories conditioned on multimodal egocentric observations. EgoFlow employs a hybrid Mamba-Transformer-Perceiver architecture to jointly model temporal dynamics, scene geometry, and semantic intent, while a gradient-guided inference process enforces differentiable physical constraints such as collision avoidance and motion smoothness. This combination yields coherent and controllable motion generation without post-hoc filtering or additional supervision. Experiments on HD-EPIC, EgoExo4D, and HOT3D show that EgoFlow outperforms diffusion-based and transformer baselines in accuracy, generalization, and physical realism, reducing collision rates by up to 79%, and strong generalization to unseen scenes. Our results highlight the promise of flow-based generative modeling for scalable and physically grounded egocentric motion understanding. Project page: https://abhi-rf.github.io/egoflow/

Abstract:
Physically Plausible Video Generation (PPVG) has emerged as a promising avenue for modeling real-world physical phenomena. PPVG requires an understanding of commonsense knowledge, which remains a challenge for video diffusion models. Current approaches leverage commonsense reasoning capability of large language models to embed physical concepts into prompts. However, generation models often render physical phenomena as a single moment defined by prompts, due to the lack of conditioning mechanisms for modeling causal progression. In this paper, we view PPVG as generating a sequence of causally connected and dynamically evolving events. To realize this paradigm, we design two key modules: (1) Physics-driven Event Chain Reasoning. This module decomposes the physical phenomena described in prompts into multiple elementary event units, leveraging chain-of-thought reasoning. To mitigate causal ambiguity, we embed physical formulas as constraints to impose deterministic causal dependencies during reasoning. (2) Transition-aware Cross-modal Prompting (TCP). To maintain continuity between events, this module transforms causal event units into temporally aligned vision-language prompts. It summarizes discrete event descriptions to obtain causally consistent narratives, while progressively synthesizing visual keyframes of individual events by interactive editing. Comprehensive experiments on PhyGenBench and VideoPhy benchmarks demonstrate that our framework achieves superior performance in generating physically plausible videos across diverse physical domains. Code is available at https://github.com/ZixuanWang0525/CoECT.

Abstract:
Recent frameworks like ToFu and TEMPEH provide an automated alternative to classical registration pipelines by predicting 3D meshes in dense semantic correspondence directly from calibrated multi-view images. However, these learning-based methods rely on the slow, manual registration pipelines they aim to replace for their training supervision. We overcome this limitation with MOCHI (Multi-view Optimizable Correspondence of Heads from Images), a multi-view 3D face prediction framework trained without requiring registered training data. MOCHI eliminates the registration data dependency by enforcing topological consistency through a pseudo-linear inverse kinematic solver. Semantic alignment is guided by dense keypoints from a 2D landmark predictor trained exclusively on synthetic data. Our analysis further reveals that standard point-to-surface distances induce training instabilities and visual artifacts in registration-free settings. We propose pointmap- and normal-based losses instead, which provide smoother gradients and superior reconstruction fidelity. Finally, we introduce a test-time optimization scheme that refines network weights over a few dozen iterations. This approach bridges the gap between feed-forward efficiency and iterative optimization precision, allowing MOCHI to outperform traditional labor-intensive pipelines in both reconstruction accuracy and visual quality. Code and model are public at: https://filby89.github.io/mochi.

Abstract:
Neural operators have emerged as an efficient paradigm for solving PDEs, overcoming the limitations of traditional numerical methods and significantly improving computational efficiency. However, due to the diversity and complexity of PDE systems, existing neural operators typically rely on a single network architecture, which limits their capacity to fully capture heterogeneous features and complex system dependencies. This constraint poses a bottleneck for large-scale PDE pre-training based on neural operators. To address these challenges, we propose a large-scale PDE pre-trained neural operator based on a nested Mixture-of-Experts (MoE) framework. In particular, the image-level MoE is designed to capture global dependencies, while the token-level Sub-MoE focuses on local dependencies. Our model can selectively activate the most suitable expert networks for a given input, thereby enhancing generalization and transferability. We conduct large-scale pre-training on twelve PDE datasets from diverse sources and successfully transfer the model to downstream tasks. Extensive experiments demonstrate the effectiveness of our approach. The code is publicly available at: https://github.com/Event-AHU/OpenFusion/tree/main/NESTOR.

Abstract:
Anatomical structures in MRI exhibit strong spatial priors, including well-defined boundaries, low inter-subject variability, and consistent topology. These properties naturally induce clustered patterns in the latent space, which are difficult to capture using conventional continuous generative priors that assume smooth manifold distributions. To address this limitation, we propose DiCoS (Discrete-Continuous Synthesis), a generative reconstruction framework that integrates discrete structural reasoning with continuous refinement. DiCoS models an anatomy-aware discrete distribution and generates diverse reconstructions in one coarse-to-fine pass through a Discrete Prior Network (DPN). A Dual-domain Balanced Scoring (DBS) mechanism adaptively evaluates candidates using both image-domain fidelity and k-space consistency. To further enhance realism, Micro Diffusion Cycles (MDC) perform efficient score-guided refinement to enhance texture realism without disturbing global topology. Experiments on the fastMRI knee and brain datasets demonstrate that DiCoS achieves state-of-the-art reconstruction quality with sharper boundaries and improved anatomical consistency. Beyond pixel metrics, segmentation-based evaluations further confirm superior structural overlap and semantic alignment, highlighting DiCoS's advantages in anatomy-aware reconstruction. The project page is available at https://kincin.github.io/DiCoS/.

Abstract:
Modern vision systems are deployed in settings where occasional catastrophic failures matter more than average accuracy--for example in medical imaging, autonomous driving, and safety monitoring. While conformal prediction gives distribution-free uncertainty guarantees, most existing methods only control mean error and are hard to tune toward rare but high-cost mistakes. We propose Bayesian-Quadrature Spectral Risk Control (BQ-SRC), a general framework for controlling tail-focused risks (such as conditional value at risk (CVaR)-style objectives) in a distribution-free way. BQ-SRC views conformal prediction through a Bayesian-quadrature lens and replaces mean-risk control with a flexible family of risk-averse criteria, while keeping the same black-box access to a trained model. A binomial testing scheme reduces the Monte Carlo conservatism of prior approaches, leading to tighter sets without sacrificing guarantees. We evaluate BQ-SRC across diverse vision tasks, including synthetic regression, closed-set and zero-shot image classification, multilabel classification, and semantic segmentation. Across these settings, BQ-SRC consistently maintains finite-sample risk guarantees and often yields smaller or otherwise more informative prediction sets than existing conformal and risk-controlling baselines, sometimes trading a modest amount of efficiency for stronger tail-risk control. The code will be publicly available at https://github.com/MohammadMahdiKazemi/BQ_SRC.

Abstract:
Recently, GRPO-based reinforcement learning has shown remarkable progress in optimizing flow-matching models, effectively improving their alignment with task-specific rewards. Within these frameworks, the policy update relies on importance-ratio clipping to constrain overconfident positive and negative gradients. However, in practice, we observe a systematic shift in the importance-ratio distribution--its mean falls below 1 and its variance differs substantially across timesteps. This left-shifted and inconsistent distribution prevents positive-advantage samples from entering the clipped region, causing the mechanism to fail in constraining overconfident positive updates. As a result, the policy model inevitably enters an implicit over-optimization stage--while the proxy reward continues to increase, essential metrics such as image quality and text-prompt alignment deteriorate sharply, ultimately making the learned policy impractical for real-world use. To address this issue, we introduce GRPO-Guard, a simple yet effective enhancement to existing GRPO frameworks. Our method incorporates ratio normalization, which restores a balanced and step-consistent importance ratio, ensuring that PPO clipping properly constrains harmful updates across denoising timesteps. In addition, a gradient reweighting strategy equalizes policy gradients over noise conditions, preventing excessive updates from particular timestep regions. Together, these designs act as a regulated clipping mechanism, stabilizing optimization and substantially mitigating implicit over-optimization without relying on heavy KL regularization. Extensive experiments on multiple diffusion backbones (e.g., SD3.5M, Flux.1-dev) and diverse proxy tasks demonstrate that GRPO-Guard significantly reduces over-optimization while maintaining or even improving generation quality. We provide detailed demonstrations of the over-optimization process and corresponding visualizations in https://jingw193.github.io/GRPO-Guard/.

Abstract:
Point cloud denoising is a critical preprocessing step for enhancing the reliability and accuracy of 3D perception systems. Most existing progressive denoising methods rely on fixed iterative pipelines that process all regions uniformly, resulting in redundant computation and over-smoothing of geometric details when handling point clouds with non-uniform noise distributions. To overcome these limitations, we introduce Dynamic Skip Net (DSNet), a novel progressive denoising framework that enables routing on demand by adaptively determining the optimal denoising path for each local patch based on its noise characteristics. DSNet incorporates a noise discriminator that quantifies local noise intensity by analyzing normal similarity, and a reverse monotonic decision function that maps this measure to an appropriate denoising module. Furthermore, we propose a path-selective iteration mechanism that dynamically re-evaluates the restoration state and re-plans the denoising route at each stage, enabling cross-stage skipping to minimize unnecessary computation. Extensive experiments on multiple benchmarks demonstrate that DSNet achieves state-of-the-art performance in noise suppression, geometric fidelity, and computational efficiency. The code is available at https://github.com/cz-61/DSNet

Abstract:
Video face swapping is crucial in film and entertainment production, where achieving high fidelity and temporal consistency over long and complex video sequences remains a significant challenge. Inspired by recent advances in reference-guided image editing, we explore whether rich visual attributes from source videos can be similarly leveraged to enhance both fidelity and temporal coherence in video face swapping. Building on this insight, this work presents LivingSwap, the first video reference guided face swapping model. Our approach employs keyframes as conditioning signals to inject the target identity, enabling flexible and controllable editing. By combining keyframe conditioning with video reference guidance, the model performs temporal stitching to ensure stable identity preservation and high-fidelity reconstruction across long video sequences. To address the scarcity of data for reference-guided training, we construct a paired face-swapping dataset, Face2Face, and further reverse the data pairs to ensure reliable ground-truth supervision. Extensive experiments demonstrate that our method achieves state-of-the-art results, seamlessly integrating the target identity with the source video's expressions, lighting, and motion, while significantly reducing manual effort in production workflows. Project webpage: https://aim-uofa.github.io/LivingSwap

Abstract:
This paper primarily investigates the task of expression-only portrait video performance editing based on a driving video, which plays a crucial role in animation and film industries. Most existing research mainly focuses on portrait animation, which aims to animate a static portrait image according to the facial motion from the driving video. As a consequence, it remains challenging for them to disentangle the facial expression from head pose rotation and thus lack the ability to edit facial expression independently. In this paper, we propose PerformRecast, a versatile expression-only video editing method which is dedicated to recast the performance in existing film and animation. The key insight of our method comes from the characteristics of 3D Morphable Face Model (3DMM), which models the face identity, facial expression and head pose of 3D face mesh with separate parameters. Therefore, we improve the keypoints transformation formula in previous methods to make it more consistent with 3DMM model, which achieves a better disentanglement and provides users with much more fine-grained control. Furthermore, to avoid the misalignment around the boundary of face in generated results, we decouple the facial and non-facial regions of input portrait images and pre-train a teacher model to provide separate supervision for them. Extensive experiments show that our method produces high-quality results which are more faithful to the driving video, outperforming existing methods in both controllability and efficiency. Our code, data and trained models are available at https://youku-aigc.github.io/PerformRecast.

Abstract:
Spatiotemporal image generation is a highly meaningful task, which can generate future scenes conditioned on given observations. However, existing change generation methods can only handle event-driven changes (e.g., new buildings) and fail to model cross-temporal variations (e.g., seasonal shifts). In this work, we propose ChangeBridge, a conditional spatiotemporal image generation model for remote sensing. Given pre-event images and multimodal event controls, ChangeBridge generates post-event scenes that are both spatially and temporally coherent. The core idea is a drift-asynchronous diffusion bridge. Specifically, it consists of three main modules: a) Composed Bridge Initialization, which replaces noise initialization. It starts the diffusion from a composed pre-event state, modeling a diffusion bridge process. b) Asynchronous Drift Diffusion, which uses a pixel-wise drift map, assigning different drift magnitudes to event and temporal evolution. This enables differentiated generation during the pre-to-post transition. c) Drift-Aware Denoising, which embeds the drift map into the denoising network, guiding drift-aware reconstruction. Experiments show that ChangeBridge can generate better cross-spatiotemporal aligned scenarios compared to state-of-the-art methods. Additionally, ChangeBridge shows great potential for land-use planning and as a data generation engine for a series of change detection tasks. Code is available at https://github.com/zhenghuizhao/ChangeBridge

Abstract:
We investigate recently introduced domain-class incremental learning scenarios for vision-language models (VLMs). Recent works address this challenge using parameter-efficient methods, such as prefix-tuning or adapters, which facilitate model adaptation to downstream tasks by incorporating task-specific information into input tokens through additive vectors. However, previous approaches often normalize the weights of these vectors, disregarding the fact that different input tokens require different degrees of adjustment. To overcome this issue, we propose Dynamic Prefix Weighting (DPW), a framework that dynamically assigns weights to prefixes, complemented by adapters. DPW consists of 1) a gating module that adjusts the weights of each prefix based on the importance of the corresponding input token, and 2) a weighting mechanism that derives adapter output weights as a residual of prefix-tuning weights, ensuring that adapters are utilized only when necessary. Experimental results demonstrate that our method achieves state-of-the-art performance in domain-class incremental learning scenarios for VLMs. The code is available at: https://github.com/YonseiML/dpw.

Abstract:
In the field of Compressive Sensing (CS), deep unrolling networks (DUNs) have demonstrated exceptional performance and interpretability by integrating traditional optimization solvers with deep networks. However, existing DUNs suffer from homogenization in cross-stage feature extraction and insufficient integration of gradient-guided information. Additionally, the feature extraction module struggles to balance the global receptive field and computational efficiency, which limits improvements in image reconstruction details. To address these challenges, we propose a multi-scale gradient-guided unrolling architecture with adaptive Mamba for CS, named MambaCS. Specifically, we utilize our customized Adaptive State-Space Block (A-SSB) to unroll the well-known Proximal Gradient Descent (PGD) algorithm across multiple feature levels to extract comprehensive image features while maintaining computational efficiency. Moreover, we design a High-Dimensional Gradient Fusion (HDGF) that ensures the persistent and stable injection of gradient-guided information across various scales and dimensions, while effectively eliminating information bottlenecks. Finally, we develop a Feature-Adaptive Proximal Operator (FAPO), using A-SSB as an extension of the sparse basis associated with the PGD proximal operator, which enhances sensitivity to multi-scale features and improves detail reconstruction. Extensive experiments demonstrate the significant advantages of our proposed MambaCS over the current SOTA methods. Our code is available at https://github.com/nikou-arch/MambaCS.

Abstract:
We present YOEO, an approach for object erasure. Unlike recent diffusion-based methods which struggle to erase target objects without generating unexpected content within the masked regions due to lack of sufficient paired training data and explicit constraint on content generation, our method allows to produce high-quality object erasure results free of unwanted objects or artifacts while faithfully preserving the overall context coherence to the surrounding content. We achieve this goal by training an object erasure diffusion model on unpaired data containing only large-scale real-world images, under the supervision of a sundries detector and a context coherence loss that are built upon an entity segmentation model. To enable more efficient training and inference, a diffusion distillation strategy is employed to train for a few-step erasure diffusion model. Extensive experiments show that our method outperforms the state-of-the-art object erasure methods. Our code is available at https://zyxunh.github.io/YOEO-ProjectPage/.

Abstract:
Vision-Language Models (VLMs) are limited by static knowledge and insufficient fine-grained visual analysis, hindering their performance on knowledge-intensive and visually complex tasks. While recent research has explored VLMs that employ external tools like search or cropping to enhance model performance, they typically employ tools in isolation and lack the ability to coordinate multiple tools effectively. To address this gap, we propose SenseSearch, the first agentic VLM for search-reasoning that supports adaptive multi-tool coordination via reinforcement learning (RL). Specifically, SenseSearch dynamically integrates the image search, text search, and image crop tools to tackle fine-grained and knowledge-intensive visual understanding challenges. We first construct a high-quality cold-start dataset to instill basic tool-usage behaviors. In the subsequent RL stage, we introduce Batch-Normalized Group Sequence Policy Optimization (BN-GSPO) algorithm to enhance the tool invocation and reasoning ability. To comprehensively evaluate the agentic VLMs on complex visual tasks, we introduce the HR-MMSearch benchmark, the first search-oriented benchmark composed of high-resolution images with knowledge-intensive and search-driven questions. Experiments demonstrate that SenseSearch achieves state-of-the-art performance on open-source search and fine-grained image understanding benchmarks, outperforming baselines by 19.18% on HR-MMSearch. SenseSearch provides a promising path toward agentic VLMs with effective and robust tool invocation capabilities. All code and data are released at https://github.com/OpenSenseNova/SenseNova-MARS.

Abstract:
To utilize pre-trained neural networks on edge and mobile devices, we often require efficient adaptation to user-specific runtime data distributions while operating under limited compute and memory resources. On-device retraining with a target dataset can facilitate such adaptations; however, it remains impractical due to the increasing depth of modern neural nets, as well as the computational overhead associated with gradient-based optimization across all layers. Current approaches reduce training cost by selecting a subset of layers for retraining; however, they rely on labeled data, at least one full-model backpropagation, or server-side meta-training, limiting their suitability for constrained devices. We introduce AdaBet, a gradient-free layer selection approach to rank important layers, followed by important channels of these layers, by analyzing topological features of their activation spaces through Betti Numbers and using forward passes alone. AdaBet allows selecting layers and channels with high learning capacity, which are important for retraining and adaptation, without requiring labels or gradients. Evaluating AdaBet on sixteen pairs of benchmark models and datasets shows AdaBet achieves an average gain of 2.5% more classification accuracy over gradient-based baselines while reducing average peak memory consumption by 40%. We open-source our code at https://github.com/Nokia-Bell-Labs/efficient_layer_selection

Abstract:
Thermal infrared (TIR) 3D reconstruction provides geometry that is intrinsically coupled to the temperature field, even in low-light, nighttime, and smoke-obscured environments. TIR imaging measures self-emitted thermal radiation driven by object temperature and is largely independent of external illumination; therefore, simply carrying over visible-spectrum assumptions to TIR-based 3D reconstruction and novel view synthesis (NVS) often results in floating artifacts and blurred edges. In addition, radiometric inconsistency and low contrast in TIR weaken structure-from-motion (SfM) initialization, which in turn hinders subsequent 3D Gaussian Splatting (3DGS) optimization. We present PhysIR-Splat, a 3DGS framework that follows infrared radiative transfer: we explicitly model temperature, emissivity, and environmental irradiance on Gaussian primitives and, during rendering, jointly account for thermal emission, the reflected component, and atmospheric transmittance to produce physically consistent thermal synthesis. We also introduce VGGT-IR, a Transformer-based feed-forward initializer that takes TIR input with optional RGB and directly regresses camera poses and initial geometry, providing a modality-aligned and stable starting point for PhysIR-Splat. Extensive experiments demonstrate that our method significantly surpasses existing approaches in thermal reconstruction quality and cross-view consistency, effectively suppressing floating artifacts and enhancing boundary sharpness. The source code will be publicly available at : https://github.com/JingyuanGao0919/physir-splat.

Abstract:
Generating realistic 3D cities is fundamental to world models, virtual reality, and game development, where an ideal urban scene must satisfy both stylistic diversity, fine-grained, and controllability. However, existing methods struggle to balance the creative flexibility offered by text-based generation with the object-level editability enabled by explicit structural representations. We introduce MajutsuCity, a natural language-driven and aesthetically adaptive framework for synthesizing structurally consistent and stylistically diverse 3D urban scenes. MajutsuCity represents a city as a composition of controllable layouts, assets, and materials, and operates through a four-stage pipeline. To extend controllability beyond initial generation, we further integrate MajutsuAgent, an interactive language-grounded editing agent that supports five object-level operations. To support photorealistic and customizable scene synthesis, we also construct MajutsuDataset, a high-quality multimodal dataset containing 2D semantic layouts and height maps, diverse 3D building assets, and curated PBR materials and skyboxes, each accompanied by detailed annotations. Meanwhile, we develop a practical set of evaluation metrics, covering key dimensions such as structural consistency, scene complexity, material fidelity, and lighting atmosphere. Extensive experiments demonstrate MajutsuCity reduces layout FID by 83.7% compared with CityDreamer and by 20.1% over CityCraft. Our method ranks first across all AQS and RDR scores, outperforming existing methods by a clear margin. These results confirm MajutsuCity as a new state-of-the-art in geometric fidelity, stylistic adaptability, and semantic controllability for 3D city generation. We expect our framework can inspire new avenues of research in 3D city generation. Our project page: https://longhz140516.github.io/MajutsuCity/

Abstract:
Diffusion models have achieved remarkable progress in high-fidelity image, video, and audio generation, yet inference remains computationally expensive. Nevertheless, current diffusion acceleration methods based on distributed parallelism suffer from noticeable generation artifacts and fail to achieve substantial acceleration proportional to the number of GPUs. Therefore, we propose a hybrid parallelism framework that combines a novel data parallel strategy, condition-based partitioning, with an optimal pipeline scheduling method, adaptive parallelism switching, to reduce generation latency and achieve high generation quality in conditional diffusion models. The key ideas are to (i) leverage the conditional and unconditional denoising paths as a new data-partitioning perspective and (ii) adaptively enable optimal pipeline parallelism according to the denoising discrepancy between these two paths. Our framework achieves 2.31xand 2.07xlatency reductions on SDXL and SD3, respectively, using two NVIDIA RTX 3090 GPUs, while preserving image quality. This result confirms the generality of our approach across U-Net-based diffusion models and DiT-based flow-matching architectures. Our approach also outperforms existing methods in acceleration under high-resolution synthesis settings. Code is available at https://github.com/kaist-dmlab/Hybridiff.

Abstract:
Estimating the 3D pose of unseen objects from a single image remains a fundamental yet challenging problem in computer vision, especially under a CAD model-free setting.Pioneering attempts address this issue by matching templates generated through Novel View Synthesis (NVS), which essentially aims to learn the geometric transformation from a reference to a target view. While promising, these methods can only approximate this transformation under pixel-level supervision, as the starting orientation remains undefined. In the absence of explicit geometric constraints to verify the correctness of the predicted transformation, existing methods often synthesize novel views with geometry-distorted structures or severely blurred local textures, leading to unreliable template matching and suboptimal pose estimation results. To this end, we propose OrienPose, a novel object pose estimation framework via orientation-aware NVS from a single image. Specifically, we introduce the Orientation-Aware Guidance, which explicitly injects object orientation cues into the reference latent embedding to enhance orientation awareness during viewpoint transformation. We also introduce an orientation consistency loss that supervises viewpoint transformation at the geometric level, establishing sufficient supervision for explicit and geometry-consistent transformation guidance beyond pixel-level similarity. This loss justifies estimating the reference orientation rather than using its ground-truth pose, thereby ensuring the alignment of coordinate domains between the injected and supervised priors. Extensive experiments demonstrate that OrienPose achieves state-of-the-art performance in single-view unseen object pose estimation and impressive robustness to image degradations. Our code is available at https://github.com/pubyLu/OrienPose.

Abstract:
Deep unfolding networks (DUNs) have achieved remarkable success and become the mainstream paradigm for spectral compressive imaging (SCI) reconstruction. Existing DUNs are derived from full-HSI imaging models, where each stage operates directly on the high-dimensional HSI, refining the entire data cube based on the single 2D coded measurement. However, this paradigm leads to computational redundancy and suffers from the ill-posed nature of mapping 2D residuals back to 3D space of HSI. In this paper, we propose two novel imaging models corresponding to the spectral basis and subspace image by explicitly integrating low-rank (LR) decomposition with the sensing model. Compared to recovering the full HSI, estimating these compact low-dimensional components significantly mitigates the ill-posedness. Building upon these novel models, we develop the Low-Rank Deep Unfolding Network (LRDUN), which jointly solves the two subproblems within an unfolded proximal gradient descent (PGD) framework. Furthermore, we introduce a Generalized Feature Unfolding Mechanism (GFUM) that decouples the physical rank in the data-fidelity term from the feature dimensionality in the prior module, enhancing the representational capacity and flexibility of the network. Extensive experiments on simulated and real datasets demonstrate that the proposed LRDUN achieves state-of-the-art (SOTA) reconstruction quality with significantly reduced computational cost. Code is available at https://github.com/huang-he99/LRDUN.

Abstract:
Infrared and Visible Image Fusion (IVIF) aims to combine complementary information from infrared and visible images to overcome the limitations of a single modality. While existing methods typically employ fixed or sample-adaptive fusion paradigms where fusion weights are static or derived from global pixel distributions, they often overlook spatial inconsistencies in pixel distribution within images, leading to suboptimal performance. To address this issue, we propose RegionFuse, a Region-Adaptive Pixel Distribution Learning Network for IVIF, which dynamically generates fusion weights based on local pixel distributions to construct a region-wise adaptive fusion paradigm. RegionFuse introduces a Mixture of Region Attention (MoRA) mechanism, which assigns each region to several specialized experts, enabling region-level feature interaction tailored to specific local distributions. Furthermore, we design a Region Feature Compression Module (RFCM) and place it after each MoRA to enhance informative regions and suppress redundant ones. Extensive experiments on various benchmarks demonstrate the superiority and robustness of RegionFuse, especially in handling non-uniform pixel distributions. Evaluations on NIR-VIS and downstream tasks further confirm its generalizability and practical utility. The code is available at https://github.com/DarkIceField/RegionFuse.

Abstract:
Though rectified flow models have achieved remarkable performance in image, video, and 3D generation, their practical deployments are challenged by slow inference speeds. Prior acceleration methods reuse cached features from previous steps, which neglects the growing mismatch between static caches and the evolving input, leading to reduced output fidelity. This work proposes Velocity Decomposition and Estimation (VDE), a training-free acceleration method that shifts the paradigm from caching-and-reusing to decomposing-and-estimating. Specifically, VDE decomposes the model's velocity into components parallel and orthogonal to the input, exploiting their temporal predictability and directional consistency for precise, input-adaptive estimation. To prevent error accumulation, it periodically anchors the model's state via full forward passes. Extensive experiments on image and video generation tasks demonstrate that VDE achieves substantial acceleration with minimal loss in visual quality. Notably, VDE accelerates Flux by 3.22x and achieves an LPIPS of 0.069 on Qwen-Image, outperforming the best baseline with a 52.2% reduction. Code: https://github.com/Tan-Junwen/VDE

Abstract:
In this paper, we introduce NAS3R, a self-supervised feed-forward framework that jointly learns explicit 3D geometry and camera parameters with no ground-truth annotations and no pretrained priors. During training, NAS3R reconstructs 3D Gaussians from uncalibrated and unposed context views and renders target views using its self-predicted camera parameters, enabling self-supervised training from 2D photometric supervision. To ensure stable convergence, NAS3R integrates reconstruction and camera prediction within a shared transformer backbone regulated by masked attention, and adopts a depth-based Gaussian formulation that facilitates well-conditioned optimization. The framework is compatible with state-of-the-art supervised 3D reconstruction architectures and can incorporate pretrained priors or intrinsic information when available. Extensive experiments show that NAS3R achieves superior results to other self-supervised methods, establishing a scalable and geometry-aware paradigm for 3D reconstruction from unconstrained data. Code and models are publicly available at https://ranrhuang.github.io/nas3r/.

Abstract:
Though recent advances in vision-language models (VLMs) have achieved remarkable progress across a wide range of multimodal tasks, understanding 3D spatial relationships from limited views remains a significant challenge. Previous reasoning methods typically rely on pure text (e.g., topological cognitive maps) or on 2D visual cues. However, their limited representational capacity hinders performance in specific tasks that require 3D spatial imagination. To address this limitation, we propose 3DThinker, a framework that can effectively exploit the rich geometric information embedded within images while reasoning, like humans do. Our framework is the first to enable 3D mentaling during reasoning without any 3D prior input, and it does not rely on explicitly labeled 3D data for training. Specifically, our training consists of two stages. First, we perform supervised training to align the 3D latent generated by VLM while reasoning with that of a 3D foundation model (e.g., VGGT). Then, we optimize the entire reasoning trajectory solely based on outcome signals, thereby refining the underlying 3D mentaling. Extensive experiments across multiple benchmarks show that 3DThinker consistently outperforms strong baselines and offers a new perspective toward unifying 3D representations into multimodal reasoning. Our code is available at https://github.com/zhangquanchen/3DThinker.

Abstract:
Multi-subject image generation aims to synthesize user-provided subjects in a single image while preserving subject fidelity, ensuring prompt consistency, and aligning with human aesthetic preferences. Existing In-Context-Learning based methods are limited by their highly coupled training paradigm. These methods attempt to achieve both high subject fidelity and multi-dimensional human preference alignment within a single training stage, relying on a single, indirect reconstruction loss, which is difficult to simultaneously satisfy both these goals. To address this, we propose MultiCrafter, a framework that decouples this task into two distinct training stages. First, in a pre-training stage, we introduce an explicit positional supervision mechanism that effectively resolves attention bleeding and drastically enhances subject fidelity. Second, in a post-training stage, we propose Identity-Preserving Preference Optimization, a novel online reinforcement learning framework. We feature a scoring mechanism to accurately assess multi-subject fidelity based on the Hungarian matching algorithm, which allows the model to optimize for aesthetics and prompt alignment while ensuring subject fidelity achieved in the first stage. Experiments validate that our decoupling framework significantly improves subject fidelity while aligning with human preferences better.

Abstract:
Vision-Language Models (VLMs) have demonstrated strong capability in a wide range of tasks such as visual recognition, document parsing, and visual grounding. Nevertheless, recent work shows that while VLMs often manage to capture the correct image region corresponding to the question, they do not necessarily produce the correct answers. In this work, we demonstrate that this misalignment could be attributed to suboptimal information flow within VLMs, where text tokens distribute too much attention to irrelevant visual tokens, leading to incorrect answers. Based on the observation, we show that modulating the information flow during inference can improve the perception capability of VLMs. The idea is that text tokens should only be associated with important visual tokens during decoding, eliminating the interference of irrelevant regions. To achieve this, we propose a token dynamics-based method to determine the importance of visual tokens, where visual tokens that exhibit distinct activation patterns during different decoding stages are viewed as important. We apply our approach to representative open-source VLMs and evaluate on various datasets, including visual question answering, visual grounding and counting, optical character recognition, and object hallucination. The results show that our approach significantly improves the performance of baselines. Project page: https://cxliu0.github.io/AIF/.

Abstract:
In real-world applications, multi-view data often suffer from missing situations due to privacy protection and sensor failures. Such incomplete scenarios not only reduce information availability but also cause significant imbalance among views: certain "strong views" dominate the fusion process, while "weak views" contribute marginally, undermining cross-view collaboration. Existing incomplete multi-view clustering methods mainly focus on handling missing data, but overlook the view contribution imbalance induced by incompleteness and its impact on representation learning and clustering performance. To address this issue, we first analyze the data imbalance caused by missing views and the resulting disparities in view learning quality. Then, we propose an Imbalanced Contribution Evaluation and Refinement framework (ICER) for imbalanced incomplete multi-view clustering. Specifically, we employ Shapley values to quantify each view's marginal contribution, and incorporate unbalanced optimal transport to characterize distributional deviations across views. On this basis, we construct the view contribution imbalance metric to comprehensively evaluate cross-view collaboration and fusion quality, and design a collaboration enhancement module to explicitly reinforce inter-view cooperative optimization. Extensive experiments on multiple datasets demonstrate that the proposed method outperforms existing incomplete multi-view clustering approaches, validating the effectiveness and necessity of explicitly modeling and mitigating view imbalance in imbalanced incomplete scenarios. Code is released at https://github.com/Evelyn-zhou24/ICER.

Abstract:
Active search and tracking of arbitrary targets by Unmanned Aerial Vehicles (UAVs) in cluttered environments remains a highly challenging problem. Existing methods either construct complex modular pipelines, leading to substantial computational costs, or adopt end-to-end controllers that often fail to generalize across different targets and scenes. Moreover, search and tracking are typically treated separately despite their strong interdependence. In this paper, we present UAST, a simple yet effective mapping-free framework that unifies active search and persistent tracking using only RGB-D observations. The proposed system couples a dual-branch perception module with a Rule-Based Point Search Policy that adaptively switches between tracking and search-based recovery. A lightweight control network generates dynamically feasible trajectories directly from fused perception and UAV states. Furthermore, we introduce a training strategy with an elaborated tracking-aware visibility loss and a tailored data construction. Extensive experiments in both simulated and real-world environments show that our approach achieves higher success rates, more stable long-term tracking, and faster target search compared with existing methods, while maintaining high efficiency. The code is available at https://github.com/qinliangql/UAST.

Abstract:
Vision Transformers (ViTs) have recently achieved state-of-the-art performance in 2D human pose estimation due to their strong global modeling capability. However, existing ViT-based pose estimators are designed for static images and process each frame independently, thereby ignoring the temporal coherence that exists in video sequences. This limitation often results in unstable predictions, especially in challenging scenes involving motion blur, occlusion, or defocus. In this paper, we propose TAR-ViTPose, a novel Temporal Aggregate-and-Restore Vision Transformer tailored for video-based 2D human pose estimation. TAR-ViTPose enhances static ViT representations by aggregating temporal cues across frames in a plug-and-play manner, leading to more robust and accurate pose estimation. To effectively aggregate joint-specific features that are temporally aligned across frames, we introduce a joint-centric temporal aggregation (JTA) that assigns each joint a learnable query token to selectively attend to its corresponding regions from neighboring frames. Furthermore, we develop a global restoring attention (GRA) to restore the aggregated temporal features back into the token sequence of the current frame, enriching its pose representation while fully preserving global context for precise keypoint localization. Extensive experiments demonstrate that TAR-ViTPose substantially improves upon the single-frame baseline ViTPose, achieving a +2.3 mAP gain on the PoseTrack2017 benchmark. Moreover, our approach outperforms existing state-of-the-art video-based methods, while also achieving a noticeably higher runtime frame rate in real-world applications. Project page: https://github.com/zgspose/TARViTPose.

Abstract:
Amodal instance segmentation aims to segment both visible and occluded regions of object instance, which are challenging due to lacking inference support under occlusion. Most existing methods employ the prior knowledge about object mask (shape prior) to support the amodal estimation, but the shape prior is not always compatible for object instances in the test stage. In this paper, we explore the task of interactive amodal segmentation, where a few user clicks are available for better segmenting the complete masks of object instances. For this task, we propose a novel framework based on learning and aligning click-aware shape prior (termed ClickPriorNet). Specifically, we propose to learn click-aware shape prior with triplet loss, which forces the retrieved shape priors to have higher IoU with the ground-truth of target instance and thus could exactly facilitate the prediction. Besides, considering the inevitable mismatch between shape prior and target instance, we propose to adaptively align the shape prior with deformable attention. Overall, our model could make full use of the interactive clicks to retrieve and align shape priors, and thus could estimate more complete masks. Extensive experiments on three benchmark datasets demonstrate the effectiveness of our method. Code is available at https://github.com/chenbys/ClickPriorNet.

Abstract:
Large Vision-Language Models (VLMs) enable strong multimodal reasoning but incur heavy inference costs from redundant visual tokens. Token pruning alleviates this issue, yet existing approaches face limitations. Attention-based methods rely on raw attention scores, which are often unstable across layers and heads and can lead to redundant selections. Diversity-based methods improve robustness by selecting tokens far apart in feature space, but risk dropping regions needed for accurate prediction. We propose ZOO-Prune, a training-free framework built on the intuition that highly sensitive tokens have a stronger influence on the model's output and capture complementary visual cues rather than redundant ones. To achieve this, we estimate token sensitivity using zeroth-order perturbations at the lightweight projection layer. This measures how small random perturbations affect the projected features and enables efficient approximation of each token's influence without backpropagation. Extensive experiments across multiple VLMs and benchmarks show that ZOO-Prune consistently outperforms prior methods while pruning up to 94.4% of tokens without sacrificing accuracy. Our method also improves efficiency, reaching up to 2.30x faster end-to-end inference compared to the baseline. Code is available at https://aim-skku.github.io/ZOO-Prune.

Abstract:
Open-vocabulary part segmentation (OVPS) aims to segment objects into fine-grained parts while generalizing to unseen categories. Existing VLM-based methods face two challenges: (1) object over-segmentation, caused by overly broad semantic activations, and (2) part under-segmentation, resulting from weak fine-grained perception. To address these issues, we propose HOPS, a two-stage framework for hierarchical open-vocabulary part segmentation. HOPS introduces a bidirectional semantic-structural attention fusion mechanism that integrates CLIP's semantic alignment with DINO's structural perception. In the object segmentation stage, the Attention-Aware Filtering Module (AFM) refines cross-modal similarity maps via semantic-structural attention to suppress object over-segmentation. In the part segmentation stage, the Affinity-Guided Enhancement Module (AEM) iteratively propagates part responses to progressively expand activation regions, effectively mitigating part under-segmentation. Experiments on Pascal-Part-116, ADE20K-Part-234, and PartImageNet demonstrate that HOPS achieves state-of-the-art performance with superior generalization. Our code is available at https://github.com/TJU-IDVLab/HOPS.

Abstract:
Multi-modal fusion has emerged as a promising paradigm for accurate 3D object detection. However, performance degrades substantially when deployed in target domains different from training. In this work, focusing on dual-branch proposal-level detectors, we identify two factors that limit robust cross-domain generalization: 1) in challenging domains such as rain or nighttime, one modality may undergo severe degradation; 2) the LiDAR branch often dominates the detection process, leading to systematic underutilization of visual cues and vulnerability when point clouds are compromised. To address these challenges, we propose three components. First, Query-Decoupled Loss provides independent supervision for 2D-only, 3D-only, and fused queries, rebalancing gradient flow across modalities. Second, LiDAR-Guided Depth Prior augments 2D queries with instance-aware geometric priors through probabilistic fusion of image-predicted and LiDAR-derived depth distributions, improving their spatial initialization. Third, Complementary Cross-Modal Masking applies complementary spatial masks to the image and point cloud, encouraging queries from both modalities to compete within the fused decoder and thereby promoting adaptive fusion. Extensive experiments demonstrate substantial gains over state-of-the-art baselines while preserving source-domain performance. Code and models are publicly available at https://github.com/IMPL-Lab/CCF.git.

Abstract:
Static benchmark-driven evaluation has provided a valuable foundation for analyzing Text-to-Image (T2I) models.However, the fixed and predetermined prompt sets in benchmarks inherently limit diagnostic depth, making it difficult to uncover the full landscape of models' systematic failures or isolate their root causes.We argue for a complementary paradigm: active exploration, and introduce FailureAtlas, the first framework designed to autonomously explore and map the vast failure landscapes of T2I models at scale.Unlike benchmarks that evaluate a fixed prompt set, FailureAtlas performs guided exploration in the input space, framing error discovery as a structured search for minimal, failure-inducing concepts. While this is a computationally explosive problem, we make it tractable with novel acceleration techniques. When applied to Stable Diffusion models, our method uncovers hundreds of thousands of previously unknown error slices (e.g., over 247,000 in SD1.5 alone) and provides the first large-scale evidence linking these failures to data scarcity in the training set. By providing a principled and scalable engine for deep model auditing, FailureAtlas establishes a new, diagnostic-first methodology to guide the development of more robust generative AI. The code is available at https://github.com/cure-lab/FailureAtlas.

Abstract:
Vision language models (VLMs) have achieved remarkable success in broad visual understanding, yet they remain challenged by object-centric reasoning on rare objects due to the scarcity of such instances in pretraining data. While prior efforts alleviate this issue by retrieving additional data or introducing stronger vision encoders, these methods are still computationally intensive during finetuning VLMs and don't fully exploit the original training data. In this paper, we introduce an efficient plug-and-play module that substantially improves VLMs' reasoning over rare objects by refining visual tokens and enriching input text prompts, without VLMs finetuning. Specifically, we propose to learn multi-modal class embeddings for rare objects by leveraging prior knowledge from vision foundation models and synonym-augmented text descriptions, compensating for limited training examples. These embeddings refine the visual tokens in VLMs through a lightweight attention-based enhancement module that improves fine-grained object details. In addition, we use the learned embeddings as object-aware detectors to generate informative hints, which are injected into the text prompts to help guide the VLM's attention toward relevant image regions. Experiments on two benchmarks show consistent and substantial gains for pretrained VLMs in rare object recognition and reasoning. Further analysis reveals how our method strengthens the VLM's ability to focus on and reason about rare objects. Code is available at https://github.com/XinHu98/seeing

Abstract:
Vision-language models (VLMs) such as CLIP exhibit strong Out-of-distribution (OOD) detection capabilities by aligning visual and textual representations. Recent CLIP-based test-time adaptation methods further improve detection performance by incorporating external OOD labels. However, such labels are finite and fixed, while the real OOD semantic space is inherently open-ended. Consequently, fixed labels fail to represent the diverse and evolving OOD semantics encountered in test streams.To address this limitation, we introduce Test-time Textual Learning (TTL), a framework that dynamically learns OOD textual semantics from unlabeled test streams, without relying on external OOD labels.TTL updates learnable prompts using pseudo-labeled test samples to capture emerging OOD knowledge. To suppress noise introduced by pseudo-labels, we introduce an OOD knowledge purification strategy that selects reliable OOD samples for adaptation while suppressing noise. In addition, TTL maintains an OOD Textual Knowledge Bank that stores high-quality textual features, providing stable score calibration across batches. Extensive experiments on two standard benchmarks with nine OOD datasets demonstrate that TTL consistently achieves state-of-the-art performance, highlighting the value of textual adaptation for robust test-time OOD detection. Our code is available at https://github.com/figec/TTL.

Abstract:
The video reasoning ability of multimodal large language models (MLLMs) is crucial for downstream tasks like video question answering and temporal grounding. While recent approaches have explored text-based chain-of-thought (CoT) reasoning for MLLMs, these methods often suffer from limited cross-modal interaction and increased hallucination, especially with longer videos or reasoning chains. To address these challenges, we propose Video Intelligence via Tool-Augmented Learning (VITAL), a novel end-to-end agentic video reasoning framework. With a visual toolbox, the model can densely sample new video frames on demand and generate multimodal CoT for precise long video reasoning. We observe that temporal grounding and question answering are mutually beneficial for video understanding tasks. Therefore, we construct two high-quality multi-task video reasoning datasets MTVR-CoT-72k for supervised fine-tuning and MTVR-RL-110k for reinforcement learning. Moreover, we propose a Difficulty-aware Group Relative Policy Optimization algorithm (DGRPO) to mitigate difficulty imbalance in multi-task reinforcement learning. Extensive experiments on eleven challenging video understanding benchmarks demonstrate the advanced reasoning ability of VITAL, outperforming existing methods in video question answering and temporal grounding tasks, especially in long video scenarios. Code is available at https://github.com/zhang9302002/ThinkingWithVideos.

Abstract:
Frame selectoin is crucial due to high frame redundancy and limited context windows when applying Large Vision-Language Models (LVLMs) to long videos. Current methods typically select frames with high relevance to a given query, resulting a disjointed set of frames that disregard the narrative structure of video. In this paper, we introduce Wavelet-based Frame Selection by Detecting Semantic Boundary (WFS-SB), a training-free framework that presents a new perspective: effective video understanding hinges not only on high relevance but, more importantly, on capturing semantic shifts--pivotal moments of narrative change that are essential to comprehending the holistic storyline of video. However, a direct detection of abrupt changes in the query-frame similarity signal is often unreliable due to high-frequency noise arising from model uncertainty and transient visual variations.To address this, we leverage the wavelet transform, which provides an ideal solution through its multi-resolution analysis in both time and frequency domains. By applying this transform, we decompose the noisy signal into multiple scales and extract a clean semantic change signal from the coarsest scale. We identify the local extrema of this signal as semantic boundaries, which segment the video into coherent clips. Building on this, WFS-SB comprises a two-stage strategy: first, adaptively allocating a frame budget to each clip based on a composite importance score; and second, within each clip, employing the Maximal Marginal Relevance approach to select a diverse yet relevant set of frames. Extensive experiments show that WFS-SB significantly boosts LVLM performance, e.g., improving accuracy by 5.5% on VideoMME, 9.5% on MLVU, and 6.2% on LongVideoBench, consistently outperforming state-of-the-art methods. Our code is available at https://github.com/MAC-AutoML/WFS-SB.

Abstract:
Referring Multi-Object Tracking (RMOT) aims to track multiple objects specified by natural language expressions in videos. With the recent significant progress of one-stage methods, the two-stage Referring-by-Tracking (RBT) paradigm has gradually lost its popularity. However, its lower training cost and flexible incremental deployment remain irreplaceable. Rethinking existing two-stage RBT frameworks, we identify two fundamental limitations: the overly heuristic feature construction and fragile correspondence modeling. To address these issues, we propose FlexHook, a novel two-stage RBT framework. In FlexHook, the proposed Conditioning Hook (C-Hook) redefines the feature construction by a sampling-based strategy and language-conditioned cue injection. Then, we introduce a Pairwise Correspondence Decoder (PCD) that replaces CLIP-based similarity matching with active correspondence modeling, yielding a more flexible and robust strategy. Extensive experiments on multiple benchmarks (Refer-KITTI/v2, Refer-Dance, and LaMOT) demonstrate that FlexHook becomes the first two-stage RBT approach to outperform current state-of-the-art methods both performance and efficiency. Code is available at https://github.com/buptLwz/FlexHook.

Abstract:
Despite significant progress in Multi-modal Large Language Models (MLLMs), their clinical reasoning capacity for multi-modal diagnosis remains largely unexamined. Current benchmarks, mostly single-modality data, can't evaluate progressive reasoning and cross-modal integration essential for clinical practice. We introduce the Cross-Modality Progressive Clinical Reasoning (X-PCR) benchmark, the first comprehensive evaluation of MLLMs through a complete ophthalmology diagnostic workflow, with two reasoning tasks: 1) a six-stage progressive reasoning chain spanning image quality assessment to clinical decision-making, and 2) a cross-modality reasoning task integrating six imaging modalities. The benchmark comprises 26,415 images and 177,868 expert-verified VQA pairs curated from 51 public datasets, covering 52 ophthalmic diseases. Evaluation of 21 MLLMs reveals critical gaps in progressive reasoning and cross-modal integration. Dataset and code: https://github.com/CVI-SZU/X-PCR.

Abstract:
This work tackles a key challenge in test-time energy adaptation: prohibitive time overhead arising from recent state-of-the-art test-time adaptation (TTA) methods, which are built on energy models relying on iterative Monte Carlo or Langevin dynamics sampling with multiple stochastic updates per test instance to approximate energy gradients. We tackle the problem from an innovative control system perspective by i) describing the energy as a complex-valued wave, where the amplitude encodes energy uncertainty and the phase characterizes its evolution, and ii) maintaining a time-dependent wave equation that interprets TTA as a control system evolution process. By enforcing the control system law of probability current conservation, our method directs probability current away from high-energy (error-prone) regions toward low-energy (accurate) ones, achieving adaptive energy redistribution without additional stochastic sampling while preserving the overall normalization of the energy landscape. Experimentally, the proposed method significantly outperforms baseline methods across several public benchmark datasets, with adaptive time being only 1/3 1/7 of that required by the compared Top-1 to Top-3 baselines. Code is available at https://github.com/wongzbb/APT.

Abstract:
Recent segmentation methods leveraging Multi-modal Large Language Models (MLLMs) have shown reliable object-level segmentation and enhanced spatial perception. However, almost all previous methods predominantly rely on specialist mask decoders to interpret masks from generated segmentation-related embeddings and visual features, or incorporate multiple additional tokens to assist. This paper aims to investigate whether and how we can unlock segmentation from MLLM itSELF with 1 segmentation Embedding (SELF1E) while achieving competitive results, which eliminates the need for external decoders. To this end, our approach targets the fundamental limitation of resolution reduction in pixel-shuffled image features from MLLMs. First, we retain image features at their original uncompressed resolution, and refill them with residual features extracted from MLLM-processed compressed features, thereby improving feature precision. Subsequently, we integrate pixel-unshuffle operations on image features with and without LLM processing, respectively, to unleash the details of compressed features and amplify the residual features under uncompressed resolution, which further enhances the resolution of refilled features. Moreover, we redesign the attention mask with dual perception pathways, i.e., image-to-image and image-to-segmentation, enabling rich feature interaction between pixels and the segmentation token. Comprehensive experiments across multiple segmentation tasks validate that SELF1E achieves performance competitive with specialist mask decoder-based methods, demonstrating the feasibility of decoder-free segmentation in MLLMs. Project page: https://github.com/ANDYZAQ/SELF1E.

Abstract:
Supervised open-loop training has been widely adopted for training traffic simulation models; however, it fails to capture the inherently dynamic, multi-agent interactions common in complex driving scenarios. We introduce RLFTSim, a reinforcement-learning-based fine-tuning framework that enhances scenario realism by aligning simulator rollouts with real-world data distributions and provides a method for distilling goal-conditioned controllability in scenario generation. We instantiate RLFTSim on top of a pre-trained simulation model, design a reward that balances fidelity and controllability, and perform comprehensive experiments on the Waymo Open Motion Dataset. Our results show improvements in realism, achieving state-of-the-art performance. Compared with other heuristic search-based fine-tuning methods, RLFTSim requires significantly fewer samples due to a proposed low-variance and dense reward signal, and it directly addresses the realism alignment issue by design. We also demonstrate the effectiveness of our approach for distilling traffic simulation controllability through goal conditioning. Project page is available at https://ehsan-ami.github.io/rlftsim.

Abstract:
In this paper, we propose Universal Holistic Audio Generation (UniHAGen), a task for synthesizing comprehensive auditory scenes that include both on-screen and off-screen sounds across diverse domains (e.g., ambient events, musical instruments, and human speech). Prior video-conditioned audio generation models typically focus on producing on-screen environmental sounds that correspond to visible sounding events, neglecting off-screen auditory events. While recent holistic joint text-video-to-audio generation models aim to produce auditory scenes with both on- and off-screen sound but they are limited to non-speech sounds, lacking the ability to generate or integrate human speech. To overcome these limitations, we introduce OmniSonic, a flow-matching-based diffusion framework jointly conditioned on video and text. It features a TriAttn-DiT architecture that performs three cross-attention operations to process on-screen environmental sound, off-screen environmental sound, and speech conditions simultaneously, with a Mixture-of-Experts (MoE) gating mechanism that adaptively balances their contributions during generation. Furthermore, we construct UniHAGen-Bench, a new benchmark with over one thousand samples covering three representative on/off-screen speech-environment scenarios. Extensive experiments show that OmniSonic consistently outperforms state-of-the-art approaches on both objective metrics and human evaluations, establishing a strong baseline for universal and holistic audio generation. Project page: https://weiguopian.github.io/OmniSonic_webpage/

Abstract:
Transformer-based approaches have recently become the dominant paradigm for 3D instance segmentation. These methods typically employ a multi-layer decoder that iteratively refines a set of learnable queries into instance mask predictions. However, we observe that multiple queries often target the same instance simultaneously, leading to fragmented masks for a single object. We define this phenomenon as inter-query competition, which slows convergence and limits segmentation accuracy. To address this problem, we present CompetitorFormer, a novel framework designed for Transformer-based methods. Our method mitigates inter-query competition by explicitly modeling the competitive relationships among queries. Specifically, we introduce a Query Competition Layer before each decoder stage to construct a dynamic competitive landscape, allowing each query to perceive its relative importance. In addition, the proposed Relative Relationship Encoding and Rank Cross-Attention modules enhance both self-attention and cross-attention by prioritizing dominant queries. Extensive experiments show that our approach converges faster and achieves superior performance on the ScanNetV2, ScanNet++V2, ScanNet200, and S3DIS datasets. Code is available at https://github.com/DuanchuWang/CompetitorFormer.

Abstract:
Despite the advancements in diffusion-based image style transfer, existing methods are commonly limited by 1) semantic gap: the style reference could miss proper content semantics, causing uncontrollable stylization; 2) reliance on extra constraints (e.g., semantic masks) restricting applicability; 3) rigid feature associations lacking adaptive global-local alignment, failing to balance fine-grained stylization and global content preservation. These limitations, particularly the inability to flexibly leverage style inputs, fundamentally restrict style transfer in terms of personalization, accuracy, and adaptability.To address these, we propose StyleGallery, a training-free and semantic-aware framework that supports arbitrary reference images as input and enables effective personalized customization.It comprises three core stages: semantic region segmentation (adaptive clustering on latent diffusion features to divide regions without extra inputs); clustered region matching (block filtering on extracted features for precise alignment); and style transfer optimization (energy function-guided diffusion sampling with regional style loss to optimize stylization).Experiments on our introduced benchmark demonstrate that StyleGallery outperforms state-of-the-art methods in content structure preservation, regional stylization, interpretability, and personalized customization, particularly when leveraging multiple style references. The source code and dataset are available at https://github.com/iiiiiiiword/StyleGallery.

Abstract:
Motion capture now underpins content creation far beyond digital humans, yet most pipelines remain species- or template-specific. We formalize this gap as Category-Agnostic Motion Capture (CAMoCap): given a monocular video and an arbitrary rigged 3D asset as a prompt, the goal is to reconstruct a rotation-based animation (e.g., BVH) that directly drives the specific asset. We present MoCapAnything, a reference-guided, factorized framework that first predicts 3D joint trajectories and then recovers asset-specific rotations via constraint-aware Inverse Kinematics (IK) Fitting. MoCapAnything comprises three learnable modules and a lightweight IK stage: a Reference Prompt Encoder that distills per-joint queries from the asset's skeleton, mesh, and rendered image set; a Video Feature Extractor that computes dense visual descriptors and reconstructs a coarse 4D deforming mesh to bridge the modality gap between RGB tokens and the point-cloud-like joint space; and a Unified Motion Decoder that fuses these cues to produce temporally coherent trajectories. We also curate Truebones Zoo with 1,038 motion clips, each providing a standardized skeleton-mesh-rendered-video triad. Experiments on in-domain benchmarks and in-the-wild videos show that MoCapAnything delivers high-quality skeletal animations and exhibits non-trivial cross-species retargeting across heterogeneous rigs, offering a scalable path toward prompt-based 3D motion capture for arbitrary assets. The codes are available on our project page: https://animotionlab.github.io/MoCapAnything/

Abstract:
Learning-based Multi-View Stereo (MVS) methods have become the mainstream in the field, relying on the construction of cost Learning-based Multi-View Stereo (MVS) methods have become the mainstream in the field, relying on the construction of cost volumes through multi-view feature similarity computation. However, existing methods depend heavily on photometric consistency across views, leading to poor performance in challenging regions. To overcome this limitation, we propose SPE-MVS, a novel MVS framework enhanced with Spatial Position Encoding (SPE). The SPE represents the 3D positional information of pixels in each image within a unified metric space, constructed using monocular depth priors. We integrate the SPE alongside image data as input and introduce a Photometric-Spatial Hybrid Feature Extractor, along with an SPE-enhanced cost volume construction module. These components incorporate spatial position-based similarity computation, substantially improving robustness in challenging areas. Furthermore, we propose a Monocular Depth-guided Enhancement (MDGE) module that enhances depth probability map using monocular depth priors, thereby further boosting the depth estimation performance. Extensive experiments demonstrate that our method significantly improves reconstruction quality in difficult regions and achieves state-of-the-art (SOTA) performance on multiple benchmarks. The code will be released at https://github.com/bdwsq1996/SPE-MVS.

Abstract:
Underwater images often exhibit dominant blue-green hues due to wavelength-dependent light attenuation. While existing enhancement methods have achieved promising performance, they typically overlook the subjective nature of visual preferences. To address this gap, we propose SDUIE, a level-aware Semi-supervised Diffusion framework for Underwater Image Enhancement that enables dual control through both quantitative and textual inputs. SDUIE-Quant allows continuous, numerical adjustment of enhancement levels via low-rank adaptation weight merging within a dual-branch diffusion model. This model comprises a supervised branch trained on synthetic underwater-terrestrial pairs and a self-supervised branch designed to preserve the natural hues of real-world underwater scenes. Building on this, SDUIE-Text introduces intuitive, language-guided control by aligning semantic prompts with visual enhancement effects, leveraging the learned fusion weights. This dual-modality design offers both precise control and flexible, user-preferred enhancement. Experimental results demonstrate that SDUIE achieves state-of-the-art results while better preserving the aesthetic qualities often missed by conventional methods. The source code is in https://github.com/Xiaofeng-life/SDUIE.

Abstract:
Monocular depth estimation serves as a core technique in endoscopic applications such as 3D reconstruction and localization. However, most existing methods focus primarily on in-domain depth estimation, which limits their robustness and prevents them from delivering impressive cross-domain performance, due to variations in depth distributions, illumination conditions, and texture patterns. In this work, we propose Depth Any Endoscopy (DAE), a novel self-supervised framework for generalizable depth estimation in monocular endoscopy. To specify, we develop a dual-level Mixture-of-Experts (MoE) adaptation paradigm that effectively tailors Vision Foundation Models to diverse endoscopic procedures, such as laparoscopy and colonoscopy, accounting for the challenges posed by varying environments. Internally, we integrate LoRA and Adapter modules within the MoE architecture, allowing the model to flexibly adapt to the characteristics of input data. Externally, a mixture of domain-specific experts provides customized guidance to enhance the training stability. In addition, we introduce a learnable gradient harmonization mechanism to dynamically balance the optimization between the depth and pose networks, along with a semantic distribution calibration module that strengthens the semantic consistency of depth predictions. Extensive experiments demonstrate that the proposed DAE achieves state-of-the-art performance in both zero-shot and in-domain depth estimation scenarios. The source code is publicly available at https://github.com/ShuweiShao/DAE.

Abstract:
Embodied Visual Reasoning (EVR) seeks to follow complex, free-form instructions based on egocentric video, enabling semantic understanding and spatiotemporal reasoning in dynamic environments. Despite its promising potential, EVR encounters significant challenges stemming from the diversity of complex instructions and the intricate spatiotemporal dynamics in long-term egocentric videos. Prior solutions either employ Large Language Models (LLMs) over static video captions, which often omit critical visual details, or rely on end-to-end Vision-Language Models (VLMs) that struggle with stepwise compositional reasoning. Considering the complementary strengths of LLMs in reasoning and VLMs in perception, we propose CLiViS. It is a novel training-free framework that leverages LLMs for high-level task planning and orchestrates VLM-driven open-world visual perception to iteratively update the scene context. Building on this synergy, the core of CLiViS is a dynamic Cognitive Map that evolves throughout the reasoning process. This map constructs a structured representation of the embodied scene, bridging low-level perception and high-level reasoning. Extensive experiments across multiple benchmarks demonstrate the effectiveness and generality of CLiViS, especially in handling long-term visual dependencies. Code is available at https://github.com/Teacher-Tom/CLiViS.

Abstract:
Event cameras action recognition (EAR) offers compelling privacy-protecting and efficiency advantages, where temporal motion dynamics is of great importance. Existing spatiotemporal multi-view representation learning (SMVRL) methods for event-based object recognition (EOR) offer promising solutions by projecting H-W-T events alone spatial axis H and W, yet are limited by its translation-variant spatial binning representation and naive early concatenation fusion architecture. This paper reexamines the key SMVRL design stages for EAR and propose: (i) a principled spatiotemporal multi-view representation through translation-invariant dense conversion of sparse events, (ii) a dual-branch, dynamic fusion architecture that models sample-wise complementarity between motion features from different views, and (iii) a bio-inspired temporal warping augmentation that mimics speed variability of real-world human actions. On three challenging EAR datasets of HARDVS, DailyDVS-200 and THU-EACT-50-CHL, we show +7.0%, +10.7%, and +10.2% Top-1 accuracy gains over existing SMVRL EOR method with surprising 30.1% reduced parameters and 35.7% lower computations, establishing our framework as a novel and powerful EAR paradigm. Code are avaliable in https://github.com/Fineshawray/SMV-EAR.

Abstract:
While large language models provide strong compositional reasoning, existing reasoning segmentation pipelines fail to transparently connect this reasoning to visual perception. Current methods, such as latent query alignment, are end-to-end yet opaque "black boxes". Conversely, textual localization readout is merely readable, not truly interpretable, often functioning as an unconstrained post-hoc step. To bridge this interpretability gap, we propose SegCompass, an end-to-end model that leverages a Sparse Autoencoder (SAE) to forge an explicit, interpretable, and differentiable alignment pathway. Given an image-instruction pair, SegCompass first generates a chain-of-thought (CoT) trace. The core of our method is an SAE that maps both the CoT and visual tokens into a shared, high-dimensional sparse concept space. A query codebook selects salient concepts from this space, which are then spatially grounded by a slot mapper into a multi-slot heatmap that guides the final mask decoder. The entire model is trained jointly, unifying reinforcement learning for the reasoning path with standard segmentation supervision. This SAE-driven interface provides a "white-box" connection that is significantly more traceable than latent queries and more coherent than textual readouts. Extensive experiments on five challenging benchmarks demonstrate that SegCompass matches or surpasses state-of-the-art performance. Crucially, our visual and quantitative analyses show a strong correlation between the quality of the learned sparse concepts and final mask accuracy, confirming that SegCompass achieves superior results through its enhanced and inspectable alignment. Code is available at https://github.com/ZhenyuLU-Heliodore/SegCompass.

Abstract:
High Dynamic Range (HDR) user-generated (UGC) videos are rapidly proliferating across social platforms, yet most perceptual video quality assessment (VQA) systems remain tailored to Standard Dynamic Range (SDR). HDR's higher bit depth, wide color gamut, and elevated luminance range expose distortions such as near-black crushing, highlight clipping, banding, and exposure flicker that amplify UGC artifacts and challenge SDR models. To catalyze progress, we curate HDR-UGC-44K, a large-scale subjective dataset of ~44K videos from 6.5K sources with >1.5M crowd ratings, spanning diverse scenes, capture conditions, and compression settings. We further introduce HDR-Q, the first Multimodal Large Language Model (MLLM) for HDR-UGC VQA. We propose (i) a novel HDR-aware vision encoder to produce HDR-sensitive embeddings, and (ii) HDR-Aware Policy Optimization (HAPO), an RL finetuning framework that anchors reasoning to HDR cues. HAPO augments GRPO via an HDR-SDR contrastive KL that encourages token reliance on HDR inputs and a gaussian weighted regression reward for fine-grained MOS calibration. Across HDR-UGC-44K and public HDR-VQA benchmarks, HDR-Q delivers state-of-the-art performance. Project page: https://shreshthsaini.github.io/Beynod8Bits

Abstract:
Multi-view crowd tracking estimates each person's tracking trajectories on the ground of the scene. Recent research works mainly rely on CNNs-based multi-view crowd tracking architectures, and most of them are evaluated and compared on relatively small datasets, such as Wildtrack and MultiviewX. Since these two datasets are collected in small scenes and only contain tens of frames in the evaluation stage, it is difficult for the current methods to be applied to real-world applications where scene size and occlusion are more complicated. In this paper, we propose a Transformer-based multi-view crowd tracking model, MVTrackTrans, which adopts interactions between camera views and the ground plane for enhanced multi-view tracking performance. Besides, for better evaluation, we collect and label two large real-world multi-view tracking datasets, MVCrowdTrack and CityTrack, which contain a much larger scene size over a longer time period. Compared with existing methods on the two large and new datasets, the proposed MVTrackTrans model achieves better performance, demonstrating the advantages of the model design in dealing with large scenes. We believe the proposed datasets and model will push the frontiers of the task to more practical scenarios, and the datasets and code are available at: https://github.com/zqyq/MVTrackTrans.

Abstract:
The applicability of current lesion segmentation models for chest X-rays (CXRs) has been limited both by a small number of target labels and the reliance on complex, expert-level text inputs, creating a barrier to practical use. To address these limitations, we introduce instruction-guided lesion segmentation (ILS), a medical-domain adaptation of referring image segmentation (RIS) designed to segment diverse lesion types based on simple, user-friendly instructions. Under this task, we construct MIMIC-ILS, the first large-scale instruction-answer dataset for CXR lesion segmentation, using our fully automated multimodal pipeline that generates annotations from CXR images and their corresponding reports. MIMIC-ILS contains 1.1M instruction-answer pairs derived from 192K images and 91K unique segmentation masks, covering seven major lesion types. To empirically demonstrate its utility, we present ROSALIA, a LISA model fine-tuned on the MIMIC-ILS dataset. ROSALIA can segment diverse lesions and provide textual explanations in response to user instructions. The model achieves high accuracy in our newly proposed task, highlighting the effectiveness of our pipeline and the value of MIMIC-ILS as a foundational resource for pixel-level CXR lesion grounding. The dataset and model are available at https://github.com/checkoneee/ROSALIA.

Abstract:
Endoscopic video analysis is essential for early gastrointestinal screening but remains hindered by limited high-quality annotations. While self-supervised video pre-training shows promise, existing methods developed for natural videos prioritize dense spatio-temporal modeling and exhibit motion bias, overlooking the static, structured semantics critical to clinical decision-making. To address this challenge, we propose Focus-to-Perceive Representation Learning (FPRL), a cognition-inspired hierarchical framework that emulates clinical examination. FPRL first focuses on intra-frame lesion-centric regions to learn static semantics, and then perceives their evolution across frames to model contextual semantics. To achieve this, FPRL employs a hierarchical semantic modeling mechanism that explicitly distinguishes and collaboratively learns both types of semantics. Specifically, it begins by capturing static semantics via teacher-prior adaptive masking (TPAM) combined with multi-view sparse sampling. This approach mitigates redundant temporal dependencies and enables the model to concentrate on lesion-related local semantics. Following this, contextual semantics are derived through cross-view masked feature completion (CVMFC) and attention-guided temporal prediction (AGTP). These processes establish cross-view correspondences and effectively model structured inter-frame evolution, thereby reinforcing temporal semantic continuity while preserving global contextual integrity. Extensive experiments on 11 endoscopic video datasets show that FPRL achieves superior performance across diverse downstream tasks, demonstrating its effectiveness in endoscopic video representation learning. The code is available at https://github.com/MLMIP/FPRL.

Abstract:
Large Vision-Language Models (LVLMs) excel at captioning, visual question answering, and robotics by combining vision and language, yet they often miss obvious objects or hallucinate nonexistent ones in atypical scenes. We examine these failures through the lens of uncertainty, focusing on contextual incongruity, where objects appear unexpectedly or fail to appear in expected contexts, and show that such cases increase recognition difficulty for state-of-the- art LVLMs. To study this regime, we introduce the Object Recognition in Incongruous Context (ORIC) framework, which constructs incongruous object-context pairs through two complementary strategies: (1) LLM-guided sampling to identify hard-to-recognize objects present in the image and (2) CLIP-guided sampling to mine plausible but absent ones. Applied to MSCOCO, ORIC creates ORIC-Bench and ORIC-style training data. Evaluating 18 LVLMs and 2 open-vocabulary detectors reveals significant degradation and bias under incongruous contexts. Visual Reinforcement Fine-Tuning of Qwen3-VL-8B-Instruct on 600 ORIC sam- ples improves performance on ORIC-Bench, AMBER, and HallusionBench. Overall, we show that contextual incongruity is a key source of uncertainty and provide tools for more reliable LVLMs. The dataset and code are publicly available at https://github.com/ZhaoyangLi-1/ORIC.

Abstract:
Visual environments are inherently hierarchical, as a panoramic view naturally encompasses and organizes multiple perspective views within its field. Capturing this hierarchy is crucial for effective perspective-to-equirectangular (P2E) visual place recognition. In this work, we introduce HypeVPR, a hierarchical embedding framework in hyperbolic space specifically designed to address the challenges of P2E matching. HypeVPR leverages the intrinsic ability of hyperbolic space to represent hierarchical structures, allowing panoramic descriptors to encode both broad contextual information and fine-grained local details. To this end, we propose a hierarchical feature aggregation mechanism that organizes local-to-global feature representations within hyperbolic space. Furthermore, HypeVPR's hierarchical organization naturally enables flexible control over the accuracy-efficiency trade-off without additional training, while maintaining robust matching across different image types. This approach enables HypeVPR to achieve competitive performance while significantly accelerating retrieval and reducing database storage requirements. Project page: https://suhan-woo.github.io/HypeVPR/

Abstract:
Multimodal large language models (MLLMs) have made significant progress in mobile agent development, yet their capabilities are predominantly confined to a reactive paradigm, where they merely execute explicit user commands. The emerging paradigm of proactive intelligence, where agents autonomously anticipate needs and initiate actions, represents the next frontier for mobile agents. However, its development is critically bottlenecked by the lack of benchmarks that can address real-world complexity and enable objective, executable evaluation. To overcome these challenges, we introduce ProactiveMobile, a comprehensive benchmark designed to systematically advance research in this domain. ProactiveMobile formalizes the proactive task as inferring latent user intent across four dimensions of on-device contextual signals and generating an executable function sequence from a comprehensive function pool of 63 APIs. The benchmark features over 3,660 instances of 14 scenarios that embrace real-world complexity through multi-answer annotations. To ensure quality, a team of 30 experts conducts a final audit of the benchmark, verifying factual accuracy, logical consistency, and action feasibility, and correcting non-compliant entries. Extensive experiments demonstrate that our fine-tuned Qwen2.5-VL-7B-Instruct achieves a success rate of 20.82%, outperforming o1 (17.02%) and GPT-5 (11.37%). This result indicates that proactivity is a critical competency widely lacking in current MLLMs, yet it is learnable, emphasizing the importance of the proposed benchmark for proactivity evaluation. Data and code are available at https://github.com/xiaomi-research/proactive-mobile.

Abstract:
The collection and detection of video anomaly data has long been a challenging problem due to its rare occurrence and spatio-temporal scarcity. Existing video anomaly detection (VAD) methods under perform in open-world scenarios. Key contributing factors include limited dataset diversity, and inadequate understanding of context-dependent anomalous semantics. To address these issues, i) we propose LAVIDA, an end-to-end zero-shot video anomaly detection framework. ii) LAVIDA employs an Anomaly Exposure Sampler that transforms segmented objects into pseudo-anomalies to enhance model adaptability to unseen anomaly categories. It further integrates a Multimodal Large Language Model (MLLM) to bolster semantic comprehension capabilities. Additionally, iii) we design a token compression approach based on reverse attention to handle the spatio-temporal scarcity of anomalous patterns and decrease computational cost. The training process is conducted solely on pseudo anomalies without any VAD data. Evaluations across four benchmark VAD datasets demonstrate that LAVIDA achieves SOTA performance in both frame-level and pixel-level anomaly detection under the zero-shot setting. Our code is available in https://github.com/VitaminCreed/LAVIDA.

Abstract:
Omnidirectional 3D Gaussian Splatting with panoramas is a key technique for 3D scene representation, and existing methods typically rely on slow SfM to provide camera poses and sparse points priors. In this work, we propose a pose-free omnidirectional 3DGS method, named PFGS360, that reconstructs 3D Gaussians from unposed omnidirectional videos. To achieve accurate camera pose estimation, we first construct a spherical consistency-aware pose estimation module, which recovers poses by establishing consistent 2D-3D correspondences between the reconstructed Gaussians and the unposed images using Gaussians' internal depth priors. Besides, to enhance the fidelity of novel view synthesis, we introduce a depth-inlier-aware densification module to extract depth inliers and Gaussian outliers with consistent monocular depth priors, enabling efficient Gaussian densification and achieving photorealistic novel view synthesis. The experiments show significant outperformance over existing pose-free and pose-aware 3DGS methods on both real-world and synthetic 360-degree videos. Code is available at https://github.com/zcq15/PFGS360.

Abstract:
Diffusion models have demonstrated remarkable success in image and video generation, yet their practical deployment remains hindered by the substantial computational overhead of multi-step iterative sampling. Among acceleration strategies, caching-based methods offer a training-free and effective solution by reusing or predicting features across timesteps. However, existing approaches rely on fixed or locally adaptive schedules without considering the global structure of the denoising trajectory, often leading to error accumulation and visual artifacts. To overcome this limitation, we propose DPCache, a novel training-free acceleration framework that formulates diffusion sampling acceleration as a global path planning problem. DPCache constructs a Path-Aware Cost Tensor from a small calibration set to quantify the path-dependent error of skipping timesteps conditioned on the preceding key timestep. Leveraging this tensor, DPCache employs dynamic programming to select an optimal sequence of key timesteps that minimizes the total path cost while preserving trajectory fidelity. During inference, the model performs full computations only at these key timesteps, while intermediate outputs are efficiently predicted using cached features. Extensive experiments on DiT, FLUX, and HunyuanVideo demonstrate that DPCache achieves strong acceleration with minimal quality loss, outperforming prior acceleration methods by +0.031 ImageReward at 4.87x speedup and even surpassing the full-step baseline by +0.028 ImageReward at 3.54x speedup on FLUX, validating the effectiveness of our path-aware global scheduling framework. Code is available at https://github.com/argsss/DPCache.

Abstract:
Recent pseudo-bag augmentation methods for Multiple Instance Learning (MIL)-based Whole Slide Image (WSI) classification sample instances from a limited number of bags, resulting in constrained diversity. To address this issue, we propose Contrastive Cross-Bag Augmentation (C2Aug) to sample instances from all bags with the same class to increase the diversity of pseudo-bags. However, introducing new instances into the pseudo-bag increases the number of critical instances (e.g., tumor instances). This increase results in a reduced occurrence of pseudo-bags containing few critical instances, thereby limiting model performance, particularly on test slides with small tumor areas. To address this, we introduce a bag-level and group-level contrastive learning framework to enhance the discrimination of features with distinct semantic meanings, thereby improving model performance. C2Aug samples instances with consistent bag-level labels across the entire dataset, which simultaneously enhances the diversity of pseudo-bags and mitigates label noise. Experimental results demonstrate that C2Aug consistently outperforms state-of-the-art approaches across multiple evaluation metrics. Our code is publicly available at: https://github.com/weiaicunzai/mixup.

Abstract:
Long-tailed distributions are common in real-world recognition tasks, where a few head classes have many samples while most tail classes have very few. Recently, fine-tuning foundation models for long-tailed learning has gained attention due to their excellent performance. However, most existing methods focus solely on mitigating long-tailed distribution bias while overlooking concept confusion caused by the long-tailed distribution. In this paper, we study this problem and attribute it to the mutual exclusivity of single-label supervision under long-tailed distributions, which suppresses feature sharing among related classes and amplifies the dominance of head classes, leading to disrupted inter-class discriminality. To address this, we propose CUE, Concept-aware Multi-label Expansion, which introduces multi-label concept signals to preserve disrupted inter-class relationships. Specifically, CUE constructs concept sets by (i) extracting instance-level visual cues from zero-shot CLIP and (ii) generating class-level semantic cues with LLM; the two cues are incorporated via separately weighted Binary Logit-Adjustmen (BLA) auxiliary losses and jointly optimized with the baseline Logit-Adjustmen (LA) loss. Experiments on several long-tailed benchmarks, CUE achieves balanced and strong performance, surpassing recent state-of-the-art methods. Code is available at: https://github.com/zhangruichi/CUE.

Abstract:
In a rapidly growing field of model training there is a constant practical interest in parameter-efficient fine-tuning and various techniques that use a small amount of training data to adapt the model to a narrow task. However, there is an open question: how to combine several adapters tuned for different tasks into one which is able to yield adequate results on both tasks? Specifically, merging subject and style adapters for generative models remains unresolved. In this paper we seek to show that in the case of orthogonal fine-tuning (OFT), we can use structured orthogonal parametrization and its geometric properties to get the formulas for training-free adapter merging. In particular, we derive the structure of the manifold formed by the recently proposed Group-and-Shuffle (GS) orthogonal matrices, and obtain efficient formulas for the geodesics approximation between two points. Additionally, we propose a spectra restoration transform that restores spectral properties of the merged adapter for higher-quality fusion. We conduct experiments in subject-driven generation tasks showing that our technique to merge two GS orthogonal matrices is capable to unite concept and style features of different adapters. To the best of our knowledge, this is the first training-free method for merging multiplicative orthogonal adapters. Code is available via the link: https://github.com/ControlGenAI/OrthoFuse.

Abstract:
Human mesh reconstruction (HMR) provides direct insights into body-environment interaction, enabling various immersive applications. However, existing large-scale HMR benchmarks largely rely on line-of-sight RGB sensing, causing HMR systems to inherit the limitations of vision-based systems, including sensitivity to occlusion, lighting variation, and privacy concerns. These limitations have motivated growing interest in radio-frequency (RF) mmWave radar as a privacy-preserving and robust modality for human sensing. Despite this promise, current radar datasets remain limited by sparse skeleton annotations, small scale, and simple in-place actions. To address this gap, we introduce M4Human, the largest-scale multimodal benchmark to date for radar-based HMR, featuring 661K frames---9 times larger than the previous largest---with high-resolution mmWave radar, RGB, and depth data. M4Human provides both raw radar tensors (RT) and processed radar point clouds (RPC), enabling research across different levels of RF signal granularity. It also includes high-quality motion capture (MoCap) annotations with 3D meshes and global trajectories, covering 20 subjects and 50 diverse actions, including in-place, seated, and free-space sports or rehabilitation movements. We establish benchmarks on RT and RPC modalities, as well as multimodal fusion with RGB-D inputs. Extensive results demonstrate the value of M4Human for radar-based human modeling while revealing persistent challenges under fast and unconstrained motion. Our benchmark is publicly available at \href https://fanjunqiao.github.io/M4Human-site/ https://fanjunqiao.github.io/M4Human-site/ .

Abstract:
Hierarchical LiDAR geometry compression encodes voxel occupancies from low to high bit-depths, yet prior methods treat each depth independently and re-estimate local context from coordinates at every level, limiting compression efficiency. We present ELiC, a real-time framework that combines cross-bit-depth feature propagation, a Bag-of-Encoders (BoE) selection scheme, and a Morton-order-preserving hierarchy. Cross-bit-depth propagation reuses features extracted at denser, lower depths to support prediction at sparser, higher depths. BoE selects, per depth, the most suitable coding network from a small pool, adapting capacity to observed occupancy statistics without training a separate model for each level. The Morton hierarchy maintains global Z-order across depth transitions, eliminating per-level sorting and reducing latency. Together these components improve entropy modeling and computation efficiency, yielding state-of-the-art compression at real-time throughput on Ford and SemanticKITTI. Code and pretrained models are available at http://github.com/moolgom/ELiCv1.

Abstract:
Synthesizing text-driven 3D human motion within realistic scenes requires learning both semantic intent ("walk to the couch") and physical feasibility (e.g., avoiding collisions). Current methods use generative frameworks that simultaneously learn high-level planning and low-level contact reasoning, and rely on computationally expensive 3D scene data such as point clouds or voxel occupancy grids. We propose SceMoS, a scene-aware motion synthesis framework that shows that structured 2D scene representations can serve as a powerful alternative to full 3D supervision in physically grounded motion synthesis. SceMoS disentangles global planning from local execution using lightweight 2D cues and relying on (1) a text-conditioned autoregressive global motion planner that operates on a bird's-eye-view (BEV) image rendered from an elevated corner of the scene, encoded with DINOv2 features, as the scene representation, and (2) a geometry-grounded motion tokenizer trained via a conditional VQ-VAE, that uses 2D local scene heightmap, thus embedding surface physics directly into a discrete vocabulary. This 2D factorization reaches an efficiency-fidelity trade-off: BEV semantics capture spatial layout and affordance for global reasoning, while local heightmaps enforce fine-grained physical adherence without full 3D volumetric reasoning. SceMoS achieves state-of-the-art motion realism and contact accuracy on the TRUMANS benchmark, reducing the number of trainable parameters for scene encoding by over 50%, showing that 2D scene cues can effectively ground 3D human-scene interaction. Project URL: https://anindita127.github.io/SceMoS.

Abstract:
Existing anomaly detection methods often treat the modality and class as independent factors. Although this paradigm has enriched the development of AD research branches and produced many specialized models, it has also led to fragmented solutions and excessive memory overhead. Moreover, reconstruction-based multi-class approaches typically rely on shared decoding paths, which struggle to handle large variations across domains, resulting in distorted normality boundaries, domain interference, and high false alarm rates.To address these limitations, we propose UniMMAD, a unified framework for multi-modal and multi-class anomaly detection. At the core of UniMMAD is a Mixture-of-Experts (MoE)-driven feature decompression mechanism, which enables adaptive and disentangled reconstruction tailored to specific domains.This process is guided by a "general - specific" paradigm. In the encoding stage, multi-modal inputs of varying combinations are compressed into compact, general-purpose features. The encoder incorporates a feature compression module to suppress latent anomalies, encourage cross-modal interaction, and avoid shortcut learning.In the decoding stage, the general features are decompressed into modality-specific and class-specific forms via a sparsely-gated cross MoE, which dynamically selects expert pathways based on input modality and class. To further improve efficiency, we design a grouped dynamic filtering mechanism and an MoE-in-MoE structure, reducing MoE parameter usage by approximately 75% while maintaining sparse activation and fast inference. UniMMAD achieves state-of-the-art performance on 9 anomaly detection datasets, spanning 3 fields, 12 modalities, and 66 classes. Code is publicly available at https://github.com/yuanzhao-CVLAB/UniMMAD.

Abstract:
Recent advances in text-to-image (T2I) generation have greatly improved visual quality, yet producing images that appear visually authentic to real-world photography remains challenging. This is partly due to biases in existing evaluation paradigms: human ratings and preference-trained metrics often favor visually vivid images with exaggerated saturation and contrast, which make generations often too vivid to be real even when prompted for realistic-style images.To address this issue, we present Color Fidelity Dataset (CFD) and Color Fidelity Metric (CFM) for objective evaluation of color fidelity in realistic-style generations. CFD contains over 1.3M real and synthetic images with ordered levels of color realism, while CFM employs a multimodal encoder to learn perceptual color fidelity. In addition, we propose a training-free Color Fidelity Refinement (CFR) that adaptively modulates spatial-temporal guidance scale in generation, thereby enhancing color authenticity.Together, CFD supports CFM for assessment, whose learned attention further guides CFR to refine T2I fidelity, forming a progressive framework for assessing and improving color fidelity in realistic-style T2I generation. The dataset and code are available at https://github.com/ZhengyaoFang/CFM.

Abstract:
Estimating slide- and patch-level gene expression profiles from pathology images enables rapid and low-cost molecular analysis with broad clinical impact. Despite strong results, existing approaches treat gene expression as a mere slide- or spot-level signal and do not incorporate the fact that the measured expression arises from the aggregation of underlying cell-level expression. To explicitly introduce this missing cell-resolved guidance, we propose a Cell-type Prototype-informed Neural Network (CPNN) that leverages publicly available single-cell RNA-sequencing datasets. Since single-cell measurements are noisy and not paired with histology images, we first estimate cell-type prototypes--mean expression profiles that capture stable gene-gene co-variation patterns. CPNN then learns cell-type compositional weights directly from images and models the relationship between prototypes and observed bulk or spatial expression, providing a biologically grounded and structurally regularized prediction framework. We evaluate CPNN on three slide-level datasets and three patch-level spatial transcriptomics datasets. Across all settings, CPNN achieves the highest performance in terms of Spearman correlation. Moreover, by visualizing the inferred compositional weights, our framework provides interpretable insights into which cell types drive the predicted expression.Code is publicly available at https://github.com/naivete5656/CPNN.

Abstract:
Vision-Language Models (VLMs) have emerged as a promising paradigm in autonomous driving (AD), providing a unified framework for perception and decisionmaking. However, their real-world deployment is hindered by significant computational overhead when processing high-resolution, multi-view images. This complexity stems from the massive number of visual tokens, which increases inference latency and memory consumption due to the quadratic complexity of self-attention. To address these challenges, we propose Prune2Drive, a plug-and-play visual token pruning framework for multi-view VLMs in AD. Prune2Drive introduces two core innovations: (i) a diversity-aware token selection mechanism that prioritizes semantic and spatial coverage across views, and (ii) a view-adaptive pruning controller that automatically learns optimal pruning ratios based on camera importance to downstream tasks. Unlike prior methods, Prune2Drive requires no model retraining or access to attention maps, ensuring compatibility with modern efficient attention implementations. Extensive experiments on the DriveLM and DriveLMM-o1 benchmarks demonstrate that Prune2Drive achieves significant speedups and memory savings with minimal performance impact. When retaining only 10% of visual tokens, our method achieves a 6.40x speedup in the prefilling phase and consumes only 13.4% of the original FLOPs, with a mere 3% average performance drop on the DriveLM benchmark. Code is available at: https://github.com/MinhaoXiong/Prune2Drive.git

Abstract:
Recent studies have shown that Reversible Adversarial Examples (RAE) can mislead unauthorized deep neural networks while remaining usable for authorized users, effectively preventing image data leakage. Existing RAE methods rely on reversibly embedding perturbation information into the original adversarial examples to enable restoration. However, this two-stage process often results in RAEs with inferior attack effectiveness and visual quality compared to the original versions. To solve these challenges, we propose a novel end-to-end Invertible Neural Network for Reversible Adversarial Examples Generation (RevINN), which directly generates RAEs in one stage by scrambling the intrinsic frequency information of images. Specifically, our RevINN consists of the Cross-Frequency Modulation Attack (CFMA) module and the High-Frequency Perturbation Enhancement (HFPE) module. CFMA selectively exchanges discriminative information between low- and high-frequency wavelet components to achieve adversariality. To fully alter high-frequency semantics, HFPE innovatively employs a tri-branch structure for fine-grained modulation among high-frequency subbands, enhancing perturbation strength. Finally, the modified components are recomposed into RAEs via the inverse wavelet transform. Our RevINN is optimized with adversarial, perceptual, and invertible losses, and can restore images based on the reversibility of the wavelet operations and network modules. Extensive experiments demonstrate that our RevINN achieves state-of-the-art RAE generation quality. The code is available at: https://github.com/WongJaylen/RevINN.

Abstract:
The paradigm of agentic AI is shifting from engineered complex workflows to post-training native models. However, existing agents are typically confined to static, predefined action spaces--such as exclusively using APIs, GUI events, or robotic commands. This rigidity limits their adaptability in dynamic environments where the optimal granularity of interaction varies contextually. To bridge this gap, we propose CrossHA, a unified agentic model that masters heterogeneous action spaces and autonomously selects the most effective interface for each step of a trajectory. We introduce a comprehensive training pipeline that integrates cold-start supervised fine-tuning with a Multi-Turn Group Relative Policy Optimization (GRPO) algorithm. This approach enables the agent to learn adaptive action switching--balancing high-level efficiency with low-level precision--without human-specified rules. Extensive experiments on over 800 tasks in the open-world Minecraft environment demonstrate that CrossHA achieves state-of-the-art performance. By dynamically leveraging the strengths of diverse action spaces, our model significantly outperforms fixed-action baselines, exhibiting superior generalization and efficiency in long-horizon reasoning. All code and models are available at https://github.com/CraftJarvis/OpenHA.

Abstract:
Immersive Computer Graphics (CGs) rendering has become ubiquitous in modern daily life. However, comprehensively evaluating CG quality remains challenging for two reasons: (1) existing CG datasets lack systematic descriptions of rendering quality; and (2) existing CG quality assessment methods cannot provide reasonable text-based explanations. To address these issues, we first identify six key perceptual dimensions of CG quality from the user perspective and construct a dataset of 3.5\mathrm K CG images with corresponding quality descriptions. Each description covers CG style, content, and perceived quality along the selected dimensions. Furthermore, we use a subset of the dataset to build several question-answer benchmarks based on the descriptions in order to evaluate the responses of existing Vision Language Models (VLMs). We find that current VLMs are not sufficiently accurate in judging fine-grained CG quality, but that descriptions of visually similar images can significantly improve a VLM's understanding of a given CG image. Motivated by this observation, we adopt retrieval-augmented generation and propose a two-stream retrieval framework that effectively enhances the CG quality assessment capabilities of VLMs. Experiments on several representative VLMs demonstrate that our method substantially improves their performance on CG quality assessment. The public dataset and code are at: https://github.com/lizhuangzi/R4-CGQA

Abstract:
Sign Language Production (SLP) aims to translate spoken language into sign sequences, where the main challenge lies in generating coherent and natural poses from discrete glosses (G2P). Existing G2P methods typically treat each pose as an indivisible unit, limiting their ability to capture fine-grained joint-level dependencies and thus degrading pose quality. To address this, we propose the Focal-General Diffusion Model (FGDM), characterized by a pioneering two-stage denoising framework that harmonizes local joint-level dependencies and global coherence. Specifically, in the Focal stage, a novel Adaptive Sign GCN (ASGCN) adaptively models each pose based on contextual correlations, skeletal topology, and semantic conditions, ensuring precise generation of local details. In the General stage, a Transformer-based module refines the entire pose sequence to enhance global coherence and naturalness. Moreover, we introduce a Semantic Consistent Guidance (SCG) mechanism that seamlessly integrates semantic supervision into diffusion training, enforcing tighter alignment between generated pose sequences and their intended gloss semantics. Extensive experiments on PHOENIX14T and USTC-CSL demonstrate that FGDM achieves SOTA performance. Our project page is available at https://yuyiheng-eu.github.io/fgdm/.

Abstract:
Multi-modal large language models (MLLMs) have advanced general-purpose video understanding but struggle with long, high-resolution videos---they process every pixel equally in their vision transformers (ViTs) or LLMs despite significant spatiotemporal redundancy. We introduce AutoGaze, a lightweight module that removes redundant patches before processed by a ViT or an MLLM. Trained with next-token prediction and reinforcement learning, AutoGaze autoregressively selects a minimal set of multi-scale patches that reconstructs the video within a user-specified error threshold, eliminating redundancy while preserving information. Empirically, AutoGaze reduces visual tokens by 4x-100x and accelerates ViTs and MLLMs by up to 19x, scaling MLLMs to 1K-frame 4K-resolution videos and achieving superior results on video benchmarks (e.g., 67.0% on VideoMME). Furthermore, we introduce HLVid: the first high-resolution, long-form video QA benchmark with 5-minute 4K-resolution videos, where an MLLM scaled with AutoGaze improves over the baseline by 10.1% and outperforms the previous best MLLM by 4.5%. Project page: https://autogaze.github.io/.

Abstract:
Multi-modal large language models (MLLMs) have shown remarkable progress in integrating visual and linguistic understanding. Recent efforts have extended these capabilities to 3D understanding through encoder-based architectures that rely on pre-trained 3D encoders to extract geometric features. However, such approaches suffer from semantic misalignment between geometric and linguistic spaces, resolution sensitivity, and substantial computational overhead. In this work, we present SAGE, the first end-to-end 3D MLLM that directly processes raw point clouds without relying on a pre-trained 3D encoder. Our approach introduces a lightweight 3D tokenizer that combines geometric sampling and neighbourhood aggregation with vector quantization to convert point clouds into discrete tokens--treating 3D data as a foreign language that naturally extends the LLM's vocabulary. Furthermore, to enhance the model's reasoning capability on complex 3D tasks, we propose a preference optimization training strategy with a semantic alignment-based reward, specifically designed for open-ended 3D question answering where responses are descriptive. Extensive experiments across diverse 3D understanding benchmarks demonstrate that our end-to-end approach outperforms existing encoder-based methods while offering significant advantages in computational efficiency, generalization across LLM backbones, and robustness to input resolution variations. Code is available at: https://github.com/snehaputul/SAGE3D.

Abstract:
Generative diffusion priors have recently achieved state-of-the-art performance in natural image super-resolution, demonstrating a powerful capability to synthesize photorealistic details. However, their direct application to remote sensing image super-resolution (RSISR) reveals significant shortcomings. Unlike natural images, remote sensing images exhibit a unique texture distribution where ground objects are globally stochastic yet locally clustered, leading to highly imbalanced textures. This imbalance severely hinders the model's spatial perception. To address this, we propose TexADiff, a novel framework that begins by estimating a Relative Texture Density Map (RTDM) to represent the texture distribution. TexADiff then leverages this RTDM in three synergistic ways: as an explicit spatial conditioning to guide the diffusion process, as a loss modulation term to prioritize texture-rich regions, and as a dynamic adapter for the sampling schedule. These modifications are designed to endow the model with explicit texture-aware capabilities. Experiments demonstrate that TexADiff achieves superior or competitive quantitative metrics. Furthermore, qualitative results show that our model generates faithful high-frequency details while effectively suppressing texture hallucinations. This improved reconstruction quality also results in significant gains in downstream task performance. The source code of our method can be found at https://github.com/ZezFuture/TexAdiff.

Abstract:
Disentangling image content and style is essential for customized image generation. Existing SDXL-based methods struggle to achieve high-quality results, while the recently proposed Flux model fails to achieve effective content-style separation due to its underexplored characteristics. To address these challenges, we conduct a systematic analysis of Flux and make two key observations: (1) Single Stream Blocks are essential for image generation; and (2) Early single stream blocks mainly control content, whereas later blocks govern style. Based on these insights, we propose SplitFlux, which disentangles content and style by fine-tuning the single stream blocks via LoRA, enabling the disentangled content to be re-embedded into new contexts. It includes two key components: (1) Rank-Constrained Adaptation. To preserve content identity and structure, we compress the rank and amplify the magnitude of updates within specific blocks, preventing content leakage into style blocks. (2) Visual-Gated LoRA. We split the content LoRA into two branches with different ranks, guided by image saliency. The high-rank branch preserves primary subject information, while the low-rank branch encodes residual details, mitigating content overfitting and enabling seamless re-embedding. Extensive experiments demonstrate that SplitFlux consistently outperforms state-of-the-art methods, achieving superior content preservation and stylization quality across diverse scenarios. The code can be accessed at https://github.com/yangyt46/SplitFlux

Abstract:
Existing generative models, such as diffusion and auto-regressive networks, are inherently static, relying on a fixed set of pretrained parameters to handle all inputs. In contrast, humans flexibly adapt their internal generative representations to each perceptual or imaginative context. Inspired by this capability, we introduce Composer, a new paradigm for adaptive generative modeling based on test-time instance-specific parameter composition. Composer generates input-conditioned parameter adaptations at inference time, which are injected into the pretrained model's weights, enabling per-input specialization without fine-tuning or retraining. Adaptation occurs once prior to multi-step generation, yielding higher-quality, context-aware outputs with minimal computational and memory overhead. Experiments show that Composer substantially improves performance across diverse generative models and use cases, including lightweight/quantized models and test-time scaling. By leveraging input-aware parameter composition, Composer establishes a new paradigm for designing generative models that dynamically adapt to each input, moving beyond static parameterization. The code will be available at https://github.com/tmtuan1307/Composer.

Abstract:
2D Gaussian Splatting has emerged as a novel image representation technique that can support efficient rendering on low-end devices. However, scaling to high-resolution images requires optimizing and storing millions of unstructured Gaussian primitives independently, leading to slow convergence and redundant parameters. To address this, we propose Structured Gaussian Image (SGI), a compact and efficient framework for representing high-resolution images. SGI decomposes a complex image into multi-scale local spaces defined by a set of seeds. Each seed corresponds to a spatially coherent region and, together with lightweight multi-layer perceptrons (MLPs), generates structured implicit 2D neural Gaussians. This seed-based formulation imposes structural regularity on otherwise unstructured Gaussian primitives, which facilitates entropy-based compression at the seed level to reduce the total storage. However, optimizing seed parameters directly on high-resolution images is a challenging and non-trivial task. Therefore, we designed a multi-scale fitting strategy that refines the seed representation in a coarse-to-fine manner, substantially accelerating convergence. Quantitative and qualitative evaluations demonstrate that SGI achieves up to 7.5xcompression over prior non-quantized 2D Gaussian methods and 1.6xover quantized ones, while also delivering 1.6xand 6.5xfaster optimization, respectively, without degrading, and often improving, image fidelity. Code is available at https://github.com/zx-pan/SGI.

Abstract:
Fine-grained spatiotemporal reasoning on surgical videos is critical, yet the capabilities of Multi-modal Large Language Models (MLLMs) in this domain remain largely unexplored. To bridge this gap, we introduce SurgCoT, a unified benchmark for evaluating chain-of-thought (CoT) reasoning in MLLMs across 7 surgical specialties and 35 diverse procedures. SurgCoT assesses five core reasoning dimensions: Causal Action Ordering, Cue-Action Alignment, Affordance Mapping, Micro-Transition Localization, and Anomaly Onset Tracking, through a structured CoT framework with an intensive annotation protocol (Question-Option-Knowledge-Clue-Answer), where the Knowledge field provides essential background context and Clue provides definitive spatiotemporal evidence. Evaluation of 10 leading MLLMs shows: 1) commercial models outperform open-source and medical-specialized variants; 2) significant gaps exist in surgical CoT reasoning; 3) SurgCoT enables effective evaluation and enhances progressive spatiotemporal reasoning. SurgCoT provides a reproducible testbed to narrow the gap between MLLM capabilities and clinical reasoning demands. Code and data: https://github.com/CVI-SZU/SurgCoT.

Abstract:
Accurate depth estimation in wide field is highly desired in applications of autonomous driving, robot vision and drone controls. Biological compound eyes inspire wide Field of View (FOV) depth estimation, yet their artificial implementations face the challenge of modality misalignment. Specifically, the spherical imaging data do not align well with planar neural networks, diminishing the learning efficiency. Herein, we propose SCE-Depth, a bio-inspired framework for spherical compound eye depth estimation, which processes spherical images natively on a HEALPix grid using a spherical neural network. This approach achieves a unified 180deg FOV while avoiding the errors typically introduced by modality conversion. Additionally, we identify a depth-sensitive gradient feature from the overlapping FOVs of adjacent ommatidia. To exploit it, we introduce a spherical Sobel operator called the Spherical Gradient Feature Extractor (SGFE) and a corresponding Spherical Gradient Loss (SGL), which jointly extract gradient features on the HEALPix grid, enabling gradient-aware depth prediction. Extensive benchmark experiments demonstrate that these strategies enable SCE-Depth to substantially reduce depth estimation error compared to fisheye-based baselines, with particularly large improvements in peripheral accuracy. We also demonstrate the generalization capability of SCE-Depth to other wide FOV data modalities, such as fisheye and panoramic imagery. Code is available at https://github.com/ZhuYi2000/SCE-Depth.

Abstract:
Reinforcement learning (RL), earlier proven to be effective in large language and multi-modal models, has been successfully extended to enhance 2D image generation recently. However, applying RL to 3D generation remains largely unexplored due to the higher spatial complexity of 3D objects, which require globally consistent geometry and fine-grained local textures. This makes 3D generation significantly sensitive to reward designs and RL algorithms. To address these challenges, we conduct the first systematic study of RL for text-to-3D autoregressive generation across several dimensions. (1) Reward designs: We evaluate reward dimensions and model choices, showing that alignment with human preference is crucial, and that general multi-modal models provide robust signal for 3D attributes. (2) RL algorithms: We study GRPO variants, highlighting the effectiveness of token-level optimization, and further investigate the scaling of training data and iterations. (3) Text-to-3D Benchmarks: Since existing benchmarks fail to measure implicit reasoning abilities in 3D generation models, we introduce MME-3DR. (4) Advanced RL paradigms: Motivated by the natural hierarchy of 3D generation, we propose Hi-GRPO, which optimizes the global-to-local hierarchical 3D generation through dedicated reward ensembles. Based on these insights, we develop AR3D-R1, the first RL-enhanced text-to-3D model, expert from coarse shape to texture refinement. We hope this study provides insights into RL-driven reasoning for 3D generation. The code is released at https://github.com/Ivan-Tang-3D/3DGen-R1.

Abstract:
Compositional 3D scene generation from a single view requires the simultaneous recovery of scene layout and 3D assets. Existing approaches mainly fall into two categories: feed-forward generation methods and per-instance generation methods. The former directly predict 3D assets with explicit 6DoF poses through efficient network inference, but they generalize poorly to complex scenes. The latter improve generalization through a divide-and-conquer strategy, but suffer from time-consuming pose optimization. To bridge this gap, we introduce 3D-Fixer, a novel in-place completion paradigm. Specifically, 3D-Fixer extends 3D object generative priors to generate complete 3D assets conditioned on the partially visible point cloud at the original locations, which are cropped from the fragmented geometry obtained from the geometry estimation methods. Unlike prior works that require explicit pose alignment, 3D-Fixer uses fragmented geometry as a spatial anchor to preserve layout fidelity. At its core, we propose a coarse-to-fine generation scheme to resolve boundary ambiguity under occlusion, supported by a dual-branch conditioning network and an Occlusion-Robust Feature Alignment (ORFA) strategy for stable training. Furthermore, to address the data scarcity bottleneck, we present ARSG-110K, the largest scene-level dataset to date, comprising over 110K diverse scenes and 3M annotated images with high-fidelity 3D ground truth. Extensive experiments show that 3D-Fixer achieves state-of-the-art geometric accuracy, which significantly outperforms baselines such as MIDI and Gen3DSR, while maintaining the efficiency of the diffusion process. Code and data will be publicly available at https://zx-yin.github.io/3dfixer.

Abstract:
Distribution Matching Distillation (DMD) distills score-based generative models into efficient one-step generators, without requiring a one-to-one correspondence with the sampling trajectories of their teachers. Yet, the limited capacity of one-step distilled models compromises generative diversity and degrades performance in complex generative tasks, e.g., generating intricate object motions in text-to-video task. Directly extending DMD to multi-step distillation increases memory usage and computational depth, leading to instability and reduced efficiency. While prior works propose stochastic gradient truncation as a potential solution, we observe that it substantially reduces the generative diversity in text-to-image generation and slows motion dynamics in video generation, reducing performance to the level of one-step models. To address these limitations, we propose Phased DMD, a multi-step distillation framework that bridges the idea of phase-wise distillation with Mixture-of-Experts (MoE), reducing learning difficulty while enhancing model capacity. Phased DMD incorporates two key ideas: progressive distribution matching and score matching within subintervals. First, our model divides the SNR range into subintervals, progressively refining the model to higher SNR levels, to better capture complex distributions. Next, to ensure accurate training within each subinterval, we derive rigorous mathematical formulations for the objective. We validate Phased DMD by distilling state-of-the-art image and video generation models, including Qwen-Image-20B and Wan2.2-28B. Experiments demonstrate that Phased DMD enhances motion dynamics, improves visual fidelity in video generation, and increases output diversity in image generation. Our code and models are available at https://x-niper.github.io/projects/Phased-DMD/.

Abstract:
Image-goal navigation driven by generative models has recently shown strong potential owing to their ability to perform multi-modal reasoning and stable learning in continuous control spaces. Despite their promise, current methods still face several fundamental limitations. Many rely on pre-built priors and lack explicit mechanisms for trajectory evaluation, restricting generalization and goal alignment in map-free navigation. Moreover, current generative policies often face inefficiency or temporal inconsistency, resulting in temporally unstable motion. The absence of interactive, closed-loop benchmarks further limits fair and reproducible comparison. To address these issues, we propose GeniNav, a generative image-goal navigation framework that couples a VLM-driven latent subgoal imagination module for high-level semantic guidance with Multi-Segment Consistency Flow Matching (MS-CFM) for temporally smooth and dynamically coherent motion generation. A hybrid trajectory evaluation module further integrates semantic alignment and geometric feasibility to assess goal consistency. We also introduce a closed-loop simulation benchmark with a large-scale dataset spanning 176 scenes and 491.6 km for standardized training and evaluation. Extensive experiments in simulation and on real robots demonstrate the effectiveness of our method. Our project page is available at: https://cyq638.github.io/geninav/.

Abstract:
The widespread adoption of text-to-image (T2I) generation has raised concerns about privacy, bias, and copyright violations. Concept erasure techniques offer a promising solution by selectively removing undesired concepts from pre-trained models without requiring full retraining. However, these methods are often evaluated on a limited set of concepts, relying on overly simplistic and direct prompts. To test the boundaries of concept erasure techniques, and assess whether they truly remove targeted concepts from model representations, we introduce EMMA, a benchmark that evaluates five key dimensions of concept erasure over 13 metrics. EMMA goes beyond standard metrics like image quality and time efficiency, testing robustness under challenging conditions, including indirect descriptions, visually similar non-target concepts, and potential gender and ethnicity bias, providing a socially aware analysis of method behavior. Using EMMA, we analyze five concept erasure methods across five domains (objects, celebrities, art styles, NSFW, and copyright). Our results show that existing methods struggle with implicit prompts (i.e., generating the erased concept when it is indirectly referenced) and visually similar non-target concepts (i.e., failing to generate non-target concepts resembling the erased one), while some amplify gender and ethnicity bias compared to the original model. Code and prompts are available at https://github.com/lobsterlulu/EMMA.

Abstract:
Dataset quantization has recently emerged as a promising solution for mitigating the computational and memory challenges of large-scale datasets. However, existing approaches rely on a bin generation step that is computationally expensive and inefficient for large-scale datasets. Moreover, a fixed drop ratio in its patch dropping step fails to adapt to the diverse redundancy levels across samples, which degrades the representational quality of the quantized coreset. To address these limitations, we present Bin-Generation-Free Dataset Quantization (BGFDQ), a fully restructured framework that incorporates a simple yet effective KNN-based neighbor identification and neighbor-aware coreset selection strategy. We theoretically demonstrate that the proposed selection strategy achieves superior sampling efficiency compared to bin-generation-based methods. Additionally, we introduce an adaptive patch dropping strategy to further enhance the quality of the quantized dataset. Extensive experiments on four image classification benchmarks show that BGFDQ consistently outperforms state-of-the-art baselines. In particular, we achieve up to 5% validation accuracy improvement on CIFAR-100. Moreover, our framework successfully scales to datasets containing up to 10^5 same-class samples while existing bin-generation-based approaches fail due to memory constraints. Code is available at https://github.com/MaijieDeng/BGFDQ.

Abstract:
The conventional checkerboard-based calibration for standard cameras faces fundamental limitations when applied to bio-inspired event cameras. Specifically, this stems from two challenges: (i) Events are triggered asynchronously at different timestamps along motion trajectories. If we accumulate them directly on the image plane, it causes temporal misalignment and produces blurred edges. (ii) Checkerboard corners on event cameras show near-zero event occurrence at the corner itself. This hinders reliable corner localization and makes calibration difficult. To address these issues, we present a novel calibration framework that directly detects checkerboard corners from event data without learning-based grayscale image reconstruction. We first mathematically analyze the absence of events at corner points. Based on this fact, we then leverage edge-driven event cues to initialize corner positions. Using the near-zero event occurrence at checkerboard corners, we gradually refine the estimated corner toward low event-density regions, achieving sub-pixel accuracy. Furthermore, we extend the corner detection to fiducial markers such as AprilTag, resulting in reliable detection even under partial visibility or occlusion. Evaluations on self-collected and public data demonstrate reliable checkerboard corner detection and stable camera calibration. Additional information is available at the following link: https://vision3d-lab.github.io/corner2tag/.

Abstract:
Robust humanoid locomotion requires accurate and globally consistent perception of the surrounding 3D environment. However, existing perception modules, mainly based on depth images or elevation maps, offer only partial and locally flattened views of the environment, failing to capture the full 3D structure. This paper presents Gallant, a voxel-grid-based framework for humanoid locomotion and local navigation in 3D constrained terrains. It leverages voxelized LiDAR data as a lightweight and structured perceptual representation, and employs a z-grouped 2D CNN to map this representation to the control policy, enabling fully end-to-end optimization. A high-fidelity LiDAR simulation that dynamically generates realistic observations is developed to support scalable, LiDAR-based training and ensure sim-to-real consistency. Experimental results show that Gallant's broader perceptual coverage facilitates the use of a single policy that goes beyond the limitations of previous methods confined to ground-level obstacles, extending to lateral clutter, overhead constraints, multi-level structures, and narrow passages. Gallant also firstly achieves over-90% success rates in challenging scenarios such as stair climbing and stepping onto elevated platforms through improved end-to-end optimization. Website: https://gallantloco.github.io/.

Abstract:
Skeleton-based Temporal Action Segmentation (STAS) aims to densely parse untrimmed skeletal sequences into frame-level action categories. However, existing methods, while proficient at capturing spatio-temporal kinematics, neglect the underlying physical dynamics that govern human motion. This oversight limits inter-class discriminability between actions with similar kinematics but distinct dynamic intents, and hinders precise boundary localization where dynamic force profiles shift. To address these, we propose the Lagrangian-Dynamic Informed Network (LaDy), a framework integrating principles of Lagrangian dynamics into the segmentation process. Specifically, LaDy first computes generalized coordinates from joint positions and then estimates Lagrangian terms under physical constraints to explicitly synthesize the generalized forces. To further ensure physical coherence, our Energy Consistency Loss enforces the work-energy theorem, aligning kinetic energy change with the work done by the net force. The learned dynamics then drive a Spatio-Temporal Modulation module: Spatially, generalized forces are fused with spatial representations to provide more discriminative semantics. Temporally, salient dynamic signals are constructed for temporal gating, thereby significantly enhancing boundary awareness. Experiments on challenging datasets show that LaDy achieves state-of-the-art performance, validating the integration of physical dynamics for action segmentation. Code is available at https://github.com/HaoyuJi/LaDy.

Abstract:
In this paper, we propose a zero-reference diffusion-based framework, named ZeroIDIR, for illumination degradation image restoration, which decouples the restoration process into adaptive illumination correction and diffusion-based reconstruction while being trained solely on low-quality degraded images. Specifically, we design an adaptive gamma correction module that performs spatially varying exposure correction to generate illumination-corrected only representations to mitigate exposure bias and serve as reliable inputs for subsequent diffusion processes, where a histogram-guided illumination correction loss is introduced to regularize the corrected illumination distribution toward that of natural scenes. Subsequently, the illumination-corrected image is treated as an intermediate noisy state for the proposed perturbed consistency diffusion model to reconstruct details and suppress noise. Moreover, a perturbed diffusion consistency loss is proposed to constrain the forward diffusion trajectory of the final restored image to remain consistent with the perturbed state, thus improving restoration fidelity and stability in the absence of supervision. Extensive experiments on publicly available benchmarks show that the proposed method outperforms state-of-the-art unsupervised competitors and is comparable to supervised methods while being more generalizable to various scenes. Code is available at https://github.com/JianghaiSCU/ZeroIDIR.

Abstract:
While structure-based relocalizers have long strived for point correspondences when establishing or regressing query-map associations, in this paper, we pioneer the use of planar primitives and 3D planar maps for lightweight 6-DoF camera relocalization in structured environments. Planar primitives, beyond being fundamental entities in projective geometry, also serve as region-based representations that encapsulate both structural and semantic richness. This motivates us to introduce PlanaReLoc, a streamlined plane-centric paradigm where a deep matcher associates planar primitives across the query image and the map within a learned unified embedding space, after which the 6-DoF pose is solved and refined under a robust framework. Through comprehensive experiments on the ScanNet and 12Scenes datasets across hundreds of scenes, our method demonstrates the superiority of planar primitives in facilitating reliable cross-modal structural correspondences and achieving effective camera relocalization without requiring realistically textured/colored maps, pose priors, or per-scene training. The code and data are available at https://github.com/3dv-casia/PlanaReLoc.

Abstract:
Cross-subject visual decoding aims to reconstruct visual experiences from brain activity across individuals, enabling more scalable and practical brain-computer interfaces. However, existing methods often suffer from degraded performance when adapting to new subjects with limited data, as they struggle to preserve both the semantic consistency of stimuli and the alignment of brain responses. To address these challenges, we propose Duala, a dual-level alignment framework designed to achieve stimulus-level consistency and subject-level alignment in fMRI-based cross-subject visual decoding. (1) At the stimulus level, Duala introduces a semantic alignment and relational consistency strategy that preserves intra-class similarity and inter-class separability, maintaining clear semantic boundaries during adaptation. (2) At the subject level, a distribution-based feature perturbation mechanism is developed to capture both global and subject-specific variations, enabling adaptation to individual neural representations without overfitting. Experiments on the Natural Scenes Dataset (NSD) demonstrate that Duala effectively improves alignment across subjects. Remarkably, even when fine-tuned with only about one hour of fMRI data, Duala achieves over 81.1% image-to-brain retrieval accuracy and consistently outperforms existing fine-tuning strategies in both retrieval and reconstruction. Our code is available at https://github.com/ShumengLI/Duala.

Abstract:
Magnetic resonance imaging (MRI) is a powerful and versatile imaging technique, offering a wide spectrum of information about the anatomy by employing different acquisition modalities. However, in the clinical workflow, it is impractical to collect all relevant modalities due to the scan time and cost constraints. Virtual full-stack scanning aims to impute missing MRI modalities from available but incomplete acquisitions, offering a cost-efficient solution to enhance data completeness and clinical usability. Existing imputation methods often depend on global conditioning or modality-specific designs, which limit their generalisability across patient cohorts and imaging protocols. To address these limitations, we propose CodeBrain, a unified framework that reformulates various "any-to-any" imputation tasks as a region-level full-stack code prediction problem. CodeBrain adopts a two-stage pipeline: (1) it learns the compact representation of a complete MRI modality set by encoding it into scalar-quantised codes at the region level, enabling high-fidelity image reconstruction after decoding these codes along with modality-agnostic common features; (2) it trains a projection encoder to predict the full-stack code map from incomplete modalities via a grading-based design for diverse imputation scenarios. Extensive experiments on two public brain MRI datasets, i.e., IXI and BraTS 2023, demonstrate that CodeBrain consistently outperforms state-of-the-art methods, establishing a new benchmark for unified brain MRI imputation and enabling virtual full-stack scanning. Our code will be released at https://github.com/ycwu1997/CodeBrain.

Abstract:
A high-performing, general-purpose visual understanding model should map visual inputs to a taxonomic tree of labels, identify novel categories beyond the training set for which few or no publicly available images exist. Large Multimodal Models (LMMs) have achieved remarkable progress in fine-grained visual recognition (FGVR) for known categories. However, they remain limited in hierarchical visual recognition (HVR) that aims at predicting consistent label paths from coarse to fine categories, especially for novel categories. To tackle these challenges, we propose Taxonomy-Aware Representation Alignment (TARA), a simple yet effective strategy to inject taxonomic knowledge into LMMs. TARA leverages representations from biology foundation models (BFMs) that encode rich biological relationships through hierarchical contrastive learning. By aligning the intermediate representations of visual features with those of BFMs, LMMs are encouraged to extract discriminative visual cues well structured in the taxonomy tree. Additionally, we align the representations of the first answer token with the ground-truth label, flexibly bridging the gap between contextualized visual features and categories of varying granularity according to user intent. Experiments demonstrate that TARA consistently enhances LMMs' hierarchical consistency and leaf node accuracy, enabling reliable recognition of both known and novel categories within complex biological taxonomies. Code is available at https://github.com/PKU-ICST-MIPL/TARA_CVPR2026.

Abstract:
Event cameras, with their high dynamic range, show great promise for Low-light Image Enhancement (LLIE). Existing works primarily focus on designing effective modal fusion strategies. However, a key challenge is the dual degradation from intrinsic background activity (BA) noise in events and low signal-to-noise ratio (SNR) in images, which causes severe noise coupling during modal fusion, creating a critical performance bottleneck. We therefore posit that precise event denoising is the prerequisite to unlocking the full potential of event-based fusion. To this end, we propose BiEvLight, a hierarchical and task-aware framework that collaboratively optimizes enhancement and denoising by exploiting their intrinsic interdependence. Specifically, BiEvLight exploits the strong gradient correlation between images and events to build a gradient-guided event denoising prior that alleviates insufficient denoising in heavily noisy regions. Moreover, instead of treating event denoising as a static pre-processing stage--which inevitably incurs a trade-off between over- and under-denoising and cannot adapt to the requirements of a specific enhancement objective--we recast it as a bilevel optimization problem constrained by the enhancement task. Through cross-task interaction, the upper-level denoising problem learns event representations tailored to the lower-level enhancement objective, thereby substantially improving overall enhancement quality. Extensive experiments on the Real-world noise Dataset SED demonstrate that our method significantly outperforms state-of-the-art (SOTA) approaches, with average improvements of 1.30dB in PSNR , 2.03dB in PSNR and 0.047 in SSIM, respectively. The code will be publicly available at https://github.com/iijjlk/BiEvlight.

Abstract:
Sound source localization task aims to identify the locations of sound-emitting objects by leveraging correlations between audio and visual modalities. Most existing SSL methods rely on contrastive learning-based feature matching, but lack explicit reasoning and verification, limiting their effectiveness in complex acoustic scenes. Inspired by human meta-cognitive processes, we propose a training-free SSL framework that exploits the intrinsic reasoning capabilities of Multimodal Large Language Models (MLLMs). Our Generation-Analysis-Refinement (GAR) pipeline consists of three stages: Generation produces initial bounding boxes and audio classifications; Analysis quantifies Audio-Visual Consistency via open-set role tagging and anchor voting; and Refinement applies adaptive gating to prevent unnecessary adjustments. Extensive experiments on single-source and multi-source benchmarks demonstrate competitive performance. The source code is available at https://github.com/VisualAIKHU/GAR-SSL.

Abstract:
While continual visual instruction tuning (CVIT) has shown promise in adapting multimodal large language models (MLLMs), existing studies predominantly focus on models without safety alignment. This critical oversight ignores the fact that real-world MLLMs inherently require such mechanisms to mitigate potential risks. In this work, we shift our focus to CVIT for safety-aligned MLLMs and observe that during continual adaptation, the model not only suffers from task forgetting but also exhibits degradation in its safety. Achieving a harmonious balance between safety and task performance remains a crucial challenge. To address this, we propose Harmonious Parameter Adaptation (HPA), a post-training framework composed of focusing-based parameter partition, harmoniously balanced parameter selection, and orthogonal parameter adjustment. Specifically, HPA partitions parameters into two types based on their focus on safety or task performance, and selects the focused ones to preserve from a balanced perspective. In addition, HPA imposes orthogonality constraints on parameter updates to further alleviate catastrophic forgetting. Extensive experiments on the CVIT benchmark and safety evaluation datasets demonstrate that HPA better maintains high safety and mitigates forgetting than existing baselines. Code is available at https://github.com/Minato-Zackie/HPA.

Abstract:
Satellite Earth-observation (EO) time series in the optical and microwave ranges are often irregular due to orbital patterns and cloud obstruction, and while compositing addresses these issues, it loses critical phenological information. To overcome this, we present TESSERA, a pixel-wise foundation model for multi-modal (Sentinel-1/2) EO time series that learns robust, label-efficient embeddings. During training, TESSERA uses Barlow Twins and sparse random temporal sampling to enforce invariance to the selection of valid observations, aided by two key regularizers: global shuffling to decorrelate spatial neighborhoods and mix-based regulation for invariance under extreme sparsity. We find that for diverse classification, segmentation, and regression tasks, TESSERA embeddings deliver state-of-the-art accuracy with high label efficiency, often requiring only a small task head and minimal computation. To democratize access, adhere to FAIR principles, and simplify use, we release global, annual, 10m, pixel-wise int8 embeddings together with open weights/code and lightweight adaptation heads, providing practical tooling for large-scale retrieval and inference at planetary scale. All code and data are available at https://github.com/ucam-eo/tessera.

Abstract:
Feed-forward 3D reconstruction offers substantial runtime advantages over per-scene optimization, which remains slow at inference and often fragile under sparse views. However, existing feed-forward methods still have potential for further performance gains, especially for out-of-domain data, and struggle to retain second-level inference time once a generative prior is introduced. These limitations stem from the one-shot prediction paradigm in existing feed-forward pipeline: models are strictly bounded by capacity, lack inference-time refinement, and are ill-suited for continuously injecting generative priors. We introduce GIFSplat, a purely feed-forward iterative refinement framework for 3D Gaussian Splatting from sparse unposed views. A small number of forward-only residual updates progressively refine current 3D scene using rendering evidence, achieve favorable balance between efficiency and quality. Furthermore, we distill a frozen diffusion prior into Gaussian-level cues from enhanced novel renderings without gradient backpropagation or ever-increasing view-set expansion, thereby enabling per-scene adaptation with generative prior while preserving feed-forward efficiency. Across DL3DV, RealEstate10K, and DTU, GIFSplat consistently outperforms state-of-the-art feed-forward baselines, improving PSNR by up to +2.1 dB, and it maintains second-scale inference time without requiring camera poses or any test-time gradient optimization. The project page can be found at: https://terencepp.github.io/gifsplat-project-page/.

Abstract:
Recent advances in diffusion-based language models (DLMs) have shown remarkable potential for de novo protein design. However, enabling controllable protein generation requires integrating diverse biological conditions, such as structure, functions, and chemical interactions, each represented in distinct modalities. Existing approaches often either support a single condition or treat multiple conditions through separate modality-specific encoders. This isolation limits cross-modal interaction, reduces generation quality, and complicates the incorporation of new conditions without retraining or redesigning the backbone. To address these limitations, we introduce MMCP-GEN, a DLM for Multi-Modal, Multi-Condition Protein sequence GENeration. MMCP-GEN establishes a new paradigm for controllable protein generation under complex multimodal constraints. Its core is a modality-composable and extensible conditioning mechanism that fuses heterogeneous biological conditions via learnable queries and modality-indicator heads, enabling disentangled, extensible, and cross-modal condition integration without retraining the backbone. A joint generation-and-scoring objective further aligns sequence recovery with structural fidelity. Empirically, MMCP-GEN achieves state-of-the-art performance across structure-, function-, and ligand-conditioned tasks, improving sequence recovery by up to 5% and outperforming attentive baselines in diverse functional annotation tasks. These results establish MMCP-GEN as a general and high-fidelity framework for controllable protein generation. The source code is publicly available at https://github.com/WanyuGroup.

Abstract:
Model fusion is a key strategy for robust recognition in unconstrained scenarios, as different models provide complementary strengths. This is especially important for whole-body human recognition, where biometric cues such as face, gait, and body shape vary across samples and are typically integrated via score-fusion. However, existing score-fusion strategies are usually static, invoking all models for every test sample regardless of sample quality or modality reliability. To overcome these limitations, we propose FusionAgent, a novel agentic framework that leverages a Multimodal Large Language Model (MLLM) to perform dynamic, sample-specific model selection. Each expert model is treated as a tool, and through Reinforcement Fine-Tuning (RFT) with a metric-based reward, the agent learns to adaptively determine the optimal model combination for each test input. To address the model score misalignment and embedding heterogeneity, we introduce Anchor-based Confidence Top-k (ACT) score-fusion, which anchors on the most confident model and integrates complementary predictions in a confidence-aware manner. Extensive experiments on multiple whole-body biometric benchmarks demonstrate that FusionAgent significantly outperforms SoTA methods while achieving higher efficiency through fewer model invocations, underscoring the critical role of dynamic, explainable, and robust model fusion in real-world recognition systems. Project page: https://fusionagent.github.io/.

Abstract:
Visual counterfactual explanations aim to reveal the minimal semantic modifications that can alter a model's prediction, providing causal and interpretable insights into deep neural networks. However, existing diffusion-based counterfactual generation methods are often computationally expensive, slow to sample, and imprecise in localizing the modified regions. To address these limitations, we propose MaskDiME, a simple, fast, yet effective diffusion framework that unifies semantic consistency and spatial precision through localized sampling. Our approach adaptively focuses on decision-relevant regions to achieve localized and semantically consistent counterfactual generation while preserving high image fidelity. Our training-free framework, MaskDiME, performs inference over 30x faster than the baseline and achieves comparable or state-of-the-art performance across five benchmark datasets spanning diverse visual domains, establishing a practical and generalizable solution for efficient counterfactual explanation. Our code is available at https://github.com/clguo/MaskDiME.

Abstract:
Open-vocabulary 3D scene understanding enables users to segment novel objects in complex 3D environments through natural language. However, existing approaches remain slow, memory-intensive, and overly complex due to iterative optimization and dense per-Gaussian feature assignments. To address this, we propose LightSplat, a fast and memory-efficient training-free framework that injects compact 2-byte semantic indices into 3D representations from multi-view images. By assigning semantic indices only to salient regions and managing them with a lightweight index-feature mapping, LightSplat eliminates costly feature optimization and storage overhead. We further ensure semantic consistency and efficient inference via single-step clustering that links geometrically and semantically related masks in 3D. We evaluate our method on LERF-OVS, ScanNet, and DL3DV-OVS across complex indoor-outdoor scenes. As a result, LightSplat achieves state-of-the-art performance with up to 50-400x speedup and 64x lower memory, enabling scalable language-driven 3D understanding. For more details, visit our project page https://vision3d-lab.github.io/lightsplat/.

Abstract:
Procedural activities, ranging from routine cooking to complex surgical operations, are highly structured sequences of actions performed in a specific temporal order. Despite the success of current self-supervised learning (SSL) methods on static images and short clips, these models often overlook the underlying sequential structure of such activities. We expose this lack of procedural awareness with a motivating experiment: models pretrained on forward and time-reversed sequences produce highly similar features, confirming that their representations are blind to the underlying procedural order. To address this shortcoming, we propose PL-Stitch, a self-supervised framework that harnesses the inherent temporal order of video frames as a powerful supervisory signal. Our approach integrates two novel probabilistic objectives based on the Plackett-Luce (PL) model. The primary PL objective trains the model to sort sampled frames chronologically, compelling it to learn the global workflow progression. The secondary objective, a spatio-temporal jigsaw loss, complements the learning by capturing fine-grained, cross-frame object correspondences. Our approach consistently achieves superior performance across five surgical and cooking benchmarks. Specifically, PL-Stitch yields significant gains in surgical phase recognition (e.g., +11.4 pp in k-NN accuracy on Cholec80) and cooking action segmentation (e.g., +5.7 pp in linear probing accuracy on Breakfast), demonstrating its effectiveness for procedural video representation learning. Code and models are available at https://github.com/visurg-ai/PL-Stitch.

Abstract:
3D vision foundation models have shown strong generalization in reconstructing key 3D attributes from uncalibrated images through a single feed-forward pass. However, when deployed in online settings such as driving scenarios, predictions are made over temporal windows, making it non-trivial to maintain consistency across time. Recent strategies align consecutive predictions by solving global transformation, yet our analysis reveals their fundamental limitations in assumption validity, local alignment scope, and robustness under noisy geometry. In this work, we propose a higher-DOF and long-term alignment framework based on Thin Plate Spline, leveraging globally propagated control points to correct spatially varying inconsistencies. In addition, we adopt a point-agnostic submap registration design that is inherently robust to noisy geometry predictions. The proposed framework is fully plug-and-play, compatible with diverse 3D foundation models and camera configurations (e.g., monocular or surround-view). Extensive experiments demonstrate that our method consistently yields more coherent geometry and lower trajectory errors across multiple datasets, backbone models, and camera setups, highlighting its robustness and generality. Code is available at https://github.com/Xian-Bei/TALO.

Abstract:
Latent diffusion models have emerged as the dominant framework for high-fidelity and efficient image generation, owing to their ability to learn diffusion processes in compact latent spaces. However, while previous research has focused primarily on reconstruction accuracy and semantic alignment of the latent space, we observe that another critical factor, robustness to sampling perturbations, also plays a crucial role in determining generation quality. Through empirical and theoretical analyses, we show that the commonly used b-VAE-based tokenizers in latent diffusion models, tend to produce overly compact latent manifolds that are highly sensitive to stochastic perturbations during diffusion sampling, leading to visual degradation. To address this issue, we propose a simple yet effective solution that constructs a latent space robust to sampling perturbations while maintaining strong reconstruction fidelity. This is achieved by introducing a Variance Expansion loss that counteracts variance collapse and leverages the adversarial interplay between reconstruction and variance expansion to achieve an adaptive balance that preserves reconstruction accuracy while improving robustness to stochastic sampling. Extensive experiments demonstrate that our approach consistently enhances generation quality across different latent diffusion architectures, confirming that robustness in latent space is a key missing ingredient for stable and faithful diffusion sampling. Our project page: https://github.com/CVL-UESTC/VE-Loss .

Abstract:
Standard deep learning models for image segmentation cannot guarantee topology accuracy, failing to preserve the correct number of connected components or structures. This, in turn, affects the quality of the segmentations and compromises the reliability of the subsequent quantification analyses. Previous works have proposed to enhance topology accuracy with specialized frameworks, architectures, and loss functions. However, these methods are often cumbersome to integrate into existing training pipelines, they are computationally very expensive, or they are restricted to structures with tubular morphology. We present SCNP, an efficient method that improves topology accuracy by penalizing the logits with their poorest-classified neighbor, forcing the model to improve the prediction at the pixels' neighbors before allowing it to improve the pixels themselves. We show the effectiveness of SCNP across 13 datasets, covering different structure morphologies and image modalities, and integrate it into three frameworks for semantic and instance segmentation. Additionally, we show that SCNP can be integrated into several loss functions, making them improve topology accuracy. Our code can be found at https://jmlipman.github.io/SCNP-SameClassNeighborPenalization.

Abstract:
Recently, 3D Gaussian Splatting and its derivatives have achieved significant breakthroughs in large-scale scene reconstruction. However, how to efficiently and stably achieve high-quality geometric fidelity remains a core challenge. To address this issue, we introduce MetroGS, a novel Gaussian Splatting framework for efficient and robust reconstruction in complex urban environments. Our method is built upon a distributed 2D Gaussian Splatting representation as the core foundation, serving as a unified backbone for subsequent modules. To handle potential sparse regions in complex scenes, we propose a structured dense enhancement scheme that utilizes SfM priors and a pointmap model to achieve a denser initialization, while incorporating a sparsity compensation mechanism to improve reconstruction completeness. Furthermore, we design a progressive hybrid geometric optimization strategy that organically integrates monocular and multi-view optimization to achieve efficient and accurate geometric refinement. Finally, to address the appearance inconsistency commonly observed in large-scale scenes, we introduce a depth-guided appearance modeling approach that learns spatial features with 3D consistency, facilitating effective decoupling between geometry and appearance and further enhancing reconstruction stability. Experiments on large-scale urban datasets demonstrate that MetroGS achieves superior geometric accuracy, rendering quality, offering a unified solution for high-fidelity large-scale scene reconstruction. Project page: https://m3phist0.github.io/MetroGS.

Abstract:
The performance of mean-based prototypical methods in few-shot learning is frequently compromised by noise and hard positives, where entangled feature representations cause prototype instability. We present a novel "Filter-Repair-Expand" framework grounded in Determinantal Point Process (DPP) theory. The method leverages DPP as its core logic, employing it to estimate sample confidence to filter anomalous samples from the initial set, guide a diffusion process via volume-maximization to enhance the sample representation, and subsequently maximize the volume of synergistic disentangled subspaces, constructing robust and diverse prototype subspaces. Experimental results establish new state-of-the-art performance on multiple benchmarks, demonstrating significant gains in few-shot learning robustness. Code is available at https://github.com/24sleeping90/DDSF.

Abstract:
Occlusion presents two critical challenges for person re-identification (Re-ID): feature interference and information loss. While existing efforts have explored occlusion-aware data augmentation and feature reconstruction to mitigate these issues, the former often fails to address erroneous matches caused by similar occlusion patterns and background distractions, whereas the latter typically introduces significant computational overhead. To overcome these limitations, we propose a Consistent Occlusion and Prompt Enhancement (COPE) network. COPE incorporates a Cross-Identity Consistent Occlusion (CICO) module that applies identical occlusions across different identities and encourages feature similarity in the same occluded regions across different identities to reduce occlusion feature interference. A Prompt Background Filling (PBF) module leverages vision-language alignment to generate foreground heatmaps and performs random background filling, enhancing feature robustness under varying backgrounds. Additionally, a lightweight Prompt Similarity Scoring (PSS) module refines retrieval similarity by utilizing prompt-guided reliability scores. Extensive experiments on both occluded and holistic Re-ID benchmarks demonstrate that COPE consistently outperforms existing methods. Notably, it achieves 82.4% Rank-1 accuracy and 76.4% mAP on the challenging Occluded-Duke dataset. The code is available at https://github.com/Cecoming/COPE.

Abstract:
Existing approaches to 3D semantic urban scene generation predominantly rely on voxel-based representations, which are bound by fixed resolution, challenging to edit, and memory-intensive in their dense form. In contrast, we advocate for a primitive-based paradigm where urban scenes are represented using compact, semantically meaningful 3D elements that are easy to manipulate and compose. To this end, we introduce PrITTI, a latent diffusion model that leverages vectorized object primitives and rasterized ground surfaces for generating diverse, controllable, and editable 3D semantic urban scenes. This hybrid representation yields a structured latent space that facilitates object- and ground-level manipulation. Experiments on KITTI-360 show that primitive-based representations unlock the full capabilities of diffusion transformers, achieving state-of-the-art 3D scene generation quality with lower memory requirements, faster inference, and greater editability than voxel-based methods. Beyond generation, PrITTI supports a range of downstream applications, including scene editing, inpainting, outpainting, and photo-realistic street-view synthesis.

Abstract:
Understanding interacting hand-object from monocular videos is crucial for immersive and dexterous interactions in AR/VR and robotic applications. However, existing monocular reconstruction methods primarily assume rigid grasping and static object geometry. When applied to articulated manipulations, the continuous joint rotations and frequent component deformations introduce a strong coupling between shape and motion, leading to severe ambiguity and instability in articulation optimization under monocular observation. To address this challenge, we propose a Clay-to-Stone dual-phase framework, modeling the articulated manipulation at hierarchical granularities, enabling a progression from flexible semantic exploration to structured articulation recovery. In the CLAY phase, our method performs fine-grained control over geometric deformation, guided by inter-part semantic correlation learning. As semantic and motion priors emerge, the STONE phase enforces rigid constraints to consolidate articulated structures and explicitly estimates motion parameters. Experiments on a real-world manipulation dataset show that our method achieves state-of-the-art reconstruction quality and plausible articulation modeling from monocular videos. Code is available at https://github.com/ru1ven/ARGS.

Abstract:
Continual Test-Time Adaptation (CTTA) aims to enable models to adapt online to unlabeled data streams under distribution shift without accessing source data. Existing CTTA methods face an efficiency-generalization trade-off: updating more parameters improves adaptation but severely reduces online inference efficiency. An ideal solution is to achieve comparable adaptation with minimal feature updates; we call this minimal subspace the golden subspace. We prove its existence in a single-step adaptation setting and show that it coincides with the row space of the pretrained classifier. To enable online maintenance of this subspace, we introduce the sample-wise Average Gradient Outer Product (AGOP) as an efficient proxy for estimating the classifier weights without retraining. Building on these insights, we propose Guided Online Low-rank Directional adaptation (GOLD), which uses a lightweight adapter to project features onto the golden subspace and learns a compact scaling vector while the subspace is dynamically updated via AGOP. Extensive experiments on classification and segmentation benchmarks, including autonomous-driving scenarios, demonstrate that GOLD attains superior efficiency, stability, and overall performance. Our code is available at https://github.com/AIGNLAI/GOLD.

Abstract:
Foundation models have attracted widespread attention across domains due to their powerful zero-shot classification capabilities. This work is motivated by two key observations: (1) Vision-Language Models (VLMs), such as CLIP, often over-rely on class-level textual priors and struggle to capture fine-grained visual cues, whereas Vision-only Foundation Models (VFMs), such as DINO, provide rich and discriminative visual features but lack semantic alignment; (2) the performance of different VLMs varies considerably across datasets owing to differences in pre-training. To address these challenges, we propose SOTA (Self-adaptive Optimal TrAnsport), a training-free ensemble framework that integrates the outputs of multiple foundation models (VFMs or VLMs) by learning a self-adaptive transport plan. Notably, SOTA is prior-free and automatically balances model contributions. Extensive experiments across diverse domains, including natural images, medical pathology, and remote sensing, validate the generalizability of SOTA. The results consistently show that it effectively leverages the complementary strengths of different foundation models and achieves substantial improvements over individual models. The implementation code is available at: https://github.com/Afleve/self-adaptive-Optimal-Transport.

Abstract:
Recently, synthetic palmprints have been increasingly used as substitutes for real data to train recognition models. To be effective, such synthetic data must reflect the diversity of real palmprints, including both style variation and geometric variation. However, existing palmprint generation methods mainly focus on style translation, while geometric variation is either ignored or approximated by simple handcrafted augmentations. In this work, we propose FlowPalm, an optical-flow-driven palmprint generation framework capable of simulating the complex non-rigid deformations observed in real palms. Specifically, FlowPalm estimates optical flows between real palmprint pairs to capture the statistical patterns of geometric deformations. Building on these priors, we design a progressive sampling process that gradually introduces the geometric deformations during diffusion while maintaining identity consistency. Extensive experiments on six benchmark datasets demonstrate that FlowPalm significantly outperforms state-of-the-art palmprint generation approaches in downstream recognition tasks. Project page: https://yuchenzou.github.io/FlowPalm/

Abstract:
While Vision-Language Models (VLMs) have achieved remarkable performance, their Euclidean embeddings remain limited in capturing hierarchical relationships such as part-to-whole or parent-child structures, and often face challenges in multi-object compositional scenarios. Hyperbolic VLMs mitigate this issue by better preserving hierarchical structures and modeling part-whole relations (i.e., whole scene and its part images) through entailment. However, existing approaches do not model that each part has a different level of semantic representativeness to the whole. We propose UNcertainty-guided Compositional Hyperbolic Alignment (UNCHA) for enhancing hyperbolic VLMs. UNCHA models part-to-whole semantic representativeness with hyperbolic uncertainty, by assigning lower uncertainty to more representative parts and higher uncertainty to less representative ones for the whole scene. This representativeness is then incorporated into the contrastive objective with uncertainty-guided weights. Finally, the uncertainty is further calibrated with an entailment loss regularized with entropy-based term. With the proposed losses, UNCHA learns hyperbolic embeddings with more accurate part-whole ordering, capturing the underlying compositional structure in an image and improving its understanding of complex multi-object scenes. UNCHA achieves state-of-the-art performance on zero-shot classification, retrieval, and multi-label classification benchmarks. Our code and models are available at: https://github.com/jeeit17/UNCHA.git.

Abstract:
End-to-end multi-object tracking (MOT) methods have recently achieved remarkable progress by unifying detection and association within a single framework. Despite their strong detection performance, these methods suffer from relatively low association accuracy. Through detailed analysis, we observe that object embeddings produced by the shared DETR architecture display excessively high inter-object similarity, as it emphasizes only category-level discrimination within single frames. In contrast, tracking requires instance-level distinction across frames with spatial and temporal continuity, for which current end-to-end approaches insufficiently optimize object embeddings. To address this, we introduce FDTA (From Detection to Association), an explicit feature refinement framework that enhances object discriminativeness across three complementary perspectives. Specifically, we introduce a Spatial Adapter (SA) to integrate depth-aware cues for spatial continuity, a Temporal Adapter (TA) to aggregate historical information for temporal dependencies, and an Identity Adapter (IA) to leverage quality-aware contrastive learning for instance-level separability. Extensive experiments demonstrate that FDTA achieves state-of-the-art performance on multiple challenging MOT benchmarks including DanceTrack, SportsMOT, and BFT, highlighting the effectiveness of our proposed discriminative embedding enhancement strategy. The code is available at https://github.com/Spongebobbbbbbbbbb/FDTA.

Abstract:
Graph neural networks (GNNs) are increasingly applied to physical design tasks such as congestion prediction and wirelength estimation, yet progress is hindered by inconsistent circuit representations and the absence of controlled evaluation protocols. We present R2G (RTL-to-GDSII), a multi-view circuit-graph benchmark suite that standardizes five stage-aware views with information parity (every view encodes the same attribute set, differing only in where features attach) over 30 open-source IP cores (up to 106 nodes/edges). R2G provides an end-to-end DEF-to-graph pipeline spanning synthesis, placement, and routing stages, together with loaders, unified splits, domain metrics, and reproducible baselines. By decoupling representation choice from model choice, R2G isolates a confound that prior EDA and graph-ML benchmarks leave uncontrolled. In systematic studies with GINE, GAT, and ResGatedGCN, we find: (i) view choice dominates model choice, with Test R2 varying by more than 0.3 across representations for a fixed GNN; (ii) node-centric views generalize best across both placement and routing; and (iii) decoder-head depth (3-4 layers) is the primary accuracy driver, turning divergent training into near-perfect predictions (R2 > 0.99). Code and datasets: https://github.com/ShenShan123/R2G.

Abstract:
Monocular 3D object detection from a single RGB image remains challenging due to two fundamental challenges: the ill-posed nature of 3D localization, where multiple plausible configurations can correspond to the same 2D observation, and unreliable confidence estimation that fails to reflect true localization accuracy. Existing methods predict deterministic 3D boxes that often collapse to implausible mean estimates and rely on absolute confidence scores that are highly sensitive to localization errors. This paper introduces RARE, a unified framework that addresses both challenges through learning to rank and retrieve. RARE formulates confidence estimation as a ranking problem, learning to order detections by their relative quality rather than regressing absolute values. It provides more robust and stable confidence estimates that are less sensitive to localization uncertainty. Building on this improved confidence estimator, RARE learns to construct a query set for each object that predicts multiple diverse and plausible 3D configurations, and retrieves the top-ranked prediction. It explicitly models the multimodal nature of monocular 3D perception and produces more plausible localizations. Extensive experiments demonstrate the effectiveness of RARE. The code is available at https://github.com/HyeonjeongPark37/RARE.

Abstract:
Establishing accurate correspondence between sparse line representations and rich textured imagery remains a formidable challenge. While diffusion features excel in semantic correspondence, they struggle to bridge the fundamental gap between abstract sketches and texture-rich photographs. We identify two critical disparities: spatial domain misalignment from structural abstraction differences, and frequency domain inconsistencies from texture density variations. Based on this analysis, we propose SFA-DIFT, a novel approach that learns spatial-frequency aligned diffusion features for robust cross-modal correspondence. Unlike previous methods focusing solely on spatial alignment, our key innovation performs dual-domain alignment by learning unified clean diffusion features while strategically aggregating low-frequency components in the frequency domain. This comprehensive spatial-frequency alignment enables equitable understanding between sparse abstractions and rich textures. To validate our approach, we extend the existing sketch-photo correspondence dataset (PSC6K) by generating multi-style textured imagery, creating MS-PSC6K, a comprehensive correspondence benchmark. Extensive experiments demonstrate that SFA-DIFT achieves state-of-the-art performance, delivering substantial improvements with an average of 0.87% on PCK@1, 2.20% on PCK@5, and 0.95% on PCK@10 over previous best methods, validating the effectiveness and robustness of our dual-domain alignment approach. Codes are available on https://github.com/Mofr77/SFA-DIFT.

Abstract:
Continual Test-Time Adaptation (CTTA) aims to empower perception systems to handle dynamic distribution shifts encountered after deployment. Existing methods predominantly follow a backward-alignment paradigm, which rigidly aligns incoming data with supervisory surrogates derived from the source domain. Consequently, they struggle with unreliable supervision and evolving distribution shifts. To overcome these limitations, we introduce a novel forward-facilitation paradigm through a method termed Dynamic Style Bridging. Prior to deployment, we construct a compact knowledge base of generated class exemplars. During test time, to mitigate inherent generative bias and adapt these proxies to incoming data, we propose a multi-level bridging mechanism. This mechanism dynamically injects the proxies with incoming data styles at the input, statistical, and representation levels, while preserving the original semantics of the proxies. These high-fidelity proxies are then used to provide reliable, on-demand supervisory signals, enabling stable adaptation under continual shifts. Extensive experiments across standard CTTA benchmarks demonstrate that our method achieves consistent and substantial improvements over recent state-of-the-art approaches. Code is available at https://github.com/z1358/DAS.

Abstract:
Spatial transcriptomics (ST) measures gene expression at fine-grained spatial resolution, offering insights into tissue molecular landscapes. Previous methods for spatial gene expression prediction typically crop spots of interest from histopathology slide images, and train models to map each spot to a corresponding gene expression profile. However, these methods inherently lose the spatial resolution in gene expression:1) each spot often contains multiple cells with distinct gene expression profiles;2) spots are typically defined at fixed spatial resolutions, limiting the ability to predict gene expression at varying scales. To address these limitations, this paper presents PixNet, a dense prediction network capable of predicting spatially resolved gene expression across spots of varying sizes and scales directly from histopathology slide images. Different from previous methods that map individual spots to gene expression values, we generate a spatially dense continuous gene expression map from the histopathology slide image, and aggregate values within spots of interest to predict the gene expression. Our PixNet outperforms state-of-the-art methods on four common ST datasets in multiple spatial scales. The source code is available at: https://github.com/wangzrk/From-Spots-to-Pixels.

Abstract:
Large Vision Language Models (LVLMs) have achieved remarkable success in a wide range of downstream tasks that require multimodal interaction, but their powerful capabilities come with substantial computational and memory overhead, which hinders practical deployment. Among numerous acceleration techniques, post-training quantization is a popular and effective strategy for reducing memory cost and accelerating inference. However, existing LVLM quantization methods typically measure token sensitivity at the modality level, which fails to capture the rich and complex cross-token interactions and falls short in quantitatively measuring the quantization error at the token level. As tokens interact within the model, the distinction between modalities gradually diminishes, suggesting the need for fine-grained calibration. Inspired by axiomatic attribution in mechanistic interpretability, we introduce a fine-grained quantization strategy based on Quantization-aware Integrated Gradients (QIG), which leverages integrated gradients to quantitatively evaluate token sensitivity and push the granularity from modality level to token level, reflecting both inter-modality and intra-modality dynamics. Extensive experiments on multiple LVLMs under both W4A8 and W3A16 settings show that our method consistently improves accuracy across models and benchmarks with negligible latency overhead. For example, under 3-bit weight-only quantization, our method improves the average accuracy of LLaVA-onevision-7B by 1.60%, reducing the gap to its full-precision counterpart to only 1.33%. The code is available at https://github.com/ucas-xiang/QIG.

Abstract:
In medical image segmentation tasks, the domain gap caused by the difference in data collection between training and testing data seriously hinders the deployment of pre-trained models in clinical practice. Continual Test-Time Adaptation (CTTA) aims to enable pre-trained models to adapt to continuously changing unlabeled domains, providing an effective approach to solving this problem. However, existing CTTA methods often rely on unreliable supervisory signals, igniting a self-reinforcing cycle of error accumulation that culminates in catastrophic performance degradation. To overcome these challenges, we propose a CTTA via Semantic-Prompt-Enhanced Graph Clustering (SPEGC) for medical image segmentation. First, we design a semantic prompt feature enhancement mechanism that utilizes decoupled commonality and heterogeneity prompt pools to inject global contextual information into local features, alleviating their susceptibility to noise interference under domain shift. Second, based on these enhanced features, we design a differentiable graph clustering solver. This solver reframes global edge sparsification as an optimal transport problem, allowing it to distill a raw similarity matrix into a refined and high-order structural representation in an end-to-end manner. Finally, this robust structural representation is used to guide model adaptation, ensuring predictions are consistent at a cluster-level and dynamically adjusting decision boundaries. Extensive experiments demonstrate that SPEGC outperforms other state-of-the-art CTTA methods on two medical image segmentation benchmarks. The source code is available at https://github.com/Jwei-Z/SPEGC-for-MIS.

Abstract:
Diffusion-based methods have shown great promise in single image super-resolution (SISR); however, existing approaches often produce blurred fine details due to insufficient guidance in the high-frequency domain. To address this issue, we propose a High-Frequency Guided Diffusion Network based on Wavelet Decomposition (HDW-SR), which replaces the conventional U-Net backbone in diffusion frameworks. Specifically, we perform diffusion only on the residual map, allowing the network to focus more effectively on high-frequency information restoration. We then introduce wavelet-based downsampling in place of standard CNN downsampling to achieve multi-scale frequency decomposition, enabling sparse cross-attention between the high-frequency subbands of the pre-super-resolved image and the low-frequency subbands of the diffused image for explicit high-frequency guidance. Moreover, a Dynamic Thresholding Block (DTB) is designed to refine high-frequency selection during the sparse attention process. During upsampling, the invertibility of the wavelet transform ensures low-loss feature reconstruction. Experiments on both synthetic and real-world datasets demonstrate that HDW-SR achieves competitive super-resolution performance, excelling particularly in recovering fine-grained image details. Code is available at https://github.com/Baty2023/HDW-SR.

Abstract:
Recent advances in Diffusion Transformer (DiT)-based video generation technologies have shown impressive results for video object removal. However, these methods still suffer from substantial inference latency. For instance, although MiniMax Remover achieves state-of-the-art visual quality, it operates at only around 10FPS, primarily due to dense computations over the entire spatiotemporal token space, even when only a small masked region actually requires processing. In this paper, we present YOSE, You Only Select Essential Tokens, an efficient fine-tuning framework. YOSE introduces two key components: Batch Variable-length Indexing (BVI) and Diffusion Process Simulator (DiffSim) Module. BVI is a differentiable dynamic indexing operator that adaptively selects essential tokens based on mask information, enabling variable-length token processing across samples. DiffSim provides a diffusion process approximation mechanism for unmasked tokens, which simulates the influence of unmasked regions within DiT self-attention to maintain semantic consistency for masked tokens. With these designs, YOSE achieves mask-aware acceleration, where the inference time scales approximately linearly with the masked regions, in contrast to full-token diffusion methods whose computation remains constant regardless of the mask size. Extensive experiments demonstrate that YOSE achieves up to 2.5X speedup in 70% of cases while maintaining visual quality comparable to the baseline. Code is available at: https://github.com/Wucy0519/YOSE-CVPR26.

Abstract:
Human action recognition (HAR) in low-light environments remains challenging due to degraded visibility, illumination variance, and loss of appearance cues. We introduce DarkAct, a large-scale and high-quality RGB-thermal video dataset purpose-built for multimodal action recognition under low illumination. DarkAct contains 12,778 paired RGB-thermal videos covering 27 human actions across diverse viewpoints and scenes, offering a novel and comprehensive benchmark for understanding human actions in dark environments. We conduct extensive experiments on DarkAct, systematically benchmarking unimodal HAR models, multimodal fusion frameworks, and vision-language foundation models. Their limited performance on DarkAct underscores the urgent need for more robust perception systems under adverse illumination. To address this, we propose DarkAct-Net, an RGB-thermal fusion framework that enhances human-centric representation and achieves adaptive cross-modal fusion, enabling robust and fine-grained action recognition across diverse lighting conditions. Dataset and code are available at https://github.com/darkact-creator/DarkAct.git.

Abstract:
Object hallucination in Large Vision-Language Models (LVLMs) significantly hinders their reliable deployment. Existing methods struggle to balance efficiency and accuracy: they often require expensive reference models and multiple forward passes, or apply static edits that risk suppressing genuine visual evidence. To address this, we introduce HulluEdit, a single-pass, reference-free intervention framework. Our core innovation is orthogonal subspace editing: we decompose the hidden states of the model into orthogonal subspaces--visual evidence, conflicting priors, and residual uncertainty--enabling selective suppression of hallucinatory patterns without interfering with visual grounding. This approach mathematically guarantees that edits applied to the prior subspace leave the visual component entirely unaffected. Extensive experiments show that HulluEdit achieves state-of-the-art hallucination reduction on benchmarks including POPE and CHAIR across diverse architectures, while preserving general capabilities on MME and maintaining efficient inference. Our method consistently outperforms contrastive decoding and static subspace editing baselines, offering a new pathway toward more trustworthy LVLMs. Code is released at https://github.com/VioAgnes/HulluEdit.

Abstract:
Estimating the 6DoF pose of a novel object with a single reference view is challenging due to occlusions, view-point changes, and outliers. A core difficulty lies in finding robust cross-view correspondences, as existing methods often rely on discrete one-to-one matching that is non-differentiable and tends to collapse onto sparse keypoints. We propose Confidence-aware Optimal Geometric Correspondence (COG), an unsupervised framework that formulates correspondence estimation as a confidence-aware optimal transport problem. COG produces balanced soft correspondences by predicting point-wise confidences and injecting them as optimal transport marginals, suppressing non-overlapping regions. Semantic priors from vision foundation models further regularize the correspondences, leading to stable pose estimation. This design integrates confidence into the correspondence finding and pose estimation pipeline, enabling unsupervised learning. Experiments show unsupervised COG achieves comparable performance to supervised methods, and supervised COG outperforms them. Codes: https://github.com/YC-Che/COG

Abstract:
Controllable generation of anatomical structures enables the rational design of synthetic datasets for virtual simulation trials and machine learning workflows. We present an inference-time guidance framework for generating 3D multi-class anatomical segmentations with localized geometric and topological control. During inference, we use cuboidal control domains of varying dimensionality, location, and extent to isolate relevant substructures and compute differentiable penalty functions that steer samples toward target constraints. We enforce geometric features such as size, shape, position, and orientation via voxel-wise moments, while topological features such as connected components, loops, and voids are enforced through persistent homology. Finally, we adapt this framework for latent diffusion models, where a neural field decoder can partially extract substructures, enabling efficient measurement and control of anatomical features. This formulation unlocks a rich design space where several constraints can be composed to control complex structures defined over arbitrary dimensions and coordinate systems. We release our code at https://github.com/kkadry/Anatomica.

Abstract:
Visual localization on standard-definition (SD) maps has emerged as a promising low-cost and scalable solution for autonomous driving. However, existing regression-based approaches often overlook inherent geometric priors, resulting in suboptimal training efficiency and limited localization accuracy. In this paper, we propose a novel homography-guided pose estimator network for fine-grained visual localization between multi-view images and standard-definition (SD) maps. We construct input pairs that satisfy a homography constraint by projecting ground-view features into the BEV domain and enforcing semantic alignment with map features. Then we leverage homography relationships to guide feature fusion and restrict the pose outputs to a valid feasible region, which significantly improves training efficiency and localization accuracy compared to prior methods relying on attention-based fusion and direct 3-DoF pose regression. To the best of our knowledge, this is the first work to unify BEV semantic reasoning with homography learning for image-to-map localization. Furthermore, by explicitly modeling homography transformations, the proposed framework naturally supports cross-resolution inputs, enhancing model flexibility. Extensive experiments on the nuScenes dataset demonstrate that our approach significantly outperforms existing state-of-the-art visual localization methods. Code and pretrained models will be publicly available at https://github.com/Quartararo0714/HOLO.

Abstract:
Finetuning Large Vision-Language Models with reinforcement learning has emerged as a promising approach to enhance their capability in object-level grounding. However, existing methods, mainly based on GRPO, assign rewards at the response level. Such sparse reward leads to minimal learning signals when all candidate responses are failed in challenging scenarios.In this work, we propose a group-revision optimisation paradigm that enhances learning on hard cases. It begins with a sampled initial response and generates a set of revised candidates to explore improved grounding outcomes. Inspired by reward shaping, we introduce a consolidation process that quantifies each candidate's improvement over the initial attempt and converts it into informative shaping signals.These signals are used to both refine the reward and modulate the advantage, amplifying the influence of high-quality revisions.Our method achieves consistent gains across referring and reasoning segmentation, REC, and counting benchmarks compared with prior GRPO-based models. Code is available at https://github.com/yyliu01/GroupRevision.

Abstract:
Recent advances in open-vocabulary object detection focus primarily on two aspects: scaling up datasets and leveraging contrastive learning to align language and vision modalities. However, these approaches often neglect internal consistency within a single modality, particularly when background or environmental changes occur. This lack of consistency leads to a performance drop because the model struggles to detect the same object in different scenes, which reveals a robustness gap. To address this issue, we introduce Contextual Consistency Learning (CCL), a novel framework that integrates two key strategies: Contextual Bootstrapped Data Generation (CBDG) and Contextual Consistency Loss (CCLoss). CBDG functions as a data generation mechanism, producing images that contain the same objects across diverse backgrounds. This is essential because existing datasets alone do not support our CCL framework. The CCLoss further enforces the invariance of object features despite environmental changes, thereby improving the model's robustness in different scenes. These strategies collectively form a unified framework for ensuring contextual consistency within the same modality. Our method achieves state-of-the-art performance, surpassing previous approaches by +16.3 AP on OmniLabel and +14.9 AP on D^3. These results demonstrate the importance of enforcing intra-modal consistency, significantly enhancing model generalization in diverse environments. Our code is publicly available at: https://github.com/bozhao-li/CCL

Abstract:
We introduce an efficient, resolution-agnostic autoregressive (AR) image synthesis approach that generalizes to arbitrary resolutions and aspect ratios, narrowing the gap to diffusion models at scale. At its core is VibeToken, a novel resolution-agnostic 1D Transformer-based image tokenizer that encodes images into a dynamic, user-controllable sequence of 32-256 tokens, achieving a state-of-the-art efficiency and performance trade-off. Building on VibeToken, we present VibeToken-Gen, a class-conditioned AR generator with out-of-the-box support for arbitrary resolutions while requiring significantly fewer compute resources. Notably, VibeToken-Gen synthesizes 1024x1024 images using only 64 tokens and achieves 3.94 gFID; by comparison, a diffusion-based state-of-the-art alternative requires 1,024 tokens and attains 5.87 gFID. In contrast to fixed-resolution AR models such as LlamaGen - whose inference FLOPs grow quadratically with resolution ( 11T FLOPs at 1024x1024) - VibeToken-Gen maintains a constant 179G FLOPs (63.4x efficient) independent of resolution. We hope VibeToken can help unlock the wide adoption of AR visual generative models in production use cases. Project page: https://github.com/SonyResearch/VibeToken

Abstract:
Video Temporal Grounding (VTG), the task of localizing video segments from text queries, struggles in open-world settings due to limited dataset scale and semantic diversity, causing performance gaps between common and rare concepts. To overcome these limitations, we introduce OmniVTG, a new large-scale dataset for open-world VTG, coupled with a Self-Correction Chain-of-Thought (CoT) training paradigm designed to enhance the grounding capabilities of Multimodal Large Language Models (MLLMs). Our OmniVTG is constructed via a novel Semantic Coverage Iterative Expansion pipeline, which first identifies gaps in the vocabulary of existing datasets and collects videos that are highly likely to contain these target concepts. For high-quality annotation, we leverage the insight that modern MLLMs excel at dense captioning more than direct grounding and design a caption-centric data engine to prompt MLLMs to generate dense, timestamped descriptions. Beyond the dataset, we observe that simple supervised finetuning (SFT) is insufficient, as a performance gap between rare and common concepts still persists. We find that MLLMs' video understanding ability significantly surpasses their direct grounding ability. Based on this, we propose a Self-Correction Chain-of-Thought (CoT) training paradigm. We train the MLLM to first predict, then use its understanding capabilities to reflect on and refine its own predictions. This capability is instilled via a three-stage pipeline of SFT, CoT finetuning, and reinforcement learning. Extensive experiments show our approach not only excels at open-world grounding in our OmniVTG dataset but also achieves state-of-the-art zero-shot performance on four existing VTG benchmarks. Code is available at https://github.com/oceanflowlab/OmniVTG.

Abstract:
Feature encoders play a key role in pixel-level crack segmentation by shaping the representation of fine textures and thin structures. Existing CNN-, Transformer-, and Mamba-based models each capture only part of the required spatial or structural information, leaving clear gaps in modeling complex crack patterns. To address this, we present MixerCSeg, a mixer architecture designed like a coordinated team of specialists, where CNN-like pathways focus on local textures, Transformer-style paths capture global dependencies, and Mamba-inspired flows model sequential context within a single encoder. At the core of MixerCSeg is the TransMixer, which explores Mamba's latent attention behavior while establishing dedicated pathways that naturally express both locality and global awareness. To further enhance structural fidelity, we introduce a spatial block processing strategy and a Direction-guided Edge Gated Convolution (DEGConv) that strengthens edge sensitivity under irregular crack geometries with minimal computational overhead. A Spatial Refinement Multi-Level Fusion (SRF) module is then employed to refine multi-scale details without increasing complexity. Extensive experiments on multiple crack segmentation benchmarks show that MixerCSeg achieves state-of-the-art performance with only 2.05 GFLOPs and 2.54 M parameters, demonstrating both efficiency and strong representational capability. The code is available at https://github.com/spiderforest/MixerCSeg.

Abstract:
Constructing computer-aided design (CAD) models is labor-intensive but essential for engineering and manufacturing. Recent advances in Large Language Models (LLMs) have inspired the LLM-based CAD generation by representing CAD as command sequences. But these methods struggle in practical scenarios because command sequence representation does not support entity selection (e.g. faces or edges), limiting its ability to support complex editing operations such as chamfer or fillet. Further, the discretization of a continuous variable during sketch and extrude operations may result in topological errors. To address these limitations, we present Pointer-CAD, a novel LLM-based CAD generation framework that leverages a pointer-based command sequence representation to explicitly incorporate the geometric information of B-rep models into sequential modeling. In particular, Pointer-CAD decomposes CAD model generation into steps, conditioning the generation of each subsequent step on both the textual description and the B-rep generated from previous steps. Whenever an operation requires the selection of a specific geometric entity, the LLM predicts a Pointer that selects the most feature-consistent candidate from the available set. Such a selection operation also reduces the quantization error in the command sequence-based representation. To support the training of Pointer-CAD, we develop a data annotation pipeline that produces expert-level natural language descriptions and apply it to build a dataset of approximately 575K CAD models. Extensive experimental results demonstrate that Pointer-CAD effectively supports the generation of complex geometric structures and reduces segmentation error to an extremely low level, achieving a significant improvement over prior command sequence methods, thereby significantly mitigating the topological inaccuracies introduced by quantization error. Our code is available at https://github.com/Snitro/Pointer-CAD.

Abstract:
Region-instructed layout control in text-to-image generation is highly practical, yet existing methods suffer from limitations: (i) training-based approaches inherit data bias and often degrade image quality, and (ii) current techniques struggle with occlusion order, limiting real-world usability. To address these issues, we propose LayerBind. By modeling regional generation as distinct layers and binding them during the generation, our method enables precise regional and occlusion controllability. Our motivation stems from the observation that spatial layout and occlusion are established at a very early denoising stage, suggesting that rearranging the early latent structure is sufficient to modify the final output. Building on this, we structure the scheme into two phases: instance initialization and subsequent semantic nursing. (1) First, leveraging the contextual sharing mechanism in multimodal joint attention, Layer-wise Instance Initialization creates per-instance branches that attend to their own regions while anchoring to the shared background. At a designated early step, these branches are fused according to the layer order to form a unified latent with a pre-established layout. (2) Then, Layer-wise Semantic Nursing reinforces regional details and maintains the occlusion order via a layer-wise attention enhancement. Specifically, a sequential layered attention path operates alongside the standard global path, with updates composited under a layer-transparency scheduler. LayerBind is training-free and plug-and-play, serving as a regional and occlusion controller across Diffusion Transformers. It also supports editable workflows, allowing for flexible modifications like changing instances or rearranging visible orders. Experimental results demonstrate LayerBind's effectiveness, highlighting its potential for creative applications. Project page: https://littlefatshiba.github.io/layerbind-page

Abstract:
Despite recent advances in personalized image generation, existing models consistently fail to produce reliable multi-human scenes, often merging or losing facial identity. We present Ar2Can, a novel two-stage framework that disentangles spatial planning from identity rendering for multi-human generation. The Architect predicts structured layouts, specifying where each person should appear. The Artist then synthesizes photorealistic images, guided by a spatially-grounded face matching reward that combines Hungarian spatial alignment with identity similarity. This approach ensures faces are rendered at correct locations and faithfully preserve reference identities. We develop two Architect variants, seamlessly integrated with our diffusion-based Artist model. This is optimized via Group Relative Policy Optimization (GRPO) using compositional rewards for count accuracy, image quality, and identity matching. Evaluated on the MultiHuman-Testbench, Ar2Can achieves substantial improvements in both count accuracy and identity preservation, while maintaining high perceptual quality. Notably, our method achieves these results using primarily synthetic data, without requiring real multi-human images. Project page: https://qualcomm-ai-research.github.io/ar2can/

Abstract:
Combinatorial explosion problem caused by dual inputs presents a critical challenge in Deformable Medical Image Registration (DMIR). Since DMIR processes two images simultaneously as input, the combination relationships between features grow exponentially, ultimately the model considers more irrelevant features during the feature modeling process. Introducing dynamics in the receptive fields and weights of the network enables the model to eliminate the irrelevant features combination and model the potential feature combination relationships. In this paper, we propose the Dynamic Stream Network (DySNet), which enables the receptive fields and weights to be dynamically adjusted. This ultimately enables the model to ignore irrelevant feature combinations and model the potential feature relationships. With two key innovations: 1) Adaptive Stream Basin (AdSB) module dynamically adjusts the shape of the receptive field, thereby enabling the model to focus on the highly correlated feature relationships. 2) Dynamic Stream Attention (DySA) mechanism generates dynamic weights to search for more valuable feature relationships. Extensive experiments have shown that DySNet consistently outperforms the most advanced DMIR methods, highlighting its outstanding generalization ability. Our code is released on the website: https://github.com/ShaochenBi/DySNet.

Abstract:
High-quality benchmarks are crucial for driving progress in machine learning research. However, despite the growing interest in video generation, there is no comprehensive dataset to evaluate human synthesis. Humans can perform a wide variety of actions and interactions, but existing datasets, like TikTok, TED-Talks, and HumanVid, lack the diversity and complexity to fully capture the capabilities of video generation models. We close this gap by introducing 'What Are You Doing?' (WYD): a new benchmark for fine-grained evaluation of controllable image-to-video generation of humans. WYD consists of 1,544 captioned videos that have been meticulously collected and annotated with fine-grained categories. These allow us to systematically measure performance across 9 aspects of human generation, including actions, interactions and motion. We also propose an evaluation framework, where we adapt existing metrics for better human- and video-level assessment, as shown by human preference. Equipped with our dataset and metrics, we perform in-depth analyses of state-of-the-art open-source models in controllable image-to-video generation, showing how WYD provides novel insights about their capabilities. We release data and code to drive progress in human video generation.

Abstract:
We present LoD-Loc v3, a novel method for generalized aerial visual localization in dense urban environments. While prior work LoD-Loc v2 achieves localization through semantic building silhouette alignment with low-detail city models, it suffers from two key limitations: poor cross-scene generalization and frequent failure in dense building scenes. Our method addresses these challenges through two key innovations. First, we develop a new synthetic data generation pipeline that produces InsLoD-Loc - the largest instance segmentation dataset for aerial imagery to date, comprising 100k images with precise instance building annotations. This enables trained models to exhibit remarkable zero-shot generalization capability. Second, we reformulate the localization paradigm by shifting from semantic to instance silhouette alignment, which significantly reduces pose estimation ambiguity in dense scenes. Extensive experiments demonstrate that LoD-Loc v3 outperforms existing state-of-the-art (SOTA) baselines, achieving superior performance in both cross-scene and dense urban scenarios with a large margin. The project is available at https://nudt-sawlab.github.io/LoD-Locv3/.

Abstract:
High-resolution human video matting aims to predict accurate alpha mattes for semi-transparent regions while ensuring temporal consistency across frames. Despite notable progress, current methods still fail to achieve a satisfactory trade-off between quality and efficiency, with limitations in subject stability, temporal modeling, and computational cost. In this paper, we introduce \muMatting, an innovative resolution-agnostic two-stage framework for video matting: (1) coarse matte localization using a portrait-aware masked autoencoder; (2) refinement of critical regions via sparse 3D convolution, augmented by a temporal modulator that injects global spatio-temporal cues for enhanced consistency and contextual awareness. From data perspective, existing research remains limited by the insufficient quality of datasets, including (1) inaccurate alpha fractional values resulting from imperfect annotation, and (2) visual inconsistencies arising from arbitrary foreground-background compositions that lack natural coherence. To address this, we introduce \alphaMatte4K, a large-scale 4K-resolution human video matting dataset, which achieves accurate annotations and physical consistency through physically based rendering (PBR). Extensive experiments show that \muMatting surpasses state-of-the-art methods in accuracy and spatio-temporal consistency, while \alphaMatte4K boosts baseline performance, driving applications in real-world scenarios. The project is open-sourced at https://github.com/kadatec/mu-Matting.

Abstract:
Fingerspelling is a component of sign languages in which words are spelled out letter by letter using specific hand poses. Automatic fingerspelling recognition plays a crucial role in bridging the communication gap between Deaf and hearing communities, yet it remains challenging due to the signing-hand ambiguity issue, the lack of appropriate training losses, and the out-of-vocabulary (OOV) problem. Prior fingerspelling recognition methods rely on explicit signing-hand detection, which often leads to recognition failures, and on a connectionist temporal classification (CTC) loss, which exhibits the peaky behavior problem. To address these issues, we develop OpenFS, an open-source approach for fingerspelling recognition and synthesis. We propose a multi-hand-capable fingerspelling recognizer that supports both single- and multi-hand inputs and performs implicit signing-hand detection by incorporating a dual-level positional encoding and a signing-hand focus (SF) loss. The SF loss encourages cross-attention to focus on the signing hand, enabling implicit signing-hand detection during recognition. Furthermore, without relying on the CTC loss, we introduce a monotonic alignment (MA) loss that enforces the output letter sequence to follow the temporal order of the input pose sequence through cross-attention regularization. In addition, we propose a frame-wise letter-conditioned generator that synthesizes realistic fingerspelling pose sequences for OOV words. This generator enables the construction of a new synthetic benchmark, called FSNeo. Through comprehensive experiments, we demonstrate that our approach achieves state-of-the-art performance in recognition and validate the effectiveness of the proposed recognizer and generator. Codes and data are available in: https://github.com/AIRC-KETI/OpenFS.

Abstract:
Drone-based building defect segmentation remains challenging due to complex surface textures and illumination variations. We propose TPSegformer, a topology-preserving segmentation framework that mitigates mis-segmentation in such scenarios. Its decoder incorporates a Hilbert curve-based topology-preserving mechanism to maintain spatial continuity and boundary precision during category layer computation. A lightweight multi-scale fusion module enhances semantic representation, while global context modeling strengthens holistic perception. Experiments on the building defect dataset show that TPSegformer outperforms existing segmentation methods, achieving 80.77% mIoU and 90.22% Acc. On the Dacl10k dataset, it maintains strong generalization, reaching 44.27% mIoU and 60.32% Acc across diverse materials and defect types. The code is available at : https://github.com/mumu-k/TPSegformer

Abstract:
Memory-efficient transfer learning (METL) approaches have recently achieved promising performance in adapting pre-trained models to downstream tasks. They avoid applying gradient backpropagation in large backbones, thus significantly reducing the number of trainable parameters and high memory consumption during fine-tuning. However, since they typically employ a lightweight and learnable side network, these methods inevitably introduce additional memory and time overhead during inference, which contradicts the ultimate goal of efficient transfer learning. To address the above issue, we propose a novel approach dubbed Masked Dual Path Distillation (MDPD) to accelerate inference while retaining parameter and memory efficiency in fine-tuning with fading side networks. Specifically, MDPD develops a framework that enhances the performance by mutually distilling the frozen backbones and learnable side networks in fine-tuning, and discard the side network during inference without sacrificing accuracy. Moreover, we design a novel feature-based knowledge distillation method for the encoder structure with multiple layers. Extensive experiments on distinct backbones across vision/language-only and vision-and-language tasks demonstrate that our method not only accelerates inference by at least 25.2% while keeping parameter and memory consumption comparable, but also remarkably promotes the accuracy compared to SOTA approaches. The source code is available at https://github.com/Zhang-VKk/MDPD.

Abstract:
Multimodal large language models (MLLMs) that "think with images" can interactively use tools to reason about visual inputs, but current approaches often rely on a narrow set of tools with limited real-world necessity and scalability. In this work, we first reveal a critical and previously overlooked weakness: even state-of-the-art MLLMs are surprisingly brittle, showing significant performance degradation on images with simple orientation changes, underscoring the need for more robust tool-based reasoning. To address this, we propose CodeVision, a flexible and scalable "code-as-tool" framework where the model generates code as a universal interface to invoke any image operation, moving beyond fixed tool registries. We train our model using a two-stage methodology, beginning with Supervised Fine-Tuning (SFT) on a high-quality dataset curated for complex, multi-turn tool composition and error recovery, followed by Reinforcement Learning (RL) with a dense process reward function to encourage strategic and efficient tool use. To facilitate this research, we construct new SFT and RL datasets and introduce a challenging new benchmark suite designed to rigorously evaluate robustness to orientation changes and multi-tool reasoning. Experiments on Qwen2.5-VL and Qwen3-VL series show that our approach significantly improves model performance and fosters emergent capabilities such as flexible tool composition, efficient chained execution, and robust error recovery from runtime feedback. Code is available at https://github.com/ByteDance-BandAI/CodeVision.

Abstract:
Open-Vocabulary Camouflaged Object Segmentation (OVCOS) aims to segment camouflaged objects from unseen categories under textual guidance precisely. However, existing methods often employ a unidirectional interaction strategy, where textual prompts guide the matching of visual features. Such a design neglects the bidirectional interaction between visual and language modalities, making the model vulnerable to the semantic gap between image-level textual semantics and pixel-level segmentation cues, which in turn leads to severe semantic confusion in complex camouflaged scenarios. To address this challenge, we propose BaCLIP, a novel bidirectional semantic alignment framework for OVCOS. At its core lies the Mutual Refinement and Enhancement Module (MREM), which establishes bidirectional cross-attention between visual and textual features, enabling mutual semantic calibration to resolve ambiguity and strengthen cross-modal alignment. Moreover, we introduce an Adaptive Prompt that transforms refined textual embeddings into semantic-aware prompts for Segment Anything Model (SAM), enabling direct textual guidance and improving mask precision. Experimental results on the OVCamo benchmark demonstrate that BaCLIP consistently achieves state-of-the-art performance with a compact architecture, effectively mitigating semantic confusion and advancing the understanding of cross-modal camouflage perception. Our code is released at https://github.com/okmaybach/BaCLIP-CVPR2026.

Abstract:
Diffusion Transformer (DiT) models have achieved unprecedented quality in image and video generation, yet their iterative sampling process remains computationally prohibitive. To accelerate inference, feature caching methods have emerged by reusing or forecasting intermediate representations across timesteps. However, existing caching approaches treat all feature components uniformly. We reveal that DiT feature spaces contain distinct principal and residual subspaces with divergent temporal behavior: the principal subspace evolves smoothly and predictably, while the residual subspace exhibits volatile, low-energy oscillations that resist accurate prediction. Building on this insight, we propose SVD-Cache, a subspace-aware caching framework that decomposes diffusion features via Singular Value Decomposition (SVD), applies exponential moving average (EMA) prediction to the dominant low-rank components, and directly reuses the residual subspace. Extensive experiments demonstrate that SVD-Cache achieves near-lossless across diverse models and methods, including 5.55xspeedup on FLUX and HunyuanVideo, and compatibility with model acceleration techniques including distillation, quantization and sparse attention. Our code is available at \href https://github.com/BlackMaple1203/SVDCache https://github.com/BlackMaple1203/SVDCache .

Abstract:
Most existing 3D referring expression segmentation (3DRES) methods rely on dense, high-quality point clouds, while real-world agents such as robots and mobile phones operate with only a few sparse RGB views and strict latency constraints. We introduce Multi-view 3D Referring Expression Segmentation (MV-3DRES), where the model must recover scene structure and segment the referred object directly from sparse multi-view images. Traditional two-stage pipelines, which first reconstruct a point cloud and then perform segmentation, often yield low-quality geometry, produce coarse or degraded target regions, and run slowly. We propose the Multimodal Visual Geometry Grounded Transformer (MVGGT), an efficient end-to-end framework that integrates language information into sparse-view geometric reasoning through a dual-branch design. Training in this setting exposes a critical optimization barrier, termed Foreground Gradient Dilution (FGD), where sparse 3D signals lead to weak supervision. To resolve this, we introduce Per-view No-target Suppression Optimization (PVSO), which provides stronger and more balanced gradients across views, enabling stable and efficient learning. To support consistent evaluation, we build MVRefer, a benchmark that defines standardized settings and metrics for MV-3DRES. Experiments show that MVGGT establishes the first strong baseline and achieves both high accuracy and fast inference, outperforming existing alternatives. The code is available at https://mvggt.github.io/.

Abstract:
Recently, state space models have demonstrated efficient video segmentation through linear-complexity state space compression. However, Video Semantic Segmentation (VSS) requires pixel-level spatiotemporal modeling capabilities to maintain temporal consistency in segmentation of semantic objects. While state space models can preserve common semantic information during state space compression, the fixed-size state space inevitably forgets specific information, which limits the models' capability for pixel-level segmentation. To tackle the above issue, we proposed a Refining Specifics State Space Model approach (RS-SSM) for video semantic segmentation, which performs complementary refining of forgotten spatiotemporal specifics. Specifically, a Channel-wise Amplitude Perceptron (CwAP) is designed to extract and align the distribution characteristics of specific information in the state space. Besides, a Forgetting Gate Information Refiner (FGIR) is proposed to adaptively invert and refine the forgetting gate matrix in the state space model based on the specific information distribution. Consequently, our RS-SSM leverages the inverted forgetting gate to complementarily refine the specific information forgotten during state space compression, thereby enhancing the model's capability for spatiotemporal pixel-level segmentation. Extensive experiments on four VSS benchmarks demonstrate that our RS-SSM achieves state-of-the-art performance while maintaining high computational efficiency. The code is available at https://github.com/zhoujiahuan1991/CVPR2026-RS-SSM.

Abstract:
Model inversion (MI) attacks pose significant privacy risks by reconstructing private training data from trained neural networks. While prior studies have primarily examined unimodal deep networks, the vulnerability of vision-language models (VLMs) remains largely unexplored. In this work, we present the first systematic study of MI attacks on VLMs to understand their susceptibility to leaking private visual training data. Our work makes two main contributions. First, tailored to the token-generative nature of VLMs, we introduce a suite of token-based and sequence-based model inversion strategies, providing a comprehensive analysis of VLMs' vulnerability under different attack formulations. Second, based on the observation that tokens vary in their visual grounding, and hence their gradients differ in informativeness for image reconstruction, we propose Sequence-based Model Inversion with Adaptive Token Weighting (SMI-AW) as a novel MI for VLMs. SMI-AW dynamically reweights each token's loss gradient according to its visual grounding, enabling the optimization to focus on visually informative tokens and more effectively guide the reconstruction of private images. Through extensive experiments and human evaluations on a range of state-of-the-art VLMs across multiple datasets, we show that VLMs are susceptible to training data leakage. Human evaluation of the reconstructed images yields an attack accuracy of 61.21%, underscoring the severity of these privacy risks. Notably, we demonstrate that publicly released VLMs are vulnerable to such attacks. Our study highlights the urgent need for privacy safeguards as VLMs become increasingly deployed in sensitive domains such as healthcare and finance. Our code and models are available at our project page: https://ngoc-nguyen-0.github.io/SMI_AW/

Abstract:
We present Upsample Anything, a lightweight test-time optimization (TTO) framework that restores low-resolution features to high-resolution, pixel-wise outputs without any training. Although Vision Foundation Models demonstrate strong generalization across diverse downstream tasks, their representations are typically downsampled by 14x/16x (e.g., ViT), which limits their direct use in pixel-level applications. Existing feature upsampling approaches depend on dataset-specific retraining or heavy implicit optimization, restricting scalability and generalization. Upsample Anything addresses these issues through a simple per-image optimization that learns an anisotropic Gaussian kernel combining spatial and range cues, effectively bridging Gaussian Splatting and Joint Bilateral Upsampling. The learned kernel acts as a universal, edge-aware operator that transfers seamlessly across architectures and modalities, enabling precise high-resolution reconstruction of features, depth, or probability maps. It runs in only ~0.419 \text s per 224x224 image and achieves state-of-the-art performance on semantic segmentation, depth estimation, and both depth and probability map upsampling. Project page: \href https://seominseok0429.github.io/Upsample-Anything/ https://seominseok0429.github.io/Upsample-Anything/

Abstract:
Recent Text-to-Image (T2I) models based on rectified-flow transformers (e.g., SD3, FLUX) achieve high generative fidelity but remain vulnerable to unsafe semantics, especially when triggered by multi-token interactions. Existing mitigation methods largely rely on fine-tuning or attention modulation for concept unlearning; however, their expensive computational overhead and design tailored to U-Net-based denoisers hinder direct adaptation to transformer-based diffusion models (e.g., MMDiT). In this paper, we conduct an in-depth analysis of the attention mechanism in MMDiT and find that unsafe semantics concentrate within interpretable, low-dimensional subspaces at head level, where a finite set of safety-critical heads is responsible for unsafe feature extraction. We further observe that perturbing the Rotary Positional Embedding (RoPE) applied to the query and key vectors can effectively modify some specific concepts in the generated images. Motivated by these insights, we propose SafeRoPE, a lightweight and fine-grained safe generation framework for MMDiT. Specifically, SafeRoPE first constructs head-wise unsafe subspaces by decomposing unsafe embeddings within safety-critical heads, and computes a Latent Risk Score (LRS) for each input vector via projection onto these subspaces. We then introduce head-wise RoPE perturbations that can suppress unsafe semantics without degrading benign content or image quality. SafeRoPE combines both head-wise LRS and RoPE perturbations to perform risk-specific head-wise rotation on query and key vector embeddings, enabling precise suppression of unsafe outputs while maintaining generation fidelity. Extensive experiments demonstrate that SafeRoPE achieves SOTA performance in balancing effective harmful content mitigation and utility preservation for safe generation of MMDiT. Codes are available at https://github.com/deng12yx/SafeRoPE.

Abstract:
Inspired by the success of Reinforcement Learning with Human Feedback (RLHF) in image generation, recent work has adapted reward-based learning to image super-resolution (ISR) by using Image Quality Assessment (IQA) models as rewards. However, existing IQA models typically output only a single global score and are insensitive to local, fine-grained distortions, allowing perceptually undesirable artifacts to obtain spuriously high rewards and leading to reward hacking. To address this issue, we propose FinPercep-RM, a fine-grained perceptual reward model built on an encoder-decoder architecture that predicts both a global quality score and a Perceptual Degradation Map for spatially localizing and quantifying local defects. We further introduce FGR-30k, a dataset containing diverse and subtle distortions produced by real-world super-resolution models, to train the reward model. While FinPercep-RM provides stronger supervision, its increased complexity also makes generator policy learning unstable. We therefore develop a Co-evolutionary Curriculum Learning (CCL) strategy, in which the reward model and the ISR model evolve synchronously: the reward signal progressively increases in complexity, while the ISR model starts with simple global supervision for fast convergence and gradually transitions to fine-grained rewards. This easy-to-hard design stabilizes training and suppresses reward hacking. Extensive experiments across multiple ISR models demonstrate improvements in both global quality and local realism. Code will be available at https://github.com/lyd-2022/FinPercep-RM.

Abstract:
Current zero-shot camouflaged object segmentation methods typically employ a two-stage pipeline (discover-then-segment): using MLLMs to obtain visual prompts, followed by SAM segmentation. However, relying solely on MLLMs for camouflaged object discovery often leads to inaccurate localization, false positives, and missed detections. To address these issues, we propose the Discover-Segment-Select (DSS) mechanism, a three-stage framework that progressively refines the segmentation process. The proposed method contains a Feature-driven Object Discovery (FOD) module that leverages visual features to generate diverse object proposals, a segmentation module that refines these proposals through SAM segmentation, and a Semantic-driven Mask Selection (SMS) module that employs MLLMs to evaluate and select the optimal segmentation mask from multiple candidates. Extensive experiments on four benchmarks demonstrate that our method achieves state-of-the-art performance with lower GPU memory consumption. Code will be released at https://github.com/ynulonger/DSS.git.

Abstract:
Cross-view geo-localization infers a location by retrieving geo-tagged reference images that visually correspond to a query image. However, the traditional satellite-centric paradigm limits robustness when high-resolution or up-to-date satellite imagery is unavailable. It further underexploits complementary cues across views (e.g., drone, satellite, and street) and modalities (e.g., language and image). To address these challenges, we propose GeoBridge, a novel model that performs bidirectional matching across views and supports language-to-image retrieval. Going beyond traditional satellite-centric formulations, GeoBridge builds on a novel semantic-anchor mechanism that bridges multi-view features through textual descriptions for robust, flexible localization. In support of this task, we construct GeoLoc, the first large-scale, cross-modal, and multi-view aligned dataset comprising over 50,000 pairs of drone, street-view panorama, and satellite images as well as their textual descriptions, collected from 36 countries, ensuring both geographic and semantic alignment. We performed broad evaluations across multiple tasks. Experiments confirm that GeoLoc pre-training markedly improves geo-location accuracy for GeoBridge while promoting cross-domain generalization and cross-modal knowledge transfer. Code, dataset, and pretrained models will be released at https://github.com/MiliLab/GeoBridge.

Abstract:
Text-to-image diffusion models exhibit remarkable generative capabilities, yet their internal operations remain opaque, particularly when handling prompts that are not fully descriptive. In such scenarios, models must make implicit decisions to generate details not explicitly specified in the text. This work investigates the hypothesis that this decision-making process is not diffuse but is computationally localized within the model's architecture. While existing localization techniques focus on prompt-related interventions, we notice that such explicit conditioning may differ from implicit decisions. Therefore, we introduce a probing-based localization technique to identify the layers with the highest attribute separability for concepts. Our findings indicate that the resolution of ambiguous concepts is governed principally by self-attention layers, identifying them as the most effective point for intervention. Based on this discovery, we propose ICM (Implicit Choice-Modification) - a precise steering method that applies targeted interventions to a small subset of layers. Extensive experiments confirm that intervening on these specific self-attention layers yields superior debiasing performance compared to existing state-of-the-art methods, minimizing artifacts common to less precise approaches.

Abstract:
Multimodal large language models (MLLMs) have demonstrated remarkable capabilities in aligning visual inputs with natural language outputs. Yet, the extent to which generated tokens depend on visual modalities remains poorly understood, limiting interpretability and reliability. In this work, we present EAGLE, a lightweight black-box framework for explaining autoregressive token generation in MLLMs. EAGLE attributes any selected tokens to compact perceptual regions while quantifying the relative influence of language priors and perceptual evidence. The framework introduces an objective function that unifies sufficiency (insight score) and indispensability (necessity score), optimized via greedy search over sparsified image regions for faithful and efficient attribution. Beyond spatial attribution, EAGLE performs modality-aware analysis that disentangles what tokens rely on, providing fine-grained interpretability of model decisions. Extensive experiments across open-source MLLMs show that EAGLE consistently outperforms existing methods in faithfulness, localization, and hallucination diagnosis, while requiring substantially less GPU memory. These results highlight its effectiveness and practicality for advancing the interpretability of MLLMs.

Abstract:
Infrared image super-resolution (IISR) under real-world conditions is a practically significant yet rarely addressed task. Pioneering works are often trained and evaluated on simulated datasets or neglect the intrinsic differences between infrared and visible imaging. In practice, however, real infrared images are affected by coupled optical and sensing degradations that jointly deteriorate both structural sharpness and thermal fidelity. To address these challenges, we propose Real-IISR, a unified autoregressive framework for real-world IISR that progressively reconstructs fine-grained thermal structures and clear backgrounds in a scale-by-scale manner via thermal-structural guided visual autoregression. Specifically, a Thermal-Structural Guidance module encodes thermal priors to mitigate the mismatch between thermal radiation and structural edges. Since non-uniform degradations typically induce quantization bias, Real-IISR adopts a Condition-Adaptive Codebook that dynamically modulates discrete representations based on degradation-aware thermal priors. Also, a Thermal Order Consistency Loss enforces a monotonic relation between temperature and pixel intensity, ensuring relative brightness order rather than absolute values to maintain physical consistency under spatial misalignment and thermal drift. We build FLIR-IISR, a real-world IISR dataset with paired LR-HR infrared images acquired via automated focus variation and motion-induced blur. Extensive experiments demonstrate the promising performance of Real-IISR, providing a unified foundation for real-world IISR and benchmarking. The dataset and code are available at: https://github.com/JZD151/Real-IISR.

Abstract:
Automatic and accurate echocardiography video segmentation is essential for efficient and repeatable measurements of key clinical functional indicators for the diagnosis of cardiovascular diseases. However, it is an extremely challenging task to obtain high-quality segmentation results throughout the cardiac cycle owing to (1) the inherent speckle noise in echocardiography videos, (2) the complex dynamic motions of cardiac structures, and (3) the scarcity of annotated data. To comprehensively address these challenges, we propose a novel semi-supervised model, EchoForge, which can achieve accurate and real-time echocardiography video segmentation with very limited annotations. EchoForge introduces an Anchor Semantic Awareness (ASA) module that refines ambiguous regions using learnable anchors and propagates structural prototypes across frames to enhance boundary delineation and temporal consistency. Building upon ASA, a Continuous Pseudo-label Reforging (CPR) module progressively integrates and refines pseudo-labels via channel-wise attention, providing robust supervision. Extensive experiments on the CAMUS and EchoNet-Dynamic benchmarks demonstrate that EchoForge outperforms state-of-the-art (SOTA) methods in accuracy while maintaining real-time efficiency. The code is available at https://github.com/YunPeng-Fang/EchoForge.

Abstract:
Multimodal biomedical Vision-Language Models (VLMs) exhibit immense potential in the field of Continual Learning (CL). However, they confront a core dilemma: how to preserve fine-grained intra-modality features while bridging the significant domain gap across different modalities. To address this challenge, we propose a comprehensive framework. Leveraging our 18-million multimodal and comprehensive medical retrieval database derived from PubMed scientific papers, we pioneer the integration of Retrieval-Augmented Generation (RAG) into CL. Specifically, we employ a multi-modal, multi-layer RAG system that provides real-time guidance for model fine-tuning through dynamic, on-demand knowledge retrieval. Building upon this, we introduce a dynamic knowledge distillation framework. This framework precisely resolves the aforementioned core dilemma by dynamically modulating the importance of the parameter space, the granularity of the distilled knowledge, and the data distribution of the reference dataset in accordance with the required level of detail. To thoroughly validate the clinical value of our strategy, we have designed a more rigorous Medical Generalist Task Incremental Learning (MGTIL) benchmark. This benchmark is engineered to simultaneously evaluate the model's capacity for adaptation to significant domain shifts, retention of subtle intra-domain features, and real-time learning of novel and complex medical tasks. Our method achieves stable SOTA results across all metrics, with https://github.com/CZZZZZZZZZZZZZZZZZ/PRIMED code available.

Abstract:
Engineering diagrams are the core carriers of technical information in industrial contexts, where the pressing demand for their digitization from industrial sectors has driven great advancements in related research domains. However, existing research still suffers from three limitations. Firstly, the detection of symbols, lines, and texts typically involves multiple independent models, resulting in cumbersome workflows. In addition, high-resolution diagrams often impose an excessive computational cost on existing models. Moreover, parsing frameworks solely based on object detection can merely localize component positions, yet fail to capture the topological connection semantics and structured knowledge among components, thus offering limited convenience for industrial applications. To address these issues, we propose an end-to-end information extraction framework based on the Dynamically Tokenized Relation Transformer (DTRT), which can dynamically reduce received image tokens, filter redundant information, and efficiently extract structural knowledge to construct hyper-relational knowledge graphs. We practiced our model on piping and instrumentation diagrams (P&IDs) and electrical diagrams (EDs): the former are widely used in chemical engineering enterprises, while the latter are employed to describe circuit systems. DTRT achieves an R@1000 accuracy of 94.84% on PIDs and R@200 accuracy of 92.52% on EDs with a significantly reduced computational cost. Our code is available at https://github.com/Tianyou-Bai/DTRT-Diagram-Parsing-in-Scene-Graph.

Abstract:
Existing 4D human datasets fall short for fashion-specific research, lacking either realistic garment dynamics or task-specific annotations. Synthetic datasets suffer from a realism gap, whereas real-world captures lack the detailed annotations and paired data required for virtual try-on (VTON) and size estimation tasks. To bridge this gap, we introduce MV-Fashion, a large-scale, multi-view video dataset engineered for domain-specific fashion analysis. MV-Fashion features 3,273 sequences (72.5 million frames) from 80 diverse subjects wearing 3-10 outfits each. It is designed to capture complex, real-world garment dynamics, including multiple layers and varied styling (e.g. rolled sleeves, tucked shirt). A core contribution is a rich data representation that includes pixel-level semantic annotations, ground-truth material properties like elasticity, and 3D point clouds. Crucially for VTON applications, MV-Fashion provides paired data: multi-view synchronized captures of worn garments alongside their corresponding flat, catalogue images. We leverage this dataset to establish baselines for fashion-centric tasks, including virtual try-on, clothing size estimation, and novel view synthesis. The dataset is available at https://hunorlaczko.github.io/MV-Fashion.

Abstract:
Gait recognition, as a reliable biometric technology, has seen rapid development in recent years while it faces significant challenges caused by diverse clothing styles in the real world. This paper introduces BarbieGait, a synthetic gait dataset where real-world subjects are uniquely mapped into a virtual engine to simulate extensive clothing changes while preserving their gait identity information. As a pioneering work, BarbieGait provides a controllable gait data generation method, enabling the production of large datasets to validate cross-clothing issues that are difficult to verify with real-world data. However, the diversity of clothing increases intra-class variance and makes one of the biggest challenges to learning cloth-invariant features under varying clothing conditions. Therefore, we propose GaitCLIF (Gait-oriented CLoth-Invariant Feature) as a robust baseline model for cross-clothing gait recognition. Through extensive experiments, we validate that our method significantly improves cross-clothing performance on BarbieGait and the existing popular gait benchmarks. We believe that BarbieGait, with its extensive cross-clothing gait data, will further advance the capabilities of gait recognition in cross-clothing scenarios and promote progress in related research. The source code and dataset are available at https://github.com/BarbieGait/BarbieGait.

Abstract:
Personalized federated learning (PFL) with foundation models has emerged as a promising paradigm enabling clients to adapt to heterogeneous data distributions. However, real-world scenarios often face the co-occurrence of non-IID data and long-tailed class distributions, presenting unique challenges that remain underexplored in PFL. In this paper, we investigate this long-tailed personalized federated learning and observe that current methods suffer from two limitations: (i) Fine-tuning degrades performance below zero-shot baselines due to the erosion of inherent class balance in foundation models; (ii) Conventional personalization techniques further transfer this bias to local models through parameter or feature-level fusion. To address these challenges, we propose Federated Learning via Gradient Purification and Residual Learning (FedPuReL), which preserves balanced knowledge in the global model while enabling unbiased personalization. Specifically, we purify local gradients using zero-shot predictions to maintain a class-balanced global model, and model personalization as residual corrections atop the frozen global model. Extensive experiments demonstrate that FedPuReL consistently outperforms state-of-the-art methods, achieving superior performance on both global and personalized models across diverse long-tailed scenarios. The code is available at https://github.com/shihaohou/FedPuReL.

Abstract:
Visual-language grounding aims to establish semantic correspondences between natural language and visual entities, enabling models to accurately identify and localize target objects based on textual instructions. Existing VLG approaches focus on coarse-grained, object-level localization, while traditional robotic grasping methods rely predominantly on geometric cues and lack language guidance, which limits their applicability in language-driven manipulation scenarios. To address these limitations, we propose the RealVLG framework, which integrates the RealVLG-11B dataset and the RealVLG-R1 model to unify real-world visual-language grounding and grasping tasks. RealVLG-11B dataset provides multi-granularity annotations including bounding boxes, segmentation masks, grasp poses, contact points, and human-verified fine-grained language descriptions, covering approximately 165,000 images, over 800 object instances, 1.3 million segmentation, detection, and language annotations, and roughly 11 billion grasping examples. Building on this dataset, RealVLG-R1 employs Reinforcement Fine-tuning on pretrained large-scale vision-language models to predict bounding boxes, segmentation masks, grasp poses, and contact points in a unified manner given natural language instructions. Experimental results demonstrate that RealVLG supports zero-shot perception and manipulation in real-world unseen environments, establishing a unified semantic-visual multimodal benchmark that provides a comprehensive data and evaluation platform for language-driven robotic perception and grasping policy learning. All data and code are publicly available at https://github.com/lif314/RealVLG-R1.

Abstract:
Visual autoregressive (AR) generation models have demonstrated strong potential for image generation, yet their next-token-prediction paradigm introduces considerable inference latency. Although speculative decoding (SD) has been proven effective for accelerating visual AR models, its "draft one step, then verify one step" paradigm prevents a direct reduction in the number of forward passes, limiting its acceleration potential. Motivated by the interchangeability of visual tokens, we explore verification skipping in the SD process for the first time to explicitly cut the number of target model forward passes, thereby reducing inference latency. By analyzing the characteristics of the drafting stage, we observe that verification redundancy and stale feature reusability are key factors to maintain generation quality while improving speed for verification-free steps. Inspired by these two observations, we propose a novel SD framework VVS to accelerate \underline \text v isual AR model via partial \underline \text v erification \underline \text s kipping, which integrates three complementary modules: (1) a verification-free token selector with dynamically truncation, (2) token level feature caching and reuse, and (3) fine-grained skipped step scheduling. Consequently, VVS reduces the number of target model forward passes by 2.8x relative to vanilla AR decoding while maintaining competitive generation quality, offering a superior speed-quality trade-off over conventional SD frameworks and revealing strong potential to reshape the SD paradigm. Our code is available at https://github.com/HyattDD/VVS.

Abstract:
We propose Robo-SGG, a plug-and-play module for robust scene graph generation (SGG). Unlike standard SGG, robust SGG aims to perform inference on a diverse range of corrupted images, with the core challenge being the domain shift between clean and corrupted images. Existing SGG methods suffer from degraded performance due to shifted visual features (e.g., corruption interference or occlusions). To obtain robust visual features, we leverage layout information, which represents the global structure of an image and is robust to domain shift, to enhance the robustness of SGG methods under corruption. Specifically, we employ Instance Normalization (IN) to alleviate the domain-specific variations and recover the robust structural features (i.e., the positional and semantic relationships among objects) by the proposed Layout-Oriented Restitution. Furthermore, for corrupted images, we introduce a Layout-Embedded Encoder (LEE) that adaptively fuses layout and visual features via a gating mechanism, enhancing the robustness of positional and semantic representations for objects and predicates. Note that our proposed Robo-SGG can be easily integrated into any baseline SGG model. Extensive experiments demonstrate that by integrating our proposed Robo-SGG into the state-of-the-art method, we achieve relative improvements of 6.3%, 11.1%, and 8.0% in mR@50 for PredCls, SGCls, and SGDet tasks on the VG-C benchmark, respectively, and achieve new state-of-the-art performance in the corrupted scene graph generation benchmarks (VG-C and GQA-C). Our source code is available at https://github.com/MICLAB-BUPT/Robo-SGG.

Abstract:
Vision-and-Language Navigation (VLN) requires the agent to navigate based on natural instructions. This task is challenging due to partial observability, which makes it difficult to align perception with language. Recent methods mitigate this by imagining future scenes, yet they rely on vision-based synthesis, leading to high computational cost and redundant details. To this end, we propose to adaptively imagine key environmental semantics via language form, enabling a more reliable and efficient strategy. Specifically, we introduce Adaptive Text Dreamer (ATD), a dual-branch self-guided imagination policy built upon a large language model (LLM). ATD is designed with a human-like left-right brain architecture, where the left brain focuses on logical integration, and the right brain is responsible for imaginative prediction of future scenes. To achieve this, we fine-tune only the Q-former within both brains to efficiently activate domain-specific knowledge in the LLM, enabling dynamic updates of logical reasoning and imagination during navigation. Furthermore, we propose a cross-interaction mechanism that regularizes the imagined latent-space outputs and integrates them with the navigation expert module via a decoder-free latent interface, thereby enabling ATD to jointly harness the reasoning ability of the LLM and the task-specific knowledge of the navigation model. We conduct extensive experiments across the R2R, REVERIE, and R4R benchmarks, demonstrating that ATD achieves competitive performance with significantly fewer parameters. The code will be available.

Abstract:
The rapid advancement of generative models has made the detection of AI-generated images a critical challenge for both research and society. Recent works have shown that most state-of-the-art fake image detection methods overfit to their training data and catastrophically fail when evaluated on curated hard test sets with strong distribution shifts. In this work, we argue that it is more principled to learn a tight decision boundary around the real image distribution and treat the fake category as a sink class. To this end, we propose SimLBR, a simple and efficient framework for fake image detection with Latent Blending Regularization (LBR). Our method significantly improves cross-generator generalization, achieving up to +24.85% accuracy and +69.62% recall on the challenging Chameleon benchmark. SimLBR is also highly efficient, training orders of magnitude faster than existing approaches. Furthermore, we emphasize the need for reliability-oriented evaluation in fake image detection, introducing risk-adjusted metrics and worst-case estimates to better assess model robustness. All the code and models are availabe at: \href https://github.com/mvrl/SimLBR https://github.com/mvrl/SimLBR .

Abstract:
LiDAR semantic segmentation must remain robust under various sensor and environmental corruptions to be reliable in safety-critical applications. Existing test-time adaptation methods, including approaches based on pseudo-labels and normalization statistics, have shown promising results but can still struggle under severe distribution shifts. To complement these approaches, we propose a geometry-aware test-time training framework that leverages an auxiliary self-supervised objective. Our method is based on geometric inlier discrimination (GeoID), which injects synthetic off-manifold points into the input and trains the model to distinguish geometry-consistent inliers from synthetically displaced outliers, enabling adaptation on unlabeled test data. To further stabilize this process under real corruptions, we introduce bidirectional unreliable point filtering (BiUPF), which uses inlier scores from the source-trained model to filter out unreliable regions on both original and synthetic points, focusing updates on high-confidence samples. Experiments on two large-scale corruption benchmarks, SemanticKITTI-C and nuScenes-C, show that our method consistently outperforms strong test-time adaptation baselines and improves robustness across diverse LiDAR corruptions. Our code is available at https://github.com/hskim617/GeoID.

Abstract:
While existing equivariant methods enhance data efficiency, they suffer from high computational intensity, reliance on single-modality inputs, and instability when combined with fast-sampling methods. In this work, we propose E3Flow, a novel framework that addresses the critical limitations of equivariant diffusion policies. E3Flow overcomes these challenges, successfully unifying efficient rectified flow with stable, multi-modal equivariant learning for the first time. Our framework is built upon spherical harmonic representations to ensure rigorous SO(3) equivariance. We introduce a novel invariant Feature Enhancement Module (FEM) that dynamically fuses hybrid visual modalities (point clouds and images), injecting rich visual cues into the spherical harmonic features. We evaluate E3Flow on 8 manipulation tasks from the MimicGen and further conduct 4 real-world experiments to validate its effectiveness in physical environments. Simulation results show that E3Flow achieves a 3.12% improvement in average success rate over the state-of-the-art Spherical Diffusion Policy (SDP) while simultaneously delivering a 7x inference speedup. E3Flow thus demonstrates a new and highly effective trade-off between performance, efficiency, and data efficiency for robotic policy learning. Code: https://github.com/zql-kk/E3Flow.

Abstract:
While Multimodal Large Language Models (MLLMs) have achieved impressive performance on semantic tasks, their spatial intelligence--crucial for robust and grounded AI systems--remains underdeveloped. Existing benchmarks fall short of diagnosing this limitation: they either focus on overly simplified qualitative reasoning or rely on domain-specific indoor data, constrained by the lack of outdoor datasets with verifiable metric ground truth. To bridge this gap, we introduce a large-scale benchmark built from pedestrian-perspective videos captured with stereo cameras, LiDAR, and IMU/GPS sensors. This dataset provides metrically precise 3D information, enabling the automatic generation of spatial reasoning questions that span a hierarchical spectrum--from qualitative relational reasoning to quantitative metric and kinematic understanding. Evaluations reveal that the performance gains observed in structured indoor benchmarks vanish in open-world settings. Further analysis using synthetic abnormal scenes and blinding tests confirms that current MLLMs depend heavily on linguistic priors instead of grounded visual reasoning. Our benchmark thus provides a principled platform for diagnosing these limitations and advancing physically grounded spatial intelligence. Project page: https://mingrui-wu.github.io/osi-bench/

Abstract:
Multiple Instance Learning (MIL) has emerged as the dominant paradigm for the analysis of gigapixel-scale Whole Slide Images (WSIs). However, recent methods leveraging guidance from Vision-Language Models often rely on static and universal pathological descriptions. This one-size-fits-all strategy fails to account for the vast morphological heterogeneity within individual WSIs, as its uniform guidance is not tailored to slide-specific visual evidence. To address this, we propose DyKo, a Dynamic Knowledge-guided MIL framework that adapts universal knowledge to slide-specific evidence for few-shot WSI classification. The core of DyKo is the WSI-Adaptive Knowledge Instantiation module (WAKI). WAKI begins by identifying key visual prototypes within a specific WSI's histology. These slide-specific prototypes then serve as queries to retrieve relevant concepts from a pathology knowledge base. This retrieved knowledge is then used to synthesize unique, knowledge-instantiated features for each instance, effectively instantiating tailored guidance at the patch level. To ensure fidelity and prevent semantic drift, we introduce a Structural Consistency loss that enforces alignment between knowledge-instantiated and visual features. Comprehensive experiments on four public real-world cancer datasets demonstrate that DyKo achieves superior performance over state-of-the-art methods in few-shot pathology diagnosis. Code is available at https://github.com/junjianli106/DyKo.

Abstract:
Reward models (RMs) are essential for training large language models (LLMs), but remain underexplored for omni models that handle interleaved image and text sequences. We introduce Multimodal RewardBench 2 (MMRB2), the first comprehensive benchmark for reward models on multimodal understanding and (interleaved) generation. MMRB2 spans four tasks: text-to-image, image editing, interleaved generation, and multimodal reasoning ("thinking-with-images"), providing 1,000 expert-annotated preference pairs per task from 23 models and agents across 21 source tasks. MMRB2 is designed with: (1) practical but challenging prompts; (2) responses from state-of-the-art models, and agents; and (3) preference pairs with strong human-expert consensus, curated via an ensemble filtering strategy. Using MMRB2, we study existing judges for each subtask, including multimodal LLM-as-a-judge and models trained with human preferences. Top judges like GPT-5 and Gemini-2.5-Pro reach 66-75% accuracy, compared to >90% for humans, and outperform the commonly used GPT-4o (59% accuracy). The best performing open-source model Qwen3-VL-32B achieves similar accuracies as Gemini-2.5-Flash (64%). We also show that MMRB2 performance strongly correlates with downstream task success, and conduct an in-depth analysis that shows key areas to improve the reward models going forward.

Abstract:
Few-shot multi-class anomaly detection is crucial in real industrial settings, where only a few normal samples are available while numerous object types must be inspected. This setting is challenging as defect patterns vary widely across categories while normal samples remain scarce. Existing vision-language model-based approaches typically depend on class-specific anomaly descriptions or auxiliary modules, limiting both scalability and computational efficiency. In this work, we propose AnoPLe, a lightweight multimodal prompt learning framework that removes reliance on anomaly-type textual descriptions and avoids any external modules. AnoPLe employs bidirectional interactions between textual and visual prompts, allowing class semantics and instance-level cues to refine one another and form class-conditioned representations that capture shared normal patterns across categories. To enhance localization, we design a scale-aware prefix trained on both global and local views, enabling the prompts to capture both global context and fine-grained details. In addition, alignment loss propagates local anomaly evidence to global features, strengthening the consistency between pixel- and image-level predictions. Despite its simplicity, AnoPLe achieves strong performance on MVTec-AD, VisA, and Real-IAD under the few-shot multi-class setting, surpassing prior approaches while remaining efficient and free from expert-crafted anomaly descriptions. Moreover, AnoPLe generalizes well to unseen anomalies and extends effectively to the medical domain.

Abstract:
Vision-Language-Action (VLA) models have shown great performance in robotic manipulation by mapping visual observations and language instructions directly to actions. However, they remain brittle under distribution shifts: when test scenarios change, VLAs often reproduce memorized trajectories instead of adapting to the updated scene, which is a failure mode we refer to as the "Memory Trap". This limitation stems from the end-to-end design, which lacks explicit 3D spatial reasoning and prevents reliable identification of actionable regions in unfamiliar environments. To compensate for this missing spatial understanding, 3D Spatial Affordance Fields (SAFs) can provide a geometric representation that highlights where interactions are physically feasible, offering explicit cues about regions the robot should approach or avoid. We therefore introduce Affordance Field Intervention (AFI), a lightweight hybrid framework that uses SAFs as an on-demand plug-in to guide VLA behavior. Our system detects memory traps through proprioception, repositions the robot to recent high-affordance regions, and proposes affordance-driven waypoints that anchor VLA-generated actions. A SAF-based scorer then selects trajectories with the highest cumulative affordance. Extensive experiments demonstrate that our method achieves an average improvement of 23.5% across different VLA backbones (pi-0 and pi-0.5) under out-of-distribution scenarios on real-world robotic platforms, and 20.2% on the LIBERO-Pro benchmark, validating its effectiveness in enhancing VLA robustness to distribution shifts. Project page: https://vla-afi.github.io/.

Abstract:
Closed-set object detection in remote sensing imagery has made significant progress, but achieving high detection accuracy remains challenging. Vision-Language Models (VLMs), which possess rich prior knowledge, offer a promising solution to this challenge. However, most existing VLMs are designed for open-vocabulary tasks and exhibit inherent limitations when directly applied to closed-set scenarios, such as notable accuracy degradation and high deployment costs. To address these issues, we propose VLM4RSDet, a novel collaborative training framework that leverages vision-language model to enhance the performance of conventional closed-set remote sensing object detectors. Notably, during inference, VLM4RSDet only retains the standard object detection architecture, thus avoiding any additional deployment overhead. Furthermore, we introduce a Global-Local Cross-Attention (GLCA) module and a Learnable Hierarchical Prediction Strategy (LHPS) to further improve collaborative training performance. Extensive experiments on five benchmark datasets demonstrate the effectiveness and robustness of our approach. In particular, our method outperforms the state-of-the-art by 7.5% in mAP_ 0.5:0.95 on the VisDrone2019 dataset. Our code is available at \href https://github.com/cszzshi/VLM4RSDet https://github.com/cszzshi/VLM4RSDet .

Abstract:
We propose Exact-GS, a novel mathematically rigorous and accurate 3D Gaussian Splatting model designed to perform 3D X-ray computed tomography (CT) reconstruction and novel view synthesis. Recently, 3D Gaussian Splatting achieved considerable progress at 3D representation. Unfortunately, due to the affine approximation of the projective transformation, previous 3DGS-based methods inevitably suffer from artifacts and projection inconsistencies. To address this problem, some ray tracing based methods perform integration along the ray across Gaussians. However, these methods are computationally inefficient on the forward and backward pass. We introduce a novel closed-form splatting solution for this problem with mathematically rigorous derivation. Our model is the first to achieve the same exact rendering quality as ray tracing based methods without any approximation under a splatting-based formulation, enabling fast CUDA-based hardware rasterization. Additionally, we present a precise Gaussian-tile intersection algorithm, enabling faster and efficient rendering. We demonstrate the performance gains by reconstruction and novel view synthesis through different synthetic and real-world datasets. Code is publicly available at: https://github.com/brucee1323/Exact-GS.

Abstract:
The unified autoregressive (AR) model excels at multimodal understanding and generation. However, its full potential in the domain of customized image generation has yet to be fully realized.Existing customization approaches for unified AR models face a fundamental dilemma: adaptation-based methods suffer from overfitting and scalability bottlenecks, while concept-injection paradigms are constrained by a shallow injection strategy that leads to poor visual fidelity and impaired re-contextualization.To address this, we propose DCoAR, a novel deep concept injection framework that maintains a completely frozen pre-trained model. DCoAR deeply integrates new concepts through a Layer-wise Multimodal Context Learning (LMCL) strategy, which is stabilized by a multi-faceted regularization scheme: a Dual Prior Preservation (DPP) loss to mitigate semantic drift and a Context-Aware Self-Regularization (CASR) loss to enhance re-contextualization. The framework also enables training-free subject customization in user-provided styles.Experiments demonstrate that DCoAR significantly outperforms previous injection-based methods and achieves performance competitive with adaptation-based approaches while requiring substantially fewer trainable parameters.

Abstract:
Video amodal completion (VAC) aims to mimic the human brain's ability to implicitly perceive the complete appearance of partially occluded objects, thereby facilitating recognition and understanding. Existing VAC methods finetune video generation models on custom datasets, yet these datasets often have unrealistic distributions and small scales due to the challenges of collecting real amodal data, and thus limit their performance and generalization. To address this, we propose a training-free framework by utilizing pre-trained image inpainting models for VAC and introducing in-context (IC) learning to enhance inter-frame consistency. However, despite the satisfactory performance of DiT-based IC Learning in generation tasks, task-agnostic global attention tends to utilize irrelevant contextual information, resulting in completion failures when applied to amodal completion. Additionally, IC Learning faces a cold-start problem with the exemplar construction. To this end, we propose a consistent video amodal completion with rectified in-context exemplar guidance. Specifically, we introduce rectified exemplar-guided completion by adjusting the attention weights of exemplar relative to the target images for consistent completion, and adopt a dual-frame calibrated exemplar rectification to tackle the cold-start issue. Quantitative and qualitative experiments demonstrate that our method outperforms SOTAs, especially in terms of generalization and robustness on uncommon data and under severe occlusion. Codes are available at https://github.com/xxyCSCV/IC-Amodal.

Abstract:
Parallel test-time scaling typically trains separate generation and verification models, incurring high training and inference costs. We propose Advantage Decoupled Preference Optimization (ADPO), a unified reinforcement learning framework that jointly learns answer generation and self-verification within a single policy. ADPO introduces two innovations: a preference verification reward improving verification capability and a decoupled optimization mechanism enabling synergistic optimization of generation and verification. Specifically, the preference verification reward computes mean verification scores from positive and negative samples as decision thresholds, providing positive feedback when prediction correctness aligns with answer correctness. Meanwhile, the advantage decoupled optimization computes separate advantages for generation and verification, applies token masks to isolate gradients, and combines masked GRPO objectives, preserving generation quality while calibrating verification scores. ADPO achieves up to +34.1% higher verification AUC and -53.5% lower inference time, with significant gains of +2.8%/+1.4% accuracy on MathVista/MMMU, +1.9 cIoU on ReasonSeg, and +1.7%/+1.0% step success rate on AndroidControl/GUI Odyssey. Code is available at https://github.com/ZJUSCL/ADPO.

Abstract:
Spiking Neural Networks (SNNs) are considered to have enormous potential in the future development of Artificial Intelligence due to their brain-inspired and energy-efficient properties. Compared to vanilla Spatial-Temporal Back-propagation (STBP) training methods, online training can effectively avoid the risk of GPU memory explosion. However, current online learning frameworks cannot tackle the gradient discrepancy problem between the forward and backward process, merely aiming to optimize the GPU memory, resulting in no performance advantages compared to the STBP-based models in the inference stage. To address the aforementioned challenges, we propose Hybrid-Driven Leaky Integrate-and-Fire (HD-LIF) model family for efficient online learning, which respectively adopt different spiking calculation mechanism in the upper-region and lower-region of the firing threshold. We theoretically point out that our learning framework can effectively separate temporal gradients and address the misalignment problem of surrogate gradients, as well as achieving full-stage optimization towards learning precision, memory footprint and power consumption. Experimental results have demonstrated that our scheme is enable to achieve state-of-the-art performance for multiple evaluation metrics, breaking through the traditional paradigm of SNN online training and deployment. Code is available at https://github.com/hzc1208/HD_LIF.

Abstract:
We tackle the problem of generating a complete vector map representation from aerial imagery in a single run: producing polygons for all land-cover classes with shared boundaries and without gaps or overlaps. Existing polygonization methods are typically class-specific; extending them to multiple classes via per-class runs commonly leads to topological inconsistencies, such as duplicated edges, gaps, and overlaps. We formalize this new task as All-Class Polygonal Vectorization (ACPV) and release the first public benchmark, Deventer-512, with standardized metrics jointly evaluating semantic fidelity, geometric accuracy, vertex efficiency, per-class topological fidelity and global topological consistency. To realize ACPV, we propose ACPV-Net, a unified framework introducing a novel Semantically Supervised Conditioning (SSC) mechanism coupling semantic perception with geometric primitive generation, along with a topological reconstruction that enforces shared-edge consistency by design. While enforcing such strict topological constraints, ACPV-Net surpasses all class-specific baselines in polygon quality across classes on Deventer-512. It also applies to single-class polygonal vectorization without any architectural modification, achieving the best-reported results on WHU-Building. Data, code, and models will be released at: https://github.com/HeinzJiao/ACPV-Net.

Abstract:
Vision-Language Models (VLMs), leveraging their powerful visual perception and reasoning capabilities, have been widely applied in Unmanned Aerial Vehicle (UAV) tasks.However, the spatial intelligence capabilities of existing VLMs in UAV scenarios remain largely unexplored, raising concerns about their effectiveness in navigating and interpreting dynamic environments.To bridge this gap, we introduce SpatialSky-Bench, a comprehensive benchmark specifically designed to evaluate the spatial intelligence capabilities of VLMs in UAV navigation. Our benchmark comprises two categories--Environmental Perception and Scene Understanding--divided into 13 subcategories, including bounding boxes, color, distance, height, and landing safety analysis, among others.Extensive evaluations of various mainstream open-source and closed-source VLMs reveal unsatisfactory performance in complex UAV navigation scenarios, highlighting significant gaps in their spatial capabilities.To address this challenge, we developed the SpatialSky-Dataset, a comprehensive dataset containing 1 M samples with diverse annotations across various scenarios. Leveraging this dataset, we introduce Sky-VLM, a specialized VLM designed for UAV spatial reasoning across multiple granularities and contexts.Extensive experimental results demonstrate that Sky-VLM achieves state-of-the-art performance across all benchmark tasks, paving the way for the development of VLMs suitable for drone scenarios.The source code is available at https://github.com/linglingxiansen/SpatialSKy .

Abstract:
Implicit neural representation (INR) based methods learn a continuous mapping from a low-resolution (LR) target magnetic resonance (MR) image and a high-resolution (HR) reference image to achieve arbitrary-scale super-resolution (SR). However, their inherent spectral bias favors learning low-frequency (LF) components, often failing to capture the sharp transitions at anatomical boundaries and resulting in the loss of high-frequency (HF) details. Inspired by 3D Gaussian splatting, we propose GaussM2ASR (Gaussian Multi-contrast MRI Arbitrary-scale Super-Resolution), which converts the challenging task of HF anatomical reconstruction into a smoother parameter optimization problem by learning the parameters of anisotropic 2D Gaussian kernels. To handle inter-contrast discrepancies, we introduce an anatomy-guided pipeline comprising three core modules: a Structure Prior Modulation Fusion (SPMF) module for feature enhancement; an Anatomy-Guided Dual-Domain Cross Attention (AG-DDCA) module for joint spatial-frequency modeling; and an Anatomy-Guided Gaussian Parametrizer (AGGP) that leverages gradient-based sparse attention to concentrate Gaussian centers on critical anatomical structures. Extensive experiments on multiple datasets demonstrate that GaussM2ASR surpasses state-of-the-art methods in recovering fine anatomical details. Our source codes have been released at https://github.com/Qiuhai-CV/GaussM2ASR.

Abstract:
Reconstructing High Dynamic Range (HDR) videos from sequences of alternating-exposure Low Dynamic Range (LDR) frames remains highly challenging, especially under dynamic scenes where cross-exposure inconsistencies and complex motion make inter-frame alignment difficult, leading to ghosting and detail loss. Existing methods often suffer from inaccurate alignment, suboptimal feature aggregation, and degraded reconstruction quality in motion-dominated regions. To address these challenges, we propose F^2HDR, a two-stage HDR video reconstruction framework that robustly perceives inter-frame motion and restores fine details in complex dynamic scenarios. The proposed framework integrates a flow adapter that adapts generic optical flow for robust cross-exposure alignment, a physical motion modeling to identify salient motion regions, and a motion-aware refinement network that aggregates complementary information while removing ghosting and noise. Extensive experiments demonstrate that F^2HDR achieves state-of-the-art performance on real-world HDR video benchmarks, producing ghost-free and high-fidelity results under large motion and exposure variations. Our code is available at https://github.com/wei1895/F2HDR.

Abstract:
Whole Slide Images (WSIs) exhibit hierarchical structure, where diagnostic information emerges from cellular morphology, regional tissue organization, and global context. Existing Computational Pathology (CPath) Multimodal Large Language Models (MLLMs) typically compress an entire WSI into a single embedding, which hinders fine-grained grounding and ignores how pathologists synthesize evidence across different scales. We introduce MLLM-HWSI, a Hierarchical WSI-level MLLM that aligns visual features with pathology language at four distinct scales, cell as word, patch as phrase, region as sentence, and WSI as paragraph to support interpretable evidence-grounded reasoning. MLLM-HWSI decomposes each WSI into multi-scale embeddings with scale-specific projectors and jointly enforces (i) a hierarchical contrastive objective and (ii) a cross-scale consistency loss, preserving semantic coherence from cells to the WSI. We compute diagnostically relevant patches and aggregate segmented cell embeddings into a compact cellular token per-patch using a lightweight Cell-Cell Attention Fusion (CCAF) transformer. The projected multi-scale tokens are fused with text tokens and fed to an instruction-tuned LLM for open-ended reasoning, VQA, report, and caption generation tasks. Trained in three stages, MLLM-HWSI achieves new SOTA results on 13 WSI-level benchmarks across six CPath tasks. By aligning language with multi-scale visual evidence, MLLM-HWSI provides accurate, interpretable outputs that mirror diagnostic workflows and advance holistic WSI understanding. Code is available at: \href https://github.com/BasitAlawode/HWSI-MLLM Github .

Abstract:
We propose Decoupled Residual Denoising Diffusion models (DRDD) for unified and data-efficient image-to-image (I2I) translation. While diffusion models have advanced I2I translation in terms of quality and diversity, we uncover a previously under-explored property in diffusion models. Crucially, beyond its conventional role of manifold lifting (i.e., moving data off low-dimensional manifolds), injecting Gaussian noise facilitates domain harmonization by implicitly aligning feature distributions across domains, a property particularly advantageous for unified I2I translation. However, existing diffusion models prematurely erode this harmonization effect, as noise and residuals are simultaneously removed in a single coupled diffusion process. To address this, DRDD decouples the diffusion process into two sequential and independent diffusion stages: (1) a stochastic noise diffusion for domain harmonization and manifold lifting, and (2) a deterministic residual diffusion that learns the core semantic mapping entirely within the fixed-noise domain. This decoupling preserves harmonization and manifold lifting effects throughout the transformation, substantially simplifying the learning of unified mappings across diverse tasks and domains. Notably, the noise diffusion stage is trained exclusively on abundant, unpaired target-domain images, greatly improving data efficiency. Comprehensive theoretical and empirical analysis demonstrates that DRDD is broadly compatible with mainstream diffusion models and consistently delivers robust, unified I2I translation, even under limited paired data. Our code is available at https://github.com/HKU-HealthAI/DRDD.

Abstract:
Unsupervised learning of latent motion from Internet videos is crucial for robot learning. Existing discrete methods generally mitigate the shortcut learning caused by extracting excessive static backgrounds through vector quantization with a small codebook size. However, they suffer from information loss and struggle to capture more complex and fine-grained dynamics. Moreover, there is an inherent gap between the distribution of discrete latent motion and continuous robot action, which hinders the joint learning of a unified policy. We propose CoMo, which aims to learn more precise continuous latent motion from internet-scale videos. CoMo employs an early temporal difference (Td) mechanism to increase the shortcut learning difficulty and explicitly enhance motion cues. Additionally, to ensure latent motion better captures meaningful foregrounds, we further propose a temporal contrastive learning (Tcl) scheme. Specifically, positive pairs are constructed with a small future frame temporal offset, while negative pairs are formed by directly reversing the temporal direction. The proposed Td and Tcl work synergistically and effectively ensure that the latent motion focuses better on the foreground and reinforces motion cues. Critically, CoMo exhibits strong zero-shot generalization, enabling it to generate effective pseudo action labels for unseen videos. Extensive simulated and real-world experiments show that policies co-trained with CoMo pseudo action labels achieve superior performance with both diffusion and auto-regressive architectures. The code is available at https://github.com/MCG-NJU/CoMo.

Abstract:
Recently, unified multimodal models (UMMs) have made remarkable progress in integrating visual understanding and generation, demonstrating strong potential for complex text-to-image (T2I) tasks. Despite their theoretical promise, a persistent capability gap exists: UMMs typically exhibit superior visual understanding but comparatively weaker generative capabilities. This discrepancy arises largely from the intrinsic decoupling between the understanding and generation processes. While a UMM can accurately interpret fine-grained visual details, it often struggles to produce semantically coherent images from complex textual prompts. To address this challenge, we explore UMMs' internal understanding capability to enhance generation quality. We propose a token-level intrinsic text-image alignment reward mechanism, GvU, enabling the UMM to act simultaneously as teacher and student: it evaluates its own outputs using the understanding branch to guide the generations accordingly. Building upon this, we design a self-supervised reinforcement learning framework, allowing UMMs to iteratively improve their generation quality through understanding-based intrinsic reward signals--without reliance on external supervision. Experimental results show that our method substantially boosts UMMs' generation, which in turn strengthens their fine-grained visual understanding, narrowing the capability gap between UMMs' visual understanding and generation. The project page is https://matrix0721.github.io/gvu.github.io/.

Abstract:
Vision-Language Navigation (VLN) enables embodied agents to reach target locations in unseen environments by following language instructions. Despite recent progress with vision-language models (VLMs), a critical semantic-geometric gap remains: while VLMs excel at language and 2D visual understanding, they struggle with 3D spatial reasoning and fail to capture the causal dynamics between actions and spatial transitions, resulting in unreliable navigation, particularly in zero-shot settings. To bridge this gap, we propose a Hierarchical Semantic-Geometric Map (HSGM) that transforms 3D geometric information into a structured representation compatible with VLMs, effectively linking them to the physical world. Specifically, HSGM is represented as a multi-channel top-down map organized into three levels: (1) geometric level that records navigable regions and obstacles, (2) semantic level that represents objects and their relations, and (3) decision level that supports high-level task reasoning and goal selection. During navigation, the VLM acts as a high-level semantic planner, interpreting the spatial layout encoded in the HSGM to select geometrically valid waypoints, while low-level, collision-free movements between waypoints are executed by a classical path-planning algorithm, fully decoupling semantic reasoning from action execution. Additionally, complex instructions are decomposed into subtasks to alleviate the problem of progress forgetting or hallucinating in long-horizon navigation. Extensive experiments on R2R-CE and RxR-CE benchmarks demonstrate that our zero-shot framework achieves state-of-the-art performance and even outperforms several supervised methods. Code is available at https://github.com/Teacher-Tom/HSGM_public.

Abstract:
We present Online3R, a new sequential reconstruction framework that is capable of adapting to new scenes through online learning, effectively resolving inconsistency issues. Specifically, we introduce a set of learnable lightweight visual prompts into a pretrained, frozen geometry foundation model to capture the knowledge of new environments while preserving the fundamental capability of the foundation model for geometry prediction. To solve the problems of missing groundtruth and the requirement of high efficiency when updating these visual prompts at test time, we introduce a local-global self-supervised learning strategy by enforcing the local and global consistency constraints on predictions. The local consistency constraints are conducted on intermediate and previously local fused results, enabling the model to be trained with high-quality pseudo groundtruth signals; the global consistency constraints are operated on sparse keyframes spanning long distances rather than per frame, allowing the model to learn from a consistent prediction over a long trajectory in an efficient way. Our experiments demonstrate that Online3R outperforms previous state-of-the-art methods on various benchmarks. Project page: https://shunkaizhou.github.io/online3r-1.0/

Abstract:
Visual generation is dominated by three paradigms: AutoRegressive (AR), diffusion, and Visual AutoRegressive (VAR) models. Unlike AR and diffusion, VARs operate on heterogeneous input structures across their generation steps, which creates severe asynchronous policy conflicts. This issue becomes particularly acute in reinforcement learning (RL) scenarios, leading to unstable training and suboptimal alignment. To resolve this, we propose a novel framework to enhance Group Relative Policy Optimization (GRPO) by explicitly managing these conflicts. Our method integrates three synergistic components: 1) a stabilizing intermediate reward to guide early-stage generation; 2) a dynamic time-step reweighting scheme for precise credit assignment; and 3) a novel mask propagation algorithm, derived from principles of Reward Feedback Learning (ReFL), designed to isolate optimization effects both spatially and temporally. Our approach demonstrates significant improvements in sample quality and objective alignment over the vanilla GRPO baseline, enabling robust and effective optimization for VAR models. Our official page is https://github.com/ByteVisionLab/NextFlow.

Abstract:
Recent vision-language-action (VLA) models for multi-task robot manipulation often rely on fixed camera setups and shared visual encoders, which limit their performance under occlusions and during cross-task transfer. To address these challenges, we propose Task-aware Virtual View Exploration (TVVE), a framework that learns to select task-relevant virtual camera viewpoints and dynamically re-render observations from a reconstructed scene representation using the selected viewpoints. To enable efficient view selection, we train an exploration policy in a pseudo-environment. In addition, we introduce a Task-aware Mixture-of-Experts (TaskMoE) visual encoder that routes visual features to task-specialized experts, mitigating interference in multi-task learning. To evaluate robustness under distribution shifts, we construct RLBench-OG, an out-of-distribution benchmark with visual perturbations and camera pose variations. Experiments on RLBench and RLBench-OG demonstrate that TVVE achieves higher success rates than strong baselines, while real-robot experiments further confirm its robustness to visual disturbances and unseen instructions. Code and visualizations are available at: https://hcplab-sysu.github.io/TAVP.

Abstract:
Vision-language models (VLMs) have achieved remarkable multimodal understanding and reasoning capabilities, yet remain computationally expensive due to dense visual tokenization. Existing efficiency approaches either merge redundant visual tokens or drop them progressively in language backbone, often trading accuracy for speed. In this work, we propose DUET-VLM, a versatile plug-and-play dual compression framework that consists of (a) vision-only redundancy aware compression of vision encoder's output into information-preserving tokens, followed by (b) layer-wise, salient text-guided dropping of visual tokens within the language backbone to progressively prune less informative tokens. This coordinated token management enables aggressive compression while retaining critical semantics. On LLaVA-1.5-7B, our approach maintains over 99% of baseline accuracy with 67% fewer tokens, and still retains >97% even at 89% reduction. With this dual-stage compression during training, it achieves 99.7% accuracy at 67% and 97.6% at 89%, surpassing prior SoTA visual token reduction methods across multiple benchmarks. When integrated into Video-LLaVA-7B, it even surpasses the baseline--achieving >100% accuracy with a substantial 53.1% token reduction and retaining 97.6% accuracy under an extreme 93.4% setting. These results highlight end-to-end training with DUET-VLM, enabling robust adaptation to reduced visual (image/video) input without sacrificing accuracy, producing compact yet semantically rich representations within the same computational budget. Our code is available at https://github.com/AMD-AGI/DUET-VLM.

Abstract:
Multiple Object Tracking (MOT) has long been a fundamental task in computer vision, with broad applications in various real-world scenarios. However, due to distribution shifts in appearance, motion pattern, and catagory between the training and testing data, model performance degrades considerably during online inference in MOT. Test-Time Adaptation (TTA) has emerged as a promising paradigm to alleviate such distribution shifts. However, existing TTA methods often fail to deliver satisfactory results in MOT, as they primarily focus solely on frame-level adaptation while neglecting temporal consistency and identity association across frames and videos. Inspired by human decisionmaking process, this paper propose a Test-time Calibration from Experience and Intuition (TCEI) framework. In this framework, the Intuitive system utilizes transient memory to recall recently observed objects for rapid predictions, while the Experiential system leverages the accumulated experience from prior test videos to reassess and calibrate these intuitive predictions. Furthermore, both confident and uncertain objects during online testing are exploited as historical priors and reflective cases, respectively, enabling the model to adapt to the testing environment and alleviate performance degradation. Extensive experiments demonstrate that the proposed TCEI framework consistently achieves superior performance across multiple benchmark datasets and significantly enhances the model's adaptability under distribution shifts. The code will be released at https://github.com/1941Zpf/TCEI.

Abstract:
Current multi-view indoor 3D object detectors rely on sensor geometry that is costly to obtain--i.e., precisely calibrated multi-view camera poses--to fuse multi-view information into a global scene representation, limiting deployment in real-world scenes. We target a more practical setting: Sensor-Geometry-Free (SG-Free) multi-view indoor 3D object detection, where there are no sensor-provided geometric inputs (multi-view poses or depth). Recent Visual Geometry Grounded Transformer (VGGT) shows that strong 3D cues can be inferred directly from images. Building on this insight, we present VGGT-Det, the first framework tailored for SG-Free multi-view indoor 3D object detection. Rather than merely consuming VGGT predictions, our method integrates VGGT encoder into a transformer-based pipeline. To effectively leverage both the semantic and geometric priors from inside VGGT, we introduce two novel key components: (i) Attention-Guided Query Generation (AG): exploits VGGT attention maps as semantic priors to initialize object queries, improving localization by focusing on object regions while preserving global spatial structure. (ii) Query-Driven Feature Aggregation (QD): a learnable See-Query interacts with object queries to 'see' what they need, then dynamically aggregates multi-level geometric features across VGGT layers that progressively lift 2D features into 3D. Experiments show that VGGT-Det significantly surpasses SG-Free baselines by 4.4 and 8.6 mAP@0.25 on ScanNet and ARKitScenes, respectively. Ablation study shows that VGGT's internally learned semantic and geometric priors can be effectively leveraged by our AG and QD. Source code and pre-trained models are available at the GitHub project page: https://github.com/yangcaoai/VGGT-Det-CVPR2026

Abstract:
Coded Aperture Snapshot Spectral Imaging (CASSI) has emerged as a prominent technique for efficient hyperspectral imaging. However, the tight coupling between physical encoding and computational decoding makes CASSI highly sensitive to slight hardware misalignments, which can significantly degrade reconstruction quality. Existing methods either assume ideal imaging conditions or rely on offline calibration, making them vulnerable to dynamic perturbations, such as thermal expansion and mechanical vibrations, which cause mask shifts. To address these limitations, we propose a Self-supervised Geometry Degradation Estimation (SGDE) framework that explicitly models mask misalignments as an affine transformation and embeds it into the imaging model. SGDE jointly estimates affine parameters and reconstructs the hyperspectral image in a self-supervised manner, eliminating the need for reference targets or device-specific training data. Furthermore, we introduce a multi-kernel estimation strategy to enhance the robustness of degradation estimation under large perturbations. Extensive experiments on both simulated and real-world datasets demonstrate that SGDE achieves superior robustness against mask misalignments. Moreover, the estimated affine parameters can be directly integrated into existing reconstruction algorithms, enabling plug-and-play degradation estimation for practical CASSI systems. The code is available at https://github.com/heyuqiao/SGDE.

Abstract:
Automatic radiology report generation (RRG) aims to translate medical images into diagnostic text, reducing radiologists' workload and standardizing clinical documentation. Nonetheless, existing approaches mainly focus on single-time point analysis and fail to capture temporal disease evolution across longitudinal examinations. While recent longitudinal RRG (LRRG) approaches incorporate historical data, they often combine images from different time points within a single representation space, leading to blurred semantics and inconsistent temporal reasoning. In this work, we propose a Temporal Decoupling with Iterative MutualRefinement Model (TIM), a two-stage framework that explicitly decouples spatial pathology from temporal progression and iteratively refines reports through mutual feedback. Stage I performs temporal-decoupled representation learning, separating temporal evolution patterns from disease-specific features and generating radiology reports for both prior and current studies. Stage II introduces a mutual report refinement mechanism that identifies diagnostic inconsistencies within prior reports and iteratively rectifies both prior and current reports through error-sensitive feedback. Experiments on the Longitudinal-MIMIC dataset demonstrate that TIM surpasses existing single-image and longitudinal baselines, achieving new state-of-the-art performance across both language and clinical metrics. Code is available at https://github.com/yihengd/TIM.

Abstract:
Dual-hand action segmentation, densely predicting actions for both hands from untrimmed videos, is essential for understanding complex bimanual activities. However, it poses several unique challenges: complex inter-hand dependencies, visual asymmetry between hands, representation conflicts where the dominant hand monopolizes gradients, and semantic ambiguity in fine-grained actions. We propose Polyphony, a three-stage method to address these challenges through: (1) an Alternating Dual-Hand Vision Transformer that alternates training between left- and right-hand mini-batches to ensure balanced gradient contributions from both hands while sharing a spatio-temporal encoder; (2) Semantic Feature Conditioning that aligns visual features with structured, compositional action descriptions to enhance discrimination of semantically similar actions; and (3) Diffusion-Based Segmentation with cross-hand feature fusion for inter-hand coordination and adaptive loss weighting for balancing performance. Polyphony achieves state-of-the-art on both dual-hand datasets (HA-ViD, ATTACH) with improvements up to 16.8 points, and on the single-stream Breakfast dataset (82.5%), outperforming the prior best method that uses a 12x larger backbone. Notably, our unified model with a single shared backbone surpasses baselines requiring separate per-hand models. Code is at https://github.com/x-labs-xyz/Polyphony-Dual-hand-Action-Segmentation.

Abstract:
In remote sensing pansharpening, spectrally mixed regions, where the spectral interactions among adjacent land covers lead to highly inconsistent reconstruction patterns, remain the most challenging areas. Due to the complex spatial distribution and heterogeneous spectral characteristics of ground objects, existing methods relying on rigid architectures and physical constraints struggle to learn generalized reconstruction patterns from limited spectral mixing samples, resulting in unstable generalization. To address this limitation, we propose an architecture-agnostic regularization-guided mechanism that adaptively directs the model to focus on learning reliable reconstruction priors for challenging regions. Specifically, we introduce a simple data-level transformation, MixShuffle, which performs random convex combinations across spatial positions and spectral channels to generate training data with richer spatial structures and stronger spectral mixing. In parallel, we propose a hierarchical attention weighting mechanism, a loss-level gradient reallocation strategy at the sample, channel, and pixel levels, enabling the model to emphasize structurally complex regions. Extensive experiments on multiple benchmark datasets (WV3, GF2, QB) and across various network architectures demonstrate the strong generality and effectiveness of the proposed strategies, achieving state-of-the-art performance when integrated into DANet. Our code is available at https://github.com/Geo-Tell/DANet.

Abstract:
We propose a decoupled 3D scene generation framework called SceneMaker in this work. Due to the lack of sufficient open-set de-occlusion and pose estimation priors, existing methods struggle to simultaneously produce high-quality geometry and accurate poses under severe occlusion and open-set settings. To address these issues, we first decouple the de-occlusion model from 3D object generation, and enhance it by leveraging image datasets and collected de-occlusion datasets for much more diverse open-set occlusion patterns. Then, we propose a unified pose estimation model that integrates global and local mechanisms for both self-attention and cross-attention to improve accuracy. Besides, we construct an open-set 3D scene dataset to further extend the generalization of the pose estimation model. Comprehensive experiments demonstrate the superiority of our decoupled framework on both indoor and open-set scenes. Our codes and datasets is released at https://idea-research.github.io/SceneMaker/.

Abstract:
Short-form videos have become a primary medium for digital advertising, requiring scalable and efficient content creation. However, current workflows and AI tools remain disjoint and modality-specific, leading to high production costs and low overall efficiency. To address this issue, we propose AutoCut, an end-to-end advertisement video editing framework based on multimodal discretization and controllable editing. AutoCut employs dedicated encoders to extract video and audio features, then applies residual vector quantization to discretize them into unified tokens aligned with textual representations, constructing a shared video-audio-text token space. Built upon a foundation model, we further develop a multimodal large language model for video editing through combined multimodal alignment and supervised fine-tuning, supporting tasks covering video selection and ordering, script generation, and background music selection within a unified editing framework. Finally, a complete production pipeline converts the predicted token sequences into deployable long video outputs. Experiments on real-world advertisement datasets show that AutoCut reduces production cost and iteration time while substantially improving consistency and controllability, paving the way for scalable video creation. Code and data are available at https://github.com/AdAutoCut/Autocut

Abstract:
Visual localization is a core technology for augmented reality and autonomous navigation. Recent methods combine the efficient rendering of 3D Gaussian Splatting (3DGS) with feature-based localization. These methods rely on direct matching between 2D query features and the 3D Gaussian feature field, but this often results in mismatches due to an inherent bias in the learned Gaussian feature. We theoretically analyze the feature learning process in 3DGS, revealing that the widely adopted alpha-blending optimization inherently introduces bias into 3D point features. This bias stems from the entanglement between individual Gaussians and their neighboring Gaussians, making the learned features unsuitable for precise matching tasks. Motivated by these findings, we propose ULF-Loc, an unbiased landmark feature framework that replaces biased feature optimization with geometry-weighted feature fusion. We further introduce keypoint-consensus landmark sampling to select reliable Gaussians and local geometric consistency verification to reject mismatches caused by rendering artifacts. On the Cambridge Landmarks dataset, ULF-Loc reduces the mean median translation error by 17% compared to the state-of-the-art, while achieving superior efficiency with only 1/10 the training time and 1/6 the GPU memory of STDLoc. Our code is accessible at https://github.com/Cyril-gyd/ULF-Loc.

Abstract:
Ultra-High-Definition (UHD) image restoration is trapped in a scalability crisis: existing models, bound to pixel-wise operations, demand unsustainable computation. While state space models (SSMs) like Mamba promise linear complexity, their pixel-serial scanning remains a fundamental bottleneck for the millions of pixels in UHD content. We ask: must we process every pixel to understand the image? This paper introduces C^2SSM, a visual state space model that breaks this taboo by shifting from pixel-serial to cluster-serial scanning. Our core discovery is that the rich feature distribution of a UHD image can be distilled into a sparse set of semantic centroids via a neural-parameterized mixture model. C^2SSM leverages this to reformulate global modeling into a novel dual-path process: it scans and reasons over a handful of cluster centers, then diffuses the global context back to all pixels through a principled similarity distribution, all while a lightweight modulator preserves fine details. This cluster-centric paradigm achieves a decisive leap in efficiency, slashing computational costs while establishing new state-of-the-art results across five UHD restoration tasks. More than a solution, C^2SSM charts a new course for efficient large-scale vision: scan clusters, not pixels. The code is available at https://github.com/5chen/C2SSM.

Abstract:
Multimodal large language models (MLLMs) have advanced from image-level reasoning to pixel-level grounding, but extending these capabilities to videos remains challenging as models must achieve spatial precision and temporally consistent reference tracking. Existing video MLLMs often rely on a static segmentation token ([SEG]) for frame-wise grounding, which provides semantics but lacks temporal context, causing spatial drift, identity switches, and unstable initialization when objects move or reappear. We introduce SPARROW, a pixel-grounded video MLLM that unifies spatial accuracy and temporal stability through two key components: (i) Target-Specific Tracked Features (TSF), which inject temporally aligned referent cues during training, and (ii) a dual-prompt design that decodes box ([BOX]) and segmentation ([SEG]) tokens to fuse geometric priors with semantic grounding. SPARROW is supported by a curated referential video dataset of 30,646 videos and 45,231 Q&A pairs and operates end-to-end without external detectors via a class-agnostic SAM2-based proposer. Integrated into three recent open-source video MLLMs (UniPixel, GLUS, and VideoGLaMM), SPARROW delivers consistent gains across six benchmarks, improving up to +8.9 J&F on RVOS, +5 mIoU on visual grounding, and +5.4 CLAIR on GCG. These results demonstrate that SPARROW substantially improves referential stability, spatial precision, and temporal coherence in pixel-grounded video understanding. Project page: https://risys-lab.github.io/SPARROW/

Abstract:
Generating automated reports for 3D positron emission tomography (PET) is an important and challenging task in medical imaging. PET plays a vital role in oncology, but automating report generation is difficult due to the complexity of whole-body 3D volumes, the wide range of potential clinical findings, and the limited availability of annotated datasets. To address these challenges, we introduce PETARSeg-11K, the first large-scale, publicly available dataset that provides lesion-level correspondence between 3D PET/CT volumes and free-text radiological findings. It comprises 11,356 lesion descriptions paired with 3D segmentations. Second, we propose PETAR-4B, a 3D vision-language model designed for mask-aware, spatially grounded PET/CT reporting. PETAR-4B jointly encodes PET, CT, and 3D lesion segmentation masks, using a 3D focal prompt to capture fine-grained details of lesions that normally comprise less than 0.1% of the volume. Evaluations using automated metrics show PETAR-4B substantially outperforming all 2D and 3D baselines. A human study involving five physicians---the first of its kind for automated PET reporting---confirms the model's clinical utility and establishes correlations between automated metrics and expert judgment. This work provides a foundational dataset and a novel architecture, advancing 3D medical vision-language understanding in PET. Project link: https://github.com/DanyalMaq/petar-release/

Abstract:
We present an approach for high-quality dynamic Gaussian Splatting from monocular videos. To this end, we in this work go one step further beyond previous methods to explicitly model continuous position and orientation deformation of dynamic Gaussians, using an SE(3) B-spline motion bases with a compact set of control points. To improve computational efficiency while enhancing the ability to model complex motions, an adaptive control mechanism is devised to dynamically adjust the number of motion bases and control points. Besides, we develop a soft segment reconstruction strategy to mitigate long-interval motion interference, and employ a multi-view diffusion model to provide multi-view cues for avoiding overfitting to training views. Extensive experiments demonstrate that our method outperforms state-of-the-art methods in novel view synthesis. Our code is available at https://github.com/hhhddddddd/se3bsplinegs.

Abstract:
Modeling long-range spatiotemporal dynamics in functional Magnetic Resonance Imaging (fMRI) remains a key challenge due to the high dimensionality of the four-dimensional signals. Prior voxel-based models, although demonstrating excellent performance and interpretation capabilities, are constrained by prohibitive memory demands and thus can only capture limited temporal windows. To address this, we propose TABLeT (Two-dimensionally Autoencoded Brain Latent Transformer), a novel approach that tokenizes fMRI volumes using a pre-trained 2D natural image autoencoder. Each 3D fMRI volume is compressed into a compact set of continuous tokens, enabling long-sequence modeling with a simple Transformer encoder with limited VRAM. Across large-scale benchmarks including the UK-Biobank (UKB), Human Connectome Project (HCP), and ADHD-200 datasets, TABLeT outperforms existing models in multiple tasks, while demonstrating substantial gains in computational and memory efficiency over the state-of-the-art voxel-based method given the same input. Furthermore, we develop a self-supervised masked token modeling approach to pre-train TABLeT, which improves the model's performance for various downstream tasks. Our findings suggest a promising approach for scalable and interpretable spatiotemporal modeling of brain activity. Our code is available at https://github.com/beotborry/TABLeT.

Abstract:
A world model is an internal model that simulates how the world evolves. Given past observations and actions, it predicts the future physical state of both the embodied agent and its environment. Accurate world models are essential for enabling agents to think, plan, and reason effectively in complex, dynamic settings. However, existing world models often focus on random generation of open worlds, but neglect the need for high-fidelity modeling of deterministic scenarios (such as fixed-map mazes and static space robot navigation). In this work, we take a step toward building a truly accurate world model by addressing a fundamental yet open problem: constructing a model that can fully clone a deterministic world. 1) Through diagnostic experiment, we quantitatively demonstrate that high-fidelity cloning is feasible and the primary bottleneck for long-horizon fidelity is the geometric structure of the latent representation, not the dynamics model itself. 2) Building on this insight, we show that applying temporal contrastive learning principle as a geometric regularization can effectively curate a latent space that better reflects the underlying physical state manifold, demonstrating that contrastive constraints can serve as a powerful inductive bias for stable world modeling; we call this approach Geometrically-Regularized World Models (GRWM). At its core is a lightweight geometric regularization module that can be seamlessly integrated into standard autoencoders, reshaping their latent space to provide a stable foundation for effective dynamics modeling. By focusing on representation quality, GRWM offers a simple yet powerful pipeline for improving world model fidelity. Code is available at https://github.com/XiaFire/Clone_Deterministic_Environment.

Abstract:
Spatial reasoning focuses on locating target objects based on spatial relations in 3D scenes, which plays a crucial role in developing intelligent embodied agents. Due to the limited availability of 3D scene-language paired data, it is challenging to train models with strong reasoning ability from scratch. Previous approaches have attempted to inject 3D scene representations into the input space of Large Language Models (LLMs) and leverage the pretrained comprehension and reasoning abilities for spatial reasoning. However, models encoding absolute positions struggle to extract spatial relations from prematurely fused features, while methods explicitly encoding all spatial relations (which is quadratic in the number of objects) as input tokens suffer from poor scalability. To address these limitations, we propose QuatRoPE, a novel positional embedding method with an input length that is linear to the number of objects, and explicitly calculates pairwise spatial relations through the dot product in attention layers. QuatRoPE's holistic vector encoding of 3D coordinates guarantees a high degree of spatial consistency, maintaining fidelity to the scene's geometric integrity. Additionally, we introduce the Isolated Gated RoPE Extension (IGRE), which effectively limits QuatRoPE's influence to object-related tokens, thereby minimizing interference with the LLM's existing positional embeddings and maintaining the LLM's original capabilities. Extensive experiments demonstrate the effectiveness of our approaches. The code and data are available at https://github.com/oceanflowlab/QuatRoPE.

Abstract:
Accurate and temporally consistent segmentation of the left ventricle from echocardiography videos is essential for estimating the ejection fraction and assessing cardiac function. However, modeling spatiotemporal dynamics remains difficult due to severe speckle noise and rapid non-rigid deformations. Existing linear recurrent models offer efficient in-context associative recall for temporal tracking, but rely on unconstrained state updates, which cause progressive singular value decay in the state matrix, a phenomenon known as rank collapse, resulting in anatomical details being overwhelmed by noise. To address this, we propose OSA, a framework that constrains the state evolution on the Stiefel manifold. We introduce the Orthogonalized State Update (OSU) mechanism, which formulates the memory evolution as Euclidean projected gradient descent on the Stiefel manifold to prevent rank collapse and maintain stable temporal transitions. Furthermore, an Anatomical Prior-aware Feature Enhancement module explicitly separates anatomical structures from speckle noise through a physics-driven process, providing the temporal tracker with noise-resilient structural cues. Comprehensive experiments on the CAMUS and EchoNet-Dynamic datasets show that OSA achieves state-of-the-art segmentation accuracy and temporal stability, while maintaining real-time inference efficiency for clinical deployment. Codes are available at https://github.com/wangrui2025/OSA.

Abstract:
Video Large Language Models (VideoLLMs) have shown remarkable progress in video understanding. However, these models still struggle to effectively perceive and exploit rich temporal information in videos when responding to user queries. Therefore, they often generate descriptions of events that are temporal inconsistent or causally implausible, causing severe hallucination issues. While most prior studies have focused on spatial hallucinations (e.g. object mismatches), temporal reasoning in video understanding remains relatively underexplored. To address this issue, we propose Self-Diagnostic Contrastive Decoding (SEASON), a training-free method that adaptively enhances temporal and spatial faithfulness for each output token. It achieves this by dynamically diagnosing each token's hallucination tendency and applying adaptive contrastive decoding against its corresponding temporal and spatial negatives. Extensive experiments demonstrate that SEASON outperforms all existing training-free hallucination mitigation approaches on three hallucination examination benchmarks, while further improves VideoLLMs across four general video understanding benchmarks. Our project page: https://chriswu018.github.io/season/

Abstract:
Neuronal morphology encodes critical information about circuit function, development, and disease, yet current methods analyze topology or graph structure in isolation. We introduce GraPHFormer, a multimodal architecture that unifies these complementary views through CLIP-style contrastive learning. Our vision branch processes a novel three-channel persistence image encoding unweighted, persistence-weighted, and radius-weighted topological densities via DINOv2-ViT-S. In parallel, a TreeLSTM encoder captures geometric and radial attributes from skeleton graphs. Both project to a shared embedding space trained with symmetric InfoNCE loss, augmented by persistence-space transformations that preserve topological semantics. Evaluated on six benchmarks (BIL-6, ACT-4, JML-4, N7, M1-Cell, M1-REG) spanning self-supervised and supervised settings, GraPHFormer achieves state-of-the-art performance on five benchmarks, significantly outperforming topology-only, graph-only, and morphometrics baselines. We demonstrate practical utility by discriminating glial morphologies across cortical regions and species, and detecting signatures of developmental and degenerative processes. Code: https://github.com/Uzshah/GraPHFormer

Abstract:
Co-salient Object Detection (CoSOD) aims to segment salient objects that consistently appear across a group of related images. Despite the notable progress achieved by recent training-based approaches, they still remain constrained by the closed-set datasets and exhibit limited generalization. However, few studies explore the potential of Vision Foundation Models (VFMs) to address CoSOD, which demonstrate a strong generalized ability and robust saliency understanding. In this paper, we investigate and leverage VFMs for CoSOD, and further propose a novel training-free method, TF-SSD, through the synergy between SAM and DINO. Specifically, we first utilize SAM to generate comprehensive raw proposals, which serve as a candidate mask pool. Then, we introduce a quality mask generator to filter out redundant masks, thereby acquiring a refined mask set. Since this generator is built upon SAM, it inherently lacks semantic understanding of saliency. To this end, we adopt an intra-image saliency filter that employs DINO's attention maps to identify visually salient masks within individual images. Moreover, to extend saliency understanding across group images, we propose an inter-image prototype selector, which computes similarity scores among cross-image prototypes to select masks with the highest score. These selected masks serve as final predictions for CoSOD. Extensive experiments show that our TF-SSD outperforms existing methods (e.g., 13.7% gains over the recent training-free method). Codes are available at https://github.com/hzz-yy/TF-SSD.

Abstract:
Large Vision-Language Models (LVLMs) usually suffer from prohibitive computational and memory costs due to the quadratic growth of visual tokens with image resolution. Existing token compression methods, while varied, often lack a high-level semantic understanding, leading to suboptimal merges, information redundancy, or context loss. To address these limitations, we introduce CORE (Compact Object-centric REpresentations), a new paradigm for visual token compression. CORE leverages an efficient segmentation decoder to generate object masks, which serve as a high-level semantic prior to guide the merging of visual tokens into a compact set of object-centric representations. Furthermore, a novel centroid-guided sorting mechanism restores a coherent spatial order to the merged tokens, preserving vital positional information. Extensive experiments show that CORE not only establishes a new state-of-the-art on six authoritative benchmarks for fixed-rate compression, but also achieves dramatic efficiency gains in adaptive-rate settings. Even under extreme compression, after aggressively retaining with only 2.2% of all visual tokens, CORE still maintains 97.4% of baseline performance. Our work demonstrates the superiority of object-centric representations for efficient and effective LVLM processing. The code is available at https://github.com/jingyulei/CORE.

Abstract:
Mix-supervised image segmentation aims to effectively leverage heterogeneous annotations. Recent prompt-based advances utilize foundation models such as Segment Anything Model (SAM) to generate pseudo-masks by treating weak labels as spatial prompts. However, these methods rely heavily on sparse spatial priors, leading to suboptimal performance in ambiguous regions and overlooking the potential of unlabeled data due to the absence of promptable cues. In this paper, we propose SAMIX, a novel framework that adapts SAM2 into a semantic-aware pseudo-label generator SA-SAM2 by incorporating a lightweight semantic adapter. Beyond being guided by sparse spatial prompts, SA-SAM2 facilitates dense contextual prompts provided by valuable image-mask reference pairs with shared semantics. Another core component of SAMIX is the Selecting Policy Network (SPNet), which auto-regressively retrieves relevant and complementary reference samples for each query image. Unlike rule-based selections, SPNet is trained via reinforcement learning to actively explore reference combinations that maximize pseudo-label quality. Guided by customized and verifiable rewards associated with mask quality, the selection is directed toward semantically informative and diverse contexts. We conduct extensive experiments on two general datasets (PASCAL VOC 2012 and Cityscapes) and two specific datasets with ambiguous boundaries (camouflaged object detection and image polyp segmentation). Across diverse mix-supervision settings, SAMIX consistently achieves state-of-the-art performance. Codes are available at https://github.com/Huster-Hq/SAMIX.

Abstract:
Multimodal MRI offers complementary information for brain tumor segmentation, but clinical scans often lack one or more modalities, which degrades segmentation performance. In this paper, we propose UniME (Uni-Encoder Meets Multi-Encoders), a two-stage heterogeneous method for brain tumor segmentation with missing modalities that reconciles the trade-offs among fine-grained structure capture, cross-modal complementarity modeling, and exploitation of available modalities. The idea is to decouple representation learning from segmentation via a two-stage heterogeneous architecture. Stage 1 pretrains a single ViT Uni-Encoder with masked image modeling to establish a unified representation robust to missing modalities. Stage 2 adds modality-specific CNN Multi-Encoders to extract high-resolution, multi-scale, fine-grained features. We fuse these features with the global representation to produce precise segmentations. Experiments on BraTS 2023 and BraTS 2024 show that UniME outperforms previous methods under incomplete multi-modal scenarios. The code is available at https://github.com/Hooorace-S/UniME

Abstract:
Diffusion models (DMs) produce high-quality images, yet their sampling remains costly when adapted to new domains. Distilled DMs are faster but typically remain confined within their teacher's domain. Thus, fast and high-quality generation for novel domains relies on two-stage pipelines: Adapt-then-Distill or Distill-then-Adapt. However, both add design complexity and often degrade quality or diversity. We introduce Uni-DAD, a single-stage pipeline that unifies DM distillation and adaptation. It couples two training signals: (i) a dual-domain distribution-matching distillation (DMD) objective that guides the student toward the distributions of the source teacher and a target teacher, and (ii) a multi-head generative adversarial network (GAN) loss that encourages target realism across multiple feature scales. The source domain distillation preserves diverse source knowledge, while the multi-head GAN stabilizes training and reduces overfitting, especially in few-shot regimes. The inclusion of a target teacher facilitates adaptation to more structurally distant domains. We evaluate Uni-DAD on two comprehensive benchmarks for few-shot image generation (FSIG) and subject-driven personalization (SDP) using diffusion backbones. It delivers better or comparable quality to state-of-the-art (SoTA) adaptation methods even with less than 4 sampling steps, and often surpasses two-stage pipelines in quality and diversity. Code: https://github.com/yaramohamadi/uni-DAD .

Abstract:
Efficient adaptation between Egocentric (Ego) and Exocentric (Exo) views is crucial for applications such as human-robot cooperation. However, the success of most existing Ego-Exo adaptation methods relies heavily on target-view data for training, thereby increasing computational and data collection costs. In this paper, we make the first exploration of a Test-time Ego-Exo Adaptation for Action Anticipation (TE^ 2 A^ 3 ) task, which aims to adjust the source-view-trained model online during test time to anticipate target-view actions. It is challenging for existing Test-Time Adaptation (TTA) methods to address this task due to the multi-action candidates and significant temporal-spatial inter-view gap. Hence, we propose a novel Dual-Clue enhanced Prototype Growing Network (DCPGN), which accumulates multi-label knowledge and integrates cross-modality clues for effective test-time Ego-Exo adaptation and action anticipation. Specifically, we propose a Multi-Label Prototype Growing Module (ML-PGM) to balance multiple positive classes via multi-label assignment and confidence-based reweighting for class-wise memory banks, which are updated by an entropy priority queue strategy. Then, the Dual-Clue Consistency Module (DCCM) introduces a lightweight narrator to generate textual clues indicating action progressions, which complement the visual clues containing various objects. Moreover, we constrain the inferred textual and visual logits to construct dual-clue consistency for temporally and spatially bridging Ego and Exo views. Extensive experiments on the newly proposed EgoMe-anti and the existing EgoExoLearn benchmarks show the effectiveness of our method, which outperforms related state-of-the-art methods by a large margin. Code is available at \href https://github.com/ZhaofengSHI/DCPGN https://github.com/ZhaofengSHI/DCPGN .

Abstract:
Jiangnan gardens, a prominent style of Chinese classical gardens, hold great potential as digital assets for film and game production and digital tourism. However, manual modeling of Jiangnan gardens heavily relies on expert experience for layout design and asset creation, making the process time-consuming. To address this gap, we propose GardenDesigner, a novel framework that encodes aesthetic principles for Jiangnan garden construction and integrates a chain of agents based on procedural modeling. The water-centric terrain and explorative pathway rules are applied by terrain distribution and road generation agents. Selection and spatial layout of garden assets follow the aesthetic and cultural constraints. Consequently, we propose asset selection and layout optimization agents to select and arrange objects for each area in the garden. Additionally, we introduce GardenVerse for Jiangnan garden construction, including expert-annotated garden knowledge to enhance the asset arrangement process. To enable interaction and editing, we develop an interactive interface and tools in Unity, in which non-expert users can construct Jiangnan gardens via text input within one minute. Experiments and human evaluations demonstrate that GardenDesigner can generate diverse and aesthetically pleasing Jiangnan gardens. Project page is available at https://monad-cube.github.io/GardenDesigner.

Abstract:
Reconstructing high-fidelity and animatable 3D head avatars from monocular videos remains a challenging yet essential task. Existing methods based on 3D Gaussian Splatting typically bind Gaussians to mesh triangles and model deformations solely via Linear Blend Skinning, which results in rigid motion and limited expressiveness. Moreover, they lack specialized strategies to handle frequently occluded regions (e.g., mouth interiors, eyelids). To address these limitations, we propose STAvatar, which consists of two key components: (1) a UV-Adaptive Soft Binding framework that leverages both image-based and geometric priors to learn per-Gaussian feature offsets within the UV space. This UV representation supports dynamic resampling, ensuring full compatibility with Adaptive Density Control (ADC) and enhanced adaptability to shape and textural variations. (2) a Temporal ADC strategy, which first clusters structurally similar frames to facilitate more targeted computation of the densification criterion. It further introduces a novel fused perceptual error as clone criterion to jointly capture geometric and textural discrepancies, encouraging densification in regions requiring finer details. Extensive experiments on four benchmark datasets demonstrate that STAvatar achieves state-of-the-art reconstruction performance, especially in capturing fine-grained details and reconstructing frequently occluded regions. Our project is available at https://github.com/JiankuoZhao/STAvatar.git.

Abstract:
Subject-driven image generation has advanced from single- to multi-subject composition, while neglecting distinction, the ability to distinguish and generate the correct subject when inputs contain multiple candidates. This limitation restricts effectiveness in complex, realistic visual settings. We propose Scone, a unified understanding-generation method that integrates composition and distinction. Scone enables the understanding expert to act as a semantic bridge, conveying semantic information and guiding the generation expert to preserve subject identity while minimizing interference. A two-stage training scheme first learns composition, then enhances distinction through semantic alignment and attention-based masking. We also introduce SconeEval, a benchmark for evaluating both composition and distinction across diverse scenarios. Experiments demonstrate that Scone outperforms existing open-source models in composition and distinction tasks on two benchmarks. Our model, benchmark, and training data are available at: https://github.com/Ryann-Ran/Scone.

Abstract:
Vision-language models like CLIP excel at zero-shot recognition but struggle with continual learning due to two critical issues: (1) severe distribution gap between pretraining captions and post-training class names, and (2) performance mismatch between vision-only and dual-encoder approaches--vision-only methods achieve 20% higher accuracy on fine-grained tasks while CLIP dominates on natural images. We propose Learning from Itself (LfI), which mines CLIP's internal knowledge to address both challenges. First, we generate pseudo-captions by optimizing learnable tokens to minimize CLIP's contrastive loss, creating auxiliary training signals that bridge the pretraining-finetuning distribution gap without external models. Second, we introduce adaptive mutual distillation that dynamically weights knowledge transfer between CLIP's text encoder and a temporary vision classifier based on their instantaneous performance--stronger branches teach more, weaker ones learn more. At inference, only the original CLIP architecture is used, having absorbed discriminative knowledge from both branches. LfI achieves state-of-the-art results across multiple continual learning benchmarks, demonstrating that CLIP can effectively teach itself to continually learn new tasks. Code is available at: https://github.com/jordangong/continual-learning.

Abstract:
Functional magnetic resonance imaging (fMRI) is widely used for studying and diagnosing brain disorders, with functional connectivity (FC) matrices providing powerful representations of large-scale neural interactions. However, existing diagnostic models are trained either on a single site or under full multi-site access, making them unsuitable for real-world scenarios where clinical data arrive sequentially from different institutions. This results in limited generalization and severe catastrophic forgetting.This paper presents the first continual learning framework specifically designed for fMRI-based diagnosis across heterogeneous clinical sites. Our framework introduces a structure-aware variational autoencoder that synthesizes realistic FC matrices for both patient and control groups. Built on this generative backbone, we develop a multi-level knowledge distillation strategy that aligns predictions and graph representations between new-site data and replayed samples. To further enhance efficiency, we incorporate a hierarchical contextual bandit scheme for adaptive replay sampling.Experiments on multi-site datasets for major depressive disorder (MDD), schizophrenia (SZ), and autism spectrum disorder (ASD) show that the proposed generative model enhances data augmentation quality, and the overall continual learning framework substantially outperforms existing methods in mitigating catastrophic forgetting. Our code is available at https://github.com/4me808/FORGE.

Abstract:
Segment Anything Model (SAM)-based approaches have shown strong potential for biomedical image segmentation. However, these methods often struggle to preserve spatial consistency in 3D electron microscopy (3D-EM) data and still require extensive manual annotation. We propose Spatial-SAM, a spatially consistent and annotation-efficient framework for high-precision 3D-EM segmentation. It introduces a 3D Signed Distance Field (SDF) memory mechanism that replaces SAM2's memory with SDF representations precomputed by a 3D U-Net, providing richer geometric information and improving spatial consistency. It also combines SAM2's few-shot capability with a dual-track pseudo-label iterative optimization strategy to learn large-scale 3D-EM segmentation from minimal annotations. Experiments show Spatial-SAM significantly outperforms existing semi-supervised methods and performs comparably to state-of-the-art fully supervised approaches on multiple 3D-EM benchmarks, reducing annotation costs while preserving spatial consistency. Code is available at https://github.com/Giluir/Spatial-SAM.

Abstract:
Flicker artifacts, arising from unstable illumination and row-wise exposure inconsistencies, pose a significant challenge in short-exposure photography, severely degrading image quality. Unlike typical artifacts, e.g., noise and low-light, flicker is a structured degradation with specific spatial-temporal patterns, which are not accounted for in current generic restoration frameworks, leading to suboptimal flicker suppression and ghosting artifacts. In this work, we reveal that flicker artifacts exhibit two intrinsic characteristics, periodicity and directionality, and propose Flickerformer, a transformer-based architecture that effectively removes flicker without introducing ghosting. Specifically, Flickerformer comprises three key components: a phase-based fusion module (PFM), an autocorrelation feed-forward network (AFFN), and a wavelet-based directional attention module (WDAM). Based on the periodicity, PFM performs inter-frame phase correlation to adaptively aggregate burst features, while AFFN exploits intra-frame structural regularities through autocorrelation, jointly enhancing the network's ability to perceive spatially recurring patterns. Moreover, motivated by the directionality of flicker artifacts, WDAM leverages high-frequency variations in the wavelet domain to guide the restoration of low-frequency dark regions, yielding precise localization of flicker artifacts. Extensive experiments demonstrate that Flickerformer outperforms state-of-the-art approaches in both quantitative metrics and visual quality. The source code is available at https://github.com/qulishen/Flickerformer.

Abstract:
Skeleton-based Temporal Action Segmentation (STAS) seeks to densely segment and classify diverse actions within long, untrimmed skeletal motion sequences. However, existing STAS methodologies face challenges of limited inter-class discriminability and blurred segmentation boundaries, primarily due to insufficient distinction of spatio-temporal patterns between adjacent actions. To address these limitations, we propose Spectral Scalpel, a frequency-selective filtering framework aimed at suppressing shared frequency components between adjacent distinct actions while amplifying their action-specific frequencies, thereby enhancing inter-action discrepancies and sharpening transition boundaries. Specifically, Spectral Scalpel employs adaptive multi-scale spectral filters as scalpels to edit frequency spectra, coupled with a discrepancy loss between adjacent actions serving as the surgical objective. This design amplifies representational disparities between neighboring actions, effectively mitigating boundary localization ambiguities and inter-class confusion. Furthermore, complementing long-term temporal modeling, we introduce a frequency-aware channel mixer to strengthen channel evolution by aggregating spectra across channels. This work presents a novel paradigm for STAS that extends conventional spatio-temporal modeling by incorporating frequency-domain analysis. Extensive experiments on five public datasets demonstrate that Spectral Scalpel achieves state-of-the-art performance. Code is available at https://github.com/HaoyuJi/SpecScalpel.

Abstract:
Rotation invariance remains a core challenge in point cloud analysis, where existing methods often struggle with structural ambiguities and insufficient global context. Most rotation-invariant (RI) representations are derived from local coordinate systems, which inherently suffer from point-pair ambiguities and fail to capture discriminative features in symmetric or repetitive structures, while discarding informative global pose cues. To overcome these limitations, we propose Ga4DPF, a novel framework that offers a robust, global-aware RI representation by converting rotation-equivariant geometric representations into invariant ones, while concurrently integrating global pose awareness. Specifically, Ga4DPF introduces a learnable steerable transform that equivariantly lifts point clouds into 4D space, facilitating robust local feature construction and mitigating point-pair ambiguities. Concurrently, we model a dynamic global pose reference using the Bingham distribution, which adaptively estimates a consistent global rotation and enhances global feature discriminability. Extensive experiments on multiple benchmark datasets demonstrate that Ga4DPF achieves state-of-the-art performance with high computational efficiency, offering a new paradigm for rotation-invariant point cloud analysis. The source code is available at: https://github.com/jiaxunguo/ga4dpf.

Abstract:
Text-to-video diffusion models have enabled open-ended video synthesis, but often struggle with generating the correct number of objects specified in a prompt. We introduce NUMINA, a training-free identify-then-guide framework for improved numerical alignment. NUMINA identifies prompt-layout inconsistencies by selecting discriminative self- and cross-attention heads to derive a countable latent layout. It then refines this layout conservatively and modulates cross-attention to guide regeneration. On the introduced CountBench, NUMINA improves counting accuracy by up to 7.4% on Wan2.1-1.3B, and by 4.9% and 5.5% on 5B and 14B models, respectively. Furthermore, CLIP alignment is improved while maintaining temporal consistency. These results demonstrate that structural guidance complements seed search and prompt enhancement, offering a practical path toward count-accurate text-to-video diffusion. The code is available at https://github.com/H-EmbodVis/NUMINA.

Abstract:
Continual learning (CL) on edge devices requires not only high accuracy but also training-time efficiency to support on-device adaptation under strict memory and computational constraints. While prompt-based continual learning (PCL) is parameter-efficient and achieves competitive accuracy, prior work has focused mainly on accuracy or inference-time performance, often overlooking the memory and computational costs of on-device training. In this paper, we propose CPS-Prompt, a critical patch-aware sparse prompting framework that explicitly targets training-time memory usage and computational cost by integrating critical patch sampling (CPS) for task-aware token reduction and decoupled prompt and classifier training (DPCT) to reduce backpropagation overhead. Experiments on three public benchmarks and real edge hardware show that CPS-Prompt improves peak memory, training time, and energy efficiency by about 1.6x over the balanced CODA-Prompt baseline, while maintaining accuracy within 2% of the state-of-the-art C-Prompt on average and remaining competitive with CODA-Prompt in accuracy. The code is available at https://github.com/laymond1/cpsprompt.

Abstract:
Visual grouping--operationalized through tasks such as instance segmentation, visual grounding, and object detection--enables applications ranging from robotic perception to photo editing. These fundamental problems in computer vision are powered by large-scale, painstakingly annotated datasets. Despite their impact, these datasets are costly to build, biased in coverage, and difficult to scale. Synthetic datasets offer a promising alternative but struggle with flexibility, accuracy, and compositional diversity. We introduce SOC, an accurate and scalable data synthesis pipeline via a novel object-centric composition strategy. It composes high-quality synthetic object segments into new images using 3D geometric layout augmentation and camera configuration augmentation with generative harmonization and mask-area-weighted blending, yielding accurate and diverse masks, boxes, and referring expressions. Models trained on just 100K of our synthetic images outperform those trained on larger real datasets (GRIT 20M, V3Det 200K) and synthetic pipelines (Copy-Paste, X-Paste, SynGround, SegGen by +24-36%)--achieving +10.9 AP on LVIS and +8.4 NAcc on gRefCOCO. SOC also enables controllable dataset construction for different use cases and boosts performance in both low-data and closed-vocabulary scenarios. Augmenting LVIS and COCO with synthetic object segments delivers strong performance across different real data scales and yields even greater improvements when real data is extremely limited (+6.59 AP on 1% COCO data). Furthermore, this controllability enables targeted data generation for intra-class referring, a diagnostic grounding task we propose that requires fine-grained attribute discrimination.

Abstract:
Few-step generation has been a long-standing goal, with recent one-step generation methods exemplified by MeanFlow achieving remarkable results. Existing research on MeanFlow primarily focuses on class-to-image generation. However, an intuitive yet unexplored direction is to extend the condition from fixed class labels to flexible text inputs, enabling richer content creation. Compared to the limited class labels, text conditions pose greater challenges to the model's understanding capability, necessitating the effective integration of powerful text encoders into the MeanFlow framework. Surprisingly, although incorporating text conditions appears straightforward, we find that integrating powerful LLM-based text encoders using conventional training strategies results in unsatisfactory performance. To uncover the underlying cause, we conduct detailed analyses and reveal that, due to the extremely limited number of refinement steps in the MeanFlow generation, such as only one step, the text feature representations are required to possess sufficiently high discriminability. This also explains why discrete and easily distinguishable class features perform well within the MeanFlow framework. Guided by these insights, we leverage a powerful LLM-based text encoder validated to possess the required semantic properties and adapt the MeanFlow generation process to this framework, resulting in efficient text-conditioned synthesis for the first time. Furthermore, we validate our approach on the widely used diffusion model, demonstrating significant generation performance improvements. We hope this work provides a general and practical reference for future research on text-conditioned MeanFlow generation. The code is available at \supp https://github.com/AMAP-ML/EMF.

Abstract:
Current millimeter-wave (mmWave) datasets for human pose estimation (HPE) are scarce and lack diversity in both point cloud (PC) attributes and human poses, hindering the generalization ability of their trained models. On the other hand, unlabeled mmWave HPE data and diverse LiDAR HPE datasets are readily available. We propose EMDUL, a novel approach to expand the volume and diversity of an existing mmWave dataset using unlabeled mmWave data and LiDAR datasets. EMDUL consists of two independent modules, namely a pseudo-label estimator to annotate unlabeled mmWave data, and a closed-form converter that translates an annotated LiDAR PC to its mmWave counterpart. Expanding the original dataset with both LiDAR-converted and pseudo-labeled mmWave PCs significantly boosts the performance and generalization ability of all the examined HPE models, reducing 15.1% and 18.9% error for in-domain and out-of-domain settings, respectively. Code is available at https://github.com/Shimmer93/EMDUL.

Abstract:
Language-guided 3D segmentation is crucial for linking 3D perception with semantic understanding, yet it remains vulnerable to the sparse and occluded views common in real-world RGB-D data. To overcome this, we present a real-time framework that leverages 3D Gaussian Splatting (3DGS) to build a semantically continuous and differentiable embedding field from partial observations. Our approach integrates two key components: a Dirichlet Process (DP) for the adaptive discovery of novel object categories, and a gradient low-rank mechanism that enhances class separability by reducing feature redundancy. This combination enables robust open-vocabulary segmentation guided directly by text prompts. Extensive experiments on challenging benchmarks demonstrate that our method achieves strong performance, exhibiting superior accuracy, robustness to partial inputs, and a powerful capacity for novel class discovery.Our code and models are available at https://github.com/Tap12345/LangGS.

Abstract:
In computational pathology, understanding and generation have evolved along disparate paths: advanced understanding models already exhibit diagnostic-level competence, whereas generative models largely simulate pixels. Progress remains hindered by three coupled factors: the scarcity of large, high-quality image-text corpora; the lack of precise, fine-grained semantic control, which forces reliance on non-semantic cues; and terminological heterogeneity, where diverse phrasings for the same diagnostic concept impede reliable text conditioning. We introduce UniPath, a semantics-driven pathology image generation framework that leverages mature diagnostic understanding to enable controllable generation. UniPath implements Multi-Stream Control: a Raw-Text stream; a High-Level Semantics stream that uses learnable queries to a frozen pathology MLLM to distill paraphrase-robust Diagnostic Semantic Tokens and to expand prompts into diagnosis-aware attribute bundles; and a Prototype stream that affords component-level morphological control via a prototype bank. On the data front, we curate a 2.65M image-text corpus and a finely annotated, high-quality 68K subset to alleviate data scarcity. For a comprehensive assessment, we establish a four-tier evaluation hierarchy tailored to pathology. Extensive experiments demonstrate UniPath's SOTA performance, including a Patho-FID of 80.9 (51% better than the second-best) and fine-grained semantic control achieving 98.7% of the real-image. The dataset and code can be obtained from https://github.com/Hanminghao/UniPath.

Abstract:
Vision-language models (VLMs) excel at image understanding tasks, but the large number of visual tokens imposes significant computational costs, hindering deployment on mobile devices. Many pruning methods rely solely on token importance and thus overlook inter-token redundancy, retaining numerous duplicated tokens and wasting capacity. Although some redundancy-aware approaches have been proposed, they often ignore the spatial relationships among visual tokens. This can lead to overly sparse selections of retained tokens that fail to adequately cover the regions of target objects. To address these limitations, we propose VLM-Pruner, a training-free token pruning algorithm that explicitly balances redundancy and spatial sparsity. Specifically, the centrifugal token pruning paradigm is introduced for near-to-far selection. Moreover, the Buffering for Spatial Sparsity (BSS) criterion delays the selection of spatially distant tokens, thereby prioritizing the preservation of fine-grained object details. To further improve efficiency, a parallel greedy strategy is adopted for token selection. To mitigate information loss from pruning, we selectively fuse salient information from the discarded tokens into the retained ones. Comprehensive comparisons demonstrate that VLM-Pruner consistently outperforms other competitive methods across five VLMs under various pruning ratios, while delivering an end-to-end inference speedup. The code is available at https://github.com/Casey-bit/VLMPruner.

Abstract:
Vision Large Language Models (VLLMs) incur high computational costs due to their reliance on hundreds of visual tokens to represent images. While token pruning offers a promising solution for accelerating inference, this paper, however, identifies a key observation: in deeper layers (e.g., beyond the 20th), existing training-free pruning methods perform no better than random pruning. We hypothesize that this degradation is caused by "vanishing token information", where visual tokens progressively lose their salience with increasing network depth. To validate this hypothesis, we quantify a token's information content by measuring the change in the model output probabilities upon its removal. Using this proposed metric, our analysis of the information of visual tokens across layers reveals three key findings: (1) As layers deepen, the information of visual tokens gradually becomes uniform and eventually vanishes at an intermediate layer, which we term as "information horizon", beyond which the visual tokens become redundant;(2) The position of this horizon is not static; it extends deeper for visually intensive tasks, such as Optical Character Recognition (OCR), compared to more general tasks like Visual Question Answering (VQA);(3) This horizon is also strongly correlated with model capacity, as stronger VLLMs (e.g., Qwen2.5-VL) employ deeper visual tokens than weaker models (e.g., LLaVA-1.5).Based on our findings, we show that simple random pruning in deep layers efficiently balances performance and efficiency. Moreover, integrating random pruning consistently enhances existing methods. Using DivPrune with random pruning achieves state-of-the-art results, maintaining 96.9% of Qwen-2.5-VL-7B performance while pruning 50% of visual tokens. The code is available at https://github.com/YahongWang1/Information-Horizon.

Abstract:
Robotic manipulation in complex 3D environments requires unifying spatial reasoning with intuitive visual perception, which is a capability that current Vision-Language-Action paradigms address separately. While 3D VLAs excel in geometric and physical reasoning, they lack intuitive, image-level understanding and dense visual semantics; conversely, 2D VLAs (even with depth image) provide rich visual intuition and semantic continuity but miss explicit spatial global grounding. We introduce DiffRender-VLA, a differentiable rendering-based framework that bridges 3D and 2D Vision-Language-Action models through gradient-consistent visual mediation. It generates differentiable images by localizing the next end-effector target with a world-aligned cube marker, differentiably structuring surrounding geometry whose color encodes spatial relations to the marker, and rendering adaptive viewpoints optimized to reveal the target-environment spatial relationships. These differentiable images serve as visual bridges, embedding spatial semantics while allowing gradients from 2D VLAs to backpropagate into 3D representations, thereby coupling geometric reasoning with visual perception. This closed differentiable loop unifies reasoning and perception, substantially improving performance under occlusion, clutter, and complex spatial manipulation tasks, achieving average improvements of +12.1% over state-of-the-art methods. Codes are available at https://github.com/zyl123456aB/DIFFVLA.

Abstract:
Detecting Unmanned Aerial Vehicles (UAVs) in low-altitude environments is essential for perception and defense systems but remains highly challenging due to complex backgrounds, camouflage, and multimodal interference. In real-world scenarios, UAVs are frequently visually blended with surrounding structures such as buildings, vegetation, and power lines, resulting in low contrast, weak boundaries, and strong confusion with cluttered background textures. Existing UAV detection datasets, though diverse, are not specifically designed to capture these camouflage and complex-background challenges, which limits progress toward robust real-world perception. To fill this gap, we construct UAV-CB, a new RGB-T UAV detection dataset deliberately curated to emphasize complex low-altitude backgrounds and camouflage characteristics. Furthermore, we propose the Local Frequency Bridge Network (LFBNet), which models features in localized frequency space to bridge both the frequency-spatial fusion gap and the cross-modality discrepancy gap in RGB-T fusion. Extensive experiments on UAV-CB and public benchmarks demonstrate that LFBNet achieves state-of-the-art detection performance and strong robustness under camouflaged and cluttered conditions, offering a frequency-aware perspective on multimodal UAV perception in real-world applications. The UAV-CB dataset will be publicly available at: https://github.com/hye999/UAV-CB.

Abstract:
Post-training of large-scale Vision-Language Models (VLMs) reveals a pronounced generalization gap: models fine-tuned with Reinforcement Learning (RL) consistently achieve superior out-of-distribution (OOD) performance compared to those trained with Supervised Fine-Tuning (SFT). This paper posits a data-centric explanation for this phenomenon, contending that RL's generalization advantage arises from an implicit data filtering mechanism that inherently prioritizes medium-difficulty training samples. To test this hypothesis, we systematically evaluate the OOD generalization of SFT models across training datasets of varying difficulty levels. Our results confirm that data difficulty is a critical factor, revealing that training on hard samples significantly degrades OOD performance. Motivated by this finding, we introduce Difficulty-Curated SFT (DC-SFT), a straightforward method that explicitly filters the training set based on sample difficulty. Experiments show that DC-SFT not only substantially enhances OOD generalization over standard SFT, but also surpasses the performance of RL-based training, all while providing greater stability and computational efficiency. This work offers a data-centric account of the OOD generalization gap in VLMs and establishes a more efficient pathway to achieving robust generalization. Code is available at https://github.com/byyx666/DC-SFT.

Abstract:
Federated Learning (FL) faces a fundamental dilemma: existing defenses against gradient leakage attacks (GLAs) invariably sacrifice model performance for privacy protection through noise injection or gradient clip. We introduce Federated Learning with Momentum-Based Orthogonal Projection (FedMOP), a method that simultaneously achieves strong privacy guarantees and superior model performance. The key insight is to leverage initialization-based offset mechanisms that operate on orthogonal dimensions. For performance enhancement, FedMOP employs gradient orthogonal projection to counteract local drift, effectively offsetting each client's round-training initial model using global statistical context. For privacy protection, it introduces momentum-based trajectory offset hiding, which makes the offset vector inherently unrecoverable by constructing information barriers through private initialization and randomized evolution. These two mechanisms are synergistic rather than antagonistic. Theoretically, we prove convergence preservation and characterize the computationally infeasible inverse problem faced by attackers. Extensive experiments on CIFAR-10/100 and Tiny-ImageNet demonstrate that FedMOP not only defends effectively against state-of-the-art GLAs but also surpasses existing FL methods in both accuracy and convergence speed, validating its ability to jointly enhance privacy and performance in FL. Codes are available at https://github.com/zyl123456aB/FedMOP.

Abstract:
Reliable uncertainty estimation for 3D object detection is critical for deploying safe autonomous systems, yet modern detectors remain poorly calibrated, especially under distribution shifts. Although post-hoc calibration methods address this issue and provide improved calibration for in-distribution tests, they fail to adapt in distribution-shifted scenarios. In this work, we address this issue and introduce a density-aware calibration method that couples post-hoc calibrators with the feature density of latent object queries from DETR-style 3D object detectors. These queries form a compact, location and class-aware feature, ideal for density estimation, allowing our approach to adjust model confidences in distribution-shift scenarios. By fitting a density estimator on these query features, our approach jointly recalibrates both classification and bounding box regression uncertainties. On both a multi-view camera and LiDAR-based detector, our approach consistently outperforms standard post-hoc methods in both in-distribution and distribution-shifted scenarios. Code: https://tillbeemelmanns.github.io/query2uncertainty

Abstract:
Thermal infrared (TIR) imaging enables robust perception in adverse conditions. However, it often suffers from complex degradations (e.g., fixed-pattern noise and low-resolution) due to sensor limitations and environmental dynamics. Existing methods, whether traditional or learning-based, easily fail under composite and varying degradation. Pre-trained generative models showcase powerful capabilities for alleviating degradations but lack effective tools to adapt visible generative priors to TIR-specific characteristics. To overcome these challenges, we propose a Hierarchical Degradation Representation and Adaptation (HiDRA) framework to decompose the enhancement procedure into degradation representation estimation and generative model fine-tuning. The degradation representation estimation aims to disentangle TIR degradation patterns, which then guide the parameter adaptation for thermal image enhancement. Additionally, we introduce a hierarchical adaptation solution that aggregates learning across varying degradation levels, further improving robustness under various scenarios. Experiments across diverse types and degrees demonstrate the robustness of our approach and further validate its effectiveness on downstream tasks. Code is available at https://github.com/Zihang-Chen/HiDRA.

Abstract:
Current state-of-the-art approaches in Source-Free Object Detection (SFOD) typically rely on Mean-Teacher self-labeling. However, domain shift often reduces the detector's ability to maintain strong object-focused representations, causing high-confidence activations over background clutter and unreliable pseudo-labels from the detection head. While prior works mainly refine the pseudo-labels, they overlook the underlying need to strengthen the feature space itself. We propose FALCON-SFOD (Foundation-Aligned Learning with Clutter Suppression and Noise Robustness), a framework designed to enhance object-focused adaptation under domain shift. It consists of two complementary components. SPAR (\underline S patial \underline P rior-\underline A ware \underline R egularization) leverages the generalization strength of vision foundation models to regularize the detector's feature space. Using class-agnostic binary masks derived from OV-SAM, SPAR promotes structured and foreground-focused activations by guiding the network toward object regions. IRPL (\underline I mbalance-aware Noise \underline R obust \underline P seudo-\underline L abeling) complements SPAR by promoting balanced and noise-tolerant learning under severe foreground-background imbalance. Guided by a theoretical analysis that connects these designs to tighter localization and classification error bounds, FALCON-SFOD achieves competitive performance across SFOD benchmarks. Our code is available in https://github.com/Sairam13001/FALCON-SFOD.

Abstract:
Video temporal grounding (VTG) is a critical task in video understanding and a key capability for extending video large language models (Vid-LLMs) to broader applications. However, existing Vid-LLMs rely on uniform frame sampling to extract video information, resulting in a sparse distribution of key frames and the loss of crucial temporal cues. To address this limitation, we propose Grounded Visual Token Sampling (GroundVTS), a Vid-LLM architecture that focuses on the most informative temporal segments. GroundVTS employs a fine-grained, query-guided mechanism to filter visual tokens before feeding them into the LLM, thereby preserving essential spatio-temporal information and maintaining temporal coherence. Futhermore, we introduce a progressive optimization strategy that enables the LLM to effectively adapt to the non-uniform distribution of visual features, enhancing its ability to model temporal dependencies and achieve precise video localization. We comprehensively evaluate GroundVTS on three standard VTG benchmarks, where it outperforms existing methods, achieving a 7.7-point improvement in mIoU for moment retrieval and 12.0-point improvement in mAP for highlight detection. Code is available at the \href https://github.com/Florence365/GroundVTS GroundVTS code repository .

Abstract:
Fully Few-shot Class-incremental Audio Classification (FFCAC) is challenging since the training samples are limited both in the incremental sessions and in the base session. Existing few-shot learning methods suffer from catastrophic forgetting and overfitting when applied to FFCAC.Pre-trained Audio-Language Models (ALMs) have achieved success in many audio learning tasks. However, we find that it is impractical to directly use ALM on FFCAC, since misalignment between text and audio causes even severe catastrophic forgetting and overfitting. We propose a Task-Adaptive Prototype Evolution (TAPE) framework to facilitate ALMs to tackle the challenges of FFCAC, which consists of two key components:(1) A Task-Adapter that isolates audio features in a metric space to mitigate catastrophic forgetting while preserving knowledge across sessions, and (2) A Prototype Evolution mechanism that dynamically refines class prototypes using query samples during inference, thereby enabling adaptive learning and reducing overfitting.To the best of our knowledge, we are the first to use ALMs on the FFCAC task. We conduct experiments on three audio datasets: NSynth-100 (instrument recognition), FSC-89 (event detection), and LBS-100 (voice recognition). The experimental results show that our proposed approach TAPE significantly surpasses the baselines. Specifically, it averagely improves upon the second best from 54.93% to 82.76% in terms of Average Accuracy (AA \uparrow), and from 28.74% to 12.56% in terms of Performance Dropping rate (PD \downarrow). The code for our work is available at https://github.com/YvoGao/TAPE.

Abstract:
The ability to extrapolate dynamic 3D scenes beyond the observed timeframe is fundamental to advancing physical world understanding and predictive modeling. Existing dynamic 3D reconstruction methods have achieved high-fidelity rendering of temporal interpolation, but typically lack physical consistency in predicting the future. To overcome this issue, we propose ParticleGS, a physics-based framework that reformulates dynamic 3D scenes as physically grounded systems. ParticleGS comprises three key components: 1) an encoder that decomposes the scene into static properties and initial dynamic physical fields; 2) an evolver based on Neural Ordinary Differential Equations (Neural ODEs) that learns continuous-time dynamics for motion extrapolation; and 3) a decoder that reconstructs 3D Gaussians from evolved particle states for rendering. Through this design, ParticleGS integrates physical reasoning into dynamic 3D representations, enabling accurate and consistent prediction of the future. Experiments show that ParticleGS achieves state-of-the-art performance in extrapolation while maintaining rendering quality.

Abstract:
Spatiotemporal predictive learning (STPL) aims to forecast future frames from past observations and is essential across a wide range of applications. Compared with recurrent or hybrid architectures, pure convolutional models offer superior efficiency and full parallelism, yet their fixed receptive fields limit their ability to adaptively capture spatially varying motion patterns. Inspired by biological center-surround organization and frequency-selective signal processing, we propose PFGNet, a fully convolutional framework that dynamically modulates receptive fields through pixel-wise frequency-guided gating. The core Peripheral Frequency Gating (PFG) block extracts localized spectral cues and adaptively fuses multi-scale large-kernel peripheral responses with learnable center suppression, effectively forming spatially adaptive band-pass filters. To maintain efficiency, all large kernels are decomposed into separable 1D convolutions (1xk followed by kx1), reducing per-channel computational cost from O(k2) to O(2k). PFGNet enables structure-aware spatiotemporal modeling without recurrence or attention. Experiments on Moving MNIST, TaxiBJ, Human3.6M, and KTH show that PFGNet delivers SOTA or near-SOTA forecasting performance with substantially fewer parameters and FLOPs. Our code is available at https://github.com/fhjdqaq/PFGNet.

Abstract:
As imaging scenarios diversify rapidly, Image Quality Assessment (IQA) faces a key challenge: how to effectively transfer perceptual knowledge from existing annotated datasets to ensure reliable quality prediction in new scenarios. However, current IQA models struggle to generalize: direct transfer often leads to severe performance degradation, while multi-dataset joint training rarely yields stable gains and can even harm target performance. In this work, we identify perceptual preference structure mismatch as an important and underexplored factor behind transfer failure in IQA. Models trained on different source datasets may rely on different perceptual cues, leading to discrepancies in the conditional distribution P(Y|X) that hinder effective transfer. To address this, we propose Perceptual Preference Representation (PPR), which characterizes dataset-specific perceptual preference structures by analyzing correlations between visual features and quality scores. Based on PPR, we define Perceptual Preference Consistency (PPC), a training-free and interpretable measure of relative preference compatibility across datasets. Building on this, we develop Preference-Structure-Aligned Transfer (PreSTA), which selects source samples better aligned with the target domain in terms of perceptual preference structure. Extensive experiments show that PreSTA achieves superior performance across cross-domain, within-domain, and targeted joint transfer settings while using only a small fraction of source data. These results highlight that aligning perceptual preference structures, rather than simply increasing dataset size, is crucial for effective and data-efficient knowledge transfer in IQA. Code will be publicly available at https://github.com/Li-aobo/PreSTA.

Abstract:
Foundational vision models have become the de facto standard for many vision tasks due to their strong performance. However, they are notoriously opaque and remain hard to interpret. We present ALOE (ALign Once to Explain), a one-time, label-free feature alignment based approach that efficiently converts foundational vision models into inherently interpretable B-cos variants. Once aligned, the B-cos backbone is used as a drop-in replacement across several downstream tasks--amortizing the cost of interpretability. ALOE is robust across pre-training paradigms (supervised, self-supervised, vision-language) and is 100-1000x more data-efficient than training from scratch. On classification, it outperforms fully-supervised B-cos models (e.g., +9.2 p.p. top-1 on ImageNet for ViT-B/16), retains strong linear probing, k-NN, and zero-shot transfer performance competitive with foundational backbones (DINOv3, SigLIP2) across diverse datasets. It also preserves spatially structured features useful for dense prediction, while yielding well-localized and highly human-interpretable explanations by design. Code link https://github.com/rmaser/aloe.

Abstract:
Open-vocabulary change detection (OVCD) seeks to recognize arbitrary changes of interest by enabling generalization beyond a fixed set of predefined classes. We reformulate OVCD as a two-stage pipeline: first generate class-agnostic change proposals using visual foundation models (VFMs) such as SAM and DINOv2, and then perform category identification with vision-language models (VLMs) such as CLIP. We reveal that category identification errors are the primary bottleneck of OVCD, mainly due to the limited ability of VLMs based on image-text matching to represent fine-grained land-cover categories. To address this, we propose OpenDPR, a training-free vision-centric diffusion-guided prototype retrieval framework. OpenDPR leverages diffusion models to construct diverse prototypes for target categories offline, and to perform similarity retrieval with change proposals in the visual space during inference. The secondary bottleneck lies in change localization, due to the inherent lack of change priors in VFMs. To bridge this gap, we design a spatial-to-change weakly supervised change detection module named S2C to adapt their strong spatial modeling capabilities for change localization. Integrating the pretrained S2C into OpenDPR leads to an optional weakly supervised variant named OpenDPR-W, which further improves OVCD with minimal supervision. Experimental results on four benchmark datasets demonstrate that the proposed methods achieve state-of-the-art performance under both supervision modes. Code is available at https://github.com/guoqi2002/OpenDPR.

Abstract:
Masked Image Modeling has been one of the most popular self-supervised learning paradigms to learn representations from large-scale, unlabeled Earth Observation images. While incorporating multi-modal and multi-temporal Earth Observation data into Masked Image Modeling has been widely explored, the spatial dependencies between images captured from neighboring areas remains largely overlooked. Since the Earth's surface is continuous, neighboring images are highly related and offer rich contextual information for self-supervised learning. To close this gap, we propose NeighborMAE, which learns spatial dependencies by joint reconstruction of neighboring Earth Observation images. To ensure that the reconstruction remains challenging, we leverage a heuristic strategy to dynamically adjust the mask ratio and the pixel-level loss weight. Experimental results across various pretraining datasets and downstream tasks show that NeighborMAE significantly outperforms existing baselines, underscoring the value of neighboring images in Masked Image Modeling for Earth Observation and the efficacy of our designs. Our code is available on https://github.com/LeungTsang/NeighborMAE.

Abstract:
Long-tailed recognition poses a significant challenge for deep learning. The two-stage decoupling paradigm, which separates representation learning from classifier retraining, offers a promising solution. During the classifier retraining stage, adaptive norm rescaling is a popular technique. It adjusts the per-class weight norms via parameter regularization, which inevitably introduces hyperparameters. However, many studies report that long-tailed recognition is sensitive to these hyperparameters, as their setup significantly impacts performance. In this paper, we first provide a class-conditional distribution perspective to support norm rescaling methods. Furthermore, we propose a simple but effective approach called Self-Adaptive Monotonic Normalization (SAMN). SAMN avoids the need for parameter regularization. It directly enforces monotonicity on per-class weight norms using the Pool Adjacent Violators Algorithm, making the method hyperparameter-friendly. SAMN is a universal strategy that integrates seamlessly with other methods for enhanced performance. Experiments on benchmark datasets demonstrate that our method significantly boosts long-tailed recognition performance, often achieving state-of-the-art results. Codes are available at https://github.com/Zhangshuojackpot/SAMN.

Abstract:
Although multi-modal large language models (MLLMs) have shown strong capabilities across diverse domains, their application in generating fine-grained 3D perception and prediction outputs in autonomous driving remains underexplored. In this paper, we propose DrivePI, a novel spatial-aware 4D MLLM that serves as a unified Vision-Language-Action (VLA) framework that is also compatible with vision-action (VA) models. Our method jointly performs spatial understanding, 3D perception (i.e., 3D occupancy), prediction (i.e., occupancy flow), and planning (i.e., action outputs) in parallel through end-to-end optimization. To obtain both precise geometric information and rich visual appearance, our approach integrates point clouds, multi-view images, and language instructions within a unified MLLM architecture. We further develop a data engine to generate text-occupancy and text-flow QA pairs for 4D spatial understanding. Remarkably, with only a 0.5B Qwen2.5 model as MLLM backbone, DrivePI as a single unified model matches or exceeds both existing VLA models and specialized VA models. Specifically, compared to VLA models, DrivePI outperforms OpenDriveVLA-7B by 2.5% mean accuracy on nuScenes-QA and reduces collision rate by 70% over ORION (from 0.37% to 0.11%) on nuScenes. Against specialized VA models, DrivePI surpasses FB-OCC by 10.3 RayIoU for 3D occupancy on OpenOcc, reduces the mAVE from 0.591 to 0.509 for occupancy flow on OpenOcc, and achieves 32% lower L2 error than VAD (from 0.72m to 0.49m) for planning on nuScenes. Code will be available at https://github.com/happinesslz/DrivePI.

Abstract:
The widespread use of smartphones has made photography ubiquitous, yet a clear gap remains between ordinary users and professional photographers, who can identify aesthetic issues and provide actionable shooting guidance during capture. We define this capability as aesthetic guidance (AG) --- an essential but largely underexplored domain in computational aesthetics. Existing multimodal large language models (MLLMs) primarily offer overly positive feedback, failing to identify issues or provide actionable guidance. Without AG capability, they cannot effectively identify distracting regions or optimize compositional balance, thus also struggling in aesthetic cropping, which aims to refine photo composition through reframing after capture. To address this, we introduce AesGuide, the first large-scale AG dataset and benchmark with 10,748 photos annotated with aesthetic scores, analyses, and guidance. Building upon it, we propose Venus, a two-stage framework that first empowers MLLMs with AG capability through progressively complex aesthetic questions and then activates their aesthetic cropping power via CoT-based rationales. Extensive experiments show that Venus substantially improves AG capability and achieves state-of-the-art (SOTA) performance in aesthetic cropping, enabling interpretable and interactive aesthetic refinement across both stages of photo creation. Code is available at https://github.com/PKU-ICST-MIPL/Venus_CVPR2026.

Abstract:
3D Gaussian Splatting (3DGS) has made remarkable progress in RGBD SLAM. Current methods usually use 3D Gaussians or view-tied 3D Gaussians to represent radiance fields in tracking and mapping. However, these Gaussians are either too flexible or too limited in movements, resulting in slow convergence or limited rendering quality. To resolve this issue, we adopt pixel-aligned Gaussians but allow each Gaussian to adjust its position along its ray to maximize the rendering quality, even if Gaussians are simplified to improve system scalability. To speed up the tracking, we model the depth distribution around each pixel as a Gaussian distribution, and then use these distributions to align each frame to the 3D scene quickly. We report our evaluations on widely used benchmarks, justify our designs, and show advantages over the latest methods in view rendering, camera tracking, runtime, and storage complexity. Please see our project page for code and videos at https://machineperceptionlab.github.io/SGAD-SLAM-Project.

Abstract:
Vision-and-Language Navigation (VLN) agents have made remarkable progress, but their robustness remains insufficiently studied. Existing adversarial evaluations often rely on perturbations that manifest as unusual textures rarely encountered in everyday indoor environments. Errors under such contrived conditions have limited practical relevance, as real-world agents are unlikely to encounter such artificial patterns. In this work, we focus on indoor lighting, an intrinsic yet largely overlooked scene attribute that strongly influences navigation. We propose Indoor Lighting-based Adversarial Attack (ILA), a black-box framework that manipulates global illumination to disrupt VLN agents. Motivated by typical household lighting usage, we design two attack modes: Static Indoor Lighting-based Attack (SILA), where the lighting intensity remains constant throughout an episode, and Dynamic Indoor Lighting-based Attack (DILA), where lights are switched on or off at critical moments to induce abrupt illumination changes. We evaluate ILA on two state-of-the-art VLN models across three navigation tasks. Results show that ILA significantly increases failure rates while reducing trajectory efficiency, revealing previously unrecognized vulnerabilities of VLN agents to realistic indoor lighting variations. Our code is available at https://github.com/LiChenyang0820/ILA4VLN.

Abstract:
In recent years, significant progress has been made in both image generation and generated image detection. Despite their rapid, yet largely independent, development, these two fields have evolved distinct architectural paradigms: the former predominantly relies on generative networks, while the latter favors discriminative frameworks. A recent trend in both domains is the use of adversarial information to enhance performance, revealing potential for synergy. However, the significant architectural divergence between them presents considerable challenges. Departing from previous approaches, we propose UniGenDet: a Unified generative-discriminative framework for co-evolutionary image Generation and generated image Detection. To bridge the task gap, we design a symbiotic multimodal self-attention mechanism and a unified fine-tuning algorithm. This synergy allows the generation task to improve the interpretability of authenticity identification, while authenticity criteria guide the creation of higher-fidelity images. Furthermore, we introduce a detector-informed generative alignment mechanism to facilitate seamless information exchange. Extensive experiments on multiple datasets demonstrate that our method achieves state-of-the-art performance. Code: https://github.com/Zhangyr2022/UniGenDet.

Abstract:
Weakly supervised object localization (WSOL) aims to localize target objects in images using only image-level labels. Despite recent progress, many approaches still rely on multi-stage pipelines or full fine-tuning of large backbones, increasing training cost, while the broader WSOL community continues to face the challenge of partial object coverage. We present TriLite, a single-stage WSOL framework that leverages a frozen Vision Transformer with DINOv2 pre-training in a self-supervised manner, and introduces only a minimal number of trainable parameters (fewer than 800K on ImageNet-1K) for both classification and localization. At its core is the proposed TriHead module, which decomposes patch features into foreground, background, and ambiguous regions, thereby improving object coverage while suppressing spurious activations. By disentangling classification and localization objectives, TriLite effectively exploits the universal representations learned by self-supervised ViTs without requiring expensive end-to-end training. Extensive experiments on CUB-200-2011, ImageNet-1K, and OpenImages demonstrate that TriLite sets a new state of the art, while remaining significantly more parameter-efficient and easier to train than prior methods. Code is available at: https://github.com/ariansabaghi/TriLite

Abstract:
Although reinforcement learning (RL) has significantly advanced reasoning capabilities in large multimodal language models (MLLMs), its efficacy remains limited for lightweight models essential for edge deployments. To address this issue, we leverage causal analysis and experiment to reveal the underlying phenomenon of perceptual bias, demonstrating that RL-based fine-tuning compels lightweight models to preferentially adopt perceptual shortcuts induced by data biases, rather than developing genuine reasoning abilities. Motivated by this insight, we propose VideoThinker, a causal-inspired framework that cultivates robust reasoning in lightweight models through a two-stage debiasing process. First, the Bias Aware Training stage forges a dedicated "bias model" to embody these shortcut behaviors. Then, the Causal Debiasing Policy Optimization (CDPO) algorithm fine-tunes the primary model, employing an innovative repulsive objective to actively push it away from the bias model's flawed logic while simultaneously pulling it toward correct, generalizable solutions. Our model, VideoThinker-R1, establishes a new state-of-the-art in video reasoning efficiency. For same-scale comparison, requiring no Supervised Fine-Tuning (SFT) and using only 1% of the training data for RL, it surpasses VideoRFT-3B with a 3.2% average gain on widely-used benchmarks and a 7% lead on VideoMME. For cross-scale comparison, it outperforms the larger Video-UTR-7B model on multiple benchmarks, including a 2.1% gain on MVBench and a 3.8% gain on TempCompass. Code is available at https://github.com/falonss703/VideoThinker.

Abstract:
Object motion blur in static scenes is spatially heterogeneous, differing from conventional deblurring problems yet frequently occurring in real handheld capture scenarios. Existing datasets either rely on costly beam-splitting capture with residual misalignment or employ synthetic blur that fails to model the continuous photon-integration process during exposure. To overcome these limitations, we introduce OMoBlur, a physically grounded dataset that emulates realistic exposure integration via programmable sensor control, ensuring close alignment between synthetic and real blur distributions. OMoBlur provides 20,000 blur-sharp-mask pairs covering diverse object motion types. Leveraging this dataset, we further propose OMDNet, an object-motion-aware deblurring network that integrates a Motion-Appearance Extract Block, a Flow-Guided Gate Predictor, and an Adaptive Gated Fusion mechanism. This design enables the network to selectively restore blurred regions while preserving static backgrounds, without requiring pixel-accurate mask annotations. Extensive experiments demonstrate that OMoBlur's physically faithful data collection and large-scale diversity significantly enhance the network's generalization to real-world motion blur, establishing OMoBlur and OMDNet as a robust benchmark and practical solution for local motion deblurring. The dataset and code is publicly released at https://yudingchuan.github.io/OMoBlur_homepage/.

Abstract:
Multi-task learning for dense prediction is limited by the need for extensive annotation for every task, although recent works have explored training with partial task labels. Leveraging the generalization power of diffusion models, we extend the partial learning setup to a zero-shot setting, training a multi-task model on multiple synthetic datasets, each labeled for only a subset of tasks. Our method, StableMTL, repurposes an image generator for latent regression. Adapting a denoising framework with task encoding, task conditioning and a tailored training scheme. Instead of per-task losses requiring careful balancing, a unified latent loss is adopted, enabling seamless scaling to more tasks. To encourage inter-task synergy, we introduce a multi-stream model with a task-attention mechanism that converts N-to-N task interactions into efficient N-to-one attention, promoting effective cross-task sharing. StableMTL outperforms baselines on 7 tasks across 8 benchmarks.

Abstract:
Existing graph models predominantly utilize self-attention mechanisms to model feature correlations between the joints of each sample, which not only neglects dynamic relation dependencies in temporal dimension but also leads to redundant computation and difficulty in establishing a unified framework for joint relation representation. To address these problems, this paper develops a Mamba-based graph convolution network (Gamba) with dynamic graph topology learning. In order to capture local motion patterns, a node classification module has been developed to categorize motion joints into distinct types. To the best of our knowledge, this is the first work to assign motion joints with label information to facilitate correlation learning. To capture the underlying relation of the joints of different categories, the state space model is introduced to process enhanced temporal features, aiming to learn dynamic adjacency matrices for long-range dependencies of the joints across different categories. The proposed framework not only facilitates an adaptive focus on spatio-temporal feature modeling but also has less computational complexity than traditional self-attention-based approaches. Extensive experiments on the public NTU RGB+D 60/120 and NW-UCLA benchmark datasets demonstrate the superiority of Gamba over state-of-the-art methods in recognition accuracy. The source code of Gamba is available at https://github.com/RCEricZhou/Gamba.

Abstract:
As a key technique in multi-modal processing, infrared and visible image fusion (IVIF) plays a crucial role in integrating complementary spectral information for visual enhancement and downstream vision tasks. Despite remarkable progress, existing methods struggle to flexibly accommodate heterogeneous demands. Achieving adaptive fusion that aligns with various preferences from both human and machine vision remains an open and challenging problem. To address this challenge, we propose DPOFusion, a direct preference optimization (DPO) framework integrating the property-aligned latent diffusion model (PALDM) and the preference-controllable latent diffusion model (PCLDM), enabling task-guided, preference-adaptive IVIF for both human and machine vision. The PALDM leverages a latent fusion prior and a joint conditional loss to generate diverse candidate fusion results with various properties. PCLDM is subsequently fine-tuned via instance direct preference optimization (IDPO), enabling direct control of the final fusion results with heterogeneous preference signals.Experimental results demonstrate that our framework not only attains precise preference alignment among humans, vision-language models, and task-driven networks, but also sets a new benchmark for adaptive fusion quality and task-oriented transferability. Our code will be available at https://github.com/suweijian1996/DPOFusion

Abstract:
Recent advances in cross-view geo-localization (CVGL) methods have shown strong potential for supporting unmanned aerial vehicle (UAV) navigation in GNSS-denied environments. However, existing work predominantly focuses on matching UAV views to onboard map tiles, which introduces an inherent trade-off between accuracy and storage overhead, and overlooks the importance of the UAV's heading during navigation. Moreover, the substantial discrepancies and varying overlaps in cross-view scenarios have been insufficiently considered, limiting their generalization to real-world scenarios. In this paper, we present Bearing-UAV, a purely vision-driven cross-view navigation method that jointly predicts UAV absolute location and heading from neighboring features, enabling accurate, lightweight, and robust navigation in the wild. Our method leverages global and local structural features and explicitly encodes relative spatial relationships, making it robust to cross-view variations, misalignment, and feature-sparse conditions. We also present Bearing-UAV-90k, a multi-city benchmark for evaluating cross-view localization and navigation. Extensive experiments show encouraging results that Bearing-UAV yields lower localization errors than previous matching/retrieval paradigms across diverse terrains. Our code is publicly available at https://github.com/liukejia121/bearinguav.

Abstract:
The 3D characterization of microstructures is crucial for understanding and designing functional materials. However, the scanning electron microscope (SEM), widely used in scientific research, captures only 2D electron intensity distributions. Existing SEM 3D reconstruction methods struggle with textureless regions, shadowing artifacts, and calibration dependencies, whereas advanced learning-based approaches fail to generalize to microscopic SEM domains due to the lack of physical priors and domain-specific data. We introduce NFH-SEM, a neural field-based hybrid framework that reconstructs high-fidelity 3D surfaces from multi-view, multi-detector SEM images. NFH-SEM integrates coarse multi-view geometry with photometric stereo cues from detector signals through a continuous neural field, incorporating a learnable forward model that embeds SEM imaging physics for self-calibrated, shadow-robust reconstruction. NFH-SEM achieves precise recovery across diverse specimens, revealing 478 nm layered features in two-photon lithography samples, 782 nm surface textures on pollen grains, and 1.559 mm fracture steps on silicon carbide particles, demonstrating its accuracy and broad applicability. Our code and real-world dataset are available at https://github.com/zju3dv/NFH-SEM.

Abstract:
Generalized Category Discovery (GCD) leverages labeled data to categorize unlabeled samples from known or unknown classes. Most previous methods jointly optimize supervised and unsupervised objectives and achieve promising results. However, inherent optimization interference still limits their ability to improve further. Through quantitative analysis, we identify a key issue, i.e., gradient entanglement, which 1) distorts supervised gradients and weakens discrimination among known classes, and 2) induces representation-subspace overlap between known and novel classes, reducing the separability of novel categories. To address this issue, we propose the Energy-Aware Gradient Coordinator (EAGC), a plug-and-play gradient-level module that explicitly regulates the optimization process. EAGC comprises two components: Anchor-based Gradient Alignment (AGA) and Energy-aware Elastic Projection (EEP). AGA introduces a reference model to anchor the gradient directions of labeled samples, preserving the discriminative structure of known classes against the interference of unlabeled gradients. EEP softly projects unlabeled gradients onto the complement of the known-class subspace and derives an energy-based coefficient to adaptively scale the projection for each unlabeled sample according to its degree of alignment with the known subspace, thereby reducing subspace overlap without suppressing unlabeled samples that likely belong to known classes. Experiments show that EAGC consistently boosts existing methods and establishes new state-of-the-art results. Code is available at https://haiyangzheng.github.io/EAGC.

Abstract:
Test-time prompt tuning (TPT) has emerged as a promising technique for enhancing the adaptability of vision-language models by optimizing textual prompts using unlabeled test data. However, prior studies have observed that TPT often produces poorly calibrated models, raising concerns about the reliability of their predictions. Recent works address this issue by incorporating additional regularization terms that constrain model outputs, which improve calibration but often degrade performance. In this work, we reveal that these regularization strategies implicitly encourage optimization toward flatter minima, and that the sharpness of the loss landscape around adapted prompts is a key factor governing calibration quality. Motivated by this observation, we introduce Flatness-aware Prompt Pretraining (FPP), a simple yet effective pretraining framework for TPT that initializes prompts within flatter regions of the loss landscape prior to adaptation. We show that simply replacing the initialization in existing TPT pipelines--without modifying any other components--is sufficient to improve both calibration and performance. Notably, FPP requires no labeled data and incurs no additional computational costs during test-time tuning, making it highly practical for real-world deployment. The code is available at: https://github.com/YonseiML/fpp.

Abstract:
The rapid movements and agile maneuvers of unmanned aerial vehicles (UAVs) induce significant observational challenges for multi-object tracking (MOT). However, existing UAV-perspective MOT benchmarks often lack these complexities, featuring predominantly predictable camera dynamics and linear motion patterns. To address this gap, we introduce DynUAV, a new benchmark for dynamic UAV-perspective MOT, characterized by intense ego-motion and the resulting complex apparent trajectories. The benchmark comprises 42 video sequences with over 1.7 million bounding box annotations, covering vehicles, pedestrians, and specialized industrial categories such as excavators, bulldozers and cranes. Compared to existing benchmarks, DynUAV introduces substantial challenges arising from ego-motion, including drastic scale changes and viewpoint changes, as well as motion blur. Comprehensive evaluations of state-of-the-art trackers on DynUAV reveal their limitations, particularly in managing the intertwined challenges of detection and association under such dynamic conditions, thereby establishing DynUAV as a rigorous benchmark. We anticipate that DynUAV will serve as a demanding testbed to spur progress in real-world UAV-perspective MOT, and we will make all resources available at https://github.com/kxzhang-lab/DynUAV.

Abstract:
Zero-Shot Composed Image Retrieval (ZS-CIR) aims to retrieve target images given a multimodal query (comprising a reference image and a modification text), without training on annotated triplets. Existing methods typically convert the multimodal query into a single modality--either as an edited caption for Text-to-Image retrieval (T2I) or as an edited image for Image-to-Image retrieval (I2I). However, T2I often loses fine-grained visual details, while I2I struggles with complex semantic modifications. To effectively leverage their complementary strengths under diverse query intents, we propose WISER, a training-free framework that unifies T2I and I2I via a "retrieve-verify-refine" pipeline, explicitly modeling intent awareness and uncertainty awareness. Specifically, WISER first performs Wider Search by generating both edited captions and images for parallel retrieval to broaden the candidate pool. Then, it conducts Adaptive Fusion with a verifier to assess retrieval confidence, triggering refinement for uncertain retrievals, and dynamically fusing the dual-path for reliable ones. For uncertain retrievals, WISER generates refinement suggestions through structured self-reflection to guide the next retrieval round toward Deeper Thinking. Extensive experiments show the superior performance of WISER across multiple benchmarks, achieving relative improvements of 45% on CIRCO (mAP@5) and 57% on CIRR (Recall@1) over existing training-free methods. Code is released at https://github.com/Physicsmile/WISER.

Abstract:
The dominant paradigm of monolithic scaling in Vision-Language Models (VLMs) is failing for understanding and reasoning in documents, yielding diminishing returns as it struggles with the inherent need of this domain for document-based procedural reasoning, cognitive complexity, and factual accuracy. To this end, we introduce MACT, a Multi-Agent Collaboration framework with agent-wise adaptive Test-time scaling that pioneers a paradigm shift to procedural scaling, adapting dynamically to the functional entities of visual documents understanding and reasoning. MACT decomposes the visual document processing flow into four specialized agents, i.e., planning, execution, judgment, and answer, to resolve cognitive overload and introduce a critical self-correction loop for factual grounding. This collaborative architecture is amplified by an agent-wise adaptive test-time scaling strategy that intelligently allocates computational resources based on the complexity and redundancy of each functionality. Evaluated on multiple visual document understanding benchmarks, MACT achieves superior performance with a smaller parameter scale, adapting effectively to various document scenarios without compromising its general or mathematical reasoning capabilities. The three variants consistently attain top-three average performance rankings, with average performance enhancements of 9.9-11.5% over the base models.

Abstract:
Existing pedestrian attribute recognition methods are generally developed based on RGB frame cameras. However, these approaches are constrained by the limitations of RGB cameras, such as sensitivity to lighting conditions and motion blur, which hinder their performance. Furthermore, current attribute recognition primarily focuses on analyzing pedestrians' external appearance and clothing, lacking an exploration of emotional dimensions. In this paper, we revisit these issues and propose a novel multi-modal RGB-Event attribute recognition task by drawing inspiration from the advantages of event cameras in low-light, high-speed, and low-power consumption. Specifically, we introduce the first large-scale multi-modal pedestrian attribute recognition dataset, termed EventPAR, comprising 100K paired RGB-Event samples that cover 50 attributes related to both appearance and six human emotions, diverse scenes, and various seasons. By retraining and evaluating mainstream PAR models on this dataset, we establish a comprehensive benchmark and provide a solid foundation for future research in terms of data and algorithmic baselines. In addition, we propose a novel RWKV-based multi-modal pedestrian attribute recognition framework, featuring an RWKV visual encoder and an asymmetric RWKV fusion module. Extensive experiments are conducted on our proposed dataset as well as two simulated datasets (MARS-Attribute and DukeMTMC-VID-Attribute), achieving state-of-the-art results. The source code and dataset will be released upon acceptance.

Abstract:
A reliable reward function is essential for reinforcement learning (RL) in image generation. Most current RL approaches depend on pre-trained preference models that output scalar rewards to approximate human preferences. However, these rewards often fail to capture human perception and are vulnerable to reward hacking, where higher scores do not correspond to better images. To address this, we introduce Adv-GRPO, an RL framework with an adversarial reward that iteratively updates both the reward model and the generator. The reward model is supervised using reference images as positive samples and can largely avoid being hacked. Unlike KL regularization that constrains parameter updates, our learned reward directly guides the generator through its visual outputs, leading to higher-quality images. Moreover, while optimizing existing reward functions can alleviate reward hacking, their inherent biases remain. For instance, PickScore may degrade image quality, whereas OCR-based rewards often reduce aesthetic fidelity. To address this, we take the image itself as a reward, using reference images and vision foundation models (e.g., DINO) to provide rich visual rewards. These dense visual signals, instead of a single scalar, lead to consistent gains across image quality, aesthetics, and task-specific metrics. Finally, we show that combining reference samples with foundation-model rewards enables distribution transfer and flexible style customization. In human evaluation, our method outperforms Flow-GRPO and SD3, achieving 70.0% and 72.4% win rates in image quality and aesthetics, respectively. Code and models has been released in https://github.com/showlab/Adv-GRPO.

Abstract:
Joint Energy-based Models (JEMs) are well known for their ability to unify classification and generation within a single framework. Despite their promising generative and discriminative performance, their robustness remains far inferior to adversarial training (AT), which, conversely, achieves strong robustness but sacrifices clean accuracy and lacks generative ability. This inherent trilemma--balancing classification accuracy, robustness, and generative capability--raises a fundamental question: Can a single model achieve all three simultaneously? To answer this, we conduct a systematic energy landscape analysis of clean, adversarial, and generated samples across various JEM and AT variants. We observe that AT reduces the energy gap between clean and adversarial samples, while JEMs narrow the gap between clean and synthetic ones. This observation suggests a key insight: if the energy distributions of all three data types can be aligned, we might bridge their performance disparities. Building on this idea, we propose Energy-based Joint Distribution Adversarial Training (EB-JDAT), a unified generative-discriminative-robust framework that maximizes the joint probability of clean and adversarial distribution. EB-JDAT introduces a novel min-max energy optimization to explicitly aligning energies across clean, adversarial, and generated samples. Extensive experiments on CIFAR-10, CIFAR-100, and ImageNet subsets demonstrate that EB-JDAT achieves state-of-the-art robustness while maintaining near-original accuracy and competitive generation quality of JEMs, effectively achieving a new trade-off frontier between accuracy, robustness, and generation. The code is released at https://github.com/yujkc/EB-JDAT.

Abstract:
The advent of multimodal agents facilitates effective interaction within graphical user interface (GUI), especially in ubiquitous GUI control. However, their inability to reliably execute toggle control instructions remains a key bottleneck. To investigate this, we construct a state control benchmark with binary toggle instructions derived from public datasets. Evaluation results of existing agents demonstrate their notable unreliability, particularly when the current toggle state already matches the desired state. To address the challenge, we propose State-aware Reasoning (StaR), a multimodal reasoning method that enables agents to perceive the current toggle state, infer the desired state from the instruction, and act accordingly. Experiments on four multimodal agents demonstrate that StaR can improve toggle instruction execution accuracy by over 30%. Further evaluations on three public agentic benchmarks show that StaR also enhances general agentic task performance. Finally, evaluations on a dynamic environment highlight the potential of StaR for real-world applications. Code and benchmark: https://github.com/ZrW00/StaR.

Abstract:
In text-driven 3D scene generation, object layout serves as a crucial intermediate representation that bridges high-level language instructions with detailed geometric output. It not only provides a structural blueprint for ensuring physical plausibility but also supports semantic controllability and interactive editing. However, the learning capabilities of current 3D indoor layout generation models are constrained by the limited scale, diversity, and annotation quality of existing datasets. To address this, we introduce M3DLayout, a large-scale, multi-source dataset for 3D indoor layout generation. M3DLayout comprises 21,367 layouts and over 433k object instances, integrating three distinct sources: real-world scans, professional CAD designs, and procedurally generated scenes. Each layout is paired with detailed structured text describing global scene summaries, relational placements of large furniture, and fine-grained arrangements of smaller items. This diverse and richly annotated resource enables models to learn complex spatial and semantic patterns across a wide variety of indoor environments. To assess the potential of M3DLayout, we establish a benchmark using both a text-conditioned diffusion model and a text-conditioned autoregressive model. Experimental results demonstrate that our dataset provides a solid foundation for training layout generation models. Its multi-source composition enhances diversity, notably through the Inf3DLayout subset which provides rich small-object information, enabling the generation of more complex and detailed scenes. We hope that M3DLayout can serve as a valuable resource for advancing research in text-driven 3D scene synthesis. All dataset and code will be made public upon acceptance.

Abstract:
Smart glass is emerging as an useful device since it provides plenty of insights under hands-busy, eyes-on-task situations. To understand the context of the wearer, 6D object pose estimation in egocentric view is becoming essential. However, existing 6D object pose estimation benchmarks fail to capture the challenges of real-world egocentric applications, which are often dominated by severe motion blur, dynamic illumination, and visual obstructions. This discrepancy creates a significant gap between controlled lab data and chaotic real-world application. To bridge this gap, we introduce EgoXtreme, a new large-scale 6D pose estimation dataset captured entirely from an egocentric perspective. EgoXtreme features three challenging scenarios--industrial maintenance, sports, and emergency rescue--designed to introduce severe perceptual ambiguities through extreme lighting, heavy motion blur, and smoke. Evaluations of state-of-the-art generalizable pose estimators on EgoXtreme indicate that their generalization fails to hold in extreme conditions, especially under low light. We further demonstrate that simply applying image restoration (e.g., deblurring) offers no positive improvement for extreme conditions. While performance gain has appeared in tracking-based approach, implying using temporal information in fast-motion scenarios is meaningful. We conclude that EgoXtreme is an essential resource for developing and evaluating the next generation of pose estimation models robust enough for real-world egocentric vision. The dataset and code are available at https://taegyoun88.github.io/EgoXtreme/

Abstract:
Long video understanding is a key challenge that plagues the advancement of Multimodal Large language Models (MLLMs). In this paper, we study this problem from the perspective of visual memory mechanism, and proposed a novel and training-free approach, termed Flexible Memory (FlexMem). In principle, FlexMem aims to mimic human behavior of video watching, i.e., continually watching video content and recalling the most relevant memory fragments to answer the question. In this way, FlexMem can help MLLMs achieve video understanding of infinite lengths, unlike previous methods that process all video information at once and have input upper-limit. Concretely, FlexMem first consider the visual KV caches as the memory sources, and realize the effective memory transfer and writing via a dual-pathway compression design. Afterwards, FlexMem also explores different memory reading strategies for the diverse video understanding tasks, including the popular streaming one. To validate FlexMem, we apply it to two popular video-MLLMs, and conduct extensive experiments on five long video and one streaming video task. The experimental results show that on a single 3090 GPU, our FlexMem can achieve obvious improvements than existing efficient video understanding methods and process more than 1k frames, which also helps the base MLLMs achieve comparable or even better performance than SOTA MLLMs on some benchmarks, e.g. , GPT-4o and Gemini-1.5 Pro. Our code is released at: https://github.com/city1517/FlexMem.

Abstract:
We present SWIM (See What I Mean), a novel training strategy that aligns vision and language representations to enable fine-grained object understanding solely from textual prompts. Unlike existing approaches that require explicit visual prompts, such as masks or points, SWIM leverages mask supervision only during training to guide cross-modal attention, allowing the model to automatically attend to the user-specified object at inference.Our cross-attention analysis of pretrained multimodal large language models (MLLMs) reveals a systematic discrepancy: Attribute words produce sharp, localized activations in the visual modality, whereas object nouns yield diffuse and scattered patterns due to semantic reference bias and distributed high-level representations. To address this misalignment, we construct NL-Refer, an enriched dataset, in which each object mask is paired with a precise natural language referring expression. SWIM extracts multi-layer cross-attention maps from object nouns and enforces spatial consistency with ground-truth masks. Experimental results demonstrate that SWIM substantially improves text-visual alignment and achieves superior performance over visual-prompt-based methods on fine-grained object understanding benchmarks. The code and data are available at https://github.com/HumanMLLM/SWIM.

Abstract:
Visual place recognition (VPR) faces critical challenges in handling extreme environmental variations while meeting the computational constraints of practical applications. Current methods predominantly address these challenges by either scaling up model capacity or employing computationally intensive reranking stages, creating a significant efficiency bottleneck. To overcome this limitation, we propose EfficientVPR, a lightweight one-stage framework that achieves speed-accuracy trade-offs through two key innovations: i) a scene-aware visual prompt tuning method which adapts pretrained features with less parameters while dynamically adjusting to sample-specific characteristics, and ii) an instance-dependent key local feature enhancement module that further reinforces discriminative regions. Comprehensive evaluations on Pitts250k, MSLS, Eynsham, AmsterTime and SVOX demonstrate that our method establishes a new SOTA for DINOv2-small models by outperforming all same-scale competitors, and delivers a 73x speedup with 60% lower-dimensional features while maintaining competitive (within 2.5% average R@1 gap) against the SOTA DINOv2-large-based two-stage method. Our code will be available at https://github.com/WiniTang/EfficientVPR.

Abstract:
Few-shot object detection (FSOD) is challenging due to unstable optimization and limited generalization arising from the scarcity of training samples. To address these issues, we propose a hybrid ensemble decoder that enhances generalization during fine-tuning. Inspired by ensemble learning, the decoder comprises a shared hierarchical layer followed by multiple parallel decoder branches, where each branch employs denoising queries either inherited from the shared layer or newly initialized to encourage prediction diversity. This design fully exploits pretrained weights without introducing additional parameters, and the resulting diverse predictions can be effectively ensembled to improve generalization. We further leverage a unified progressive fine-tuning framework with a plateau-aware learning rate schedule, which stabilizes optimization and achieves strong few-shot adaptation without complex data augmentations or extensive hyperparameter tuning. Extensive experiments on CD-FSOD, ODinW-13, and RF100-VL validate the effectiveness of our approach. Notably, on RF100-VL, which includes 100 datasets across diverse domains, our method achieves an average performance of 41.9 in the 10-shot setting, significantly outperforming the recent approach SAM3, which obtains 35.7. We further construct a mixed-domain test set from CD-FSOD to evaluate robustness to out-of-distribution (OOD) samples, showing that our proposed modules lead to clear improvement gains. These results highlight the effectiveness, generalization, and robustness of the proposed method. Code is available at: https://github.com/Intellindust-AI-Lab/FT-FSOD.

Abstract:
Realistic reconstruction of dynamic 4D scenes from monocular videos is essential for understanding the physical world. Despite recent progress in neural rendering, existing methods often struggle to recover accurate 3D geometry and temporally consistent motion in complex environments. To address these challenges, we propose MotionScale, a 4D Gaussian Splatting framework that scales efficiently to large scenes and extended sequences while maintaining high-fidelity structural and motion coherence. At the core of our approach is a scalable motion field parameterized by cluster-centric basis transformations that adaptively expand to capture diverse and evolving motion patterns. To ensure robust reconstruction over long durations, we introduce a progressive optimization strategy comprising two decoupled propagation stages: 1) A background extension stage that adapts to newly visible regions, refines camera poses, and explicitly models transient shadows; 2) A foreground propagation stage that enforces motion consistency through a specialized three-stage refinement process. Extensive experiments on challenging real-world benchmarks demonstrate that MotionScale significantly outperforms state-of-the-art methods in both reconstruction quality and temporal stability. Project page: https://hrzhou2.github.io/motion-scale-web/.

Abstract:
Lane detection is a crucial task in autonomous driving, which is conducive to ensuring the safe operation of vehicles. However, current datasets like CULane and TuSimple have relatively limited data under extreme weather conditions, such as rain, snow and fog, which makes detection models unreliable in extreme conditions, potentially leading to serious safety-critical failures on the road. In this direction, we propose HG-Lane, a High-fidelity Generation framework for Lane Scenes under adverse weather and lighting conditions, without the need for re-annotation. Based on our framework, we further propose a benchmark that includes adverse weather and lighting conditions, with 30,000 images. Experiment results demonstrate that our method constantly and significantly improves the performance of all the related lane detection networks. Taking the state-of-the-art CLRNet as an example, the overall mF1 on our benchmark increases by 20.87%. The F1@50 for the overall, normal, snow, rain, fog, night, and dusk categories increases by 19.75%, 8.63%, 38.8%, 14.96%, 26.84%, 21.5%, and 12.04%, respectively. The Code and dataset are available at https://github.com/zdc233/HG-Lane.

Abstract:
Post-training alignment of video generation models with human preferences is a critical goal. Developing effective Reward Models (RMs) for this process faces significant methodological hurdles. Current data collection paradigms, reliant on in-prompt pairwise annotations, suffer from labeling noise. Concurrently, the architectural design of VLM-based RMs, particularly their output mechanisms, remains underexplored. Furthermore, RM is susceptible to reward hacking in post-training. To mitigate these limitations, we propose SoliReward, a systematic framework for video RM training. Our framework first sources high-quality, cost-efficient data via single-item binary annotations, then constructs preference pairs using a cross-prompt pairing strategy. Architecturally, we employ a Hierarchical Progressive Query Attention mechanism to enhance feature aggregation. Finally, we introduce a modified BT loss that explicitly accommodates win-tie scenarios. This approach regularizes the RM's score distribution for positive samples, providing more nuanced preference signals to alleviate over-focus on a small number of top-scoring samples. Our approach is validated on benchmarks evaluating physical plausibility, subject deformity, and semantic alignment, demonstrating improvements in direct RM evaluation metrics and in the efficacy of post-training on video generation models. Code and benchmark are available at https://github.com/lian700/SoliReward.

Abstract:
Precise 3D medical image segmentation is a clinical cornerstone for diagnosis, therapy planning, and longitudinal monitoring. However, routine acquisition with anisotropic voxel spacing and heterogeneous reconstruction induces downsampling aliasing and cross-scale misalignment that blur boundaries, fragment topology, and undermine reliability. Existing U-shaped CNN or Transformer designs neither control alias injection at decimation nor explicitly align high-resolution evidence before decoder fusion, leading to unstable interfaces under device and protocol variability. We introduce the Coset-fibRated micrO-local co-attention Network (CROWn), a general segmentation framework that couples sampling theory with representation learning to jointly suppress aliasing and calibrate cross-scale fusion. CROWn comprises two complementary components. The Microlocal Polyphase Co-Attentive Decimator (\muPCAD) performs axis-aware polyphase analysis with pooled-subband co-attention and explicit anti-alias low-pass, routing boundary-relevant high-frequency evidence while attenuating spurious phase components during downsampling. The Octaphase Coset Fibration (OCF) anti-aliases high-resolution skips, restructures them via 3D space-to-depth into cosets, and applies phase attention with edge-gated modulation to deliver compact, phase-aligned, boundary-aware features to the decoder. Extensive evaluations across 15 publicly available datasets spanning CT, MRI, and OCT demonstrate CROWn's state-of-the-art performance against 17 recent leading methods, improves overlap and topological consistency, consistently reduces boundary errors, while maintaining controlled training and inference cost. The code is publicly available at https://github.com/IMOP-lab/CROWn.

Abstract:
Accurate and interpretable medical image segmentation remains a major challenge, as existing deep learning models primarily optimize pixel-level accuracy while overlooking positional reasoning--an essential component for automated report generation and clinical interpretability. We introduce CG-Reasoner, a novel centroid-guided cross-modal framework that jointly performs medical image segmentation and positional reasoning. CG-Reasoner integrates a multimodal large language model (LLM), a newly designed light-weight encoder-decoder architecture, and a Text2Centroid module that predicts lesion centroids from reasoning embeddings--enabling the model to produce both accurate segmentation masks and spatially coherent, clinically meaningful reasoning explanations. Furthermore, we propose PRScore (Positional-Reasoning Score), a robust evaluation metric that jointly measures the spatial and semantic alignment between generated reasoning text and segmentation masks. Experiments on six medical datasets across different imaging modalities demonstrate that CG-Reasoner achieves state-of-the-art performance, offering precise segmentation, spatially coherent reasoning, and clinically interpretable visual-textual explanations within a unified framework. The source code is available at https://github.com/lpmm2025/CG-Reasoner.

Abstract:
Although depth-assisted scene flow estimation has advanced rapidly, mainstream dense frameworks (e.g., RAFT-3D) still rely primarily on 2D feature correlations to optimize 3D motion fields, which hinders their ability to exploit 3D structural priors effectively and consequently limits robustness in complex scenes. We present SEA-Flow3D, a simple, efficient, and accurate framework for dense scene flow estimation. At its core lies a Spatial Vector Sampling (SVS) module that jointly samples 3D coordinates and correlation volumes within the local neighborhood of matched points, producing a direction-aware correlation representation with explicit spatial vectors and providing strong geometric guidance for subsequent optimization. Following the simplicity-and-efficiency principle, SEA-Flow3D adopts a RAFT-style multi-scale recurrent refinement architecture, integrating an RNN-based optimizer with context-guided upsampling to achieve higher accuracy with fewer iterations. Extensive experiments on KITTI and Sintel demonstrate that SEA-Flow3D achieves state-of-the-art performance while maintaining remarkable efficiency and a lightweight design. Code page: https://github.com/HanLingsgjk/SEAFLOW3D.

Abstract:
Cardiovascular disease (CVD) diagnosis relies heavily on electrocardiograms (ECGs). However, most existing self-supervised uni-modal methods suffer from limited representational capacity, while multi-modal frameworks are hindered by coarse-grained semantic alignment across modalities, thus restricting their generalizability in clinical settings. To address these limitations, we propose TAMER, a Tri-modal contrastive Alignment and Multi-scale Embedding Refinement framework that jointly models ECG recordings, spectrograms, and diagnostic reports. TAMER is composed of three key components: First, the tri-modal feature encoding and projection (TFEP) module employs modality-specific encoders to extract global and local features from ECG recordings, spectrograms, and diagnostic reports, and projects them into latent spaces. Then, the global-local temporal-spectral alignment (GLTSA) module captures complementary rhythm- and wave-level characteristics via contrastive alignment and attentive interaction between temporal and spectral modalities. Finally, the report-aware alignment and refinement (RAAR) module performs diagnostic-level alignment and wave-level refinement with clinical reports, enabling semantic enrichment of ECG representations.Extensive experiments on three public ECG datasets demonstrate that TAMER achieves state-of-the-art zero-shot classification performance (AUC: 81.2%) and strong cross-domain generalization (AUC: 83.1%), outperforming existing uni-modal and multi-modal baselines methods.The source code is available at https://github.com/zhouxw12345/TAMER.

Abstract:
Traditional open-access datasets focusing on surgical procedures are often limited by their small size, typically consisting of fewer than 100 videos and less than 30 hours of footage, which leads to poor model generalization. To address this data limitation, a new dataset called LEMON has been compiled using a novel aggregation pipeline that collects high-resolution videos from online sources. Featuring an extensive collection of over 4K surgical videos totaling 938 hours (85 million frames) of high-quality footage across multiple procedure types, LEMON offers a comprehensive resource surpassing existing alternatives in size and scope, including two novel downstream tasks. To demonstrate the effectiveness of this diverse dataset, we introduce LemonFM, a foundation model pretrained on LEMON using a novel self-supervised augmented knowledge distillation approach. LemonFM consistently outperforms existing surgical foundation models across four downstream tasks and six datasets, achieving significant gains in surgical phase recognition (+9.5pp, +9.4pp, and +8.4pp in Jaccard on AutoLaparo, M2CAI16, and Cholec80), surgical action recognition (+4.4pp in mAP on CholecT50), surgical tool presence detection (+5.3pp and +10.2pp in mAP on Cholec80 and GraSP), and surgical semantic segmentation (+10.3pp in mDice on CholecSeg8k). LEMON and LemonFM will serve as foundational resources for the research community and industry, accelerating progress in developing autonomous robotic surgery systems and ultimately contributing to safer and more accessible surgical care worldwide. Dataset, code, and models are publicly available at https://github.com/visurg-ai/LEMON.

Abstract:
Methods using inertial measurement units (IMUs) provide a wearable alternative to camera-based motion capture. To mitigate drift from inertial signals, recent sparse inertial pose estimators integrate inter-sensor distances measured by ultra-wideband (UWB) ranging. So far, UWB distances have only been used as an additional input feature, ignoring the physical constraints they impose on sensor positions. However, these distances can also be used to reconstruct the underlying 3D sensor layout, which in turn provides more informative input for pose reconstruction. We propose Ultra Diffusion Poser, a diffusion model that explicitly models these geometric constraints. It includes a Spatial Layout Module that analytically reconstructs the 3D sensor positions from UWB measurements. These sensor positions are used alongside IMU signals and UWB distances as a conditioning signal during diffusion. Still, network predictions can violate inter-sensor distance measurements. To address this, we introduce UWB-Diffusion Guidance, which encourages alignment between predicted poses and measured distances during diffusion sampling. Together, these contributions enable our model to achieve state-of-the-art performance, reducing joint position error by up to 22% over prior work. Code can be found here https://github.com/eth-siplab/UltraDiffusionPoser.

Abstract:
An ideal embodied agent should possess lifelong learning capabilities to handle long-horizon and complex tasks, enabling continuous operation in general environments. This not only requires the agent to accurately accomplish given tasks but also to leverage long-term episodic memory to optimize decision-making. However, existing mainstream one-shot embodied tasks primarily focus on task completion results, neglecting the crucial process of exploration and memory utilization. To address this, we propose Long-term Memory Embodied Exploration (LMEE), which aims to unify the agent's exploratory cognition and decision-making behaviors to promote lifelong learning. We further construct a corresponding dataset and benchmark, LMEE-Bench, incorporating multi-goal navigation and memory-based question answering to comprehensively evaluate both the process and outcome of embodied exploration. To enhance the agent's memory recall and proactive exploration capabilities, we propose MemoryExplorer, a novel method that fine-tunes a multimodal large language model through reinforcement learning to encourage active memory querying. By incorporating a multi-task reward function that includes action prediction, frontier selection, and question answering, our model achieves proactive exploration. Extensive experiments against state-of-the-art embodied exploration models demonstrate that our approach achieves significant advantages in long-horizon embodied tasks. Our dataset and code will be released at https://wangsen99.github.io/papers/lmee/

Abstract:
Composed Image Retrieval (CIR) aims to retrieve target images by integrating a reference image with a corresponding modification text. CIR requires jointly considering the explicit semantics specified in the query and the implicit semantics embedded within its bi-modal composition. Recent training-free Zero-Shot CIR (ZS-CIR) methods leverage Multimodal Large Language Models (MLLMs) to generate detailed target descriptions, converting the implicit information into explicit textual expressions. However, these methods rely heavily on the textual modality and fail to capture the fuzzy retrieval nature that requires considering diverse combinations of candidates. This leads to reduced diversity and accuracy in retrieval results. To address this limitation, we propose a novel training-free method, Geodesic Mixup-based Implicit semantic eXpansion and Explicit semantic Re-ranking for ZS-CIR (G-MIXER). G-MIXER constructs composed query features that reflect the implicit semantics of reference image-text pairs through geodesic mixup over a range of mixup ratios, and builds a diverse candidate set. The generated candidates are then re-ranked using explicit semantics derived from MLLMs, improving both retrieval diversity and accuracy. Our proposed G-MIXER achieves state-of-the-art performance across multiple ZS-CIR benchmarks, effectively handling both implicit and explicit semantics without additional training. Our code will be available at https://github.com/maya0395/gmixer.

Abstract:
We introduce CADFS, a data-centric framework that enables large vision-language models to generate complex CAD design histories. Existing generative CAD systems are restricted to sketch-extrude operations due to simplified representations and limited datasets. We address this by introducing a FeatureScript-based representation and constructing a dataset of 450k real-world CAD models spanning 15 modeling operations. We obtain the dataset via a new pipeline that reconstructs clean, executable FeatureScript programs and provides multimodal annotations. Fine-tuning a VLM on this representation yields state-of-the-art results in text-conditioned CAD generation and image-based reconstruction, producing more accurate, diverse, and feature-rich designs than prior frameworks. Ablations show that each individual component of our framework, i.e., the FeatureScript representation, the extended operation set, and representation-aligned textual descriptions, significantly improves performance. Our framework substantially broadens the complexity and realism achievable in generative CAD. The CADFS framework and the new dataset are available at https://voyleg.github.io/cadfs/.

Abstract:
This work presents a generalizable framework to transfer relative depth to metric depth. Current monocular depth estimation methods are mainly divided into metric depth estimation (MMDE) and relative depth estimation (MRDE). MMDEs estimate depth in metric scale but are often limited to a specific domain. MRDEs generalize well across different domains, but with uncertain scales that hinder downstream applications. To this end, we aim to build up a framework to solve scale uncertainty and transfer relative depth to metric depth. Previous methods used language as input and estimated two factors for conducting rescaling. Our approach, TR2M, utilizes both text descriptions and images as inputs and estimates two rescale maps to transfer relative depth to metric depth at the pixel level. Features from two modalities are fused with a cross-modality attention module to better capture scale information. A strategy is designed to construct and filter confident pseudo metric depth for more comprehensive supervision. We also develop dual-level scale-oriented contrastive learning to utilize depth distribution as guidance to enforce the model learning about intrinsic cues consistent with the scale distribution. TR2M only exploits a small number of trainable parameters to train on datasets in various domains and experiments not only demonstrate TR2M's great performance in seen datasets but also reveal superior zero-shot capabilities on five unseen datasets. We show the huge potential in pixel-wise transferring relative depth to metric depth with language assistance instead of large-size metric depth models with large amounts of training data. Code is available at: https://github.com/BeileiCui/TR2M.

Abstract:
Recent progress in large models has led to significant advances in unified multimodal generation and understanding. However, the development of models that unify motion-language generation and understanding remains largely underexplored. Existing approaches often fine-tune large language models (LLMs) on paired motion-text data, which can result in catastrophic forgetting of linguistic capabilities due to the limited scale of available text-motion pairs. Furthermore, prior methods typically convert motion into discrete representations via quantization to integrate with language models, introducing substantial jitter artifacts from discrete tokenization. To address these challenges, we propose LLaMo, a unified framework that extends pretrained LLMs through a modality-specific Mixture-of-Transformers (MoT) architecture. This design inherently preserves the language understanding of the base model while enabling scalable multimodal adaptation. We encode human motion into a causal continuous latent space and maintain the next-token prediction paradigm in the decoder-only backbone through a lightweight flow-matching head, allowing for streaming motion generation in real-time (>30 FPS). Leveraging the comprehensive language understanding of pretrained LLMs and large-scale motion-text pretraining, our experiments demonstrate that LLaMo achieves high-fidelity text-to-motion generation and motion-to-text captioning in general settings, especially zero-shot motion generation, marking a significant step towards a general unified motion-language large model. Here are our project homepage: https://kunkun0w0.github.io/project/LLaMo/

Abstract:
Reconstructing and semantically interpreting 3D scenes from sparse 2D views remains a fundamental challenge in computer vision. Conventional methods often decouple semantic understanding from reconstruction or necessitate costly per-scene optimization, thereby restricting their scalability and generalizability. In this paper, we introduce Uni3R, a novel feed-forward framework that jointly reconstructs a unified 3D scene representation enriched with open-vocabulary semantics, directly from unposed multi-view images. Our approach leverages a Cross-View Transformer to robustly integrate information across arbitrary multi-view inputs, which then regresses a set of 3D Gaussian primitives endowed with semantic feature fields. This unified representation facilitates high-fidelity novel view synthesis, open-vocabulary 3D semantic segmentation, and depth prediction--all within a single, feed-forward pass. Extensive experiments demonstrate that Uni3R sets a new state of the art across multiple benchmarks, including in-domain datasets such as RE10K and ScanNet, as well as the out-of-domain dataset Mip-NeRF360. This work represents a new paradigm toward generalizable and unified 3D scene reconstruction and understanding.

Abstract:
Achieving accurate garment grasping under dynamically changing illumination is crucial for all-day operation of service robots. However, the reduced illumination in low-light scenes severely degrades garment structural features, leading to a significant drop in grasping robustness. Existing methods typically enhance RGB features by exploiting the illumination-invariant properties of non-RGB modalities, yet they overlook the varying dependence on non-RGB features under varying lighting conditions, which can introduce misaligned non-RGB cues and thereby weaken the model's adaptability to illumination changes when utilizing multimodal information. To address this problem, we propose GraspALL, an illumination-structure interactive compensation model. The innovation of GraspALL lies in encoding continuous illumination changes into quantitative references to guide adaptive feature fusion between RGB and non-RGB modalities according to varying lighting intensities, thereby generating illumination-consistent grasping representations. Experiments on the self-built garment grasping dataset demonstrate that GraspALL improves grasping accuracy by 32-44% over baselines under diverse illumination conditions. The code is available at https://github.com/Zhonghaifeng6/GraspALL

Abstract:
While spatial transcriptomics (ST) has advanced our understanding of gene expression in tissue context, its high experimental cost limits its large-scale application. Predicting ST from pathology images is a promising, cost-effective alternative, but existing methods struggle to capture complex cross-slide spatial relationships. To address the challenge, we propose SpaHGC, a multi-modal heterogeneous graph-based model that captures both intra-slice and inter-slice spot-spot relationships from histology images. It integrates local spatial context within the target slide and cross-slide similarities computed from image embeddings extracted by a pathology foundation model. These embeddings enable inter-slice knowledge transfer, and SpaHGC further incorporates Masked Graph Contrastive Learning to enhance feature representation and transfer spatial gene expression knowledge from reference to target slides, enabling it to model complex spatial dependencies and significantly improve prediction accuracy.We conducted comprehensive benchmarking on seven matched histology-ST datasets from different platforms, tissues, and cancer subtypes. The results demonstrate that SpaHGC significantly outperforms the existing nine state-of-the-art methods across all evaluation metrics. Additionally, the predictions are significantly enriched in multiple cancer-related pathways, thereby highlighting its strong biological relevance and application potential. Code and data are available at https://github.com/wenwenmin/SpaHGC.

Abstract:
Accurate cancer risk assessment is critical for personalized treatment planning. While multimodal models that integrate histopathology with complementary data modalities (e.g., genomics, or clinical reports) exhibit superior prognostic capability, they typically assume full data availability, an unrealistic expectation in real-world clinical settings. In contrast, histopathology slides are routinely collected, universally accessible, and information-rich, making them a practical anchor for robust survival prediction. In this study, we propose a novel framework that leverages histopathology as a basis for outcome prediction, while using other data modalities when training the models. Extensive experiments across eight cancer types and scenarios, including various data modalities, demonstrate that our model outperforms all baselines, with up to 8% gains over methods that solely use histopathology at training time, and a 1.4% gap compared to models that utilize all data modalities. Our model also stratifies patients into meaningful risk groups in 67% of risk stratification scenarios (vs. 50% for best SOTA), generalizes well under varying modality missingness, and matches the best SOTA even with 40% higher rate of missing data during training. It also preserves semantic alignment in zero-shot settings. These results highlight the practical utility and robustness of our approach for real-world cancer risk prediction in resource-limited or modality-incomplete settings. Code is available at: https://github.com/pazadimo/fca-robust-risk-estimation

Abstract:
4D millimeter-wave radar is a promising sensing modality for autonomous driving, yet effective 3D object detection from 4D radar and monocular images remains challenging. Existing fusion approaches either rely on instance proposals lacking global context or dense BEV grids constrained by rigid structures, lacking a flexible and adaptive representation for diverse scenes. To address this, we propose RaGS, the first framework that models the scene as a continuous field of Gaussians, enabling dynamic resource allocation to foreground objects detection while maintaining flexibility and efficiency. Specifically, RaGS adopts a cascaded pipeline to construct and progressively refine the Gaussian field. It begins with Frustum-based Localization Initiation (FLI), which unprojects foreground pixels to initialize coarse Gaussian centers. Then, Iterative Multimodal Aggregation (IMA) explicitly exploits image semantics and implicitly integrates 4D radar velocity geometry to refine the Gaussians within regions of interest. Finally, Multi-level Gaussian Fusion (MGF) renders the Gaussian field into hierarchical BEV features for 3D object detection. By dynamically focusing on sparse and informative regions, RaGS achieves object-centric precision and comprehensive scene perception. Extensive experiments on View-of-Delft, TJ4DRadSet, and OmniHD-Scenes demonstrate its robustness and SOTA performance. Source code is available at https://github.com/shawnnnkb/RaGS.

Abstract:
Novel view synthesis from low dynamic range (LDR) blurry images, which are common in the wild, struggles to recover high dynamic range (HDR) and sharp 3D representations in extreme lighting conditions. Although existing methods employ event data to address this issue, they ignore the sensor-physics mismatches between the camera output and physical world radiance, resulting in suboptimal HDR and deblurring results. To cope with this problem, we propose a unified sensor-physics grounded NeRF framework for sharp HDR novel view synthesis from single-exposure blurry LDR images and corresponding events. We employ NeRF to directly represent the actual radiance of the 3D scene in the HDR domain and model raw HDR scene rays hitting the sensor pixels as in the physical world. A 2D pixel-wise RGB CRF model is introduced to align the NeRF rendered pixel values with the sensor-recorded LDR pixel values of the input images. A novel event CRF model is also designed to bridge the gap between physical scene dynamics and event sensor output. The two models are jointly optimized with the NeRF network, leveraging the spatial and temporal dynamic information in events to enhance the sharp HDR 3D representation learning. Experiments on the collected and public datasets demonstrate that our method achieves state-of-the-art HDR and deblurring novel view synthesis results with single-exposure blurry LDR images and corresponding events. Our code and datasets are publicly available at https://github.com/iCVTEAM/See-NeRF.

Abstract:
Google Earth imagery, combined with building footprint databases, offers an efficient way to construct localized building datasets. However, the lack of orthorectification in these images leads to spatial misalignments between annotations and their corresponding roof locations. Adopting such misaligned data directly for model training can severely degrade segmentation performance. To address the challenge, we propose an Object-based Multi-stage Alignment Framework (OMAF) that generates high-quality corrected labels with minimal manual intervention. OMAF first employs a prior-regularized self-alignment method to produce high-confidence, object-level offset pseudo-labels, which are then used to train an instance-level offset regression model for label refinement. Experimental results on the challenging datasets demonstrate that OMAF effectively corrects misalignments and consistently boosts the mIoU of all baseline models by up to 40.6%. This work provides a practical and cost-effective solution for large-scale dataset construction and domain adaptation. Our code and dataset are publicly available at https://github.com/dayunyan/OMAF-Building-Alignment.

Abstract:
Complex degradations like noise, blur, and low resolution are typical challenges in real-world image fusion tasks, limiting the performance and practicality of existing methods. End-to-end neural network-based approaches are generally simple to design and highly efficient in inference, but their black-box nature leads to limited interpretability. Diffusion-based methods alleviate this to some extent by providing powerful generative priors and a more structured inference process. However, they are trained to learn a single-domain target distribution, whereas fusion lacks natural fused data and relies on modeling complementary information from multiple sources, making diffusion hard to apply directly in practice. To address these challenges, this paper proposes an efficient degradation-aware diffusion framework for image fusion under arbitrary degradation scenarios. Specifically, instead of explicitly predicting noise as in conventional diffusion models, our method performs implicit denoising by directly regressing the fused image, enabling flexible adaptation to diverse fusion tasks under complex degradations with limited steps. Moreover, we design a joint observation model correction mechanism that simultaneously imposes degradation and fusion constraints during sampling to ensure high reconstruction accuracy. Experiments on diverse fusion tasks and degradation configurations demonstrate the superiority of the proposed method under complex degradation scenarios. Code: https://github.com/YShi-cool/DRFusion.

Abstract:
Generating high-fidelity audio that is both semantically meaningful and temporally synchronized with silent videos remains a challenging problem in video-to-audio generation. Existing approaches often fail to capture fine-grained temporal correspondence between visual events and audio dynamics, leading to unrealistic or desynchronized outputs. To address these limitations, we propose VisioSonic, a Video-Aligned Sound generation framework that unifies flow-matching diffusion and preference-guided alignment. VisioSonic introduces a multimodal conditioning module that jointly leverages video frames and textual cues to provide semantic and frame-level temporal guidance. A co-attention diffusion transformer efficiently fuses visual and audio representations, enabling content-aware sound synthesis with minimal computation costs. To further enhance alignment beyond supervised training, we introduce Semantic-Temporal Alignment Ranked Direct Preference Optimization (STAR-DPO), a novel preference-learning paradigm that automatically generates audio candidates, ranks them based on both semantic and temporal alignment, and subsequently fine-tunes the diffusion model using the derived preference pairs. Extensive experiments on various benchmarks demonstrate that VisioSonic achieves state-of-the-art audio-video synchronization and audio fidelity while using the fewest trainable parameters among competing approaches. Project page: https://kaiw7.github.io/VisioSonic/

Abstract:
We propose JUMP-Hand, a novel multi-view 3D hand reconstruction method that explicitly models probabilistic joint-wise uncertainty as a gating mechanism for multi-view fusion. Existing approaches usually rely on naive pooling or implicit attention, overlooking that each hand joint exhibits varying visibility and reliability across views. Such indiscriminate aggregation of noisy or unreliable information inevitably degrades overall performance. For instance, a joint severely occluded in one view might be clearly visible in another. To address this, JUMP-Hand draws inspiration from the Mixture of Experts (MoE) paradigm, treating each 2D view as a specialized expert. The key idea is that the reliability of each view expert is quantified through joint-wise uncertainty modeling, serving as an explicit gating signal to route experts' partial yet complementary clues for each joint in a coarse-to-fine reconstruction paradigm. In this design, uncertainty not only guides the uncertainty-aware triangulation for reliable 3D hand initialization during the coarse stage, but also acts as a gating signal during the refinement stage to adaptively aggregate multi-scale features from different view experts on a joint-wise basis, enabling robust 3D hand reconstruction. Extensive experiments on DexYCB-MV, HO3D-MV, and OakInk-MV demonstrate that our method consistently achieves state-of-the-art results, validating the effectiveness of the proposed method with joint-wise uncertainty gating for reliable 3D hand reconstruction. Code is available at https://github.com/HaohongKuang/JUMP-Hand.

Abstract:
Facial expression recognition (FER) in the wild is challenged by co-occurring visual perturbations (e.g., occlusions, pose variations) and label noise. Existing methods often address these issues in isolation, failing to handle their compound effects effectively. To this end, we propose D^3FER (Dual channel and Dual branch framework for Dual challenges in FER), a unified framework tackling both issues simultaneously. D^3FER employs a dual-channel augmentation strategy with weak and strong views to achieve reliable pseudo-labeling. At its core, a dynamic queue mechanism is designed to address label noise by adaptively estimating noise thresholds directly from prediction confidence. Built upon this dynamically updated queue, a momentum-updated Query-Key dual-branch architecture is further developed to enhance feature compactness and discriminability. By virtue of the dynamic queue, the overall framework gains stronger robustness against the dual challenges of visual perturbations and label noise. During inference, the stable Key branch guarantees consistent predictions. Experiments on major in-the-wild benchmarks show D^3FER outperforms state-of-the-art methods in accuracy and robustness. The source code is available at https://github.com/D3FER/D3FER.

Abstract:
With the rapid advancement of deepfake technology, malicious face manipulations pose a significant threat to personal privacy and social security. However, existing proactive forensics methods typically treat deepfake detection, tampering localization, and source tracing as independent tasks, lacking a unified framework to address them jointly. To bridge this gap, we propose a unified proactive forensics framework that jointly addresses these three core tasks. Our core framework adopts an innovative 152-dimensional landmark-identity watermark termed LIDMark, which structurally interweaves facial landmarks with a unique source identifier. To robustly extract the LIDMark, we design a novel Factorized-Head Decoder (FHD). Its architecture factorizes the shared backbone features into two specialized heads (i.e., regression and classification), robustly reconstructing the embedded landmarks and identifier, respectively, even when subjected to severe distortion or tampering. This design realizes an "all-in-one" trifunctional forensic solution: the regression head underlies an "intrinsic-extrinsic" consistency check for detection and localization, while the classification head robustly decodes the source identifier for tracing. Extensive experiments show that the proposed LIDMark framework provides a unified, robust, and imperceptible solution for the detection, localization, and tracing of deepfake content. The code is available at https://github.com/vpsg-research/LIDMark.

Abstract:
Automated 3D pulmonary vessel segmentation from CT images is crucial for improving early screening and assessment of pulmonary vessel related diseases. However, it remains an extremely challenging task due to the complex and tree-like structures of vessels, large scale-variations, and the existence of highly similar tissues in the background. Existing segmentation models either cannot sufficiently capture long-range structural dependencies, which are of great importance in vessel segmentation, or are constrained by insufficient computational resources in clinical settings. In this paper, we propose VesMamba, a novel model for 3D pulmonary vessel segmentation that comprehensively addresses these challenges. Specifically, we first devise a spatial-gated structural perception (SSP) module, which employs Mamba to efficiently capture long-range dependencies. In SSP, we design dynamic spatial attention convolutions (DSAC) for dynamically learning the tree-like 3D vessel structures, providing Mamba with the spatial perception capability to better track the complicated topologies of vessels. Second, we propose an innovative bidirectional scale-aware filter (BSF) module to strengthen the representation capability of the encoder, facilitating our model to focus on vessels of different scales under noise. Moreover, we apply a mask-constrained decoder to further improve segmentation consistency and accuracy, which constrains the inference of adjacent low-layer decoders directly by high-layer masks. Extensive experiments on the public Parse22 and internal Lung79 datasets demonstrate that our method can achieve better performance than SOTAs. Code is available at https://github.com/Lzpbright/VesMamba.

Abstract:
Vision-language models such as CLIP have achieved remarkable zero-shot recognition capabilities, yet their robustness against adversarial perturbations remains limited. Test-time counterattack (TTC) was recently proposed to improve CLIP's robustness by perturbing an input image to steer it away from a corrupted state during inference. However, TTC remains fragile under strong attacks because its counterattack relies on a directly corrupted original view and employs a noise-driven hard-gating scheme that cannot adapt to varying corruption severity. To address these limitations, we introduce Multi-view guided Adaptive Counterattack (MAC), which performs counterattacks for multi-view with corruption-aware soft weighting. Specifically, MAC first constructs augmented views of an input image to obtain diverse embeddings. It then performs counterattacks to refine corrupted embeddings of views. Next, MAC adaptively scales the counterattack intensity for each view based on its estimated corruption degree. Finally, the adaptively counterattacked views are aggregated to yield a robust final prediction. Extensive experiments across 20 datasets and diverse attack scenarios demonstrate that MAC substantially improves robustness while preserving high inference speed and memory efficiency with its tuning-free design. Our code is available at https://github.com/sunoh-kim/MAC.

Abstract:
Contrastive vision-language (V&L) models remain a popular choice for various applications. However, several limitations have emerged, most notably the limited ability of V&L models to learn compositional representations. Prior methods often addressed this limitation by generating custom training data to obtain hard-negative samples. Hard-negatives have been shown to improve performance on compositionality tasks, but are often specific to a single benchmark, do not generalize, and can cause substantial degradation of basic V&L capabilities such as zero-shot or retrieval performance, rendering them impractical. In this work we follow a different approach. We identify two root causes that limit compositionality performance of V&Ls: 1) Long training captions do not require a compositional representation; and 2) The final global pooling in the text and image encoders leads to a complete loss of the necessary information to learn binding in the first place. As a remedy, we propose two simple solutions: 1) We obtain short concept centric caption parts using standard NLP software and align those with the image; and 2) We introduce a parameter-free cross-modal attention-pooling to obtain concept centric visual embeddings from the image encoder. With these two changes and simple auxiliary contrastive losses, we obtain SOTA performance on standard compositionality benchmarks, while maintaining or improving strong zero-shot and retrieval capabilities. This is achieved without increasing inference cost. We release the code for this work at https://github.com/SamsungLabs/concept_centric_clip.

Abstract:
Visual in-context learning (VICL) enables visual foundation models to handle multiple tasks by steering them with demonstrative prompts. The choice of such prompts largely influences VICL performance, standing out as a key challenge. Prior work has made substantial progress on prompt retrieval and reranking strategies, but mainly focuses on prompt images while overlooking labels. We reveal these approaches sometimes get visually similar but label-inconsistent prompts, which potentially degrade VICL performance. On the other hand, higher label consistency between query and prompts preferably indicates stronger VICL results. Motivated by these findings, we develop a framework named LaPR (Label-aware Prompt Retrieval), which highlights the role of labels in prompt selection. Our framework first designs an image-label joint representation for prompts to incorporate label cues explicitly. Besides, to handle unavailable query labels at test time, we introduce a mixture-of-expert mechanism to the dual encoders with query-adaptive routing. Each expert is expected to capture a specific label mode, while the router infers query-adaptive mixture weights and helps to learn label-aware representation. We carefully design alternative optimization for experts and router, with a VICL performance-guided contrastive loss and a label-guided contrastive loss, respectively. Extensive experiments show promising and consistent improvement of LaPR on in-context segmentation, detection, and colorization tasks. Moreover, LaPR generalizes well across feature extractors and cross-fold scenarios, suggesting the importance of label utilization in prompt retrieval for VICL. Code is available at https://github.com/luotc-why/CVPR26-LaPR.

Abstract:
Existing image fusion methods face difficulties in adapting to unseen fusion tasks and have limitations in balancing semantic information with pixel-level details. This limitation can be attributed to three key challenges: (1) the lack of a unified, task-agnostic optimization objective; (2) the inherent difficulty in balancing semantic fidelity and pixel-level richness; and (3) an over-reliance on supervised learning, which limits transferability across tasks. To overcome these issues, this work proposes a unified fusion framework that generalizes to diverse fusion tasks even when trained solely on infrared-visible image pairs. Specifically, inspired by the free-energy principle, we introduce a fusion paradigm that combines high pixel-entropy expectation with low semantic-entropy expectation, and we design a frequency-aware feature decoupling mechanism to balance semantic content and pixel detail. Furthermore, an unsupervised dual-path trade-off strategy provides collaborative constraints at both semantic and pixel levels. Experiments show that our method significantly outperforms existing state-of-the-art methods in visual quality and downstream-task performance. It not only handles trained tasks efficiently but also generalizes well to unseen fusion tasks, while featuring lightweight model parameters and strong practical applicability. The code is available at https://github.com/XiaoW-Liu/DECC.

Abstract:
The European Space Agency's Sentinel-2 satellite provides global multispectral coverage for remote sensing (RS) applications. However, limited spectral resolution (12 bands) and non-unified spatial resolution (60/20/10 m) restrict their practicality. In contrast, the high spectral-spatial resolution sensor (e.g., NASA's AVIRIS-NG) covers only the American region due to practical considerations. This raises a fundamental question: "Can a global hyperspectral coverage be achieved by reconstructing Sentinel-2 data to NASA hyperspectral images?" This study aims to achieve spectral super-resolution from 12-to-186 and unify the spatial resolution of Sentinel-2 data to 5 m. To enable a reliable and efficient reconstruction, we formulate a novel deep unfolding framework regularized by a data-driven spectrum prior from PriorNet, instead of relying on implicit deep priors as conventional deep unfolding does. Moreover, an adversarial term is integrated into the unfolded architecture, enabling the discriminator to guide the reconstruction in both the training and testing phases; we term this novel concept unfolding adversarial learning (UAL). Experiments show that our UALNet outperforms the next-best Transformer in PSNR, SSIM, and SAM, while requiring only 15% MACs and 20 times fewer parameters. The associated code is publicly available at https://github.com/IHCLab/UALNet.

Abstract:
To enhance the perception and reasoning capabilities of multimodal large language models in complex visual scenes, recent research has introduced agent-based workflows. In these works, MLLMs autonomously utilize image cropping tool to analyze regions of interest for question answering. While existing training strategies, such as those employing supervised fine-tuning and reinforcement learning, have made significant progress, our empirical analysis reveals a key limitation. We demonstrate the model's strong reliance on global input and its weak dependence on the details within the cropped region. To address this issue, we propose a novel two-stage reinforcement learning framework that does not require trajectory supervision. In the first stage, we introduce the "Information Gap" mechanism by adjusting the granularity of the global image. This mechanism trains the model to answer questions by focusing on cropped key regions, driven by the information gain these regions provide. The second stage further enhances cropping precision by incorporating a grounding loss, using a small number of bounding box annotations. Experiments show that our method significantly enhances the model's attention to cropped regions, enabling it to achieve state-of-the-art performance on high-resolution visual question-answering benchmarks. Our method provides a more efficient approach for perceiving and reasoning fine-grained details in MLLMs. Code is available at: https://github.com/XuanPu-Z/LFPC.

Abstract:
The development of AI-generated technology requires effective image quality assessment (AGIQA) methods to jointly evaluate visual quality and text-content alignment, ensuring that the generated content is both visually appealing and faithful to the user's instructions. Nevertheless, visual degradation and text-content misalignment often coincide, and it is difficult to tell whether a bad subjective evaluation arises from prompt noncompliance or rendering artifacts. As such, disentangling image content and rendering distortions is vital. We propose the dual-prior guided fusion network (DPGF-Net), which leverages image-side priors to disentangle distortions from content and combines them with text-side prompt templates to simulate their interactions, to address this issue. DPGF-Net employs a local text-conditioned aggregation branch to highlight semantically relevant and quality-sensitive regions in conjunction with a global modulation branch that captures holistic perceptual characteristics. Finally, adaptive fusion produces a single score. Experiments on three AGIQA datasets demonstrate that our method is highly correlated with human judgments, with lower prediction error and stable evaluation behavior. The source code is available at https://github.com/leeto221/AGIQA.

Abstract:
Two-hand reconstruction from monocular images is hampered by complex poses and severe occlusions, which often cause interaction misalignment and two-hand penetration. We address this by decoupling the problem into 2D structural alignment and 3D spatial interaction alignment, each handled by a tailored component. For 2D alignment, we pioneer the attempt to unify heterogeneous structural priors (keypoints, segmentation, and depth) from vision foundation models as complementary structured guidance for two-hand recovery. Instead of extracting priors prediction as explicit inputs, we propose a fusion-alignment encoder that absorbs their structural knowledge implicitly, achieving foundation-level guidance without foundation-level cost. For 3D spatial alignment, we propose a two-hand penetration-free diffusion model that learns a generative mapping from interpenetrated poses to realistic, collision-free configurations. Guided by collision gradients during denoising, the model converges toward the manifold of valid two-hand interactions, preserving geometric and kinematic coherence. This generative formulation approach enables physically credible reconstructions even under occlusion or ambiguous visual input. Extensive experiments on InterHand2.6M and HIC show state-of-the-art or leading performance in interaction alignment and penetration suppression. Project: https://gaogehan.github.io/A2P/

Abstract:
The increasing adaptation of vision models across domains, such as satellite imagery and medical scans, has raised an emerging privacy risk: models may inadvertently retain and leak sensitive source-domain specific information in the target domain. This creates a compelling use case for machine unlearning to protect the privacy of sensitive source-domain data. Among adaptation techniques, source-free domain adaptation (SFDA) calls for an urgent need for machine unlearning (MU), where the source data itself is protected, yet the source model exposed during adaptation encodes its influence. Our experiments reveal that existing SFDA methods exhibit strong zero-shot performance on source-exclusive classes in the target domain, indicating they inadvertently leak knowledge of these classes into the target domain, even when they are not represented in the target data. We identify and address this risk by proposing an MU setting called SCADA-UL: Unlearning Source-exclusive ClAsses in Domain Adaptation. Existing MU methods do not address this setting as they are not designed to handle data distribution shifts. We propose a new unlearning method, where an adversarially generated forget class sample is unlearned by the model during the domain adaptation process using a novel rescaled labeling strategy and adversarial optimization.We also extend our study to two variants: a continual version of this problem setting and to one where the specific source classes to be forgotten may be unknown.Alongside theoretical interpretations, our comprehensive empirical results show that our method consistently outperforms baselines in the proposed setting while achieving retraining-level unlearning performance on benchmark datasets. Code is available at https://github.com/D-Arnav/SCADA

Abstract:
Shape matching is a fundamental task in computer graphics and vision, with deep functional maps becoming a prominent paradigm. However, existing methods primarily focus on learning informative feature representations by constraining pointwise and functional maps, while neglecting the optimization of the spectral basis--a critical component of the functional map pipeline. This oversight often leads to suboptimal matching results. Furthermore, many current approaches rely on conventional, time-consuming functional map solvers, incurring significant computational overhead. To bridge these gaps, we introduce Advanced Functional Maps, a framework that generalizes standard functional maps by replacing fixed basis functions with learnable ones, supported by rigorous theoretical guarantees. Specifically, the spectral basis is optimized through a set of learned inhibition functions. Building on this, we propose the first unsupervised spectral basis learning method for robust non-rigid 3D shape matching, enabling the joint, end-to-end optimization of feature extraction and basis functions. Our approach incorporates a novel heat diffusion module and an unsupervised loss function, alongside a streamlined architecture that bypasses expensive solvers and auxiliary losses. Extensive experiments demonstrate that our method significantly outperforms state-of-the-art feature-learning approaches, particularly in challenging non-isometric and topological noise scenarios, while maintaining high efficiency. Finally, we reveal that optimizing basis functions is equivalent to spectral convolution, where inhibition functions act as filters. This insight enables enhanced representations inspired by spectral graph networks, opening new avenues for future research. Our code is available at https://github.com/LuoFeifan77/Unsupervised-Spectral-Basis-Learning.

Abstract:
Large-scale and categorical-balanced text data is essential for training effective Scene Text Recognition (STR) models, which is hard to achieve when collecting real data. Synthetic data offers a cost-effective and perfectly labeled alternative. However, its performance often lags behind, revealing a significant domain gap between real and current synthetic data. In this work, we systematically analyze mainstream rendering-based synthetic datasets and identify their key limitations: insufficient diversity in corpus, font, and layout, which restricts their realism in complex scenarios. To address these issues, we introduce UnionST, a strong data engine synthesizes text covering a union of challenging samples and better aligns with the complexity observed in the wild. We then construct UnionST-S, a large-scale synthetic dataset with improved simulations in challenging scenarios. Furthermore, we develop a self-evolution learning (SEL) framework for effective real data annotation. Experiments show that models trained on UnionST-S achieve significant improvements over existing synthetic datasets. They even surpass real-data performance in certain scenarios. Moreover, when using SEL, the trained models achieve competitive performance by only seeing 9% of real data labels. Code is available at https://github.com/YesianRohn/UnionST.

Abstract:
Despite recent advances in diffusion models, AI generated images still often contain visual artifacts that compromise realism. Although more thorough pre-training and bigger models might reduce artifacts, there is no assurance that they can be completely eliminated, which makes artifact mitigation a highly crucial area of study. Previous artifact-aware methodologies depend on human-labeled artifact datasets, which are costly and difficult to scale, underscoring the need for an automated approach to reliably acquire artifact-annotated datasets. In this paper, we propose ArtiAgent, which efficiently creates pairs of real and artifact-injected images. It comprises three agents: a perception agent that recognizes and grounds entities and subentities from real images, a synthesis agent that introduces artifacts via artifact injection tools through novel patch-wise embedding manipulation within a diffusion transformer, and a curation agent that filters the synthesized artifacts and generates both local and global explanations for each instance. Using our pipeline, we collect 100K pairwise instances and demonstrate both efficacy and versatility across diverse applications. Code is available at https://github.com/krafton-ai/ArtiAgent.

Abstract:
The safety of autonomous driving systems (ADS) depends on accurate perception across distance and driving conditions. The outputs of AI perception algorithms are stochastic, which has a major impact on decision making and safety outcomes, including time-to-collision estimation. However, current perception evaluation metrics do not reflect the stochastic nature of perception algorithms. We therefore introduce the Perception Characteristics Distance (PCD), a novel metric incorporating model output uncertainty as represented by the farthest distance at which an object can be reliably detected. To represent a system's overall perception capability in terms of reliable detection distance, we used the averaged PCD values across multiple detection quality and probabilistic thresholds to produce the average PCD (aPCD). For empirical validation, we present the SensorRainFall dataset, collected on the Virginia Smart Roads using a sensor-equipped vehicle (cameras, radar, and LiDAR) controlled under different weather (clear and rainy) and illumination conditions (daylight, streetlight, and night). The dataset includes ground-truth distances, bounding boxes, and segmentation masks for target objects. Experiments with state-of-the-art models show that aPCD captures meaningful differences across weather, daylight, and illumination conditions, which traditional evaluation metrics fail to reflect. PCD provides an uncertainty-aware measure of perception performance, supporting safer and more robust ADS operation, while the SensorRainFall dataset offers a valuable benchmark for evaluation. The SensorRainFall dataset is publicly available at https://www.kaggle.com/datasets/datadrivenwheels/sensorrainfall, and the evaluation code is available at https://github.com/datadrivenwheels/PCD_Python

Abstract:
Ranking methods or models based on their performance is of prime importance but is tricky because performance is fundamentally multidimensional. In the case of classification, precision and recall are scores with probabilistic interpretations that are both important to consider and complementary. The rankings induced by these two scores are often in partial contradiction. In practice, therefore, it is extremely useful to establish a compromise between the two views to obtain a single, global ranking. Over the last fifty years or so, it has been proposed to take a weighted harmonic mean, known as the F-score, F-measure, or F-beta. Generally speaking, by averaging basic scores, we obtain a score that is intermediate in terms of values. However, there is no guarantee that these scores lead to meaningful rankings and no guarantee that the rankings are good tradeoffs between these base scores. Given the ubiquity of F-beta scores in the literature, some clarification is in order. Concretely: (1) We establish that F-beta-induced rankings are meaningful and define a shortest path between precisionand recall-induced rankings. (2) We frame the problem of finding a tradeoff between two scores as an optimization problem expressed with Kendall rank correlations. We show that F1 and its skew-insensitive version are far from being optimal in that regard. (3) We provide theoretical tools and a closed-form expression to find the optimal value for beta for any distribution or set of performances, and we illustrate their use on six case studies. Code is available at https://github.com/pierard/cvpr-2026-optimal-tradeoff-precision-recall.

Abstract:
We consider the problem of imaging a dynamic scene when scene appearance variations can outpace photon arrivals. Under such conditions, a pixel is effectively "blind" to changes in appearance that occur within the timespan separating the photons it detects, and so the inter-photon interval presents a significant speed barrier to video acquisition systems. To analyze and advance imaging capabilities at the inter-photon limit, we introduce a novel reparameterization of time-varying flux that reveals the intrinsic difficulty of signal reconstruction by relating the Fourier decomposition of a flux function to the number of photons arriving within each oscillation period. We find that inter-photon-limited videography of general scenes is underexplored and beyond the reach of existing reconstruction techniques. To this end, we introduce Neural Flux Fields---a technique that combines statistical modeling of photon arrival with intrinsic priors of a neural network to achieve robust videography at the inter-photon limit. Using this approach, we demonstrate never-before-seen capabilities in video reconstruction across a range of captured single-photon video datasets spanning the inter-photon-limited regime.

Abstract:
We present DrivoR, a simple transformer-based architecture for end-to-end autonomous driving. Our approach builds on pretrained Vision Transformers (ViTs) and introduces camera-aware register tokens that compress multi-camera features into a compact scene representation, significantly reducing downstream computation without sacrificing accuracy. These tokens drive two lightweight transformer decoders that generate and then score candidate trajectories. The scoring decoder learns to mimic an oracle and predicts interpretable sub-scores e.g., safety or efficiency, enabling behavior-conditioned driving at inference. Despite its minimal design, DrivoR outperforms or matches strong baselines across NAVSIM-v1/v2, and closed-loop HUGSIM benchmarks. Our results show that a pure-transformer architecture, combined with targeted token compression, is sufficient for accurate, efficient, and adaptive end-to-end driving. Code and checkpoints are available via the project page.

Abstract:
This paper addresses the Perspective-Three-Point (P3P) problem under affine camera models. We derive direct closed-form solvers for weak perspective and para perspective, which are representative affine camera models. The affine P3P solution reduces to a bi-quadratic equation. Unlike exact P3P solvers that require a cubic or quartic equation, it allows for the simple and stable calculation of real solutions using the quadratic formula. Since affine approximations are valid only when scene depth variation is small, we further propose an iterative correction that upgrades the affine solution to the exact P3P solution. Through extensive comparisons using synthetic data and public datasets, we demonstrate that affine P3P solvers with two upgrade iterations achieve performance substantially comparable to that of the state-of-the-art P3P solver.

Abstract:
Humans do not just see attribute similarity -- we also see relational similarity. An apple is like a peach because both are reddish fruit, but the Earth is also like a peach: its crust, mantle, and core correspond to the peach's skin, flesh, and pit. This ability to perceive and recognize relational similarity, is arguable by cognitive scientist to be what distinguishes humans from other species. Yet, all widely used visual similarity metrics today (e.g., LPIPS, CLIP, DINO) focus solely on perceptual attribute similarity and fail to capture the rich, often surprising relational similarities that humans perceive. How can we go beyond the visible content of an image to capture its relational properties? How can we bring images with the same relational logic closer together in representation space? To answer these questions, we first formulate relational image similarity as a measurable problem: two images are relationally similar when their internal relations or functions among visual elements correspond, even if their visual attributes differ. We then curate 114k image-caption dataset in which the captions are anonymized -- describing the underlying relational logic of the scene rather than its surface content. Using this dataset, we finetune a Vision-Language model to measure the relational similarity between images. This model serves as the first step toward connecting images by their underlying relational structure rather than their visible appearance. Our study shows that while relational similarity has a lot of real-world applications, existing image similarity models fail to capture it -- revealing a critical gap in visual computing.

Abstract:
Can a diffusion model produce its own "mental average" of a concept--one that is as sharp and realistic as a typical sample? We introduce Diffusion Mental Averages (DMA), a model-centric answer to this question. While prior methods aim to average image collections, they produce blurry results when applied to diffusion samples from the same prompt. These data-centric techniques operate outside the model, ignoring the generative process. In contrast, DMA averages within the diffusion model's semantic space, as discovered by recent studies. Since this space evolves across timesteps and lacks a direct decoder, we cast averaging as trajectory alignment: optimize multiple noise latents so their denoising trajectories progressively converge toward shared coarse-to-fine semantics, yielding a single sharp prototype. We extend our approach to multimodal concepts (e.g., dogs with many breeds) by clustering samples in semantically-rich spaces such as CLIP and applying Textual Inversion or LoRA to bridge CLIP clusters into diffusion space. This is, to our knowledge, the first approach that delivers consistent, realistic averages, even for abstract concepts, serving as a concrete visual summary and a lens into model biases and concept representation.

Abstract:
Existing fine-grained image retrieval (FGIR) methods learn discriminative embeddings by adopting semantically sparse one-hot labels derived from category names as supervision. While effective on seen classes, such supervision overlooks the rich semantics encoded in category names, hindering the modeling of comparability among cross-category details and, in turn, limiting generalization to unseen categories. To tackle this, we introduce LaFG, a Language-driven framework for Fine-Grained Retrieval that converts class names into attribute-level supervision using large language models (LLMs) and vision-language models (VLMs). Treating each name as a semantic anchor, LaFG prompts an LLM to generate detailed, attribute-oriented descriptions. To mitigate attribute omission in these descriptions, it leverages a frozen VLM to project them into a vision-aligned space, clustering them into a dataset-wide attribute vocabulary while harvesting complementary attributes from related categories. Leveraging this vocabulary, a global prompt template selects category-relevant attributes, which are aggregated into category-specific linguistic prototypes. These prototypes supervise the retrieval model to steer it toward pinpointing visual details consistent with linguistic descriptions, thus modeling comparability among object details. Extensive evaluations show that LaFG achieves impressive performance on both fine- and coarse-grained benchmarks and generalizes well to unseen categories.

Abstract:
The computation of volumetric correspondences between 3D shapes is a prominent tool for medical and industrial applications. In this work, we pave the way for spectral volume mapping, extending for the first time the functional maps framework from the surface to the volumetric setting. We show that the eigenfunctions of the volumetric Laplace operator define a functional space that is suitable for high-quality signal transfer. We also experiment with various techniques that edit this functional space, porting them to volume domains. We validate our method on novel volumetric datasets and on tetrahedralizations of well-established surface datasets, also showcasing practical applications involving both discrete and continuous signal mapping, for segmentation transfer, mesh connectivity transfer, and solid texturing. Last but not least, we show that considering the volumetric spectrum greatly improves the accuracy for classical shape matching tasks among surfaces, consistently outperforming existing surface-only spectral methods.

Abstract:
We introduce Clothe and Pose, an image generation and editing task that enables users to try on garments while simultaneously adopting any desired pose. Our method takes a single user image, a set of garment images, and a reference pose as input, and outputs the user wearing the target garment in the specified pose. We also propose an evaluation framework for clothing and re-posing tasks. Our study encompasses a wide variety of garments--including athletic wear, bottoms, dresses, innerwear, and swimwear--across diverse poses. Finally, we demonstrate the utility of Clothe and Pose for human-centric editing on both real-world captures and synthetic imagery.

Abstract:
Diffusion models degrade images through noise, and reversing this process reveals an information hierarchy across timesteps. Scale-space theory exhibits a similar hierarchy via low-pass filtering. We formalize this connection and show that highly noisy diffusion states contain no more information than small, downsampled images - raising the question of why they must be processed at full resolution. To address this, we fuse scale spaces into the diffusion process by formulating a family of diffusion models with generalized linear degradations and practical implementations. Using downsampling as the degradation yields our proposed \ours. To support \ours, we introduce \ourmodel , a UNet variant that performs resolution-preserving and resolution-increasing denoising using only the necessary parts of the network. We evaluate our framework on CelebA and ImageNet and analyze its scaling behavior across resolutions and network depths. Our website is available at prateksha.github.io/projects/scale-space-diffusion.

Abstract:
Camera-controlled generative video re-rendering methods, such as ReCamMaster, have achieved remarkable progress. However, despite their success in single-view setting, these works often struggle to maintain consistency across multi-view scenarios. Ensuring spatio-temporal coherence in hallucinated regions remains challenging due to the inherent stochasticity of generative models. To address it, we introduce PlenopticDreamer, a framework that synchronizes generative hallucinations to maintain spatio-temporal memory. The core idea is to train a multi-in-single-out video-conditioned model in an autoregressive manner, aided by a camera-guided video retrieval strategy that adaptively selects salient videos from previous generations as conditional inputs. In addition, Our training incorporates progressive context-scaling to improve convergence, self-conditioning to enhance robustness against long-range visual degradation caused by error accumulation, and a long-video conditioning mechanism to support extended video generation. Extensive experiments on the Basic and Agibot benchmarks demonstrate that PlenopticDreamer achieves state-of-the-art video re-rendering, delivering superior view synchronization, high-fidelity visuals, accurate camera estimation, and diverse view transformations (e.g., third-person - third-person, and head-view - gripper-view in robotic manipulation).

Abstract:
High-fidelity physics simulation is essential for scalable robotic learning, but the sim-to-real gap persists, especially for tasks involving complex, dynamic, and discontinuous interactions like physical contacts. Explicit system identification, which tunes explicit simulator parameters, is often insufficient to align the intricate, high-dimensional, and state-dependent dynamics of the real world. To overcome this, we propose an implicit sim-to-real alignment framework that learns to directly align the simulator's dynamics with contact information. Our method treats the off-the-shelf simulator as a base prior and learns a contact-aware neural dynamics model to refine simulated states using real-world observations. We show that using tactile contact information from robotic hands can effectively model the non-smooth discontinuities inherent in contact-rich tasks, resulting in a neural dynamics model grounded by real-world data. We demonstrate that this learned forward dynamics model improves state prediction accuracy and can be effectively used to predict policy performance and refine policies trained purely in standard simulators, offering a scalable, data-driven approach to sim-to-real alignment.

Abstract:
Although normalization layers have long been viewed as indispensable components of deep learning architectures, the recent introduction of Dynamic Tanh (DyT) has demonstrated that alternatives are possible. The point-wise function DyT constrains extreme values for stable convergence and reaches normalization-level performance; this work seeks further for function designs that can surpass it. We first study how the intrinsic properties of point-wise functions influence training and performance. Building on these findings, we conduct a large-scale search for a more effective function design. Through this exploration, we introduce \mathrm Derf (x) = \mathrm erf (\alpha x + s), where \mathrm erf (x) is the rescaled Gaussian cumulative distribution function, and identify it as the most performant design. Derf outperforms LayerNorm, RMSNorm, and DyT across a wide range of domains, including visual recognition and generation, speech representation, and DNA sequence modeling. Our analysis also suggests that the performance gains of Derf largely stem from its improved generalization rather than stronger fitting capacity. Its simplicity and stronger performance make Derf a practical choice for normalization-free Transformer architectures.

Abstract:
Mirror Illusion Art is a novel reflection-conditioned 3D illusion where one object yields two target appearances (front and mirror). The task is formulated as inverse design from two target 2D images (front and mirror) to a printable 3D object with geometry and texture. Prior topology-driven and shadow-based approaches demand substantial manual effort, optimize shape only, and often yield non-smooth or incomplete geometry. To address these challenges, we propose AutoMIA, an automated Mirror Illusion Art design pipeline that jointly optimizes shape and color. To stabilize optimization and suppress artifacts, four mechanisms are introduced: (1) projection-alignment component (PAC) selection to reduce surface noise, (2) position-weighted adaptive (PWA) suppression for background noise, (3) internal voxel preservation (IVP) to prevent internal fractures, and (4) shape-color decoupled (SCD) optimization that balance shape and color optimization. AutoMIA generate diverse smooth Mirror Illusion artworks successfully both in the digital and physical world, with only around 76s design time and 2.6 GB memory on average using a single RTX 3090, advancing inverse graphics and computational design.

Abstract:
Users frequently edit camera images post-capture to achieve their preferred photofinishing style. While editing in the RAW domain provides greater accuracy and flexibility, most edits are performed on the camera's display-referred output (e.g., 8-bit sRGB JPEG) since RAW images are rarely stored. Existing RAW reconstruction methods can recover RAW data from sRGB images, but these approaches are typically optimized for pixel-wise RAW reconstruction fidelity and tend to degrade under diverse rendering styles and editing operations. We introduce a plug-and-play, edit-aware loss function that can be integrated into any existing RAW reconstruction framework to make the recovered RAWs more robust to different rendering styles and edits. Our loss formulation incorporates a modular, differentiable image signal processor (ISP) that simulates realistic photofinishing pipelines with tunable parameters. During training, parameters for each ISP module are randomly sampled from carefully designed distributions that model practical variations in real camera processing. The loss is then computed in sRGB space between ground-truth and reconstructed RAWs rendered through this differentiable ISP. Incorporating our loss improves sRGB reconstruction quality by up to 1.5-2 dB PSNR across various editing conditions. Moreover, when applied to metadata-assisted RAW reconstruction methods, our approach enables fine-tuning for target edits, yielding further gains. Since photographic editing is the primary motivation for RAW reconstruction in consumer imaging, our simple yet effective loss function provides a general mechanism for enhancing edit fidelity and rendering flexibility across existing methods.

Abstract:
Recent progress in 3D reconstruction has made it easy to create realistic digital twins from everyday environments. However, current digital twins remain largely static--limited to navigation and view synthesis without embodied interactivity. To bridge this gap, we introduce Dexterous World Model (DWM), an scene-action-conditioned video diffusion model enabling embodied interaction within static 3D scenes. Given a static 3D scene rendering and an egocentric hand motion sequence, DWM generates temporally coherent videos depicting plausible human-scene interactions. Our approach conditions video generation on (1) static scene renderings following a specified camera trajectory to ensure spatial consistency, and (2) egocentric hand mesh renderings that encode both geometry and motion cues in the egocentric view to model action-conditioned dynamics directly. We train our model on a synthetic human-scene interaction dataset and real-world object manipulation dataset, then evaluate it across both synthetic and real-world egocentric benchmarks. Experiments demonstrate that DWM enables realistic, physically grounded interactions, such as grasping, opening, or moving objects, while maintaining camera and scene consistency. This framework establishes the first step toward video diffusion-based interactive digital twins, enabling embodied simulation and 3D scene interactivity from egocentric actions.

Abstract:
Diffusion transformers have demonstrated remarkable generation quality, albeit requiring longer training iterations and numerous inference steps. In each denoising step, diffusion transformers encode the noisy inputs to extract the lower-frequency semantic component and then decode the higher frequency with identical modules. This scheme creates an inherent optimization dilemma: encoding low-frequency semantics necessitates reducing high-frequency components, creating tension between semantic encoding and high-frequency decoding. To resolve this challenge, we propose a new \color ddtD ecoupled \color ddtD iffusion \color ddtT ransformer(\color ddtDDT ), with a decoupled design of a dedicated condition encoder for semantic extraction alongside a specialized velocity decoder. Our experiments reveal that a more substantial encoder yields performance improvements as model size increases. For ImageNet 256x256, Our DDT-XL/2 achieves a new state-of-the-art performance of 1.31 FID (nearly 4xfaster training convergence compared to previous diffusion transformers). For ImageNet 512x512, Our DDT-XL/2 achieves a new state-of-the-art FID of 1.28. Additionally, as a beneficial by-product, our decoupled architecture enhances inference speed by enabling the sharing self-condition between adjacent denoising steps. To minimize performance degradation, we propose a novel statistical dynamic programming approach to identify optimal sharing strategies.

Abstract:
Optical imaging systems are inherently imperfect due to diffraction limits, lens manufacturing tolerances, assembly misalignment, and other physical constraints. In addition, unavoidable camera shake and object motion further introduce non-ideal degradations during acquisition. These aberrations and motion-induced variations are typically unknown, difficult to measure, and costly to model or calibrate in practice. Blind inverse problems offer a promising direction by jointly estimating both the latent image and the unknown degradation kernel. However, existing approaches often suffer from convergence instability, limited prior expressiveness, and sensitivity to hyperparameters. Inspired by recent advances in self-diffusion, we propose DeblurSDI, a zero-shot, self-supervised blind imaging framework that requires no pre-training. DeblurSDI formulates blind image recovery as an iterative reverse self-diffusion process that begins from pure noise and progressively refines both the sharp image and the blur kernel. Extensive experiments on combined optical aberrations and motion blur demonstrate that DeblurSDI consistently outperforms other methods by a substantial margin.

Abstract:
Acquiring physically plausible motor skills across diverse and unconventional embodiments, including humanoids and quadrupeds, is essential for advancing character simulation and robotics. Traditional methods, such as reinforcement learning (RL), require extensive reward function engineering. Imitation learning (IL) offers an alternative but relies heavily on curated 3D expert demonstrations, which are scarce and difficult to obtain for non-human morphologies. Video diffusion models, on the other hand, are capable of generating realistic-looking videos of various morphologies, from humans to ants. However, these videos are often not physically plausible, which limits their direct use for skill acquisition. We introduce "No-data Imitation Learning" (NIL): an imitation learning framework that replaces curated expert demonstrations with videos generated by a pretrained video diffusion model. Our key insight is that the physics simulator enforces physical constraints, while the video provides visual guidance. NIL learns 3D motor skills in a physics simulator from 2D-generated videos, with generalization capability to unconventional forms. Specifically, NIL computes a discriminator-free imitation reward that combines (i) a video-embedding similarity between generated and simulated videos using a pretrained video vision transformer, and (ii) an image-based similarity term derived from video segmentation masks. We evaluate NIL on locomotion and whole-body control tasks across unique body configurations. Our experiments show that in humanoid locomotion, NIL matches the performance of state-of-the-art IL baselines trained on motion capture data; and in whole-body manipulation, it exceeds the performance of RL baselines without requiring any curated data. Our project page, including videos and code, is available at https://nil.is.tue.mpg.de.

Abstract:
Internet photo collections exhibit an extremely long-tailed distribution: a few famous landmarks are densely photographed and easily reconstructed in 3D, while most real-world sites are represented with sparse, noisy, uneven imagery beyond the capabilities of both classical and learned 3D methods. We believe that tackling this long-tail regime represents one of the next frontiers for 3D foundation models. Although reliable ground-truth 3D supervision from sparse scenes is challenging to acquire, we observe that it can be effectively simulated by sampling sparse subsets from well-reconstructed Internet landmarks. To this end, we introduce MegaDepth-X, a large dataset of 3D reconstructions with clean, dense depth, together with a strategy for sampling sets of training images that mimic camera distributions in long-tail scenes. Finetuning 3D foundation models with these components yields robust reconstructions under extreme sparsity, and also enables more reliable reconstruction in symmetric and repetitive scenes, while preserving generalization to standard, dense 3D benchmark datasets.

Abstract:
Conventional imaging systems capture objects visible in the direct line-of-sight (LOS). A decade of research on non-line-of-sight (NLOS) imaging approaches has made it possible to reconstruct hidden geometry outside the line of sight by analyzing indirect light transport. However, most existing methods operate in the optical visible or IR range. Relying on diffuse inter-reflections, every bounce incurs a quadratic intensity falloff. As such, with illumination power limited by eye-safety limitations, existing methods are fundamentally restricted to short ranges on the order of a few meters. We propose an X-band radar-based NLOS imaging method that leverages the long wavelength to convert diffuse reflections into predominantly specular ones, allowing for large-scale hidden-scene perception. We develop a neural reconstruction method that combines a learned dense prediction module and a geometry-aware NLOS reconstruction module, tackling the inherently low spatial resolution of long-wavelength radar. We assess our method using a prototype system and in simulation. Synthetic validation shows that, under the same transmit power, X-band radar achieves 10xlonger NLOS reconstruction range than optical systems, while experimental results further demonstrate accurate hidden-object reconstructions up to 40 m, establishing a practical pathway toward real-world long-range NLOS sensing.

Abstract:
Reinforcement learning (RL) provides a principled framework for improving vision-language models (VLMs) on complex reasoning tasks. However, existing RL approaches often depend on human-annotated labels or task-specific heuristics to define verifiable rewards--both costly and limited in scalability. We introduce VisPlay, a self-evolving RL framework that enables VLMs to autonomously improve their reasoning capabilities from massive unlabeled image data. Starting from a single base VLM, VisPlay assigns the model into two interacting roles: an Image-Conditioned Questioner that formulates challenging yet answerable visual questions, and a Multimodal Reasoner that generates silver responses. These roles are jointly trained using Group Relative Policy Optimization (GRPO), which uses diversity and difficulty rewards to balance the difficulty of generated questions with the quality of silver answers. VisPlay scales efficiently across two model families. Trained on Qwen2.5-VL and MiMo-VL, VisPlay achieves consistent improvements in visual reasoning, compositional generalization, and hallucination reduction across eight benchmarks including MM-Vet and MMMU, and establishes a scalable path toward self-evolving multimodal intelligence.

Abstract:
We present Streamo, a real-time streaming video LLM that serves as a general-purpose interactive assistant. Unlike existing online video models that focus narrowly on question answering or captioning, Streamo performs a broad spectrum of streaming video tasks, including real-time narration, action understanding, event captioning, temporal event grounding, and time-sensitive question answering. To develop such versatility, we construct Streamo-Instruct-465K, a large-scale instruction-following dataset tailored for streaming video understanding. The dataset covers diverse temporal contexts and multi-task supervision, enabling unified training across heterogeneous streaming tasks. After training end-to-end on the instruction-following dataset through a streamlined pipeline, Streamo exhibits strong temporal reasoning, responsive interaction, and broad generalization across a variety of streaming benchmarks. Extensive experiments show that Streamo bridges the gap between offline video perception models and real-time multimodal assistants, making a step toward unified, intelligent video understanding in continuous video streams.

Abstract:
We present Recurrent Video Masked-Autoencoders (RVM): a novel approach to video representation learning that leverages recurrent computation to model the temporal structure of video data. RVM couples an asymmetric masking objective with a transformer-based recurrent neural network to aggregate information over time, training solely on a simple pixel reconstruction loss. This design yields a highly efficient "generalist" encoder: RVM achieves competitive performance with state-of-the-art video models (e.g. VideoMAE, V-JEPA) on video-level tasks like action classification, and point and object tracking, while matching or exceeding the performance of image models (e.g. DINOv2) on tasks that require strong geometric and dense spatial features. Notably, RVM achieves strong performance in the small-model regime without requiring knowledge distillation, exhibiting up to 30xgreater parameter efficiency than competing video masked autoencoders. Finally, we demonstrate that \model's recurrent nature allows for stable feature propagation over long temporal horizons with linear computational cost, overcoming some of the limitations of standard spatio-temporal attention-based video models. Ablation studies further highlight the factors driving the model's success, with qualitative results showing that RVM learns rich representations of scene semantics, structure, and motion.

Abstract:
Aligning natural language descriptions with precise 3D human poses remains a big challenge due to the scarcity of effective pose representation mechanisms and large-scale, semantically rich datasets. To overcome these limitations, we first introduce CLEP-2M, the largest 3D pose-language dataset to date, comprising two million high-quality 3D pose-language pairs. This dataset provides a 20-fold increase in scale and far richer semantic diversity than existing benchmarks. Second, we propose CLEP, a novel contrastive pretraining framework. The core of CLEP is HierFormer, a hierarchical pose encoder specifically designed for language alignment. Its key innovation is a Cross-Scale Attention Fusion (CSAF) mechanism that dynamically integrates features from the joint, limb, and body levels. This enables CLEP to precisely align complex, multi-scale text descriptions with the pose representation. Extensive experimental evaluations on CLEP-2M and PoseScript demonstrate that our method consistently outperforms existing approaches across a range of downstream tasks. CLEP shows exceptional zero-shot generalization, achieving a 34.8 mRecall on the human-annotated PoseScript-H benchmark--a nearly 6-fold improvement from the baseline. Furthermore, CLEP demonstrates superior performance on pose generation and fine-grained pose editing. These results establish CLEP as a strong multimodal foundation model for human-centric understanding and generation tasks.

Abstract:
We present a general optimization-based framework for depth-preserving normal integration. Unlike existing methods that operate on surface orientations defined over regular grids, our approach introduces a unified graph-based formulation capable of integrating semi-differentiable surfaces on unstructured domains. Given a set of points uniformly sampled from a surface, we construct a directed, weighted graph that jointly parameterizes the surface geometry and pairwise point correlations. Surface depth is recovered by minimizing projected point-to-plane distances across the graph, and we attain this objective through variational inference. In our formulation, estimated surface normals serve as latent variables that encode local geometry via the posterior probabilities of a two-component Gaussian mixture, allowing depth discontinuities to be explicitly inferred from sampled triplet configurations. The unknowns are estimated in an alternating fashion, and we provide a geometric interpretation of this inference process by relating it to shape deformation. Experimental results show that the proposed method not only delivers superior performance on regularly gridded data, but also generalizes effectively to scattered points, which existing approaches do not directly support.

Abstract:
Geometry-free view synthesis transformers have recently achieved state-of-the-art performance in Novel View Synthesis (NVS), outperforming traditional approaches that rely on explicit geometry modeling. Yet the factors governing their scaling with compute remain unclear. We present a systematic study of scaling laws for view synthesis transformers and derive design principles for training compute-optimal NVS models. Contrary to prior findings, we show that encoder-decoder architectures can be compute-optimal; we trace earlier negative results to suboptimal architectural choices and comparisons across unequal training compute budgets. Across several compute levels, we demonstrate that our encoder-decoder architecture, which we call the Scalable View Synthesis Model (SVSM), scales as effectively as decoder-only models, achieves a superior performance-compute Pareto frontier, and surpasses the previous state-of-the-art on real-world NVS benchmarks with substantially reduced training compute.

Abstract:
Incremental open-vocabulary 3D instance-semantic mapping is essential for autonomous agents operating in complex everyday environments. However, it remains challenging due to the need for robust instance segmentation, real-time processing, and flexible open-set reasoning. Existing methods often rely on the closed-set assumption or dense per-pixel language fusion, which limits scalability and temporal consistency. We introduce OVI-MAP that decouples instance reconstruction from semantic inference. We propose to build a class-agnostic 3D instance map that is incrementally constructed from RGB-D input, while semantic features are extracted only from a small set of automatically selected views using vision-language models. This design enables stable instance tracking and zero-shot semantic labeling throughout online exploration. Our system operates in real time and outperforms state-of-the-art open-vocabulary mapping baselines on standard benchmarks.

Abstract:
Visual Deformation Measurement (VDM) aims to recover dense deformation fields by tracking surface motion from camera observations. Traditional image-based methods rely on minimal inter-frame motion to constrain the correspondence search space, which limits their applicability to highly dynamic scenes or necessitates high-speed cameras at the cost of prohibitive storage and computational overhead. We propose an event-frame fusion framework that exploits events for temporally dense motion cues and frames for spatially dense precise estimation. Revisiting the solid elastic modeling prior, we propose an Affine Invariant Simplicial (AIS) framework. It partitions the deformation field into linearized sub-regions with low-parametric representation, effectively mitigating motion ambiguities arising from sparse and noisy events. To speed up optimization and reduce error accumulation, a neighborhood-greedy optimization strategy is introduced, enabling well-converged sub-regions to guide their poorly-converged neighbors, effectively suppress local error accumulation in long-term dense tracking. To evaluate the proposed method, a benchmark dataset with temporally aligned event streams and frames is established, encompassing over 120 sequences spanning diverse deformation scenarios. Experimental results show that our method outperforms the state-of-the-art baseline by 1.6x in survival rate. Remarkably, it achieves this using only 18.9% of the data storage and processing resources of high-speed video methods.

Abstract:
In real-world scenarios, new views are continuously collected over time, forming a dynamic view stream. To handle such evolving data, a lifelong multi-view clustering framework is needed instead of a static model. However, large discrepancies across views make it challenging to learn new knowledge while preserving previously acquired information. There are few methods use consistency alignment or knowledge distillation to align new knowledge with old ones. However, these strategies cannot fundamentally prevent knowledge degradation, since new knowledge inevitably interferes with the learned representation space. To overcome this limitation, we propose a new Anti-degradation Lifelong Multi-view Clustering (ALMC) framework. Specifically, we innovatively propose a null-space-projection knowledge base anti-degradation technique, which ensures that new knowledge updates to the model only occur in directions orthogonal to the retained knowledge, thus preventing catastrophic forgetting of knowledge and degradation of clustering performance, and provides theoretical proof for this. Extensive experiments on multiple multi-view benchmark datasets demonstrate superior performance in multi-view clustering.

Abstract:
Scintillators are transparent materials that interact with high-energy particles and emit visible light as a result. They are used in state of the art methods of measuring high-energy particles and radiation sources. Most existing methods use fast single-pixel detectors to detect and time scintillation events. Cameras provide spatial resolution but can only capture an average over many events, making it difficult to image the events associated with an individual particle. Emerging single-photon avalanche diode cameras combine speed and spatial resolution to enable capturing images of individual events. This allows us to use machine vision techniques to analyze events, enabling new types of detectors. The main challenge is the very low brightness of the events. Techniques have to work with a very limited number of photons. We propose a kaleidoscopic scintillator to increase light collection in a single-photon camera while preserving the event's spatial information. The kaleidoscopic geometry creates mirror reflections of the event in known locations for a given event location that are captured by the camera. We introduce theory for imaging an event in a kaleidoscopic scintillator and an algorithm to estimate the event's 3D position. We find that the kaleidoscopic scintillator design provides sufficient light collection to perform high-resolution event measurements for advanced radiation imaging techniques using a commercial CMOS single-photon camera.

Abstract:
Visually imperceptible surface deformations encode rich information about a scene, from the mechanical properties of an object to the acoustic vibrations present in the surrounding environment. Optical interferometric techniques can reveal these subtle changes, typically by capturing a sequence of measurements to perform temporal phase shifting. In this paper, we introduce Computational Speckle Pattern Interferometry (CSPI), a novel single-shot approach to estimating per-pixel displacement and motion. Our key insight is that the image formation model for speckle pattern interferometry can be decomposed into spatial and temporal factors, each represented as a vector. After calibrating for the spatial term, we recover the scene dynamics using a reconstruction algorithm modeled after the classic Horn-Schunck method for estimating optical flow. Unlike traditional interferometric methods, CSPI requires no precision instrumentation to perform phase stepping. We demonstrate its effectiveness by measuring per-pixel displacements and motions at sub-micrometer scales, visualizing high-frequency vibrations of a tuning fork and a Chladni plate, and recovering sound indirectly from these vibrations.

Abstract:
Recent audio-to-image models have shown impressive performance in generating images of specific objects conditioned on their corresponding sounds. However, these models fail to reconstruct real-world landscapes conditioned on acoustic environments. To address this challenge, we present Geo-contextual Soundscape-to-Landscape (GeoS2L) generation, a novel and practically significant task that aims to synthesize geographically realistic landscape images from environmental soundscapes. To support this task, we construct two large-scale geo-contextual multi-modal datasets, SoundingSVI and SonicUrban, which pair diverse environmental soundscapes with real-world landscape images. We further propose SounDiT, a diffusion transformer (DiT)-based model that incorporates acoustic environments and geo-contextual scene conditioning to synthesize geographically coherent landscape images. Furthermore, we propose the Place Similarity Score (PSS), a practically-informed geo-contextual evaluation framework to measure consistency between input soundscapes and generated landscape images. Extensive experiments demonstrate that SounDiT outperforms existing baselines in the GeoS2L, while the PSS effectively captures multi-level generation consistency across element, scene,and human perception.

Abstract:
Machine unlearning (MU) seeks to remove the influence of specified data from a trained model in response to privacy requests or data poisoning. While certified unlearning has been analyzed in centralized and server-orchestrated federated settings (via guarantees analogous to differential privacy, DP), the decentralized setting--where peers communicate without a coordinator--remains underexplored. We study certified unlearning in decentralized networks with fixed topologies and propose \methodname, a random-walk procedure that performs one projected gradient ascent step on the forget set at the unlearning client and a geometrically distributed number of projected descent steps on the retained data elsewhere, combined with subsampled Gaussian noise and projection onto a trust region around the original model. We provide (i) convergence guarantees in the convex case and stationarity guarantees in the nonconvex case, (ii) (\varepsilon,\delta) network-unlearning certificates on client views via subsampled Gaussian Renyi DP (RDP) with segment-level subsampling, and (iii) deletion-capacity bounds that scale with the forget-to-local data ratio and quantify the effect of decentralization (network mixing and randomized subsampling) on the privacy-utility trade-off. Empirically, on image benchmarks, \methodname matches a given (\varepsilon,\delta) while achieving higher test accuracy than decentralized DP baselines and reducing backdoor accuracy (ASR) to the random-guessing baseline ((~ 10%)).

Abstract:
Energy-based predictive world models provide a powerful approach for multi-step visual planning by reasoning over latent energy landscapes rather than generating pixels. However, existing approaches face two major challenges: (i) their latent representations are typically learned in Euclidean space, neglecting the underlying geometric and hierarchical structure among states, and (ii) they struggle with long-horizon prediction, which leads to rapid degradation across extended rollouts. To address these challenges, we introduce GeoWorld, a geometric world model that preserves geometric structure and hierarchical relations through a Hyperbolic JEPA, which maps latent representations from Euclidean space onto hyperbolic manifolds. We further introduce Geometric Reinforcement Learning for energy-based optimization, enabling stable multi-step planning in hyperbolic latent space. Extensive experiments on CrossTask and COIN demonstrate around 3% SR improvement in 3-step planning and 2% SR improvement in 4-step planning compared to the state-of-the-art V-JEPA-2.

Abstract:
Recent feed-forward networks have achieved remarkable progress in sparse-view 3D reconstruction by predicting dense point maps directly from RGB images. However, they often suffer from geometric inconsistencies and limited fine-grained accuracy due to the absence of explicit multi-view constraints. We introduce the Geometry-Grounded Point Transformer (GGPT), a framework that augments feed-forward reconstruction with reliable sparse geometric guidance. We first propose an improved Structure-from-Motion pipeline based on dense feature matching and lightweight geometric optimisation to efficiently estimate accurate camera poses and partial 3D point clouds from sparse input views. We propose a geometry-guided 3D point transformer that refines dense point maps under explicit partial-geometry supervision using an optimised guidance encoding. Extensive experiments demonstrate that our method provides a principled mechanism for integrating geometric priors with dense feed-forward predictions, producing reconstructions that are both geometrically consistent and spatially complete, recovering fine structures and filling gaps in textureless areas.Trained solely on ScanNet++ with VGGT predictions, GGPT generalises across architectures and datasets, substantially outperforming state-of-the-art feed-forward 3D reconstruction models in both in-domain and out-of-domain settings.

Abstract:
As a representative technique in neural architecture search, neural architecture generation aims to construct high-performance architectures for a given task directly. It is poised to replace the inefficient random exploration components of some search strategies, such as the acquisition strategies in Bayesian optimization. Despite significant research, current architecture generation techniques face problems such as low generation efficiency and insufficient constraints, leading to invalidly generated architectures. To this end, we propose Progressive Neural Architecture Generation (PNAG), which constructs architectures incrementally through coarse-to-fine evolution, enhancing generation efficiency, and incorporates step-wise refinements to ensure the validity of the generated architecture. To achieve this, PNGA involves two modules, multi-scale sub-architecture quantization (MSQ) and step-wise consistency constraint (SCC). Specifically, MSQ constructs sub-architectures using quantization decoding and progressively expands them, transitioning from simple to complex forms. This operation bypasses network inference to enhance efficiency.Complementing MSQ, SCC, implemented through a tailored regularization mechanism, introduces penalties for deviations during sub-architecture generation, guiding the process towards valid target architectures. As such, PNAG establishes a clear generation path, laying the groundwork for generating suitable architectures in downstream tasks. Extensive experiments demonstrate that PNAG not only generates superior architectures for various downstream tasks (+8.43%/+5.07%, on average) but also significantly improves generation efficiency, reducing the architecture generation time by 1300x. Furthermore, PNAG demonstrates strong extensibility by successfully generating Transformer-based architectures.

Abstract:
High-quality 3D assets are critical for VR/AR, industrial design, and entertainment, driving growing interest in generative models that can create 3D content from user-provided prompts. Most existing 3D generators, however, rely on a single conditioning modality: image-conditioned models deliver high visual fidelity by exploiting pixel-aligned cues but suffer from viewpoint bias when the input view is limited or ambiguous, whereas text-conditioned models benefit from broad semantic guidance yet lack low-level visual detail. This restricts how users can express their intent and raises a natural question: can the two modalities be combined to yield more flexible and faithful 3D generation? Our diagnostic study shows that even a simple late fusion of text- and image-conditioned predictions improves over single-modality models, evidencing strong cross-modal complementarity. Building on this finding, we formalize the task of Text-Image Conditioned 3D Generation, which requires joint reasoning over a visual exemplar and a textual specification during generation. To address this task, we introduce TIGON, a minimalist dual-branch baseline that maintains separate image- and text-conditioned backbones with lightweight cross-modal fusion. Extensive experiments demonstrate that text-image conditioning yields consistent gains over single-modality methods, suggesting complementary vision-language guidance as a promising direction for future 3D generation research.

Abstract:
Modern vision models must capture image-level context without sacrificing local detail while remaining computationally affordable. We revisit this tradeoff and advance a simple principle: decouple the roles of global reasoning and local representation. To operationalize this principle, we introduce ConvNeur, a two-branch architecture in which a lightweight neural memory branch aggregates global context on a compact set of tokens, and a locality-preserving branch extracts fine structure. A learned gate lets global cues modulate local features without entangling their objectives. This separation yields subquadratic scaling with image size, retains inductive priors associated with local processing, and reduces overhead relative to fully global attention. On standard classification, detection, and segmentation benchmarks, ConvNeur matches or surpasses comparable alternatives at similar or lower compute and offers favorable accuracy versus latency trade-offs at similar budgets. These results support the view that efficiency follows global-local decoupling.

Abstract:
Diffusion models deliver high quality in image synthesis but remain expensive during training and inference. Recent works have leveraged the inherent redundancy in visual content to make training more affordable by training only on a subset of visual information. While these methods were successful in providing cheaper and more effective training, sparsely trained diffusion models struggle in inference. This is due to their lacking response to Classifier-free Guidance (CFG) leading to underwhelming performance during inference. To overcome this, we propose Sparse Guidance (SG). Instead of using conditional dropout as a signal to guide diffusion models, SG uses token-level sparsity. As a result, SG preserves the high-variance of the conditional prediction better, achieving good quality and high variance outputs. Leveraging token-level sparsity at inference, SG improves fidelity at lower compute, achieving 1.58 FID on the commonly used ImageNet-256 benchmark with 25% fewer FLOPs, and yields up to 58% FLOP savings at matched baseline quality. To demonstrate the effectiveness of Sparse Guidance, we train a 2.5B text-to-image diffusion model using training time sparsity and leverage SG during inference. SG achieves improvements in composition and human preference score while increasing throughput at the same time.

Abstract:
While Large Multimodal Models (LMMs) have made significant progress, they remain largely text-centric, relying on language as their core reasoning modality. As a result, they are limited in their ability to handle reasoning tasks that are predominantly visual. Recent approaches have sought to address this by supervising intermediate visual steps with helper images, depth maps, or image crops. However, these strategies impose restrictive priors on what "useful" visual abstractions look like, add heavy annotation costs, and struggle to generalize across tasks. To address this critical limitation, we propose a task-agnostic mechanism that trains LMMs to discover and use visual reasoning tokens without explicit supervision. These tokens attend globally and re-encode the image in a task-adaptive way, enabling the model to extract relevant visual information without hand-crafted supervision. Our approach outperforms direct fine-tuning and achieves state-of-the-art results on a diverse range of vision-centric tasks -- including those where intermediate abstractions are hard to specify -- while also generalizing to multi-task instruction tuning.

Abstract:
We introduce the novel task of egocentric self-emotion tracking, which aims to infer an individual's evolving emotions from egocentric multimodal streams such as voice, visual surroundings, semantic subtext, and eye-tracking signals. To establish this research direction, we present: (1) OSMO dataset, a large-scale annotation effort on 110 hours of existing bilingual smart-glasses recordings, establishing the largest egocentric emotion dataset and the first with subject-wise emotion timelines; (2) OSMO benchmark, a suite of five tasks (emotion recognition, sentiment, intensity, localization, and reasoning), that redefine emotion understanding as a continuous, context-aware process rather than discrete classification of trimmed videos; (3) OSIRIS, a large multimodal model that tracks emotions over time by reasoning over the user's personal emotion history, current expressions, and egocentric observations. Extensive evaluations show that OSIRIS achieves a state-of-the-art performance, delivering, for the first time, coherent emotion timelines from egocentric data. Project website: https://osmo-emos.github.io.

Abstract:
We introduce the Visual Personalization Turing Test (VPTT), a new paradigm for evaluating contextual visual personalization based on perceptual indistinguishability, rather than identity replication. A model passes the VPTT if its output (image, video, 3D asset, etc.) is indistinguishable to a human or calibrated VLM judge from content a given person might plausibly create or share. To operationalize VPTT, we present the VPTT Framework, integrating a 10k-persona benchmark (VPTT-Bench), a visual retrieval-augmented generator (VPRAG), and the VPTT Score, a text-only metric calibrated against human and VLM judgments. We show high correlation across human, VLM, and VPTT evaluations, validating the VPTT Score as a reliable perceptual proxy. Experiments demonstrate that VPRAG achieves the best alignment-originality balance, offering a scalable and privacy-safe foundation for personalized generative AI.

Abstract:
We introduce ART, Articulated Reconstruction Transformer--a category-agnostic, feed-forward model that reconstructs complete 3D articulated objects from only sparse, multi-state RGB images. Previous methods for articulated object reconstruction either rely on slow optimization with fragile cross-state correspondences or use feed-forward models limited to specific object categories. In contrast, ART treats articulated objects as assemblies of rigid parts, formulating reconstruction as a part-based prediction problem. Our newly designed transformer architecture maps sparse image inputs to a set of learnable part slots, from which ART jointly decodes unified representations for individual parts, including their 3D geometry, texture, and explicit articulation parameters. The resulting reconstructions are physically interpretable and readily exportable to standard simulation formats. Trained on a large-scale, diverse dataset with per-part supervision, and evaluated across diverse benchmarks, ART achieves significant improvements over existing baselines and establishes a new state of the art for articulated object reconstruction from image inputs.

Abstract:
Dense 3D shape correspondence remains a central challenge in computer vision and graphics as many deep learning approaches still rely on intermediate geometric features or handcrafted descriptors, limiting their effectiveness under non-isometric deformations, partial data, and non-manifold inputs. To overcome these issues, we introduce RINO, an unsupervised, rotation-invariant dense correspondence framework that effectively unifies rigid and non-rigid shape matching. The core of our method is the novel RINONet, a feature extractor that integrates vector-based SO(3)-invariant learning with orientation-aware complex functional maps to extract robust features directly from raw geometry. This allows for a fully end-to-end, data-driven approach that bypasses the need for shape pre-alignment or handcrafted features. Extensive experiments show unprecedented performance of RINO across challenging non-rigid matching tasks, including arbitrary poses, non-isometry, partiality, non-manifoldness, and noise.

Abstract:
Training frontier-scale diffusion models often requires substantial computational resources concentrated in tightly-coupled clusters, limiting participation to well-resourced institutions. While Decentralized Diffusion Models (DDM) enable training multiple experts in isolation, existing approaches require 1176 GPU-days and homogeneous training objectives across all experts. We present an efficient framework that dramatically reduces resource requirements while supporting heterogeneous training objectives. Our approach combines three key contributions: (1) a heterogeneous decentralized training paradigm that allows experts to use different objectives (DDPM and Flow Matching), unified at inference time without any retraining; (2) pretrained checkpoint conversion from ImageNet-DDPM to Flow Matching objectives, accelerating convergence and enabling initialization without objective-specific pretraining; and (3) PixArt-\alpha's efficient AdaLN-Single architecture, reducing parameters while maintaining quality. Experiments on LAION-Aesthetics show that, relative to the training scale reported for prior DDM work, our approach reduces the compute by 16xand data by 14x. Under aligned inference settings, our heterogeneous configuration achieves better FID and higher intra-prompt diversity than the homogeneous baseline. By eliminating synchronization requirements and enabling mixed DDPM/FM objectives, our framework makes decentralized generative model training accessible to contributors with single GPUs requiring only 24--48GB VRAM.

Abstract:
Tractography is the process of inferring the trajectories of white-matter pathways in the brain from diffusion magnetic resonance imaging (dMRI). Local tractography methods, which construct streamlines by following local fiber orientation estimates stepwise through an image, are prone to error accumulation and high false positive rates, particularly on noisy or low-resolution data. In contrast, global methods, which attempt to optimize a collection of streamlines to maximize compatibility with underlying fiber orientation estimates, are computationally expensive. To address these challenges, we introduce GenTract, the first generative model for global tractography. We frame tractography as a generative task, learning a direct mapping from dMRI to complete, anatomically plausible streamlines. We compare both diffusion-based and flow matching paradigms and evaluate GenTract's performance against state-of-the-art baselines. Notably, GenTract achieves precision 1.8x and 2.1x higher than the next-best methods, DDTracking and TractOracle, respectively. This advantage becomes even more pronounced in challenging low-resolution and noisy settings, where it outperforms the closest competitor by a factor of 3.5. By producing tractograms with high precision on research-grade data while also maintaining reliability on imperfect, lower-resolution data, GenTract represents a promising solution for global tractography.

Abstract:
Unsupervised Anomaly Detection (UAD) aims to identify abnormal regions by establishing correspondences between test images and normal templates. Existing methods primarily rely on image reconstruction or template retrieval but face a fundamental challenge: matching between test images and normal templates inevitably introduces noise due to intra-class variations, imperfect correspondences, and limited templates. Observing that Retrieval-Augmented Generation (RAG) leverages retrieved samples directly in the generation process, we reinterpret UAD through this lens and introduce RAID, a retrieval-augmented UAD framework designed for noise-resilient anomaly detection and localization. Unlike standard RAG that enriches context or knowledge, we focus on using retrieved normal samples to guide noise suppression in anomaly map generation. RAID retrieves class-, semantic-, and instance-level representations from a hierarchical vector database, forming a coarse-to-fine pipeline. A matching cost volume correlates the input with retrieved exemplars, followed by a guided Mixture-of-Experts (MoE) network that leverages the retrieved samples to adaptively suppress matching noise and produce fine-grained anomaly maps. RAID achieves state-of-the-art performance across full-shot, few-shot, and multi-dataset settings on MVTec, VisA, MPDD, and BTAD benchmarks.

Abstract:
Modern cameras, such as smartphone cameras and DSLRs, are equipped with gyro sensors that measure motion of the camera. While the motion information is valuable for deblurring, gyro-based deblurring has not been widely studied, particularly for video. A few gyro-based video deblurring methods have been proposed, but they exhibit inherent limitations. First, gyro sensors capture only rotational motion, leading these methods to ignore translational motion. Second, their dependence on simplified blur models and deconvolution-based solutions restricts overall performance. To address these limitations, we introduce GyroDVD, the first learning-based framework for gyro-based video deblurring. We propose a novel blur kernel construction scheme that jointly accounts for rotational and translational motion. A video deblurring network then restores sharp videos by exploiting the constructed kernels together with the video frames. For training and evaluation, we introduce the GyroVD dataset, a large-scale and realistic dataset specifically designed for gyro-based deblurring. Extensive experiments demonstrate that our method significantly outperforms prior gyro-based image and video deblurring methods.

Abstract:
Generative models and vision encoders have largely advanced on separate tracks, optimized for different goals and grounded in different mathematical principles. Yet, they share a fundamental property: latent space Gaussianity. Generative models map Gaussian noise to images, while encoders map images to semantic embeddings whose coordinates empirically behave as Gaussian. We hypothesize that both are views of a shared latent source, the Universal Normal Embedding (UNE): an approximately Gaussian latent space from which encoder embeddings and DDIM-inverted noise arise as noisy linear projections. To test our hypothesis, we introduce NoiseZoo, a dataset of per-image latents comprising DDIM-inverted diffusion noise and matching encoder representations (CLIP, DINO). On CelebA, linear probes in both spaces yield strong, aligned attribute predictions, indicating that generative noise encodes meaningful semantics along linear directions. These directions further enable faithful, controllable edits (e.g., smile, gender, age) without architectural changes, where simple orthogonalization mitigates spurious entanglements. Taken together, our results provide empirical support for the UNE hypothesis and reveal a shared Gaussian-like latent geometry that concretely links encoding and generation.

Abstract:
We present a method for harmonizing the lighting of a foreground video to match a target background scene, adjusting shadows, color tone, and illumination intensity (relightful harmonization). Unlike images, acquiring labeled data for videos, where identical motions are recorded under different lighting conditions, is practically infeasible and non-scalable. While one way to create such paired data is to apply existing image-based harmonization models frame by frame to a video, the resulting outputs often suffer from significant temporal jitters. We overcome this problem by introducing a novel lighting deflickering model that can stabilize the global and local lighting flickering artifacts. Our video diffusion model learns from these upgraded deflickered data with a volume of real and synthetic videos to generate high-quality video harmonization results. We further propose an asymmetric alpha mask conditioning technique to learn the clean boundaries from real videos. Experiments demonstrate that our model achieves strong temporal coherence, naturalness, cleaner boundaries, and physically meaningful lighting behavior, while maintaining strong relighting expressiveness compared to prior image-based and video-based harmonization methods.

Abstract:
Implicit Neural Representations (INRs) parameterize continuous signals via multilayer perceptrons (MLPs), enabling compact, resolution-independent modeling for tasks like image, audio, and 3D reconstruction. However, fitting high-resolution signals demands optimizing over millions of coordinates, incurring prohibitive computational costs. To address it, we propose NTK-Guided Implicit Neural Teaching (NINT), which accelerates training by dynamically selecting coordinates that maximize global functional updates. Leveraging the Neural Tangent Kernel (NTK), NINT scores examples by the norm of their NTK-augmented loss gradients, capturing both fitting errors and heterogeneous leverage (self-influence and cross-coordinate coupling). This dual consideration enables faster convergence compared to existing methods. Through extensive experiments, we demonstrate that NINT significantly reduces training time by nearly half while maintaining or improving representation quality, establishing state-of-the-art acceleration among recent sampling-based strategies.

Abstract:
Current approaches to zero-shot 3D-object perception typically rely on ensembles of frozen foundation models. This limits deep object understanding and cross-domain generalization, making performance inadequate for real-world deployment. The 3D-Object Perception Transformer (3PT) addresses this limitation by unifying detection, segmentation, and 6DoF pose estimation in a single framework, directly trained for 3D-object perception. Based on two large-scale trained transformers that specialize in 2D and 3D object-centric scene understanding respectively, 3PT continuously refines its object representations without depth input, enhancing 3D understanding by incorporating multi-view information. 3PT is the state-of-the-art for detection and pose estimation on the BOP benchmarks, often achieving double digit improvements, in many cases, outperforming non-zero-shot methods, and winning 7 of 11 tracks in the BOP-2025 challenge. 3PT surpasses task-specialized models for detection and pose estimation, often achieving double-digit percentage improvements on the diverse BOP-benchmarks, and in some cases outperforming non zero-shot methods. It also ranked first in 7 of 11 tracks at the BOP Challenge 2025. 3PT's high-accuracy and reliability is well-suited for practical industrial robotics applications such as bin picking and precise insertion. Project Page can be found at https://www.intrinsic.ai/publications/3pt-cvpr2026.

Abstract:
Reference-to-video (R2V) generation aims to synthesize videos that align with a text prompt while preserving the subject identity from reference images. However, current R2V methods are hindered by the reliance on explicit reference image-video-text triplets, whose construction is highly expensive and difficult to scale. We bypass this bottleneck by introducing Saber, a scalable zero-shot framework that requires no explicit R2V data. Trained exclusively on video-text pairs, Saber employs a masked training strategy and a tailored attention-based model design to learn identity-consistent and reference-aware representations. Mask augmentation techniques are further integrated to mitigate copy-paste artifacts common in reference-to-video generation. Moreover, Saber demonstrates remarkable generalization capabilities across a varying number of references and achieves superior performance on the OpenS2V-Eval benchmark compared to methods trained with R2V data.

Abstract:
Hierarchical image recognition aims to predict labels across a semantic taxonomy, typically assuming fine-grained annotations for every image. However, real-world supervision may appear at any level: a distant bird may only be labeled as "Bird", while a clear image allows "Bald eagle". To reflect this reality, we introduce free-grained hierarchical recognition, where training labels can appear at any level of a taxonomy, requiring consistent predictions under partial and mixed supervision. We construct benchmark datasets with varying label granularity and show that existing hierarchical methods degrade significantly in this setting. To address this, we propose simple yet effective approaches that leverage 1) semantic guidance from vision-language models and 2) visual structure through semi-supervised learning. Finally, we study free-grained inference, where the model adaptively selects prediction depth, enabling reliable coarse predictions when fine-grained ones are uncertain. Together, our task, datasets, and methods provide a practical step toward hierarchical recognition in real-world scenarios.

Abstract:
In this paper, we propose Extend3D, a training-free pipeline for 3D scene generation from a single image, built upon an object-centric 3D generative model. To overcome the limitations of fixed-size latent spaces in object-centric models for representing wide scenes, we extend the latent space in the x and y directions. Then, by dividing the extended latent space into overlapping patches, we apply the object-centric 3D generative model to each patch and couple them at each time step. Since patch-wise 3D generation with image conditioning requires strict spatial alignment between image and latent patches, we initialize the scene using a point cloud prior from a monocular depth estimator and iteratively refine occluded regions through SDEdit. We discovered that treating the incompleteness of 3D structure as noise during 3D refinement enables 3D completion via a concept, which we term under-noising. Furthermore, to address the sub-optimality of object-centric models for sub-scene generation, we optimize the extended latent during denoising, ensuring that the denoising trajectories remain consistent with the sub-scene dynamics. To this end, we introduce 3D-aware optimization objectives for improved geometric structure and texture fidelity. We demonstrate that our method yields better results than prior methods, as evidenced by human preference and quantitative experiments.

Abstract:
The neural process (NP) is a probabilistic meta learning model that learns distributions over functions via a global latent variable.It enables fast adaptation in few-shot scenarios by leveraging past experience. However, the design of latent variable structures and conditioning mechanisms in NPs remains underexplored, despite their importance in capturing diverse functional distributions.This paper proposes a new variant of NPs via mixture density modeling, referred to as the neural mixture density process (NMDP).The NMDP decomposes model parameters into task-agnostic and task-specific components to represent function distributions more flexibly. We train the model via the Expectation-Maximization algorithm to construct expressive functional priors.Compared with existing work, our method maintains several advantages: (i) less overfitting by updating a small part of the network parameters, (ii) compact task representation via distributions in the simplex,(iii) an improvement guarantee of generative likelihoods over iteration. Experimental results show that our method can achieve competitive performance with adequate explainability.

Abstract:
Art has long been a profound medium for expressing emotions.While existing image stylization methods effectively transform visual appearance, they often overlook the emotional impact carried by styles.To bridge this gap, we introduce Affective Image Stylization (AIS), a task that applies artistic styles to evoke specific emotions while preserving content.We present EmoStyle, a framework designed to address key challenges in AIS, including the lack of training data and the emotion-style mapping.First, we construct EmoStyleSet, a content-emotion-stylized image triplet dataset derived from ArtEmis to support AIS.We then propose an Emotion-Content Reasoner that adaptively integrates emotional cues with content to learn coherent style queries.Given the discrete nature of artistic styles, we further develop a Style Quantizer that converts continuous style features into emotion-related codebook entries.Extensive qualitative and quantitative evaluations, including user studies, demonstrate that EmoStyle enhances emotional expressiveness while maintaining content consistency.Moreover, the learned emotion-aware style dictionary is adaptable to other generative tasks, highlighting its potential for broader applications.Our work establishes a foundation for emotion-driven image stylization, expanding the creative potential of AI-generated art.

Abstract:
Multimodal Large Language Models (MLLMs) serve as daily assistants for millions. However, their ability to generate responses aligned with individual preferences remains limited. Prior approaches enable only static, single-turn personalization through input augmentation or output alignment, and thus fail to capture users' evolving preferences and personality over time (see Fig.img1). In this paper, we introduce PersonaVLM, an innovative personalized multimodal agent framework designed for long-term personalization. It transforms a general-purpose MLLM into a personalized assistant by integrating three key capabilities: (a) Remembering: It proactively extracts and summarizes chronological multimodal memories from interactions, consolidating them into a personalized database. (b) Reasoning: It conducts multi-turn reasoning by retrieving and integrating relevant memories from the database. (c) Response Alignment: It infers the user's evolving personality throughout long-term interactions to ensure outputs remain aligned with their unique characteristics. For evaluation, we establish Persona-MME, a comprehensive benchmark comprising over 2,000 curated interaction cases, designed to assess long-term MLLM personalization across seven key aspects and 14 fine-grained tasks. Extensive experiments validate our method's effectiveness, improving the baseline by 22.4% (Persona-MME) and 9.8% (PERSONAMEM) under a 128k context, while outperforming GPT-4o by 5.2% and 2.0%, respectively. Project page: https://PersonaVLM.github.io.

Abstract:
Realistic digital avatars require expressive and dynamic hair motion; however, most existing head avatar methods assume rigid hair movement. These methods often fail to disentangle hair from the head, representing it as a simple outer shell and failing to capture its natural volumetric behavior. In this paper, we address these limitations by introducing PhysHead, a hybrid representation for animatable head avatars with realistic hair dynamics learned from multiview video. At the core is a 3D Gaussian-based layered representation of the head. Our approach combines a 3D parametric mesh for the head with strand-based hair, which can be directly simulated using physics engines. For the appearance model, we employ Gaussian primitives attached to both the head mesh and hair segments. This representation enables the creation of photorealistic head avatars with dynamic hair behavior, such as wind-blown motion, overcoming the constraints of rigid hair in existing methods. However, these animation capabilities also require new training schemes. In particular, we propose the use of VLM-based models to generate appearance of regions that are occluded in the dynamic training sequences. In quantitative and qualitative studies, we demonstrate the capabilities of the proposed model and compare it with existing baselines. We show that our method can synthesize physically plausible hair motion besides expression and camera control. For additional results and code, please refer to https://physhead.github.io

Abstract:
Recent advances in diffusion-based video generation have achieved remarkable visual realism but still struggle to obey basic physical laws such as gravity, inertia, and collision. Generated objects often move inconsistently across frames, exhibit implausible dynamics, or violate physical constraints, limiting the realism and reliability of AI-generated videos. We address this gap by introducing Physical Simulator In-the-loop Video Generation (PSIVG), a novel framework that integrates a physical simulator into the video diffusion process. Starting from a template video generated by a pre-trained diffusion model, PSIVG reconstructs the 4D scene and foreground object meshes, initializes them within a physical simulator, and generates physically consistent trajectories. These simulated trajectories are then used to guide the video generator toward spatio-temporally physically coherent motion. To further improve texture consistency during object movement, we propose a Test-Time Texture Consistency Optimization (TTCO) technique that adapts text and feature embeddings based on pixel correspondences from the simulator. Comprehensive experiments demonstrate that PSIVG produces videos that better adhere to real-world physics while preserving visual quality and diversity. Project Page: https://vcai.mpi-inf.mpg.de/projects/PSIVG

Abstract:
Floor plans encapsulate compact spatial priors, enabling agents to navigate unseen scenes more efficiently. While prior work has explored floor plan-guided navigation, it has focused mainly on PointNav and a limited set of environments. To bridge this gap, we introduce FloVerse, a new task for floor plan-guided embodied navigation that unifies PointNav, ObjectNav, and ImageNav. To support this FloVerse, we assemble FloVerse-1.6K, a large-scale dataset of 1.6K scenes from HM3D and Gibson 4+, paired with corresponding floor plans, comprising 240K expert trajectories and 12M RGBD frames. We further propose ThreeDiff, a two-stage imitation learning policy comprising a planner, a diffusion-based multimodal goal-reasoning module trained via masked-modality modeling, and a refiner, a depth-based trajectory-refinement module for safe execution. Extensive experiments demonstrate that (1) floor-plan priors improve navigation performance across all goal modalities, and (2) ThreeDiff implicitly captures spatial information from floor plans. These results underscore the effectiveness of spatial priors and validate our proposed unified approach for floor plan-guided embodied navigation.

Abstract:
We present GR3D, a spatial vision language model equipped with three complementary grounding capabilities--explicit 2D grounding, implicit 2D grounding, and monocular 3D grounding--within a single framework. GR3D introduces an implicit grounding mechanism that identifies entity mentions during generation and inserts the corresponding region tokens into the text stream, allowing the model to reference visual evidence on the fly when producing spatial chain-of-thought responses. In parallel, a region-prompted monocular 3D grounding design predicts 3D bounding boxes in the camera view from grounded region queries, supported by intrinsic-aware normalization and dense geometric supervision. Together, these grounding capabilities enable GR3D to decompose complex spatial understanding problems into grounded 2D perception followed by 3D inference. GR3D achieves consistent improvements across grounded and non-grounded spatial benchmarks, demonstrating grounding as an effective inductive bias for strengthening spatial understanding in VLMs. These grounding capabilities collectively enhance general spatial understanding beyond the grounding task itself.

Abstract:
Equivariance is a fundamental property in computer vision models, yet strict equivariance is rarely satisfied in real-world data, which can limit a model's performance. Controlling the degree of equivariance is therefore desirable. We propose a general framework for constructing soft equivariant models by projecting the model weights into a designed subspace. The method applies to any pre-trained architecture and provides theoretical bounds on the induced equivariance error. Empirically, we demonstrate the effectiveness of our method on multiple pre-trained backbones, including ViT and ResNet, across image classification, semantic segmentation, and human-trajectory prediction tasks. Notably, our approach improves the performance while simultaneously reducing equivariance error on the competitive ImageNet benchmark.

Abstract:
MeanFlow (MF) is a diffusion-motivated generative model that enables efficient few-step generation by learning long jumps directly from noise to data. In practice, it is often used as a latent MF by leveraging the pre-trained Stable Diffusion variational autoencoder (SD-VAE) for high-dimensional data modeling. However, MF training remains computationally demanding and is often unstable. During inference, the SD-VAE decoder dominates the generation cost, and MF depends on complex guidance hyperparameters for class-conditional generation. In this work, we develop an efficient training and sampling scheme for MF in the latent space of a Representation Autoencoder (RAE), where a pre-trained vision encoder (e.g., DINO) provides semantically rich latents paired with a lightweight decoder. We observe that naive MF training in the RAE latent space suffers from severe gradient explosion. To stabilize and accelerate training, we adopt Consistency Mid-Training for trajectory-aware initialization and use a two-stage scheme: distillation from a pre-trained flow matching teacher to speed convergence and reduce variance, followed by an optional bootstrapping stage with a one-point velocity estimator to further reduce deviation from the oracle mean flow. This design removes the need for guidance, simplifies training configurations, and reduces computation in both training and sampling. Empirically, our method achieves a 1-step FID of 2.03, outperforming vanilla MF's 3.43, while reducing sampling GFLOPS by 38% and total training cost by 83% on ImageNet 256. We further scale our approach to ImageNet 512, achieving a competitive one-step FID of 3.23 with the lowest GFLOPS among all baselines.

Abstract:
We present WonderZoom, a novel approach to generating 3D scenes with contents across multiple spatial scales from a single image. Existing 3D world generation models remain limited to single-scale synthesis and cannot produce coherent scene contents at varying granularities. The fundamental challenge is the lack of a scale-aware 3D representation capable of generating and rendering content with largely different spatial sizes. WonderZoom addresses this through two key innovations: (1) scale-adaptive Gaussian surfels for generating and real-time rendering of multi-scale 3D scenes, and (2) a progressive detail synthesizer that iteratively generates finer-scale 3D contents. Our approach enables users to "zoom into" a 3D region and auto-regressively synthesize previously non-existent fine details from landscapes to microscopic features. Experiments demonstrate that WonderZoom significantly outperforms state-of-the-art video and 3D models in both quality and alignment, enabling multi-scale 3D world creation from a single image. We show video results and an interactive viewer of generated multi-scale 3D worlds on the website of the supplementary materials.

Abstract:
We introduce Complet4R, a novel end-to-end framework for Geometric Complete 4D Reconstruction, which aims to recover temporally coherent and geometrically complete reconstruction for dynamic scenes. Our method formalizes the task of Geometric Complete 4D Reconstruction as a unified framework of reconstruction and completion, by directly accumulating full contexts onto each frame. Unlike previous approaches that rely on pairwise reconstruction or local motion estimation, Complet4R utilizes a decoder-only transformer to operate all context globally directly from sequential video input, reconstructing a complete geometry for every single timestamp, including occluded regions visible in other frames. Our method demonstrates the state-of-the-art performance on our proposed benchmark for Geometric Complete 4D Reconstruction and the 3D Point Tracking task. Code will be released to support future research.

Abstract:
We introduce the Self-Evaluating Model (Self-E), a novel, from-scratch training approach for text-to-image generation that supports any-step inference. Self-E learns from data similarly to a Flow Matching model, while simultaneously employing a novel self-evaluation mechanism: it evaluates its own generated samples using its current score estimates, effectively serving as a dynamic self-teacher. Unlike traditional diffusion or flow models, it does not rely solely on local supervision, which typically necessitates many inference steps. Unlike distillation-based approaches, it does not require a pretrained teacher. This combination of instantaneous local learning and self-driven global matching bridges the gap between the two paradigms, enabling the training of a high-quality text-to-image model from scratch that excels even at very low step counts. Extensive experiments on large-scale text-to-image benchmarks show that Self-E not only excels in few-step generation, but is also competitive with state-of-the-art Flow Matching models at 50 steps. We further find that its performance improves monotonically as inference steps increase, enabling both ultra-fast few-step generation and high-quality long-trajectory sampling within a single unified model. To our knowledge, Self-E is the first from-scratch, any-step text-to-image model, offering a unified framework for efficient and scalable generation.

Abstract:
Large vision-language models (LVLMs) have demonstrated remarkable capabilities by integrating pre-trained vision encoders with large language models (LLMs). Similar to single-modal LLMs, chain-of-thought (CoT) prompting has been adapted for LVLMs to enhance multi-modal reasoning by generating intermediate rationales based on visual and textual inputs. While CoT is assumed to improve grounding and accuracy in LVLMs, our experiments reveal a key challenge: existing LVLMs often ignore the contents of generated rationales in CoT reasoning. To address this, we re-formulate multi-modal CoT reasoning as a KL-constrained reward maximization focused on rationale-conditional log-likelihood. As the optimal solution, we propose rationale-enhanced decoding (RED), a novel plug-and-play inference-time decoding strategy. RED harmonizes visual and rationale information by multiplying distinct image-conditional and rationale-conditional next token distributions. Extensive experiments show that RED consistently and significantly improves reasoning over standard CoT and other decoding methods across multiple benchmarks and LVLMs. Our work offers a practical and effective approach to improve both the faithfulness and accuracy of CoT reasoning in LVLMs, paving the way for more reliable rationale-grounded multi-modal systems.

Abstract:
Humans have remarkable selective sensitivity to identities--they easily distinguish between highly-similar identities, even across significantly different contexts such as diverse viewpoints or lighting. Vision models have struggled to match this capability, and progress towards identity-focused tasks such as personalized image generation is slowed by a lack of identity-focused evaluation metrics. To help facilitate progress, we propose ID-Sim, a feed-forward metric designed to faithfully reflect human selective sensitivity. To build ID-Sim, we curate a high-quality training set of images spanning diverse real-world domains, augmented with generative synthetic data that provides controlled, fine-grained identity and contextual variations. We evaluate our metric on a new unified evaluation benchmark for assessing consistency with human annotations across identity-focused recognition, retrieval, and generative tasks.

Abstract:
Aerial Vision-and-Language Navigation (Aerial VLN) enables unmanned aerial vehicles (UAVs) to follow natural language instructions and navigate complex urban environments.While recent advances have achieved progress through large-scale memory graphs and lookahead path planning, they remain limited by shallow instruction understanding and high computational cost. In particular, existing methods rely primarily on landmark descriptions, overlooking directional cues--a key source of spatial context in human navigation.In this work, we propose LookasideVLN, a new paradigm that exploits directional cues in natural language to achieve both more accurate spatial reasoning and greater computational efficiency. LookasideVLN comprises three core components: (1) an Egocentric Lookaside Graph (ELG) that dynamically encodes instruction-relevant landmarks and their directional relationships, (2) a Spatial Landmark Knowledge Base (SLKB) that provides lightweight memory retrieval from prior navigation experiences, and (3) a Lookaside MLLM Navigation Agent that aligns multimodal information from user instructions, visual observations, and landmark-direction information from ELG for path planning.Extensive experiments show that LookasideVLN significantly outperforms the state-of-the-art CityNavAgent, even with a single-level lookahead, demonstrating that leveraging directional cues is a powerful yet efficient strategy for Aerial VLN.

Abstract:
Vision-Language Models (VLMs) can perform zero-shot classification but are susceptible to adversarial attacks. While robust fine-tuning improves their robustness, existing approaches align fixed text embeddings with an image embedding, sacrificing natural performance and robustness. A robustness degradation also occurs when a model faces adversarial attacks targeting superclasses (parent classes, e.g., mammal) in addition to their base (leaf) classes (e.g., cat). Thus, to enhance adversarial robustness and leverage the inherent hierarchical properties of class space, we propose a novel adversarial fine-tuning framework based on hierarchical embeddings and several levels of adversarially robust alignment of image-text modalities. Additional mechanisms place visual embeddings at the desired depth of hierarchy, and we provide a theoretical connection between the depth of embedding in the hierarchy and the maximum viable margin size. Our model naturally realizes several margin sizes, boosting generalization of adversaries for robustification. As various trees with different parent labels can share the same leaf labels, we also consider aligning over multiple trees to boost semantic variety. Experiments across several datasets are performed.

Abstract:
With the rise of visual-language models, multi-modal ReID retrieves specific targets by integrating different spectra and textual descriptions. Existing methods merely adopt descriptive representation learning for image-text, ignoring the relationships among the intrinsic logical hierarchies of semantic features. Since Chain-of-Thought (CoT) can provide textual logical context and enhance semantic perception in large-model reasoning, we propose CoT-ReID, a CoT-guided framework that injects the Multi-modal Large Language Models (MLLMs) reasoning into multi-modal ReID. Specifically, we simulate the joint visual-textual logical decision-making of human reasoning, leveraging CoT textual logical reasoning to guide visual feature learning at the early, late, and decision-making level: At the early level, we embed the semantic reversion of CoT hierarchical reasoning into visual features to calibrate bottom-level features and emphasize visual hierarchical reasoning. Next, we take CoT hierarchical reasoning text as an anchor condition to constrain the consistency of visual cross-modal semantics. Finally, through the hierarchical reasoning process of CoT, we embed logically reasoned text attribute features into multi-modal decision-making, providing logical support for selecting discriminative identity features. By constructing CoT textual benchmarks and our proposed modules, our framework generates more robust multi-modal features in complex scenarios. Comprehensive experiments on four datasets (RGBNT100, MSVR310, WMVeID863, RGBNT201) demonstrate that our method outperforms existing approaches. Code will be released upon acceptance.

Abstract:
High-resolution and high-precision digital elevation models (DEMs) of the lunar surface are essential for landing site selection and geological research. However, traditional stereo matching provides a limited representation of the 3D scene and struggling with non-textured regions and extreme illumination variations. Recent lunar neural rendering methods are also ill-suited for 3D reconstruction due to their reliance on simple pinhole approximations for pushbroom sensors. These challenges are further compounded by geometric misalignment, distributional bias, and labor-intensive handcrafted preprocessing in satellite image pipelines. To address these issues, we introduce the Lunar Neural Elevation Model (LNEM), a volumetric reconstruction method that explicitly incorporates the pushbroom imaging process. A core component of our approach is Lunar Studio, a multi-orbit dataset and pipeline constructed using Rigorous Sensor Models (RSMs) to produce geometrically consistent observations from the Lunar Reconnaissance Orbiter Camera (LROC) Narrow Angle Camera (NAC) and the Korea Pathfinder Lunar Orbiter (KPLO) Lunar Terrain Imager (LUTI). LNEM integrates this pushbroom camera formulation with learned shadow modeling, enabling geometrically grounded and illumination-aware volumetric rendering under challenging lunar lighting conditions. Extensive experiments demonstrate that LNEM achieves geometrically consistent reconstruction across multiple sensors under diverse viewing and illumination conditions, providing a scalable complement to conventional DEM pipelines. To support reproducibility and future lunar research, we release Lunar Studio, the multi-orbit dataset, and the LNEM reconstruction pipeline.

Abstract:
Intrinsic image decomposition aims to separate images into physical components such as albedo, depth, normals, and illumination. While recent diffusion- and transformer-based models benefit from paired supervision from synthetic datasets, their generalization to diverse, real-world scenarios remains challenging. We propose ReasonX, a novel framework that leverages a multimodal large language model (MLLM) as a perceptual judge providing relative intrinsic comparisons, and uses these comparisons as GRPO rewards for fine-tuning intrinsic decomposition models on unlabeled, in-the-wild images. Unlike RL methods for generative models, our framework aligns conditional intrinsic predictors by rewarding agreement between the judge's relational assessments and analytically derived relations from the model's outputs. ReasonX is model-agnostic and can be applied to different intrinsic predictors. Across multiple base architectures and modalities, ReasonX yields significant improvements, including 9-25% WHDR reduction on IIW albedo and up to 46% depth accuracy gains on ETH3D, highlighting the promise of MLLM-guided comparative supervision to bridge low- and high-level vision reasoning.

Abstract:
Video panoptic segmentation (VPS) aims to jointly detect, segment, and track all objects while partitioning the video into semantically consistent regions. We introduce the task setting of unsupervised VPS, omitting any human supervision. Existing unsupervised scene understanding works mainly focused on image segmentation tasks; the video domain remains underexplored. We propose VideoCUPS, the first unsupervised VPS approach. VideoCUPS generates temporally consistent panoptic video pseudo-labels from scene-centric videos by exploiting unsupervised depth, motion, and visual cues. Training on these pseudo-labels using a novel Video DropLoss yields an accurate, unsupervised VPS model. To benchmark progress, we introduce a comprehensive evaluation protocol and four competitive baselines, extending state-of-the-art unsupervised panoptic image and instance video segmentation models to VPS. VideoCUPS outperforms all baselines and demonstrates strong label-efficient learning. With VideoCUPS, our evaluation protocol, and baselines, we provide a strong foundation for future research on unsupervised VPS.

Abstract:
We introduce Particulate, a feed-forward model that, given a 3D mesh of an object, infers its articulations, including its 3D parts, their kinematic structure, and the motion constraints. The model is based on a transformer network, the Part Articulation Transformer, which predicts all these parameters for all joints. We train the network end-to-end on a diverse collection of articulated 3D assets from public datasets. During inference, Particulate maps the output of the network back to the input mesh, yielding a fully articulated 3D model in seconds, much faster than prior approaches that require per-object optimization. Particulate also works on AI-generated 3D assets, enabling the generation of articulated 3D objects from a single (real or synthetic) image when combined with an off-the-shelf image-to-3D model. We further introduce a new challenging benchmark for 3D articulation estimation curated from high-quality public 3D assets, and redesign the evaluation protocol to be more consistent with human preferences. Empirically, Particulate significantly outperforms state-of-the-art approaches.

Abstract:
Mapping systems with novel view synthesis (NVS) capabilities are widely used in computer vision, as well as in various applications, including augmented reality, robotics, and autonomous driving. Most notably, 3D Gaussian Splatting-based systems show high NVS performance; however, many current approaches are limited to static scenes. While recent works have begun addressing short-term dynamics (motion within the camera's view), long-term dynamics (the scene evolving through changes out of view) remain less explored.To overcome this limitation, we introduce a dynamic scene adaptation mechanism that continuously updates the 3D representation to reflect the latest changes. In addition, since maintaining geometric and semantic consistency remains challenging due to stale observations disrupting the reconstruction process, we propose a novel keyframe management mechanism that discards outdated observations while preserving as much information as possible. We thoroughly evaluate Gaussian Mapping for Evolving Scenes (GaME) on both synthetic and real-world datasets, achieving a 29.7% improvement in PSNR and a 3 times improvement in L1 depth error over the most competitive baseline.

Abstract:
The demand for high-resolution video generation is growing rapidly. However, the generation resolution is severely constrained by slow inference speeds. For instance, Wan2.1 requires over 50 minutes to generate a single 720p video. While previous works explore accelerating video generation from various aspects, most of them compromise the distinctive signatures (e.g., layout, semantic, motion) of the original model. In this work, we propose SURF, an efficient framework for generating high-resolution videos, while maximally keeping the signatures. Specifically, SURF divides video generation into two stages: First, we leverage the pretrained model to infer at optimal resolution and downsample latent to generate low-resolution previews in fast speed; then we design a Refiner to upscale the preview. In the preview stage, we identify that directly inferring a model (trained with higher resolution) on lower resolution causes severe losses in signatures. So we introduce noise reshifting, a training-free technique that mitigates this issue by conducting initial denoising steps on the original resolution and switching to low resolution in later steps. In the refine stage, we establish a mapping relationship between the preview and the high-resolution target, which significantly reduces the denoising steps. We further integrate shifting windows and carefully design the training paradigm to get a powerful and efficient Refiner. In this way, SURF enables generating high-resolution videos efficiently while maximally closer to the signatures of the given pretrained model. SURF is conceptually simple and could serve as a plug-in that is compatible with various base model and acceleration methods. For example, it achieves 12.5x speedup for generating 5-second, 16fps, 720p Wan 2.1 videos and 8.7x speedup for generating 5-second, 24fps, 720p HunyuanVideo.

Abstract:
We present FlowFixer, a refinement framework for subject-driven generation (SDG) that restores fine details lost during generation caused by changes in scale and perspective of a subject. FlowFixer proposes direct image-to-image translation from visual references, avoiding ambiguities in language prompts. To enable image-to-image training, we introduce a one-step denoising scheme to generate self-supervised training data, which automatically removes high-frequency details while preserving global structure, effectively simulating real-world SDG errors. We further propose a keypoint matching-based metric to properly assess fidelity in details beyond semantic similarities usually measured by CLIP or DINO. Experimental results demonstrate that FlowFixer outperforms state-of-the-art SDG methods in both qualitative and quantitative evaluations, setting a new benchmark for high-fidelity subject-driven generation.

Abstract:
With the rapid development of Multimodal Large Language Models (MLLMs), their potential in Micro-Action understanding, a vital role in human emotion analysis, remains unexplored due to the absence of specialized benchmarks. To tackle this issue, we present MA-Bench, a benchmark comprising 1,000 videos and a three-tier evaluation architecture that progressively examines micro-action perception, relational comprehension, and interpretive reasoning. MA-Bench contains 12,000 structured question-answer pairs, enabling systematic assessment of both recognition accuracy and action interpretation. The results of 23 representative MLLMs reveal that there are significant challenges in capturing motion granularity and fine-grained body-part dynamics. To address these challenges, we further construct MA-Bench-Train, a large-scale training corpus with 20.5K videos annotated with structured micro-action captions for fine-tuning MLLMs. The results of Qwen3-VL-8B fine-tuned on MA-Bench-Train show clear performance improvements across micro-action reasoning and explanation tasks. Our work aims to establish a foundation benchmark for advancing MLLMs in understanding subtle micro-action and human-related behaviors. Project Page: https://MA-Bench.github.io

Abstract:
Supervised and unsupervised homography estimation methods depend on image pairs tailored to specific modalities to achieve high accuracy. However, their performance deteriorates substantially when applied to unseen modalities. To address this issue, we propose a training data synthesis method that generates unaligned image pairs with ground-truth offsets from a single input image. Our approach renders the image pairs with diverse textures and colors while preserving their structural information. These synthetic data empower the trained model to achieve greater robustness and improved generalization across various domains. Additionally, we design a network to fully leverage cross-scale information and decouple color information from feature representations, thus improving estimation accuracy. Extensive experiments show that our training data synthesis method improves generalization performance. The results also confirm the effectiveness of the proposed network.

Abstract:
Perceiving and reconstructing 3D scene geometry from visual inputs is crucial for autonomous driving. However, there still lacks a driving-targeted dense geometry perception model that can adapt to different scenarios and camera configurations. To bridge this gap, we propose a Driving Visual Geometry Transformer (DVGT), which reconstructs a global dense 3D point map from a sequence of unposed multi-view visual inputs. We first extract visual features for each image using a DINO backbone, and employ alternating intra-view local attention, cross-view spatial attention, and cross-frame temporal attention to infer geometric relations across images. We then use multiple heads to decode a global point map in the ego coordinate of the first frame and the ego poses for each frame. Unlike conventional methods that rely on precise camera parameters, DVGT is free of explicit 3D geometric priors, enabling flexible processing of arbitrary camera configurations. DVGT directly predicts metric-scaled geometry from image sequences, eliminating the need for post-alignment with external sensors. Trained on a large mixture of driving datasets including nuScenes, OpenScene, Waymo, KITTI, and DDAD, DVGT significantly outperforms existing models on various scenarios.

Abstract:
Generative video-to-audio (V2A) models produce highly plausible soundtracks, but it remains unclear whether they capture the underlying physical processes. Existing evaluations emphasize perceptual realism and overlook physical correctness under controlled interventions. In this paper, we introduce FlatSounds, a benchmark that audits the physical reasoning of V2A models through: 1) controlled counterfactual pairs in which a single physical factor is varied, and 2) single-video pattern tests that probe internal consistency and directional trends. These settings test whether the generated audio correctly reflects specific physical properties and timings. Our evaluation of state-of-the-art models reveals a consistent trade-off: models rely more on text captions than the visual stream to infer physics and semantics. Captions generally improve physical and semantic accuracy, but paradoxically degrade temporal alignment. Our results highlight the need to move beyond audio quality toward learning physical processes directly from pixels. Finally, we find that our physics-based metrics correlate strongly with human preference tests on our own data.

Abstract:
Vision Language Models (VLMs) exhibit a fundamental semantic-to-geometric gap in spatial reasoning: they excel at qualitative semantic inference but their reasoning operates within a lossy semantic space, misaligned with high-fidelity geometry. Current paradigms fail to bridge this gap. Training-based methods suffer from an "oracle paradox," learning flawed spatial logic from imperfect oracles. Tool-integrated methods constrain the final computation but critically leave the VLM's planning process unconstrained, resulting in geometrically flawed plans. In this work, we propose Geometrically-Constrained Agent (GCA), a training-free agentic paradigm that resolves this gap by introducing a formal task constraint. Specifically, we strategically decouples the VLM's role into two stages. First, acting as a semantic analyst, the VLM translates the user's ambiguous query into the formal, verifiable task constraint, which defines the reference frame and objective. Second, acting as a task solver, the VLM generates and executes tool calls strictly within the deterministic bounds defined by the constraint. This geometrically-constrained strategy successfully resolve the semantic-to-geometric gap, yielding a robust and verifiable reasoning pathway for complex spatial tasks. Comprehensive experiments demonstrate that GCA achieves SOTA performance on multiple spatial reasoning benchmarks, surpassing existing training-based and tool-integrated methods by 27%.

Abstract:
We introduce AvatarPointillist, a novel framework for generating dynamic 4D Gaussian avatars from a single portrait image. At the core of our method is a decoder-only Transformer that autoregressively generates a point cloud for 3D Gaussian Splatting. This sequential approach allows for precise, adaptive construction, dynamically adjusting point density and the total number of points based on the subject's complexity. During point generation, the AR model also jointly predicts per-point binding information, enabling realistic animation. After generation, a dedicated Gaussian decoder converts the points into complete, renderable Gaussian attributes. We demonstrate that conditioning the decoder on the latent features from the AR generator enables effective interaction between stages and markedly improves fidelity. Extensive experiments validate that AvatarPointillist produces high-quality, photorealistic, and controllable avatars. We believe this autoregressive formulation represents a new paradigm for avatar generation, and we will release our code to inspire future research.

Abstract:
Large-scale video-text pretraining achieves strong performance but depends on noisy, synthetic captions with limited semantic coverage, often overlooking implicit world knowledge such as object motion, 3D geometry, and physical cues. In contrast, masked video modeling (MVM) directly exploits spatiotemporal structures but trails text-supervised methods on general tasks.We find this gap arises from overlooked architectural issues: pixel-level reconstruction struggles with convergence and its low-level requirement often conflicts with semantics, while latent prediction often encourages shortcut learning.To address these, we disentangle the traditional encoder-decoder design into an Encoder-Predictor-Decoder (EPD) framework, where the predictor acts as a latent world model, and propose InternVideo-Next, a two-stage pretraining scheme that builds a semantically consistent yet detail-preserving latent space for this world model.First, conventional linear decoder in pixel MVM enforces the predictor's output latent to be linearly projected to, thus separable in pixel space, causing the conflict with semantic abstraction.Our Stage 1 proposes a conditional diffusion decoder and injects clean image-level semantic priors to enhance semantics and convergence, thus bridging pixel-level fidelity with high-level semantic abstraction.Stage 2 further learns world knowledge by predicting frozen Stage 1 targets within this space, mitigating shortcut learning.Trained on public, unlabeled videos, InternVideo-Next achieves state-of-the-art results across benchmarks and provides a scalable path toward general video representation learning.

Abstract:
3D large multimodal models (3D LMMs) rely heavily on ego poses for enabling directional question-answering and spatial reasoning. However, most existing point cloud benchmarks contain rich directional queries but lack the corresponding ego poses, making them inherently ill-posed in 3D large multimodal modelling. In this work, we redefine a new and rigorous paradigm that enables direction-aware 3D LMMs by identifying and supplementing ego poses into point cloud benchmarks and transforming the corresponding point cloud data according to the identified ego poses. We enable direction-aware 3D LMMs with two novel designs. The first is PoseRecover, a fully automatic pose recovery pipeline that matches questions with ego poses from RGB-D video extrinsics via object-frustum intersection and visibility check with Z-buffers. The second is PoseAlign that transforms the point cloud data to be aligned with the identified ego poses instead of either injecting ego poses into textual prompts or introducing pose-encoded features in the projection layers. Extensive experiments show that our designs yield consistent improvements across multiple 3D LMM backbones such as LL3DA, LL3DA-SONATA, Chat-Scene, and 3D-LLAVA, improving ScanRefer mIoU by 30.0% and Scan2Cap LLM-as-judge accuracy by 11.7%. In addition, our approach is simple, generic, and training-efficient, requiring only instruction tuning while establishing a strong baseline for direction-aware 3D-LMMs.

Abstract:
State-of-the-art flow models achieve remarkable quality but require slow, iterative sampling. To accelerate this, flow maps can be distilled from pre-trained teachers, a procedure that conventionally requires sampling from an external dataset. We argue that this data-dependency introduces a fundamental risk of Teacher-Data Mismatch, as a static dataset may provide an incomplete or even misaligned representation of the teacher's full generative capabilities. This leads us to question whether this reliance on data is truly necessary for successful flow map distillation. In this work, we explore a data-free alternative that samples only from the prior distribution--a distribution the teacher is guaranteed to follow by construction, thereby circumventing the mismatch risk entirely. To demonstrate the practical viability of this philosophy, we introduce a principled framework that learns to predict the teacher's sampling path while actively correcting for its own compounding errors to ensure high fidelity. Our approach surpasses all data-based counterparts and establishes a new state-of-the-art by a significant margin. Specifically, distilling from SiT-XL/2+REPA, our method reaches an impressive FID of 1.45 on ImageNet 256x256, and 1.49 on ImageNet 512x512, both with only 1 sampling step. We hope our work establishes a more robust paradigm for accelerating generative models and motivates the broader adoption of flow map distillation without data.

Abstract:
Vector images have attracted increasing attention in the field of information hiding in recent years due to their scalability and structural properties. However, existing steganographic methods for vector images often introduce noticeable modifications to the files themselves, resulting in potential security risks and limited embedding capacity. Motivated by recent advances in diffusion models and image generative steganography, we propose GVIS, a novel Generative Vector Image Steganography framework. GVIS deterministically generates raster images using diffusion models, which are subsequently vectorized into vector images. On the sender side, we design a lightweight overlap detection algorithm to identify cubic Bezier curve control points suitable for data embedding, which enables the secret information to be encoded into the coordinate parameters of these control points. Then, the receiver can use the pre-shared settings to reconstruct the generation process and accurate message extraction by difference. Extensive theoretical analysis and experimental results demonstrate that GVIS achieves high-capacity, high-accuracy, secure, and training-free steganography. To the best of our knowledge, this is the first attempt to apply generative models to vector image steganography.

Abstract:
Most existing underwater instance segmentation approaches are constrained by close-vocabulary prediction, limiting their ability to recognize novel marine categories. To support evaluation, we introduce MARIS (_Marine Open-Vocabulary Instance Segmentation_), the first large-scale fine-grained benchmark for underwater Open-Vocabulary (OV) Instance segmentation (UOVIS), featuring a limited set of seen categories and diverse unseen categories. Although OV instance segmentation has shown promise on natural images, our analysis reveals that transfer to underwater scenes suffers from severe visual degradation (e.g., color attenuation) and semantic misalignment caused by lack underwater class definitions. To address these issues, we propose a unified framework with two complementary components. The Geometric Prior Enhancement Module (GPEM) leverages stable part-level and structural cues to maintain object consistency under degraded visual conditions. The Semantic Alignment Injection Mechanism (SAIM) enriches language embeddings with domain-specific priors, mitigating semantic ambiguity and improving recognition of unseen categories. Experiments show that our framework consistently outperforms existing OV baselines both In-Domain and Cross-Domain setting on MARIS, establishing a strong foundation for future underwater perception research. The code of this paper can be found in Github.

Abstract:
Most vision-language models assume images have a single literal meaning, even though images are inherently polysemous. We propose a retrieval paradigm that models many-to-many relationships between images and text using interpretive lenses and introduce Lenses, a multi-prompt embedding model and dataset for polysemous image-text retrieval. The Lenses dataset contains 105,669 images and 732,405 captions, with each image paired with multiple captions and image-side prompts annotated across five categories: Literal, Figurative, Abstract, Background, and Emotional. Building on a multimodal large language model, the Lenses model uses learned lens tokens to extract lens-specific embeddings for every image and caption and compares these using a lens-masking similarity function with a global fallback that prioritizes same-lens matches while retaining a global pathway. Training uses a category-aware multi-positive contrastive loss and intra-set diversity regularization to align corresponding perspectives while preventing semantic collapse across lenses. We further propose lens-aware evaluation protocols, including category-aware ranking, that better reflect how humans match images and text. Experiments on the Lenses dataset and public benchmarks show that our model outperforms baselines on literal and non-literal retrieval and reduces over-reliance on literal cues.

Abstract:
The Abstraction and Reasoning Corpus (ARC) is designed to promote research on abstract reasoning, a fundamental aspect of human intelligence. Common approaches to ARC treat it as a language-oriented problem, addressed by large language models (LLMs) or recurrent reasoning models. However, although the puzzle-like tasks in ARC are inherently visual, existing research has rarely approached the problem from a vision-centric perspective. In this work, we formulate ARC within a vision paradigm, framing it as an image-to-image translation problem. To incorporate visual priors, we represent the inputs on a "canvas" that can be processed like natural images. It is then straightforward for us to apply standard vision architectures, such as a vanilla Vision Transformer (ViT), to perform image-to-image mapping. Our model is trained from scratch solely on ARC data and generalizes to unseen tasks through test-time training. Our framework, termed Vision ARC (VARC), achieves 60.4% accuracy on the ARC-1 benchmark, substantially outperforming existing methods that are also trained from scratch. Our results are competitive with those of leading LLMs and close the gap to average human performance.

Abstract:
Woven fabric materials are widely used in rendering applications, yet designing realistic examples typically involves multiple stages, requiring expertise in weaving principles and texture authoring. Recent advances have explored diffusion models to streamline this process; however, pre-trained diffusion models often struggle to generate intricate yarn-level details that conform to weaving rules. To address this, we present FabricGen, an end-to-end framework for generating high-quality woven fabric materials from textual descriptions. A key insight of our method is the decomposition of macro-scale textures and micro-scale weaving patterns. To generate macro-scale textures free from microstructures, we fine-tune pre-trained diffusion models on a collected dataset of microstructure-free fabrics. As for micro-scale weaving patterns, we develop an enhanced procedural geometric model capable of synthesizing natural yarn-level geometry with yarn sliding and flyaway fibers. The procedural model is driven by a specialized large language model, WeavingLLM, which is fine-tuned on an annotated dataset of formatted weaving drafts, and prompt-tuned with domain-specific fabric expertise. Through fine-tuning and prompt tuning, WeavingLLM learns to design weaving drafts and fabric parameters from textual prompts, enabling the procedural model to produce diverse weaving patterns that stick to weaving principles. The generated macro-scale texture, along with the micro-scale geometry, can be used for fabric rendering. Consequently, our framework produces materials with significantly richer detail and realism compared to prior generative models.

Abstract:
Federated Learning (FL) is plagued by two key challenges: high communication overhead and performance collapse on heterogeneous (non-IID) data. Analytic FL (AFL) provides a single-round, data distribution invariant solution, but is limited to linear models. Subsequent non-linear approaches, like DeepAFL, regain accuracy but sacrifice the single-round benefit. In this work, we break this trade-off. We propose SAFLe, a framework that achieves scalable non-linear expressivity by introducing a structured head of bucketed features and sparse, grouped embeddings. We prove this non-linear architecture is mathematically equivalent to a high-dimensional linear regression. This key equivalence allows SAFLe to be solved with AFL's single-shot, invariant aggregation law. Empirically, SAFLe establishes a new state-of-the-art for analytic FL, significantly outperforming both linear AFL and multi-round DeepAFL in accuracy across all benchmarks, demonstrating a highly efficient and scalable solution for federated vision.

Abstract:
Recently, vision foundation models have been shown to boost the efficiency of flow-based generative models by revealing the intrinsic union-of-manifold structures and lowering the complexity of the latent/target distribution. In this paper, we exploit the multimodality aspect of the union-of-manifold structures, and aim to further improve the learning and inference efficiency for flow-matching models. To this end, we propose an efficient source and coupling co-design method termed Mixture-Modeling Flow Matching (MM-FM), by integrating a data-adaptive multimodal source distribution (implemented as Gaussian mixture models) and mode-dependent data coupling. The former shortens the distance between the source and the target, and the latter promotes local and straighter flows. We also derive theoretical results to confirm our intuition in a quantitative sense. In our experiments on ImageNet256x256 with multimodal DINOv2-B latents, MM-FM exhibits superior learning efficiency and state-of-the-art unconditional generation quality: FID=2.74 with autoguidance in only 80 epochs.

Abstract:
The rapid advancement of generative AI enables the creation of highly realistic text images, posing significant security risks from fraud and disinformation. However, research into robust detection is critically hampered by existing datasets that lack scale, diversity, and updated counterfeit techniques, as well as by models that fail to generalize. To address these deficiencies, we introduce DanceText, a large-scale, comprehensive dataset for AI-counterfeited text image detection. Constructed using our novel Creative Proposer pipeline, which automates the generation of diverse and realistic counterfeits, DanceText surpasses previous works by over 100-fold in multiple dimensions. It is the first to include counterfeits from multimodal large models, commercial software, and mobile apps, covering all major paradigms, including full-image generation, regional removal, and editing. Building on this dataset, we propose DS-Net, a novel and effective detection model. It features two key components: a Forensic Decoupling Encoder to extract generator-agnostic artifact features, and a Synergy Denoising Decoder that synergizes image-level classification with instance-level localization. Extensive experiments demonstrate that DS-Net achieves state-of-the-art performance, advancing the field toward robust and unified detection of AI-counterfeited text images in real-world scenarios. Both our code and dataset will be released publicly.

Abstract:
Training a unified model integrating video-to-audio (V2A), text-to-audio (T2A), and joint video-text-to-audio (VT2A) generation offers significant application flexibility, yet faces two unexplored foundational challenges: (1) the scarcity of high-quality audio captions with tight V-A-T alignment, leading to severe semantic conflict between multimodal conditions, and (2) cross-task and intra-task competition, manifesting as an adverse V2A-T2A performance trade-off and modality bias in the VT2A task. First, to address data scarcity, we introduce SoundAtlas, a large-scale dataset (470k pairs) that significantly outperforms existing benchmarks and even human experts in quality. Powered by a novel agentic pipeline, it integrates Vision-to-Language Compression to mitigate visual bias of MLLMs, a Junior-Senior Agent Hand-off for a five times cost reduction, and rigorous Post-hoc Filtering to ensure fidelity. Consequently, SoundAtlas delivers semantically rich and temporally detailed captions with tight V-A-T alignment. Second, we propose Omni2Sound, a unified VT2A diffusion model supporting flexible input modalities. To resolve the inherent cross-task and intra-task competition, we design a three-stage multi-task progressive training schedule that converts cross-task competition into joint optimization and mitigates modality bias in the VT2A task, maintaining both audio-visual alignment and off-screen audio generation faithfulness. Finally, we construct VGGSound-Omni, a comprehensive benchmark for unified evaluation, including challenging off-screen tracks. With a standard DiT backbone, Omni2Sound achieves unified SOTA performance across all three tasks within a single model, demonstrating strong generalization across benchmarks with heterogeneous input conditions.

Abstract:
Classifier-Free Guidance (CFG) has emerged as a central approach for enhancing semantic alignment in flow-based diffusion models. In this paper, we explore a unified framework called CFG-Ctrl, which reinterprets CFG as a control applied to the first-order continuous-time generative flow, using the conditional-unconditional discrepancy as an error signal to adjust the velocity field. From this perspective, we summarize vanilla CFG as a proportional controller (P-control) with fixed gain, and typical follow-up variants develop extended control-law designs derived from it. However, existing methods mainly rely on linear control, inherently leading to instability, overshooting, and degraded semantic fidelity especially on large guidance scales. To address this, we introduce Sliding Mode Control CFG (SMC-CFG), which enforces the generative flow toward a rapidly convergent sliding manifold. Specifically, we define an exponential sliding mode surface over the semantic prediction error and introduce a switching control term to establish nonlinear feedback-guided correction. Moreover, we provide a Lyapunov stability analysis to theoretically support finite-time convergence. Experiments across text-to-image generation models including Stable Diffusion 3.5, Flux, and Qwen-Image demonstrate that SMC-CFG outperforms standard CFG in semantic alignment and enhances robustness across a wide range of guidance scales.

Abstract:
Egocentric human pose estimation (HPE) using a head-mounted device is crucial for various VR and AR applications, but it faces significant challenges due to keypoint invisibility. Nevertheless, none of the existing egocentric HPE datasets provide keypoint visibility annotations, and the existing methods often overlook the invisibility problem, treating visible and invisible keypoints indiscriminately during estimation. As a result, their capacity to accurately predict visible keypoints is compromised. In this paper, we first present Eva-3M, a large-scale egocentric visibility-aware HPE dataset comprising over 3.0M frames, with 435K of them annotated with keypoint visibility labels. Additionally, we augment the existing EMHI dataset with keypoint visibility annotations to further facilitate the research in this direction. Furthermore, we propose EvaPose, a novel egocentric visibility-aware HPE method that explicitly incorporates visibility information to enhance pose estimation accuracy. Extensive experiments validate the significant value of ground-truth visibility labels in egocentric HPE settings, and demonstrate that our EvaPose achieves state-of-the-art performance in both Eva-3M and EMHI datasets.

Abstract:
Beyond general recognition tasks, specialized domains and fine-grained settings often encounter data scarcity, especially for tail classes. To obtain less biased and more reliable models under such scarcity, practitioners leverage diffusion models to supplement underrepresented regions of real data. Specifically, recent studies fine-tune pretrained diffusion models with LoRA on few-shot real sets to synthesize additional images. While an image-wise LoRA trained on a single image captures fine-grained details yet offers limited diversity, a class-wise LoRA trained over all shots produces diverse images as it encodes class priors yet tends to overlook fine details. To combine both benefits, we separate the adapter into a class-shared LoRA A for class priors and per-image LoRAs \mathcal B for image-specific characteristics. To expose coherent class semantics in the shared LoRA A, we propose a semantic boosting by preserving class bounding boxes during training. For generation, we compose A with a mixture of \mathcal B using coefficients drawn from a Dirichlet distribution. Across diverse datasets, our synthesized images are both diverse and detail-rich while closely aligning with the few-shot real distribution, yielding robust gains in downstream classification accuracy.

Abstract:
Image-to-text (I2T) understanding and text-to-image (T2I) generation are two fundamental, important yet traditionally isolated multimodal tasks. Despite their intrinsic connection, existing approaches typically optimize them independently, missing the opportunity for mutual enhancement. In this paper, we argue that the both tasks can be connected under a shared Auto-Encoder perspective, where text serves as the intermediate latent representation bridging the two directions - encoding images into textual semantics (I2T) and decoding text back into images (T2I). Our key insight is that if the encoder truly "understands" the image, it should capture all essential structure, and if the decoder truly "understands" the text, it should recover that structure faithfully. Building upon this principle, we propose Unified-GRPO, a post-training method based on reinforcement learning that jointly optimizes both modules through reconstructive rewards, maximizing the semantic consistency between the input and the generated images. Under this reconstruction objective, the encoder is encouraged to extract as much accurate and comprehensive semantic information from the input image to maximize reconstruction quality, while the decoder is simultaneously optimized to generate conditioned on the encoder's prior, enabling a self-evolving improvement. Empirically, we find that using text as the intermediate representation and training under a reconstructive RL paradigm effectively benefits both I2T and T2I. The I2T module gains stronger fine-grained visual perception, such as small-object recognition, grounding, etc, while its dense embeddings and language priors, in turn, provide richer semantic signals that improve T2I fidelity and complex instruction following. These results demonstrate that the reconstructive RL establishes a mutually reinforcing cross-modal synergy within the auto-encoding framework.

Abstract:
The growing adoption of XR devices has fueled strong demand for high-quality stereo video, yet its production remains costly and artifact-prone.To address this challenge, we present StereoWorld, an end-to-end framework that repurposes a pretrained video generator for high-fidelity monocular-to-stereo video generation. Our framework jointly conditions the model on the monocular video input while explicitly supervising the generation with a geometry-aware regularization to ensure 3D structural fidelity.A spatio-temporal tiling scheme is further integrated to enable efficient, high-resolution synthesis.To enable large-scale training and evaluation, we curate a high-definition stereo video dataset containing over 11M frames aligned to natural human interpupillary distance (IPD).Extensive experiments demonstrate that StereoWorld substantially outperforms prior methods, generating stereo videos with superior visual fidelity and geometric consistency.

Abstract:
Existing autonomous driving systems rely on onboard sensors (cameras, LiDAR, IMU, etc) for environmental perception. However, this paradigm is limited by the drive-time perception horizon and often fails under limited view scope, occlusion or extreme conditions such as darkness and rain. In contrast, human drivers are able to recall road structure even under poor visibility. To endow models with this recall ability, we propose the spatial retrieval paradigm, introducing offline retrieved geographic images as an additional input. These images are easy to obtain from offline caches (e.g., Google Maps or stored autonomous driving datasets) without requiring additional sensors, making it a plug-and-play extension for existing AD tasks. For experiments, we first extend the nuScenes dataset with geographic images retrieved via Google Maps APIs and align the new data with ego-vehicle trajectories. We establish baselines across five core autonomous driving tasks: object detection, online mapping, occupancy prediction, end-to-end planning, and generative world modeling. Extensive experiments show that the extended modality could enhance the performance of certain tasks. We will open-source dataset curation code, data, and benchmarks for further study of this new autonomous driving paradigm.

Abstract:
We present ARTrack-AC, a new step in the autoregressive tracking paradigm that introduces adaptive capacity inference to achieve both temporal consistency and dynamic efficiency. While existing autoregressive trackers predict object states sequentially with fixed inference capacity, they fail to accommodate the fluctuating temporal difficulty of real videos. ARTrack-AC addresses this limitation by equipping the tracker with the ability to modulate its inference capacity over time. A diffusion-based difficulty estimator anticipates the stability of upcoming segments, guiding a controller to switch between an accurate (high-capacity) and an efficient (low-capacity) mode while maintaining autoregressive consistency. This system-level autoregression extends conventional sequence modeling beyond "what to predict" toward "how to predict," forming a self-regulated tracking process that aligns inference cost with temporal complexity. Despite its simplicity, ARTrack-AC achieves state-of-the-art accuracy-speed trade-off on major benchmarks--66.7% AUC on LaSOT and 47.5% AUC on LaSOText--running 2.9xfaster than its predecessor.

Abstract:
Virtual try-on (VTON) has recently achieved impressive visual fidelity, but most existing systems require uploading personal photos to cloud-based GPUs, raising privacy concerns and limiting on-device deployment. To address this, we present Mobile-VTON, a high-quality, privacy-preserving framework that enables fully offline virtual try-on on commodity mobile devices using only a single user image and a garment image. Mobile-VTON introduces a modular TeacherNet--GarmentNet--TryonNet (TGT) architecture that integrates knowledge distillation, garment-conditioned generation, and garment alignment into a unified pipeline optimized for on-device efficiency. Within this framework, we propose a Feature-Guided Adversarial (FGA) Distillation strategy that combines teacher supervision with adversarial learning to better match real-world image distributions. GarmentNet is trained with a trajectory-consistency loss to preserve garment semantics across diffusion steps, while TryonNet uses latent concatenation and lightweight cross-modal conditioning to enable robust garment-to-person alignment without large-scale pretraining. By combining these components, Mobile-VTON achieves high-fidelity generation with low computational overhead. Experiments on VITON-HD and DressCode at 1024 x 768 show that it matches or outperforms strong server-based baselines while running entirely offline. These results demonstrate that high-quality VTON is not only feasible but also practical on-device, offering a secure solution for real-world applications.

Abstract:
Existing Image Restoration (IR) studies typically focus on task-specific or universal modes individually, relying on the mode selection of users and lacking cooperation between multiple task-specific/universal restoration modes. This leads to insufficient interaction for unprofessional users and limits their restoration capability for complicated real-world applications. In this work, we present HybridAgent, intending to incorporate multiple restoration modes into a unified image restoration model and achieve intelligent and efficient user interaction through our proposed hybrid agents. Concretely, we propose the hybrid rule of fast, slow, and feedback restoration agents. Here, the slow restoration agent optimizes the powerful multimodal large language model (MLLM) with our proposed instruction-tuning dataset to identify degradations within images with ambiguous user prompts and invokes proper restoration tools accordingly. The fast restoration agent is designed based on a lightweight large language model (LLM) via in-context learning to understand the user prompts with simple and clear requirements, which can obviate the unnecessary time/resource costs of MLLM. Moreover, we introduce the mixed distortion removal mode for our HybridAgents, which is crucial but not concerned in previous agent-based works. It can effectively prevent the error propagation of step-by-step image restoration and largely improve the efficiency of the agent system. We validate the effectiveness of HybridAgent with both synthetic and real-world IR tasks.

Abstract:
Composition matters during the photo-taking process, yet many casual users struggle to frame well-composed images. To provide composition guidance, we introduce PhotoFramer, a multi-modal composition instruction framework. Given a poorly composed image, PhotoFramer first describes how to improve the composition in natural language and then generates a well-composed example image. To train such a model, we curate a large-scale dataset. Inspired by how humans take photos, we organize composition guidance into a hierarchy of sub-tasks: shift, zoom-in, and view-change tasks. Shift and zoom-in data are sampled from existing cropping datasets, while view-change data are obtained via a two-stage pipeline. First, we sample pairs with varying viewpoints from multi-view datasets, and train a degradation model to transform well-composed photos into poorly composed ones. Second, we apply this degradation model to expert-taken photos to synthesize poor images to form training pairs. Using this dataset, we finetune a model that jointly processes and generates both text and images, enabling actionable textual guidance with illustrative examples. Extensive experiments demonstrate that textual instructions effectively steer image composition, and coupling them with exemplars yields consistent improvements over exemplar-only baselines. PhotoFramer offers a practical step toward composition assistants that make expert photographic priors accessible to everyday users.

Abstract:
We introduce LAM, a system that explores collaboration between large language models and vision-language models to generate articulated objects from text prompts without a visual prior or prebuilt 3D assets. In contrast, we formulate articulated object generation as a unified code-generation task, in which geometry and articulations can be co-designed from scratch. Given an input text, LAM coordinates a team of specialized modules to generate code to represent the desired articulated object procedurally. LAM first reasons about the hierarchical structure of parts (links) with Link Designer, then writes code, compiles it, and debugs it with Geometry and Articulation Coders and self-corrects with Geometry and Articulation Checkers. The code serves as a structured, interpretable bridge between individual links, ensuring the correct relationships among them. Experiments demonstrate the power of leveraging code as a generative medium within a collaboration system, showcasing its effectiveness in automatically constructing complex articulated objects.

Abstract:
Successful robotic grasping in cluttered environments not only requires a model to visually ground a target object but also to reason about obstructions that must be cleared beforehand. While current vision-language embodied reasoning models show emergent spatial understanding, they remain limited in terms of obstruction reasoning and accessibility planning. To bridge this gap, we present UNOGrasp, a learning-based vision-language model capable of performing visually-grounded obstruction reasoning to infer the sequence of actions needed to unobstruct the path and grasp the target object. We devise a novel multi-step reasoning process based on obstruction paths originated by the target object. We anchor each reasoning step with obstruction-aware visual cues to incentivize reasoning capability. UNOGrasp combines supervised and reinforcement finetuning through verifiable reasoning rewards. Moreover, we construct UNOBench, a large-scale dataset for both training and benchmarking, based on MetaGraspNetV2, with over 100k obstruction paths annotated by humans with obstruction ratios, contact points, and natural-language instructions. Extensive experiments and real-robot evaluations show that UNOGrasp significantly improves obstruction reasoning and grasp success across both synthetic and real-world environments, outperforming generalist and proprietary alternatives.

Abstract:
Diffusion models (DMs) have become the state-of-the-art for generative tasks across domains, but their reliance on sequential forward passes limits real-time performance. Prior acceleration methods mainly reduce sampling steps or reuse intermediate results. Leveraging the flexibility of Diffusion Transformers (DiTs) to handle variable token counts, we propose RAS, a training-free sampling strategy that dynamically assigns different update ratios to image regions based on model focus. Our key observation is that at each step, DiTs concentrate on semantically meaningful areas, and these regions exhibit strong continuity across consecutive steps. Exploiting this, RAS updates only focused regions while reusing cached noise for others, with focus determined from the previous step's output. Evaluated on Stable Diffusion 3 and Lumina-Next-T2I, RAS achieves up to 2.36x and 2.51x speedups, respectively, with minimal quality loss. This demonstrates a practical step toward more efficient diffusion transformers for real-time generation.

Abstract:
Prompt learning has emerged as an effective technique for fine-tuning large-scale foundation models for downstream tasks. However, conventional prompt learning methods are prone to overfitting and can struggle with out-of-distribution generalization. To address these limitations, Bayesian prompt learning has been proposed, which frames prompt optimization as a Bayesian inference problem to enhance robustness. This paper introduces Repulsive Bayesian Prompt Learning (ReBaPL), a novel method for Bayesian prompt learning, designed to efficiently explore the complex and often multimodal posterior landscape of prompts. Our method integrates a cyclical step-size schedule with a stochastic gradient Hamiltonian Monte Carlo (SGHMC) algorithm, enabling alternating phases of exploration to discover new modes, and exploitation to refine existing modes. Furthermore, we introduce a repulsive force derived from a potential function over probability metrics (including Maximum Mean Discrepancy and Wasserstein distance) computed on the distributions of representations produced by different prompts. This representation-space repulsion diversifies exploration and prevents premature collapse to a single mode. Our approach allows for a more comprehensive characterization of the prompt posterior distribution, leading to improved generalization. In contrast to prior Bayesian prompt learning methods, our method provides a modular plug-and-play Bayesian extension of any existing prompt learning method based on maximum likelihood estimation. We demonstrate the efficacy of ReBaPL on several benchmark datasets, showing superior performance over state-of-the-art prompt learning methods.

Abstract:
Generating new Low-Rank Adaptation (LoRA) weights from pre-trained LoRAs has demonstrated strong generalization capabilities across various tasks, enabling the efficient transfer of AI models, particularly on resource-constrained edges. However, previous studies either merge base LoRAs via weighting coefficients or train a generative model under the closed-world assumption, limiting their efficiency and flexibility in complex edge user cases. This challenge may further increase when there are significant domain shifts between training and deployment. To this end, we propose Semantic-Guided LoRA Parameter Generation (SG-LoRA), a tuning-free generative framework to efficiently produce task-specific parameters for unseen tasks in a semantic-to-LoRA pipeline. Concretely, SG-LoRA uses task descriptions as the semantic bridge, measuring their proximity to a set of known expert tasks in a shared embedding space. Based on this semantic guidance, it models the target task's LoRA parameter distribution to generate high-performing parameters for novel tasks. SG-LoRA enables the real-time construction of LoRA models aligned with individual intents by distilling knowledge from prominent LoRA experts, while also offering a privacy-preserving solution for personalized model adaptation in a novel zero-shot open-world setting proposed in this work. Extensive experiments on multiple challenging tasks confirm the superior performance and remarkable adaptability of SG-LoRA. The code is attached in the supplementary material.

Abstract:
Burst image restoration aims to reconstruct a high-quality image from burst images, which are typically captured using manually designed exposure settings.Although these exposure settings significantly influence the final restoration performance, the problem of finding optimal exposure settings has been overlooked.In this paper, we present Dynamic Exposure Burst Image Restoration (DEBIR), a novel burst image restoration pipeline that enhances restoration quality by dynamically predicting exposure times tailored to the shooting environment.In our pipeline, Burst Auto-Exposure Network (BAENet) estimates the optimal exposure time for each burst image based on a preview image, as well as motion magnitude and gain.Subsequently, a burst image restoration network reconstructs a high-quality image from burst images captured using these optimal exposure times.For training, we introduce a differentiable burst simulator and a three-stage training strategy. Our experiments demonstrate that our pipeline achieves state-of-the-art restoration quality. Furthermore, we validate the effectiveness of our approach on a real-world camera system, demonstrating its practicality.

Abstract:
Training machine learning models on massive datasets is expensive and time-consuming. Dataset distillation addresses this by creating a small synthetic dataset that achieves the same performance as the full dataset. Recent methods use diffusion models to generate distilled data, either by promoting diversity or matching training gradients. However, existing approaches produce redundant training signals, where samples convey overlapping information. Empirically, disjoint subsets of distilled datasets capture 80-90% overlapping signals. This redundancy stems from optimizing visual diversity or average training dynamics without accounting for similarity across samples, leading to datasets where multiple samples share similar information rather than complementary knowledge. We propose learnability-driven dataset distillation, which constructs synthetic datasets incrementally through successive stages. Starting from a small set, we train a model and generate new samples guided by learnability scores that identify what the current model can learn from, creating an adaptive curriculum. We introduce Learnability-Guided Diffusion (LGD), which balances training utility for the current model with validity under a reference model to generate curriculum-aligned samples. Our approach reduces redundancy by 39.1%, promotes specialization across training stages, and achieves state-of-the-art results on ImageNet-1K (60.1%), ImageNette (87.2%), and ImageWoof (72.9%).

Abstract:
The growing demand for oriented object detection (OOD) across various domains has driven significant research in this area. However, the high cost of dataset annotation remains a major concern. Current mainstream OOD algorithms can be mainly categorized into three types: (1) fully supervised methods using complete oriented bounding box (OBB) annotations, (2) semi-supervised methods using partial OBB annotations, and (3) weakly supervised methods using weak annotations such as horizontal boxes or points. However, these algorithms inevitably increase the cost of models in terms of annotation speed or annotation cost. To address this issue, we propose: (1) the first Partial Weakly-Supervised Oriented Object Detection (PWOOD) framework based on partially weak annotations (horizontal boxes or single points), which can efficiently leverage large amounts of unlabeled data, significantly outperforming weakly supervised algorithms trained with partially weak annotations, also offers a lower cost solution; (2) Orientation-and-Scale-aware Student (OS-Student) model capable of learning orientation and scale information with only a small amount of orientation-agnostic or scale-agnostic weak annotations; and (3) Class-Agnostic Pseudo-Label Filtering strategy (CPF) to reduce the model's sensitivity to static filtering thresholds. Comprehensive experiments on DOTA-v1.0/v1.5/v2.0 and DIOR datasets demonstrate that our PWOOD framework performs comparably to, or even surpasses traditional semi-supervised algorithms.

Abstract:
Scene text editing seeks to modify textual content in natural images while maintaining visual realism and semantic consistency. Existing methods often require task-specific training or paired data, limiting their scalability and adaptability. In this paper, we propose TextFlow, a training-free scene text editing framework that integrates the strengths of Attention Boost (AttnBoost) and Flow Manifold Steering (FMS) to enable flexible, high-fidelity text manipulation without additional training. Specifically, FMS preserves the structural and style consistency by modeling the visual flow of characters and background regions, while AttnBoost enhances the rendering of textual content through attention-based guidance. By jointly leveraging these complementary modules, our approach performs end-to-end text editing through semantic alignment and spatial refinement in a plug-and-play manner. Extensive experiments demonstrate that our framework achieves visual quality and text accuracy comparable to or superior to those of training-based counterparts, generalizing well across diverse scenes and languages. This study advances scene text editing toward a more efficient, generalizable, and training-free paradigm.

Abstract:
We introduce Radiance Meshes for representing radiance fields with constant density tetrahedral cells produced with a Delaunay tetrahedralization.Unlike a Voronoi diagram, a Delaunay tetrahedralization yields simple triangles that are natively supported by existing hardware. As such, our model is able to perform exact and fast volume rendering using both rasterization and ray-tracing. We introduce a new rasterization method that achieve faster rendering speeds than all prior radiance field representations (assuming an equivalent number of primitives and resolution) across a variety of platforms.Optimizing the positions of Delaunay vertices introduces topological discontinuities (edge flips). To solve this, we use a Zip-NeRF-style backbone which allows us to express a smoothly varying field even when the topology changes.Our rendering method exactly evaluates the volume rendering equation and enables high quality, real-time view synthesis on standard consumer hardware. Our tetrahedral meshes also lend themselves to a variety of exciting applications including fisheye lens distortion, physics-based simulation, editing, and mesh extraction.

Abstract:
Reconstructing video content from EEG (electroencephalogram) is a research task of significant scientific importance. However, due to inter-subject differences in physiological states and variations in signal acquisition configurations, this task faces the challenge of inconsistent cross-subject generation.To address this, we propose a Subject Adversarial and Mapping Network (SAM-Net). In SAM-Net, we first introduce a Hybrid Region-Temporal (HRT) Encoder to conduct inter-channel semantic interactions guided by brain regions and aggregate temporal semantics across different time scales. Secondly, we propose a Centered-progressive Subject Adversarial (C-SA) Mechanism to gradually narrow the metric distance between different subjects, thereby obtaining a unified and stable semantic representation. Thirdly, we design a New2Source Mapper to align the EEG distribution of new subjects with that of multiple known subjects. Finally, we adopt a keyframe-guided continuous semantic generation paradigm to drive the production of coherent and high-quality videos. Extensive experiments validate the competitive performance of our SAM-Net in cross-subject EEG-to-Video generation tasks, as well as its excellent performance in generation tasks involving new subjects.

Abstract:
This work presents ODGS-SLAM, an omnidirectional simultaneous localization and mapping (SLAM) system utilizing 3D Gaussian Splatting (3DGS) as the unified representation for tracking and mapping. Thus, it reconstructs scene geometry from panoramic image sequences (RGB or RGBD) via splats while also estimating the camera poses. Such a framework is important to understand the full surrounding, e.g., for augmented reality applications or autonomous systems. We extended existing 3DGS-SLAM methods to handle omnidirectional input by including closed-form gradients for mapping and camera pose estimation, utilizing an equirectangular projection model. To reduce memory footprint, a key frame removal procedure based on graph analysis is proposed, enabling the application to handle larger input sizes. For evaluation, we provide a dataset of controlled real-world and synthetic test scenes (indoor and outdoor), employing a custom developed virtual camera lens. An extensive evaluation shows that, for camera tracking, the proposed method achieves statistically significant lower ATE RMSE scores compared to a recent omnidirectional SLAM system, as well as other 3DGS-SLAM frameworks, while reaching a similar mapping performance.

Abstract:
Due to the lack of ground truth, existing methods of active stereo matching generally employ fully self-supervised learning to produce precise depth estimates. Although they can achieve promising results, their performance still has a noticeable gap compared with supervised models. To fill this gap, we propose a novel framework that synthesizes proxy labels to enable supervised training of deep active stereo networks without requiring any ground-truth depth. To expand the training data and generate disparity proxy labels, we develop an active 2D Gaussian Splatting (2DGS)-based synthesis method that explicitly models the scene geometry and the projected active pattern. Furthermore, to balance the varying contributions of different supervisions during training, we design a hybrid supervision regularization strategy that dynamically adjusts the loss weights to achieve stable optimization. We also contribute a real-world dataset captured by a handheld RealSense camera, along with our active 2DGS model, which facilitates future research on active stereo matching. Extensive qualitative and quantitative experiments demonstrate that our method achieves state-of-the-art performance on active stereo matching task. The code and dataset will be publicly released.

Abstract:
Text-to-image (T2I) diffusion models such as SDXL and FLUX have achieved impressive photorealism, yet small-scale distortions remain pervasive in limbs, face, text and so on. Existing refinement approaches either perform costly iterative re-generation or rely on vision-language models (VLMs) with weak spatial grounding, leading to semantic drift and unreliable local edits. To close this gap, we propose Agentic Retoucher, a hierarchical decision-driven framework that reformulates post-generation correction as a human-like perception-reasoning-action loop.Specifically, we design (1) a perception agent that learns contextual saliency for fine-grained distortion localization under text-image consistency cues, (2) a reasoning agent that performs human-aligned inferential diagnosis via progressive preference alignment, and (3) an action agent that adaptively plans localized inpainting guided by user preference. This design integrates perceptual evidence, linguistic reasoning, and controllable correction into a unified, self-corrective decision process. To enable fine-grained supervision and quantitative evaluation, we further construct GenBlemish-27K, a dataset of 6K T2I images with 27K annotated artifact regions across 12 categories.Extensive experiments demonstrate that Agentic Retoucher consistently outperforms state-of-the-art methods in perceptual quality, distortion localization and human preference alignment, establishing a new paradigm for self-corrective and perceptually reliable T2I generation.

Abstract:
Motion forecasting predicts where points will move in the future, while motion tracking predicts where they are in the present. Despite these similarities, existing approaches to the two problems are quite different. In this paper, we propose a unified model that can address both tasks. We train a causal, video-conditioned flow matching model to predict point positions. The resulting model can easily toggle between point tracking to forecasting by changing its visual signal. Despite our model's simplicity, we find that it outperforms prior work in point forecasting and obtains performance that is competitive with the state-of-the-art on the TAP-Vid benchmark.

Abstract:
Inter-object relations underpin spatial intelligence, yet existing representations--linguistic prepositions or object-level scene graphs--are too coarse to specify which regions actually support, contain, or contact one another, leading to ambiguous and physically inconsistent layouts. To address these ambiguities, a part-level formulation is needed; therefore, we introduce PARSE, a framework that explicitly models how object parts interact to determine feasible and spatially grounded scene configurations. PARSE centers on the Part-centric Assembly Graph (PAG), which encodes geometric relations between specific object parts, and a Part-Aware Spatial Configuration Solver that converts these relations into geometric constraints to assemble collision-free, physically valid scenes. Using PARSE, we build PARSE-10K, a dataset of 10,000 3D indoor scenes constructed from real-image layout priors and a curated part-annotated shape database, each with dense contact structures and a part-level contact graph. With this structured, spatially grounded supervision, fine-tuning Qwen3-VL on PARSE-10K yields stronger object-level layout reasoning and more accurate part-level relation understanding; furthermore, leveraging PAGs as structural priors in 3D generation models leads to scenes with substantially improved physical realism and structural complexity. Together, these results show that PARSE significantly advances geometry-grounded spatial reasoning and supports the generation of physically consistent 3D scenes.

Abstract:
Identifying potential objects is critical for object recognition and analysis across various computer vision applications. Existing methods typically localize potential objects by relying on exemplar images, predefined categories, or textual descriptions. However, their reliance on image and text prompts often limits flexibility, restricting adaptability in real-world scenarios. In this paper, we introduce a novel Prompt-Free Universal Region Proposal Network (PF-RPN), which identifies potential objects without relying on external prompts. First, the Sparse Image-Aware Adapter (SIA) module performs initial localization of potential objects using a learnable query embedding dynamically updated with visual features. Next, the Cascade Self-Prompt (CSP) module identifies the remaining potential objects by leveraging the self-prompted learnable embedding, autonomously aggregating informative visual features in a cascading manner. Finally, the Centerness-Guided Query Selection (CG-QS) module facilitates the selection of high-quality query embeddings using a centerness scoring network.Our method can be optimized with limited data (e.g., 5% of MS COCO data) and applied directly to various object detection application domains for identifying potential objects without fine-tuning, such as underwater object detection, industrial defect detection, and remote sensing image object detection. Experimental results across 19 datasets validate the effectiveness of our method.

Abstract:
Neural asset authoring and neural rendering have traditionally evolved as disjoint paradigms: one generates digital assets for fixed graphics pipelines, while the other maps conventional assets to images. However, treating them as independent entities limits the potential for end-to-end optimization in fidelity and consistency. In this paper, we bridge this gap with NeAR, a Coupled Neural Asset--Renderer Stack. We argue that co-designing the asset representation and the renderer creates a robust "contract" for superior generation. On the asset side, we introduce the Lighting-Homogenized SLAT (LH-SLAT). Leveraging a rectified-flow model, NeAR lifts casually lit single images into a canonical, illumination-invariant latent space, effectively suppressing baked-in shadows and highlights. On the renderer side, we design a lighting-aware neural decoder tailored to interpret these homogenized latents. Conditioned on HDR environment maps and camera views, it synthesizes relightable 3D Gaussian splats in real-time without per-object optimization. We validate NeAR on four tasks: (1) G-buffer-based forward rendering, (2) random-lit reconstruction, (3) unknown-lit relighting, and (4) novel-view relighting. Extensive experiments demonstrate that our coupled stack outperforms state-of-the-art baselines in both quantitative metrics and perceptual quality. We hope this coupled asset-renderer perspective inspires future graphics stacks that view neural assets and renderers as co-designed components instead of independent entities.

Abstract:
Conditional image generation methods are increasingly used in human-centric applications, yet existing human amodal completion (HAC) models offer users limited control over the completed content. Given an occluded person image, they hallucinate invisible regions while preserving visible ones, but cannot reliably incorporate user-specified constraints such as a desired pose or spatial extent. As a result, users often resort to repeatedly sampling the model to obtain a satisfactory output. Pose-guided person image synthesis (PGPIS) methods allow explicit pose conditioning, but frequently fail to preserve the instance-specific visible appearance and tend to be biased toward the training distribution, even when built on strong diffusion model priors. To address these limitations, we introduce promptable human amodal completion (PHAC), a new task that completes occluded human images while satisfying both visible appearance constraints and multiple user prompts. Users provide simple point-based prompts, such as additional joints for the target pose or bounding boxes for desired regions; these prompts are encoded using ControlNet modules specialized for each prompt type. These modules inject the prompt signals into a pre-trained diffusion model, and we fine-tune only the cross-attention blocks to obtain strong prompt alignment without degrading the underlying generative prior. To further preserve visible content, we propose an inpainting-based refinement module that starts from a slightly noised coarse completion, faithfully preserves the visible regions, and ensures seamless blending at occlusion boundaries. Extensive experiments on standard HAC and PGPIS benchmarks show that our approach produces more physically plausible, higher-quality completions with significantly improved prompt alignment compared to existing amodal completion and pose-guided synthesis methods.

Abstract:
Embodied navigation stands as a foundation pillar within the pursuit of embodied intelligence. However, previous navigation research is divided into different tasks/capabilities, e.g., ObjNav, ImgNav and VLN, where they differ in task settings/objectives and modalities, making datasets and methods designed individually. In this work, we take steps toward generalist navigation, which can follow free-form instructions that include arbitrary compounds of modality and capability. To achieve this, we propose a large-scale benchmark and corresponding method, termed OctoNav-Bench and OctoNav-R1. Specifically, OctoNav-Bench is constructed via a designed automatic annotation pipeline. We thoroughly craft instruction-trajectory pairs, where instructions are diverse in free-form with arbitrary modality and capability. Also, we construct a Think-Before-Action (TBA-CoT) dataset within OctoNav-Bench to provide the thinking process behind actions. For OctoNav-R1, we build it upon MLLMs and adapt it to a VLA-type model, which can produce low-level actions solely based on 2D visual observations. Moreover, we design a Hybrid Training Paradigm (HTP) that consists of three stages, i.e., Action-/TBA-SFT, Nav-GRPO, and Online RL stages. Each stage contains designed learning policies and rewards. Specifically, inspired by the OpenAI-o1 and DeepSeek-R1, which show impressive reasoning ability via thinking-before-answer, we design TBA-SFT and Nav-GRPO to achieve thinking-before-action for embodied navigation, improving model's reasoning ability toward generalists. TBA-SFT utilizes the TBA-CoT dataset to fine-tune the model, and then we leverage Nav-GRPO to improve its thinking ability. Finally, OctoNav-R1 shows superior performance compared with the previous methods.

Abstract:
We study cross-embodiment 6-DOF robot grasping. Unlike prior works, we require the model not only to generalize to novel objects / scenes but also to novel gripper morphologies and physical grasping processes. Our method extends diffusion model based generative 6-DOF grasping models to condition on the additional gripper's representation. We propose a swept-volume heuristic for encoding the gripper. We train our cross-embodiment model with procedural grippers and a large-scale dataset of 395 Million grasps. In simulation experiments, our model has the best zero-shot generalization to novel real-world grippers and objects over baseline methods. Our model also serves as a good initialization for fine-tuning to adapt to novel grippers. In ablations, we demonstrate the efficiency of our sweep-volume gripper representation and our procedural gripper training dataset. Last, we show zero-shot generalization to real-world novel grippers for 6-DOF grasping, surpassing baselines in cross-embodiment generalization.

Abstract:
Reinforcement Learning with Verifiable Reward (RLVR) has substantially advanced the video understanding capabilities of Multimodal Large Language Models (MLLMs). However, the rapid progress of MLLMs is outpacing the complexity of existing video datasets, while the manual annotation of new, high-quality data remains prohibitively expensive.This work investigates a pivotal question: Can the rich, intrinsic information within videos be harnessed to self-generate high-quality, verifiable training data?To investigate this problem, we first introduce three self-supervised pretext tasks for video understanding: Anomaly Grounding, Object Counting, and Temporal Jigsaw. To validate the difficulty of these tasks, we construct the Video Intrinsic Understanding Benchmark (VIUBench), revealing that current state-of-the-art MLLMs struggle significantly on these tasks. Building upon these pretext tasks, we develop the VideoSSR-30K dataset and propose VideoSSR, a novel video self-supervised reinforcement learning framework for RLVR. Extensive experiments across 17 benchmarks, spanning four major video domains (General Video QA, Long Video QA, Temporal Grounding, and Complex Reasoning), demonstrate that our VideoSSR consistently enhances model performance, yielding an average improvement of over 5%. These results establish VideoSSR as a potent foundational framework for developing more advanced video understanding in MLLMs.

Abstract:
Facial image datasets have become essential resources for various face analysis tasks, but their use raises significant privacy concerns. To address this issue, face obfuscation has emerged as a practical approach to hide identity from humans while retaining cues decipherable by machines. However, existing methods often leave exploitable visual traces, making them vulnerable to reconstruction attacks that restore hidden identity. To address this issue, we propose a frequency-domain manipulation framework, called FreM, which adjusts frequency subbands differently to hide identity, retain machine-decipherable cues, and improve robustness against reconstruction attacks. Specifically, the proposed FreM first decomposes a facial image into frequency subbands and applies subband-adaptive modulation that regulates information according to the characteristics of each subband. The modulation parameters are then refined to yield the reliable obfuscated result. Extensive experiments across multiple face analysis benchmarks demonstrate that FreM achieves superior obfuscation quality and strong robustness against reconstruction attacks.

Abstract:
Dynamic objects in our physical 4D (3D + time) world are constantly evolving, deforming, and interacting with other objects, leading to diverse 4D scene dynamics. In this paper, we present a universal generative pipeline, CHORD, for CHOReographing Dynamic objects and scenes and synthesizing this type of phenomena. Traditional rule-based graphics pipelines to create these dynamics are based on category-specific heuristics, yet are labor-intensive and not scalable. Recent learning-based methods typically demand large-scale datasets, which may not cover all object categories in interest. Our approach instead inherits the universality from the video generative models by proposing a distillation-based pipeline to extract the rich Lagrangian motion information hidden in the Eulerian representations of 2D videos. Our method is universal, versatile, and category-agnostic. We demonstrate its effectiveness by conducting experiments to generate a diverse range of multi-body 4D dynamics, show its advantage compared to existing methods, and demonstrate its applicability in generating robotics manipulation policies.

Abstract:
Multimodal retrieval is the task of aggregating information from queries across heterogeneous modalities to retrieve desired targets. State-of-the-art multimodal retrieval models can understand complex queries, yet they are typically limited to two modalities: text and vision. This limitation impedes the development of universal retrieval systems capable of comprehending queries that combine more than two modalities. To advance toward this goal, we present OmniRet, the first retrieval model capable of handling complex, composed queries spanning three key modalities: text, vision, and audio. Our OmniRet model addresses two critical challenges for universal retrieval: computational efficiency and representation fidelity. First, feeding massive token sequences from modality-specific encoders to Large Language Models (LLMs) is computationally inefficient. We therefore introduce an attention-based resampling mechanism to generate compact, fixed-size representations from these sequences. This shared module is designed to maintain representational diversity and generalization capabilities while remaining sensitive to modality-specific information. Second, compressing rich omni-modal data into a single embedding vector inevitably causes information loss and discards fine-grained details. We propose Attention Sliced Wasserstein Pooling to preserve these fine-grained details, leading to improved omni-modal representations. OmniRet is trained on an aggregation of approximately 6 million query-target pairs spanning 30 datasets. We benchmark our model on 13 retrieval tasks and a MMEBv2 subset. Our model demonstrates significant improvements on composed query, audio and video retrieval tasks, while achieving on-par performance with state-of-the-art models on others. Furthermore, we curate a new Audio-Centric Multimodal Benchmark (ACM). This new benchmark introduces two critical, previously missing tasks--composed audio retrieval and audio-visual retrieval--to more comprehensively evaluate a model's omni-modal embedding capacity. We believe our benchmark will facilitate the development of universal retrieval systems.

Abstract:
Long-sequence streaming 3D reconstruction remains a significant open challenge. Existing autoregressive models often fail when processing long sequences because they anchor poses to the first frame, leading to attention decay, scale drift, and extrapolation errors. We introduce LongStream, a novel gauge-decoupled streaming visual geometry model for metric-scale scene reconstruction across thousands of frames under a strictly online, future-invisible setting. Our approach is threefold. First, we discard the first-frame anchor and predict keyframe-relative poses. This reformulates long-range extrapolation into a constant-difficulty local task. Second, we introduce orthogonal scale learning. This method fully disentangles geometry from scale estimation to suppress drift. Finally, we identify attention bias issues in Transformers, including attention-sink reliance and long-term KV-cache saturation. We propose cache-consistent training combined with periodic cache refresh. This approach suppresses attention biases and contamination over ultra-long sequences and reduces the gap between training and inference. Experiments show that LongStream achieves state-of-the-art performance, enabling stable, metric-scale reconstruction over kilometer-scale sequences at 18 FPS.

Abstract:
We present BRDG, a boundary-responsive differentiable gating superpixel framework designed to resolve the trade-off between computational efficiency and segmentation precision in surgical scenes. At its core, the architecture is organized into three cooperative agents within a fully differentiable backbone. The Region Creator agent converts dense features into learnable superpixel tokens, jointly learning region descriptors and dense context. The Boundary Detector agent acts as a gating mechanism, estimating boundary confidence from region features to predict where refinement is needed. The refinement agent uses this gate to selectively fuse efficient coarse predictions with a high-fidelity refinement path that restores pixel-level details. To further enhance distinctiveness, an adjacency-boosted contrastive loss mines hard negatives across neighboring regions to improve boundary separation. We evaluate BRDG on three surgical tasks requiring high-precision EndoVis18-parts, EndoVis18-tools, EndoVis17-tools, as well as general domain benchmarks. Our model improves mIoU by substantial margins 4.5-7.0 over strong pixel-wise baselines while raising Boundary-F1 scores by over 10 points. Under the same hardware (RTX-A6000 Pro), it reaches 150.25 FPS with only 24M parameters. This makes it 10 times faster and 3.5 smaller than current state-of-the-art models, effectively resolving the critical accuracy-efficiency trade-off in real-time segmentation.

Abstract:
We present Functional Mean Flow (FMF) as a one-step generative model defined in infinite-dimensional Hilbert space. FMF extends the one-step Mean Flow framework to functional domains by providing a theoretical formulation for Functional Flow Matching and a practical implementation for efficient training and sampling. We also introduce an x_1-prediction variant that improves stability over the original u-prediction form. The resulting framework is a practical one-step Flow Matching method applicable to a wide range of functional data generation tasks such as time series, images, PDEs, and 3D geometry.

Abstract:
Ill-posed inverse problems are fundamental in many domains, ranging from astrophysics to medical imaging. Emerging diffusion models provide a powerful prior for solving these problems. Existing maximum-a-posteriori (MAP) or posterior sampling approaches, however, rely on different computational approximations, leading to inaccurate or suboptimal samples. To address this issue, we introduce a new approach to solving MAP problems with diffusion model priors using a dual ascent optimization framework. Our framework achieves better image quality as measured by various metrics for image restoration problems, it is more robust to high levels of measurement noise, it is faster, and it estimates solutions that represent the observations more faithfully than the state of the art.

Abstract:
In this paper we show that visual diffusion models can serve as effective geometric solvers: they can directly reason about geometric problems by working in pixel space. We first demonstrate this on the Inscribed Square Problem, a long-standing problem in geometry that asks whether every Jordan curve contains four points forming a square. We then extend the approach to two other well-known hard geometric problems: the Steiner Tree Problem and the Maximum Area Polygon Problem. Our method treats each problem instance as an image and trains a standard visual diffusion model that transforms Gaussian noise into an image representing a valid approximate solution that closely matches the exact one. The model learns to transform noisy geometric structures into correct configurations, effectively recasting geometric reasoning as image generation. Unlike prior work that necessitates specialized architectures and domain-specific adaptations when applying diffusion to parametric geometric representations, we employ a standard visual diffusion model that operates on the visual representation of the problem. This simplicity highlights a surprising bridge between generative modeling and geometric problem solving. Beyond the specific problems studied here, our results point toward a broader paradigm: operating in image space provides a general and practical framework for approximating notoriously hard problems, and opens the door to tackling a far wider class of challenging geometric tasks.

Abstract:
The slow inference process of image diffusion models significantly degrades interactive user experiences. To address this, we introduce Diffusion Preview, a novel paradigm employing rapid, low-step sampling to generate preliminary outputs for user evaluation, deferring full-step refinement until the preview is deemed satisfactory. Existing acceleration methods, including training-free solvers and post-training distillation, struggle to deliver high-quality previews or ensure consistency between previews and final outputs. In this paper, we propose ConsistencySolver derived from general linear multistep methods, a lightweight, trainable high-order solver optimized via Reinforcement Learning, that enhances preview quality and consistency.Experimental results demonstrate that ConsistencySolver significantly improves generation quality in low-step scenarios, making it ideal for efficient preview-and-refine workflows. Notably, it achieves FID scores on-par with Multistep DPM-Solver using 47% fewer steps, while outperforming distillation baselines. Furthermore, user studies indicate our approach reduces overall user interaction time by nearly 50% while maintaining generation quality.

Abstract:
We consider the problem of generating images whose internal structure - defined by the distribution of patches across multiple scales---matches that of a single reference image. Recent approaches address this problem by training a diffusion model on a single image. But even in this setting, training is computationally expensive and requires hours of optimization. Instead, we model the image using a dataset of its patches at different scales. As this dataset is finite and the dimensionality of its patches is small, the score function for a noisy patch can be computed tractably using an optimal, closed-form denoiser, eliminating the need for neural network training. We integrate this patch-based denoiser into an efficient, training-free image diffusion model, and we describe how our method connects to classical patch-based image restoration techniques. Our approach achieves state-of-the-art generation quality and diversity compared to trained single-image diffusion models, and we demonstrate applications, including unconditional image generation, text-guided stylization, image symmetrization, and retargeting. Further, we show that our approach is compatible with latent space diffusion, and we show multiple additional acceleration techniques to achieve megapixel single-image generation in one second, and gigapixel generation in minutes.

Abstract:
Pose-guided human image animation aims to synthesize realistic videos of a reference character driven by a sequence of poses. While diffusion-based methods have achieved remarkable success, most existing approaches are limited to single-character animation. We observe that naively extending these methods to multi-character scenarios often leads to identity confusion and implausible occlusions between characters. To address these challenges, in this paper, we propose an extensible multi-character image animation framework built upon modern Diffusion Transformers (DiTs) for video generation. At its core, our framework introduces two novel components--Identifier Assigner and Identifier Adapter--which collaboratively capture per-person positional cues and inter-person spatial relationships. This mask-driven scheme, along with a scalable training strategy, not only enhances flexibility but also enables generalization to scenarios with more characters than those seen during training. Remarkably, trained on only a two-character dataset, our model generalizes to multi-character animation while maintaining compatibility with single-character cases. Extensive experiments demonstrate that our approach achieves state-of-the-art performance in multi-character image animation, surpassing existing diffusion-based baselines. Codes will be released.

Abstract:
Preference-based finetuning of vision-language models (VLMs) is notoriously unstable, as trivially wrong negatives inject uninformative gradients that distort optimization and degrade calibration. This work revisits this issue through the lens of learning dynamics and identifies a core pathology, the squeezing effect, where easy negatives retain large, misaligned gradients despite having negligible loss.To address this, we propose Cooling-Weighted Direct Preference Optimization (CW-DPO), a two-stage framework that first smooths and then stabilizes the alignment process. Stage 1 employs a constrained SFT phase with low-weight "gentle negatives" to regularize overconfident distributions and flatten the loss landscape. Stage 2 introduces a competence-aware cooling weight that adaptively scales negative gradients according to the model's average per-token log-probability, suppressing uninformative updates while emphasizing hard, on-policy contrasts. This dynamics-aware weighting effectively mitigates the squeezing effect and enables smoother convergence.Extensive experiments on mainstream benchmarks--including COCO, Flickr30k, NoCaps, MMMU, and MMBench1.1--show that CW-DPO achieves state-of-the-art performance, for example +3.4 CIDEr over PPO and +2.4% absolute accuracy on MMMU, while improving calibration and halving convergence steps. This demonstrates that smoothing before cooling constitutes a simple yet general principle for robust VLM preference optimization.

Abstract:
Recent advances in vision-language models (VLMs) have made them highly effective at reasoning tasks. However, the principles underlying the construction of performant VL reasoning training datasets remain poorly understood. In this work, we introduce several data curation approaches and study their impacts on VL reasoning capabilities by carefully controlling training and evaluation setups. We analyze the effects of context (image and question pair) sources, implement targeted data interventions, and explore scaling up images, questions, and chain-of-thought (CoT) solutions. Our findings reveal that (a) context source strategies significantly affect VLM performance, (b) interventions such as auxiliary signals from image captions and the inclusion of text-only reasoning yield substantial gains, and (c) scaling all data dimensions (e.g., unique questions per image and unique CoTs per image-question pair) consistently improves reasoning capability. Motivated by these insights, we introduce HoneyBee, a large-scale, high-quality CoT reasoning dataset with 2.5M examples consisting 350K image-question pairs. VLMs trained with HoneyBee outperform state-of-the-art models across model sizes. For instance, a HoneyBee-trained VLM with 3B parameters outperforms the SOTA model and the base model by 7.8% and 24.8%, respectively, on MathVerse. Furthermore, we propose a test-time scaling strategy that reduces decoding cost by 73% without sacrificing accuracy. Overall, this work presents improved strategies for VL reasoning dataset curation research.

Abstract:
Occlusion remains one of the major challenges in UAV tracking, where dynamic viewpoints and complex environments often cause partial or complete visibility loss.Existing transformer-based trackers typically regard occlusion as random information dropout, overlooking its structured and spatially correlated nature in real-world scenes.We rethink occlusion modeling in UAV tracking as a structured process governed by spatial dependencies.Based on this insight, we introduce Clustered Occlusion Modeling (COM) to generate realistic, density-adaptive occlusion patterns that enhance feature robustness under partial visibility.Furthermore, we design Cost-Aware Depth Bias (CADB), which employs a depth-dependent prior to adjust inference depth, yielding better efficiency while maintaining competitive accuracy.Integrating COM and CADB into a unified single-stream transformer framework, termed OCTrack, our tracker achieves robust and efficient UAV tracking in occlusion-prone environments.Extensive experiments on multiple UAV benchmarks validate its effectiveness and demonstrate state-of-the-art performance.

Abstract:
Adverse-weather LiDAR point cloud generation is challenged by complex weather-induced degradations. These degradations affect geometry and reflectance in fundamentally different ways, making joint modeling difficult and ambiguous, especially when diverse real-world training data is limited. To address this, we propose Structure-to-Intensity Diffusion (SiD), a diffusion-based framework that explicitly factorizes the denoising process at each time step: it first reconstructs the geometric structure, then conditions reflectance intensity denoising on the estimated structure. This structure-conditioned design decomposes the joint distribution, reduces modeling ambiguity, and leads to point clouds that are both geometrically coherent and radiometrically realistic. To mitigate data scarcity, we introduce Real-Prior Weather Simulation (RPWS), a degradation module that leverages real-world sensor statistics to synthesize physically plausible adverse-weather point clouds from clear scans. Extensive experiments demonstrate that, with similar model complexity, our approach outperforms the previous state-of-the-art in generating adverse-weather LiDAR scans with both structural and radiometric properties more closely aligned with real-world data.

Abstract:
Recent progress in deep learning has been driven by increasingly large-scale models, but the resulting computational cost has become a critical bottleneck. Sparse Mixture of Experts (MoE) offers an effective solution by activating only a small subset of experts for each input, achieving high scalability without sacrificing inference speed. Although effective, sparse MoE training exhibits characteristic optimization difficulties. Because the router receives informative gradients only through the experts selected in the forward pass, it suffers from gradient blocking and obtains little information from unselected routes. This limited, highly localized feedback makes it difficult for the router to learn appropriate expert-selection scores and often leads to unstable routing dynamics, such as fluctuating expert assignments during training. To address this issue, we propose TGR-MoE: Teacher-Guided Routing for Sparse Vision Mixture-of-Experts, a simple yet effective method that stabilizes router learning using supervision derived from a pretrained dense teacher model. TGR-MoE constructs a teacher router from the teacher's intermediate representations and uses its routing outputs as pseudo-supervision for the student router, suppressing frequent routing fluctuations during training and enabling knowledge-guided expert selection from the early stages of training. Extensive experiments on ImageNet-1K and CIFAR-100 demonstrate that TGR consistently improves both accuracy and routing consistency, while maintaining stable training even under highly sparse configurations.

Abstract:
Lifelong embodied navigation requires agents to accumulate, retain, and exploit spatial-semantic experience across tasks, enabling efficient exploration in novel environments and rapid goal reaching in familiar ones. While object-centric memory is interpretable, it depends on detection and reconstruction pipelines that limit robustness and scalability. We propose an image-centric memory framework that achieves long-term implicit memory via an efficient visual context compression module end-to-end coupled with a Qwen2.5-VL-based navigation policy. Built atop a ViT backbone with frozen DINOv3 features and lightweight PixelUnshuffle+Conv blocks, our visual tokenizer reduces native vision tokens by roughly 10-20x, representing each image with about 30 tokens and allowing the agent to maintain hundreds of historical frames within a single context. Experimental results on GOAT-Bench and HM3D-OVON show that our method achieves state-of-the-art navigation performance, improving exploration in unfamiliar environments and shortening paths in familiar ones. Ablation studies further reveal that moderate compression provides the best balance between efficiency and accuracy. These findings position compressed image-centric memory as a practical and scalable interface for lifelong embodied agents, enabling them to reason over long visual histories and navigate with human-inspired efficiency.

Abstract:
We solve the problem of determining the pose of known shapes in \mathbb R ^3 from their unoccluded silhouettes. The pose is determined up to global optimality using a simple yet under-explored property of the area-of-silhouette: its continuity w.r.t trajectories in the rotation space. The proposed method utilises pre-computed silhouette-signatures, modelled as a response surface of the area-of-silhouettes. Querying this silhouette-signature response surface for pose estimation leads to a strong branching of the rotation search space, making resolution-guided candidate search feasible. Additionally, we utilise the aspect ratio of 2D ellipses fitted to projected silhouettes as an auxiliary global shape signature to accelerate the pose search. This combined strategy forms the first method to efficiently estimate globally optimal pose from just the silhouettes, without being guided by correspondences, for any shape, irrespective of its convexity and genus. We validate our method on synthetic and real examples, demonstrating significantly improved accuracy against comparable approaches.

Abstract:
Face swapping aims to transfer the identity of a source face onto a target face while preserving target-specific attributes such as pose, expression, lighting, skin tone, and makeup. However, since real ground truth for face swapping is unavailable, achieving both accurate identity transfer and high-quality attribute preservation remains challenging. Recent diffusion-based approaches attempt to improve visual fidelity through conditional inpainting on masked target images, but the masked condition removes crucial appearance cues, resulting in plausible yet misaligned attributes. To address this limitation, we propose APPLE (Attribute-Preserving Pseudo-Labeling), a fully diffusion-based teacher-student framework for attribute-preserving face swapping. Our approach introduces a teacher design to produce pseudo-labels aligned with the target attributes through (1) a conditional deblurring formulation that improves the preservation of global attributes such as skin tone and illumination, and (2) an attribute-aware inversion scheme that further enhances fine-grained attribute preservation such as makeup. APPLE conditions the student on clean pseudo-labels rather than degraded masked inputs, enabling more faithful attribute preservation. As a result, APPLE achieves state-of-the-art performance in attribute preservation while maintaining competitive identity transferability.

Abstract:
The explosive growth of Text-to-Image (T2I) models, from large-scale versions to lightweight, real-time ones, now faces diminishing marginal returns from single-model scaling. Agentic T2I methods emerged to alleviate this bottleneck by using multiple models. However, existing agentic T2I methods suffer from three key challenges: reliance on expensive handcrafted priors or human annotations, rigid single-path decision mechanisms, and a neglect of inference efficiency. To address these challenges, we introduce OctoT2I, a novel agentic framework that reformulates the T2I task as a joint optimization of generation quality and inference efficiency. OctoT2I implements a stateful, multi-round routing strategy that adaptively selects the most suitable tool based on its knowledge and memory. This strategy is enabled by a knowledge base built from scratch by our novel Self-Evolving Mechanism. This mechanism, which requires no human supervision, first autonomously defines foundational Conceptual Dimensions (e.g., style, color, count) and then intelligently explores their combinations via an iterative "Propose--Solve--Evaluate--Learn" (PSEL) loop. The PSEL loop efficiently discovers each tool's capability frontier, driving continuous improvement without external guidance. Extensive experiments demonstrate that OctoT2I achieves competitive performance (0.96) on GenEval while delivering a 90.3% inference speedup and a 56.6% energy-efficiency gain over the leading baseline (Flow-GRPO), striking an exceptional balance between performance and efficiency. Code and models will be made available.

Abstract:
Annotating bounding boxes is costly and limits the scalability of object detection. This challenge is compounded by the need to preserve high accuracy while minimizing manual effort in real-world applications. Prior active learning methods often depend on model features or modify detector internals and training schedules, increasing integration overhead. Moreover, they rarely jointly exploit the benefits of image-level signals, class-imbalance cues, and instance-level uncertainty for comprehensive selection. We present Portable Active Learning (PAL), a detector-agnostic, easily portable framework that operates solely on inference outputs. PAL combines class-wise instance uncertainty with image-level diversity to guide data selection. At each round, PAL trains lightweight class-specific logistic classifiers to distinguish true from false positives, producing entropy-based uncertainty scores for proposals. Candidate images are then refined using global image entropy, class diversity, and image similarity, yielding batches that are both informative and diverse. PAL requires no changes to model internals or training pipelines, ensuring broad compatibility across detectors. Extensive experiments on COCO, PASCAL VOC, and BDD100K demonstrate that PAL consistently improves label efficiency and detection accuracy compared to existing active learning baselines, making it a practical solution for scalable and cost-effective deployment of object detection in real-world settings.

Abstract:
Directly modeling the explicit likelihood of the raw data distribution is a key topic in the machine learning area, which achieves the scaling success in Large Language Models by autoregressive modeling. However, continuous AR modeling over visual pixel data suffers from extremely long sequences and high-dimensional spaces. In this paper, we present FARMER, a novel end-to-end generative framework that unifies Normalizing Flows (NFs) and Autoregressive (AR) models for tractable likelihood estimation and high-quality image synthesis directly from raw pixels. FARMER employs an invertible autoregressive flow to transform images into latent sequences, whose distribution is implicitly modeled by an autoregressive model. To address the redundancy and complexity in pixel-level modeling, we propose a self-supervised dimension reduction scheme that partitions NF latent channels into informative and redundant groups, enabling more effective and efficient AR modeling. Furthermore, we design a one-step distillation scheme to significantly accelerate inference speed and introduce a resampling-based classifier-free guidance algorithm to boost image generation quality. Extensive experiments demonstrate that FARMER achieves competitive performance compared to existing pixel-based generative models while providing exact likelihoods and scalable training.

Abstract:
Continual Learning (CL) aims to train neural networks on a dynamic stream of tasks without forgetting previously learned knowledge. Among optimization-based approaches, C-Flat has emerged as a promising solution due to its plug-and-play nature and its ability to encourage uniformly low-loss regions for both new and old tasks. However, C-Flat requires three additional gradient computations per iteration, imposing substantial overhead on the optimization process. In this work, we propose C-Flat Turbo, a faster yet stronger optimizer that significantly reduces the training cost. We show that the gradients associated with first-order flatness contain direction-invariant components relative to the proxy-model gradients, enabling us to skip redundant gradient computations in the perturbed ascent steps. Moreover, we observe that these flatness-promoting gradients progressively stabilize across tasks, which motivates a linear scheduling strategy with an adaptive trigger to allocate larger turbo steps for later tasks. Experiments show that C-Flat Turbo is 1.0x to 1.25x faster than C-Flat across a wide range of CL methods, while achieving comparable or even improved accuracy.

Abstract:
What data should a vision-language model be trained on? To answer this question, many data curation efforts center on the quality of a dataset. However, most of these existing methods are (i) offline, i.e. they produce a static dataset from a set of predetermined filtering criteria, and (ii) concept-agnostic, i.e. they use model-based filters which induce additional dataset bias. In this work, we go beyond such offline, concept-agnostic methods and advocate for more flexible, task-adaptive online concept-based curation. Our first contribution is DataConcept, a collection of 128M web-crawled image-text pairs annotated with fine-grained details about their concept composition. Building on DataConcept, we introduce CABS (Concept-Aware Batch Sampling), a simple yet effective batch-sampling framework that flexibly constructs batches on-the-fly based on specific target distributions. We propose two variants: (i) Diversity Maximization, to curate batches with the broadest coverage of available concepts, and (ii) CABS-FM (Frequency Maximization), to curate batches with maximal object multiplicity. Through extensive evaluations with four visual backbones and a suite of 28 benchmarks, we demonstrate that CABS significantly benefits Language-Image Pretraining (LIP) and yields highly performant models on long-tailed evaluations. Overall, CABS represents a strong open-source alternative to proprietary online curation algorithms, enabling practitioners to define custom concept distributions that optimize for specific downstream tasks. Both DataConcept and the source code for CABS will be made public.

Abstract:
Unified models can handle both multimodal understanding and generation within a single architecture, yet they typically operate in a single pass without iteratively refining their outputs. Many multimodal tasks, especially those involving complex spatial compositions, multiple interacting objects, or evolving instructions, require decomposing instructions, verifying intermediate results, and making iterative corrections. While test-time scaling (TTS) has demonstrated that allocating additional inference compute for iterative reasoning substantially improves language model performance, extending this paradigm to unified multimodal models remains an open challenge.We introduce UniT, a framework for multimodal chain-of-thought test-time scaling that enables a single unified model to reason, verify, and refine across multiple rounds. UniT combines agentic data synthesis, unified model training, and flexible test-time inference to elicit cognitive behaviors including verification, subgoal decomposition, and content memory. Our key findings are: (1) unified models trained on short reasoning trajectories generalize to longer inference chains at test time; (2) sequential chain-of-thought reasoning provides a more scalable and compute-efficient TTS strategy than parallel sampling; (3) training on generation and editing trajectories improves out-of-distribution visual reasoning. These results establish multimodal test-time scaling as an effective paradigm for advancing both generation and understanding in unified models.

Abstract:
We propose LiveGesture, the first fully streamable, speech-driven full-body gesture generation framework that operates with zero look-ahead and supports arbitrary sequence length. Unlike existing co-speech gesture methods--which are designed for offline generation and either treat body regions independently or entangle all joints within a single model--LiveGesture is built from the ground up for causal, region-coordinated motion generation. LiveGesture consists of two main modules: the Streamable Vector-Quantized Motion Tokenizer (SVQ) and the Hierarchical Autoregressive Transformer (HAR). The SVQ tokenizer converts the motion sequence of each body region into causal, discrete motion tokens, enabling real-time, streamable token decoding. On top of SVQ, HAR employs region-eXpert autoregressive (xAR) transformers to model expressive, fine-grained motion dynamics for each body region. A causal spatio-temporal fusion module (xAR-Fusion) then captures and integrates correlated motion dynamics across regions. Both xAR and xAR-Fusion are conditioned on live, continuously arriving audio signals encoded by a streamable causal audio encoder. To enhance robustness under streaming noise and prediction errors, we introduce autoregressive masking training, which leverages uncertainty-guided token masking and random region masking to expose the model to imperfect, partially erroneous histories during training. Experiments on the BEAT2 dataset demonstrate that LiveGesture produces coherent, diverse, and beat-synchronous full-body gestures in real time, matching or surpassing state-of-the-art offline methods under true zero-look-ahead conditions.

Abstract:
Recent progress in Multimodal Large Language Models (MLLMs) demonstrates that Chain-of-Thought (CoT) reasoning enables systematic solutions to complex understanding tasks. However, its extension to generation tasks remains nascent and limited by scenario-specific mechanisms that hinder generalization and adaptation. In this work, we present ThinkGen, the first think-driven visual generation framework that explicitly leverages MLLM's CoT reasoning in various generation scenarios. ThinkGen employs a decoupled architecture comprising a pretrained MLLM and a Diffusion Transformer (DiT), wherein the MLLM generates tailored instructions based on user intent, and DiT produces high-quality images guided by these instructions. We further propose a separable GRPO-based training paradigm (SepGRPO), alternating reinforcement learning between the MLLM and DiT modules. This flexible design enables joint training across diverse datasets, facilitating effective CoT reasoning for a wide range of generative scenarios. Extensive experiments demonstrate that ThinkGen achieves robust, state-of-the-art performance across multiple generation benchmarks.

Abstract:
Recent advances in Vision-Language Models (VLMs) have enabled video-instructed robotic programming, allowing agents to interpret video demonstrations and generate executable control code. We formulate video-instructed robotic programming as a cross-domain adaptation problem, where perceptual and physical differences between demonstration and deployment induce procedural mismatches. However, current VLMs lack the procedural understanding needed to reformulate causal dependencies and achieve task-compatible behavior under such domain shifts. We introduce NeSyCR, a neurosymbolic counterfactual reasoning framework that enables verifiable adaptation of task procedures, providing a reliable synthesis of code policies. NeSyCR abstracts video demonstrations into symbolic trajectories that capture the underlying task procedure. Given deployment observations, it derives counterfactual states that reveal cross-domain incompatibilities. By exploring the symbolic state space with verifiable checks, NeSyCR proposes procedural revisions that restore compatibility with the demonstrated procedure. NeSyCR achieves a 31.14% improvement in task success over the strongest baseline Statler, showing robust cross-domain adaptation across both simulated and real-world manipulation tasks.

Abstract:
We propose OMG-Avatar, a novel One-shot method that leverages a Multi-LOD (Level-of-Detail) Gaussian representation for animatable 3D head reconstruction from a single image in 0.2s. Our method enables LOD head avatar modeling using a unified model that accommodates diverse hardware capabilities and inference speed requirements. To capture both global and local facial characteristics, we employ a transformer-based architecture for global feature extraction and projection-based sampling for local feature acquisition. These features are effectively fused under the guidance of a depth buffer, ensuring occlusion plausibility. We further introduce a coarse-to-fine learning paradigm to support Level-of-Detail functionality and enhance the perception of hierarchical details. To address the limitations of 3DMMs in modeling non-head regions such as the shoulders, we introduce a multi-region decomposition scheme in which the head and shoulders are predicted separately and then integrated through cross-region combination. Extensive experiments demonstrate that OMG-Avatar outperforms state-of-the-art methods in reconstruction quality, reenactment performance, and computational efficiency.

Abstract:
Reconstructing High Dynamic Range (HDR) video from alternately exposed Low Dynamic Range (LDR) frames is challenged by large motion, exposure-induced photometric inconsistency, and information loss in saturated or under-exposed regions. Prior HDR video pipelines typically follow an alignment-reconstruction paradigm, which is limited by the precision of alignment and the performance of the fusion module. We propose a new reconstruction framework called Learning Representation-enhanced HDR Video Reconstruction (LRHDR), which is built around two novel components: an Amalgamated Cross-exposure Consistent Representation (ACCR) network and an Adaptive Pixel-wise Sparse Weighted Fusion (APSWF). The ACCR includes an Exposure-aware Interleaved Context (EIC) encoder and a Representation Mapper (RM). The EIC couples a large-field path with a high-fidelity sub-pixel path and an exposure gate to produce exposure-aware features. The RM avoids explicit cross-exposure alignment by mapping features from different exposures into a unified representation via per-pixel, per-channel linear modulation and decoding into the calibrated linear HDR domain. The APSWF treats fusion as pixel-wise candidate selection, producing sparse weighted masks to form a normalized fusion in the linear HDR domain, thereby suppressing artifacts. Extensive experiments on standard benchmarks demonstrate that our LRHDR outperforms previous methods.

Abstract:
Computer vision and robotics applications ranging from augmented reality to robot autonomy in large-scale environments require spatio-temporal memory frameworks that capture both geometric structure for accurate language-grounding and semantic detail. Existing methods face a tradeoff, where producing rich open-vocabulary descriptions comes at the expense of real-time performance when these descriptions have to be grounded in 3D. To address these challenges, we propose Describe Anything, Anywhere, at Any Moment (DAAAM), a novel spatio-temporal memory framework for large-scale and real-time 4D scene understanding. DAAAM introduces a novel optimization-based frontend to infer detailed semantic descriptions from localized captioning models, such as the Describe Anything Model (DAM), using batch processing to speed up inference by an order of magnitude for online deployment. It leverages such semantic understanding to build a hierarchical 4D scene graph (SG), which acts as an effective globally spatially and temporally consistent memory representation. DAAAM constructs 4D SGs with detailed, geometrically grounded descriptions while maintaining real-time performance. We show that DAAAM's 4D SG interfaces well with a tool-calling agent for inference and reasoning. We thoroughly evaluate DAAAM in the complex task of spatio-temporal question answering (SQA) on the NaVQA benchmark and show its generalization capabilities for sequential task grounding on the SG3D benchmark. We further curate an extended OC-NaVQA benchmark for large-scale and long-time evaluations. DAAAM achieves state-of-the-art results in both tasks, improving OC-NaVQA question accuracy by 53.6%, reducing position errors by 21.9% and temporal errors by 21.6%, and improving SG3D task grounding accuracy by 27.8% over the most competitive baselines. We release our data and code open-source.

Abstract:
Data-driven approaches have revolutionized 3D vision, enabling transformers to effectively reconstruct and generate static 3D objects. However, generating simulative 4D dynamics---realistic temporal deformations of static objects under various physical conditions---remains challenging and often ad hoc despite being critical for building comprehensive 3D world models. Most existing methods assume a predefined physical model and use system identification to estimate parameters, restricting these methods to specific categories and small-scale datasets. We propose that these restrictions can be overcome by learning a data-driven kinematic state parameterization for object-centric physical systems. Specifically, we learn both a latent space of all possible states of the object and a decoder that maps any sampled latent to a plausibly deformed shape of the object. We refer to this parameterization as Neural Object Kinematics (NeuROK), and learn a transformer-based encoder-decoder model on a curated large-scale 4D dataset. This formulation and the learned model significantly simplify the generation of simulative dynamics since we only need to consider the dynamics within a low-dimensional latent space from the Lagrangian mechanics' perspective in classical physics. We demonstrate the effectiveness and generality of this framework across diverse dynamic object types, showing clear advantages over prior works.

Abstract:
Multimodal contrastive learning typically relies on pairwise similarities for alignment, but recent work has shown that Gramian volumes can capture higher-order correlations across modalities. However, Euclidean Gramian volumes suffer from volume collapse under L2 normalization, concentrating near unity with minimal discriminative variance. Hyperbolic geometry's exponential volume growth naturally addresses this via variance preservation, motivating us to extend Gramian alignment to hyperbolic space. Yet preliminary experiments reveal that pure hyperbolic geometry alone is insufficient: while it preserves variance, it underperforms Euclidean baselines on cross-category discrimination. We introduce HyperGRAM, a hybrid geometry framework that combines Euclidean discriminative stability with hyperbolic semantic variance through learnable mixing. Using the numerically stable Lorentz model, HyperGRAM enables volumes to serve dual roles: discriminating matched from mismatched triplets while preserving semantic sensitivity within matched pairs that reflects interpretation spaces (the set of valid multimodal realizations). Evaluation across four video-text benchmarks demonstrates that hybrid geometry consistently outperforms both pure Euclidean and pure hyperbolic variants, achieving significant zero-shot improvements with cross-dataset semantic sensitivity exhibiting contrasting correlation patterns.

Abstract:
Unified large multimodal models (LMMs) have achieved remarkable progress in general-purpose multimodal understanding and generation. However, they still operate under a "one-size-fits-all" paradigm and struggle to model user-specific concepts (e.g., generate a photo of \texttt \ ) in a consistent and controllable manner. Existing personalization methods typically rely on external retrieval, which is inefficient and poorly integrated into unified multimodal pipelines. Recent personalized unified models introduce learnable soft prompts to encode concept information, yet they either couple understanding and generation or depend on complex multi-stage training, leading to cross-task interference and ultimately to fuzzy or misaligned personalized knowledge.We present OmniPersona, an end-to-end personalization framework for unified LMMs that, for the first time, integrates personalized understanding, generation, and image editing within a single architecture. OmniPersona introduces structurally decoupled concept tokens, allocating dedicated subspaces for different tasks to minimize interference, and incorporates an explicit knowledge replay mechanism that propagates personalized attribute knowledge across tasks, enabling consistent personalized behavior.To systematically evaluate unified personalization, we propose \texttt OmniPBench , extending the public UnifyBench concept set with personalized editing tasks and cross-task evaluation protocols integrating understanding, generation, and editing. Experimental results demonstrate that OmniPersona delivers competitive and robust performance across diverse personalization tasks. We hope OmniPersona will serve as a strong baseline and spur further research on controllable, unified personalization.

Abstract:
Vision language models (VLMs) have achieved remarkable success in semantic understanding tasks under language guidance, yet their potential for geometric perception remains largely underexplored. This paper introduces the first VLM-based depth completion framework. With almost no architectural modifications, we propose a sparse depth injection mechanism that extends the capability of VLM toward 3D perception through three key aspects: visual tokenization, textual prompt, and textual supervision. At the visual input side, sparse depth is tokenized to provide absolute scale and accurate geometric cues, alleviating the scale and camera ambiguities of RGB-only inputs. At the textual input side, a binary mask derived from sparse depth serves as a prompt, instructing the model where to complete and where to preserve. At the supervision side, the model is fine-tuned using text labels generated from sparse depth, requiring no ground-truth depth. Benefiting from the strong semantic priors and cross-modal expressiveness of VLM, our framework achieves superior zero-shot performance across diverse sensors, sparsity levels, and scenes.

Abstract:
Multimodal Large Language Models (MLLMs) are increasingly deployed in multi-image scenarios requiring complex reasoning across visual contexts. However, current MLLMs remain fundamentally limited by object hallucination--generating plausible yet factually inconsistent descriptions about objects. Existing benchmarks, designed primarily for single-image settings or providing only high-level multi-image assessments, cannot systematically diagnose how visual complexity and reasoning demands trigger hallucination. To address this gap, we introduce MIOH, a fine-grained multi-image object hallucination benchmark that systematically evaluates object hallucination across four foundational tasks (existence, counting, attribute, position) through three multi-image reasoning patterns (comprehensive, comparative, selective) under three controlled adversarial pressures (visual context scale, perceptual difficulty, contextual bias). Through evaluation of 29 models, we reveal that even state-of-the-art systems like GPT-5 and Gemini-2.5-Pro exhibit distinct failure patterns across different reasoning patterns and tasks. Our evaluation reveals that hallucination stems not merely from perceptual failures but from integration-stage limitations when maintaining object representations across multiple images. MIOH provides a controlled framework for analyzing multi-image object hallucination and serves as a critical evaluation tool for developing more reliable multimodal AI systems.

Abstract:
Large-scale vision-language models (VLMs) such as CLIP have excelled in zero-shot image classification, yet they struggle to achieve the dense cross-modal alignment required by open-vocabulary dense perception (OVDP). While recent self-distillation methods address this by aligning dense features with the generalizable global semantics, a key question remains: how should such dense features be constructed to achieve optimal alignment? To address this, we propose DenseRC, a principled Dense Representations Construction framework that reconstructs CLIP for OVDP based on two key insights.First, by analyzing the internal semantics encoded in the global cls token, we identify that multi-layer value embeddings serve as an informative basis for dense features. Second, we reveal that spatial aggregation tends to amplify semantic misalignment. Motivated by this, we design a lightweight Head-Selective Gating (HSG) module that adaptively reweights feature heads according to their intrinsic heterogeneity, enabling discriminative and alignment-friendly dense representations construction. Extensive experiments demonstrate that DenseRC delivers consistent and substantial gains across OVDP tasks including object detection and semantic segmentation, setting new state-of-the-art performance on multiple benchmarks.

Abstract:
Clinical MRI contrast acquisition suffers from inefficient information yield, which presents as a mismatch between the risky and costly acquisition protocol and fixed and sparse acquisition sequence. Applying world models to simulate the contrast enhancement kinetics in human body enables continuous contrast-free dynamics. However, the low temporal resolution in MRI acquisition restricts the training of world models, leading to the sparsely sampled dataset. Directly training a generative model to capture the kinetics leads to two limitations: (a) Due to the absence data on missing time, the model tends to overfit to irrelevant features, leading to content distortion. (b) Due to the lack of continuous temporal supervision, the model fails to learn the continuous kinetics law over time, causing temporal discontinuities. For the first time, we propose MRI Contrast Enhancement Kinetics World model (MRI CEKWorld) with SpatioTemporal Consistency Learning (STCL). For (a), guided by spatial law that patient-level structures remain consistent during enhancement, we propose Latent Alignment Learning (LAL) that constructs a patient-specific template to constrain contents to align with this template. For (b), guided by the temporal law that the kinetics follows a consistent smooth trend, we propose Latent Difference Learning (LDL) which extends the unobserved intervals by interpolation and constrain smooth variations in the latent space among interpolated sequence. Extensive experiments on two datasets show our MRI CEKWorld achieves better realistic contents and kinetics. Codes will be available.

Abstract:
Range-view (RV) based LiDAR diffusion has recently made huge strides towards 2D photo-realism. However, it neglects 3D geometry realism and often generates various RV artifacts such as depth bleeding and wavy surfaces. We design L3DR, a 3D-aware LiDAR Diffusion and Rectification framework that can regress and cancel RV artifacts in 3D space and restore local geometry accurately. Our theoretical and empirical analysis reveals that 3D models are inherently superior to 2D models in generating sharp and authentic boundaries. Leveraging such analysis, we design a 3D residual regression network that rectifies RV artifacts and achieves superb geometry realism by predicting point-level offsets in 3D space. On top of that, we design a Welsch Loss that helps focus on local geometry and ignore anomalous regions effectively. Extensive experiments over multiple benchmarks including KITTI, KITTI360, nuScenes and Waymo show that the proposed L3DR achieves state-of-the-art generation and superior geometry-realism consistently. In addition, L3DR is generally applicable to different LiDAR diffusion models with little computational overhead.

Abstract:
Vision-language models (VLMs) have transformed multimodal reasoning, but feeding hundreds of visual patch tokens to LLMs incurs quadratic computational costs, straining memory and context windows. Traditional approaches face a trade-off: continuous compression dilutes high-level semantics like object identities, while discrete quantization loses granular details such as textures. We challenge this by introducing HTC-VLM, a hybrid framework that disentangles semantics and appearance through dual channels, i.e., a continuous pathway for fine-grained details via ViT patches and a discrete pathway for symbolic anchors using MGVQ quantization projected to four tokens. These are fused into a 580-token hybrid sequence and compressed to one token via a disentanglement attention mask and a `` bottleneck, ensuring efficient, grounded representations.HTC-VLM achieves an average performance retention of 87.2% across seven benchmarks (GQA, VQAv2, MMBench, MME, POPE, SEED-Bench, ScienceQA-Image), outperforming the leading continuous baseline at 81.0% with a 580-to-1 compression ratio. Attention analyses show the compressed token prioritizes the discrete anchor, validating its semantic guidance. Our work demonstrates that a minimalist hybrid can resolve the efficiency-fidelity dilemma, advancing scalable VLMs.

Abstract:
Multimodal Sentiment Analysis (MSA) integrates language, visual, and acoustic modalities to infer human sentiment. Most existing methods either focus on globally shared representations or modality-specific features, while overlooking signals that are shared only by certain modality pairs. This limits the expressiveness and discriminative power of multimodal representations. To address this limitation, we propose a Tri-Subspace Disentanglement (TSD) framework that explicitly factorizes features into three complementary subspaces: a common subspace capturing global consistency, submodally-shared subspaces modeling pairwise cross-modal synergies, and private subspaces preserving modality-specific cues. To keep these subspaces pure and independent, we introduce a decoupling supervisor together with structured regularization losses. We further design a Subspace-Aware Cross-Attention (SACA) fusion module that adaptively models and integrates information from the three subspaces to obtain richer and more robust representations. Experiments on CMU-MOSI and CMU-MOSEI demonstrate that TSD achieves state-of-the-art performance across all key metrics, reaching 0.691 MAE on CMU-MOSI and 54.6% Acc-7 on CMU-MOSEI under the unaligned setting, and also transfers well to multimodal intent recognition tasks. Ablation studies confirm that tri-subspaces disentanglement and SACA jointly enhance the modeling of multi-granular cross-modal sentiment cues.

Abstract:
We introduce DiffBMP, a scalable and efficient differentiable rendering engine for a collection of bitmap images. Our work addresses a limitation that traditional differentiable renderers are constrained to vector graphics, given that most images in the world are bitmaps. Our core contribution is a highly parallelized rendering pipeline, featuring a custom CUDA implementation for calculating gradients. This system can, for example, optimize the position, rotation, scale, color, and opacity of thousands of bitmap primitives all in under 1 min using a consumer GPU. We employ and validate several techniques to facilitate the optimization: soft rasterization via Gaussian blur, structure-aware initialization, noisy canvas, and specialized losses/heuristics for videos or spatially constrained images. We demonstrate DiffBMP is not just an isolated tool, but a practical one designed to integrate into creative workflows. It supports exporting compositions to a native, layered file format, and the entire framework is publicly accessible via an easy-to-hack Python package.

Abstract:
Class-Incremental Learning (CIL) seeks to acquire new classes over time without erasing prior knowledge. While recent methods leverage pre-trained models (PTMs) to curb forgetting, they largely optimize the feature extractor and overlook the crucial classification head. In this work, we advance a simple view: if each task is equipped with a classifier that has the ability to both recognize in-distribution (IND) classes and reject out-of-distribution (OOD) inputs, CIL arises naturally--inputs are accepted only by heads that deem them in-distribution and rejected otherwise. Supported by rigorous theoretical and empirical studies, we find that this ability is characterized by Inter-class Separation and Intra-class Compactness; lacking these, standard linear and cosine-similarity heads remain closed-set and fail to yield a usable OOD signal. To address this, we propose GOD (Geometry-driven OOD Detectors), which unifies IND recognition and OOD rejection in a single geometric space by replacing the learnable head with fixed Equiangular Tight Frame (ETF) anchors; an ETF loss enforces inter-class separation, and an ArcFace loss further tightens intra-class compactness. For efficiency, we further introduce a parameter-efficient hybrid architecture and an efficient inference strategy, thus reducing both parameter footprint and inference cost. Extensive experiments on multiple incremental settings and datasets show that GOD achieves state-of-the-art results.[^1]: Code and datasets are available in the supplementary material.

Abstract:
Vision Transformers (ViTs) exhibit robust performance across diverse visual scenarios. However, their efficiency is constrained by excessive token counts. Token merging offers a viable solution for achieving efficient ViTs. Existing methods merge tokens based solely on specific characteristics within the attention mechanism, which changes significantly across layers. In this paper, we propose a novel training-free SAliency-Driven Token Merging (SAD-TM) approach by leveraging not only the semantic relevance in the attention space but also the latent visual saliency of input patches. Our SAD-TM is inspired by the discovery that saliency-based statistics can directly capture the causal relationship between model input and output, regardless of the layers. Based on the observation, we develop a method that is mathematically formulated to merge tokens with high saliency outliers. The principle behind our merging is that tokens with high saliency outliers usually imply inconsistencies with the global gradient direction, and thus can be merged safely. Besides, our systematic analysis indicates that class attention shows considerable variation across early blocks, so a deferred merging strategy is introduced to optimize the selection of merging rates. In a training-free manner, SAD-TM demonstrates superior performance across various ViT architectures. Especially, with a FLOPs compression of 23.08% on DeiT-Tiny, SAD-TM achieves a Top-1 Accuracy comparable to the pretrained baseline on ImageNet dataset. The code will be available soon.

Abstract:
Text-to-motion generation aims to generate 3D human motions that are tightly aligned with the input text while remaining physically plausible and rich in fine-grained detail. Although recent approaches can produce complex and natural movements, they usually operate at only one temporal scale, which limits both semantic alignment and temporal coherence. Inspired by the fact that complex motions are conceptualized hierarchically rather than at a single temporal scale in the human cognitive system, we propose MotionHiFlow, a hierarchical flow matching framework to generate motion progressively by constructing flow path from low to high temporal scales. The flows at lower scales capture high-level semantics and coarse motion structures, while flows at higher scales refine temporal details. To link the flows across scales, we introduce a novel cross-scale transition process, ensuring continuity and preserving noise consistency. Furthermore, by integrating a Text-Motion Diffusion Transformer and a topology-aware Motion VAE, MotionHiFlow explicitly models structural dependencies among joints via joint-aware positional encoding and skeletal topology, enabling precise semantic alignment alongside fine-grained motion details. Extensive experiments on HumanML3D and KIT-ML benchmarks demonstrate state-of-the-art performance, with ablation studies confirming the effectiveness of the hierarchical design and key components.

Abstract:
Frequent and precise land surface monitoring is critical for numerous applications, yet existing satellites struggle to achieve both simultaneously. Spatiotemporal fusion (STF) tackles this challenge by integrating multiple satellite images to generate data with improved temporal and spatial resolution, enabling more frequent and precise land surface observations. However, current methods often fail to recover dynamic landscape changes due to significant scale discrepancies among multi-source images. To address these challenges, we propose a semantic-adaptive diffusion framework for dynamic spatiotemporal fusion (SA-STF), which constrains the solution space using low-resolution and high-frequency features decoupled via a Taylor-inspired decoder. By incorporating temporal feature alignment and semantic-adaptive fusion modules, SA-STF projects multimodal images with temporal dynamics into a unified latent space, and adaptively enhances spatial details while maintaining the spectral consistency of the reconstructed images. Experiments on benchmark datasets demonstrate that SA-STF outperforms existing methods in both quantitative and qualitative evaluations, particularly in complex and dynamic scenes.

Abstract:
We present Any4D, a scalable multi-view transformer for metric-scale, dense feed-forward 4D reconstruction. Any4D directly generates per-pixel motion and geometry predictions for N frames, in contrast to prior work that typically focuses on either 2-view dense scene flow or sparse 3D point tracking. Moreover, unlike other recent methods for 4D reconstruction from monocular RGB videos, Any4D can process additional modalities and sensors such as RGB-D frames, IMU-based egomotion, and Radar Doppler measurements, when available. One of the key innovations that allows for such a flexible framework is a modular representation of a 4D scene; specifically, per-view 4D predictions are encoded using a variety of egocentric factors (depthmaps and camera intrinsics) represented in local camera coordinates, and allocentric factors (camera extrinsics and scene flow) represented in global world coordinates. We achieve superior performance across diverse setups - both in terms of accuracy (2-3X lower error) and compute efficiency (15X faster) - opening avenues for multiple downstream applications.

Abstract:
Due to sensor failures and occlusions during data acquisition, multi-view data often suffer from partial missing samples, thereby producing incomplete multi-view data. Recently, Incomplete Multi-View Classification (IMVC) has become one of the research hot topics, where numerous IMVC methods have been proposed. Although these methods have achieved promising performance by exploiting internal semantic information from partially observed data, they primarily rely on limited internal supervision for view completion. Clearly, this largely constrains their performance ceiling. To overcome this limitation, we propose an EXternal visiOn-driven incomplete mulTi-vIew Classification (EXOTIC) paradigm that incorporates external vision knowledge as semantic guidance, thereby assisting in imputing incomplete views. To the best of our knowledge, it is the first work that leverages external vision knowledge as supervision signals, thereby guiding missing-view completion. Specifically, we first introduce an external vision knowledge library based on a pre-trained vision-language model. Then, we design a Knowledge Filtering module to adaptively select task-relevant knowledge. Afterwards, we present a Knowledge Purification module to align external knowledge with internal representations. Finally, we propose External Completion that leverages the refined knowledge to impute missing views, thereby enhancing the classification decision ability. Extensive experiments on multiple incomplete multi-view datasets demonstrate that the proposed EXOTIC consistently outperforms existing methods, especially under high missing rates.

Abstract:
RAW images captured by different camera sensors exhibit substantial domain shifts due to varying spectral responses, noise characteristics, and tone behaviors, complicating their direct use in downstream computer vision tasks. Prior methods address this problem by training domain-specific RAW-to-RAW translators for each source-target pair, but such approaches do not scale to real-world scenarios involving multiple types of commercial cameras. In this work, we introduce MERIT, the first unified framework for multi-domain RAW image translation, which leverages a single model to perform translations across arbitrary camera domains. To address domain-specific noise discrepancies, we propose a sensor-aware noise modeling loss that explicitly aligns the signal-dependent noise statistics of the generated images with those of the target domain. To facilitate standardized evaluation, we introduce MDRAW, the first dataset tailored for multi-domain RAW image translation, comprising both paired and unpaired RAW captures from five diverse camera sensors across a wide range of scenes. Extensive experiments demonstrate that MERIT outperforms prior models in both quality (+5.56 dB) and scalability (80% reduction in training iterations).

Abstract:
Diffusion and flow matching models generate high-fidelity data by simulating paths defined by Ordinary or Stochastic Differential Equations (ODEs/SDEs), starting from a tractable prior distribution. The probability flow ODE formulation enables the use of advanced numerical solvers to accelerate sampling. Orthogonal yet vital to solver design is the discretization strategy. While early approaches employed handcrafted heuristics and recent methods adopt optimization-based techniques, most existing strategies enforce a globally shared timestep schedule across all samples. This uniform treatment fails to account for instance-specific complexity in the generative process, potentially limiting performance.Motivated by controlled experiments on synthetic data, which reveals the suboptimality of global schedules under instance-specific dynamics, we propose an instance-aware discretization framework. Our method learns to adapt timestep allocations based on input-dependent priors, extending gradient-based discretization search to the conditional generative setting. Empirical results across diverse settings, including synthetic data, pixel-space diffusion, latent-space images and video flow matching models, demonstrate that our method consistently improves generation quality with marginal tuning cost compared to training and negligible inference overhead.

Abstract:
Lighting has a strong influence on visual appearance, yet understanding and representing lighting in images remains notoriously difficult. Various lighting representations exist, such as environment maps, irradiance, spherical harmonics, or text, but they are incompatible, which limits cross-modal transfer. We thus propose UniLight, a joint latent space as lighting representation, that unifies multiple modalities within a shared embedding. Modality-specific encoders for text, images, irradiance, and environment maps are trained contrastively to align their representations, with an auxiliary spherical-harmonics prediction task reinforcing directional understanding. Our multi-modal data pipeline enables large-scale training and evaluation across three tasks: lighting-based retrieval, environment-map generation, and lighting control in diffusion-based image synthesis. Experiments show that our representation captures consistent and transferable lighting features, enabling flexible manipulation across modalities.

Abstract:
Meshes serve as a primary representation for 3D assets. Autoregressive mesh generators serialize faces into sequences and train on truncated segments with sliding-window inference to cope with memory limits. However, this mismatch breaks long-range geometric dependencies, producing holes and fragmented components. To address this critical limitation, we introduce MeshRipple, which expands a mesh outward from an active generation frontier, akin to a ripple on a surface. MeshRipple rests on three key innovations: a frontier-aware BFS tokenization that aligns the generation order with surface topology; an expansive prediction strategy that maintains coherent, connected surface growth; and a sparse-attention global memory that provides an effectively unbounded receptive field to resolve long-range topological dependencies. This integrated design enables MeshRipple to generate meshes with high surface fidelity and topological completeness, outperforming strong recent baselines.

Abstract:
Mixture-of-Experts (MoE) has become a prevalent backbone for large vision-language models (VLMs), yet how modality-specific signals should guide expert routing remains under-explored. Existing routing strategies are either hand-crafted or modality-agnostic, relying on idealized priors that ignore the layer-dependent modality fusion patterns in MoE-VLMs and provide little guidance for expert specialization. We propose Soft Modality-guided Expert Specialization (SMoES), which consists of dynamic soft modality scores that capture layer-dependent fusion patterns, an expert binning mechanism aligned with expert-parallel deployment, and an inter-bin mutual information regularization that encourages coherent modality specialization. Our method leverages attention-based or Gaussian-statistics modality scores to optimize mutual information regularization. Experiments across four MoE-based VLMs and 16 benchmarks demonstrate improvement on both effectiveness and efficiency: 0.9% and 4.2% average gain on multimodal and language-only tasks, 56.1% reduction in EP communication overhead, and 12.3% throughput improvement under realistic deployment. These results validate that aligning routing with modality-aware expert specialization unlocks MoE-VLM capacity and efficiency.

Abstract:
Counting is a core capability for multimodal large language models (MLLMs), yet there is no unified counting dataset to rigorously evaluate this ability across image, text, and audio. We present UNICBench, a unified multimodal, multi-level counting benchmark and evaluation toolkit with accurate ground truth, deterministic numeric parsing, and stratified reporting. The corpus comprises 5,300 images (5,508 QA), 872 documents (5,888 QA), and 2,069 audio clips (2,905 QA), annotated with a three-level capability taxonomy and difficulty tags. Under a standardized protocol with fixed splits/prompts/seeds and modality-specific matching rules, we evaluate 45 state-of-the-art MLLMs across modalities. Results show strong performance on some basic counting tasks but significant gaps on reasoning and the hardest partitions, highlighting long-tail errors and substantial headroom for improving general counting. UNICBench offers a rigorous and comparable basis for measurement and a public toolkit to accelerate progress.

Abstract:
Denoising diffusion models have achieved state-of-the-art performance in image restoration by modeling the process as sequential denoising steps. However, most approaches assume independent and identically distributed (i.i.d.) Gaussian noise, while real-world sensors often exhibit spatially correlated noise due to readout mechanisms, limiting restoration quality in practice. We introduce Correlation Aware Restoration with Diffusion (CARD), a training-free extension of DDRM that explicitly handles correlated Gaussian noise. CARD first whitens the noisy observation, converting the noise into an i.i.d. form. The diffusion restoration steps are then replaced with noise-whitened updates, which inherit DDRM's closed-form sampling efficiency while enabling restoration under correlated noise. To emphasize the importance of addressing correlated noise, we contribute CIN-D, a novel correlated noise dataset captured across diverse illumination conditions to evaluate restoration methods on real rolling-shutter sensor noise. This dataset fills a critical gap in the literature for experimental evaluation with real-world correlated noise. Experiments on standard benchmarks with synthetic correlated noise and on CIN-D demonstrate that CARD consistently outperforms existing methods across denoising, deblurring, and super-resolution tasks.

Abstract:
Creativity is a complex phenomenon. When it comes to representing and assessing creativity, treating it as a single undifferentiated quantity would appear naive and under-whelming. In this work, we learn the first type-specific creativity reward model, coined CREward, which spans three creativity "axes," geometry, material, and texture, to allow us to view creativity through the lens of the image formation pipeline. To build our reward model, we first conduct a human benchmark evaluation to capture human perception of creativity for each type across various creative images. We then analyze the correlation between human judgments and predictions by large vision-language models (LVLMs), confirming that LVLMs exhibit strong alignment with human perception. Building on this observation, we collect LVLM-generated labels to train our CREward model that is applicable to both evaluation and generation of creative images. We explore three applications of CREward: creativity assessment, explainable creativity, and creative sample acquisition for both human design inspiration and guiding creative generation through low-rank adaptation.

Abstract:
Prior expressive whole-body conversational avatar systems map audio to parametric poses and then render, creating a lossy bottleneck where quantization, retargeting, and tracking errors accumulate. This degrades audio-motion synchronization and suppresses micro-articulations critical for realism--such as bilabial closures, cheek inflation, nasolabial motion, blinks, and fine hand gestures--especially under single-image personalization. We propose an end-to-end framework that builds a full-body, photorealistic conversational avatar from a single image and drives it directly from audio, bypassing intermediate pose prediction. The avatar is modeled as a particle-based deformation field of 3D Gaussian primitives in a canonical space, with an audio-conditioned dynamics module that outputs per-particle trajectories for face, hands, and body, enabling localized high-frequency control with globally coherent motion. A splat-based differentiable renderer preserves identity, texture, and photo realism, while feature-level distillation from a large audio-driven video diffusion model and weak supervision from synthetic audio-conditioned clips further improve synchronization and natural expressivity. Joint photometric and temporal objectives shape the audio-conditioned deformation and rendering. Experiments across diverse speakers show improved lip-audio sync, fine facial detail, and conversational gesture naturalness over pose-driven baselines.

Abstract:
Contrastive pre-training frameworks such as CLIP have demonstrated impressive zero-shot performance across a range of vision-language tasks. Recent studies have shown that aligning individual text tokens with specific image patches or regions enhances fine-grained compositional understanding. However, it remains challenging to capture compositional semantics spanning multiple image regions.To address this limitation, we propose PowerCLIP, a novelcontrastive pre-training framework enhanced by powerset alignment, which exhaustively optimizes region-to-phrase alignments by minimizing the loss defined between powersets of image regions and textual parse trees.As this approach increases computational complexity exponentially due to the combinatorial explosion in the number of region subsets, we introduce efficient non-linear aggregators (NLAs) that reduce complexity from \mathit \mathcal O (2^ M ) to \mathit \mathcal O (M) with respect to the number of regions M, provably approximating the exact loss value with arbitrary precision.Our extensive experiments demonstrate that PowerCLIPoutperforms state-of-the-art methodsin zero-shot classification and retrieval tasks, underscoring compositionality and robustness of our approach. Our code will be made publicly available.

Abstract:
Inverse rendering is an ill-posed problem, but priors such as illumination priors can help simplify it. Existing work either disregards the spherical and rotation-equivariant nature of illumination environments or does not provide a well-behaved latent space. We propose a rotation-equivariant variational autoencoder that models natural illumination on the sphere without relying on 2D projections. To preserve the SO(2)-equivariance of environment maps, we use a novel Vector Neuron Vision Transformer (VN-ViT) as encoder and a rotation-equivariant conditional neural field as decoder. In the encoder, we reduce the equivariance from SO(3) to SO(2) using a novel SO(2)-equivariant fully connected layer, an extension of Vector Neurons. We show that our SO(2)-equivariant fully connected layer outperforms standard Vector Neurons when used in our SO(2)-equivariant model. Compared to previous methods, our variational autoencoder enables smoother interpolation in latent space and offers a more well-behaved latent space.

Abstract:
Autoregressive (AR) models offer stable and efficient training, but standard next-token prediction is not well aligned with the temporal structure required for text-conditioned motion generation. We introduce MoScale, a next-scale AR framework that generates motion hierarchically from coarse to fine temporal resolutions. By providing global semantics at the coarsest scale and refining them progressively, MoScale establishes a causal hierarchy better suited for long-range motion structure. To improve robustness under limited text-motion data, we further incorporate cross-scale hierarchical refinement for improving per-scale initial predictions and in-scale temporal refinement for selective bidirectional re-prediction. MoScale achieves SOTA text-to-motion performance with high training efficiency, scales effectively with model size, and generalizes zero-shot to diverse motion generation and editing tasks. Code and additional materials are available on the webpage.

Abstract:
To act in the world, a model must name what it sees and know where it is in 3D. Today's vision-language models excel at open-ended 2D description and grounding, yet multi-object 3D detection remains largely missing from the VLM toolbox. We present LocateAnything3D, a VLM-native recipe that casts 3D detection as a next-token prediction problem. The key is a short, explicit Chain-of-Sight (CoS) sequence that mirrors how people reason from images: find an object in 2D, then infer its distance, size, and pose. The decoder first emits 2D detections as a visual chain-of-thought, then predicts 3D boxes under an easy-to-hard curriculum: across objects, a near-to-far order reduces early ambiguity and matches ego-centric utility; within each object, a center-from-camera, dimensions, and rotation factorization ranks information by stability and learnability. This VLM-native interface preserves open-vocabulary and visual-prompting capability without specialized heads. On the challenging Omni3D benchmark, our model achieves state-of-the-art results, with 49.89 AP, surpassing the previous best by +15.51 absolute improvement even when baseline is given ground-truth 2D boxes. It also generalizes zero-shot to held-out categories with strong calibration and robustness. By turning 3D detection into a disciplined next-token problem, LocateAnything3D offers a practical foundation for models to perceive in 3D.

Abstract:
Video generation using recurrent architectures offers compelling efficiency advantages over attention-based transformers, particularly for long-sequence generation. However, chunked processing in recurrent models creates temporal discontinuities that harm long-range consistency. We introduce two complementary memory mechanisms to address this challenge at different granularities: (1) Context Memory maintains persistent global context within attention chunks through learnable sink columns and boundary buffers, adding only 150K parameters (\textless 0.1% overhead); (2) Latent Context-as-Memory (LCaM) extends memory across video segments by storing and retrieving historical latent embeddings, enabling cross-segment consistency without requiring camera annotations or frame reconstruction. Applied to Generalized Spatial-temporal Propagation Networks (GSTPN), our dual-memory approach achieves 1.54xfaster inference than attention-based transformers, while excelling in visual quality metrics. Our approach is particularly effective for knowledge distillation scenarios where only pre-extracted latent embeddings are available. This work demonstrates compelling efficiency-quality trade-offs for practical long video generation.

Abstract:
We show that introducing a weighting factor to reduce the influence of identity shortcuts in residual networks significantly enhances semantic feature learning in generative representation learning frameworks, such as masked autoencoders (MAEs) and diffusion models. Our modification improves linear probing accuracy for both, notably increasing ImageNet accuracy from 67.8% to 72.7% for MAEs with a VIT-B/16 backbone, while also boosting generation quality for diffusion models. This significant gap suggests that, while residual connection structure serves an essential role in facilitating gradient propagation, it may have a harmful side effect of reducing capacity for abstract learning by virtue of injecting an echo of shallower representations into deeper layers. We ameliorate this downside via a fixed formula for monotonically decreasing the contribution of identity connections as layer depth increases. Our design promotes the gradual development of feature abstractions, without impacting network trainability. Analyzing the representations learned by our modified residual networks, we find correlation between low effective feature rank and downstream task performance.

Abstract:
Conventional cameras generate a lot of data that can be challenging to process in resource-constrained applications. Usually, cameras generate data streams on the order of the number of pixels in the image. However, most of this captured data is redundant for many downstream computer vision algorithms. We propose a novel camera design, which we call SuperCam, that adaptively processes captured data by performing superpixel segmentation on the fly. We show that SuperCam performs better than current state-of-the-art superpixel algorithms under memory-constrained situations. We also compare how well SuperCam performs when the compressed data is used for downstream computer vision tasks. Our results demonstrate that the proposed design provides superior output for image segmentation, object detection, and monocular depth estimation in situations where the available memory on the camera is limited. We posit that superpixel segmentation will play a crucial role as more computer vision inference models are deployed in edge devices. SuperCam would allow computer vision engineers to design more efficient systems for these applications.

Abstract:
Anchor-based multi-view clustering is attractive for its efficiency, yet most methods fix the number of anchors a priori, implicitly assuming uniform needs across clusters. In practice, clusters differ in information richness, scale, and intrinsic structure, motivating adaptive per-cluster anchor allocation. We propose Cluster-aware Anchor Learning (CAL), which learns a consensus anchor matrix and organizes its columns into cluster-specific anchor groups. CAL imposes an l2,1-norm column-sparsity penalty on each group to suppress redundancy and preserve cluster-discriminative features, thereby automatically determining how many anchors each cluster retains. To further enhance separability, CAL introduces an inter-cluster regularization that constrains relationships among groups, promoting mutual dissimilarity. This data-driven design learns higher-quality, cluster-aware anchors and yields a more discriminative representation matrix across multiple views. Extensive experiments on multiple benchmarks show that CAL outperforms state-of-the-art multi-view clustering methods, demonstrating superior effectiveness, robustness, and adaptability to heterogeneous cluster structures.

Abstract:
Diffusion models have achieved impressive generative quality across modalities like 2D images, videos, and 3D shapes, but their inference remains computationally expensive due to the iterative denoising process. While recent caching-based methods effectively reuse redundant computations to speed up 2D and video generation, directly applying these techniques to 3D diffusion models can severely disrupt geometric consistency. In 3D synthesis, even minor numerical errors in cached latent features accumulate, causing structural artifacts and topological inconsistencies. To overcome this limitation, we propose Fast3Dcache, a training-free geometry-aware caching framework that accelerates 3D diffusion inference while preserving geometric fidelity. Our method introduces a Predictive Caching Scheduler Constraint (PCSC) to dynamically determine cache quotas according to voxel stabilization patterns and a Spatiotemporal Stability Criterion (SSC) to select stable features for reuse based on velocity magnitude and acceleration criterion. Comprehensive experiments show that Fast3Dcache accelerates inference significantly, achieving up to a 27.12% speed-up and a 54.83% reduction in FLOPs, with minimal degradation in geometric quality as measured by Chamfer Distance (2.48%) and F-Score (1.95%).

Abstract:
Eye images captured from wearable devices such as head-mounted displays (HMDs) contain identifiable biometric cues, posing significant challenges for safe data sharing. Existing eye anonymization techniques often degrade downstream performance, particularly gaze estimation while still retaining iris-recognizable features. Although these methods aim to anonymize the iris, they introduce noticeable visual artifacts that reduce image fidelity. To address these limitations, we propose PrivateEyes, a privacy-preserving framework that synthesizes anonymized yet gaze-consistent eye images. Our approach employs a three-stage pipeline: (1) a deep segmentation network that isolates semantic eye regions and provides structural control signals for synthesis, (2) a pose estimation network (PEN) trained on anatomically accurate synthetic eye renders to infer precise eye pose, and (3) a conditional diffusion model that reconstructs realistic, anonymized eye images conditioned on segmentation and pose. Extensive experiments across multiple benchmark datasets show that PrivateEyes achieves superior gaze-estimation accuracy compared to state-of-the-art anonymization baselines, improving performance by over 10% while reducing iris-recognition accuracy by ~50%. Our method also produces higher-fidelity images compared to other existing approaches. By enabling task-preserving and privacy-secure sharing of eye images, PrivateEyes supports responsible research and development in AR/VR and other gaze-driven applications.

Abstract:
While recent Large Vision-Language Models (LVLMs) exhibit impressive multimodal reasoning abilities, they often produce ungrounded, hallucinated responses by over-relying on linguistic priors rather than visual evidence. This critical limitation arises from the lack of a quantitative measure of how much these models actually rely on visual inputs during reasoning. We propose Draft and Refine (DnR), an agent framework driven by a novel question-conditioned utilization metric. This metric quantifies the model's actual reliance on visual evidence by first constructing a query-conditioned relevance map to localize question-specific evidence, and then assessing dependence through relevance-based probabilistic masking. Guided by this metric, the DnR agent refines its initial "draft" through targeted feedback from external visual experts. Each expert's output (e.g., boxes, masks) is rendered as visual cues on the image, and the LVLM is re-queried to select the response that yields the greatest improvement in utilization. This process strengthens visual grounding of predictions without retraining or architectural changes. Experiments across a broad range of VQA and captioning benchmarks demonstrate consistent accuracy gains and reduced hallucination. These results show that quantifying visual utilization provides a principled path for designing more interpretable and evidence-driven multimodal agent systems that effectively leverage visual experts.

Abstract:
Concept erasure is extensively utilized in image generation to prevent text-to-image models from generating undesired content. Existing methods can effectively erase narrow concepts that are specific and concrete, such as distinct intellectual properties (e.g. Pikachu) or recognizable characters (e.g. Elon Musk). However, their performance degrades on broad concepts such as "sexual" or "violent", whose wide scope and multi-faceted nature make them difficult to erase reliably.To overcome this limitation, we exploit the model's intrinsic embedding geometry to identify latent embeddings that encode a given concept. By clustering these embeddings, we derive a set of concept prototypes that summarize the model's internal representations of the concept, and employ them as negative conditioning signals during inference to achieve precise and reliable erasure. Extensive experiments across multiple benchmarks show that our approach achieves substantially more reliable removal of broad concepts while preserving overall image quality, marking a step towards safer and more controllable image generation.

Abstract:
While modern generative models such as diffusion-based architectures have enabled impressive creative capabilities, they also raise important safety and ethical risks. These concerns have led to growing interest in concept erasure, the process of removing unwanted concepts from model representations. Existing approaches often achieve strong erasure performance but rely on iterative optimization and may inadvertently distort unrelated concepts. In this work, we present a simple yet principled alternative: a linear transformation framework that achieves concept erasure analytically, without any training. Our method adapts a pretrained model through two sequential, closed-form steps: first, computing a proxy projection of the target concept, and second, applying a constrained transformation within the left null space of known concept directions. This design yields a deterministic and geometrically interpretable procedure for safe, efficient, and theory-grounded concept removal. Across a wide range of experiments, including object and style erasure on multiple Stable Diffusion variants and the flow-matching model (FLUX), our approach matches or surpasses the performance of state-of-the-art methods while preserving non-target concepts more faithfully. Requiring only a few seconds to apply, it offers a lightweight and drop-in tool for controlled model editing, advancing the goal of safer and more responsible generative models.

Abstract:
Machine unlearning (MU) addresses privacy risks in pretrained models. The main goal of MU is to remove the influence of designated data while preserving the utility of retained knowledge. Achieving this goal requires preserving semantic relations among retained instances, which existing studies often overlook. We observe that without such preservation, models suffer from progressive structural collapse, undermining both the deletion-retention balance. In this work, we propose a novel structure-faithful framework that introduces stakes, i.e., semantic anchors that serve as reference points to maintain the knowledge structure. By leveraging these anchors, our framework captures and stabilizes the semantic organization of knowledge. Specifically, we instantiate the anchors from language-driven attribute descriptions encoded by a semantic encoder (e.g., CLIP). We enforce preservation of the knowledge structure via structure-aware alignment and regularization: the former aligns the organization of retained knowledge before and after unlearning around anchors, while the latter regulates updates to structure-critical parameters. Results from image classification, retrieval, and face recognition show average gains of 32.9%, 22.5%, and 19.3% in performance, balancing the deletion-retention trade-off and enhancing generalization.

Abstract:
Sparse Upcycling provides an efficient way to initialize a Mixture-of-Experts (MoE) model from pretrained dense weights instead of training from scratch. However, since all experts start from identical weights and the router is randomly initialized, the model suffers from expert symmetry and limited early specialization. We propose Cluster- aware Upcycling, a strategy that incorporates semantic structure into MoE initialization. Our method first partitions the dense model's input activations into semantic clusters. Each expert is then initialized using the subspace representations of its corresponding cluster via truncated SVD, while setting the router's initial weights to the cluster centroids. This cluster-aware initialization breaks expert symmetry and encourages early specialization aligned with the data distribution. Furthermore, we introduce an expert-ensemble self-distillation loss that stabilizes training by providing reliable routing guidance using an ensemble teacher. When evaluated on CLIP ViT-B/32 and ViT-B/16, Cluster-aware Upcycling consistently outperforms existing methods across both zero-shot and few-shot benchmarks. The proposed method also produces more diverse and dis- entangled expert representations, reduces inter-expert similarity, and leads to more confident routing behavior.

Abstract:
Recent advancements in vision-language models have enabled multi-modal person re-identification (Re-ID), where the system takes both an image and a text query to identify matching individuals. While previous state-of-the-art methods perform well with detailed, sentence-level descriptions, we found that their Recall@1 drops by half when using short, keyword-based queries due to ambiguity, training biases, and under-represented attributes. Despite this challenge, short queries provide a more natural and efficient user experience, requiring less effort and allowing for iterative refinement. To address this limitation, we introduce a new problem setting, Composite-Attributes Person Re-ID (CA-ReID), along with a fine-grained composite attribute dataset with queries belonging to varying levels of ambiguity. We further propose two methods: Dense Disentangling Loss to promote attribute-specific embeddings, and Part-Aware Representations that use pose estimation to align textual attributes with relevant body regions. Our method sets a new state of the art on the new CA-ReID benchmark (up to +17% Recall@1) and performs on par with prior methods on existing CC-ReID benchmarks.

Abstract:
Motion reasoning serves as the cornerstone of multi-object tracking (MOT), as it enables consistent association of targets across frames. However, existing motion estimation approaches face two major limitations: (1) instability caused by noisy or probabilistic predictions, and (2) vulnerability under occlusion, where trajectories often fragment once visual cues disappear.To overcome these issues, we propose a collaborative reasoning framework that enhances motion estimation through joint inference among multiple correlated objects. By allowing objects with similar motion states to mutually constrain and refine each other, our framework stabilizes noisy trajectories and infers plausible motion continuity even when target is occluded.To realize this concept, we design HyperSSM, an architecture that integrates Hypergraph computation and a State Space Model (SSM) for unified spatial-temporal reasoning. The Hypergraph module captures spatial motion correlations through dynamic hyperedges, while the SSM enforces temporal smoothness via structured state transitions. This synergistic design enables simultaneous optimization of spatial consensus and temporal coherence, resulting in robust and stable motion estimation.Extensive experiments on four mainstream and diverse benchmarks(MOT17, MOT20, DanceTrack, and SportsMOT) covering various motion patterns and scene complexities, demonstrate that our approach achieves state-of-the-art performance across a wide range of tracking scenarios.

Abstract:
Estimating human gaze targets from images in-the-wild is an important and formidable task. Existing approaches primarily employ brittle, multi-stage pipelines that require explicit inputs, like head bounding boxes and human pose, in order to identify the subject of gaze analysis. As a result, detection errors can cascade and lead to failure. Moreover, these prior works lack the flexibility of specifying the gaze analysis task via natural language prompting, an approach which has been shown to have significant benefits in convenience and scalability for other image analysis tasks. To overcome these limitations, we introduce the Promptable Gaze Target Estimation (PGE) task, a new end-to-end, concept-driven paradigm for gaze analysis. PGE conditions gaze prediction on flexible user text or visual prompts (e.g., "the boy in the red shirt" or "person in point [0.52, 0.48]") to identify a specific subject for gaze analysis. This approach integrates subject localization with gaze estimation, and eliminates the rigid dependency on intermediate analysis stages. We develop a scalable data engine to generate Gaze-Co (Gaze Estimation with Concepts), a dataset and benchmark of 120K high-quality, prompt-annotated image pairs. We also propose GazeAnywhere, the first model designed for PGE. GazeAnywhere uses a transformer-based detector to fuse features from frozen encoders and simultaneously solves subject localization, in/out-of-frame presence, and gaze target heatmap estimation. GazeAnywhere achieves state-of-the-art performance on multiple PGE benchmarks, setting a strong baseline for this new problem even on a difficult out-of-domain, real-world clinical dataset. GazeAnywhere is open-sourced in github.com/IrohXu/GazeAnywhere.

Abstract:
Reconstructing dynamic 4D scenes remains challenging due to the presence of moving objects that corrupt camera pose estimation. Existing optimization methods alleviate this issue with additional supervision, but they are mostly computationally expensive and impractical in real-time applications. To address these limitations, we propose MoRe, a feedforward 4D reconstruction network that efficiently recovers dynamic 3D scenes from monocular videos. Built upon a strong static reconstruction backbone, MoRe employs an attention-forcing strategy to disentangle dynamic motion from static structure. To further enhance robustness, we fine-tune the model on large-scale, diverse datasets encompassing both dynamic and static scenes. Moreover, our grouped causal attention captures temporal dependencies and adapts to varying token lengths across frames, ensuring temporally coherent geometry reconstruction. Extensive experiments on multiple benchmarks demonstrate that MoRe achieves high-quality dynamic reconstructions with exceptional efficiency.

Abstract:
While existing point-cloud-registration methods can well handle high-overlap scenarios of two point clouds, they often struggle with low-overlap scenarios, due to inevitable geometric/semantic ambiguities in the non-overlapping regions. In this paper, we introduce SuP, a novel framework that reformulates low-overlap registration as a high-overlap sub-cloud pairs (anchor pairs) mining problem. Central to SuP is our Dual-phase Sub-cloud Anchor Mining (DSAM) module, which first subdivides the source and target point clouds into multiple sub-clouds, followed by introducing a dual-phase weighting pipeline: 1) an efficient overlap-guided prior-weighting scheme (OPS) that leverages feature salience to identify candidate anchor pairs, and 2) a multi-scale post-weighting network (MPN) that exploits neighborhood feature consensus to further identify anchor pairs. Subsequently, final correspondences are generated through a merge-to-match module using the anchor pairs. To train DSAM, we design an alignment-aware weighting loss that uses on-the-fly alignment errors as supervision. Comprehensive experiments on the color-enhanced 3DMatch and 3DLoMatch demonstrate that SuP significantly outperforms state-of-the-art methods, achieving higher registration recall and more accurate alignment, especially under challenging low-overlap conditions.

Abstract:
Multi-modal Large Language Models (MLLMs) show promise in video understanding. However, their reasoning often suffers from thinking drift and weak temporal comprehension, even when enhanced by Reinforcement Learning (RL) techniques like Group Relative Policy Optimization (GRPO). Moreover, existing RL methods usually depend on Supervised Fine-Tuning (SFT), which requires costly Chain-of-Thought (CoT) annotation and multi-stage training, and enforces fixed reasoning paths, limiting MLLMs' ability to generalize and potentially inducing bias. To overcome these limitations, we introduce Summary-Driven Reinforcement Learning (SDRL), a novel single-stage RL framework that obviates the need for SFT by utilizing a Structured CoT format: Summarize - Think - Answer. SDRL introduces two self-supervised mechanisms integrated into the GRPO objective: 1) Consistency of Vision Knowledge (CVK) enforces factual grounding by reducing KL divergence among generated summaries; and 2) Dynamic Variety of Reasoning (DVR) promotes exploration by dynamically modulating thinking diversity based on group accuracy. This novel integration effectively balances alignment and exploration, supervising both the final answer and the reasoning process. Our method achieves state-of-the-art performance on seven public VideoQA datasets.

Abstract:
Modern neural architectures for 3D point cloud processing contain both convolutional layers and attention blocks, but the best way to assemble them remains unclear. We analyse the role of different computational blocks in 3D point cloud networks and find an intuitive behaviour: convolution is adequate to extract low-level geometry at high-resolution in early layers, where attention is expensive without bringing any benefits; attention captures high-level semantics and context in low-resolution, deep layers more efficiently. Guided by this design principle, we propose a new, improved 3D point cloud backbone that employs convolutions in early stages and switches to attention for deeper layers. To avoid the loss of spatial layout information when discarding redundant convolution layers, we introduce a novel, parameter-free 3D positional encoding, PointROPE. The resulting LitePT model has 3.6x fewer parameters, runs 2x faster, and uses 2x less memory than the state-of-the-art Point Transformer V3, but nonetheless matches or outperforms it on a range of tasks and datasets.

Abstract:
We present Lynx, a high-fidelity model for personalized video synthesis from a single input image. Built on an open-source Diffusion Transformer (DiT) foundation model, Lynx introduces two lightweight adapters to ensure identity fidelity. The ID-adapter employs a Perceiver Resampler to convert ArcFace-derived facial embeddings into compact identity tokens for conditioning, while the Ref-adapter integrates dense VAE features from a frozen reference pathway, injecting fine-grained details across all transformer layers through cross attention. These modules collectively enable robust identity preservation while maintaining temporal coherence and visual realism. Through evaluation on a curated benchmark of 40 subjects and 20 unbiased prompts, which yielded 800 test cases, Lynx has demonstrated superior face resemblance, competitive prompt following, and strong video quality, thereby advancing the state of personalized video generation. Code and models will be released publicly upon publication.

Abstract:
Vision-Language models (VLMs) achieve strong performance on multimodal tasks but often fail at systematic visual reasoning, especially in inductive reasoning problems. Neuro-symbolic methods promise to address this by inducing interpretable logical programs from images, though they usually rely on rigid, domain-specific perception modules for this. We propose Vision-Language Programs (VLP), which combine the perceptual flexibility of VLMs with the systematic reasoning of symbolic program synthesis. Rather than embedding reasoning inside the VLM, VLP leverages the model to produce structured visual descriptions that are compiled into neuro-symbolic programs. The resulting programs execute directly on images, remain consistent with task constraints, and provide human-interpretable explanations that enable easy shortcut mitigation. Our experiments across synthetic and real-world datasets demonstrate that VLPs outperform both direct and structured prompting of VLMs, particularly on tasks that require complex logical reasoning.

Abstract:
Model merging integrates multiple task-specific models into a single consolidated one. Recent research has made progress in improving merging performance for in-distribution or multi-task scenarios, but domain generalization in model merging remains underexplored. We investigate how merging models fine-tuned on distinct domains affects generalization to unseen domains. Through an analysis of parameter competition in the task matrix using singular value decomposition, we show that merging models trained under different distribution shifts induces stronger conflicts between their subspaces compared to traditional multi-task settings. To mitigate this issue, we propose SCORE (Subspace COnflict-Resolving mErging), a method designed to alleviate such singular subspace conflicts. SCORE finds a shared orthogonal basis by computing the principal components of the concatenated leading singular vectors of all models. It then projects each task matrix into the shared basis, pruning off-diagonal components to remove conflicting singular directions. SCORE consistently outperforms, on average, existing model merging approaches in domain generalization settings across a variety of architectures and model scales, demonstrating its effectiveness and scalability.

Abstract:
Dexterous manipulation is essential for real-world robot autonomy, mirroring the central role of human hand coordination in daily activity. Humans rely on rich multimodal perception--vision, sound, and language-guided intent--to perform dexterous actions, motivating vision-based, language-conditioned manipulation systems for robots. However, training reliable vision-language-action (VLA) models for dexterous manipulation requires large-scale demonstrations across many robotic hands. In addition, as new dexterous embodiments appear rapidly, collecting data for each becomes costly and impractical, creating a need for scalable cross-embodiment learning. We introduce \ourmethod, a vision-language-action framework integrated with a unified latent action space shared across diverse dexterous hands. This embodiment-invariant latent space is directly pluggable into standard VLA architectures, enabling seamless cross-embodiment training and efficient reuse of both existing and newly collected data. Experimental results demonstrate that \ourmethod consistently outperforms baseline VLA models operating in raw joint spaces, establishing it as an effective solution for scalable cross-embodiment dexterous manipulation.

Abstract:
Recent advancements in instruction-based image editing and subject-driven generation have garnered significant attention, yet both tasks still face limitations in meeting practical user needs. Instruction-based editing relies solely on language instructions, which often fail to capture specific editing details, making reference images necessary. Meanwhile, subject-driven generation is limited to combining concrete objects or people, overlooking broader, abstract concepts. To address these challenges, we propose two novel tasks: multimodal instruction-based editing and generation. These tasks support both text and image instructions and extend the scope to include both concrete and abstract concepts, greatly enhancing their practical applications. We introduce DreamOmni2, tackling two primary challenges: data creation and model framework design. Our data synthesis pipeline consists of three steps: (1) using a feature mixing method to create extraction data for both abstract and concrete concepts, (2) generating multimodal instruction-based editing training data using the editing and extraction models, and (3) further applying the extraction model to create training data for multimodal instruction-based editing. For the framework, to handle multi-image input, we propose an index encoding and position encoding shift scheme, which helps the model distinguish images and avoid pixel confusion. Additionally, we introduce joint training with the VLM and our generation/editing model to better process complex instructions. In addition, we have proposed comprehensive benchmarks for these two new tasks to drive their development. Experiments show that DreamOmni2 achieved impressive results.

Abstract:
Accurate camera viewpoint estimation under sparse-view conditions remains challenging, particularly in two-view scenarios. Recent approaches leverage diffusion models such as Zero123 to synthesize novel views conditioned on relative viewpoint, showing promising results when repurposed for viewpoint estimation via optimization with MSE loss. However, existing methods often suffer from non-convex loss landscape with numerous local minima, making them sensitive to initialization and reliant on naive multi-start strategies. We analyze these optimization challenges and visualize failure cases, showing that geometric ambiguities, such as symmetry and self-similarity, can mislead gradient-based updates toward incorrect viewpoints. To address these limitations, we propose a score-based method that reshapes the optimization landscape to guide updates toward the ground-truth viewpoint, followed by a refinement stage using a viewpoint-conditioned diffusion model. Experiments show that our method improves convergence, reduces reliance on brute-force sampling, and achieves competitive accuracy with higher sample-efficiency.

Abstract:
We suggest a new multi-modal algorithm for joint inference of paired structurally aligned samples with Rectified Flow models. While some existing methods propose a codependent generation process, they do not view the problem of joint generation from a structural alignment perspective. Recent work uses Score Distillation Sampling to generate aligned 3D models, but SDS is known to be time-consuming, prone to mode collapse, and often provides cartoonish results. By contrast, our suggested approach relies on the joint transport of a segment in the sample space, yielding faster computation at inference time. Our approach can be built on top of an arbitrary Rectified Flow model operating on the structured latent space. We show the applicability of our method to the domains of image, video, and 3D shape generation using state-of-the-art baselines and evaluate it against both editing-based and joint inference-based competing approaches. We demonstrate a high degree of structural alignment for the sample pairs obtained with our method and a high visual quality of the samples. Our method improves the state-of-the-art for image and video generation pipelines. For 3D generation, it is able to show comparable quality while working orders of magnitude faster.

Abstract:
Models for long-term point tracking are typically trained on large synthetic datasets. The performance of these models degrades in real-world videos due todifferent characteristics and the absence of dense ground-truth annotations.Self-training on unlabeled videos has been explored as a practical solution, but the quality of pseudo-labels strongly depends on the reliability of teacher predictions, which vary across frames and scenes.In this paper, we address the problem of real-world fine-tuning and introduce Verifier, a meta-model that learns to assess the reliability of tracker predictions and guide pseudo-label generation.Given candidate trajectories from multiple pretrained trackers, the verifier evaluates them per frame and selects the most trustworthy predictions to construct refined pseudo-label trajectories.When applied during fine-tuning, verifier-guided pseudo-labeling substantially improves the quality of supervision and enables data-efficient adaptation to unlabeled videos.Extensive experiments on four real-world benchmarks demonstrate that our approach achieves state-of-the-art results while requiring less data than prior self-training methods.

Abstract:
We present LumiX, a structured diffusion framework for coherent text-to-intrinsic generation. Conditioned on text prompts, LumiX jointly generates a comprehensive set of intrinsic maps (e.g., albedo, irradiance, normal, depth, and final color), providing a structured and physically consistent description of an underlying scene. This is enabled by two key contributions: 1) Query-Broadcast Attention, a mechanism that ensures structural consistency by sharing queries across all maps in each self-attention block. 2) Tensor LoRA, a tensor-based adaptation that parameter-efficiently models cross-map relations for efficient joint training. Together, these designs enable stable joint diffusion training and unified generation of multiple intrinsic properties. Experiments show that LumiX produces coherent and physically meaningful results, achieving 23% higher alignment and a better preference score (0.19 vs. -0.41) compared to the state of the art, and it can also perform image-conditioned intrinsic decomposition within the same framework.

Abstract:
Forecasting gaze behavior is an important task for understanding user intent and creating AR/VR systems that can anticipate where users will look and interact next. While prior works have addressed predicting scanpaths in static images, forecasting gaze in egocentric videos presents new challenges due to the dynamic nature of the scene and the camera wearer's continuous movement through the 3D environment. To address these challenges, we formulate the novel task of egocentric scanpath prediction as forecasting a sequence of future fixations in 3D Cartesian coordinates relative to the last observed camera pose, producing a 3D scanpath that is grounded in the environment. We propose a transformer architecture that leverages egocentric video frames, head pose, and past 3D gaze observations to predict future 3D fixation sequences. We evaluate our method on the Aria Digital Twin dataset. Our findings establish a baseline for the novel task of 3D scanpath prediction and highlight important architectural elements for our task.

Abstract:
Autoregressive (AR) visual generators model images as sequences of discrete tokens and are trained with next token likelihood. This strict causality supervision optimizes each step only by its immediate next token, which diminishes global coherence and slows convergence. We ask whether foresight, training signals that originate from later tokens, can help AR visual generation. We conduct a series of controlled diagnostics along the injection level, foresight layout, and foresight source axes, unveiling a key insight: aligning foresight to AR models' internal representation on the 2D image grids improves causality modeling.We formulate this insight with Mirai (meaning "future" in Japanese), a general framework that injects future information into AR training with no architecture change and no extra inference overhead: Mirai-E uses explicit foresight from multiple future positions of unidirectional representations, whereas Mirai-I leverages implicit foresight from matched bidirectional representations.Extensive experiments show that Mirai significantly accelerates convergence and improves generation quality. For instance, Mirai can speed up LlamaGen-B's convergence by up to 10xand reduce the generation FID from 5.34 to 4.34 on the ImageNet class-condition image generation benchmark.Our study highlights that visual autoregressive models need foresight.

Abstract:
The flexibility and accuracy of methods for automatically counting objects in images and videos are limited by the way the object can be specified. While existing methods allow users to describe the target object with text and visual examples, the visual examples must be manually annotated inside the image, and there is no way to specify what not to count. To address these gaps, we introduce novel capabilities that expand how the target object can be specified. Specifically, we extend the prompt to enable what not to count to be described with text and/or visual examples, introduce the concept of `pseudo-exemplars' that automate the annotation of visual examples at inference, and extend counting models to accept visual examples from both natural and synthetic external images. We also use our new counting model, CountGD++, as a vision expert agent for an LLM. Together, these contributions expand the prompt flexibility of multi-modal open-world counting and lead to significant improvements in accuracy, efficiency, and generalization across multiple datasets. Code is available at https://www.robots.ox.ac.uk/ vgg/research/countgd/.

Abstract:
Safe highway autonomy for heavy trucks remains an open and unsolved challenge: due to long braking distances, scene understanding of hundreds of meters is required for anticipatory planning and to allow safe braking margins. However, existing driving datasets primarily cover urban scenes, with perception effectively limited to short ranges of only up to 100 meters. To address this gap, we introduce TruckDrive, a highway-scale multimodal driving dataset, captured with a sensor suite purpose-built for long range sensing: seven long-range FMCW LiDARs measuring range and radial velocity, three high-resolution short-range LiDARs, eleven 8MP surround cameras with varying focal lengths and ten 4D FMCW radars. The dataset offers 475 thousands samples with 165 thousands densely annotated frames for driving perception benchmarking up to 1,000 meters for 2D detection and 400 meters for 3D detection, depth estimation, tracking, planning and end to end driving over 20 seconds sequences at highway speeds. We find that state-of-the-art autonomous driving models do not generalize to ranges beyond 150 meters, with drops between 31% and 99% in 3D perception tasks, exposing a systematic long-range gap that current architectures and training signals cannot close.

Abstract:
Unsupervised domain adaptation (UDA) for semantic segmentation seeks to transfer models from a labeled source domain to an unlabeled target domain. While auxiliary self-supervised tasks such as contrastive learning have enhanced feature discriminability, masked modeling remains underexplored due to architectural constraints and misaligned objectives. We propose Masked Representation Modeling (MRM), an auxiliary task that performs representation masking and reconstruction directly in the latent space. Unlike prior masked modeling methods that reconstruct low-level signals (e.g., pixels or visual tokens), MRM targets high-level semantic features, aligning its objective with segmentation and integrating seamlessly into standard architectures like DeepLab and DAFormer. To support efficient reconstruction, we design a lightweight auxiliary module, Rebuilder, which is jointly trained with the segmentation network but removed during inference, introducing zero test-time overhead. Extensive experiments demonstrate that MRM consistently improves segmentation performance across diverse architectures and UDA benchmarks. When integrated with four representative baselines, MRM achieves an average gain of +2.3 mIoU on GTA -> Cityscapes and +2.8 mIoU on Cityscapes -> Synthia, establishing it as a simple, effective, and generalizable strategy for unsupervised domain-adaptive semantic segmentation.

Abstract:
Monocular 3D object detection (Mono3D) aims to infer object locations and dimensions in 3D space from a single RGB image. Despite recent progress, existing methods remain highly sensitive to camera intrinsics and struggle to generalize across diverse settings, since intrinsic governs how 3D scenes are projected onto the image plane. We propose MonoIA, a unified intrinsic-aware framework that models and adapts to intrinsic variation through a language-grounded representation.The key insight is that intrinsic variation is not a numeric difference but a perceptual transformation that alters apparent scale, perspective, and spatial geometry. To capture this effect, MonoIA employs large language models and vision-language models to generate intrinsic embeddings that encode the visual and geometric implications of camera parameters.These embeddings are hierarchically integrated into the detection network via an Intrinsic Adaptation Module, allowing the model to modulate its feature representations according to camera-specific configurations and maintain consistent 3D detection across intrinsics.This shifts intrinsic modeling from numeric conditioning to semantic representation, enabling robust and unified perception across cameras.Extensive experiments show that MonoIA achieves new state-of-the-art results on standard benchmarks including KITTI, Waymo, and nuScenes (e.g., +1.18% on the KITTI leaderboard), and further improves performance under multi-dataset training (e.g., +4.46% on KITTI Val).

Abstract:
Domain Generalization in Semantic Segmentation (DG-SS) aims to enable segmentation models to perform robustly in unseen environments. However, conventional DG-SS methods are restricted to a fixed set of known categories, limiting their applicability in open-world scenarios. Recent progress in Vision-Language Models (VLMs) has advanced Open-Vocabulary Semantic Segmentation (OV-SS) by enabling models to recognize a broader range of concepts. Yet, these models remain sensitive to domain shifts and struggle to maintain robustness when deployed in unseen environments, a challenge that is particularly severe in urban-driving scenarios. To bridge this gap, we introduce Open-Vocabulary Domain Generalization in Semantic Segmentation (OVDG-SS), a new setting that jointly addresses unseen domains and unseen categories. We introduce the first benchmark for OVDG-SS in autonomous driving, addressing a previously unexplored problem and covering both synthetic-to-real and real-to-real generalization across diverse unseen domains and unseen categories. In OVDG-SS, we observe that domain shifts often distort text-image correlations in pre-trained VLMs, which hinders the performance of OV-SS models. To tackle this challenge, we propose S^2-Corr, a state-space-driven text-image correlation refinement mechanism that can mitigate domain-induced distortions and produce a more consistent text-image correlation under distribution changes. Extensive experiments on our constructed benchmark demonstrate that the proposed method achieves superior cross-domain performance and efficiency compared to existing OV-SS approaches.

Abstract:
Style-conditioned scene text generation faces unique challenges in extracting precise text styles from complex backgrounds and maintaining fine-grained style consistency across characters, especially for multilingual scripts. We propose StyleTextGen, a novel framework that learns to perceive and replicate visual text styles across different languages and writing systems. Our approach features three key contributions: First, we introduce a dual-branch style encoder dedicated to style modeling, yielding robust multilingual text style representations in complex real-world scenes. Second, we design a text style consistency loss that enhances style coherence and improves overall visual quality. Third, we develop a mask-guided inference strategy that ensures precise style alignment between generated and reference text. To facilitate systematic evaluation, we construct StyleText-CE, a bilingual scene text style benchmark covering both monolingual and cross-lingual settings. Extensive experiments demonstrate that StyleTextGen significantly outperforms existing methods in style consistency and cross-lingual generalization, establishing new state-of-the-art performance in multilingual style-conditioned text generation.

Abstract:
3D semantic occupancy prediction is central to autonomous driving, yet current methods are vulnerable to long-tailed class bias and out-of-distribution (OOD) inputs, often overconfidently assigning anomalies to rare classes. We present ProOOD, a lightweight, plug-and-play method that couples prototype-guided refinement with training-free OOD scoring. ProOOD comprises (i) prototype-guided semantic imputation that fills occluded regions with class-consistent features, (ii) prototype-guided tail mining that strengthens rare-class representations to curb OOD absorption, and (iii) EchoOOD, which fuses local logit coherence with local and global prototype matching to produce reliable voxel-level OOD scores. Extensive experiments on five datasets demonstrate that ProOOD achieves state-of-the-art performance on both in-distribution 3D occupancy prediction and OOD detection. On SemanticKITTI, it surpasses baselines by +3.57% mIoU overall and +24.80% tail-class mIoU; on VAA-KITTI, it improves AuPRC_r by +19.34 points, with consistent gains across benchmarks. These improvements yield more calibrated occupancy estimates and more reliable OOD detection in safety-critical urban driving. The source code will be made publicly available.

Abstract:
Existing industrial 3D garment meshes already cover most real-world clothing geometries, yet their texture diversity remains limited. To acquire more realistic textures, generative methods are often used to extract Physically-based Rendering (PBR) textures and materials from large collections of wild images and project them back onto garment meshes. However, most image-conditioned texture generation approaches require strict topological consistency between the input image and the input 3D mesh, or rely on accurate mesh deformation to match to the image poses, which significantly constrains the texture generation quality and flexibility.To address the challenging problem of non-isometric image-based garment texture generation, we construct 3D Garment Videos, a physically simulated, garment-centric dataset that provides consistent geometry and material supervision across diverse deformations, enabling robust cross-pose texture learning. We further employ Nano Banana for high-quality non-isometric image editing, achieving reliable cross-topology texture generation between non-isometric image-geometry pairs. Finally, we propose an iterative baking method via uncertainty-guided view selection and reweighting that fuses multi-view predictions into seamless, production-ready PBR textures. Through extensive experiments, we demonstrate that our feedforward dual-branch architecture generates versatile and spatially aligned PBR materials suitable for industry-level 3D garment design.

Abstract:
Polynomial systems in multiview geometry are often over-constrained, and naively removing equations can lead to unstable or inconsistent solutions. We revisit this issue through constraint relaxation---selectively removing equations to obtain a finite set of isolated (and, ideally, well-conditioned) solutions. Focusing on the three-view Kruppa equations for autocalibration, we introduce a minimal relaxation framework for identifying subsets of constraints leading to geometrically valid solutions. Using both symbolic computation and numerical homotopy continuation methods, we enumerate and classify all possible relaxation patterns arising from our formulation of Kruppa's equations, isolating all patterns that yield finitely-many solutions. In experiments with synthetic and real images, we demonstrate empirically that specific relaxations are well-conditioned and consistently outperform multiple baselines, including a more classical Kruppa formulation. Overall, our findings establish the minimal relaxation framework as a practical tool in multiview geometry computation.

Abstract:
What makes a good viewpoint? The quality of the data used to learn 3D reconstructions is crucial for enabling efficient and accurate scene modeling. We study the active view selection problem and develop a principled analysis that yields a simple and interpretable criterion for selecting informative camera poses. Our key insight is that informative views can be obtained by minimizing a tractable approximation of the Fisher Information Gain, which reduces to favoring viewpoints that cover geometry that has been insufficiently observed by past cameras. This leads to a lightweight coverage-based view selection metric that avoids expensive transmittance estimation and is robust to noise and training dynamics. We call this metric CONVERGE (Coverage Optimized Novel Viewset Estimation for Reconstructing Geometry Efficiently). We integrate our method into the Nerfstudio framework and evaluate it on real datasets within fixed and embodied data acquisition scenarios. Across multiple datasets and radiance-field baselines, our method consistently improves reconstruction quality compared to state-of-the-art active view selection methods.

Abstract:
Existing convolutional learning methods for 3D point cloud data are divided into two paradigms: point-based methods that preserve geometric precision but often face performance challenges, and voxel-based methods that achieve high efficiency through quantization at the cost of geometric fidelity. This loss of precision is a critical bottleneck for tasks such as point cloud registration. We propose PointCNN++, a novel architectural design that fundamentally mitigates this precision-performance trade-off. It generalizes sparse convolution from voxels to points, treating voxel-based convolution as a specialized, degraded case of our more general point-based convolution. First, we introduce a point-centric convolution where the receptive field is centered on the original, high-precision point coordinates. Second, to make this high-fidelity operation performant, we design a computational strategy that operates natively on points. We formulate the convolution on native points as a Matrix-Vector Multiplication and Reduction (MVMR) problem, for which we develop a dedicated, highly-optimized GPU kernel. Experiments demonstrate that PointCNN++ uses an order of magnitude less memory and is several times faster than representative point-based methods. Furthermore, when used as a simple replacement for the voxel-based backbones it generalizes, it significantly improves point cloud registration accuracies while proving both more memory-efficient and faster. PointCNN++ shows that preserving geometric detail and achieving high performance are not mutually exclusive, paving the way for a new class of 3D learning with high fidelity and efficiency.

Abstract:
Vision Transformers (ViTs), when pre-trained on large-scale data, provide general-purpose representations for diverse downstream tasks. However, artifacts in ViTs are widely observed across different supervision paradigms and downstream tasks. Through systematic analysis of artifacts in ViTs, we find that their fundamental mechanisms have yet to be sufficiently elucidated. In this paper, through systematic analysis, we conclude that these artifacts originate from a lazy aggregation behavior: ViT uses semantically irrelevant background patches as shortcuts to represent global semantics, driven by global attention and Coarse-grained semantic supervision. Our solution selectively integrates patch features into the CLS token, reducing the influence of background-dominated shortcuts and consistently improving performance across 12 benchmarks under label-, text-, and self-supervision. We hope this work offers a new perspective on ViT behavior. All the code and weights will be made publicly available.

Abstract:
MeanFlow promises high-quality generative modeling in few steps, by jointly learning instantaneous and average velocity fields. Yet, the underlying training dynamics remain unclear. We analyze the interaction between the two velocities and find: (i) well-established instantaneous velocity is a prerequisite for learning average velocity; (ii) learning of instantaneous velocity benefits from average velocity when the temporal gap is small, but degrades as the gap increases; and (iii) task-affinity analysis indicates that smooth learning of large-gap average velocities, essential for one-step generation, depends on the prior formation of accurate instantaneous and small-gap average velocities. Guided by these observations, we design an effective training scheme that accelerates the formation of instantaneous velocity, then shifts emphasis from short- to long-interval average velocity. Our enhanced MeanFlow training yields faster convergence and significantly better few-step generation: With the same DiT-XL backbone, our method reaches an impressive FID of 2.87 on 1-NFE ImageNet 256x256, compared to 3.43 for the conventional MeanFlow baseline. Alternatively, our method matches the performance of the MeanFlow baseline with 2.5x shorter training time, or with a smaller DiT-L backbone.

Abstract:
Recent studies have demonstrated that incorporating trainable prompts into pretrained models enables effective incremental learning. However, the application of prompts in incremental object detection (IOD) remains underexplored. Our study reveals that existing prompts-pool-based approaches assume disjoint class sets across incremental tasks, which are unsuitable for IOD as they overlook the inherent co-occurrence phenomenon in detection. In co-occurring scenarios, unlabeled objects from previous tasks may appear in current task images, leading to confusion in prompts pool. In this paper, we hold that prompt structures should exhibit adaptive consolidation properties across tasks, with constrained updates to prevent confusion and catastrophic forgetting. Motivated by this, we introduce Parameterized Prompts for Incremental Object Detection (P^2IOD). Leveraging neural networks global evolution properties, P^2IOD employs networks as the parameterized prompts to adaptively consolidate knowledge across tasks. To constrain prompts structure updates, P^2IOD further engages a parameterized prompts fusion strategy. Extensive experiments on PASCAL VOC2007 and MS COCO datasets demonstrate that P^2IOD's effectiveness in IOD and achieves the state-of-the-art performance among existing baselines.

Abstract:
We present a unified framework for reconstructing animatable 3D human avatars from a single portrait across head, half-body, and full-body inputs. Our method tackles three bottlenecks: pose- and framing-sensitive feature representations, limited scalable data, and unreliable proxy-mesh estimation. We introduce a Dual-UV representation that maps image features to a canonical UV space via Core-UV and Shell-UV branches, eliminating pose- and framing-induced token shifts. We also build a factorized synthetic data manifold combining 2D generative diversity with geometry-consistent 3D renderings, supported by a training scheme that improves realism and identity consistency. A robust proxy-mesh tracker maintains stability under partial visibility. Together, these components enable strong in-the-wild generalization. Trained only on half-body synthetic data, our model achieves state-of-the-art head and upper-body reconstruction and competitive full-body results. Extensive experiments and analyses further validate the effectiveness of our approach.

Abstract:
We introduce PerpetualWonder, a hybrid generative simulator that enables long-horizon, action-conditioned 4D scene generation from a single image. Current works fail at this task because their physical state is decoupled from their visual representation, which prevents generative refinements to update the underlying physics for subsequent interactions. PerpetualWonder solves this by introducing the first true closed-loop system. It features a novel unified representation that creates a bidirectional link between the physical state and visual primitives, allowing generative refinements to correct both the dynamics and appearance. It also introduces a robust update mechanism that gathers supervision from multiple viewpoints to resolve optimization ambiguity. Experiments demonstrate that from a single image, PerpetualWonder can successfully simulate complex, multi-step interactions from long-horizon actions, maintaining physical plausibility and visual consistency.

Abstract:
Object detection has long been dominated by traditional coordinate regression-based models, such as YOLO, DETR, and Grounding DINO. Although recent efforts have attempted to leverage MLLMs to tackle this task, they face challenges like low recall rate, duplicate predictions, coordinate misalignment, etc. In this work, we bridge this gap and propose Rex-Omni, a 3B-scale MLLM that achieves state-of-the-art object perception performance. On benchmarks like COCO and LVIS, Rex-Omni attains performance comparable to or exceeding regression-based models (e.g., DINO, Grounding DINO) in a zero-shot setting. This is enabled by three key designs: 1) Task Formulation: we use special tokens to represent quantized coordinates from 0 to 999, reducing the model's learning difficulty and improving token efficiency for coordinate prediction; 2) Data Engines: we construct multiple data engines to generate high-quality grounding, referring, and pointing data, providing semantically rich supervision for training; 3) Training Pipelines: we employ a two-stage training process, combining supervised fine-tuning on 22 million data with GRPO-based reinforcement post-training. This RL post-training leverages geometry-aware rewards to effectively bridge the discrete-to-continuous coordinate prediction gap, improve box accuracy, and mitigate undesirable behaviors like duplicate predictions that stem from the teacher-guided nature of the initial SFT stage. Beyond conventional detection, Rex-Omni's inherent language understanding enables versatile capabilities such as object referring, pointing, visual prompting, GUI grounding, spatial referring, OCR and key-pointing, all systematically evaluated on dedicated benchmarks. We believe that Rex-Omni paves the way for more versatile and language-aware visual perception systems.

Abstract:
Text-to-Image (T2I) generation has long been an open problem, with compositional synthesis remaining particularly challenging. This task requires accurate rendering of complex scenes containing multiple objects that exhibit diverse attributes as well as intricate spatial and semantic relationships, demanding both precise object placement and coherent inter-object interactions. In this paper, we propose a novel compositional curriculum reinforcement learning framework named CompGen that addresses compositional weakness in existing T2I models. Specifically, we leverage scene graphs to establish a novel difficulty criterion for compositional ability and develop a corresponding adaptive Markov Chain Monte Carlo graph sampling algorithm. This difficulty-aware approach enables the synthesis of training curriculum data that progressively optimize T2I models through reinforcement learning. We integrate our curriculum learning approach into Group Relative Policy Optimization (GRPO) and investigate different curriculum scheduling strategies. Our experiments reveal that CompGen exhibits distinct scaling curves under different curriculum scheduling strategies, with easy-to-hard and Gaussian sampling strategies yielding superior scaling performance compared to random sampling. Extensive experiments demonstrate that CompGen significantly enhances compositional generation capabilities for both diffusion-based and auto-regressive T2I models, highlighting its effectiveness in improving the compositional T2I generation systems.

Abstract:
We introduce COT-FM, a general framework that reshapes the probability path in Flow Matching (FM) to achieve faster and more reliable generation. FM models often produce curved trajectories due to random or batch-wise couplings, which increase discretization error and reduce sample quality. COT-FM fixes this by clustering target samples and assigning each cluster a dedicated source distribution obtained by reversing pretrained FM models. This divide-and-conquer strategy yields more accurate local transport and significantly straighter vector fields, all without changing the model architecture. As a plug-and-play approach, COT-FM consistently accelerates sampling and improves generation quality across 2D datasets, image benchmarks, and robotic manipulation tasks.

Abstract:
Interactive video segmentation often requires many user interventions for robust performance in challenging scenarios (e.g., occlusions, object separations, camouflage, etc.).Yet, even state-of-the-art models like SAM2 use corrections only for immediate fixes without learning from this feedback, leading to inefficient, repetitive user effort. To address this, we introduce Live Interactive Training (LIT), a novel framework for prompt-based visual systems where models also learn online from human corrections at inference time. Our primary instantiation, LIT-LoRA, implements this by continually updating a lightweight LoRA module on-the-fly. When a user provides a correction, this module is rapidly trained on that feedback, allowing the vision system to improve performance on subsequent frames of the same video. Leveraging the core principles of LIT, our LIT-LoRA implementation achieves an average 18-34% reduction in total corrections on challenging video segmentation benchmarks, with a negligible training overhead of 0.5s per correction. We further demonstrate its generality by successfully adapting it to other segmentation models and extending it to CLIP-based fine-grained image classification. Our work highlights the promise of live adaptation to transform interactive tools and significantly reduce redundant human effort in complex visual tasks.

Abstract:
In remote sensing images, small objects often exhibit low color contrast and blurred edges, leading to suboptimal feature extraction. Physiological studies indicate that the LGN/V1-V2-V4 pathway provides color-opponent sensitivity and hierarchical enhancement for color information extraction, whereas the V1-V4 pathway exhibits strong orientation selectivity for edge extraction. Integrating these complementary visual signals in the V4 region can substantially improve target discrimination. Motivated by these findings, we propose a dual-backbone network (BDNet) to enhance feature extraction for small objects. BDNet adopts a parallel architecture to capture fine-grained features from color and edge cues. Specifically, the color-extraction backbone simulates the color-opponent mechanism in LGN/V1 via a Color Antagonism Module (CAM) to amplify color differences, and further mimics the chromatic processing hierarchy in V2 using a Visual Cortex Hue Enhancement Module (VCHM) to enrich hue representations. Together, these two modules alleviate low color contrast. The edge-extraction backbone simulates the orientation selectivity of receptive fields in V1 through an Orientation Selective Module (OrSM) to select and enhance salient edges, thereby reducing edge blurring from fragmented edge responses. Finally, the two feature types are fused via a Feature Fusion Module (FFM) that emulates integration in V4, yielding a comprehensive feature representation. Experiments demonstrate that BDNet outperforms state-of-the-art methods on the VisDrone2019, NWPU VHR-10, and AI-TODv2 datasets, providing a bio-inspired solution for small-object detection in remote sensing images.

Abstract:
Reconstructing 3D humans and scenes from monocular videos is a challenging task, particularly due to human motion, varying illumination, and dynamic scene shadows. While recent works have explored scene disentanglement by jointly modeling humans and their surrounding scenes, they often overlook illumination and shadow effects--resulting in inconsistent human appearance and degraded scene realism. To address this gap, we propose a photometrically consistent integration of human and scene reconstruction based on 3D Gaussian Splatting, with a key focus on modeling spatially-varying illumination and shadows. Central to our method is a learnable light volume that provides localized lighting cues to human Gaussians, enabling more realistic and consistent appearance synthesis. To further ensure accurate human geometry and alignment, we adopt a two-stage reconstruction strategy: we first optimize a human mesh and then anchor Gaussians to the refined surface. In addition, we introduce an implicit shadow estimation module that disentangles cast shadows from the scene, thus supporting plausible human shadow synthesis. Our framework also facilitates human relighting and compositing into novel scenes with contextually appropriate lighting. Quantitative and qualitative results demonstrate that our method achieves state-of-the-art performance, producing consistent appearances, realistic illumination, and enhanced overall scene realism.

Abstract:
Context-dependent (CD) tasks demand the model to have advanced visual understanding ability, such as recognizing camouflaged objects and medical lesions. Current CD methods rely heavily on pixel-level annotated training sets, neglecting issues from redundant samples and the high annotation costs. In this paper, we address the pruning needs of CD datasets, focusing on selecting the most valuable samples for labeling and training using weak annotations. To achieve this, we decompose CD coreset selection into two steps: sample evaluation and coreset selection, proposing corresponding solutions: points-based optimal transport and a maximum distance entropy strategy. Specifically, we formulate sample evaluation as an optimal transport problem between foreground and background distributions, designing a foreground destruction-reconstruction process based on points to compute transport costs and score samples. For samples of varying importance, our selection strategy balances coreset coverage and diversity. We validate our method on six CD tasks, achieving 1% accuracy loss relative to full training under a 40% pruning rate.

Abstract:
Interleaved multimodal generation enables capabilities beyond unimodal generation models, such as step-by-step instructional guides, visual planning, and generating visual drafts for reasoning. However, the quality of existing interleaved generation models under general instructions remains limited by insufficient training data and base model capacity. We present DuoGen, an interleaved generation framework that systematically addresses data curation, architecture design, and evaluation. On the data side, we build a large-scale, high-quality instruction-tuning dataset by combining multimodal conversations rewritten from curated raw websites, and diverse synthetic examples covering everyday scenarios. Architecturally, DuoGen leverages the strong visual understanding of a pretrained multimodal LLM and the visual generation capabilities of a diffusion transformer (DiT) pretrained on video generation, avoiding costly unimodal pretraining and enabling flexible base model selection. A two-stage decoupled strategy first instruction-tunes the MLLM, then aligns DiT with it using curated interleaved image-text sequences. Across public and newly proposed benchmarks, DuoGen outperforms prior open-source models in text quality, image fidelity, and image-context alignment, and also achieves state-of-the-art performance on text-to-image and image editing among unified generation models. Data and code are released at https://research.nvidia.com/labs/dir/duogen/

Abstract:
Embodied Conversational Agents (ECAs) aim to emulate human face-to-face interaction through speech, gestures, and facial expressions. Current large language model (LLM)-based conversational agents lack embodiment and the expressive gestures essential for natural interaction. Existing solutions for ECAs often produce rigid, low-diversity motions, that are unsuitable for human-like interaction. Alternatively, generative methods for co-speech gesture synthesis yield natural body gestures but depend on future speech context and require long run-times. To bridge this gap, we present MIBURI, the first online, causal framework for generating expressive full-body gestures and facial expressions synchronized with real-time spoken dialogue. We employ body-part aware gesture codecs that encode hierarchical motion details into multi-level discrete tokens. These tokens are then autoregressively generated by a two-dimensional causal framework conditioned on LLM-based speech-text embeddings, modeling both temporal dynamics and part-level motion hierarchy in real time. Further, we introduce auxiliary objectives to encourage expressive and diverse gestures while preventing convergence to static poses. Comparative evaluations demonstrate that our causal and real-time approach produces natural and contextually aligned gestures against recent baselines. We urge the reader to explore demo videos on https://vcai.mpi-inf.mpg.de/projects/MIBURI/

Abstract:
Text-to-motion synthesis aims to generate natural and expressive human motions from textual descriptions. While existing approaches primarily focus on generating holistic motions from text descriptions, they struggle to accurately reflect actions involving specific body parts. Recent part-wise motion generation methods attempt to resolve this but face two critical limitations: (i) they lack explicit mechanisms for aligning textual semantics with individual body parts, and (ii) they often generate incoherent full-body motions due to integrating independently generated part motions. To overcome these issues and resolve the fundamental trade-off in existing methods, we propose ParTY, a novel framework that enhances part expressiveness while generating coherent full-body motions. ParTY comprises: (1) Part-Guided Network, which first generates part motions to obtain part guidance, then uses it to generate holistic motions; (2) Part-aware Text Grounding, which diversely transforms text embeddings and appropriately aligns them with each body part; and (3) Holistic-Part Fusion, which adaptively fuses holistic motions and part motions. Extensive experiments, including part-level and coherence-level evaluations, demonstrate that ParTY achieves substantial improvements over previous methods.

Abstract:
Vision-Language Models (VLMs) have achieved impressive performance on spatial reasoning benchmarks, yet these evaluations mask critical weaknesses in understanding object interactions. Current benchmarks test high-level relationships ("left of," "behind", etc.) but ignore fine-grained spatial understanding needed for real-world applications: precise 3D localization, physical compatibility between objects, object affordances and multi-step spatial planning. In this work, we present BOP-ASK, a novel large-scale dataset for object-interaction reasoning for both training and benchmarking. Our data generation pipeline leverages 6D object poses from the Benchmark for Object Pose Estimation (BOP) datasets from which we derive fine-grained annotations such as grasp poses, referred object poses, path planning trajectories, relative spatial and depth relationships, and object-to-object relationships. BOP-ASK comprises over 150k images and 33M question-answer pairs spanning six tasks (four novel), providing a rich resource for training and evaluating VLMs. We evaluate proprietary and open-sourced VLMs, and conduct human evaluations on BOP-ASK-core, a contributed test benchmark. We also release BOP-ASK-lab, an out-of-distribution benchmark with images not sourced from BOP, enabling testing of generalization. Our experiments demonstrate that models trained on BOP-ASK outperform baselines and exhibit emergent capabilities such as precise object and grasp pose estimation, trajectory planning, and fine-grained object-centric spatial reasoning in cluttered environments. Project website: https://bop-ask.github.io/

Abstract:
Few-shot learning aims to enable rapid adaptation to unseen tasks using limited data. Optimization-based meta-learning addresses this challenge by acquiring shared prior knowledge across diverse tasks. However, its effectiveness degrades in cross-domain scenarios where unseen tasks differ significantly from training tasks. We identify this degradation as a failure to acquire generalizable prior knowledge, which is fundamentally caused by gradient discrepancies--conflicting update directions arising in the meta-training environment with diverse task distributions. To achieve robust few-shot generalization, we propose Data-Centric Meta-Learning (DCML), a novel framework that mitigates gradient discrepancies by aligning task-specific input distributions with shared prior knowledge. DCML accomplishes this alignment through a meta-learnable visual prompt that is integrated into the entire meta-learning process--unlike previous prompt-based methods restricted solely to test-time adaptation. During meta-training, the prompt transforms each task's inputs to induce more consistent gradients, thereby facilitating the learning of generalizable prior knowledge. Leveraging this robust knowledge, DCML enables rapid and parameter-efficient test-time adaptation by updating only the lightweight prompt and classifier while keeping the backbone frozen.Extensive experiments demonstrate that DCML consistently outperforms baselines, particularly in challenging few-shot cross-domain scenarios, establishing a data-centric perspective for robust meta-learning.

Abstract:
This paper tackles the challenge of recovering 4D dynamic scenes from videos captured by as few as four portable cameras. Learning to model scene dynamics for temporally consistent novel-view rendering is a foundational task in computer graphics, where previous works often require dense multi-view captures using camera arrays of dozens or even hundreds of views. We propose 4C4D, a novel framework that enables high-fidelity 4D Gaussian Splatting from video captures of extremely sparse cameras. Our key insight lies that the geometric learning under sparse settings is substantially more difficult than modeling appearance. Driven by this observation, we introduce a Neural Decaying Function on Gaussian opacities for enhancing the geometric modeling capability of 4D Gaussians. This design mitigates the inherent imbalance between geometry and appearance modeling in 4DGS by encouraging the 4DGS gradients to focus more on geometric learning. Extensive experiments across sparse-view datasets with varying camera overlaps show that 4C4D achieves superior performance over prior art.

Abstract:
Model merging aims to integrate multiple task-specific fine-tuned models derived from a shared pre-trained checkpoint into a single multi-task model without additional training. Despite extensive research, task interference remains a major obstacle that often undermines the performance of merged models. In this paper, we propose ESM (Essential Subspace Merging) , a robust framework for effective model merging. We begin by performing Principal Component Analysis (PCA) on feature shifts induced by parameter updates. The resulting principal directions span an essential subspace that dominantly influences feature representations. Each task's parameter update matrix is projected onto its respective essential subspace for low-rank decomposition before merging. This methodology mitigates inter-task interference while preserving core task-specific functionality. Furthermore, we introduce a multi-level polarized scaling strategy that amplifies parameters containing critical knowledge and suppresses redundant ones, preventing essential knowledge from being overwhelmed during fusion. Extensive experiments across multiple task sets and model scales demonstrate that our method achieves state-of-the-art performance in multi-task model merging.

Abstract:
Visual tokenizers play a crucial role in diffusion models. The dimensionality of latent space governs both reconstruction fidelity and the semantic expressiveness of the latent feature. However, a fundamental trade-off is inherent between dimensionality and generation quality, constraining existing methods to low-dimensional latent spaces. Although recent works have leveraged vision foundation models (VFMs) to enrich the semantics of visual tokenizers and accelerate convergence, high-dimensional tokenizers still underperform their low-dimensional counterparts. In this work, we propose RecTok, which overcomes the limitations of high-dimensional visual tokenizers through two key innovations: flow semantic distillation and reconstruction-alignment distillation. Our key insight is to make the forward flow in flow matching semantically rich, which serves as the training space of diffusion transformers, rather than focusing on the latent space as in previous works. Specifically, our method distill the semantic information in VFMs into the forward flow trajectories in flow matching. And we further enhance the semantics by introducing a masked feature reconstruction loss. Our RecTok achieves superior image reconstruction, generation quality, and discriminative performance. It achieves state-of-the-art results on the gFID-50K under both with and without classifier-free guidance settings, while maintaining a semantically rich latent space structure. Furthermore, as the latent dimensionality increases, we observe consistent improvements. Code and model will be publicly available.

Abstract:
Inspired by how humans communicate spatial information, language-guided geo-localization has gained significant traction for its intuitive and practical value. Despite this progress, most methods still rely on a static, one-shot retrieval paradigm, which fails to handle the ambiguity and incompleteness inherent in real-world natural language descriptions. We propose a paradigm shift to reasoning retrieval and introduce Dialogue Place Recognition (DlgPR), which casts localization as an interactive, dialogue-driven reasoning process. To support this new task, we present DlgQuest-Cities, the first large-scale dialogue-based benchmark for place recognition, and a unified reasoning framework that couples a cross-modal multi-level retriever with an intelligent questioner, DQ-pilot. DQ-pilot is trained in a curriculum: supervised fine-tuning on a curated DQ-cities-20k subset followed by reinforcement refinement on a harder DQ-cities-10k split via GRPO. Two task-aligned metrics guide learning: a Discriminative Difficulty Index (DDI) for curriculum sampling and a Positional Retrieval Gain (PRG) reward that directly measures retrieval improvement induced by a question. Experiments show this reasoning-based approach significantly outperforms baselines. The code will be made publicly available upon acceptance.

Abstract:
Multi-view 3D geometry networks offer a powerful prior but are prohibitively slow for real-time applications. We propose a novel way to adapt them for online use, enabling real-time 6-DoF pose tracking and online reconstruction of objects and scenes from monocular RGB videos. Our method rapidly selects and manages a set of images as keyframes to map a scene or object via \pi^3 [??] with full bidirectional attention. We then cache the global self-attention block's key-value (KV) pairs and use them as the sole scene representation for online tracking. This allows for up to 15xspeedup during inference without the fear of drift or catastrophic forgetting. Our caching strategy is model-agnostic and can be applied to other off-the-shelf multi-view networks without retraining. We demonstrate KV-Tracker on both scene-level tracking and the more challenging task of on-the-fly object tracking and reconstruction without depth measurements or object priors. Experiments on the TUM RGB-D, 7-Scenes, Arctic and OnePose datasets show the strong performance of our system while maintaining high frame-rates up to ~ 30 FPS.

Abstract:
Multimodal learning hinges on capturing redundant, unique, and synergistic information across modalities, which collectively constitute multimodal interactions. A critical yet underexplored challenge is that these implicit interactions vary dynamically across samples. In this work, we present the first systematic, information-theoretic analysis highlighting why learning these dynamic, sample-specific interactions is critical for effective multimodal learning. Our analysis further reveals deficits in conventional paradigms at learning these distinct interaction types: modality ensemble approaches struggle to capture synergy, while joint learning paradigms often under-utilize redundant information. This highlights the need for an approach that can adaptively learn from different interaction types on a per-sample basis. To this end, we propose Decomposition-based Multimodal Interaction Learning (DMIL), a novel paradigm that explicitly models and learns from sample-specific interactions. First, we design a variational decomposition architecture to isolate the constituent interaction components. Second, we employ a new learning strategy that leverages these explicit interaction components in a fine-tuning process to achieve comprehensive interaction learning. Extensive experiments across diverse tasks and architectures demonstrate that DMIL consistently achieves superior performance by adapting to holistic sample-specific interactions. Our framework is flexible and broadly applicable, establishing an interaction-centric paradigm for multimodal learning.

Abstract:
Bundle adjustment is a long-standing problem in computer vision that solves for camera parameters and 3D point coordinates from 2D image observations. While there has been much work on various aspects, like adaptation to different camera models and sensors, and considerations for solving the optimization problem, in this paper, we deal with a fundamental and distinct question of the uniqueness of its solution. In particular, we examine the unique solvability of the 3D reconstruction problem using parallel rigidity theory. We design an algorithm to ensure that the topology of the bipartite graph formed by the camera-3D point relations in bundle adjustment does not result in independent scaling of the edges in its subgraphs. To tackle the generally large-sized bipartite graph, we leverage camera-camera relationships in 3D reconstruction problems for efficiency. We demonstrate the benefits of our analysis for a global structure-from-motion pipeline. Applying our proposed algorithm results in significantly cleaner reconstructions by removing misplaced cameras and 3D points.

Abstract:
Recent advances in 3D Gaussian Splatting (3DGS) have enabled highly efficient and photorealistic novel view synthesis. However, segmenting objects accurately in 3DGS remains challenging due to the discrete nature of Gaussian representations, which often leads to aliasing and artifacts at object boundaries. In this paper, we introduce NG-GS, a novel framework for high-quality object segmentation in 3DGS that explicitly addresses boundary discretization. Our approach begins by automatically identifying ambiguous Gaussians at object boundaries using mask variance analysis. We then apply radial basis function (RBF) interpolation to construct a spatially continuous feature field, enhanced by multi-resolution hash encoding for efficient multi-scale representation. A joint optimization strategy aligns 3DGS with a lightweight NeRF module through alignment and spatial continuity losses, ensuring smooth and consistent segmentation boundaries. Extensive experiments on NVOS and LERF-OVS benchmarks demonstrate that our method achieves state-of-the-art performance, with significant gains in boundary mIoU.

Abstract:
Hand-object interaction forms the foundation of how humans interact with the world. Understanding the connection between hand action and egocentric video is essential for enabling embodied agents to perceive, simulate, and plan like humans. However, it is challenging to learn and predict across hand actions and egocentric videos due to their non-linear relationship. In this work, we introduce HandWorld, a unified generative framework that focuses on hand-object interaction and jointly models egocentric videos and hand actions. HandWorld learns shared cross-domain conditions through a dual-branch condition network that integrates information from both video and action domains. MANO-rendered hand representation is incorporated as an intermediate input to further enhance cross-domain coherence. Conditioned on the shared representation, two decoupled diffusion transformers are trained to predict in their respective domain. A flexible training strategy enables the model to learn across diverse task configurations, including action forecasting and controllable video generation. Experiments on large-scale egocentric HOI datasets demonstrate that HandWorld achieves high-fidelity video synthesis and accurate action prediction, outperforming existing baselines across diverse scenarios.

Abstract:
This paper addresses the problem of dynamic scene surface reconstruction using Gaussian Splatting (GS), aiming to recover temporally consistent geometry. While existing GS-based dynamic surface reconstruction methods can yield superior reconstruction, they are typically limited to either a single object or objects with only small deformations, struggling to maintain temporally consistent surface reconstruction of large deformations over time. We propose "4DSurf", a novel and unified framework for generic dynamic surface reconstruction that does not require specifying the number or types of objects in the scene, can handle large surface deformations and temporal inconsistency in reconstruction. The key innovation of our framework is the introduction of Gaussian deformations induced Signed Distance Function Flow Regularization that constrains the motion of Gaussians to align with the evolving surface. To handle large deformations, we introduce an Overlapping Segment Partitioning strategy that divides the sequence into overlapping segments with small deformations and incrementally passes geometric information across segments through the shared overlapping timestep. Experiments on two challenging dynamic scene datasets, Hi4D and CMU Panoptic, demonstrate that our method outperforms state-of-the-art surface reconstruction methods by 49% and 19% in Chamfer distance, respectively, and achieves superior temporal consistency under sparse-view settings.

Abstract:
Moving instance segmentation (MIS) attracts increasing attention due to its broad applications in traffic surveillance, autonomous driving, and animal tracking. Event cameras record asynchronous brightness changes, providing high temporal resolution and dynamic range, which makes them highly sensitive to motion information. By fusing event and image features, motion cues from events can complement spatial details from images, enhancing the performance of MIS. However, current multimodal MIS methods still struggle to segment small moving instances, as event cameras often yield sparse features under limited resolution. Moreover, event features entangle appearance attributes with motion cues, which further restricts effective cross-modal fusion. To address these challenges, we first propose a dual-disentangling feature extraction framework that separates and extracts appearance and motion information within both image and event modalities, thereby improving feature density. Subsequently, a multi-granularity cross-modal alignment is introduced to align distributionally and semantically consistent features across modalities, enabling more effective fusion with rich spatial and temporal details. The experiment results demonstrate that our method achieves state-of-the-art performance in multimodal MIS, especially for small instances under challenging conditions such as fast motion and low-light settings.

Abstract:
In recent years, the increasing demand for smaller and more powerful semiconductors highlighted the critical role of lithography--a key stage in semiconductor manufacturing responsible for precise mask design and wafer patterning. To meet these demands, the semiconductor industry has increasingly adopted computational lithography, employing machine learning and deep learning techniques to accelerate advancements in lithographic technology. Despite the various research efforts and successes in computational lithography, there remains a lack of explicit incorporation of physical principles. This gap limits the ability of existing methods to fully capture the complex physical phenomena inherent in lithography behaviors. To bridge this gap, we propose OptiCo, a novel convolutional neural network that seamlessly integrates optical diffraction principles into its architecture. At its core, OptiCo employs an optical phase kernel to model phase variations resulting from light propagation, effectively capturing the physical interactions among light, masks, and wafers. We evaluate OptiCo on semiconductor lithography benchmarks, demonstrating its superior performance in mask optimization tasks, with its remarkable generalization capabilities in OOD datasets.

Abstract:
Recent advances in image editing models have shown remarkable progress. A common architectural design couples a multimodal large language model (MLLM) encoder with a diffusion decoder, as seen in systems such as Step1X-Edit and Qwen-Image-Edit, where the MLLM encodes both the reference image and the instruction but remains frozen during training.In this work, we demonstrate that unlocking the reasoning capabilities of MLLM can further push the boundaries of editing models. Specifically, we explore two reasoning mechanisms, thinking and reflection, which enhance instruction understanding and editing accuracy.Based on that, our proposed framework enables image editing in a thinking-editing-reflection loop: the thinking mechanism leverages the world knowledge of MLLM to interpret abstract instructions, while the reflection reviews editing results, automatically corrects unintended manipulations, and identifies the stopping round.Extensive experiments demonstrate that our reasoning approach achieves significant performance gains, with improvements of ImgEdit(+4.4%), GEdit(+3.1%), and Kris(+11.5%) when initializing our DiT from the Step1X-Edit(ReasonEdit-S), and also outperforms previous open-source methods on both GEdit and Kris when integrated with Qwen-Image-Edit(ReasonEdit-Q).Code and checkpoints will be open-sourced.

Abstract:
Recent advances have markedly improved the cross-scene generalization of relative depth estimation, yet its practical applicability remains limited by the absence of metric scale, local inconsistencies, and low computational efficiency. To address these issues, we present Midas Touch for Depth (MTD), a mathematically interpretable approach that converts relative depth into metric depth using only extremely sparse 3D data. To eliminate local scale inconsistencies, it applies a segment-wise recovery strategy via sparse graph optimization, followed by a pixel-wise refinement strategy using a discontinuity-aware geodesic cost. MTD exhibits strong generalization and achieves substantial accuracy improvements over previous depth completion and depth estimation methods. Moreover, its lightweight, plug-and-play design facilitates deployment and integration on diverse downstream 3D tasks.

Abstract:
Text-guided 3D editing aims to modify existing 3D assets using natural-language instructions. Current methods struggle to jointly understand complex prompts, automatically localize edits in 3D, and preserve unedited content. We introduce Vinedresser3D, an agentic framework for high-quality text-guided 3D editing that operates directly in the latent space of a native 3D generative model. Given a 3D asset and an editing prompt, Vinedresser3D uses a multimodal large language model to infer rich descriptions of the original asset, identify the parts to edit and the edit type (addition, modification, deletion), and generate decomposed structural and appearance-level text guidance. The agent then selects an informative view and applies an image editing model to obtain visual guidance. Finally, an inversion-based rectified-flow inpainting pipeline with an interleaved sampling module performs editing in the 3D latent space, enforcing prompt alignment while maintaining 3D coherence and unedited regions. Experiments on diverse 3D edits demonstrate that Vinedresser3D outperforms prior baselines in both automatic metrics and human preference studies, while enabling precise, coherent, and mask-free 3D editing.

Abstract:
Multimodal large language models (MLLMs) achieve strong performance across diverse tasks but remain prone to hallucinations, where outputs are not grounded in visual inputs. This issue can be attributed to two main biases: text-visual bias, the overreliance on prompts and prior outputs, and co-occurrence bias, spurious correlations between frequently paired objects. We propose Gradient-based Influence-Aware Constrained Decoding (GACD), an inference-based method, that addresses both biases without auxiliary models, and is readily applicable to existing models without finetuning. The core of our approach is bias estimation, which uses first-order Taylor gradients to understand the contribution of individual tokens--visual features and text tokens--to the current output. Based on this analysis, GACD mitigates hallucinations through two components: (1) suppressing spurious visual features correlated with the output objects, and (2) rebalancing cross-modal contributions by strengthening visual features relative to text. Experiments across multiple benchmarks demonstrate that GACD effectively reduces hallucinations and improves the visual grounding of MLLMs outputs.

Abstract:
Event cameras provide high-temporal-resolution, motion-centric measurements that remain reliable under fast motion and challenging illumination, making them a promising sensing modality for motion deblurring. However, existing deblurring methods typically require large-scale paired blur--sharp datasets, which are extremely difficult to obtain in real-world settings, especially when an additional modality such as events is involved. In this work, we introduce EMP, an event-based motion deblurring framework that operates entirely in an unpaired setting, removing the need for aligned blur--sharp supervision. EMP bridges the disjoint blur and sharp domains through event information and leverages two complementary training mechanisms tailored to the unpaired regime: (1) an event-based physical prior with confidence masking that provides reliable self-supervisory signals for blurry inputs, and (2) a generative blur modeling process that extracts blur-related frequency-domain cues from blur--event pairs and transfers them to sharp images to synthesize realistic blur. Together, these mechanisms enable stable and effective deblurring without paired labels. Extensive experiments on various real event datasets, including REBlur, EventAid, and HighREV, show that EMP outperforms existing unpaired baselines and achieves performance competitive with paired methods.

Abstract:
Reinforcement learning from human feedback (RLHF) has proven effective in aligning large language models with human preferences, inspiring the development of reward-centric diffusion reinforcement learning (RDRL) to achieve similar alignment and controllability. While diffusion models can generate high-quality outputs, RDRL remains susceptible to reward hacking, where the reward score increases without corresponding improvements in perceptual quality. We demonstrate that this vulnerability arises from the non-robustness of reward model gradients, particularly when the reward landscape with respect to the input image is sharp. To mitigate this issue, we introduce methods that exploit gradients from a robustified reward model, without requiring its retraining. Specifically, we employ gradients from a flattened reward model, obtained through parameter perturbations of the diffusion model and perturbations of its generated samples. Empirically, each method independently alleviates reward hacking and improves robustness, while their joint use amplifies these benefits. Our resulting framework, RSA-FT (Reward Sharpness-Aware Fine-Tuning), is simple, broadly compatible, and consistently enhances the reliability of RDRL.

Abstract:
We present ReViSe (_Reasoning with Video Sparsity_), a framework that combines multi-round reasoning with adaptive frame selection for video question answering (VQA). Existing vision-language models (VLMs) uniformly sample video frames, which introduces redundancy or irrelevancy. In contrast, ReViSeinteractively selects informative frames through multi-round reasoning. To achieve this, ReViSe includes three modules: a multi-round conversation module that retains frame selection history as memory; a reasoning tracer that maintains a chain-of-thought across rounds; and a self-correction mechanism that enforces structural and behavioral validity. ReViSe integrates seamlessly with both proprietary and open-source VLMs. It supports proprietary models in a "plug-and-play" manner and enables reinforcement fine-tuning for open-source models. Experiments on multiple VQA benchmarks show that ReViSe improves the video understanding ability of VLMs by improving accuracy while reducing the number of frames used.

Abstract:
We propose RankOOD, a rank-based Out-of-Distribution (OOD) detection approach based on training a model with the Placket-Luce loss, which is now extensively used for preference alignment tasks in foundational models. Our approach is based on the insight that with a deep learning model trained using the Cross Entropy Loss, in-distribution (ID) class prediction induces a ranking pattern for each ID class prediction. The RankOOD framework formalizes the insight by first extracting a rank list for each class using an initial classifier and then uses another round of training with the Plackett-Luce loss, where the class rank, a fixed permutation for each class, is the predicted variable. An OOD example may get assigned with high probability to an ID example, but the probability of it respecting the ranking classification is likely to be small. RankOOD, achieves SOTA performance on the near-ODD TinyImageNet evaluation benchmark, reducing FPR95 by 4.3%.

Abstract:
Sinusoidal neural networks (SIRENs) are powerful implicit neural representations (INRs) for low-dimensional signals in vision and graphics. By encoding input coordinates with sinusoidal functions, they enable high-frequency image and surface reconstruction. However, training SIRENs is often unstable and highly sensitive to frequency initialization: small frequencies produce overly smooth reconstructions in detailed regions, whereas large ones introduce spurious high-frequency components that manifest as noise in smooth areas such as image backgrounds. To address these challenges, we propose SASNet, a Spatially-Adaptive Sinusoidal Network that couples a frozen frequency embedding layer, which explicitly fixes the network's frequency support, with jointly learned spatial masks that localize neuron influence across the domain. This pairing stabilizes optimization, sharpens edges, and suppresses noise in smooth areas. Experiments on 2D image and 3D volumetric data fitting as well as signed distance field (SDF) reconstruction benchmarks demonstrate that SASNet achieves faster convergence, superior reconstruction quality, and robust frequency localization--assigning low- and high-frequency neurons to smooth and detailed regions respectively--while maintaining parameter efficiency.

Abstract:
In practical real-time XR and telepresence applications, network and computing resources fluctuate frequently. Therefore, a progressive 3D representation is needed. To this end, we propose ProgressiveAvatars, a progressive avatar representation built on a hierarchy of 3D Gaussians grown by adaptive implicit subdivision on a template mesh. 3D Gaussians are defined in face-local coordinates to remain animatable under varying expressions and head motion across multiple detail levels. The hierarchy expands when screen-space signals indicate a lack of detail, allocating resources to important areas. Leveraging importance ranking, Progressive-Avatars supports incremental loading and rendering, adding new Gaussians as they arrive while preserving previous content, thus achieving smooth quality improvements across varying bandwidths. ProgressiveAvatars enables progressive delivery and progressive rendering under fluctuating network bandwidth and varying compute and memory resources.

Abstract:
Methods based on diffusion models (DMs) for solving inverse problems (IPs) have recently achieved remarkable performance. However, DM-based methods typically struggle against outliers, which are common in real-world measurements. In this work, to tackle IPs with outliers, we first refine the measurement via explicit noise estimation to mitigate the effect of noise. Subsequently, we formulate an iteratively reweighted least squares objective based on the Huber loss to address the outliers. We propose a method utilizing gradient descent to approximately solve the corresponding optimization problem for the robust objective. To avoid delicate tuning of the learning rate required by the gradient descent method, we further employ the conjugate gradient method with an efficient strategy for updating. Extensive experiments on multiple image datasets for linear and nonlinear tasks under various conditions demonstrate that our proposed methods exhibit robustness to outliers and outperform recent DM-based methods in most cases.

Abstract:
Human pose estimation (HPE) underpins critical applications in healthcare, activity recognition, and human-computer interaction. However, the privacy implications of processing sensitive visual data present significant deployment barriers in critical domains. %Conventional anonymization techniques offer weak protection and are unquantifiable, while Differential Privacy (DP) provides formal guarantees but often results in steep performance costs.We introduce the first unified framework for differentially private 2D Human Pose Estimation (2D-HPE) that achieves strong privacy-utility trade-offs for structured visual prediction through complementary noise mitigation mechanisms. Our Feature-Projective DP integrates: (1) subspace projection that reduces noise variance by a factor k/p by restricting gradient updates to a k-principal subspace within the full p-dimensional parameter space, and (2) feature-level privacy, which selectively privatizes sensitive features while retaining public visual cues. Together these mechanisms yield a multiplicative utility gain under formal privacy constraints.%We further propose a feature-projective hybrid that combines both the mechanisms within a single post-processing framework.Extensive experiments on MPII and HumanART datasets across privacy budgets (\varepsilon \in \ 0.2, 0.4, 0.6, 0.8\ ), clipping thresholds (C \in \ 0.01, 0.1, 1.0\ ) and training strategies demonstrate consistent improvements over vanilla DP-SGD. At \varepsilon=0.8, our method achieves 82.61% PCKh@0.5, recovering 73% of the privacy induced performance gap. Cross-dataset evaluation on the HumanART confirms generalization (51.6 AP). Our study provides the first rigorous benchmark and a practical blueprint for privacy-preserving pose estimation in sensitive, real-world applications. Our source code will be made public on acceptance.

Abstract:
Contemporary commercial and open-source diffusion models have demonstrated remarkable performance in text-to-image generation, enabling widespread applications in creative design and content creation. However, legitimate requirements--such as copyright protection, privacy compliance, or personalized customization--often necessitate the removal of specific semantic concepts from pretrained models. Existing concept erasure methods suffer from two critical limitations: (1) Incomplete suppression, where the model still occasionally generates images containing the target concept; (2) Poor semantic selectivity, which degrades the generation quality of unrelated concepts and compromises overall model utility.To address these challenges, we propose `MapRoute`, a lightweight, semantics-aware concept erasure framework based on dynamic routing. Our approach introduces a set of modular components--termed Mappers--placed after a frozen pretrained text encoder. Each Mapper learns a linear mapping from a target concept to a surrogate concept. During inference, the system dynamically activates the top-K Mappers most relevant to the input prompt, based on cosine similarity between the text embedding and all the target concept embeddings, and applies their transformations sequentially. This input-driven, modular intervention enables precise, on-demand erasure while avoiding unnecessary interference with irrelevant semantics.Extensive experiments demonstrate that `MapRoute` effectively suppresses specified concepts while significantly reducing collateral damage to unrelated concept. By operating without full-model fine-tuning, our method entirely avoids parameter drift and concept erosion. Moreover, `MapRoute` outperforms state-of-the-art baselines in terms of generation fidelity, semantic consistency, and scalability to multi-concept erasure scenarios.

Abstract:
Disentangling visual layers in real-world images is a persistent challenge in vision and graphics, as such layers often involve non-linear and globally coupled interactions, including shading, reflection, and perspective distortion. In this work, we present an in-context image decomposition framework that leverages large diffusion foundation models for layered separation. We focus on the challenging case of logo-object decomposition, where the goal is to disentangle a logo from the surface on which it appears while faithfully preserving both layers. Our method fine-tunes a pretrained diffusion model via lightweight LoRA adaptation and introduces a cycle-consistent tuning strategy that jointly trains decomposition and composition models, enforcing reconstruction consistency between decomposed and recomposed images. This bidirectional supervision substantially enhances robustness in cases where the layers exhibit complex interactions. Furthermore, we introduce a progressive self-improving process, which iteratively augments the training set with high-quality model-generated examples to refine performance. Extensive experiments demonstrate that our approach achieves accurate and coherent decompositions and also generalizes effectively across other decomposition types, suggesting its potential as a unified framework for layered image decomposition.

Abstract:
Motion blur is a common degradation in dynamic imaging. Recent studies have moved beyond restoring a single sharp image from a blurred input and instead target blur decomposition: recovering a temporally continuous sharp video sequence from one motion-blurred image. Event cameras, with their microsecond temporal resolution, can effectively alleviate motion ambiguity. However, existing event-based methods often fail to explicitly model time-aligned event-image features. How to accurately exploit event data to reconstruct frames at different time instants remains largely underexplored. In this paper, we propose TSANet, an event-based blur-to-video decomposition method that time-specializes both event features and image features for alignment. Specifically, we introduce a Relative Time-Encoded Attention module that steers event features toward motion information relevant to a given target time, and a Timesurface Dynamic Warping module that warps image features into the spatial configuration corresponding to that time. With time-specialized motion and image features explicitly aligned at arbitrary query times, our framework can decompose a single blurred image into a high-frame-rate sharp video sequence. In addition, we collect a new dataset containing real events and high-quality color videos, and synthesize blurred inputs by averaging sharp frames to evaluate our method. Experiments on multiple datasets with both synthetic and real events demonstrate that our approach consistently outperforms previous state-of-the-art methods on the blur decomposition task.

Abstract:
In high-stakes domains, small task-specific vision models are crucial due to their low computational requirements and the availability of numerous methods to explain their results. However, these explanations often reveal that the models do not align well with human domain knowledge, relying instead on spurious correlations. This might result in brittle behavior once deployed in the real-world. To address this issue, we introduce a novel and efficient method for aligning small task-specific vision models with human domain knowledge by leveraging the generalization capabilities of a Large Vision Language Model (LVLM). Our LVLM-Aided Visual Alignment (LVLM-VA) method provides a bidirectional interface that translates model behavior into natural language and maps human class-level specifications to image-level critiques, enabling effective interaction between domain experts and the model. Our method demonstrates substantial improvement in aligning model behavior with human specifications, as validated on both synthetic and real-world datasets. We show that it effectively reduces the model's dependence on spurious features and on group-specific biases, without requiring fine-grained feedback.

Abstract:
Deep vision models perform input-output computations that are hard to interpret. Concept-based explanation methods (CBEMs) increase interpretability by re-expressing parts of the model with human-understandable semantic units, or concepts. Checking if the derived explanations are faithful--that is, they represent the model's internal computation--requires a surrogate that combines concepts to compute the output. Simplifications made for interpretability inevitably reduce faithfulness, resulting in a tradeoff between the two. State-of-the-art unsupervised CBEMs (U-CBEMs) are seemingly more interpretable, while also being more faithful to the model. However, we observe that the reported improvement in faithfulness artificially results from either (1) using overly complex surrogates, which introduces an unmeasured cost to the explanation's interpretability, or (2) relying on deletion-based approaches that, as we demonstrate, do not properly measure faithfulness. We propose Surrogate Faithfulness (SURF), which (1) replaces prior complex surrogates with a simple, linear surrogate that measures faithfulness without changing the explanation's interpretability and (2) introduces well-motivated metrics that assess loss across all output classes, not just the predicted class. We validate SURF with a measure-over-measure study by proposing a simple sanity check--explanations with random concepts should be less faithful-which prior surrogates fail. SURF enables the first reliable faithfulness benchmark of U-CBEMs, revealing that many visually compelling U-CBEMs are not faithful. Code is released.

Abstract:
We introduce PixARMesh, a method to autoregressively reconstruct complete 3D indoor scene meshes directly from a single RGB image. Unlike prior methods that rely on implicit signed distance fields and post-hoc layout optimization, PixARMesh jointly predicts object layout and geometry within a unified model, producing coherent and artist-ready meshes in a single forward pass. Building on recent advances in mesh generative models, we augment a point-cloud encoder with pixel-aligned image features and global scene context via cross-attention, enabling accurate spatial reasoning from a single image. Scenes are generated autoregressively from a unified token stream containing context, pose, and mesh, yielding compact meshes with high-fidelity geometry. Experiments on synthetic and real-world datasets show that PixARMesh achieves state-of-the-art reconstruction quality while producing lightweight, high-quality meshes ready for downstream applications.

Abstract:
AI assistants that support humans in daily life are becoming increasingly feasible, driven by the rapid advancements in multimodal language models. A key challenge lies in overcoming the generic nature of these models to deliver personalized experiences. Existing approaches to personalizing large vision language models often rely on additional training stages, which limit generality and scalability, or on engineered pipelines with external pre-trained modules, which hinder deployment efficiency. In this work, we propose an efficient personalization method that leverages the model's inherent ability to capture personalized concepts. Specifically, we extract visual tokens that predominantly represent the target concept by utilizing the model's internal attention mechanisms. These tokens serve as a memory of that specific concept, enabling the model to recall and describe it when it appears in test images. We conduct a comprehensive and unified evaluation of our approach and SOTA methods across various personalization settings including single-concept, multi-concept, and video personalization, demonstrating strong performance gains with minimal personalization overhead.

Abstract:
Multi-modal multi-agent systems (MM-MAS) have gained increasing attention for their capacity to enable complex reasoning and coordination across diverse modalities. As these systems continue to expand in scale and functionality, investigating their potential vulnerabilities has become increasingly important.However, existing studies on adversarial attacks in multi-agent systems primarily focus on isolated agents or unimodal settings, leaving the vulnerabilities of MM-MAS largely underexplored. To bridge this gap, we introduce HAM\textsuperscript 3 , a Hierarchical Attack framework for multi-modal multi-agent systems that decomposes attacks into three interconnected layers. Specifically, at the perception layer, HAM\textsuperscript 3 mounts attacks by perturbing visual inputs, textual inputs, and their fused visual-textual representations. At the communication layer, it performs communication-level attacks that corrupt message content and interaction topology, such as manipulating shared context or communication links to distort collective information flow. At the reasoning layer, it conducts reasoning-level attacks that interfere with each agent's cognitive pipeline, biasing reasoning trajectories and ultimately compromising final decisions. We evaluate HAM\textsuperscript 3 on the GQA benchmark through multi-agent systems built on distinct reasoning paradigms including ReAct, Plan-and-Solve, and Reflexion. Experiments demonstrate that our framework achieves an Attack Success Rate of up to 78.3%, with reasoning-layer attacks being the most effective. More than half of the successful attacks lead multiple agents to produce consistent errors. These findings offer valuable insights for building more robust and interpretable multi-agent intelligence.

Abstract:
While generative video models have achieved remarkable fidelity and consistency, applying these capabilities to video editing remains a complex challenge. Recent research has extensively explored motion controllability as a means to enhance text-to-video generation or image animation; however, we identify precise motion control as a promising, yet under-explored, paradigm for editing existing videos. In this work, we propose modifying video motion by directly editing sparse trajectories extracted from the input. We term the deviation between input and output trajectories a 'motion edit' and demonstrate that this representation, when coupled with a generative backbone, enables many powerful video editing capabilities. To achieve this, we introduce a novel pipeline for generating `motion counterfactuals' -- video pairs that share identical content but distinct motion -- and fine-tune a motion-conditioned video diffusion architecture on this dataset. Our approach allows for edits that start at any timestamp and propagate naturally. In a 4-way head-to-head user study, our model achieves over 65% preference against prior work.

Abstract:
Distinct from attention-based compression methods, this paper presents an information uniqueness driven video compression framework, termed UniComp, which aims to maximize the information fidelity of video representations under constrained computational budgets. Starting from the information-theoretic perspective, we formulate the vision compression as an optimization problem that minimizes conditional entropy (reconstruction error) between retained and full tokens. To achieve this, we introduce the notion of information uniqueness to measure intrinsic redundancy among tokens to link with reconstruction error. Based on uniqueness, we design three modules--Frame Group Fusion, Token Allocation, and Spatial Dynamic Compression--that progressively perform semantic frame grouping, adaptive resource allocation, and fine-grained spatial compression. Extensive experiments demonstrate that UniComp consistently outperforms existing compression methods in preserving essential visual tokens under limited computational budgets, highlighting the pivotal role of information uniqueness in token compression efficacy.

Abstract:
Label-free image anomaly detection is difficult because anomalies must be separated from intra-normal variability. Diffusion models learn a manifold for normal data, and, under the common assumption that off-manifold anomalies are harder to generate and yield larger prediction errors, many methods build detectors from prediction residuals; yet reverse-process stochasticity and complex but normal structure also produce large residuals, so magnitude alone is non-diagnostic. To clarify what is recoverable from such noisy residuals, the theory examines how residual signals propagate through later reverse steps, showing that variability consistent with normal statistics is gradually absorbed toward stationarity, whereas anomalous regions retain an additional non-stationary signal that persists. Building on this insight, the Residual-Evolution Field (REF) isolates this persistent signal, with labeled source data calibrating the extractor and Cross-domain Field Alignment (CFA) transferring it to unlabeled targets. A theoretical framework with formal analysis is established, and experiments across multiple benchmarks under substantial domain shifts demonstrate state-of-the-art performance, improving over strong baselines by 2.01-14 percentage points (pp).

Abstract:
Understanding real-world videos such as movies requires integrating visual and dialogue cues. Yet existing VideoQA benchmarks struggle to capture this multimodal reasoning and, given the difficulty of evaluating free-form answers, largely resort to simple multiple choice questions. We introduce a novel open-ended multimodal VideoQA benchmark, MovieRecapsQA, created using movie recap videos -- a distinctive type of YouTube content that summarizes a film via a voiceover description of key clips from the movie (recap video). From the transcribed voiceover (recap summary) of 60 recap videos, we generate 8.2K questions along with the necessary "facts" expected in each answer; the former facilitates the creation of questions that require mutimodal reasoning and the latter allow the construction of a reference-free evaluation metric that can be applied to open-ended responses. To our knowledge, this is the first reference-free open-ended VideoQA benchmark. The benchmark allows each question to be evaluated in different input video settings: given (a) the full-length movie, (b) the full ( 11 min) recap video (visual only), (c) 14 min of aligned movie scenes, i.e, movie scenes relevant to the question, and (d) 1.2 min of aligned recap video scenes. In all cases, the text of any associated movie dialogue is provided. Each question is categorized by the modality required to answer it---visual, dialogue, or both---enabling fine-grained evaluation of multimodal capabilities. We benchmark (setting (d)) seven state-of-the-art MLLMs and find that (i) only our reference-free metric produces meaningful human-aligned model separation; (ii) vision-centric questions yield the lowest scores across all models; (iii) removing visual input often improves model factuality; and (iv) the primary bottleneck is visual perception, not visual reasoning.

Abstract:
Unsupervised Concept Extraction aims to extract concepts from a single image, yet existing methods suffer from the inability to extract composable intrinsic concepts. To address this, this paper introduces a new task called Compositional and Interpretable Intrinsic Concept Extraction (CI-ICE). The CI-ICE task aims to leverage diffusion-based text-to-image models to extract composable object-level and attribute-level concepts from a single image, such that the original concept can be reconstructed through the combination of these concepts. To achieve this goal, we propose a method called HyperExpress, which addresses the CI-ICE task through two core aspects. Specifically, first, we propose a concept learning approach that leverages the inherent hierarchical modeling capability of hyperbolic space to achieve accurate concept disentanglement while preserving the hierarchical structure and relational dependencies among concepts; second, we introduce a concept-wise optimization method that maps the concept embedding space to maintain complex inter-concept relationships while ensuring concept composability. Our method demonstrates outstanding performance in extracting compositionally interpretable intrinsic concepts from a single image.

Abstract:
Generating human grasping poses that accurately reflect both object geometry and user-specified interaction semantics is essential for natural hand-object interactions in AR/VR and embodied AI. However, existing semantic grasping approaches struggle with the large modality gap between 3D object representations and textual instructions, and often lack explicit spatial or semantic constraints, leading to physically invalid or semantically inconsistent grasps. In this work, we present AffordGrasp, a diffusion-based framework that produces physically stable and semantically faithful human grasps with high precision. We first introduce a scalable annotation pipeline that automatically enriches hand-object interaction datasets with fine-grained structured language labels capturing interaction intent. Building upon these annotations, AffordGrasp integrates an affordance-aware latent representation of hand poses with a dual-conditioning diffusion process, enabling the model to jointly reason over object geometry, spatial affordances, and instruction semantics. A distribution adjustment module further enforces physical contact consistency and semantic alignment. We evaluate AffordGrasp across four instruction-augmented benchmarks derived from HO-3D, OakInk, GRAB, and AffordPose, and observe substantial improvements over state-of-the-art methods in grasp quality, semantic accuracy, and diversity.

Abstract:
We present LATTICE, a new framework for high-fidelity 3D asset generation that bridges the quality and scalability gap between 3D and 2D generative models. While 2D image synthesis benefits from fixed spatial grids and well-established transformer architectures, 3D generation remains fundamentally more challenging due to the need to predict both spatial structure and detailed geometric surfaces from scratch. These challenges are exacerbated by the computational complexity of existing 3D representations and the lack of structured and scalable 3D asset encoding schemes. To address this, we propose VoxSet, a semi-structured representation that compresses 3D assets into a compact set of latent vectors anchored to a coarse voxel grid, enabling efficient and position-aware generation. VoxSet retains the simplicity and compression advantages of prior VecSet methods while introducing explicit structure into the latent space, allowing positional embeddings to guide generation and enabling strong token-level test-time scaling. Built upon this representation, LATTICE adopts a two-stage pipeline: first generating a sparse voxelized geometry anchor, then producing detailed geometry using a recitified flow transformer. Our method is simple at its core, but supports arbitrary resolution decoding, low-cost training, and flexible inference schemes, achieving state-of-the-art performance on various aspects, and offering a significant step toward scalable, high-quality 3D asset creation.

Abstract:
Turbulence mitigation (TM) is highly ill-posed due to the stochastic nature of atmospheric turbulence. Most methods rely on multiple frames recorded by conventional cameras to capture stable patterns in natural scenarios. However, they inevitably suffer from a trade-off between accuracy and efficiency: more frames enhance restoration at the cost of higher system latency and larger data overhead. Event cameras, equipped with microsecond temporal resolution and efficient sensing of dynamic changes, offer an opportunity to break the bottleneck. In this work, we present EHETM, a high-quality and efficient TM method inspired by the superiority of events to model motions in continuous sequences. We discover two key phenomena: (1) turbulence-induced events exhibit distinct polarity alternation correlated with sharp image gradients, providing structural cues for restoring scenes; and (2) dynamic objects form spatiotemporally coherent "event tubes" in contrast to irregular patterns within turbulent events, providing motion priors for disentangling objects from turbulence. Based on these insights, we design two complementary modules that respectively leverage polarity-weighted gradients for scene refinement and event-tube constraints for motion decoupling, achieving high-quality restoration with few frames. Furthermore, we construct two real-world event-frame turbulence datasets covering atmospheric and thermal cases. Extensive experiments show that EHETM outperforms SOTA methods, especially under scenes with dynamic objects, while reducing data overhead and system latency by approximately 77.3% and 89.5%, respectively.

Abstract:
Learning-based CAD modeling shows great promise in automating parametric design, yet existing approaches often overlook the incremental and state-dependent nature of sketch construction. We present CADSketcher, a query-driven bidirectional framework for completing partial parametric sketches by internalizing the non-linear construction logic of interactive CAD processes. At the core of CADSketcher are two key innovations. First, a bidirectional sketch learner recovers both prior and posterior contexts from arbitrary-span partial sketches via a bidirectional query mechanism, enabling exploration of multiple plausible modeling trajectories. Second, a confidence-guided completion pipeline adaptively determines the expansion direction through a confidence gate and ensures executable instruction generation using a validity compiler, while a progressive context updater preserves sketch consistency throughout the evolving sketch state. In addition, a hybrid positional encoding integrates global modeling progression with local geometric semantics, reinforcing structural coherence during both learning and completion. Extensive experiments demonstrate that CADSketcher achieves superior geometric validity and instruction consistency across diverse sketch completion tasks, offering a robust and interpretable framework toward intelligent CAD automation.

Abstract:
Multi-modal image fusion integrates complementary information from different modalities into a unified representation. Current methods predominantly optimize statistical correlations between modalities, often capturing dataset-induced spurious associations that degrade under distribution shifts. In this paper, we propose an intervention-based framework inspired by causal principles to identify robust cross-modal dependencies. Drawing insights from Pearl's causal hierarchy, we design three principled intervention strategies to probe different aspects of modal relationships: i) complementary masking with spatially disjoint perturbations tests whether modalities can genuinely compensate for each other's missing information, ii) random masking of identical regions identifies feature subsets that remain informative under partial observability, and iii) modality dropout evaluates the irreplaceable contribution of each modality. Based on these interventions, we introduce a Causal Feature Integrator (CFI) that learns to identify and prioritize intervention-stable features maintaining importance across different perturbation patterns through adaptive invariance gating, thereby capturing robust modal dependencies rather than spurious correlations. Extensive experiments demonstrate that our method achieves SOTA performance on both public benchmarks and downstream high-level vision tasks. The Code can be available.

Abstract:
Existing GUI benchmarks primarily focus on evaluating models' comprehensive capabilities but largely overlook hallucination phenomena in grounding tasks, which are crucial to the reliability of GUI understanding. In this work, we expose two major types of hallucinations in GUI grounding: 1) Confusion Hallucination, where distractor elements are mistakenly selected, and 2) Fabricated Hallucination, where nonexistent elements are hallucinated with plausible coordinates. To systematically investigate their origins, we introduce GUI-HalluBench, a benchmark comprising two complementary subsets: a parsing subset for assessing structural representation of GUI elements and a hallucination subset for measuring grounding robustness under challenging conditions. This design allows us to associate hallucination patterns with deficiencies in prerequisite abilities: parsing errors are closely tied to both fabricated and confusion hallucinations. Experiments on state-of-the-art models confirm these connections, offering new insights into the root causes of hallucinations and guiding the development of more reliable GUI understanding tools.

Abstract:
Creating realistic 3D animation remains a time-consuming and expertise-dependent process, requiring manual rigging, keyframing, and fine-tuning of complex motions. Meanwhile, video diffusion models have recently demonstrated remarkable 2D motion imagination, generating dynamic and visually coherent motion from text or image prompts. However, their outputs lack explicit 3D structure and cannot be directly used for animation or simulation. We present AniMimic, a framework that animates static 3D meshes using motion priors learned from video diffusion models. Starting from an input mesh, AniMimic synthesizes a monocular animation video, automatically constructs a skeleton with skinning weights, and refines its joint parameters through differentiable rendering and video-based supervision. To further enhance realism, we integrate a differentiable simulation module that refines mesh deformation through physically grounded soft-tissue dynamics. Our method bridges the creativity of video diffusion and the structural control of 3D rigged animation, producing physically plausible, temporally coherent, and artist-editable motion sequences that integrate seamlessly into standard animation pipelines.

Abstract:
Recent diffusion models increasingly favor Transformer backbones, motivated by the remarkable scalability of fully attentional architectures. Yet the locality bias, parameter efficiency, and hardware friendliness--the attributes that established ConvNets as the efficient vision backbone--have seen limited exploration in modern generative modeling. Here we introduce the fully convolutional diffusion model (FCDM), a model having a backbone similar to ConvNeXt, but designed for conditional diffusion modeling. We find that using only 50% of the FLOPs of DiT-XL/2, FCDM-XL achieves competitive performance with 7x and 7.5x fewer training steps at 256x256 and 512x512 resolutions, respectively. Remarkably, FCDM-XL can be trained on a 4-GPU system, highlighting the exceptional training efficiency of our architecture. Our results demonstrate that modern convolutional designs provide a competitive and highly efficient alternative for scaling diffusion models, reviving ConvNeXt as a simple yet powerful building block for efficient generative modeling.

Abstract:
With the surge of pre-trained text-to-image flow matching models, text-based image editing performance has gained remarkable improvement, especially for simple editing that only contains a single editing target. However, to satisfy the exploding editing requirements, the complex editing that contains multiple editing targets is posed as a more challenging task. However, current complex editing solutions: single-round and multi-round editing are limited by long text following and cumulative inconsistency, respectively. Thus, they struggle to strike a balance between semantic alignment and source consistency.In this paper, we propose FlowDC, which decouples the complex editing into multiple sub-editing effects and superposes them in parallel during the editing process. Meanwhile, we observed that the velocity quantity that is orthogonal to the editing displacement harms the source structure preserving. Thus, we decompose the velocity and decay the orthogonal part for better source consistency.To evaluate the effectiveness of complex editing settings, we construct a complex editing benchmark: Complex-PIE-Bench. On two benchmarks, FlowDC shows superior results compared with existing methods. We also detail the ablations of our module designs.

Abstract:
Reinforcement Learning with Verifiable Rewards (RLVR) traditionally relies on a sparse, outcome-based signal. Recent work shows that providing a fine-grained, model-intrinsic signal--rewarding the confidence growth in the ground-truth answer--effectively improves language reasoning training by providing step-level guidance without costly external models. While effective for unimodal text, we find that naively applying this global reward to vision-language (V-L) reasoning is a suboptimal strategy, as the task is a heterogeneous mix of sparse visual perception and dense textual reasoning. This global normalization creates mixture-induced signal degradation, where the training signal for visual steps is statistically distorted by the predominant textual steps. We propose Perception-Decomposed Confidence Reward (PDCR), a framework that solves this by aligning the reward structure with the task's heterogeneous nature. PDCR first performs an unsupervised skill decomposition, introducing a model-internal Visual Dependence Score to quantify visual reliance and applying a clustering algorithm to separate perception and reasoning steps. Based on this, PDCR computes a decomposed advantage by normalizing confidence gains within each skill cluster. This intra-cluster normalization provides a stable, correctly-scaled signal for both perception and reasoning. We demonstrate that PDCR outperforms the naive, global-reward formulation and sparse-reward baselines on key V-L reasoning benchmarks.

Abstract:
Large-scale video diffusion models exhibit strong world-simulation and temporal reasoning capabilities, yet their potential as zero-shot image editors remains underexplored. We present \ifedit IF-Edit (Image Edit by Generating Frames), a tuning-free framework that repurposes pre-trained image-to-video diffusion models for instruction-driven image editing. \ifedit IF-Edit addresses three core obstacles--prompt misalignment, redundant temporal latents, and blurry late-stage frames--via: (1) a Chain-of-Thought Prompt Enhancement module that reformulates static editing instructions into temporally grounded reasoning prompts; (2) a Temporal Latent Dropout strategy that compresses frame latents after the expert-switch point, accelerating denoising while preserving global semantics and temporal coherence; and (3) a Self-Consistent Post-Refinement step that refines the sharpest late-stage frame through a brief still-video trajectory, leveraging the video prior for sharper and more faithful results. Extensive experiments across four public benchmarks--covering non-rigid deformations, physical and temporal reasoning, and general instruction editing--show that \ifedit IF-Edit achieves strong performance on non-rigid and reasoning-centric tasks while remaining competitive on general-purpose edits. Our study offers a systematic view of video diffusion models as image editors, revealing their unique strengths, limitations, and a simple recipe for unified video-image generative reasoning.

Abstract:
3D assets have rapidly expanded in quantity and diversity due to the growing popularity of virtual reality and gaming. As a result, text-to-shape retrieval has become essential in facilitating intuitive search within large repositories. However, existing methods require canonical poses and support few object categories, limiting their real-world applicability where objects can belong to diverse classes and appear in random orientations. To address this challenge, we propose RI-Mamba, the first rotation-invariant state-space model for point clouds. RI-Mamba defines global and local reference frames to disentangle pose from geometry and uses Hilbert sorting to construct token sequences with meaningful geometric structure while maintaining rotation invariance. We further introduce a novel strategy to compute orientational embeddings and reintegrate them via feature-wise linear modulation, effectively recovering spatial context and enhancing model expressiveness. Our strategy is inherently compatible with state-space models and operates in linear time. To scale up retrieval, we adopt cross-modal contrastive learning with automated triplet generation, allowing training on diverse datasets without manual annotation. Extensive experiments demonstrate RI-Mamba's superior representational capacity and robustness, achieving state-of-the-art performance on the OmniObject3D benchmark across more than 200 object categories under arbitrary orientations. The project page is available at ndkhanh360.github.io/project-rimamba.

Abstract:
We present a training-free framework for style-personalized image generation that operates during inference using a scale-wise autoregressive model. Our method generates a stylized image guided by a single reference style while preserving semantic consistency and mitigating content leakage. Through a detailed step-wise analysis of the generation process, we identify a pivotal step where the dominant singular values of the internal feature encode style-related components. Building upon this insight, we introduce two lightweight control modules: Principal Feature Blending, which enables precise modulation of style through SVD-based feature reconstruction, and Structural Attention Correction, which stabilizes structural consistency by leveraging content-guided attention correction across fine stages. Without any additional training, extensive experiments demonstrate that our method achieves competitive style fidelity and prompt fidelity compared to fine-tuned baselines, while offering faster inference and greater deployment flexibility.

Abstract:
Multimodal Large Language Models (MLLMs) increasingly need to forget specific knowledge, such as unsafe or private information, without full retraining. However, existing unlearning methods often disrupt vision-language alignment, causing models to reject both harmful and benign queries simultaneously. We trace this failure to the projector network: during unlearning, its Jacobian becomes severely ill-conditioned, leading to unstable optimization and drift in cross-modal embeddings. We introduce SineProject, a simple approach that augments the frozen projector with sinusoidally modulated trainable parameters that improve the Jacobian's spectral conditioning and stabilize alignment throughout unlearning. Evaluated across standard safety and privacy unlearning benchmarks using LLaVA-v1.5-7B and 13B, SineProject reduces benign-query refusals while achieving complete forgetting of targeted information, delivering state-of-the-art forget-retain trade-offs with negligible computational overhead

Abstract:
Ocean dynamics drive global climate patterns and extreme weather events, making accurate spatiotemporal forecasting essential for climate monitoring and marine operations. Traditional Global Ocean Forecasting Systems (GOFSs) offer high accuracy predictions, yet remain computationally expensive and fail to fully leverage growing historical data. Recent deep learning models have achieved notable success, but still face three fundamental challenges: (1) they homogenize ocean variables despite strong physical coupling via equation-of-state relationships; (2) they neglect spherical geometry, resulting in severe distortions at high latitudes; and (3) they struggle to model multi-scale temporal dynamics. We introduce PhyOceanCast, a physics-informed diffusion model that overcomes these limitations through two key innovations. First, the Spherical Graph Attention Network for Multi-scale Ocean Coupling (SGAN-MOC) preserves spherical topology while enabling cross-variable interactions via heterogeneous encoding and k-hop-constrained attention. Second, the Physics-Informed Wavelet Temporal Coherence (PWTC) module that decomposes ocean dynamics across multiple scales with advection-diffusion constraints. PhyOceanCast forecasts 145 ocean variables, including temperature, salinity, and velocity fields, across 36 depth levels plus sea surface height. Extensive experiments demonstrate superior performance over diffusion, transformer, and hybrid baselines, promising a new paradigm for global ocean canonical variable forecasting.

Abstract:
Pretrained vision encoders are widely used as frozen, black-box feature extractors, yet they often inherit spurious correlations that disproportionately harm underrepresented groups. We introduce PLD-Debias, a fully black-box debiasing framework that requires neither access to backbone parameters nor demographic annotations. Our method integrates three components: (1) Rank-Regularized Amplification, a lightweight adapter that exaggerates latent spurious directions; (2) Unsupervised Pseudo-Bias Induction, which clusters amplified features to infer high-fidelity proxy bias labels; and (3) Bias-Guided Refinement, combining supervised contrastive alignment with cluster-aware adaptive margins to purify representations and equalize decision boundaries. We theoretically show that these components jointly tighten a worst-group risk bound under spurious correlations. Empirically, PLD-Debias achieves state-of-the-art worst-group accuracy across CelebA, Waterbirds, and CMNIST, improving performance by 3-5 points over prior black-box methods while maintaining average accuracy. Remarkably, our pseudo-bias labels align with ground-truth bias annotations at over 90% fidelity, enabling oracle-level robustness without demographic supervision. Our results demonstrate that fairness and utility can be achieved through a plug-and-play classifier adapter for any frozen foundation model.

Abstract:
Despite remarkable progress, multimodal foundation models still exhibit surprising deficiencies in spatial intelligence. In this work, we explore scaling up multimodal foundation models to cultivate spatial intelligence within the SenseNova-SI family, built upon established multimodal foundations including visual understanding models (i.e., Qwen3-VL and InternVL3) and unified understanding and generation models (i.e., Bagel). We take a principled approach to constructing high-performing and robust spatial intelligence by systematically curating SenseNova-SI-8M: eight million diverse data samples under a rigorous taxonomy of spatial capabilities. SenseNova-SI demonstrates unprecedented performance across a broad range of spatial intelligence benchmarks: 68.8% on VSI-Bench, 43.3% on MMSI, 85.7% on MindCube, 54.7% on ViewSpatial, 47.7% on SITE, 63.9% on BLINK, 55.5% on 3DSR, and 72.0% on EmbSpatial, while maintaining strong general multimodal understanding (e.g., 84.9% on MMBench-En). More importantly, we analyze the impact of data scaling, discuss early signs of emergent generalization capabilities enabled by diverse data training, analyze the risk of overfitting and language shortcuts, present a preliminary study on spatial chain-of-thought reasoning, and validate the potential downstream application. All newly trained multimodal foundation models are publicly released.

Abstract:
Large vision-language models (LVLMs) have shown remarkable capabilities in visual-language understanding. Despite their success, LVLMs still suffer from generating hallucinations in complex generation tasks, leading to inconsistencies between visual inputs and generated content. To address this issue, some approaches have introduced inference-time interventions, such as contrastive decoding, to reduce overreliance on language priors. However, these approaches overlook hallucinations stemming from position bias and spurious inter-modality correlations. In this paper, we propose a Cross-Modal Attention Calibration (CMAC) method to mitigate hallucinations in LVLMs in a training-free manner. In this method, we design an Inter-Modality Decoding (IMD) module to alleviate hallucination by a novel contrastive decoding mechanism. IMD masks the value vectors associated with significant cross-modal attention weights as distortion, which addresses both uni-modality overreliance and misleading inter-modality correlations. Additionally, a Cross-Modal Position Calibration (CMPC) module shrinks the position gap of image tokens, alleviating the position bias in cross-modal attention. Experimental results on diverse hallucination benchmarks validate the superiority of our method over existing state-of-the-art techniques in reducing hallucinations for LVLM. Our code will be available.

Abstract:
Generating realistic human geometry animations remains a challenging task, as it requires modeling natural clothing dynamics with fine-grained geometric details under limited data. To address these challenges, we propose two novel designs. First, we propose a compact distribution-based latent representation that enables efficient and high-quality geometry generation. We improve upon previous work by establishing a more uniform mapping between SMPL and avatar geometries. Second, we introduce a generative animation model that fully exploits the diversity of limited motion data. We focus on short-term transitions while maintaining long-term consistency through an identity-conditioned design. These two designs formulate our method as a two-stage framework: the first stage learns a latent space, while the second learns to generate animations within this latent space. We conducted experiments on both our latent space and animation model. We demonstrate that our latent space produces high-fidelity human geometry surpassing previous methods (90% lower Chamfer Dist.). The animation model synthesizes diverse animations with detailed and natural dynamics (2.2 x higher user study score), achieving the best results across all evaluation metrics.

Abstract:
Event cameras provide accurate information at motion boundaries--exactly where disentangling ego-motion, object motion, and border ownership determines segmentation quality. We argue that the missing ingredient in dynamic scene interpretation is moving border ownership: detecting motion boundaries and assigning which side is foreground so occlusions are resolved by design. Traditional geometric motion segmentation pipelines (e.g., flow clustering, simple motion models) remain assumption-heavy and slow, while deep models often fail to generalize across sensors or datasets. We introduce a lightweight, ownership-aware predictor trained solely on synthetic events with perfect supervision for boundaries, ownership, and motion, generated via a Blender pipeline. Its key targets--a signed-distance ownership field and a motion mask--focus learning where events occur and yield stable gradients. The model runs in real time and generalizes without tuning: trained on synthetic events, it achieves zero-shot transfer on EED, EVIMO1, EVIMO2, and EMSMC, delivering state-of-the-art performance. By casting motion segmentation as ownership-aware edge understanding, we combine the robustness of model-based reasoning with the scalability of learning.

Abstract:
Creating interactive digital environments for gaming, robotics, and simulation relies on articulated 3D objects whose functionality emerges from their part geometry and kinematic structure. However, existing approaches remain fundamentally limited: optimization-based reconstruction methods require slow, per-object joint fitting and typically handle only simple, single-joint objects, while retrieval-based methods assemble parts from a fixed library, leading to repetitive geometry and poor generalization. To address these challenges, we introduce ArtLLM, a novel framework for generating high-quality articulated assets directly from complete 3D meshes. At its core is a 3D multimodal large language model trained on a large-scale articulation dataset curated from both existing articulation datasets and procedurally generated objects. Unlike prior work, ArtLLM autoregressively predicts a variable number of parts and joints, inferring their kinematic structure in a unified manner from the object's point cloud. This articulation-aware layout then conditions a 3D generative model to synthesize high-fidelity part geometries. Experiments on the PartNet-Mobility dataset show that ArtLLM significantly outperforms state-of-the-art methods in both part layout accuracy and joint prediction, while generalizing robustly to real-world objects. Finally, we demonstrate its utility in constructing digital twins, highlighting its potential for scalable robot learning.

Abstract:
Spatial intelligence is essential for multimodal large language models, yet current benchmarks largely assess it only from an understanding perspective. We ask whether modern generative or unified multimodal models also possess generative spatial intelligence (GSI)--the ability to respect and manipulate 3D spatial constraints during image generation--and whether such capability can be measured or improved. We introduce GSI-Bench, the first benchmark designed to quantify GSI through spatially grounded image editing. It consists of two complementary components: GSI-Real, a high-quality real-world dataset built via a 3D-prior-guided generation and filtering pipeline, and GSI-Syn, a large-scale synthetic benchmark with controllable spatial operations and fully automated labeling.Together with a unified evaluation protocol, GSI-Bench enables scalable, model-agnostic assessment of spatial compliance and editing fidelity. Experiments show that fine-tuning unified multimodal models on GSI-Syn yields substantial gains on both synthetic and real tasks and, strikingly, also improves downstream spatial understanding. This provides the first clear evidence that generative training can tangibly strengthen spatial reasoning--establishing a new pathway for advancing spatial intelligence in multimodal models.

Abstract:
Self-supervised representations excel at many vision and speech tasks, but their potential for audio-visual deepfake detection remains underexplored. Unlike prior work that uses these features in isolation or buried within complex architectures, we systematically evaluate them across modalities (audio, video, multimodal) and domains (lip movements, generic visual content). We assess three key dimensions: detection effectiveness, interpretability of encoded information, and cross-modal complementarity. We find that most self-supervised features capture deepfake-relevant information, and that this information is complementary. Moreover, models primarily attend to semantically meaningful regions rather than spurious artifacts (such as the leading silence). Among the investigated features, audio-informed representations generalize best and achieve state-of-the-art results. However, generalization to realistic in-the-wild data remains challenging. Our analysis indicates this gap stems from intrinsic dataset difficulty rather than from features latching onto superficial patterns.

Abstract:
Learning 3D scene geometry and semantics from images is a core challenge in computer vision and a key capability for autonomous driving. Since large-scale 3D annotation is prohibitively expensive, recent work explores self-supervised learning directly from sensor data without manual labels. Existing approaches either rely on 2D rendering consistency, where 3D structure emerges only implicitly, or on discretized voxel grids from accumulated lidar point clouds, limiting spatial precision and scalability. We introduce QueryOcc, a query-based self-supervised framework that learns continuous 3D semantic occupancy directly through independent 4D spatio-temporal queries sampled across adjacent frames. The framework supports supervision from either pseudo-point clouds derived from vision foundation models or raw lidar data. To enable long-range supervision and reasoning under constant memory, we introduce a contractive scene representation that preserves near-field detail while smoothly compressing distant regions.QueryOcc surpasses previous camera-based methods by 26% in semantic RayIoU on the self-supervised Occ3D-nuScenes benchmark while running at 11.6 FPS, demonstrating that direct 4D query supervision enables strong self-supervised occupancy learning.

Abstract:
Current vision-language pre-training (VLP) paradigms excel at global scene understanding but struggle with instance-level reasoning due to global-only supervision. We introduce InstAP, an Instance-Aware Pre-training framework that jointly optimizes global vision-text alignment and fine-grained, instance-level contrastive alignment by grounding textual mentions to specific spatial-temporal regions. To support this, we present InstVL, a large-scale dataset (2 million images, 50,000 videos) with dual-granularity annotations: holistic scene captions and dense, grounded instance descriptions. On the InstVL benchmark, InstAP substantially outperforms existing VLP models on instance-level retrieval, and also surpasses a strong VLP baseline trained on the exact same data corpus, isolating the benefit of our instance-aware objective. Moreover, instance-centric pre-training improves global understanding: InstAP achieves competitive zero-shot performance on multiple video benchmarks, including MSR-VTT and DiDeMo. Qualitative visualizations further show that InstAP localizes textual mentions to the correct instances, while global-only models exhibit more diffuse, scene-level attention.

Abstract:
Multi-view camera systems enable rich observations of complex real-world scenes, and understanding dynamic objects in multi-view settings has become central to various applications. Point tracking serves as a key mechanism for capturing dynamic motion. However, conventional single-view approaches often fail due to the limited geometric information available in monocular video, which becomes a critical bottleneck for multi-view extension. In this work, we present MV-TAP, a robust point tracker that tracks points across multi-view videos of dynamic scenes by leveraging cross-view information. MV-TAP utilizes camera geometry and a cross-view attention mechanism to aggregate spatio-temporal information across views, enabling more complete and reliable trajectory estimation in multi-view videos. To support this task, we construct a large-scale synthetic training dataset and real-world evaluation sets tailored for multi-view tracking. Extensive experiments demonstrate that MV-TAP outperforms existing point-tracking methods on challenging benchmarks, establishing an effective baseline for advancing research in multi-view point tracking.

Abstract:
Recent advances in Multi-modal Large Language Models (MLLMs) have predominantly focused on enhancing visual \perception to improve \accuracy. However, a critical question remains unexplored: Do models know when they do not know? Through a probing experiment, we reveal a severe \confidence miscalibration problem in MLLMs. To address this, we propose Confidence-Driven Reinforcement Learning (CDRL), which uses original-noise image pairs and a novel confidence-based reward to enhance perceptual sensitivity and robustly calibrate the model's confidence. Beyond training benefits, calibrated confidence enables more effective test-time scaling as a free lunch. We further propose Confidence-Aware Test-Time Scaling (CA-TTS), which dynamically coordinates Self-Consistency, Self-Reflection, and Visual Self-Check modules guided by confidence signals. An Expert Model acts in multiple roles (e.g., Planner, Critic, Voter) to schedule these modules and provide external verification. Our integrated framework establishes new state-of-the-art results with consistent 8.8% gains across four benchmarks. More ablation studies demonstrate the effectiveness of each module and scaling superiority. Our code will be released after the acception.

Abstract:
Continuous image tokenizers enable efficient visual generation, and those based on variational frameworks can learn smooth, structured latent representations through KL regularization. Yet this often leads to posterior collapse when using fewer tokens, where the encoder fails to encode informative features into the compressed latent space. To address this, we introduce MacTok, a Masked Augmenting 1D Continuous Tokenizer that leverages image masking and representation alignment to prevent collapse while learning compact and robust representations. MacTok applies both random masking to regularize latent learning and DINO-guided semantic masking to emphasize informative regions in images, forcing the model to encode robust semantics from incomplete visual evidence. Combined with global and local representation alignment, MacTok preserves rich discriminative information in a highly compressed 1D latent space, requiring only 64 or 128 tokens. On ImageNet, MacTok achieves a competitive gFID of 1.44 at 256x256 and a state-of-the-art 1.52 at 512x512 with SiT-XL, while reducing token usage by up to 64x. These results confirm that masking and semantic guidance together prevent posterior collapse and achieve efficient, high-fidelity tokenization.

Abstract:
Automatically extracting chemical structures from documents is essential for the large-scale analysis of the literature in chemistry. Automatic pipelines have been developed to recognize molecules represented either in figures or in text independently. However, methods for recognizing chemical structures from multimodal descriptions (Markush structures) lag behind in precision and cannot be used for automatic large-scale processing. In this work, we present MarkushGrapher-2, an end-to-end approach for the multimodal recognition of chemical structures in documents. First, our method employs a dedicated OCR model to extract text from chemical images. Second, the text, image, and layout information are jointly encoded through a Vision-Text-Layout encoder and an Optical Chemical Structure Recognition vision encoder. Finally, the resulting encodings are effectively fused through a two-stage training strategy and used to auto-regressively generate a representation of the Markush structure. To address the lack of training data, we introduce an automatic pipeline for constructing a large-scale dataset of real-world Markush structures. In addition, we present IP5-M, a large manually-annotated benchmark of real-world Markush structures, designed to advance research on this challenging task. Extensive experiments show that our approach substantially outperforms state-of-the-art models in multimodal Markush structure recognition, while maintaining strong performance in molecule structure recognition. Code, models, and datasets will be released publicly.

Abstract:
Can one perceive a video's content without seeing its pixels, just from the camera trajectory--the path it carves through space? This paper is the first to systematically investigate this seemingly implausible question. Towards this end, we propose a contrastive learning framework to train CamFormer, a dedicated encoder that projects camera pose trajectories into a joint embedding space, aligning them with natural language. We find that, contrary to its apparent simplicity, the camera trajectory is a remarkably informative signal to uncover video content. In other words, "how you move" can indeed provide valuable cues about "what you are doing" (egocentric) or "observing" (exocentric). We demonstrate the versatility of our learned CamFormer embeddings on a diverse suite of downstream tasks, ranging from cross-modal alignment to classification and temporal analysis. Importantly, our representations are robust across diverse camera pose estimation methods, including both high-fidelity multi-sensored and standard RGB-only estimators. Our findings establish camera trajectory as a lightweight, robust, and versatile modality for perceiving video content.

Abstract:
We introduce Humanoid-GPT, a GPT-style Transformer with causal attention trained on a billion-scale motion corpus for whole-body control. Unlike prior shallow MLP trackers constrained by scarce data and an agility-generalization trade-off, Humanoid-GPT is pre-trained on a 2B-frame retargeted corpus that unifies all major mocap datasets with large-scale in-house recordings. Scaling both data and model capacity yields a single generative Transformer that tracks highly dynamic behaviors while achieving unprecedented zero-shot generalization to unseen motions and control tasks. Extensive experiments and scaling analyses show that our model establishes a new performance frontier, demonstrating robust zero-shot generalization to unseen tasks while simultaneously tracking highly dynamic and complex motions.

Abstract:
Vision-Language Model (VLM) is now an important component to enable robust robot manipulation. Yet, using it to translate human instructions into an action-resolvable intermediate representation often needs a tradeoff between VLM-comprehensibility and generalizability. Inspired by context-free grammar structure, we design the Semantic Assembly representation named SEAM, by decomposing the intermediate representation into vocabulary and grammar. Doing so leads us to a concise vocabulary of semantically-rich operations and a VLM-friendly grammar for handling diverse unseen tasks. Also, we design a novel open-vocabulary segmentation paradigm with an in-context learning strategy to precisely localize fine-grained object parts for manipulation (e.g., cup handle, teapot opening) effectively with the shortest inference time over all state-of-the-art parallel works. We then formulate new metrics for action-generalizability and VLM-comprehensibility to evaluate mainstream representations, demonstrating the strong performance of SEAM on both aspects. Extensive real-world experiments further manifest the SOTA performance of SEAM under varying settings and tasks.

Abstract:
Reliable UAV object detection requires robustness to illumination changes, motion blur, and scene dynamics that suppress RGB cues. Thermal long-wave infrared (LWIR) sensing preserves contrast in low light, and event cameras retain microsecond-level temporal edges, but integrating all three modalities in a unified detector has not been systematically studied. We present a tri-modal framework that processes RGB, thermal, and event data with a dual-stream hierarchical vision transformer. At selected encoder depths, a Modality-Aware Gated Exchange (MAGE) applies inter-sensor channel and spatial gating, and a Bidirectional Token Exchange (BiTE) module performs bidirectional token-level attention with depthwise-pointwise refinement, producing resolution-preserving fused maps for a standard feature pyramid and two-stage detector. We introduce a 10,489-frame UAV dataset with synchronized and pre-aligned RGB-thermal-event streams and 24,223 annotated vehicles across day and night flights. Through 61 controlled ablations, we evaluate fusion placement, mechanism (baseline MAGE+BiTE, CSSA, GAFF), modality subsets, and backbone capacity. Tri-modal fusion improves over all dual-modal baselines, with fusion depth having a significant effect and a lightweight CSSA variant recovering most of the benefit at minimal cost. This work provides the first systematic benchmark and modular backbone for tri-modal UAV-based object detection.

Abstract:
The expansion of Transformers and the collection of high-quality multimodal datasets have propelled deep neural networks to achieve unprecedented performance in vision and language tasks. However, applying these advances is non-trivial in real-world applications. The extensive number of parameters complicates model updates, and real-world data often features a long-tailed distribution along with noisy labels. To address the above issues, we propose to explore the internal structure of the neural network for learning with sample relationships, rather than just increasing the number of model parameters. Specifically, we introduce RetFormer, a model enhanced with a multimodal knowledge base for storing world knowledge, and a retrieval cross-fusion module designed to establish robust multimodal sample relationships by leveraging content from the knowledge base. RetFormer establishes a robust relationship between image and text modalities by integrating information from external knowledge bases into the model's decision-making process, thus overcoming the limitations of traditional approaches on model size and datasets. Our experiments demonstrate the benefits of integrating large-scale image-text datasets into vision tasks and exemplify the importance of modeling the relationship between image and text modalities. We have evaluated our approach on the task of long-tailed recognition and learning with noisy labels and have shown that it achieves state-of-the-art accuracies.

Abstract:
Generative AI models pose a significant challenge to intellectual property (IP), as they can replicate unique artistic styles and concepts without attribution. While watermarking offers a potential solution, existing methods often fail in complex scenarios where multiple concepts (e.g., an object and an artistic style) are composed within a single image. These methods struggle to disentangle and attribute each concept individually. In this work, we introduce TokenTrace, a novel proactive watermarking framework for robust, multi-concept attribution. Our method embeds secret signatures into the semantic domain by simultaneously perturbing the text prompt embedding and the initial latent noise that guide the diffusion model's generation process. For retrieval, we propose a query-based TokenTrace module that takes the generated image and a textual query specifying which concepts need to be retrieved (e.g., a specific object or style) as inputs. This query-based mechanism allows the module to disentangle and independently verify the presence of multiple concepts from a single generated image. Extensive experiments show that our method achieves state-of-the-art performance on both single-concept (object and style) and multi-concept attribution tasks, significantly outperforming existing baselines while maintaining high visual quality and robustness to common transformations.

Abstract:
Synthesizing realistic human-object interaction (HOI) is essential for 3D computer vision and robotics, underpinning animation and embodied control. Existing approaches often require manually specified intermediate waypoints and place all optimization objectives on a single network, which increases complexity, reduces flexibility, and leads to errors such as unsynchronized human and object motion or penetration. To address these issues, we propose Decoupled Generative Modeling for Human-Object Interaction Synthesis (DecHOI), which separates path planning and action synthesis. A trajectory generator first produces human and object trajectories without prescribed waypoints, and an action generator conditions on these paths to synthesize detailed motions. To further improve contact realism, we employ adversarial training with a discriminator that focuses on the dynamics of distal joints. The framework also models a moving counterpart and supports responsive, long-sequence planning in dynamic scenes, while preserving plan consistency. Across two benchmarks, FullBodyManipulation and 3D-FUTURE, DecHOI surpasses prior methods on most quantitative metrics and qualitative evaluations, and perceptual studies likewise prefer our results.

Abstract:
We introduce MeshLAM, a feed-forward framework for one-shot animatable mesh head reconstruction that generates high-fidelity, animatable 3D head avatars from a single image. Unlike previous work that relies on time-consuming test-time optimization or extensive multi-view data, our method produces complete mesh representations with inherent animatability from a single image in a single forward pass. Our approach employs a dual shape and texture map architecture that simultaneously processes mesh vertices and texture map with extracted image features from a shared transformer backbone, allowing for coherent shape carving and appearance modeling. To prevent mesh collapse and ensure topological integrity during feed-forward deformation, we propose an iterative GRU-based decoding mechanism with progressive geometry deformation and texture refinement, coupled with a novel reprojection-based texture guidance mechanism that anchors appearance learning to the input image. Extensive experiments demonstrate that our method outperforms state-of-the-art approaches in reconstruction quality, animation capability, and computational efficiency. Project page at https://meshlam.github.io.

Abstract:
Vision-language models (VLMs) have shown promise in earth observation (EO), yet they struggle with tasks that require grounding complex spatial reasoning in precise pixel-level visual representations. To address this problem, we introduce TerraScope, a unified VLM that delivers pixel-grounded geospatial reasoning with two key capabilities: (1) modality-flexible reasoning: it handles single-modality inputs (optical or SAR) and adaptively fuses different modalities into the reasoning process when both are available; (2) multi-temporal reasoning: it integrates temporal sequences for change analysis across multiple time points. In addition, we curate Terra-CoT, a large-scale dataset containing 1 million samples with pixel-level masks embedded in reasoning chains across multiple sources. We also propose TerraScope-Bench, the first benchmark for pixel-grounded geospatial reasoning with six sub-tasks that evaluates both answer accuracy and mask quality to ensure authentic pixel-grounded reasoning. Experiments show that TerraScope significantly outperforms existing VLMs on pixel-grounded geospatial reasoning while providing interpretable visual evidence.

Abstract:
Accurately modeling agent behaviors is an important task in self-driving. It is also a task with many symmetries, such as equivariance to the order of agents and objects in the scene or equivariance to arbitrary roto-translations of the entire scene as a whole; i.e., SE(2)-equivariance. The transformer architecture is a ubiquitous tool for modeling these symmetries. While standard self-attention is inherently permutation equivariant, explicit pairwise relative positional encodings have been the standard for introducing SE(2)-equivariance. However, this approach introduces an additional cost that is quadratic in the number of agents, limiting its scalability to larger scenes and batch sizes. In this work, we propose DriveGATr, a novel transformer-based architecture for agent modeling that achieves SE(2)-equivariance without the computational cost of existing methods. Inspired by recent advances in geometric deep learning, DriveGATr encodes scene elements as multivectors in the 2D projective geometric algebra \mathbb R ^_ 2,0,1 and processes them with a stack of equivariant transformer blocks. Crucially, DriveGATr models geometric relationships using standard attention between multivectors, eliminating the need for costly explicit pairwise relative positional encodings. Experiments on the Waymo Open Motion Dataset demonstrate that DriveGATr is comparable to the state-of-the-art in traffic simulation and establishes a superior Pareto front for performance vs computational cost.

Abstract:
Large multimodal models (LMMs) have shown promising performance for various document recognition tasks. However, LMMs adopt implicit modeling, and the parameters lack interpretability. Inspired by recent advances in human memory and learning research, we propose an explicit multiscale prototype memory that augments document recognition models, explicitly modeling recurrent layout and stylistic patterns across different spatial resolutions. A Memory Retrieval Mechanism enables local regions to sparsely attend to a few prototypes (e.g., image borders, tilted text); the retrieved compositional factors are concatenated with visual features and passed to the decoder, providing explicit region-wise structural context. Prototype memory consolidation updates and stabilizes prototypes via attention-weighted exponential moving average (EMA) strategy, while sparsity and anti-collapse regularization promote selective activation. We further adopt hierarchical memory for multi-resolution encoding. The proposed DREAM module is a plug-and-play component, allowing seamless integration into various encoder-decoder architectures. We validate on two tasks including document recognition on public datasets and the self-built DreamDoc dataset, and handwriting recognition on the SCUT-HCCDoc and SCUT-EPT datasets. Experimental results show that the proposed method is effective.

Abstract:
Lifelong learning aims to preserve knowledge acquired from previous tasks while incorporating knowledge from a sequence of new tasks. However, most prior work explores only streams of homogeneous tasks (e.g., only classification tasks) and neglects the scenario of learning across heterogeneous tasks that possess different structures of outputs. In this work, we formalize this broader setting as lifelong heterogeneous learning (LHL).Departing from conventional lifelong learning, the task sequence of LHL spans different task types, and the learner needs to retain heterogeneous knowledge for different output space structures.To instantiate the LHL, we focus on LHL in the context of dense prediction (LHL4DP), a realistic and challenging scenario.To this end, we propose the Heterogeneity-Aware Distillation (HAD) method, an exemplar-free approach that preserves previously gained heterogeneous knowledge by self-distillation in each training phase.The proposed HAD comprises two complementary components, including a distribution-balanced heterogeneity-aware distillation loss to alleviate the global imbalance of prediction distribution and a salience-guided heterogeneity-aware distillation loss that concentrates learning on informative edge pixels extracted with the Sobel operator.Extensive experiments demonstrate that the proposed HAD method significantly outperforms existing methods in this new scenario.

Abstract:
Multi-camera 3D object detection (MC3D) has attracted increasing attention with the growing deployment of multi-sensor physical agents, such as robots and autonomous vehicles. However, MC3D models still struggle to generalize to unseen platforms with new multi-camera configurations. Current solutions simply employ a meta-camera for unified representation but lack comprehensive consideration. In this paper, we revisit this issue and identify that the devil lies in spatial prior discrepancies across source and target configurations, including different intrinsics, extrinsics, and array layouts. To address this, we propose CoIn3D, a generalizable MC3D framework that enables strong transferability from source configurations to unseen target ones. CoIn3D explicitly incorporates all identified spatial priors into both feature embedding and image observation through spatial-aware feature modulation (SFM) and camera-aware data augmentation (CDA), respectively. SFM enriches feature space by integrating four spatial representations, such as focal length, ground depth, ground gradient, and Plucker coordinate. CDA improves observation diversity under various configurations via a training-free dynamic novel-view image synthesis scheme. Extensive experiments demonstrate that CoIn3D achieves strong cross-configuration performance on landmark datasets such as NuScenes, Waymo, and Lyft, under three dominant MC3D paradigms represented by BEVDepth, BEVFormer, and PETR.

Abstract:
Parameter Efficient Fine-Tuning (PEFT) is a key technique for adapting a large pretrained model to downstream tasks by fine-tuning only a small number of parameters. Recent methods based on Fourier transforms have further reduced the fine-tuned parameters scale by only fine-tuning a few spectral coefficients. Its basic assumption is that the weight change \Delta W is a spatial-domain matrix with a sparse spectrum. However, in this paper, we observe that the spectrum of weight change is not sparse, but instead distributed like power-uniform. This fact implies that fine-tuning only a few spectral coefficients is insufficient to accurately model the weight change \Delta W with uniform spectrum.To address this issue, we propose to seek an invertible transformation that can transform a latent spatial-domain matrix with sparse spectrum to the weight change, and then perform PEFT on such sparse spectrum domain with few spectral coefficients, called \text S ^2\text FT . To seek such transformation, we first pre-estimate a coarse weight change as a prior. Then, inspired by that sparse spectrum often correspond to locally smooth spatial structures, we regard this transformation as a row and column rearrangement operation on the pre-estimated weight change that smooth spatial structures while keep the structure information of neurons.Finally, we propose to solve the rearrangement search problem in a simple nearest neighbor search manner, thereby obtaining the invertible transformation. Extensive results show our \text S ^2\text FT achieves superior performance by only using 0.08% training parameters.

Abstract:
3D Referring Expression segmentation (3D-RES) is an emerging field that segments 3D objects in point cloud scenes based on given referring expressions. Although existing methods have achieved substantial progress, they primarily focus on semantic cues and often overlook spatial relations, which are essential for segmenting the referred objects in complex 3D scenes, especially those containing multiple visually similar instances. In this paper, we propose Position3D, a novel approach that explicitly incorporates spatial relation modeling into 3D-RES. Specifically, we introduce a spatial-aware query generation module that constructs point proxies by aggregating local context and incorporating spatial relations, from which the most text-relevant are selected as queries. Furthermore, we design a position-guided deformable attention in the decoder, which progressively refines attention to concentrate on the target object under positional relationship guidance. Extensive experiments on two benchmark datasets, i.e., ScanRefer, and Multi3DRefer, validate the effectiveness of the proposed method Position3D.

Abstract:
Fast Adversarial Training (FAT) has proven effective in enhancing model robustness by encouraging networks to learn perturbation-invariant representations.However, FAT often suffers from catastrophic overfitting (CO), where the model overfits to the training attack and fails to generalize to unseen ones. Moreover, robustness-oriented optimization typically leads to notable performance degradation on clean inputs, and such degradation becomes increasingly severe as the perturbation budget grows.In this work, we conduct a comprehensive analysis of how guidance strength affects model performance by modulating perturbation and supervision levels across distinct confidence groups.The findings reveal that low-confidence samples are the primary contributors to CO and the robustness-accuracy trade-off. Building on this insight, we propose a Distribution-aware Dynamic Guidance (DDG) strategy that dynamically adjusts both the perturbation budget and supervision signal. Specifically, DDG scales the perturbation magnitude according to the sample confidence at the ground-truth class, thereby guiding samples toward consistent decision boundaries while mitigating the influence of learning spurious correlations. Simultaneously, it dynamically adjusts the supervision signal based on the prediction state of each sample, preventing overemphasis on incorrect signals. To alleviate potential gradient instability arising from dynamic guidance, we further design a weighted regularization constraint.Extensive experiments on standard benchmarks demonstrate that DDG effectively alleviates both CO and the robustness-accuracy trade-off.

Abstract:
Video (camera) trajectory editing aims to synthesize new videos that follow user-defined camera paths while preserving scene content and plausibly inpainting previously unseen regions, upgrading amateur footage into professionally styled videos. Existing VTE methods struggle with precise camera control and long-range consistency because they either inject target poses through a limited-capacity embedding or rely on single-frame warping with only implicit cross-frame aggregation in video diffusion models. To address these issues, we introduce a new VTE framework that 1) explicitly aggregates information across the entire source video via a hybrid warping scheme. Specifically, static regions are progressively fused into a world cache then rendered to target camera poses, while dynamic regions are directly warped; their fusion yields globally consistent coarse frames that guide refinement.2) processes video segments jointly with their history via a history-guided autoregressive diffusion model, while the world cache is incrementally updated to reinforce already inpainted content, enabling long-term temporal coherence. Finally, we present iPhone-PTZ, a new VTE benchmark with diverse camera motions and large trajectory variations, and achieve state-of-the-art performance with fewer parameters.

Abstract:
We introduce SpiderCam, an FPGA-based snapshot depth-from-defocus camera which produces 480x400 sparse depth maps in real-time at 32.5 FPS over a working range of 52 cm while consuming 611 mW of power in total. SpiderCam comprises a custom camera which simultaneously captures two differently focused images of the same scene, processed with a SystemVerilog implementation of depth from differential defocus (DfDD) on a low-power FPGA. To achieve state-of-the-art power consumption, we present algorithmic improvements to DfDD that overcome challenges caused by low-power sensors, and design a memory-local implementation for streaming depth computation on a device that is too small to store even a single image pair. We report the first sub-Watt total power measurement for passive FPGA-based 3D cameras in the literature.

Abstract:
Empowering machines to simulate human handwriting is a promising research direction. Most existing methods, however, primarily focus on reproducing the writing trajectory to capture the overall character structure, while neglecting the critical aspect of stroke contour modeling. Consequently, these methods struggle to generate visually realistic, human-like handwriting, limiting their applicability in scenarios such as calligraphy robots. To address this issue, we propose a new task, called Contour-aware Handwriting Trajectory Reconstruction (CHTR). This task presents two major challenges: 1) Existing handwriting datasets lack stroke contour annotations, making supervised learning difficult; 2) Previous methods are unable to recover stroke contour and preserve the overall character structure jointly. To address the dataset limitation, we present CHTR-110K, a large-scale character dataset with refined stroke contour annotations. To tackle the technical challenge, we propose Graph-based Handwriting Trajectory Reconstruction (G-HTR), a novel method using contour-aware graphs to jointly model stroke contour and character structure. We use a Graph Neural Network to capture structural relationships among nodes and introduce a multi-scale graph learning strategy to encode both fine-grained stroke details and global character structure. Extensive experiments verify the effectiveness of G-HTR, outperforming previous state-of-the-art methods on both our CHTR-110K and the widely-used CASIA-OLHWDB dataset. G-HTR further shows strong real-world results when deployed on robots, confirming its practical value. To support future research, we will release source code and dataset.

Abstract:
Recent advancements in generative video codec (GVC) typically encode video into a 2D latent grid and employ high-capacity generative decoders for reconstruction. However, this paradigm still leaves two key challenges in fully exploiting spatial-temporal redundancy: Spatially, the 2D latent grid inevitably preserves intra-frame redundancy due to its rigid structure, where adjacent patches remain highly similar, thereby necessitating a higher bitrate. Temporally, the 2D latent grid is less effective for modeling long-term correlations in a compact and semantically coherent manner, as it hinders the aggregation of common contents across frames. To address these limitations, we introduce Generative Video Compression with One-Dimensional (1D) Latent Representation (GVC1D). GVC1D encodes the video data into extreme compact 1D latent tokens conditioned on both short- and long-term contexts. Without the rigid 2D spatial correspondence, these 1D latent tokens can adaptively attend to semantic regions and naturally facilitate token reduction, thereby reducing spatial redundancy. Furthermore, the proposed 1D memory provides semantically rich long-term context while maintaining low computational cost, thereby further reducing temporal redundancy. Experimental results indicate that GVC1D attains superior compression efficiency, where it achieves bitrate reductions of 60.4% under LPIPS and 68.8% under DISTS on the HEVC Class B dataset, surpassing the previous video compression methods.

Abstract:
We introduce MotionEdit, a novel dataset for motion-centric image editing--the task of modifying subject actions and interactions while preserving identity, structure, and physical plausibility.Unlike existing image editing datasets that focus on static appearance changes or contain only sparse, low-quality motion edits, MotionEdit provides high-fidelity image pairs depicting realistic motion transformations extracted and verified from continuous videos. This new task is not only scientifically challenging but also practically significant, powering downstream applications such as frame-controlled video synthesis and animation.To evaluate model performance on the novel task, we introduce MotionEdit-Bench, a benchmark that challenges models on motion-centric edits and measures model performance with generative, discriminative, and preference-based metrics.Benchmark results reveal that motion editing remains highly challenging for existing state-of-the-art diffusion-based editing models.To address this gap, we propose MotionNFT (Motion-guided Negative-aware FineTuning), a post-training framework that computes motion alignment rewards based on how well the motion flow between input and model-edited images matches the ground-truth motion, guiding models toward accurate motion transformations.Extensive experiments on FLUX.1 Kontext and Qwen-Image-Edit show that MotionNFT consistently improves editing quality and motion fidelity of both base models on the motion editing task without sacrificing general editing ability, demonstrating its effectiveness.

Abstract:
While Reinforcement Learning from Human Feedback (RLHF) has become a pivotal paradigm for text-to-image generation, its application to image editing remains largely unexplored. A key bottleneck is the lack of a robust general reward model for all editing tasks. Existing edit reward models usually give overall scores without detailed checks, ignoring different instruction requirements and causing biased rewards. To address this, we propose the Verifier-Based Reasoning Reward Model (RRM), which breaks instructions into verifiable principles, evaluates the edited images against each principle, and aggregates fine-grained scores to reduce hallucinations and provide more interpretable criteria. To address this, we argue the key is to move from a simple scorer to a reasoning verifier. We introduce Edit-R1, a framework to build a chain-of-thought (CoT) verifier-based reasoning reward model (RRM) and leverage it into the downstream editing task. The Edit-RRM breaks instructions into distinct principles, evaluates the edited image against each, and aggregates these checks to provide an interpretable, fine-grained reward. To build such an RRM, we first apply supervised fine-tuning (SFT) as a "cold-start" to generate CoT reward trajectories. Then, we introduce Group Contrastive Preference Optimization (GCPO), a reinforcement learning algorithm that leverages human pairwise preference data to reinforce our pointwise RRM. After building the RRM, we use GRPO to train editing models with this non-differentiable yet powerful reward model. Extensive experiments demonstrate that our Edit-RRM surpasses powerful VLMs such as Seed-1.5-VL and Seed-1.6-VL as an editing-specific reward model, and we observe a clear scaling trend, with performance consistently improving from 3B to 7B parameters. Moreover, Edit-R1 delivers gains to editing models like FLUX.1-kontext, highlighting its effectiveness in enhancing image editing.

Abstract:
Long-horizon navigation in complex urban environments relies heavily on continuous human operation, which leads to fatigue, reduced efficiency, and safety concerns. Shared autonomy, where a Vision-Language AI agent and a human operator collaborate on maneuvering the mobile machine, presents a promising solution to address these issues. However, existing shared autonomy methods often require humans and AI to operate within the same action space, leading to high cognitive overhead. We present Assistive Urban Robot Autonomy (AURA), a new multi-modal framework that decomposes urban navigation into high-level human instruction and low-level AI control. AURA incorporates a Spatial-Aware Instruction Encoder to align various human instructions with visual and spatial context. To facilitate training, we construct MM-CoS, a large-scale dataset comprising teleoperation and vision-language descriptions. Experiments in simulation and the real world demonstrate that AURA effectively follows human instructions, reduces manual operation effort, and improves navigation stability, while enabling online adaptation. Moreover, under similar takeover conditions, our shared autonomy framework reduces the frequency of takeovers by more than 44%. Demo video and more detail are provided in the project page.

Abstract:
Multi-task learning (MTL) aims to enable a single model to solve multiple tasks efficiently; however, current parameter-efficient fine-tuning (PEFT) methods remain largely limited to single-task adaptation. We introduce Free Sinewich, a parameter-efficient multi-task learning framework that enables near-zero-cost weight modulation via frequency switching (Free). Specifically, a Sine-AWB (Sinewich) layer combines low-rank factors and convolutional priors into a single kernel, which is then modulated elementwise by a sinusoidal transformation to produce task-specialized weights. A lightweight Clock Net is introduced to produce bounded frequencies that stabilize this modulation during training. Theoretically, sine modulation enhances the rank of low-rank adapters, while frequency separation decorrelates the weights of different tasks. On dense prediction benchmarks, Free Sinewich achieves state-of-the-art performance-efficiency trade-offs (e.g., up to +5.39% improvement over single-task fine-tuning with only 6.53M trainable parameters), offering a compact and scalable paradigm based on frequency-based parameter sharing.

Abstract:
The astonishing proficiency and unprecedented level of realism of diffusion models in creating and manipulating images have undoubtedly drawn concerns.Many methods have been proposed to detect generated images. Typically, they usually take RGB images as input, and use backbones like ResNet, CLIP visual encoder to extract features. Even though these backbones are capable to detect fake images, they are mainly designed to extract the high-level semantic information, rather than inherently designed for fake image detection. To this end, in this paper, we want to optimize the embedding space tailored for detecting fake images via representation learning. We notice that Neighboring Pixel Relationships (NPR) is capable to capture the intrinsic forgery clues, which means that NPR may be a good input to perform representation learning that aims at learning the embedding space tailored for detecting fake images.Therefore, we leverage features from both RGB modality and NPR modality to perform two proposed representation learning methods, Cross-Modal Contrastive Learning (CMCL) and Cross-Modal Mutual Distillation (CMMD), in order to learn the forgery-aware embedding space. The CMCL boosts the discrimination of features between real and fake images, while the CMMD simultaneously transfers the learned knowledge between two modalities, being able to learn compact features within the intra-class. CMCL and CMMD work collaboratively so that each modality learns a more comprehensive forgery-aware representation to distinguish real and fake images.Extensive experiments on GenImage, DRCT-2M, and Co-Spy-Bench datasets show that our method achieves state-of-the-art results.

Abstract:
LiDAR perception is fundamental to robotics, enabling machines to understand their environment in 3D. A crucial task for LiDAR-based scene understanding and navigation is ground segmentation. However, existing methods are either handcrafted for specific sensor configurations or rely on costly per-point manual labels, severely limiting their generalization and scalability. To overcome this, we introduce TerraSeg, the first self-supervised, domain-agnostic model for LiDAR ground segmentation. We train TerraSeg on OmniLiDAR, a unified large-scale dataset that aggregates and standardizes data from 12 major public benchmarks. Spanning almost 22 million raw scans across 15 distinct sensor models, OmniLiDAR provides unprecedented diversity for learning a highly generalizable ground model. To supervise training without human annotations, we propose PseudoLabeler, a novel module that generates high-quality ground and non-ground labels through self-supervised per-scan runtime optimization. Extensive evaluations demonstrate that, despite using no manual labels, TerraSeg achieves state-of-the-art results on nuScenes, SemanticKITTI, and Waymo Perception while delivering real-time performance. Our code and model weights are publicly available.

Abstract:
Atmospheric turbulence causes spatially varying distortion and blur in long-range imaging, making restoration highly challenging in real-world applications. Supervised methods rely on synthetic training data, whose simulated degradation often cannot faithfully reflect real turbulence. Existing unsupervised methods usually estimate degradation parameters for each frame independently, without exploiting the shared correlations among frames from the same scene. We propose TMFS, an optimization-based and physics-grounded method for unsupervised turbulence restoration. TMFS is based on a physically motivated tilt-then-blur degradation model and represents frame degradations through a shared turbulence structure. By decomposing the distortion and blur of each frame into a scene-shared correlation function and per-frame noise maps, TMFS enables cross-frame information sharing and alleviates the ill-posedness of framewise estimation. Experiments on synthetic and real datasets demonstrate the effectiveness of TMFS and its strong generalization to real turbulence.

Abstract:
Understanding how the 3D world evolves over time is a fundamental task in computer vision, essential for embodied settings, autonomous driving, etc. It requires not only the reconstruction of the observed scene but also the anticipation of how the scene dynamics will unfold in the future. While the area of 3D reconstruction has progressed rapidly with the advent of recent feed-forward neural networks, forecasting future dynamics in 3D, given the 2D frames of a video remains unexplored. We present Point4Cast, a unified framework that processes streaming 2D frame sequences of a video to estimate the past, present, and future of the underlying dynamic scene, in 3D. At the core of our approach lies a persistently evolving latent spacetime representation that models the environment's evolution across time. Upon receiving a new 2D frame, an update operation integrates the incoming evidence to refine the latent spacetime representation. When queried for any time instant, whether before, at, or beyond the timestamp of the last update. A readout procedure predicts temporally conditioned point maps and camera parameters describing the scene geometry at the queried time. Unlike prior approaches for online dynamic scene reconstruction that estimate each frame's point map solely at the timestamp of the last observed frame, Point4Cast achieves coherent reconstruction across any queried time. Empirical evaluations show that Point4Cast achieves state-of-the-art performance on streaming dynamic scene reconstruction and forecasting benchmarks, across multiple challenging datasets, while providing scene flow estimation and forecasting for free. The code will be released publicly.

Abstract:
Video quality assessment (VQA) seeks to predict the perceptual quality of a video in alignment with human visual perception, serving as a fundamental tool for quantifying quality degradation across video processing workflows. The dominant VQA paradigm relies on supervised training with human-labeled datasets, which, despite substantial progress, still suffers from poor generalization to unseen video content. In this work, we explore weak-to-strong (W2S) learning as a new paradigm for advancing VQA without reliance on human-labeled datasets. We first provide empirical evidence that a straightforward W2S strategy allows a strong student model to not only match its weak teacher on in-domain benchmarks but also surpass it on out-of-distribution (OOD) benchmarks, revealing a distinct weak-to-strong effect in VQA. Building on this insight, we propose a novel framework that enhances W2S learning from two aspects: (1) integrating homogeneous and heterogeneous supervision signals from diverse VQA teachers---including off-the-shelf VQA models and synthetic distortion simulators---via a learn-to-rank formulation, and (2) iterative W2S training, where each strong student is recycled as the teacher in subsequent cycles, progressively focusing on challenging cases. Extensive experiments show that our method achieves state-of-the-art results across both in-domain and OOD benchmarks, with especially strong gains in OOD scenarios. Our findings highlight W2S learning as a principled route to break annotation barriers and achieve scalable generalization in video quality assessment.

Abstract:
We present Vista4D, a robust and flexible video reshooting framework that grounds the input video and target cameras in a 4D point cloud. Specifically, given an input video, our method re-synthesizes the scene with the same dynamics from a different camera trajectory and viewpoint. Existing video reshooting methods often struggle with depth estimation artifacts of real-world dynamic videos, while also failing to preserve content appearance and maintain precise camera control for challenging new trajectories. We build a 4D-grounded point cloud representation with static pixel segmentation and 4D reconstruction to explicitly preserve seen content and provide rich camera signals, and we train with reconstructed multiview dynamic data for robustness against point cloud artifacts during real-world inference. Our results demonstrate improved 4D consistency, camera control, and visual quality compared to state-of-the-art baselines under a variety of videos and camera paths. Moreover, our method generalizes to real-world applications such as dynamic scene expansion and 4D scene recomposition. Results are best viewed as videos in the Supplement.

Abstract:
Audio-visual deepfake localization demands interval-level outputs that serve as temporal evidence. Despite recent progress, symmetric fusion under single-sided or asynchronous forgeries propagates cross-modal noise, degrading high-precision localization. We present IaMSB, an inconsistency-aware multimodal Schrodinger Bridge (SB) that jointly estimates cross-modal consistency and performs interval-level localization. Unlike diffusion models, SB minimizes path-distribution discrepancy and yields consistency scores without explicit noise injection or denoising. With the Schrodinger Bridge (SB), IaMSB unifies consistency estimation, cross-modal information selection, and bridge-step scheduling in one framework. Specifically, a lightweight coarse bridge first proposes candidate intervals and estimates cross-modal consistency; these statistics select cross-modal witness signals and allocate bridge steps asymmetrically across modalities. A refinement bridge then performs step-tuned fusion and outputs refined, time-aligned intervals. IaMSB anticipates single-sided and asynchronous forgeries and, using bottlenecked cross-modal interaction with step allocation, suppresses noise transfer, avoids unnecessary iterations. Across benchmarks, IaMSB stabilizes strict-IoU boundary precision, raising AP@0.95 by 3~10%, and yields improved high-precision localization, particularly for single-sided forgeries.

Abstract:
Co-manipulation requires multiple humans to synchronize their motions with a shared object while ensuring reasonable interactions, maintaining natural poses, and preserving stable states. However, most existing motion generation approaches are designed for single-character scenarios or fail to account for payload-induced dynamics. In this work, we propose a flow-matching framework that ensures the generated co-manipulation motions align with the intended goals while maintaining naturalness and effectiveness. Specifically, we first introduce a generative model that derives explicit manipulation strategies from the object's affordance and spatial configuration, which guide the motion flow toward successful manipulation. To improve motion quality, we then design an adversarial interaction prior that promotes natural individual poses and realistic inter-person interactions during co-manipulation. In addition, we also incorporate a stability-driven simulation into the flow matching process, which refines unstable interaction states through sampling-based optimization and directly adjusts the vector field regression to promote more effective manipulation. The experimental results demonstrate that our method achieves higher contact accuracy, lower penetration, and better distributional fidelity compared to state-of-the-art human-object interaction baselines. The code will be made publicly available.

Abstract:
Current Video Moment Retrieval (VMR) models are trained on videos paired with captions, which are written by annotators after watching the videos. These captions are used as textual queries---which we term caption-based queries. This annotation process induces a visual bias, leading to overly descriptive and fine-grained queries, which significantly differ from the more general search queries that users are likely to employ in practice. In this work, we investigate the degradation of existing VMR methods, particularly of DETR architectures, when trained on caption-based queries but evaluated on search queries. For this, we introduce three benchmarks by modifying the textual queries in three public VMR datasets---i.e., HD-EPIC, YouCook2 and ActivityNet-Captions. Our analysis reveals two key generalization challenges: (i) A language gap, arising from the linguistic under-specification of search-queries, and (ii) a multi-moment gap, caused by the shift from single moment to multi-moment queries. We also identify a critical issue in these architectures---an active decoder-query collapse---as a primary cause of the poor generalization to multi-moment instances. We mitigate this issue with architectural modifications that effectively increase the number of active decoder queries. Extensive experiments demonstrate that our approach improves performance on search queries by up to 14.82% mAP_m, and up to 21.83% mAP_m on multi-moment search queries.

Abstract:
Diffusion-based generative image compression has demonstrated remarkable potential for achieving realistic reconstruction at ultra-low bitrates. The key to unlocking this potential lies in making the entire compression process content-adaptive, ensuring that the encoder's representation and the decoder's generative prior are dynamically aligned with the semantic and structural characteristics of the input image. However, existing methods suffer from three critical limitations that prevent effective content adaptation. First, isotropic quantization applies a uniform quantization step, failing to adapt to the spatially varying complexity of image content and creating a misalignment with the diffusion model's noise-dependent prior. Second, the information concentration bottleneck---arising from the dimensional mismatch between the high-dimensional noisy latent and the diffusion decoder's fixed input---prevents the model from adaptively preserving essential semantic information in the primary channels. Third, existing textual conditioning strategies either need significant textual bitrate overhead or rely on generic, content-agnostic textual prompts, thereby failing to provide adaptive semantic guidance efficiently. To overcome these limitations, we propose a content-adaptive diffusion-based image codec (CADC) with three technical innovations: 1) an Uncertainty-Guided Adaptive Quantization (UGAQ) method that learns spatial uncertainty maps to adaptively align quantization distortion with content characteristics; 2) an Auxiliary Decoder-Guided Information Concentration (ADGIC) method that uses a lightweight auxiliary decoder to enforce content-aware information preservation in the primary latent channels; and 3) a Bitrate-Free Adaptive Textual Conditioning (BFATC) method that derives content-aware textual descriptions from the auxiliary reconstructed image, enabling semantic guidance without bitrate cost. Comprehensive experimental results show that our codec achieves state-of-the-art perceptual quality at ultra-low bitrates.

Abstract:
Diffusion policies have emerged as a powerful paradigm for robot learning, but their inherent multi-modality can lead to a diverse set of plausible--though not always optimal--actions from a single observation. We posit that for a given task, an optimal action exists within this distribution. Inspired by negative prompting in generative models, we introduce a novel method that leverages an error detector to identify out-of-distribution (OOD) execution histories and uses them to construct negative action prompts. This allows our policy to steer away from suboptimal behaviors and converge towards higher-performance actions. We present a comprehensive ablation study demonstrating the effectiveness of positive, and negative prompts, and validate our approach on a suite of simulated benchmarks and real-world robotic tasks. Our results show that the proposed Negative-Prompt-guided Diffusion Policy achieves significant improvement in task performance by effectively filtering undesirable action modes.

Abstract:
We propose Video Gaussian Masked Autoencoders (Video-GMAE), a self-supervised approach for representation learning that encodes a sequence of images into a set of Gaussian splats moving over time. Representing a video as a set of Gaussians enforces a reasonable inductive bias: that 2-D videos are often consistent projections of a dynamic 3-D scene. We find that tracking emerges when pre-training a network with this architecture. Mapping the trajectory of the learnt Gaussians onto the image plane gives zero-shot tracking performance comparable to state-of-the-art. With small-scale finetuning, our models achieve 34.6% improvement on Kinetics, and 13.1% on Kubric datasets, surpassing existing self-supervised video approaches.

Abstract:
Temporal Forgery Localization (TFL) is crucial for enhancing the interpretability and accountability of deepfake forensics by precisely pinpointing the manipulated segments.However, existing methods face two limitations: (1) localization precision, where one-shot boundary prediction models fail to rectify inherent initial prediction biases, and temporal emphasis overlooks modality-internal semantic forgery cues, resulting in noise-sensitive localization, and (2) cross-dataset generalization, where fixed-scale temporal receptive fields struggle to accommodate varying manipulation durations across real-world scenarios. To address these challenges, we propose a unified framework based on structural-semantic perception and diffusion-guided refinement. The structural-semantic perception comprises two complementary components: (1) structural perception, which adaptively models manipulation durations across varying temporal spans using a designed scale weight allocation network, and (2) semantic perception, which analyzes the semantic consistency within each modality through intra-modal distillation.In this way, it first suppresses low-quality forgery localization proposals, yielding a structurally and semantically reliable candidate set. Then a diffusion-based regression head further iteratively refines the candidates into precise and temporally coherent boundary trajectories.Extensive experiments on multiple TFL benchmarks demonstrate that our method achieves state-of-the-art performance.

Abstract:
Weakly Supervised Semantic Segmentation (WSSS) typically utilizes Class Activation Maps (CAMs) to provide the pixel-wise localization. However, CAMs tend to activate only the most discriminative regions, leading to suboptimal WSSS performance. Although existing CAM refinement methods leverage pair-wise relations in affinity to expand the activation regions, these affinities derived from Vision Transformer (ViTs) exhibit a smoothing property, neglecting crucial high-frequency relations and failing to accurately refine object boundaries. In this work, we propose the Dual Frequency-Aware framework (DFA) to address this limitation. Specifically, the Low-Frequency-Aware Alignment (LFAA) generates low-frequency-aware affinity that captures salient semantic relations to enhance object interior semantic consistency on CAMs, while the High-Frequency-Aware Rectification (HFAR) module produces high-frequency-aware affinity that models precise relations to preserve object boundary structure on CAMs. By effectively integrating these two complementary affinities, we design a novel Frequency-Guided (FG) CAM Generation based on Optimal Transport theory, which significantly omits the complex refinement process. Extensive experiments demonstrate that our DFA framework achieves state-of-the-art performance on both PASCAL VOC and MS COCO benchmarks. Code will be released.

Abstract:
Clinical video diagnosis, in which physicians assess dynamic tissue responses across procedural stages, is critical for detecting diseases such as cervical and colorectal cancers. Recent spatiotemporal models map visual progressions directly to diagnostic outputs, yet overlook two hallmarks of expert reasoning: clinically grounded diagnostic principles and hypothesis-driven counterfactual thinking. This gap leads existing methods to conflate causal pathological cues with non-pathological variations, thereby limiting their reliability in data-scarce settings. Here we show that explicitly modeling counterfactual tissue evolution, guided by clinical rules that encode expert diagnostic principles, substantially improves diagnostic robustness-emulating the hypothesis-driven reasoning clinicians employ in practice. We introduce MEDVCR, comprising a Counterfactual Generator (CG) that synthesizes hypothetical tissue transitions via diffusion modeling, a Counterfactual Representation Learning (CRL) module that enforces temporal consistency, pathological separability, and counterfactual alignment, and a Dual Diagnostic Prediction (DDP) strategy that combines video-level context with frame-level counterfactual contrast. On two representative tasks, MEDVCR achieves 93.0% Recall@1 on colposcopy (+10.2%) and 94.8% Average Precision (AP) on colonoscopy (+2.6%), outperforming all state-of-the-art (SOTA) baselines. We anticipate that this counterfactual reasoning paradigm will open new avenues for vision-based clinical diagnostic tasks toward more transparent and interpretable decision support.

Abstract:
Video Temporal Grounding (VTG) aims to localize the video segment that corresponds to a natural language query, which requires a comprehensive understanding of complex temporal dynamics. Existing Vision-LMMs typically perceive temporal dynamics via positional encoding, text-based timestamps, or visual frame numbering. However, these approaches exhibit notable limitations: assigning each frame a text-based timestamp token introduces additional computational overhead and leads to sparsity in visual attention, positional encoding struggles to capture absolute temporal information, and visual frame numbering often compromises spatial detail. To address these issues, we propose Temporal to Spatial Gridification (T2SGrid), a novel framework that reformulates video temporal understanding as a spatial understanding task. The core idea of T2SGrid is to process video content in clips rather than individual frames. we employ a overlapping sliding windows mechanism to segment the video into temporal clips. Within each window, frames are arranged chronologically in a row-major order into a composite grid image, effectively transforming temporal sequences into structured 2D layouts. The gridification not only encodes temporal information but also enhances local attention within each grid. Furthermore, T2SGrid enables the use of composite text timestamps to establish global temporal awareness. Experiments on standard VTG benchmarks demonstrate that T2SGrid achieves superior performance.

Abstract:
Three-dimensional molecular representation learning has shown great promise in modeling chemical structures and their properties. However, most existing approaches implicitly assume molecules are at or near equilibrium states. This assumption breaks down for non-equilibrium structures--ubiquitous in molecular dynamics (MD) trajectories--where standard coordinate denoising techniques fail because the direct equivalence between denoising scores and atomic forces no longer holds. To bridge this gap, we propose Node Denoising on non-Equilibrium Molecules (NDeM), a novel auxiliary task grounded in a second-order finite difference approximation of the potential energy surface. By explicitly accounting for the non-zero inherent forces in non-equilibrium states, NDeM provides a theoretically sound denoising objective applicable to arbitrary molecular conformations. Crucially, our method is designed as a lightweight, architecture-agnostic plugin that requires no pre-training and can be seamlessly integrated into various supervised learning pipelines. Extensive experiments across diverse benchmarks, including MD17, QM9, and the large-scale OC20 dataset, demonstrate that NDeM consistently improves baseline models, yielding highly competitive performance and validating its robustness across both equilibrium and non-equilibrium regimes.

Abstract:
We present StereoWorld, a camera-conditioned stereo world model that jointly learns appearance and binocular geometry for end-to-end stereo video generation.Unlike monocular RGB or RGBD approaches, StereoWorld operates exclusively within the RGB modality, while simultaneously grounding geometry directly from disparity. To efficiently achieve consistent stereo generation, our approach introduces two key designs: (1) a unified camera-frame RoPE that augments latent tokens with camera-aware rotary positional encoding, enabling relative, view- and time-consistent conditioning while preserving pretrained video priors via a stable attention initialization; and (2) a stereo-aware attention decomposition that factors full 4D attention into 3D intra-view attention plus horizontal row attention, leveraging the epipolar prior to capture disparity-aligned correspondences with substantially lower compute. Across benchmarks, StereoWorld improves stereo consistency, disparity accuracy, and camera-motion fidelity over strong monocular-then-convert pipelines, achieving more than 3x faster generation with an additional 5% gain in viewpoint consistency. Beyond benchmarks, StereoWorld enables end-to-end binocular VR rendering without depth estimation or inpainting, enhances embodied policy learning through metric-scale depth grounding, and is compatible with long-video distillation for extended interactive stereo synthesis.

Abstract:
We present Gen3R, a method that bridges the strong priors of foundational reconstruction models and video diffusion models for scene-level 3D generation. We repurpose the VGGT reconstruction model to produce geometric latents by training an adapter on its tokens, which are regularized to align with the appearance latents of pre-trained video diffusion models. By jointly generating these disentangled yet aligned latents, Gen3R produces both RGB videos and corresponding 3D geometry, including camera poses, depth maps, and global point clouds. Experiments demonstrate that our approach achieves state-of-the-art results in single- and multi-image conditioned 3D scene generation. Additionally, our method can enhance the robustness of reconstruction by leveraging generative priors, demonstrating the mutual benefit of tightly coupling reconstruction and generative models.

Abstract:
Models such as VGGT and \pi^3 have shown strong multi-view 3D performance, but their heavy reliance on global self-attention results in high computational cost. Existing sparse-attention variants offer partial speedups, yet lack a systematic analysis of how global attention contributes to multi-view reasoning. In this paper, we first conduct an in-depth investigation of the global attention modules in VGGT and \pi^3 to better understand their roles. Our analysis reveals a clear division of roles in the alternating global-frame architecture: early global layers do not form meaningful correspondences, middle layers perform cross-view alignment, and last layers provide only minor refinements. Guided by these findings, we propose a training-free two-step acceleration scheme: (1) converting early global layers into frame attention, and (2) subsampling global attention by subsampling K/V over patch tokens with diagonal preservation and a mean-fill component. We instantiate this strategy on VGGT and \pi^3 and evaluate across standard pose and point-map benchmarks. Our method achieves substantial inference acceleration across different context lengths, yielding about 2xspeedup at 100 frames, 4--5xat 300 frames, and 8--10xat 800 frames, while matching or slightly improving the accuracy of the original models and remaining robust in extremely dense multi-view settings where prior sparse-attention baselines fail.

Abstract:
We propose a method to reconstruct high-fidelity human avatars from multi-view video that can run on mobile devices. Many works can model high-quality Gaussian-based full-body avatars from multi-view video. However, these methods require heavy computation to obtain pose-dependent appearance, making deployment on mobile devices very difficult. Recent methods distill from pretrained models and model pose-dependent nonlinear Gaussian attributes by linearly combining global pose features with blendshapes. Although they can run on mobile devices, they suffer some loss of detail. We observe that nearby Gaussians are often highly correlated within a local region of the body, and can be linearly modeled with less error. Therefore, we use local linear blendshapes in small body parts to capture global nonlinear changes of Gaussian attributes. To further reduce computation and model size, we propose to remove blendshapes for Gaussians whose attributes change little, yielding a minimal blendshape representation. Our method is an end-to-end training method without a pretrained model. To make it run on multiple devices, we implement our method using WebGPU. Experiments show that our method can render high-quality human avatars with better details, and can reach 120 FPS at 2K resolution on mobile devices.

Abstract:
Real-world visual recognition faces the fundamental challenge of long-tailed distributions. While state-of-the-art methods often employ multi-expert models to address different frequency categories, we find that the mutual knowledge distillation used in these models enhances collaboration at the cost of introducing two critical limitations: indiscriminate knowledge transfer leads to bias propagation, where a single expert's error can spread and contaminate others, and error consolidation, where mutual reinforcement of incorrect predictions solidifies erroneous consensus. To overcome these issues, we propose Trust-calibrated Collaborative Learning (TCL). Our framework introduces the trustworthy knowledge orchestration module, which enables reliable distillation and precise collaboration through a knowledge quality gate that blocks erroneous information and a tail-class compensation mechanism that alleviates knowledge scarcity for tail categories. Furthermore, we design a consensus error calibration module that suppresses consensus high-confidence negative classes to correct collective misjudgments and steer optimization in the right direction. Extensive experiments on five long-tailed benchmarks demonstrate that TCL achieves the best performance, raising Top-1 accuracy on CIFAR100-LT to 58.7%, a gain of 2.4% over previous SOTA methods.

Abstract:
The proliferation of low-power intelligent processors with integrated Neural Processing Units (NPUs), called mNPUs, has created new opportunities for on-device generative AI, benefitting end devices like smart wearables and small robots. However, deploying Vision-Language Models (VLMs) on mNPUs is severely hindered by stringent memory constraints and limited operator support. To bridge this critical gap, we propose mVLM, the first lightweight-oriented VLM architecture designed for mNPUs. It is comprised of our proposed OverMod encoder and AttSSM decoder. OverMod is a lightweight dynamic convolutional network inspired by biomimetic vision, incorporating our novel Global Spatial Modulation mechanism to enable adaptive, high-fidelity feature extraction using only NPU-friendly operators. AttSSM leverages a highly efficient State Space Model (SSM) core, augmented with multi-scale feature fusion and Global Context Dynamic Modulation mechanism, to perform robust sequential modeling. Furthermore, we introduce a coordinated full-parameter quantization strategy that preserves precision across the encoder-decoder boundary, alongside hand-optimized operators for unsupported modules like SSMs. mVLM achieves a competitive CIDEr score of 117.8 on the COCO Karpathy test split and, for the first time, demonstrates the feasibility of millisecond-level VLM inference on a mNPU platform.

Abstract:
Event cameras asynchronously capture brightness changes with microsecond latency, offering exceptional temporal precision but suffering from severe noise and signal inconsistencies. Unlike conventional signals, events carry state information through polarities and process information through inter-event time intervals. However, existing event filters often ignore the latter, producing outputs that are sparser than the raw input and limiting the reconstruction of continuous irradiance dynamics. We propose the Event Density Flow Filter (EDFilter), a framework that models event generation as threshold-crossing probability fluxes arising from the stochastic diffusion of irradiance trajectories. EDFilter performs nonparametric, kernel-based estimation of probability flux and reconstructs the continuous event density flow using an O(1) recursive solver, enabling real-time processing. The Rotary Event Dataset (RED), featuring microsecond-resolution ground-truth irradiance flow under controlled illumination is also presented for event quality evaluation. Experiments demonstrate that EDFilter achieves high-fidelity, physically interpretable event denoising and motion reconstruction.

Abstract:
Creating realistic and simulation-ready 3D assets is crucial for autonomous driving research and virtual environment construction. However, existing 3D vehicle generation methods are often trained on synthetic data with significant domain gaps from real-world distributions. The generated models often exhibit arbitrary poses and undefined scales, resulting in poor visual consistency when integrated into driving scenes. In this paper, we present Unposed-to-3D, a novel framework that learns to reconstruct 3D vehicles from real-world driving images using image-only supervision. Our approach consists of two stages. In the first stage, we train an image-to-3D reconstruction network using posed images with known camera parameters. In the second stage, we remove camera supervision and use a camera prediction head that directly estimates the camera parameters from unposed images. The predicted pose is then used for differentiable rendering to provide self-supervised photometric feedback, enabling the model to learn 3D geometry purely from unposed images. To ensure simulation readiness, we further introduce a scale-aware module to predict real-world size information, and a harmonization module that adapts the generated vehicles to the target driving scene with consistent lighting and appearance. Extensive experiments demonstrate that Unposed-to-3D effectively reconstructs realistic, pose-consistent, and harmonized 3D vehicle models from real-world images, providing a scalable path toward creating high-quality assets for driving scene simulation and digital twin environments.

Abstract:
Diffusion models face a fundamental trade-off between generation quality and computational efficiency. Latent Diffusion Models (LDMs) offer an efficient solution but suffer from potential information loss and non-end-to-end training. In contrast, existing pixel space models bypass VAEs but are computationally prohibitive for high-resolution synthesis. To resolve this dilemma, we propose DiP, an efficient pixel space diffusion framework. DiP decouples generation into a global and a local stage: a Diffusion Transformer (DiT) backbone operates on large patches for efficient global structure construction, while a co-trained lightweight Patch Detailer Head leverages contextual features to restore fine-grained local details. This synergistic design achieves computational efficiency comparable to LDMs without relying on a VAE. DiP is accomplished with up to 10xfaster inference speeds than previous method while increasing the total number of parameters by only 0.3%, and achieves an 1.79 FID score on ImageNet 256x256.

Abstract:
Cross-degradation generalization remains a critical challenge for RGB-infrared multimodal object detection, especially when training data covers limited degradation types. This paper presents a distribution alignment framework with a key insight: aligning fused features to the pretrained distribution where the frozen detector performs optimally, rather than adapting to training-specific degradations. By freezing the pretrained detector and training only a lightweight fusion module, our approach leverages complementary infrared information to reduce distribution shift while maintaining computational efficiency. The method achieves state-of-the-art results on three benchmarks with 4x faster training. Critically, we demonstrate that aligning to the pretrained distribution substantially outperforms aligning to training degradations when generalizing to unseen scenarios.

Abstract:
Diffusion-based video super-resolution (VSR) achieves remarkable fidelity but suffers from prohibitive sampling cost. While distribution matching distillation (DMD) accelerates diffusion models to one-step generation, directly applying it to VSR leads to training instability and degraded, insufficient supervision.To address these issues, we propose DUO-VSR, a three-stage framework centered on a DUal-Stream Distillation strategy that integrates distribution matching and adversarial supervision for One-step VSR.We first adopt a Progressive Guided Distillation Initialization to stabilize subsequent training through trajectory-preserving distillation.We then introduce a Dual-Stream Distillation Strategy to jointly optimize DMD and Real-Fake Score Feature GAN (RFS-GAN) streams, with the latter providing complementary adversarial supervision using features from both real and fake score models.Finally, a Preference-Guided Refinement aligns the student with perceptual quality preferences.Comprehensive experiments demonstrate that DUO-VSR achieves superior visual quality and efficiency over previous one-step VSR methods.

Abstract:
In-Context Learning (ICL) is a significant paradigm for Large Multimodal Models (LMMs), using a few in-context demonstrations (ICDs) for new task adaptation. However, its performance is sensitive to demonstration configurations and computationally expensive. Mathematically, the influence of these demonstrations can be decomposed into a dynamic mixture of the standard attention output and the context values. Current approximation methods simplify this process by learning a "shift vector". Inspired by the exact decomposition, we introduce High-Fidelity In-Context Learning (HiFICL) to more faithfully model the ICL mechanism. HiFICL consists of three key components: 1) a set of "virtual key-value pairs" to act as a learnable context, 2) a low-rank factorization for stable and regularized training, and 3) a simple end-to-end training objective. From another perspective, this mechanism constitutes a form of context-aware Parameter-Efficient Fine-Tuning (PEFT). Extensive experiments show that HiFICL consistently outperforms existing approximation methods on several multimodal benchmarks.

Abstract:
Text-to-image generation powers content creation across design, media, and data augmentation. Post-training of text-to-image generative models is a promising path to improve human preference alignment, factuality, and aesthetics. We introduce SOLACE (Self-Originating LAtent Confidence Estimation), a post-training framework that replaces external reward supervision with an internal self-confidence signal: we re-noise the model's own outputs and measure how accurately it recovers the injected noise, treating low reconstruction error as high self-confidence. SOLACE converts this intrinsic signal into scalar rewards for reinforcement learning, requiring no external reward models, annotators, or preference data. By reinforcing high-confidence generations, SOLACE delivers consistent gains in compositional generation, text rendering, and text-image alignment. Integrating SOLACE with external rewards yields complementary improvements while alleviating reward hacking.

Abstract:
Embedding invisible and temporally consistent watermarks into dynamic 4D Gaussian Splatting (4DGS) models poses unique challenges due to continuous spatio-temporal deformation of Gaussians and diverse motion dynamics. Existing 3DGS watermarking methods, which directly fine-tune parameters within Gaussian splats, fail to preserve geometric fidelity and temporal coherence when extended to 4D settings. In this regime, we propose Mark4D, a temporally consistent watermarking technique that achieves robust, imperceptible, and motion-aware watermark embedding for 4DGS. Mark4D comprises 1) a decoder for watermark recovery in the latent video-text space for robustness against pixel-level distortions, 2) trajectory-aligned offsets that embed watermark signals along Gaussian motion paths to preserve geometry, and 3) a motion-adaptive loss weighting strategy that balances supervision across frames with varying motion intensities. Extensive experiments on synthetic and real-world dynamic scene datasets demonstrate the effectiveness of our method, establishing a foundation for secure and reliable protection of dynamic 4D scene assets.

Abstract:
3D Gaussian Splatting (3DGS) has significantly advanced 3D reconstruction with its impressive performance. However, its reliance on sharp images and precise camera pose priors limits its effectiveness in high-speed scenarios. Recent advances have integrated spike cameras, a bio-inspired sensor with high temporal resolution, to enhance 3DGS in such conditions. Although spike-based methods reduce the need for sharp images, they still face challenges in achieving precise camera pose estimation due to unstable observations and visual texture deficiency. To address these challenges, we propose Nope-SGS, the first framework that reconstructs high-speed 3D scenes from unposed captures of bio-inspired high-temporal-resolution spike cameras. To achieve robust 3D reconstruction and pose estimation, we first reformulate the spike model from a probabilistic perspective and extend its application to keyframing, effectively alleviating the instability caused by spike streams. Building upon this foundation, we devise a progressive optimization framework to facilitate swift 3D reconstruction. The experimental results demonstrate that our method achieves up to 7.4 dB higher PSNR and 40% lower Absolute Trajectory Error (ATE) compared to state-of-the-art methods under challenging high-speed scenarios while maintaining the fastest reconstruction speed among spike-based methods.

Abstract:
Image watermarking supports authenticity and provenance, yet many schemes are still easy to bypass with various distortions and powerful generative edits. Deep learning-based watermarking has improved robustness to diffusion-based image editing, but a gap remains when a watermarked image is converted to video by image-to-video (I2V), in which per-frame watermark detection weakens. I2V has quickly advanced from short, jittery clips to multi-second, temporally coherent scenes, and it now serves not only content creation but also world-modeling and simulation workflows, making cross-modal watermark recovery crucial. We present WaTeRFlow, a framework tailored for robustness under I2V. It consists of (i) FUSE (Flow-guided Unified Synthesis Engine), which exposes the encoder-decoder to realistic distortions via instruction-driven edits and a fast video diffusion proxy during training, (ii) optical-flow warping with a Temporal Consistency Loss (TCL) that stabilizes per-frame predictions, and (iii) a semantic preservation loss that maintains the conditioning signal. Experiments across representative I2V models show accurate watermark recovery from frames, with higher first-frame and per-frame bit accuracy and resilience when various distortions are applied before or after video generation.

Abstract:
Deep neural networks have achieved strong performance in fundus image classification, yet their flattened feature representations are often highly redundant. Such redundancy can lead to poor generalization across imaging devices, reduced interpretability, and inefficient use of model capacity. To address this issue, this study proposes a post-training feature pruning framework, termed greedy feature pruning (GFP) , which removes weak or redundant dimensions from the flattened features of trained backbones. GFP employs a greedy build-up process guided by performance metrics on the training set, constrained by a minimum feature keeping ratio, to identify compact yet discriminative subsets of features. Experiments are conducted on five public fundus datasets covering multiple tasks, including diabetic retinopathy detection (DDR, Messidor-2), glaucoma detection (PAPILA), multi-label classification (ODIR) and multi-class retinal disease classification (RETINA), using EfficientNetV2, ViT, and CoAtNet as backbones. Results show that GFP consistently improves AUROC and AUPRC across datasets while reducing the number of flattened features by up to 96%. Feature visualizations and quantitative analyses confirm that GFP enhances the compactness and separability of latent features. Moreover, cross-dataset evaluation demonstrates that GFP improves transferability between datasets, indicating better domain robustness. Overall, the proposed GFP framework provides a simple yet effective approach for compressing feature representations and improving both discriminability and generalization in fundus image classification.

Abstract:
Reward models (RMs) are inherently non-neutral value functions designed and trained to encode specific objectives, such as human preferences or text-image alignment. RMs have become crucial components of text-to-image (T2I) generation systems where they are used at various stages for dataset filtering, as evaluation metrics, as a supervisory signal during optimization of parameters, and for post-generation safety and quality filtering of T2I outputs. While specific problems with the integration of RMs into the T2I pipeline have been studied (e.g. reward hacking or mode collapse), their robustness and fairness as scoring functions remains largely unknown. We conduct a large scale audit of RM robustness with respect to demographic biases during T2I model training and generation. We provide quantitative and qualitative evidence that while originally developed as quality measures, RMs encode demographic biases, which cause reward-guided optimization to disproportionately sexualize female image subjects, reinforce gender/racial stereotypes, and collapse demographic diversity. These findings highlight shortcomings in current reward models, challenge their reliability as quality metrics, and underscore the need for improved data collection and training procedures to enable more robust scoring.

Abstract:
Current video generation techniques excel at single-shot clips but struggle to produce narrative multi-shot videos, which require flexible shot arrangement, coherent narrative, and controllability beyond text prompts. To tackle these challenges, we propose MultiShotMaster, a framework for highly controllable multi-shot video generation. We extend a pretrained single-shot model by integrating two novel variants of RoPE. First, we introduce Multi-Shot Narrative RoPE, which applies explicit phase shift at shot transitions, enabling flexible shot arrangement while preserving the temporal narrative order. Second, we design Spatiotemporal Position-Aware RoPE to incorporate reference tokens and grounding signals, enabling spatiotemporal-grounded reference injection. In addition, to overcome data scarcity, we establish an automated data annotation pipeline to extract multi-shot videos, captions, cross-shot grounding signals and reference images. Our framework leverages the intrinsic architectural properties to support multi-shot video generation, featuring text-driven inter-shot consistency, customized subject with motion control, and background-driven customized scene. Both shot count and duration are flexibly configurable. Extensive experiments demonstrate the superior performance and outstanding controllability of our framework.

Abstract:
Masked visual modeling is a self-supervised learning task that does not use visual annotations. It aims to learn discriminative representations via a mask-reconstruction task. A single mask ratio in reconstruction may fail to capture complex semantics, which motivates dynamic masking strategies. In this work, we propose Progressive Mask Distillation (PMD), which utilizes dynamic mask ratios to facilitate progressive semantic learning from easy to hard. PMD integrates three key components: a progressive student distiller, a difficulty-aware region enhancer, and a cross-layer feature aligner. First, to capture dynamic visual semantics, we design a progressive student distiller that trains multiple student models with progressively increasing mask ratios. The early-phase student (with a low mask ratio) learns easy, low-level semantics from more visible tokens. This learned knowledge then guides the next-phase student (with a higher mask ratio) to capture hard, high-level semantics from fewer visible tokens. This progressive distillation mechanism enhances detail reconstruction at a high mask ratio. Second, to alleviate insufficient learning of semantic regions, we design a difficulty-aware region enhancer. We first smooth the region reconstruction loss to reduce large fluctuations across training epochs. The smoothed loss is then used to learn region-level weights, prioritizing accurate learning of regions with large reconstruction losses. Third, to further bridge the semantic gap across network layers, we design cross-layer feature alignment. This module aligns features across shallow, middle, and deep encoder layers, ensuring that shallow-layer features incorporate semantic information from deeper layers. Extensive experiments demonstrate that our method achieves state-of-the-art performance on the Something-Something V2, Kinetics-400, UCF-101, and HMDB-51 datasets.

Abstract:
Reducing communication overhead in federated learning (FL) is challenging but crucial for large-scale distributed privacy-preserving machine learning. Unfortunately, directly compressing model updates often leads to sub-optimal convergence due to information loss, while increasing local computation can cause model divergence. Hence, this paper proposes a drastically different approach that adheres to the maxim that "a picture is worth a thousand words". We observe that the entire gradient information from local training can be effectively reconstructed from a compact, image-like representation. Based on this observation, we propose a novel approach, OS-Fed, which performs One-Shot Federated Learning by transmitting only a single, compact snapshot (comprising an image and a set of learnable labels) per round. To realize this approach, OS-Fed presents new snapshot synthesis techniques to (1) target the accumulated update of a trajectory segment to tackle gradient noise, (2) design a multi-grid snapshot that decouples conflicting gradient directions, and (3) incorporate error compensation to maintain training stability under extreme compression. Extensive experiments on CV and NLP benchmarks show that OS-Fed reduces communication costs by 1.5-16xcompared to state-of-the-art algorithms , resulting in 18-45% faster convergence.

Abstract:
Real-time video generation demands fast decoding as much as fast denoising, yet current latent video diffusion models rely on 3D convolutional decoders that are slow and memory-intensive at high resolutions or for long video. We introduce FlashDecoder, a fast, memory-efficient pure-Transformer video decoder that decodes latents to pixels frame by frame. At each step, the current frame attends only to a fixed-size window of past frames through a rolling KV cache. The fixed temporal window keeps decoding fast and memory bounded regardless of video length, enabling constant-latency streaming. Because frames are processed sequentially, temporal causality is enforced without explicit attention masks, enabling training at resolutions up to 1080p and matching the reconstruction quality of convolutional decoders. On the Wan2.1 and Wan2.2 latent spaces, FlashDecoder matches each convolutional decoder in reconstruction quality (e.g., 41.55 vs. 41.49 dB PSNR at 1080p) while decoding 3.6x-4.7x faster with up to 11x less memory on a single H100 GPU. With architecture-aware inference optimizations, the speedup widens to 12x.

Abstract:
Recent advances in generative modeling can create remarkably realistic synthetic videos, making it increasingly difficult for humans to distinguish them from real ones and necessitating reliable detection methods. However, two key limitations hinder the development of this field.From the dataset perspective, existing datasets are often limited in scale and constructed using outdated or narrowly scoped generative models, making it difficult to capture the diversity and rapid evolution of modern generative techniques. Moreover, the dataset construction process frequently prioritizes quantity over quality, neglecting essential aspects such as semantic diversity, scenario coverage, and technological representativeness. From the benchmark perspective, current benchmarks largely remain at the stage of dataset creation, leaving many fundamental issues and in-depth analysis yet to be systematically explored.Addressing this gap, we propose AIGVDBench, a benchmark designed to be comprehensive and representative, covering 31 state-of-the-art generation models and over 440,000 videos. By executing more than 1,500 evaluations on 33 existing detectors belonging to four distinct categories. This work presents 8 in-depth analyses from multiple perspectives and identifying 4 novel findings that offer valuable insights for the field. We hope this work provides a solid foundation for advancing the field of AI-generated video detection.

Abstract:
Multimodal Large Language Models (MLLMs) have achieved remarkable progress in visual reasoning tasks, serving as the core perception engines for emerging AI agents like OpenClaw. While recent studies have introduced several effective image-based jailbreak methods, the vulnerabilities inherent in the video modality remain a largely unexplored frontier. As a pioneering effort to bridge this critical safety gap, we demonstrate that video-driven jailbreak attacks are significantly more effective and robust against pre-defined system prompts than their image-based counterparts. Specifically, we find that simply repeating a harmful image across multiple frames to construct a video can bypass the safety mechanisms of MLLMs. Our analysis reveals that unsafe videos are embedded more similarly to safe videos in the model's representation space than individual harmful images, making them harder to detect. Moreover, videos composed of identical frames are processed more like static images and are more likely to trigger safety defenses compared to videos with diverse frames. Motivated by these findings, we propose an algorithm that injects harmful content into typographic videos by interleaving it with diverse, safety-proximal frames, thereby evading MLLM safety alignment. Extensive experiments demonstrate that our approach achieves state-of-the-art jailbreak performance on several widely-used MLLMs (e.g., VideoLLaMA-2, Qwen2.5-VL, GPT-4.1, and Gemini-2.5) under 16 different safety policies.

Abstract:
Autoregressive (AR) models have achieved remarkable success in image synthesis, yet their sequential nature imposes significant latency constraints. Speculative Decoding offers a promising avenue for acceleration, but existing approaches are limited by token-level ambiguity and lack of spatial awareness. In this work, we introduce Multi-Scale Local Speculative Decoding (MuLo-SD), a novel framework that combines multi-resolution drafting with spatially informed verification to accelerate AR image generation. Our method leverages a low-resolution drafter paired with an up-sampling step to propose candidate image tokens, which are then verified in parallel by a high-resolution target model. Crucially, we incorporate a local rejection and resampling mechanism, enabling efficient correction of draft errors by focusing on spatial neighborhoods rather than raster-scan resampling after the first rejection. When integrated with parallel decoding resampling, \methodname achieves substantial speedups -- up to 5x -- outperforming both speculative decoding and parallel decoding baselines in terms of acceleration, while maintaining comparable semantic alignment and perceptual quality. These results are validated using GenEval, DPG-Bench, and FID/HPSv2 on the MS-COCO 5k validation split. Extensive ablations highlight the impact of up-sampling design, probability pooling, and local rejection and resampling with neighborhood expansion. Our approach sets a new state-of-the-art in speculative decoding for image synthesis, bridging the gap between efficiency and fidelity.

Abstract:
Large-scale floorplan generation is critical for virtual space planning and architectural simulation. Although existing methods have shown success in generating small-scale floorplans with simple room shapes, they struggle to handle complex room connections and irregular room shapes that arise in large-scale floorplans. In this paper, we propose CG-Floor, a centroid-guided hierarchical framework that explicitly decouples room position and shape generation to address these issues. We first introduce the size-aware semantic centroid heatmap, derived from predicted room centroids and sizes, which provides a structured representation to guide the effective generation of a coarse-to-fine floorplan generator while ensuring semantic alignment. Additionally, we train a vector quantized codebook of floorplans with complex room shapes to capture the diversity of room shapes and employ a latent diffusion transformer to generate large-scale floorplans featuring non-Manhattan room shapes. CG-Floor achieves state-of-the-art performance on the large-scale MSD dataset, and supports 3D floorplan conversion and editing, demonstrating the practicality of our approach. The code is available at https://cic.tju.edu.cn/faculty/likun/projects/CG-Floor.

Abstract:
We present the first systematic analysis of multimodal large language models (MLLMs) in personalized question-answering requiring ego-grounding - the ability to understand the camera-wearer in egocentric videos. To this end, we introduce MyEgo, the first egocentric VideoQA dataset designed to evaluate MLLMs' ability to understand, remember, and reason about the camera wearer. MyEgo comprises 541 long videos and 5K personalized questions asking about "my things", "my activities", and "my past". Benchmarking reveals that competitive MLLMs across variants, including open-source vs. proprietary, thinking vs. non-thinking, small vs. large scales all struggle on MyEgo. Top closed- and open-source models (e.g., GPT-5 and Qwen3-VL) achieve only 46% and 36% accuracy, trailing human performance by near 40% and 50% respectively. Surprisingly, neither explicit reasoning nor model scaling yield consistent improvements. Models improve when relevant evidence is explicitly provided, but gains drop over time, indicating limitations in tracking and remembering "me" and "my past". These findings collectively highlight the crucial role of ego-grounding and long-range memory in enabling personalized QA in egocentric videos. We hope MyEgo and our analyses catalyze further progress in these areas for egocentric personalized assistance.

Abstract:
The surge of highly realistic synthetic videos produced by contemporary generative systems has significantly increased the risk of malicious use, challenging both humans and existing detectors. Against this backdrop, we take a generator-side view and observe that internal cross-attention mechanisms in these models encode fine-grained speech-motion alignment, offering useful correspondence cues for forgery detection. Building on this insight, we propose X-AVDT, a robust and generalizable deepfake detector that probes generator-internal audio-visual signals accessed via DDIM inversion to expose these cues. X-AVDT extracts two complementary signals: (i) a video composite capturing inversion-induced discrepancies, and (ii) audio-visual cross-attention feature reflecting modality alignment enforced during generation. To enable faithful, cross-generator evaluation, we further introduce MMDF, a new multi-modal deepfake dataset spanning diverse manipulation types and rapidly evolving synthesis paradigms, including GANs, diffusion, and flow-matching. Extensive experiments demonstrate that X-AVDT achieves leading performance on MMDF and generalizes strongly to external benchmarks and unseen generators, outperforming existing methods with accuracy improved by +13.1%. Our findings highlight the importance of leveraging internal audio-visual consistency cues for robustness to future generators in deepfake detection. The code and dataset will soon be available.

Abstract:
Human-Object Interaction (HOI) modelling captures how humans act upon and relate to objects, typically expressed as triplets. Existing approaches split into two disjoint families: HOI generation synthesises scenes from structured triplets and layout, but fails to integrate mixed conditions like HOI and object-only entities; and HOI editing modifies interactions via text, yet struggles to decouple pose from physical contact and scale to multiple interactions. We introduce OneHOI, a unified diffusion transformer framework that consolidates HOI generation and editing into a single conditional denoising process driven by shared structured interaction representations. At its core, the Relational Diffusion Transformer (R-DiT) models verb-mediated relations through role- and instance-aware HOI tokens, layout-based spatial Action Grounding, a Structured HOI Attention to enforce interaction topology, and HOI RoPE to disentangle multi-HOI scenes. Trained jointly with modality dropout on our HOI-Edit-44K, along with HOI and object-centric datasets, OneHOI supports layout-guided, layout-free, arbitrary-mask, and mixed-condition control, achieving state-of-the-art results across both HOI generation and editing.

Abstract:
Recent research suggested that the embeddings produced by CLIP-like contrastive language-image training are suboptimal for image-only tasks. The main theory is that the inter-modal (language-image) alignment loss ignores intra-modal (image-image) alignment, leading to poorly calibrated distances between images. In this study, we question this intra-modal misalignment hypothesis. We reexamine its foundational theoretical argument, the indicators used to support it, and the performance metrics affected. For the theoretical argument, we demonstrate that there are no such supposed degrees of freedom for image embedding distances. For the empirical measures, our findings reveal they yield similar results for language-image trained models (CLIP, SigLIP) and image-image trained models (DINO, SigLIP2). This indicates the observed phenomena do not stem from a misalignment specific to the former. Experiments on the commonly studied intra-modal tasks retrieval and few-shot classification confirm that addressing supposed misalignment is unnecessary for best performance.

Abstract:
When obtaining visual illustrations from text descriptions, today's methods take a description with a single text context--a caption, or an action description--and retrieve or generate the matching visual context. However, prior work does not permit visual illustration of multistep descriptions, e.g. a cooking recipe or a gardening instruction manual, and simply handling each step description in isolation would result in an incoherent demonstration. We propose Stitch-a-Demo, a novel retrieval-based method to assemble a video demonstration from a multistep description. The resulting video contains clips, possibly from different sources, that accurately reflect all the step descriptions, while being visually coherent. We formulate a training pipeline that creates large-scale weakly supervised data containing diverse procedures and injects hard negatives that promote both correctness and coherence. Validated on in-the-wild instructional videos, Stitch-a-Demo achieves state-of-the-art performance, with gains up to 29% as well as dramatic wins in a human preference study.

Abstract:
3D reconstruction is to recover 3D signals from the sampled discrete 2D pixels, with the goal to converge continuous 3D spaces.In this paper, we revisit 3D reconstruction from the perspective of signal processing, identifying the periodic spectral extension induced by discrete sampling as the fundamental challenge.Previous 3D reconstruction kernels, such as Gaussians, Exponential functions, and Student's t distributions, serve as the low pass filters to isolate the baseband spectrum.However, their unideal low-pass property results in the overlap of high-frequency components with low-frequency components in the discrete-time signal's spectrum.To this end, we introduce Jinc kernel with an instantaneous drop to zero magnitude exactly at the cutoff frequency, which is corresponding to the ideal low pass filters.As Jinc kernel suffers from low decay speed in the spatial domain, we further propose modulated kernels to strick an effective balance, and achieves superior rendering performance by reconciling spatial efficiency and frequency-domain fidelity.Experimental results have demonstrated the effectiveness of our Jinc and modulated kernels.

Abstract:
Adversarial examples expose fundamental vulnerabilities within deep neural networks, and their transferability highlights shared weaknesses across diverse models. Existing mainstream attack methods often rely on iterative processes with various strategies to improve transferability, but the limited knowledge of the target model restricts the success of these approaches. In this paper, we reveal that the iterative optimization process tends to over-specialize adversarial perturbations to the local gradient characteristics of the surrogate model, thereby hindering their transferability to other models. To address this limitation, we propose a novel attack method called Local Perturbation Augmentation Attack (LPAA). The key innovation of our approach lies in constructing multiple augmented local subspaces during each iteration, which steers perturbation updates towards a more generalizable direction, reducing over-reliance on the surrogate model. Additionally, to improve the initial performance and overcome sensitivity to initial perturbation, we introduce a dedicated perturbation initialization strategy that ensures the optimization process starts from a direction with greater transferability. Compared with existing random neighborhood sampling strategies, LPAA serves as an effective approach that leverages the characteristics of perturbations to overcome their limitations. Extensive experiments on CNNs and ViTs demonstrate that LPAA consistently generates highly transferable adversarial examples, significantly surpassing the performance of state-of-the-art methods.

Abstract:
Large Multimodal Models (LMMs) that process 3D data typically rely on heavy, pretrained visual encoders to extract geometric features. While recent 2D LMMs have begun to eliminate such encoders for efficiency and scalability, extending this paradigm to 3D remains challenging due to the unordered and large-scale nature of point clouds. This leaves a critical unanswered question: How can we design an LMM that tokenizes unordered 3D data effectively and efficiently without a cumbersome encoder? We propose Fase3D, the first efficient encoder-free Fourier-based 3D scene LMM. Fase3D tackles the challenges of scalability and permutation invariance with a novel tokenizer that combines point cloud serialization and the Fast Fourier Transform (FFT) to approximate self-attention. This design enables an effective and computationally minimal architecture, built upon three key innovations: First, we represent large scenes compactly via structured superpoints. Second, our space-filling curve serialization followed by an FFT enables efficient global context modeling and graph-based token merging. Lastly, our Fourier-augmented LoRA adapters inject global frequency-aware interactions into LLM backbones at a negligible cost. Fase3D achieves performance comparable to encoder-based 3D LMMs while being significantly more efficient in computation and parameters.

Abstract:
We introduce a probabilistic splat-based radiance field framework that retains the fast rasterization and test-time efficiency of 3D Gaussian Splatting (3DGS) while replacing heuristic primitive manipulation with gradient-based optimization of a volumetric probability density. Rather than relocating, splitting, or culling Gaussians via hand-tuned densification (e.g., ADC), we treat primitive locations as samples drawn from a persistent, learnable density. We instantiate this density with a novel, memory-efficient multi-scale hierarchical grid that enables end-to-end gradient-based control over primitive population density. To stabilize stochastic training, we derive an unbiased gradient estimator with control variates that markedly reduces variance. By allowing probability mass to flow to where the loss demands, our method eliminates brittle priors and naturally explores the volume, achieving state-of-the-art reconstruction quality on mip-NeRF 360 while preserving 3DGS-level rendering speed.

Abstract:
Conventional frame-based cameras capture rich contextual information but suffer from limited temporal resolution and motion blur in dynamic scenes. Event cameras offer an alternative visual representation with higher dynamic range free from such limitations. The complementary characteristics of the two modalities make event-frame asymmetric stereo promising for reliable 3D perception under fast motion and challenging illumination. However, the modality gap often leads to marginalization of domain-specific cues essential for cross-modal stereo matching. In this paper, we introduce Bi-CMPStereo, a novel bidirectional cross-modal prompting framework that fully exploits semantic and structural features from both domains for robust matching. Our approach learns finely aligned stereo representations within a target canonical space and integrates complementary representations by projecting each modality into both event and frame domains. Extensive experiments demonstrate that our approach significantly outperforms state-of-the-art methods in accuracy and generalization.

Abstract:
Existing concept customization methods have achieved remarkable outcomes in high-fidelity and multi-concept customization. However, they often neglect the influence on the original model's behavior and capabilities when learning new personalized concepts. To address this issue, we propose PureCC. PureCC introduces a novel decoupled learning objective for concept customization, which combines the implicit guidance of the target concept with the original conditional prediction. This separated form enables PureCC to substantially focus on the original model during training. Moreover, based on this objective, PureCC designs a dual-branch training pipeline that includes a frozen extractor providing purified target concept representations as implicit guidance and a trainable flow model producing the original conditional prediction, jointly achieving pure learning for personalized concepts. Furthermore, PureCC introduces a novel adaptive guidance scale l to dynamically adjust the guidance strength of the target concept, balancing customization fidelity and model preservation. Extensive experiments show that PureCC achieves state-of-the-art performance in preserving the original behavior and capabilities while enabling high-fidelity concept customization.

Abstract:
State-of-the-art point trackers typically rely on cost volumes to match features across frames, but this approach incurs quadratic complexity in spatial resolution, limiting scalability and efficiency. In this paper, we propose CoWTracker, a novel dense point tracker that eschews cost volumes in favor of warping. Inspired by recent advances in optical flow, our approach iteratively refines track estimates by warping features from the target frame to the query frame based on the current track estimate. Combined with a transformer that performs joint spatiotemporal reasoning across all tracks, our design establishes long-range correspondences without computing feature correlations. Our model is simple and achieves state-of-the-art performance on standard dense point tracking benchmarks, including TAP-Vid-DAVIS, TAP-Vid-Kinetics, and Robo-TAP. Remarkably, the model also excels at optical flow, sometimes outperforming specialized methods on the Sintel, KITTI, and Spring benchmarks. These results suggest that warping-based architectures can unify dense point tracking and optical flow estimation.

Abstract:
As 3D Gaussian Splatting becomes the de facto representation for interactive 3D assets, robust yet imperceptible watermarking is critical. We present a representation-native framework that separates where to write from how to preserve quality. A Trio-Experts module operates directly on Gaussian primitives to derive priors for carrier selection, while a Safety and Budget Aware Gate (SBAG) allocates Gaussians to watermark carriers--optimized for bit resilience under perturbation and bitrate budgets--and to visual compensators that are insulated from watermark loss. To maintain fidelity, we introduce a channel-wise group mask that controls gradient propagation for carriers and compensators, thereby limiting Gaussian parameter updates, repairing local artifacts, and preserving high-frequency details without increasing runtime. Our design yields view-consistent watermark persistence and strong robustness against common image distortions such as compression and noise, while achieving a favorable robustness-quality trade-off compared with prior methods. In addition, the decoupled finetuning provides per-Gaussian attributions that reveal where the message is carried and why those carriers are selected, enabling auditable explainability. Compared with state-of-the-art methods, our approach achieves a PSNR improvement of +0.83 dB and a bit-accuracy gain of +1.24%.

Abstract:
Large video diffusion and flow models have achieved remarkable success in high-quality video generation, but their use in real-time interactive applications remains limited due to their inefficient multi-step sampling process. In this work, we present Transition Matching Distillation (TMD), a novel framework for distilling video diffusion models into efficient few-step generators. The central idea of TMD is to match the multi-step denoising trajectory of a diffusion model with a few-step probability transition process, where each transition is modeled as a lightweight conditional flow. To enable efficient distillation, we decompose the original diffusion backbone into two components: (1) a main backbone, comprising the majority of early layers, that extracts semantic representations at each outer transition step; and (2) a flow head, consisting of the last few layers, that leverages these representations to perform multiple inner flow updates. Given a pretrained video flow model, we first introduce a flow head to the model, and adapt it into a conditional flow map. We then apply distribution matching distillation to the student model with flow head rollout in each transition step. Extensive experiments on distilling Wan2.1 1.3B and 14B text-to-video models demonstrate that TMD provides a flexible and strong trade-off between generation speed and visual quality. In particular, TMD outperforms existing distilled models under comparable inference costs in terms of visual fidelity and prompt adherence.

Abstract:
Multi-view learning fuses complementary views to improve perception, but real-world deployments often suffer from Test-time Noisy Correspondence (TNC) -- cross-view misalignment caused by asynchronous sampling, transient network congestion, or other disturbances. Such misalignment introduces semantic inconsistency and significantly degrades performance. Existing remedies typically estimate view-specific reliability from clean, well-aligned training data and then extrapolate to noisy fusion at inference, resulting in a train-test task gap and reduced robustness against TNC. To bridge this gap, we propose \underline \textcolor red B ootstrapping \underline \textcolor red M ulti-view \underline \textcolor red L earning (BML) -- a plug-and-play framework that explicitly learns to fuse under TNC. Specifically, BML performs in-place TNC bootstrapping to construct a controllable noise-augmented training set that simulates realistic correspondence distortion, thereby eliminating the task gap without external data. Unlike prior uncertainty-based approaches that model reliability in an unsupervised manner, BML presents a reveal-supervised paradigm, wherein a lightweight estimator jointly models intra-view predictive uncertainty (view quality) and inter-view prediction discrepancy (correspondence consistency) to produce calibrated reliability weights guided by both task objectives and bootstrapped supervision. Once deployed, these reliability weights directly modulate fusion, suppressing corrupted views while preserving informative ones. Across 11 benchmarks spanning diverse noise ratios, BML consistently outperforms state-of-the-art baselines and maintains robustness against TNC.

Abstract:
While model merging has demonstrated remarkable success for large language models (LLMs), its application to vision-language models (VLMs) remains largely underexplored. Recent methods attempt to enhance VLM reasoning capabilities by integrating specialized LLM parameters through layer-wise merging. However, existing paradigms suffer from two critical limitations: (1) strict positional correspondence, enforcing rigid one-to-one layer alignment, and (2) uniform merging weights applied across all layers. These constraints fail to account for substantial functional disparities between VLM and LLM layers, potentially misaligning incompatible layers and leading to detrimental parameter combinations. To address these, we propose Chain-of-Merging (CoM), a framework that adaptively adjusts merging plans for different images and questions, comprising two key stages: (1) Adaptive Layer Matching, which identifies optimal layer pairings based on structural and semantic matching scores while filtering incompatible pairings, and (2) Dynamic Weight Merging, which determines layer-specific merging weights based on matching scores and employs spherical linear interpolation to minimize memory overhead. Extensive experiments demonstrate that CoM achieves substantial performance improvements, with Qwen2.5-VL-7B + Qwen2.5-Math-7B attaining a 4.4% average improvement on mathematical reasoning benchmarks while enhancing general visual understanding, significantly outperforming existing training-free methods.

Abstract:
Deep learning-based gaze estimation methods tend to suffer from substantial performance drop in real-world scenarios with varying users and environments. To tackle this issue, most recent approaches employ Unsupervised Domain Adaptation (UDA) to bridge the gap between source and target domains. However, this paradigm is misaligned with real-world scenarios, where the system typically needs to adapt to only a single new user. Therefore, this paper advocates a more practical paradigm: Unsupervised Personal Adaptation (UPA), which calibrates a pre-trained model using a few unlabeled images from a single new user. Conventional UDA methods do not guarantee improvements for every user and often yield lower average performance in this setting. To address this problem, we propose Render-to-Adapt (R2A), a self-supervised framework specifically designed for the UPA task. Given a pretrained gaze model, R2A utilizes a gaze-conditioned renderer to synthesize new images based on the model's gaze predictions, and enforces eye-region consistency as a label-free signal to enhance personalized gaze estimation. We evaluate R2A on a re-designed cross-dataset personal adaptation benchmark. Experimental results show that R2A consistently improves performance across all individuals and significantly outperforms existing SOTA methods.

Abstract:
Generating realistic and physically plausible 3D Human-Object Interactions (HOI) remains a key challenge in motion generation. One primary reason is that describing these physical constraints with words alone is difficult. To address this limitation, we propose a new paradigm: extracting rich interaction priors from easily accessible 2D images. Specifically, we introduce ViHOI, a novel framework that enables diffusion-based generative models to leverage rich, task-specific priors from 2D images to enhance generation quality. We utilize a large Vision-Language Model (VLM) as a powerful prior-extraction engine and adopt a layer-decoupled strategy to obtain visual and textual priors. Concurrently, we design a Q-Former-based adapter that compresses the VLM's high-dimensional features into compact prior tokens, which significantly facilitates the conditional training of our diffusion model. Our framework is trained on motion-rendered images from the dataset to ensure strict semantic alignment between visual inputs and motion sequences. During inference, it leverages reference images synthesized by a text-to-image generation model to improve generalization to unseen objects and interaction categories. Experimental results demonstrate that ViHOI achieves state-of-the-art performance, outperforming existing methods across multiple benchmarks and demonstrating superior generalization. The code for this work will be released at https//github.com/MPI-Lab/ViHOI.

Abstract:
Despite achieving state-of-the-art generation quality, diffusion models are hindered by the substantial computational burden of their iterative sampling process. While feature caching techniques achieve effective acceleration at higher step counts (e.g., 50 steps), they exhibit critical limitations in the practical low-step regime of 20-30 steps. As the interval between steps increases, polynomial-based extrapolators like TaylorSeer suffer from error accumulation and trajectory drift. Meanwhile, conventional caching strategies often overlook the distinct dynamical properties of different denoising phases. To address these challenges, we propose Trajectory-Consistent Pade(TC-Pade) approximation, a feature prediction framework grounded in Pade approximation. By modeling feature evolution through rational functions, our approach captures asymptotic and transitional behaviors more accurately than Taylor-based methods. To enable stable and trajectory-consistent sampling under reduced step counts, TC-Pade incorporates (1) adaptive coefficient modulation that leverages historical cached residuals to detect subtle trajectory transitions, and (2) step-aware prediction strategies tailored to the distinct dynamics of early, mid, and late sampling stages. Extensive experiments on DiT-XL/2, FLUX.1-dev, and Wan2.1 across both image and video generation demonstrate the effectiveness of TC-Pade. For instance, TC-Pade achieves 2.88xacceleration on FLUX.1-dev and 1.72xon Wan2.1 while maintaining high quality across FID, CLIP, Aesthetic, and VBench-2.0 metrics, substantially outperforming existing feature caching methods.

Abstract:
While diffusion models have shown great potential in portrait generation, generating expressive, coherent, and controllable cinematic portrait videos remains a significant challenge. Existing intermediate signals for portrait generation, such as 2D landmarks and parametric models, have limited disentanglement capabilities and cannot express personalized details due to their sparse or low-rank representation. Therefore, existing methods based on these models struggle to accurately preserve subject identity and expressions, hindering the generation of highly expressive portrait videos. To overcome these limitations, we propose a high-fidelity personalized head representation that more effectively disentangles expression and identity. This representation captures both static, subject-specific global geometry and dynamic, expression-related details. Furthermore, we introduce an expression transfer module to achieve personalized transfer of head pose and expression details between different identities. We use this sophisticated and highly expressive head model as a conditional signal to train a diffusion transformer (DiT)-based generator to synthesize richly detailed portrait videos. Extensive experiments on self- and cross-reenactment tasks demonstrate that our method outperforms previous models in terms of identity preservation, expression accuracy, and temporal stability, particularly in capturing fine-grained details of complex motion.

Abstract:
Vision-Language-Action (VLA) models excel in robotic manipulation but are constrained by their heavy reliance on expert demonstrations, leading to demonstration bias and limiting performance. Reinforcement learning (RL) is a vital post-training strategy to overcome these limits, yet current VLA-RL methods, including group-based optimization approaches, are crippled by severe reward sparsity. Relying on binary success indicators wastes valuable information in failed trajectories, resulting in low training efficiency. To solve this, we propose Self-Referential Policy Optimization (SRPO), a novel VLA-RL framework. SRPO eliminates the need for external demonstrations or manual reward engineering by leveraging the model's own successful trajectories, generated within the current training batch, as a self-reference. This allows us to assign a progress-wise reward to failed attempts. A core innovation is the use of Latent World Representations to measure behavioral progress robustly. Instead of relying on raw pixels or requiring domain-specific fine-tuning, we utilize the compressed, transferable encodings from a world model's latent space. These representations naturally capture progress patterns across environments, enabling accurate, generalized trajectory comparison. Empirical evaluations on the LIBERO benchmark demonstrate SRPO's efficiency and effectiveness. Starting from a supervised baseline with 48.9% success, SRPO achieves a new state-of-the-art success rate of 99.2% in just 200 RL steps, representing a 103% relative improvement without any extra supervision. Furthermore, SRPO shows substantial robustness, achieving a 167% performance improvement on the LIBERO-Plus benchmark.

Abstract:
Machine learning has been progressively generalised to operate within non-Euclidean domains, but geometrically accurate methods for learning on surfaces are still falling behind. The lack of closed-form Riemannian operators, the non-differentiability of their discrete counterparts, and poor parallelisation capabilities have been the main obstacles to the development of the field on meshes. A principled framework to compute the exponential map on Riemannian surfaces discretised as meshes is straightest geodesics, which also allows to trace geodesics and parallel-transport vectors as a by-product. We provide a parallel GPU implementation and derive two different methods for differentiating through the straightest geodesics, one leveraging an extrinsic proxy function and one based upon a geodesic finite differences scheme. After proving our parallelisation performance and accuracy, we demonstrate how our differentiable exponential map can supercharge geometrically-correct learning and optimisation pipelines. In particular, to showcase the versatility of our method, we propose a new geodesic convolutional layer, a new flow matching method for learning on meshes, and a second-order optimiser that we apply to centroidal Voroni tesselation. Our code, pre-trained models, and pip-installable library will be made available upon publication.

Abstract:
Large-scale image datasets frequently contain identifiable or sensitive content, raising privacy risks when training models that may memorize and leak such information. We present Unsafe2Safe, a fully automated pipeline that detects privacy-prone images and rewrites only their sensitive regions using multimodally guided diffusion editing. Unsafe2Safe operates in two stages. Stage 1 uses a vision-language model to (i) inspect images for privacy risks, (ii) generate paired private and public captions that respectively include and omit sensitive attributes, and (iii) prompt a large language model to produce structured, identity-neutral edit instructions conditioned on the public caption. Stage 2 employs instruction-driven diffusion editors to apply these dual textual prompts, producing privacy-safe images that preserve global structure and task-relevant semantics while neutralizing private content. To measure anonymization quality, we introduce a unified evaluation suite covering Quality, Cheating, Privacy, and Utility dimensions. Across MS-COCO, Caltech101 and MIT Indoor67, Unsafe2Safe reduces face similarity, text similarity, and demographic predictability by large margins, while maintaining downstream model accuracy comparable to training on raw data. Fine-tuning diffusion editors on our automatically generated triplets (private caption, public caption, edit instruction) further improves both privacy protection and semantic fidelity. Unsafe2Safe provides a scalable, principled solution for constructing large, privacy-safe datasets without sacrificing visual consistency or downstream utility.

Abstract:
Federated Learning (FL) serves as a prominent distributed training paradigm, enabling devices to collaboratively train a shared model with local data. However, in practice, clients generally possess data from multiple distributional domains, posing significant challenges to efficiency and generalization. In this paper, we propose a domain-sensitive federated pruning framework, FedFIP, that preserves domain-invariant structures while retaining domain-specific representations. The Domain-Sensitive Fisher Pruning (DSFP) module estimates channel importance per domain via Fisher information, and uploads this signal to the server to obtain a globally shared pruning mask. Given local domain heterogeneity, each client reuses its Fisher information to selectively reactivate domain-specific channels, yielding personalized sparse models that remain structurally aligned yet adapt to local heterogeneity. To further enhance performance, we adopt a Domain-Sensitive Regularization (DSR) module: the server builds domain prototypes from uploaded importance signals and broadcasts them back. Guided by the domain prototypes, we adopt a structure-contrastive loss to strengthen intra-domain consistency and inter-domain discriminability. We propose a structure-aware aggregation algorithm that fuses heterogeneous personalized architectures into a domain-generalized global model. Experiments on multi-domain benchmarks demonstrate that FedFIP surpasses state-of-the-art FL baselines while substantially shrinking model size.

Abstract:
Collaborative perception is vital for autonomous driving yet remains constrained by tight communication budgets. Earlier work reduced bandwidth by compressing full feature maps with fixed-rate encoders, which adapts poorly to a changing environment, and it further evolved into spatial selection methods that improve efficiency by focusing on salient regions, but this object-centric approach often sacrifices global context, weakening holistic scene understanding. To overcome these limitations, we introduce WhisperNet, a bandwidth-aware framework that proposes a novel, receiver-centric paradigm for global coordination across agents. Senders generate lightweight saliency metadata, while the receiver formulates a global request plan that dynamically budgets feature contributions across agents and features, retrieving only the most informative features. A collaborative feature routing module then aligns related messages before fusion to ensure structural consistency.Extensive experiments show that WhisperNet achieves state-of-the-art performance, improving AP@0.7 on OPV2V by 2.4% with only 0.5% of the communication cost. As a plug-and-play component, it boosts strong baselines with merely 5% of full bandwidth while maintaining robustness under localization noise. These results demonstrate that globally-coordinated allocation across what and where to share is the key to achieving efficient collaborative perception.

Abstract:
Multi-view contrastive clustering has emerged as a powerful paradigm for learning comprehensive representations from heterogeneous data sources. However, prevailing approaches typically overlook the intrinsic geometric and clustering structures, rendering them structure-agnostic. In this paper, we propose a novel framework that performs Multi-Hierarchical Contrastive Spectral Fusion (MCSF) to address these limitations. MCSF integrates deep spectral embedding into the encoder to preserve local manifold structure, guiding the learned representations to be clustering-friendly. To enhance cross-view consistency, MCSF introduces a multi-hierarchical contrastive loss jointly optimizing (1) view-specific structure preservation, (2) view-consensus alignment, and (3) consensus structure refinement. This mechanism enables the construction of an accurate and semantically consistent consensus representation, effectively fusing multi-view information and uncovering authentic cluster structures. Extensive experiments on benchmarks validate the effectiveness of multi-hierarchical contrastive spectral fusion in clustering accuracy and representation quality.

Abstract:
With diffusion models demonstrating superior capabilities in image editing, more users now rely on online servers for content manipulation via textual prompts rather than traditional offline tools. Despite servers attempting to prevent the proliferation of maliciously edited content via active defense like watermarking, this approach is not conducive to passive detection by third-party forensics. To address this limitation, we propose a plug-and-play controllable denoising termed Forensic-Friendly Image Manipulation (FFIM), which simultaneously satisfies user editing requirements while facilitating forensic analysis. Specifically, FFIM comprises three phases: Controllable Projection, Implicit Detection, and Explicit Guidance. Phase I enforces orthogonality between the variance of random noise and image features to ensure clear demarcation between the edited and unedited regions. Phase II implicitly evaluates whether this demarcation meets detection requirements; if not, Phase III explicitly introduces a surrogate detection model and adversarially adjusts the random noise to maximize the feature discrepancy between these regions. Experiments across four datasets demonstrate the superiority of FFIM over baseline methods, achieving up to +6.6% F1 in pixel-level localization and +27.3% AUC in image-level detection. Importantly, these forensic gains are attained without compromising visual quality, as evidenced by comparable manipulation in both subjective user studies and objective quality assessments. We envision that the proposed method will be widely adopted by generative AI service providers, enabling more comprehensive information authenticity from a passive defensive standpoint.

Abstract:
Reasoning segmentation enables open-set object segmentation via implicit text queries, therefore serving as a foundation for embodied agents that should operate autonomously in real-world environments. However, existing methods for reasoning segmentation require multimodal large language models with billions of parameters that exceed the computational capabilities of edge devices that typically deploy the embodied AI systems. Distillation offers a pathway to compress these models while preserving their capabilities. Yet, existing distillation approaches fail to transfer the multi-step reasoning capabilities that reasoning segmentation demands, as they focus on matching output predictions and intermediate features rather than preserving reasoning chains. The emerging paradigm of reasoning over digital twin representations presents an opportunity for more effective distillation by re-framing the problem. Consequently, we propose FastReasonSeg, which employs digital twin representations that decouple perception from reasoning to enable more effective distillation. Our distillation scheme first relies on supervised fine-tuning on teacher-generated reasoning chains. Then it is followed by reinforcement fine-tuning with joint rewards evaluating both segmentation accuracy and reasoning quality alignment. Experiments on two video (JiTBench, RVTBench) and two image benchmarks (ReasonSeg, LLM-Seg40K) demonstrate that our FastReasonSeg achieves state-of-the-art reasoning segmentation performance. Moreover, the distilled 0.6B variant outperforms models with 20 times more parameters while achieving 7.79 FPS throughput with only 2.1GB memory consumption. This efficiency enables deployment in resource-constrained environments to enable real-time reasoning segmentation.

Abstract:
Transformer architecture search (TAS) discovers optimal vision transformer (ViT) architectures automatically, reducing human effort to manually design ViTs. However, existing TAS methods suffer from the feature collapse problem, where subnets within a supernet fail to learn subnet-specific features, mainly due to the shared weights in a supernet, limiting the performance of individual subnets. To address this, we propose TAS-LoRA, a novel method that introduces parameter-efficient low-rank adaptation (LoRA) to enable subnet-specific feature learning, while maintaining computational efficiency. TAS-LoRA incorporates a Mixture-of-LoRA-Experts (MoLE) strategy, where a lightweight router dynamically assigns LoRA experts based on subnet architectures, and introduces a group-wise router initialization technique to encourage diverse feature learning across experts early in training. Extensive experiments on ImageNet and several transfer learning benchmarks, including CIFAR-10/100, Flowers, CARS, and INAT-19, demonstrate that TAS-LoRA mitigates feature collapse effectively, improving performance over state-of-the-art TAS methods significantly.

Abstract:
Adapting image-pretrained backbones to video typically relies on time-domain adapters tuned to a single temporal scale. Our experiments show that these modules pick up static image cues and very fast flicker changes, while overlooking medium-speed motion. Capturing dynamics across multiple time-scales is, however, crucial for fine-grained temporal analysis (i.e., opening vs. closing bottle).To address this, we introduce Frame2Freq -- a family of frequency-aware adapters that perform spectral encoding during image-to-video adaptation of pretrained Vision Foundation Models (VFMs), improving fine-grained action recognition. Frame2Freq uses Fast Fourier Transform (FFT) along time and learns frequency-band specific embeddings that adaptively highlight the most discriminative frequency ranges. Across five fine-grained activity recognition datasets, Frame2Freq outperforms prior PEFT methods and even surpasses fully fine-tuned models on four of them. These results provide encouraging evidence, that frequency analysis methods are a powerful tool for modeling temporal dynamics in image-to-video transfer.

Abstract:
With the rapid development of large multimodal models, reliable judge and critic models have become essential for open-ended evaluation and preference alignment, providing pairwise preferences, numerical scores, and explanatory justifications for assessing model-generated responses. However, existing critics are primarily trained in general visual domains such as captioning or image question answering, leaving physical AI tasks involving perception, causal reasoning, and planning largely underexplored. We introduce PhyCritic, a multimodal critic model optimized for physical AI through a two-stage RLVR pipeline: a physical skill warmup stage that enhances physically oriented perception and reasoning, followed by self-referential critic finetuning, where the critic generates its own prediction as an internal reference before judging candidate responses, improving judgment stability and physical correctness. Across both physical and general-purpose multimodal judge benchmarks, PhyCritic achieves strong performance gains over open-source baselines and, when applied as a policy model, further improves perception and reasoning in physically grounded tasks.

Abstract:
We present a scalable 3D reconstruction model that addresses a critical limitation in offline feed-forward methods: their computational and memory requirements grow quadratically w.r.t. the number of input images. Our approach is built on the key insight that this bottleneck stems from the varying-length Key-Value (KV) space representation of scene geometry, which we distill into a fixed-size Multi-Layer Perceptron (MLP) via test-time training. Our VGG-T^3 (Visual Geometry Grounded Test Time Training) scales linearly w.r.t. the number of input views, similar to online models, and reconstructs a 1k image collection in just 54 seconds, achieving a 11.6xspeed-up over baselines that rely on softmax attention. Since our method retains global scene aggregation capability, our point map reconstruction error outperforming other linear-time methods by large margins. Finally, we demonstrate visual localization capabilities of our model by querying the scene representation with unseen images.

Abstract:
3D foundation models (3DFMs) have recently transformed 3D vision, enabling joint prediction of depths, poses, and point maps directly from images. Yet their ability to reason under extreme, non-overlapping views remains largely unexplored. In this work, we study their internal representations and find that 3DFMs exhibit an emergent understanding of extreme-view geometry, despite never being trained for such conditions. To further enhance these capabilities, we introduce a lightweight alignment scheme that refines their internal 3D representation by tuning only a small subset of backbone bias terms, leaving all decoder heads frozen. This targeted adaptation substantially improves relative pose estimation under extreme viewpoints without degrading per-image depth or point quality. Additionally, we contribute MegaUnScene, a new benchmark of Internet scenes unseen by existing 3DFMs, with dedicated test splits for both relative pose estimation and dense 3D reconstruction.

Abstract:
We propose Unblur-SLAM, an RGB SLAM pipeline for sharp 3D reconstruction from blurred image inputs. In contrast to previous work, our approach is able to handle different types of blur and demonstrates state-of-the-art performance in the presence of both motion blur and defocus blur. Moreover, we adjust the computation effort with the amount of blur in the input image.As a first stage, our method uses a feed-forward image deblurring model for which we propose a suitable training scheme that can improve both tracking and mapping modules.Frames that are successfully deblurred by the feed-forward network obtain refined poses and depth through local-global multi-view optimization and loop closure. Frames that fail the first stage deblurring are directly modeled through the global 3DGS representation and an additional blur network to model multiple blurred sub-frames and simulate the blur formation process in 3D space, thereby learning sharp details and refined sub-frame poses.Experiments on several real-world datasets demonstrate consistent improvements in both pose estimation and sharp reconstruction results of geometry and texture.

Abstract:
Processing long-form videos with Video Large Language Models (Video-LLMs) is computationally prohibitive. Current efficiency methods often compromise fine-grained perception through irreversible information disposal or inhibit long-range temporal modeling via rigid, predefined sparse patterns. This paper introduces AdaSpark, an adaptive sparsity framework designed to address these limitations. AdaSpark first partitions video inputs into 3D spatio-temporal cubes. It then employs two co-designed, context-aware components: (1) Adaptive Cube-Selective Attention (AdaS-Attn), which dynamically selects a subset of relevant video cubes to attend for each query token, and (2) Adaptive Token-Selective FFN (AdaS-FFN), which selectively processes only the most salient tokens within each cube. An entropy-based (Top-p) selection mechanism adaptively allocates computational resources based on input complexity. Experiments demonstrate that AdaSpark significantly reduces computational load by up to 57% FLOPs while maintaining comparable performance to dense models and preserving fine-grained, long-range dependencies, as validated on challenging hour-scale video benchmarks.

Abstract:
Humans perceive actions through key transitions that structure actions across multiple abstraction levels, whereas machines, relying on visual features, tend to over-segment. This highlights the difficulty of enabling hierarchical reasoning in video understanding. Interestingly, we observe that lower-level visual and high-level action latent variables evolve at different rates, with low-level visual variables changing rapidly, while high-level action variables evolve more slowly, making them easier to identify. Building on this insight, we propose the Hierarchical Action Learning (HAL) model for weakly-supervised action segmentation. Our approach introduces a hierarchical causal data generation process, where high-level latent action govern the dynamics of low-level visual features. To model these varying timescales effectively, we introduce deterministic processes to align these latent variables over time. The HAL model employs a hierarchical pyramid transformer to capture both visual features and latent variables, and a sparse transition constraint is applied to enforce the slower dynamics of high-level action variables. This mechanism enhances the identification of these latent variables over time. Under mild assumptions, we prove that these latent action variables are strictly identifiable. Experimental results on several benchmarks show that the HAL model significantly outperforms existing methods for weakly-supervised action segmentation, confirming its practical effectiveness in real-world applications.

Abstract:
Cross-view geo-spatial learning consists of two important tasks: Cross-View Geo-Localization (CVGL) and Cross-View Image Synthesis (CVIS), both of which rely on establishing geometric correspondences between ground and aerial views. Recent Geometric Foundation Models (GFMs) have demonstrated strong capabilities in extracting generalizable 3D geometric features from images, but their potential in cross-view geo-spatial tasks remains underexplored. In this work, we present Geo^2, a unified framework that leverages Geometric priors from GFMs (e.g., VGGT) to jointly perform geo-spatial tasks, CVGL and bidirectional CVIS. Despite the 3D reconstruction ability of GFMs, directly applying them to CVGL and CVIS remains challenging due to the large viewpoint gap between ground and aerial imagery. We propose GeoMap, which embeds ground and aerial features into a shared 3D-aware latent space, effectively reducing cross-view discrepancies for localization. This shared latent space naturally bridges cross-view image synthesis in both directions. To exploit this, we propose GeoFlow, a flow-matching model conditioned on geometry-aware latent embeddings. We further introduce a consistency loss to enforce latent alignment between the two synthesis directions, ensuring bidirectional coherence. Extensive experiments on standard benchmarks, including CVUSA, CVACT, and VIGOR, demonstrate that Geo^2 achieves state-of-the-art performance in both localization and synthesis, highlighting the effectiveness of 3D geometric priors for cross-view geo-spatial learning.

Abstract:
Unsupervised object-centric learning (OCL) decomposes visual scenes into distinct entities. Slot attention is a popular approach that represents individual objects as latent vectors, called slots. Current methods obtain these slot representations solely from the last layer of a pre-trained vision transformer (ViT), ignoring valuable, semantically rich information encoded across the other layers. To better utilize this latent semantic information, we introduce MUFASA, a lightweight plug-and-play framework for slot-attention-based approaches to unsupervised object segmentation. Our model computes slot attention across multiple feature layers of the ViT encoder, fully leveraging their semantic richness. We propose a fusion strategy to aggregate slots obtained on multiple layers into a unified object-centric representation. Integrating MUFASA into existing OCL methods improves their segmentation results across multiple datasets, setting a new state of the art while simultaneously improving training convergence with only minor inference overhead.

Abstract:
Earth observation (EO) data spans a wide range of spatial, spectral, and temporal resolutions, from high-resolution optical imagery to low resolution multispectral products or radar time series. While recent foundation models have improved multimodal integration for learning meaningful representations, they often expect fixed input resolutions or are based on sensor-specific encoders limiting generalization across heterogeneous EO modalities. To overcome these limitations we introduce RAMEN, a resolution-adjustable multimodal encoder that learns a shared visual representation across EO data in a fully sensor-agnostic manner. RAMEN treats the modality and spatial and temporal resolutions as key input data features, enabling coherent analysis across modalities within a unified latent space. Its main methodological contribution is to define spatial resolution as a controllable output parameter, giving users direct control over the desired level of detail at inference and allowing explicit trade-offs between spatial precision and computational cost. We train a single, unified transformer encoder reconstructing masked multimodal EO data drawn from diverse sources, ensuring generalization across sensors and resolutions. Once pretrained, RAMEN transfers effectively to both known and unseen sensor configurations and outperforms larger state-of-the-art models on the community-standard PANGAEA benchmark, containing various multi-sensor and multi-resolution downstream tasks. Our code and pretrained model will be released upon acceptance.

Abstract:
We propose Tea-Adapter, a plug-and-play adapter designed to efficiently integrate conditional knowledge from a smaller teacher model into a larger student video diffusion model. Existing controllable video DiT methods face critical challenges: full fine-tuning of billion-parameter models is extremely expensive. At the same time, ControlNets introduce substantial parameter overhead and exhibit limited flexibility for novel multi-condition compositions. To overcome these issues, Tea-Adapter introduces a novel reverse distillation method that enables large video diffusion models to inherit precise control capabilities from smaller, efficiently tuned teacher diffusion models, eliminating the need for full fine-tuning. Moreover, recognizing the intrinsic relationships between different conditions, we replace the cascaded ControlNet design with a Mixture of Condition Experts (MCE) layer. This structure dynamically routes diverse conditional inputs within a unified architecture, supporting both single-condition control and multiple condition combinations without additional training cost. To achieve cross-scale knowledge transfer, we further develop a Feature Propagation Module to ensure efficient and temporally consistent feature propagation across video frames. Experiments demonstrate that Tea-Adapter enables high-fidelity, multi-condition video synthesis, making advanced, controllable video generation feasible on low-resource hardware.

Abstract:
Unified image generation and editing models suffer from severe task interference in dense diffusion transformers architectures, where a shared parameter space must compromise between conflicting objectives (e.g., local editing v.s. subject-driven generation). While the sparse Mixture-of-Experts (MoE) paradigm is a promising solution, its gating networks remain task-agnostic, operating based on local features, unaware of global task intent. This task-agnostic nature prevents meaningful specialization and fails to resolve the underlying task interference.In this paper, we propose a novel framework to inject semantic intent into MoE routing. We introduce a Hierarchical Task Semantic Annotation scheme to create structured task descriptors (e.g., scope, type, preservation). We then design Predictive Alignment Regularization to align internal routing decisions with the task's high-level semantics. This regularization evolves the gating network from a task-agnostic executor to a dispatch center. Our model effectively mitigates task interference, outperforming dense baselines in fidelity and quality, and our analysis shows that experts naturally develop clear and semantically correlated specializations.

Abstract:
Recent vision foundation models give the impression that 3D reconstruction from RGB is largely solved. Yet these systems struggle with object-specific 3D structure: the fine-grained geometry implied by an object's landmarks or skeleton. In this paper, we show that when a model is given only 2D landmarks, it can recover more accurate 3D structure than state-of-the-art depth-from-RGB foundation models. Classical lifting approaches such as PAUL demonstrate this principle but do not scale beyond single categories, while methods like 3D-LFM scale but require extensive 3D supervision. We present the first lifting foundation model that learns object-specific 3D geometry using only 2D supervision. The key idea is to inject correspondence structure into the model via a positional encoding inspired by classical structure-from-motion. This simple inductive bias enables robust, object-agnostic 3D lifting that rivals or exceeds recent 3D-supervised approaches, revealing that landmark-based lifting remains a powerful and under-exploited paradigm for 3D understanding.

Abstract:
Recent visual generative models often struggle with consistency during image editing due to the entangled nature of raster images, where all visual content is fused into a single canvas. In contrast, professional design tools employ layered representations, allowing isolated edits while preserving consistency. Motivated by this, we propose Qwen-Image-Layered, an end-to-end diffusion model that decomposes a single RGB image into multiple semantically disentangled RGBA layers, enabling inherent editability, where each RGBA layer can be independently manipulated without affecting other content. To support variable-length decomposition, we introduce three key components: (1) an RGBA-VAE to unify the latent representations of RGB and RGBA images; (2) a VLD-MMDiT (Variable Layers Decomposition MMDiT) architecture capable of decomposing a variable number of image layers; and (3) a Multi-stage Training strategy to adapt a pretrained image generation model into a multilayer image decomposer. Furthermore, to address the scarcity of high-quality multilayer training images, we build a pipeline to extract and annotate multilayer images from Photoshop documents (PSD). Experiments demonstrate that our method significantly surpasses existing approaches in decomposition quality and establishes a new paradigm for consistent image editing.

Abstract:
Multimodal Large Language Models (MLLMs) have been shown to be vulnerable to malicious queries that can elicit unsafe responses. Recent work uses prompt engineering, response classification, or finetuning to improve MLLM safety. Nevertheless, such approaches are often ineffective against evolving malicious patterns, may require rerunning the query, or demand heavy computational resources. Steering the activations of a frozen model at inference time has recently emerged as a flexible and effective solution. However, existing steering methods for MLLMs typically handle only a narrow set of safety-related concepts or struggle to adjust specific concepts without affecting others. To address these challenges, we introduce Dictionary-Aligned Concept Control (DACO), a framework that utilizes a curated concept dictionary and a Sparse Autoencoder (SAE) to provide granular control over MLLM activations. First, we curate a dictionary of 15,000 multimodal concepts by retrieving over 400,000 caption-image stimuli and summarizing their activations into concept directions. We name the dataset DACO-400K. Second, we show that the curated dictionary can be used to intervene activations via sparse coding. Third, we propose a new steering approach that uses our dictionary to initialize the training of an SAE and automatically annotate the semantics of the SAE atoms for safeguarding MLLMs. Experiments on multiple MLLMs (e.g., QwenVL, LLaVA, InternVL) across safety benchmarks (e.g., MM-SafetyBench, JailBreakV) show that DACO significantly improves MLLM safety while maintaining general-purpose capabilities.

Abstract:
Despite rapid progress in video generation, existing models are incapable of producing vector animation, a dominant and highly expressive form of multimedia on the Internet. Vector animations offer resolution-independence, compactness, semantic structure, and editable parametric motion representations, yet current generative models operate exclusively in raster space and thus cannot synthesize them. Meanwhile, recent advances in large multimodal models demonstrate strong capabilities in generating structured data such as slides , 3D meshes , LEGO sequences , and indoor layouts , suggesting that native vector animation generation may be achievable. In this work, we present the first framework for tokenizing and autoregressively generating vector animations. We adopt Lottie, a widely deployed JSON-based animation standard, and design a tailored Lottie Tokenizer that encodes layered geometric primitives, transforms, and keyframe-based motion into a compact and semantically aligned token sequence. To support large-scale training, we also construct LottieAnimation-660K, the largest and most diverse vector animation dataset to date, consisting of 660k real-world Lottie animation and 15M static Lottie image files curated from broad Internet sources. Building upon these components, we finetune Qwen-VL to create LottieGPT, a native multimodal model capable of generating coherent, editable vector animations directly from natural language or visual prompts. Experiments show that our tokenizer dramatically reduces sequence length while preserving structural fidelity, enabling effective autoregressive learning of dynamic vector content. LottieGPT exhibits strong generalization across diverse animation styles and outperforms previous state-of-the-art models on SVG generation (a special case of single-frame vector animation).

Abstract:
Achieving precise, object-level control in image editing remains challenging: 2D methods lack 3D awareness and often yield ambiguous or implausible results, while existing 3D-aware approaches rely on heavy optimization or incomplete monocular reconstructions. We present ObjectMorpher, a unified, interactive framework that converts ambiguous 2D edits into geometry-grounded operations. ObjectMorpher lifts target instances with an image-to-3D generator into editable 3D Gaussian Splatting (3DGS), enabling fast, identity-preserving manipulation. Users drag control points; a graph-based non-rigid deformation with as-rigid-as-possible (ARAP) constraints ensures physically sensible shape and pose changes. A composite diffusion module harmonizes lighting, color, and boundaries for seamless reintegration. Across diverse categories, ObjectMorpher delivers fine-grained, photorealistic edits with superior controllability and efficiency, outperforming 2D drag and 3D-aware baselines on KID, LPIPS, SIFID, and user preference.

Abstract:
Automatic residential floorplan generation has long been a central challenge bridging architecture and computer graphics, aiming to make spatial design more efficient and accessible. While early methods based on constraint satisfaction or combinatorial optimization ensure feasibility, they lack diversity and flexibility. Recent generative models achieve promising results but struggle to generalize across heterogeneous conditional tasks, such as generation from site boundaries, room adjacency graphs, or partial layouts, due to their suboptimal representations. To address this gap, we introduce Floorplan Markup Language (FML), a general representation that encodes floorplan information within a single structured grammar, which casts the entire floorplan generation problem into a next token prediction task. Leveraging FML, we develop a transformer-based generative model, Floorplan Markup Language Model (FMLM), capable of producing high-fidelity and functional floorplans under diverse conditions. Comprehensive experiments on the RPLAN dataset demonstrate that FMLM, despite being a single model, surpasses the previous task-specific state-of-the-art methods.

Abstract:
Reinforcement learning with verifiable rewards (RLVR) has shown promise for GUI automation, enabling agents to learn from binary task completion signals. However, when task difficulty exceeds model capacity, on-policy exploration fails to discover correct actions, creating zero-advantage traps that eliminate learning signals. While incorporating off-policy expert demonstrations seems intuitive, it causes persistent high-entropy states due to distribution mismatch, disrupting effective learning. We propose GUI-SAGE, a self-explanation framework that generates in-distribution reasoning trajectories for GUI automation. By conditioning on ground-truth actions, our method produces in-distribution guidance that avoids the confusion caused by out-of-distribution expert demonstrations. We further introduce Entropy-Modulated Credit Assignment, which recalibrates learning weights by jointly considering prediction confidence and reward signals, enabling amplified updates for confident correct actions and attenuated updates for uncertain explorations. Extensive experiments on AndroidControl and GUI-Odyssey demonstrate that GUI-SAGE-3B achieves competitive performance with 81.1% success rate, substantially outperforming existing methods. Our analysis validates that self-explanations maintain stable learning dynamics while expert demonstrations cause entropy collapse, and that entropy modulation provides the largest improvements on in-distribution samples.

Abstract:
Recovering 3D human motion from monocular videos in-the-wild remains challenging due to occlusions, rapid movements, and viewpoint variations. To address these challenges, we introduce Recover-Anyone Module (RAM), a unified framework for real-time and accurate 3D human motion reconstruction. RAM incorporates a motion-aware semantic tracker with adaptive Kalman filtering to achieve robust identity association under severe occlusions and dynamic interactions. A memory-augmented Temporal HMR module further enhances human motion reconstruction by injecting spatio-temporal priors for consistent and smooth motion estimation. Moreover, a lightweight Predictor module forecasts future poses to maintain reconstruction continuity, while a gated combiner adaptively fuses reconstructed and predicted features to ensure coherence and robustness. Experiments on in-the-wild multi-person benchmarks such as PoseTrack and 3DPW, demonstrate that RAM substantially outperforms previous state-of-the-art in both Zero-shot tracking stability and 3D accuracy, offering a generalizable paradigm for markerless 3D human motion capture in-the-wild.

Abstract:
Diffusion Transformers have established a new state-of-the-art in image synthesis, but the high computational cost of iterative sampling severely hampers their practical deployment. While existing acceleration methods often focus on the temporal domain, they overlook the substantial spatial redundancy inherent in the generative process, where global structures emerge long before fine-grained details are formed. The uniform computational treatment of all spatial regions represents a critical inefficiency. In this paper, we introduce Just-in-Time (JiT), a novel training-free framework that addresses this challenge by acceleration in the spatial domain. JiT formulates a spatially approximated generative ordinary differential equation (ODE) that drives the full latent state evolution based on computations from a dynamically selected, sparse subset of anchor tokens. To ensure seamless transitions as new tokens are incorporated to expand the dimensions of the latent state, we propose a deterministic micro-flow, a simple and effective finite-time ODE that maintains both structural coherence and statistical correctness. Extensive experiments on the state-of-the-art FLUX.1-dev model demonstrate that JiT achieves up to a 7x speedup with nearly lossless performance, significantly outperforming existing acceleration methods and establishing a new and superior trade-off between inference speed and generation fidelity.

Abstract:
The inherent semantic ambiguity of "many-to-many", where one video matches multiple texts and vice versa, aggravates the difficulty in text-video retrieval. The dominant deterministic embeddings only struggle to capture the mean semantics, while existing probabilistic methods fail to distinguish hard negatives for their imposing rigid uncertainty priors or ignoring the interaction between similarity and uncertainty. To this end, we propose a novel physics-inspired framework (GraviAlign) that decomposes the alignment of cross-modal semantic distributions into two orthogonal factors inspired by the Gravitational Force: (1) Semantic Attraction measuring gravitational alignment between distribution centers via uncertainty-derived "semantic mass" and "semantic distance"; (2) Geometric Overlap quantifying distribution intersection. Each factor has independent veto power to reject those matches with misalignment or poor overlap. Additionally, GraviAlign offers an efficient (O(D)), theoretically grounded alternative to intractable joint integrals. Extensive experiments on DiDeMo, MSR-VTT, and ActivityNet demonstrate our effectiveness and superiority, and solid ablation studies confirm the indispensability of two novel components.

Abstract:
Collaborative perception empowers autonomous agents to share complementary information and overcome perception limitations. While early fusion offers more perceptual complementarity and is inherently robust to model heterogeneity, its high communication cost has limited its practical deployment, prompting most existing works to favor intermediate or late fusion. To address this, we propose a communication-efficient early collaborative perception framework that incorporates LiDAR completion to restore scene completeness under sparse transmission, dubbed as CoLC. Specifically, the CoLC integrates three complementary designs. First, each neighbor agent applies Foreground-Aware Point Sampling (FAPS) to selectively transmit informative points that retain essential structural and contextual cues under bandwidth constraints. The ego agent then employs Completion-Enhanced Early Fusion (CEEF) to reconstruct dense pillars from the received sparse inputs and adaptively fuse them with its own observations, thereby restoring spatial completeness. Finally, the Dense-Guided Dual Alignment (DGDA) strategy enforces semantic and geometric consistency between the enhanced and dense pillars during training, ensuring consistent and robust feature learning. Experiments on both simulated and real-world datasets demonstrate that CoLC achieves superior perception-communication trade-offs and remains robust under heterogeneous model settings.

Abstract:
We introduce Consistent Instance Field, a continuous and probabilistic spatio-temporal representation for dynamic scene understanding.Unlike prior methods that rely on discrete tracking or view-dependent features, our approach disentangles visibility from persistent object identity by modeling each space-time point with an occupancy probability and a conditional instance distribution. To realize this, we introduce a novel instance-embedded representation based on deformable 3D Gaussians, which jointly encode radiance and semantic information and are learned directly from input RGB images and instance masks through differentiable rasterization.Furthermore, we introduce new mechanisms to calibrate per-Gaussian identities and resample Gaussians toward semantically active regions, ensuring consistent instance representations across space and time. Experiments on HyperNeRF and Neu3D datasets demonstrate that our method significantly outperforms state-of-the-art methods on novel-view panoptic segmentation and open-vocabulary 4D querying tasks.

Abstract:
Modern multi-label image classification models suffer from a critical reliance on spurious correlations, failing to learn the underlying causal mechanisms. Many causality-inspired methods are impractical, demanding box-level supervision that is rarely available in real-world datasets. Others rely on static confounder dictionaries, which are inherently inflexible and fail to capture complex biases or adapt to feature space changes during training. To address this, we present prototype-based causal intervention (ProCI), a novel framework that approximates the backdoor adjustment using only image-level supervision. It models confounders as learnable contextual prototypes engineered to represent class-wise co-occurring bias. These prototypes are learned dynamically within a stable memory and leveraged to construct sample-specific bias vectors for an adaptive feature adjustment, effectively counteracting spurious correlations. Experiments on MS-COCO, Pascal VOC, COCO-Stuff, and the challenging Sewer-ML dataset validate our approach. ProCI achieves competitive performance on standard benchmarks while setting a new state-of-the-art on the highly-confounded Sewer-ML. It outperforms the previous best model by a remarkable +5.44 points on the F2-CIW metric. These results demonstrate the effectiveness of our approach in mitigating complex real-world biases using only image-level supervision.

Abstract:
The life of a photo begins with photons striking the sensor, whose signals are passed through a sophisticated image signal processing (ISP) pipeline to produce a display-referred image. However, such images are no longer faithful to the incident light, being compressed in dynamic range and stylized by subjective preferences. In contrast, RAW images record direct sensor signals before non-linear tone mapping. After camera response curve correction and demosaicing, they can be converted into linear images, which are scene-referred representations that directly reflect true irradiance and are invariant to sensor-specific factors. Since image sensors have better dynamic range and bit depth, linear images contain richer information than display-referred ones, leaving users more room for editing during post-processing. Despite this advantage, current generative models mainly synthesize display-referred images, which inherently limits downstream editing. In this paper, we address the task of text-to-linear-image generation: synthesizing a high-quality, scene-referred linear image that preserves full dynamic range, conditioned on a text prompt, for professional post-processing. Generating linear images is challenging, as pre-trained VAEs in latent diffusion models struggle to simultaneously preserve extreme highlights and shadows due to the higher dynamic range and bit depth. To this end, we represent a linear image as a sequence of exposure brackets, each capturing a specific portion of the dynamic range, and propose a DiT-based flow-matching architecture for text-conditioned exposure bracket generation. We further demonstrate downstream applications including text-guided linear image editing and structure-conditioned generation via ControlNet.

Abstract:
Vision-language pre-training (VLP) has achieved remarkable performance across diverse multimodal learning (MML) tasks. Recently, many efforts have focused on reconstructing missing modalities to improve the adaptability of VLP models in incomplete MML scenarios. However, these approaches overlook the learning imbalance under severe missing-modality conditions, i.e., the optimization process is dominated by reconstructed samples, thereby weakening complete-sample representations. In this paper, we propose a novel ANchor-guided Gradient Alignment (ANGA) framework to address this issue. Specifically, we first retrieve similar instances to reconstruct the missing modalities, thereby alleviating information deficiency. We then introduce an entropy-driven curriculum that progressively incorporates reliable reconstructed samples together with complete ones to form an optimization anchor, which guides gradient alignment to mitigate learning imbalance. Furthermore, we design a semantic-enhanced adapter that leverages the retrieved instances to generate dynamic prompts, further enhancing the robustness of the VLP model. Extensive experiments on widely used datasets demonstrate the superiority of ANGA over state-of-the-art (SOTA) baselines across various missing-modality scenarios. The code is available at this repository.

Abstract:
Diffusion priors have recently demonstrated strong capability in enhancing the quality of sparse-view 3D reconstruction by augmenting training views at novel viewpoints, but they inevitably introduce hallucinated content-- artifacts inconsistent with the input views -- into the final 3D model. To address this challenge, we propose Hallucination-Aware Diffusion prior (HAD), which estimates pixel-wise hallucination score maps for augmented images by leveraging multi-view reasoning capabilities from a feedforward novel view synthesis (NVS) network pre-trained on large-scale 3D data. These hallucination scores enable selective masking of unreliable pixels during the progressive 3D reconstruction procedure, preventing the introduction of non-existent artifacts into the 3D model. To further enhance performance, we create multiple versions of augmented images at each novel view by conditioning the diffusion prior on different input views, which are then fused into a final image that leverages the broader context across all input views. We show that our method substantially reduces hallucination artifacts in diffusion-assisted 3D reconstruction, thereby achieving state-of-the-art performance across multiple benchmarks on novel view synthesis.

Abstract:
All-in-one image restoration (AiOIR) methods have made remarkable progress in handling diverse degradations. However, their performance often deteriorates when the test distribution deviates from the training distribution. Exploring test-time adaptation for AiOIR is therefore crucial. To adapt a pre-trained AiOIR model to unseen degradation distributions without access to source data or retraining, two key challenges must be addressed: designing reliable pseudo-supervision and stabilizing adaptation. Observing that multiple degraded versions of the same scene should map to a consistent clean image, we propose Degradation-Consistent Test-Time Adaptation (DCTTA). DCTTA comprises three core components: (1) test-time redegradation generation, which leverages a diffusion-based generator to construct pseudo degraded-clean pairs for distribution alignment; (2) degradation-guided image restoration, which enforces domain adaptation via self-supervised consistency loss; and (3) test-time important parameter selection, which selectively updates degradation-sensitive parameters to ensure stable adaptation while preserving pre-trained knowledge. Extensive experiments across multiple tasks and challenging domain shifts demonstrate that DCTTA consistently outperforms state-of-the-art AiOIR baselines, achieving up to +4.57 dB PSNR improvement on the Rain100H dataset.

Abstract:
Test-Time Adaptive Segmentation (TTA-Seg) aims to adapt a trained segmentation model to test data under distribution shift in an unsupervised manner. Existing approaches typically utilize class-wise prototypes to capture and transfer the source distribution, but inevitably neglect the diversity within source samples. In this paper, we propose a new test-time adaptation paradigm based on the mixture-of-experts (MoE), where domain experts are designed to 1) better capture the source distribution, and 2) dynamically adjust their contribution in test case prediction. Specifically, during source training, prototypes are derived as the class-wise average for source pixel features. We then generate multiple experts through clustering these prototypes, providing each class with several experts with enhanced representativeness. At test time, each instance prediction is drawn from all experts' knowledge in an adaptive manner, i.e., a gating network assigns weights according to instance-expert correlations. To optimize the system, we devise a min-max entropy optimization scheme for the gating network but keeping the rest frozen, minimizing the entropy of model prediction but maximizing the entropy in expert selection. Consequently, the model is urged to derive confident predictions with effective utilization of domain experts, hence promoting the adaptation.Experiments on two scenarios, Test-time Adaptation (TTA) and the more challenging continual TTA, demonstrate that our approach achieves the new state-of-the-art performance.

Abstract:
Although 3D Gaussian Splatting (3DGS) has achieved impressive performance in real-time rendering, its unordered Gaussians make level-of-detail (LoD) construction and model compression highly challenging, limiting its applicability in customized scenarios.In this work, we propose a learning-based Gaussian hierarchy representation that ranks Gaussians by their contribution to the scene, enabling flexible LoD representations across arbitrary Gaussian counts.We first introduce a unified, continuous formulation and metric for Gaussian hierarchy. Then, we introduce a hierarchy-based modulated rendering method built upon a Differentiable Decreasing Step Function, which enables efficient hierarchy learning while maintaining approximately equivalent rendering. Moreover, we develop a PDF-Guided Active-Region Sampling strategy that encourages the learned hierarchy to become widely distributed within its value range.Our method requires no additional training stages and produces Gaussian hierarchies within training time comparable to classical 3DGS. Experiments on multiple datasets show that our approach achieves performance comparable to or surpassing state-of-the-art methods in both LoD rendering and model pruning.

Abstract:
Understanding social interaction in video requires reasoning over a dynamic interplay of verbal and non-verbal cues: who is speaking, to whom, and with what gaze or gestures.While Multimodal Large Language Models (MLLMs) are natural candidates, simply adding visual inputs yields surprisingly inconsistent gains on social tasks. Our quantitative analysis of cross-modal attention inside state-of-the-art MLLMs reveals a core failure mode: in multi-speaker scenes, visual and textual tokens lack speaker-consistent alignment, exhibiting substantially weaker cross-modal attention than in object-centric images.To address this, we propose a multimodal multi-speaker attention alignment method that can be integrated into existing MLLMs. First, we introduce dynamic cross-modal head selection to identify attention heads most responsible for grounding. Then, an adaptive social-aware attention bias, computed from existing attention patterns and speaker locations, is injected into the attention mechanism. This bias reinforces alignment between a speaker's visual representation and their utterances without introducing trainable parameters or architectural changes.We integrate our method into three distinct MLLMs (LLaVA-NeXT-Video, Qwen2.5-VL, and InternVL3) and evaluate on three benchmarks (TVQA+, MMSI, OnlineMMSI). Across four social tasks, results demonstrate that our approach improves the ability of MLLMs and achieves state-of-the-art results.Attention visualizations confirm our method successfully focuses the model on speaker-relevant regions, enabling more robust multi-party social reasoning.

Abstract:
YOLO series lead object detection with superior accuracy and speed. However, both convolutional and self-attention based architectures suffer from parameter redundancy and insufficient computational efficiency. Existing lightweight methods excessively pursue speed while ignoring the loss of important information during feature extraction and spatial transformation across different stages. Thus, effective lightweighting is crucial for detection performance. We propose YOLO-ULM, an ultra-lightweight real-time detector that achieves accelerated inference while preserving high accuracy. We innovatively design a variety of dual efficiency- and accuracy-driven modules, including efficient feature aggregation and parallel downsampling modules, as well as a more focused complete-IoU loss function. To validate our approach, we train it from scratch on COCO dataset without pretrained weights. By refining backbone parameters, we extend it to YOLO-ULM-Turbo for accelerated inference. YOLO-ULM surpasses state-of-the-art real-time detectors like YOLOv11/YOLOv12/YOLOv13 and RT-DETR. On a T4 GPU, YOLO-ULM-N achieves 41.6% mAP with an inference latency of 1.52 ms, outperforming YOLOv11-N (2.2%\uparrow) and YOLOv12-N (1.0%\uparrow). YOLO-ULM-S exceeds RT-DETR-R18 by 1.6% mAP with 64.7% fewer FLOPs and 63% fewer parameters. YOLO-ULM-L / X surpass YOLOv13-L / X by 0.7% and 0.8% respectively in mAP. YOLO-ULM-Turbo matches YOLOv12-Turbo's performance but uses less computation, with Turbo-N variant achieving 0.3% higher mAP and 16% fewer parameters than YOLOv12-Turbo-N.

Abstract:
Retrieval-Augmented Generation (RAG) has recently been extended to multimodal settings, connecting multimodal large language models (MLLMs) with vast corpora of external knowledge such as multimodal knowledge graphs (MMKGs). Despite their recent success, multimodal RAG in the audio-visual domain remains challenging due to 1) limited modality coverage and multi-hop connectivity of existing MMKGs, and 2) retrieval based solely on similarity in a shared multimodal embedding space, which fails to filter out off-topic or redundant knowledge. To address these limitations, we propose M^3KG-RAG, a Multi-hop Multimodal Knowledge Graph-enhanced RAG that retrieves query-aligned audio-visual knowledge from MMKGs, improving reasoning depth and answer faithfulness in MLLMs. Specifically, we devise a lightweight multi-agent pipeline to construct multi-hop MMKG (M^3KG), which contains context-enriched triplets of multimodal entities, enabling modality-wise retrieval based on input queries. Furthermore, we introduce GRASP (Grounded Retrieval And Selective Pruning), which ensures precise entity grounding to the query, evaluates answer-supporting relevance, and prunes redundant context to retain only knowledge essential for response generation. Extensive experiments across diverse multimodal benchmarks demonstrate that M^3KG-RAG significantly enhances MLLMs' multimodal reasoning and grounding over existing approaches.

Abstract:
Despite great progress, existing multimodal large language models (MLLMs) are still inferior in visual-spatial reasoning, which greatly impedes their trustworthy applications in scenarios such as Embodied AI. To facilitate the research, we propose a new MLLM task in this paper, called Grounded Chain-of-Thought (GCoT). Different from recent visual CoT studies, which focus more on visual knowledge reasoning, GCoT aims to improve the visual-spatial reasoning capabilities of MLLMs via recognizing and grounding the relevant visual cues step by step, which are also supported by step-vise grounding coordinates as the intuitive basis. To facilitate this task, we also carefully design and construct a benchmark called multimodal grounded chain-of-thought (MM-GCoT). Besides, a comprehensive consistency evaluation system is also introduced, including the metrics of answer accuracy, grounding accuracy and answer-grounding consistency. We further design and conduct a bunch of experiments on 12 advanced MLLMs, and reveal some notable findings: i. most MLLMs performs poorly on the consistency evaluation, indicating obvious visual hallucination; ii. visual hallucination is not directly related to the parameter size and general multimodal performance; iii. a larger and stronger MLLM is not less affected by this issue.

Abstract:
Part-level 3D generation is crucial for various downstream applications, including gaming, film production, and industrial design. However, decomposing a 3D shape into geometrically plausible and meaningful components remains a significant challenge. Previous part-based generation methods often struggle to produce well-constructed parts, exhibiting either poor structural coherence, geometric implausibility, inaccuracy, or inefficiency. To address these challenges, we introduce EI-Part, a novel framework specifically designed to generate high-quality 3D shapes with components distinguished by structural coherence, geometric plausibility, accuracy, and generation efficiency. We propose utilizing distinct representations at different stages: an Explode state for part completion and an Implode state for geometry refinement. This strategy allows us to fully leverage spatial resolution, enabling flexible part completion and fine geometric details generation. To maintain structural coherence between parts, a self-attention mechanism is incorporated in both the exploded and imploded states, facilitating effective information perception and feature fusion among components during generation. Extensive experiments conducted on various benchmarks demonstrate that EI-Part efficiently yields semantically meaningful and structurally coherent parts with fine-grained geometric details, achieving state-of-the-art performance in part-level generation compared to existing methods.

Abstract:
It is extremely hard for an underwater agent to know where it is. Satellite signals disappear within centimeters of the surface; acoustic baselines require heavy infrastructure to instrument small regions. The polarization of the sky, visible underwater, reveals the elevation of the sun. The pattern of elevation over the day reveals location to an agent with a clock. However, recovering elevation from polarization images is very difficult. State-of-the-art (SOTA) geolocalization methods can localize well for locations where they have seen data, but accuracy collapses when the data comes from a new location. Our physics-guided synthesis pipeline expands a huge library of polarization images from a small set of sites to 2.8 million solar-elevation-matched training sequences spanning latitudes, seasons, and water types. A compact two-stage transformer reconstructs the solar-elevation curve and predicts geolocation. Under leave-one-site-out tests, the site averaged median geodesic error is 500km--about an eightfold improvement over previous deep-learning baselines; with limited target-site data, the median error contracts to single-digit kilometers.

Abstract:
While a general embodied agent must function as a unified system, current methods are built on isolated models for understanding, world modeling, and control. This fragmentation prevents unifying multimodal generative capabilities and hinders learning from large-scale, heterogeneous data. In this paper, we propose Motus, a unified latent action world model that leverages existing general pretrained models and rich, sharable motion information. Motus introduces a Mixture-of-Transformer (MoT) architecture to integrate three experts (ie, understanding, video generation, and action) and adopts a UniDiffuser-style scheduler to enable flexible switching between different modeling modes (ie, world models, vision-language-action models, inverse dynamics models, video generation models, and video-action joint prediction). Motus further leverages the optical flow to learn latent actions and adopts a recipe with three-phase training pipeline and six-layer data pyramid, thereby extracting pixel-level "delta action" and enabling large-scale action pretraining. Experiments show that Motus achieves superior performance against state-of-the-art methods in both simulation (a +15% improvement over X-VLA and a +45% improvement over Pi-0.5) and real-world scenarios(improved by +11 48%), demonstrating unified modeling of all functionalities and priors significantly benefits downstream robotic tasks.

Abstract:
LiDAR-based perception is critical for autonomous driving due to its robustness to poor lighting and visibility conditions. Yet, current models operate under the closed-set assumption and often fail to recognize unexpected out-of-distribution (OOD) objects in the open world. Existing OOD scoring functions exhibit limited performance because they ignore the pronounced class imbalance inherent in LiDAR OOD detection and assume a uniform class distribution. To address this limitation, we propose the Neural Distribution Prior (NDP), a framework that models the distributional structure of network predictions and adaptively reweights OOD scores based on alignment with a learned distribution prior. NDP dynamically captures the logit distribution patterns of training data and corrects class-dependent confidence bias through an attention-based module. We further introduce a Perlin noise-based OOD synthesis strategy that generates diverse auxiliary OOD samples from input scans, enabling robust OOD training without external datasets. Extensive experiments on the SemanticKITTI and STU benchmarks demonstrate that NDP substantially improves OOD detection performance, achieving a point-level AP of 61.31% on the STU test set, which is more than 10xhigher than the previous best result. Our framework is compatible with various existing OOD scoring formulations, providing an effective solution for open-world LiDAR perception.

Abstract:
We present Scalable Pixel-anchored End-to-end Diffusion (SpeeDiff), a latent diffusion method that jointly trains the VAE and the diffusion model from scratch. In principle, joint training allows the diffusion loss gradient to directly guide the VAE encoder, encouraging the formation of a generation-friendly latent space and potentially yielding faster convergence than the conventional two-stage approach with a pretrained frozen VAE. However, a naive end-to-end implementation severely degrades performance, as unrestricted backpropagation of the diffusion loss leads to latent space collapse. Our main technical contribution is a simple yet effective Tweedie Pixel Reconstruction (TPR) loss, which provides additional pixel-level feedback by decoding a predicted clean latent from an intermediate noisy state using Tweedie's formula, thereby alleviating collapse. Furthermore, our method enables jointly scaling a fully transformer-based architecture and enhances representation alignment within the end-to-end framework. Our SpeeDiff-XL model achieves over 140x and 61x faster training compared to Vanilla SiT and REPA, respectively, while attaining an FID of 1.50 without guidance on ImageNet 256x256 generation. With a more efficient 32x compressed VAE, our model further reaches an FID of 1.53 without guidance on ImageNet 512x512 generation.

Abstract:
Referring Video Object Segmentation (RVOS) aims to segment target objects in videos based on natural language descriptions. However, fixed keyframe-based approaches that couple a vision language model with a separate propagation module often fail to capture rapidly changing spatiotemporal dynamics and to handle queries requiring multi-step reasoning, leading to sharp performance drops on motion-intensive and reasoning-oriented videos beyond static RVOS benchmarks. To address these limitations, we propose VIRST (Video-Instructed Reasoning Assistant for Spatio-Temporal Segmentation), an end-to-end framework that unifies global video reasoning and pixel-level mask prediction within a single model. VIRST bridges semantic and segmentation representations through the Spatio-Temporal Fusion (STF), which fuses segmentation-aware video features into the vision-language backbone, and employs the Temporal Dynamic Anchor Updater to maintain temporally adjacent anchor frames that provide stable temporal cues under large motion, occlusion, and reappearance. This unified design achieves state-of-the-art results across diverse RVOS benchmarks under realistic and challenging conditions, demonstrating strong generalization to both referring and reasoning oriented settings.

Abstract:
Open-vocabulary object detection aims to detect arbitrary classes via text prompts. Methods without cross-modal fusion layers (non-fusion) offer faster inference by treating recognition as a retrieval problem, i.e., matching regions to text queries in a shared embedding space. In this work, we fully explore this retrieval philosophy and demonstrate its unique advantages in efficiency and versatility through a model family named WeDetect: (1) State-of-the-art performance. WeDetect is a real-time detector with a dual-tower architecture. We show that, with well-curated data and full training, the non-fusion WeDetect surpasses other fusion models and establishes a strong open-vocabulary foundation. (2) Fast backtrack of historical data. WeDetect-Uni is a universal proposal generator based on WeDetect. We freeze the entire detector and only finetune an objectness prompt to retrieve generic object proposals across categories. Importantly, the proposal embeddings are class-specific and enable a new application, object retrieval, supporting retrieval objects in historical data. (3) Integration with LMMs for referring expression comprehension (REC). We further propose WeDetect-Ref, an LMM-based object classifier to handle complex referring expressions, which retrieves target objects from the proposal list extracted by WeDetect-Uni. It discards next-token prediction and classifies objects in a single forward pass. Together, the WeDetect family unifies detection, proposal generation, object retrieval, and REC under a coherent retrieval framework, achieving state-of-the-art performance across 15 benchmarks with high inference efficiency. We will open-source all models.

Abstract:
While neural video codecs (NVCs) have recently demonstrated superior performance over traditional codecs through end-to-end learning, existing approaches primarily focus on architectural enhancements and coding module design, with limited exploration into optimizing hierarchical structures--specifically, quality and reference configurations. Current hierarchical structure optimization methods face two major limitations: (1) insufficient content-adaptive optimization, and (2) disjointed handling of quality and reference structures. To overcome these challenges, we propose a novel NVC framework that introduces content-adaptive hierarchical structure optimization through a hierarchical hyperprior derived from the current frame. Our NVC integrates two key components: (1) a hierarchical hyperprior extracted from the original frame to enable content-aware adaptation of the hierarchical structure; and (2) an adaptor within the hierarchical hyperprior codec combined with a dual-reference scheme, guided by the hyperprior, to jointly optimize quality and reference structures. By leveraging this content-adaptive hierarchical structure, our NVC achieves state-of-the-art rate-distortion performance, outperforming the previous leading NVC method DCVC-FM with BD-rate reductions of 15.51% and 12.20% relative to VTM-23.4 low-delay B (LDB) under intra-period settings of -1 and 32, respectively.

Abstract:
Self-supervised Vision Transformers (ViTs) like DINO show an emergent ability to discover objects, typically observed in \texttt [CLS] token attention maps of the final layer. However, these maps often contain spurious activations resulting in poor localization of objects. This is because the \texttt [CLS] token, trained on an image-level objective, summarizes the entire image instead of focusing on objects. This aggregation dilutes the object-centric information existing in the local, patch-level interactions. We analyze this by computing inter-patch similarity using patch-level attention components (query, key, and value) across all layers. We find that: (1) Object-centric properties are encoded in the similarity maps derived from all three components (q, k, v), unlike prior work that uses only key features or the \texttt [CLS] token. (2) This object-centric information is distributed across the network, not just confined to the final layer. Based on these insights, we introduce Object-DINO, a training-free method that extracts this distributed object-centric information. Object-DINO clusters attention heads across all layers based on the similarities of their patches and automatically identifies the object-centric cluster corresponding to all objects. We demonstrate Object-DINO's effectiveness on two applications: enhancing unsupervised object discovery (+3.6 to +12.4 CorLoc gains) and mitigating object hallucination in Multimodal Large Language Models by providing visual grounding. Our results demonstrate that using this distributed object-centric information improves downstream tasks without additional training.

Abstract:
With their high dynamic range and temporal resolution, event cameras are well-suited for object detection, especially under motion blur and extreme illumination. Recent state-of-the-art works for event-based object detection primarily focus on the high-level design of backbones. However, developing effective event representations is equally crucial, as it bridges asynchronous event streams with the dense tensors required by detection networks. Most existing aggregation strategies for event representation continuously accumulate all events within sampled intervals without selective filtering, inevitably introducing uninformative events that degrade detection accuracy. To address this limitation, we introduce a novel perspective, termed Discrete Aggregation, which adaptively and discretely selects informative events for differentiable aggregation. We realize this through the Spiking Discrete Aggregation (SDA) module, which is inspired by the threshold-based spike firing mechanism in Spiking Neural Networks (SNNs) and implemented using gated recurrent spiking neurons. Additionally, we introduce the Multi-Timescale Fusion (MTF) method which leverages coarse-grained temporal features from continuous event streams to further enhance the representation capability of SDA. Experimental results on neuromorphic datasets demonstrate that our method achieves state-of-the-art performance among all fully spiking architectures while using fewer parameters, reaching 43.4% mAP\textsubscript 50:95 on Gen1 (+ 4.5% over prior art). Moreover, our method exhibits superior robustness under noisy conditions and shows strong compatibility with non-spiking models.

Abstract:
Recent advances in point cloud In-Context Learning (ICL) have demonstrated strong multitask capabilities. Existing approaches typically adopt a Masked Point Modeling (MPM)-based paradigm for point cloud ICL. However, MPM-based methods directly predict the target point cloud from masked tokens without leveraging geometric priors, requiring the model to infer spatial structure and geometric details solely from token-level correlations via transformers. Additionally, these methods suffer from a training-inference objective mismatch, as the model learns to predict the target point cloud using target-side information that is unavailable at inference time. To address these challenges, we propose DeformPIC, a deformation-based framework for point cloud ICL. Unlike existing approaches that rely on masked reconstruction, DeformPIC learns to deform the query point cloud under task-specific guidance from prompts, enabling explicit geometric reasoning and consistent objectives. Extensive experiments demonstrate that DeformPIC consistently outperforms previous state-of-the-art methods, achieving reductions of 1.6, 1.8, and 4.7 points in average Chamfer Distance on reconstruction, denoising, and registration tasks, respectively. Furthermore, we introduce a new out-of-domain benchmark to evaluate generalization across unseen data distributions, where DeformPIC achieves state-of-the-art performance.

Abstract:
Accurate rejection of sensitive or harmful visual content, i.e., harmful image guardrail, is critical in many application scenarios. This task must continuously adapt to the evolving safety policies and content across various domains and over time. However, traditional classifiers, confined to fixed categories, require frequent retraining when new policies are introduced. Vision-language models (VLMs) offer a more adaptable and generalizable foundation for dynamic safety guardrails. Despite this potential, existing VLM-based safeguarding methods are typically trained and evaluated under only a fixed safety policy. We find that these models are heavily overfitted to the seen policy, fail to generalize to unseen policies, and even lose the basic instruction-following ability and general knowledge. To address this issue, in this paper we make two key contributions. First, we benchmark the cross-policy generalization performance of existing VLMs with SafeEditBench, a new evaluation suite. SafeEditBench leverages image-editing models to convert unsafe images into safe counterparts, producing policy-aligned datasets where each safe-unsafe image pair remains visually similar except for localized regions violating specific safety rules. Human annotators then provide accurate safe/unsafe labels under five distinct policies, enabling fine-grained assessment of policy-aware generalization. Second, we introduce SafeGuard-VL, a reinforcement learning-based method with verifiable rewards (RLVR) for robust unsafe-image guardrails. Instead of relying solely on supervised fine-tuning (SFT) under fixed policies, SafeGuard-VL explicitly optimizes the model with policy-grounded rewards, promoting verifiable adaptation across evolving policies. Extensive experiments verify the effectiveness of our method for unsafe image guardrails across various policies.

Abstract:
WiFi sensing offers passive and privacy-preserving perception that complements vision-based sensing, but its performance degrades sharply under domain shifts caused by changes in environment, subjects, or hardware. This challenge is exacerbated in real-world deployments where source data are unavailable, motivating test-time adaptation (TTA) as a practical solution for self-calibration using only unlabeled target samples. We introduce WiTTA-Bench, the first comprehensive benchmark for WiFi TTA, covering 20 representative methods, two adaptation protocols (OTTA and TTDA), and three major physics-induced shifts in WiFi: cross-environment, cross-subject, and cross-device. Furthermore, we contribute a new dataset featuring paired recordings from heterogeneous devices to bridge the cross-device gap. Extensive experiments reveal three key insights unique to WiFi sensing: (i) WiFi domain shifts exhibit a physics-induced hierarchy; (ii) OTTA and TTDA are complementary; (iii) OTTA is generally more robust to hyperparameters, while TTDA is more sensitive due to recursive self-training. WiTTA-Bench establishes the first systematic foundation for adaptive, robust, and deployable WiFi sensing under realistic wireless conditions.

Abstract:
Knowledge distillation (KD) is a standard route to compress Large Language Models (LLMs) into compact students, yet most pipelines uniformly apply token-wise loss regardless of teacher confidence. This indiscriminate supervision amplifies noisy, high-entropy signals and is especially harmful under large teacher-student capacity gaps. We introduce SelecTKD, a plug-and-play Selective Token-Weighted distillation framework that shifts the focus from "how to measure divergence" to "where to apply learning". At each step, the student proposes tokens that are verified by the teacher through a robust propose-and-verify procedure with two variants: greedy Top-k and non-greedy Spec-k. Accepted tokens receive full loss, while rejected tokens are masked or down-weighted. This objective-agnostic design works with on- and off-policy data, induces an implicit curriculum quantified by Token Acceptance Rate (TAR), and stabilizes optimization. Across instruction following, mathematical reasoning, code generation, and a VLM setting, SelecTKD consistently improves strong baselines and achieves state-of-the-art results for small models without architectural changes or extra reference models.

Abstract:
Dataset distillation compresses large training sets into compact synthetic datasets while preserving downstream performance. As modern systems increasingly operate on paired vision-language inputs, multimodal distillation must preserve representation quality and cross-modal alignment under tight compute and memory budgets, yet prior methods often require heavy computes and overlook their correlations. To address this, we present Multimodal Distribution Matching (MDM), a geometry-aware framework for efficient and generalizable multimodal distillation. Specifically, MDM integrates complementary components at the data, model, and loss levels. At the data level, it initializes synthetic image-text pairs by sampling from clusters in the joint embedding space. At the model level, it forms a mixed teacher by interpolating independently fine-tuned models in weight space according to their angular deviation from the pretrained anchor. At the loss level, it matches joint distributions on the unit hypersphere using a geometry-aware matching objective that exploits the joint features in the cross-modal alignment and discrepancy directions along with symmetric contrastive learning. Across image-text retrieval benchmarks with cross-architecture evaluation, MDM yields compact synthetic sets that preserve multimodal semantics, substantially reduce distillation cost, and remain robust across architectures.

Abstract:
Multi-model routing has evolved from an engineering technique into essential infrastructure, yet existing work lacks a systematic, reproducible benchmark for evaluating vision-language models (VLMs). We present VL-RouterBench to assess the overall capability of VLM routing systems systematically. The benchmark is grounded in raw inference and scoring logs from VLMs and constructs quality and cost matrices over sample-model pairs. In scale, VL-RouterBench covers 14 datasets across 3 task groups, totaling 30,540 samples, and includes 15 open-source models and 2 API models, yielding 519,180 sample-model pairs and a total input-output token volume of 34,494,977. The evaluation protocol jointly measures average accuracy, average cost, and throughput, and builds a ranking score from the harmonic mean of normalized cost and accuracy to enable comparison across router configurations and cost budgets. On this benchmark, we evaluate 10 routing methods and baselines and observe a significant routability gain, while the best current routers still show a clear gap to the ideal Oracle, indicating considerable room for improvement in router architecture through finer visual cues and modeling of textual structure. We open-source the complete data construction and evaluation toolchain to promote comparability, reproducibility, and practical deployment in multimodal routing research.

Abstract:
Cross-view geo-localization (CVGL) aims to estimate an image's geographic location by matching it with geo-referenced images from different viewpoints, supporting applications such as autonomous driving, UAV navigation, and visual surveillance. However, due to the high cost of image collection, current CVGL datasets often suffer from limited diversity in both drone and ground imagery, which constrains model generalization. Furthermore, existing methods primarily focus on either ground-to-satellite or drone-to-satellite matching, lacking a unified framework capable of handling image matching across all three platforms: satellite, drone, and ground. To this end, we introduce the Unified Geo-localization dataset with Real-world and Synthetic imagery (UniGeoRS), a comprehensive benchmark featuring satellite, drone, and ground-view images, with a particular emphasis on the richness and diversity of drone and ground perspectives, enabling more realistic and flexible evaluations of CVGL. Additionally, we propose Cross-Attention-based Matching Enhancement (CAME), a unified framework for CVGL. By dynamically aggregating contextual information from top-ranked candidates, CAME refines feature representations and enhances cross-view matching robustness. Experimental results show (1) The Proposed UniGeoRS benchmark is necessary for training and evaluating the CVGL model across all three platforms. (2) UniGeoRS improves model generalization across diverse conditions. (3) CAME consistently boosts performance across state-of-the-art CVGL approaches.

Abstract:
Egocentric perception on smart glasses could transform how we learn new skills in the physical world, but automatic skill assessment remains a fundamental technical challenge. We introduce SkillSight for power-efficient skill assessment from first-person data. Central to our approach is the hypothesis that skill level is evident not only in how a person performs an activity (video), but also in how they directtheir attention when doing so (gaze). Our two-stage framework first learns to jointly model gaze and egocentric video when predicting skill level, then distills a gaze-only student model. At inference, the student model requires only gaze input, drastically reducing power consumption by eliminating continuous video processing. Experiments on three datasets spanning cooking, music, and sports establish, for the first time, the valuable role of gaze in skill understanding across diverse real-world settings. Our SkillSight teacher model achieves state-of-the-art performance, while our gaze-only student variant maintains high accuracy using 73x less power than competing methods. These results pave the way for in-the-wild AI-supported skill learning.

Abstract:
This paper introduces a novel unsupervised approach for image deblurring that utilizes a simple process for training data collection, thereby enhancing the applicability and effectiveness of deblurring methods. Our technique does not require meticulously paired data of blurred and corresponding sharp images; instead, it uses unpaired blurred and sharp images of similar scenes to generate pseudo-ground truth data by leveraging a dense matching model to identify correspondences between a blurry image and reference sharp images. Thanks to the simplicity of the training data collection process, our approach does not rely on existing paired training data or pre-trained networks, making it more adaptable to various scenarios and suitable for networks of different sizes, including those designed for low-resource devices. We demonstrate that this novel approach achieves state-of-the-art performance, marking a significant advancement in the field of image deblurring.

Abstract:
Large-scale video-language pretraining enables strong generalization across multimodal tasks but often incurs prohibitive computational costs. Although recent advances in masked visual modeling help mitigate this issue, they still suffer from two fundamental limitations: severe visual information loss under high masking ratios and temporal information leakage caused by inter-frame correlations. To address these challenges, we propose ClusterSTM, a Cluster-Wise Spatio-Temporal Masking strategy for efficient video-language pretraining. ClusterSTM first performs intra-frame clustering to partition visual tokens into multiple semantically independent clusters, then conducts cluster-wise masking by retaining the token with the highest temporal density within each cluster. Our masking strategy ensure that the retained tokens capture holistic video content while exhibit strong temporal correlation. Additionally, we introduce a video-text relevance reconstruction objective that aligns high-level multimodal semantics beyond conventional visual reconstruction. Extensive experiments across multiple benchmarks demonstrate that ClusterSTM achieves superior performance on video-text retrieval, video question answering, and video captioning tasks, establishing a new state-of-the-art among efficient video-language models.

Abstract:
Motion transfer from the driving to the source portrait remains a key challenge in the portrait animation. Current diffusion-based approaches condition only on the driving motion, which fails to capture source-to-driving correspondences and consequently yields suboptimal motion transfer. Although flow estimation provides an alternative, predicting dense correspondences from 2D input is ill-posed and often yields inaccurate animation. We address this problem by introducing 3D flows, a learning-free and geometry-driven motion correspondence directly computed from parametric 3D head models. To integrate this 3D prior into diffusion model, we introduce 3D flow encoding to query potential 3D flows for each target pixel to indicate its displacement back to the source location. To obtain 3D flows aligned with 2D motion changes, we further propose depth-guided sampling to accurately locate the corresponding 3D points for each pixel. Beyond high-fidelity portrait animation, our model further supports user-specified editing of facial expression and head pose. Extensive experiments demonstrate the superiority of our method on consistent driving motion transfer as well as faithful source identity preservation.

Abstract:
Gait recognition is a biometric technique that identifies individuals based on their walking patterns, offering advantages in long-range, non-intrusive scenarios. However, real-world scenarios often involve heterogeneous sensing modalities such as LiDAR and RGB cameras, making LiDAR-Camera Cross-modal Gait recognition (LCCGR) a critical yet challenging task due to the substantial modality gap between 2D videos and 3D point cloud sequences. To address this challenge, we propose TCFDNet, a Text-guided Cross-modal Feature Disentanglement Network, which leverages modality-aware textual priors as semantic anchors to guide the learning of disentangled modality-shared representations. Specifically, we construct a Gait Modality Text Dictionary (GMTD) using large language models to generate rich semantic descriptions of gait across modalities and viewpoints. A CLIP-based Multi-grained Feature Encoder then aligns visual and textual features within a unified vision-language space. Furthermore, the Text-guided Feature Disentanglement (TFD) module selects the top\text - k matched textual descriptions to reconstruct modality-specific representations and derive modality-shared features via residual decomposition and orthogonality constraints. To mitigate the fragility of the disentangled shared features, we propose a Feature Stability Enhancement (FSE) module, which models spatial and channel-wise correlations to improve feature robustness. In addition, a cross-modal patch exchange strategy is introduced to further improve generalization. Extensive experiments on SUSTech1K and FreeGait datasets demonstrate that TCFDNet achieves new state-of-the-art results and validate the effectiveness of the proposed modules.

Abstract:
Existing video generation models struggle to maintain long-term spatial and temporal consistency due to the dense, high-dimensional nature of video signals. To overcome this limitation, we propose Spatia, a spatial memory-aware video generation framework that explicitly preserves a 3D scene point cloud as persistent spatial memory. Spatia iteratively generates video clips conditioned on this spatial memory and continuously updates it through visual SLAM. This dynamic-static disentanglement design enhances spatial consistency throughout the generation process while preserving the model's ability to produce realistic dynamic entities. Furthermore, Spatia enables applications such as explicit camera control and 3D-aware interactive editing, providing a geometrically grounded framework for scalable, memory-driven video generation.

Abstract:
Sketch colorization models have been widely studied to automate and assist in the creation of animation frames and digital illustrations. However, current methods are still not satisfactory for industrial standard applications in high-resolution synthesis and precise controllability of details. To further enhance the synthesis quality and controllability, we propose an image-referenced sketch colorization method based on the powerful SDXL backbone and leverage sketches as spatial guidance and RGB images as color references. A split cross-attention mechanism is coupled with spatial masks to separately colorize the foreground and background regions to avoid spatial entanglement. A tagger network trained on a massive anime-style image dataset is employed to extract attribution-level information from reference images and integrated into the pipeline to provide precise control signals for synthesis. However, the increased resolution and number of attention layers in the SDXL backbone and precise reference information from the tagger network cause severe entanglement during colorization. We consequently combine a foreground encoder and a background encoder for disentanglement and better synthesis quality. Furthermore, a high-quality annotated and paired sketch colorization dataset is collected for fine-tuning. The proposed method is the first to achieve high resolution high quality sketch colorization with precise control, and obvious outperforms existing methods in quantitative and qualitative validations, as well as user studies in both quality and controllability. Ablation study reveals the influence of each component. Code and dataset will be made publicly available upon paper acceptance.

Abstract:
Generative models are often deployed to make decisions on behalf of users, such as vision-language models (VLMs) identifying which person in a room is a doctor to help visually impaired individuals. Yet, VLM decisions are influenced by the perceived demographic attributes of people in the input, which can lead to biased outcomes like failing to identify women as doctors. Moreover, when reducing bias leads to performance loss, users may have varying needs for balancing bias mitigation with overall model capabilities, highlighting the demand for methods that enable controllable bias reduction during inference. Activation steering is a popular approach for inference-time controllability that has shown potential in inducing safer behavior in large language models (LLMs). However, we observe that current steering methods struggle to correct biases, where equiprobable outcomes across demographic groups are required. To address this, we propose Direct Steering Optimization (DSO) which uses reinforcement learning to find linear transformations for steering activations, tailored to mitigate bias while maintaining control over model performance. We demonstrate that DSO achieves state-of-the-art trade-off between fairness and capabilities on both VLMs and LLMs, while offering practitioners inference-time control over the trade-off. Overall, our work highlights the benefit of designing steering strategies that are directly optimized to control model behavior, providing more effective bias intervention than methods that rely on pre-defined heuristics for controllability.

Abstract:
Multi-taskdensepredictionaims tosimultaneously addressmultiplepixel-level tasksthroughaunifiednetworkforvisualsceneunderstanding.However,adverse environmental conditions limit thegeneralizationand practicality of such tasks. Toaddress this, we proposeALLNet, anovel framework that effectively explores degradationpatterns and integratesmulti-task collaborative information. Specifically, we designa MoE-basedMixtureofAdaptiveExperts(MaE)restorationcomponentnetworkthatenhancesdegradationfeaturesthroughdynamicroutingandguidestask-specific featureextraction. Furthermore,weformulateaTaskawareCollaborativeRefinement (TCR)moduletocaptureglobalsemanticcorrelationsandcross-taskdependencies,facilitatingbidirectionalcollaborationbetween restorationandtask-specific featuresondegradedimages. Tothebestofourknowledge, thisrepresentsthe firstattemptatmulti-taskdensepredictionunderimage degradation.ExperimentalresultsondegradedNYUDv2andPASCAL-Contextbenchmarksdemonstratethat ourarchitecturesignificantlyoutperformsexistingmethodsindegradedscenarios.

Abstract:
The task of capturing and rendering 3D dynamic scenes from 2D images has become increasingly popular in recent years. However, most conventional cameras are bandwidth-limited to 30-60 FPS, restricting these methods to static or slowly evolving scenes. While overcoming bandwidth limitations is difficult in general scenes, recent years have seen a flurry of computational imaging methods that yield high-speed videos using conventional cameras for specific scenarios (e.g., motion capture and particle image velocimetry). However, most of these methods require modifications to camera optics or the addition of mechanically moving components, limiting them to a single-view high-speed capture. Consequently, these cannot be readily used to capture a 3D representation of rapid scene motion. In this paper, we propose a novel method to capture and reconstruct a volumetric representation of a high-speed scene using only unaugmented low-speed cameras. Instead of modifying the hardware or optics of each individual camera, we encode high-speed scene dynamics by illuminating the scene with a rapid, sequential color sequence. This results in simultaneous multi-view capture of the scene, in which high-speed temporal information is encoded in the images' color. To construct a high-speed volumetric representation of the dynamic scene, we develop a novel dynamic Gaussian Splatting-based approach that decodes the temporal information from the images. We evaluate our approach on simulated scenes and real-world experiments using a multi-camera imaging setup, showing first-of-a-kind high-speed volumetric scene reconstructions.

Abstract:
Compositional zero-shot learning (CZSL) aims to recognize unseen attribute-object compositions by recombining primitives learned from seen pairs. Recent CZSL methods built on vision-language models (VLMs) typically adopt parameter-efficient fine-tuning (PEFT).They apply visual disentanglers for decomposition and manipulate token-level prompts or prefixes to encode compositions.However, such PEFT-based designs suffer from two fundamental limitations: (1) Implicit Composition Construction, where composition is realized only via token concatenation or branch-wise prompt tuning rather than an explicit operation in the embedding space; (2) Remained Feature Entanglement, where imperfect disentanglement leaves attribute, object, and composition features mutually contaminated. Together, these issues limit the generalization ability of current CZSL models.In this paper, we are the first to systematically study flow matching for CZSL and introduce FlowComposer, a model-agnostic framework that learns two primitive flows to transport visual features toward attribute and object text embeddings, and a learnable Composer that explicitly fuses their velocity fields into a composition flow. To exploit the inevitable residual entanglement, we further devise a leakage-guided augmentation scheme that reuses leaked features as auxiliary signals.We thoroughly evaluate FlowComposer on three public CZSL benchmarks by integrating it as a plug-and-play component into various baselines, consistently achieving significant improvements.

Abstract:
Reconstructing 3D scenes from sparse, unposed images remains challenging under real-world conditions with varying illumination and transient occlusions. Existing methods rely on scene-specific optimization with appearance embeddings or dynamic masks, requiring extensive per-scene training and failing under sparse views. Moreover, evaluations on limited scenes raise questions about generalization. We present GenWildSplat, a feed-forward framework for sparse-view outdoor reconstruction that requires no per-scene optimization. Given unposed internet images, GenWildSplat predicts depth, camera parameters, and 3D Gaussians in a canonical space using learned geometric priors. An appearance adapter modulates appearance for target lighting conditions, while semantic segmentation handles transient objects. Through curriculum learning on synthetic and real data, GenWildSplat generalizes across diverse illumination and occlusion patterns. Evaluations on PhotoTourism and a new 20-scene MegaScenes benchmark demonstrate state-of-the-art feed-forward reconstruction quality, achieving real-time inference without test-time optimization.

Abstract:
Color transfer aims to match the color distribution of a content image (source) to that of a style image (target) while preserving structure and perceptual realism. Yet modulation-based flow models such as ModFlows often produce trajectory misalignment and artifacts because they rely on strictly linear transport paths. We propose NCT, a nonlinear color transfer framework that replaces linear paths with Bezier trajectories, enabling smooth, nonlinear, and perceptually coherent color transfer. This parameterization lets the transport bend toward plausible intermediate color regimes, improving content-style alignment and reducing chromatic distortion. We further incorporate a Mixture of Experts (MoE) module in the encoder to select trajectory experts for different chromatic regimes, improving generalization to heterogeneous data with complex illumination and materials. Experiments show that NCT reduces artifacts and achieves more stable color transfer than prior flow-based methods, especially on 3D-rendered or highly textured images. The code is provided in supplementary materials.

Abstract:
Recent image editing models boast next-level intelligent capabilities, facilitating cognition- and creativity-informed image editing. Yet, existing benchmarks provide too narrow a scope for evaluation, failing to holistically assess these advanced abilities. To address this, we introduce WiseEdit, a knowledge-intensive benchmark for comprehensive evaluation of cognition- and creativity-informed image editing, featuring deep task depth and broad knowledge breadth. Drawing an analogy to human cognitive creation, WiseEdit decomposes image editing into three cascaded steps--Awareness, Interpretation, and Imagination--each corresponding to a task that poses a challenge for models to complete at the specific step. It also encompasses complex tasks, where none of the three steps can be finished easily. Furthermore, WiseEdit incorporates three fundamental types of knowledge: Declarative, Procedural, and Metacognitive knowledge. Ultimately, WiseEdit comprises 1,220 test cases, objectively revealing the limitations of SoTA image editing models in knowledge-based cognitive reasoning and creative composition capabilities.

Abstract:
We propose POLAR, a novel radar-guided depth estimation method that introduces polynomial fitting to efficiently transform scaleless depth predictions from pretrained monocular depth estimation (MDE) models into metric depth maps. Unlike existing approaches that rely on complex architectures or expensive sensors, our method is grounded in a fundamental insight: although MDE models often infer reasonable local depth structure within each object or local region, they may misalign these regions relative to one another, making a linear scale and shift (affine) transformation insufficient given three or more of these regions. To address this limitation, we use polynomial coefficients predicted from cheap, ubiquitous radar data to adaptively adjust predictions non-uniformly across depth ranges. In this way, POLAR generalizes beyond affine transformations and is able to correct such misalignments by introducing inflection points. Importantly, our polynomial fitting framework preserves structural consistency through a novel training objective that enforces local monotonicity via first-derivative regularization. POLAR achieves state-of-the-art performance across three datasets, outperforming existing methods by an average of 24.9% in MAE and 33.2% in RMSE, while also achieving state-of-the-art efficiency in terms of latency and computational cost.

Abstract:
This paper revisits camera pose estimation through the lens of self-supervised pretraining, focusing on inverse-dynamics pretraining as a scalable alternative to the current trend of fully supervised training with 3D annotations. Concretely, we employ inverse- and forward-dynamics models to learn latent action representations, similar to Genie from large-scale driving videos.Our idea is simple yet effective. Existing methods use latent actions in their original capacity, that is, as action conditioning of world-models or as proxies of robot action parameters in policy networks.Our method, dubbed LA-Pose, repurposes the latent action features as inputs to a camera pose estimator, finetuned on a limited set of high-quality 3D annotations.This formulation enables accurate and generalizable pose prediction while maintaining feed-forward efficiency. Extensive experiments on driving benchmarks show that LA-Pose achieves competitive and even superior performance to state-of-the-art methods while using orders of magnitude less labeled data. Concretely, on the Waymo and PandaSet benchmarks, LA-Pose achieves over 10% higher pose accuracy than recent feed-forward methods.To our knowledge, this work is the first to demonstrate the power of inverse-dynamics self-supervised learning for pose estimation.

Abstract:
Pre-trained text-to-image (T2I) generative priors have shown success in depth and normal prediction. However, dense prediction is inherently an image-to-image task, suggesting that image editing models, rather than T2I generative models, may be a more suitable foundation for fine-tuning. Motivated by this, we conduct a systematic analysis of the fine-tuning behaviors of both editors and generators for dense geometry estimation. Our findings show that editing models possess inherent structural priors, which enable them to converge more stably by "refining" their innate features, and ultimately achieve higher performance than their generative counterparts. Based on these findings, we introduce FE2E, a framework that pioneeringly adapts an advanced editing model based on Diffusion Transformer (DiT) architecture for dense geometry prediction. Specifically, to tailor the editor for this deterministic task, we reformulate the editor's original flow matching loss into the "consistent velocity" training objective. And we use logarithmic quantization to resolve the precision conflict between the editor's native BFloat16 format and the high precision demand of our tasks. Additionally, we repurpose the editor's discarded region for a cost-free joint estimation of depth and normals, which improves the inference efficiency. Without scaling up the training data, FE2E achieves impressive performance improvements in zero-shot monocular depth and normal estimation across multiple datasets. Notably, it achieves over 35% performance gains on the ETH3D dataset and outperforms the DepthAnything series, which is trained on 100xdata.

Abstract:
Visual autoregressive models achieve remarkable generation quality through next-scale predictions across multi-scale token pyramids. However, the conventional method uses uniform scale downsampling to build these pyramids, leading to aliasing artifacts that compromise fine details and introduce unwanted jaggies and moire patterns. To tackle this issue, we present FVAR, which reframes the paradigm from next-scale prediction to next-focus prediction, mimicking the natural process of camera focusing from blur to clarity. Our approach introduces three key innovations: 1) Next-Focus Prediction Paradigm that transforms multi-scale autoregression by progressively reducing blur rather than simply downsampling; 2) Progressive Refocusing Pyramid Construction that uses physics-consistent defocus kernels to build clean, alias-free multi-scale representations; and 3) High-Frequency Residual Learning that employs a specialized residual teacher network to effectively incorporate alias information during training while maintaining deployment simplicity. Specifically, we construct optical low-pass views using defocus point spread function (PSF) kernels with decreasing radius, creating smooth blur-to-clarity transitions that eliminate aliasing at its source. To further enhance detail generation, we introduce a High-Frequency Residual Teacher that learns from both clean structure and alias residuals, distilling this knowledge to a vanilla VAR deployment network for seamless inference. Extensive experiments on ImageNet demonstrate that FVAR substantially reduces aliasing artifacts, improves fine detail preservation, and enhances text readability, achieving superior performance with perfect compatibility to existing VAR frameworks.

Abstract:
Conventional physically-based rendering (PBR) pipelines generate photorealistic images through computationally expensive light transport simulations. Although recent deep learning approaches leverage diffusion model priors with geometry buffers (G-buffers) to produce visually compelling results without explicit scene geometry or light simulation, they remain constrained by two major limitations. First, the iterative nature of the diffusion process introduces substantial latency. Second, the inherent stochasticity of these generative models compromises physical accuracy and temporal consistency. In response to these challenges, we propose a novel, end-to-end, deterministic single-step neural rendering framework RenderFlow built upon a flow matching paradigm. To further strengthen both rendering quality and generalization, we propose an efficient and effective module for sparse keyframe guidance. Our method significantly accelerates the rendering process and, by optionally incorporating sparsely rendered keyframes as guidance, enhances both the physical plausibility and overall visual quality of the output. The resulting pipeline achieves near real-time performance with photorealistic rendering quality, effectively bridging the gap between the efficiency of modern generative models and the precision of traditional physically based rendering. Furthermore, we demonstrate the versatility of our framework by introducing a lightweight, adapter-based module that efficiently repurposes the pretrained forward model for the inverse rendering task of intrinsic decomposition.

Abstract:
Reconstructing surface meshes from multi-view images has remained a core challenge in recent years. Most existing methods, whether implicit or explicit, depend on intermediate representations and post-processing steps like Marching Cubes or TSDF fusion, often resulting in artifacts and fragmented geometry. Directly optimizing explicit meshes is a promising approach. However, it presents two critical challenges. The first is how to adaptively refine mesh topology to capture detail without introducing degenerate faces. The second is how to maintain consistent UV coordinates for high-fidelity texturing as the mesh structure evolves. To overcome these, we propose ExMesh, a novel framework that directly optimizes explicit meshes by integrating differentiable optimization with discrete topology updates. Specifically, we introduce an adaptive vertex splitting and merging strategy, along with real-time UV maintenance, to enable coarse-to-fine optimization while preserving geometric integrity. To our knowledge, ExMesh is the first framework to seamlessly integrate discrete topology operations into a continuous differentiable optimization pipeline. Extensive experiments demonstrate that ExMesh achieves a balance among accuracy, computational efficiency, and mesh conciseness.

Abstract:
Disentanglement-based methods for learning shared representations are widely used in multimodal sentiment analysis. However, existing methods often overlook potential emotional conflicts across modalities within the same sample. They typically adopt intra-modal reconstruction and rely on similarity losses to align shared representations, which may distort the shared semantics under such conflicts. To address these, we propose a Conflict-aware Adaptive Cross-Reconstruction approach (CACR). First, we formally define emotional conflict and design a conflict-aware weighting strategy. This strategy calculates conflict scores for each modality and maps them to the corresponding cross-reconstruction weights. Second, we construct a cross-reconstruction module. It reconstructs each target modality using its own specific features and the shared features of other modalities, achieving implicit alignment of shared representations. Coupled with the aforementioned weights, this module effectively suppresses conflicting modalities and mitigates semantic ambiguity. Furthermore, we develop a fine-grained sentiment refinement module that captures fine-grained cues from visual- and audio-specific features to supplement textual semantics. Extensive experiments on three datasets show that CACR outperforms existing state-of-the-art methods, demonstrating its effectiveness in handling emotional conflict.

Abstract:
Discovering physical laws directly from high-dimensional visual data is a long-standing human pursuit but remains a formidable challenge for machines, representing a fundamental goal of scientific intelligence. This task is inherently difficult because physical knowledge is low-dimensional and structured, whereas raw video observations are high-dimensional and redundant, with most pixels carrying little or no physical meaning. Extracting concise, physically relevant variables from such noisy data remains a key obstacle. To address this, we propose Pixel2Phys, a collaborative multi-agent framework adaptable to any Multimodal Large Language Model (MLLM). It emulates human scientific reasoning by employing a structured workflow to extract formalized physical knowledge through iterative hypothesis generation, validation, and refinement. By repeatedly formulating, and refining candidate equations on high-dimensional data, it identifies the most concise representations that best capture the underlying physical evolution. This automated exploration mimics the iterative workflow of human scientists, enabling AI to reveal interpretable governing equations directly from raw observations. Across diverse simulated and real-world physics videos, Pixel2Phys discovers accurate, interpretable governing equations and maintaining stable long-term extrapolation where baselines rapidly diverge.

Abstract:
We present ReMoT, a unified training paradigm to systematically address the fundamental shortcomings of VLMs in spatio-temporal consistency--a critical failure point in navigation, robotics, and autonomous driving. ReMoT integrates two core components: (i) A rule-based automatic framework that generates ReMoT-16K, a large-scale (16.5K triplets) motion-contrast dataset derived from video meta-annotations, surpassing costly manual or model-based generation. (ii) Group Relative Policy Optimization, which we empirically validate, yields optimal performance and data efficiency for learning this contrastive reasoning, far exceeding standard Supervised Fine-Tuning. We also construct the first benchmark for fine-grained motion contrast triplets to measure a VLM's discrimination of subtle motion attributes (e.g., opposing directions). The resulting model achieves SOTA performance on our new benchmark and multiple standard VLM benchmarks, culminating in a remarkable 25.1 performance leap on spatio-temporal reasoning tasks.

Abstract:
The Mean Flow Matching algorithm is the state-of-the-art for one-step generative models. Building on this idea, we propose the Stable Mean Flow algorithm and introduce a Lyapunov-inspired stability regularizer that enforces local non-expansivity of the single-step transport map. This design guarantees uniqueness of characteristics and bounds trajectory drift. We conduct experiments that show improved output quality and convergence speed over Mean Flow. Moreover, we establish explicit upper bounds on error growth for both one-step and multi-step generation.

Abstract:
Masked Discrete Diffusion Models (MDMs) have achieved strong performance across a wide range of multimodal tasks, including image understanding, generation, and editing. However, their inference speed remains suboptimal due to the need to repeatedly process redundant masked tokens at every sampling step. In this work, we propose Sparse-LaViDa, a novel modeling framework that dynamically truncates unnecessary masked tokens at each inference step to accelerate MDM sampling. To preserve generation quality, we introduce specialized register tokens that serve as sparse representations for the truncated tokens. Furthermore, to ensure consistency between training and inference, we design a specialized attention mask that faithfully matches the truncated sampling procedure during training. Built upon the state-of-the-art unified MDM LaViDa-O, Sparse-LaViDa achieves up to a 2xspeedup across diverse tasks including text-to-image generation, image editing, and mathematical reasoning, while maintaining generation quality.

Abstract:
Vision-Language Models (VLMs) exhibit remarkable common-sense and semantic reasoning capabilities. However, they lack a grounded understanding of physical dynamics. This limitation arises from training VLMs on static internet-scale visual-language data that contain no causal interactions or action-conditioned changes. Consequently, it remains challenging to leverage VLMs for fine-grained robotic manipulation tasks that require physical understanding, reasoning, and corresponding action planning. To overcome this, we present SIMPACT, a test-time, SIMulation-enabled ACTion Planning framework that equips VLMs with physical reasoning through simulation-in-the-loop world modeling, without requiring any additional training. From a single RGB-D observation, SIMPACT efficiently constructs physics simulations, enabling the VLM to propose informed actions, observe simulated rollouts, and iteratively refine its reasoning. By integrating language reasoning with physics prediction, our simulation-enabled VLM can understand contact dynamics and action outcomes in a physically grounded way. Our method demonstrates state-of-the-art performance on five challenging, real-world rigid-body and deformable manipulation tasks that require fine-grained physical reasoning, outperforming existing general-purpose robotic manipulation models. Our results demonstrate that embedding physics understanding via efficient simulation into VLM reasoning at test time offers a promising path towards generalizable embodied intelligence.

Abstract:
Estimating the 3D motion of scene points from 2D observations, typically parameterized by optical flow and motion in depth, is a fundamental problem in computer vision. Existing learning-based methods usually rely on supervised regression from densely labeled data, but their dependence on annotations and limited use of geometric constraints restricts generalization, motivating unsupervised solutions. Unsupervised 3D motion estimation is challenging because motion along the viewing direction is unobservable, and optical flow and motion in depth are geometrically coupled, making their separation ambiguous. Event cameras capture per-pixel brightness changes asynchronously with microsecond latency, providing high temporal resolution and motion continuity. Projecting event streams along different axes reveals spatiotemporal expansion and contraction patterns that encode depth variation and geometric structure, offering rich cues for unsupervised estimation. Leveraging these properties, we propose an unsupervised event-based 3D motion estimation framework that jointly models optical flow and motion in depth. We first derive an analytical relationship to infer initial motion in depth from estimated flow and further refine it using a directional expansion modulation module that captures horizontal and vertical expansion-contraction patterns in event projections. Finally, motion in depth is incorporated into optical flow warping under a contrast maximization objective. Experiments on the CarlaEvent3D dataset show that our method achieves competitive accuracy and strong generalization, advancing unsupervised 3D motion estimation in the event domain.

Abstract:
We present Match-and-Fuse - a zero-shot, training-free method for consistent controlled generation of unstructured image sets - collections that share a common visual element, yet differ in viewpoint, time of capture, and surrounding content. Unlike existing methods that operate on individual images or densely sampled videos, our framework performs set-to-set generation: given a source set and user prompts, it produces a new set that preserves cross-image consistency of shared content. Our key idea is to model the task as a graph, where each node corresponds to an image and each edge triggers a joint generation of image pairs. This formulation consolidates all pairwise generations into a unified framework, enforcing local consistency while ensuring global coherence across the entire set. This is achieved by fusing internal features across image pairs, guided by dense input correspondences, without requiring masks or manual supervision, and by leveraging an emergent prior in text-to-image models that encourages coherent generation when multiple views share a single canvas. Match-and-Fuse achieves state-of-the-art consistency and visual quality, and unlocks new capabilities for content creation from image collections.

Abstract:
Deep learning-based image watermarking, while robust against conventional distortions, remains vulnerable to advanced adversarial and regeneration attacks. Conventional countermeasures, which jointly optimize the encoder and decoder via a noise layer, face 2 inevitable challenges:(1) decrease of clean accuracy due to decoder adversarial training and (2) limited robustness due to simultaneous training of all three advanced attacks.To overcome these issues, we propose AdvMark, a novel two-stage fine-tuning framework that decouples the defense strategies. In stage 1, we address adversarial vulnerability via a tailored adversarial training paradigm that primarily fine-tunes the encoder while only conditionally updating the decoder. This approach learns to move the image into a non-attackable region, rather than modifying the decision boundary, thus preserving clean accuracy.In stage 2, we tackle distortion and regeneration attacks via direct image optimization. To preserve the adversarial robustness gained in stage 1, we formulate a principled, constrained image loss with theoretical guarantees, which balances the deviation from cover and previous encoded images. We also propose a quality-aware early-stop to further guarantee the lower bound of visual quality.Extensive experiments demonstrate AdvMark outperforms with the highest image quality and comprehensive robustness, i.e. up to 29%, 33% and 46% accuracy improvement for distortion, regeneration and adversarial attacks, respectively.

Abstract:
Multi-view inverse rendering aims to recover geometry, materials, and illumination consistently across multiple viewpoints. Existing single-view approaches often ignore cross-view relationships, leading to inconsistent results, while multi-view optimization methods rely on slow differentiable rendering and per-scene refinement, making them computationally expensive and hard to scale. To address these limitations, we introduce a feed-forward multi-view inverse rendering framework that directly predicts spatially varying albedo, metallicity, roughness, diffuse shading, and surface normals from sequences of RGB images. By alternating attention across views, our model captures both intra-view long-range lighting interactions and inter-view material consistency, enabling coherent scene-level reasoning within a single forward pass. Due to the scarcity of real-world training data, models trained on existing synthetic datasets often struggle to generalize to real-world scenes. To overcome this limitation, we propose a consistency-based finetuning strategy that leverages unlabeled real-world videos to enhance both multi-view coherence and robustness under in-the-wild conditions. Extensive experiments on benchmark datasets demonstrate that our method achieves state-of-the-art performance in terms of multi-view consistency, material and normal estimation quality, and generalization to real-world imagery.

Abstract:
Current mainstream methods of aligning diffusion models with human preferences typically employ VLM-based reward models.However, these reward models, pre-trained for semantic alignment, struggle to capture the essential perceptual qualities--such as aesthetics, composition, and visual harmony.In this work, we argue that a model capable of high-fidelity generation must possess a profound understanding of these visual attributes.Based on this insight, we introduce the Diffusion-based Reward Model (DRM), a novel paradigm that use the pre-trained diffusion model as a powerful evaluative backbone.A key advantage of the DRM is its unique ability to assess not only the final image but also the noisy intermediate latents at any stage of the generative process. We leverage this step-wise evaluative capacity in two ways.First, we propose Step-wise GRPO, a reinforcement learning algorithm that provides dense, per-step rewards to resolve the imprecise credit assignment problem in GRPO algorithm, leading to more stable and effective alignment.Second, we introduce Step-wise Sampling, a novel inference strategy that employs the DRM as a dynamic guide to evaluate multiple generation paths at each step, steering the process towards higher-quality outcomes.Extensive experiments confirm that our approach significantly enhances the final quality of generated images.

Abstract:
In real-world multimodal applications, systems usually need to comprehend arbitrarily combined and interleaved multimodal inputs from users, while also generating outputs in any interleaved multimedia form. This capability defines the goal of any-to-any interleaved multimodal learning under a unified paradigm of understanding and generation, posing new challenges and opportunities for advancing Multimodal Large Language Models (MLLMs). To foster and benchmark this capability, this paper introduces the UniM benchmark, the first Unified Any-to-Any Interleaved Multimodal dataset. UniM contains 31K high-quality instances across 30 domains and 7 representative modalities: text, image, audio, video, document, code, and 3D, each requiring multiple intertwined reasoning and generation capabilities. We further introduce the UniM Evaluation Suite, which assesses models along three dimensions: Semantic Correctness & Generation Quality, Response Structure Integrity, and Interleaved Coherence. In addition, we propose UniMA, an agentic baseline model equipped with traceable reasoning for structured interleaved generation. Comprehensive experiments demonstrate the difficulty of UniM and highlight key challenges and directions for advancing unified any-to-any multimodal intelligence.

Abstract:
GUI grounding, which translates natural language instructions into precise pixel coordinates, is essential for developing practical GUI agents. However, we observe that existing grounding models exhibit significant coordinate prediction instability--minor visual perturbations (e.g., cropping a few pixels) can drastically alter predictions, flipping results between correct and incorrect. This instability severely undermines model performance, especially for samples with high-resolution and small UI elements. To address this issue, we propose Multi-View Prediction (MVP), a training-free framework that enhances grounding performance through multi-view inference. Our key insight is that while single-view predictions may be unstable, aggregating predictions from multiple carefully cropped views can effectively distinguish correct coordinates from outliers. MVP comprises two components: (1) Attention-Guided View Proposal, which derives diverse views guided by instruction-to-image attention scores, and (2) Multi-Coordinates Clustering, which ensembles predictions by selecting the centroid of the densest spatial cluster. Extensive experiments demonstrate MVP's effectiveness across various models and benchmarks. Notably, on ScreenSpot-Pro, MVP boosts UI-TARS-1.5-7B to 56.1%, GTA1-7B to 61.7%, Qwen3VL-8B-Instruct to 65.3%, and Qwen3VL-32B-Instruct to 74.0%.

Abstract:
Significant advancements made in reconstructing hands from images have delivered accurate single-frame estimates, yet they often lack physics consistency and provide no notion of how confidently the motion satisfies physics. In this paper, we propose a novel physics-aware conditional diffusion framework that refines noisy pose sequences into physically plausible hand motion while estimating the physics variance in motion estimates. Building on a MeshCNN-Transformer backbone, we formulate Euler-Lagrange dynamics for articulated hands. Unlike prior works that enforce zero residuals, we treat the resulting dynamic residuals as virtual observables to more effectively integrate physics. Through a last-layer Laplace approximation, our method produces per-joint, per-time variances that measure physics consistency and offers interpretable variance maps indicating where physical consistency weakens. Experiments on two well-known hand datasets show consistent gains over strong image-based initializations and competitive video-based methods. Qualitative results confirm that our variance estimations are aligned with the physical plausibility of the motion in image-based estimates.

Abstract:
Rearranging disarrayed objects to their intended goal states requires the agent to comprehend the changes that have occurred in the scene and to reason about the process of these changes. To address this, we propose a novel perspective on the visual rearrangement task, drawing inspiration from the diffusion processes in molecular thermodynamics. We model the room shuffle and unshuffle stages as the forward and reverse processes of diffusion. In contrast to conventional methods that rely on scene modeling and differential comparisons, our approach provides insight into the intrinsic evolution process between the goal and initial states of the scene, which allows for a more reasonable rearrangement of objects through fine-grained and progressive denoising steps with high confidence. By analyzing the task objectives, we represent the scene via spatial distributions of objects and model the visual rearrangement process using a diffusion bridge model. Building upon this, we introduce the Diffusion Rearrangement model, which takes point cloud data as input, fits it into Gaussian mixture distributions to represent the states of objects, and predicts the rearrangement target through an iterative denoising transformer. Experimental results on the RoomR dataset demonstrate the effectiveness of our approach.

Abstract:
Spatial transcriptomics provides both spatial coordinates and gene expression profiles, enabling the study of tissue organization and cellular heterogeneity. Despite recent progress, current spatial clustering methods still face two major limitations. First, representations learned from spatial and expression views often differ due to view-specific noise and incomplete structural information. Without enforcing sample-level cross-view consistency, embeddings from the two views may not correspond to the same biological identity, reducing discriminative capability. Second, existing approaches lack effective semantic-level supervision. Although node embeddings capture local neighborhood patterns, they do not explicitly reflect high-level semantic structures. Prototype-based modeling can provide such semantic abstraction, yet current methods seldom align prototypes with node representations, leading to weak semantic consistency. To overcome these issues, we propose a Multi-View Hierarchical Alignment Learning for Spatial Transcriptomics (MHAL). At the sample level, MHAL introduce positive sample alignment to enforce consistency between spatial and expression embeddings. At the semantic level, MHAL design prototype level contrastive learning, where prototypes act as semantic anchors and guide the formation of coherent cluster structures. Together, these two alignment mechanisms progressively ensure both local consistency and global semantic discrimination. Extensive experimental results demonstrate that the proposed hierarchical contrastive multi-view clustering method achieves competitive performance in spatial domain identification compared to other state-of-the-art methods.

Abstract:
Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities in visual reasoning across diverse tasks. These tasks place different demands on visual representations: some prioritize high-level global context, while others emphasize fine-grained local details. However, most existing methods operate on visual representations primarily in the spatial domain, lacking an explicit mechanism for distinguishing between high-frequency local details and low-frequency global context. This limitation hinders fine-grained control of visual representations and complicates their hierarchical alignment with language. To address this issue, we introduce Language-guided Frequency Modulation (LFM), a plug-and-play approach that adaptively refines visual signals in the frequency domain under linguistic guidance. By selectively enhancing critical regions and details, LFM enables more structured and precise visual processing. Crucially, it adds no extra training parameters, relying solely on a lightweight learnable projector to refine visual tokens before integration into the LLM, thereby ensuring minimal computational overhead. Extensive experiments across diverse vision-language benchmarks highlight LFM's scalability, effectiveness, and broad applicability to LVLMs.

Abstract:
This work introduces a new task, text-conditioned selective video-to-audio (V2A) generation, which produces only the user-intended sound from a multi-object video. This capability is especially crucial in multimedia production, where audio tracks are handled individually for each sound source for precise editing, mixing, and creative control. We propose SELVA, a novel text-conditioned V2A model that treats the text prompt as an explicit selector to distinctly extract prompt-relevant sound-source visual features from the video encoder. To suppress text-irrelevant activations with efficient video encoder finetuning, the proposed supplementary tokens promote cross-attention to yield robust semantic and temporal grounding. SELVA further employs an autonomous video-mixing scheme in a self-supervised manner to overcome the lack of mono audio track supervision. We evaluate SELVA on VGG-MONOAUDIO, a curated benchmark of clean single-source videos for such a task. Extensive experiments and ablations consistently verify its effectiveness across audio quality, semantic alignment, and temporal synchronization.

Abstract:
This paper addresses the problem of affordance grounding from RGBD images of an object, which aims to localize surface regions corresponding to a text query that describes an action on the object. While existing methods predict affordance regions only on visible surfaces, we propose Affostruction, a generative framework that reconstructs complete object geometry from partial RGBD observations and grounds affordances on the full shape including unobserved regions. Our approach introduces sparse voxel fusion of multi-view features for constant-complexity generative reconstruction, a flow-based formulation that captures the inherent ambiguity of affordance distributions, and an active view selection strategy guided by predicted affordances. Affostruction outperforms existing methods by large margins on challenging benchmarks, achieving 19.1 aIoU on affordance grounding and 32.67 IoU for 3D reconstruction.

Abstract:
Video stylization, an important downstream task of video generation models, has not yet been thoroughly explored. Its input style conditions typically include text, style image, and stylized first frame. Each condition has a characteristic advantage: text is more flexible, style image provides a more accurate visual anchor, and stylized first frame makes long-video stylization feasible. However, existing methods are largely confined to a single type of style condition, which limits their scope of application. Additionally, their lack of high-quality datasets leads to style inconsistency and temporal flicker. To address these limitations, we introduce DreamStyle, a unified framework for video stylization, supporting (1) text-guided, (2) style-image-guided, and (3) first-frame-guided video stylization, accompanied by a well-designed data curation pipeline to acquire high-quality paired video data. DreamStyle is built on a vanilla Image-to-Video (I2V) model and trained using a Low-Rank Adaptation (LoRA) with token-specific up matrices that reduces the confusion among different condition tokens. Both qualitative and quantitative evaluations demonstrate that DreamStyle is competent in all three video stylization tasks, and outperforms the competitors in style consistency and video quality.

Abstract:
Conventional knowledge distillation, designed for model compression, fails on long-tailed distributions because the teacher model tends to be biased toward head classes and provides limited supervision for tail classes. We propose Long-Tailed Knowledge Distillation (LTKD), a novel framework that reformulates the conventional objective into two components: a cross-group loss, capturing mismatches in prediction distributions across class groups (head, medium, and tail), and a within-group loss, capturing discrepancies within each group's distribution. This decomposition reveals the specific sources of the teacher's bias. To mitigate the inherited bias, LTKD introduces (1) a rebalanced cross-group loss that calibrates the teacher's group-level predictions and (2) a reweighted within-group loss that ensures equal contribution from all groups. Extensive experiments on CIFAR-100-LT, TinyImageNet-LT, and ImageNet-LT demonstrate that LTKD significantly outperforms existing methods in both overall and tail-class accuracy, thereby showing its ability to distill balanced knowledge from a biased teacher for real-world applications.

Abstract:
Most instruction-driven 3D editing methods rely on 2D models to guide the explicit and iterative optimization of 3D representations. This paradigm, however, suffers from two primary drawbacks. First, it lacks a universal design of different 3D editing tasks because the explicit manipulation of 3D geometry necessitates task-dependent rules, e.g., 3D appearance editing demands inherent source 3D geometry, while 3D removal alters source geometry. Second, the iterative optimization process is highly time-consuming, often requiring thousands of invocations of 2D/3D updating. We present Omni-3DEdit, a unified, learning-based model that generalizes various 3D editing tasks implicitly. One key challenge to achieve our goal is the scarcity of paired source-edited multi-view assets for training. To address this issue, we construct a data pipeline, synthesizing a relatively rich number of high-quality paired multi-view editing samples. Subsequently, we adapt the pre-trained generative model SEVA as our backbone by concatenating source view latents along with conditional tokens in sequence space. A dual-stream LoRA module is proposed to disentangle different view cues, largely enhancing our model's representational learning capability. As a learning-based model, our model is free of the time-consuming online optimization, and it can complete various 3D editing tasks in one forward pass, reducing the inference time from tens of minutes to approximately two minutes. Extensive experiments demonstrate the effectiveness and efficiency of Omni-3DEdit.

Abstract:
The heterogeneity between high-level vision-language understanding and low-level action control remains a fundamental challenge in robotic manipulation. Although recent methods have advanced task-specific action alignment, they often struggle to generate robust and accurate actions for novel or semantically related tasks. To address this, we propose the Language-Grounded Decoupled Action Representation (LaDA) framework, which leverages natural language as a semantic bridge to connect perception and control. LaDA introduces a fine-grained intermediate layer of three interpretable action primitives--translation, rotation, and gripper control--providing explicit semantic structure for low-level actions. It further employs a semantic-guided soft-label contrastive learning objective to align similar action primitives across tasks, enhancing generalization and motion consistency. An adaptive weighting strategy, inspired by curriculum learning, dynamically balances contrastive and imitation objectives for stable and effective training. Extensive experiments on simulated benchmarks (LIBERO and MimicGen) and real-world demonstrations validate that LaDA achieves strong performance and generalizes effectively to unseen or related tasks.

Abstract:
Vision-language-action (VLA) models unify perception, language, and control for embodied agents but face significant challenges in practical deployment due to rapidly increasing compute and memory demands, especially as models scale to longer horizons and larger backbones. To address these bottlenecks, we introduce QuantVLA, a training-free post-training quantization (PTQ) framework that, to our knowledge, is the first PTQ approach for VLA systems and the first to successfully quantize a diffusion transformer (DiT) action head. QuantVLA incorporates three scale-calibrated components: (1) a selective quantization layout that integerizes all linear layers in both the language backbone and the DiT while keeping attention projections in floating point to preserve the original operator schedule; (2) attention temperature matching, a lightweight per-head scaling mechanism that stabilizes attention logits and is folded into the dequantization scales at inference; and (3) output head balancing, a per-layer residual interface calibration that mitigates post-projection energy drift. The framework requires no additional training, uses only a small unlabeled calibration buffer, and supports integer kernels for low-bit weights and activations while leaving the architecture unchanged. Across representative VLA models on LIBERO, QuantVLA exceeds the task success rates of full-precision baselines, achieves about 70% relative memory savings on the quantized components, providing a practical pathway toward scalable low-bit embodied intelligence under strict compute, memory, and power constraints.

Abstract:
Composed Video Retrieval aims to retrieve a target video given a reference video and a textual modification describing the desired change. The core challenge lies in modeling compositional multimodal transformations, i.e., how entities, actions, and scenes evolve across video and language modalities in response to fine-grained textual edits. Existing methods address this issue by training on large-scale video-text-video triplets or by generating dense textual descriptions to capture subtle visual differences. However, these supervised approaches often rely on noisy web-scale data and dataset-specific correspondences, leading to overfitting and limited generalization in diverse or fine-grained scenarios, while also failing to effectively model compositional and temporal transformations. We propose Multi-objective Reasoning (MoRe), a zero-shot framework based on MLLMs for multi-objective candidate selection and fine-grained transformation reasoning. Our method decomposes the compositional transformation into three complementary reasoning dimensions, i.e., entity, action, and scene, and performs pairwise candidate reasoning to explicitly capture semantic evolution over time. Furthermore, we introduce a recall-oriented multi-objective candidate selection module that identifies high-quality retrieval targets by jointly balancing visual, textual, and multimodal similarities before transformation reasoning. Experiments on EgoCVR and WebVid-CoVR demonstrate the effectiveness of our method over state-of-the-art approaches under the zero-shot setting, with R@1 improvements of +5.8 and +10.8, respectively.

Abstract:
As an important research task of human cultural heritage, the restoration of artworks and calligraphy is of great significance. Seldom existing works have taken the multi-source (i.e., fragments are not ensured to be from the same piece of artworks) fragment-oriented restoration task into account. We propose ShreddingNet, a coarse-to-fine two-stage pipeline for multi-source manuscript restoration that operates without restrictive conditions. The proposed coarse stage compares the features of each fragment, selecting top-K candidates and clustering fragments by source. This design leverages the key insight that erroneous matches rarely cross source boundaries, enabling high-precision clustering. The proposed fine-grained stage evaluates candidates, yielding matching scores and filters out erroneous matching pairs from the candidate set; producing more precise final matching pairs for global assembly. Experiments conducted on more than 4,000 images from two datasets demonstrate the average reconstruction F1-score achieves 98.37%, which is 5.72% higher than the current state-of-the-art method, confirming the method's effectiveness and robustness. Source code is available at github.com/tqychy/shreddingnet.

Abstract:
Pedestrian trajectory prediction is crucial for applications such as autonomous driving and social robots. Recently, language model (LM)-based trajectory prediction has offered both prediction accuracy and interpretability. However, the L2 loss commonly used in trajectory prediction cannot be directly applied to LM optimization, resulting in degraded prediction performance. Moreover, current LM-based trajectory prediction methods lack explicit expressions of social interactions, and their scene descriptions are overly simplistic, making it challenging to impose practical scene constraints. To address these issues, we propose Write-to-Walk (W2W). First, we convert observed trajectories and interaction cues (companion/following/obstacle) into parsable textual prompts, so that interaction semantics are expressed more explicitly in the model input. Afterward, a T5-Small backbone is trained in a two-stage manner: (1) Full-parameter supervised fine-tuning with cross-entropy loss for language learning, enabling formatted question answering; (2) Reinforcement Learning (RL) to optimize W2W, where a reward function combining ADE error and off-road penalties strengthens scene constraints, producing future trajectories consistent with the scene context and further improving prediction accuracy. Experiments on benchmark datasets (ETH/UCY and SDD) demonstrate that W2W remains competitive with recent LM-based prediction methods and strong trajectory prediction baselines on ADE/FDE. The interpretability of LMs further supports W2W's deployment in safety-critical domains such as autonomous driving.

Abstract:
Visual SLAM is one of the most fundamental problems in computer vision, with direct applications to real-time localization tasks such as AR/VR, robotics, and 3D scene reconstruction. Although significant progress has been made in both sparse and dense approaches, real-time monocular SLAM remains challenging--particularly in the uncalibrated setting, where existing methods are often inefficient and lack modularity. In this paper, we present a new visual SLAM pipeline, called SLAM-MER, which is implemented from scratch in C++ explicitly leveraging the spatio-temporal structure of the SLAM problem for improved localization, and has modular design to easily integrate off-the-shelf components. We introduce a temporal representation based on a buffer of recent keyframes that preserves short-term scene continuity. We complement this by incorporating a spatial representation based on a 3D cell-based scene model, enabling efficient retrieval of relevant 3D points from previously reconstructed regions. Leveraging recent feed-forward geometry estimators, our hybrid design combines sparse keypoint-based localization with a semi-dense anchor-point-driven spatial representation. This integration allows us to achieve real-time performance (exceeding 80 FPS) and a substantial efficiency improvement compared to existing uncalibrated monocular SLAM pipelines, while maintaining or improving localization accuracy.

Abstract:
Unsupervised deraining has attracted attention for its ability to learn the real-world distribution of rain without paired supervision. However, the lack of strong constraints makes it difficult for the network to converge, especially with the complex diversity of rain degradation. A key motivation is that high-quality deraining results occasionally emerge during training, which can be leveraged to guide the optimization process. To overcome these challenges, we introduce RGSUD (Reward-Guided Self-Reinforcement Unsupervised Image Deraining), comprising two key stages: reward recycling and self-reinforcement (SR) training. In the reward recycling stage, we propose an Image Quality Assessment (IQA)-based dynamic reward recycling mechanism that selects optimal derained outputs during training and continuously collects high-quality deraining images. In the SR training stage, we incorporate these rewards into the model's optimization process, constraining the optimization space and improving alignment between derained outputs and clean images. By leveraging IQA-based self reinforced loss and dynamically updated rewards, we enhance the quality of synthesized pseudo-paired data and stabilize the optimization. Extensive experiments demonstrate that our method achieves SOTA performance across multiple datasets, including paired synthetic, paired real, and unpaired real images, outperforming existing unsupervised deraining approaches in both subjective and objective IQA metrics. Additionally, we show that the Self Reinforcement Strategy is adaptable to other unsupervised deraining methods.

Abstract:
Synthesizing realistic human motion from natural language holds transformative potential for animation, robotics, and virtual reality. Recent methods handle single-action sequences and simple textual instructions, yet multi-action compositions and precise editing remain elusive due to limited data diversity, inadequate representations, and fragmented pipelines. Critically, most existing methods train motion generation models from scratch, failing to exploithe rich action semantics and long-horizon reasoning already encoded in pretrained Multimodal Large Language Models (MLLMs). Here we show that finetuning a pretrained MLLM with large-scale motion data yields strong zero-shot generalization across diverse text-guided motion generation and editing tasks. We present MotionMaster, a unified framework built on three components: MotionGB, a 10,000-hour dataset expanded from 400 hours of verified motion capture via spatial-temporal augmentation; an FSQ-based tokenizer that preserves both local joint accuracy and global trajectory coherence; and a finetuned MLLM with motion and language tokens in a shared embedding space. MotionMaster outperforms prior methods by 41.6% in multi-action semantic consistency and 20.8% in body-part composition. These results demonstrate that pretraining knowledge from MLLMs transfers effectively to motion understanding, opening a viable path toward general-purpose motion intelligence.

Abstract:
We consider the problem of identifying degenerate configurations while estimating the fundamental matrix from (at least) 8 point correspondences. It is known that such configurations correspond to an ill-posed estimation of the fundamental matrix, so it is important to identify them in practice. So far, a practical degeneracy test is only available for the cases of planar scenes and pure rotation, while the case of the general critical surface (e.g., a hyperboloid/cone/cylinder containing 3D points and camera centres) is less studied, and the only available method is highly unstable, involving a pre-computed fundamental matrix. In this paper, we propose a novel degeneracy test for detecting points on the critical surface. By exploiting the geometry of the so-called "homaloidal net of conics", we are able to design a simple and very practical test that requires the linear estimation of a quadratic transformation from image correspondences. Our test does not require a fundamental matrix in advance and turns out to be more stable than its closest competitor, as shown in our experiments on both synthetic and real-world degenerate configurations.

Abstract:
Efficient view planning is a fundamental challenge in computer vision and robotic perception, critical for tasks ranging from search and rescue operations to autonomous navigation. While classical approaches, including sampling-based and deterministic methods, have shown promise in planning camera viewpoints for scene exploration, they often struggle with computational scalability and solution optimality in complex settings. This study introduces HQC-NBV, a hybrid quantum-classical framework for view planning that leverages quantum properties to efficiently explore the parameter space while maintaining robustness and scalability. We propose a specific Hamiltonian formulation with multi-component cost terms and a parameter-centric variational ansatz with bidirectional alternating entanglement patterns that capture the hierarchical dependencies between viewpoint parameters. Comprehensive experiments demonstrate that quantum-specific components provide measurable performance advantages. Compared to the classical methods, our approach achieves 7.9-49.2% higher exploration efficiency across diverse environments. Our analysis of entanglement architecture and coherence-preserving terms provides insights into the mechanisms of quantum advantage in robotic exploration tasks. This work represents a significant advancement in integrating quantum computing into robotic perception systems, offering a paradigm-shifting solution for various robot vision tasks.

Abstract:
Brachytherapy is a complex radiation oncology problem that requires the simultaneous prediction of radiation dose, which is used for treatment planning, and a set of machine parameters, known as dwell times, used for treatment delivery. We propose Direct Dwell Time Transformer (D2T2), the first deep learning architecture that directly predicts dwell times during dose prediction. D2T2 is a two stage model, where the first stage predicts a vector of dwell times and the second implements the physical model of radiation delivery, namely a linear combination of radiation dose kernels. Besides the automatic prediction of dwell times, this has the benefit of constraining the model to make physically plausible dose predictions, when trained end-to-end. To enhance this training, we also propose a new loss function, denoted as the gamma loss, based on the prediction of the gamma index, which is the gold standard of dose comparisons. This is implemented by training a model to predict the latter using a synthetic dataset of groundtruth and predicted dose pairs. We train D2T2 on a large dataset of ~5,000 clinical brachytherapy plans---the largest such dataset to date---spanning gynecological, breast, and other treatment sites. Results demonstrate that D2T2 outperforms existing methods in both accuracy and speed. Notably, D2T2 produces deliverable plans and physically valid dose distributions in a single forward pass, for any application of brachytherapy, hours faster than manual planning and minutes faster than more recent automated methods.

Abstract:
Reinforcement Learning with Verifiable Rewards (RLVR) has significantly advanced the reasoning capabilities of Large Language Models (LLMs) and is now being applied to Vision-Language Models (VLMs). However, vanilla RLVR for VLMs verifies only the final textual output, critically neglecting the foundational step of visual perception. This oversight leads to visual hallucinations and reward hacking, as reasoning built upon flawed perception is inherently unreliable. To address this, we propose PEARL (Perceptual-Evidence Anchored Reinforced Learning), a dual-branch, perception-reasoning synergistic that strengthens multimodal reasoning by explicitly anchoring it to verified visual evidence. For each reasoning-oriented QA instance, PEARL first derive a perception checklist---a set of perception-oriented sub-questions with verifiable answers that probe the model's understanding of key visual evidence. During training, auxiliary rollouts on this checklist yield a perceptual reward that both directly reinforces the model's perception ability and acts as a fidelity gate for reasoning. If the model passes the perception check, its policy update is biased towards evidence-anchored reasoning. Otherwise, the process is halted to prevent reasoning from flawed premises. PEARL can be seamlessly integrated with popular RL methods like GRPO and DAPO. Comprehensive experiments show PEARL achieves substantial gains on multimodal reasoning benchmarks, e.g., a +9.7% improvement over the baseline and +6.6% over GRPO on MathVerse.

Abstract:
Vision-Language Navigation requires agents to act coherently over long horizons by understanding not only local visual context but also how far they have advanced within a multi-step instruction.However, recent Vision-Language-Action models focus on direct action prediction and earlier progress methods predict numeric achievements; both overlook the monotonic co-progression property of the observation and instruction sequences.Building on this insight, Progress-Think introduces semantic progress reasoning, predicting instruction-style progress from visual observations to enable more accurate navigation.To achieve this without annotations, we propose a three-stage framework.In the initial stage, Self-Aligned Progress Pretraining bootstraps a reasoning module via a novel differentiable alignment between visual history and instruction prefixes.Then, Progress-Guided Policy Pretraining injects learned progress states into the navigation context, guiding the policy toward consistent actions.Finally, Progress-Policy Co-Finetuning jointly optimizes both modules with tailored progress-aware reinforcement objectives.Experiments on R2R-CE and RxR-CE show substantial gains in success, efficiency, and interpretability, demonstrating semantic progress provides a more consistent and generalizable representation of navigation advancement.

Abstract:
Assembling objects from parts requires understanding multimodal instructions, linking them to 3D components, and predicting physically plausible 6-DoF motions for each assembly step. Existing datasets focus on simplified scenarios, overlooking shape complexities and assembly trajectories in industrial assemblies. We introduce AssemblyBench, a synthetic dataset of 2,789 industrial objects with multimodal instruction manuals, corresponding 3D part models, and part assembly trajectories. We also propose a transformer-based model, AssemblyDyno, which uses the instructional manual and the 3D shape of each part to jointly predict assembly order and part assembly trajectories. AssemblyDyno outperforms prior works in both assembly pose estimation and trajectory feasibility, where the latter is evaluated by our physics-based simulations.

Abstract:
Micro-expression Recognition (MER) is challenging due to the subtle motion. Existing methods heavily rely on the onset/apex pair to capture the most discriminative motion clues. This paradigm struggles with labor-intensive apex annotation and effective utilization of data. In this paper, we propose a novel paradigm for MER that eliminates the need for expensive apex annotations while effectively capturing subtle motion dynamics. Our key insight is that frames within the sequence exhibit spatially consistent and intensity varied motion cues relative to the onset frame. Motivated by this, our method treats each sequence as a set of multiple onset/near-median motion instances. To fully exploit weaker motion information conveyed by these varied instances, our framework introduces an Instance Region Consistency (IRC) module that enforces visual attention consistency on similar facial activation regions across different instances within the same set. Furthermore, we present a Multi-Region Discovery (MRD) module with self-supervised learning to expand attention on more subtle activation regions which are typically neglected. Extensive experiments on four public micro-expression datasets demonstrate that our proposed approach surpasses state-of-the-art methods without using any apex frame annotations.

Abstract:
In this paper, we uncover the hidden potential of Diffusion Transformers (DiTs) to significantly enhance generative tasks. Through an in-depth analysis of the denoising process, we demonstrate that introducing a single learned scaling parameter can significantly improve the performance of DiT blocks. Building on this insight, we propose Calibri, a parameter-efficient approach that optimally calibrates DiT components to elevate generative quality. Calibri frames DiT calibration as a black-box reward optimization problem, which is efficiently solved using an evolutionary algorithm and modifies just around 100 parameters. Experimental results reveal that despite its lightweight design, Calibri consistently improves performance across various text-to-image models. Notably, Calibri also reduces the inference steps required for image generation, all while maintaining high-quality outputs.

Abstract:
We present EvoID, a novel framework that reformulates Identity-Preserving Video Generation as a self-evolving process through Reinforcement Learning. Moving beyond the static paradigm of imitation learning, EvoID enables a generative model to actively learn and optimize the complex trade-offs between identity fidelity, motion naturalness, and temporal coherence. At the heart of our EvoID is a dynamic, dual-path reward mechanism, which acts as an intrinsic critic by adaptively combining objective metric indicators and MLLM-based holistic quality assessment. This allows the model to "evolve" its generation strategy, focusing on different aspects of quality at different stages of training. To ensure stable and coherent evolution, we anchor the exploring Student model with a frozen Teacher, preserving robust world priors while allowing for creative refinement when generating videos. Extensive experiments demonstrate the superiority of our proposal, and EvoID achieves the total score of 0.687 on the Human-Domain of OpenS2V-Eval dataset, surpassing 0.658 of the open-source VACE and 0.653 of the commercial Hailuo. Moreover, EvoID also obtains a new record of 0.718 on our newly minted MLLM-based metric, prioritizing human perception and more comprehensively reflecting video quality.

Abstract:
Handling the dynamic environments is a significant research challenge in Visual Simultaneous Localization and Mapping (SLAM). Recent research combines 3D Gaussian Splatting (3DGS) with SLAM to achieve both robust camera pose estimation and photorealistic renderings. However, using SLAM to efficiently reconstruct both static and dynamic regions remains challenging. In this work, we propose an efficient framework for dynamic 3DGS SLAM guided by optical flow. Using the input depth and prior optical flow, we first propose a category-agnostic motion mask generation strategy by fitting a camera ego-motion model to decompose the optical flow. This module separates dynamic and static Gaussians and simultaneously provides flow-guided camera pose initialization. We boost the training speed of dynamic 3DGS by explicitly modeling their temporal centers at keyframes. These centers are propagated using 3D scene flow priors and are dynamically initialized with an adaptive insertion strategy. Alongside this, we model the temporal opacity and rotation using a Gaussian Mixture Model (GMM) to adaptively learn the complex dynamics. The empirical results demonstrate our state-of-the-art performance in tracking, dynamic reconstruction, and training efficiency. Our code will be made publicly available upon paper acceptance.

Abstract:
Predicting how a scene will evolve after a desired 3D transformation from images is a central goal in vision, graphics, and robotics. Yet unlike ideal simulators with full access to 3D geometry and dynamics, real world systems must rely on perceptual inputs and local actions that are inherently partial and incomplete. In this work, we present P3Sim, a physical world modeling system that simulates future scene states under both partial observations and incomplete 3D transformation signals. P3Sim is composed of three interacting components: a learned physical world model, a geometric conditioning module, and a persistent scene memory. The world model interprets perception as probabilistic inference over multimodal scene variables, providing predictions of the distributions of any scene variable conditioned on any combination of others. The geometric conditioning module provides a partial 3D transform signal for conditioning the world model at inference time. The persistent scene memory integrates predictions over time, enabling online updates and consistency under uncertainty. By combining learned inference with explicit geometric structure, P3Sim balances data-driven flexibility with built-in inductive bias. This design yields a flexible perceptual simulator that generalizes across diverse 3D transformation tasks, such as novel view synthesis, object manipulation, and dynamic scene prediction, advancing toward general purpose 3D scene understanding and transformation.

Abstract:
Graphical user interface (GUI) agents powered by large vision-language models (VLMs) have shown remarkable potential in automating digital tasks, highlighting the need for high-quality trajectory data to support effective agent training. Yet existing trajectory synthesis pipelines often yield agents that fail to generalize beyond simple interactions. We identify this limitation as stemming from the neglect of semantic-ambiguous actions--interactions whose meanings are context-dependent, sequentially dependent, or visually ambiguous. Such actions are crucial for real-world robustness but are under-represented and poorly processed in current datasets, leading to semantic misalignment between task instructions and execution. To address these issues, we propose HATS, a Hardness-Aware Trajectory Synthesis framework designed to mitigate the impact of semantic ambiguity. We define hardness as the degree of semantic ambiguity associated with an action and develop two complementary modules: (1) a hardness-driven exploration that guides data collection toward ambiguous yet informative interactions, and (2) an alignment-guided refinement that iteratively validates and repairs instruction-execution consistency. The two modules operate in a closed-loop manner--exploration supplies refinement with challenging trajectories, while refinement feedback updates the hardness signal to guide future exploration. Extensive experiments show that agents trained with HATS consistently outperform state-of-the-art baselines across benchmark GUI environments.

Abstract:
Recent advances in stereo matching have focused on accuracy, often at the cost of significantly increased model size. Traditionally, the community has regarded efficient models as incapable of zero-shot ability due to their limited capacity. In this paper, we introduce Lite Any Stereo, a stereo depth estimation framework that achieves strong zero-shot generalization while remaining highly efficient. To this end, we design a compact yet expressive backbone to ensure scalability, along with a carefully crafted hybrid cost aggregation module. We further propose a three-stage training strategy on million-scale data to effectively bridge the sim-to-real gap. Together, these components demonstrate that an ultra-light model can deliver strong generalization, ranking 1st across four widely used real-world benchmarks. Remarkably, our model attains accuracy comparable to or exceeding state-of-the-art non-prior-based accurate methods while requiring less than 1% computational cost, setting a new standard for efficient stereo matching.

Abstract:
We introduce MoLingo, a text-to-motion (T2M) model that generates realistic, lifelike human motion by denoising in a continuous latent space. Recent works perform latent space diffusion, either on the whole latent at once or auto-regressively over multiple latents. In this paper, we study how to make diffusion on continuous motion latents work best. We focus on two questions: (1) how to build a semantically aligned latent space so diffusion becomes more effective, and (2) how to best inject text conditioning so the motion follows the description closely. We propose a semantic-aligned motion encoder trained with frame-level text labels so that latents with similar text meaning stay close, which makes the latent space more diffusion-friendly. We also compare single-token conditioning with a multi-token cross-attention scheme and find that cross-attention gives better motion realism and text-motion alignment. With semantically aligned latents, auto-regressive generation, and cross-attention text conditioning, our model sets a new state-of-the-art in human motion generation on standard metrics and in a user study. We will release our code and models for further research and downstream usage.

Abstract:
We present Lighting in Motion (LiMo), a diffusion-based approach to spatiotemporal lighting estimation. LiMo targets both realistic high-frequency detail prediction and accurate illuminance estimation. To account for both, we propose generating a set of mirrored and diffuse spheres at different exposures, based on their 3D positions in the input. Making use of diffusion priors, we fine-tune powerful existing diffusion models on a large-scale customized dataset of indoor and outdoor scenes, paired with spatiotemporal light probes. For accurate spatial conditioning, we demonstrate that depth alone is insufficient and we introduce a new geometric condition to provide the relative position of the scene to the target 3D position. Finally, we combine diffuse and mirror predictions at different exposures into a single HDRI map leveraging differentiable rendering.We thoroughly evaluate our method and design choices to establish LiMo as state-of-the-art for both spatial control and prediction accuracy.

Abstract:
Despite significant progress in text-to-image generation, aligning outputs with complex prompts remains challenging, particularly for fine-grained semantics and spatial relations. This difficulty stems from the feed-forward nature of generation, which requires anticipating alignment without fully understanding the output. In contrast, evaluating generated images is more tractable. Motivated by this asymmetry, we propose xLARD, a self-correcting framework that uses multimodal large language models to guide generation through Explainable Latent Rewards. xLARD introduces a lightweight corrector that refines latent representations based on structured feedback from model-generated references. A key component is a differentiable mapping from latent edits to interpretable reward signals, enabling continuous latent-level guidance from non-differentiable image-level evaluations. This mechanism allows the model to understand, assess, and correct itself during generation. Experiments across diverse generation and editing tasks show that xLARD improves semantic alignment and visual fidelity while maintaining generative priors, offering a data-efficient and generalizable solution to bridging the gap between comprehension and synthesis.

Abstract:
Diffusion models have achieved remarkable progress in video generation, but their controllability remains a major limitation. Key scene factors such as layout, lighting, and camera trajectory are often entangled or only weakly modeled, restricting their applicability in domains like filmmaking and virtual production where explicit scene control is essential. We present LiVER, a diffusion-based framework for scene-controllable video generation. To achieve this, we introduce a novel framework that conditions video synthesis on explicit 3D scene properties, supported by a new large-scale dataset with dense annotations of object layout, lighting, and camera parameters. Our method disentangles these properties by rendering control signals from a unified 3D representation. We propose a lightweight conditioning module and a progressive training strategy to integrate these signals into a foundational video diffusion model, ensuring stable convergence and high fidelity. Our framework enables a wide range of applications, including image-to-video and video-to-video synthesis where the underlying 3D scene is fully editable. To further enhance usability, we develop a scene agent that automatically translates high-level user instructions into the required 3D control signals. Experiments show that LiVER achieves state-of-the-art photorealism and temporal consistency while enabling precise, disentangled control over scene factors, setting a new standard for controllable video generation.

Abstract:
Although recent work has explored generative modeling of 3D or 4D driving scenes, most approaches operate on dense voxel-based representations, which are computationally expensive and struggle to maintain temporal or structural consistency. These methods often produce blurred or merged entities (i.e., cars, trucks, pedestrians) and lack fine-grained control over individual scene elements. We propose LatentWorld, a framework for generative modeling in a compact, entity-centric latent space, where each grounded 3D latent represents a semantically meaningful local region of the scene. This formulation enables precise, consistent control of both foreground and background elements while preserving geometric detail. We further extend this representation to 4D by learning a motion diffusion model for both ego and dynamic actors, conditioned on the generated 3D scene, and by propagating the grounded latents through time. Our framework produces physically consistent and temporally coherent 4D scenes, supporting controllable and realistic generation.

Abstract:
Visual Speaker Authentication (VSA) verifies identity by analyzing lip dynamics during prompted speech, offering enhanced privacy compared to full-face methods while maintaining discriminability for high-security applications. However, recent advances in talking face generation (TFG) have enabled realistic forgeries that closely mimic lip dynamics in sync with speech, posing severe threats to VSA systems. Prevailing defenses rely heavily on supervised classifiers trained on known forgeries via empirical risk minimization, resulting in poor generalization to unseen attacks, dependency on continuously updated fake data, and complete failure in the absence of effective forgery priors. In this paper, we revisit the design of forgery detectors and argue that over-reliance on fake priors hinders the exploitation of rich authenticity signals inherently present in real videos. We propose a novel detector trained exclusively on authentic data, learning forgery-aware representations through three key components: (1) lightweight modules that capture forgery-indicative statistics from real videos; (2) an asymmetric contrastive objective that compacts real samples while repelling potential forgeries in representation space; and (3) a theoretically grounded regularizer that shapes real representations into a tractable, isotropic Gaussian. To support rigorous evaluation, we introduce a benchmark suite spanning diverse TFG forgeries. Across eight modern forgery attacks and ten state-of-the-art (SOTA) detectors, our method achieves over a 10% reduction in error rates while preserving identity-verification capability with minimal overhead, and demonstrates consistent gains on datasets that better emulate real-world scenarios.

Abstract:
Multimodal Large Language Models (MLLMs) often generate hallucinations, where the output deviates from the visual content. Given that these hallucinations can take diverse forms, detecting hallucinations at a fine-grained level is essential for comprehensive evaluation and analysis. To this end, we propose a novel task of multimodal fine-grained hallucination detection and editing for MLLMs. Moreover, we propose ZINA, a novel method that identifies hallucinated spans at a fine-grained level, classifies their error types into six categories, and suggests appropriate refinements. To train and evaluate models for this task, we constructed VisionHall, a dataset comprising 6.9k outputs from twelve MLLMs manually annotated by 211 annotators, and 20k synthetic samples generated using a graph-based method that captures dependencies among error types. We demonstrated that ZINA outperformed existing methods, including GPT-4o and Llama-3.2, in both detection and editing tasks.

Abstract:
Large multimodal models (LMMs) are increasingly adopted as judges in multimodal evaluation systems due to their strong instruction following and consistency with human preferences. However, their ability to follow diverse, fine-grained evaluation criteria remains underexplored. We develop Multi-Crit, a benchmark for evaluating multimodal judges on their capacity to follow pluralistic criteria and produce reliable criterion-level judgments. Covering both open-ended generation and verifiable reasoning tasks, Multi-Crit is built through a rigorous data curation pipeline that gathers challenging response pairs with multi-criterion human annotations. It further introduces three novel metrics for systematically assessing pluralistic adherence, criterion-switching flexibility, and the ability to recognize criterion-level preference conflicts. Comprehensive analysis of 25 LMMs reveals that 1) proprietary models still struggle to maintain consistent adherence to pluralistic criteria--especially in open-ended evaluation; 2) open-source models lag further behind in flexibly following diverse criteria; and 3) critic fine-tuning with holistic judgment signals enhances visual grounding but fails to generalize to pluralistic criterion-level judgment. Additional analyses on reasoning fine-tuning, test-time scaling, and boundary consistency between open-source and proprietary models further probe the limits of current multimodal judges. As a pioneering study, Multi-Crit lays the foundation for building reliable and steerable multimodal AI evaluation.

Abstract:
While the most fundamental pretraining paradigm typically trains modality-specific models on their respective datasets, the Platonic Representation Hypothesis that representations eventually align across modalities as data and model scale suggests an intriguing possibility: large language models (LLMs) could be pretrained on visual corpora to reach parity with text-pretrained models, thereby expanding data sources to break the text-scaling bottlenecks, and leveraging richer visual cues for more comprehensive corpus understanding. This paper makes the first attempt to demonstrate the feasibility of this implication by introducing Masked Autoregressive Pretraining for Learning language intelligencE (MAPLE), a novel visual pretraining paradigm for LLMs that leverages raw document images to improve language intelligence. MAPLE is universal to integrate masked auto-regressive models with various LLM backbones, where the LLMs are incentivized to generate latent hypotheses for the masked regions based on the unmasked regions. We verify MAPLE in the domain of math reasoning with multiple LLM backbones and show that MAPLE consistently surpasses text-only pretraining relatively by at most 40.2% on average accuracy across four math reasoning benchmarks. Further analyses show that visually pretrained LLMs learn a shared latent space that aligns document visuals with text and exploits layout and structural cues, supporting visual pretraining as a feasible and scalable route to stronger language models.

Abstract:
One of the major differentiators unlocked by learned codecs relative to their hard-coded traditional counterparts is their ability to be optimized directly to appeal to the human visual system. Despite this potential, a perceptual yet practical image codec is yet to be proposed.In this work, we aim to close this gap. We conduct a comprehensive study of the key modeling choices that govern the design of a practical learned image codec, jointly optimized for perceptual quality and runtime -- including within the ablations several novel techniques. We then perform performance-aware neural architecture search over millions of backbone configurations to identify models that achieve the target on-device runtime while maximizing compression performance as captured by perceptual metrics. We combine the various optimizations to construct a new codec that achieves a significantly improved tradeoff between speed and perceptual quality. Based on rigorous subjective user studies, it provides 2.3-3x bitrate savings against AV1, AV2, VVC, ECM and JPEG-AI, and 20-40% bitrate savings against the best learned codec alternatives. At the same time, on an iPhone 17 Pro Max, it encodes 12MP images as fast as 230ms, and decodes them in 150ms -- faster than most top ML-based codecs run on a V100 GPU.

Abstract:
Multimodal Large Language Models (MLLMs) have demonstrated promising advancements in augmenting the capabilities of LLMs to comprehend visual input. However, modality misalignment between vision and text remains a key challenge in MLLM, which can be attributed to two aspects: misalignment of modality-specific representations and depletion of modality-specific details. To address the issue of modality misalignment, we propose DeepAlign, a novel multimodal alignment framework to mitigate modality conflict, which employs representation intervention and structure-induced knowledge distillation to prevent the misalignment and depletion of modality-specific information. Extensive experiments demonstrate that DeepAlign significantly mitigates modality conflicts, leading to substantial performance improvements compared to backbone models across multiple vision-language tasks. It also stimulates some emergent abilities in MLLMs, such as multimodal in-context learning on interleaved text-image sequences.

Abstract:
We introduce VisualToolAgent (VisTA), a new reinforcement learning framework that empowers visual agents to dynamically explore, select, and compose tools from a diverse library based on empirical performance. Existing methods for tool-augmented visual reasoning either rely on training-free prompting or large-scale supervised fine-tuning; both lack active tool exploration and typically assume limited tool diversity, and fine-tuning methods additionally demand extensive human supervision. In contrast, VisTA leverages end-to-end reinforcement learning to iteratively refine sophisticated, query-specific tool selection strategies, guided solely by task outcomes. Leveraging reinforcement learning with verifiable rewards (RLVR), our framework enables an agent to autonomously discover effective tool-selection pathways without requiring explicit reasoning supervision. Experiments on the ChartQA, Geometry3K, MathVerse, and BlindTest benchmarks demonstrate that VisTA achieves significant performance gains over training-free and fine-tuning baselines, especially on out-of-distribution examples. These results highlight VisTA's ability to enhance generalization, adaptively utilize diverse tools, and pave the way for flexible, experience-driven visual reasoning systems.

Abstract:
The efficiency of hyperparameter optimization (HPO) is critical for deep learning, yet state-of-the-art methods share a fundamental flaw: they are difficulty-agnostic, treating all hyperparameter configurations homogeneously. This approach leads to inefficient resource allocation, wasting budget in simple regions while under-exploring complex, rugged landscapes, and thereby critically undermining both search efficiency and final performance. To address this universal challenge, we introduce DABO, a framework that pioneers difficulty-aware tuning within the efficient context of Freeze-Thaw Bayesian Optimization. We first model optimization difficulty hierarchically. Then, departing from hand-crafted priors, we train a conditional diffusion model on 120,000 real learning curves, generating synthetic data with 2.3xhigher fidelity. This data trains our difficulty-aware surrogate model and acquisition function to dynamically adapt the search strategy. Across 75 tasks, DABO reduces regret by 11-18% compared to the leading difficulty-agnostic method, ifBO. Our work establishes a new paradigm for HPO, shifting the focus from configuration-centric to difficulty-aware resource allocation to enable more robust and efficient optimization.

Abstract:
Contrastive learning (CL) has become a powerful approach for learning representations from unlabeled images. However, existing CL methods focus predominantly on visual appearance features while neglecting topological characteristics (e.g., connectivity patterns, boundary configurations, cavity formations) that provide valuable cues for medical image analysis. To address this limitation, we propose a new topological CL framework (TopoCL) that explicitly exploits topological structures during contrastive learning for medical imaging. Specifically, we first introduce topology-aware augmentations that control topological perturbations using a relative bottleneck distance between persistence diagrams, preserving medically relevant topological properties while enabling controlled structural variations. We then design a Hierarchical Topology Encoder that captures topological features through self-attention and cross-attention mechanisms. Finally, we develop an adaptive mixture-of-experts (MoE) module to dynamically integrate visual and topological representations. TopoCL can be seamlessly integrated with existing CL methods. We evaluate TopoCL on five representative CL methods (SimCLR, MoCo-v3, BYOL, DINO, and Barlow Twins) and five diverse medical image classification datasets. The experimental results show that TopoCL achieves consistent improvements: an average gain of 3.26% in linear probe classification accuracy with strong statistical significance, verifying its effectiveness.

Abstract:
The dominant paradigm for high-fidelity 3D generation relies on a VAE-Diffusion pipeline, where the VAE's reconstruction capability sets a firm upper bound on generation quality. A fundamental challenge limiting existing VAEs is the representation mismatch between ground-truth meshes and network predictions: GT meshes have arbitrary, variable topology, while VAEs typically predict fixed-structure implicit fields (e.g., SDF on regular grids). This inherent misalignment prevents establishing explicit mesh-level correspondences, forcing prior work to rely on indirect supervision signals such as SDF or rendering losses. Consequently, fine geometric details, particularly sharp features, are poorly preserved during reconstruction. To address this, we introduce TopoMesh, a sparse voxel-based VAE that unifies both GT and predicted meshes under a shared Dual Marching Cubes (DMC) topological framework. Specifically, we convert arbitrary input meshes into DMC-compliant representations via a remeshing algorithm that preserves sharp edges using an Linfinity distance metric. Our decoder outputs meshes in the same DMC format, ensuring that both predicted and target meshes share identical topological structures. This establishes explicit correspondences at the vertex and face level, allowing us to derive explicit mesh-level supervision signals for topology, vertex positions, and face orientations with clear gradients. Our sparse VAE architecture employs this unified framework and is trained with Teacher Forcing and progressive resolution training for stable and efficient convergence. Extensive experiments demonstrate that TopoMesh significantly outperforms existing VAEs in reconstruction fidelity, achieving superior preservation of sharp features and geometric details, as shown in Fig. 1.

Abstract:
While image editing has advanced rapidly, video editing remains less explored, facing challenges in consistency, control, and generalization.We study the design space of data, architecture, and control, and introduce EasyV2V, a simple and effective framework for instruction-based video editing. On the data side, we compose existing experts with fast inverses to build diverse video pairs, lift image edit pairs into videos via single-frame supervision and pseudo pairs with shared affine motion, mine dense-captioned clips for video pairs, and add transition supervision to teach how edits unfold.On the model side, we observe that pretrained text-to-video models possess editing capability, motivating a simplified design. Simple sequence concatenation for conditioning with light LoRA fine-tuning suffices to train a strong model.For control, we unify spatiotemporal control via a single mask mechanism and support optional reference images.Overall, EasyV2V works with flexible inputs, e.g., video+text, video+mask+text, video+mask+reference+text, and achieves state-of-the-art video editing results, surpassing concurrent and commercial systems.

Abstract:
Text-to-video generation has significantly enriched content creation and holds the potential to evolve into powerful world simulators. However, modeling the vast spatiotemporal space remains computationally demanding, particularly when employing Transformers, which incur quadratic complexity in sequence processing and thus limit practical applications. Recent advancements in linear-time sequence modeling, particularly the Mamba architecture, offer a more efficient alternative. Nevertheless, its plain design limits its direct applicability to multimodal and spatiotemporal video generation tasks. To address these challenges, we introduce M4V, a multimodal Mamba framework for efficient text-to-video generation. Specifically, a MultiModal diffusion Mamba (MM-DiM) block is designed within the framework to enable seamless integration of multimodal information and spatiotemporal modeling. In detail, we introduce a novel multimodal token re-composition design, which employs a bidirectional scheme for multimodal information integration through simple token arrangement, along with visual registers to enhance spatial-temporal consistency. As a result, the MM-DiM blocks in M4V reduce FLOPs by 45% compared with the attention-based alternative when generating videos at 768x1280 resolution. Additionally, several training strategies are explored in this work to provide a better understanding of training text-to-video models using only publicly available datasets. Extensive experiments on text-to-video benchmarks demonstrate M4V's ability to produce high-quality videos while significantly lowering computational costs. Code will be made publicly available.

Abstract:
We present IntrinsicWeather, a diffusion-based framework for controllable weather editing in intrinsic space. Our framework includes two components based on diffusion priors: an inverse renderer that estimates material properties, scene geometry, and lighting as intrinsic maps from an input image, and a forward renderer that utilizes these geometry and material maps along with a text prompt that describes specific weather conditions to generate a final image. The intrinsic maps enhance controllability compared to traditional pixel-space editing approaches. We propose an intrinsic map-aware attention mechanism that improves spatial correspondence and decomposition quality in large outdoor scenes. For forward rendering, we leverage CLIP-space interpolation of weather prompts to achieve fine-grained weather control. We also introduce a synthetic and a real-world dataset, containing 38k and 18k images under various weather conditions, each with intrinsic map annotations. IntrisicWeather outperforms state-of-the-art pixel-space editing approaches, weather restoration methods, and rendering-based methods, showing promise for downstream tasks such as autonomous driving, enhancing the robustness of detection and segmentation in challenging weather scenarios.

Abstract:
Recent advances in video generation have been remarkable, enabling models to produce visually compelling videos with synchronized audio. While existing video generation benchmarks provide comprehensive metrics for visual quality, they lack convincing evaluations for audio-video generation, especially for models aiming to generate synchronized audio-video outputs. To address this gap, we introduce VABench, a comprehensive and multi-dimensional benchmark framework designed to systematically evaluate the capabilities of synchronous audio-video generation. VABench encompasses three primary task types: text-to-audio-video (T2AV), image-to-audio-video (I2AV), and stereo audio-video generation. It further establishes two major evaluation modules covering 15 dimensions. These dimensions specifically assess pairwise similarities (text-video, text-audio, video-audio), audio-video synchronization, lip-speech consistency, and carefully curated audio and video question-answering (QA) pairs, among others. Furthermore, VABench covers seven major content categories: animals, human sounds, music, environmental sounds, synchronous physical sounds, complex scenes, and virtual worlds. We provide a systematic analysis and visualization of the evaluation results, aiming to establish a new standard for assessing video generation models with synchronous audio capabilities and to promote the comprehensive advancement of the field.

Abstract:
Learned image compression (LIC) outperforms traditional codecs but suffers from excessive peak memory usage when handling high-resolution images. Consequently, block-based LIC has been studied to reduce peak memory and peak computational cost, but it often introduces blocking artifacts that degrade visual quality. To mitigate this, the JPEG-AI standard introduced a patch-based scheme in which overlapped blocks are coded independently using empirically determined overlap sizes. However, experimentally searching for optimal overlaps is time-consuming and does not guarantee blocking-free reconstruction. In this paper, we propose an analytic framework that models overlap propagation through convolution and transposed convolution layers to precisely determine the minimal overlaps required for blocking-free reconstruction. Based on the calculated minimum overlaps, we provide a block-based implementation methodology applicable to most CNN-based LIC models. Applied to four CNN-based LIC models on 4K images partitioned into various block sizes (256 x 256, 512 x 512), our method achieves rate-distortion performance identical to full-image coding while reducing average peak memory usage to 13.94% (encoder) and 13.33% (decoder), and average peak computational cost to 2.6% and 1.24%, respectively. Notably, the proposed block-based framework does not require any retraining of the original model. Furthermore, it can also be applied to most CNN-based image-processing neural networks without performance degradation.

Abstract:
Pixel-wise capabilities are essential for building interactive intelligent systems. However, pixel-wise multi-modal LLMs (MLLMs) remain difficult to scale due to complex region-level encoders, specialized segmentation decoders, and incompatible training objectives. To address these challenges, we present SAMTok, a discrete mask tokenizer that converts any region mask into two special tokens and reconstructs the mask from these tokens with high fidelity. By treating masks as a new language, SAMTok enables base MLLMs (such as the QwenVL series) to learn pixel-wise capabilities through standard next-token prediction and simple reinforcement learning, without architectural modifications and specialized loss design. SAMTok builds on SAM2 and is trained on 209M diverse masks using a mask encoder and residual vector quantizer to produce discrete, compact, and information-rich tokens. With 5M SAMTok-formatted mask understanding and generation data samples, QwenVL-SAMTok attains state-of-the-art or comparable results on region captioning, region VQA, grounded conversation, referring segmentation, scene graph parsing, and multi-round interactive segmentation. We further introduce a textual answer-matching reward that enables efficient reinforcement learning for mask generation, delivering substantial improvements on GRES and GCG benchmarks. Our results demonstrate a scalable and straightforward paradigm for equipping MLLMs with strong pixel-wise capabilities.

Abstract:
Precise spatial control in diffusion-based style transfer remains challenging. This challenge arises because diffusion models treat style as a global feature and lack explicit spatial grounding of style representations, making it difficult to restrict style application to specific objects or regions. To our knowledge, existing diffusion models are unable to perform true localized style transfer, typically relying on handcrafted masks or multi-stage post-processing that introduce boundary artifacts and limit generalization. To address this, we propose an attention-supervised diffusion framework that explicitly teaches the model where to apply a given style by aligning the attention scores of style tokens with object masks during training. Two complementary objectives, a Focus loss based on KL divergence and a Cover loss using binary cross-entropy, jointly encourage accurate localization and dense coverage. A modular LoRA-MoE design further enables efficient and scalable multi-style adaptation. To evaluate localized stylization, we introduce the Regional Style Editing Score, which measures Regional Style Matching through CLIP-based similarity within the target region and Identity Preservation via masked LPIPS and pixel-level consistency on unedited areas. Experiments show that our method achieves mask-free, single-object style transfer at inference, producing regionally accurate and visually coherent results that outperform existing diffusion-based editing approaches.

Abstract:
Human sketches provide a compact and expressive form of visual communication, but their sparse structural cues, while capturing essential object structures, introduce ambiguity because a single sketch can correspond to multiple plausible images, making cross-domain alignment uncertain and unstable. Such ambiguity fundamentally limits sketch-based vision tasks that rely on precise sketch--image correspondence. To address this challenge, we introduce AmbiScore, a metric that quantifies the ambiguity of sketch-image pairs, and use Zero-Shot Sketch-Based Image Retrieval (ZS-SBIR) as a testbed to reveal how ambiguous supervision leads to performance collapse in existing methods. We further propose DisAmb (Disentangling Ambiguity), a framework that explicitly models and mitigates ambiguity through two components: (1) Elastic Matching, which adaptively adjusts supervision strength using AmbiScore, and (2) Purified Matching, which employs ambiguity-agnostic masks to disentangle structure and appearance via shape jigsaw and texture swapping. DisAmb establishes new benchmarks under high ambiguity and provides a robust, transferable supervisory signal for downstream sketch-guided tasks.

Abstract:
Human motion generation from text prompts has made remarkable progress in recent years. However, existing methods primarily rely on either sequence-level or action-level descriptions due to the absence of fine-grained, part-level motion annotations. This limits their controllability over individual body parts. In this work, we construct a high-quality motion dataset with atomic, temporally-aware part-level annotations, leveraging the reasoning capabilities of large language models (LLMs). Unlike prior datasets that either provide synchronized part captions with fixed time segments or rely solely on global sequence labels, our dataset captures asynchronous and semantically distinct part movements at fine temporal resolution. Based on this dataset, we introduce a diffusion-based part-aware motion generation framework, namely FrankenMotion, where each body part is guided by its own temporally-structured textual prompt. This is, to our knowledge, the first work to provide atomic, temporally-aware part-level motion annotations and have a model that allows motion generation with both spatial (body part) and temporal (atomic action) control. Experiments demonstrate that FrankenMotion outperforms all previous baseline models adapted and retrained for our setting, and our model can compose motions unseen during training. Our code and dataset will be publicly available upon publication.

Abstract:
Current dehazing methods face two intertwined challenges: (1) the misidentification of haze-related features due to domain-specific interference in both single-domain and empirically integrated multi-domain approaches, and (2) severe chromatic distortion caused by haze-induced color entanglement. To overcome these limitations, we propose a unified framework centered on a Cross-domain Invariant Manifold (CIM), which aligns multi-domain features into a unified latent space through shared scattering semantics. The manifold is optimized via consensus-density-driven contrastive learning, effectively enhancing cross-domain consistency while eliminating domain-specific biases. Building upon this structured foundation, we further introduce a disentanglement-wise architecture, i.e., Physics-Guided HSV Decomposition Network, that explicitly separates entangled color components to ensure robust color fidelity. Comprehensive experiments demonstrate that our CIM-D framework achieves state-of-the-art performance, effectively eliminating haze-induced color shifts and restoring natural scene appearance.

Abstract:
Structured shape completion recovers missing geometry as primitives rather than as unstructured points, which enables primitive-based surface reconstruction. Instead of following the prevailing cascade, we rethink how primitives and points should interact, and find it more effective to decode primitives in a dedicated pathway that attends to shared shape features. Following this principle, we present UniCo, which in a single feed-forward pass predicts a set of primitives with complete geometry, semantics, and inlier membership. To drive this unified representation, we introduce primitive proxies, learnable queries that are contextualized to produce assembly-ready outputs. To ensure consistent optimization, our training strategy couples primitives and points with online target updates. Across synthetic and real-world benchmarks with four independent assembly solvers, UniCo consistently outperforms recent baselines, lowering Chamfer distance by up to 50% and improving normal consistency by up to 7%. These results establish an attractive recipe for structured 3D understanding from incomplete data. Project page: https://unico-completion.github.io.

Abstract:
Robust and privacy-preserving indoor scene understanding remains a fundamental open problem. While optical sensors such as RGB and LiDAR offer high spatial fidelity, they suffer from severe occlusions and introduce privacy risks in indoor environments. In contrast, millimeter-wave (mmWave) radar preserves privacy and penetrates obstacles, but its inherently low spatial resolution makes reliable geometric reasoning difficult. We introduce RISE, the first benchmark and system for single static radar indoor scene understanding, jointly targeting layout reconstruction and object detection. RISE is built upon the key insight that multipath reflections--traditionally treated as noise--encode rich geometric cues. To exploit this, we propose a Bi-Angular Multipath Enhancement that explicitly models Angle of Arrival and Angle of Departure to recover secondary (ghost) reflections and reveal invisible structures. On top of these enhanced observations, a simulation-to-reality hierarchical diffusion framework transforms fragmented radar responses into complete layout reconstruction and object detection. Our benchmark contains 50,000 frames collected across 100 real indoor trajectories, forming the first large-scale dataset dedicated to single static radar-based indoor scene understanding. Extensive experiments show that RISE reduces the Chamfer distance by 60% (down to 16 cm) compared to the state of the art in mmWave layout reconstruction, and delivers the first mmWave-based object detection, achieving 58% IoU. These results establish RISE as a new foundation for geometry-aware and privacy-preserving indoor scene understanding using a single static radar. Our website and code are available at https://rise-cvpr.github.io.

Abstract:
The rise of AI agents powered by large language models (LLMs) has transformed intelligent systems by enabling autonomous tool utilizing, reasoning, and action across diverse tasks. Despite this rapid progress, existing video summarization approaches primarily focus on feature extraction or frame-level importance regression but lack the autonomous reasoning, self-correction, and decision-making capabilities that define true agent-based intelligence. To bridge this gap, we propose AgenticVS--the first agentic workflow for video summarization that leverages multimodal large language models (MLLMs) to complete the summarization-verify-reflection loop in a fully autonomous manner. Rather than designing new architectures for feature extraction or regression, we exploit the understanding and reflective reasoning abilities of MLLMs to build an adaptive summarization framework with a self-reflecting workflow. Experiments on SumMe and TVSum demonstrate that our agentic workflow outperforms state-of-the-art methods, enhancing interpretability, adaptability, and paving the way for agent-based multimodal video understanding.

Abstract:
Diffusion Transformer (DiT)-based models have significantly advanced image editing by encoding conditional images and integrating them into transformer layers. However, most edits involve modifying only small regions, while current methods uniformly process and denoise all tokens at every timestep, causing redundant computation and potentially degrading unchanged areas. This raises a fundamental question: Is it truly necessary to regenerate every region during editing? To address this, we propose SpotEdit, a training-free diffusion editing framework that selectively updates only the modified regions. SpotEdit comprises two key components: SpotSelector identifies stable regions via perceptual similarity and skips their computation by reusing conditional image features; SpotFusion adaptively blends these features with edited tokens through a dynamic fusion mechanism, preserving contextual coherence and editing quality. By reducing unnecessary computation and maintaining high fidelity in unmodified areas, SpotEdit achieves efficient and precise image editing.

Abstract:
This paper introduces LaS-Comp, a zero-shot and category-agnostic approach that leverages the rich geometric priors of 3D foundation models to enable 3D shape completion across diverse types of partial observations. Our contributions are threefold: First, LaS-Comp harnesses these powerful generative priors for completion through a complementary two-stage design: (i) an explicit replacement stage that preserves the partial observation geometry to ensure faithful completion; and (ii) an implicit refinement stage ensures seamless boundaries between the observed and synthesized regions. Second, our framework is training-free and compatible with different 3D foundation models. Third, we introduce Omni-Comp, a comprehensive benchmark combining real-world and synthetic data with diverse and challenging partial patterns, enabling a more thorough and realistic evaluation. Both quantitative and qualitative experiments demonstrate that our approach outperforms previous state-of-the-art approaches. Code and data will be released upon acceptance.

Abstract:
Sparse autoencoders (SAEs) promise a unified approach for mechanistic interpretability, concept discovery, and model steering in LLMs and LVLMs. However, realizing this potential requires learned features to be both interpretable and steerable. To that end, we introduce two new computationally inexpensive interpretability and steerability metrics for a systematic analysis of LVLM SAEs. This uncovers two observations; (i) a majority of SAE neurons exhibit either low interpretability or low steerability or both, rendering them ineffective for downstream use; and (ii) user-desired concepts are often absent in the SAE, thus limiting their practical utility. To address these limitations, we propose Concept Bottleneck Sparse Autoencoders (CB-SAE)--a novel post-hoc framework that prunes low-utility neurons and augments the latent space with a lightweight concept bottleneck aligned to a user-defined concept set. The resulting CB-SAE improves interpretability by +32.1% and steerability by +14.5% across LVLMs and image generation tasks.

Abstract:
Novel view synthesis has been significantly advanced by NeRFs and 3D Gaussian Splatting (3DGS), which require ordering volumetric samples or primitives for correct color blending. While the recent Gaussian-Enhanced Surfels (GES) enable high-performance, sort-free rendering, they suffer from aliasing artifacts and suboptimal reconstruction. To address these limitations, we propose DP-GES, a novel representation that combines 2D opaque surfels with semi-transparent boundaries to represent coarse-scale geometry and appearance, and 3D Gaussians surrounding the surfels to supplement fine-scale details. We employ Depth Peeling to achieve accurate per-pixel ordering for surfel rendering, which enables sort-free Gaussian splatting with correct transmittance modulation, effectively eliminating aliasing and popping artifacts while facilitating a fully differentiable joint optimization. Extensive experiments demonstrate that our method achieves superior novel view synthesis quality and compares favorably against state-of-the-art techniques across a wide range of scenes.

Abstract:
Automated 3D scene generation is pivotal for applications spanning virtual reality, digital content creation, and Embodied AI. While computer graphics prioritizes aesthetic layouts, vision and robotics demand scenes that mirror real-world complexity which current data-driven methods struggle to achieve due to limited unstructured training data and insufficient spatial and physical modeling. We propose SPREAD, a diffusion-based framework that jointly learns spatial and physical relationships through a graph transformer, explicitly conditioning on posed scene point clouds for geometric awareness. Moreover, our model integrates differentiable guidance for collision avoidance, relational constraint, and gravity, ensuring physically coherent scenes without sacrificing relational context. Our experiments on 3D-FRONT and ProcTHOR datasets demonstrate state-of-the-art performance in spatial-relational reasoning and physical metrics. Moreover, our method significantly outperforms baselines in scene consistency and stability during pre- and post-physics simulation, proving its capability to generate simulation-ready environments for embodied AI agents.

Abstract:
Federated Prompt Learning (FPL) has recently attracted increasing attention for its ability to leverage large-scale vision-language models such as CLIP within federated learning frameworks. While existing studies have advanced FPL through personalization strategies to enhance client-specific performance, personalized models often suffer severe degradation when deployed across unseen domains due to distribution shifts.In this paper, we take the first step toward exploring Test-Time FPL (TTFPL), aiming to bridge the cross-domain performance gap with minimal effort, requiring only unlabeled target-domain data. We propose COTE, a tri-prompt controllable TTFPL framework that dynamically balances three complementary prompts: the global prompt from standard FPL, the local prompt from personalized FPL, and the frozen CLIP prompt.Specifically, we introduce a novel confidence-guided Model-Data Alignment (MoDA) metric in COTE that quantifies alignment at both macro and micro levels, capturing the consistency between model predictions and data distributions. By integrating MoDA with model confidence, COTE adaptively adjusts the contribution of each prompt at test time, enabling robust generalization across heterogeneous clients and unseen domains without requiring labeled data.Extensive experiments on multiple benchmark datasets demonstrate that our method consistently improves target-domain performance, setting a new direction for adaptive FPL.

Abstract:
Collaborative perception leverages data exchange among multiple agents to enhance overall perception capabilities. However, heterogeneity across agents introduces domain gaps that hinder collaboration, and this is further exacerbated by an underexplored issue: modality isolation. It arises when multiple agents with different modalities never co-occur in any training data frame, enlarging cross-modal domain gaps. Existing alignment methods rely on supervision from spatially overlapping observations, thus fail to handle modality isolation. To address this challenge, we propose CodeAlign, the first efficient, co-occurrence-free alignment framework that smoothly aligns modalities via cross-modal feature-code-feature(FCF) translation. The key idea is to explicitly identify the representation consistency through codebook, and directly learn mappings between modality-specific feature spaces, thereby eliminating the need for spatial correspondence. Codebooks regularize feature spaces into code spaces, providing compact yet expressive representations. With a prepared code space for each modality, CodeAlign learns FCF translations that map features to the corresponding codes of other modalities, which are then decoded back into features in the target code space, enabling effective alignment. Experiments show that, when integrating three modalities, CodeAlign requires only 8% of the training parameters of prior alignment methods, reduces communication load by 1024x, and achieves state-of-the-art perception performance on both OPV2V and DAIR-V2X dataset. Code will be released.

Abstract:
Image Super-Resolution (SR) aims to reconstruct high-resolution images from degraded low-resolution inputs. While diffusion-based SR methods offer powerful generative capabilities, their performance heavily depends on how semantic priors are structured and integrated into the generation process. Existing approaches often rely on entangled or coarse-grained priors that mix global layout with local details, or conflate structural and textural cues, thereby limiting semantic controllability and interpretability. In this work, we propose DTPSR, a novel diffusion-based SR framework that introduces disentangled textual priors along two complementary dimensions: spatial hierarchy (global vs. local) and frequency semantics (low- vs. high-frequency). By explicitly separating these priors, DTPSR enables the model to simultaneously capture scene-level structure and object-specific details with frequency-aware semantic guidance. The corresponding embeddings are injected via specialized cross-attention modules, forming a progressive generation pipeline that reflects the semantic granularity of visual content--from global layout to fine-grained textures. To support this paradigm, we construct DisText-SR, a large-scale dataset containing approximately 95,000 image-text pairs with carefully disentangled global, low-frequency, and high-frequency descriptions. To further enhance controllability and consistency, we adopt a multi-branch classifier-free guidance strategy with frequency-aware negative prompts to suppress hallucinations and semantic drift. Extensive experiments on synthetic and real-world benchmarks show that DTPSR achieves high perceptual quality, competitive fidelity, and strong generalization across diverse degradation scenarios.

Abstract:
Producing long, coherent video sequences with stable 3D structure remains a major challenge, particularly in streaming scenarios. Motivated by this, we introduce Endless World, a real-time framework for infinite, 3D-consistent video generation. To support infinite video generation, we introduce a conditional autoregressive training strategy that aligns newly generated content with existing video frames. This design preserves long-range dependencies while remaining computationally efficient, enabling real-time inference on a single GPU without additional training overhead. Moreover, our Endless World integrates global 3D-aware attention to provide continuous geometric guidance across time. Our 3D injection mechanism enforces physical plausibility and geometric consistency throughout extended sequences, addressing key challenges in long-horizon and dynamic scene synthesis. Extensive experiments demonstrate that Endless World produces long, stable, and visually coherent videos, achieving competitive or superior performance to existing methods in both visual fidelity and spatial consistency. Our code will be released after review.

Abstract:
Point cloud completion is crucial for robust 3D perception but remains challenging. Coarse-to-fine methods can lead to unconstrained local guesses in the absence of key structures, whereas diffusion-based approaches may introduce geometric inconsistencies. To overcome these limitations, we present TouchDream, a novel framework that leverages a diffusion model to `dream' of tactile sensing on object surfaces, which reformulates the sensing process as a learnable generative modeling task. Unlike visual cues, tactile data provides rich local geometry that can be directly converted into 3D space for point fusion, offering a powerful guide for detail-aware completion. Specifically, our approach generate compact tactile latent representations conditioned on coarse points and sampled touch poses. These generated touches are then used to optimize the coarse geometry. Extensive experiments show that our TouchDream model achieves the state-of-the-art performance.

Abstract:
Digital zoom on smartphones relies on learning-based super-resolution (SR) models that operate on RAW sensor images, but obtaining sensor-specific training data is challenging due to the lack of ground-truth images. Synthetic data generation via "unprocessing" pipelines offers a potential solution by simulating the degradations that transform high-resolution (HR) images into their low-resolution (LR) counterparts. However, these pipelines can introduce domain gaps due to incomplete or unrealistic degradation modeling. In this paper, we demonstrate that principled and carefully designed degradation modeling can enhance SR performance in real-world conditions. Instead of relying on generic priors for camera blur and noise, we model device-specific degradations through calibration and unprocess publicly available rendered images into the RAW domain of different smartphones. Using these image pairs, we train a single-image SR model that operates on mosaicked RAW images and evaluate it on real data from a held-out device. Our experiments show that accurate degradation modeling leads to noticeable improvements, with our SR model outperforming baselines trained on large pools of arbitrarily chosen degradations.

Abstract:
Robots deployed in unstructured environments must coordinate whole-body motion---simultaneously moving a mobile base and arm---to interact with the physical world. This coupled mobility and dexterity yields a state space that grows combinatorially with scene and object diversity, demanding datasets far larger than those sufficient for fixed-base manipulation. Yet existing acquisition methods, including teleoperation and planning, are either labor-intensive or computationally prohibitive at scale. The core bottleneck is the lack of a scalable pipeline for generating large-scale, physically valid, coordinated trajectory data across diverse embodiments and environments. Here we introduce AutoMoMa, a GPU-accelerated framework that unifies AKR modeling, which consolidates base, arm, and object kinematics into a single chain, with parallelized trajectory optimization. AutoMoMa achieves 5,000 episodes per GPU-hour (over 80x faster than CPU-based baselines), producing a dataset of over 500k physically valid trajectories spanning 330 scenes, diverse articulated objects, and multiple robot embodiments. Prior datasets were forced to compromise on scale, diversity, or kinematic fidelity; AutoMoMa addresses all three simultaneously. Training downstream IL policies further reveals that even a single articulated-object task requires tens of thousands of demonstrations for State-of-the-Art (SOTA) methods to reach 80% success, confirming that data scarcity--not algorithmic limitations--has been the binding constraint. AutoMoMa thus bridges high-performance planning and reliable IL-based control, providing the infrastructure previously missing for coordinated mobile manipulation research. By making large-scale, kinematically valid training data practical, AutoMoMa showcases generalizable whole-body robot policies capable of operating in the diverse, unstructured settings of the real world.

Abstract:
Image protection against unauthorized diffusion-based editing has achieved encouraging progress. However, existing methods face two critical limitations: (1) They only disturb the denoising direction at local step, resulting in generated images still retaining original or edited semantics. (2) Their optimization rely heavily on model-specific gradient, limiting transferable protection across different models and tasks. To address these challenges, we propose a Universal Defense (UniDef) framework for protection against unauthorized image manipulation. Specifically, we first discover that different variants of diffusion models tend to pursue a consistent distribution objective during complete denoising process. Based on this discovery, we design Consistent Distribution Deviation strategy to perturb the diffusion direction at the global denoising, thereby disrupting the overall image semantics. Furthermore, to mitigate model dependency, we devise a Finite Difference-based Jacobian Estimation module to approximate the global gradient in a model-agnostic manner, thus ensuring more transferable protection. Benefiting from the above designs, our method yields generated images no longer preserve the image semantic while possessing excellent generalization. Extensive experiments demonstrate that our UniDef not only outperforms existing methods, but also exhibits universal protection across diverse models and tasks.

Abstract:
Robust training and validation of Autonomous Driving Systems (ADS) require massive, diverse datasets. Proprietary data collected by Autonomous Vehicle (AV) fleets, while high-fidelity, are limited in scale, diversity of sensor configurations, as well as geographic and long-tail-behavioral coverage. In contrast, in-the-wild data from sources like dashcams offers immense scale and diversity, capturing critical long-tail scenarios and novel environments. However, this unstructured, in-the-wild video data is incompatible with ADS expecting structured, multi-modal sensor inputs for validation and training. To bridge this data gap, we propose Sensor2Sensor, a novel generative modeling paradigm that translates in-the-wild monocular dashcam videos into a high-fidelity, multi-modal sensor suite (AV logs) comprising multi-view camera images and LiDAR point clouds. A core challenge is the lack of paired training data. We address this by converting real AV logs into dashcam-style videos via 4D Gaussian Splatting (4DGS) reconstruction and novel-view rendering. Sensor2Sensor then utilizes a diffusion architecture to perform the generative conversion. We perform comprehensive quantitative evaluations on the fidelity and realism of the generated sensor data. We demonstrate Sensor2Sensor's practical utility by converting challenging in-the-wild internet and dashcam footage into realistic, multi-modal data formats, further unlocking vast external data sources for AV development.

Abstract:
Learning-based real image dehazing methods have achieved notable progress, yet they still face adaptation challenges in diverse real haze scenes. These challenges mainly stem from the lack of effective unsupervised mechanisms for unlabeled data and the heavy cost of full model fine-tuning. To address these challenges, we propose the haze-to-clear text-directed loss that leverages CLIP's cross-modal capabilities to reformulate real image dehazing as a semantic alignment problem in latent space, thereby providing explicit unsupervised cross-modal guidance in the absence of reference images. Furthermore, we introduce the Bilevel Layer-positioning LoRA (BiLaLoRA) strategy, which learns both the LoRA parameters and automatically search the injection layers, enabling targeted adaptation of critical network layers. Extensive experiments demonstrate our superiority against state-of-the-art methods on multiple real-world dehazing benchmarks.

Abstract:
Tokenization in video models, typically through patchification, generates an excessive and redundant number of tokens. This severely limits video efficiency and scalability. While the recent trajectory-based tokenizers offer a promising solution by decoupling video duration from token count, they rely on complex, external segmentation and tracking pipelines that are slow and task-agnostic. We propose TrajTok, an end-to-end video tokenizer module that is fully integrated and co-trained with video models for a downstream objective, dynamically adapting its token granularity to semantic complexity, independent of video duration. TrajTok contains a unified segmenter that performs implicit clustering over pixels in both space and time to directly produce object trajectories in a single forward pass. By prioritizing downstream adaptability over pixel-perfect segmentation fidelity, TrajTok is lightweight, efficient, and yet empirically improves video understanding performance. With TrajTok, we implement a video CLIP model trained from scratch (TrajViT2). It achieves the best accuracy at scale across both classification and retrieval benchmarks, while maintaining efficiency comparable to the best token-merging methods. TrajTok also proves to be a versatile component beyond its role as a tokenizer. We show that it can be seamlessly integrated as either a probing head for pretrained visual features (TrajAdapter) or an alignment connector in vision-language models (TrajVLM) with especially strong performance in long-video reasoning.

Abstract:
Recent advances in language and vision have demonstrated that scaling up model capacity consistently improves performance across diverse tasks.In 3D visual geometry reconstruction, large-scale training has likewise proven effective for learning versatile representations.However, further scaling of 3D models is challenging due to the complexity of geometric supervision and the diversity of 3D data. To overcome these limitations, we propose MoRE, a dense 3D visual foundation model based on a Mixture-of-Experts (MoE) architecture that dynamically routes features to task-specific experts, allowing them to specialize in complementary data aspects and enhance both scalability and adaptability.Aiming to improve robustness under real-world conditions, MoRE incorporates a confidence-based depth refinement module that stabilizes and refines geometric estimation.In addition, it integrates dense semantic features with globally aligned 3D backbone representations for high-fidelity surface normal prediction.MoRE is further optimized with tailored loss functions to ensure robust learning across diverse inputs and multiple geometric tasks.Extensive experiments demonstrate that MoRE achieves state-of-the-art performance across multiple benchmarks and supports effective downstream applications without extra computation.

Abstract:
Semantic segmentation in unstructured environments presents unique challenges due to irregular terrain, occlusions, and complex spatial layouts. While structured settings (e.g., urban scenes) have been widely studied, segmentation in unstructured settings remains relatively underexplored, both in terms of standardized benchmarking and architectural design. In this work, we propose a encoder-decoder based semantic segmentation architecture that integrates a Reduced Masked Autoencoder (RMAE) as the encoder, a Feature-to-Pyramid (F2P) neck, and a novel decoder called ProGRess. The ProGRess decoder introduces Progressive Leapwise Fusion (PLF) for top-down multi-scale fusion of non-contiguous feature maps, a Lightweight Channel Attention gate with Residuals (LCAR) module, and a Bottleneck Feature Fusion (BFF) block for compact refinement. We establish comprehensive baselines by benchmarking state-of-the-art CNN and transformer-based models on challenging unstructured environment datasets viz. RELLIS-3D, it's coarse-grained variant, and RUGD. Our architecture achieves the state-of-the-art performance with 57.41% mIoU on RELLIS-3D, 45.63% mIoU on RUGD, 78.95% mIoU on RELLIS-3DC datasets while maintaining competitive parameter-count and vRAM usage.

Abstract:
Single Image Super-Resolution (SISR) aims to reconstruct a high-resolution (HR) image from a low-resolution (LR) input, a task that becomes increasingly challenging under real-world computational constraints. However, most efficient SISR methods adopt lightweight, spatially uniform strategies that allocate equal computation and attention across all regions, ignoring the uneven distribution of visual complexity. From an information-theoretic perspective, textures and edges inherently carry more critical information, resulting in reconstruction errors that are disproportionately concentrated in these regions. This motivates the allocation of greater computational resources and attention to these informative areas. In this paper, we propose IAFMNet, an Information-Aware Feature Modulation network for efficient SR. At its core lies the Information Density Map (IDM), which is estimated in an unsupervised manner by minimizing the Information Entropy Loss, thereby highlighting informative regions with high estimated encoding costs. Guided by the IDM, IAFMNet adopts a synergistic dual-branch design: (1) a sparse convolution branch that dynamically allocates computation to informative areas while bypassing low-information regions, and (2) an implicit modulation branch that adaptively emphasizes complex regions through information-aware affine transformations. Extensive experiments demonstrate that IAFMNet effectively identifies informative regions and achieves superior visual fidelity with reduced computational overhead.

Abstract:
Robust perception and reasoning require consistency across sensory modalities. Yet, current multimodal models often violate this principle, yielding contradictory predictions for visual versus textual representations of the same input. Rather than masking these failures with standard voting mechanisms--which amplify systematic biases--we demonstrate that cross-modal inconsistency provides a rich, natural signal for learning. We introduce R-C2, a reinforcement learning framework that resolves internal conflicts by enforcing cross-modal cycle consistency. By requiring a model to perform backward inference, switches modalities, and reliably reconstruct the answer via forward inference, we establish a dense, label-free reward. This cyclic constraint forces the model to autonomously align its representations. Optimizing for this structure mitigates modality-specific errors and improves reasoning accuracy by up to 7.6 points. Our results suggest that advanced reasoning emerges not just from scaling data, but from enforcing a structurally consistent understanding of the world.

Abstract:
Generalizing video matting models to real-world videos remains a significant challenge due to the scarcity of labeled data. To address this, we present Video Mask-to-Matte Model VideoMaMa that converts coarse segmentation masks into pixel accurate alpha mattes, by leveraging pretrained video diffusion models. VideoMaMa demonstrates strong zero-shot generalization to real-world footage, even though it is trained solely on synthetic data. Building on this capability, we develop a scalable pseudo-labeling pipeline for large-scale video matting and construct the Matting Anything in Video MA-V dataset, which offers high-quality matting annotations for more than 50K real-world videos spanning diverse scenes and motions. To validate the effectiveness of this dataset, we fine-tune the SAM2 model on MAV to obtain SAM2-Matte, which outperforms the same model trained on existing matting datasets in terms of robustness on in-the-wild videos.These findings emphasize the importance of large-scale pseudo-labeled video matting and showcase how generative priors and accessible segmentation cues can drive scalable progress in video matting research.

Abstract:
Accurate weather forecasts are essential across various domains and are safety-critical in extreme weather conditions. Compared to simulation-based forecasting, data-driven approaches show greater efficiency, enabling short-term, high-resolution nowcasting. In particular, diffusion models proved effective in weather nowcasting due to their strong probabilistic foundation. However, existing methods rely on deterministic compression to reduce the complexity of high-dimensional weather data, limiting their ability to capture uncertainty in the decoding process. In this work, we introduce FREUD, a FRame-wise Encoder and United Decoder model based on rectified flow transformers for efficient compression of spatio-temporal weather data. Frame-wise encoding enables continuous forecast updates, while the unified video decoder ensures temporal consistency. Our uncertainty-preserving first stage allows us to capture aleatoric uncertainty through ensembling, which is particularly beneficial for extreme weather events with high decoding variability. We achieve state-of-the-art performance in precipitation nowcasting with a compact latent-space rectified flow transformer on the SEVIR benchmark and show further performance gains by model and test-time scaling.

Abstract:
Cooperative 3D perception via Vehicle-to-Everything communication is a promising paradigm for enhancing autonomous driving, offering extended sensing horizons and occlusion resolution. However, the practical deployment of existing methods is hindered at long distances by two critical bottlenecks: the quadratic computational scaling of dense BEV representations and the fragility of feature association mechanisms under significant observation and alignment errors. To overcome these limitations, we introduce Long-SCOPE, a fully sparse framework designed for robust long-distance cooperative 3D perception. Our method features two novel components: a Geometry-guided Query Generation module to accurately detect small, distant objects, and a learnable Context-Aware Association module that robustly matches cooperative queries despite severe positional noise. Experiments on the V2X-Seq and Griffin datasets validate that Long-SCOPE achieves state-of-the-art performance, particularly in challenging 100-150m long-range settings, while maintaining highly competitive computation and communication costs.

Abstract:
Visually-grounded language models (VLMs) are highly effective in linking visual and textual information, yet they often struggle with basic classification and localization tasks. While classification mechanisms have been studied more extensively, the processes that support object localization remain poorly understood. In this work, we investigate two representative families, LLaVA-1.5 and InternVL-3.5, using a suite of mechanistic interpretability tools, including token ablations, attention knockout, and causal mediation analysis. We find that localization is driven by a containerization mechanism in which object-aligned tokens define the spatial extent of the object, while the semantic arrangement of tokens within those boundaries is largely irrelevant to the predicted box. Only a very small set of attention heads mediates the causal effect for both classification and localization, concentrating in early-mid layers for LLaVA and mid-late layers for InternVL. The two tasks share some early processing but ultimately depend on largely distinct specialized heads. Overall, we provide the first layer- and head-level account of localization in VLMs, revealing narrow computational pathways that can guide future model design and grounding objectives.

Abstract:
This paper addresses the challenge of data scarcity in semantic segmentation by generating datasets through text-to-image (T2I) generation models, reducing image acquisition and labeling costs. Segmentation dataset generation faces two key challenges: 1) aligning generated samples with the target domain and 2) producing informative samples beyond the training data. Fine-tuning T2I models can help generate samples aligned with the target domain. However, it often overfits and memorizes training data, limiting their ability to generate diverse and well-aligned samples. To overcome these issues, we propose Concept-Aware LoRA (CA-LoRA), a novel fine-tuning approach that selectively identifies and updates only the weights associated with necessary concepts (e.g., style or viewpoint) for domain alignment while preserving the pretrained knowledge of the T2I model to produce informative samples. We demonstrate its effectiveness in generating datasets for urban-scene segmentation, outperforming baseline and state-of-the-art methods in in-domain (few-shot and fully-supervised) settings, as well as in domain generalization tasks, especially under challenging conditions such as adverse weather and varying illumination, further highlighting its superiority.

Abstract:
Event cameras sense the world through asynchronous brightness changes with microsecond latency and high dynamic range, offering motion fidelity far beyond frame-based sensors and capturing temporal structure that conventional exposures often miss. These properties make events a powerful complement to RGB in autonomous driving, especially under blur, glare, and rapid motion, where frame-based perception can become unreliable. However, existing event-aware vision-language models remain limited to generic perception and do not reveal how event sensing contributes to reasoning and decision-making across the full driving loop. We present EventDrive, a large-scale benchmark and model suite that unifies event streams, RGB frames, and language supervision across four core dimensions: Perception, Understanding, Prediction, and Planning, covering captions, structured QA, grounding, motion-state recognition, trajectory forecasting, and planning tasks. Building on this foundation, EventDrive-VLM introduces a multi-horizon event pyramid and a modality-routing mixture-of-experts to adaptively encode and fuse asynchronous and frame-based information for downstream reasoning. Comprehensive evaluation across diverse tasks shows that event streams provide substantial gains in temporal precision, motion awareness, and robustness, bringing event sensing into the center of driving intelligence.

Abstract:
Most evaluations of generative models rely on feature-distribution metrics such as FID, which operate on continuous recognition features that are explicitly trained to be invariant to appearance variations, and thus discard cues critical for perceptual quality. We instead evaluate models in the space of discrete visual tokens, where modern 1D image tokenizers compactly encode both semantic and perceptual information and quality manifests as predictable token statistics. We introduce Codebook Histogram Distance(CHD), a training-free distribution metric in token space, and Code Mixture Model Score(CMMS), a no-reference quality metric learned from synthetic degradations of token sequences. To stress-test metrics under broad distribution shifts, we further propose VisForm, a benchmark of 210K images spanning 62 visual forms and 11 generative models with expert annotations. Across AGIQA, HPDv2/3, and VisForm, our token-based metrics achieve state-of-the-art correlation with human judgments, and we will release all code and datasets to facilitate future research.

Abstract:
Stochastic Gradient Descent (SGD) and its momentum variants form the backbone of deep learning optimization, yet the underlying dynamics of their gradient behavior remain insufficiently understood. In this work, we reinterpret gradient updates through the lens of signal processing and reveal that fixed momentum coefficients inherently distort the balance between bias and variance, leading to skewed or suboptimal parameter updates. To address this, we propose SGDF (SGD with Filter), an optimizer inspired by the principles of Optimal Linear Filtering. SGDF computes an online, time-varying gain to dynamically refine gradient estimation by minimizing the mean-squared error, thereby achieving an optimal trade-off between noise suppression and signal preservation. Furthermore, our approach could extend to other optimizers, showcasing its broad applicability to optimization frameworks. Extensive experiments across diverse architectures and benchmarks demonstrate SGDF surpasses conventional momentum methods and achieves performance on par with or surpassing state-of-the-art optimizers.

Abstract:
While recent advancements in generative models have achieved remarkable visual fidelity in video synthesis, creating coherent multi-shot narratives remains a significant challenge. To address this, keyframe-based approaches have emerged as a promising alternative to computationally intensive end-to-end methods, offering the advantages of fine-grained control and greater efficiency. However, these methods often fail to maintain cross-shot consistency and capture cinematic language. In this paper, we introduce STAGE, a SToryboard-Anchored GEneration workflow to reformulate the keyframe-based multi-shot video generation task. Instead of using sparse keyframes, we propose STEP^2 to predict a structural storyboard composed of start-end frame pairs for each shot. We introduce the multi-shot memory pack to ensure long-range entity consistency, the dual-encoding strategy for intra-shot coherence, and the two-stage training scheme to learn cinematic inter-shot transition. We also contribute the large-scale ConStoryBoard dataset, including high-quality movie clips with fine-grained annotations for story progression, cinematic attributes, and human preferences. Extensive experiments demonstrate that STAGE achieves superior performance in structured narrative control and cross-shot coherence. Our code will be available at this url.

Abstract:
Existing 3D point tracking methods mostly rely on heuristic designs or scene reconstruction, which incurs significant computational overhead and makes it difficult to meet the demands of real-time applications. To address this problem, in this work, we present a novel spatial tracker that leverages a feed-forward visual geometry transformer to predict the trajectories of arbitrary query points from monocular videos in real time. Specifically, we employ a query initialization mechanism to maintain and update a global feature vector and a set of frame-level feature vectors for each query point. Then, we propose a new spatial tracking framework, which consists of a visual geometry transformer backbone, a global embedding branch, a frame-level embedding branch, and a tracking head. The key innovation lies in the dual-branch embedding design, where the global embedding branch integrates geometry-grounded features of the entire video into global query features to optimize track information across the entire sequence and the frame-level branch combines geometry-grounded features of each respective frame into frame-level query features to refine fine-grained track coordinate predictions. Furthermore, to facilitate collaboration between the global branch and the frame-level branch, we introduce an interaction module which enables unidirectional or bidirectional information exchange between the global query features and frame-level query features. Extensive experiments on various point tracking benchmark datasets show that our approach achieves significantly fast spatial tracking speed compared with state-of-the-art methods, while maintaining comparable tracking accuracy.

Abstract:
Video agentic models have advanced challenging video-language tasks. However, most agentic approaches still heavily rely on greedy parsing over densely sampled video frames, resulting in high computational cost. We present VideoSeek, a long-horizon video agent that leverages video logic flow to actively seek answer-critical evidence instead of exhaustively parsing the full video. This insight allows the model to use far fewer frames while maintaining, or even improving, its video understanding capability. VideoSeek operates in a think-act-observe loop with a well-designed toolkit for collecting multi-granular video observations. This design enables query-aware exploration over accumulated observations and supports practical video understanding and reasoning. Experiments on four challenging video understanding and reasoning benchmarks demonstrate that VideoSeek achieves strong accuracy while using far fewer frames than prior video agents and standalone LMMs. Notably, VideoSeek achieves a 10.2 absolute points improvement on LVBench over its base model, GPT-5, while using 93% fewer frames. Further analysis highlights the significance of leveraging video logic flow, strong reasoning capability, and the complementary roles of toolkit design.

Abstract:
GUI grounding is a critical capability for enabling GUI agents to execute tasks such as clicking and dragging. However, in complex scenarios like the ScreenSpot-Pro benchmark, existing models often suffer from suboptimal performance. Utilizing the proposed Masked Prediction Distribution (MPD) attribution method, we identify that the primary sources of errors are twofold: high image resolution (leading to precision bias) and intricate interface elements (resulting in ambiguity bias). To address these challenges, we introduce Bias-Aware Manipulation Inference (BAMI), which incorporates two key manipulations, coarse-to-fine focus and candidate selection, to effectively mitigate these biases. Our extensive experimental results demonstrate that BAMI significantly enhances the accuracy of various GUI grounding models in a training-free setting. For instance, applying our method to the TianXi-Action-7B model boosts its accuracy on the ScreenSpot-Pro benchmark from 51.9% to 57.8%. Furthermore, ablation studies confirm the robustness of the BAMI approach across diverse parameter configurations, highlighting its stability and effectiveness.

Abstract:
Cinematic Audio Source Separation (CASS) aims to decompose mixed film audio into speech, music, and sound effects, enabling applications like dubbing and remastering. Existing CASS approaches are audio-only, overlooking the inherent audio-visual nature of films, where sounds often align with visual cues. We present the first framework for audio-visual CASS (AV-CASS), leveraging visual context to enhance separation quality. Our method formulates CASS as a conditional generative modeling problem using conditional flow matching, enabling multimodal audio source separation. To address the lack of cinematic datasets with isolated sound tracks, we introduce a training data synthesis pipeline that pairs in-the-wild audio and video streams (e.g., facial videos for speech, scene videos for effects) and design a dedicated visual encoder for this dual-stream setup. Trained entirely on synthetic data, our model generalizes effectively to real-world cinematic content and achieves strong performance on synthetic, real-world, and audio-only CASS benchmarks. Code and demo are available at https://cass-flowmatching.github.io.

Abstract:
Systems such as video chatbots and navigation robots often depend on streaming image captioning to interpret visual inputs. Existing approaches typically employ large multimodal language models (MLLMs) for this purpose, but their substantial computational cost hinders practical application.This limitation motivates our development of a lightweight captioning model. Our investigation begins by replacing the large-scale language component in MLLMs with a compact 125M-parameter model.Surprisingly, this compact model, despite a 93x reduction in size, achieves comparable performance to MLLMs, suggesting that factual image captioning does not significantly require the complex reasoning abilities of LLMs. Despite this promising result, our lightweight model still lacks reliability. To address this, we draw inspiration from the human visual process: perceiving a global and coarse understanding of the scene before attending to finer details. Accordingly, we propose a multimodal self-refinement framework that guides the model to utilize features from salient regions, identified by referencing the previous coarse caption, and to produce a refined description. Experimental results demonstrate the superiority of our model in both single-sentence and detailed captioning, extending even to long-range video QA tasks.

Abstract:
State-Space Models (SSMs) excel at capturing long-range dependencies with structured recurrence, making them well-suited for sequence modeling. However, their evolving internal states pose unique challenges in Continual Learning (CL). Without access to the full distribution of previous tasks, updates to the state-space dynamics become unconstrained, leading to catastrophic forgetting. To address this, we propose Inf-SSM, a geometry-aware regularization framework for CL in SSMs. It constrains state evolution via the infinite-dimensional Grassmannian of SSM observability subspaces, without requiring any exemplars from past tasks. Unlike classical CL methods that restrict weight updates, Inf-SSM directly regularizes the infinite-horizon state evolution encoded by the extended observability subspace of the SSM. We show that enforcing this regularization requires solving a matrix equation known as the Sylvester equation, which typically incurs \mathcal O (n^3) complexity. Thus, we develop a \mathcal O (n^2) solution by exploiting the structure and properties of SSMs. This leads to an efficient regularization mechanism that can be seamlessly integrated into existing CL methods. Comprehensive experiments on challenging benchmarks of ImageNet-R, CIFAR-100, and Caltech-256 demonstrate a significant reduction in forgetting while improving accuracy across sequential tasks.

Abstract:
Video-based commonsense captioning aims to generate captions for the video content while providing multiple commonsense about the underlying events. Existing approaches rely on constructing a "video -> content caption -> commonsense" reasoning chain, which generates visually ungrounded commonsense and neglects inter-category commonsense correlations. Firstly, the existing reasoning chain induces the model's excessive reliance on content caption when generating commonsense, resulting in generic outputs with limited visual relevance. Secondly, the reasoning chain adopts multiple isolated decoders for commonsense generation, which fails to leverage the correlations between different categories of commonsense. To address these limitations, we introduce a novel self-critical distillation network (SCD-Net), which optimizes the reasoning chain by enhancing visual reasoning and establishing inter-category commonsense correlations. Specifically, on the one hand, we introduce self-critical learning and design a reward function to allow the model to refine its output. This mechanism incentivizes the model to maximize the utilization of visual information, thus improving the model's capacity for visual comprehension. On the other hand, we propose a joint reasoning distillation framework that fosters mutual inference among diverse commonsense categories. In this framework, we incorporate the cascaded decoder and knowledge distillation strategy to facilitate inter-category commonsense knowledge transfer while maintaining the fairness of the testing. Our experiments on the large-scale Video-to-Commonsense dataset demonstrate that our approach performs favorably against state-of-the-art methods. The code will be released soon.

Abstract:
The rapid evolution of video generative models has shifted their focus from producing visually plausible outputs to tackling tasks requiring physical plausibility and logical consistency. However, despite recent breakthroughs such as Veo 3's chain-of-frames reasoning, it remains unclear whether these models can exhibit reasoning capabilities similar to large language models (LLMs). Existing benchmarks predominantly evaluate visual fidelity and temporal coherence, failing to capture higher-order reasoning abilities. To bridge this gap, we propose TiViBench, a hierarchical benchmark specifically designed to evaluate the reasoning capabilities of image-to-video (I2V) generation models. TiViBench systematically assesses reasoning across four dimensions: i) Structural Reasoning & Search, ii) Spatial & Visual Pattern Reasoning, iii) Symbolic & Logical Reasoning, and iv) Action Planning & Task Execution, spanning 24 diverse task scenarios across 3 difficulty levels. Through extensive evaluations, we show that commercial models (e.g., Sora 2, Veo 3.1) demonstrate stronger reasoning potential, while open-source models reveal untapped potential that remains hindered by limited training scale and data diversity. To further unlock this potential, we introduce VideoTPO, a simple yet effective test-time strategy inspired by preference optimization. By performing LLM self-analysis on generated candidates to identify strengths and weaknesses, VideoTPO significantly enhances reasoning performance without requiring additional training, data, or reward models. Together, TiViBench and VideoTPO pave the way for evaluating and advancing reasoning in video generation models, setting a foundation for future research in this emerging field.

Abstract:
Recent advances in generative video models have enabled the creation of high-quality videos based on natural language prompts. However, these models frequently lack fine-grained temporal control, meaning they do not allow users to specify when particular visual elements should appear within a generated sequence. In this work, we introduce TempoControl, a method that allows for temporal alignment of visual concepts during inference, without requiring retraining or additional supervision. TempoControl utilizes cross-attention maps, a key component of text-to-video diffusion models, to guide the timing of concepts through a novel optimization approach. Our method steers attention using three complementary principles: aligning its temporal pattern with a control signal (correlation), adjusting its strength where visibility is required (magnitude), and preserving semantic consistency (entropy). TempoControl provides precise temporal control while maintaining high video quality and diversity. We demonstrate its effectiveness across various applications, including temporal reordering of single and multiple objects, action timing, and audio-aligned video generation.

Abstract:
Panoramic imagery offers a full 360^\circ field of view and is increasingly common in consumer devices. However, it introduces non-pinhole distortions that challenge joint pose estimation and 3D reconstruction. Existing feed-forward models, built for perspective cameras, generalize poorly to this setting. We propose PanoVGGT, a permutation-equivariant Transformer framework that jointly predicts camera poses, depth maps, and 3D point clouds from one or multiple panoramas in a single forward pass. The model incorporates spherical-aware positional embeddings and a panorama-specific three-axis SO(3) rotation augmentation, enabling effective geometric reasoning in the spherical domain. To resolve inherent global-frame ambiguity, we further introduce a stochastic anchoring strategy during training. In addition, we contribute PanoCity, a large-scale outdoor panoramic dataset with dense depth and 6-DoF pose annotations. Extensive experiments on PanoCity and standard benchmarks demonstrate that PanoVGGT achieves competitive accuracy, strong robustness, and improved cross-domain generalization. Code and dataset will be released.

Abstract:
Recent T2I diffusion models have evolved to control multiple conditions, including structure, appearance, and text prompt. Despite this progress, training-based methods demand heavy computation, whereas training-free methods often 're-imagine' the subject to satisfy given structure, thereby compromising identity preservation and attenuating fine textures.We propose SeAl (Semantic Alignment for Pose-Invariant Identity Preserving Diffusion), a novel training-free framework that addresses the 're-imagining' problem from the perspective of 'infusion'. SeAl integrates structure, appearance, and text prompt with three modules: AnchorAlign pre-aligns spatial discrepancies, Reference-guided Appearance Infusion injects identity via semantic matching, and Delta-Bridge leverages the guidance delta to mediate text-appearance conflicts. We demonstrate that our method faithfully reflects all three control factors and dramatically reduces the identity leakage endemic to prior methods. Notably, SeAl excels on challenging datasets where identity preservation typically fails (e.g., distinctive animal features or complex human attire), establishing a novel paradigm for training-free identity preservation in diffusion models.

Abstract:
Video reasoning models are a core component of egocentric and embodied agents. However, standard benchmarks for assessing models provide only evaluation of the output (e.g. the answer to a question), without evaluation of inter- mediate reasoning steps, and most provide answers only in the text domain. We introduce Minerva-Ego, a bench- mark for evaluating complex egocentric visual reasoning. We extend recent high-quality video data sources recorded from egocentric / embodied settings with a set of challenging, multi-step multimodal questions and spatiotemporally-dense human-annotated reasoning traces. Benchmarking experiments show that state-of-the-art models still have a large gap to human performance. To investigate this gap in detail, we annotate each reasoning trace in the dataset with the objects of interest required to solve the question, as spatio-temporal mask annotations. Through extensive evaluations, we identify that prompting frontier models with hints of 'where' and when to look yields substantial improvements in performance.

Abstract:
We present CUBE (Control-based Unified B-Splinie Encoding), a new geometric representation for human faces that combines B-spline volumes with learned features, and demonstrate its use as a decoder for 3D scan registration and monocular 3D face reconstruction. Unlike existing B-spline representations with 3D control points, CUBE is parametrized by a lattice (e.g., 8 x 8 x 8) of high-dimensional control features, increasing the model's expressivity. These features define a continuous, two-stage mapping from a 3D parametric domain to 3D Euclidean space via an intermediate feature space. First, high-dimensional control features are locally blended using the B-spline bases, yielding a high-dimensional feature vector whose first three values define a 3D base mesh. A small MLP then processes this feature vector to predict a residual displacement from the base shape, yielding the final refined 3D coordinates. To reconstruct 3D surfaces in dense semantic correspondence, CUBE is queried at 3D coordinates sampled from a fixed template mesh. Crucially, CUBE retains the local support property of traditional B-spline representations, enabling local surface editing by updating individual control features. We demonstrate the strengths of this representation by training transformer-based encoders to predict CUBE's control features from unstructured point clouds and monocular images, achieving state-of-the-art scan registration results compared to recent baselines.

Abstract:
Despite rapid advances in text-to-video synthesis, generated video quality remains critically dependent on precise user prompts. Existing test-time optimization methods, successful in other domains, struggle with the multi-faceted nature of video. In this work, we introduce VISTA (Video Iterative Self-improvemenT Agent), a novel multi-agent system that autonomously improves video generation through refining prompts in an iterative loop. VISTA first decomposes a user's idea into a structured temporal plan. After generation, the best video is identified through a robust pairwise tournament. This winning video is then critiqued by a trio of specialized agents focusing on visual, audio, and contextual fidelity. Finally, a reasoning agent synthesizes this feedback to introspectively rewrite and enhance the prompt for the next generation cycle. Experiments on single- and multi-scene video generation scenarios show that while prior methods yield inconsistent gains, VISTA consistently improves video quality and alignment with user intent, achieving up to 60% pairwise win rate against state-of-the-art baselines. Human evaluators concur, preferring VISTA's outputs in 66.4% of comparisons.

Abstract:
Weighted sampling--sampling from a probability density function (PDF) proportional to the product of a base PDF and a weight function--is a fundamental technique with wide-ranging applications in variance reduction, biased sampling, data augmentation, and more. Leveraging the increasing availability of pretrained score-based generative models (SGMs), we propose a training-free weighted sampling framework that approximates the backward diffusion process of the target distribution by augmenting the pretrained base score function with an auxiliary guidance term, in a principled and computationally efficient manner. Our approach builds on two key components: a lightweight approximation of the guidance that avoids costly higher-order derivatives of both the score and weight functions, and an uncertainty-aware scheduler that dynamically adjusts the guidance strength based on a temporal analysis of approximation error. Together, these components enable accurate and stable sampling without relying on particle-based resampling or Hessian evaluations commonly required by existing methods. We validate the effectiveness of our method from synthetic to large-scale settings such as Stable Diffusion XL, where our framework achieves 1.2xto 4.7xspeedups while consistently matching or outperforming state-of-the-art baselines in task performance. These results position our method as a scalable and inference-efficient solution for task-adaptive, time-sensitive sampling in generative applications.

Abstract:
Synthesizing human motion has advanced rapidly, yet realistic hand motion and bimanual interaction remain underexplored. Whole-body models often miss the fine-grained cues that drive dexterous behavior, finger articulation, contact timing, and inter-hand coordination, and existing resources lack high-fidelity bimanual sequences that capture nuanced finger dynamics and collaboration. To fill this gap, we present HandX, a unified foundation spanning data, annotation, and evaluation. We consolidate and filter existing datasets for quality, and collect a new motion-capture dataset targeting underrepresented bimanual interactions with detailed finger dynamics. For scalable annotation, we introduce a decoupled paradigm that extracts representative motion features, e.g., contact events and finger flexion, and then leverages reasoning from large language models to produce fine-grained, semantically rich descriptions aligned with these features. Building on the resulting data and annotations, we benchmark diffusion and autoregressive models with versatile conditioning modes. Experiments demonstrate high-quality dexterous motion generation, supported by our newly proposed hand-focused metrics. We further observe clear scaling trends: larger models trained on larger, higher-quality datasets produce more semantically coherent bimanual motion. Our dataset is released to support future research.

Abstract:
High-quality 3D garment reconstruction plays a crucial role in mitigating the sim-to-real gap in applications such as digital avatars, virtual try-on and robotic manipulation. However, existing garment reconstruction methods typically rely on unstructured representations, such as 3D Gaussian Splats, struggling to provide accurate reconstructions of garment topology and sewing structures. As a result, the reconstructed outputs are often unsuitable for high-fidelity physical simulation. We propose ReWeaver, a novel framework for topology-accurate 3D garment and sewing pattern reconstruction from sparse multi-view RGB images. Given as few as four input views, ReWeaver predicts seams and panels as well as their connectivities in both the 2D UV space and the 3D space. The predicted seams and panels align precisely with the multi-view images, yielding structured 2D-3D garment representations suitable for 3D perception, high-fidelity physical simulation, and robotic manipulation.To enable effective training, we construct a large-scale dataset GCD-TS, comprising multi-view RGB images, 3D garment geometries, textured human body meshes and annotated sewing patterns. The dataset contains over 100,000 synthetic samples covering a wide range of complex geometries and topologies. Extensive experiments show that ReWeaver consistently outperforms existing methods in terms of topology accuracy, geometry alignment and seam-panel consistency. Code and data will be available at the project sii-liming.github.io/ReWeaver.

Abstract:
In controllable image generation, synthesizing coherent and consistent images from multiple reference inputs, i.e., Multi-Image Composition (MICo), remains a challenging problem, partly hindered by the lack of high-quality training data.To bridge this gap, we conduct a systematic study of MICo, categorizing it into 7 representative tasks and curate a large-scale collection of high-quality source images and construct diverse MICo prompts.Leveraging powerful proprietary models, we synthesize a rich amount of balanced composite images, followed by human-in-the-loop filtering and refinement, resulting in MICo-150K, a comprehensive dataset for MICo with identity consistency.We further build a Decomposition-and-Recomposition (De&Re) subset, where 11K real-world complex images are decomposed into components and recomposed, enabling both real and synthetic compositions.To enable comprehensive evaluation, we construct MICo-Bench with 100 cases per task and 300 challenging De&Re cases, and further introduce a new metric, Weighted-Ref-VIEScore, specifically tailored for MICo evaluation.Finally, we fine-tune multiple models onMICo-150K and evaluate them on MICo-Bench. The results show that MICo-150K effectively equips models without MICo capability and further enhances those with existing skills.Notably, Qwen-MICo, fine-tuned from Qwen-Image-Edit, matches Qwen-Image-2509 in 3-image composition while supporting arbitrary multi-image inputs beyond the latter's limitation.Our dataset and benchmark will be valuable resources for advancing MICo research.

Abstract:
Vision-language models (VLMs) have achieved impressive results on single-view vision tasks, but lack the multi-view spatial reasoning capabilities essential for embodied AI systems to understand 3D environments and manipulate objects across different viewpoints. In this work, we introduce Cross-View Relations (XVR), a large-scale dataset designed to teach VLMs spatial reasoning across multiple views. XVR comprises 100K vision-question-answer samples derived from 18K diverse 3D scenes and 70K robotic manipulation trajectories, spanning three fundamental spatial reasoning tasks: Correspondence (matching objects across views), Verification (validating spatial relationships), and Localization (identifying object positions). VLMs fine-tuned on XVR achieve substantial improvements on established multi-view and robotic spatial reasoning benchmarks (MindCube and RoboSpatial). When integrated as backbones in Vision-Language-Action models, XVR-trained representations improve success rates on RoboCasa. Our results demonstrate that explicit training on cross-view spatial relations significantly enhances multi-view reasoning and transfers effectively to real-world robotic manipulation.

Abstract:
How do video understanding models acquire their answers? Although current Vision Language Models (VLMs) reason over complex scenes with diverse objects, action performances, and scene dynamics, understanding and controlling their internal processes remains an open challenge. Motivated by recent advancements in text-to-video (T2V) generative models, this paper introduces a logits-to-video (L2V) task alongside a model-independent approach, TRANSPORTER, to generate videos that capture the underlying rules behind VLMs' predictions. Given the high-visual-fidelity produced by T2V models, TRANSPORTER learns an optimal transport coupling to VLM's high-semantic embedding spaces. In turn, logit scores define embedding directions for conditional video generation. TRANSPORTER generates videos that reflect caption changes over diverse object attributes, action adverbs, and scene context. Quantitative and qualitative evaluations across VLMs demonstrate that L2V can provide a fidelity-rich, novel direction for model interpretability that has not been previously explored.

Abstract:
Single-image relighting is highly under-constrained: small illumination changes can produce large, nonlinear variations in shading, shadows, and specularities, while geometry and materials remain unobserved. Existing diffusion-based approaches either rely on intrinsic- or G-buffer-based pipelines that require dense and fragile supervision, or operate purely in latent space without physical grounding, making fine-grained control of direction, intensity, and color unreliable. We observe that full intrinsic decomposition is unnecessary for accurate relighting. Instead, sparse but physically meaningful cues--indicating where illumination should change and how materials should respond--are sufficient to guide a diffusion model. Based on this insight, we introduce LightCtrl that integrates minimal physical priors at two levels: a few-shot latent proxy encoder that extracts compact material-geometry cues from limited PBR supervision, and a lighting-aware mask that identifies illumination-sensitive regions and steers the denoiser toward shading-relevant pixels. To compensate for scarce PBR data, we refine the proxy branch using a DPO-based objective that aligns predicted cues with perceptually preferred relighting behavior. We further present ScaLight, a large-scale object-level dataset with systematically varied illumination and complete camera-light metadata, enabling physically consistent and controllable training. Across object- and scene-level benchmarks, our method achieves photometrically faithful relighting with accurate continuous control, surpassing prior diffusion- and intrinsic-based baselines, including gains of up to +2.4 dB PSNR and 35% lower RMSE under controlled lighting shifts.

Abstract:
Aerial-Ground Person Re-Identification (AGPReID) remains highly challenging due to drastic viewpoint variations between drones and fixed cameras. Existing methods typically follow a view-invariant paradigm, aligning shared features across views to achieve robustness. However, view-invariant inherently enforces part-level alignment, which ignores view-specific cues and discriminative identity information. To this end, this work proposes ViSA (View-aware Semantic Alignment), a view-aware framework that achieves cross-view semantic consistency containing an Expert-driven Token Generation Module (ETGM) and a Dual-branch Local Fusion Module (DLFM). Technically, the former constructs a set of view-aware experts to generate adaptive semantic queries that perceive viewpoint-specific patterns, while the latter leverages graph reasoning to extract and align local regions responsive to different experts under varying viewpoints. Extensive experiments on three large-scale AGPReID benchmarks including AG-ReID.v2, CARGO and LAGPeR demonstrate that ViSA consistently achieves superior performance, with a notable 10.06% mAP improvement on the challenging CARGO cross-view protocol. The code will be available.

Abstract:
Test-time adaptation (TTA) aims to mitigate performance degradation under distribution shifts by updating model parameters during inference. Existing approaches have primarily framed adaptation around affine modulation, focusing on recalibrating normalization layers. This perspective, while effective, overlooks another influential component in representation dynamics: the activation function. We revisit this overlooked space and propose AcTTA, an activation-aware framework that reinterprets conventional activation functions from a learnable perspective and updates them adaptively at test time. AcTTA reformulates conventional activation functions (e.g., ReLU, GELU) into parameterized forms that shift their response threshold and modulate gradient sensitivity, enabling the network to adjust activation behavior under domain shifts. This functional reparameterization enables continuous adjustment of activation behavior without modifying network weights or requiring source data. Despite its simplicity, AcTTA achieves robust and stable adaptation across diverse corruptions. Across CIFAR10-C, CIFAR100-C, and ImageNet-C, AcTTA consistently surpasses normalization-based TTA methods. Our findings highlight activation adaptation as a compact and effective route toward domain-shift-robust test-time learning, broadening the prevailing affine-centric view of adaptation.

Abstract:
The proliferation of generative AI has led to hyper-realistic synthetic videos, escalating misuse risks and outstripping binary real/fake detectors. We introduce \texttt SAGA (\underline S ource \underline A ttribution of \underline G enerative \underline A I videos), the first comprehensive framework to address the urgent need for AI-generated video source attribution at a large scale. Unlike traditional detection, \texttt SAGA identifies the specific generative model used. It uniquely provides multi-granular attribution across five levels: authenticity, generation task (e.g., T2V/I2V), model version, development team, and the precise generator, offering far richer forensic insights. Our novel video transformer architecture, leveraging features from a robust vision foundation model, effectively captures spatio-temporal artifacts. Critically, we introduce a data-efficient pretrain-and-attribute strategy, enabling \texttt SAGA to achieve state-of-the-art attribution using only 0.5% of source-labeled data per class, matching fully supervised performance. Furthermore, we propose Temporal Attention Signatures (\texttt T-Sig ), a novel interpretability method that visualizes learned temporal differences, offering the first explanation for why different video generators are distinguishable. Extensive experiments on public datasets, including cross-domain scenarios, demonstrate that \texttt SAGA sets a new benchmark for synthetic video provenance, providing crucial, interpretable insights for forensic and regulatory applications.

Abstract:
Estimating 3D attributes directly from images has advanced rapidly with the Visual Geometry Grounded Transformer (VGGT), which predicts camera parameters, depth maps, and point clouds in a single forward pass. However, its 1.2B-parameter scale severely limits deployment on resource-constrained platforms such as UAVs and mobile AR devices. To address this limitation, we introduce QVGGT, a tailored quantization framework designed to compress VGGT. Our approach starts from the observation that transformer blocks within VGGT exhibit heterogeneous sensitivity to quantization. We thus analyze per-block quantization sensitivity and propose a selective mixed-precision strategy that allocates higher precision to the most fragile transformer blocks. To address the amplification of quantization error caused by high-variance camera and register tokens, we further introduce token filtering with camera information compensation, which removes these outliers from activation calibration and restores their geometric cues using a PCA-derived global compensation token. Finally, we develop a task-aware scale search mechanism that evaluates candidate quantization scales not only through layer reconstruction but also through multi-head supervision and cross-head geometric consistency among camera poses, depth maps, and point maps.Extensive experiments on multiple geometry perception benchmarks demonstrate that QVGGT achieves near-lossless W4A16 quantization, preserving the accuracy of all 3D prediction heads while delivering 3 4.9xmemory reduction and up to 2.8x real hardware speedup over FP32.Our approach makes high-fidelity 3D perception feasible on edge devices, enabling practical deployment of feed-forward 3D reconstruction models in real-world constrained environments.

Abstract:
Recent advances in 3D scene editing using NeRF and 3DGS enable high-quality static scene editing. In contrast, dynamic scene editing remains challenging, as methods that directly extend 2D diffusion models to 4D often produce motion artifacts, temporal flickering, and inconsistent style propagation. We introduce Catalyst4D, a framework that transfers high-quality 3D edits to dynamic 4D Gaussian scenes while maintaining spatial and temporal coherence. At its core, Anchor-based Motion Guidance (AMG) builds a set of structurally stable and spatially representative anchors from both original and edited Gaussians. These anchors serve as robust region-level references, and their correspondences are established via optimal transport to enable consistent deformation propagation without cross-region interference or motion drift. Complementarily, Color Uncertainty-guided Appearance Refinement (CUAR) preserves temporal appearance consistency by estimating per-Gaussian color uncertainty and selectively refining regions prone to occlusion-induced artifacts. Extensive experiments demonstrate that Catalyst4D achieves temporally stable, high-fidelity dynamic scene editing and outperforms existing methods in both visual quality and motion coherence.

Abstract:
We introduce the "single-life" learning paradigm, where we train a distinct vision model exclusively on egocentric videos captured by one individual. We leverage the multiple viewpoints naturally captured within a single life to learn a visual encoder in a self-supervised manner. Our experiments demonstrate three key findings. First, models trained independently on different lives develop a highly aligned geometric understanding. We demonstrate this by training visual encoders on distinct datasets each capturing a different life, both indoors and outdoors, as well as introducing a novel cross-attention-based metric to quantify the functional alignment of the internal representations developed by different models. Second, we show that single-life models learn generalizable geometric representations that effectively transfer to downstream tasks, such as depth estimation, in unseen environments. Third, we demonstrate that training on up to 30 hours from one week of the same person's life leads to comparable performance to training on 30 hours of diverse web data, highlighting the strength of single-life representation learning. Overall, our results establish that the shared structure of the world, both leads to consistency in models trained on individual lives, and provides a powerful signal for visual representation learning.

Abstract:
Multimodal large reasoning models (MLRMs) often suffer from hallucinations that stem not only from insufficient visual grounding but also from imbalanced allocation between perception and reasoning processes. Building upon recent interpretability findings suggesting a staged division of attention across layers, we analyze how this functional misalignment leads to two complementary failure modes: perceptual bias in shallow layers and reasoning drift in deeper layers. To alleviate these issues, we propose Functional Head Identification and Class-Conditioned Rescaling, a lightweight, training-free plugin that identifies perception- and reasoning-oriented heads and adaptively rebalances their layerwise contributions. Our method improves reasoning consistency and visual faithfulness without retraining or any architectural modification. Evaluations across three representative MLRMs and five multimodal reasoning benchmarks show an average 4.2 percentage-point gain, with less than 1% additional computation and only 9% baseline latency. Beyond empirical improvements, our study provides an interpretable perspective on regulating cross-layer functional dynamics to enhance the reliability of multimodal reasoning.

Abstract:
Complex video reasoning, actually, relies excessively on fine-grained perception rather than on expert (e.g., Ph.D, Science)-level reasoning.Through extensive empirical observation, we have recognized the critical impact of perception.In particular, when perception ability is almost fixed, enhancing reasoning from Qwen3-8B to OpenAI-o3 yields only 0.7% performance improvement. Conversely, even minimal change in perception model scale (from 7B to 32B) boosts performance by 1.4%, indicating enhancing perception, rather than reasoning, is more critical to improve performance.Therefore, exploring how to enhance perception ability through reasoning without the need for expensive fine-grained annotation information is worthwhile.To achieve this goal, we specially propose APPO, the Attention-guided Perception Policy Optimization algorithm that leverages token-level dense rewards to improve model's fine-grained perception.The core idea behind APPO is to optimize those tokens from different responses that primarily focus on the same crucial video frame (called intra-group perception tokens).Experimental results on diverse video benchmarks and models with different scales (3/7B) demonstrate APPO consistently outperforms GRPO and DAPO (0.5% 4%). We hope our work provides a promising approach to effectively enhance model's perception abilities through reasoning in a low-cost manner, serving diverse scenarios and demands.

Abstract:
Planning-oriented end-to-end driving models show great promise, yet they fundamentally learn statistical correlations instead of true causal relationships. This vulnerability leads to causal confusion, where models exploit dataset biases as shortcuts, critically harming their reliability and safety in complex scenarios. To address this, we introduce CausalVAD, a de-confounding training framework that leverages causal intervention. At its core, we design the sparse causal intervention scheme (SCIS), a lightweight, plug-and-play module to instantiate the backdoor adjustment theory in neural networks. SCIS constructs a dictionary of prototypes representing latent driving contexts. It then uses this dictionary to intervene on the model's sparse vectorized queries. This step actively eliminates spurious associations induced by confounders, thereby eliminating spurious factors from the representations for downstream tasks. Extensive experiments on benchmarks like nuScenes show CausalVAD achieves state-of-the-art planning accuracy and safety. Furthermore, our method demonstrates superior robustness against both data bias and noisy scenarios configured to induce causal confusion.

Abstract:
Video motion transfer aims to generate a target video that inherits motion patterns from a source video while rendering new scenes. Existing training-free approaches focus on constructing motion guidance based on the intermediate outputs of pre-trained T2V models, which results in heavy computational overhead and limited flexibility. In this paper, we present FlowMotion, a novel training-free framework that enables efficient and flexible motion transfer by directly leveraging the predicted outputs of flow-based T2V models. Our key insight is that early latent predictions inherently encode rich temporal information. Motivated by this, we propose flow guidance, which extracts motion representations based on latent predictions to align motion patterns between source and generated videos. We further introduce a velocity regularization strategy to stabilize optimization and ensure smooth motion evolution. By operating purely on model predictions, FlowMotion achieves superior time and resource efficiency as well as competitive performance compared with state-of-the-art methods.

Abstract:
The misuse of AI-driven video generation technologies has raised serious social concerns, highlighting the urgent need for reliable AI-generated video detectors. However, most existing methods are limited to binary classification and lack the necessary explanations for human interpretation. In this paper, we present Skyra, a specialized multimodal large language model (MLLM) that identifies human-perceivable visual artifacts in AI-generated videos and leverages them as grounded evidence for both detection and explanation. To support this objective, we construct ViF-CoT-4K for Supervised Fine-Tuning (SFT), which represents the first large-scale AI-generated video artifact dataset with fine-grained human annotations.We then develop a two-stage training strategy that systematically enhances our model's spatio-temporal artifact perception, explanation capability, and detection accuracy. To comprehensively evaluate Skyra, we introduce ViF-Bench, a benchmark comprising 3K high-quality samples generated by over ten state-of-the-art video generators. Extensive experiments demonstrate that Skyra surpasses existing methods across multiple benchmarks, while our evaluation yields valuable insights for advancing explainable AI-generated video detection. Our code, models, and datasets will be made publicly available.

Abstract:
Existing monocular 3D detectors typically tame the pronounced nonlinear regression of 3D bounding box through decoupled prediction paradigm, which employs multiple branches to estimate geometric center, depth, dimensions, and rotation angle separately.Although this decoupling strategy simplifies the learning process, it inherently ignores the geometric collaborative constraints between different attributes, resulting in the lack of geometric consistency prior, thereby leading to suboptimal performance. To address this issue, we propose novel Spatial-Projection Alignment (SPAN) with two pivotal components: (i). Spatial Point Alignment enforces an explicit global spatial constraint between the predicted and ground-truth 3D bounding boxes, thereby rectifying spatial drift caused by decoupled attribute regression. (ii). 3D-2D Projection Alignment ensures that the projected 3D box is aligned tightly within its corresponding 2D detection bounding box on the image plane, mitigating projection misalignment overlooked in previous works. To ensure training stability, we further introduce a Hierarchical Task Learning strategy that progressively incorporates spatial-projection alignment as 3D attribute predictions refine, preventing early stage error propagation across attributes. Extensive experiments demonstrate that the proposed method can be easily integrated into any established monocular 3D detector and delivers significant performance improvements.

Abstract:
We introduce Dark3R, a framework for structure from motion in the dark that operates directly on raw images with signal-to-noise ratios (SNRs) below -4 dB--a regime where conventional feature- and learning-based methods break down. Our key insight is to adapt large-scale 3D foundation models to extreme low-light conditions through a teacher-student distillation process, enabling robust feature matching and camera pose estimation in low light. Dark3R requires no 3D supervision; it is trained solely on noisy--clean raw image pairs, which can be either captured directly or synthesized using a simple Poisson-Gaussian noise model applied to well-exposed raw images.To train and evaluate our approach, we introduce a new, exposure-bracketed dataset that includes 42,000 multi-view raw images with ground-truth 3D annotations, and we demonstrate that Dark3R achieves state-of-the-art structure from motion in the low-SNR regime. Further, we demonstrate state-of-the-art novel view synthesis in the dark using Dark3R's predicted poses and a coarse-to-fine radiance field optimization procedure.

Abstract:
This paper aims to present a robust AI-generated image detection framework designed to address performance degradation caused by image compression in online social networks. The key challenges are twofold: 1) compression destroys fragile artifacts that are crucial to existing methods, and 2) it introduces new compression artifacts that interfere with detection. Existing methods typically enhance the compression robustness by collecting original-compression pairs and compression labels. However, the collection and annotation process is highly resource-intensive. To address these issues, we propose a Compression-Robust Phase-Harmonized Transformer, motivated by the observation that phase spectrum remains stable under compression. The framework consists of a phase-harmonized cross-modal interaction module that leverages phase spectrum information for feature fusion, enhancing compression robustness, and a multi-domain modulation adapter that further refines fused features while enabling parameter-efficient fine-tuning. In particular, the framework operates without requiring compression-original data pairs and compression labels. When limited compression labels are available, we introduce a difficulty-aware consistency loss to maximize their utility by prioritizing hard compressed samples during training, further boosting robustness. Extensive experiments demonstrate that our method significantly outperforms state-of-the-art approaches, exhibiting superior robustness against image compression.

Abstract:
Visual-Language-Action (VLA) models report impressive success rates exceeding 95% on robotic manipulation benchmarks, yet these results may mask fundamental weaknesses in robustness. Current simulation-based robustness evaluations suffer from narrow perturbation coverage, manual design constraints, and coarse-grained analysis that fails to reveal when and how models fail. To address this gap, we propose LIBERO-Plus, a comprehensive, automatic, and fine-grained evaluation framework with controlled perturbations across seven dimensions: object layouts, camera viewpoints, robot initial states, language instructions, lighting conditions, background textures, and sensor noise. Our systematic analysis of ten state-of-the-art models reveals consistent brittleness beneath apparent competence, with performance dropping from 95% to below 30% under modest perturbations. Our findings challenge the assumption that high benchmark scores equate to true competency and highlight the need for evaluation practices that assess reliability under realistic variation.

Abstract:
Current text-to-image (T2I) generation models often struggle with prompts that require complex reasoning or specialized knowledge, failing to accurately interpret implicit user intent. To bridge this gap, we introduce T2I-Reason, a large-scale dataset designed to empower text-to-image generation in unified multimodal models (UMMs) with reasoning and knowledge. The dataset contains 120k pairs of text triplet and image. The text triplet consists of (1) an implicit prompt, which requires reasoning or knowledge to decipher its underlying meaning; (2) a reasoning chain, which provides a step-by-step analysis to resolve the implicit prompt's meaning; and (3) an explicit prompt, a clear and straightforward visual description prepared for T2I generation. T2I-Reason is meticulously constructed: 65k samples are dedicated to reasoning, specifically targeting arithmetic reasoning, spatial-attribute relationship reasoning, deductive reasoning (cause to effect), and abductive reasoning (effect to cause). While 55k samples necessitate specialized knowledge, which covers multiple disciplines, spatial-temporal concepts, and entity knowledge. To validate the effectiveness of our dataset, we train a unified multimodal model, Bagel, on our dataset. Results across multiple benchmarks that evaluate the reasoning capabilities of T2I generation demonstrate that our model achieves significant and consistent improvements on both composition and reasoning, confirming that explicit training on intermediate reasoning chains is a pivotal step towards more intelligent unified generative models.

Abstract:
Modern generative models can propose striking 3D vehicle shapes from text and images, but turning these sketches into aerodynamically efficient, regulation compliant designs still requires weeks of high-fidelity computational fluid dynamics (CFD) and manual iteration. As a result, fast 3D generation without trustworthy physics in the loop does little to reduce end-to-end design time. We study how an AI agent can close this loop under a strict CFD budget. We introduce AeroAgent, a vision-physics-decision framework built around a single 3D, editable surface representation for vehicle shapes. A vision module turns text and 2D references into diverse, standardized 3D candidates and supports image-level edits. A physics module, AeroFormer is a geometry-guided Transformer surrogate trained on a large-scale vehicle aerodynamics dataset of roughly 50k CFD simulations; three task-specific heads predict drag Cd, surface pressure, and velocity fields on shared 3D grids. A decision module encodes regulatory size limits and aesthetic constraints as feasibility tests, uses prototype priors and surrogate sensitivities to guide free-form deformation edits, and runs a budget-aware propose evaluate refine loop in which only the final top K shapes are confirmed by high-fidelity CFD. In extensive experiments across five common vehicle classes, running only five propose evaluate refine iterations per vehicle reduces drag by an average of 2-12% and cuts high-fidelity CFD calls by 50-80% compared to baseline workflows, while preserving or improving styling quality.

Abstract:
Feed-forward transformer models have driven rapid progress in 3D vision, but state-of-the-art methods such as VGGT and \pi^3 have a computational cost that scales quadratically with the number of input images, making them inefficient when applied to large image collections. Sequential-reconstruction approaches reduce this cost but sacrifice reconstruction quality. We introduce ZipMap, a stateful feed-forward model that achieves linear-time, bidirectional 3D reconstruction while matching or surpassing the accuracy of quadratic-time methods. ZipMap employs test-time training layers to zip an entire image collection into a compact hidden scene state in a single forward pass, enabling reconstruction of over 700 frames in under 10 seconds on a single H100 GPU, more than 20x faster than state-of-the-art methods such as VGGT. Moreover, we demonstrate the benefits of having a stateful representation in real-time scene-state querying and its extension to sequential streaming reconstruction.

Abstract:
Recent end-to-end autonomous driving approaches have leveraged Vision-Language Models (VLMs) to enhance planning capabilities in complex driving scenarios. However, VLMs are inherently trained as generalist models, lacking specialized understanding of driving-specific reasoning in 3D space and time. When applied to autonomous driving, these models struggle to establish structured spatial-temporal representations that capture geometric relationships, scene context, and motion patterns critical for safe trajectory planning.To address these limitations, we propose SGDrive, a novel framework that explicitly structures the VLM's representation learning around driving-specific knowledge hierarchies. Built upon a pre-trained VLM backbone, SGDrive decomposes driving understanding into a scene-agent-goal hierarchy that mirrors human driving cognition: drivers first perceive the overall environment (scene context), then attend to safety-critical agents and their behaviors, and finally formulate short-term goals before executing actions. This hierarchical decomposition provides the structured spatial-temporal representation that generalist VLMs lack, integrating multi-level information into a compact yet comprehensive format for trajectory planning.Extensive experiments on the NAVSIM benchmark demonstrate that SGDrive achieves state-of-the-art performance among camera-only methods, validating the effectiveness of hierarchical knowledge structuring for adapting generalist VLMs to autonomous driving.

Abstract:
Instruction-driven segmentation in remote sensing generates masks from guidance, offering great potential for accessible and generalizable applications. However, existing methods suffer from fragmented task formulations and limited instruction data, hindering effective understanding and generalization. To address these issues, we introduce GeoSeg-1M, the first million-scale dataset for remote sensing instruction-driven segmentation, constructed via an automatic mask filtering and instruction generation pipeline that synthesizes referring, interactive, and reasoning segmentation instructions from multiple public datasets. GeoSeg-1M contains 590K images, 117 categories, and 1.1M image-mask-instruction triplets. Building upon this foundation, we further curate GeoSeg-Bench, a challenging benchmark designed to evaluate contextual understanding and reasoning capabilities across diverse instruction-driven tasks and complex geospatial scenes. Furthermore, we present UniGeoSeg, a unified framework that serves as a strong baseline, incorporating task-aware text enhancement, latent knowledge memory, and a progressive training strategy to facilitate multi-task learning. Extensive experiments demonstrate the state-of-the-art performance of UniGeoSeg across GeoSeg-Bench and diverse public benchmarks, while exhibiting strong zero-shot generalization. The datasets and source code will be publicly released.

Abstract:
We present NaTex, a native texture generation framework that predicts texture color directly in 3D space. In contrast to previous approaches that rely on baking 2D multi-view images synthesized by geometry-conditioned Multi-View Diffusion models (MVDs), NaTex avoids several inherent limitations of the MVD pipeline. These include difficulties in handling occluded regions that require inpainting, achieving precise mesh-texture alignment along boundaries, and maintaining cross-view consistency and coherence in both content and color intensity. NaTex features a novel paradigm that addresses the aforementioned issues by viewing texture as a dense color point cloud. Driven by this idea, we propose latent color diffusion, which comprises a geometry-awared color point cloud VAE and a multi-control diffusion transformer (DiT), entirely trained from scratch using 3D data, for texture reconstruction and generation. To enable precise alignment, we introduce native geometry control that conditions the DiT on direct 3D spatial information via positional embeddings and geometry latents. We co-design the VAE-DiT architecture, where the geometry latents are extracted via a dedicated geometry branch tightly coupled with the color VAE, providing fine-grained surface guidance that maintains strong correspondence with the texture. With these designs, NaTex demonstrates strong performance, significantly outperforming previous methods in texture coherence and alignment. Moreover, NaTex also exhibits strong generalization capabilities, either training-free or with simple tuning, for various downstream applications, e.g., material generation, texture refinement, and part segmentation and texturing.

Abstract:
Test-time adaptation for video understanding, which enables vision-language models (VLMs) to generalize to downstream tasks such as action recognition, has demonstrated substantial value in real-world applications. Existing memory-based methods typically build a visual cache from high-confidence test videos and perform inference via a two-step modality mapping chain, i.e., vision-vision and vision-text. However, due to the asymmetry of the two mappings, the chain exhibits non-transitivity, hindering the generalization of VLMs. To this end, we propose a novel training-free Condensed Dynamic Adapter ConDA for action recognition, which leverages vision-text alignment to guide vision-vision alignment. It first selects semantic patches based on the semantic activation probability obtained from the vision-text alignment (Probability-based Semantic Patch Selection, PSPS), and then adaptively constructs spatial-temporal video tubes based on patch-level visual similarity (Adaptive Tube Construction, ATC). We conduct extensive experiments on seven benchmarks with different backbones and baselines. The quantitative results demonstrate that ConDA is compatible with arbitrary VLM and generalizes well across complex scenarios, such as long-term and egocentric scenarios. In addition, qualitative analyses showcase the interpretability of ConDA in terms of capturing semantic cues.

Abstract:
Vector glyphs are the atomic units of digital typography, yet most learning-based pipelines still depend on carefully curated exemplar sheets and raster-to-vector postprocessing, which limits accessibility and editability. We introduce VecGlypher, a single multimodal language model that generates high-fidelity vector glyphs directly from text descriptions or image exemplars. Given a style prompt, optional reference glyph images, and a target character, VecGlypher autoregressively emits SVG path tokens, avoiding raster intermediates and producing editable, watertight outlines in one pass. A typography-aware data and training recipe makes this possible: (i) a large-scale continuation stage on 39K noisy Envato fonts to master SVG syntax and long-horizon geometry, followed by (ii) post-training on 2.5K expert-annotated Google Fonts with descriptive tags and exemplars to align language and imagery with geometry; preprocessing normalizes coordinate frames, canonicalizes paths, de-duplicates families, and quantizes coordinates for stable long-sequence decoding. On cross-family OOD evaluation, VecGlypher substantially outperforms both general-purpose LLMs and specialized vector-font baselines for text-only generation, while image-referenced generation reaches a state-of-the-art performance, with marked gains over DeepVecFont-v2 and DualVector. Ablations show that model scale and the two-stage recipe are critical and that absolute-coordinate serialization yields the best geometry. VecGlypher lowers the barrier to font creation by letting users design with words or exemplars, and provides a scalable foundation for future multimodal design tools.

Abstract:
Sparse Autoencoders uncover thousands of features in vision models, yet explaining these features without requiring human intervention remains an open challenge. While previous work has proposed generating correlation-based explanations based on top activating input examples, we present a fundamentally different alternative based on causal interventions. We leverage the structure of Vision-Language Models and steer individual SAE features in the vision encoder after providing an empty image. Then, we prompt the language model to explain what it "sees", effectively eliciting the visual concept represented by each feature. Results show that Steering offers an scalable alternative that complements traditional approaches based on input examples, serving as a new axis for automated interpretability in vision models. Moreover, the quality of explanations improves consistently with the scale of the language model, highlighting our method as a promising direction for future research. Finally, we propose Steering-informed Top-k, a hybrid approach that combines the strengths of causal interventions and input-based approaches to achieve state-of-the-art explanation quality without additional computational cost.

Abstract:
Extending person re-identification (ReID) to a federated scenario has recently drawn attention due to privacy concerns of individuals, but existing methods mostly assume sufficient diversity in pose variations even within a decentralized client. We focus on a more realistic federated-by-camera scenario, where each client corresponds to a single camera and thus captures only a sparse set of poses. To enrich pose diversity, we propose a pose-guided enriched feature learning method that explicitly augments pose-diverse samples. Specifically, a Pose-Extraction Module (PEM) disentangles pose-relevant and pose-irrelevant feature components, where pose-relationship knowledge distillation method helps identify the correct pose and semantic consistency maintenance method preserves semantics even with pose changes. In addition, a compatibility regularization method ensures the PEM to be compatible with the feature space of the global model. By recombining pose-relevant and -irrelevant components across identities via PEM, our method synthesizes pose-swapped features, thereby substantially enhances contrastive learning of ReID models. Extensive experiments on Market1501 and MSMT17 under the federated-by-camera setting demonstrate that the proposed method consistently outperforms federated ReID baselines and their conjunctions with the existing feature augmentation methods; thus achieving state-of-the-art federated ReID performance.

Abstract:
The Visual Geometry Grounded Transformer (VGGT) marks a significant leap forward in 3D scene reconstruction, as it is the first model that directly infers all key 3D attributes (camera poses, depths, and dense geometry) jointly in one pass. However, this joint inference mechanism requires global attention layers that perform all-to-all attention computation on tokens from all views. For reconstruction of large scenes with long-sequence inputs, this causes a significant latency bottleneck. In this paper, we propose head-wise temporal merging (HTTM), a training-free 3D token merging method for accelerating VGGT. Existing merging techniques merge tokens uniformly across different attention heads, resulting in identical tokens in the layers' output, which hinders the model's representational ability. HTTM tackles this problem by merging tokens in multi-head granularity, which preserves the uniqueness of feature tokens after head concatenation. Additionally, this enables HTTM to leverage the spatial locality and temporal correspondence observed at the head level to achieve higher merging ratios with lower merging costs compared to existing methods. Thus, HTTM achieves up to 7x acceleration with negligible performance drops in a GPU-based inference.

Abstract:
Adapting large pre-trained models to unseen tasks under tight data and compute budgets remains challenging. Meta-learning approaches explicitly learn good initializations, but they require an additional meta-training phase over many tasks, incur high training cost, and can be unstable. At the same time, the number of task-specific pre-trained models continues to grow, yet the question of how to transfer them to new tasks with minimal additional training remains relatively underexplored. We propose BOLT (Basis-Oriented Low-rank Transfer), a framework that reuses existing fine-tuned models not by merging weights, but instead by extracting an orthogonal, task-informed spectral basis and adapting within that subspace.In the offline phase, BOLT collects dominant singular directions from multiple task vectors and orthogonalizes them per layer to form reusable bases. In the online phase, we freeze these bases and train only a small set of diagonal coefficients per layer for the new task, yielding a rank-controlled update with very few trainable parameters. This design provides (i) a strong, training-free initialization for unseen tasks, obtained by pooling source-task coefficients--along with a lightweight rescaling step--while leveraging the shared orthogonal bases, and (ii) a parameter-efficient fine-tuning (PEFT) path that, in our experiments, achieves robust performance compared to common PEFT baselines as well as a representative meta-learned initialization. Our results show that constraining adaptation to a task-informed orthogonal subspace provides an effective alternative for unseen-task transfer.

Abstract:
3D Gaussian Splatting (3DGS) represents scenes through primitives with coupled intrinsic properties: geometric attributes (position, covariance, opacity) and appearance attributes (view-dependent color). Faithful reconstruction requires intrinsic geometry-appearance consistency, where geometry accurately captures 3D structure while appearance reflects photometry. However, sparse observations lead to appearance overfitting and underconstrained geometry, causing severe novel-view artifacts. We present ICO-GS (Intrinsic Geometry-Appearance Consistency Optimization for 3DGS), a principled framework that enforces this consistency through tightly coupled geometric regularization and appearance learning. Our approach first regularizes geometry via feature-based multi-view photometric constraints by employing pixel-wise top-k selection to handle occlusions and edge-aware smoothness to preserve sharp structures. Then appearance is coupled with geometry through cycle-consistency depth filtering, which identifies reliable regions to synthesize virtual views that propagate geometric correctness into appearance optimization. Experiments on LLFF, DTU, and Blender show ICO-GS consistently outperforms existing sparse-view baselines, particularly in challenging weakly-textured regions.

Abstract:
Recent unified models have made remarkable strides in generating high-quality images, yet they consistently fail on reasoning-intensive tasks, i.e., solving mazes, assembling tangrams. Intriguingly, we find that vision-language models (VLMs) and large language models (LLMs) can accurately solve these tasks, but cannot generate the corresponding images because they lack a structured visual output interface. This reveals that the core bottleneck is not reasoning capacity, but the lack of a structured interface to translate high-level reasoning into precise visual output. To bridge this gap, we propose using code-structured visual hints (i.e., SVG/HTML) overlays that explicitly encode reasoning steps directly on the image plane. Accordingly, we develop an automatic data construction pipeline that can generate high-quality code-structured hints for existing datasets and train a unified model called Hint2Gen based on FLUX.1 Kontext to condition its generation on such hints. Furthermore, to comprehensively evaluate the effectiveness of our approach, we introduce Reason2Gen, a benchmark comprising 4,000 samples spanning 20 categories across 7 core dimensions, including path connectivity, spatial assembly, etc. Extensive experiments demonstrate that even simply providing such hints as extra inputs--without any retraining--boosts their performance. And our model significantly outperforms all leading open-source/closed-source methods on reasoning-aware generation and editing across all the dimensions.

Abstract:
Despite the impressive performance of diffusion models such as Stable Diffusion (SD) in image generation, their slow inference limits practical deployment. Recent works accelerate inference by distilling multi-step diffusion into one-step generators. To better understand the distillation mechanism, we analyze U-Net/DiT weight changes between one-step students and their multi-step teacher counterparts. Our analysis reveals that changes in weight direction significantly exceed those in weight norm, highlighting it as the key factor during distillation. Motivated by this insight, we propose the Low-rank Rotation of weight Direction (LoRaD), a parameter-efficient adapter tailored to one-step diffusion distillation. LoRaD is designed to model these structured directional changes using learnable low-rank rotation matrices. We further integrate LoRaD into Variational Score Distillation (VSD), resulting in Weight Direction-aware Distillation (WaDi)--a novel one-step distillation framework. WaDi achieves state-of-the-art FID scores on COCO 2014 and COCO 2017 while using only approximately 10% of the trainable parameters of the U-Net/DiT. Furthermore, the distilled one-step model demonstrates strong versatility and scalability, generalizing well to various downstream tasks such as controllable generation, relation inversion, and high-resolution synthesis.

Abstract:
We propose VIAFormer, a Voxel-Image Alignment transFormer model designed for Multi-view Conditioned Voxel Refinement--the task of repairing incomplete noisy voxels using calibrated multi-view images as guidance. Its effectiveness stems from a synergistic design: an Image Index that provides explicit 3D spatial grounding for 2D image tokens, a Correctional Flow objective that learns a direct voxel-refinement trajectory, and a Hybrid Stream Transformer that enables robust cross-modal fusion. Experiments show that VIAFormer establishes a new state of the art in correcting both severe synthetic corruptions and realistic artifacts on the voxel shape obtained from powerful Vision Foundation Models. Beyond benchmarking, we demonstrate VIAFormer as a practical and reliable bridge in real-world 3D creation pipelines, paving the way for voxel-based methods to thrive in large-model, big-data wave.

Abstract:
Existing humanoid control systems often rely on teleoperation or modular generation pipelines that separate language understanding from physical execution. However, the former is entirely human-driven, and the latter lacks tight alignment between language commands and physical behaviors. In this paper, we present SENTINEL, a fully end-to-end language-action framework for humanoid whole-body control. We construct a large-scale dataset by tracking human motions in simulation using a pretrained whole body controller, combined with their text annotations. The model directly maps language commands and proprioceptive inputs to low-level actions without any intermediate representation. The model generates action chunks using flow matching, which can be subsequently refined by a residual action head for real-world deployment. Our method exhibits strong semantic understanding and stable execution on humanoid robots in both simulation and real-world deployment, and also supports multi-modal extensions by converting inputs into texts.

Abstract:
Although progress has been made in LLM-based text-driven motion generation, it still has the limitations of generating fine-grained and semantically consistent motions. These limitations stem from: 1) fine-grained motion quantization errors; 2) mismatches between causal reasoning language and non-causal motion representation; and 3) lack of human preference alignment. To solve them, this paper proposes MoTiGA, a multi-level causal LLM-based text-to-motion generation framework with human alignment. Firstly, MoTiGA employs Causal RVQ-VAE for multi-level causal fine-grained motion representation, then explores iterative residual quantization and causal convolutions to reduce fine-grained motion quantization errors, while preserving the causality as language presentation. Furthermore, the framework incorporates a time-lagged causal prediction strategy, enabling parallel prediction across motion token levels while maintaining temporal dependencies. Finally, to enhance human alignment, we propose Multi-level Hybrid-weighted Preference Optimization (MHPO), which dynamically adjusts semantic similarity weighting and continuous similarity scores. For MHPO, we also release the HumanML3D-R dataset, the first large-scale preference dataset for motion generation, with 101,490 human preference pairs. Evaluations show MoTiGA's superior performance, with an 82.3% FID improvement on HumanML3D and a 64.7% improvement on KIT-ML over other LLM-based methods.

Abstract:
Understanding objects in dynamic 4D environments via natural language is crucial yet underexplored. While existing methods focus on static 3D referring segmentation or open-vocabulary 4D querying, they struggle to ground complex spatio-temporal referring expressions in explicit 4D reconstructions. We introduce Spatio-Temporal Referring Segmentation in 4D Gaussian Splatting (STRS-4DGS), a novel task aiming to jointly identify and segment a target instance across space and time given a referring expression. To tackle this, we propose ST4R-Splat, the first framework for STRS-4DGS. Specifically, our framework incorporates an Instance-Aware 4D Gaussian Referring Field that assigns time-invariant embeddings for robust spatial grounding, and an Instance-Level Temporal State Mapping module that enables view-independent temporal localization directly in feature space. To provide rich spatio-temporal semantic supervision, we develop an automatic, MLLM-based captioning pipeline that generates decoupled spatial and temporal textual descriptions. Evaluated on our newly constructed STRS-4DGS benchmark, our method significantly outperforms adapted state-of-the-art baselines across both time-agnostic and time-sensitive metrics, establishing a strong foundation for language-driven 4D scene understanding.

Abstract:
Latent diffusion has become the default paradigm for visual generation, yet we observe a persistent reconstruction-generation trade-off as latent dimensionality increases: higher-capacity autoencoders improve reconstruction fidelity but generation quality eventually declines. We trace this gap to the different behaviors in high-frequency tokenization and detokenization. Through controlled perturbations in both RGB and latent domains, we analyze encoder/decoder behaviors and find that decoders depend strongly on high-frequency latent components to recover details, whereas encoders under-represent high-frequency contents, yielding insufficient exposure and underfitting in high-frequency bands for diffusion model training. To address this issue, we introduce FreqWarm, a plug-and-play frequency warm-up curriculum that increases early-stage exposure to high-frequency latent signals during diffusion or flow-matching training -- without modifying or retraining the autoencoder. Applied across several high-dimensional tokenizers, FreqWarm consistently improves generation quality: decreasing gFID by 14.11 on Wan2.2-VAE, 6.14 on LTX-VAE, and 4.42 on DC-AE-f32, while remaining architecture-agnostic and compatible with diverse backbones. Our study shows that explicitly managing frequency exposure can successfully turn high-dimensional latent spaces into more diffusible targets.

Abstract:
Masked Image Modeling (MIM) has become a ubiquitous self-supervised vision paradigm. In this work, we show that MIM objectives cause the learned representations to retain non-semantic information, which ultimately hurts performance during inference. We introduce a model-agnostic score for semantic invariance using Principal Component Analysis (PCA) on real and synthetic non-semantic images. Based on this score, we propose a simple method, Semantically Orthogonal Artifact Projection (SOAP), to directly suppress non-semantic information in patch representations, leading to consistent improvements in zero-shot performance across various MIM-based models. SOAP is a post-hoc suppression method, requires zero training, and can be attached to any model as a single linear head.

Abstract:
We introduce a framework for learning latent representations of 4D objects which are descriptive, faithfully capturing object geometry and appearance; compressive, aiding in downstream efficiency; and accessible, requiring minimal input, i.e., an unstructured dynamic point cloud, to construct. Specifically, Velox trains an encoder to compress spatiotemporal color point clouds into a set of dynamic tokens. These tokens are supervised using two complementary decoders: a 4D surface decoder, which models the time-varying surface distribution capturing the geometry; and a Gaussian decoder, which maps the tokens to 3D Gaussians, helping learn appearance. To demonstrate the utility of our representation, we evaluate it across three downstream tasks: video-to-4D generation, 3D tracking, and cloth simulation via image-to-4D generation, and observe strong performances in all settings.

Abstract:
All classifiers, including state-of-the-art vision models, possess invariants, partially rooted in the geometry of their linear mappings. These invariants, which reside in the null-space of the classifier, induce equivalent sets of inputs that map to identical outputs. The semantic content of these invariants remains vague, as existing approaches struggle to provide human-interpretable information. To address this gap, we present Semantic Interpretation of the Null-space Geometry (SING), a method that constructs equivalent images, with respect to the network, and assigns semantic interpretations to the available variations. We use a mapping from network features to multi-modal vision language models. This allows us to obtain natural language descriptions and visual examples of the induced semantic shifts. SING can be applied to a single image, uncovering local invariants, or to sets of images, allowing a breadth of statistical analysis at the class and model levels. For example, our method reveals that ResNet50 leaks relevant semantic attributes to the null space, whereas DINO-ViT, a ViT pretrained with self-supervised DINO, is superior in maintaining class semantics across the invariant space. Code is available at https://tinyurl.com/github-SING.

Abstract:
Segmenting long-form videos into semantically coherent scenes is a fundamental task in large-scale video understanding. Existing encoder-based methods are limited by visual-centric biases, classify each shot in isolation without leveraging sequential dependencies, and lack both narrative understanding and explainability. In this paper, we present Scene-VLM, the first fine-tuned vision-language model (VLM) framework for video scene segmentation. Scene-VLM jointly processes visual and textual cues including frames, transcriptions, and optional metadata to enable multimodal reasoning across consecutive shots. The model generates predictions sequentially with causal dependencies among shots and introduces a context-focus window mechanism to ensure sufficient temporal context for each shot-level decision. In addition, we propose a scheme to extract confidence scores from the token-level logits of the VLM, enabling controllable precision-recall trade-offs that were previously limited to encoder-based methods. Furthermore, we demonstrate that our model can be aligned to generate coherent natural-language rationales for its boundary decisions through minimal targeted supervision. Our approach achieves state-of-the-art performance on standard scene segmentation benchmarks. On MovieNet, for example, Scene-VLM yields significant improvements of +6 AP and +13.7 F1 over the previous leading method.

Abstract:
This paper addresses the critical and underexplored challenge of long video understanding with low computational budgets.We propose LongVideo-R1, an active, reasoning-equipped multimodal large language model (MLLM) agent designed for efficient video context navigation, avoiding the redundancy of exhaustive search.At the core of LongVideo-R1 lies a reasoning module that leverages high-level visual cues to infer the most informative video clip for subsequent processing.During inference, the agent initiates traversal from top-level visual summaries and iteratively refines its focus, immediately halting the exploration process upon acquiring sufficient knowledge to answer the query.To facilitate training, we first extract hierarchical video captions from CGBench, a video corpus with grounding annotations, and guide GPT-5 to generate 33K high-quality chain-of-thought-with-tool trajectories. The LongVideo-R1 agent is fine-tuned upon the Qwen-3-8B model through a two-stage paradigm: supervised fine-tuning (SFT) followed by reinforcement learning (RL), where RL employs a specifically designed reward function to maximize selective and efficient clip navigation.Experiments on multiple long video benchmarks validate the effectiveness of name, which enjoys superior tradeoff between QA accuracy and efficiency. All curated data and source code are provided in the supplementary material and will be made publicly available.

Abstract:
Video understanding with multimodal large language models (MLLMs) remains challenging due to the long token sequences of videos, which contain extensive temporal dependencies and redundant frames.Existing approaches typically treat MLLMs as passive recognizers, processing entire videos or uniformly sampled frames without adaptive reasoning.Recent agent-based methods introduce external tools, yet still depend on manually designed workflows and perception-first strategies, resulting in inefficiency on long videos. We present EVA, an Efficient Reinforcement Learning framework for End-to-End Video Agent, which enables planning-before-perception through iterative summary-plan-action-reflection reasoning.EVA autonomously decides what to watch, when to watch, and how to watch, achieving query-driven and efficient video understanding.To train such agents, we design a simple yet effective three-stage learning pipeline--comprising supervised fine-tuning (SFT), Kahneman-Tversky Optimization (KTO), and Group Relative Policy Optimization (GRPO)--that bridges supervised imitation and reinforcement learning.We further construct high-quality datasets for each stage, supporting stable and reproducible training.We evaluate EVA on six video understanding benchmarks, demonstrating its comprehensive capabilities. Compared with existing baselines, EVA achieves a substantial improvement of 6--12% over general MLLM baselines and a further 1--3% gain over prior adaptive agent methods.

Abstract:
Single-image reflection removal is a highly ill-posed problem, where existing methods struggle to reason about the composition of corrupted regions, causing them to fail at recovery and generalization in the wild. This work reframes an editing-purpose latent diffusion model to effectively perceive and process highly ambiguous, layered image inputs, yielding high-quality outputs. We argue that the challenge of this conversion stems from a critical yet overlooked issue, i.e., the latent space of semantic encoders lacks the inherent structure to interpret a composite image as a linear superposition of its constituent layers. Our approach is built on three synergistic components, including a reflection-equivariant VAE that aligns the latent space with the linear physics of reflection formation, a learnable task-specific text embedding for precise guidance that bypasses ambiguous language, and a depth-guided early-branching sampling strategy to harness generative stochasticity for promising results. Extensive experiments reveal that our model achieves new SOTA performance on multiple benchmarks and generalizes well to challenging real-world cases.

Abstract:
Generative image models can produce convincingly real images, with plausible shapes, textures, layouts and lighting. However, one domain in which they perform notably poorly is in the synthesis of transparent objects, which exhibit refraction, reflection, absorption and scattering. Refraction is a particular challenge, because refracted pixel rays often intersect with surfaces observed in other parts of the image, providing a constraint on the color. It is clear from inspection that generative models have not distilled the laws of optics sufficiently well to accurately render refractive objects. In this work, we consider the problem of generating images with accurate refraction, given a text prompt. We synchronize the pixels within the object's boundary with those outside by warping and merging the pixels using Snell's Law of Refraction, at each step of the generation trajectory. For those surfaces that are not directly observed in the image, but are visible via refraction or reflection, we recover their appearance by synchronizing the image with a second generated image---a panorama centered at the object---using the same warping and merging procedure. We demonstrate that our approach generates much more optically-plausible images that respect the physical constraints.

Abstract:
Current diffusion-based portrait animation models predominantly focus on enhancing visual quality and expression realism, while overlooking generation latency and real-time performance, which restricts their application range in the live streaming scenario. We propose PersonaLive, a novel diffusion-based framework towards streaming real-time portrait animation with multi-stage training recipes. Specifically, we first adopt hybrid implicit signals, namely implicit facial representations and 3D implicit keypoints, to achieve expressive image-level motion control. Then, a fewer-step appearance distillation strategy is proposed to eliminate appearance redundancy in the denoising process, greatly improving inference efficiency. Finally, we introduce an autoregressive micro-chunk streaming generation paradigm equipped with a sliding training strategy and a historical keyframe mechanism to enable low-latency and stable long-term video generation. Extensive experiments demonstrate that PersonaLive achieves state-of-the-art performance with up to 7-22xspeedup over prior diffusion-based portrait animation models. The code will be publicly available.

Abstract:
Transformers have emerged as a universal backbone across 3D perception, video generation, and world models for autonomous driving and embodied AI, where understanding camera geometry is essential for grounding visual observations in three-dimensional space. However, existing camera encoding methods often rely on simplified pinhole assumptions, restricting generalization across the diverse intrinsics and lens distortions in real-world cameras. We introduce Relative Ray Encoding, a geometry-consistent representation that unifies complete camera information, including 6-DoF poses, intrinsics, and lens distortions. To evaluate its capability under diverse controllability demands, we adopt camera-controlled text-to-video generation as a testbed task. Within this setting, we further identify pitch and roll as two components effective for Absolute Orientation Encoding, enabling full control over the initial camera orientation. Together, these designs form UCPE (Unified Camera Positional Encoding), which integrates into a pretrained video Diffusion Transformer through a lightweight spatial attention adapter, adding less than 1% trainable parameters while achieving state-of-the-art camera controllability and visual fidelity. To facilitate systematic training and evaluation, we construct a large video dataset covering a wide range of camera motions and lens types. Extensive experiments validate the effectiveness of UCPE in camera-controllable video generation and highlight its potential as a general camera representation for Transformers across future multi-view, video, and 3D tasks.

Abstract:
Open-vocabulary object detection (OVOD) enables models to detect any object category, including unseen ones. Benefiting from large-scale pre-training, existing OVOD methods achieve strong detection performance on general scenarios (e.g., OV-COCO) but suffer severe performance drops when transferred to downstream tasks with substantial domain shifts. This degradation stems from the scarcity and weak semantics of category labels in domain-specific task, as well as the inability of existing models to capture auxiliary semantics beyond coarse-grained category label. To address these issues, we propose HSA-DINO, a parameter-efficient semantic augmentation framework for enhancing open-vocabulary object detection. Specifically, we propose a multi-scale prompt bank that leverages image feature pyramids to capture hierarchical semantics and select domain-specific local semantic prompts, progressively enriching textual representations from coarse to fine-grained levels. Furthermore, we introduce a semantic-aware router that dynamically selects the appropriate semantic augmentation strategy during inference, thereby preventing parameter updates from degrading the generalization ability of the pre-trained OVOD model. We evaluate HSA-DINO on OV-COCO, several vertical domain datasets, and modified benchmark settings. The results show that HSA-DINO performs favorably against previous state-of-the-art methods, achieving a superior trade-off between domain adaptability and open-vocabulary generalization.

Abstract:
Diffusion Transformers (DiTs) set the state of the art in visual generation, yet their quadratic self-attention cost fundamentally limits scaling to long token sequences. Recent Top-K sparse attention approaches reduce the computation of DiTs by compressing tokens into block-wise representation and selecting a small set of relevant key blocks, but still suffer from (i) quadratic selection cost on compressed tokens and (ii) increasing K required to maintain model quality as sequences grow. We identify that their inefficiency is due to the single-level design, as a single coarse level is insufficient to represent the global structure.In this paper, we introduce Log-linear Sparse Attention (LLSA), a trainable sparse attention mechanism for extremely long token sequences that reduces both selection and attention costs from quadratic to log-linear complexity by utilizing a hierarchical structure. LLSA performs hierarchical Top-K selection, progressively adopting sparse Top-K selection with the indices found at the previous level,and introduces a Hierarchical KV Enrichment mechanism that preserves global context while using fewer tokens of different granularity during attention computation. To support efficient training, we develop a high-performance GPU implementation that uses only sparse indices for both the forward and backward passes, eliminating the need for dense attention masks.We evaluate LLSA on high-resolution pixel-space image generation without using patchification and VAE encoding. LLSA accelerates attention inference by 28.27 xand DiT training by 6.09 xon 256 x256 pixel token sequences, while maintaining generation quality. The results demonstrate that LLSA offers a promising direction for training long-sequence DiTs efficiently.

Abstract:
Multi-Label Recognition (MLR) based on Vision-Language Models (VLMs) aims to leverage their pre-trained knowledge to better adapt complex recognition scenarios, thereby enhancing model robustness. However, for realistic decentralized applications requiring federated learning, adapting VLMs to each client that possesses private and heterogeneous data can cause the model to overfit spurious label correlations, consequently triggering irrelevant categories when encountering new samples. To tackle this problem, we reconsider the federated learning for MLR with a causal model, in which we adopt a front-door adjustment and decouple the MLR modeling process by intermediate variables that magnify the oracle label co-occurrence. Guided by our analysis, we propose our FedMPT, the first method specifically designed for federated MLR. The core idea of FedMPT is to leverage generalizable conditions to steer federated MLR to mitigate erroneous label activations. To achieve this, FedMPT introduces an Large Language Model (LLM)-driven pipeline to decipher the underlying conditions that govern label dependencies. Furthermore, we introduce an optimal transport between the condition-enriched prompts and the image patches to uncover multiple region-level semantics. Finally, we generate synergistic predictions from different conditions with a crafted gating mechanism. Experiments on multiple benchmark datasets show that our proposed approach achieves competitive results and outperforms SOTA methods under varied settings.

Abstract:
Spiking Neural Networks (SNNs) have gained significant attention due to their event-driven computational paradigm, making them promising for neuromorphic computing. In recent years, the integration of SNNs and Transformer architectures has made remarkable progress in various tasks. However, existing spiking self-attention mechanisms predominantly focus on spatial information while neglecting explicit temporal modeling, leading to suboptimal performance. In this paper, we introduce the Temporal Interaction Coefficient (TIC) to analyze temporal dependency patterns in these spatial-only attention mechanisms, revealing their limited temporal interactions and restricted pattern diversity. To overcome this issue, we propose the Multi-Delay Mixer (MD-Mixer), drawing inspiration from time delay mechanisms in the nervous system. Specifically, MD-Mixer introduces multiple temporal delays to perform effective time mixing and facilitate temporally enriched spatial attention. In addition, it can be integrated seamlessly into existing Spiking Transformers as a drop-in replacement while maintaining computational efficiency. Extensive evaluations on static and neuromorphic benchmarks demonstrate that MD-Mixer substantially improves the performance of Spiking Transformers, outperforming existing state-of-the-art (SOTA) methods. This work establishes MD-Mixer as an effective and general solution for temporal modeling in event-driven architectures.

Abstract:
We present Sketch2Colab, which turns storyboard-style 2D sketches into coherent, object-aware 3D multi-human motion with fine-grained control over agents, joints, timing, and contacts. Diffusion-based motion generators offer strong realism but often rely on costly guidance for multi-entity control and degrade under strong conditioning. Sketch2Colab instead learns a sketch-conditioned diffusion prior and distills it into a rectified-flow student in latent space for fast, stable sampling. To make motion follow storyboards closely, we guide the student with differentiable objectives that enforce keyframes, paths, contacts, and physical consistency. Collaborative motion naturally involves discrete changes in interaction, such as converging, forming contact, cooperative transport, or disengaging, and a continuous flow alone struggles to sequence these shifts cleanly. We address this with a lightweight continuous-time Markov chain (CTMC) planner that tracks the active interaction regime and modulates the flow to produce clearer, synchronized coordination in human-object-human motion. Experiments on CORE4D and InterHuman show that Sketch2Colab outperforms baselines in constraint adherence and perceptual quality while sampling substantially faster than diffusion-only alternatives.

Abstract:
A fundamental challenge in embodied AI is verifying if agents build internal models of spatial structure or merely learn to mimic task-specific expert trajectories. This is critical as foundational approaches rooted in action-centric tasks (e.g., VLN) and reasoning-centric tasks (e.g., EQA) often share a common limitation: they lack a learning signal that forces them to encode fine-grained spatial relationships (like topology or distance) over long-range, fragmented experiences. To address this, we first propose LASAR, an architecture featuring a dual-memory system designed to maintain both episodic experiences and a semantic cognitive map. We then introduce Spatio-temporal Contextual Representation Learning (ST-CRL), a contrastive objective designed to train this architecture. ST-CRL leverages spatio-temporal cues from cognitive queries generated through annotated spatio-temporal context in simulation to build sample pairs, thereby forming the internal cognitive map from the agent's experiences. Experiments demonstrate that our method achieves 2%-3.5% gains in both zero-shot generalization on standard VLN-CE and VSI-Bench benchmarks. We also demonstrate that our proposed cognitive map has high self-consistency.

Abstract:
Large-scale deep learning models for physical AI applications depend on diverse training data collection efforts. These models and correspondingly, the training data, must address the different evaluation criteria necessary for the models to be deployable in real-world environments. Data selection policies can guide the development of the training set, but current frameworks do not account for the ambiguity in how data points affect different metrics. In this work, we propose Mixture Optimization via Scaling-Aware Iterative Collection (MOSAIC), a general data selection framework that operates by: (i) partitioning the dataset into domains; (ii) fitting neural scaling laws from each data domain to the evaluation metrics; and (iii) optimizing a data mixture by iteratively adding data from domains that maximize the change in metrics. We apply MOSAIC to autonomous driving (AD), where an End-to-End (E2E) planner model is evaluated on the Extended Predictive Driver Model Score (EPDMS), an aggregate of driving rule compliance metrics. Here, MOSAIC outperforms a diverse set of baselines on EPDMS with up to 80% less data.

Abstract:
While Multimodal Large Language Models (MLLMs) excel on benchmarks, their processing paradigm differs from the human ability to integrate visual information. Unlike humans who naturally bridge details and high-level concepts, models tend to treat these elements in isolation. Prevailing evaluation protocols often decouple low-level perception from high-level reasoning, overlooking their semantic and causal dependencies, which yields non-diagnostic results and obscures performance bottlenecks. We present VCU-Bridge, a framework that operationalizes a human-like hierarchy of visual connotation understanding: multi-level reasoning that advances from foundational perception through semantic bridging to abstract connotation, with an explicit evidence-to-inference trace from concrete cues to abstract conclusions. Building on this framework, we construct HVCU-Bench, a benchmark for hierarchical visual connotation understanding with explicit, level-wise diagnostics. Comprehensive experiments demonstrate a consistent decline in performance as reasoning progresses to higher levels. We further develop a data generation pipeline for instruction tuning guided by Monte Carlo Tree Search (MCTS) and show that strengthening low-level capabilities yields measurable gains at higher levels. Interestingly, it not only improves on HVCU-Bench but also brings benefits on general benchmarks (average +2.53%), especially with substantial gains on MMStar (+7.26%), demonstrating the significance of the hierarchical thinking pattern and its effectiveness in enhancing MLLM capabilities.

Abstract:
Deep learning models for medical image analysis often act as "black boxes," seldom aligning with clinical guidelines or explicitly linking decisions to supporting evidence. This is especially critical in Alzheimer's disease (AD), where predictions should be grounded in both anatomical and clinical findings. We present EMAD, a vision-language framework that generates structured AD diagnostic reports with each claim explicitly grounded in multimodal evidence. EMAD uses a hierarchical Sentence-Evidence-Anatomy (SEA) Grounding mechanism: (i) sentence-to-evidence grounding links generated sentences to clinical evidence phrases, and (ii) evidence-to-anatomy grounding localizes corresponding structures on 3D brain MRI. To reduce dense annotation requirements, we propose GTX-Distill, which transfers grounding behavior from a teacher trained with limited supervision to a student operating on model-generated reports. We further introduce Executable-Rule GRPO, a reinforcement fine-tuning scheme with verifiable rewards that enforces clinical consistency, protocol adherence, and reasoning-diagnosis coherence. On the AD-MultiSense dataset, EMAD achieves state-of-the-art diagnostic accuracy and produces more transparent, anatomically faithful reports than existing methods; we will release code and grounding annotations to support future research in trustworthy medical vision-language models.

Abstract:
Autoregressive (AR) models have demonstrated remarkable performance in generating high-fidelity images. However, their inherently sequential next-token prediction leads to significantly slows inference. Recent studies have introduced Jacobi-style decoding to accelerate autoregressive image generation. Extending the draft sequence initially improves efficiency, yet the acceleration quickly saturates as error propagation in the one-dimensional sequence hinders convergence.Observing that images exhibit strong local spatial correlations, we propose Parallel Jacobi Decoding (PJD), a training-free decoding approach that expands draft tokens in the two-dimensional spatial domain to enable efficient spatially parallel refinement.PJD adjusts the attention mask to mitigate error accumulation and improve convergence stability. Extensive experiments on diverse datasets show that PJD achieves 4.8x-6.4x acceleration across multiple autoregressive image generation models while maintaining competitive generation quality.

Abstract:
The transition of Retrieval-Augmented Generation (RAG) from offline video analysis to online, streaming scenarios presents a set of critical, unexplored challenges. These include the need for on-the-fly semantic segmentation of continuous video, the inherent tension between low-latency processing and high-quality knowledge extraction, and the demand for query-specific temporal reasoning. We propose StreamRAG, a novel framework designed to overcome these hurdles. StreamRAG is built upon three core technical pillars: (1) a Stream Event Segmentation (SES) module that performs real-time boundary detection to chunk the stream into meaningful units; (2) a Token-Reusing Accelerator that drastically cuts down captioning latency by leveraging computational overlap between consecutive frames; and (3) a Dynamic Retrieval Gate that modulates the retrieval scope and strategy based on the query's temporal sensitivity and contextual similarity. Empirical evaluation confirms that StreamRAG establishes a new state-of-the-art, delivering superior accuracy with minimal latency in streaming video comprehension.

Abstract:
Recent advancements in reinforcement learning with verifiable rewards (RLVR) have significantly improved the complex reasoning ability of vision-language models (VLMs). However, its outcome-level supervision is too coarse to diagnose and correct errors within the reasoning chain. To this end, we propose Perceval, a process reward model (PRM) that enables token-level error grounding, which can extract image-related claims from the response and compare them one by one with the visual evidence in the image, ultimately returning claims that contain perceptual errors. Perceval is trained with perception-intensive supervised training data. We then integrate Perceval into the RL training process to train the policy models. Specifically, compared to traditional GRPO, which applies sequence-level advantages, we apply token-level advantages by targeting penalties on hallucinated spans identified by Perceval, thus enabling fine-grained supervision signals. In addition to augmenting the training process, Perceval can also assist VLMs during the inference stage. Using Perceval, we can truncate the erroneous portions of the model's response, and then either have the model regenerate the response directly or induce the model to reflect on its previous output. This process can be repeated multiple times to achieve test-time scaling. Experiments show significant improvements on benchmarks from various domains across multiple reasoning VLMs trained with RL. For test-time scaling, it also demonstrates consistent performance gains over other strategies, such as major voting. Our code and data will be publicly released.

Abstract:
We introduce LaviGen, a framework that repurposes 3D generative models for 3D layout generation. Unlike previous methods that infer object layouts from textual descriptions, LaviGen operates directly in the native 3D space, formulating layout generation as an autoregressive process that explicitly models geometric relations and physical constraints among objects, producing coherent and physically plausible 3D scenes. To further enhance this process, we propose an adapted 3D diffusion model to integrate scene, object, and instruction information and employ a dual-guidance self-rollout distillation mechanism to improve efficiency and spatial accuracy. Extensive experiments on the LayoutVLM benchmark show LaviGen achieves superior 3D layout generation performance, with 19% higher physical plausibility than the state of the art and 65% faster computation. We will release our code publicly.

Abstract:
Mechanistic interpretability seeks circuits that are both sufficient on their own and necessary to a model's behavior under intervention, yet many circuit-discovery pipelines still rely on manually fixed circuit budgets. We propose a procedure that scores edges with sign-preserving integrated gradients and then chooses a single global cutoff by a performance-retention criterion for circuit selection. This makes circuit size a consequence of retained behavior rather than a manually selected budget. On the Mechanistic Interpretability Benchmark, this retention-calibrated thresholding yields circuits that are competitive on CPR/CMD across multiple tasks and models. On our GPT-2 IOI probability-space diagnostic, it improves sufficiency relative to the EAP-IG baseline while also improving the reported necessity diagnostics. Threshold ablations and stability checks further show that the selected operating point yields the best overall balance between self-sufficiency and necessity-oriented diagnostics.

Abstract:
Diffusion transformer models deliver state-of-the-art image synthesis quality but suffer from prohibitively slow iterative sampling. Although reducing sampling steps accelerates inference, aggressive step schedules often distort intermediate features and degrade sample fidelity. We present GeoRK2, a training-free framework that bridges numerical analysis and information geometry. GeoRK2 couples a second-order Runge-Kutta (RK2) integrator with a curvature-aware geometric flow derived from the model's noise predictions, promoting stable feature evolution under manifold-aware integration. By leveraging an empirical feature covariance-induced metric estimated from activation covariances to capture intrinsic feature geometry, GeoRK2 constrains error propagation under large-step integration while preserving structural fidelity. As a fully plug-and-play method, GeoRK2 requires no retraining and is compatible with mainstream pre-trained diffusion transformers. Comprehensive experiments on image generation and text-to-video generation tasks across representative diffusion backbones (e.g., DiT-XL, HunyuanVideo, and FLUX.1-dev) demonstrate that GeoRK2 achieves 4-5x faster inference than baseline frameworks with only marginal perceptual differences (DFID 0.81), confirming its effectiveness and generality.

Abstract:
Although artificial neural networks are often described as brain-inspired, their representations typically rely on continuous activations, such as the continuous latent variables in variational autoencoders (VAEs), which limits their biological plausibility compared to the discrete spike-based signaling in real neurons. Extensions like the Poisson VAE introduce discrete count-based latents, but their equal mean-variance assumption fails to capture overdispersion in neural spikes, leading to less expressive and informative representations. To address this, we propose NegBio-VAE, a negative-binomial latent-variable model with a dispersion parameter for flexible spike count modeling. NegBio-VAE preserves interpretability while improving representation quality and training feasibility via novel KL estimation and reparameterization. Experiments on four datasets demonstrate that NegBio-VAE consistently achieves superior reconstruction and generation performance, and yields robust, informative latent representations for downstream tasks. Extensive ablation studies are performed to verify the model's robustness w.r.t. various components.

Abstract:
Joint estimation of surface normals and depth is essential for holistic 3D scene understanding, yet high-resolution prediction remains difficult due to the trade-off between preserving fine local detail and maintaining global consistency. To address this challenge, we propose the Ultra Resolution Geometry Transformer (URGT), which adapts the Visual Geometry Grounded Transformer (VGGT) into a unified multi-patch transformer for monocular high-resolution depth-normal estimation. A single high-resolution image is partitioned into patches that are augmented with coarse depth and normal priors from pre-trained models, and jointly processed in a single forward pass to predict refined geometric outputs. Global coherence is enforced through cross-patch attention, which enables long-range geometric reasoning and seamless propagation of information across patches within a shared backbone. To further enhance spatial robustness, we introduce a GridMix patch sampling strategy that probabilistically samples grid configurations during training, improving inter-patch consistency and generalization. Our method achieves state-of-the-art results on UnrealStereo4K, jointly improving depth and normal estimation---reducing AbsRel from 0.0582 to 0.0291, RMSE from 2.17 to 1.31, and lowering mean angular error from 23.36deg to 18.27deg--while producing sharper and more stable geometry. The proposed multi-patch framework also demonstrates strong zero-shot and cross-domain generalization and scales effectively to very high resolutions, offering an efficient and extensible solution for high-quality geometry refinement.

Abstract:
Ultrasound imaging is widely used in clinical diagnostics due to its real-time capability and radiation-free nature. However, existing vision-language pre-training models, such as CLIP, are primarily designed for other modalities, and are difficult to directly apply to ultrasound data, which exhibit heterogeneous anatomical structures and diverse diagnostic attributes. To bridge this gap, we construct US-365K, a large-scale ultrasound image-text dataset containing 365k paired samples across 52 anatomical categories. We establish Ultrasonographic Diagnostic Taxonomy (UDT) containing two hierarchical knowledge frameworks. Ultrasonographic Hierarchical Anatomical Taxonomy standardizes anatomical organization, and Ultrasonographic Diagnostic Attribute Framework formalizes nine diagnostic dimensions, including body system, organ, diagnosis, shape, margins, echogenicity, internal characteristics, posterior acoustic phenomena, and vascularity. Building upon these foundations, we propose Ultrasound-CLIP, a semantic-aware contrastive learning framework that introduces semantic soft labels and semantic loss to refine sample discrimination. Moreover, we construct a heterogeneous graph modality derived from UDAF's textual representations, enabling structured reasoning over lesion-attribute relations. Extensive experiments with patient-level data splitting demonstrate that our approach achieves state-of-the-art performance on classification and retrieval benchmarks, while also delivering strong generalization to zero-shot, linear probing, and fine-tuning tasks.

Abstract:
Recent advances in video diffusion models have shifted towards transformer-based architectures, achieving state-of-the-art video generation but at the cost of quadratic attention complexity, which severely limits scalability for longer sequences. We introduce ReHyAt, a Recurrent Hybrid Attention mechanism that combines the fidelity of softmax attention with the efficiency of linear attention, enabling chunk-wise recurrent reformulation and constant memory usage. Unlike the concurrent linear-only SANA Video, ReHyAt's hybrid design allows efficient distillation from existing softmax-based models, reducing the training cost by two orders of magnitude to 160 GPU hours, while being competitive in the quality. Our light-weight distillation and finetuning pipeline provides a recipe that can be applied to future state-of-the-art bidirectional softmax-based models. Experiments on VBench and VBench-2.0, as well as a human preference study, demonstrate that ReHyAt achieves state-of-the-art video quality while reducing attention cost from quadratic to linear, unlocking practical scalability for long-duration and on-device video generation.

Abstract:
AI-assisted coding has rapidly reshaped software practice and research workflows, yet today's models still struggle to produce correct code for complex 3D geometric vision. If models could reliably write such code, the research of our community would change substantially. To measure progress toward that goal, we introduce GeoCodeBench, a PhD-level benchmark that evaluates coding for 3D vision. Each problem is a fill-in-the-function implementation task curated from representative papers at recent venues: we first let a tool propose candidate functions from official repositories, then perform careful human screening to select core 3D geometric components. For every target, we generate diverse, edge-case unit tests, enabling fully automatic, reproducible scoring. We evaluate eight representative open- and closed-source models to reflect the current ecosystem. The best model, GPT-5, attains only 36.6% pass rate, revealing a large gap between current capabilities and dependable 3D scientific coding. GeoCodeBench organizes tasks into a two-level hierarchy: General 3D capability (geometric transformations and mechanics/optics formulation) and Research capability (novel algorithm implementation and geometric logic routing). Scores are positively correlated across these axes, but research-oriented tasks are markedly harder. Context ablations further show that "more paper text" is not always better: cutting off at the Method section statistically outperforms full-paper inputs, highlighting unresolved challenges in long-context scientific comprehension. Together, these findings position GeoCodeBench as a rigorous testbed for advancing from generic coding to trustworthy 3D geometric vision coding.

Abstract:
The field of 360-degree omnidirectional understanding has been receiving increasing attention for advancing spatial intelligence. However, the lack of large-scale and diverse data remains a major limitation. In this work, we propose AirSim360, a simulation platform for omnidirectional data from aerial viewpoints, enabling wide-ranging scene sampling with drones. Specifically, AirSim360 focuses on three key aspects: a render-aligned data and labeling paradigm for pixel-level geometric, semantic, and instance-level understanding; an interactive pedestrian-aware system for modeling human behavior; and an automated trajectory generation paradigm to support navigation tasks. Furthermore, we collect more than 60K panoramic samples and conduct extensive experiments across various tasks to demonstrate the effectiveness of our simulator. Unlike existing simulators, our work is the first to systematically model the 4D real world under an omnidirectional setting. The entire platform, including the toolkit, plugins, and collected datasets, will be made publicly available.

Abstract:
Common image editing tasks typically adopt powerful generative diffusion models as the leading paradigm for real-world content editing. Meanwhile, although reinforcement learning (RL) methods such as Diffusion-DPO and Flow-GRPO have further improved generation quality, efficiently applying Reinforcement Learning from Human Feedback (RLHF) to diffusion-based editing remains largely unexplored, due to a lack of scalable human-preference datasets and frameworks tailored to diverse editing needs. To fill this gap, we propose HP-Edit, a post-training framework for Human Preference-aligned Editing, and introduce RealPref-50K, a real-world dataset across eight common tasks and balancing common object editing. Specifically, HP-Edit leverages a small amount of human-preference scoring data and a pretrained visual large language model (VLM) to develop HP-Scorer--an automatic, human preference-aligned evaluator. We then use HP-Scorer both to efficiently build a scalable preference dataset and to serve as the reward function for post-training the editing model. We also introduce RealPref-Bench, a benchmark for evaluating real-world editing performance. Extensive experiments demonstrate that our approach significantly enhances models such as Qwen-Image-Edit-2509, aligning their outputs more closely with human preference.

Abstract:
We introduce ShadowDraw, a framework that transforms ordinary 3D objects into shadow-drawing compositional art. Given a 3D object, our system predicts scene parameters---including object pose and lighting---together with a partial line drawing, such that the cast shadow completes the drawing into a recognizable image. To this end, we optimize scene configurations to reveal meaningful shadows, employ shadow contour to guide line drawing generation, and adopt automatic evaluation to enforce shadow-drawing coherence and visual quality. Experiments show that ShadowDraw produces compelling results across diverse inputs, from real-world scans and curated datasets to generative assets, and naturally extends to multi-object scenes, animations, and physical deployments. Our work provides a practical pipeline for creating shadow-drawing art and broadens the design space of computational visual art, bridging the gap between algorithmic design and artistic storytelling.

Abstract:
We propose a generative framework for producing high-quality PBR textures on a given 3D mesh. As large-scale PBR texture datasets are scarce, our approach focuses on effectively leveraging the embedding space and diffusion priors of pretrained latent image generative models while learning a material latent space, MatLat, through targeted fine-tuning. Unlike prior methods that freeze the embedding network, which leads to distribution shifts when encoding additional PBR channels and hinders subsequent diffusion training, we fine-tune the pretrained VAE so that new material channels can be incorporated with minimal latent distribution deviation. We further show that correspondence-aware attention alone is insufficient for cross-view consistency unless the latent-to-image mapping preserves locality. To enforce this locality, we introduce a regularization in the VAE fine-tuning that crops latent patches, decodes them, and aligns the corresponding image regions to maintain strong pixel-latent spatial correspondence. Ablation studies and comparison with previous baselines demonstrate that our framework improves PBR texture fidelity and that each component is critical for achieving state-of-the-art performance. Our project page is available at https://matlat-proj.github.io.

Abstract:
Generating high-fidelity 3D contents remains a fundamental challenge due to the complexity of representing arbitrary topologies--such as open surfaces and intricate internal structures--while preserving geometric details. Prevailing methods based on signed distance fields (SDFs) are hampered by costly watertight preprocessing and struggle with non-manifold geometries, while point-cloud representations often suffer from sampling artifacts and surface discontinuities. To overcome these limitations, we propose a novel 3D variational autoencoder (VAE) framework built upon unsigned distance fields (UDFs)--a more robust and computationally efficient representation that naturally handles complex and incomplete shapes. Our core innovation is a local-to-global (LoG) architecture that processes the UDF by partitioning it into uniform subvolumes, termed UBlocks. This architecture couples 3D convolutions for capturing local detail with sparse transformers for enforcing global coherence. A Pad-Average strategy further ensures smooth transitions at subvolume boundaries during reconstruction. This modular design enables seamless scaling to ultra-high resolutions up to 2048^3-a regime previously unattainable for 3D VAEs. Experiments demonstrate state-of-the-art performance in both reconstruction accuracy and generative quality, yielding superior surface smoothness and geometric flexibility.

Abstract:
Generative inbetweening (GI) seeks to synthesize realistic intermediate frames between the first and last keyframes beyond mere interpolation. As sequences become sparser and motions larger, previous GI models struggle with inconsistent frames with unstable pacing and semantic misalignment. Since GI involves fixed endpoints and numerous plausible paths, this task requires additional guidance gained from the keyframes and text to specify the intended path. Thus, we give semantic and temporal guidance from the keyframes and text onto each intermediate frame through Keyframe-anchored Attention Bias. We also better enforce frame consistency with Rescaled Temporal RoPE, which allows self-attention to attend to keyframes more faithfully. TGI-Bench, the first benchmark specifically designed for text-conditioned GI evaluation, enables challenge-targeted evaluation to analyze GI models. Without additional training, our method achieves state-of-the-art frame consistency, semantic fidelity, and pace stability for both short and long sequences across diverse challenges.

Abstract:
The rapid proliferation of pretrained models and open repositories has made model merging a convenient yet risky practice, allowing free-riders to combine fine-tuned models into a new multi-capability model without authorization. Such unauthorized model merging not only violates intellectual property rights but also undermines model ownership and accountability. To address this issue, we present MergeGuard, a proactive dual-stage weight protection framework that disrupts merging compatibility while maintaining task fidelity. In the first stage, we redistribute task-relevant information across layers via L2-regularized optimization, ensuring that important gradients are evenly dispersed. In the second stage, we inject structured perturbations to misalign task subspaces, breaking curvature compatibility in the loss landscape. Together, these stages reshape the model's parameter geometry such that merged models collapse into destructive interference while the protected model remains fully functional. Extensive experiments on both vision (ViT-L-14) and language (Llama2, Gemma2, Mistral) models demonstrate that MergeGuard reduces merged model accuracy by up to 90% with less than 1.5% performance loss on the protected model.

Abstract:
Dataset Distillation (DD) seeks to create small, synthetic datasets for efficient model training. While diffusion models are powerful generators, their use in DD is hampered by distribution shifts between synthetic and ideal distilled data, leading to suboptimal performance. We identify two critical shifts. First, considering the small capacity of the synthetic data, an optimal synthetic distribution for DD should be a simplification of the real data distribution, rather than replicating the original data's complexity. Second, there is a hazardous empirical deviation in the synthetic dataset from this learned distribution due to the data sampling process. To address these, we introduce a two-stage approach. During diffusion training time, we mitigate the distribution shift by employing an L1 sparsity regularizer, compelling the diffusion model to learn a compact and semantically sparse manifold. Then, during sampling time, we abandon the flawed sequential sampling paradigm and instead synchronously denoises the entire synthetic dataset with distribution regularizers. This framework systematically mitigates both identified distribution shifts. Experiments show our method achieves state-of-the-art performance with superior computational efficiency.

Abstract:
Recent works on vision-language-action (VLA) models have made great progress in exploring action tokenizers that convert continuous control signals into discrete tokens to align with LLM/VLM training paradigms.These approaches typically train a single tokenizer over entire manipulation trajectories, which often comprise multiple distinct skills and thus pose a challenging optimization trade-off.To address this issue, we introduce MoEActok, a novel action tokenizer that employs a mixture-of-experts (MoE) quantizer to produce skill-aware discrete representations for VLA models. MoEActok utilizes a clustering-driven MoE VQ-VAE mechanism in which each expert specializes in a particular skill.The key components are: (a) an action-skill decoupling strategy that uses k-means clustering to group action chunks, aligning clusters having similar skills; (b) a skill-aware training paradigm that augments VLA models with skill-conditioned context, improving skill grounding; and (c) an adapter that projects shared encoder representations into skill-specific latent spaces for specialized quantization, and subsequently harmonize the heterogeneous quantized representations back into a unified space for coherent reconstruction by the shared decoder.We evaluate MoEActok-based VLA models against multiple prior action tokenizer baselines in the RoboTwin and Simpler-Env simulators, and further assess zero-shot transfer on three real-world tasks. Across both simulated and real-world settings, MoEActok-based VLA substantially outperforms existing discrete tokenization methods.

Abstract:
Digital characters are central to modern media, yet generating character videos with long-duration, consistent multi-view appearance and expressive identity remains challenging. Existing approaches either provide insufficient context to preserve identity or leverage non-character-centric information as the "memory", leading to suboptimal consistency.Recognizing that character video generation inherently resembles an "outside-looking-in" scenario. In this work, we propose represent the character's visual attributes through a compact set of anchor frames.This design provides stable references for consistency, while reference-based video generation inherently faces challenges of copy-pasting and multi-reference conflicts. To address these, we introduce two mechanisms: Superset Content Anchoring, providing intra- and extra-training clip cues to prevent duplication, and RoPE as Weak Condition, encoding positional offsets to distinguish multiple anchors.Furthermore, we construct a scalable pipeline to extract these anchors from massive videos. Experiments show our method generates high-quality character videos exceeding 10 minutes, and achieves expressive identity and appearance consistency across views, surpassing existing methods.

Abstract:
Controllable video character replacement with a user-provided identity remains a challenging problem due to the lack of paired video data. Prior works have predominantly relied on a reconstruction-based paradigm that requires per-frame segmentation masks and explicit structural guidance (e.g., skeleton, depth). This reliance, however, severely limits their generalizability in complex scenarios involving occlusions, character-object interactions, unusual poses, or challenging illumination, often leading to visual artifacts and temporal inconsistencies. In this paper, we propose MoCha, a new framework that mitigates these limitations by harnessing the inherent tracking ability of the video diffusion model, therefore requiring only a single arbitrary frame mask and no structural guidance. To effectively adapt the multi-modal input condition and enhance facial identity, we introduce a condition-aware RoPE and employ an RL-based post-training stage. Furthermore, to overcome the scarcity of qualified paired-training data, we propose a comprehensive data construction pipeline. Specifically, we design three specialized datasets: a high-fidelity rendered dataset built with Unreal Engine 5 (UE5), an expression-driven dataset synthesized by current portrait animation techniques, and an augmented dataset derived from existing video-mask pairs. Extensive experiments demonstrate that our method substantially outperforms existing state-of-the-art approaches. We will release the code and dataset to facilitate further research.

Abstract:
Video foundation models aim to integrate video understanding, generation, editing, and instruction following within a single framework, making them a central direction for next-generation multimodal systems. However, existing evaluation benchmarks remain fragmented and limited in scope, as they each target a single task, rely on task-specific metrics, and typically use short or simple video clips. As a result, they do not capture the unified capabilities that these models are designed to deliver. To address this gap, we introduce UniVBench, a benchmark purpose-built for evaluating video foundation models across four core abilities: video understanding, video generation, video editing, and a newly proposed task, video reconstruction, which assesses how faithfully a model can reproduce video content it has encountered. Our benchmark substantially expands the complexity of evaluation by incorporating 200 high-quality, diverse and multi-shot videos, each paired with detailed captions, multi-format editing instructions, and reference images. All videos are entirely human-created and carefully validated, offering richer cinematic information than prior benchmarks. In addition, we develop a unified agentic evaluation system (UniV-Eval) that standardizes prompting, instruction parsing, and scoring across all tasks, enabling fair, scalable, and reproducible comparisons of unified video models. By grounding evaluation in instruction-based multi-shot video tasks, UniVBench provides the first framework for measuring the integrated capabilities that video foundation models aim to achieve. Extensive human annotations ensure our evaluation aligns with human judgment, enabling rigorous assessment and accelerating progress toward robust video intelligence.

Abstract:
Virtual Try-on (VTON) has become a core capability for online retail, where realistic try-on results provide reliable fit guidance, reduce returns, and benefit both consumers and merchants. Diffusion-based VTON methods achieve photorealistic synthesis, yet often rely on intricate architectures such as auxiliary reference networks and suffer from slow sampling, making the trade-off between fidelity and efficiency a persistent challenge.We approach VTON as a structured image editing problem that demands strong conditional generation under three key requirements: subject preservation, faithful texture transfer, and seamless harmonization. We present PROMO, a promptable virtual try-on framework built upon a Flow Matching DiT backbone with latent multi-modal conditional concatenation. By leveraging conditioning efficiency and self-reference mechanisms, our approach substantially reduces inference overhead. On standard benchmarks, PROMO surpasses both prior VTON methods and general image editing models in visual fidelity while delivering a competitive balance between quality and speed. These results demonstrate that flow-matching transformers, coupled with latent multi-modal conditioning and self-reference acceleration, offer an effective and training-efficient solution for high-quality virtual try-on.

Abstract:
E-commerce product poster generation aims to automatically synthesize a single image that effectively conveys product information by presenting a subject, text, and a designed style. Recent diffusion models with fine-grained and efficient controllability have advanced product poster synthesis, yet they typically rely on multi-stage pipelines, and simultaneous control over subject, text, and style remains underexplored. Such naive multi-stage pipelines also show three issues: poor subject fidelity, inaccurate text, and inconsistent style. To address these issues, we propose InnoAds-Composer, a single-stage framework that enables efficient tri-conditional control tokens over subject, glyph, and style. To alleviate the quadratic overhead introduced by naive tri-conditional token concatenation, we perform importance analysis over layers and timesteps and route each condition only to the most responsive positions, thereby shortening the active token sequence. Besides, to improve the accuracy of Chinese text rendering, we design a Text Feature Enhancement Module (TFEM) that integrates features from both glyph images and glyph crops. To support training and evaluation, we also construct a high-quality e-commerce product poster dataset and benchmark, which is the first dataset that jointly contains subject, text, and style conditions. Extensive experiments demonstrate that InnoAds-Composer significantly outperforms existing product poster methods without obviously increasing inference latency.

Abstract:
Accurately anticipating how complex, diverse scenes will evolve requires models that represent uncertainty, simulate along extended interaction chains, and efficiently explore many plausible futures. Yet most existing approaches rely on dense video or latent-space prediction, expending substantial capacity on dense appearance rather than on the underlying sparse trajectories of points in the scene. This makes large-scale exploration of future hypotheses costly and limits performance when long-horizon, multi-modal motion is essential. We address this by formulating the prediction of open-set future scene dynamics as step-wise inference over sparse point trajectories. Our autoregressive diffusion model advances these trajectories through short, locally predictable transitions, explicitly modeling the growth of uncertainty over time. This dynamics-centric representation enables fast rollout of thousands of diverse futures from a single image, optionally guided by initial constraints on motion, while maintaining physical plausibility and long-range coherence. We further introduce OWM, a benchmark for open-set motion prediction based on diverse in-the-wild videos, to evaluate accuracy and variability of predicted trajectory distributions under real-world uncertainty. Our method matches or surpasses dense simulators in predictive accuracy while achieving orders-of-magnitude higher sampling speed, making open-set future prediction both scalable and practical.

Abstract:
Video Large Language Models (VLLMs) encounter significant computational challenges due to the large volume of visual tokens generated from multiple frames. Existing visual token pruning methods fail to account for the uneven spatiotemporal information density, thus squandering scarce token budgets on regions with low information density. In this paper, we propose a training-free Metadata-guided Token Merging framework (MeToM) that leverages intrinsic video metadata to adaptively allocate budgets and merge visual tokens based on content complexity. Specifically, MeToM exploits residual data from codec metadata as spatial information density cues. It merges less informative regions during tokenization, avoiding redundant encoding and improving the efficiency of the visual encoder. Additionally, MeToM captures temporal variations in information density by utilizing the average Group of Pictures (GoP) packet size to represent scene complexity. This mechanism enables dynamic per-frame token allocation across time, assigning more tokens to content-complex frames and fewer tokens to information-sparse ones. Finally, we merge low-contribution visual tokens via multi-layer attention to reduce prefill FLOPs and the visual KV-cache footprint inside the LLM. Extensive experimental results demonstrate that MeToM outperforms prior state-of-the-art methods. Notably, it achieves a 2.65xinference speedup over the baseline VLLM without sacrificing accuracy.

Abstract:
We present a novel approach for interactive light editing in indoor scenes from a single multi-view scene capture. Our method leverages a generative image-based light decomposition model that factorizes complex indoor scene illumination into its constituent light sources. This factorization enables independent manipulation of individual light sources, specifically allowing control over their state (on/off), chromaticity, and intensity. We further introduce multi-view lighting harmonization to ensure consistent propagation of the lighting decomposition across all scene views. This is integrated into a relightable 3D Gaussian splatting representation, providing real-time interactive control over the individual light sources. Our results demonstrate highly photorealistic lighting decomposition and relighting outcomes across diverse indoor scenes. We evaluate our method on both synthetic and real-world datasets and provide a quantitative and qualitative comparison to state-of-the-art techniques.

Abstract:
Point cloud anomaly detection is crucial in automated manufacturing, with reconstruction-based diffusion methods emerging as a mainstream solution. However, these approaches still face two major challenges: (1) geometry violation, where random noise perturbations deviate from local surface normals, causing structural distortion; and (2) undistinguished reference regions, where uniformly applied coarse anomaly embeddings during denoising blur normal details and impede accurate anomaly recovery. To address these issues, we propose AARD, a geometry-aligned and anomaly-aware diffusion reconstruction framework. We argue that high-fidelity anomaly detection requires a principled reformulation of the diffusion process: noise should align with geometry to preserve structures, and reconstruction can be better guided by anomaly-aware references to discriminatively recover normal details while correcting defects. AARD progressively aligns noise directions with vertex normals while maintaining vertex-graph consistency, and employs an adaptive transformer that assigns normal references to anomalous regions and input references to normal areas. Experiments on Anomaly-ShapeNet and Real3D-AD show that AARD consistently outperforms state-of-the-art approaches, achieving superior geometric fidelity and robust anomaly localization.

Abstract:
From content moderation to content curation, applications requiring vision classifiers for visual concepts are rapidly expanding. Existing human-in-the-loop approaches typically assume users begin with a clear, stable concept understanding to be able to provide high-quality supervision. In reality, users often start with a vague idea and must iteratively refine it through "concept deliberation", a practice we uncovered through structured interviews with content moderation experts. We operationalize the common strategies in deliberation used by real content moderators into a human-in-the-loop framework called Agile Deliberation that explicitly supports evolving and subjective concepts. The system supports users in defining the concept for themselves by exposing them to borderline cases. The system does this with two deliberation stages: (1) concept scoping, which decomposes the initial concept into a structured hierarchy of sub-concepts, and (2) concept iteration, which surfaces semantically borderline examples for user reflection and feedback to iteratively align an image classifier with the user's evolving intent. Since concept deliberation is inherently subjective and interactive, we painstakingly evaluate the framework through 18 user sessions, each 1.5h long, rather than standard benchmarking datasets. We find that Agile Deliberation achieves 7.5% higher F1 scores than automated decomposition baselines and more than 3% higher than manual deliberation, while participants reported clearer conceptual understanding and lower cognitive effort.

Abstract:
We introduce EMMA, a physics-informed multimodal framework that recovers all identifiable dynamical parameters of a system directly from raw video, audio, and image-based time-series observations. Unlike prior video-only approaches that struggle with occluded states, hidden actuation inputs, or assumptions about known initial conditions and coordinate frames, EMMA performs joint inference of explicit parameters, implicit dynamical components, and calibration invariants within a unified continuous-time model. EMMA leverages a Liquid Time-Constant (LTC) network to learn latent dynamics from heterogeneous modalities while a physics-constrained loss enforces consistency with the governing differential equations. A unified feature pipeline enables consistent alignment across video trajectories, acoustic signatures, and chart-derived measurements, allowing EMMA to estimate parameters under forced, implicit, and multivariate dynamics without requiring segmentation masks, differentiable rendering, or specialized sensors. Across 100+ scenarios including five standard dynamical benchmarks (75 Delfys videos), real-world rover and quadrotor systems with hidden inputs, and simulation-chart case studies spanning biological and chaotic systems, EMMA delivers robust multi-parameter recovery and significantly outperforms existing single-modality and equation-discovery baselines. Our results establish EMMA as a general, scalable solution for physics-consistent model extraction from opportunistic multimodal data.

Abstract:
Multimodal large language models (MLLMs) have expanded from vision-language systems to include audio, unlocking new capabilities in cross-modal reasoning and interaction. To address the limitation that existing benchmarks focus mainly on perception tasks and lack a unified cognitive evaluation framework, we propose Hierarchical Audio-Visual Evaluation Benchmark (HAVE-Bench). It systematically evaluates the audio-related capabilities of MLLMs along a three-level cognitive hierarchy: Perception, Reasoning, and Interaction, utilizing 2,451 curated samples and manually annotated multi-turn interaction-level tasks. Experiments using this unified framework reveal significant gaps in existing models at the reasoning and interaction levels, with speech-driven visual question answering (VQA) performance significantly lagging behind the text-image setting. These findings underscore the urgency of enhancing models' handling of long and complex audio and facilitating the transfer of reasoning capabilities from the vision-text to the audio-visual domain.

Abstract:
Reproducible closed-loop evaluation remains a major bottleneck in Embodied AI such as visual navigation. A promising path forward is high-fidelity simulation that combines photorealistic sensor rendering with geometrically grounded interaction in complex, open-world urban environments. Although recent video-3DGS methods ease open-world scene capturing, they are still unsuitable for benchmarking due to large visual and geometric sim-to-real gaps. To address these challenges, we introduce Wanderland, a real-to-sim framework that features multi-sensor capture, reliable reconstruction, accurate geometry, and robust view synthesis. Using this pipeline, we curate a diverse dataset of indoor-outdoor urban scenes and systematically demonstrate how image-only pipelines scale poorly, how geometry quality impacts novel view synthesis, and how all of these adversely affect navigation policy learning and evaluation reliability. Beyond serving as a trusted testbed for embodied navigation, Wanderland's rich raw sensor data further allows benchmarking of 3D reconstruction and novel view synthesis models. Our work establishes a new foundation for reproducible research in open-world embodied AI.

Abstract:
Pixel-level annotations for medical image segmentation are costly and labor-intensive, often requiring expert knowledge. Bounding box labels provide a more scalable alternative but introduce strong box-shaped bias that hampers segmentation quality. We propose WeakMed, a general-purpose weakly supervised segmentation framework that removes the dependence on pixel-level masks while overcoming the structural limitations of box supervision. WeakMed introduces two lightweight, plug-and-play training components: (1) a Mask-to-Box transformation that aligns predicted masks with box annotations to reduce label mismatch and box-induced bias, and (2) a Scale Consistency loss that enforces multi-scale self-supervision to address the ambiguity and instability of weak labels. Both modules are used only during training and impose no inference overhead. Across 9 segmentation tasks, 9 datasets, and 6 imaging modalities, WeakMed consistently surpasses existing weakly supervised methods and achieves performance competitive with fully supervised baselines. These results demonstrate its practicality as a low-cost yet high-quality solution for medical image segmentation.

Abstract:
We propose Context-aware Video-text Alignment (CVA), a novel framework to address a significant challenge in video temporal grounding: achieving temporally sensitive video-text alignment that remains robust to irrelevant background context. %CVA integrates complementary data-centric and architectural innovations to enhance contextual robustness. Our framework is built on three key components. First, we propose Query-aware Context Diversification (QCD), a new data augmentation strategy that ensures only semantically unrelated content is mixed in. It builds a video-text similarity-based pool of replacement clips to simulate diverse contexts while preventing the false negative caused by query-agnostic mixing. Second, we introduce the Context-invariant Boundary Discrimination (CBD) loss, a contrastive loss that enforces semantic consistency at challenging temporal boundaries, making their representations robust to contextual shifts and hard negatives. Third, we introduce the Context-enhanced Transformer Encoder (CTE), a hierarchical architecture that combines windowed self-attention and bidirectional cross-attention with learnable queries to capture multi-scale temporal context. Through the synergy of these data-centric and architectural enhancements, CVA achieves state-of-the-art performance on major VTG benchmarks, including QVHighlights and Charades-STA. Notably, our method achieves a significant improvement of approximately 5 points in Recall@1 (R1) scores over state-of-the-art methods, highlighting its effectiveness in mitigating false negatives.

Abstract:
Accurate segmentation of marine organisms is vital for biodiversity monitoring and ecological assessment, yet existing datasets and models remain largely limited to terrestrial scenes. To bridge this gap, we introduce AquaOV255, the first large-scale and fine-grained underwater segmentation dataset containing 255 categories and over 20K images, covering diverse marine organisms and man-made objects for open-vocabulary evaluation. Furthermore, we establish the first underwater open-vocabulary segmentation benchmark, UOVSBench, by integrating AquaOV255 with five additional underwater datasets to enable comprehensive cross-domain evaluation. Alongside, we present Earth2Ocean, a training-free open-vocabulary segmentation framework that transfers terrestrial vision-language models (VLMs) to underwater domains without any additional underwater training. Earth2Ocean consists of two core components: a Geometric-guided Visual Mask Generator (GMG) that refines visual features via self-similarity geometric priors for local structure perception, and a Category-visual Semantic Alignment(CSA) module that enhances text embeddings through multimodal large language model reasoning and scene-aware template construction. Extensive experiments on the UOVSBench benchmark demonstrate that Earth2Ocean achieves over 6+ mIoU improvement on average while maintaining efficient inference.

Abstract:
Stereo video inpainting, which aims to fill the occluded regions of warped videos with visually coherent content while maintaining temporal consistency, remains a challenging open problem. The regions to be filled are scattered along object boundaries and occupy only a small fraction of each frame, leading to two key challenges. First, existing approaches perform poorly on such tasks due to the scarcity of high-quality stereo inpainting datasets, which limits their ability to learn effective inpainting priors. Second, these methods apply equal processing to all regions of the frame, even though most pixels require no modification, resulting in substantial redundant computation. To address these issues, we introduce three interconnected components. We first propose Gradient-Aware Parallax Warping (GAPW), which leverages backward warping and the gradient of the coordinate mapping function to obtain continuous edges and smooth occlusion regions. Then, a Parallax-Based Dual Projection (PBDP) strategy is introduced, which incorporates GAPW to produce geometrically consistent stereo inpainting pairs and accurate occlusion masks without requiring stereo videos. Finally, we present Sparsity-Aware Stereo Inpainting (SASI), which reduces over 70% of redundant tokens, achieving a 10.7xspeedup during diffusion inference and delivering results comparable to its full-computation counterpart, enabling real-time processing of HD (768x1280) videos at 25\,FPS on a single A100 GPU.

Abstract:
Visual grounding aims to associate free-form textual queries with specific regions in an image. While recent Multimodal Large Language Models (MLLMs) have demonstrated promising capabilities in this domain, they primarily excel at object-level grounding and often struggle with part-level grounding--an essential requirement for fine-grained tasks such as robotic manipulation. In this work, we introduce a general approach that equips any open-source MLLMs with accurate 2D part-level point grounding, offering a more direct alternative to conventional grounding representations. Our method leverages the attention mechanisms inherently present in MLLMs. By synthesizing text-conditioned, grounding-aware queries within intermediate layers via the proposed Q-Synth Module, we capture target-relevant attention patterns and refine them with a lightweight Attention-to-Point Decoder, which converts these patterns into a point-centric heatmap for final prediction. Notably, all original MLLM parameters are frozen, ensuring full preservation of their pre-trained capabilities. Experiments show that our design consistently improves part-level grounding accuracy across datasets and can be seamlessly integrated into any open-source MLLMs.

Abstract:
The proliferation of highly realistic AI-generated images poses critical challenges for digital forensics, demanding precise pixel-level localization of manipulated regions. Existing methods predominantly learn discriminative patterns of specific forgeries, struggling with novel manipulations as editing techniques evolve. We propose the Iterative Forgery Amplifier Network (IFA-Net), which shifts from learning "what is fake" to modeling "what is real". Grounded in the principle that all manipulations deviate from the natural image manifold, IFA-Net leverages a frozen Masked Autoencoder (MAE) pretrained on real images as a universal realness prior. Our framework operates through a two-stage closed-loop process: an initial Dual-Stream Segmentation Network (DSSN) fuses the original image with MAE reconstruction residuals for coarse localization, then a Task-Adaptive Prior Injection (TAPI) module converts this coarse prediction into guiding prompts that steer the MAE decoder to amplify reconstruction failures in suspicious regions, enabling precise refinement. Extensive experiments on four diffusion-based inpainting benchmarks show that IFA-Net achieves an average improvement of 6.5% in IoU and 8.1% in F1-score over the second-best method, while demonstrating strong generalization to traditional manipulation types.

Abstract:
Tracking targets with high-speed and nonlinear motion under occlusion remains challenging due to spatial appearance deprivation and temporal trajectory fragmentation caused by missing visual cues. Existing methods typically either dynamically update templates to maintain appearance similarity or employ autoregressive models to predict targets from historical trajectories. However, these methods are ineffective under severe occlusion owing to template contamination and limited frame rates for complex motion. In this work, we observe that occlusion inherently degrades the spatial matching mechanism, highlighting the importance of temporal cues. Meanwhile, event cameras with microsecond-level temporal resolution provide transient dynamic cues that facilitate modeling nonlinear motion. In light of this, we propose EvoTrack, an occlusion-robust tracking framework via event-derived transient evolution, which comprises event-based motion autoregression and target-aware appearance matching. Specifically, for motion autoregression, the fine-grained timestamps of events naturally encode the target's direction and speed, motivating a bidirectional motion consistency that constrains inter-frame displacement prediction under nonlinear motion. For appearance matching, we adopt a Gaussian masking strategy to simulate occlusion degradation, guiding the model to focus on target regions and learn invariant representations. Furthermore, we build a pixel-aligned Frame-Event tracking dataset with higher spatial resolution and explicit occlusion labels. Extensive experiments demonstrate the effectiveness of EvoTrack in challenging occlusion scenes.

Abstract:
Deep neural networks trained with backpropagation have achieved outstanding performance in vision tasks but remain biologically implausible, computationally demanding, and difficult to interpret. The Forward-Forward (FF) algorithm offers a promising alternative by training each layer independently through local goodness objectives. However, its purely local optimization lacks hierarchical coordination across layers, and the decoupling of goodness from features leaves the representations unconstrained and semantically ambiguous. We propose a Hierarchical and Contrastive Learning FF framework (HCL-FF) to address these limitations. HCL-FF introduces (1) a coarse-to-fine hierarchical learning strategy that guides representations from low-level cues to high-level semantics, and (2) a supervised contrastive objective that enforces class-discriminative alignment after goodness decoupling. Experiments on CIFAR-10, CIFAR-100, and Tiny-ImageNet demonstrate that HCL-FF achieves new state-of-the-art performance among FF-based methods, with notable accuracy gains of +5.46%, +17.00%, and +12.51%, respectively.

Abstract:
Vision-Language-Action (VLA) models have emerged as a powerful framework that unifies perception, language, and control, enabling robots to perform diverse tasks through multimodal understanding. However, current VLA models typically contain massive parameters and rely heavily on large-scale robot data pretraining, leading to high computational costs during training, as well as limited deployability for real-time inference.Moreover, most training paradigms often degrade the perceptual representations of the Vision-Language backbone, resulting in overfitting and poor generalization to downstream tasks.In this work, we present Evo-1, a lightweight VLA model that reduces computation and improves deployment efficiency, while maintaining strong performance without pretraining on robot data. Evo-1 builds on a native multimodal Vision-Language model (VLM), incorporating a novel cross-modulated diffusion transformer along with an optimized integration module, together forming an effective architecture.We further introduce a two-stage training paradigm that progressively aligns action with perception, preserving the representations of the VLM.Notably, with only 0.77 billion parameters, Evo-1 achieves state-of-the-art results on the Meta-World and RoboTwin suite, surpassing the previous best models by 12.4% and 6.9%, respectively, and also attains a competitive result of 94.8% on LIBERO.In real-world evaluations, Evo-1 attains a 78% success rate with high inference frequency and low memory overhead, outperforming all baseline methods.We release code, data, and model weights to facilitate future research on lightweight and efficient VLA models.

Abstract:
We introduce OpenVO, a novel framework for Open-world Visual Odometry (VO) with temporal awareness under limited input conditions. OpenVO effectively estimates real-world-scale ego-motion from monocular dashcam footage with varying observation rates and uncalibrated cameras, enabling robust trajectory dataset construction from rare driving events recorded in dashcam.Existing VO methods are trained on fixed observation frequency (e.g., 10Hz or 12Hz), completely overlooking temporal dynamics information. Many prior methods also require calibrated cameras with known intrinsic parameters. Consequently, their performance degrades when (1) deployed under unseen observation frequencies or (2) applied to uncalibrated cameras. These significantly limit their generalizability to many downstream tasks, such as extracting trajectories from dashcam footage.To address these challenges, OpenVO (1) explicitly encodes temporal dynamics information within a two-frame pose regression framework and (2) leverages 3D geometric priors derived from foundation models. We validate our method on three major autonomous-driving benchmarks - KITTI, nuScenes, and Argoverse 2 - achieving more than 20% performance improvement over state-of-the-art approaches. Under varying observation rate settings, our method is significantly more robust, achieving 46%-92% lower errors across all metrics.These results demonstrate the versatility of OpenVO for real-world 3D reconstruction and diverse downstream applications.

Abstract:
Generative data augmentation has created new opportunities for improving image recognition models. Despite these advances, most existing augmentation methods process images holistically, without accounting for the uneven distribution of discriminative information in classification tasks, where foreground subjects typically contain richer category-relevant signals than background context. Ignoring this imbalance can introduce unintended semantic shifts in generated samples, thereby weakening the model's ability to capture the intrinsic ontology of the target object. In fact, human perception operates in a similar manner: subject identity remains stable, while contextual variations are naturally tolerated as long as overall coherence is preserved. Inspired by this observation, we propose OntoAug, an ontology-oriented data augmentation framework that explicitly distinguishes between subject and environment. OntoAug separates foreground and background through structured layout control and guides diffusion models to generate samples with consistent subjects and diverse contextual variations. By explicitly introducing foreground masks and diverse background prompts during the generation process, the framework enhances both semantic fidelity and diversity in the synthesized data. Extensive experiments show that OntoAug achieves strong performance across multiple tasks, including image classification, few-shot learning, weakly supervised object localization (WSOL), and large vision-language model reasoning.

Abstract:
Recent advances in unimanual manipulation policies have achieved remarkable success across diverse robotic tasks through abundant training data and well-established model architectures. However, extending these capabilities to bimanual manipulation remains challenging due to the lack of bimanual demonstration data and the complexity of coordinating dual-arm actions. Existing approaches either rely on extensive bimanual datasets or fail to effectively leverage pre-trained unimanual policies. To address this limitation, we propose EnergyAction, a novel framework that compositionally transfers unimanual manipulation policies to bimanual tasks through the Energy-Based Models (EBMs). Specifically, our method incorporates three key innovations. First, we model individual unimanual policies as EBMs and leverage their compositional properties to compose left and right arm actions, enabling the fusion of unimanual policies into a bimanual policy. Second, we introduce an energy-based temporal-spatial coordination mechanism through energy constraints, ensuring the generated bimanual actions are both temporal coherence and spatial feasibility. Third, we propose two different energy-aware denoising strategies that dynamically adapt denoising steps based on action quality assessment. These strategies ensure the generation of high-quality actions while maintaining superior computational efficiency compared to fixed-step denoising approaches. Experimental results demonstrate that EnergyAction effectively transfers unimanual knowledge to bimanual tasks, achieving superior performance on both simulated and real-world tasks with minimal bimanual data.

Abstract:
Learning-based video quality assessment (VQA) has advanced rapidly, yet progress is increasingly constrained by a disconnect between model design and dataset curation. Model-centric approaches often iterate on fixed benchmarks, while data-centric efforts collect new human labels without systematically targeting the weaknesses of existing VQA models. Here, we describe MDS-VQA, a model-informed data selection mechanism for curating unlabeled videos that are both difficult for the base VQA model and diverse in content. Difficulty is estimated by a failure predictor trained with a ranking objective, and diversity is measured using deep semantic video features, with a greedy procedure balancing the two under a constrained labeling budget. Experiments across multiple VQA datasets and models demonstrate that MDS-VQA identifies diverse, challenging samples that are particularly informative for active fine-tuning. With only a 5% selected subset per target domain, the fine-tuned model improves mean SRCC from 0.651 to 0.722 and achieves the top gMAD rank, indicating strong adaptation and generalization.

Abstract:
In industrial settings, classification of 3D CAD models are critical for efficient manufacturing. However, the limited availability of annotated CAD models presents an obstacle to achieving rapid adaptation in few-shot part classification scenarios. In this paper, we propose a hybrid graph representation and a pre-training and graph prompt framework for B-rep few-shot classification. Specifically, hybrid graph representation captures comprehensive and multi-level structural information of B-rep models by constructing local topology graph, global parallel graph and regional association hypergraph. A hierarchical graph network then fuses component-level structures with topological details in the hybrid graph. Reinforcement-augmented contrastive pre-training produces robust universal representations while in-place perturbation reduces training time. Structure-aware graph prompts finally produce node-specific cues, enabling few-shot B-rep part classification without heavy fine-tuning. Experiments on the TraceParts-11and FabWave-31 datasets show that our method outperforms existing general-purpose approaches. This work provides an efficient and state-of-the-art solution for few-shot B-rep part classification.

Abstract:
The latest progress in text-to-3D generative models makes it possible to generate high-quality 3D content. Recent text-to-3D large model have achieved remarkable breakthroughs in multi-view consistency. However, their effectiveness is often affected by inherent biases, resulting in sensitivity to design settings such as prompt format, leading to difficulty understanding complex prompts. To help text-to3D generative models understand more diverse prompts, we propose a framework to localize and mitigate the bias in the current text-to-3D large model. Specifically, we first use the existing model to generate 3D content and use the quality evaluation model to identify the cross-modality bias. Then, we use the predicted quality score to quantify the contribution of the prompt text to the bias. Finally, in order to reduce these biases, we construct diverse pairwise examples to help the current text-to-3D large model construct unbiased visual-text connections. The experiment shows that our method has achieved competitive results and can provide higher quality, more diverse 3D content compared to existing methods.

Abstract:
Recent progress in video-to-video (V2V) translation has enabled realistic resimulation of embodied AI demonstrations, a capability that allows pretrained robot policies to be transferable to new environments without additional data collection. However, prior works can only operate on a single view at a time, while embodied AI tasks are commonly captured from multiple synchronized cameras to support policy learning. Naively applying single-view models independently to each camera leads to inconsistent appearance across views, and standard transformer architectures do not scale to multi-view settings due to the quadratic cost of cross-view attention. We present VideoWeaver, the first multimodal multi-view V2V translation framework. VideoWeaver is initially trained as a single-view flow-based V2V model. To achieve an extension to the multi-view regime, we propose to ground all views in a shared 4D latent space derived from a feed-forward spatial foundation model, namely, Pi3. This encourages view-consistent appearance even under wide baselines and dynamic camera motion. To scale beyond a fixed number of cameras, we train views at distinct diffusion timesteps, enabling the model to learn both joint and conditional view distributions. This in turn allows autoregressive synthesis of new viewpoints conditioned on existing ones. Experiments show superior or similar performance to the state-of-the-art on the single-view translation benchmarks and, for the first time, physically and stylistically consistent multi-view translations, including challenging egocentric and heterogeneous-camera setups central to world randomization for robot learning.

Abstract:
Histopathology whole slide images (WSIs) are gigapixel images that present significant challenges in generating effective representations that capture both local histological features and their global spatial organization. Current pathology foundation models focus primarily on local patch-level features while neglecting the complex spatial relationships that pathologists rely on for diagnosis and prognosis. We introduce TopoSlide, a novel self-supervised representation learning framework that leverages persistent homology from topological data analysis to capture the global spatial organization of tissue architecture in WSIs. Our method decomposes slides into histologically meaningful clusters using patch-level embeddings, then characterizes their spatial arrangement through topological descriptors. We train a vision transformer to predict cluster topology from slide-level embeddings using a conditional multi-task objective that integrates local patch features with their topological attributes. Evaluated across lung adenocarcinoma and breast cancer cohorts, TopoSlide achieves superior performance improving histologic pattern retrieval by up to 15% in majority voting macro F1 score, and competitive survival and gene mutation predictions, while training on only hundreds of slides compared to hundreds of thousands for foundation models. Our results demonstrate that topology-aware learning provides a powerful inductive bias for pathology representation learning, enabling both improved performance and novel topology-based conditional retrieval capabilities for clinical applications. Our code and models are publicly available.

Abstract:
We introduce two new benchmarks REST and REST+ (Render-Equivalence Stress Tests) to enable systematic evaluation of cross-modal inconsistency in multimodal large language models (MLLMs). MLLMs are trained to represent vision and language in the same embedding space, yet they cannot perform the same tasks in both modalities. Our benchmarks contain samples with the same semantic information in three modalities (image, text, mixed) and we show that state-of-the-art MLLMs cannot consistently reason over these different modalities. We evaluate 15 MLLMs and find that the degree of modality inconsistency varies substantially, even when accounting for problems with text recognition (OCR). Neither rendering text as image nor rendering an image as text solves the inconsistency. Even if OCR is correct, we find that visual characteristics (text colour and resolution, but not font) and the number of vision tokens have an impact on model performance. Finally, we find that our consistency score correlates with the modality gap between text and images, highlighting a mechanistic interpretation of cross-modal inconsistent MLLMs.

Abstract:
RewardFlow is a zero-shot, training-free framework for text-guided image editing and generation based on reward-guided Langevin dynamics. We steer pretrained diffusion and flow-matching models at inference time using a diverse set of differentiable rewards, and control their influence with a prompt-aware adaptive policy that parses the text instruction, infers edit intent, and dynamically adjusts update steps. Our design includes a differentiable VQA-based reward for fine-grained semantic supervision and a SAM-guided reward for precise, localized edits with minimal leakage. Across standard image editing and compositional generation benchmarks, RewardFlow achieves state-of-the-art zero-shot edit fidelity and compositional alignment.

Abstract:
Crime anticipation enables proactive public safety interventions, yet existing video security systems remain largely reactive, unable to detect precursors of crime. While current visual language models (VLM)-based video understanding methods show promise in high-level reasoning, they are not designed to explicitly model the spatio-temporal causal relationships essential for anticipating crimes. We address this limitation by two causal-driven components. First, we develop the Spatio-Temporal Causal Reasoning Crime (STCRC) dataset, a hierarchical dataset comprising 73K samples across five progressive causal reasoning tasks, facilitating criminal precursors learning. Second, we propose the Spatio-Temporal Causal Hypergraph (STCH), a streaming module that transforms implicit entity dynamics into explicit causal structures to enhance causal reasoning for crime in VLMs. By combining these two components, our framework advances real-time crime anticipation, achieving improvements in anticipatory tasks: a 70.7% relative improvement in crime classification, a 10.1% in crime detection, and a 3.7% reduction in temporal prediction error.

Abstract:
Multi-modal test-time adaptation (TTA) enhances the resilience of benchmark multi-modal models against distribution shifts by leveraging the unlabeled target data during inference. Despite the documented success, the advancement of multi-modal TTA methodologies has been impeded by a persistent limitation, i.e., the lack of explicit modeling of category-conditional distributions, which is crucial for yielding accurate predictions and reliable decision boundaries. Canonical Gaussian discriminant analysis (GDA) provides a vanilla modeling of category-conditional distributions and achieves moderate advancement in uni-modal contexts. However, in multi-modal TTA scenario, the inherent modality distribution asymmetry undermines the effectiveness of modeling the category-conditional distribution via the canonical GDA. To this end, we introduce a tailored probabilistic Gaussian model for multi-modal TTA to explicitly model the category-conditional distributions, and further propose an adaptive contrastive asymmetry rectification technique to counteract the adverse effects arising from modality asymmetry, thereby deriving calibrated predictions and reliable decision boundaries. Extensive experiments across diverse benchmarks demonstrate that our method achieves state-of-the-art performance under a wide range of distribution shifts.

Abstract:
Recently, generating 3D assets using visual priors from pretrained diffusion models has shown remarkable results. However, due to the inherent lack of 3D geometric priors in 2D diffusion, the synthesized results often suffer from spatial hallucination and multi-view inconsistency. To address this limitation, we propose Thoughtful3D, a novel framework that enhances 3D content generation quality by introducing structural chain-of-thought (CoT) reasoning to alleviate inconsistent issues and mitigate hallucinations. Specifically, we design a dual-phase structural CoT strategy: (1) 3DBlueprint-CoT explicitly plans the 3D generation process through textual semantic parsing and logical deduction during the initialization phase. (2) 3DRefine-CoT dynamically evaluates latent inconsistencies by analyzing multiple renderings, employing a multi-round iterative refinement mechanism to suppress hallucinations and enhance cross-view consistency. To further promote consistency across views, we propose a Cross-view Semantic Appearance Alignment strategy that enhances multi-view consistency by establishing dynamic geometric associations between the same features from different viewpoints. Extensive experiments demonstrate that Thoughtful3D significantly improves the quality and consistency of generated 3D assets.

Abstract:
Geometric problem solving, as a typical multimodal reasoning problem, has attracted much attention and made great progress recently, however most of works focus on plane geometry while usually fail in solid geometry due to 3D spatial diagrams and complex reasoning. To bridge this gap, we introduce Hilbert-Geo, the first unified formal language framework for solid geometry, including an extensive predicate library and a dedicated theorem bank. Based on this framework, we propose a Parse2Reason method containing two steps of first parsing then reasoning. In the parsing step, we utilize conditional description language (CDL), a formalized language composed of predicates specifically designed to construct geometric conditions, to represent both problem description (natural text) and solid diagrams (visual image).In the reasoning step, we leverage those formal CDL and the theorem bank to perform relational inference and algebraic computation, generating strictly correct, verifiable, and human-readable reasoning processes.Notably, our proposed Hilbert-Geo is also applicable to plane geometry.To advance geometric reasoning, we curate two expert-annotated dataset SolidFGeo2k and PlaneFGeo3k, which are furnished with geometric formal language annotations, solutions and answers. Extensive experiments show that our proposed method achieves the state-of-the-art (SOTA) performance 77.3% in SolidFGeo2k and 84.1% in MathVerse-Solid (one small subset in MathVerse dedicated to solid geometry), substantially outperforming leading MLLMs, such as Gemini-2.5-pro (54.2% on SolidFGeo2k) and GPT-5 (62.9% on MathVerse-Solid). In addition, our method achieves the SOTA accuracy 80.2% in PlaneFGeo3k, demonstrating the generality of the Hilbert-Geo in geometric reasoning. Our code and datasets will be publicly available.

Abstract:
Modern deep learning faces significant challenges with noisy labels, class ambiguity, as well as the need to robustly reject out-of-distribution or corrupted samples. In this work, we propose a unified framework based on the concept of a "drainage node" which we add at the output of the network. The node serves to reallocate probability mass toward uncertainty, while preserving desirable properties such as end-to-end training and differentiability. This mechanism provides a natural escape route for highly ambiguous, anomalous, or noisy samples, particularly relevant for instance-dependent and asymmetric label noise. In systematic experiments involving the addition of varying proportions of instance-dependent noise or asymmetric noise to CIFAR-10/100 labels, our drainage formulation achieves an accuracy increase of up to 10% over existing approaches in the high-noise regime. Our results on real-world datasets, such as mini-WebVision, mini-ImageNet and Clothing-1M, match or surpass existing state-of-the-art methods. Qualitative analysis reveals a denoising effect, where the drainage neuron consistently absorbs corrupt, mislabeled, or outlier data, leading to more stable decision boundaries. Furthermore, our drainage formulation enables applications well beyond classification, with immediate benefits for web-scale, semi-supervised dataset cleaning, and open-set applications.

Abstract:
The new era has witnessed a remarkable capability to extend Vision-Language Models (VLMs) for tackling tasks of video understanding. While current VLMs excel at event- or story-level understanding, their ability to capture fine-grained motion details remains limited, primarily due to their focus on high-level static semantic structures and macro-event logic. In contrast, Video Diffusion Models (VDMs) are adept at modeling dynamic motion patterns, benefiting from large-scale video data and the intrinsic requirement of temporal generation. In this paper, we introduce MotionEnhancer, a novel approach that leverages motion priors distilled from a powerful video diffusion model as auxiliary supervision to enhance the motion understanding capability of a VLM via attention alignment. MotionEnhancer comprises two simple parameter-free modules, Motion-sensitive Head Selection (MHS) and Motion-salient Text Token Identification (MTTI), to directly extract and optimize motion-related attentions from the VDM in a computation-only manner. MotionEnhancer provides a scalable solution for motion understanding without additional training parameters, modifications to existing architectures, or tool calling. Extensive experiments demonstrate that MotionEnhancer can achieve consistent improvements over state-of-the-art VLMs on two motion-level video understanding benchmarks, especially on motion-related metrics.

Abstract:
The object re-identification (ReID) task aims to recognize the same individual object across diverse viewpoints and sensing conditions.Although person and vehicle ReID have achieved remarkable success, most existing methods are built on the assumption that training and testing data come from the same object category.This constraint requires separate models for each category, which limits scalability and generalization.To address this limitation, we introduce Object-Generalized Re-Identification (OG-ReID), a new paradigm that learns unified identity representations transferable across different object categories.Unlike conventional domain generalization that focuses on appearance variations within a single category, OG-ReID deals with category shifts caused by intrinsic structural differences in identity cues. To achieve this goal, we introduce the Meta-Generalized Object Re-Identification (MGOR) framework, which treats meta-learning as semantic distributional regularization, exposing the model to controlled category shifts so that invariance emerges as an equilibrium between semantic diversity and identity discrimination.Extensive evaluations on more than 100 unseen object categories from multiple domains show that MGOR outperforms existing ReID approaches without any target-domain adaptation, advancing toward universal identity perception beyond domain and category boundaries.

Abstract:
As a pivotal task that bridges remote visual and linguistic understanding, Remote Sensing Image-Text Retrieval (RSITR) has attracted considerable research interest in recent years. However, almost all RSITR methods implicitly assume that image-text pairs are matched perfectly. In practice, acquiring a large set of well-aligned data pairs is often prohibitively expensive or even infeasible. Although several studies have acknowledged the presence of noisy pairs, little work has explored how to endow neural networks with robustness against such noise. Based on the above observations, we reveal an important but untouched problem in RSITR, i.e., Noisy Correspondence (NC). To overcome these challenges, we propose a novel Robust Remote Sensing Image-Text Retrieval (RRSITR) paradigm that designs a self-paced learning strategy to mimic human cognitive learning patterns, thereby learning from easy to hard from multi-modal data with NC. Specifically, we first divide all training sample pairs into three categories based on the loss magnitude of each pair, i.e., clean sample pairs, ambiguous sample pairs, and noisy sample pairs. Then, we respectively estimate the reliability of each training pair by assigning a weight to each pair based on the values of the loss. Further, we respectively design a new self-paced function to dynamically regulate the training sequence and weights of the samples, thus establishing a progressive learning process. Finally, for noisy sample pairs, we present an enhanced triplet loss to dynamically adjust the soft margin based on semantic similarity, thereby enhancing the robustness against noise. Extensive experiments on three popular benchmark datasets demonstrate that the proposed RRSITR significantly outperforms the state-of-the-art methods, especially in high noise rates.

Abstract:
We present ShapeAR, a novel autoregressive latent diffusion framework that decomposes raster images into editable, artist-like vector shape layers. Unlike conventional raster-to-SVG methods that rely on boundary tracing or joint path optimisation, ShapeAR generates non-overlapping RGBA shape layers directly in latent space via flow-matching diffusion. To scale generation to complex scenes with many shapes, we formulate the process auto-regressively, conditioning each step on both the input image (global context) and partial composition of previously generated layers (local context). In addition, we propose geometry-aware evaluation metrics that quantify aesthetic and structural quality of the generated shapes.

Abstract:
High-fidelity generative models are increasingly needed in privacy-sensitive scenarios, where access to data is severely restricted due to regulatory and copyright constraints. This scarcity hampers model development--ironically, in settings where generative models are most needed to compensate for the lack of data. This creates a self-reinforcing challenge: limited data leads to poor generative models, which in turn fail to mitigate data scarcity. To break this cycle, we propose a reinforcement-guided synthetic data generation framework that adapts general-domain generative priors to privacy-sensitive identity recognition tasks. We first perform a cold-start adaptation to align a pretrained generator with the target domain, establishing semantic relevance and initial fidelity. Building on this foundation, we introduce a multi-objective reward that jointly optimizes semantic consistency, coverage diversity, and expression richness, guiding the generator to produce both realistic and task-effective samples. During downstream training, a dynamic sample selection mechanism further prioritizes high-utility synthetic samples, enabling adaptive data scaling and improved domain alignment. Extensive experiments on benchmark datasets demonstrate that our framework significantly improves both generation fidelity and classification accuracy, while also exhibiting strong generalization to novel categories in small-data regimes.

Abstract:
Causality--referring to temporal, uni-directional cause-effect relationships between components--underlies many complex generative processes, including videos, language, and robot trajectories.Current causal diffusion models entangle temporal reasoning with iterative denoising, applying causal attention across all layers, at every denoising step, and over the entire context.In this paper, we show that the causal computation in these models is separable from the multi-step denoising process.Through systematic probing of autoregressive video diffusers, we uncover two key regularities:(1) early blocks produce highly similar features across denoising steps, indicating redundant computation along the diffusion trajectory; and(2) deeper blocks exhibit sparse cross-frame attention and primarily perform intra-frame rendering.Motivated by these findings, we introduce Separable Causal Diffusion (SCD), a new architecture that explicitly decouples once-per-frame temporal reasoning, via a causal transformer encoder, from multi-step frame-wise rendering, via a lightweight diffusion decoder.Extensive experiments on both pretraining and post-training tasks across synthetic and real benchmarks show that CSD significantly improves throughput and latency while matching or surpassing the generation quality of strong causal diffusion baselines.

Abstract:
Spiking Neural Networks (SNNs) have attracted increasing attention for their biologically inspired temporal dynamics. As their applications expand, understanding their robustness has become an important research focus. However, little is known about how the intrinsic temporal properties of SNNs affect robustness. In this work, we revisit SNN robustness from an information-theoretic perspective and reveal the pivotal role of temporal dynamics. We establish a theoretical link between robustness error and the mutual information (MI) between inputs and latent representations along the temporal dimension, grounded in the information bottleneck principle. Through an analysis of spike-based information transmission, we show that temporal dynamics inherently compress MI, thereby tightening the robustness error bound. Building on this insight, we propose a Temporal Mutual Information (TMI) regularizer that explicitly exploits temporal characteristics to enhance robustness. Extensive experiments on CIFAR-10, CIFAR-100, DVS-CIFAR10, Tiny-ImageNet, and ImageNet demonstrate that our method consistently improves SNN robustness across various architectures and attack settings.

Abstract:
Recent advances in visuomotor policy learning have enabled robots to perform control directly from visual inputs. Yet, extending such end-to-end learning from single-arm to bimanual manipulation remains challenging due to the need for both independent perception and coordinated interaction between arms. Existing methods typically favor one side--either decoupling the two arms to avoid interference or enforcing strong cross-arm coupling for coordination--thus lacking a unified treatment. We propose CUBic, a Coordinated and Unified framework for Bimanual perception and control that reformulates bimanual coordination as a unified perceptual modeling problem. CUBic learns a shared tokenized representation bridging perception and control, where independence and coordination emerge intrinsically from structure rather than from hand-crafted coupling. Our approach integrates three components: unidirectional perception aggregation, bidirectional perception coordination through two codebooks with shared mapping, and a unified perception-to-control diffusion policy. Extensive experiments on the RoboTwin benchmark show that CUBic consistently surpasses standard baselines, achieving marked improvements in coordination accuracy and task success rates over state-of-the-art visuomotor baselines.

Abstract:
Vision-Language-Action (VLA) tasks require reasoning over complex visual scenes and executing adaptive actions in dynamic environments. While recent studies on reasoning VLAs show that explicit chain-of-thought (CoT) can improve generalization, they suffer from high inference latency due to lengthy reasoning traces. We propose Fast-ThinkAct, an efficient reasoning framework that achieves compact yet performant planning through verbalizable latent reasoning. Fast-ThinkAct learns to reason efficiently with latent CoTs by distilling from a teacher, driven by a preference-guided objective to align manipulation trajectories that transfers both linguistic and visual planning capabilities for embodied control. This enables reasoning-enhanced policy learning that effectively connects compact reasoning to action execution. Extensive experiments across diverse embodied manipulation and reasoning benchmarks demonstrate that Fast-ThinkAct achieves strong performance with up to 89.3% reduced inference latency over state-of-the-art reasoning VLAs, while maintaining effective long-horizon planning, few-shot adaptation, and failure recovery.

Abstract:
Multimodal large language models have demonstrated remarkable capabilities in 2D vision, motivating their extension to 3D scene understanding. Recent studies represent 3D scenes as 3D spatial videos composed of image sequences with depth and camera pose information, enabling pre-trained video-language models to perform 3D reasoning tasks. However, the large number of visual tokens in spatial videos remains a major bottleneck for efficient inference. Existing pruning methods overlook the view consistency of spatial videos and the spatial diversity of the remaining tokens, which prevents them from effectively removing inter-frame redundancy and preserving scene completeness. In this paper, we propose Geo3DPruner, a Geometry-Guided 3D Visual Token Pruning framework. Geo3DPruner first models cross-frame relevance through geometry-aware global attention, and then performs a two-stage pruning process. The intra-voxel stage selects representative multi-view features within each voxel, while the inter-voxel stage preserves spatial diversity by selecting a globally distributed subset of voxels. Extensive experiments on multiple 3D scene understanding benchmarks demonstrate that Geo3DPruner retains over 90% of the original performance while pruning 90% of visual tokens, significantly outperforming existing text-guided and vision-guided pruning methods.

Abstract:
Atmospheric turbulence significantly degrades long-range imaging by introducing geometric warping and exposure-time-dependent blur, which adversely affects both visual quality and the performance of high-level vision tasks. Existing methods for synthesizing turbulence effects often oversimplify the relationship between blur and exposure-time, typically assuming fixed or binary exposure settings. This leads to unrealistic synthetic data and limited generalization capability of trained models. To address this gap, we revisit the modulation transfer function (MTF) formulation and propose a novel Exposure-Time-dependent MTF (ET-MTF) that models blur as a continuous function of exposure-time. For blur synthesis, we derive a tilt-invariant point spread function (PSF) from the ET-MTF, which, when integrated with a spatially varying blur-width field, provides a comprehensive and physically accurate characterization of turbulence-induced blur. Building on this synthesis pipeline, we construct ET-Turb, a large-scale synthetic turbulence dataset that explicitly incorporates continuous exposure-time modeling across diverse optical and atmospheric conditions. The dataset comprises 5,083 videos (2,005,835 frames), partitioned into 3,988 training and 1,095 test videos. Extensive experiments demonstrate that models trained on ET-Turb produce more realistic restorations and achieve superior generalization on real-world turbulence data compared to those trained on other datasets. The dataset is publicly available at: github.com/Jun-Wei-Zeng/ET-Turb.

Abstract:
Diffusion models have achieved remarkable success in generating high-fidelity content but suffer from slow, iterative sampling, resulting in high latency that limits their use in interactive applications. We introduce DRiffusion, a parallel sampling framework that parallelizes diffusion inference through a draft-and-refine process. DRiffusion employs skip transitions to generate multiple draft states for future timesteps and computes their corresponding noises in parallel, which are then used in the standard denoising process to produce refined results. Theoretically, our method achieves an acceleration rate of \tfrac 1 n or \tfrac 2 n+1 , depending on whether the conservative or aggressive mode is used, where n denotes the number of devices. Empirically, DRiffusion attains 1.4x-3.7x speedup across multiple diffusion models while incur minimal degradation in generation quality: on MS-COCO dataset, both FID and CLIP remain largely on par with those of the original model, while PickScore and HPSv2.1 show only minor average drops of 0.17 and 0.43, respectively. These results verify that DRiffusion delivers substantial acceleration and preserves perceptual quality.

Abstract:
Open-vocabulary object detection (OVD) aims to detect objects described by arbitrary text, but most existing methods operate at a coarse category level and struggle with fine-grained, attribute-sensitive queries. We address this from both model and data perspectives. We propose a Semantic-Retrieval-Augmented Detector (SRA-Det) that uses an attention-based module to retrieve multiple semantic facets from token-level text features, and a soft-min matching rule that behaves like a differentiable logical AND over these facets, ensuring that all key attributes are satisfied. In parallel, we introduce an automatic attribute-augmented data pipeline that uses an LLM to generate category-specific visual attributes and a dual CLIP-based similarity check to verify them at the instance level. With a Swin-T backbone, our approach achieves 54.9 mAP in the zero-shot setting on FG-OVD and 40.4 AP on LVIS, establishing strong fine-grained and general OVD performance.

Abstract:
Recent advances in text-to-video diffusion models have enabled high-fidelity and temporally coherent video synthesis. However, current models are predominantly optimized for single-event generation. When handling multi-event prompts, without explicit temporal grounding, such models often produce blended or collapsed scenes that break the intended narrative. To address this limitation, we present SwitchCraft, a training-free framework for multi-event video generation. Our key insight is that uniform prompt injection across time ignores the correspondence between events and frames. To this end, we introduce Event-Aligned Query Steering (EAQS), which steers frame-level attention to align with relevant event prompts. Furthermore, we propose Auto-Balance Strength Solver (ABSS), which adaptively balances steering strength to preserve temporal consistency and visual fidelity. Extensive experiments demonstrate that SwitchCraft substantially improves prompt alignment, event clarity, and scene consistency compared with existing baselines, offering a simple yet effective solution for multi-event video generation.

Abstract:
Multimodal misinformation poses an escalating challenge that often evades traditional detectors, which are opaque black boxes and fragile against new manipulation tactics. We present Probabilistic Concept Graph Reasoning (PCGR), an interpretable, modular, and evolvable framework that reframes multimodal misinformation detection (MMD) as structured, concept-based reasoning. PCGR follows a build-then-infer paradigm, which first constructs a graph of human-understandable concept nodes, including novel high-level concepts automatically discovered and validated by multimodal large language models (MLLMs), and then applies hierarchical attention over this concept graph to infer claim veracity. This design produces interpretable reasoning chains linking evidence to conclusions. Experiments demonstrate that PCGR achieves state-of-the-art MMD accuracy and robustness to emerging manipulation types, outperforming prior methods in both coarse detection and fine-grained manipulation recognition.

Abstract:
The space of task-agnostic feature upsampling has emerged as a promising area of research to efficiently create denser features from pre-trained visual backbones. These methods act as a shortcut to achieve dense features for a fraction of the cost by learning to map low-resolution features to high-resolution versions. While early works in this space used iterative upsampling approaches, more recent works have switched to cross-attention-based methods, which risk falling into the same efficiency scaling problems of the backbones they are upsampling. In this work, we demonstrate that iterative upsampling methods can still compete with cross-attention-based methods; moreover, they can achieve state-of-the-art performance with lower inference costs. We propose UPLiFT, an architecture for Universal Pixel-dense Lightweight Feature Transforms. We also propose an efficient Local Attender operator to overcome the limitations of prior iterative feature upsampling methods. This operator uses an alternative attentional pooling formulation defined fully locally. We show that our Local Attender allows UPLiFT to maintain stable features throughout upsampling, enabling state-of-the-art performance with lower inference costs than existing pixel-dense feature upsamplers. In addition, we apply UPLiFT to generative downstream tasks and show that it achieves competitive performance with state-of-the-art Coupled Flow Matching models for VAE feature upsampling. Altogether, UPLiFT offers a versatile and efficient approach to creating denser features.

Abstract:
Vision-Language Models (VLMs) have achieved remarkable success in visual question answering tasks, but their reliance on large numbers of visual tokens introduces significant computational overhead. While existing efficient VLM approaches reduce visual tokens through fixed-ratio compression, they operate passively and lack the ability to adapt to varying task requirements. This motivates a fundamental question: Can VLMs autonomously determine the minimum number of visual tokens required for each sample? Inspired by human active vision mechanisms, we introduce AdaptVision, an efficient VLM paradigm that enables adaptive visual token acquisition through a coarse-to-fine approach. Our model initially processes compressed visual tokens from low-resolution images and selectively acquires additional visual information by invoking a bounding box tool to crop key regions when necessary. We train AdaptVision using a reinforcement learning framework that carefully balances accuracy and efficiency. Central to our approach is Decoupled Turn Policy Optimization (DTPO), which decouples the learning objective into two components: (1) tool learning, which optimizes correct tool utilization, and (2) accuracy improvement, which refines the generated responses to improve answer correctness. Based on this formulation, we further decouple advantage estimation by computing separate advantages for tokens associated with each objective. This formulation enables more effective optimization for AdaptVision compared to vanilla GRPO. Comprehensive experiments across multiple VQA benchmarks demonstrate that AdaptVision achieves superior performance while consuming substantially fewer visual tokens than state-of-the-art efficient VLM methods.

Abstract:
Current 2D-to-3D conversion methods achieve geometric accuracy but are artistically deficient, failing to replicate the immersive and emotionally resonant experience of professional 3D cinema. This is because "geometric reconstruction" paradigms mistake deliberate artistic intent--such as strategic zero-plane shifts for "pop-out" effects and local depth sculpting--for data "noise" or ambiguity. This paper argues for a new paradigm: Artistic Disparity Synthesis, shifting the goal from physically accurate disparity estimation to artistically coherent disparity synthesis. We propose Art3D, a preliminary framework exploring this paradigm. Art3D uses a dual-path architecture to decouple global depth parameters (macro-intent) from local artistic effects (visual brushstrokes) and learns from professional 3D film data via indirect supervision. We also introduce a preliminary evaluation method to quantify cinematic alignment. Experiments show our approach demonstrates potential in replicating key local out-of-screen effects and aligning with the global depth styles of cinematic 3D content, laying the groundwork for a new class of artistically-driven conversion tools.

Abstract:
The rise of AI-generated images (AIGIs) poses growing challenges for digital authenticity, prompting the need for efficient, generalizable image forgery detection systems. Existing methods, whether non-LLM-based or LLM-based, exhibit distinct advantages and limitations. While non-LLM-based models offer efficient low-level artifact detection, they often lack semantic understanding. Conversely, LLM-based methods provide strong semantic reasoning and explainability but are computationally intensive and less sensitive to subtle visual artifacts. Moreover, the true contribution of explanatory reasoning texts to forgery detection performance remains unclear. In this work, we investigate the intrinsic value and potential of LLM-generated reasoning texts, considering it a source of generalization and semantic-error sensitivity. Based on these findings, we propose ReAlign, a novel framework that distills high-quality reasoning texts generated by a GRPO-optimized LLM into a lightweight AIGI detector via contrastive learning. ReAlign effectively inherits the generalization ability and semantic sensitivity capability of reasoning textual representations, while remaining efficient and lightweight for deployment. Moreover, ReAlign adopts a tailored joint optimization strategy that integrates contrastive loss for image-text alignment and classification loss for accurate forgery discrimination. Experimental results on AIGCDetectBenchmark, AIGI-Holmes, and our newly constructed UltraSynth-10k demonstrate that ReAlign consistently outperforms existing state-of-the-art detectors in both accuracy and generalization, particularly when facing complex, high-fidelity forgeries from modern generative models.

Abstract:
The dominant 3D Gaussian splatting (3DGS) acceleration methods fail to properly regulate the number of Gaussians during training, causing redundant computational time overhead. In this paper, we propose FastGS, a novel, simple, and general acceleration framework that fully considers the importance of each Gaussian based on multi-view consistency, efficiently solving the trade-off between training time and rendering quality. We innovatively design a densification and pruning strategy based on multi-view consistency, dispensing with the budgeting mechanism. Extensive experiments on Mip-NeRF 360, Tanks & Temples, and Deep Blending datasets demonstrate that our method significantly outperforms the state-of-the-art methods in training speed, achieving a 3.29x training acceleration and comparable rendering quality compared with DashGaussian on the Mip-NeRF 360 dataset and a 15.45x acceleration compared with vanilla 3DGS on the Deep Blending dataset. We demonstrate that FastGS exhibits strong generality, delivering 2-6x training acceleration across various tasks, including dynamic scene reconstruction, surface reconstruction, sparse-view reconstruction, large-scale reconstruction, and simultaneous localization and mapping.

Abstract:
High-quality and carefully curated data is a cornerstone of training medical large language models, as it directly impacts both generalization and robustness to unseen clinical tasks. We investigate strategies for training and data curation to develop a robust multimodal reasoning model in the medical domain. Our work focuses on supervised fine-tuning (SFT) and explores data recipes that leverage structured reasoning traces. Using our proposed data recipe, we scale experiments to a dataset of over 8 million examples and 6.8 billion response tokens, achieving state-of-the-art performance among open-source models across diverse out-of-distribution medical benchmark tasks. Our results further indicate that curating a high-quality, diverse training dataset with varying structured reasoning trace lengths enables the fine-tuned model to self-calibrate its reasoning trajectory lengths based on the downstream task, without explicit supervision. We present key insights, describe the data curation strategy, and outline next steps toward developing robust medical vision-language reasoning systems.

Abstract:
Recent advancements in computer vision have accelerated the development of autonomous driving. Despite these advancements, training machines to drive in a way that aligns with human expectations remains a significant challenge. Human factors are still essential, as humans possess a sophisticated cognitive system capable of rapidly interpreting scene information and making accurate decisions. Aligning machine with human intent has been explored with Reinforcement Learning with Human Feedback (RLHF). Conventional RLHF methods rely on collecting human preference data by manually ranking AI-generated outputs, which is time-consuming and indirect. In this work, we propose an electroencephalography (EEG)-guided decision-making framework to incorporate human cognitive insights into reinforcement learning (RL) for autonomous driving. We collected EEG signals from 20 participants in a realistic driving simulator and analyzed event-related potentials (ERP) in response to sudden environmental changes. Our proposed framework employs a neural network to predict the strength of ERP based on the cognitive information from visual scene information. Moreover, we explore the integration of such cognitive information into the reward signal of the RL algorithm. Experimental results show that our framework can improve the collision avoidance ability of the RL algorithm, highlighting the potential of neuro-cognitive feedback in enhancing autonomous driving systems.

Abstract:
Retrieval-augmented generation (RAG) has emerged as a critical paradigm for grounding Multimodal Large Language Models (MLLMs) in external knowledge. Recent GraphRAG methods introduce structured entity-relation graphs to improve retrieval and reasoning. However, they remain limited by treating knowledge graphs as static data structures built offline and queried in a single pass. This static paradigm misaligns with the interactive, iterative nature of knowledge-intensive reasoning, creating three bottlenecks: (i) text-centric fragmentation that impedes cross-modal reasoning, (ii) frozen structures unable to incorporate new evidence or correct errors, and (iii) rigid single-pass retrieval without adaptive refinement. To overcome these limitations, we introduce EvoGraph-R1, a self-evolving GraphRAG framework that reconceptualizes knowledge graphs as dynamic environments shaped through agent interactions.We formulate retrieval as a Markov Decision Process (MDP) where the agent observes the graph state and executes actions to query (GraphRetrieve), expand (WebSearch), refine (GraphEdit), or terminate (Answer) the reasoning. These actions reshape the hypergraph structure and generate feedback signals that guide subsequent evolution.Through this closed loop, the hypergraph evolves by integrating new evidence, correcting errors, and refining structure to support multi-hop reasoning. Experiments on multimodal VQA and text QA benchmarks demonstrate substantial improvements over existing RAG baselines in accuracy, coverage, and traceability, establishing self-evolving knowledge graphs as a fundamental paradigm across modalities.

Abstract:
Vision Transformers (ViTs) have achieved remarkable progress in visual and multimodal tasks, yet their deployment remains costly. Token-adaptive methods reduce FLOPs through dynamic depth computation, but they face two limitations: (1) Global attention overemphasizes highly similar foreground regions, causing token-adaptive modules to assign the deepest computation to semantically weak foreground tokens while prematurely exiting edge tokens rich in structural cues (as shown in Fig. 1); (2) Although token-adaption lowers FLOPs, it still relies on large parameter sets, and deep-layer weights remain underutilized due to early token exit. Parameter sharing could address redundancy but is difficult to apply in ViTs, where hierarchical abstraction typically requires diverse transformations. To address these issues, we propose Edge-RecViT, an Edge-Adaptive Dynamic Recursive Vision Transformer that integrates an edge-aware token-adaptive ranker with a recursive transformer using fully shared parameters in its hidden layers. Edge-RecViT dynamically allocates computation based on semantic richness: structurally informative edge tokens receive deeper refinement, whereas redundant low-information tokens exit early. Extensive experiments show that Edge-RecViT provides an excellent trade-off among accuracy, FLOPs, and parameter efficiency. On ImageNet-1K, it matches DeiT within 0.3% Top-1 accuracy while reducing FLOPs by 30.5% (35.1 to 24.39 GFLOPs). At the Base level, parameter drops from 86M to 23.21M with higher accuracy than ViT-Base; compared with ViT-Large, parameters are reduced by 93% while maintaining superior accuracy.

Abstract:
Large Language Model (LLM) agentic systems solve complex tasks through coordinated workflows, but designing them remains labor-intensive. The Agentic Supernet paradigm automates this by optimizing a probabilistic architecture space, yet suffers from critical evaluation instabilities: absolute performance scores entangle architectural merit with query difficulty, while single-execution protocols capture execution randomness rather than true capability. These instabilities lead to unreliable search dynamics where simple queries inflate weak designs and challenging queries suppress strong ones.We introduce RAAS (Robust Architecture Adaptive Search), which establishes more stable and fair evaluation through two synergistic mechanisms. Contextual Architecture Orchestration (CAO) disentangles quality from task difficulty by evaluating cohorts of candidate architectures on identical queries, deriving context-aware merit signals through peer-group comparison. Multi-Trial Assessment Synthesis (MTAS) reduces execution variance by aggregating performance across multiple independent trials, producing statistically robust capability estimates.Together, these mechanisms provide more reliable signals for architecture discovery. Experiments on six benchmarks spanning mathematical reasoning, code generation, and one multi-step tool-use benchmark show that RAAS consistently improves over strong baselines, improving HumanEval pass@1 from 92.23% to 96.31% and MATH accuracy from 52.08% to 60.87%, while maintaining practical efficiency. These results suggest that robust evaluation is a useful ingredient for agentic architecture search in the studied settings.

Abstract:
Understanding physical transformation processes is crucial for both human cognition and artificial intelligence systems, particularly from an egocentric perspective, which serves as a key bridge between humans and machines in action modeling. We define this modeling process as Egocentric Instructed Visual State Transition (EIVST), which involves generating intermediate frames that depict object transformations between initial and target states under a brief action instruction. EIVST poses two challenges for current generative models: (1) understanding the visual scenes of the initial and target states and reasoning about transformation steps from an egocentric view, and (2) generating a consistent intermediate transition that follows the given instruction while preserving object appearance across the two visual states. To address these challenges, we propose the EgoIn framework. It first infers the multi-step transition process between two given states using TransitionVLM, fine-tuned on our curated dataset to better adapt to this task and reduce hallucinated information. It then generates a sequence of frames based on transition conditions produced by the proposed Transition Conditioning module. Additionally, we introduce Object-aware Auxiliary Supervision to preserve consistent object appearance throughout the transition. Extensive experiments on human-object and robot-object interaction datasets demonstrate EgoIn's superior performance in generating semantically meaningful and visually coherent transformation sequences.

Abstract:
With the success of flow matching in visual generation, sampling efficiency remains a critical bottleneck for its practical application. Among flow models' accelerating methods, ReFlow has been somehow overlooked although it has theoretical consistency with flow matching. This is primarily due to its suboptimal performance in practical scenarios compared to consistency distillation and score distillation. In this work, we investigate this issue within the ReFlow framework and propose FlowSteer, a method unlocks the potential of ReFlow-based distillation by guiding the student along teacher's authentic generation trajectories. We first identify that Piecewised ReFlow's performance is hampered by a critical distribution mismatch during the training and propose Online Trajectory Alignment(OTA) to resolve it. Then, we introduce a adversarial distillation objective applied directly on the ODE trajectory, improving the student's adherence to the teacher's generation trajectory. Furthermore, we find and fix a previously undiscovered flaw in the widely-used \texttt FlowMatchEulerDiscreteScheduler that largely degrades few-step inference quality. Our experiment result on SD3 demonstrates our method's efficacy.

Abstract:
Text-guided motion generation in 3D scenes has advanced the synthesis of human-scene interactions, contributing to embodied AI, scene understanding, and virtual agent simulation. While recent studies have begun exploring multi-agent scenarios, achieving temporally synchronised interactions among multiple agents remains an open challenge. Existing methods are often limited in flexibility and scalability when handling diverse interaction contexts.We present a method that enables synchronised multi-agent interaction using a single-agent motion synthesis model through two key components: a text-guided dependency-aware story planner and a temporal synchronisation module. The story planner interprets natural language instructions into structured event sequences with temporal dependencies. Our synchronisation module, built upon time-warping control and diffusion posterior sampling, aligns interaction timing across agents without retraining.Experimental results demonstrate that the proposed framework effectively models temporal dependencies and causal order between events. Evaluations across diverse interaction types show improved temporal alignment and coherent multi-agent motion generation consistent with textual instructions.

Abstract:
Modeling dynamic 3D environments from LiDAR sequences is central to building reliable 4D worlds for autonomous driving and embodied AI. Existing generative frameworks, however, often treat all spatial regions uniformly, overlooking the varying uncertainty across real-world scenes. This uniform generation leads to artifacts in complex or ambiguous regions, limiting realism and temporal stability. In this work, we present U4D, an uncertainty-aware framework for 4D LiDAR world modeling. Our approach first estimates spatial uncertainty maps from a pretrained segmentation model to localize semantically challenging regions. It then performs generation in a "hard-to-easy" manner through two sequential stages: (1) uncertainty-region modeling, which reconstructs high-entropy regions with fine geometric fidelity, and (2) uncertainty-conditioned completion, which synthesizes the remaining areas under learned structural priors. To further ensure temporal coherence, U4D incorporates a mixture of spatio-temporal (MoST) block that adaptively fuses spatial and temporal representations during diffusion. Extensive experiments show that U4D produces geometrically faithful and temporally consistent LiDAR sequences, advancing the reliability of 4D world modeling for autonomous perception and simulation.

Abstract:
Intricate correlations among atomic actions and inherent visual confounders in long-term action recognition (LTAR) contribute to the persistent challenges in this domain. While methods based on vision-language models that employ label text for supervision offer potential for handling visual confounders, their reliance on statistical correlations rather than causal mechanisms introduces two vulnerabilities: (1) spurious alignments with non-causal co-occurring visual features during cross-modal interaction, and (2) misinterpretation of codependencies among actions. To address these limitations, this paper introduces Progressive Cross-Modal Causal Intervention (PCMCI). PCMCI first mitigates co-occurrence hallucination via causal intervention grounded in optimal transport theory. Subsequently, an action relation-aware mechanism counters the backdoor path induced by codependency illusion, enabling the derivation of deconfounded text embeddings. Finally, these deconfounded embeddings serve as mediator to implement front-door adjustment to remove visual confounders. This progressive causal intervention framework facilitates learning robust representations for LTAR. Experiments on three long-term action benchmarks demonstrate the effectiveness of the proposed model.

Abstract:
Adapting pretrained multi-modal models to evolving test-time distributions, known as multi-modal test-time adaptation, presents a significant challenge. Existing methods frequently encounter negative transfer in the unbiased modality and catastrophic forgetting in the biased modality. To address these challenges, we propose Decoupling Adaptation for Stability and Plasticity (DASP), a novel diagnose-then-mitigate framework. Our analysis reveals a critical discrepancy within the unified latent space: the biased modality exhibits substantially higher interdimensional redundancy (i.e., strong correlations across feature dimensions) compared to the unbiased modality. Leveraging this insight, DASP identifies the biased modality and implements an asymmetric adaptation strategy. This strategy employs a decoupled architecture where each modality-specific adapter is divided into stable and plastic components. The asymmetric mechanism works as follows: for the biased modality, which requires plasticity, the plastic component is activated and updated to capture domain-specific information, while the stable component remains fixed. Conversely, for the unbiased modality, which requires stability, the plastic component is bypassed, and the stable component is updated using KL regularization to prevent negative transfer. This asymmetric design enables the model to adapt flexibly to new domains while preserving generalizable knowledge. Comprehensive evaluations on diverse multi-modal benchmarks demonstrate that DASP significantly outperforms state-of-the-art methods.

Abstract:
We propose a probabilistic discrepancy learning approach for roadside LiDAR scene completion (PDL). Conventional methods focus on object-level completion and scene completion from ego-vehicle viewpoint. These methods struggle to cope with long-term or severe occlusions caused by roadside sensors with fixed viewpoints. To address this issue, we compensate for occlusion roadside point clouds by introducing external visual information. Specifically, the PDL is mainly divided into probabilistic pose discrepancy minimization and scene discrepancy learning. We employ probabilistic pose discrepancy minimization to correct noisy poses from vision-based detectors, while utilizing a diffusion model within scene discrepancy learning for robust full-scene completion. Furthermore, we design regional and global sampling discrepancy learning losses to achieve robust and efficient training. We conducted extensive experiments on the V2X-Seq and TUMTraf-V2X roadside datasets. Results demonstrate that PDL achieves state-of-the-art performance, with average reductions of 14.5% in chamfer distance (CD) and 6% in 3D Jensen Shannon divergence (JSD) compared to existing methods.

Abstract:
Visual decoding from brain signals is a key challenge at the intersection of computer vision and neuroscience, requiring methods that bridge neural representations and computational models of vision. A field-wide goal is to achieve generalizable, cross-subject models. A major obstacle towards this goal is the substantial variability in neural representations across individuals, which has so far required training bespoke models or fine-tuning separately for each subject. To address this challenge, we introduce a meta-optimized approach for semantic visual decoding from fMRI that generalizes to novel subjects without any fine-tuning. By simply conditioning on a small set of image-brain activation examples from the new individual, our model rapidly infers their unique neural encoding patterns to facilitate robust and efficient visual decoding. Our approach is explicitly optimized for in-context learning of the new subject's encoding model and performs decoding by hierarchical inference, inverting the encoder. First, for multiple brain regions, we estimate the per-voxel visual response encoder parameters by constructing a context over multiple stimuli and responses. Second, we construct a context consisting of encoder parameters and response values over multiple voxels to perform aggregated functional inversion. We demonstrate strong cross-subject and cross-scanner generalization across diverse visual backbones without retraining or fine-tuning. Moreover, our approach requires neither anatomical alignment nor stimulus overlap. This work is a critical step towards a generalizable foundation model for non-invasive brain decoding.

Abstract:
The creation of high-fidelity, physically-based rendering (PBR) materials remains a bottleneck in many graphics pipelines, typically requiring specialized equipment and expert-driven post-processing. To democratize this process, we present MatE, a novel method for generating tileable PBR materials from a single image taken under unconstrained, real-world conditions. Given an image and a user-provided mask, MatE first performs coarse rectification using an estimated depth map as a geometric prior, and then employs a dual-branch diffusion model. Leveraging a learned consistency from rotation-aligned and scale-aligned training data, this model further rectifies residual distortions from the coarse result and translates it into a complete set of material maps, including albedo, normal, roughness and height. Our framework achieves invariance to the unknown illumination and perspective of the input image, allowing for the recovery of intrinsic material properties from casual captures. Through comprehensive experiments on both synthetic and real-world data, we demonstrate the efficacy and robustness of our approach, enabling users to create realistic materials from real-world images.

Abstract:
Human visual perception offers valuable insights for understanding computational principles of motion-based scene interpretation. Humans robustly detect and segment moving entities that constitute independently moveable chunks of matter, whether observing sparse moving dots, textured surfaces, or naturalistic scenes. In contrast, existing computer vision systems lack a unified approach that works across these diverse settings. Inspired by principles of human perception, we propose a generative model that hierarchically groups low-level motion and appearance features into particles (small Gaussians representing local matter), and groups particles into clusters capturing coherently and independently moveable physical entities. We develop a hardware-accelerated inference algorithm based on parallelized block Gibbs sampling to recover stable particle motion and groupings. Our model operates on different kinds of inputs (random dots, stylized textures, or naturalistic RGB video), enabling it to work across settings where biological vision succeeds but existing computer vision approaches do not. We validate this unified framework across three domains: on 2D random dot kinematograms, our approach captures human object perception including graded uncertainty across ambiguous conditions; on a Gestalt-inspired dataset of camouflaged rotating objects, our approach recovers correct 3D structure from motion and thereby accurate 2D object segmentation; and on naturalistic RGB videos, our model tracks the moving 3D matter that makes up deforming objects, enabling robust object-level scene understanding. This work thus establishes a general framework for motion-based perception grounded in principles of human vision.

Abstract:
The advent of high-quality video generation models has amplified the need for robust watermarking schemes that can be used to reliably detect and track the provenance of generated videos. Existing video watermarking methods based on both post-hoc and in-generation approaches fail to simultaneously achieve imperceptibility, robustness, and computational efficiency. This work introduces a novel framework for in-generation video watermarking called \texttt SPDMark (pronounced `SpeedMark') based on selective parameter displacement of a video diffusion model. Watermarks are embedded into the generated videos by modifying a subset of parameters in the generative model. To make the problem tractable, the displacement is modeled as an additive composition of layer-wise basis shifts, where the final composition is indexed by the watermarking key. For parameter efficiency, this work specifically leverages low-rank adaptation (LoRA) to implement the basis shifts. During the training phase, the basis shifts and the watermark extractor are jointly learned by minimizing a combination of message recovery, perceptual similarity, and temporal consistency losses. To detect and localize temporal modifications in the watermarked videos, we use a cryptographic hashing function to derive frame-specific watermark messages from the given base watermarking key. During watermark extraction, maximum bipartite matching is applied to recover the correct frame order, even from temporally tampered videos. Evaluations on both text-to-video and image-to-video generation models demonstrate the ability of \texttt SPDMark to generate imperceptible watermarks that can be recovered with high accuracy and also establish its robustness against a variety of common video modifications.

Abstract:
Multimodal large language models (MLLMs) have demonstrated significant progress in semantic scene understanding and text-image alignment, with reasoning variants enhancing performance on more complex tasks involving mathematics and logic. However, their proficiency in tasks requiring both fine-grained visual understanding and spatial reasoning remains underexplored. To bridge this gap, we introduce ReasonMap, a novel benchmark specifically designed to evaluate these capabilities. ReasonMap encompasses high-resolution transit maps from 30 cities and includes 1,008 question-answer pairs spanning two question types and three templates. Furthermore, we design a two-level evaluation pipeline that properly assesses answer correctness and quality. Our comprehensive evaluation of 16 popular MLLMs reveals a counterintuitive pattern: among open-source models, base variants outperform their reasoning-tuned counterparts, whereas the opposite trend is observed in closed-source models. Further analysis under the visual-masking setting confirms that strong performance necessitates direct visual grounding, rather than relying solely on language priors. We further establish a training baseline with reinforcement fine-tuning, providing a reference for future exploration. We hope this benchmark study offers new insights into visual reasoning and helps investigate the gap between open- and closed-source models.

Abstract:
Humans naturally allocate more time before acting when handling complex tasks in the physical world. This paradigm has recently led to remarkable advances in boosting Large Language Models (LLMs) on complex tasks in digital domains. However, the potential of test-time computing remains largely unexplored for robotic foundation models that interact with the physical world. In this work, we propose FM-Steer, a test-time computing framework that augments flow-based Vision-Language-Action (VLA) generalist policies with value-guided sampling and cascaded action denoising, enabling stronger control performance and real-time action rates for dexterous robot manipulation. FM-Steer first introduces an intermediate flow verifier to estimate state-action values for candidate actions. At test time, the policy iteratively samples multiple noisy action proposals and retains the one with the highest predicted value, yielding value-aligned, high-quality actions without retraining. To satisfy the stringent frequency demands of robot control, FM-Steer further introduces cascaded action denoising, decoupling expensive value-guided sampling from fast action refinement. A lightweight Lite-Flow denoiser asynchronously takes the selected high-value noisy action and rapidly denoises it to produce the final control signal, enabling fluid, high-rate execution. During deployment, the intermediate verifier operates at a low frequency to provide value-guided sampling, while the Lite-Flow denoiser continually processes selected candidates to maintain real-time control. Extensive experiments demonstrate that FM-Steer scales flow-based VLA models effectively at test time and achieves state-of-the-art performance across diverse simulation benchmarks and real-world dexterous robotic tasks. The source code is available on the project page: https://hume-vla.github.io.

Abstract:
Document layout understanding remains data-intensive despite advances in semi-supervised learning. We present a framework that enhances semi-supervised detection by fusing visual predictions with structural priors from text-pretrained LLMs via principled probabilistic weighting. Given unlabeled documents, an OCR-LLM pipeline infers hierarchical regions which are combined with teacher detector outputs through inverse-variance fusion to generate refined pseudo-labels.Our method demonstrates consistent gains across model scales. With a lightweight SwiftFormer backbone (26M params), we achieve 88.2\pm0.3 AP using only 5% labels on PubLayNet. When applied to document-pretrained LayoutLMv3 (133M params), our fusion framework reaches 89.7\pm0.4 AP, surpassing both LayoutLMv3 with standard semi-supervised learning (89.1\pm0.4 AP, p=0.02) and matching UDOP [??] (89.8 AP) which requires 100M+ pages of multimodal pretraining. This demonstrates that LLM structural priors are complementary to both lightweight and pretrained architectures. Key findings include: (1) learned instance-adaptive gating improves over fixed weights by +0.9 AP with data-dependent PAC bounds correctly predicting convergence; (2) open-source LLMs enable privacy-preserving deployment with minimal loss (Llama-3-70B: 87.1 AP lightweight, 89.4 AP with LayoutLMv3); (3) LLMs provide targeted semantic disambiguation (18.7% of cases, +3.8 AP gain) beyond simple text heuristics.Total system cost includes \12 for GPT-4o-mini API or 17 GPU-hours for local Llama-3-70B per 50K pages, amortized across training runs.

Abstract:
Despite Multimodal Large Language Models (MLLMs) having shown impressive capabilities, they may suffer from hallucinations. Empirically, we find that MLLMs attend disproportionately to task-irrelevant background regions compared with text-only LLMs, implying spurious background-answer correlations. We claim and analyze that (i) outcome-based rewards can be an important factor leading to spurious correlations, and (ii) spurious correlations can be an important factor leading to hallucinations. Based on these results, we propose Causal-Oriented Policy Optimization (COPO) to mitigate these spurious correlations, thus addressing the issue of hallucinations. It imposes token-level sufficiency and necessity constraints to measure each inference token's causal contribution, thus ensuring correct and evidence-grounded output. Specifically, we first evaluate each token's causal contribution via a newly proposed causal completeness reward. This reward is then used to construct a causally informed advantage function within the GRPO optimization framework, encouraging the model to focus on tokens that are causally sufficient and necessary for accurate generation. Experimental results across various benchmarks demonstrate the advantages of COPO.

Abstract:
In recent years, there has been a growing trend in computer vision towards exploiting RAW sensor data, which preserves richer information compared to conventional low-bit RGB images. Early studies mainly focused on enhancing visual quality, while more recent efforts aim to leverage the abundant information in RAW data to improve the performance of visual perception tasks such as object detection and segmentation. However, existing approaches still face two key limitations: large-scale ISP networks impose heavy computational overhead, while methods based on tuning traditional ISP pipelines are restricted by limited representational capacity.To address these issues, we propose Task-Aware Image Signal Processor (TA-ISP), a compact RAW-to-RGB framework that produces task-oriented representations for pretrained vision models. Instead of heavy dense convolutional pipelines, TA-ISP predicts a small set of lightweight, multi-scale modulation operators that act at global, regional, and pixel scales to reshape image statistics across different spatial extents. This factorized control significantly expands the range of spatially varying transforms that can be represented while keeping memory usage, computation, and latency tightly constrained. Evaluated on several RAW-domain detection and segmentation benchmarks under both daytime and nighttime conditions, TA-ISP consistently improves downstream accuracy while markedly reducing parameter count and inference time, making it well suited for deployment on resource-constrained devices.

Abstract:
Today's denoising diffusion models do not "denoise" in the classical sense, i.e., they do not directly predict clean images. Rather, the neural networks predict noise or a noised quantity. In this paper, we suggest that predicting clean data and predicting noised quantities are fundamentally different. According to the manifold assumption, natural data should lie on a low-dimensional manifold, whereas noised quantities do not. With this assumption, we advocate for models that directly predict clean data, which allows apparently under-capacity networks to operate effectively in very high-dimensional spaces. We show that simple, large-patch Transformers on pixels can be strong generative models: using no tokenizer, no pre-training, and no extra loss. Our approach is conceptually nothing more than "Just image Transformers", or JiT, as we call it. We report competitive results using JiT with large patch sizes of 16 and 32 on ImageNet at resolutions of 256 and 512, where predicting high-dimensional noised quantities can fail catastrophically. With our networks mapping back to the basics of the manifold, our research goes back to basics and pursues a self-contained paradigm for Transformer-based diffusion on raw natural data.

Abstract:
Learning versatile, fine-grained representations from irregular event streams is pivotal yet nontrivial, primarily due to the heavy annotation that hinders scalability in dataset size, semantic richness, and application scope. To mitigate this dilemma, we launch a novel self-supervised pretraining method that distills visual foundation models (VFMs) to push the boundaries of event representation at scale. Specifically, we curate an extensive synchronized image-event collection to amplify cross-modal alignment. Nevertheless, due to inherent mismatches in sparsity and granularity between image-event domains, existing distillation paradigms are prone to semantic collapse in event representations, particularly at high resolutions. To bridge this gap, we propose to extend the alignment objective to semantic structures provided off-the-shelf by VFMs, indicating a broader receptive field and stronger supervision. The key ingredient of our method is a structure-aware distillation loss that grounds higher-quality image-event correspondences for alignment, optimizing dense event representations. Extensive experiments demonstrate that our approach takes a great leap in downstream benchmarks, significantly surpassing traditional methods and existing pretraining techniques. This breakthrough manifests in enhanced generalization, superior data efficiency and elevated transferability. The source code will be available.

Abstract:
The aesthetic quality of a scene depends strongly on camera viewpoint. Existing approaches for aesthetic viewpoint suggestion are either single-view adjustments, predicting limited camera adjustments from a single image without understanding scene geometry, or 3D exploration approaches, which rely on dense captures or prebuilt 3D environments coupled with costly reinforcement learning (RL) searches. In this work, we introduce the notion of 3D aesthetic field that enables geometry-grounded aesthetic reasoning in 3D with sparse captures, allowing efficient viewpoint suggestions in contrast to costly RL searches. We opt to learn this 3D aesthetic field using a feedforward 3D Gaussian Splatting network that distills high-level aesthetic knowledge from a pretrained 2D aesthetic model into 3D space, enabling aesthetic prediction for novel viewpoints from only sparse input views. Building on this field, we propose a two-stage search pipeline that combines coarse viewpoint sampling with gradient-based refinement, efficiently identifying aesthetically appealing viewpoints without dense captures or RL exploration. Extensive experiments show that our method consistently suggests viewpoints with superior framing and composition compared to existing approaches, establishing a new direction toward 3D-aware aesthetic modeling.

Abstract:
Continuous-time neural networks provide adaptive dynamics, but rely on a single hidden state to encode both fast input fluctuations and longer-term context. This shared representation forces rapidly changing inputs to overwrite slower contextual signals, causing the model to lose past information as new observations arrive. In contrast, biological perceptual systems maintain stable behaviour under evolving sensory input by integrating ongoing signals with stored associative patterns rather than relying on a single evolving state.Motivated by this distinction, we study a simple coupling of Liquid Time-Constant Networks (LTCs) with a Modern Hopfield Network (MHN) that serves as a content-addressable memory. At each time step, the liquid state is projected into a query, the MHN retrieves a memory vector, and the two representations are concatenated before a readout layer. We analyse this coupling under standard norm and Lipschitz assumptions and show that the combined representation remains bounded. We further show that the retrieval map contracts gradients for parameters upstream of the memory query, which provides a mechanism for reducing curvature in the loss landscape.On public time-series benchmarks, the coupled LTC-MHN model improves mean accuracy by 2.3% over competitive recurrent and continuous-time baselines and reduces the estimated Hessian trace by about an order of magnitude relative to a standalone LTC encoder, with the largest gains on classification tasks and competitive performance on a regression task. Qualitative analyses of training curves, loss landscapes, and latent embeddings support the interpretation that Hopfield retrieval smooths optimization and encourages more compact, linearly separable class manifolds. Code will be released upon publication.

Abstract:
We present FloodDiffusion, a new framework for text-driven, streaming human motion generation. Given time-varying text prompts, FloodDiffusion generates text-aligned, seamless motion sequences with real-time latency.Unlike existing methods that rely on chunk-by-chunk or auto-regressive model with diffusion head, we adopt a diffusion forcing framework to model this time-series generation task under time-varying control events.We find that a straightforward implementation of vanilla diffusion forcing (as proposed for video models) fails to model real motion distributions. We demonstrate that to guarantee modeling the output distribution, the vanilla diffusion forcing must be tailored to: (i) train with a bi-directional attention instead of casual attention; (ii) implement a lower triangular time scheduler instead of a random one; (iii) utilize a continues time-varying way to introduce text conditioning.With these improvements, we demonstrate in the first time that the diffusion forcing-based framework achieves state-of-the-art performance on the streaming motion generation task, reaching an FID of 0.057 on the HumanML3D benchmark. Models, code, and weights are available.

Abstract:
Multimodal Large Language Models (MLLMs) have exhibited immense potential across numerous medical specialties, yet dentistry remains underexplored, in part due to limited domain-specific data, scarce dental expert annotations, insufficient modality-specific modeling, and challenges in reliability. In this paper, we present OralGPT-Omni, the first dental-specialized MLLM designed for comprehensive and trustworthy analysis across diverse dental imaging modalities and clinical tasks. To explicitly capture dentists' diagnostic reasoning, we construct TRACE-CoT, a clinically grounded chain-of-thought dataset that mirrors dental radiologists' decision-making processes. This reasoning supervision, combined with our proposed four-stage training paradigm, substantially strengthens the model's capacity for dental image understanding and analysis. In parallel, we introduce MMOral-Uni, the first unified multimodal benchmark for dental image analysis. It comprises 2,809 open-ended question-answer pairs spanning five modalities and five tasks, offering a comprehensive evaluation suite to date for MLLMs in digital dentistry. OralGPT-Omni achieves an overall score of 51.84 on the MMOral-Uni benchmark and 45.31 on the MMOral-OPG benchmark, dramatically outperforming the scores of GPT-5. Our work promotes intelligent dentistry and paves the way for future advances in dental image analysis. All code, benchmark, and models will be made publicly available.

Abstract:
Diffusion models have become the dominant tool for high-fidelity image and video generation, yet are critically bottlenecked by their inference speed due to the numerous iterative passes of Diffusion Transformers. To reduce the exhaustive compute, recent works resort to the feature caching and reusing scheme that skips network evaluations at selected diffusion steps by using cached features in previous steps. However, their preliminary design solely relies on local approximation, causing errors to grow rapidly with large skips and leading to degraded sample quality at high speedups. In this work, we propose spectral diffusion feature forecaster (Spectrum), a training-free approach that enables global, long-range feature reuse with tightly controlled error. In particular, we view the latent features of the denoiser as functions over time and approximate them with Chebyshev polynomials. Specifically, we fit the coefficient for each basis via ridge regression, which is then leveraged to forecast features at multiple future diffusion steps. We theoretically reveal that our approach admits more favorable long-horizon behavior and yields an error bound that does not compound with the step size. Extensive experiments on various state-of-the-art image and video diffusion models consistently verify the superiority of our approach. Notably, we achieve up to 4.79x speedup on FLUX.1 and 4.67x speedup on Wan2.1-14B, while maintaining much higher sample quality compared with the baselines.

Abstract:
3D Gaussian Splatting reconstructs scenes by starting from a sparse Structure-from-Motion initialization and refiningunder-reconstructed regions. This process is slow, as it requires multiple densification steps where Gaussians arerepeatedly split and adjusted, following a lengthy optimization path. Moreover, this incremental approach often yieldssuboptimal renderings in high-frequency regions. We propose a fundamentally different approach: eliminate densification with a one-step approximation of scenegeometry using triangulated pixels from dense image correspondences. This dense initialization allows us to estimatethe rough geometry of the scene while preserving rich details from input RGB images, providing each Gaussian withwell-informed color, scale, and position. As a result, we dramatically shorten the optimization path and remove theneed for densification. Unlike methods that rely on sparse keypoints, our dense initialization ensures uniform detailacross the scene, even in high-frequency regions where other methods struggle. Moreover, since all splats are initializedin parallel at the start of optimization, we remove the need to wait for densification to adjust new Gaussians.EDGS reaches LPIPS and SSIM performance of standard 3DGS significantly faster than existing efficiency-focused approaches. When trained further, it exceeds the reconstruction quality of state-of-the-art models aimed at maximizing fidelity. Our method is fully compatible with other acceleration techniques, making it a versatile and efficient solution that can be integrated with existing approaches.

Abstract:
Videos are unique in their ability to capture actions which transcend multiple frames. Accordingly, action recognition has long been a quintessential task for video models. Unfortunately, due to a lack of sufficiently diverse and challenging data, modern vision-language models (VLMs) are no longer evaluated on their action recognition capabilities. To revitalize action recognition in the era of VLMs, we advocate for a returned focus on domain-specific actions. To this end, we introduce VideoNet, a domain-specific action recognition benchmark covering 1,087 distinct actions from 38 domains. VLMs struggle immensely on VideoNet, with Gemini 2.5 Pro performing only 15.8 percentage points better than random chance. To improve model performance we provide in-context demonstrations, but only see a 3% boost in VLM performance compared to a 13% increase in non-expert human accuracy, suggesting that VLMs are poor few-shot learners. At last, we collect a large-scale training dataset containing nearly 500k video question-answer pairs. Fine-tuning an open-weight 4B model on our data, we surpass all Gemini models on the VideoNet benchmark. We release all of our data, inviting the community to explore new techniques to improve domain-specific action recognition capabilities and few-shot learning in video models.

Abstract:
Reconstructing textured 3D human models from a single image is fundamental for AR/VR and digital human applications. However, existing methods mostly focus on single individuals and thus fail in multi-human scenes, where naive composition of individual reconstructions often leads to artifacts such as unrealistic overlaps, missing geometry in occluded regions, and distorted interactions. These limitations highlight the need for approaches that incorporate group-level context and interaction priors. We introduce a holistic method that explicitly models both group- and instance-level information. To mitigate perspective-induced geometric distortions, we first transform the input into a canonical orthographic space. Our primary component, Human Group-Instance Multi-View Diffusion (HUG-MVD), then generates complete multi-view normals and images by jointly modeling individuals and group context to resolve occlusions and proximity. Subsequently, the Human Group-Instance Geometric Reconstruction (HUG-GR) module optimizes the geometry by leveraging explicit, physics-based interaction priors to enforce physical plausibility and accurately model inter-human contact. Finally, the multi-view images are fused into a high-fidelity texture. Together, these components form our complete framework, HUG3D. Extensive experiments show that HUG3D significantly outperforms both single-human and existing multi-human methods, producing physically plausible, high-fidelity 3D reconstructions of interacting people from a single image.

Abstract:
Low-rank tensor representation (LRTR) is an effective tool for compactly modeling high-order data. While nonlinear LRTR models can better capture real-world nonlinear dependencies, most existing methods rely on fixed local mappings of multilayer perceptrons (MLPs) or convolutional neural networks (CNNs), limiting their ability to model complex global dependencies. To overcome this limitation, we construct a novel paradigm called Self-Attention Driven Tensor Representation (SADTR), which is the first framework that models nonlinearity from the perspective of self-attention. Specifically, we design a factor self-representation mechanism to establish dynamic global mapping, thereby adaptively capturing both local and non-local nonlinear dependencies in the factor space. Moreover, we introduce an implicit sparse representation to impose sparsity constraint while avoiding additional optimization problems. As a result, the proposed SADTR can achieve a more accurate low-rank representation. In theory, we provide a detailed analysis to demonstrate the recoverability of SADTR. To validate the effectiveness of SADTR, we apply it to three representative high-order data recovery tasks. Experimental results demonstrate that SADTR consistently outperforms existing state-of-the-art LRTR methods.

Abstract:
With the rapid development of pre-training technologies, adapting large-scale Vision-Language Models (VLMs) for video understanding i.e. image-to-video transfer learning has become a dominant paradigm. To achieve superior performance, it raises as an effective strategy among recent advances to employ Mixture-of-Experts (MoE) to enhance VLMs' temporal modeling capabilities. However, conventional MoE designs suffer from expert homogenization, where all experts act as identical generalists, inefficiently learning spatio-temporal features from undifferentiated video streams.To overcome this problem, we propose VidPrism, a novel heterogeneous temporal Mixture-of-Experts framework. VidPrism pioneers a division of labor by deploying functionally specialized experts, each assuming a role ranging from spatial understanding to temporal modeling.To feed these specialists appropriately, we introduce a content-aware, multi-rate sampling module that dynamically generates streams ranging from semantically rich to motion-focused representations, providing specialized inputs for experts.Furthermore, a dynamic, bidirectional fusion mechanism enables synergistic information exchange between these pathways, leading to a comprehensive video representation. Extensive experiments on various video recognition benchmarks demonstrate that VidPrism achieves state-of-the-art performance and effectively fosters expert specialization.

Abstract:
Graphical User Interface (GUI) Agents, benefiting from recent advances in multimodal large language models (MLLM), have achieved significant development. However, due to the frequent updates of GUI applications, adapting to new tasks without forgetting old tasks in GUI continual learning remains an open problem. In this work, we reveal that while Supervised Fine-Tuning (SFT) facilitates fast adaptation, it often triggers knowledge overwriting, whereas Reinforcement Learning (RL) demonstrates an inherent resilience that shields prior interaction logic from erasure. Based on this insight, we propose a Continual GUI Learning (CGL) framework that dynamically balances adaptation efficiency and skill retention by enhancing the synergy between SFT and RL. Specifically, we introduce an SFT proportion adjustment mechanism guided by policy entropy to dynamically control the weight allocation between the SFT and RL training phases. To resolve explicit gradient interference, we further develop a specialized gradient surgery strategy. By projecting exploratory SFT gradients onto GRPO-based anchor gradients, our method explicitly clips the components of SFT gradients that conflict with GRPO. On top of that, we establish an AndroidControl-CL benchmark, which divides GUI applications into distinct task groups to effectively simulate and evaluate the performance of continual GUI learning. Experimental results demonstrate the effectiveness of our proposed CGL framework across continual learning scenarios. The benchmark, code, and model will be made publicly available.

Abstract:
Feature matching is a critical task in geometric perception, yet existing methods often struggle under domain shifts and illumination changes due to reliance on visual-only learning and expensive 3D supervision. In this paper, we present TextFM, the first language-guided feature matching framework that incorporates domain-invariant semantic information from vision-language models (VLMs). Built upon a detector-free architecture, TextFM leverages textual embeddings as instance-level queries to provide global semantic context during coarse-level matching, enhancing robustness in challenging scenarios such as textureless surfaces and cross-domain shifts. Additionally, we integrate illumination-invariant physical priors and apply Low-Rank Adaptation (LoRA) to efficiently fine-tune Vision Foundation Models (VFMs) for more robust visual feature extraction. Extensive experiments on outdoor and indoor datasets show that our method outperforms other state-of-the-art methods. In addition, we contribute a synthetic day-night matching benchmark for rigorous evaluation under extreme lighting conditions. Together, our method and dataset establish a strong foundation for robust and generalizable feature matching under real-world constraints.

Abstract:
3D semantic occupancy prediction is crucial for autonomous driving, yet vision-only approaches suffer from weak geometric cues, and existing multi-modal frameworks often depend on dense voxel or BEV tensors that impose heavy computational cost. We present Gau-Occ, a multi-modal framework that models the scene as a compact collection of semantic 3D Gaussians, enabling geometry-guided fusion without dense volumetric processing. To enhance geometric completeness, a learned LiDAR Completion Diffuser (LCD) trained on real-world priors recovers missing structures from sparse LiDAR, and the completed points are encoded as semantic Gaussian anchors. To further integrate multi-view image semantics, we introduce Gaussian Anchor Fusion (GAF), a geometry-aligned aggregation module that performs anchor-guided 2D sampling, local neighborhood encoding, and cross-modal alignment. By constructing locally aggregated Gaussian descriptors that capture spatial consistency and semantic discriminability, GAF facilitates accurate feature association across modalities. Through anchor-driven refinement of Gaussian attributes, Occ-GS supports detailed 3D occupancy prediction. Extensive experiments across challenging benchmarks demonstrate that Occ-GS achieves state-of-the-art performance.

Abstract:
Radiance field representations have recently been explored in the latent space of VAEs that are commonly used by diffusion models. This direction offers efficient rendering and seamless integration with diffusion-based pipelines. However, these methods face a fundamental limitation: The VAE latent space lacks multi-view consistency, leading to blurred textures and missing details during 3D reconstruction. Existing approaches attempt to address this by fine-tuning the VAE, at the cost of reconstruction quality, or by relying on pre-trained diffusion models to recover fine-grained details, at the risk of some hallucinations. We present Splatent, a diffusion-based enhancement framework designed to operate on top of 3D Gaussian Splatting (3DGS) in the latent space of VAEs. Our key insight departs from the conventional 3D-centric view: rather than reconstructing fine-grained details in 3D space, we recover them in 2D from input views through multi-view attention mechanisms. This approach preserves the reconstruction quality of pretrained VAEs while achieving faithful detail recovery.Evaluated across multiple benchmarks, Splatent establishes a new state-of-the-art for VAE latent radiance field reconstruction. We further demonstrate that integrating our method with existing feed-forward frameworks, consistently improves detail preservation, opening new possibilities for high-quality sparse-view 3D reconstruction. Code is available on our project page.

Abstract:
Retrieval-augmented diffusion models (RAG-DMs) have been increasingly deployed across applications, reflecting a broader trend of adopting retrieval-augmented pipelines in AI agent systems and the emerging OpenClaw framework. Despite the success, their trustworthiness remains underexplored. Existing backdoor attacks focus on either manipulating the generation phase or the retrieval phase under the white-box setting, which suffer from knowledge conflicts between retrieved images and user prompts. To bridge this gap, we propose a novel red-teaming approach JOB, which is the first jointly optimized backdoor attack tailored to black-box RAG-DMs. Specifically, JOB poisons the knowledge base with a small number of target class images and learns a trigger through multi-objective optimization, steering retrieval toward poisoned images and aligning the generated outputs with the target class, while preserving benign performance. Experiments show that JOB effectively attacks black-box RAG-DMs, achieving high success rates and outperforming state-of-the-art baselines.

Abstract:
First-Frame Propagation (FFP) offers a promising paradigm for controllable video editing, but existing methods are hampered by a reliance on cumbersome run-time guidance. We identify the root cause of this limitation as the inadequacy of current training datasets, which are often too short, low-resolution, and lack the task diversity required to teach robust temporal priors. To address this foundational data gap, we first introduce FFP-300K, a new large-scale dataset comprising 300K high-fidelity video pairs at 720p resolution and 81 frames in length, constructed via a principled two-track pipeline for diverse local and global edits. Building on this dataset, we propose a novel framework designed for true guidance-free FFP that resolves the critical tension between maintaining first-frame appearance and preserving source video motion. Architecturally, we introduce Adaptive Spatio-Temporal RoPE (AST-RoPE), which dynamically remaps positional encodings to disentangle appearance and motion references. At the objective level, we employ a self-distillation strategy where an identity propagation task acts as a powerful regularizer, ensuring long-term temporal stability and preventing semantic drift. Comprehensive experiments on the EditVerseBench benchmark demonstrate that our method significantly outperforming existing academic and commercial models by receiving about 0.2 PickScore and 0.3 VLM score improvement against these competitors. The project is at ffp-300k.github.io.

Abstract:
Space-time video super-resolution (STVSR) is a task aimed at simultaneously upsampling a video in both spatial and temporal dimensions. Previous studies on STVSR have primarily focused on task-specific architectures and modeling paradigms, while effective pretraining strategies remain underexplored. In this paper, we propose a pseudo-temporal space-time reconstruction pretraining framework for STVSR networks that enables effective use of image datasets, which naturally provide strong spatial cues. Each training sample is constructed by duplicating a single image into a pseudo-temporal video and independently zero-filling random pixel regions across its frames. Instead of designing a separate pretraining module, we pretrain the STVSR network on a task aligned with its core objectives of spatial restoration and cross-frame aggregation. The model learns to reconstruct clean, higher-spatio-temporal-resolution outputs from degraded, pseudo-temporal inputs, with a modulation factor encouraging greater focus on difficult regions. Extensive experiments show that our simple pretraining significantly improves STVSR performance and outperforms existing video representation learning approaches. We note our method is effective even when pretraining and finetuning with a limited quantity of data.

Abstract:
We present BioCoach, a biomechanics-grounded vision-language framework for fitness coaching from streaming video. BioCoach fuses two signals, visual appearance and 3D skeletal kinematics, through a novel three-stage pipeline: an exercise-specific degree-of-freedom selector that focuses analysis on salient joints; a structured biomechanical context that pairs individualized morphometrics with cycle and constraint analysis; and a vision-biomechanics conditioned feedback module that applies cross-attention to generate precise, actionable text. Using parameter-efficient training that freezes the vision and language backbones, BioCoach yields transparent, personalized reasoning rather than pattern matching. To enable learning and fair evaluation, we augment QEVD-fit-coach with biomechanics-oriented feedback to create QEVD-bio-fit-coach, and we introduce a biomechanics-aware LLM judge metric. BioCoach delivers clear gains on QEVD-bio-fit-coach across lexical and judgment metrics while maintaining temporal triggering; on the original QEVD-fit-coach, it improves text quality and correctness with near-parity timing, demonstrating that explicit kinematics and constraints are key to accurate, phase-aware coaching.

Abstract:
Dynamic multimodal fusion methods lack robust theoretical guidance for handling modal conflicts and inconsistent data quality. While recent theoretical work has associated modal weights with scalar quantities such as loss or confidence, this paradigm struggles to comprehensively address the risks caused by inconsistent predictive distributions across modalities. In this paper, we propose a Conflict-driven Risk Minimization (CoRiM) dynamic fusion paradigm. Specifically, we redefine dynamic fusion as a principled, per-sample, direct risk minimization task. To this end, we first design a novel, differentiable Modality Conflict Risk (MCR) function, \mathcal R (w), which quantifies risk by directly modeling fused uncertainty and inter-modal consistency. Second, we identify that minimizing \mathcal R (w) is fundamentally a non-convex constrained optimization problem over the probabilistic simplex. To efficiently solve this specific challenge, we innovatively introduce the projection free Frank-Wolfe (FW) algorithm, as it is perfectly suited for optimization on the simplex.We prove that our designed \mathcal R (w) possesses L-smoothness, which provides theoretical guarantees for the convergence of the FW algorithm on our non-convex objective. Extensive experiments on multiple benchmark datasets demonstrate that CoRiM outperforms current state-of-the-art methods in high-conflict and noisy environments, validating the robustness of our method.

Abstract:
Recent breakthroughs in 3D generative modeling have yielded remarkable progress in static shape synthesis, yet truly dynamic 4D generation remains elusive, hindered by temporal artifacts and prohibitive computational demand. We present Sculpt4D, a native 4D generative framework that seamlessly integrates efficient temporal modeling into a pretrained 3D Diffusion Transformer (Hunyuan3D 2.1), thereby mitigating the scarcity of 4D training data. At its core lies a Block Sparse Attention mechanism that preserves object identity by anchoring to the initial frame while capturing rich motion dynamics via a time-decaying sparse mask. This design faithfully models complex spatiotemporal dependencies with high fidelity, while sidestepping the quadratic overhead of full attention and reducing network total computation by 56%. Consequently, Sculpt4D establishes a new state-of-the-art in temporally coherent 4D synthesis, charting a path toward efficient and scalable 4D generation.

Abstract:
We propose StreamVLO, a streaming visual-LiDAR odometry framework that performs unified spatio-temporal correlation with Mamba models and tackles the long-standing cumulative drift problem via an online Cumulative Drift Compensation scheme for localization in 4D dynamic environments. Specifically, StreamVLO introduces a unified spatio-temporal correlation module built on Mamba to fuse heterogeneous visual and LiDAR cues across multi-frame clips, overcoming the limited temporal exploration of prior pairwise methods. Furthermore, a Cumulative Drift Compensation module minimizes cumulative drift by iteratively learning residual corrections from multiple historical frames in a causal manner. To strengthen spatial feature representation on salient regions, we adopt a Keypoint-Aware Auxiliary Loss with a winner-takes-all strategy. StreamVLO achieves state-of-the-art performance on two commonly used autonomous driving datasets, reducing errors by 19% t_rel and 22% r_rel on KITTI, and by 18% ATE and 16% RPE on Argoverse, while remaining suitable for real-time deployment.

Abstract:
Recent advances in motion diffusion models have substantially improved the realism of human motion synthesis. However, existing approaches fall into two categories: full-sequence diffusion models with bidirectional generation, which limit temporal causality and real-time applicability, and autoregressive models, which are vulnerable to instability and error accumulation. In this work, we present Causal Motion Diffusion Models (CMDM), a unified framework for autoregressive motion generation based on a causal diffusion transformer that operates in a semantically aligned latent space. CMDM builds upon a Motion-Language-Aligned Causal VAE (MAC-VAE), which encodes motion sequences into temporally causal latent representations. On top of this latent representation, an autoregressive diffusion transformer is trained using causal diffusion forcing to perform temporally ordered denoising across motion frames. To achieve fast inference, we introduce a frame-wise sampling schedule with causal uncertainty, where each subsequent frame is predicted from partially denoised previous frames. The resulting framework supports high-quality text-to-motion generation, streaming synthesis, and long-horizon motion generation at interactive rates. Experiments on HumanML3D and SnapMoGen demonstrate that CMDM outperforms existing diffusion and autoregressive models in both semantic fidelity and temporal smoothness, while substantially reducing inference latency.

Abstract:
Bridging the sim-to-real gap is important for applying low-cost simulation data to real-world robotic systems. However, previous methods are severely limited by treating each transfer as an isolated endeavor, demanding repeated, costly tuning and wasting prior transfer experience.To move beyond isolated sim-to-real, we build a continual cross-task sim-to-real transfer paradigm centered on knowledge accumulation across iterative transfers, thereby enabling effective and efficient adaptation to novel tasks. Thus, we propose GeCo-SRT, a geometry-aware continual adaptation method. It utilizes domain-invariant and task-invariant knowledge from local geometric features as a transferable foundation to accelerate adaptation during subsequent sim-to-real transfers. This method starts with a geometry-aware mixture-of-experts module, which dynamically activates experts to specialize in distinct geometric knowledge to bridge observation sim-to-real gap. Further, the geometry-expert-guided prioritized experience replay module preferentially samples from underutilized experts, refreshing specialized knowledge to combat forgetting and maintain robust cross-task performance.Leveraging knowledge accumulated during iterative transfer, GeCo-SRT method not only achieves 52% average performance improvement over the baseline, but also demonstrates significant data efficiency for new task adaptation with only 1/6 data.We hope this work inspires approaches for efficient, low-cost cross-task sim-to-real transfer.

Abstract:
In edge-cloud environments, efficiency of speculative decoding is heavily constrained by uplink transmission and cloud-side verification. In this work, we identify a phenomenon we term credit inertia, where acceptance rates of adjacent token windows exhibit strong temporal consistency. Tokens following recently well-performing windows are likely to pass verification, whereas tokens following poorly performing windows are likely to fail. Motivated by this observation, we propose E^2-SCI, an elastic edge-cloud speculative decoding framework that dynamically adjusts draft token verification thresholds based on recent historical performance. This adaptive mechanism allows system to be more permissive for windows with strong historical performance and stricter for windows with weak performance, effectively leveraging temporal consistency to reduce overall latency. We further introduce Progressive Lookahead Concurrency (PLC), which pipelines draft generation and verification asynchronously to hide latency. Experiments across multiple benchmarks show that E^2-SCI achieves over 9.4 tokens/s on DeepSeek-R1-Distill-Qwen (1.5B/32B), delivering an 88.5% speed improvement over FSD baseline while maintaining accuracy. Notably, E^2-SCI integrates seamlessly with existing frameworks, demonstrating broad applicability and superior efficiency-quality trade-offs.

Abstract:
Machine unlearning aims to erase requested data from trained models without full retraining. For Reasoning Multimodal Large Language Models (RMLLMs), this is uniquely challenging: intermediate chain-of-thought steps can still leak sensitive information even when final answers are forgotten, and overly aggressive interventions easily damage general reasoning ability. Yet no benchmark jointly evaluates how well unlearning methods suppress reasoning-level leakage while preserving reasoning competence. We address this gap with RMLLMU-Bench, the first benchmark for RMLLM unlearning that extends standard forgetting metrics with dedicated measures of reasoning leakage and reasoning retention. A systematic evaluation on RMLLMU-Bench reveals that existing unlearning methods for MLLMs and Large (Language) Reasoning Models (LRMs) either leave substantial leakage in the reasoning process or severely degrade reasoning performance. To address these gaps, we propose R-MUSE (\underline R easoning-preserving \underline M LLM \underline U nlearning via \underline S ubspace guidance and Adaptive St \underline E ering), a training-free and inference-time intervention framework that steers internal representations to forget both answers and reasoning traces while explicitly preserving general reasoning. Experiments on RMLLMU-Bench demonstrate that R-MUSE achieves a substantially better balance between effective forgetting and reasoning retention.

Abstract:
We propose a novel learning-based task named fractured object recovery. Unlike the previous fractured object reassembly task that only aligns existing parts with overlaps, our task aims to recover the complete shape by not only reassembling irrelevant parts but also predicting missing parts. Our task coincides with practical experiences, where the prior knowledge of similar shapes can be leveraged such that even non-overlapping parts can be reasoned into adequate locations. We also present the first learning model for the proposed task by correlating features of both existing and missing parts using a transformer, where the latter is naturally represented as missing tokens. To facilitate the task, we introduce a new dataset based on the existing fractured object benchmark by imposing different configurations of missing parts. We perform extensive evaluations to demonstrate the performance of the proposed model over baselines. The results show that joint part reassembly and prediction can be made possible and also have mutual benefits, which we believe can inspire future research and favor real applications.

Abstract:
Sketch-based caricature synthesis suffers from a fundamental failure mode: when identity and shape conditions are combined in diffusion models, they create destructive interference that causes inevitable collapse toward either bland portraits or unrecognizable distortions. We identify the root cause as condition signal contamination -- competing probability distributions in the denoising trajectory that make balanced generation impossible. We present CaricHarmony, the first training-free method that explicitly resolves this contamination through parallel uncontaminated diffusion paths. During inference, we maintain three paths: \mathcal P ^ \mathrm i (pure identity), \mathcal P ^ \mathrm s (pure shape), and \mathcal P ^ \mathrm i+s (harmonized output). Novel energy functions operating on cross-attention features provide gradient guidance that steers \mathcal P ^ \mathrm i+s toward optimal balance: \mathcal E _ \mathrm shape ensures sketch fidelity through layout and semantic alignment, while \mathcal E _ \mathrm id employs token-level correspondence matching robust to extreme distortions. Unlike DemoCaricature requiring 70 seconds per-identity fine-tuning or CaricatureBooth constrained to Bezier curves, CaricHarmony accepts any sketch format and generates in under 16 seconds. Experiments demonstrate state-of-the-art performance: 0.8615 shape CLIP score (vs. 0.8450) under comparable identity consistency score, with 7.81 overall user preference score (vs. 6.06). Our method fundamentally reconceptualizes the ID-shape conflict as conditioning signal contamination for diffusion models, enabling unprecedented creative control while preserving recognition.

Abstract:
Scaling vision foundation models is constrained by the quadratic complexity of self-attention. Although subquadratic attention alternatives like linear attention variants and state-space models successfully reduce the model complexity, they typically serialize images into 1D token sequences, compromising spatial coherence and efficiency. Generalized Spatial Propagation Networks (GSPN) offer a linear-time alternative that propagates context directly on the 2D grid via line-scan propagation and removes positional embeddings, yet the original design hits GPU-scaling limits: growing batch/channels saturate SM concurrency, serializing scans, and spiking latency. We introduce Compact GSPN (C-GSPN), a ViT block that compresses the propagation space to preserve accuracy while cutting propagation latency by nearly 10x. We further improve efficiency with lightweight projections and fused CUDA kernels. To enable large-scale pretraining, we adopta two-stage cross-operator distillation strategy that combines layer-wise supervision with end-to-end alignment. In a representative 1K configuration (batch 32, C=1152), C-GSPN achieves up to 2x speedup, maintains competitive zero-shot accuracy, and improves segmentation by +2.1%. Extensive experiments and ablations show that the proposed compression and two-stage distillation are criticalfor strong transfer while substantially reducing compute, enabling the first extension of a subquadratic operator to foundation-scale (CLIP-style) vision pretraining.

Abstract:
We present ReMatch, a framework that leverages the generative strength of MLLMs for multimodal retrieval. Previous approaches treated an MLLM as a simple encoder, ignoring its generative nature, and under-utilising its compositional reasoning and world knowledge. We train the embedding MLLM end-to-end with a chat-style generative matching stage. The matching stage uses the same MLLM to autoregressively decide relevance from multi-view inputs, including both raw data and its own projected embeddings for each query and document. It provides instance-wise discrimination supervision that complements a standard contrastive loss, offering stronger gradients on hard negatives and preserving the compositional strengths of the original MLLM. To obtain semantically richer multimodal embeddings, we use multiple learnable tokens to augment each input, generating fine-grained contextual, mutually orthogonal embeddings with low inference cost. Leveraging our established high-performance baseline, we assemble the ideas mentioned above into a powerful training recipe and achieve a new state-of-the-art on the Massive Multimodal Embedding Benchmark(MMEB). Our experiments show particularly strong zero-shot generalization results on five datasets, highlighting the robustness and transferability of ReMatch.

Abstract:
Graphical User Interface (GUI) agents require effective utilization of historical context to perform sequential navigation tasks. While incorporating past actions and observations can significantly improve decision-making, naively using full history leads to excessive computational overhead and potential distraction from irrelevant information. In this work, we introduce HiconAgent, a GUI agent trained with History Context-aware Policy Optimization (HCPO) for effective and efficient utilization of historical information. HCPO explicitly optimizes history usage in both sampling and policy updates by integrating two complementary components: (1) Dynamic Context Sampling (DCS) presents the agent with variable-length histories during sampling, enabling adaptive use of the most relevant historical context to improve sequential decision quality; (2) Anchor-guided History Compression (AHC) refines the policy update phase via a dual-branch optimization strategy, where the compressed branch drops history observations while keeping history actions as information flow anchors. The compressed and uncompressed branches are coupled through a history-enhanced alignment loss to enforce consistent history usage, achieving efficiency with minimal performance degradation. Extensive experiments on mainstream GUI navigation benchmarks demonstrate the strong performance of our model. Despite its smaller size, HiconAgent-3B outperforms GUI-R1-7B by +8.46% grounding and +11.32% step successful rate on GUI-Odyssey, while achieving comparable results on AndroidControl and AITW, with up to 2.47x computational speedup and 60% FLOPs reduction.

Abstract:
We address the task of multi-view image editing from sparse input views, where the inputs can be seen as a mix of images capturing the scene from different viewpoints. The goal is to modify the scene according to a textual instruction while preserving consistency across all views. Existing methods, based on per-scene neural fields or temporal attention mechanisms, struggle in this setting, often producing artifacts and incoherent edits. We propose InstructMix2Mix (I-Mix2Mix), a framework that distills the editing capabilities of a 2D diffusion model into a pretrained multi-view diffusion model, leveraging its data-driven 3D prior for cross-view consistency. A key contribution is replacing the conventional neural field consolidator in Score Distillation Sampling (SDS) with a multi-view diffusion student, which requires novel adaptations: incremental student updates across timesteps, a specialized teacher noise scheduler to prevent degeneration, and an attention modification that enhances cross-view coherence without additional cost. Experiments demonstrate that I-Mix2Mix significantly improves multi-view consistency while maintaining high per-frame edit quality.

Abstract:
Current visual text generation models struggle with the trade-off between text accuracy and overall image coherence. We find that achieving high text accuracy can reduce aesthetic quality and instruction-following capability. Although reinforcement learning approaches can alleviate the problem through aligning with multiple rewards, they are often unstable for text generation, as existing approaches normally optimize multiple rewards in a weighted-sum way. In addition, it is difficult to balance the weight of each reward. Moreover, reinforcement learning requires a set of training instructions. A large number of prompts require more training time and computing resources, while a small set leads to poor performance. Hence, how to select the prompts for efficient training is an unsolved problem. In this study, we propose Pareto-Optimal Curriculum Alignment (POCA), a framework that addresses this issue as a multi-objective problem by: 1) identifying the Pareto-optimal set to avoid simple scalarization and 2) designing an adaptive curriculum alignment strategy to manage a learning sequence of a multi-reward dataset using automatic difficulty assessment, which is crucial for optimal convergence as RL methods explore in a limited data environment. In synergy, POCA finds the Pareto-optimal set in a unified reward space, which eliminates inconsistent signals to find the best trade-off solution from different rewards under an easy-to-hard optimization landscape. The experimental results show that POCA significantly improves all metrics such as CLIP, HPS scores and sentence accuracy.

Abstract:
Annotated 3D scene data is scarce and expensive to acquire, while abundant unlabeled videos are readily available on the internet. In this paper, we demonstrate that carefully designed data engines can leverage web-curated, unlabeled videos to automatically generate training data, to facilitate end-to-end models in 3D scene understanding alongside human-annotated datasets. We identify and analyze bottlenecks in automated data generation, revealing critical factors that determine the efficiency and effectiveness of learning from unlabeled data. To validate our approach across different perception granularities, we evaluate on three tasks spanning low-level perception, i.e., 3D object detection and instance segmentation, to high-level reasoning, i.e., 3D spatial Visual Question Answering (VQA) and Vision-Lanugage Navigation (VLN). Models trained on our generated data demonstrate strong zero-shot performance and show further improvement after finetuning. This demonstrates the viability of leveraging readily available web data as a path toward more capable scene understanding systems.

Abstract:
We present Ov3R, a novel framework for open-vocabulary semantic 3D reconstruction from RGB video streams, designed to advance Spatial AI. The system features two key components: CLIP3R, a CLIP-informed 3D reconstruction module that predicts dense point maps from overlapping clips alongside object-level semantics; and 2D-3D OVS, a 2D-3D open-vocabulary semantic module that lifts 2D features into 3D by learning fused descriptors integrating spatial, geometric, and semantic cues. Unlike prior methods, Ov3R incorporates CLIP semantics directly into the reconstruction process, enabling globally consistent geometry and fine-grained semantic alignment. Our framework achieves state-of-the-art performance in both dense 3D reconstruction and open-vocabulary 3D segmentation.

Abstract:
Federated Semi-Supervised Learning (FSSL) aims to collaboratively train a global model across clients by leveraging partially-annotated local data in a privacy-preserving manner. In FSSL, data heterogeneity is a challenging issue, which exists both across clients and within clients. External heterogeneity refers to the data distribution discrepancy across different clients, while internal heterogeneity represents the mismatch between labeled and unlabeled data within clients. Most FSSL methods typically design fixed or dynamic parameter aggregation strategies to collect client knowledge on the server (external) and / or filter out low-confidence unlabeled samples to reduce mistakes in local client (internal). But, the former is hard to precisely fit the ideal global distribution via direct weights, and the latter results in fewer data participation into FL training. To this end, we propose a proxy-guided framework called ProxyFL that focuses on simultaneously mitigating external and internal heterogeneity via a unified proxy. I.e., we consider the learnable weights of classifier as proxy to simulate the category distribution both locally and globally. For external, we explicitly optimize global proxy against outliers instead of direct weights; for internal, we re-include the discarded samples into training by a positive-negative proxy pool to mitigate the impact of potentially-incorrect pseudo-labels. Insight experiments & theoretical analysis show our significant performance and convergence in FSSL.

Abstract:
Recent advances in machine unlearning have focused on developing algorithms to remove specific training samples from a trained model. In contrast, we observe that not all models are equally easy to unlearn. Hence, we introduce a family of deep semi-parametric models (SPMs) that exhibit non-parametric behavior during unlearning. SPMs use a fusion module that aggregates information from each training sample, enabling explicit test-time deletion of selected samples without altering model parameters. Empirically, we demonstrate that SPMs achieve competitive task performance to parametric models in image classification and generation, while being significantly more efficient for unlearning. Notably, on ImageNet classification, SPMs reduce the prediction gap relative to a retrained (oracle) baseline by 11% and achieve over 10xfaster unlearning compared to existing approaches on parametric models.

Abstract:
For video-text retrieval, the use of CLIP has been a de facto standard. However, as CLIP provides only image and text encoders, this consensus has led to a biased paradigm that entirely ignores the sound track of videos. While several attempts have been made to reintroduce audio -- typically by incorporating an audio encoder and fusing its output with visual features -- these methods face two challenges: ineffective representation of speech content and suboptimal vision-audio fusion. To address these issues jointly, we propose SAVE, a Speech Aware Video rEpresentation learning method. SAVE improves upon AVIGATE, a SOTA audiovisual method, with a dedicated speech branch for more effective speech embedding. Furthermore, we introduce soft-ALBEF for early vision-audio alignment that facilitates fusion. Extensive experiments on five benchmarks show that SAVE compares favorably against the SOTA, outperforming AVIGATE by +4.1% on MSRVTT-1k, +1.9% on MSRVTT-3k, +2.5% on VATEX, +9.8% on Charades, and +2.1% on LSMDC, in light of the SumR metric.

Abstract:
MLLMs exhibit strong reasoning on isolated queries, yet they operate de novo--solving each problem independently and often repeating the same mistakes. Existing memory-augmented agents mainly store past trajectories for reuse. However, trajectory-based memory suffers from brevity bias, gradually losing essential domain knowledge. More critically, even in truly multimodal problem-solving settings, it records only a single-modality trace of past behavior, failing to preserve how visual attention and logical reasoning jointly contributed to the solution. This is fundamentally misaligned with human cognition: semantic memory is both multimodal and integrated, preserving visual and abstract knowledge through coordinated but distinct representational streams. We thus introduce ViLoMem, a dual-stream memory framework that constructs compact, schema-based memory. It separately encodes visual distraction patterns and logical reasoning errors, enabling MLLMs to learn from their successful and failed experiences. Following a grow-and-refine principle, the system incrementally accumulates and updates multimodal semantic knowledge--preserving stable, generalizable strategies while avoiding catastrophic forgetting. Across nine multimodal benchmarks, ViLoMem consistently improves pass@1 accuracy and substantially reduces repeated visual and logical errors. Ablations confirm the necessity of dual-stream memory with explicit distraction-hallucination separation, demonstrating the value of error-aware multimodal memory for lifelong and cross-domain agentic learning.

Abstract:
Roadside perception datasets are typically constructed via cooperative labeling between synchronized vehicle and roadside frame pairs, but real deployment is often limited roadside-only data due to hardware and privacy constraints. The observation that even human experts struggle to produce accurate labels without vehicle-side data reveals a fundamental learnability problem: many roadside-only scenes contain distant, blurred, or occluded objects whose 3D properties are ambiguous from a single view and can only be reliably annotated by cross-checking paired vehicle--roadside frames. We refer to such cases as inherently ambiguous samples. In this work, we develop an active learning framework for roadside monocular 3D object detection and propose a learnability-driven framework that selects scenes which are both informative and reliably labelable, suppressing inherently ambiguous samples while ensuring coverage. Experiments demonstrate that our method significantly outperforms uncertainty-based baselines, which suggests that learnability, not uncertainty, matters for roadside 3D perception.

Abstract:
Flow-GRPO successfully applies reinforcement learning to flow models, but uses uniform credit assignment across all steps. This ignores the temporal structure of diffusion generation: early steps determine composition and content (low-frequency structure), while late steps resolve details and textures (high-frequency details). Moreover, assigning uniform credit based solely on the final image can inadvertently reward suboptimal intermediate steps, especially when errors are corrected later in the diffusion trajectory. We propose Stepwise-Flow-GRPO, which assigns credit based on each step's reward improvement. By leveraging Tweedie's formula to obtain intermediate reward estimates and introducing gain-based advantages, our method achieves superior sample efficiency and faster convergence. We also introduce a DDIM-inspired SDE that improves reward quality while preserving stochasticity for policy gradients.

Abstract:
Understanding complex surgical scenes requires recognizing multiple interdependent entities--such as instruments, actions, and targets--and maintaining their relational consistency across time. Existing surgical triplet recognition methods struggle to jointly model intra-frame label dependencies and inter-frame temporal semantics in a unified manner. To address these limitations, we propose a unified framework that integrates spatial, relational, and temporal cues for robust surgical triplet recognition. Specifically, class-specific spatial priors are first extracted through a multi-scale encoder. Then, these priors are refined by a Label Correlation Modeling module with multi-scale class activation map-guided relational extraction (MS-CAMRE), enabling the model to capture both static co-occurrence and dynamic contextual dependencies among triplet components. Furthermore, a Bidirectional Temporal-Relational Fusion Attention (BTRFA) module harmonizes temporal and relational representations to achieve coherent temporal reasoning. We also introduce a new evaluation metric, the Triplet Consistency Error Rate (TCER), which quantitatively measures the model's capability to preserve causal and semantic consistency across triplets. Extensive experiments on the CholecT45 and ProStaTD datasets show that our method achieves state-of-the-art (SOTA) performance, improving AP_ IVT by 5.1% and 7.8%, respectively. Moreover, on the TCER metric, our approach yields over 36% and 25% relative reductions on the two datasets, respectively, underscoring the effectiveness of our framework in temporal-relational co-reasoning.

Abstract:
Low-Light Image Enhancement (LLIE) is a challenging task, as severe information loss means a single input can correspond to multiple plausible restorations. This inherent ambiguity causes conventional regression-based models to produce overly-smooth results that lack detail. While recent generative models can create richer details, their common unidirectional design often compromises content fidelity by distorting original structures. We introduce Bi-Bridge, a unified framework that models both enhancement and its inverse degradation within a single symmetric diffusion bridge. By compelling the network to preserve essential content structures across both transformations, this bidirectional learning acts as a powerful constraint, leading to significantly more faithful and realistic restorations. Extensive experiments show that Bi-Bridge outperforms state-of-the-art (SOTA) methods across multiple benchmarks, establishing a new standard for fidelity and perceptual quality.

Abstract:
Defect synthesis, as a core technology for addressing the problem of few-shot defect classification, has been widely adopted in industrial scenarios. It helps alleviate the problem of insufficient model generalization capability owing to data scarcity by establishing a data augmentation pipeline. Recently, remarkable progress has been achieved in both explicit defect image generation and implicit defect feature synthesis approaches. However, existing methods are always conducted in Euclidean space. Constrained by the flatness of Euclidean space, it is difficult to synthesize defect data containing complex structures. In this paper, we attempt to explore defect generation in hyperbolic space and propose a hyperbolic defect feature synthesis (HypDFS) method. By modeling the potential defect distribution via a small number of hyperbolic defect prototypes and further optimizing the synthetic defect features with the hierarchical defect contrastive loss in hyperbolic space, our HypDFS method can obtain a better generalized defect representation that is more conducive to downstream few-shot defect classification task. Extensive experiments on the MVTec-FS benchmark and standard MTD dataset under the few-shot settings demonstrate that the proposed HypDFS surpasses the Euclidean baseline by a large margin, showing promising prospects for defect synthesis in hyperbolic space.

Abstract:
This paper does not introduce a novel method but instead establishes a straightforward, incremental, yet essential baseline for video temporal grounding (VTG), a core capability in video understanding. While multimodal large language models (MLLMs) excel at various video understanding tasks, the recipes for optimizing them for VTG remain under-explored. In this paper, we present TimeLens, a systematic investigation into building MLLMs with strong VTG ability, along two primary dimensions: data quality and algorithmic design. We first expose critical quality issues in existing VTG benchmarks and introduce TimeLens-Bench, comprising meticulously re-annotated versions of three popular benchmarks with strict quality criteria. Our analysis reveals dramatic model re-rankings compared to legacy benchmarks, confirming the unreliability of prior evaluation standards. We also address noisy training data through an automated re-annotation pipeline, yielding TimeLens-100K, a large-scale, high-quality training dataset. Building on our data foundation, we conduct in-depth explorations of algorithmic design principles, yielding a series of meaningful insights and effective yet efficient practices. These include interleaved textual encoding for time representation, a thinking-free reinforcement learning with verifiable rewards (RLVR) approach as the training paradigm, and carefully designed recipes for RLVR training. These efforts culminate in TimeLens models, a family of MLLMs with state-of-the-art VTG performance among open-source models and even surpass proprietary models such as GPT-5 and Gemini-2.5-Flash. All codes, data, and models will be released to facilitate future research.

Abstract:
Text-guided human body animation has advanced rapidly, yet facial animation lags due to the scarcity of well-annotated, text-paired facial corpora. To close this gap, we leverage foundation generative models to synthesize a large, balanced corpus of facial behavior. We design prompts suite covering emotions and head motions, generate about 80 hours of facial videos with multiple generators, and fit per-frame 3D facial parameters, yielding large-scale (prompt and parameter) pairs for training. Building on this dataset, we probe language models for bidirectional competence over facial motion via two complementary tasks: (1) Motion2Language: given a sequence of 3D facial parameters, the model produces natural-language descriptions capturing content, style, and dynamics; and (2) Language2Motion: given a prompt, the model synthesizes the corresponding sequence of 3D facial parameters via quantized motion tokens for downstream animation. Extensive experiments show that in this setting language models can both interpret and synthesize facial motion with strong generalization. To best of our knowledge, this is the first work to cast facial-parameter modeling as a language problem, establishing a unified path for text-conditioned facial animation and motion understanding.

Abstract:
Aerial-ground visual localization is a challenging task due to the significant differences in scene scale and view point captured between two views. In this work, we explore the practical benefit of jointly learning camera calibration and bird's-eye-view (BEV) projection for estimating full 6 Degrees-of-freedom relative camera pose between uncalibrated aerial and ground views. We present Visual Geometry Alignment (VGA), a unified framework that jointly learns a global gravity-alignment prior inferred from dense monocular perspective fields, and a planar alignment prior complementing the unobserved azimuth angle through Procrustes alignment in a shared BEV plane. At inference, we jointly refine the relative camera pose by integrating the predicted per-camera gravity alignment and relative planar azimuth angle, yielding improved orientation and translation alignment from visual input with extreme wide base-lines and limited overlap. We evaluate our method on challenging MatrixCity, ACC-NVS1 and ULTRRA ground-aerial pairs, demonstrating that optimizing with learned geometric priors can further improve the camera pose estimation across diverse altitudes and environment.

Abstract:
Depth priors provide salient geometric structure that benefits camouflaged object detection (COD), but directly using Monocular Depth Estimation (MDE) causes a task misalignment that still fails to identify camouflaged objects.To address this issue, we propose the Depth Segment Anything Model (DepthSAM), a MDE-adapted method specifically designed to mitigate this misalignment.DepthSAM incorporates two core innovations: (1) a Sparse Mixture-of-Experts Adapter (SMEA) that enables MDE to learn semantic information unique to camouflaged scenes, and (2) a Geometric-Semantic Fusion Module (GSFM) that efficiently integrates geometric cues with high-level semantics. With these components, DepthSAM achieves both robust semantic understanding in camouflaged environments and accurate segmentation of camouflaged objects.Extensive experiments show that DepthSAM achieves new SOTA performance on three major benchmarks. For example, on COD10K, its S_ \alpha and F_ b ^ \omega metrics surpass the best competing methods by 3.0% and 4.3%, respectively.

Abstract:
While pre-trained visual representations have significantly advanced imitation learning, they are often task-agnostic as they remain frozen during policy learning. In this work, we explore leveraging pre-trained text-to-image diffusion models to obtain task-adaptive visual representations for robotic control, without fine-tuning the model itself. However, we find that naively applying textual conditions--a successful strategy in other vision domains--yields minimal or even negative gains in control tasks. We attribute this to the domain gap between the diffusion model's training data and robotic control environments, leading us to argue for conditions that consider the specific, dynamic visual information required for control. To this end, we propose ORCA, which introduces learnable task prompts that adapt to the control environment and visual prompts that capture fine-grained, frame-specific details. Through facilitating task-adaptive representations with our newly devised conditions, our approach achieves state-of-the-art performance on various robotic control benchmarks, significantly surpassing prior methods.

Abstract:
Diffusion models have achieved remarkable success in image and video generation tasks. However, the high computational demands of Diffusion Transformers (DiTs) pose a significant challenge to their practical deployment. While feature caching is a promising acceleration strategy, existing methods based on simple reusing or training-free forecasting struggle to adapt to the complex, stage-dependent dynamics of the diffusion process, often resulting in quality degradation and failing to maintain consistency with the standard denoising process. To address this, we propose a LEarnable Stage-Aware (LESA) predictor framework based on two-stage training. Our approach leverages a Kolmogorov-Arnold Network (KAN) to accurately learn temporal feature mappings from data. We further introduce a multi-stage, multi-expert architecture that assigns specialized predictors to different noise-level stages, enabling more precise and robust feature forecasting. Extensive experiments show our method achieves significant acceleration while maintaining high-fidelity generation. Experiments demonstrate 5.00xacceleration on FLUX.1-dev with minimal quality degradation (1.0% drop), 6.25xspeedup on Qwen-Image with a 20.2% quality improvement over the previous SOTA (TaylorSeer), and 5.00xacceleration on HunyuanVideo with a 24.7% PSNR improvement over TaylorSeer. State-of-the-art performance on both text-to-image and text-to-video synthesis validates the effectiveness and generalization capability of our training-based framework across different models. Our code is included in the supplementary materials and will be released on GitHub.

Abstract:
We present NimbusGS, a unified framework for reconstructing high-quality 3D scenes from degraded multi-view inputs captured under diverse and mixed adverse weather conditions. Unlike existing methods that target specific weather types, NimbusGS addresses the broader challenge of generalization by modeling the dual nature of weather: a continuous, view-consistent medium that attenuates light, and dynamic, view-dependent particles that cause scattering and occlusion. To capture this structure, we decompose degradations into a global transmission field and per-view particulate residuals. The transmission field represents static atmospheric effects shared across views, while the residuals model transient disturbances unique to each input. To enable stable geometry learning under severe visibility degradation, we introduce a geometry-guided gradient scaling mechanism that mitigates gradient imbalance during the self-supervised optimization of 3D Gaussian representations. This physically grounded formulation allows NimbusGS to disentangle complex degradations while preserving scene structure, yielding superior geometry reconstruction and outperforming task-specific methods across diverse and challenging weather conditions.

Abstract:
Training-free neural architecture search promises efficient discovery of high-performance networks without costly training. However, existing zero-cost proxies rely on fragmented heuristics that fail to capture the fundamental question: what makes an architecture trainable? This paper introduces Intrinsic Trainability (InTrain), a unified theoretical proxy that formalizes trainability as an architectural invariant emerging from two synergistic components: geometric capacity and optimization resilience. We operationalize intrinsic trainability through analysis of neural information processing. Geometric capacity is quantified via the participation ratio of activation covariance eigenspectrum, capturing the effective dimensionality of representation manifolds. Optimization resilience is measured through cumulative gradient health, assessing the robustness of backpropagation across network depth. InTrain synthesizes these dimensions through a scale-invariant multiplicative coupling, which we validate is essential for capturing their synergistic, non-additive relationship. Extensive experiments on standard NAS benchmarks and search spaces demonstrate that InTrain achieves ranking correlations on par with state-of-the-art ensemble-based proxies and outperforms other single-metric methods.

Abstract:
Diffusion-based text-to-image (T2I) models have made remarkable progress in generating photorealistic and semantically rich images. However, when the target concepts lie in low-density regions of the training distribution, these models often produce semantically misaligned or structurally inconsistent results. This limitation arises from the long-tailed nature of text-image datasets, where rare concepts or editing instructions are underrepresented. To address this, we introduce Adaptive Auxiliary Prompt Blending (AAPB) -- a unified framework that stabilizes the diffusion process in low-density regions. AAPB leverages auxiliary anchor prompts to provide semantic support in rare concept generation and structural support in image editing, ensuring faithful guidance toward the target prompt. Unlike prior heuristic prompt alternation methods, AAPB derives a closed-form adaptive coefficient that optimally balances the influence between the auxiliary anchor and the target prompt at each diffusion step. Grounded in Tweedie's identity, our formulation provides a principled and training-free framework for adaptive prompt blending, ensuring stable and target-faithful generation. We demonstrate the effectiveness of adaptive interpolation over fixed interpolation through controlled experiments and empirically show consistent improvements on the RareBench and FlowEdit datasets, achieving superior semantic accuracy and structural fidelity compared to prior training-free baselines.

Abstract:
Diffusion-based stylization has advanced significantly, yet existing methods are limited to color-driven transformations, neglecting complex semantics and material details. We introduce StyleExpert, a semantic-aware framework based on Mixture of Experts (MoE).Our framework employs a unified style encoder, trained on our large-scale dataset of content-style-stylized triplets, to embed diverse styles into a consistent latent space. This embedding is then used to condition a similarity-aware gating mechanism, which dynamically routes styles to specialized experts within the MoE architecture. Leveraging this MoE architecture, our method adeptly handles diverse styles spanning multiple semantic levels, from shallow textures to deep semantics. Extensive experiments show that StyleExpert outperforms existing approaches in preserving semantics and material details, while generalizing to unseen styles.

Abstract:
Advances in diffusion-based video generation models, while significantly improving human animation, poses threats of misuse through the creation of fake videos from a specific person's photo and text prompts. Recent efforts have focused on adversarial attacks that introduce crafted perturbations to protect images from diffusion-based models. However, most existing approaches target image generation, while relatively few explicitly address image-to-video diffusion models (VDMs), and most primarily focus on UNet-based architectures. Hence, their effectiveness against Diffusion Transformer (DiT) models remains largely under-explored, as these models demonstrate improved feature retention, and stronger temporal consistency due to larger capacity and advanced attention mechanisms. In this work, we introduce Anti-I2V, a novel defense against malicious human image-to-video generation, applicable across diverse diffusion backbones. Instead of restricting noise updates to the RGB space, Anti-I2V operates in both the Lab and frequency domains, improving robustness and concentrating on salient pixels. We then identify the network layers that capture the most distinct semantic features during the denoising process to design appropriate training objectives that maximize degradation of temporal coherence and generation fidelity. Through extensive validation, Anti-I2V demonstrates state-of-the-art defense performance against diverse video diffusion models, offering an effective solution to the problem.

Abstract:
3D generation and reconstruction have become essential in many computer vision applications, where the reconstructed or generated 3D shapes need to appear realistic to human perception. However, traditional metrics like Chamfer Distance to compare two 3D shapes focus primarily on matching accuracy of the shape geometry and fail to capture perceptual fidelity in the shape. While frequency-based metrics attempt to analyze shape details in the spectral domain, they still do not fully encapsulate the complexity of human perception. To address this gap, we propose a human-aligned fidelity metric that leverages local shape connectivity through a local attention mechanism to capture rich, detailed shape information. We also introduce the two-branch Real Shape Fidelity (RSF) dataset, including a main subset and test-only subset. This dataset generates 3D mesh distortions using real-world reconstruction and generation methods and annotated by hundreds of human subjects. Our metric named Local-Connection-based Shape Evaluation (LoCaSE), utilizes a PointNet-based backbone combined with Low-Rank Adaptation (LoRA)-style pretraining and finetuning to reduce model bias, while maintaining translation, rotation, and scale invariance. Experiments demonstrate that our approach achieves superior alignment with human perception compared to previous metrics.

Abstract:
Accurate 3D tooth segmentation is fundamental for digital dentistry, orthodontic analysis, and clinical simulation. Intraoral scan (IOS) models often suffer from incomplete or unreliable texture information, making it difficult to delineate fine boundaries between teeth and gingiva, while 2D intraoral images provide rich semantic and chromatic information that can complement 3D geometry. Thus, we propose a novel Photo-guided 3D Model Tooth Segmentation framework, PMTSeg, that enhances 3D tooth segmentation by integrating texture cues from intraoral photos. Our framework introduces three key components: a Camera Alignment Module (CAM) for accurate image-model registration, a Feature Filtering Gate (FFG) for adaptive multi-view feature selection, and a Consistent Feature Learning (CFL) mechanism for learning texture-geometry correspondence. Our method supports arbitrary numbers and views of intraoral photos. Experiments show significant improvements in distinguishing adjacent teeth and tooth-gingiva boundaries, demonstrating that intraoral photographs serve as an efficient, semantically rich supplement to 3D scans for precise dental segmentation.

Abstract:
Industrial anomaly detection (AD) is characterized by an abundance of normal images but a scarcity of anomalous ones. Although numerous few-shot anomaly synthesis methods have been proposed to augment anomalous data for downstream AD tasks, most existing approaches require time-consuming training and struggle to learn distributions that are faithful to real anomalies, thereby restricting the efficacy of AD models trained on such data. To address these limitations, we propose a training-free few-shot anomaly generation method, namely O2MAG, which leverages the self-attention in One reference anomalous image to synthesize More realistic anomalies, supporting effective downstream anomaly detection. Specifically, O2MAG manipulates three parallel diffusion processes via self-attention grafting and incorporates the anomaly mask to mitigate foreground-background query confusion, synthesizing text-guided anomalies that closely adhere to real anomalous distributions. To bridge the semantic gap between the encoded anomaly text prompts and the true anomaly semantics, Anomaly-Guided Optimization is further introduced to align the synthesis process with the target anomalous distribution, steering the generation toward realistic and text-consistent anomalies. Moreover, to mitigate faint anomaly synthesis inside anomaly masks, Dual-Attention Enhancement is adopted during generation to reinforce both self- and cross-attention on masked regions. Extensive experiments validate the effectiveness of \method, demonstrating its superior performance over prior state-of-the-art methods on downstream AD tasks.

Abstract:
Large-scale pre-trained text-to-image (T2I) diffusion models, such as Stable Diffusion, can be finetuned for image super-resolution (SR) with highly realistic details. While impressive, pre-training such multi-modal models demands billions of high-quality text-image pairs and substantial computational resources, despite that SR is fundamentally an image-to-image (I2I) task. This raises a critical question: do we truly need multi-modal priors and billion-scale text-image data to solve a purely vision task? In this paper, we propose VOSR, a Vision-Only Super-Resolution framework that eliminates the need for textual priors and multi-modal pretraining. We identify two key limitations in previous image-based, uni-modal diffusion models: limited visual semantic guidance and unstable unconditional training. To this end, we leverage a pretrained vision encoder to inject semantic cues, and introduce a relaxed unconditional objective that partially uses the low-quality condition to stabilize training. To accelerate inference, we adopt a modified shortcut model for one-step SR with minimal quality degradation. VOSR is trained from scratch with significantly less data and a lower computational cost compared to T2I-based diffusion models. However, VOSR achieves comparable or even better performance than state-of-the-art T2I-tuned SR methods on both synthetic and real-world benchmarks, demonstrating its potential as a scalable and competitive alternative for generative SR. Codes and models will be made publicly available.

Abstract:
Applying diffusion models to physically-based material estimation and generation has recently gained prominence. In this paper, we propose MatMart, a novel material reconstruction framework for 3D objects, offering the following advantages. First, MatMart adopts a two-stage reconstruction, starting with accurate material prediction from inputs and followed by prior-guided material generation for unobserved views, yielding high-fidelity results. Second, by utilizing progressive inference alongside the proposed view-material cross-attention (VMCA), MatMart enables reconstruction from an arbitrary number of input images, demonstrating strong scalability and flexibility. Finally, MatMart achieves both material prediction and generation capabilities through end-to-end optimization of a single diffusion model, without relying on additional pre-trained models, thereby exhibiting enhanced stability across various types of objects. Extensive experiments demonstrate that MatMart achieves superior performance in material reconstruction compared to existing methods.

Abstract:
We present GNVC-VD, the first DiT-based generative neural video compression framework built upon an advancedvideo generation foundation model, where spatio-temporal latent compression and sequence-level generative refinement are unified within a single codec. Existing perceptual codecs primarily rely on pre-trained image generative priors to restorehigh-frequency details, but their frame-wise nature lacks temporal modeling and inevitably leads to perceptual flickering. To address this, GNVC-VD introduces a unified flow-matching latent refinement module that leverages a video diffusion transformer to jointly enhance intra- and inter-frame latents through sequence-level denoising, ensuring consistent spatio-temporal details. Instead of denoising from pure Gaussian noise as in video generation, GNVC-VD initializes refinement from decoded spatio-temporal latents and learns a correction term that adapts the diffusion prior to compression-induced degradation. A conditioning adaptor further injects compression-aware cues into intermediate DiT layers, enabling effective artifact removal while maintaining temporal coherenceunder extreme bitrate constraints. Extensive experiments show that GNVC-VD surpasses both traditional and learned codecs in perceptual quality and significantly reduces the flickering artifacts that persist in prior generative approaches, even below 0.01 bpp, highlighting the promise of integrating video-native generative priors into neural codecs for next-generation perceptual video compression.

Abstract:
Reinforcement Learning with Verifiable Rewards (RLVR) has become a key paradigm to improve the reasoning capabilities of Multimodal Large Language Models (MLLMs). However, prevalent group-based algorithms such as GRPO require multi-rollout sampling for each prompt. While more efficient single-rollout variants have recently been explored in text-only settings, we find that they suffer from severe instability in multimodal contexts, often leading to training collapse. To address this sample efficiency-stability trade-off, we introduce MSSR (Multimodal Stabilized Single-Rollout), a group-free RLVR framework that achieves both stable optimization and effective multimodal reasoning performance. MSSR achieves this via an entropy-based advantage-shaping mechanism that adaptively regularizes advantage magnitudes, preventing collapse and maintaining training stability. While such mechanisms have been used in group-based RLVR, we show that in the multimodal single-rollout setting they are not merely beneficial but essential for stability. In in-distribution evaluations, MSSR demonstrates superior rollout sample efficiency, achieving similar validation accuracy with half the training steps. When trained for the same number of steps, MSSR's performance surpasses the group-based baseline and shows consistent generalization improvements across five diverse reasoning-intensive benchmarks. Together, these results demonstrate that MSSR enables stable, sample-efficient, and effective RLVR for complex multimodal reasoning tasks.

Abstract:
We introduce Intrinsic Image Fusion, a method that reconstructs high-quality physically based materials from multi-view images.Material reconstruction is highly underconstrained and typically relies on analysis-by-synthesis, which requires expensive and noisy path tracing. To better constrain the optimization, we incorporate single-view priors into the reconstruction process. We leverage a diffusion-based material estimator that produces multiple, but often inconsistent, candidate decompositions per view.To reduce the inconsistency, we fit an explicit low-dimensional parametric function to the predictions.We then propose a robust optimization framework using soft per-view prediction selection together with confidence-based soft multi-view inlier set to fuse the most consistent predictions of the most confident views into a consistent parametric material space. Finally, we use inverse path tracing to optimize for the low-dimensional parameters. Our results outperform state-of-the-art methods in material disentanglement on both synthetic and real scenes, producing sharp and clean reconstructions suitable for high-quality relighting.

Abstract:
The emergence of neural radiance fields (NeRF) and 3D Gaussian splatting (3DGS) has advanced novel view synthesis (NVS). These methods, however, require high-quality RGB inputs and accurate corresponding poses, limiting robustness under real-world conditions such as fast camera motion or adverse lighting. Event cameras, which capture brightness changes at each pixel with high temporal resolution and wide dynamic range, enable precise sensing of dynamic scenes and offer a promising solution. However, existing event-based NVS methods either assume known poses or rely on depth estimation models that are bounded by their initial observations, failing to generalize as the camera traverses previously unseen regions. We present E2EGS, a pose-free framework operating solely on event streams. Our key insight is that edge information provides rich structural cues essential for accurate trajectory estimation and high-quality NVS. To extract edges from noisy event streams, we exploit the distinct spatio-temporal characteristics of edges and non-edge regions. The event camera's movement induces consistent events along edges, while non-edge regions produce sparse noise. We leverage this through a patch-based temporal coherence analysis that measures local variance to extract edges while robustly suppressing noise. The extracted edges guide structure-aware Gaussian initialization and enable edge-weighted losses throughout initialization, tracking, and bundle adjustment. Extensive experiments on both synthetic and real datasets demonstrate that E2EGS achieves superior reconstruction quality and trajectory accuracy, establishing a fully pose-free paradigm for event-based 3D reconstruction.

Abstract:
Decoder-only autoregressive image generation typically relies on fixed-length tokenization schemes whose token counts grow quadratically with resolution, substantially increasing the computational and memory demands of attention. We present DPAR, a novel decoder-only autoregressive model that dynamically aggregates image tokens into a variable number of patches for efficient image generation. Our work is the first to demonstrate that next-token prediction entropy from a lightweight and unsupervised autoregressive model provides a reliable criterion for merging tokens into larger patches based on information content. DPAR makes minimal modifications to the standard decoder architecture, ensuring compatibility with multimodal generation frameworks and allocating more compute to the generation of high-information image regions. Further, we demonstrate that training with dynamically sized patches yields representations that are robust to patch boundaries, allowing DPAR to scale to larger patch sizes at inference. DPAR reduces token count by 1.81x and 2.06x on ImageNet 256 and 384 generation resolution, respectively, leading to a reduction of up to 40.4% FLOPs in training costs. Further, our method exhibits faster convergence and improves FID by up to 29.6% relative to baseline models.

Abstract:
The unrealistic reliance on abundant prior information in traditional transferable attacks has spurred the Practical No-box Scenario (PNS), where attackers can access only limited unlabeled images. However, existing methods rely on iterative optimization to produce adversarial examples with inherently limited inference speed and transferability. Conversely, faster generative attacks fundamentally conflict with the PNS due to their critical dependence on abundant prior information that is explicitly absent in this scenario. To bridge this gap, we propose Prior-free Generative Attack (PGA), the first generative attack tailored for the PNS. Specifically, we introduce the Curriculum-Guided Micro-Robust Optimization that progressively incorporates more challenging discriminative tasks to mitigate the degenerate solutions common in self-supervised learning with limited data, yielding robust and transferable surrogates for downstream attacks. Furthermore, the Region-Aware Consistent Perturbation Learning guides the generator to produce fine-grained and spatially coherent perturbations, mitigating the common pitfall of generative attacks falling into local optima under insufficient supervision. Extensive experiments demonstrate that our PGA achieves remarkable transferability across various settings with high inference speed. This work provides a more practical benchmark for future research on transferable attacks, revealing the great potential of generative attacks under the PNS.

Abstract:
Surface electromyography (sEMG) signal-based activity recognition has attracted increasing research attention in recent years. To develop accurate sEMG signal-based activity recognizers, numerous approaches have been proposed. Some studies focus on designing larger and more expressive model architectures to enhance the representational capacity of sEMG signals, while others aim to enrich model priors through large-scale pretraining, thereby improving recognition performance. Recently, large language models (LLMs) have shown remarkable generalization and reasoning capabilities in natural language processing, whose implicit knowledge, learned from extensive linguistic descriptions of actions, opens new possibilities for interpreting sEMG signals and inferring activity intentions. Motivated by this, we propose LLM-sEMG, a novel framework that leverages LLMs as sEMG activity recognizers. Within this framework, we design a language-oriented mapping mechanism that converts continuous sEMG sequences into "sEMG language," integrating several strategies to further facilitate the signal-to-language mapping process. Extensive experiments demonstrate that the proposed framework achieves highly accurate sEMG signal-based activity recognition using large language models.

Abstract:
3D visual grounding (VG) aims to localize target objects in 3D scenes based on free-form textual descriptions. Existing 3D VG models predominantly employ point-based backbones for point cloud feature extraction. Such methods require aggressive downsampling of the input point cloud, which sacrifices the fine-grained spatial details crucial for precise localization. This paper proposes PV-Ground, a novel 3D VG architecture based on effective text-guided point-voxel feature interaction. Our method leverages the complementary strengths of both voxels and keypoints: it employs a voxel-based feature extraction backbone to preserve high-resolution spatial details, while utilizing compact keypoints to aggregate these features for efficient, deep interaction with the textual query. Furthermore, we propose a text-guided keypoint sampling module to adaptively concentrate the keypoint distribution around the text-described object, enabling task-specific feature aggregation and significantly boosts model performance. Extensive qualitative and quantitative experiments demonstrate the superiority of our proposed method. Our method achieves a performance improvement of 5.1% on the ScanRefer dataset and 5.6% on the ReferIt3D dataset, while also achieves over 4% improvement in the segmentation task. The code will be made publicly available.

Abstract:
Generating multi-frame, action-rich visual narratives without fine-tuning faces a threefold tension: action text faithfulness, subject identity fidelity, and cross frame background continuity. We propose StoryTailor, a zero-shot pipeline that runs on a single RTX 4090 (24 GB) and produces temporally coherent, identity-preserving image sequences from a long narrative prompt, per subject references, and grounding boxes. Three synergistic modules drive the system: Gaussian-Centered Attention (GCA) to dynamically focus on each subject core and ease grounding-box overlaps; Action-Boost Singular Value Reweighting (AB-SVR) to amplify action-related directions in the text embedding space; and Selective Forgetting Cache (SFC) that retains transferable background cues, forgets nonessential history, and selectively surfaces the retained cues to build cross scene semantic ties. Compared with baseline methods, the experiments show that CLIP-T improves by up to 10-15%, with DreamSim lower than strong baselines, while CLIP-I stays in a visually acceptable, competitive range. With a matched resolution and steps on a 24 GB GPU, inference is faster than FluxKontext. Qualitatively, StoryTailor delivers expressive interactions and evolving yet stable scenes.

Abstract:
Understanding physical properties such as friction, stiffness, hardness, and material composition is essential for enabling robots to interact safely and effectively with their surroundings. However, existing 3D reconstruction methods focus on geometry and appearance and cannot infer these underlying physical properties. We present PhysGS, a Bayesian-inferred extension of 3D Gaussian Splatting that estimates dense, per-point physical properties from visual cues and vision--language priors. We formulate property estimation as Bayesian inference over Gaussian splats, where material and property beliefs are iteratively refined as new observations arrive. PhysGS also models aleatoric and epistemic uncertainties, enabling uncertainty-aware object and scene interpretation. Across object-scale (ABO-500), indoor, and outdoor real-world datasets, PhysGS improves accuracy of the mass estimation by up to 22.8%, reduces Shore hardness error by up to 61.2%, and lowers kinetic friction error by up to 18.1% compared to deterministic baselines. Our results demonstrate that PhysGS unifies 3D reconstruction, uncertainty modeling, and physical reasoning in a single, spatially continuous framework for dense physical property estimation.

Abstract:
This paper focuses on enhancing the grasping precision and generalization of manipulation policies learned via imitation learning. Diffusion-based policy learning methods have recently become the mainstream approach for robotic manipulation tasks. As grasping is a critical subtask in manipulation, the ability of imitation-learned policies to execute precise and generalizable grasps merits particular attention. Existing imitation learning techniques for grasping often suffer from imprecise grasp executions, limited spatial generalization, and poor object generalization. To address these challenges, we incorporate grasp prior knowledge into the diffusion policy framework. In particular, we employ a latent diffusion policy to guide action chunk decoding with grasp pose prior, ensuring that generated motion trajectories adhere closely to feasible grasp configurations. Furthermore, we introduce a self-supervised reconstruction objective during diffusion to embed the graspness prior: at each reverse diffusion step, we reconstruct wrist-camera images back-projected the graspness from the intermediate representations. Both simulation and real robot experiments demonstrate that our approach significantly outperforms baseline methods and exhibits strong dynamic grasping capabilities.

Abstract:
Multimodal Deepfakes proliferating on social media threaten authenticity, information integrity, and digital forensics. Existing benchmarks are constrained by their single-modality scope, simplified manipulations, or unrealistic distributions, which limit their ability to assess real-world robustness. We present Omni-Fake, a unified omni-dataset for comprehensive multimodal deepfake detection in social-media settings. It comprises Omni-Fake-Set, a large-scale, high-quality dataset with 1M+ samples, and Omni-Fake-OOD, an out-of-distribution benchmark with 100k+ samples intentionally excluded from training to evaluate generalization. Omni-Fake spans four modalities--image, audio, video, and audio-video talking head and supports a joint detection-localization-explanation protocol. For images, audio, and videos, we define a ternary task (real / partially manipulated / fully synthetic) with spatial or temporal localization masks for fine-grained reasoning. Talking heads are formulated as an audio-video fusion binary task targeting speaking digital humans and lip-synced avatar forgeries. On top of Omni-Fake, we further propose Omni-Fake-R1, a reinforcement-learning-driven multimodal detector that adaptively integrates visual and auditory cues and outputs structured decisions, localization, and natural-language explanations. Extensive experiments show significant gains in detection accuracy, cross-modal generalization, and explainability over state-of-the-art baselines. Code will be released.

Abstract:
Recent advancements in 3D object generation using diffusion models have achieved remarkable success, but generating realistic 3D urban scenes remains challenging. Existing methods relying solely on 3D diffusion models tend to suffer a degradation in appearance details, while those utilizing only 2D diffusion models typically compromise camera controllability. To overcome this limitation, we propose ScenDi, a method for urban scene generation that integrates both 3D and 2D diffusion models. We first train a 3D latent diffusion model to generate 3D Gaussians, enabling the rendering of images at a relatively low resolution. To enable controllable synthesis, this 3DGS generation process can be optionally conditioned by specifying inputs such as 3d bounding boxes, road maps, or text prompts. Then, we train a 2D video diffusion model to enhance appearance details conditioned on rendered images from the 3D Gaussians. By leveraging the coarse 3D scene as guidance for 2D video diffusion, ScenDi generates desired scenes based on input conditions and successfully adheres to accurate camera trajectories. Experiments on two challenging real-world datasets, Waymo and KITTI-360, demonstrate the effectiveness of our approach.

Abstract:
The task of multi-view inpainting necessitates 3D consistency in the inpainted images. Most prior methods first employ single-view 2D inpainting and then enforce multi-view consistency in a post-hoc 3D optimization stage, which leads to undesirable artifacts and lengthy optimization times. The existing single-stage method, MVInpainter, uses video priors and is pose-free, making it less suitable for inputs beyond video sequences. In this paper, we propose a framework that trains an inpainting model to condition on the explicit and reliable multi-view correspondences from a 3D foundation model. Central to our framework is a cross-view conditioning architecture, LaRP, carefully designed to utilize both the generative prior of a pretrained diffusion inpainting model and the reprojected cross-view appearance latents. We additionally propose a scalable data pipeline for stable training of LaRP. Extensive experiments demonstrate that LaRP outperforms prior methods in 3D consistency and novel view synthesis quality competitive with the state-of-the-art, while being 50x faster.

Abstract:
Video temporal grounding, the task of localizing the start and end times of a natural language query in untrimmed video, requires capturing both global context and fine-grained temporal detail. This challenge is particularly pronounced in long videos, where existing methods often compromise temporal fidelity by over-downsampling or using fixed windows. We present HieraMamba, a hierarchical architecture that preserves temporal structure and semantic richness across scales. At its core are Anchor-MambaPooling (AMP) blocks, which utilize Mamba's selective scanning to produce compact anchor tokens summarizing video content across scales. We further introduce anchor-conditioned and segment-pooled contrastive losses-two complementary objectives that encourage anchors to retain local detail while remaining globally discriminative. HieraMamba sets a new state-of-the-art on Ego4D-NLQ, MAD, and TACoS, demonstrating precise, temporally faithful localization in long, untrimmed videos.

Abstract:
Visual-language reasoning, driving knowledge, and value alignment are essential for advanced autonomous driving systems. However, existing approaches largely rely on data-driven learning, making it difficult to capture the complex logic underlying decision-making through imitation or limited reinforcement rewards. To address this, we propose KnowVal, a new autonomous driving system that enables visual-language reasoning through the synergistic integration of open-world perception and knowledge retrieval. Specifically, we construct a comprehensive driving knowledge graph that encodes traffic laws, defensive driving principles, and ethical norms, complemented by an efficient LLM-based retrieval mechanism tailored for driving scenarios. Furthermore, we develop a human-preference dataset and train a Value Model to guide interpretable, value-aligned trajectory assessment. Experimental results show that our method substantially improves planning performance while remaining compatible with existing architectures. Notably, KnowVal achieves the lowest collision rate on nuScenes and state-of-the-art results on Bench2Drive and NAVSIM.

Abstract:
Cardiac magnetic resonance (CMR) is the clinical gold standard for assessing cardiovascular diseases, but its interpretation relies on expert experience and remains challenging, particularly for identifying rare diseases. Existing automated methods lack interpretable reasoning processes, limiting clinical adoption. Although vision-language models (VLMs) possess basic visual understanding and text generation capabilities, they still lack verifiable reasoning chains in medical diagnosis and underperform on minority classes in long-tail distributions. To address these challenges, we propose CMR-RD, to our knowledge the first VLM for interpretable diagnosis in CMR, capable of generating explicit diagnostic chains aligned with imaging evidence. We construct a CMR dataset that reflects real-world clinical distributions, comprising five disease categories (including two rare conditions) plus normal controls. Building on this, the general-purpose VLM is aligned to medical and CMR semantics using large-scale medical vision-text data, and cold-start training is used to enhance its understanding of medical concepts and basic reasoning. To enhance reasoning and performance on rare samples, we propose Group Phase Policy Optimization (GPPO), which combines online multi-stage reinforcement learning (RL) with adaptive sampling. GPPO enables the model to proactively explore rare and underperforming classes, thereby effectively mitigating long-tail bias. Experiments demonstrate that CMR-RD achieves highly competitive accuracy and reasoning-chain correctness compared with medical and general VLMs, shows stronger recognition of rare categories.

Abstract:
Embodied Question Answering (EQA) connects perception, reasoning, and interaction within embodied environments. However, existing datasets and benchmarks remain fragmented, each focusing on a limited subset of reasoning skills such as spatial understanding or procedural reasoning, without offering a unified large-scale framework for comprehensive evaluation. We present EQA-Decision, a large-scale embodied QA dataset that systematically covers four complementary dimensions of embodied reasoning: static scene construction, spatial understanding, task dynamics reasoning, and instant decision. The dataset contains over four million question-answer pairs with hierarchical annotations across diverse embodied scenarios. In addition, we develop RoboDecision, a strong baseline model aligned with the EQA-Decision Benchmark, providing a unified framework that jointly evaluates perception, reasoning, and action-level decision-making in embodied environments. Results demonstrate that EQA-Decision effectively benchmarks and enhances VLM capabilities in spatial and interaction reasoning, providing a solid foundation for advancing embodied intelligence research.

Abstract:
We present GRaF, Generalizable Radio-Frequency (RF) Radiance Fields, a framework that models RF signal propagation to synthesize spatial spectra at arbitrary transmitter or receiver locations, where each spectrum measures signal power across all surrounding directions at the receiver. Unlike state-of-the-art methods that adapt vanilla Neural Radiance Fields (NeRF) to the RF domain with scene-specific training, GRaF generalizes across scenes to synthesize spectra. To enable this, we prove an interpolation theory in the RF domain: the spatial spectrum from a transmitter can be approximated using spectra from geographically proximate transmitters. Building on this theory, GRaF comprises two components: (i) a geometry-aware Transformer encoder that captures spatial correlations from neighboring transmitters to learn a scene-independent latent RF radiance field, and (ii) a neural ray tracing algorithm that estimates spectrum reception at the receiver. Experimental results demonstrate that GRaF outperforms existing methods on single-scene benchmarks and achieves state-of-the-art performance on unseen scene layouts.

Abstract:
Recent advances in multimodal models have demonstrated remarkable text-guided image editing capabilities, with systems like GPT-4o and Nano-Banana setting new benchmarks. However, the research community's progress remains constrained by the absence of large-scale, high-quality, and openly accessible datasets built from real images. We introduce Pico-Banana-400K, a comprehensive 400K-image dataset for instruction-based image editing. Our dataset is constructed by leveraging Nano-Banana to generate diverse edit pairs from real photographs in the OpenImages collection. What distinguishes Pico-Banana-400K from previous synthetic datasets is our systematic approach to quality and diversity. We employ a fine-grained image editing taxonomy to ensure comprehensive coverage of edit types while maintaining precise content preservation and instruction faithfulness through MLLM-based quality scoring and careful curation. Beyond single turn editing, Pico-Banana-400K enables research into complex editing scenarios. The dataset includes three specialized subsets: (1) a 72K-example multi-turn collection for studying sequential editing, reasoning, and planning across consecutive modifications; (2) a 56K-example preference subset for alignment research and reward model training; and (3) paired long-short editing instructions for developing instruction rewriting and summarization capabilities. By providing this large-scale, high-quality, and task-rich resource, Pico-Banana-400K establishes a robust foundation for training and benchmarking the next generation of text-guided image editing models.

Abstract:
Video large language models (Video-LLMs) have demonstrated strong capabilities in video understanding tasks. However, their practical deployment is still hindered by the inefficiency introduced by processing massive amounts of visual tokens. Although recent approaches achieve extremely low token retention ratios while maintaining accuracy comparable to full-token baselines, most of them perform compression only at the late stage of prefilling, leaving the efficiency of the vision encoder unoptimized. In this paper, we first show that vision encoding contributes a large portion to the time-to-first-token (TTFT). Therefore, instead of compressing visual tokens only after the vision encoder, performing compression inside the encoder still leaves substantial room for exploration. Based on this insight, we propose EarlyTom, a training-free token compression framework that performs early-stage visual token compression inside the vision encoder, enabling significantly better TTFT reduction and higher throughput. In addition, we introduce a decoupled spatial token selection strategy that improves the overall compression effectiveness. EarlyTom reduces TTFT by up to 2.65x and FLOPs by up to 61% on a single NVIDIA A100 GPU for the LLaVA-OneVision-7B model, while maintaining accuracy comparable to the full-token baseline. These improvements substantially enhance the practicality of deploying Video-LLMs in real-world production scenarios.

Abstract:
As multimodal models like CLIP become integral to downstream systems, the need to remove sensitive information is critical. However, machine unlearning for contrastively-trained encoders remains underexplored, and existing evaluations fail to diagnose fine-grained, association-level forgetting. We introduce SALMUBench (Sensitive Association-Level Multimodal Unlearning), a benchmark built upon a synthetic dataset of 60K persona-attribute associations and two foundational models: a Compromised model polluted with this data, and a Clean model without it. To isolate unlearning effects, both are trained from scratch on the same 400M-pair \texttt retain base, with the Compromised model additionally trained on the \texttt sensitive set. We propose a novel evaluation protocol with structured holdout sets (\texttt holdout_identity , \texttt holdout_association ) to precisely measure unlearning efficacy and collateral damage. Our benchmark reveals that while utility-efficient deletion is feasible, current methods exhibit distinct failure modes: they either fail to forget effectively or over-generalize by erasing more than intended. SALMUBench sets a new standard for comprehensive unlearning evaluation, and we publicly release our dataset, models, evaluation scripts, and leaderboards to foster future research.

Abstract:
Product poster generation presents unique challenges beyond general-purpose de-sign: it demands not only aesthetic composition and accurate text rendering, butalso strict preservation of the product subject and precise control over dense,multi-line text layouts. While general image editing models struggle with text lay-out control and subject consistency, existing specialized approaches--often builtupon inpainting frameworks--still suffer from unintended subject extension andinaccurate text synthesis. A common solution involves integrating auxiliary mod-ules such as ControlNet to condition on subject structure and text layout, but theseapproaches introduce significant architectural complexity and training overhead.In this work, we challenge the necessity of such complexity and demonstrate thatminimalist adaptation is sufficient. We introduce SimplePoster, a minimalist yetpowerful inpainting-based framework that enables faithful subject preservationand position-controllable text rendering--entirely without external controllers likeControlNet. SimplePoster rests on two key insights: (1) full-parameter fine-tuningalone effectively suppresses subject extension by aligning the model's internalrepresentations with domain-specific priors; and (2) a lightweight character-levelposition encoding strategy enables end-to-end, spatially grounded text generation.Experiments show that SimplePoster achieves near-perfect subject preservation(98.7% of cases with strict subject preservation), significantly outperforming boththe state-of-the-art editing model SeedEdit3.0 (55.2%) and the specialized ap-proach PosterMaker (85.3%). It further demonstrates superior text rendering ac-curacy, even in challenging scenarios with complex multi-line layouts. We believeSimplePoster establishes a simple yet strong baseline for product poster genera-tion. We question the necessity of such complexity and demonstrate that minimalist designs suffice. We propose SimplePoster, a simple yet effective inpainting-based framework that achieves faithful subject preservation and position-controllable text rendering without relying on external controllers like ControlNet. SimplePoster is based on two key insights: (1) full-parameter fine-tuning effectively suppresses subject extension; and (2) a training-free character-level position encoding strategy enables end-to-end, geometry-aware text generation. Remarkably, SimplePoster achieves a near-perfect subject preservation rate (98.7%), significantly outperforming SOTA models SeedEdit 3.0 (55.2%) and PosterMaker (85.3%). It also excels in text rendering accuracy. We believe SimplePoster establishes a simple yet strong baseline for product poster generation. Code, models and benchmark will be released upon acceptance.

Abstract:
Spiking Neural Networks (SNNs) promise energy-efficient vision, but applying them to RGB visual tracking remains difficult: Existing SNN tracking frameworks either do not fully align with spike-driven computation or do not fully leverage neurons' spatiotemporal dynamics, leading to a trade-off between efficiency and accuracy. To address this, we introduce SpikeTrack, a spike-driven framework for energy-efficient RGB object tracking. SpikeTrack employs a novel asymmetric design that uses asymmetric timestep expansion and unidirectional information flow, harnessing spatiotemporal dynamics while cutting computation. To ensure effective unidirectional information transfer between branches, we design a memory-retrieval module inspired by neural inference mechanisms. This module recurrently queries a compact memory initialized by the template to retrieve target cues and sharpen target perception over time. Extensive experiments demonstrate that SpikeTrack achieves the state-of-the-art among SNN-based trackers and remains competitive with advanced ANN trackers. Notably, it surpasses TransT on LaSOT dataset while consuming only 1/26 of its energy. To our knowledge, SpikeTrack is the first spike-driven framework to make RGB tracking both accurate and energy efficient.

Abstract:
Human perception of social environments is inherently a multi-view synthesis problem, requiring the integration of complementary and often occluded information across space and time. However, existing benchmarks for Multimodal Large Language Models (MLLMs) are overwhelmingly predicated on a "sufficient-view" assumption, rewarding single-view pattern recognition while failing to evaluate cross-view fusion. To address this critical gap, we introduce CVBench, a large-scale, multi-task benchmark for cross-view human understanding. CVBench comprises 3,000 challenging questions across 12 spatial and temporal tasks, where every item is designed with verifiable single-view insufficiency, mandating that models synthesize disparate evidence to resolve ambiguities. Our comprehensive evaluation of state-of-the-art open and closed-source MLLMs reveals a substantial performance gap, with the best models falling nearly 50 points behind human performance. We identify a systemic failure mechanism across all models: a dominant "Single-View Bias", whereby models ignore conflicting evidence and default to the most confident but incorrect single-view prediction. This demonstrates that current MLLMs lack the fundamental mechanisms for geometric grounding, identity persistence, and true spatio-temporal fusion. CVBench provides a rigorous diagnostic framework to catalyze the development of next-generation, cross-view-aware architectures.

Abstract:
Cross-domain few-shot object detection (CD-FSOD) aims to adapt pretrained detectors from a source domain to target domains with limited annotations, suffering from severe domain shifts and data scarcity problems. In this work, we find a previously overlooked phenomenon: models exhibit dispersed and unfocused attention in target domains, leading to imprecise localization and redundant predictions, just like a human cannot focus on visual objects. Therefore, we call it the target-domain Astigmatism problem. Analysis on attention distances across transformer layers reveals that regular fine-tuning inherently shows a trend to remedy this problem, but results are still far from satisfactory, which we aim to enhance in this paper. Biologically inspired by the human fovea-style visual system, we enhance the fine-tuning's inherent trend through a center-periphery attention refinement framework, which contains (1) a Positive Pattern Refinement module to reshape attention toward semantic objects using class-specific prototypes, simulating the visual center region; (2) a Negative Context Modulation module to enhance boundary discrimination by modeling background context, simulating the visual periphery region; and (3) a Textual Semantic Alignment module to strengthen center-periphery distinction through cross-modal cues. Our bio-inspired approach transforms astigmatic attention into focused patterns, substantially improving adaptation to target domains. Experiments on six challenging CD-FSOD benchmarks consistently demonstrate improved detection accuracy and establish new state-of-the-art results.

Abstract:
Robotic Foundation Models (RFMs) hold great promise as generalist, end-to-end systems for robot control.Yet their ability to generalize across new environments, tasks, and embodiments remains limited.We argue that a major bottleneck lies in their foundations: most RFMs are built by fine-tuning internet-pretrained Vision-Language Models (VLMs).However, these VLMs are trained on 2D image-language tasks and lack the 3D spatial reasoning inherently required for embodied control in the 3D world.Bridging this gap directly with large-scale robotic data is costly and difficult to scale.Instead, we propose to enrich easy-to-collect non-robotic image data with 3D annotations and enhance a pretrained VLM with 3D understanding capabilities.Following this strategy, we train SPEAR-VLM, a 3D-aware VLM that infers object coordinates in 3D space from a single 2D image.Building on SPEAR-VLM, we introduce our main contribution, SPEAR-1: a robotic foundation model that integrates grounded 3D perception with language-instructed embodied control.Trained on ~45M frames from 24 Open X-Embodiment datasets, SPEAR-1 outperforms or matches state-of-the-art models such as \pi_0-FAST and \pi_ 0.5 , while it uses 20xfewer robot demonstrations.This carefully-engineered training strategy unlocks new VLM capabilities and as a consequence boosts the reliability of embodied control beyond what is achievable with only robotic data.We make our model weights and 3D-annotated datasets publicly available.

Abstract:
Data matters. In computer vision, data (or pixels) are the primary source of information containing signals that span from low-level attributes to high-level concepts. At scale, the success of modern vision systems has been closely tied to how data is curated for semantic understanding (e.g., ImageNet). Recent trends in spatial intelligence and physical world understanding further highlight the importance of real-world signals that preserve spatial structure, beyond purely semantic signals. This motivates a shift toward curating data that better captures spatially grounded information across diverse environments. In this work, we demonstrate that training on 2B web-crawled images with a self-curation strategy on masked autoencoder (MAE) can learn strong representations for dense prediction tasks, while remaining simple, stable, and efficient. Our model, codenamed "Pixio", is an enhanced masked autoencoder (MAE) with more challenging pre-training tasks and more capable architectures. Pixio yields dense representations achieving promising results across a wide range of dense prediction tasks in the wild, including monocular depth estimation (e.g., Depth Anything), feed-forward 3D reconstruction (i.e., MapAnything), and visual segmentation (e.g., SAM). Our results suggest that data curation can significantly contribute to dense representation learning.

Abstract:
Federated learning (FL) allows distributed clients to collaboratively train a global model in a privacy-preserving manner. However, one major challenge is domain skew, where clients' data originating from diverse domains may hinder the aggregated global model from learning a consistent representation space, resulting in poor generalizable ability in multiple domains. In this paper, we argue that the domain skew is reflected in the domain-specific biased features of each client, causing the local model's representations to collapse into a narrow low-dimensional subspace. We then propose Federated Feature Decoupling and Calibration (F^2DC), which liberates valuable class-relevant information by calibrating the domain-specific biased features, enabling more consistent representations across domains. A novel component, Domain Feature Decoupler (DFD), is first introduced in F^2DC to determine the robustness of each feature unit, thereby separating the local features into domain-robust features and domain-related features. A Domain Feature Corrector (DFC) is further proposed to calibrate these domain-related features by explicitly linking discriminative signals, capturing additional class-relevant clues that complement the domain-robust features. Finally, a domain-aware aggregation of the local models is performed to promote consensus among clients. Empirical results on three popular multi-domain datasets demonstrate the effectiveness of the proposed F^2DC and the contributions of its two modules.

Abstract:
We aim to learn a joint representation between inertial measurement unit (IMU) signals and 2D pose sequences extracted from video, enabling accurate cross-modal retrieval, temporal synchronization, subject and body-part localization, and action recognition. To this end, we introduce MoBind, a hierarchical contrastive learning framework designed to address three challenges: (1) filtering out irrelevant visual background, (2) modeling structured multi-sensor IMU configurations, and (3) achieving fine-grained, sub-second temporal alignment. To isolate motion-relevant cues, MoBind aligns IMU signals with skeletal motion sequences rather than raw pixels. We further decompose full-body motion into local body-part trajectories, pairing each with its corresponding IMU to enable semantically grounded multi-sensor alignment. To capture detailed temporal correspondence, MoBind employs a hierarchical contrastive strategy that first aligns token-level temporal segments, then fuses local (body-part) alignment with global (body-wide) motion aggregation. Evaluated on mRi, TotalCapture, and EgoHumans, MoBind consistently outperforms strong baselines across all four tasks, demonstrating robust fine-grained temporal alignment while preserving coarse semantic consistency across modalities.

Abstract:
Concept Bottleneck Models (CBMs) ground predictions in human-understandable concepts but face fundamental limitations: the absence of a metric to pre-evaluate concept relevance, the "linearity problem" causing recent CBMs to bypass the concept bottleneck entirely, an accuracy gap compared to opaque models, and finally the lack of systematic study on the impact of different visual backbones and VLMs. We introduce CBM-Suite, a methodological framework to systematically addresses these challenges. First, we propose an entropy-based metric to quantify the intrinsic suitability of a concept set for a given dataset. Second, we resolve the linearity problem by inserting a non-linear layer between concept activations and the classifier, which ensures that model accuracy faithfully reflects concept relevance. Third, we narrow the accuracy gap by leveraging a distillation loss guided by a linear teacher probe. Finally, we provide comprehensive analyses on how different vision encoders, vision-language models, and concept sets interact to influence accuracy and interpretability in CBMs. Extensive evaluations show that CBM-Suite yields more accurate models and provides insights for improving concept-based interpretability.

Abstract:
We present WildRayZer, a self-supervised framework for novel view synthesis (NVS) in dynamic environments where both the camera and objects move. Dynamic content breaks the multi-view consistency that static NVS models rely on, leading to ghosting, hallucinated geometry, and unstable pose estimation. WildRayZer addresses this by performing an analysis-by-synthesis test: a camera-only static renderer explains rigid structure, and its residuals reveal transient regions. From these residuals, we construct pseudo motion masks, distill a motion estimator, and use it to mask input tokens and gate loss gradients so supervision focuses on cross-view background completion. To enable large-scale training and evaluation, we curate Dynamic RealEstate10K (D-RE10K), a real-world dataset of 15K casually captured dynamic sequences, and D-RE10K-iPhone, a paired transient and clean benchmark for sparse-view transient-aware NVS. Experiments show that WildRayZer consistently outperforms optimization-based and feed-forward baselines in both transient-region removal and full-frame NVS quality with a single feed-forward pass.

Abstract:
Bayesian deep learning (BDL) integrates Bayesian inference with deep learning, improving predictive performance while enabling principled uncertainty quantification. However, existing BDLs often rely on non-informative random priors, limiting the benefits of Bayesian inference. In contrast, knowledge-augmented deep learning explicitly injects domain knowledge during training, yet lacks a probabilistic foundation. In this paper, we propose a knowledge-augmented BDL framework that integrates domain knowledge both as an informative prior and as an adaptive likelihood under a unified two-stage hybrid formulation. In the first stage, we learn a knowledge-informed prior p(\theta \mid \mathcal K ) by pre-training a model to satisfy domain-specific constraints. In the second stage, we perform Bayesian inference on task data with an adaptive knowledge likelihood p(\mathcal K \mid \theta, \mathcal D ), which dynamically enforces these constraints during optimization. This unified framework enables knowledge to guide both initialization and training, significantly improving prediction accuracy, robustness, adaptation and uncertainty estimation. Experiments on various computer vision tasks, including semi-synthetic and real-knowledge scenarios, demonstrate that our two-stage framework consistently outperforms state-of-the-art Bayesian and knowledge-augmented baselines.

Abstract:
Effective medical image analysis requires representations that capture both global anatomical structure and fine-grained tissue texture. Current self-supervised approaches exhibit limited capacity to address both requirements simultaneously. Invariance-based methods learn through augmentation consistency but face challenges in medical imaging where common augmentations may discard diagnostically relevant intensity patterns. Masked image modeling approaches employ high masking ratios to enforce holistic reasoning, yet inherently limit exposure to fine-grained texture. Recent work in general-domain vision demonstrates that generative and semantic objectives can mutually benefit each other, yet this paradigm remains unexplored for 3D medical imaging. We introduce Masked-Diffusion Autoencoders (MDAE), a self-supervised framework that imposes concurrent spatial masking and diffusion corruption, encouraging the model to learn complementary objectives: masked region reconstruction for structural coherence and visible region denoising for textural characteristics. This dual corruption enables the network to learn structure-texture representations within a unified time-conditioned objective. Evaluated on brain MRI across tumor classification, molecular marker detection, and dense segmentation benchmarks, MDAE consistently outperforms state-of-the-art baselines, with improvements most pronounced in cross-modal generalization tasks.

Abstract:
Millimeter-wave radar offers unique advantages in adverse weather but suffers from low spatial fidelity, severe azimuth ambiguity, and clutter-induced spurious returns. Existing methods mainly focus on improving spatial perception effectiveness via coarse-to-fine cross-modal supervision, yet often overlook the ambiguous feature-to-label mapping, which may lead to ill-posed geometric inference and pose fundamental challenges to downstream perception tasks. In this work, we propose RaUF, a spatial uncertainty field learning framework that models radar measurements through their physically grounded anisotropic properties. To resolve conflicting feature-to-label mapping, we design an anisotropic probabilistic model that learns fine-grained uncertainty. To further enhance reliability, we propose a Bidirectional Domain Attention mechanism that exploits the mutual complementarity between spatial structure and Doppler consistency, effectively suppressing spurious or multipath-induced reflections. Extensive experiments on public benchmarks and real-world datasets demonstrate that RaUF delivers highly reliable spatial detections with well-calibrated uncertainty. Moreover, downstream case studies further validate the enhanced reliability and scalability of RaUF under challenging real-world driving scenarios. Our project will be available at https://shengpeng.wang/rauf.

Abstract:
Recent digital media advancements have created increasing demands for sophisticated portrait manipulation techniques, particularly head swapping--where one image's head is seamlessly integrated onto another's body. Current approaches predominantly rely on face-centered cropped data with limited view angles, significantly restricting their real-world applicability. These methods struggle with diverse head expressions, varying hairstyles, and natural blending beyond facial regions. To address these limitations, we propose Adaptive Head Synthesis (AHS), which effectively handles full upper-body images with varied head poses and expressions. AHS incorporates a novel head reenacted synthetic data augmentation strategy to overcome self-supervised training constraints, enhancing generalization across diverse facial expressions and orientations without requiring paired training data. Comprehensive experiments demonstrate that our approach achieves superior performance in challenging real-world scenarios, producing visually coherent results that preserve both identity and expression fidelity across various head orientations and hairstyles. Notably, our method shows exceptional robustness in maintaining facial identity while drastic expression changes and faithfully preserving accessories while significant head pose variations.

Abstract:
Autonomous driving has made substantial progress recently, achieving reliable performance in most real-world environments. However, existing algorithms still depend heavily on high-definition maps, making them ineffective in mapless scenarios such as indoor parking lots. These limitations hinder seamless point-to-point navigation and restrict the broader deployment of the autonomous driving system.To address this challenge, we propose DriveVLN, a new task that extends Vision-and-Language Navigation (VLN) to autonomous driving. DriveVLN employs visual and linguistic priors to guide vehicles toward destinations based solely on concise natural-language descriptions, without access to predefined maps or routes. Unlike conventional VLN, which relies on detailed step-wise instructions in indoor environments, DriveVLN requires models to produce navigation information based on diverse visual cues and history, including signs, landmarks, and textual indicators.We further develop a CARLA-based simulation engine comprising over 200 realistic scenes reconstructed from real road scans, enabling large-scale training and closed-loop evaluation. A baseline model is established through supervised fine-tuning on real data, followed by reinforcement learning in simulation.Comprehensive experiments show that DriveVLN effectively bridges map-based and mapless driving, providing a new foundation for unified, language-driven autonomous navigation in complex real-world environments.

Abstract:
Text-to-video generation has advanced rapidly in visual fidelity, whereas standard methods still have limited ability to control the subject composition of generated scenes. Prior work shows that adding localized text control signals, such as bounding boxes or segmentation masks, can help. However, these methods struggle in complex scenarios and degrade in multi-object settings, offering limited precision and lacking a clear correspondence between individual trajectories and visual entities as the number of controllable objects increases. We introduce Text-Grounded Trajectories (TGT), a framework that conditions video generation on trajectories paired with localized text descriptions. We propose Location-Aware Cross-Attention (LACA) to integrate these signals and adopt a dual-CFG scheme to separately modulate local and global text guidance. In addition, we develop a data processing pipeline that produces trajectories with localized descriptions of tracked entities, and we annotate two million high quality video clips to train TGT. Together, these components enable TGT to use point trajectories as intuitive motion handles, pairing each trajectory with text to control both appearance and motion. Extensive experiments show that TGT achieves higher visual quality, more accurate text alignment, and improved motion controllability compared with prior approaches.

Abstract:
High-resolution image editing is essential for professional and creative applications, yet existing multimodal diffusion-based editors remain computationally inefficient and constrained to relatively low resolutions. Current approaches redundantly process the entire image canvas or rely on large-scale high-resolution datasets, resulting in substantial training and inference costs. We introduce HierEdit, a region-aware hierarchical diffusion framework designed for efficient and scalable high-resolution image editing. Our method first performs edits on a low-resolution proxy using an off-the-shelf editing model to generate a reference and to localize the modified regions. A hierarchical local-window diffusion model (Local-Window MMDiT) that refines only edited regions within the original high-resolution image, while reusing the unaltered regions as conditioning inputs. The low-resolution proxy further provides structural guidance and intermediate denoising supervision (Inference Acceleration) , ensuring consistent global semantics and stable generation without the need for full-resolution attention computation. This targeted and hierarchical design enables fast, high-fidelity editing of images up to 4K resolution without requiring any specialized high-resolution training data. Extensive experiments demonstrate that HierEdit achieves competitive visual quality on commodity-resolution datasets while significantly accelerating inference and extending seamlessly to ultra-high-resolution 4K editing.

Abstract:
Deep learning models are well known to be susceptible to backdoor attacks, and text-to-image generation models are no exception. When a specific trigger is embedded in the input, a backdoored model can be manipulated to perform attacker-defined malicious behaviors, such as generating harmful or inappropriate images. Existing backdoor attacks on text-to-image generation models are largely limited to dirty-label attacks, where misaligned image-caption pairs are injected into the training data. While effective in controlled settings, such methods are often easily detectable, limiting their practicality in realistic applications. To address this limitation, we propose the first clean-label backdoor attack for text-to-image generative models, which preserves semantic consistency within poisoned image-caption pairs to evade detection. We design a dual-modality manipulation strategy that injects nearly imperceptible noise into images while embedding a composite semantic text trigger. The text trigger combines synonym substitution and syntactic restructuring, enabling stealthy yet effective backdoor implantation without compromising the visual-textual alignment. Experimental results demonstrate that our method achieves high attack success while effectively preserving model utility and evading mainstream defenses, including commercial content filters.

Abstract:
Large Vision-Language Models (LVLMs) use their vision encoders to translate images into representations for downstream reasoning, but the encoders often underperform in domain-specific visual tasks such as medical image diagnosis or fine-grained classification, where representation errors can cascade through the language model, leading to incorrect responses. Existing adaptation methods modify the continuous feature interface between encoder and language model through projector tuning or other parameter-efficient updates, which still couples the two components and requires re-alignment whenever the encoder changes. We introduce CRAFT (Codebook RegulAted Fine-Tuning), a lightweight method that fine-tunes the encoder using a discrete codebook that anchors visual representations to a stable token space, achieving domain adaptation without modifying other parts of the model. This decoupled design allows the adapted encoder to seamlessly boost the performance of LVLMs with different language architectures, as long as they share the same codebook. Empirically, CRAFT achieves an average gain of 13.51% across 10 domain-specific benchmarks such as VQARAD and PlantVillage, while preserving the LLM's linguistic capabilities and outperforming peer methods that operate on continuous tokens.

Abstract:
Scene Graph Generation (SGG) aims to extract a detailed graph structure from an image, a representation that holds significant promise as a robust intermediate step for complex downstream tasks like reasoning for embodied agents. However, practical deployment in real-world applications - especially on resource constrained edge devices - requires speed and resource efficiency, challenges that have received limited attention in existing research. To bridge this gap, we introduce DSFlash, a low-latency model for panoptic scene graph generation designed to overcome these limitations. DSFlash can efficiently process video streams without compromising performance against existing state-of-the-art methods, with smaller DSFlash variants achieving 56 frames per second on a standard RTX 3090 GPU. Crucially, unlike prior approaches that often restrict themselves to salient relationships, DSFlash computes comprehensive scene graphs, offering richer contextual information while maintaining its superior latency. Furthermore, DSFlash is light on resources, requiring less than 24 hours to train on a single, nine-year-old GTX 1080 GPU. This accessibility makes DSFlash particularly well-suited for researchers and practitioners operating with limited computational resources, empowering them to adapt and fine-tune SGG models for specialized applications.

Abstract:
In everyday conversations, humans effortlessly recognize communication partners using visual cues such as gaze or head orientation. Replicating this social reasoning in computer vision is challenging, especially in dynamic, multi-person settings. We introduce Communication Context Identification (CCI) in egocentric vision: Given a first-person video sequence, determine which individuals are engaged in communication with the camera wearer. To support CCI, we collected a challenging large-scale dataset comprising 68.9 hours of egocentric video captured across diverse multi-person, multi-conversation scenarios. We propose CoCoNet, a temporal interaction model for CCI that tracks social dynamics via attention across individuals over long time scales. CoCoNet flexibly handles varying group sizes, maintains predictions through occlusions, and performs robustly even with limited temporal input. Leveraging long temporal contexts, it achieves 96% balanced accuracy on CCI. Performance varies with group size and spatial scene layout, highlighting the importance of dataset diversity. Our work advances vision-based conversational awareness, enabling applications in assistive hearing that use egocentric video to enhance individuals in the user's conversation group.

Abstract:
High-quality 4D reconstruction enables photorealistic and immersive rendering of the dynamic real world. However, unlike static scenes that can be fully captured with a single camera, high-quality dynamic scenes typically require dense arrays of tens or even hundreds of synchronized cameras. Dependence on such costly lab setups severely limits practical scalability. The reliance on such costly lab setups severely limits practical scalability. To this end, we propose a sparse-camera dynamic reconstruction framework that exploits abundant yet inconsistent generative observations. Our key innovation is the Spatio-Temporal Distortion Field, which provides a unified mechanism for modeling inconsistencies in generative observations across both spatial and temporal dimensions. Building on this, we develop a complete pipeline that enables 4D reconstruction from sparse and uncalibrated camera inputs. We evaluate our method on multi-camera dynamic scene benchmarks, achieving spatio-temporally consistent high-fidelity renderings and significantly outperforming existing approaches.

Abstract:
Camera and object motions are central to a video's narrative. However, precisely editing these captured motions remains a significant challenge, especially under complex object movements. Current motion-controlled image-to-video (I2V) approaches often lack full-scene context for consistent video editing, while video-to-video (V2V) methods provide viewpoint changes or basic object translation, but offer limited control over fine-grained object motion. We present a track-conditioned V2V framework that enables joint editing of camera and object motion. We achieve this by conditioning a video generation model on a source video and paired 3D point tracks representing source and target motions. These 3D tracks establish sparse correspondences that transfer rich context from the source video to new motions while preserving spatiotemporal coherence. Crucially, compared to 2D tracks, 3D tracks provide explicit depth cues, allowing the model to resolve depth order and handle occlusions for precise motion editing. Trained in two stages on synthetic and real data, our model supports diverse motion edits, including joint camera/object manipulation, motion transfer, and non-rigid deformation, unlocking new creative potential in video editing.

Abstract:
Understanding region-wise correspondences between manga line art images is fundamental for high-level manga processing, supporting downstream tasks such as line art colorization and in-between frame generation. Unlike natural images that contain rich visual cues, manga line art consists only of sparse black-and-white strokes, making it challenging to determine which regions correspond across images. In this work, we introduce a new task: predicting region-wise correspondence between raw manga line art images without any annotations. To address this problem, we propose a Transformer-based framework trained on large-scale, automatically generated region correspondences. The model learns to suppress noisy matches and strengthen consistent structural relationships, resulting in robust patch-level feature alignment within and across images. During inference, our method segments each line art and establishes coherent region-level correspondences through edge-aware clustering and region matching. We construct manually annotated benchmarks for evaluation, and experiments across multiple datasets demonstrate both high patch-level accuracy and strong region-level correspondence performance, achieving 78.4-84.4% region-level accuracy. These results highlight the potential of our method for real-world manga and animation applications.

Abstract:
The LiDAR-based multi-agent and single-agent perception has shown promising performance in environmental understanding for robots and automated vehicles. However, there is no existing method that simultaneously solves both multi-agent and single-agent perception in an unsupervised way. By sharing sensor data between multiple agents via communication, this paper discovers two key insights: 1) Improved point cloud density after the data sharing from cooperative views could benefit unsupervised object classification, 2) Cooperative view of multiple agents can be used as unsupervised guidance for the 3D object detection in the single view. Based on these two discovered insights, we propose an Unsupervised Multi-agent and Single-agent (UMS) perception framework that leverages multi-agent cooperation without human annotations to simultaneously solve multi-agent and single-agent perception. UMS combines a learning-based Proposal Purifying Filter to better classify the candidate proposals after multi-agent point cloud density cooperation, followed by a Progressive Proposal Stabilizing module to yield reliable pseudo labels by the easy-to-hard curriculum learning. Furthermore, we design a Cross-View Consensus Learning to use multi-agent cooperative view to guide detection in single-agent view. Experimental results on two public datasets V2V4Real and OPV2V show that our UMS method achieved significantly higher 3D detection performance than the state-of-the-art methods on both multi-agent and single-agent perception tasks in an unsupervised setting.

Abstract:
Cross-modal transfer methods have achieved significant progress in extending RGB-based foundation models to non-RGB modalities. However, existing transfer paradigms are primarily task-oriented, meaning that changing tasks requires re-training and re-storing, leading to substantial redundancy in data, computation and storage. To address this limitation, we propose an efficient cross-modal transfer paradigm that decouples the process into a one-time general modality knowledge transfer and a flexible task knowledge transfer. In Stage 1, we propose a Progressive Self-Supervised Tuning strategy that integrates modality-aware structural reconstruction with semantic discriminative learning, which enables task-agnostic modality knowledge learning using only unlabeled data through a one-time training process, resulting in reusable target-modality LoRAs. In Stage 2, we incorporate the modality LoRAs and further propose a Task-Prompted Mixture-of-Modality Experts module. This design enables lightweight task knowledge injection while effectively balancing task-specific, modality-general and modality-specific knowledge in multimodal fusion process for diverse downstream tasks. Extensive experiments across six cross-modal transfer scenarios, along with analyses of data, computation, and storage efficiency, demonstrate the superiority of our method.

Abstract:
We present Wave-Former, a novel method capable of high-accuracy 3D shape reconstruction for completely occluded, diverse, everyday objects. This capability can open new applications spanning robotics, augmented reality, and logistics. Our approach leverages millimeter-wave (mmWave) wireless signals, which can penetrate common occlusions and reflect off hidden objects. In contrast to past mmWave reconstruction methods, which suffer from limited coverage and high noise, Wave-Former introduces a physics-aware shape completion model capable of inferring full 3D geometry. At the heart of Wave-Former's design is a novel three-stage pipeline which bridges raw wireless signals with recent advancements in vision-based shape completion by incorporating physical properties of mmWave signals. The pipeline proposes candidate geometric surfaces, employs a transformer-based shape completion model designed specifically for mmWave signals, and finally performs entropy-guided surface selection. This enables Wave-Former to be trained using entirely synthetic point-clouds, while demonstrating impressive generalization to real-world data. In head-to-head comparisons with state-of-the-art baselines, Wave-Former raises recall from 54% to 72% while maintaining a high precision of 85%.

Abstract:
The deployment of automatic radiology report generator (RRG) models in clinical workflows is being hampered by the lack of factual correctness in the produced reports. Existing methods to improve the report generators use alignment approaches that require pairs of ground truth preferred and dis-preferred responses. As these are not available at inference time in clinical workflows, new alignment methods are needed to improve report quality at inference time. In this paper, we present a new phrase-grounded automatic preference optimization (APO) alignment method which offers such improvement during inference without needing additional ground truth. Specifically, the method generates surrogate ground truth preference data for alignment automatically from the RRG model response itself though fact-checking and LLM-prompted correction. We also develop a novel APO loss function that combines preference response alignment loss with phrasal grounding loss paying attention to both the description of the finding and its image location. We show that this method of alignment, on the average, improves the report quality at inference time by 30-40% across various SOTA report generators as tested on multi-institutional chest X-ray datasets.

Abstract:
Multi-task learning (MTL) involves the simultaneous optimization of multiple task-specific losses, often leading to gradient conflicts and scale imbalances that result in negative transfer. While existing multi-task optimization methods attempt to mitigate these challenges, they either lack the stochasticity needed to escape poor local minima or fail to explicitly address conflicts at the gradient level. In this work, we propose TaskForce, a novel multi-task optimization framework incorporating cooperative multi-agent reinforcement learning (MARL), where agents learn to find an effective joint optimization strategy based on their respective task gradients and losses. To keep the optimization process compact yet informative, agents observe a summary of the training dynamics that consists of the gradient Gram matrix---capturing both gradient magnitudes and pairwise alignments---and task loss values. Each agent then predicts the balancing parameters that determine the weight of their contribution to the final gradient update. Crucially, we design a hybrid reward function that incorporates both gradient-based signals and loss improvement dynamics, enabling agents to effectively resolve gradient conflicts and avoid poor convergence by considering both direct gradient information and the resulting impact on loss reduction. TaskForce achieves consistent improvements over state-of-the-art MTL baselines on NYU-v2, Cityscapes, and QM9, demonstrating the promise of cooperative MARL in complex multi-task scenarios.

Abstract:
Geo-spatial analysis of our world benefits from a multimodal approach, as every single geographic location can be described in numerous ways (images from various viewpoints, textual descriptions, geographic coordinates, etc.). Current benchmarks have limited coverage across modalities, leading to specialized models that perform well in their respective domains, but do not fully take advantage of other geo-spatial modalities. We introduce the Multi-Modal Landmark dataset (MMLandmarks), a benchmark composed of four modalities: 197k high-resolution aerial images, 329k ground-view images, textual information, and geographic coordinates for 18,557 distinct landmarks in the United States. The MMLandmarks dataset has a one-to-one landmark level correspondence across every modality, which enables training and benchmarking models for various geo-spatial tasks, including cross-view Ground-to-Satellite retrieval, ground and satellite geolocalization, Text-to-Image, and Text-to-GPS retrieval. We show that current specialized and off-the-shelf foundation models cannot be trivially used to solve this variety of geo-spatial tasks, illustrating a gap where multimodal datasets lead to broader geo-spatial understanding. We employ a simple CLIP-inspired baseline that reflects versatility and broad generalization when trained with MMLandmarks.

Abstract:
Parking is a critical task for autonomous driving systems (ADS), with unique challenges in crowded parking slots and GPS-denied environments. However, existing works focus on 2D parking slot perception, mapping, and localization, 3D reconstruction remains underexplored, which is crucial for capturing complex spatial geometry in parking scenarios. Naively improving the visual quality of reconstructed parking scenes does not directly benefit autonomous parking, as the key entry point for parking is the slots perception module. To address these limitations, we curate the first benchmark named ParkRecon3D, specifically designed for parking scene reconstruction. It includes sensor data from four surround-view fisheye cameras with calibrated extrinsics and dense parking slot annotations. We then propose ParkGaussian, the first framework that integrates 3D Gaussian Splatting (3DGS) for parking scene reconstruction. To further improve the alignment between reconstruction and downstream parking slot detection, we introduce a slot-aware reconstruction strategy that leverages existing parking perception methods to enhance the synthesis quality of slot regions. Experiments on ParkRecon3D demonstrate that ParkGaussian achieves state-of-the-art reconstruction quality and better preserves perception consistency for downstream tasks. The code and dataset will be released.

Abstract:
Monocular visual SLAM enables 3D reconstruction from internet video and autonomous navigation on resource-constrained platforms, yet suffers from scale drift, i.e., the gradual divergence of estimated scale over long sequences. Existing frame-to-frame methods achieve real-time performance through local optimization but accumulate scale drift due to the lack of global constraints among independent windows. To address this, we propose SCE-SLAM, an end-to-end SLAM system that maintains scale consistency through scene coordinate embeddings, which are learned patch-level representations encoding 3D geometric relationships under a canonical scale reference. The framework consists of two key modules: geometry-guided aggregation that leverages 3D spatial proximity to propagate scale information from historical observations through geometry-modulated attention, and scene coordinate bundle adjustment that anchors current estimates to the reference scale through explicit 3D coordinate constraints decoded from the scene coordinate embeddings. Experiments on KITTI, Waymo, and vKITTI demonstrate substantial improvements: our method reduces absolute trajectory error by 8.36m on KITTI compared to the best prior approach, while maintaining 36 FPS and achieving scale consistency across large-scale scenes.

Abstract:
Multimodal Continual Learning (MMCL) aims to enable models to continuously accumulate knowledge across multiple tasks and modalities without forgetting prior information. MMCL presents more challenges than single-modal continual learning, as it requires effective cooperation and complementarity between modalities. Existing methods often treat modality alignment as a static process, assuming once alignment is established, it remains fixed. However, we argue that modality alignment is inherently dynamic, evolving with task learning and feature propagation across layers. To address this, we introduce Dynamic Alignment Graph Regularization (DAGR), a novel approach that explicitly models the evolving alignment across layers. By incorporating multi-level graph regularization, our method stabilizes the alignment process and mitigates catastrophic forgetting. Extensive experiments on benchmarks, such as MTIL, show that DAGR outperforms static alignment-based methods and other continual learning techniques, achieving superior stability.

Abstract:
Vision-Language-Action (VLA) models are emerging as a promising paradigm for end-to-end autonomous driving, valued for their potential to leverage world knowledge and reason about complex driving scenes. However, existing methods suffer from two critical limitations: a persistent misalignment between language instructions and action outputs, and the inherent inefficiency of typical auto-regressive action generation. In this paper, we introduce LinkVLA, a novel architecture that directly addresses these challenges to enhance both alignment and efficiency. First, we establish a structural link by unifying language and action tokens into a shared discrete codebook, processed within a single multi-modal model. This structurally enforces cross-modal consistency from the ground up. Second, to create a deep semantic link, we introduce an auxiliary action understanding objective that trains the model to generate descriptive captions from trajectories, fostering a bidirectional language-action mapping. Finally, we replace the slow, step-by-step generation with a two-step coarse-to-fine generation method (C2F) that efficiently decodes the action sequence, saving 86% inference time. Experiments on closed-loop driving benchmarks show consistent gains in instruction following accuracy and driving performance, alongside reduced inference latency.

Abstract:
In this paper, we present a method to reconstruct physically plausible human-object interactions (HOI) from monocular videos. While existing kinematic-based approaches produce visually plausible motion, they often result in physical artifacts such as interpenetration and object floating. To overcome these issues, we introduce a physics-guided reconstruction framework that begins with a kinematic estimate and then refines it through a reinforcement learning (RL) policy trained to reproduce the interaction in a physics simulator. Because kinematic estimates are typically noisy, naive RL training can fail. Therefore, we propose an adaptive sampling strategy with a dual self-updating mechanism that automatically identifies the frames with the most informative and reliable kinematic reconstruction. Our process progressively improves reconstruction quality and yields physically consistent HOI sequences. We demonstrate our approach on two standard benchmarks and achieve clear improvements in physical plausibility metrics over state-of-the-art methods.

Abstract:
Real-world robotic tasks are long-horizon and often span multiple floors, demanding rich spatial reasoning. However, existing embodied benchmarks are largely confined to single-floor in-house environments, failing to reflect the complexity of real-world tasks. We introduce MANSION, the first language-driven framework for generating building-scale, multi-floor 3D environments. Being aware of vertical structural constraints, MANSION generates realistic, navigable whole-building structures with diverse, human-friendly scenes, enabling the development and evaluation of cross-floor long-horizon tasks. Building on this framework, we release MansionWorld, a dataset of over 1,000 diverse buildings ranging from hospitals to offices, alongside a Task-Semantic Scene Editing Agent that customizes these environments using open-vocabulary commands to meet specific user needs. Benchmarking reveals that state-of-the-art agents degrade sharply in our settings, establishing MANSION as a critical testbed for the next generation of spatial reasoning and planning.

Abstract:
Monocular metric depth estimation (MMDE) is a core challenge in computer vision, playing a pivotal role in real-world applications that demand accurate spatial understanding. Although prior works have shown promising zero-shot performance in MMDE, they often struggle with generalization across diverse camera types, such as fisheye and 360^\circ cameras. Recent advances have addressed this through unified camera representations or canonical representation spaces, but they require either including large FoV camera data during training or separately trained models for different domains. We propose UniDAC, an MMDE framework that presents universal robustness in all domains and generalizes across diverse cameras using a single model. We achieve this by decoupling metric depth estimation into relative depth prediction and spatially varying scale estimation, enabling robust performance across different domains. We propose a lightweight Depth-Guided Scale Estimation module that upsamples a coarse scale map to high resolution using the relative depth map as guidance to account for local scale variations. Furthermore, we introduce RoPE-\phi, a distortion-aware positional embedding that respects the spatial warping in Equi-Rectangular Projections (ERP) via latitude-aware weighting. UniDAC achieves state of the art (SoTA) in cross-camera generalization by consistently outperforming prior methods across all datasets.

Abstract:
Accurately modelling human attention is essential for numerous computer vision applications, particularly in the domain of automotive safety. Existing methods typically collapse gaze into saliency maps or scanpaths, treating gaze dynamics only implicitly. We instead formulate gaze modelling as an autoregressive dynamical system and explicitly unroll raw gaze trajectories over time, conditioned on both gaze history and the evolving environment. Driving scenes are represented as gaze-centric graphs processed by the Affinity Relation Transformer (ART), a heterogeneous graph transformer that models interactions between driver gaze, traffic objects, and road structure. We further introduce the Object Density Network (ODN) to predict next-step gaze distributions, capturing the stochastic and object-centric nature of attentional shifts in complex environments. We also release Focus100, a new dataset of raw gaze data from 30 participants viewing egocentric driving footage. Trained directly on raw gaze, without fixation filtering, our unified approach produces more natural gaze trajectories, scanpath dynamics, and saliency maps than existing attention models, offering valuable insights for the temporal modelling of human attention in dynamic environments.

Abstract:
Unlike offline processing, streaming video vision-language models face two fundamental constraints: causality and accumulation. Causality prevents access to future frames that offline methods exploit, while accumulation causes tokens to grow unbounded, creating efficiency bottlenecks. However, existing approaches only regulate post-LLM kv-cache, leaving costly pre-LLM prefill unchanged. We introduce StreamingTOM, a training-free, plug-and-play two-stage framework that addresses both pre-LLM and post-LLM bottlenecks. Causal Temporal Reduction imposes a fixed per-frame budget and selects tokens based on adjacent-frame changes and token saliency, drastically reducing per-frame prefill cost by processing only a compact subset of visual tokens, ensuring predictable latency. Online Quantized Memory stores tokens in 4-bit format, retrieves relevant groups on demand, and dequantizes them, keeping the active kv-cache bounded regardless of stream length. Experiments demonstrate our method achieves 15.7x kv-cache compression ratio; compared to prior SOTA (LiveVLM), it delivers 1.2x lower peak memory and 2x faster TTFT. StreamingTOM achieves state-of-the-art accuracy among training-free methods with an average of 63.8% on offline benchmarks and 55.8% accuracy and 3.7 score on RVS. These results demonstrate that real-time streaming video understanding with bounded active memory is achievable without model retraining.

Abstract:
Distilling latent diffusion models (LDMs)/rectified flow models (RFMs) into ones that are fast to sample from conditions is attracting huge interest. However, the majority of existing methods either need significant training resources or lead to quality degradation, especially in text-image alignment. To address these challenges, we propose Local-to-global Consistency Distillation (LogCD) to accelerate LDMs/RFMs via two-stage distillation. LogCD first performs local consistency distillation and then executes global consistency distillation to ensure the consistency along inference path. Besides, Latent Learned Perceptual Image Patch Similarity model is exploited to enhance perceptual consistency. Notably, LogCD exhibits high flexibility, allowing a single unified model to operate with 2 to 4 sampling steps. The model's performance improves seamlessly as the number of steps increases within this range. With only 70 A100 GPU hours, LogCD accelerates SDXL to achieve a 33.5 CLIP score with just 3 sampling steps, surpassing state-of-the-art accelerated models using even more steps.

Abstract:
Vision-language models (VLMs) demonstrate powerful capabilities in multimodal tasks. However, the large number of visual tokens imposes a significant computational cost. In this paper, we propose QuietPrune, a QUery-guIded Early Token Pruning method to remove redundant visual tokens within VLMs, thereby enhancing computational efficiency. Unlike previous late pruning methods, we recognize that implementing early pruning within the vision transformer (ViT) can achieve benefits in both latency reduction and accuracy maintenance. To address the semantic loss problem in early pruning, we design a lightweight adapter by performing an inverse transformation of the projector in VLMs. The proposed adapter converts the contextual query into a visual domain [Q-CLS] (Query [CLS]) token, providing textual guidance for ViT pruning. During pruning, we further introduce a semi-structured pruning scheme based on visual-textual relevance. Specifically, we group spatially adjacent 2 x2 tokens to accommodate the visual token merging operation prevalent in mainstream VLMs. We use the mean attention scores between the [Q-CLS] token and the visual tokens as the relevance metric for each group, avoiding additional computation. Pruning is then applied at the group level based on the relevance score, preserving positional continuity. After pruning, we aggregate the redundant tokens into a single token to maintain context cues. Our method achieves up to 19.0% reduction in prefill latency while outperforming 4.2% in accuracy on the recent Qwen3-VL and InternVL3 series compared to existing late pruning methods.

Abstract:
Few-shot learning has attracted extensive attention, with metric-based approaches such as Prototypical Networks establishing strong baselines. These methods construct class prototypes from support samples and classify query samples via distance metrics, but their performance is highly sensitive to label noise. To tackle this challenge, we propose a novel Graph Attention Prototypical Network (GAPNet) for robust few-shot classification. GAPNet first extracts local and global features via a classic CNN backbone and a group attention broad learning module, respectively. To mitigate the impact of label noise, the intra-class and inter-class relationships between support and query samples are explicitly modeled via a pseudo-label guided graph constructor, and then processed by an edge-aware graph attention module to capture topological correlations. Furthermore, an adaptive noise-robust prototype generator is introduced to dynamically suppress the contributions of noisy samples, substantially improving the reliability of class prototypes. Extensive experiments demonstrate the effectiveness and robustness of GAPNet to label noise. Compared to state-of-the-art approaches, GAPNet improves accuracy in the 5-way 5-shot setting by 3% ~ 8% on three general image benchmarks and one fine-grained classification dataset.

Abstract:
With the success of static black-hole imaging, the next frontier is the dynamic and 3D imaging of black holes. Recovering the dynamic 3D gas near a black hole would reveal previously-unseen parts of the universe and inform new physics models. However, only sparse radio measurements from a single viewpoint are possible, making the dynamic 3D reconstruction problem significantly ill-posed. Previously, BH-NeRF addressed the ill-posed problem by assuming Keplerian dynamics of the gas, but this assumption breaks down near the black hole, where the strong gravitational pull of the black hole and increased electromagnetic activity complicate fluid dynamics. To overcome the restrictive assumptions of BH-NeRF, we propose PI-DEF, a physics-informed approach that uses differentiable neural rendering to fit a 4D (time + 3D) emissivity field given EHT measurements. Our approach jointly reconstructs the 3D velocity field with the 4D emissivity field and enforces the velocity as a soft constraint on the dynamics of the emissivity. In experiments on simulated data, we find significantly improved reconstruction accuracy over both BH-NeRF and a physics-agnostic approach. We demonstrate how our method may be used to estimate other physics parameters of the black hole, such as its spin.

Abstract:
Generating crisp, i.e., one-pixel-wide, edge maps remains one of the fundamental challenges in edge detection, affecting both traditional and learning-based methods. To obtain crisp edges, most existing approaches rely on two hand-crafted post-processing algorithms, Non-Maximum Suppression (NMS) and skeleton-based thinning, which are non-differentiable and hinder end-to-end optimization. Moreover, all existing crisp edge detection methods still depend on such post-processing to achieve satisfactory results. To address this limitation, we propose MatchED, a lightweight, only 21K additional parameters, and plug-and-play matching-based supervision module that can be appended to any edge detection model for joint end-to-end learning of crisp edges. At each training iteration, MatchED performs one-to-one matching between predicted and ground-truth edges based on spatial distance and confidence, ensuring consistency between training and testing protocols. Extensive experiments on four popular datasets demonstrate that integrating MatchED substantially improves the performance of existing edge detection models. In particular, MatchED increases the Average Crispness (AC) metric by up to 2--4x compared to baseline models. Under the crispness-emphasized evaluation (CEval), MatchED further boosts baseline performance by up to 20--35% in ODS and achieves similar gains in OIS and AP, achieving SOTA performance that matches or surpasses standard post-processing for the first time. Code is available at https://cvpr26-matched.github.io.

Abstract:
Text-to-image (T2I) models are increasingly popular, producing a large share of AI-generated images online. To compare model quality, voting-based leaderboards have become the standard, relying on anonymized model outputs for fairness. In this work, we show that such anonymity can be easily broken. We find that generations from each T2I model form distinctive clusters in the image embedding space, enabling accurate deanonymization without prompt control or training data. Using 22 models and 280 prompts (150K images), our centroid-based method achieves high accuracy and reveals systematic model-specific signatures. We further introduce a prompt-level distinguishability metric and conduct large-scale analyses showing how certain prompts can lead to near-perfect distinguishability. Our findings expose fundamental security flaws in T2I leaderboards and motivate stronger anonymization defenses.

Abstract:
The performance of neural network models deteriorates due to their unreliable behavior on non-robust features of corrupted samples. Owing to their opaque nature, rectifying models to address this problem often necessitates arduous data cleaning and model retraining, resulting in huge computational and manual overhead. In this work, we leverage rank-one model editing to establish an attribution-guided model rectification framework that effectively locates and corrects model unreliable behaviors. We first distinguish our rectification setting from existing model editing, yielding a formulation that corrects unreliable behavior while preserving model performance and reducing reliance on large budgets of cleansed samples. We further reveal a bottleneck of model rectifying arising from heterogeneous editability across layers. To target the primary source of misbehavior, we introduce an attribution-guided layer localization method that quantifies layer-wise editability and identifies the layer most responsible for unreliabilities. Extensive experiments demonstrate the effectiveness of our method in correcting unreliabilities observed for neural Trojans, spurious correlations and feature leakage. Our method shows remarkable performance by achieving its editing objective with as few as a single cleansed sample, which makes it appealing for practice.

Abstract:
Pixel diffusion aims to generate images directly in pixel space in an end-to-end fashion. This approach avoids the limitations of VAE in the two-stage latent diffusion, offering higher model capacity. Existing pixel diffusion models suffer from slow training and inference, as they usually model both high-frequency signals and low-frequency semantics within a single diffusion transformer (DiT). To pursue a more efficient pixel diffusion paradigm, we propose the frequency-DeCoupled pixel diffusion framework. With the intuition to decouple the generation of high and low frequency components, we leverage a lightweight pixel decoder to generate high-frequency details conditioned on semantic guidance from the DiT. This thus frees the DiT to specialize in modeling low-frequency semantics. In addition, we introduce a frequency-aware flow-matching loss that emphasizes visually salient frequencies while suppressing insignificant ones. Extensive experiments show that DeCo achieves superior performance among pixel diffusion models, attaining FID of 1.62 (256x256) and 2.22 (512x512) on ImageNet, closing the gap with latent diffusion methods. Furthermore, our pretrained text-to-image model achieves a leading overall score of 0.86 on GenEval in system-level comparison.

Abstract:
Recent advances in large vision models (LVMs) have shifted from modality-specific designs toward unified architectures that jointly process images, videos, and 3D data. However, existing unified LVMs primarily pursue functional integration, while overlooking the deeper goal of cross-vision synergy: the ability to reason over complementary priors across visual modalities. To address this, we present PolyV, a unified LVM that achieves cross-vision synergy at both the architectural and training levels. Architecturally, PolyV adopts a sparse Mixture-of-Experts LVM coordinated by a dynamic modality router, allowing each expert to specialize in modality-specific priors while enabling bidirectional interaction and mutual refinement across modalities. Training-wise, a synergy-aware paradigm combines modality-specific pretraining with coarse-to-fine synergy tuning via knowledge distillation and object-/relation-level alignment. Extensive experiments on 10 benchmarks spanning image, video, and 3D understanding, including synergy-focused datasets requiring spatial or temporal priors, demonstrate that PolyV consistently outperforms existing models, achieving over 10% average improvement over its backbone. Overall, PolyV establishes a unified framework for synesthetic visual reasoning, advancing toward truly synergistic LVMs.

Abstract:
Performance estimation under distribution shift aims to predict how a model behaves on an unlabeled test set whose distribution differs from the training data, a scenario that requires reliable indicators that can faithfully reflect model behavior without ground-truth labels. Existing approaches rely solely on the outputs of the given model whose biases are amplified once the distribution shifts, weakening the correlation with the true performance. Motivated by this limitation, we propose Fused Reference Alignment Prediction (FRAP), which leverages the complementary strengths of an external foundation model and the base model to construct a more reliable surrogate of the ground-truth labels. FRAP aligns the prediction distribution of the foundation model with that of the base model by applying temperature-scaled calibration that minimizes their divergence. The aligned predictions are fused through confidence-based weighting into a refined reference distribution that integrates robustness from the foundation model and domain-specific expertise from the base model, and performance estimation is obtained by measuring how closely the base model predictions agree with this reference. Extensive experiments across diverse datasets and architectures show that FRAP provides consistent and substantial improvements over representative performance-estimation methods under distribution shift.

Abstract:
Tokenization is a fundamental technique in the generative modeling of various modalities. In particular, it plays a critical role in autoregressive (AR) models, which have recently emerged as a compelling option for 3D generation.However, optimal tokenization of 3D shapes remains an open question. State-of-the-art (SOTA) methods primarily rely on geometric level-of-detail (LoD) hierarchies, originally designed for rendering and compression. These spatial hierarchies are often token-inefficient and lack semantic coherence for AR modeling. We propose Level-of-Semantics Tokenization (LoST), which orders tokens by semantic salience, such that early prefixes decode into complete, plausible shapes that possess principal semantics, while subsequent tokens refine instance-specific geometric and semantic details. To train LoST, we introduce Relational Inter-Distance Alignment (RIDA), a novel 3D semantic alignment loss, inspired by relational knowledge distillation, that aligns the relational structure of the 3D shape latent space with that of the semantic DINO feature space. Experiments show that LoST achieves SOTA reconstruction, surpassing previous LoD-based 3D shape tokenizers by large margins on both geometric and semantic reconstruction metrics. Moreover, LoST achieves efficient, high-quality AR 3D generation and enables downstream tasks like semantic retrieval, while using only 0.1%-10% of the tokens needed by prior 3D AR models. Code is available via the project website at https://lost3d.github.io/

Abstract:
Prior panorama stitching approaches heavily rely on pairwise feature correspondences and are unable to leverage geometric consistency across multiple views. This leads to severe distortion and misalignment, especially in challenging scenes with weak textures, large parallax, and repetitive patterns. Given that multi-view geometric correspondences can be directly constructed in 3D space, making them more accurate and globally consistent, we extend the 2D alignment task to the 3D photogrammetric space. We adopt a novel transformer-based architecture to achieve 3D awareness and aggregate global information across all views. It directly utilizes camera poses to guide image warping for global alignment in 3D space and employs a multi-feature joint optimization strategy to compute the seams. Additionally, to establish an evaluation benchmark and train our network, we constructed a large-scale dataset of real-world scenes. Extensive experiments show that our method significantly outperforms existing alternatives in alignment accuracy and perceptual quality.

Abstract:
End-to-end differentiable learning has emerged as a prominent paradigm in autonomous driving (AD). A significant bottleneck in this approach is its substantial demand for high-quality labeled data, such as 3D bounding boxes and semantic segmentation, which are especially expensive to annotate manually. This challenge is exacerbated by the long tailed distribution in AD datasets, where a substantial portion of the collected data might be trivial (e.g. simply driving straight on a straight road) and only a minority of instances are critical to safety. In this paper, we propose ActiveAD, a planning-oriented active learning strategy designed to enhance sampling and labeling efficiency in end-to-end autonomous driving. ActiveAD progressively annotates parts of collected raw data based on our newly developed metrics. We design innovative diversity metrics to enhance initial sample selection, addressing the cold-start problem. Furthermore, we develop uncertainty metrics to select valuable samples for the ultimate purpose of route planning during subsequent batch selection. Empirical results demonstrate that our approach significantly surpasses traditional active learning methods. Remarkably, our method achieves comparable results to state-of-the-art end-to-end AD methods - by using only 30% data in both open-loop nuScenes and closed-loop CARLA evaluation.

Abstract:
Isotropic microscopy reconstruction remains challenging because the anisotropic point spread function in optical systems yields much poorer axial resolution and hampers accurate 3D analysis. Hardware strategies can approach isotropy, yet they are complex, costly, susceptible to sidelobes, and introduce phototoxicity. Deep learning approaches reduce acquisition burden, yet synthetic pipelines often rely on Gaussian blur mismatched to physical degradation, and many methods lack explicit volumetric geometry constraints as they process 2D slices independently, resulting in low-fidelity reconstructions. To address these challenges, we present MicroFM, which synthesizes realistic training data using physical PSFs matched to the target microscope. MicroFM also introduces a physics-guided flow-matching framework for isotropic microscopy reconstruction, guided by a continuous implicit geometry prior to achieve high-fidelity isotropic recovery. Across four fluorescence microscopy systems and datasets, MicroFM achieves state-of-the-art (SOTA) performance, producing sharper structures, more isotropic spectra, and substantial gains in both full-reference and no-reference metrics.

Abstract:
While text-to-video (T2V) generation has achieved remarkable progress in photorealism, generating intent-aligned videos that faithfully obey physics principles remains a core challenge. In this work, we systematically study Newtonian motion-controlled text-to-video generation and evaluation, emphasizing physical precision and motion coherence. We introduce MoReGen, a motion-aware, physics-grounded T2V framework that integrates multi-agent large language models (LLMs), physics simulators, and renderers to generate reproducible, physically accurate videos from text prompts in the code domain. To quantitatively assess physical validity, we propose object-trajectory correspondence as a direct evaluation metric and present MoReSet, a benchmark of 1,275 human-annotated videos spanning nine classes of Newtonian phenomena with scene descriptions, spatiotemporal relations, and ground-truth trajectories. Using MoReSet, we conduct experiments on existing T2V models, evaluating their physical validity through both our MoRe metrics and existing physics-based evaluators. Our results reveal that state-of-the-art models struggle to maintain physical validity, while MoReGen establishes a principled direction toward physically coherent video synthesis.

Abstract:
Diffusion Transformers (DiTs) have achieved state-of-the-art performance in image and video generation, but their success comes at the cost of heavy computation. This inefficiency is largely due to the fixed tokenization process, which uses constant-sized patches throughout the entire denoising phase, regardless of the content's complexity. We propose dynamic tokenization, an efficient test-time strategy that varies patch sizes based on content complexity and the denoising timestep. Our key insight is that early timesteps usually require coarser patches to model global structure, while later iterations demand finer (smaller-sized) patches to refine local details. During inference, our method dynamically reallocates patch sizes across denoising steps for image and video generation and substantially reduces cost while preserving perceptual generation quality. Extensive experiments demonstrate the effectiveness of our approach: it achieves up to 3.52xand 3.2xspeedup on FLUX-1.Dev and Wan 2.1, respectively, without compromising the generation quality and prompt adherence.

Abstract:
Accurate capture of human-object interaction from ubiquitous sensors like RGB cameras is important for applications in human understanding, gaming, and robot learning. However, inferring 4D interactions from a single RGB view is highly challenging due to the unknown object and human information, depth ambiguity, occlusion, and complex motion, which hinder consistent 3D and temporal reconstruction. Previous methods simplify the setup by assuming ground truth object template or constraining to a limited set of object categories. We present CARI4D, the first category-agnostic method that reconstructs spatially and temporarily consistent 4D human-object interaction at metric scale from monocular RGB videos. To this end, we propose a pose hypothesis selection algorithm that robustly integrates the individual predictions from foundation models, jointly refine them through a learned render-and-compare paradigm to ensure spatial, temporal and pixel alignment, and finally reasoning about intricate contacts for further refinement satisfying physical constraints. Experiments show that our method outperforms prior art by 35% on in-distribution dataset and 36% on unseen dataset in terms of reconstruction error. Our model generalizes beyond the training categories and thus can be applied zero-shot to in-the-wild internet videos. Our code and pretrained models will be publicly released.

Abstract:
Establishing consistent correspondences across images is essential for 3D vision tasks such as structure-from-motion (SfM), yet most existing matchers operate in a pairwise manner, often producing fragmented and geometrically inconsistent tracks when their predictions are chained across views. We propose MV-RoMa, a multi-view dense matching model that jointly estimates dense correspondences from a source image to multiple co-visible targets. Specifically, we design an efficient model architecture which avoids high computational cost of full cross-attention for multi-view feature interaction: (i) multi-view encoder that leverages pair-wise matching results as a geometric prior, and (ii) multi-view matching refiner that refines correspondences using pixel-wise attention. Additionally, we propose a post-processing strategy that integrates our model's consistent multi-view correspondences as high-quality tracks for SfM. Across diverse and challenging benchmarks, MV-RoMa produces more reliable correspondences and substantially denser, more accurate 3D reconstructions than existing sparse and dense matching methods.

Abstract:
Monocular depth estimation (MDE) has been widely adopted in the perception systems of autonomous vehicles and mobile robots. However, existing approaches often struggle to maintain temporal consistency in depth estimation across consecutive frames. This inconsistency not only causes jitter but can also lead to estimation failures when the depth range changes abruptly. To address these challenges, this paper proposes a consistency-aware monocular depth estimation framework that leverages wheel odometry from a mobile robot to achieve stable and coherent depth predictions over time. Specifically, we estimate camera pose and sparse depth from triangulation using optical flow between consecutive frames. The sparse depth estimates are used to update a recursive Bayesian estimate of the metric scale, which is then applied to rescale the relative depth predicted by a pre-trained depth estimation foundation model. The proposed method is evaluated on the KITTI, TartanAir, MS2, and our own dataset, demonstrating robust and accurate depth estimation performance.

Abstract:
Recent advancements in video generation highlight that realistic audio-visual synchronization is crucial for engaging content creation. However, existing video editing methods largely overlook audio-visual synchronization and lack the fine-grained spatial and temporal controllability required for precise instance-level edits. In this paper, we propose AVI-Edit, a framework for audio-sync video instance editing. We propose a granularity-aware mask refiner that iteratively refines coarse user-provided masks into precise instance-level regions. We further design a self-feedback audio agent to curate high-quality audio guidance, providing fine-grained temporal control. To facilitate this task, we additionally construct a large-scale dataset with instance-centric correspondence and comprehensive annotations. Extensive experiments demonstrate that AVI-Edit outperforms state-of-the-art methods in visual quality, condition following, and audio-visual synchronization. Project page: https://hjzheng.net/projects/AVI-Edit/.

Abstract:
Affordance segmentation aims to decompose 3D objects into parts that serve distinct functional roles, enabling models to reason about object interactions rather than mere recognition. Existing methods, mostly following the paradigm of 3D semantic segmentation or prompt-based frameworks, struggle when geometric cues are weak or ambiguous, as sparse point clouds provide limited functional information. To overcome this limitation, we leverage the rich semantic knowledge embedded in large-scale 2D Vision Foundation Models (VFMs) to guide 3D representation learning through a cross-modal alignment mechanism. Specifically, we propose Cross-Modal Affinity Transfer (CMAT), a pretraining strategy that compels the 3D encoder to align with the semantic structures induced by lifted 2D features. CMAT is driven by a core affinity alignment objective, supported by two auxiliary losses, geometric reconstruction and feature diversity, which together encourage structured and discriminative feature learning. Built upon the CMAT-pretrained backbone, we employ a lightweight affordance segmentor that injects text or visual prompts into the learned 3D space through an efficient cross-attention interface, enabling dense and prompt-aware affordance prediction while preserving the semantic organization established during pretraining. Extensive experiments demonstrate consistent improvements over previous state-of-the-art methods in both accuracy and efficiency.

Abstract:
Online 3D scene perception in real time is essential for robotics, AR/VR, and autonomous systems, particularly in edge computing scenarios where computational resources are limited and privacy is crucial. Recent state-of-the-art methods like EmbodiedSAM (ESAM) demonstrate the promise of online 3D perception by leveraging the Segment Anything Model (SAM) for real-time, fine-grained, and generalized 3D instance segmentation. However, ESAM still relies on a computationally expensive 3D sparse UNet for point cloud feature extraction, which accounts for the majority of the 3D inference time, hindering its practicality on resource-constrained devices. In this paper, we propose ESAM++, a lightweight and scalable alternative for online 3D scene perception tailored to edge devices without GPU acceleration. Our method introduces a 3D Sparse Feature Pyramid Network (SFPN) that efficiently captures multi-scale geometric features from streaming 3D point clouds while significantly reducing computational overhead and model size. We evaluate our approach on four challenging segmentation benchmarks, namely ScanNet, ScanNet200, SceneNN, and 3RScan, demonstrating that our model achieves competitive accuracy with up to 3x faster inference with a 2x smaller model size compared to ESAM, enabling practical deployment on edge devices.

Abstract:
Camera-based Semantic Scene Completion (SSC) is able to comprehensively understand the entire scene, but it suffers from ambiguous predictions due to occlusions and incomplete information. Temporal SSC alleviates this issue, but existing models simply stack multi-frame temporal features, which can lead to inconsistencies between geometry and semantics over time. In this paper, we present ConSSC, a novel SSC method that learns Spatial-Temporal Consistency. It works by lifting historical frames into a 3D scene-level occupancy framework, aggregating 2D and 3D historical features from current voxels, and learning from 2D visibility and similarity cues in a temporal buffer. Specifically, our framework introduces two key components: the Hierarchical Voxel Refinement module, which extracts a coarse occupancy from depth and refines it through voxel-level representations, recovering missing information. The Temporal Semantic Aggregation module effectively integrates semantic features from different viewpoints and time points, enabling the reconstruction of occluded regions in the current frame using historical context, aggregating them into corresponding voxel features. Without additional sensors or data, ConSSC improves both geometric and semantic consistency. Extensive experiments on the SemanticKITTI and SSCBench-KITTI-360 datasets show that ConSSC outperforms state-of-the-art camera-based and temporal SSC baselines by a significant margin in terms of IoU and mIoU.

Abstract:
Test-time training (TTT) enhances the robustness of pretrained models to out-of-distribution (OOD) data through auxiliary self-supervised tasks, without requiring labeled samples. However, existing TTT methods predominantly rely on decoder-based auxiliary objectives, which suffer from inefficient adaptation and weak coupling with the primary task. To solve these limitations, we revisit the mechanism of test-time training by analyzing masking-based pretrained models to uncover the fundamental source of their OOD robustness. Our investigation reveals that their generalization capability stems from a latent feature-level structural invariance, the consistency of encoded representations under masked perturbations. Building on this insight, we introduce LoTT-PC, a lightweight LoRA-based framework that operationalizes this invariance-preserving principle for 3D point cloud classification. LoTT-PC consists of two main components: (1) low-rank modulation units for parameter-efficient adaptation, and (2) a permutation-invariant alignment mechanism that enforces representation consistency through masked feature alignment. Extensive experiments on multiple benchmarks demonstrate that this unified design enables pretrained point cloud models to self-tune rapidly and reliably across diverse OOD scenarios, outperforming state-of-the-art methods by an average of 2.7% in accuracy under various corruption types.

Abstract:
We present Flowception, a novel non-autoregressive and variable-length video generation framework. Flowception learns a probability path that interleaves discrete frame insertions with continuous frame denoising. Compared to autoregressive methods, Flowception alleviates error accumulation/drift as the frame insertion mechanism during sampling serves as an efficient compression mechanism to handle long-term context. Compared to full-sequence flows, our method reduces FLOPs for training three-fold, while also being more amenable to local attention variants, and allowing to learn the length of videos jointly with their content. Quantitative experimental results show improved FVD and VBench metrics over autoregressive and full-sequence baselines, which is further validated with qualitative results. Finally, by learning to insert and denoise frames in a sequence, Flowception seamlessly integrates different tasks such as image-to-video generation and video interpolation.

Abstract:
Recent advances in Multimodal Large Language Models (MLLMs) have enabled unified multimodal understanding and generation. However, they still struggle with fine-grained text-image alignment, often failing to faithfully depict objects with correct attributes such as color, shape, and spatial relationships. To mitigate this issue, previous studies have explored preference optimization methods such as DPO and GRPO, but these approaches incur substantial computational cost, both in constructing preference data and in performing optimization. This has motivated self-improving preference optimization approaches, in which the MLLM autonomously generates its own training data, self-estimates preference feedback, and self-optimizes using the resulting self-constructed preference pairs. However, existing self-improving methods still overlook fine-grained, object-level semantics, allowing object hallucination to persist. To tackle this problem, we propose Object-centric Self-improving Preference Optimization (OSPO), a self-improving framework designed to enhance object-level text-image alignment. OSPO explicitly constructs object-centric preference data without relying on any external data or models. We also introduce a new approach that leverages attention-based object masks together with an object-weighted SimPO loss to enhance object-specific fidelity. Extensive experiments on three compositional image generation benchmarks demonstrate that OSPO significantly improves fine-grained alignment and reduces object hallucination, outperforming prior self-improving methods and even specialized diffusion-based text-to-image models.

Abstract:
Chain-of-Thought (CoT) has recently shown encouraging progress in the vision language model. However, the pure-vision CoT (i.e., chain-of-vision) has been underexplored in visual in-context learning. In this paper, we introduce Diffusion Guided Chain-of-Vision, which integrates an explicit chain-of-thought process into autoregressive vision models through vision prior from pre-trained diffusion models. Concretely, we find that pre-trained diffusion models induce a reliable probability flow in image space, where intermediate images sampled along this flow exhibit visual coherence and serve as task-free, chain-of-vision supervision for pure-vision autoregressive models. Extensive experiments on diverse vision tasks and multi-scale models validate the effectiveness of our proposed method for visual in-context learning. Code and dataset will be publicly available.

Abstract:
Autoregressive mesh generation has gained attention by tokenizing meshes into sequences and training models in a language-modeling fashion. However, existing approaches suffer from two fundamental limitations: (i) low tokenization efficiency, which yields long token sequences and prevents scaling to high-poly meshes, and (ii) absence of geometry-aware guidance, as generation is conditioned only on global shape embeddings rather than local surface cues. We introduce MeshWeaver, an autoregressive framework that treats mesh generation as a surface weaving process by directly predicting the next vertex instead of independent coordinates. At its core is a multi-level sparse-voxel encoder that injects geometric context into the generative process in three complementary ways: providing voxel features as vertex representations, guiding token prediction via cross-attention to voxel features, and serving as a structural scaffold that constrains generation around the input surface. Our hierarchical design enables coarse-to-fine vertex prediction in a single decoding step, while tightly coupling the generative model with 3D geometry. Extensive experiments demonstrate that MeshWeaver achieves a state-of-the-art compression ratio of 18%, can generate meshes with up to 16K faces, and significantly improves geometric fidelity over prior approaches.

Abstract:
With the increasing adoption of vision-language models (VLMs) in critical decision-making systems such as in healthcare or autonomous driving, the calibration of their uncertainty estimates has become paramount. Yet, this dimension has been largely underexplored in the VLM test-time prompt-tuning (TPT) literature, which has prioritized on improving their discriminative performance. Recent state-of-the-art methods advocate for enforcing full orthogonality over pairs of text prompt embeddings to enhance separability, and therefore calibration. Nevertheless, as we theoretically show in this work, the inherent gradients from fully orthogonal constraints will strongly push semantically related classes away, ultimately making the model overconfident. Based on our findings, we propose Semantic Orthogonal Calibration (SoC), a Huber-based regularizer that enforces smooth prototype separation while preserving semantic proximity, thereby improving calibration compared to prior orthogonality-based approaches. Across a comprehensive empirical validation, we demonstrate that SoC consistently improves calibration performance, while also maintaining competitive discriminative capabilities. Our code is available at github.com/leofillioux/SoC.

Abstract:
We introduce FaceCam, a system that generates video under customizable camera trajectories for monocular human portrait video input. Recent camera control approaches based on large video-generation models have shown promising progress but often exhibit geometric distortions and visual artifacts on portrait videos due to scale-ambiguous camera representations or 3D reconstruction errors. To overcome these limitations, we propose a face-tailored scale-aware representation for camera transformations that provides deterministic conditioning without relying on 3D priors. We train a video generation model on both multi-view studio captures and in-the-wild monocular videos, and introduce two camera-control data generation strategies: synthetic camera motion and multi-shot stitching, to exploit stationary training cameras while generalizing to dynamic, continuous camera trajectories at inference time. Experiments on Ava-256 dataset and diverse in-the-wild videos demonstrate that FaceCam achieves superior performance in camera controllability, visual quality, identity and motion preservation.

Abstract:
Vision-language models (VLMs) have demonstrated remarkable generalization across diverse tasks, yet their performance remains constrained by the quality and geometry of the textual prototypes used to represent classes. Standard zero-shot classifiers, derived from frozen text encoders and handcrafted prompts, may yield correlated or weakly separated embeddings that limit task-specific discriminability. We introduce ORION, a text encoder fine-tuning framework that improves pretrained VLMs using only class names. Our method optimizes, via low-rank adaptation, a novel loss integrating two terms, one promoting pairwise orthogonality between the textual representations of the classes of a given task and the other penalizing deviations from the initial class prototypes. Furthermore, we provide a probabilistic interpretation of our orthogonality penalty, connecting it to the general maximum likelihood estimation (MLE) principle via Huygens' theorem. We report extensive experiments on 11 benchmarks and three large VLM backbones, showing that the refined textual embeddings yield powerful replacements for the standard CLIP prototypes. Added as plug-and-play module on top of various state-of-the-art methods, and across different prediction settings (zero-shot, few-shot and test-time adaptation), ORION improves the performance consistently and significantly.

Abstract:
3D object detection is fundamental for spatial understanding. Real-world environments demand models capable of recognizing diverse, previously unseen objects, which remains a major limitation of closed-set methods. Existing open-vocabulary 3D detectors relax annotation requirements but still depend on training scenes as point clouds or images. We take this further with Zoo3D, the first training-free 3D object detection framework. Our method constructs 3D bounding boxes via graph clustering of 2D instance masks and assigns semantic labels using a novel open-vocabulary module with best-view selection and view-consensus mask generation. Zoo3D operates in two modes: the zero-shot Zoo3D_0, which requires no training, and the self-supervised Zoo3D_1, which refines 3D box prediction by training a class-agnostic detector on Zoo3D_0-generated pseudo labels. We further extend Zoo3D beyond point clouds to operate directly on posed or even unposed images. Across ScanNet200 and ARKitScenes, both Zoo3D_0 and Zoo3D_1 achieve state-of-the-art results in open-vocabulary 3D object detection, with zero-shot Zoo3D_0 outperforming all existing self-supervised methods.

Abstract:
Photographic style preferences are deeply personal, varying across individuals in color and tonal aesthetics. We introduce Personalized Photographic Style (PPS) learning, where the goal is to capture a user's implicit preferences from comparative judgments and apply them consistently across diverse images. To establish a foundation for this problem, we present three contributions. First, we introduce PPSD, a dataset containing pairwise preference judgments from 767 users, each providing an average of 70 comparisons. To capture diverse style signals, images are sourced from professional edits, device pipelines, and generative models. Second, we explore several baseline models demonstrating the feasibility of adapting style transfer and enhancement approaches for preference learning. Third, we develop a comparative evaluation framework suited to the implicit nature of personal preferences. We will make our dataset publicly available, and hope this work serves as a foundation for advancing research in personalized photographic style learning.

Abstract:
Mitigating noisy correspondence in cross-modal matching poses a serious challenge due to the problem of error accumulation. Existing methods primarily attribute this accumulation to errors caused by noisy sample pairs. However, a novel source of error from clean sample pairs (also termed anchor pairs) is discovered in this paper. Such error accumulation is considered to arise from modality-inconsistent correlations. To address this issue, a novel method termed Geometric-Semantic Learning (GSL) is proposed. Firstly, GSL leverages the Fourier transform to emphasize semantic representations and reduce cross-modal inconsistencies caused by perturbations in non-critical fine-grained features, thereby alleviating the error accumulation problem. After that, a Geometry-Aware Label Correction (GALC) method is introduced to re-estimate soft correspondence labels by leveraging angular consistency between noisy sample pairs and anchor pairs across different modalities. Finally, a semantically constrained triplet loss is employed to regulate sample distances using semantic information, enabling robust separation of clean and noisy pairs during the training process. Extensive experiments on three benchmark datasets demonstrate that GSL consistently outperforms existing methods in retrieval accuracy.

Abstract:
The rapid advancement of Vision-Language models (VLMs) has raised growing concerns that their black-box reasoning processes could lead to unintended forms of social bias. Current debiasing approaches focus on mitigating surface-level bias signals through post-hoc learning or test-time algorithms, while leaving the internal dynamics of the model largely unexplored. In this work, we introduce an interpretable, model-agnostic bias mitigation framework, DeBiasLens, that localizes social attribute neurons in VLMs through sparse autoencoders (SAEs) applied to multimodal encoders. Building upon the disentanglement ability of SAEs, we train them on facial image or caption datasets without corresponding social attribute labels to uncover neurons highly responsive to specific demographics, including those that are underrepresented. By selectively deactivating the social neurons most strongly tied to bias for each group, we effectively mitigate socially biased behaviors of VLMs without degrading their semantic knowledge. Our research lays the groundwork for future auditing tools, prioritizing social fairness in emerging real-world AI systems.

Abstract:
Building intelligent agents capable of dexterous manipulation is essential for achieving human-like automation in both robotics and digital environments. However, existing GUI agents rely on discrete click predictions (x,y), which prohibits free-form, closed-loop trajectories (e.g. dragging a progress bar) that require continuous, on-the-fly perception and adjustment. In this work, we develop ShowUI-p, the first flow-based generative model as GUI dexterous hand, featuring the following designs:(i) Unified Discrete-Continuous Actions, integrating discrete clicks and continuous drags within a shared model, enabling flexible adaptation across diverse interaction modes;(ii) Flow-based Action Generation for drag modeling, which predicts incremental cursor adjustments from continuous visual observations via a lightweight action expert, ensuring smooth and stable trajectories;(iii) Drag Training data and Benchmark, where we manually collect and synthesize 20K drag trajectories across five domains (e.g. PowerPoint, Adobe Premiere Pro), and introduce ScreenDrag, a benchmark with comprehensive online and offline evaluation protocols for assessing GUI agents' drag capabilities.Our experiments show that proprietary GUI agents still struggle on ScreenDrag (e.g. Operator scores 13.27, and the best Gemini-2.5-CUA reaches 22.18). In contrast, ShowUI-p achieves 26.98 with only 450M parameters, underscoring both the difficulty of the task and the effectiveness of our approach. We hope this work advances GUI agents toward human-like dexterous control in digital world.

Abstract:
Autoregressive models are structurally misaligned with the inherently parallel nature of geospatial understanding, forcing a rigid sequential narrative onto scenes and fundamentally hindering the generation of structured and coherent outputs. We challenge this paradigm by reframing geospatial generation as a parallel refinement process, enabling a holistic, coarse-to-fine synthesis that resolves all semantic elements simultaneously. To operationalize this, we introduce GeoDiT, the first diffusion-based vision-language model tailored for the geospatial domain. Extensive experiments demonstrate that GeoDiT establishes a new state-of-the-art on benchmarks requiring structured, object-centric outputs. It achieves significant gains in image captioning, visual grounding, and multi-object detection, precisely the tasks where autoregressive models falter. Our work validates that aligning the generative process with the data's intrinsic structure is key to unlocking superior performance in complex geospatial analysis.

Abstract:
We present ReFlow, a unified framework for monocular dynamic scene reconstruction that learns 3D motion in a novel self-correction manner from raw video. Existing methods often suffer from incomplete scene initialization for dynamic regions, leading to unstable reconstruction and motion estimation, which often resorts to external dense motion guidance such as pre-computed optical flow to further stabilize and constrain the reconstruction of dynamic components. However, this introduces additional complexity and potential error propagation.To address these issues, ReFlow integrates a Complete Canonical Space Construction module for enhanced initialization of both static and dynamic regions, and a Separation-Based Dynamic Scene Modeling module that decouples static and dynamic components for targeted motion supervision.The core of ReFlow is a novel self-correction flow matching mechanism, consisting of Full Flow Matching to align 3D scene flow with time-varying 2D observations, and Camera Flow Matching to enforce multi-view consistency for static objects. Together, these modules enable robust and accurate dynamic scene reconstruction.Extensive experiments across diverse scenarios demonstrate that ReFlow achieves superior reconstruction quality and robustness, establishing a novel self-correction paradigm for monocular 4D reconstruction.

Abstract:
Existing zero-shot Object Goal Navigation (ObjectNav) methods often exploit commonsense knowledge from large language or vision-language models to guide navigation. However, such knowledge arises from internet-scale text rather than embodied 3D experience, and episodic observations collected during navigation are typically discarded, preventing the accumulation of lifelong experience.To this end, we propose Trajectory RAG (TrajRAG), a retrieval-augmented generation framework that enhances large-model reasoning by retrieving geometric-semantic experiences. TrajRAG incrementally accumulates episodic observations from past navigation episodes. To structure these observations, we propose a topological-polar (topo-polar) trajectory representation that compactly encodes spatial layouts and semantic contexts, effectively removing redundancies in raw episodic observations. A hierarchical chunking structure further organizes similar topo-polar trajectories into unified summaries, enabling coarse-to-fine retrieval. During navigation, candidate frontiers generate multiple trajectory hypotheses that query TrajRAG for similar past trajectories, guiding large-model reasoning for waypoint selection. New experiences are continually consolidated into TrajRAG, enabling the accumulation of lifelong navigation experience.Experiments on MP3D, HM3D-v1, and HM3D-v2 show that TrajRAG effectively retrieves relevant geometric-semantic experiences and improves zero-shot ObjectNav performance.

Abstract:
Compositional scene reconstruction seeks to create object-centric representations rather than holistic scenes from real-world videos, which is natively applicable for simulation and interaction. Conventional compositional reconstruction approaches primarily emphasize on visual appearance and show limited generalization ability to real-world scenarios. In this paper, we propose SimRecon, a framework that realizes a "Perception-Generation-Simulation" pipeline towards cluttered scene reconstruction, which first conducts scene-level semantic reconstruction from video input, then performs single-object generation, and finally assembles these assets in the simulator. However, naively combining these three stages leads to visual infidelity of generated assets and physical implausibility of the final scene, a problem particularly severe for complex scenes. Thus, we further propose two bridging modules between the three stages to address this problem. To be specific, for the transition from Perception to Generation, critical for visual fidelity, we introduce Active Viewpoint Optimization, which actively searches in 3D space to acquire optimal projected images as conditions for single-object completion. Moreover, for the transition from Generation to Simulation, essential for physical plausibility, we propose a Scene Graph Synthesizer, which guides the construction from scratch in 3D simulators, mirroring the native, constructive principle of the real world. Extensive experiments on the ScanNet dataset validate our method's superior performance over previous state-of-the-art approaches.

Abstract:
Storyboard synthesis plays a crucial role in visual storytelling, aiming to generate coherent shot sequences that visually narrate cinematic events with consistent characters, scenes, and transitions. However, existing approaches are mostly adapted from text-to-image diffusion models, which struggle to maintain long-range temporal coherence, consistent character identities, and narrative flow across multiple shots. In this paper, we introduce DreamShot, a video generative model based storyboard framework that fully exploits powerful video diffusion priors for controllable multi-shot synthesis. DreamShot supports both Text-to-Shot and Reference-to-Shot generation, as well as story continuation conditioned on previous frames, enabling flexible and context-aware storyboard generation. By leveraging the spatial-temporal consistency inherent in video generative models, DreamShot produces visually and semantically coherent sequences with improved narrative fidelity and character continuity. Furthermore, DreamShot incorporates a multi-reference role conditioning module that accepts multiple character reference images and enforces identity alignment via a Role-Attention Consistency Loss, explicitly constraining attention between reference and generated roles. Extensive experiments demonstrate that DreamShot achieves superior scene coherence, role consistency, and generation efficiency compared to state-of-the-art text-to-image storyboard models, establishing a new direction toward controllable video model-driven visual storytelling.

Abstract:
Open-set fine-grained retrieval (OSFR) is a challenging task where models must generalize to unseen subcategories. Existing methods often fail this, as they embed category-specific semantics from closed-set training labels. Recently, diffusion transformers (DiT) have shown promise by encoding attribute-centric, generative curriculum knowledge that is agnostic to these labels. However, the vanilla DiT is not optimized for fine-grained visual discrepancies and its massive size makes deployment infeasible. To solve this, we propose DiT-Distill, a framework to first refine and then distill this knowledge. We introduce a conditional discrepancy refinement strategy to fine-tune the DiT, forcing it to focus on discrepancy-aware, attribute-centric details rather than holistic context. Subsequently, a generative curriculum distillation mechanism transfers this refined, hierarchical knowledge from multiple diffusion timesteps of the DiT into a lightweight backbone using a generative infusion module and a curriculum alignment loss. This process results in an efficient retrieval model that enables DiT-free inference. Extensive experiments show DiT-Distill achieves state-of-the-art performance on open-set fine-grained datasets.

Abstract:
Rapidly-exploring random trees (RRTs) have been widely adopted for robot motion planning due to their robustness and theoretical guarantees. However, existing RRT-based planners require explicit goal configurations specified as numerical joint angles, while many practical applications provide goal specifications through visual observations such as images or demonstration videos where precise goal configurations are unavailable. In this paper, we propose visual-RRT (vRRT), a motion planner that enables visual-goal planning by unifying gradient-based exploitation from differentiable robot rendering with sampling-based exploration from RRTs. We further introduce (1) a frontier-based exploration-exploitation strategy that adaptively prioritizes visually promising search regions, and (2) inertial gradient tree expansion that inherits optimization states across tree branches for momentum-consistent gradient exploitation. Extensive experiments across various robot manipulators including Franka, UR5e, and Fetch demonstrate that vRRT achieves effective visual-goal planning in both simulated and real-world settings, bridging the gap between sampling-based planning and vision-centric robot applications. Our code is available at https://sgvr.kaist.ac.kr/Visual-RRT.

Abstract:
GUI agents powered by Multimodal Large Language Models (MLLMs) have demonstrated impressive capability in understanding and executing user instructions. However, accurately grounding instruction-relevant elements from high-resolution screenshots cluttered with irrelevant UI components remains challenging for existing approaches. Inspired by how humans dynamically adjust their perceptual scope to locate task-related regions on complex screens, we propose DRS-GUI, a training-free dynamic region search framework for GUI grounding that can be seamlessly integrated into existing MLLMs. DRS-GUI introduces a lightweight UI Perceptor that performs three human-like perceptual actions (Focus, Shift, and Scatter) to progressively explore the interface and generate region proposals. To dynamically schedule these actions, we further design an Action Planner based on Monte Carlo Tree Search (MCTS). A region quality reward is employed to evaluate and select the highly instruction-relevant region, efficiently pruning redundant UI elements. Experiments demonstrate that DRS-GUI yields a 14% improvement on ScreenSpot-Pro for general and GUI-specific MLLMs (Qwen2.5-VL-7B and UGround-V1-7B), significantly enhancing grounding performance and generalization.

Abstract:
This paper focuses on embodied task planning, where an agent acquires visual observations from the environment and executes atomic actions to accomplish a given task. Although recent Vision-Language Models (VLMs) have achieved impressive results in multimodal understanding and reasoning, their performance remains limited when applied to embodied planning that involves multi-turn interaction, long-horizon reasoning, and extended context analysis. To bridge this gap, we propose RoboAgent, a capability-driven planning pipeline in which the model actively invokes different sub-capabilities. Each capability maintains its own context, and produces intermediate reasoning results or interacts with the environment according to the query given by a scheduler. This framework decomposes complex planning into a sequence of basic vision-language problems that VLMs can better address, enabling a more transparent and controllable reasoning process. The scheduler and all capabilities are implemented with a single VLM, without relying on external tools. To train this VLM, we adopt a multi-stage paradigm that consists of: (1) behavior cloning with expert plans, (2) DAgger training using trajectories collected by the model, and (3) reinforcement learning guided by an expert policy. Across these stages, we exploit the internal information of the environment simulator to construct high-quality supervision for each capability, and we further introduce augmented and synthetic data to enhance the model's performance in more diverse scenarios. Extensive experiments on widely used embodied task planning benchmarks validate the effectiveness of the proposed approach.

Abstract:
Vision-Language-Action (VLA) models have shown remarkable progress in embodied tasks recently, but most methods process visual observations independently at each timestep. This history-agnostic design treats robot manipulation as a Markov Decision Process, even though real-world robotic control is inherently partially observable and requires reasoning over past interactions. To address this mismatch, we reformulate VLA policy learning from a Partially Observable Markov Decision Process perspective and propose AVA-VLA, a framework that conditions action generation on a recurrent state that serves as a neural approximation to the agent's belief over task history. Built on this recurrent state, we introduce Active Visual Attention (AVA), which dynamically reweights visual tokens in the current observation to focus on regions most relevant given both the instruction and execution history. Extensive experiments show that AVA-VLA achieves state-of-the-art performance on standard robotic benchmarks, including LIBERO and CALVIN, and transfers effectively to real-world dual-arm manipulation tasks. These results demonstrate the effectiveness of temporally grounded active visual processing for improving VLA performance in robotic sequential decision-making.

Abstract:
As autonomous driving technology transitions from small-scale validation to large-scale deployment, its development in unstructured road environments has become a critical and inevitable trend. Autonomous vehicles increasingly rely on high-quality and diverse datasets for perception systems. However, existing public datasets predominantly focus on clear-weather and urban-road scenarios, leaving a significant gap in the coverage of unstructured road environments. To bridge this gap, we construct URScenes, the first multi-scenario, open-source perception dataset for unstructured road environments. The dataset consists of 472 scenes, each lasting 30 seconds, and provides over 28K annotated samples and 119K sweeps. URScenes, for the first time, covers eight typical scenarios, including rainy, snowy, foggy, dusty, high-glare, night, cloudy, and sunny conditions. Additionally, URScenes supports multi-task perception for 3D object detection, multi-object tracking, and 3D occupancy in unstructured road environments. URScenes also provides a unified annotation system and format conversion tools, enabling easy conversion to popular formats such as nuScenes, KITTI, and Waymo datasets. Finally, this study presents comparative experimental results to assess the performance of state-of-the-art algorithms on the URScenes dataset. The data and development toolkit are available at http://www.sav-lab.com.

Abstract:
Vision-language models offer strong few-shot capability through prompt tuning but remain vulnerable to noisy labels, which can corrupt prompts and degrade cross-modal alignment. Existing approaches struggle because they often lack the ability to model fine-grained semantic cues and to adaptively separate clean from noisy signals. To address these challenges, we propose NA-MVP, a framework for Noise-Aware few-shot learning through bi-directional Multi-View Prompt alignment. NA-MVP is built upon a key conceptual shift: robust prompt learning requires moving from global matching to region-aware alignment that explicitly distinguishes clean cues from noisy ones. To realize this, NA-MVP employs (1) multi-view prompts combined with unbalanced optimal transport to achieve fine-grained patch-to-prompt correspondence while suppressing unreliable regions; (2) a bi-directional prompt design that captures complementary clean-oriented and noise-aware cues, enabling the model to focus on stable semantics; and (3) an alignment-guided selective refinement strategy that uses optimal transport to correct only mislabeled samples while retaining reliable data. Experiments on synthetic and real-world noisy benchmarks demonstrate that NA-MVP consistently outperforms state-of-the-art baselines, confirming its effectiveness in enabling robust few-shot learning under noisy supervision.

Abstract:
While Video Large Language Models (Video-LLMs) have shown significant potential in multimodal understanding and reasoning tasks, how to efficiently select the most informative frames from videos remains a critical challenge. Existing methods attempt to optimize frame sampling by reducing inter-frame redundancy or employing unsupervised event localization. However, these approaches often fall short in handling complex instruction-following tasks and scenarios that demand precise temporal modeling, resulting in limited performance in both semantic alignment and temporal reasoning. To address the above challenges, we introduce Instructed Temporal Grounding for Videos (VideoITG), a framework aiming to adaptively customize frame sampling strategies based on user instructions. Specifically, we design the VidThinker pipeline, which automates annotation by generating instruction-conditioned captions, retrieving relevant video segments, and selecting key frames to enable efficient supervision. Using VidThinker, we build the VideoITG-40K dataset with 40K videos and 500K temporal grounding annotations. Our plug-and-play VideoITG model leverages Video-LLMs' visual-language alignment and reasoning for discriminative frame selection. VideoITG consistently boosts the performance on multiple multimodal video understanding benchmarks, demonstrating its effectiveness and potential.

Abstract:
We present DuoMo, a generative method that recovers human motion in world-space coordinates from unconstrained videos with noisy or incomplete observations. Reconstructing such motion requires solving a fundamental trade-off: generalizing from diverse and noisy video inputs while maintaining global motion consistency. Our approach addresses this problem by factorizing motion learning into two diffusion models. The camera-space model first estimates motion from videos in camera coordinates. The world-space model then lifts this initial estimate into world coordinates and refines it to be globally consistent. Together, the two models can reconstruct motion across diverse scenes and trajectories, even from highly noisy or incomplete observations. Moreover, our formulation is general, generating the motion of mesh vertices directly (bypassing parametric models). DuoMo achieves state-of-the-art performance. On EMDB, our method obtains a 16% reduction in world-space MPJPE error while maintaining low foot skating. On RICH, it obtains a 30% reduction in world-space MPJPE error. Project page: yufu-wang.github.io/duomo/

Abstract:
Existing methods for extracting reward signals in Reinforcement Learning typically rely on labeled data and dedicated training splits, a setup that contrasts with how humans learn directly from their environment.In this work, we propose TTRV to enhance vision-language understanding by adapting the model on-the-fly at inference time, without the need for any labeled data.Concretely, we enhance the Group Relative Policy Optimization (GRPO) framework by designing rewards based on the frequency of the base model's output, while inferring on each test sample multiple times.Further, we also propose to control the diversity of the model's output by simultaneously rewarding the model for obtaining low entropy of the output empirical distribution.Our approach delivers consistent gains across both object recognition and visual question answering (VQA), with improvements of up to 52.4% and 29.8%, respectively, and average boosts of 24.6% and 10.0% across 16 datasets. Remarkably, on image recognition, TTRV applied to Intern-VL-8B surpasses GPT-4o by an average of 2.3% over 8 benchmarks, while remaining highly competitive on VQA, demonstrating that test-time reinforcement learning can match or exceed the strongest proprietary models. Finally, we find many interesting properties of test-time RL for VLMs: for example, even in extremely data-constrained scenarios, where adaptation is performed on a single randomly chosen unlabeled test example, \method still yields non-trivial improvements of up to 5.5% in recognition tasks.

Abstract:
Multimodal sentiment analysis (MSA) seeks to infer human emotions by integrating heterogeneous signals from text, audio, and visual modalities.Although recent approaches attempt to leverage cross-modal complementarity, they often struggle to fully utilize weaker modalities.In practice, the expressive power across modalities is inherently imbalanced: dominant modalities tend to overshadow non-verbal ones, which not only limits their contribution but also induces modality competition during training.This imbalance leads to degraded fusion performance and poor robustness under noisy or missing modalities.To address these challenges, we propose a novel model, Enhance-then-Balance Modality Collaboration framework (EBMC). EBMC first improves representational quality via modality semantic disentanglement (MSD) and cross-modal complementary enhancement (CCE), which strengthens weaker modalities using information from other modalities.To prevent dominant modalities from overwhelming others during joint optimization, EBMC introduces an Energy-guided Modality Coordination (EMC) mechanism that models modality contributions via energy potentials and achieves implicit gradient rebalancing through a differentiable equilibrium objective.Further, an Instance-aware Modality Trust Distillation (IMTD) module estimates sample-level modality reliability and adaptively modulates fusion weights, ensuring robustness against noise and modality incompleteness.Extensive experiments on multiple MSA benchmarks demonstrate that EBMC achieves state-of-the-art or competitive results.Moreover, EBMC maintains strong performance under missing-modality settings, highlighting its effectiveness and robustness.

Abstract:
The creation of high-fidelity, customizable 3D indoor scene textures remains a significant challenge. While text-driven methods offer flexibility, they lack the precision for fine-grained, instance-level control, and often produce textures with insufficient quality, artifacts, and baked-in shading. To overcome these limitations, we introduce CustomTex, a novel framework for instance-level, high-fidelity scene texturing driven by reference images. CustomTex takes an untextured 3D scene and a set of reference images specifying the desired appearance for each object instance, and generates a unified, high-resolution texture map. The core of our method is a dual-distillation approach that separates semantic control from pixel-level enhancement. We employ semantic-level distillation, equipped with an instances cross attention, to ensure semantic plausibility and "reference-instance" alignment, and pixel-level distillation to enforce high visual fidelity. Both are unified within a Variational Score Distillation optimization framework. Experiments demonstrate that CustomTex achieves precise instance-level consistency with reference images and produces textures with superior sharpness, reduced artifacts, and minimal baked-in shading compared to state-of-the-art methods. Our work establishes a more direct and user-friendly path to high-quality, customizable 3D scene appearance editing.

Abstract:
Post-training quantization (PTQ) serves as a vital technique for efficiently compressing large-scale models, with weight-compensation methods such as GPTQ (symmetric calibration) and GPTAQ (asymmetric calibration) showing remarkable success. However, directly applying these methods to Vision-Language Models (VLMs) exposes two notable shortcomings: 1) the standard rounding-to-nearest (RTN) method is suboptimal for the asymmetric objective, failing to account for residual-induced shifts in the optimal quantization target; and 2) all input channels are processed uniformly across modalities, overlooking the distinct information densities. In this paper, we introduce VLM-PTQ, an asymmetric post-training quantization framework for VLMs. First, we derive a closed-form correction term that shifts the quantization target, which explicitly accounts for the output residual and the corresponding inverse Hessian column, yielding a better local optimum than RTN. Second, we propose a modality-aware quantization that differentiates channel importance between vision and language tokens, allowing the quantizer to pre-compute better quantization parameters through a lightweight search. Our method extends weight-compensation methods with minimal overhead while achieving significant performance improvements in low-bit scenarios. Extensive experiments demonstrate that VLM-PTQ achieves competitive results compared to existing methods, effectively compressing models from 1B to 72B parameters on a single GPU.

Abstract:
Face swapping aims to seamlessly transfer a source facial identity onto a target while preserving target attributes such as pose and expression. Diffusion models, known for their superior generative capabilities, have recently shown promise in advancing face-swapping quality. This paper addresses two key challenges in diffusion-based face swapping: the prioritized preservation of identity over target attributes and the inherent conflict between identity and attribute conditioning. To tackle these issues, we introduce an identity-constrained attribute-tuning framework for face swapping that first ensures identity preservation and then fine-tunes for attribute alignment, achieved through a decoupled condition injection. We further enhance fidelity by incorporating identity and adversarial losses in a post-training refinement stage. Our proposed identity-constrained diffusion-based face-swapping model outperforms existing methods in both qualitative and quantitative evaluations, demonstrating superior identity similarity and attribute consistency, achieving a new state-of-the-art performance in high-fidelity face swapping.

Abstract:
Prompt-based continual learning methods effectively mitigate catastrophic forgetting. However, most existing methods assign a fixed set of prompts to each task, completely isolating knowledge across tasks and resulting in suboptimal parameter utilization. To address this, we consider the practical needs of continual learning and propose a prompt-sharing framework. This framework constructs a global prompt pool and introduces a task-aware gated routing mechanism that sparsely activates a subset of prompts to achieve dynamic decoupling and collaborative optimization of task-specific feature representations. Furthermore, we introduce a history-aware modulator that leverages cumulative prompt activation statistics to protect frequently used prompts from excessive updates, thereby mitigating inefficient parameter usage and knowledge forgetting. Extensive analysis and empirical results demonstrate that our approach consistently outperforms existing static allocation strategies in effectiveness and efficiency. Code and models will be released.

Abstract:
Given a collection of multi-view images, we perform consistent multi-view editing with a training-free framework using pre-trained 2D editing models and a generative multi-view model. While 2D editing models can independently edit each image in a set of multi-view images of a 3D scene, they do not maintain consistency across views. Existing approaches typically rely on explicit 3D representations to average out inconsistencies, but they suffer from lengthy optimization, instability under sparse-view settings, and blurry results. We address the problem from a different lens, using the 2D editing model to guide a multi-view generative model during diffusion sampling. This is achieved through our novel coupled diffusion sampling process. We concurrently sample two trajectories from both a multi-view image distribution and a 2D edited image distribution, and connect the samples with a coupling term. Effectively, the two models guide each other during sampling, and the resulting sample from the multi-view model remains consistent while satisfying the desired edit. We validate the effectiveness and generality of this framework on three distinct multi-view image editing tasks, and demonstrate its applicability across various model architectures. We further illustrate the effects of coupling on SoTA image and video generation models, highlighting the potential of our method beyond multi-view editing.

Abstract:
Modern video diffusion models excel at appearance synthesis but still struggle with physical consistency: objects drift, collisions lack realistic rebound, and material responses seldom match their underlying properties. We present PhyCo, a framework that introduces continuous, interpretable, and physically grounded control into video generation. Our approach integrates three key components: (i) a large-scale dataset of over 100K photorealistic simulation videos where friction, restitution, deformation, and force are systematically varied across diverse scenarios; (ii) physics-supervised fine-tuning of a pretrained diffusion model using a ControlNet conditioned on pixel-aligned physical property maps; and (iii) VLM-guided reward optimization, where a fine-tuned vision-language model evaluates generated videos with targeted physics queries and provides differentiable feedback. This combination enables a generative model to produce physically consistent and controllable outputs through variations in physical attributes--without any simulator or geometry reconstruction at inference. On the Physics-IQ benchmark, PhyCo significantly improves physical realism over strong baselines, and human studies confirm clearer and more faithful control over physical attributes. Our results demonstrate a scalable path toward physically consistent, controllable generative video models that generalize beyond synthetic training environments.

Abstract:
In this work, we investigate the potential of weights to serve as effective representations, focusing on neural fields. Our key insight is that constraining the optimization space through a pre-trained base model and low-rank adaptation (LoRA) can induce structure in weight space. Across reconstruction, generation, and analysis tasks on 2D and 3D data, we find that multiplicative LoRA weights achieve high representation quality while exhibiting distinctiveness and semantic structure. When used with latent diffusion models, multiplicative LoRA weights enable higher-quality generation than existing weight-space methods.

Abstract:
Reliable incremental estimation of camera poses and 3D reconstruction is key to enable various applications including robotics, interactive visualization, and augmented reality. However, online SLAM is particularly challenging in dynamic natural environments, where scene dynamics can severely deteriorate camera pose estimation accuracy. In this work, we propose a novel monocular visual SLAM system that can robustly estimate camera poses in dynamic scenes. We leverage the complementary strengths of geometric patch-based online bundle adjustment and recent feed-forward reconstruction models. Specifically, we propose a feed-forward reconstruction model to precisely filter out dynamic regions, while also utilizing its depth prediction to enhance the robustness of the patch-based visual SLAM. By aligning depth prediction with estimated patches from bundle adjustment, we handle the inherent scale ambiguities of the batch-wise application of the feed-forward reconstruction model. Extensive experiments on multiple tasks show the superior performance of our proposed method compared to state-of-the-art approaches.

Abstract:
Human behaviors in the real world naturally encode rich, long-term contextual information that can be leveraged to train embodied agents for perception, understanding, and acting.However, existing capture systems typically rely on costly studio setups and wearable devices, limiting the large-scale collection of scene-conditioned human motion data in the wild.To address this, we propose EmbodMocap, a portable and affordable data collection pipeline using two moving iPhones. Our key idea is to jointly calibrate dual RGB-D sequences to reconstruct both humans and scenes within a unified metric world coordinate frame.The proposed method allows metric-scale and scene-consistent capture in everyday environments without static cameras or markers, bridging human motion and scene geometry seamlessly.Based on the collected data, we empower three embodied AI tasks: monocular human-scene-reconstruction, where we fine-tune on feedforward models that output metric-scale, world-space aligned humans and scenes; physics-based character animation, where we prove our data could be used to scale human-object interaction skills and scene-aware motion tracking; and robot motion control, where we train a humanoid robot via sim-to-real RL to replicate human motions depicted in videos. Experimental results validate the effectiveness of our pipeline and its contributions towards advancing embodied AI research.

Abstract:
Video-language models (VLMs) learn to reason about the dynamic visual world through natural language. We introduce a suite of open datasets, benchmarks, and recipes for scalable oversight that enable precise video captioning. First, we define a structured specification for describing subjects, scenes, motion, spatial, and camera dynamics, grounded by hundreds of carefully defined visual primitives developed with professional video creators such as filmmakers. Next, to curate high-quality captions, we introduce a critique-based human-AI (CHAI) oversight framework, where trained human experts provide correctional critiques to revise model-generated pre-captions into improved post-captions. This division of labor improves annotation accuracy and efficiency by offloading text generation to models, allowing humans to better focus on verification. Additionally, these critiques and preferences between pre- and post-captions provide rich supervision for improving open-source models (Qwen3-VL) on caption generation, reward modeling, and critique generation through SFT, DPO, and inference-time scaling. Our ablations show that critique quality in precision, recall, and constructiveness, ensured by our oversight framework, directly governs downstream performance. With modest expert supervision, the resulting model outperforms even closed-source models such as Gemini-3.1-Pro. Finally, we apply our approach to re-caption large-scale professional videos (e.g., films, commercials, games) and fine-tune video generation models such as Wan to better follow detailed prompts of up to 400 words, achieving finer control over cinematography including camera motion, angle, lens, focus, point of view, and framing. Overall, our results show that precise specification and human-AI oversight are key to achieving professional-level video understanding and generation.

Abstract:
The public accessibility of Large Vision-Language Models (LVLMs) raises serious concerns about unauthorized model reuse and intellectual property infringement. Existing ownership verification approaches often rely on semantically abnormal queries or out-of-distribution responses as fingerprints, which are easily recognized and removed by adversaries. We first expose this vulnerability through the Semantic Divergence Attack (SDA), which detects and filters fingerprint checks by measuring semantic divergence between a stolen model and a reference model, showing that existing fingerprints are not semantic-preserving, easy to detect and bypass, and lacking robustness. To address these weaknesses, we propose SIF (Semantically In-Distribution Fingerprints), a non-intrusive ownership verification framework requiring no parameter modification. SIF introduces Semantic-Aligned Fingerprint Distillation (SAFD), which distills text-generation watermark signals--originally designed for text ownership protection rather than model protection--into the visual modality, enabling semantically coherent yet fingerprinted responses. Robust-Fingerprint Optimization (RFO) further simulates worst-case representation perturbations, ensuring resilience to model modifications such as fine-tuning and quantization. Extensive experiments on LLaVA-1.5 and Qwen2.5-VL demonstrate that SIF achieves superior stealthiness and robustness, providing a practical solution for LVLM copyright protection.

Abstract:
Vision-based Retrieval-Augmented Generation (VisRAG) leverages vision-language models (VLMs) to jointly retrieve relevant visual documents and generate grounded answers based on multimodal evidence. However, existing VisRAG models degrade in performance when visual inputs suffer from distortions such as blur, noise, low light, or shadow, where semantic and degradation factors become entangled within pretrained visual encoders, leading to errors in both retrieval and generation stages. To address this limitation, we introduce RobustVisRAG, a causality-guided dual-path framework that improves VisRAG robustness while preserving efficiency and zero-shot generalization. RobustVisRAG uses a non-causal path to capture degradation signals through unidirectional attention and a causal path to learn purified semantics guided by these signals. Together with the proposed Non-Causal Distortion Modeling and Causal Semantic Alignment objectives, the framework enforces a clear separation between semantics and degradations, enabling stable retrieval and generation under challenging visual conditions. To evaluate robustness under realistic conditions, we introduce the Distortion-VisRAG dataset, a large-scale benchmark containing both synthetic and real-world degraded documents across seven domains, with 12 synthetic and 5 real distortion types that comprehensively reflect practical visual degradations. Experimental results show that RobustVisRAG improves retrieval, generation, and end-to-end performance by 7.35%, 6.35%, and 12.40%, respectively, on real-world degradations, while maintaining comparable accuracy on clean inputs.

Abstract:
Recent advances in vision-language models (VLMs) have significantly enhanced the visual grounding task, which involves locating objects in an image based on natural language queries. Despite these advancements, the security of VLM-based grounding systems has not been thoroughly investigated. This paper reveals a novel and realistic vulnerability: the first multi-target backdoor attack on VLM-based visual grounding. Unlike prior attacks that rely on static triggers or fixed targets, we propose IAG, a method that dynamically generates input-aware, text-guided triggers conditioned on any specified target object description to execute the attack. This is achieved through a text-conditioned UNet that embeds imperceptible target semantic cues into visual inputs while preserving normal grounding performance on benign samples. We further develop a joint training objective that balances language capability with perceptual reconstruction to ensure imperceptibility, effectiveness, and stealth. Extensive experiments on multiple VLMs (e.g., LLaVA, InternVL, Ferret) and benchmarks (RefCOCO, RefCOCO+, RefCOCOg, Flickr30k Entities, and ShowUI) demonstrate that IAG achieves the best ASRs compared with other baselines on almost all settings without compromising clean accuracy, maintaining robustness against existing defenses, and exhibiting transferability across datasets and models. These findings underscore critical security risks in grounding-capable VLMs and highlight the need for further research on trustworthy multimodal understanding. Code is in the supplementary material.

Abstract:
Multimodal Large Language Models (MLLMs) have shown impressive capabilities in jointly understanding text, images, and videos, often evaluated via Visual Question Answering (VQA). However, even state-of-the-art MLLMs struggle with domain-specific or knowledge-intensive queries, where relevant information is underrepresented in pre-training data. Knowledge-based VQA (KB-VQA) addresses this by retrieving external documents to condition answer generation, but current retrieval-augmented approaches suffer from low precision, noisy passages, and limited reasoning. To address this, we propose ReAG, a novel Reasoning-Augmented Multimodal RAG approach that combines coarse- and fine-grained retrieval with a critic model that filters irrelevant passages, ensuring high-quality additional context. The model follows a multi-stage training strategy leveraging reinforcement learning to enhance reasoning over retrieved content, while supervised fine-tuning serves only as a cold start. Extensive experiments on Encyclopedic-VQA and InfoSeek demonstrate that ReAG significantly outperforms prior methods, improving answer accuracy and providing interpretable reasoning grounded in retrieved evidence.

Abstract:
Developing generalizable deepfake detectors has become increasingly important with the rapid advancement of generative models. Adapting visual foundation models (VFMs), e.g., CLIP, through parameter-efficient finetuning (PEFT), with only a small subset of parameters updated, has been proven highly effective for generalizable detection. However, the success of "fewer-parameters" training raises an important question: although only a few parameters are tuned, have existing PEFT-based detectors truly exploited the most informative ones while eliminating redundant parameters for better generalization? In this work, we move beyond standard PEFT by proposing a joint optimization strategy that operates at both the layer and token levels. Since latent features across layers capture different semantic abstractions and tokens within the same layer convey varied forgery cues, we propose integrating both layer-level and token-level routing to maximize representational synergy. Specifically, at the layer level, we introduce "Early Layer Pruning", an adaptive truncation mechanism that enables the model to adaptively learn distinct forward depths for different types of instances. At the token level, "Token Selection" is guided by the Spearman rank loss to filter tokens irrelevant to forgery learning, enabling the model to focus on the most discriminative cues. Furthermore, a unified MoE architecture is applied that encourages diversity and thus reduces the potential model's overfitting to specific forgery types. Extensive benchmarking results demonstrate the effectiveness of our designs and show the superior performance of our method over existing state-of-the-arts.

Abstract:
Unified multimodal models that couple visual understanding with image generation have advanced rapidly, yet most systems still focus on visual grounding--aligning language with image regions--while their generative counterpart, linguistic-embedded layout-grounded generation(LELG) for layout-controllable multi-instance generation, remains underexplored and limits precise compositional control. We present ConsistCompose, a unified multimodal framework that embeds layout coordinates directly into language prompts, enabling layout-controlled multi-instance image generation from Interleaved Image-Text within a single generative interface. We further construct ConsistCompose3M, a 3.4M multi-instance generation dataset with layout and identity annotations (2.6M text-guided and 0.8M image-guided data pairs) that provides large-scale supervision for layout-conditioned generation. Within this framework, LELG is instantiated through instance-coordinate binding prompts and coordinate-aware classifier-free guidance, which translate linguistic layout cues into precise spatial control without task-specific branches. Experiments on COCO-Position and MS-Bench show that ConsistCompose substantially improves spatial accuracy over layout-controlled baselines while preserving identity fidelity and competitive general multimodal understanding, establishing a unified paradigm for layout-controllable multimodal image generation.

Abstract:
Modern scene reconstruction methods, such as 3D Gaussian Splatting, deliver photo-realistic novel view synthesis at real-time speeds, yet their adoption in interactive graphics applications has been limited. A major bottleneck is the difficulty of interacting with these representations compared to traditional, human-authored 3D assets. While previous research has attempted to impose semantic decomposition on these models, significant challenges remain regarding segmentation quality and consistency. To address this, we introduce Semantic Foam, extending the recently proposed Radiant Foam representations to semantic decomposition tasks.Our approach integrates the natural spatial volumetric decomposition of Radiant Foam's Voronoi mesh with an explicit semantic feature field parameterized at the cell level. This explicit structure enables direct spatial regularization, which prevents artifacts caused by occlusion or inconsistent supervision across views - common pitfalls for other point-based representations. Experimental results show that our method achieves superior object-level segmentation performance compared to state-of-the-art methods like Gaussian Grouping and SAGA.

Abstract:
Existing robot policies based on learned visual embeddings lack explicit structure and are sensitive to visual distractions.Thus, the representations that drive their behaviour are often opaque, making their decision-making process difficult to interpret.To address this, we introduce Structured Image Representation, a method that leverages Scene Graphs as an intermediate representation for robot policy learning.Our approach first constructs a fully connected graph, using 2D or 3D image-derived features as initial node representations. Then, a module learns to sparsify this graph end-to-end, creating a minimal, task-relevant sub-graph that is passed to the action generation model.This process makes our model intrinsically explainable.Evaluations on RoboCasa show that our sparse graph policies outperform image-based baselines on average with 19.5% vs 14.81% success rate.We also demonstrate that our graph-based representations are significantly more robust to distractor objects, showing almost no performance degradation, as opposed to image representations.Most importantly, we show that the learned sparse graphs are a powerful tool for introspection.By analysing when the model's sub-graph deviates from human expectation, such as by including distractor nodes or omitting key objects, we successfully uncover dataset biases, including spurious correlations and positional biases.

Abstract:
Prompt tuning of large-scale vision-language models such as CLIP enables efficienttask adaptation without updating model weights. However, it often leads to poorconfidence calibration and unreliable predictive uncertainty. We address thisproblem by proposing a calibration framework that enhances predictive reliabilitywhile preserving the geometry of the pretrained CLIP embedding space, which isrequired for robust generalization. Our approach extends the standard cross-entropyloss with two complementary regularizers: (1) a mean-variance margin penalty thatstabilizes inter-class logit margins by maximizing their average while minimizingdispersion, mitigating underconfidence and overconfidence spikes; and (2) a textmoment-matching loss that aligns the first and second moments of tuned textembeddings with their frozen CLIP counterparts, preserving semantic dispersioncrucial for generalization. Through extensive experiments across 7 prompt-tuningmethods and 11 diverse datasets, we demonstrate that our approach significantlyreduces the Expected Calibration Error (ECE) compared to competitive calibrationtechniques on both base and novel classes.

Abstract:
Recent advances in learning-based Light Field (LF) image denoising have achieved impressive results. However, these methods rely heavily on large-scale noisy-clean image pairs and often fail to generalize to unseen or complex noise.In this work, we observe that the inherent multi-view consistency of LF images makes it highly unlikely for noise to be coherent across views, offering a more reliable supervisory signal for self-supervised denoising.Building on this insight, we extend the blind-spot principle to the LF domain and propose a novel LF Blind-View denoising Network (LF-BVN). We first introduce a geometric invariance mask that leverages angular redundancy for efficient full-view supervision. To enforce cross-view photometric consistency, we further introduce latent representation volumes and enforce consistency between them.Additionally, we exploit focus stacks to extract latent depth cues from noisy observations, providing further guidance.Extensive experiments show that LF-BVN achieves competitive denoising performance while maintaining strong cross-view consistency without requiring clean data or external supervision.

Abstract:
Understanding and predicting motion is a fundamental component of visual intelligence. Although modern video models exhibit strong comprehension of scene dynamics, exploring multiple possible futures through full video synthesis remains prohibitively inefficient. We model scene dynamics orders of magnitude more efficiently by directly operating on a long-term motion embedding that is learned from large-scale trajectories obtained from tracker models.This enables efficient generation of long, realistic motions that fulfill goals specified via text prompts or spatial pokes. To achieve this, we first learn a highly compressed motion embedding with a temporal compression factor of 64x. In this space, we train a conditional flow-matching model to generate motion latents conditioned on task descriptions. The resulting motion distributions outperform those of both state-of-the-art video models and specialized task-specific approaches.

Abstract:
Multi-period image collections are common in real-world applications. Cities are re-scanned for mapping, construction sites are revisited for progress tracking, and natural regions are monitored for environmental change. Such data form multi-period scenes, where geometry and appearance evolve. Reconstructing such scenes is an important yet underexplored problem. Existing pipelines rely on incompatible assumptions: static and in-the-wild methods enforce a single geometry, while dynamic ones assume smooth motion, both failing under long-term, discontinuous changes. To solve this problem, we introduce ChronoGS, a temporally modulated Gaussian representation that reconstructs all periods within a unified anchor scaffold. It's also designed to disentangle stable and evolving components, achieving temporally consistent reconstruction of multi-period scenes. To catalyze relevant research, we release ChronoScene dataset, a benchmark of real and synthetic multi-period scenes, capturing geometric and appearance variation. Experiments demonstrate that ChronoGS consistently outperforms baselines in reconstruction quality and temporal consistency. Our code and the ChronoScene dataset will be made publicly available.

Abstract:
Model stitching, connecting early layers of one model (source) to later layers of another (target) via a light stitch layer, has served as a probe of representational compatibility. Prior work finds that models trained on the same dataset remain stitchable (negligible accuracy drop) despite different initializations or objectives. We revisit stitching for Vision Foundation Models (VFMs) that vary in objectives, data, and modality (e.g., CLIP, DINOv2, SigLIP2) and ask: Are heterogeneous VFMs stitchable? We introduce a systematic protocol spanning stitch positions, stitch layer families, training losses, and downstream tasks. Three findings emerge. (1) Stitch layer training matters: conventional approaches that match the intermediate features at the stitch position or optimize the task loss end-to-end struggle to retain accuracy, especially at shallow stitch positions. (2) With a simple feature-matching loss at the target model's penultimate layer, heterogeneous VFMs become reliably stitchable across vision tasks. (3) For deep stitch positions, the stitched model can significantly surpass either constituent model with a small inference overhead (for the stitch layer). Building on these findings, we further propose the VFM Stitch Tree (VST), which shares early layers across VFMs while retaining their later layers, yielding a controllable accuracy-latency trade-off for multimodal LLMs that often leverage multiple VFMs. Taken together, our study elevates stitching from a diagnostic probe to a practical recipe for integrating complementary VFM strengths and pinpointing where their representations align or diverge.

Abstract:
This paper presents a generalizable CoSOD framework via mixed content-style modulation, termed CoMCS, to enhance the robustness of the model to unseen domains. The CoMCS, consisting of a mixed content modulator (MCM), a mixed style modulator (MSM), and a collaborative semantic contrast module (SCM), effectively extracts scene structure priors as well as augments the source domain styles to bridge the domain gap between the source and the unseen domains. Specifically, the CoMCS first utilizes the CLIP model to extract conceptual knowledge associated with the semantic classes in the whole scene, resulting in multi-class semantic embeddings that are domain-invariant. Subsequently, the MCM models the semantic relationships between the prototypes of co-salient objects and the multi-class semantic embeddings through the cross-attention mechanism, effectively capturing domain-invariant scene structure priors that aid in reducing scene distribution shift in unseen domains. Meanwhile, to alleviate domain perturbations encountered during testing, the MSM addresses the uncertainty associated with domain shifts by synthesizing feature statistics, such as mean and standard deviation, during training to simulate new stylistic characteristics, thus achieving data augmentation within the source domain. Finally, to reduce the ambiguity of the co-salient object representations within test data from unseen domains, the SCM employs a uniform loss function to ensure that the learned prototypes are uniformly distributed within the hyperspherical space, further enhancing the domain generalization capabilities of the framework. Besides, to further verify the generalization ability of the CoMCS to unseen domains, we construct an unseen-domain benchmark dataset (UND) that selects a variety of image groups with unseen classes from CoCA, CoSOD3k, CoSal2015. Extensive evaluations on the four benchmark datasets demonstrate favorable performance of our CoMCS to a variety of state-of-the-art methods.

Abstract:
Existing autoregressive (AR) methods for generating artist-designed meshes struggle to balance global structural consistency with high-fidelity local details, and are susceptible to error accumulation. To address this, we propose PartDiffuser, a novel semi-autoregressive diffusion framework for point-cloud-to-mesh generation. The method first performs semantic segmentation on the mesh and then operates in a "part-wise" manner: it employs autoregression between parts to ensure global topology, while utilizing a parallel discrete diffusion process within each semantic part to precisely reconstruct high-frequency geometric features. PartDiffuser is based on the DiT architecture and introduces a part-aware cross-attention mechanism, using point clouds as hierarchical geometric conditioning to dynamically control the generation process, thereby effectively decoupling the global and local generation tasks. Experiments demonstrate that this method significantly outperforms state-of-the-art (SOTA) models in generating 3D meshes with rich detail, exhibiting exceptional detail representation suitable for real-world applications.

Abstract:
Previous representation and generation approaches for the B-rep relied on graph-based representations that disentangle geometric and topological features through decoupled computational pipelines, thereby precluding the application of sequence-based generative frameworks, such as transformer architectures that have demonstrated remarkable performance. In this paper, we propose BrepARG, the first attempt to encode B-rep's geometry and topology into a holistic token sequence representation, enabling sequence-based B-rep generation with an autoregressive architecture. Specifically, BrepARG encodes B-rep into 3 types of tokens: geometry and position tokens representing geometric features, and face index tokens representing topology. Then the holistic token sequence is constructed hierarchically, starting with constructing the geometry blocks (i.e., faces and edges) using the above tokens, followed by geometry block sequencing. Finally, we assemble the holistic sequence representation for the entire B-rep. We also construct a transformer-based autoregressive model that learns the distribution over holistic token sequences via next-token prediction, using a multi-layer decoder-only architecture with causal masking. Experiments demonstrate that BrepARG achieves state-of-the-art (SOTA) performance. BrepARG validates the feasibility of representing B-rep as holistic token sequences, opening new directions for B-rep generation.

Abstract:
We present TempoMaster, a novel framework that formulates long video generation as next-frame-rate prediction. Specifically, we first generate a low-frame-rate clip that serves as a coarse blueprint of the entire video sequence, and then progressively increase the frame rate to refine visual details and motion continuity. During generation, TempoMaster employs bidirectional attention within each frame-rate level while performing autoregression across frame rates, thus achieving long-range temporal coherence while enabling efficient and parallel synthesis. Extensive experiments demonstrate that TempoMaster establishes a new state-of-the-art in long video generation, excelling in both visual and temporal quality.

Abstract:
Recent progress in video diffusion models has enabled remarkable generative fidelity, yet leveraging these priors for restoration remains limited by the strong coupling between conditional and unconditional branches in standard classifier-free guidance. We introduce a training-free framework that enhances distorted and low-resolution videos by decoupling these signals in time. Our proposed Decoupled Time Guidance (DTG) evaluates the unconditional branch at a cleaner diffusion timestep, providing a lookahead prior that preserves geometry while suppressing replication of warped content. This temporal bias is annealed throughout sampling, allowing the model to transition from structure correction to detail refinement without retraining. Combined with any off-the-shelf restoration module in a plug-and-play manner, our approach improves perceptual coherence and restores plausible structure in AI-generated and real-world videos alike. To facilitate evaluation, we curate GenWarp480, a benchmark of 4,400 distorted 480p videos synthesized from diverse text-to-video models. GenWarp480 focuses on characteristic generative degradations such as warped faces, body misalignments, and spatial artifacts, providing a purpose-built testbed for assessing robustness to generative errors. Extensive experiments demonstrate that our method achieves significant improvements in structural fidelity and temporal stability without any model training.

Abstract:
LiDAR-based 3D human motion capture has broad applications in fields such as autonomous driving and robotics, where accurate motion reconstruction is crucial. However, existing methods often struggle with unstable inputs and severe occlusions, leading to jittery or even failed pose predictions. To address these challenges, we propose BMLiCap, a coarse-to-fine framework that models motion using temporally compressible Bezier curves. By reducing control points through a trajectory-preserving strategy, we obtain a coherent and learning-friendly motion representation. To reconstruct human actions from LiDAR point-cloud cues, we design a progressive motion-reconstruction module. Specifically, a Time-scale Motion Transformer (TMT) is introduced to predict motion curves at multiple temporal scales, and a Multi-level Motion Aggregator (MMA) is utilized to adaptively fuse the multi-scale curves to recover detailed, temporally coherent poses, effectively bridging observation gaps caused by occlusions and noise. Across four mainstream benchmarks LiDARHuman26M, FreeMotion, NoiseMotion, and SLOPER4D, BMLiCap achieves state-of-the-art accuracy and temporal continuity in complex scenes, demonstrating its ability to compensate for severe occlusions and reduce prediction jitter.

Abstract:
Despite recent progress, diffusion-based video frame interpolation methods still struggle with large, complex motions, resulting in discontinuous motions and inconsistent object appearances across frames. We observe that these limitations arise from both the current full-sequence interpolation strategy and the pixel reconstruction training objective. To solve these challenges, we propose ARVFI, a novel video diffusion-based interpolation method for large complex motion interpolation. Instead of generating all intermediate frames simultaneously, ARVFI interpolates in an autoregressive manner from two input frames to the middle ones. Thus, ARVFI interpolates a frame that is further away from the inputs based on all previous interpolation results, resulting in smoother motion transitions and better temporal consistency. Additionally, ARVFI further utilizes DINOv3 features as motion representations, which provide high-level semantics for accurate motion estimation, compared with a simple pixel-level loss. With all these designs, ARVFI generates the intermediate DINOv3 features first and then the frames with an effective conditional generation method for frames. Our ARVFI consistently outperforms existing methods with superior interpolation accuracy and visual quality.

Abstract:
Recent Vision-Language-Action (VLA) models for autonomous driving explore inference-time reasoning as a way to improve driving performance and safety in challenging scenarios. Most prior work uses natural language to express chain-of-thought (CoT) reasoning before producing driving actions. However, text may not be the most efficient representation for reasoning. In this work, we present Latent-CoT-Drive (LDrive): a model that expresses CoT in a latent language that captures possible outcomes of the driving actions being considered. Our approach unifies CoT reasoning and decision making by representing both in an action-aligned latent space. Instead of natural language, the model reasons by interleaving (1) action-proposal tokens, which use the same vocabulary as the model's output actions; and (2) world model tokens, which are grounded in a learned latent world model and express future outcomes of these actions. We cold start latent CoT by supervising the model's action proposals and world model tokens based on ground-truth future rollouts of the scene. We then post-train with closed-loop reinforcement learning to strengthen reasoning capabilities. On a large-scale end-to-end driving benchmark, LDrive achieves faster inference, better trajectory quality, and larger improvements from interactive reinforcement learning compared to both non-reasoning and text-reasoning baselines.

Abstract:
Multimodal continual instruction tuning enables multimodal large language models to sequentially adapt to new tasks while building upon previously acquired knowledge. However, this continual learning paradigm faces the significant challenge of catastrophic forgetting, where learning new tasks leads to performance degradation on previous ones. In this paper, we introduce a novel insight into catastrophic forgetting by conceptualizing it as a problem of missing gradients from old tasks during new task learning. Our approach approximates these missing gradients by leveraging the geometric properties of the parameter space, specifically using the directional vector between current parameters and previously optimal parameters as gradient guidance. This approximated gradient can be further integrated with real gradients from a limited replay buffer and regulated by a Bernoulli sampling strategy that dynamically balances model stability and plasticity. Extensive experiments on multimodal continual instruction tuning datasets demonstrate that our method achieves state-of-the-art performance without model expansion, effectively mitigating catastrophic forgetting while maintaining a compact architecture.

Abstract:
Learning transferable knowledge from unlabeled video data and applying it in new environments is a fundamental capability of intelligent agents. This work presents VideoWorld 2, which extends VideoWorld and provides the first investigation of learning transferable knowledge for complex, long-horizon tasks directly from raw real-world videos. At its core, VideoWorld 2 introduces a dynamics-enhanced Latent Dynamics Model (dLDM) that decouples action dynamics from visual appearance: a pretrained video diffusion model handles visual appearance modeling, enabling the dLDM to learn latent codes that focus on compact and meaningful task-related dynamics. These latent codes are then modeled autoregressively to learn task policies and support long-horizon reasoning. We evaluate VideoWorld 2 on challenging real-world handcraft making tasks, where prior video generation and latent-dynamics models struggle to operate reliably. Remarkably, VideoWorld 2 achieves up to 70% improvement in task success rate and produces coherent long execution videos. In robotics, we show that VideoWorld 2 can acquire transferable manipulation knowledge from the Open-X dataset, which substantially improves task performance on CALVIN, demonstrating strong cross-domain generalization. This study reveals the potential of learning transferable world knowledge directly from raw videos, with all code, data, and models open-sourced for further research.

Abstract:
We present AMB3R, a multi-view feed-forward model for metric-scale dense 3D reconstruction that addresses diverse 3D vision tasks. The key idea is to employ a sparse, yet compact, volumetric scene representation as our backend, enabling geometric reasoning with spatial compactness. We further introduce AMB3R-VO and AMB3R-SfM, two training-free, model-agnostic pipelines that extend multi-view networks to uncalibrated visual odometry (online) or structure from motion for any number of images without test-time optimization. Compared to prior 3D foundation models, AMB3R achieves state-of-the-art performance in camera pose, depth, metric-scale estimation, and 3D reconstruction, while AMB3R-VO and AMB3R-SfM even surpass optimization-based SLAM and SfM systems with dense reconstruction priors on common benchmarks.

Abstract:
Non-transferable learning (NTL) aims to enforce usage restrictions by limiting a model's generalization on target-domain data while maintaining its utility on the source domain. Current approaches face three major challenges: (1) low training efficiency due to retraining of the backbone network, (2) low inference efficiency, and (3) a rigid reliance on a shared, non-adaptive backbone network spanning both source and target domains. This shared setup, which aims to maximize source-domain performance and minimize target-domain performance, often introduces optimization conflicts due to overlapping class categories across source and target domains. In this paper, we propose a novel and efficient NTL approach using a dynamic Early-Exit Network, named ENL-DEE, which leverages Bayesian theory and dynamic neural networks to address these limitations. Our custom loss function guides source-domain data to exit at later stages of the network, maximizing model utility, while target-domain data exits earlier with non-semantic features, ensuring limited transferability. ENL-DEE offers three key advantages: (1) it enhances training efficiency by optimizing only the parameters of dynamic exit classifiers, bypassing the need to retrain the backbone; (2) it improves inference efficiency as data exits at various exit classifiers in the network; and (3) it resolves optimization conflicts by using distinct parameter sets for source and target domains, achieving higher performance on the source domain and lower performance on the target domain, thereby strengthening NTL. Extensive experiments across diverse datasets and model architectures validate the scalability, efficiency, and effectiveness of our approach.

Abstract:
Language-driven dexterous grasp generation requires the models to understand task semantics, 3D geometry, and complex hand-object interactions. While vision-language models have been applied to this problem, existing approaches directly map observations to grasp parameters without intermediate reasoning about physical interactions. We present DextER, Dexterous Grasp Generation with Embodied Reasoning, which introduces contact-based embodied reasoning for multi-finger manipulation. Our key insight is that predicting which hand links contact where on the object surface provides an embodiment-aware intermediate representation, bridging task semantics with physical constraints. DextER autoregressively generates embodied contact tokens specifying which finger links contact where on the object surface, followed by grasp tokens encoding the hand configuration. On DexGYS, DextER achieves 67.14% success rate, outperforming state-of-the-art by 3.83 p.p. with 96.4% improvement in intention alignment. We also demonstrate steerable generation through partial contact specification, providing fine-grained control over grasp synthesis.

Abstract:
Recent advances in trajectory-controllable video generation have achieved remarkable progress. Previous methods mainly use adapter-based architectures for precise motion control along predefined trajectories.However, all these methods rely on a multi-step denoising process, leading to substantial time redundancy and computational overhead.While existing video distillation methods successfully distill multi-step generators into few-step, directly applying these approaches to trajectory-controllable video generation results in noticeable degradation in both video quality and trajectory accuracy.To bridge this gap, we introduce FlashMotion, a novel training framework designed for few-step trajectory-controllable video generation.We first train a trajectory adapter on a multi-step video generator for precise trajectory control.Then, we distill the generator into a few-step version to accelerate video generation.Finally, we finetune the adapter using a hybrid strategy that combines diffusion and adversarial objectives, aligning it with the few-step generator to produce high-quality, trajectory-accurate videos.For evaluation, we introduce FlashBench, a benchmark for long-sequence trajectory-controllable video generation that measures both video quality and trajectory accuracy across varying numbers of foreground objects. Experiments on two adapter architectures show that FlashMotion surpasses existing video distillation methods and previous multi-step models in both visual quality and trajectory consistency.

Abstract:
Recent advancements adopt online reinforcement learning (RL) from LLMs to text-to-image rectified flow diffusion models for reward alignment. The use of group-level rewards successfully aligns the model with the targeted reward. However, it faces challenges including low efficiency, dependency on stochastic samplers, and reward hacking. The problem is that rectified flow models are fundamentally different from LLMs: 1) For efficiency, online image sampling takes much more time and dominates the time of training. 2) For stochasticity, rectified flow is deterministic once the initial noise is fixed. Aiming at these problems and inspired by the effects of group-level rewards from LLMs, we design Group-level Direct Reward Optimization (GDRO). GDRO is a new post-training paradigm for group-level reward alignment that combines the characteristics of rectified flow models. Through rigorous theoretical analysis, we point out that GDRO supports full offline training that saves the large time cost for image rollout sampling. Also, it is diffusion-sampler-independent, which eliminates the need for the ODE-to-SDE approximation to obtain stochasticity. We also empirically study the reward hacking trap that may mislead the evaluation, and involve this factor in the evaluation using a corrected score that not only considers the original evaluation reward but also the trend of reward hacking. Extensive experiments demonstrate that GDRO effectively and efficiently improves the reward score of the diffusion model through group-wise offline optimization across the OCR and GenEval tasks, while demonstrating strong stability and robustness in mitigating reward hacking.

Abstract:
Multi-object tracking (MOT) involves analyzing object trajectories and counting the number of objects in video sequences. However, 2D MOT faces challenges due to positional cost confusion arising from partial occlusion. To address this issue, we present the novel Occlusion-Aware SORT (OA-SORT) framework, a plug-and-play and training-free framework that includes the Occlusion-Aware Module (OAM), the Occlusion-Aware Offset (OAO), and the Bias-Aware Momentum (BAM). Specifically, OAM analyzes the occlusion status of objects, where a Gaussian Map (GM) is introduced to reduce background influence. In contrast, OAO and BAM leverage the OAM-described occlusion status to mitigate cost confusion and suppress estimation instability. Comprehensive evaluations on the DanceTrack, SportsMOT, and MOT17 datasets demonstrate the importance of occlusion handling in MOT. On the DanceTrack test set, OA-SORT achieves 63.1% and 64.2% in HOTA and IDF1, respectively. Furthermore, integrating the Occlusion-Aware framework into the four additional trackers improves HOTA and IDF1 by an average of 2.08% and 3.05%, demonstrating the reusability of the occlusion awareness.

Abstract:
Powerful 3D representations such as DUSt3R's invariant point maps, which encode 3D shape and camera parameters, have significantly advanced feed-forward 3D reconstruction. While point maps assume static scenes, Dynamic Point Maps (DPMs) extend the concept to dynamic 3D content by also representing scene motion. However, DPMs have so far been limited to image pairs and, like DUSt3R, require post-processing via optimisation when more than two views are involved. We argue that DPMs are more useful when applied to videos and introduce V-DPM to demonstrate this. First, we show how to set up DPMs for videos to optimise representational power, facilitate neural prediction, and enable reuse of pretrained models. Second, we implement these ideas on top of VGGT, a recent powerful 3D reconstructor. Although VGGT was trained on static scenes, we show that a modest amount of synthetic data suffices to adapt it into an effective V-DPM predictor. This yields state-of-the-art 3D and 4D reconstruction in dynamic settings. In particular, unlike recent dynamic extensions of VGGT such as P3, DPMs recover not only dynamic depth but also the 3D motion of every point in the scene.

Abstract:
Symbolic computer vision represents diagrams through explicit logical rules and structured representations, enabling interpretable understanding in machine vision. This requires fundamentally different learning paradigms from pixel-based visual models. Symbolic visual learners parse diagrams into geometric primitives--points, lines, and shapes--whereas pixel-based learners operate on textures and colors. We propose a novel self-supervised symbolic auto-encoder, where the encoder parses diagrams into structured primitives and their relations, and the decoder compiles them back into executable code that reproduces the original diagrams. Central to this architecture is SymHPR (Symbolic Hierarchical Process Reward Modeling), which applies hierarchical step-level rewards to enforce point-on-line, line-on-shape, and shape-on-relation consistency. Since reinforcement learning collapses under vanilla GRPO, our proposed stabilized strategies balance exploration and exploitation. Built upon the symbolic learner, we design two pipelines: one replaces the MLLM's visual encoder with the symbolic encoder; the other feeds interpretable symbolic parses directly to the LLM to guide reasoning. Evaluations across reconstruction, perception, and reasoning tasks demonstrate the effectiveness of our approach: achieving a 98.2% reduction in MSE for geometric diagram reconstruction, surpassing GPT-4o by 0.6% with a 7B model on chart reconstruction, and improving by +13% on the MathGlance perception benchmark, and by +3% on MathVerse and GeoQA reasoning benchmarks.

Abstract:
As super-resolution (SR) techniques advance, we observe a growing distrust of evaluation metrics in recent SR research. An inconsistency often emerges between certain evaluation criteria and human perceptual preference. Although current SR research employs varying metrics to evaluate SR performance, it remains underexplored how robust and reliable these metrics actually are. To bridge this gap, we conduct a comprehensive analysis of widely used image quality metrics, examining their consistency with human perception when evaluating state-of-the-art SR models. We show that some metrics exhibit only limited--or even negative--correlation with human preferences. We further identify several intrinsic challenges in SR evaluation that compromise the effectiveness of both full-reference (FR) and no-reference (NR) image quality assessment (IQA) frameworks. To address these issues, we propose a simple yet effective Relative Quality Index (RQI) framework, which assesses the relative quality discrepancy between image pairs. Our framework enables easy integration and notable improvements for existing IQA metrics in SR evaluation. Moreover, it can be utilized as a valuable training guide for SR models, enabling the generation of images with more realistic details while maintaining structural fidelity.

Abstract:
Interpreting natural-language commands to localize target objects is critical for autonomous driving (AD). Existing visual grounding (VG) methods in AD struggle with ambiguous, context-dependent instructions, as they lack reasoning over 3D spatial relations and anticipated scene evolution. Grounded in the principles of world models, we propose ThinkDeeper, a framework that reasons about future spatial states before making grounding decisions. At its core is a Spatial-Aware World Model (SA-WM) that learns to reason ahead by distilling the current scene into a command-aware latent state and rolling out a sequence of future latent states, providing forward-looking cues for disambiguation. Complementing this, a hypergraph-guided decoder then hierarchically fuses these states with the multimodal input, capturing higher-order spatial dependencies for robust localization. In addition, we present DrivePilot, a multi-source VG dataset in AD, featuring semantic annotations generated by a Retrieval-Augmented Generation (RAG) and Chain-of-Thought (CoT)-prompted LLM pipeline. Empirical results on six benchmarks show that ThinkDeeper outperforms SOTA baselines on DrivePilot, MoCAD, Talk2Car, and RefCOCO/+/g. Notably, it also exhibits strong robustness and efficiency in challenging scenarios (long-text, multi-agent, ambiguity) and retains superior performance even when trained on 50% of the data.

Abstract:
Achieving 3D spatial awareness is crucial for surgical robotic manipulation, where precise and delicate operations are required. Existing methods either explicitly reconstruct the surgical scene prior to manipulation, or enhance multi-view features by adding wrist-mounted cameras to supplement the default stereo endoscopes. However, both paradigms suffer from notable limitations: the former easily leads to error accumulation and prevents end-to-end optimization due to its multi-stage nature, while the latter is rarely adopted in clinical practice since wrist-mounted cameras can interfere with the motion of surgical robot arms. In this work, we introduce the Spatial Surgical Transformer (SST), an end-to-end visuomotor policy that empowers surgical robots with 3D spatial awareness by directly exploring 3D spatial cues embedded in endoscopic images. First, we build Surgical3D, a large-scale photorealistic dataset containing 30K stereo endoscopic image pairs with accurate 3D geometry, addressing the scarcity of 3D data in surgical scenes. Based on Surgical3D, we finetune a powerful geometric transformer to extract robust 3D latent representations from stereo endoscopes images. These representations are then seamlessly aligned with the robot's action space via a lightweight multi-level spatial feature connector (MSFC), all within an endoscope-centric coordinate frame. Extensive real-robot experiments demonstrate that SST achieves state-of-the-art performance and strong spatial generalization on complex surgical tasks such as knot tying and ex-vivo organ dissection, representing a significant step toward practical clinical deployment. The dataset and code will be released.

Abstract:
3D modeling is shifting from static visual representations toward physical, articulated assets that can be directly used in simulation and interaction. However, most existing 3D generation methods overlook key physical and articulation properties, thereby limiting their utility in embodied AI. To bridge this gap, we introduce PhysX-Anything, the first simulation-ready physical 3D generative framework that, given a single in-the-wild image, produces high-quality sim-ready 3D assets with explicit geometry, articulation, and physical attributes. Specifically, we propose the first VLM-based physical 3D generative model, along with a new 3D representation that efficiently tokenizes geometry. It reduces the number of tokens by 193x, enabling explicit geometry learning within standard VLM token budgets without introducing any special tokens during fine-tuning and significantly improving generative quality. In addition, to overcome the limited diversity of existing physical 3D datasets, we construct a new dataset, PhysX-Mobility, which expands the object categories in prior physical 3D datasets by over 2xand includes more than 2K common real-world objects with rich physical annotations. Extensive experiments on PhysX-Mobility and in-the-wild images demonstrate that PhysX-Anything delivers strong generative performance and robust generalization. Furthermore, simulation-based experiments in a MuJoCo-style environment validate that our sim-ready assets can be directly used for contact-rich robotic policy learning. We believe PhysX-Anything can substantially empower a broad range of downstream applications, especially in embodied AI and physics-based simulation.

Abstract:
In autonomous driving, mapping is critical for motion planning but remains an under-utilized resource for perception tasks like 3D object detection. Maps can provide robust structural priors of the static environment, suited to resolving ambiguities and correcting for sensor data sparsity or noise -- issues especially prevalent for distant objects or during adverse weather conditions. However, conventional High-Definition (HD) maps are resource-intensive to obtain and maintain, which presents a challenge for achieving efficient, large-scale deployment. In this paper, we propose a scalable solution to systemically leverage mapping to improve 3D detection by overcoming two primary challenges. First, we introduce a pipeline to automatically build dense mapping priors from aggregated sensor data, eliminating the need for human labeling. Second, we design a novel Mapping Prior Augmented 3D detection (MPA3D) framework to effectively integrate the mapping priors with the distinct modalities of sensor data. Our extensive experiments on the Waymo Open Dataset demonstrate that our approach achieves new state-of-the-art results, and proving the effectiveness of using scalable, reconstructed scene priors to enhance 3D detection.

Abstract:
Spatial understanding remains a weakness of Large Vision-Language Models (LVLMs). Existing supervised fine-tuning (SFT) and recent reinforcement learning with verifiable rewards (RLVR) pipelines depend on costly supervision, specialized tools, or constrained environments that limit scale. We introduce Spatial-SSRL, a self-supervised RL paradigm that derives verifiable signals directly from ordinary RGB or RGB-D images. Spatial-SSRL automatically formulates five pretext tasks that capture 2D and 3D spatial structure: shuffled patch reordering, flipped patch recognition, cropped patch inpainting, regional depth ordering, and relative 3D position prediction. These tasks provide ground-truth answers that are easy to verify and require no human or LVLM annotation. Training on our tasks substantially improves spatial reasoning while preserving general visual capabilities. On seven spatial understanding benchmarks in both image and video settings, Spatial-SSRL delivers average accuracy gains of 4.63% (3B) and 3.89% (7B) over the Qwen2.5-VL baselines. Our results show that simple, intrinsic supervision enables RLVR at scale and provides a practical route to stronger spatial intelligence in LVLMs.

Abstract:
We present PaNDaS, a novel deep learning framework for Partial Non-Rigid Deformations and interpolations of Surfaces (PaNDaS). PaNDaS learns a per-face feature field on the source mesh and fuses it with a global encoding of the target. A deformation generator predicts a Jacobian field and recovers a smooth displacement, enabling precise regional control, pose mixing, and transferable local edits. Unlike previous approaches, our method can restrict the deformations to specific parts of the shape in a versatile way. Across various human body part datasets, PaNDaS achieves state-of-the-art interpolation accuracy and stronger locality than methods based on global shape codes or handles, while remaining robust to remeshing. We demonstrate several localized shape manipulation tasks and show that our method can generate new shapes by combining different input deformations.

Abstract:
Multimodal Large Language Models (MLLMs) show strong potential for interpreting and interacting with complex, pixel-rich Graphical User Interface (GUI) environments. However, building agents that are both efficient for high-level tasks and precise for fine-grained interactions remains challenging. GUI agents must perform routine actions efficiently while also handling tasks that demand exact visual grounding, yet existing approaches struggle when accuracy depends on identifying specific interface elements. These MLLMs also remain large and cannot adapt their reasoning depth to the task at hand. In this work, we introduce iSHIFT: Implicit Slow-fast Hybrid Inference with Flexible Tokens, a lightweight agent that integrates latent thinking (implicit chain-of-thought) with a perception control module. iSHIFT enables an MLLM to switch between a slow mode, which leverages detailed visual grounding for high precision and a fast mode that uses global cues for efficiency. Special perception tokens guide attention to relevant screen regions, allowing the model to decide both how to reason and where to focus. Despite its compact 2.5B size, iSHIFT matches state-of-the-art performance on multiple benchmark datasets.

Abstract:
Radar is a critical perception modality in autonomous driving systems due to its all-weather characteristics and ability to measure range and Doppler velocity. However, the sheer volume of high-dimensional raw radar data saturates the communication link to the computing engine (e.g., an NPU), which is often a low-bandwidth interface with data rate provisioned only for a few low-resolution range-Doppler frames. A generalized codec for utilizing high-dimensional radar data is notably absent, while existing image-domain approaches are unsuitable, as they typically operate at fixed compression ratios and fail to adapt to varying or adversarial conditions. In light of this, we propose radar data compression with adaptive feedback. It dynamically adjusts the compression ratio by performing gradient descent from the proxy gradient of detection confidence with respect to the compression rate. We employ a zeroth-order gradient approximation as it enables gradient computation even with non-differentiable core operations--pruning and quantization. This also avoids transmitting the gradient tensors over the band-limited link, which, if estimated, would be as large as the original radar data. In addition, we have found that radar feature maps are heavily concentrated on a few frequency components. Thus, we apply the discrete cosine transform to the radar data cubes and selectively prune out the coefficients effectively. We preserve the dynamic range of each radar patch through scaled quantization. Combining those techniques, our proposed online adaptive compression scheme achieves over 100x feature size reduction at minimal performance drop ( 1%p). We validate our results on the RADIal, CARRADA, and Radatron datasets.

Abstract:
Dataset Distillation aims to synthesize compact datasets that can approximate the training efficacy of large-scale real datasets, offering an efficient solution to the increasing computational demands of modern deep learning. Recently, diffusion-based dataset distillation methods have shown great promise by leveraging the strong generative capacity of diffusion models to produce diverse and structurally consistent samples. However, a fundamental goal misalignment persists: diffusion models are optimized for generative likelihood rather than discriminative utility, resulting in over-concentration in high-density regions and inadequate coverage of boundary samples crucial for classification. To address this issue, we propose two complementary strategies. Inversion-Matching (IM) introduces an inversion-guided fine-tuning process that aligns denoising trajectories with their inversion counterparts, broadening distributional coverage and enhancing diversity. Selective Subgroup Sampling( S^3 ) is a training-free sampling mechanism that improves inter-class separability by selecting synthetic subsets that are both representative and distinctive. Extensive experiments demonstrate that our approach significantly enhances the discriminative quality and generalization of distilled datasets, achieving state-of-the-art performance among diffusion-based methods.

Abstract:
The recovery of training data from generative models ("model inversion") has been extensively studied for diffusion models in the data domain as a memorization/overfitting phenomenon. Latent diffusion models (LDMs), which operate on the latent codes from encoder/decoder pairs, have been robust to prior inversion methods. In this work we describe two key findings: (1) the diffusion model exhibits non-uniform memorization across latent codes, tending to overfit samples located in high-distortion regions of the decoder pullback metric; (2) even within a single latent code, memorization contributions are unequal across representation dimensions. Our proposed method to ranks latent dimensions by their contribution to the decoder pullback metric, which in turn identifies dimensions that contribute to memorization. For score-based membership inference, a sub-task of model inversion, we find that removing less-memorizing dimensions improves performance on all tested methods and datasets, with average AUROC gains of 1-4% and substantial increases in TPR@1%FPR (1-32%) across diverse datasets including CIFAR-10, CelebA, ImageNet-1K, Pokemon, MS-COCO, and Flickr. Our results highlight the overlooked influence of the auto-encoder geometry on LDM memorization and provide a new perspective for analyzing privacy risks in diffusion-based generative models.

Abstract:
Event-based Action Recognition (EAR) provides a promising pathway for understanding dynamic behaviors under challenging conditions. Recent progress in vision-language models has introduced a cross-modal learning paradigm into EAR, enabling models to associate event streams with textual semantics for enhancing conceptual understanding. However, existing methods typically overlook the intrinsic polarity-driven motion cues that are fundamental to event data, leading to suboptimal spatiotemporal representations. To address this limitation, we propose a POlarity Knowledge Enhanced framework (POKER), which explicitly incorporates event polarity-aware motion knowledge across visual and textual modalities.POKER consists of two synergistic components: Polarity Motion Capturer (PMC) and Polarity Motion Reasoner (PMR). Specifically, PMC decouples positive and negative polarities to capture polarity-sensitive motion cues, while PMR semantically analyzes polarity-induced motion dynamics via large language models. Through the polarity alignment, POKER couples semantic reasoning with visual dynamics, achieving more discriminative representations. Extensive experiments on multiple benchmarks demonstrate that POKER enhances performance across diverse event representations.

Abstract:
Pre-trained video models learn powerful priors for generating high-quality, temporally coherent content. While these models excel at temporal coherence, their dynamics are often constrained by the continuous nature of their training data. We hypothesize that by injecting the rich and unconstrained content diversity from image data into this coherent temporal framework, we can generate image sets that feature both natural transitions and a far more expansive dynamic range. To this end, we introduce iMontage, a unified framework designed to repurpose a powerful video model into an all-in-one image generator. The framework consumes and produces variable-length image sets, unifying a wide array of image generation and editing tasks. To achieve this, we propose an elegant and minimally invasive adaptation strategy, complemented by a tailored data curation process and training paradigm. This approach allows the model to acquire broad image manipulation capabilities without corrupting its invaluable original motion priors. iMontage excels across several mainstream many-in-many-out tasks, not only maintaining strong cross-image contextual consistency but also generating scenes with extraordinary dynamics that surpass conventional scopes. Our code and model weights will be made publicly available.

Abstract:
We propose Coded-E2LF (coded event to light field), a computational imaging method for acquiring a 4-D light field using a coded aperture and a stationary event-only camera. In a previous work, an imaging system similar to ours was adopted, but both events and intensity images were captured and used for light field reconstruction. In contrast, our method is purely event-based, which relaxes restrictions for hardware implementation. We also introduce several advancements from the previous work that enable us to theoretically support and practically improve light field reconstruction from events alone. In particular, we clarify the key role of a black pattern in aperture coding patterns. We finally implemented our method on real imaging hardware to demonstrate its effectiveness in capturing real 3-D scenes. To the best of our knowledge, we are the first to demonstrate that a 4-D light field with pixel-level accuracy can be reconstructed from events alone. Our software and supplementary video are available from our project website: https://www.fujii.nuee.nagoya-u.ac.jp/Research/EventLF/.

Abstract:
The primary obstacle for applying reinforcement learning (RL) to real-world robotics is the design of effective reward functions. While recently learning-based Process Reward Models (PRMs) are a promising direction, they are often hindered by two fundamental limitations: their reward models lack step-aware understanding and rely on single-view perception, leading to unreliable assessments of fine-grained manipulation progress; and their reward shaping procedures are theoretically unsound, often inducing a semantic trap that misguides policy optimization. To address these, we introduce Dopamine-Reward, a novel reward modeling method for learning a general-purpose, step-aware process reward model from multi-view inputs. At its core is our General Reward Model (GRM), trained on a vast 3,400+ hour dataset, which leverages Step-wise Reward Discretization for structural understanding and Multi-Perspective Reward Fusion to overcome perceptual limitations. Building upon Dopamine-Reward, we propose Dopamine-RL, a robust policy learning framework that employs a theoretically-sound Policy-Invariant Reward Shaping method, which enables the agent to leverage dense rewards for efficient self-improvement without altering the optimal policy, thereby fundamentally avoiding the semantic trap. Extensive experiments across diverse simulated and real-world tasks validate our approach. GRM achieves state-of-the-art accuracy in reward assessment, and Dopamine-RL built on GRM significantly improves policy learning efficiency. For instance, after GRM is adapted to a new task in a one-shot manner from a single expert trajectory, the resulting reward model enables Dopamine-RL to improve the policy from near-zero to 95% success with only 150 online rollouts (approximately 1 hour of real robot interaction), while retaining strong generalization across tasks.

Abstract:
Reinforcement learning (RL) has recently achieved remarkable success in eliciting visual reasoning within Multimodal Large Language Models (MLLMs). However, existing approaches typically train separate models for different tasks and treat image and video reasoning as disjoint domains. This results in limited scalability toward a multimodal reasoning generalist, which restricts practical versatility and hinders potential knowledge sharing across tasks and modalities.To this end, we propose OneThinker, an all-in-one reasoning model that unifies image and video understanding across diverse fundamental visual tasks, including question answering, captioning, spatial and temporal grounding, tracking, and segmentation. To achieve this, we construct the OneThinker-600k training corpus covering all these tasks and employ commercial models for CoT annotation, resulting in OneThinker-SFT-340k for SFT cold start.Moreover, we propose EMA-GRPO to handle reward heterogeneity in multi-task RL by tracking task-wise moving averages of reward standard deviations for balanced optimization. Extensive experiments on diverse visual benchmarks show that OneThinker delivers strong performance on 31 benchmarks, across 10 fundamental visual understanding tasks.Moreover, it exhibits effective knowledge transfer between certain tasks and preliminary zero-shot generalization ability, marking a step toward a unified multimodal reasoning generalist. All code, model, and data are released.

Abstract:
Recent years have witnessed remarkable progress in world models, which primarily aim to capture the spatiotemporal correlations between an agent's actions and the evolving environment. However, existing approaches often suffer from tight runtime coupling or depend on offline reward signals, resulting in substantial inference overhead or hindering end-to-end optimization. To overcome these limitations, we introduce WPT, a World-to-Policy Transfer training paradigm that enables online distillation under the guidance of an end-to-end world model. Specifically, we develop a trainable reward model that infuses world knowledge into a teacher policy by aligning candidate trajectories with the future dynamics predicted by the world model. Subsequently, we propose policy distillation and world reward distillation to transfer the teacher's reasoning ability into a lightweight student policy, enhancing planning performance while preserving real-time deployability. Extensive experiments on both open-loop and closed-loop benchmarks show that WPT achieves state-of-the-art performance with a simple policy architecture: it attains a 0.11 collision rate (open-loop) and achieves a 79.23 driving score (closed-loop), surpassing both world-model-based and imitation-learning methods in accuracy and safety. Moreover, the student sustains up to 4.9xfaster inference, while retaining most of the gains.

Abstract:
Humans rarely plan whole-body interactions with objects at the level of explicit whole-body movements. High-level intentions, such as affordance, define the goal, while coordinated balance, contact, and manipulation can emerge naturally from underlying physical and motor priors. Scaling such priors is key to enabling humanoids to compose and generalize loco-manipulation skills across diverse contexts while maintaining physically coherent whole-body coordination. To this end, we introduce InterPrior, a scalable framework that learns a unified generative controller through large-scale imitation pretraining and post-training by reinforcement learning. InterPrior first distills a full-reference imitation expert into a versatile, goal-conditioned variational policy that reconstructs motion from multimodal observations and high-level intent. While the distilled policy reconstructs training behaviors, it does not generalize reliably due to the vast configuration space of large-scale human-object interactions. To address this, we apply data augmentation with physical perturbations, and then perform reinforcement learning finetuning to improve competence on unseen goals and initializations. Together, these steps consolidate the reconstructed latent skills into a valid manifold, yielding a motion prior that generalizes beyond the training data, e.g., it can incorporate new behaviors such as interactions with unseen objects. We further demonstrate its effectiveness for user-interactive control and its potential for real robot deployment.

Abstract:
Domain adaptive object detection (DAOD) aims to generalize an object detector trained on a source domain to a target domain, where the domain gap degrades the adaptability. Recently, large-scale vision foundation models (VFMs), pretrained on web-scale datasets, exhibit such powerful generalization capabilities that many approaches leverage them to bridge the domain gap. However, their generalized knowledge is not tailored to the specific domain, which makes it difficult to offer precise guidance in the target domain. In this paper, we propose an Expert-Teacher-Student collaborative learning (ETS) framework to synergize the generalized knowledge from VFMs with the domain-specific knowledge from the teacher model. Concretely, we first design an Expert-Teacher Collaborative Teaching (ETCT) module, which leverages the complementary knowledge of expert and teacher models to collaboratively generate high-quality pseudo labels for supervising student model learning. Second, we devise an Expert-Teacher Joint Consolidating (ETJC) module, which introduces class-wise prototype alignment among expert, teacher, and student models, to jointly consolidate generalized and domain-specific knowledge within the student model. ETS leverages VFMs as the expert model in a free lunch manner, thus avoiding significant additional training costs. Extensive experiments exhibit that our method outperforms the existing SOTA methods on three benchmarks.Our code is available in the supplementary materials.

Abstract:
Despite advances in Multimodal LLMs (MLLMs), their ability to reason over 3D structures and temporal dynamics remains limited, constrained by weak 4D perception and temporal understanding. Existing 3D and 4D Video Question Answering (VQA) benchmarks also emphasize static scenes and lack region-level prompting.We tackle these issues by introducing:(a) 4D-RGPT, a specialized MLLM designed to capture 4D representations from video inputs with enhanced temporal perception;(b) Perceptual 4D Distillation (P4D), a training framework that transfers 4D representations from a frozen expert model into 4D-RGPT for comprehensive 4D perception; and(c) R4D-Bench, a benchmark for depth-aware dynamic scenes with region-level prompting, built via a hybrid automated and human-verified pipeline.Our 4D-RGPT achieves notable improvements on both existing 4D VQA benchmarks and the proposed R4D-Bench benchmark.

Abstract:
Self-supervised learning on images seeks to extract meaningful visual representations from unlabeled data. When scaled to large datasets, this paradigm has achieved state-of-the-art performance and the resulting trained models such as DINOv3 have seen widespread adoption. However, most prior efforts are optimized for semantic understanding rather than geometric reasoning. One important exception is Cross-View Completion, CroCo, which is a form of masked autoencoding (MAE) tailored for 3D understanding. In this work, we continue on the path proposed by CroCo and focus on learning features tailored for 3D vision. In a nutshell, we extend MAE to arbitrarily many views of the same scene. By uniformly masking all views and employing a lightweight decoder with inter-frame attention, our approach is inherently simpler and more scalable than CroCo. We evaluate the resulting model, MuM, extensively on downstream tasks including feedforward reconstruction, dense image matching and relative pose estimation, finding that it outperforms the state-of-the-art visual encoders DINOv3 and CroCo v2.

Abstract:
We often aim to generate images that are both photorealistic and 3D-consistent, adhering to precise geometry, material, and viewpoint controls. Typically, this is achieved by fine-tuning an image generator, pre-trained on billions of real images, using renders of synthetic 3D assets, where annotations for control signals are available. While this approach can learn the desired controls, it often compromises the realism of the images due to domain gap between photographs and renders. We observe that this issue largely arises from the model learning an unintended association between the presence of control signals and the synthetic appearance of the images. To address this, we introduce Realiz3D, a lightweight framework for training diffusion models, that decouples controls and visual domain. The key idea is to explicitly learn visual domain, real or synthetic, separately from other control signals by introducing a co-variate that, fed into small residual adapters, shifts the domain. Then, the generator can be trained to gain controllability, without fitting to specific visual domain. In this way, the model can be guided to produce realistic images even when controls are applied. We enhance control transferability to the real domain by leveraging insights about roles of different layers and denoising steps in diffusion-based generators, informing new training and inference strategies that further mitigate the gap. We demonstrate the advantages of Realiz3D in tasks as text-to-multiview generation and texturing from 3D inputs, producing outputs that are 3D-consistent and photorealistic.

Abstract:
Cooperative perception lets agents share information to expand coverage and improve scene understanding. However, in real-world scenarios, diverse and unpredictable corruptions undermine its robustness and generalization. To address these challenges, we introduce CoopDiff, a diffusion-based cooperative perception framework that mitigates corruptions via a denoising mechanism. CoopDiff adopts a teacher-student paradigm: the Quality-Aware Teacher performs voxel-level early fusion with Quality of Interest weighting and semantic guidance, then produces clean supervision features via a diffusion denoiser. The Dual-Branch Diffusion Student first separates ego and cooperative streams in encoding to reconstruct the teacher's clean targets. And then, an Ego-Guided Cross-Attention mechanism facilitates balanced decoding under degradation by adaptively integrating ego and cooperative features. We evaluate CoopDiff on two constructed multi-degradation benchmarks, OPV2Vn and DAIR-V2Xn, each incorporating six corruption types, including environmental and sensor-level distortions. Benefiting from the inherent denoising properties of diffusion, CoopDiff consistently outperforms prior methods across all degradation types and lowers the relative corruption error. Furthermore, it offers a tunable balance between precision and inference efficiency.

Abstract:
Pan-sharpening, a fundamental image preprocessing technique in remote sensing, aims to generate spatially and spectrally enriched multispectral imagery by integrating complementary information from texture-rich panchromatic (PAN) images and paired low-resolution multispectral (LRMS) counterparts. Although recent generative diffusion models have achieved impressive fusion quality, these performance gains often come with substantial computational costs, rendering them impractical for resource-constrained scenarios common in remote sensing applications. This work introduces a function-space diffusion model built upon a neural operator architecture that achieves compelling performance with promising efficiency. Specifically, our framework replaces the standard attention-based denoising backbone with a Galerkin-type neural operator, significantly reducing computational complexity while maintaining excellent representational capacity. Furthermore, by explicitly integrating pixel-wise spatial-spectral consistency residuals into each reverse diffusion step, our method establishes a fine-grained, closed-loop guidance mechanism that dynamically calibrates spatial details and spectral fidelity throughout the generation process. Extensive experiments on multiple benchmark datasets demonstrate the effectiveness of our approach over state-of-the-art methods.

Abstract:
Foundation models pre-trained with self-supervised learning (SSL) on large-scale datasets have become powerful general-purpose feature extractors. However, their immense size and computational cost make them prohibitive for deployment on edge devices such as robots and AR/VR headsets. Existing compression techniques like standard knowledge distillation create efficient `specialist' models but sacrifice the crucial, downstream-agnostic generality that makes foundation models so valuable.In this paper, we introduce Foundation Model Distillation (FMD), a new paradigm for compressing large SSL models into compact, efficient, and faithful proxies that retain their general-purpose representational power. We present Foundry, the first implementation of FMD for 3D point clouds. Our approach, Foundry, trains a student to learn a compressed set of SuperTokens that reconstruct the teacher's token-level representations, capturing a compact basis of its latent space. A single distilled model maintains strong transferability across diverse downstream tasks--classification, part segmentation, and few-shot scenarios--approaching full foundation-model performance while using significantly fewer tokens and FLOPs, making such models more practical for deployment on resource-constrained hardware.

Abstract:
Spike cameras are a novel class of neuromorphic vision sensors that capture scene dynamics with ultra-high temporal resolution via spike planes. While recent methods have addressed motion blur and noise in spike-based reconstruction, defocus blur caused by shallow depth of field or lens adjustment delays remains a critical yet underexplored issue in real-world applications such as autonomous driving. In this work, we present DeSpike, the first end-to-end defocus removal framework specifically designed for spike cameras. Our method begins by explicitly modeling the defocus formation process using a physics-inspired thin-lens approximation to simulate spike responses under optical blur. Guided by this formulation, DeSpike employs multi-temporal-scale integrate-and-fire (IF) neurons to compensate for FPN and extract defocus-aware features from spike streams. These features are then processed by a physics-informed deblurring module constructed from learnable discrete PSF priors. To address spatially variant blur, we introduce a Transformer-based fusion mechanism that adaptively weighs multi-scale deblurring results through attention across defocus levels. Finally, a coarse-to-fine iterative refinement stage combines spike features and PSF priors for progressive restoration. Extensive experiments on both synthetic and real-world defocused spike datasets demonstrate that our method achieves superior performance over state-of-the-art deblurring approaches in terms of structural fidelity, perceptual sharpness, and contrast, setting a new benchmark for defocus-aware spike-based image reconstruction.

Abstract:
Neural Radiance Fields (NeRF) have shown remarkable success in image novel view synthesis (NVS), inspiring extensions to LiDAR NVS. However, most methods heavily rely on accurate camera poses for scene reconstruction. The sparsity and textureless nature of LiDAR data also present distinct challenges, leading to geometric holes and discontinuous surfaces. To address these issues, we propose SG-NLF, a pose-free LiDAR NeRF framework that integrates spectral information with geometric consistency. Specifically, we design a hybrid representation based on spectral priors to reconstruct smooth geometry. For pose optimization, we construct a confidence-aware graph based on feature compatibility to achieve global alignment. In addition, an adversarial learning strategy is introduced to enforce cross-frame consistency, thereby enhancing reconstruction quality. Comprehensive experiments demonstrate the effectiveness of our framework, especially in challenging low-frequency scenarios. Compared to previous state-of-the-art methods, SG-NLF improves reconstruction quality and pose accuracy by over 35.8% and 68.8%. Our work can provide a novel perspective for LiDAR view synthesis.

Abstract:
3D Gaussian Splatting (3DGS) has become a standard approach to reconstruct and render photorealistic 3D head avatars. A major challenge is to relight the avatars to match any scene illumination. For high quality relighting, existing methods require subjects to be captured under complex time-multiplexed illumination, such as one-light-at-a-time (OLAT). We propose a new generalized relightable 3D Gaussian head model that can relight any subject observed in a single- or multi-view images without requiring OLAT data for that subject. Our core idea is to learn a mapping from flat-lit 3DGS avatars to corresponding relightable Gaussian parameters for that avatar. Our model consists of two stages: a first stage that models flat-lit 3DGS avatars without OLAT lighting, and a second stage that learns the mapping to physically-based reflectance parameters for high-quality relighting. This two-stage design allows us to train the first stage across diverse existing multi-view datasets without OLAT lighting ensuring cross-subject generalization, where we learn a dataset-specific lighting code for self-supervised lighting alignment. Subsequently, the second stage can be trained on a significantly smaller dataset of subjects captured under OLAT illumination. Together, this allows our method to generalize well and relight any subject from the first stage as if we had captured them under OLAT lighting. Furthermore, we can fit our model to unseen subjects from as little as a single image, allowing several applications in novel view synthesis and relighting for digital avatars.

Abstract:
Fashion design aims to express a designer's creative intent and to depict how garments interact with the human body. Recent methods condition on multimodal inputs to support garment editing and virtual try-on. However, existing methods still (i) confine design to garment-related images, excluding creative design sources such as artwork, abstract imagery, and natural photographs, and (ii) cannot support complete outfits, including accessories. We present FEAT (Fashion Editing And Try-On from Any Design), a method that enables editing and try-on across garments and accessories using diverse design sources. To achieve this, we introduce Disentangled Dual Injection (DDI). It takes both apparel and non-apparel design sources and selectively injects design cues via content and style disentanglement. Furthermore, we propose Orthogonal-Guided Noise Fusion (OGNF), a training-free mechanism that removes residual garments via orthogonal projection and applies region-specific noise strategies to enable virtual try-on for both garments and accessories. Extensive experiments demonstrate that FEAT achieves state-of-the-art performance in design flexibility, prompt consistency, and visual realism.

Abstract:
We propose Multi-view Pyramid Transformer (MVP), a scalable multi-view transformer architecture that directly reconstructs large 3D scenes from tens to hundreds of images in a single forward pass. Drawing on the idea of "looking broader to see the whole, looking finer to see the details," MVP is built on two core design principles: 1) a local-to-global inter-view hierarchy that gradually broadens the model's perspective from local views to groups and ultimately the full scene, and 2) a fine-to-coarse intra-view hierarchy that starts from detailed spatial representations and progressively aggregates them into compact, information-dense tokens. This dual hierarchy achieves both computational efficiency and representational richness, enabling fast reconstruction of large and complex scenes. We validate MVP on diverse datasets and show that, when coupled with 3D Gaussian Splatting as the underlying 3D representation, it achieves state-of-the-art generalizable reconstruction quality while maintaining high efficiency and scalability across a wide range of view configurations.

Abstract:
Self-supervised pre-training has driven rapid progress in foundation models for language, 2D images, and video, yet remains largely unexplored for learning 3D-aware representations from multi-view images. In this paper, we present E-RayZer, a self-supervised 3D vision model that learns geometrically grounded representations directly from unlabeled images. Unlike prior self-supervised methods such as RayZer, which infer 3D indirectly through latent-space view synthesis, E-RayZer operates directly in 3D space, performing self-supervised 3D reconstruction with Explicit geometry. This formulation eliminates shortcut solutions and yields representations that are 3D-aware. To ensure convergence and scalability, we introduce a fine-grained learning curriculum that organizes training from easy to hard samples and harmonizes heterogeneous data sources without any supervision. Experiments show that E-RayZer significantly outperforms RayZer on pose estimation and matches or sometimes surpasses fully supervised reconstruction models such as VGGT. Furthermore, its learned representations outperform leading visual pre-training models (e.g., DINOv3, CroCo v2, VideoMAE V2, and RayZer) on 3D downstream tasks, establishing E-RayZer as a promising paradigm for spatial visual pre-training.

Abstract:
Visual Autoregressive (VAR) models have demonstrated competitive performance with diffusion models in image generation by adopting a "next-scale" prediction paradigm that significantly reduces inference steps. However, VAR's progressive multi-scale generation leads to severe memory overhead due to KV cache accumulation across all scales, limiting practical deployment. Existing solutions either require training and deploying multiple specialized models or sacrifice generation quality.We observe a critical scale-depth asymmetric dependency in VAR: small scales (low-resolution tokens) are highly sensitive to network depth and require deep layers to capture global semantic information, while large scales (high-resolution tokens) exhibit remarkable robustness to depth reduction.Motivated by this insight, we propose VARiant, a unified supernet framework that enables dynamic depth adjustment within a single model through parameter sharing. Our approach employs an even-spacing layer selection strategy to construct quality-preserving subnetworks, and introduces a dynamic-ratio progressive training strategy that gradually transitions from joint optimization (full network to subnetwork ratio 2:8) to subnetwork optimization (ratio 10:0), effectively resolving the inherent optimization conflicts between full network and subnetworks in supernet training.Extensive experiments on ImageNet demonstrate that our method achieves Pareto-optimal trade-offs between generation quality and inference efficiency: by using full depth (30 layers) for the first 7 scales and a 16-layer subnetwork (50% depth) for subsequent scales, we obtain 50% cache reduction and 1.8x inference speedup with minimal quality loss (FID increases by only 0.3).Unlike approaches requiring deployment of multiple models, our single-model solution eliminates deployment complexity, supports zero-cost runtime depth switching, and seamlessly integrates into standard transformer inference frameworks, making it highly practical for resource-constrained scenarios.

Abstract:
Establishing dense correspondences between shapes is a crucial task in computer vision and graphics, while prior approaches depend on near-isometric assumptions and homogeneous subject types (i.e., only operate for human shapes). However, building semantic correspondences for cross-category objects remains challenging and has received relatively little attention. To achieve this, we propose UniMatch, a semantic-aware, coarse-to-fine framework for constructing dense semantic correspondences between strongly non-isometric shapes without restricting object categories. The key insight is to lift "coarse" semantic cues into "fine" correspondence, which is achieved through two stages. In the "coarse" stage, we perform class-agnostic 3D segmentation to obtain non-overlapping semantic parts and prompt multimodal large language models (MLLMs) to identify part names. Then, we employ pretrained vision language models (VLMs) to extract text embeddings, enabling the construction of matched semantic parts. In the "fine" stage, we leverage these coarse correspondences to guide the learning of dense correspondences through a dedicated rank-based contrastive scheme. Thanks to class-agnostic segmentation, language guiding, and rank-based contrastive learning, our method is versatile for universal object categories and requires no predefined part proposals, enabling universal matching for inter-class and non-isometric shapes. Extensive experiments demonstrate UniMatch consistently outperforms competing methods in various challenging scenarios.

Abstract:
Incremental Object Detection (IOD) enables AI systems to continuously acquire new object classes while preserving knowledge of previously learned ones, an ability essential for deployment in dynamic, real-world environments. Existing IOD methods typically rely on knowledge distillation to mitigate catastrophic forgetting. However, the tight coupling between the student model's detection head and backbone causes distillation gradients to conflict with new-class supervision at the head, injecting head-specific bias into the backbone and ultimately weakening distillation effectiveness. To address this issue, we propose a decoupled training mechanism for the model's backbone and classification head. Specifically, we introduce the Future-aware decoupled Cross-head Distillation (FaCHD) method, which utilizes two frozen complementary teachers (historical and intermediate teachers) to decode the student's ROI features for cross-head distillation. This strategy implicitly alleviates prediction conflicts caused by detection-head bias and provides richer task-relevant guidance, thereby improving distillation efficiency. To further address the detection head bias and model recency problem, we propose a Prototype Semantic Drift Compensation module, which recalibrates multi-granularity prototypes of old classes, effectively correcting semantic drift and enhancing the stability of the detection head. Extensive experiments on two standard IOD benchmarks demonstrate the effectiveness and superiority of the proposed method.

Abstract:
Conventional monocular depth estimators suffer from visual ambiguities and nuisances. We demonstrate that language can improve the fidelity of estimates by providing additional information through text as a condition, thereby reducing the solution space for depth estimates. This conditional distribution is learned during text-to-image pre-training of diffusion models. To generate images under various viewpoints and layouts that reflect textual descriptions, the model implicitly encodes object shapes and their spatial relationships, comprising the 3-dimensional (3D) scene structure. We investigate the benefits of integrating text descriptions into the training and inference of the diffusion-based monocular depth estimators (MDE). We experiment with three different diffusion-based MDEs, namely Marigold, Lotus, and E2E-FT, and their variants. By training on HyperSim and Virtual KITTI, and evaluating on NYUv2, KITTI, ETH3D, ScanNet, and DIODE, we find that our strategy of integrating text into MDEs improves the overall accuracy of depth estimates, especially in small areas. The effect is also targeted in that improvements are more pronounced in specific regions described in the text. Naturally, this lends to iterative refinement of the depth estimate by providing additional descriptions of the 3D scene. Separately, we find that incorporating text can accelerate convergence of training and the inference diffusion trajectory.

Abstract:
Weight regularization methods in continual learning (CL) alleviate catastrophic forgetting by assessing and penalizing changes to important model weights. Elastic Weight Consolidation (EWC) is a foundational and widely used approach within this framework that estimates weight importance based on gradients. However, it has consistently shown suboptimal performance.In this paper, we conduct a systematic analysis of importance estimation in EWC from a gradient-based perspective. For the first time, we find that EWC's reliance on the Fisher Information Matrix (FIM) results in gradient vanishing and inaccurate importance estimation in certain scenarios.Our analysis also reveals that Memory Aware Synapses (MAS), a variant of EWC, imposes unnecessary constraints on parameters irrelevant to prior tasks, termed the redundant protection. Consequently, both EWC and its variant exhibit fundamental misalignments in estimating the importance of weights, leading to inferior performance. To tackle these issues, we propose the Logits Reversal (LR) operation, a simple yet effective modification that rectifies the importance estimation of EWC. Specifically, reversing the logit values during the calculation of the FIM can effectively prevent both the gradient vanishing and the redundant protection. Extensive experiments across various CL tasks and datasets show that the proposed method significantly outperforms existing EWC and its variants. Therefore, we refer to it as EWC Done Right (EWC-DR).

Abstract:
Bridging the modality gap between infrared and visible imagery is critical for cross-modal understanding and for enriching multimodal benchmarks. However, existing approaches remain confined to one-to-one mappings and are typically evaluated on unidirectional or closed-set scenarios. To address this challenge, we present SynthRGB-T, a unified framework for diverse and bidirectional image translation. Specifically, we formulate image translation as a vision-language guided denoising diffusion process, enabling flexible conditioning and open-world generalization. To enhance semantic alignment, a Visual Grounding Pipeline (VGP) is introduced to exploit the world knowledge of foundation models for fine-grained translation guidance. During the diffusion process, we propose to adopt a decoupling injection strategy to alleviate interference among multiple guidance. In addition, a Dual Conditional Cross-Attention (DCCA) module is designed to facilitate collaborative representation learning in latent space. SynthRGB-T is simple and versatile--capable of synthesizing diverse, high-fidelity data that substantially extends multimodal resources within the community. Comprehensive evaluations on multiple real-world benchmarks confirm that SynthRGB-T delivers superior performance and enhanced visual diversity over existing approaches.

Abstract:
A key barrier to the real-world deployment of humanoid robots is the lack of autonomous loco-manipulation skills. We introduce VIRAL, a visual sim-to-real framework that learns humanoid loco-manipulation entirely in simulation and deploys it zero-shot to real hardware. VIRAL follows a teacher-student design: a privileged RL teacher, operating on full state, learns long-horizon loco-manipulation using a delta action space and reference state initialization. A vision-based student policy is then distilled from the teacher via large-scale simulation with tiled rendering, trained with a mixture of online DAgger and behavior cloning. We find that compute scale is critical: scaling simulation to tens of GPUs (up to 64) makes both teacher and student training reliable, while low-compute regimes often fail. To bridge the sim-to-real gap, VIRAL combines large-scale visual domain randomization--over lighting, materials, camera parameters, image quality, and sensor delays--with real-to-sim alignment of the dexterous hand and cameras. Deployed on a Unitree G1 humanoid, the resulting RGB-based policy performs continuous loco-manipulation for up to 54 cycles, generalizing to diverse spatial and appearance variations without any real-world fine-tuning, and approaching expert-level teleoperation performance. Extensive ablations dissect the key design choices required to make RGB-based humanoid loco-manipulation work in practice.

Abstract:
Occluded visible-infrared person re-identification (Occluded VI-ReID) remains difficult due to modality heterogeneity and occlusions, both of which break structural consistency and weaken cross-modality feature alignment. Existing methods rely mainly on spatial-domain cues (such as local body parts and salient patches), but their discriminability degrades severely under varying imaging conditions or partial visibility. To address these issues, we introduce a spatial-frequency collaborative perspective that offers global perception and cross-location consistency. Specifically, we propose a Spatial-Frequency Collaborative Learning (SFCL) framework that uses frequency information to complement spatial representations. SFCL comprises a Cross-Modality Frequency Alignment Module (CFAM), a Spatial-Frequency Interaction Module (SFIM), and a Frequency-Aware Discriminative (FAD) loss. The CFAM models the spectral features of visible/infrared images in the frequency domain, establishing modality-consistent spectral priors. The SFIM injects these priors into spatial features, promoting dual-domain interaction and complementary representations of spatial and frequency semantics. In addition, the FAD loss jointly enforces cross-modality frequency alignment and semantic consistency, thus enhancing robustness and discriminability under occlusions. For real-occlusion evaluation, we construct two occluded datasets, Occ-SYSU-MM01 and Occ-RegDB, on which SFCL outperforms the state-of-the-art.

Abstract:
Grasping in a densely cluttered environment is a challenging task for robots. Previous methods tried to solve this problem by actively gathering multiple views before grasp pose generation. However, they either overlooked the importance of the grasp distribution for information gain estimation or relied on the projection of the grasp distribution, which ignores the structure of grasp poses on the SE(3) manifold. To tackle these challenges, we propose a calibrated energy-based model for grasp pose generation and an active view selection method that estimates information gain from grasp distribution. Our energy-based model captures the multi-modality nature of grasp distribution on the SE(3) manifold. The energy level is calibrated to the success rate of grasps so that the predicted distribution aligns with the real distribution. The next best view is selected by estimating the information gain for grasp from the calibrated distribution conditioned on the reconstructed environment, which could efficiently drive the robot to explore affordable parts of the target object. Experiments on simulated environments and real robot setups demonstrate that our model could successfully grasp objects in a cluttered environment with limited view budgets compared to previous state-of-the-art models. Our simulated environment can serve as a reproducible platform for future research on active grasping. The source code of our paper will be made public when the paper is released to the public.

Abstract:
Recent advances in generative models have enabled the creation of high-fidelity human faces, yet constructing reliable virtual identities that preserve user privacy while supporting consistent and verifiable identity assignment remains challenging. In this paper, we propose a diffusion-based framework for generating traceable virtual identities that maintains stable identity semantics while preserving pose and expression. Our framework couples a virtual identity sampler that generates diverse yet consistent identity embeddings with a 3D geometric and expression conditioning module that preserves the pose and non-identity characteristics of the input face. In addition, we incorporate a lightweight latent watermarking mechanism that embeds an imperceptible identity signature during generation, enabling a user to verify ownership of the resulting virtual identity through a secure token without revealing their real facial appearance. Quantitative evaluations demonstrate that our method achieves high virtual identity consistency, strong pose and expression fidelity, and improved anonymity compared with prior works. These results validate the effectiveness of integrating virtual identity sampling, geometric conditioning, and latent watermarking into a single generative framework, and highlight the practical potential of our solution for constructing privacy-aware virtual identities.

Abstract:
The quadratic complexity of the attention mechanism severely limits the context scalability of Video Diffusion Transformers (DiTs). We find that the highly sparse spatio-temporal attention patterns exhibited in Video DiTs can be naturally represented by the Monarch matrix. It is a class of structured matrices with flexible sparsity, enabling sub-quadratic attention via an alternating minimization algorithm. Accordingly, we propose VMonarch, a novel attention mechanism for Video DiTs that enables efficient computation over the dynamic sparse patterns with structured Monarch matrices. First, we adapt spatio-temporal Monarch factorization to explicitly capture the intra-frame and inter-frame correlations of the video data. Second, we introduce a recomputation strategy to mitigate artifacts arising from instabilities during alternating minimization of Monarch matrices. Third, we propose a novel online entropy algorithm fused into FlashAttention, enabling fast Monarch matrix updates for long sequences. Extensive experiments demonstrate that VMonarch achieves comparable or superior generation quality to full attention on VBench after minimal fine-tuning. It overcomes the attention bottleneck in Video DiTs, reduces attention FLOPs by a factor of 17.5, and achieves a speedup of over 5 times in attention computation for long videos, surpassing state-of-the-art sparse attention methods at 90% sparsity.

Abstract:
Ensemble clustering aims to derive a consensus partition from multiple base clustering results. Anchor-based methods construct compact similarity representations via anchors, substantially improving computational efficiency. However, when outliers contaminate the data, reconstructing the base clustering results often yields biased anchors. These biased anchors degrade the quality of the anchor similarity matrix and lead to a decline in clustering accuracy. To address this issue, we propose a novel method called large-scale robust enhanced ensemble clustering via outlier decoupling (RANGE). Specifically, RANGE first converts the base clustering results into an initial bipartite graph. To enhance the reliability of this bipartite graph, RANGE designs a high-order fuzzy enhancement strategy (HFES) specifically for initial bipartite graphs. Next, RANGE introduces an anchor matrix to improve the computational efficiency of the bipartite graph reconstruction process. To improve robustness, RANGE decomposes the anchor similarity matrix into a clean component and a residual outlier component in the anchor space. A global cross-correlation penalty is imposed to suppress leakage between the two components, while a row-wise l_ 2,1 -norm on the residual part confines contamination to a small number of residual anchor directions. Moreover, by applying outlier detectors to the decoupled outlier structure, RANGE can be extended to the outlier detection task. Consequently, RANGE forms a cross-task general framework. Extensive experiments indicate that RANGE delivers superior performance in both clustering validity and outlier detection.

Abstract:
Recent advances in diffusion-based real-world image super-resolution (Real-ISR) have demonstrated remarkable perceptual quality, yet the balance between fidelity and controllability remains a problem: multi-step diffusion-based methods suffer from generative diversity and randomness, resulting in low fidelity, while one-step methods lose control flexibility due to fidelity-specific finetuning. In this paper, we present ODTSR, a one-step diffusion transformer based on Qwen-Image that performs Real-ISR considering fidelity and controllability simultaneously: a newly introduced visual stream receives low-quality images (LQ) with adjustable noise (Control Noise), and the original visual stream receives LQs with consistent noise (Prior Noise), forming the Noise-hybrid Visual Stream (NVS) design. ODTSR further employs Fidelity-aware Adversarial Training (FAA) to enhance controllability and achieve one-step inference. Extensive experiments demonstrate that ODTSR not only achieves state-of-the-art (SOTA) performance on generic Real-ISR, but also enables prompt controllability on challenging scenarios such as real-world scene text image super-resolution (STISR) of Chinese characters without training on specific datasets.

Abstract:
Missing modalities in RGBT tracking often lead to incomplete and unstable multimodal feature representations that greatly degrade the performance. Existing methods typically attempt to recover missing modalities from available ones, but the quality of data generated in challenging scenarios might be unsatisfactory. In addition, current approaches exhibit limited flexibility in processing both missing and complete data. To overcome these limitations, we propose a Spatio-temporal Conditional Denoising Transformer (SCDT), which integrates the spatial cues and the temporal context to adaptively perform information reconstruction of missing modalities and feature enhancement of weak modalities in a unified framework, for robust modality-missing RGBT tracking. In particular, SCDT leverages the short-term temporal cues from recent historical frames to capture the fine-grained temporal correlations and the long-term temporal cues encoding modality evolution to capture the global context. By jointly exploiting long short-term temporal contexts as the conditions, SCDT progressively guides noisy features of available modalities to learn reliable and temporally consistent multimodal representations. Furthermore, SCDT introduces a noise-modulated adaptation mechanism that dynamically adjusts its behavior according to the modal availability, enabling a single framework to unify feature learning under both modality-missing and complete scenarios without changing the architecture or parameters. Extensive experiments on three public benchmark datasets demonstrate that our method consistently outperforms state-of-the-art methods.

Abstract:
Monocular depth estimation is essential for applications such as robotics. The complementary characteristics of event and image modalities have inspired fusion-based methods for robust depth estimation. However, existing methods rely on convolutional or attention-based architectures, which either have limited capacity for long-range modeling or incur high computational cost, making them less suitable for depth estimation over long sequences. Moreover, effective image-event fusion remains challenging, since most methods directly fuse features without addressing the domain gap and representational differences between raw events and images, resulting in semantic bias and degraded performance. In this work, we propose AIMDepth, an Asymmetric Image-Event Mamba framework for monocular depth estimation, built on state space models for linear complexity and accurate prediction. To alleviate input-domain misalignment, we introduce a Spectral Cross-modal Prior Guidance (SCPG) for bidirectional prior injection at the input level. To reduce the imbalance between sparse events and dense images, we design an asymmetric modal-aware Encoder (AME) with separate encoding paths and feature-level alignment. We further develop a Modality-interactive Local Refinement (ModiLocal) to enable hierarchical interaction and fine-grained alignment. Experiments on public datasets show that AIMDepth achieves state-of-the-art performance in complex environments.

Abstract:
Can warping tokens, rather than pixels, help multimodal large language models (MLLMs) understand how a scene appears from a nearby viewpoint? While MLLMs perform well on visual reasoning, they remain fragile to viewpoint changes, as pixel-wise warping is highly sensitive to small depth errors and often introduces geometric distortions. Drawing on theories of mental imagery that posit part-level structural representations as the basis for human perspective transformation, we examine whether image tokens in ViT-based MLLMs serve as an effective substrate for viewpoint changes. We compare forward and backward warping, finding that backward token warping, which defines a dense grid on the target view and retrieves a corresponding source-view token for each grid point, achieves greater stability and better preserves semantic coherence under viewpoint shifts. Experiments on our proposed ViewBench benchmark demonstrate that token-level warping enables MLLMs to reason reliably from nearby viewpoints, consistently outperforming all baselines including pixel-wise warping approaches, spatially fine-tuned MLLMs, and a generative warping method.

Abstract:
Despite significant advances in talking avatar generation, existing methods face critical challenges: insufficient text-following capability for diverse actions, lack of temporal alignment between actions and audio content, and dependency on additional control signals such as pose skeletons. We present ActAvatar, a framework that achieves phase-level precision in action control through textual guidance by capturing both action semantics and temporal context. Our approach introduces three core innovations: (1) Phase-Aware Cross-Attention (PACA), which decomposes prompts into a global base block and temporally-anchored phase blocks, enabling the model to concentrate on phase-relevant tokens for precise temporal-semantic alignment; (2) Progressive Audio-Visual Alignment, which aligns modality influence with the hierarchical feature learning process--early layers prioritize text for establishing action structure while deeper layers emphasize audio for refining lip movements, preventing modality interference; (3) A two-stage training strategy that first establishes robust audio-visual correspondence on diverse data, then injects action control through fine-tuning on structured annotations, maintaining both audio-visual alignment and the model's text-following capabilities. Extensive experiments demonstrate that ActAvatar significantly outperforms state-of-the-art methods in both action control and visual quality.

Abstract:
Current video vision language models (VLMs) process information passively, lacking the ability to dynamically plan their analysis or perform joint reasoning across crucial modalities such as video and audio. To address this, we introduce Hippocampal Sensing (HippoSense), a learning paradigm inspired by hippocampus that shifts the focus from temporal predictive sensing to cross-modal predictive sensing. The core objective of HippoSense is to train the model to anticipate current status's audio-caption summarizations from video and vice versa. We present HippoVLM, a video VLM that operationalizes this paradigm. Instead of passively ingesting all data, HippoVLM actively reasons about its information needs using Chain-of-Thought (CoT). Our training process is twofold: we first finetune HippoVLM with HippoSense, and then apply a novel contrastive Reinforcement Learning (RL) algorithm, Video-Audio Negative-aware Optimization (VANAO), to optimize this selective co-reasoning process. This approach proves highly effective: despite their significantly smaller size, our HippoVLM model achieve competitive performance to massive MLLMs like GPT-4o and Gemini 1.5 Pro on multiple video VQA benchmarks.

Abstract:
Single Image Reflection Separation (SIRS) disentangles mixed images into transmission and reflection layers. Existing methods suffer from transmission-reflection confusion under nonlinear mixing, particularly in deep decoder layers, due to implicit fusion mechanisms and inadequate multi-scale coordination. We propose ReflexSplit, a dual-stream framework with three key innovations.(1) Cross-scale Gated Fusion (CrGF) adaptively aggregates semantic priors, texture details, and decoder context across hierarchical depths, stabilizing gradient flow and maintaining feature consistency. (2) Layer Fusion-Separation Blocks (LFSB) alternate between fusion for shared structure extraction and differential separation for layer-specific disentanglement. Inspired by Differential Transformer, we extend attention cancellation to dual-stream separation via cross-stream subtraction. (3) Curriculum training progressively strengthens differential separation through depth-dependent initialization and epoch-wise warmup.Extensive experiments on synthetic and real-world benchmarks demonstrate state-of-the-art performance with superior perceptual quality and robust generalization.

Abstract:
In this paper, we tackle the problem of performing consistent and unified modifications across a set of related images. This task is particularly challenging because these images may vary significantly in pose, viewpoint, and spatial layout. Achieving coherent edits requires establishing reliable correspondences across the images, so that modifications can be applied accurately to semantically aligned regions. To address this, we propose GroupEditing, a novel framework that builds both explicit and implicit relationships among images within a group. On the explicit side, we extract geometric correspondences using VGGT, which provides spatial alignment based on visual features. On the implicit side, we reformulate the image group as a pseudo-video and leverage the temporal coherence priors learned by pre-trained video models to capture latent relationships. To effectively fuse these two types of correspondences, we inject the explicit geometric cues from VGGT into the video model through a novel fusion mechanism. To support large-scale training, we construct GroupEditData, a new dataset containing high-quality masks and detailed captions for numerous image groups. Furthermore, to ensure identity preservation during editing, we introduce an alignment-enhanced RoPE module, which improves the model's ability to maintain consistent appearance across multiple images. Finally, we present GroupEditBench, a dedicated benchmark designed to evaluate the effectiveness of group-level image editing. Extensive experiments demonstrate that GroupEditing significantly outperforms existing methods in terms of visual quality, cross-view consistency, and semantic alignment.

Abstract:
In recent years, contrastive multi-view clustering has achieved remarkable performance improvements. However, existing methods still face two key challenges: (1) reliance on a predefined number of clusters k, which is often unknown in real-world scenarios; and (2) contrastive learning might cause representation degeneration when thecollected multiple views inherently have inconsistent semantic information . To address these issues, we propose a novel framework--Reliable Clustering Number Estimation for Contrastive Multi-View Clustering (RCNMC). RCNMC consists of a Semantics-Aware Contrastive Learning module and a Reinforcement Learning-based Cluster Number Learning module. Specifically, the Semantics-Aware Contrastive Learning module first measures the discrepancy between pairwise representations and adaptively strengthens useful pairwise views while weakening unreliable ones, thereby alleviating representation degeneration. The Reinforcement Learning-based Cluster Number Learning module infers the optimal number of clusters in an unsupervised manner by using intra-cluster and inter-cluster distances as a reward-driven strategy. The two modules complement each other, making RCNMC more suitable for complex multi-view clustering tasks in real-world scenarios. Extensive experiments on multiple benchmark datasets demonstrate that RCNMC significantly outperforms existing state-of-the-art methods.

Abstract:
Recent diffusion-based models achieve photorealism in image inpainting but require many sampling steps, limiting practical use. Few-step text-to-image models offer faster generation, but naively applying them to inpainting yields poor harmonization and artifacts between the background and inpainted region. We trace this cause to random Gaussian noise initialization, which under low function evaluations causes semantic misalignment and reduced fidelity. To overcome this, we propose InverFill, a one-step inversion method tailored for inpainting that injects semantic information from the input masked image into the initial noise, enabling high-fidelity few-step inpainting. Instead of training inpainting models, InverFill leverages few-step text-to-image models in a blended sampling pipeline with semantically aligned noise as input, significantly improving vanilla blended sampling and even matching specialized inpainting models at low NFEs. Moreover, InverFill does not require real-image supervision and only adds minimal inference overhead. Extensive experiments show that InverFill consistently boosts baseline few-step models, improving image quality and text coherence without costly retraining or heavy iterative optimization.

Abstract:
Skeleton generation is essential for animating 3D assets, but current deep learning methods remain limited: they cannot handle the growing structural complexity of modern models and offer minimal controllability, creating a major bottleneck for real-world animation workflows. To address this, we propose an animator-centric SG framework that achieves high-quality skeleton prediction on complex inputs while providing intuitive control handles. Our contributions are threefold. First, we curate a large-scale dataset of 82,633 rigged meshes with diverse and complicated structures. Second, we introduce a novel semantic-aware tokenization scheme for auto-regressive modeling. This scheme effectively complements purely geometric prior methods by subdividing bones into semantically meaningful groups, thereby enhancing robustness to structural complexity and enabling a key control mechanism. Third, we design a learnable density interval module that allows animators to exert soft, direct control over bone density. Extensive experiments demonstrate that our framework not only generates high-quality skeletons for challenging inputs but also successfully fulfills two critical requirements from professional animators. Our work paves the way for more flexible and efficient animation pipelines.

Abstract:
Recent advances in multimodal large language models (MLLMs) have shown impressive progress in integrating visual and linguistic understanding. However, most existing MLLMs inject visual information into the input space of large language models (LLMs), which substantially increases context length and computational overhead, while often disrupting pretrained linguistic priors by forcing the LLM to optimize on vision-dominant multimodal sequences. In this work, we propose a rotation-based vision injection paradigm that aligns visual information with the parameter space of LLMs. Visual semantics are encoded as rotation matrices and applied directly to the pretrained parameters. This parameter-space injection eliminates the need for long input sequences, thus avoiding the quadratic computational overhead inherent in input-space injection. Besides, it preserves the linguistic competence of the LLM by maintaining the intrinsic geometric structure of the pretrained parameters. Building upon this paradigm, we develop ROSE, a 7B MLLM that achieves fine-grained vision-language alignment with remarkable computational efficiency. Extensive experiments across 12 multimodal benchmarks show that ROSE delivers superior or competitive performance compared with leading models.At comparable accuracy, ROSE reduces FLOPs by 80.7% and inference latency by 56.4% relative to Qwen2.5-VL-7B, demonstrating the effectiveness and scalability. The code and model weights will be publicly released.

Abstract:
Recent text-to-image generation models have acquired the ability of multi-reference generation and editing; that is, to inherit the appearance of subjects from multiple reference images and re-render them in new contexts. However, existing benchmark datasets often focus on generation using a single or a few reference images, which prevents us from measuring progress in model performance or identifying weaknesses when following instructions with a larger number of references. In addition, their task definitions are still vague, limited to axes such as "what to edit" or "how many references are given", and therefore fail to capture the challenges inherent in combining heterogeneous references. To address this gap, we introduce MultiBanana, which is designed to assess the edge of model capabilities by widely covering problems specific to multi-reference settings: (1) varying the number of references (up to 8), (2) domain mismatch among references (e.g., photo vs. anime), (3) scale mismatch between reference and target scenes, (4) references containing rare concepts (e.g., a red banana), and (5) multilingual textual references for rendering. Our analysis among a variety of text-to-image models reveals their respective performances, typical failure modes, and areas for improvement.

Abstract:
The rapid advancement of Large Multimodal Models (LMMs) for 2D images and videos has sparked interest in extending these models to 3D scenes, with the goal of human-like visual-spatial intelligence. However, achieving deep spatial understanding comparable to human capabilities remains challenging for both model design and data acquisition. Existing methods often rely on external depth sensors for geometry capture or off-the-shelf algorithms for pre-constructing 3D maps, which limits their scalability.In this work, we introduce VLM-3R, a framework for Vision-Language Models that couples 3D reconstructive instruction tuning with scalable training data curation and a new benchmark for temporal reasoning. Specifically, VLM-3R processes monocular video frames with a geometry encoder that derives implicit 3D tokens representing scene context (spatial tokens) and camera motion (view tokens). In parallel, we build a scalable data creation pipeline with over 200K 3D reconstructive instruction-tuning question-answer pairs. To evaluate temporal reasoning, we further introduce the Vision-Spatial-Temporal Intelligence benchmark (VSTI-Bench), which contains over 138.6K question-answer pairs across five distinct tasks focused on evolving spatial relationships. Extensive experiments show that VLM-3R supports robust visual-spatial reasoning and improves the understanding of temporal 3D context changes, enabling monocular 3D spatial assistance and embodied reasoning.

Abstract:
We present a novel formulation for mesh-free, reduced-order simulation of deformable hyperelastic objects. Existing work in reduced-order elastodynamic simulation represents the input geometry by either meshes, which can be difficult to obtain due to challenges in scanning and triangulating complex shapes, or by neural fields that require per-shape optimization. We propose to adopt a Reproducing Kernel Particle Method (RKPM) representation, which enables the construction of reduced-order skinning weights by solving a generalized eigensystem on the Hessian matrix of the elastic energy. We demonstrate that this formulation not only leads to a 40x training speedup compared with the per-shape optimization of neural fields, but also achieves lower simulation error when evaluated against the converged results of finite element method. We show our simulation results on a wide variety of objects in different representations including meshes and Gaussian splats, as well as the application of our method in the downstream task of robot simulation.

Abstract:
Learned world models excel at interpolative generalization but fail at extrapolative generalization to novel physical properties. This limitation arises because they learn statistical correlations rather than the environment's underlying generative rules, such as physical invariances and conservation laws. We argue that learning these invariances is key to robust extrapolation. To achieve this, we first introduce Symmetry Exploration, an unsupervised exploration strategy where an agent is intrinsically motivated by a Hamiltonian-based curiosity bonus to actively probe and challenge its understanding of conservation laws, thereby collecting physically informative data. Second, we design a Hamiltonian-based world model that learns from the collected data, using a novel self-supervised contrastive objective to identify the invariant physical state from raw, view-dependent pixel observations. Our framework, DreamSAC, trained on this actively curated data, significantly outperforms state-of-the-art baselines in 3D physics simulations on tasks requiring extrapolation.

Abstract:
Concept erasure in text-to-image diffusion models seeks to remove undesired concepts while preserving overall generative capability. Localized erasure methods aim to restrict edits to the spatial region occupied by the target concept. However, we observe that suppressing a concept can unintentionally weaken semantically related neighbor concepts, reducing fidelity in fine-grained domains. We propose Neighbor-Aware Localized Concept Erasure (NLCE), a training-free framework designed to better preserve neighboring concepts while removing target concepts. It operates in three stages: (1) a spectrally-weighted embedding modulation that attenuates target concept directions while stabilizing neighbor concept representations, (2) an attention-guided spatial gate that identifies regions exhibiting residual concept activation, and (3) a spatially-gated hard erasure that eliminates remaining traces only where necessary. This neighbor-aware pipeline enables localized concept removal while maintaining the surrounding concept neighborhood structure. Experiments on fine-grained datasets (Oxford Flowers, Stanford Dogs) show that our method effectively removes target concepts while better preserving closely related categories. Additional results on celebrity identity, explicit content and artistic style demonstrate robustness and generalization to broader erasure scenarios.

Abstract:
Self-consistency has proven to be an effective technique for improving LLM performance on natural language reasoning tasks in a lightweight, unsupervised manner. In this work, we study how to adapt self-consistency to visual domains. Specifically, we consider the generation and verification of LLM-produced motion graphics trajectories. Given a prompt (e.g., "Move the circle in a spiral path"), we first sample diverse motion trajectories from an LLM, and then identify groups of consistent trajectories via clustering. Our key insight is to model the family of shapes associated with a prompt as a prototype trajectory paired with a group of geometric transformations (e.g., rigid, similarity, and affine). Two trajectories can then be considered consistent if one can be transformed into the other under the warps allowable by the transformation group. We propose an algorithm that automatically recovers a shape family, using hierarchical relationships between a set of candidate transformation groups. Our approach improves the accuracy of LLM-based trajectory generation by 4-6%. We further extend our method to support verification, observing 11% precision gains over VLM baselines. Our code and dataset are available at https://majiaju.io/trajectory-self-consistency

Abstract:
Multimodal learning aims to capture both shared and private information from multiple modalities. However, existing methods that project all modalities into a single latent space for fusion often overlook the asynchronous, multi-level semantic structure of multimodal data. This oversight induces semantic misalignment and error propagation, thereby degrading representation quality. Hence, we propose Cross-Level Collaborative Representation (CLCR), which explicitly organizes each modality's features into a three-level semantic hierarchy and specifies level-wise constraints for cross-modal interactions. First, a semantic hierarchy encoder aligns shallow, mid, and deep features across modalities, establishing a common basis for interaction. And then, at each level, an Intra-Level Co-Exchange Domain (IntraCED) factorizes features into shared and private subspaces and restricts cross-modal attention to the shared subspace via a learnable token budget. This design ensures that only shared semantics are exchanged and prevents leakage from private channels. To integrate information across levels, the Inter-Level Co-Aggregation Domain (InterCAD) synchronizes semantic scales using learned anchors, selectively fuses the shared representations, and gates private cues to form a compact task representation. We further introduce regularization terms to enforce separation of shared and private features and to minimize cross-level interference. Experiments on six benchmarks spanning emotion recognition, event localization, sentiment analysis, and action recognition show that CLCR achieves strong performance and generalizes well across tasks.

Abstract:
Inverse problems in imaging are ill-posed, leading to infinitely many solutions consistent with the measurements due to the non-trivial null-space of the sensing matrix. Common image priors promote solutions on the general image manifold, such as sparsity, smoothness, or score function. However, as these priors do not constrain the null-space component, they can bias the reconstruction. Thus, we aim to incorporate meaningful null-space information in the reconstruction framework. Inspired by smooth image representation on graphs, we propose Graph-Smooth Null-Space Representation (GSNR), a mechanism that imposes structure only into the invisible component. Particularly, given a graph Laplacian, we construct a null-restricted Laplacian that encodes similarity between neighboring pixels in the null-space signal, and we design a low-dimensional projection matrix from the p-smoothest spectral graph modes (lowest graph frequencies). This approach has strong theoretical and practical implications: i) improved convergence via a null-only graph regularizer, ii) better coverage, how much null-space variance is captured by p modes, and iii) high predictability, how well these modes can be inferred from the measurements. GSNR is incorporated into well-known inverse problem solvers, e.g., PnP, DIP, and diffusion solvers, in four scenarios: image deblurring, compressed sensing, demosaicing, and image super-resolution, providing consistent improvement of up to 4.3 dB over baseline formulations and up to 1 dB compared with end-to-end learned models in terms of PSNR.

Abstract:
Coreset selection compresses large datasets into compact, representative subsets, reducing the energy and computational burden of training deep neural networks. Existing methods are either: (i) DNN-based, which are inherently coupled with network-specific parameters, inevitably introducing architectural bias and compromising generalization; or (ii) DNN-free, which utilize heuristics that lack rigorous theoretical guarantees for stability and accuracy. Neither approach explicitly constrains distributional equivalence of the representative subsets, largely because continuous distribution matching is broadly considered inapplicable to discrete dataset sampling. Furthermore, prevalent distribution metrics (e.g., MSE, KL, MMD, and CE) are often incapable of accurately capturing higher-order moment differences. These deficiencies lead to suboptimal coreset performance, preventing the selected coreset from being truly equivalent to the original dataset.In this work, we propose FAST (Frequency-domain Aligned Sampling via Topology), the first DNN-free distribution-matching coreset selection framework that formulates coreset selection as a graph-constrained optimization problem grounded in spectral graph theory and employs the Characteristic Function Distance (CFD) to capture full distributional information (i.e., all moments and intrinsic correlations) in the frequency domain. We further discover that naive CFD suffers from a "vanishing phase gradient" issue in medium and high-frequency regions; to address this, we introduce an Attenuated Phase-Decoupled CFD. Furthermore, for better convergence, we design a Progressive Discrepancy-Aware Sampling strategy that progressively schedules frequency selection from low to high. This preserves global structures before refining local details, enabling accurate matching with few frequencies while preventing overfitting. Extensive experiments demonstrate that FAST significantly outperforms state-of-the-art coreset selection methods across all evaluated benchmarks, achieving an average accuracy gain of 9.12%. Compared to other baseline coreset methods, it reduces power consumption by 96.57% and achieves a 2.2xaverage speedup even on CPU with 1.7GB of memory, underscoring its high performance and energy efficiency.

Abstract:
Pose-agnostic anomaly detection (PAD) achieves strong performance in localizing anomalies from arbitrary viewpoints when trained on densely sampled normal data. However, under sparse-view conditions, existing methods face two key challenges: (1) sparse observations lead to overfitting and geometric detail loss in 3D reconstruction; (2) limited visual cues lead to inaccurate pose estimation, compromising the reliability of subsequent anomaly localization. To address these challenges, we propose Wave-Pose3D, a wavelet-driven 3D anomaly detection framework tailored for PAD under sparse-view conditions. First, we design a structure-aware and wavelet-optimized Gaussian modeling strategy that dynamically filters unreliable regions via structural priors to mitigate overfitting and leverages high-frequency supervision to restore fine-grained geometric details. Second, to improve pose estimation under sparse views, we develop a wavelet-based pose estimator that integrates low-frequency structural cues and high-frequency details to enhance both initialization and refinement accuracy. Finally, we introduce a wavelet difference-aware anomaly detector that computes frequency-domain anomaly scores, improving localization robustness against pose and geometric variations. By integrating these strategies, Wave-Pose3D achieves robust and accurate anomaly localization under sparse views. Extensive experiments validate that the proposed approach achieves state-of-the-art performance under 10% and 20% sparse-view configurations.

Abstract:
OmniLottie is a versatile framework that generates high-quality vector animations from multi-modal instructions, including interleaved texts, images, and videos. To fully parameterize vector animations for flexible motion and visual content control, we seek help from the Lottie representation, which encodes both shapes and animated behaviors in a single JSON file. Building upon a pretrained vision-language model (VLM), OmniLottie produces vivid, semantically aligned vector animations that adhere closely to multi-modal conditions. To avoid the complexity and irregularity of raw JSON structures, we introduce a dedicated Lottie tokenizer that transforms Lottie files into structured sequences of function calls representing shapes, animation commands, and their parameters. This design enables the model to directly learn the underlying shape and animation priors from data, substantially improving generation stability and controllability. To further advance research in vector animation generation, we curate MMLottie-2M, a large-scale dataset of professionally designed vector animations paired with textual and visual annotations. Leveraging the well-designed tokenizer and our newly established dataset, OmniLottie demonstrates strong multi-modal conditional generation capabilities using a simple next-token prediction objective. For qualitative results, please refer to the generated animations rendered through standard Lottie players on the supplementary website.

Abstract:
Understanding how transcriptomic programs shape tissue morphology remains a central challenge in computational pathology. Gene-to-WSI tile synthesis offers a principled generative framework to translate molecular profiles into histological images. However, most existing methods compress RNA-Seq into a single global embedding injected once at initialization, an oversimplified design that weakens transcriptomic signals and induces spurious, non-biological associations. We present GeneVAR, an Autoregressive Gene-to-WSI model that reformulates synthesis as an iterative, coarse-to-fine generative process. At its core is a novel Causal MeanFlow module, which leverages counterfactual interventions to suppress non-biological variations (e.g., staining or contrast artifacts) and enforce an artifact-invariant generation process. Concurrently, it reinforces transcriptome-informed guidance across multiple stages, preserving biological fidelity throughout the generative trajectory. Combined with a b-VAE for compact gene embeddings and a multi-scale vector quantizer for discrete morphology representation, GeneVAR generates H&E-stained WSI tiles that are both visually realistic and transcriptomically faithful. Extensive experiments across five TCGA cancer benchmarks demonstrate consistent state-of-the-art performance, surpassing prior methods in both generative fidelity and downstream classification accuracy.

Abstract:
Given the real-time demands of UAV tracking, many methods simplify the backbone to reduce computation, but this often weakens feature representation and degrades performance in complex scenarios. To alleviate this issue, we propose EATrack, an efficient and asymmetric UAV tracking framework centered around a teacher-guided dual-branch distillation strategy that enhances the feature expressiveness of the lightweight student model. Specifically, EATrack investigates two complementary perspectives of knowledge transfer: a spatially focused feature-level distillation that compensates for weakened representations by guiding the student to learn strong target representations, and a prediction-level distillation that enhances spatial localization by learning the teacher's capability of accurate target localization. Furthermore, to enhance robustness against appearance variations, we introduce a fine-grained target-aware distillation strategy that selectively transfers the teacher's target modeling capacity to the student. A temporal adaptation module is incorporated at inference to enhance robustness over time. Experiments on five UAV benchmarks demonstrate that EATrack achieves a favorable balance between accuracy and speed, with EATrack-DeiT improving average success rate by 1.2% over the previous SOTA while running at 241.9 FPS on GPU.

Abstract:
Modern video-language models are typically trained on videos downsampled to low frames-per-second (FPS), and the most commonly used evaluation benchmarks are designed for low-FPS input as well. To address this shortcoming, we present FPS-Bench, a large video question-answering benchmark designed to evaluate VLMs' capabilities to understand video at high-frame rates. We introduce a new metric, the minimum frames-per-second (minFPS), which measures the minimum frame-rate required to solve a given question. While existing benchmarks require <1 minFPS, we rigorously curate more than 1000 questions from a diverse source of videos and manually verify minFPS for each example, leading to a benchmark that requires watching videos at on average 7 FPS to solve. Our evaluation of several state-of-the-art VLMs shows that they are severely lacking, achieving QA accuracy of 30% in the FPS-Bench multiple-choice task, while humans achieve 72% accuracy.

Abstract:
Recent progress in 3D generation has been driven largely by models conditioned on images or text, while readily available 3D priors are still underused. In many real-world scenarios, the visible-region point cloud are easy to obtain--from active sensors such as LiDAR or from feed-forward predictors like VGGT--offering explicit geometric constraints that current methods fail to exploit. In this work, we introduce Points-to-3D, a diffusion-based framework that leverages point cloud priors for geometry-controllable 3D asset and scene generation. Built on a latent 3D diffusion model TRELLIS, Points-to-3D first replaces pure-noise sparse structure latent initialization with a point cloud priors tailored input formulation. A structure inpainting network, trained within the TRELLIS framework on task-specific data designed to learn global structural inpainting, is then used for inference with a staged sampling strategy (structural inpainting followed by boundary refinement), completing the global geometry while preserving the visible regions of the input priors.In practice, Points-to-3D can take either accurate point-cloud priors or VGGT-estimated point clouds from single images as input.Experiments on both single-object and multi-object scenarios consistently demonstrate superior performance over state-of-the-art baselines in terms of rendering quality and geometric fidelity, highlighting the effectiveness of explicitly embedding point-cloud priors for achieving more accurate and structurally controllable 3D generation.

Abstract:
3D visual grounding task aims to accurately identify and grounding target objects in 3D space based on natural language descriptions,where the effective exploitation of relative relations between the target and anchor is crucial.However, in existing methods, relative relations are often tightly entangled with entity semantics. This tight coupling encourages models to rely on semantic shortcuts from entity names, making it difficult to maintain good generalization under multi-view and complex multi-object scenarios.To address this, we propose an object-relation decoupling framework that treats target-anchor relations as first-class geometric and semantic primitives and models them explicitly.First, we construct a scene-level relative geometric representation that encodes the direction and distance between the target and anchor, and introduce a scene-level hyper-object token as a unified prior for scale and viewpoint.Second, we develop a predicate-decoupled cross-modal alignment strategy that preserves only predicates carrying spatial relational semantics while masking out all other tokens, thereby suppressing semantic leakage from entity names.Finally, we design an anchor-guided regression module that predicts auxiliary anchors and samples their features to guide the model in learning entity semantics from text, explicitly injecting target-anchor priors and effectively resolving ambiguities in complex multi-object scenes.Extensive experiments on multiple 3D visual grounding benchmarks demonstrate that our method consistently outperforms state-of-the-art approaches and exhibits strong robustness and generalization under challenging multi-view and relation-intensive settings.

Abstract:
Diffusion Probabilistic Models have demonstrated remarkable performance across a wide range of generative tasks. However, we have observed that these models often suffer from a Signal-to-Noise Ratio-timestep (SNR-t) bias. This bias refers to the misalignment between the SNR of the denoising sample and its corresponding timestep during the inference phase. Specifically, during training, the SNR of a sample is strictly coupled with its timestep. However, this correspondence is disrupted during inference, leading to error accumulation and impairing the generation quality. We provide comprehensive empirical evidence and theoretical analysis to substantiate this phenomenon and propose a simple yet effective differential correction method to mitigate the SNR-t bias. Recognizing that diffusion models typically reconstruct low-frequency components before focusing on high-frequency details during the reverse denoising process, we decompose samples into various frequency components and apply differential correction to each component individually. Extensive experiments show that our approach significantly improves the generation quality of various diffusion models (IDDPM, ADM, DDIM, A-DPM, EA-DPM, EDM, PFGM++, and FLUX) on datasets of various resolutions with negligible computational overhead.

Abstract:
Multimodal learning that integrates genomics and histopathology has shown strong potential in cancer diagnosis, yet its clinical translation is hindered by the limited availability of paired histology-genomics data. Knowledge distillation (KD) offers a practical solution by transferring genomic supervision into histopathology models, enabling accurate inference using histology alone. However, existing KD methods rely on batch-local alignment, which introduces instability due to limited within-batch comparisons and ultimately degrades performance.To address these limitations, we propose Momentum Memory Knowledge Distillation (MoMKD), a cross-modal distillation framework driven by a momentum-updated memory. This memory aggregates genomic and histopathology information across batches, effectively enlarging the supervisory context available to each mini-batch. Furthermore, we decouple the gradients of the genomics and histology branches, preventing genomic signals from dominating histology feature learning during training and eliminating the modality-gap issue at inference time.Extensive experiments on the TCGA-BRCA benchmark (HER2, PR, and ODX classification tasks) and an independent in-house testing dataset demonstrate that MoMKD consistently outperforms state-of-the-art MIL and multimodal KD baselines, delivering strong performance and generalization under histology-only inference. Overall, MoMKD establishes a robust and generalizable knowledge distillation paradigm for computational pathology.

Abstract:
Vision Transformers (ViTs) have been widely adopted in vision tasks due to their strong transferability. In Federated Learning (FL), where full fine-tuning is communication-heavy, Low-Rank Adaptation (LoRA) provides an efficient and communication-friendly way to adapt ViTs. However, existing LoRA-based federated tuning methods overlook latent client structures in real-world settings, limiting shared representation learning and hindering effective adaptation to unseen clients. To address this, we propose HiLoRA, a hierarchical LoRA framework that places adapters at three levels: root, cluster, and leaf, each designed to capture global, subgroup, and client-specific knowledge, respectively. Through cross-tier orthogonality and cascaded optimization, HiLoRA separates update subspaces and aligns each tier with its residual personalized objective. In particular, we develop a LoRA-Subspace Adaptive Clustering mechanism that infers latent client groups via subspace similarity analysis, thereby facilitating knowledge sharing across structurally aligned clients. Theoretically, we establish a tier-wise generalization analysis that supports HiLoRA's design. Experiments on ViT backbones with CIFAR-100 and DomainNet demonstrate consistent improvements in both personalization and generalization.

Abstract:
Recent reasoning-augmented Vision-Language-Action (VLA) models have improved the interpretability of end-to-end autonomous driving by generating intermediate reasoning traces. Yet these models primarily describe what they perceive and intend to do, rarely questioning whether their planned actions are safe or appropriate. This work introduces Counterfactual VLA (CF-VLA), a self-reflective VLA framework that enables the model to reason about and revise its planned actions before execution. CF-VLA first generates time-segmented meta-actions that summarize driving intent, and then performs counterfactual reasoning conditioned on both the meta-actions and the visual context. This step simulates potential outcomes, identifies unsafe behaviors, and outputs corrected meta-actions that guide the final trajectory generation. To efficiently obtain such self-reflective capabilities, we propose a rollout-filter-label pipeline that mines high-value scenes from a base (non-counterfactual) VLA's rollouts and labels counterfactual reasoning traces for subsequent training rounds. Experiments on large-scale driving datasets show that CF-VLA improves trajectory accuracy by up to 17.6%, enhances safety metrics by 20.5%, and exhibits adaptive thinking: it only enables counterfactual reasoning in challenging scenarios. By transforming reasoning traces from one-shot descriptions to causal self-correction signals, CF-VLA takes a step toward self-reflective autonomous driving agents that learn to think before they act.

Abstract:
Dataset distillation often prioritizes global semantic proximity when creating small surrogate datasets for original large-scale ones. However, object semantics are inherently hierarchical. For example, the position and appearance of a bird's eyes are constrained by the outline of its head. Global proximity alone fails to capture how object-relevant structures at different levels support recognition. In this work, we investigate the contributions of hierarchical semantics to effective distilled data. We leverage the vision autoregressive (VAR) model whose coarse-to-fine generation mirrors this hierarchy and propose HierAmp to amplify semantics at different levels. At each VAR scale, we inject class tokens that dynamically identify salient regions and use their induced maps to guide amplification at that scale. This adds only marginal inference cost while steering synthesis toward discriminative parts and structures. Empirically, we find that semantic amplification leads to more diverse token choices in constructing coarse-scale object layouts. Conversely, at fine scales, the amplification concentrates token usage, increasing focus on object-related details. Across popular dataset distillation benchmarks, HierAmp consistently improves validation performance without explicitly optimizing global proximity, demonstrating the importance of semantic amplification for effective dataset distillation.

Abstract:
Continual learning for video--language understanding is increasingly important as models face non-stationary data, domains, and query styles, yet prevailing solutions blur what should stay stable versus what should adapt, rely on static routing/capacity, or require replaying past videos. We aim to explicitly specify where stability lives and where plasticity should be focused under realistic memory and privacy constraints. We introduce Affordance-First Decomposition (AFD): videos are mapped to slowly varying affordance tokens that form a shared, time-aligned substrate, while a lightweight, query-routed, conflict-aware scheduler concentrates adaptation and grows capacity only when needed. The substrate is stabilized via weak alignment and teacher consistency, and training uses question-only replay. AFD achieves state-of-the-art across protocols: 51.6% average accuracy with -1.8% forgetting on domain-incremental VideoQA, ViLCo R@1@0.5 of 29.6% (MQ) and 20.7% (NLQ) with 18.4% stAP@0.25 (VQ), and 39.5% accuracy with -1.6% forgetting on time-incremental iVQA. Overall, AFD offers an explicit, interpretable split between a stable interaction-centered substrate and targeted adaptation.

Abstract:
Videos are continuous 2D projections of 3D worlds. After training on large video data, will global 3D understanding naturally emerge? We study this by quantifying the 3D understanding of existing Video Foundation Models (VidFMs) pretrained on vast video data. We propose the first model-agnostic framework that measures the 3D awareness of various VidFMs by estimating multiple 3D properties from their features via shallow read-outs. Our study presents meaningful findings regarding the 3D awareness of VidFMs on multiple axes. In particular, we show that state-of-the-art video generation models exhibit a strong understanding of 3D objects and scenes, despite not being trained on any 3D data. Such understanding can even surpass that of large expert models specifically trained for 3D tasks. Our findings, together with the 3D benchmarking of major VidFMs, provide valuable observations for building scalable 3D models.

Abstract:
The development of 3D Vision-Language Models (VLMs), crucial for applications in robotics, autonomous driving, and augmented reality, is severely constrained by the scarcity of paired 3D-text data. Existing methods rely solely on next-token prediction loss, using only language tokens for supervision. This results in inefficient utilization of limited 3D data and leads to a significant degradation and loss of valuable geometric information in intermediate representations.To address these limitations, we propose \mname , a novel feature-level alignment regularization method. \mname explicitly supervises intermediate point cloud representations to preserve fine-grained 3D geometric-semantic information throughout the language modeling process. Specifically, we constrain the intermediate point cloud tokens within the LLM to align with visual input tokens via a consistency loss. By training only a lightweight alignment projector and LoRA adapters, \mname achieves explicit feature-level supervision with minimal computational overhead, effectively preventing geometric degradation.Extensive experiments on ModelNet40 and Objaverse datasets demonstrate that our method achieves 2.08 pp improvement on average for classification tasks, with a substantial 7.50 pp gain on the challenging open-vocabulary Objaverse classification task and 4.88 pp improvement on 3D object captioning evaluated by Qwen2-72B-Instruct, validating the effectiveness of \mname .

Abstract:
Recent Vision-Language Models (VLMs) have demonstrated remarkable multimodal understanding capabilities, yet the redundant visual tokens incur prohibitive computational overhead and degrade inference efficiency. Prior studies typically relies on [CLS] attention or text-vision cross-attention to identify and discard redundant visual tokens. Despite promising results, such solutions are prone to introduce positional bias and, more critically, are incompatible with efficient attention kernels such as FlashAttention, limiting their practical deployment for VLM acceleration. In this paper, we step away from attention dependencies and revisit visual token compression from an information-theoretic perspective, aiming to maximally preserve visual information without any attention involvement. We present ApET, an Approximation-Error Token compression framework. ApET first reconstructs the original visual tokens with a small set of basis tokens via linear approximation, then leverages the approximation error to identify and drop the least informative tokens. Extensive experiments across multiple VLMs and benchmarks demonstrate that ApET retains 95.2% of the original performance on image-understanding tasks and even attains 100.4% on video-understanding tasks, while compressing the token budgets by 88.9% and 87.5%, respectively. Thanks to its attention-free design, ApET seamlessly integrates with FlashAttention, enabling further inference acceleration and making VLM deployment more practical.

Abstract:
Autoregressive (AR) visual generation relies on tokenizers to map images to and from discrete sequences. However, tokenizers are trained to reconstruct clean images from ground-truth tokens, while AR generators are optimized only for token likelihood. This misalignment leads to generated token sequences that may decode into low-quality images, without direct supervision from the pixel space. We propose VA-\boldsymbol \pi , a lightweight post-training framework that directly optimizes AR models with a principled pixel-space objective. VA-\pi formulates the generator-tokenizer alignment as a variational optimization, deriving an evidence lower bound (ELBO) that unifies pixel reconstruction and autoregressive modeling. To optimize under the discrete token space, VA-\pi introduces a reinforcement-based alignment strategy that treats the AR generator as a policy, uses pixel-space reconstruction quality as its intrinsic reward. The reward is measured by how well the predicted token sequences can reconstruct the original image under teacher forcing, giving the model direct pixel-level guidance without expensive free-running sampling.The regularization term of the ELBO serves as a natural regularizer, maintaining distributional consistency. VA-\pi enables rapid adaptation of existing AR generators, without neither tokenizer retraining nor external reward models. With only 1% ImageNet-1K data and 25 minutes of tuning, it reduces FID from 14.36 to 7.65 and improves IS from 86.55 to 116.70 on LlamaGen-XXL, while also yielding notable gains in the text-to-image task on GenEval for both visual generation model (LlamaGen: from 0.306 to 0.339) and unified multi-modal model (Janus-Pro: from 0.725 to 0.744).

Abstract:
Vision-based robotic policies often struggle with even minor viewpoint changes, underscoring the need for view-invariant visual representations. This challenge becomes more pronounced in real-world settings, where viewpoint variability is unavoidable and can significantly disrupt policy performance. Existing methods typically learn invariance from multi-view observations at the scene level, but such approaches rely on visual appearance and fail to incorporate the physical dynamics essential for robust generalization. We propose View-Invariant Latent Action (VILA), which models a latent action capturing transition patterns across trajectories to learn view-invariant representations grounded in physical dynamics. VILA aligns these latent actions across viewpoints using an action-guided objective based on ground-truth action sequences. Experiments in both simulation and the real world show that VILA-based policies generalize effectively to unseen viewpoints and transfer well to new tasks, establishing VILA as a strong pretraining framework that improves resilience to viewpoint shifts and downstream learning performance.

Abstract:
Advances in generative AI (GenAI) have increasingly complicated the identification of synthetic images, prompting the proposal of numerous zero-/few-shot detection methods to counter unknown GenAI better. However, we observe that existing detectors often misclassify synthetic images with physical transformations (e.g., print+scan) as real. The essence of this observation lies in: should images remapped from the physical world to digital space still be categorized as "Synthetic"? Furthermore, the definition of what constitutes real and synthetic images urgently needs to be clarified. We first boldly propose that the authenticity of an image depends on whether it originates from the physical world, i.e., it is necessary to verify the original correlation between the digital image and the physical world. To this end, we first analyze the physical-to-digital mapping process: illumination signals are captured by camera sensors as RAW data, which is then converted into RGB data via camera internal parameters. This process embodies unique physical cues inherent to real scenes. Based on this, we propose a novel forensic feature termed alignment trace, which is constructed by modeling a shared RAW-RGB feature space. This trace captures the inherent parameter correlations of real images in the physical-to-digital conversion process, thereby indirectly verifying the physical origin of the image. Experiments demonstrate that our method achieves state-of-the-art zero-shot detection using only real RAW-RGB data pairs. When additional prior knowledge is provided, the method can be easily fine-tuned to achieve better cross-domain detection performance. We hope this work provides a new baseline for zero-shot synthetic detection and, more significantly, inspires the forensics community to explore the essential distinctions between real and synthetic images.

Abstract:
Current diffusion-based makeup transfer methods commonly use the makeup information encoded by off-the-shelf foundation models (e.g., CLIP) as condition to preserve the makeup style of reference image in the generation. Although effective, these works mainly have two limitations: (1) foundation models pre-trained for generic tasks struggle to capture makeup styles; (2) the makeup features of reference image are injected to the diffusion denoising model as a whole for global makeup transfer, overlooking the facial region-aware makeup features (i.e., eyes, mouth, etc) and limiting the regional controllability for region-specific makeup transfer. To address these, in this work, we propose Facial Region-Aware Makeup features (FRAM), which has two stages: (1) makeup CLIP fine-tuning; (2) identity and facial region-aware makeup injection. For makeup CLIP fine-tuning, unlike prior works using off-the-shelf CLIP, we synthesize annotated makeup style data using GPT-o3 and text-driven image editing model, and then use the data to train a makeup CLIP encoder through self-supervised and image-text contrastive learning. For identity and facial region-aware makeup injection, we construct before-and-after makeup image pairs from the edited images in stage 1 and then use them to learn to inject identity of source image and makeup of reference image to the diffusion denoising model for makeup transfer. Specifically, we use learnable tokens to query the makeup CLIP encoder to extract facial region-aware makeup features for makeup injection, which is learned via an attention loss to enable regional control. As for identity injection, we use a ControlNet Union to encode source image and its 3D mesh simultaneously. The experimental results verify the superiority of our regional controllability and our makeup transfer performance.

Abstract:
Lookup-free quantization has received much attention due to its efficiency on parameters and scalability to a large codebook. In this paper, we present a unified formulation of different non-parametric quantization methods through the lens of lattice coding. The geometry of lattice codes explains the necessity of auxiliary loss terms when training auto-encoders with certain existing lookup-free quantization variants such as BSQ. As a step forward, we explore a few possible candidates, including random lattices, generalized Fibonacci lattices, and densest sphere packing lattices. Among all, we find the Leech lattice-based quantization method, which is dubbed as Spherical Leech Quantization (\Lambda_ 24 -SQ), leads to both a simplified training recipe and an improved reconstruction-compression tradeoff thanks to its high symmetry and even distribution on the hypersphere. In image tokenization and compression tasks, this quantization approach achieves better reconstruction quality across all metrics than BSQ, the best prior art, while consuming slightly fewer bits. The improvement also extends to state-of-the-art auto-regressive image generation frameworks.

Abstract:
Classifier-Free Guidance (CFG) is a cornerstone of modern conditional diffusion models, yet its reliance on the fixed or heuristic dynamic guidance weight is predominantly empirical and overlooks the inherent dynamics of the diffusion process. In this paper, we provide a rigorous theoretical analysis of the Classifier-Free Guidance. Specifically, we establish strict upper bounds on the score discrepancy between conditional and unconditional distributions at different timesteps based on the diffusion process.This finding explains the limitations of fixed-weight strategies and establishes a principled foundation for time-dependent guidance. Motivated by this insight, we introduce Control Classifier-Free Guidance (C^2FG), a novel, training-free, and plug-in method that aligns the guidance strength with the diffusion dynamics via an exponential decay control function. Extensive experiments demonstrate that C^2FG is effective and broadly applicable across diverse generative tasks, while also exhibiting orthogonality to existing strategies.

Abstract:
Efficient and accurate feed-forward multi-view reconstruction has long been an important task in computer vision. Recent transformer-based models like VGGT, \pi^3 and MapAnything have demonstrated remarkable performance with relatively simple architectures. However, their scalability is fundamentally constrained by the quadratic complexity of global attention, which imposes a significant runtime bottleneck when processing large image sets. In this work, we empirically analyze the global attention matrix of these models and observe that the probability mass concentrates on a small subset of patch-patch interactions corresponding to cross-view geometric correspondences. Building on this insight and inspired by recent advances in large language models, we propose a training-free, block-sparse replacement for dense global attention, implemented with highly optimized kernels. Our method accelerates inference by more than 3xwhile maintaining comparable task performance. Evaluations on a comprehensive suite of multi-view benchmarks demonstrate that our approach seamlessly integrates into existing global attention-based architectures such as VGGT, \pi^3, and MapAnything, while substantially improving scalability to large image collections.

Abstract:
Unified multi-modal understanding/generative models have shown improved image editing performance by incorporating fine-grained understanding into their Chain-of-Thought (CoT) process. However, a critical question remains underexplored: what forms of CoT and training strategy can jointly enhance both the understanding granularity and generalization? To address this, we propose Meta-CoT, a paradigm that performs a two-level decomposition of any single-image editing operation with two key properties: (1) Decomposability. We observe that any editing intention can be represented as a triplet -- (task, target, required understanding ability). Inspired by this, Meta-CoT decomposes both the editing task and the target, generating task-specific CoT and traversing editing operations on all targets. This decomposition enhances the model's understanding granularity of editing operations and guides it to learn each element of the triplet during training, substantially improving the editing capability.(2) Generalizability. In the second decomposition level, we further break down editing tasks into five fundamental meta-tasks. We find that training on these five meta-tasks, together with the other two elements of the triplet, is sufficient to achieve strong generalization across diverse, unseen editing tasks.To further align the model's editing behavior with its CoT reasoning, we introduce the CoT-Editing Consistency Reward, which encourages more accurate and effective utilization of CoT information during editing. Experiments demonstrate that our method achieves an overall 15.8% improvement across 21 editing tasks, and generalizes effectively to unseen editing tasks when trained on only a small set of meta-tasks. Our code, benchmark, and model will be released publicly.

Abstract:
Modality gap significantly restricts the effectiveness of multimodal fusion. Previous methods often use techniques such as diffusion models and adversarial learning to reduce the modality gap, but they typically focus on one-to-one alignment without exposing the data points of the source modality to the global distribution information of the target modality. To this end, leveraging the characteristic of rectified flow that can map one distribution to another via a straight trajectory, we extend rectified flow for modality distribution mapping. Specifically, we leverage the `one-to-many mapping' strategy in rectified flow that allows each data point of the source modality to observe the overall target distribution. This also alleviates the issue of insufficient paired data within each sample, enabling a more robust distribution transformation. Moreover, to achieve more accurate distribution mapping and address the ambiguous flow directions in one-to-many mapping, we design `adaptive relaxed alignment', enforcing stricter alignment for modality pairs belonging to the same sample, while applying relaxed mapping for pairs not belonging to the same sample or category. Additionally, to prevent information loss during distribution mapping, we introduce `cyclic rectified flow' to ensure the transferred features can be translated back to the original features, allowing multimodal representations to learn sufficient modality-specific information. After distribution alignment, our approach achieves very competitive results on multiple tasks of multimodal affective computing even with a simple fusion method, and visualizations verify that it can effectively reduce the modality gap.

Abstract:
Large Vision Language Models (LVLMs) have demonstrated remarkable capabilities in Chain-of-Thought (CoT) reasoning. However, existing LVLM reasoning paradigms only begin reasoning after the entire video becomes available, introducing unnecessary latency and diminishing attention to early visual cues in dynamic scenes. Inspired by the human ability to think while watching, we introduce a streaming reasoning paradigm for LVLMs, where reasoning unfolds sequentially with incoming frames and deepens after the full video is observed. We instantiate this paradigm through Think-as-You-See (TaYS), a unified framework that enables LVLMs to reason while watching by integrating streaming CoT generation, stream-constrained training, and stream-parallel inference. Specifically, TaYS employs temporally aligned streaming reasoning units with precise CoT supervision, enforces ordered reasoning via streaming attention masks and positional encodings, and utilizes a parallel KV caches mechanism that decouples input encoding from reasoning generation, ensuring alignment and true concurrency. We evaluate TaYS on the Qwen2.5-VL model family across representative video CoT tasks, including event dynamics analysis, causal reasoning, and thematic understanding. Experimental results show that TaYS achieves superior reasoning performance compared with batch-mode CoT, while reducing pre-reasoning latency to under one second and overall answer delay by more than 50%. These findings demonstrate the effectiveness of the streaming paradigm in enabling real-time, human-like reasoning for LVLMs.

Abstract:
Appropriate parameter initialization is crucial for reducing the training cost of deep neural networks. Graph HyperNetworks (GHN) have emerged as a promising approach for initializing diverse architectures, with recent methods such as Task-Aware Learngene (TAL) further attempting to leverage pre-trained model knowledge via soft label supervision. However, such indirect supervision fails to fully exploit the rich information encoded in pre-trained weights. We propose Parameter InheriTance HyperNetwork (PITH), which introduces a novel parameter projection mechanism to directly inherit parameters from pre-trained models for initializing target networks of varying configurations. Our method enables initialized networks to directly achieve competitive performance on downstream tasks without any further training, which we term zero-shot initialization. Extensive experiments demonstrate the superiority of PITH: ViT-Base initialized by PITH achieves 53.35% zero-shot accuracy on ImageNet-1K, surpassing the previous state-of-the-art by 6.54%, with consistent improvements across multiple downstream tasks.

Abstract:
Visual concept personalization aims to transfer only specific image attributes, such as identity, expression, lighting, and style, into unseen contexts. However, existing methods rely on holistic embeddings from general-purpose image encoders, which entangle multiple visual factors and make it difficult to isolate a single attribute. This often leads to information leakage and incoherent synthesis. To address this limitation, we introduce Omni-Attribute, the first open-vocabulary image attribute encoder designed to learn high-fidelity, attribute-specific representations. Our approach jointly designs the data and model: (i) we curate semantically linked image pairs annotated with positive and negative attributes to explicitly teach the encoder what to preserve or suppress; and (ii) we adopt a dual-objective training paradigm that balances generative fidelity with contrastive disentanglement. The resulting embeddings prove effective for open-vocabulary attribute retrieval, personalization, and compositional generation, achieving state-of-the-art performance across multiple benchmarks.

Abstract:
Instruction-based video editing aims to modify an input video according to a natural-language instruction while preserving content fidelity and temporal coherence. However, existing diffusion-based approaches are often trained on paired data of simple editing operations, which fundamentally limits their ability to generalize to diverse and complex, real-world instructions. To address this generalization gap, we propose VIVA, a scalable framework for instruction-based video editing that leverages VLM-guided encoding and reward optimization. First, we introduce a VLM-based instructor that encodes the textual instruction, the first frame of the source video, and an optional reference image into visually-grounded instruction representations, providing fine-grained spatial and semantic context for the diffusion transformer backbone. Second, we propose a post-training stage, Edit-GRPO, which adapts Group Relative Policy Optimization to the domain of video editing, directly optimizing the model for instruction-faithful, content-preserving, and aesthetically pleasing edits using relative rewards. Furthermore, we propose a data construction pipeline designed to synthetically generate diverse, high-fidelity paired video-instruction data of basic editing operations. Extensive experiments show that VIVA achieves superior instruction following, generalization, and editing quality over state-of-the-art methods.

Abstract:
This paper presents FluxMem, a training-free framework for efficient streaming video understanding. FluxMem adaptively compresses redundant visual memory through a hierarchical, two-stage design: (1) a Temporal Adjacency Selection (TAS) module removes redundant visual tokens across adjacent frames, and (2) a Spatial Domain Consolidation (SDC) module further merges spatially repetitive regions within each frame into compact representations. To adapt effectively to dynamic scenes, we introduce a self-adaptive token compression mechanism in both TAS and SDC, which automatically determines the compression rate based on intrinsic scene statistics rather than manual tuning. Extensive experiments demonstrate that FluxMem achieves new state-of-the-art results on existing online video benchmarks, reaching 76.4 on StreamingBench and 67.2 on OVO-Bench under real-time settings, while reducing latency by 69.9% and peak GPU memory by 34.5% on OVO-Bench. Furthermore, it maintains strong offline performance, achieving 73.1 on MLVU while using 65% fewer visual tokens.

Abstract:
The emergence of large generative models has substantially advanced learning-based scene recovery in the synthetic domain. However, these models generalize poorly to real scenarios stemming from the significant distribution gap, alongside poor adaptation to complex and unforeseen degradations. Consequently, it is imperative to develop a real scene adaptation strategy that yields faithful restorations with reliable generalizability. To this end, we propose Bilevel Prompt LoRA, a novel learning paradigm designed to effectively adapt pre-trained generative models for real scene recovery. First, we introduce a self-supervised distribution-fidelity learning scheme to calibrate the autoencoding pathway under task-irrelevant real distributions to improve texture fidelity. Subsequently, a bilevel joint modeling via hyperparameter optimization is further established, empowering robust synthetic-to-real adaptation for both seen and unseen scenes by exploiting the complementary advantages between LoRA and Prompts to foster mutual promotion. Extensive evaluations on diverse real adverse scenarios demonstrate our superiority, with comprehensive algorithm analyses proving our effectiveness.

Abstract:
Storytelling in real-world videos often unfolds through multiple shots--discontinuous yet semantically connected clips that together convey a coherent narrative. However, existing multi-shot video generation (MSV) methods struggle to effectively model long-range cross-shot context, as they rely on limited temporal windows or single keyframe conditioning, leading to degraded performance under complex narratives. In this work, we propose OneStory, achieving global yet compact cross-shot context modeling for consistent and scalable narrative generation. OneStory reformulates MSV as a next-shot generation task, enabling shot-by-shot autoregressive synthesis while leveraging pretrained image-to-video (I2V) models for strong visual conditioning. We introduce two key modules: a Frame Selection module that constructs a semantically-relevant global memory based on informative frames from prior shots, and an Adaptive Conditioner that performs importance-guided patchification to generate compact context for direct conditioning. We further curate a high-quality multi-shot dataset with referential captions to mirror real-world storytelling patterns, and design effective training strategies under the next-shot paradigm. Finetuned from a pretrained I2V model on our curated 60K dataset, OneStory achieves state-of-the-art narrative coherence across diverse and complex scenes in both text- and image-conditioned settings, enabling controllable and immersive long-form video storytelling. See our Project Page for more details.

Abstract:
Symmetry detection is a fundamental problem in computer vision, and symmetries serve as powerful priors for downstream tasks. However, existing learning-based methods for detecting 3D symmetries from single images have been almost exclusively trained and evaluated on object-centric or synthetic datasets, and thus fail to generalize to real-world scenes. Furthermore, due to the inherent scale ambiguity of monocular inputs, which makes localizing the 3D plane an ill-posed problem, many existing works only predict the plane's orientation. In this paper, we address these limitations by presenting the first framework for detecting 3D-grounded reflectional symmetries from single, in-the-wild RGB images, focusing on architectural landmarks. We introduce two key innovations: (1) a scalable data annotation pipeline to automatically curate a large-scale dataset of architectural symmetries, ArchSym, from SfM reconstructions by leveraging cross-view image matching; and building on the dataset, (2) a single-view symmetry detector that accurately localizes symmetries in 3D by parameterizing them as signed distance maps defined relative to predicted scene geometry. We validate our symmetry annotation pipeline against geometry-based alternatives and demonstrate that our symmetry detector significantly outperforms state-of-the-art baselines on our new benchmark.

Abstract:
Recent advances in image editing allow impressive manipulation of objects, existing methods still struggle to handle spatial movement in complex scenes, such as objects span different depth layers or are partially occluded. Most image editing methods focus solely on prior information from 2D datasets, emphasizing planar features while lacking support for spatial structures. Even approaches that incorporate explicit positional information fail to capture true 3D spatial relationships, thus limiting accurate object movement in complex scenes. In this paper, we present SpatialDiff, a method that effectively captures 3D spatial structures, enabling precise and consistent object movements in complex scenes. Our core innovations are twofold: (1) Implicit 3D Spatial Modeling, which introduces 3D prior knowledge and enables the model to internally build a comprehensive understanding of the three-dimensional spatial structure; and (2) Global Spatial Supervision, which constrains the latent spatial features to enable the model to perceive changes in object spatial positions caused by editing operations. Experimental results demonstrate that our method significantly improves the accuracy and fidelity of spatial movement in complex scenes.

Abstract:
Based on an RGB-D image pair, semantic scene completion (SSC) provides a description for 3D scene understanding by predicting 3D semantic occupancy map. Recent methods extract RGB-D multi-modal features and fuse them in spatial domain, which disregards the misalignment caused by the imperfect raw multi-modal data and the multi-modal feature learning. Moreover, the operations of extracting high-level features they utilized tend to introduce feature smoothing and detail loss, exacerbating the above misalignment. To tackle these problems, this paper introduces MFDNet, a lightweight semantic scene completion network based on a multi-modal frequency decomposition strategy. By integrating frequency processing with limited layers of convolution and downsampling, MFDNet achieves a balance between modalities alignment and detail retainment. The network is equipped with Multi-modal Adaptive Frequency Fusion (MAFF) and Frequency Detail Compensation (FDC). MAFF models the intra-modal multi-bands dependencies and inter-modal relationships from a global perspective, enabling modality-specific calibration while facilitating the aligned fusion of multi-modal features. FDC excavates the high-frequency cues in shallow features to compensate for the missing local details of the fused feature and achieve fine-grained alignment for completion. MAFF and FDC formulate a global-to-local alignment and completion paradigm for multi-modal SSC. Extensive experiments demonstrate that MFDNet reduces parameters by 54.4% while achieving state-of-the-art performance on the NYUv2 and NYUCAD datasets.

Abstract:
Nighttime photography is susceptible to flare caused by strong light sources, which degrades visual quality and disrupts structural information required by downstream vision tasks. Existing nighttime flare removal methods generally lack semantic priors for flare-occluded regions and thus tend to introduce artifacts and lose details under severe degradation. To address this problem, we propose a language-guided one-step diffusion framework that explicitly aligns flare-occluded regions with the underlying scene content at the semantic level. Specifically, we develop the first flare-specific vision-language model, Flare-VLM, which extracts fine-grained textual descriptions to guide one-step diffusion for high-quality restoration of severely damaged areas. Then, we propose semantics-aware distribution distillation to constrain the noise distribution with high-level semantics, suppressing redundant perturbations on clean backgrounds and improving the stability of distillation. In addition, we design an instruction-driven data synthesis pipeline to generate geometrically and semantically aligned nighttime flare samples, narrowing the gap to real degradations. Experimental results demonstrate that the proposed method achieves better restoration and enhances the performance of downstream vision tasks.

Abstract:
Reconstructing photorealistic and animatable 4D head avatars from a single portrait image remains a fundamental challenge in computer vision. While diffusion models have enabled remarkable progress in image and video generation for avatar reconstruction, existing methods primarily rely on 2D priors and struggle to achieve consistent 3D geometry. We propose a novel framework that leverages geometry-aware diffusion to distill strong geometry priors for high-fidelity head avatar reconstruction. Our approach jointly synthesizes portrait images and corresponding surface normals, while a pose-free expression encoder captures implicit expression representations. Both synthesized images and expression latents are distilled into 3D Gaussian-based avatars, enabling photorealistic rendering with accurate geometry. Extensive experiments demonstrate that our method substantially outperforms state-of-the-art approaches in visual quality, expression fidelity, and cross-identity generalization, while supporting real-time rendering.

Abstract:
As Vision-Language Models (VLMs) move into interactive, multi-turn use, safety concerns intensify for multimodal multi-turn dialogue, which is characterized by concealment of malicious intent, contextual risk accumulation, and cross-modal joint risk. These characteristics limit the effectiveness of content moderation approaches designed for single-turn or single-modality settings. To address these limitations, we first construct the Multimodal Multi-turn Dialogue Safety (MMDS) dataset, comprising 4,484 annotated dialogues and a comprehensive risk taxonomy with 8 primary and 60 subdimensions. As part of MMDS construction, we introduce Multimodal Multi-turn Red Teaming (MMRT), an automated framework for generating unsafe multimodal multi-turn dialogues. We further propose LLaVAShield, which audits the safety of both user inputs and assistant responses under specified policy dimensions in multimodal multi-turn dialogues. Extensive experiments show that LLaVAShield significantly outperforms state-of-the-art VLMs and existing content moderation tools while demonstrating strong generalization and flexible policy adaptation. Additionally, we analyze vulnerabilities of mainstream VLMs to harmful inputs and evaluate the contribution of key components, advancing understanding of safety mechanisms in multimodal multi-turn dialogues.

Abstract:
Detecting visual anomalies in industrial inspection often requires training with only a few normal images per category. Recent few-shot methods achieve strong results employing foundation-model features, but typically rely on memory banks, auxiliary datasets, or multi-modal tuning of vision-language models. We therefore question whether such complexity is necessary given the feature representations of vision foundation models. To answer this question, we introduce SubspaceAD, a training-free method, that operates in two simple stages. First, patch-level features are extracted from a small set of normal images by a frozen DINOv2 backbone. Second, a Principal Component Analysis (PCA) model is fit to these features to estimate the low-dimensional subspace of normal variations. At inference, anomalies are detected via the reconstruction residual with respect to this subspace, producing interpretable and statistically grounded anomaly scores. Despite its simplicity, SubspaceAD achieves state-of-the-art performance across one-shot and few-shot settings without training, prompt tuning, or memory banks. In the one-shot anomaly detection setting, SubspaceAD achieves image-level and pixel-level AUROC of 98.0% and 97.6% on the MVTec-AD dataset, and 93.3% and 98.3% on the VisA dataset, respectively, surpassing prior state-of-the-art results.

Abstract:
We identify occlusion reasoning as a fundamental yet overlooked aspect for 3D layout-conditioned generation. It is essential for synthesizing partially occluded objects with depth-consistent geometry and scale. While existing methods can generate realistic scenes that follow input layouts, they often fail to model precise inter-object occlusions. We propose SeeThrough3D, a model for 3D layout conditioned generation that explicitly models occlusions. We introduce an occlusion-aware 3D scene representation (OSCR), where objects are depicted as translucent 3D boxes placed within a virtual environment and rendered from desired camera viewpoint. The transparency encodes hidden object regions, enabling the model to reason about occlusions, while the rendered viewpoint provides explicit camera control during generation. We condition a pretrained flow based text-to-image image generation model by introducing a set of visual tokens derived from our rendered 3D representation. Furthermore, we apply masked self-attention to accurately bind each object bounding box to its corresponding textual description, enabling accurate generation of multiple objects without object attribute mixing. To train the model, we construct a synthetic dataset with diverse multi-object scenes with strong inter-object occlusions. SeeThrough3D generalizes effectively to unseen object categories and enables precise 3D layout control with realistic occlusions and consistent camera control.

Abstract:
Exemplar-Free Class Incremental Learning (EFCIL) aims to enable models to learn new classes sequentially without retaining samples from previous tasks. While recent approaches leverage pre-trained models with parameter-efficient tuning to mitigate forgetting, they often overlook a crucial cause of forgetting: the collapse of the class-discriminative structure. This structure comprises two interdependent components: intra-class structure, which characterizes the shape of individual classes, and inter-class structure, which characterizes the global geometric relationships among class prototypes. We reveal that catastrophic forgetting stems from the simultaneous deterioration of both intra-class and inter-class structures. To address this, we propose a unified framework that preserves the class-discriminative structure. It preserves the intra-class structure by reshaping class means and covariances to preserve each class's shape during migration, and maintains inter-class structure by stabilizing angular relationships between samples and old prototypes. Extensive experiments demonstrate that our framework outperforms existing leading methods on multiple EFCIL benchmarks, validating that preserving the class-discriminative structure is crucial for mitigating catastrophic forgetting.

Abstract:
Despite major advances brought by diffusion-based models, current 3D texture generation systems remain hindered by cross-view inconsistency -- textures that appear convincing from one viewpoint often fail to align across others. We find that this issue arises from attention ambiguity, where unstructured full attention is applied indiscriminately across tokens and modalities, causing geometric confusion and unstable appearance-structure coupling.To address this, we introduce CaliTex, a framework of geometry-calibrated attention that explicitly aligns attention with 3D structure.It introduces two modules: Part-Aligned Attention that enforces spatial alignment across semantically matched parts, and Condition-Routed Attention which routes appearance information through geometry-conditioned pathways to maintain spatial fidelity.Coupled with a two-stage diffusion transformer, CaliTex makes geometric coherence an inherent behavior of the network rather than a byproduct of optimization.Empirically, CaliTex produces seamless and view-consistent textures and outperforms both open-source and commercial baselines.

Abstract:
We tackle the challenging task of generating complete 3D facial animations for two interacting, co-located participants from a mixed audio stream. While existing methods often produce disembodied "talking heads" akin to a video conference call, our work is the first to explicitly model the dynamic 3D spatial relationship--including relative position, orientation, and mutual gaze--that is crucial for realistic in-person dialogues. Our system synthesizes the full performance of both individuals, including precise lip-sync, and uniquely allows their relative head poses to be controlled via textual descriptions. To achieve this, we propose a dual-stream architecture where each stream is responsible for one participant's output. We employ speaker's role embeddings and inter-speaker cross-attention mechanisms are designed to disentangle the mixed audio and model the interaction. Furthermore, we introduce a novel eye gaze loss to promote natural, mutual eye contact. To power our data-hungry approach, we introduce a novel pipeline to curate a large-scale conversational dataset consisting of over 2 million dyadic pairs from in-the-wild videos. Our method generates fluid, controllable, and spatially aware dyadic animations suitable for immersive applications in VR and telepresence, significantly outperforming existing baselines in perceived realism and interaction coherence.

Abstract:
Videos are increasingly used as inputs to machine learning systems, where repeated decoding and processing across diverse downstream tasks dominate computational cost. However, existing video pipelines remain inefficient. Traditional codecs such as H.264 and H.265 are optimized for human perception and require full pixel decoding for every query, compressed-domain methods are tied to specific codec structures with limited flexibility, and machine-oriented video coding approaches often rely on task-specific encoders and separate representations without supporting human visualization. We propose Neural Video Pipeline (NVP), a framework that leverages implicit neural representations to directly extract task-specific features from intermediate layers, eliminating pixel reconstruction overhead. NVP introduces lightweight micro adapters that map these features into the representation space of downstream models, bypassing both decoding and early-stage feature extraction. Across four representative tasks--image classification, object detection, action recognition, and segmentation--NVP reduces latency by up to 89.5% and inference FLOPs by up to 29.9%, while supporting multiple tasks using a single unified representation.

Abstract:
Cognitive science suggests that spatial ability develops progressively--from perception to reasoning and interaction. Yet in multimodal LLMs (MLLMs), this hierarchy remains poorly understood, as most studies focus on a narrow set of tasks. We introduce SpatialTree, a cognitive-science-inspired hierarchy that organizes spatial abilities into four levels: low-level perception (L1), mental mapping (L2), simulation (L3), and agentic competence (L4). Based on this taxonomy, we construct the first capability-centric hierarchical benchmark, thoroughly evaluating mainstream MLLMs across 27 sub-abilities. The evaluation results reveal a clear structure: L1 skills are largely orthogonal, whereas higher-level skills are strongly correlated, indicating increasing interdependency. Through targeted supervised fine-tuning, we uncover a surprising transfer dynamic--negative transfer within L1, but strong cross-level transfer from low- to high-level abilities with notable synergy. Finally, we explore how to improve the entire hierarchy. We find that naive RL that encourages extensive "thinking" is unreliable: it helps complex reasoning but hurts intuitive perception. We propose a simple auto-think strategy that suppresses unnecessary deliberation, enabling RL to consistently improve performance across all levels. By building SpatialTree, we provide a proof-of-concept framework for understanding and systematically scaling spatial abilities in MLLMs.

Abstract:
Spatial Transcriptomics (ST) merges the benefits of pathology images and gene expression, linking molecular profiles with tissue structure to analyze spot-level function comprehensively. Predicting gene expression from histology images is a cost-effective alternative to expensive ST technologies. However, existing methods mainly focus on spot-level image-to-gene matching but fail to leverage the full hierarchical structure of ST data, especially on the gene expression side, leading to incomplete image-gene alignment. Moreover, a challenge arises from the inherent information asymmetry: gene expression profiles contain more molecular details that may lack salient visual correlates in histological images, demanding a sophisticated representation learning approach to bridge this modality gap. We propose HyperST, a framework for ST prediction that learns multi-level image-gene representations by modeling the data's inherent hierarchy within hyperbolic space, a natural geometric setting for such structures. First, we design Multi-Level Representation Extractors to capture both spot-level and niche-level representations from each modality, providing context-aware information beyond individual spot-level image-gene pairs. Second, a Hierarchical Hyperbolic Alignment module is introduced to unify these representations, performing spatial alignment while hierarchically structuring image and gene embeddings. This alignment strategy enriches the image representations with molecular semantics, significantly improving cross-modal prediction. HyperST achieves state-of-the-art performance on four public datasets from different tissues, paving the way for scalable and accurate spatial transcriptomics prediction.

Abstract:
Recent advances in visual generative models have enabled high-fidelity image editing guided by human instructions. However, these models often struggle with complex instructions involving combinatorial editing operations or inter-step dependencies. This difficulty stems from the limitations of two canonical paradigms: (1) single-turn editing, which attempts to apply all instructed edits in one pass, often fails to parse the complex instruction accurately and causes undesired edits; and (2) sequential editing can decompose the task into simpler steps but suffers from compounding errors introduced by the sequential execution, leading to low-fidelity results. To derive a robust solution for complex image editing, we examine editing behaviors of different paradigms under a unified in-context editing framework, and study how the benefits of sequential decomposition can be balanced against its error-accumulation drawbacks. We further develop a synthetic data pipeline that constructs editing tasks of varying instruction complexity, allowing us to curate a large-scale editing dataset with high-quality decomposed sequences. By finetuning on synthetic data, we discovered that with properly designed editing paradigms, sequential decomposition yields robust improvements even as task complexity increases. Furthermore, the decomposition skills learned from synthetic tasks can transfer to real images by co-training with real-world editing data, demonstrating the promise of sim-to-real generalization for tackling complex image editing across broader domains.

Abstract:
Despite significant progress in Vision-Language Navigation (VLN), existing approaches still rely on dense RGB videos that produce excessive patch tokens and lack explicit spatial structure, resulting in substantial computational overhead and limited spatial reasoning. To address these issues, we introduce the Geometry-Aware BEV (GA-BEV) --a compact, 3D-grounded feature representation that integrates both explicit and implicit geometric cues into multimodal large language model (MLLM)-based navigation systems. We construct BEV spatial maps from RGB-D inputs by projecting visual features into 3D space and aggregating them into an agent-centric layout that preserves geometric consistency while reducing token redundancy. To further enrich geometric understanding, we incorporate features from a pretrained 3D foundation model into the BEV space, injecting structural priors learned from large-scale 3D reconstruction tasks. Together, these complementary cues--explicit depth-based projection and implicit learned priors--yield compact yet spatially expressive representations that substantially improve navigation efficiency and performance. Experiments show that our method achieves state-of-the-art results using only navigation data, without DAgger augmentation or mixed VQA training, demonstrating the robustness and data efficiency of the proposed GA-VLN framework.

Abstract:
Large vision-language models (VLMs) excel on general benchmarks but often lack robustness in medical imaging, where heterogeneous supervision induces cross-dataset interference and sensitivity to data regime (i.e., how the supervisory signals are mixed). In realistic clinical workflows, data and tasks arrive sequentially, so naive continual training further leads to catastrophic forgetting. To address these challenges, we propose MedQwen, a parameter-efficient medical VLM that couples a spectrally routed Mixture-of-Experts (MoE) with a theoretically grounded scaling rule that aligns low-rank updates with a full-rank, fully fine-tuned MoE, without changing the base architecture. Concretely, we initialize each expert from non-overlapping singular value decomposition (SVD) segments of the pretrained weight and introduce a residual compensation and scaling scheme to enable stable expert specialization and consistent routing under distribution shift.Across 23 medical datasets covering visual question answering, report generation, radiology classification, and hallucination mitigation, MedQwen achieves strong, reliable performance: it approaches full fine-tuning on zero-shot classification with 339xfewer trainable parameters, and reduces sequential forgetting to ~5% where strong baselines degrade by >20-50%.

Abstract:
Visible-infrared person re-identification (VI-ReID) is challenging due to the large modality discrepancy between visible and infrared images. We contend that this discrepancy is largely related to differing lighting conditions, including differences in light wavelength and light source type. Recently, frequency-based VI-ReID approaches have achieved notable success because frequency information can better extract identity-relevant contours and details while excluding irrelevant lighting and color. However, existing methods either do not distinguish different frequency bands or focus on only one band, which is insufficient under diverse lighting conditions. To perform comprehensive frequency-domain learning, we propose a Multi-Frequency Expert Network (MFEN) that enables multi-frequency modulation and adaptively combines different bands through a mixture-of-experts design. We further introduce Random Frequency Augmentation (RFA) and Frequency Auxiliary Optimization (FAO) to better train MFEN. The three modules are complementary and jointly capture critical frequency-domain details for robust representation learning. Extensive experiments on three VI-ReID datasets demonstrate the effectiveness of our approach.

Abstract:
Text-driven video editing aims to modify video content based on natural language instructions. While recent training-free methods have leveraged pretrained diffusion models, they often rely on an inversion-editing paradigm. This paradigm maps the video to a latent space before editing. However, the inversion process is not perfectly accurate, often compromising appearance fidelity and motion consistency. To address this, we introduce FlowDirector, a novel training-free and inversion-free video editing framework. Our framework models the editing process as a direct evolution in the data space. It guides the video to transition smoothly along its inherent spatio-temporal manifold using an ordinary differential equation (ODE), thereby avoiding the inaccurate inversion step. From this foundation, we introduce three flow correction strategies for appearance, motion, and stability: 1) Direction-aware flow correction amplifies components that oppose the source direction and removes irrelevant terms, breaking conservative streamlines and enabling stronger structural and textural changes. 2) Motion-appearance decoupling optimizes motion agreement as an energy term at each timestep, significantly improving consistency and motion transfer. 3) Differential averaging guidance strategy leverages differences among multiple candidate flows to approximate a low variance regime at low cost, suppressing artifacts and stabilizing the trajectory. Extensive experiments across various editing tasks and benchmarks demonstrate that FlowDirector achieves state-of-the-art performance in instruction following, temporal consistency, and background preservation, establishing an efficient new paradigm for coherent video editing without inversion.

Abstract:
Diffusion-based image compression has recently shown outstanding perceptual fidelity, yet its practicality is hindered by prohibitive sampling overhead and high memory usage.Most existing diffusion codecs employ UNet architectures, where hierarchical downsampling forces diffusion to operate in shallow latent spaces (typically with only 8xspatial downscaling), resulting in excessive computation.In contrast, conventional VAE-based codecs work in much deeper latent domains (16x-64xdownscaled), motivating a key question:Can diffusion operate effectively in such compact latent spaces without compromising reconstruction quality?To address this, we introduce DiT-IC--an Aligned Diffusion Transformer for Image Compression--which replaces the UNet with a Diffusion Transformer capable of performing diffusion in latent space entirely at 32xdownscaled resolution.DiT-IC adapts a pretrained text-to-image multi-step DiT into a single-step reconstruction model through three key alignment mechanisms:(1) a variance-guided reconstruction flow that adapts denoising strength to latent uncertainty for efficient reconstruction;(2) a self-distillation alignment that enforces consistency with encoder-defined latent geometry to enable one-step diffusion; and(3) a latent-conditioned guidance that replaces text prompts with semantically aligned latent conditions, enabling text-free inference.With these designs, DiT-IC achieves state-of-the-art perceptual quality while offering up to 30x faster decoding and drastically lower memory usage than existing diffusion-based codecs. Remarkably, it can reconstruct 2048x2048 images on a 16 GB laptop GPU.

Abstract:
We present a compact pipeline for high-fidelity hair reconstruction from multi-view images. While recent 3D Gaussian Splatting (3DGS) methods achieve realistic results, they often require millions of primitives, leading to high storage and rendering costs. Observing that hair exhibits structural and visual similarities across a hairstyle, we cluster strands into representative hair cards and group these into shared texture codebooks. Our approach integrates this structure with 3DGS rendering, significantly reducing reconstruction time and storage while maintaining comparable visual quality. In addition, we propose a generative prior accelerated method to reconstruct the initial strand geometry from a set of images. Our experiments demonstrate a 4-fold reduction in strand reconstruction time and achieve comparable rendering performance with over 200x lower memory footprint.

Abstract:
Feed-forward 3D Gaussian Splatting (3DGS) enables efficient one-pass scene reconstruction, providing 3D representations for novel view synthesis without per-scene optimization. However, existing methods typically predict pixel-aligned primitives per-view, producing an excessive number of primitives in dense-view settings and offering no explicit control over the number of predicted Gaussians. To address this, we propose EcoSplat, the first efficiency-controllable feed-forward 3DGS framework that adaptively predicts the 3D representation for any given target primitive count at inference time. EcoSplat adopts a two-stage optimization process. The first stage is Pixel-aligned Gaussian Training (PGT) where our model learns initial primitive prediction. The second stage is Importance-aware Gaussian Finetuning (IGF) stage where our model learns rank primitives and adaptively adjust their parameters based on the target primitive count. Extensive experiments across multiple dense-view settings show that EcoSplat is robust and outperforms state-of-the-art methods under strict primitive-count constraints, making it well-suited for flexible downstream rendering tasks. Code and project page will be released.

Abstract:
Enabling humanoid robots to physically interact with humans is a critical frontier, but progress is hindered by the scarcity of high-quality Human-Humanoid Interaction (HHoI) data. While leveraging abundant Human-Human Interaction (HHI) data presents a scalable alternative, we first demonstrate that standard retargeting fails by breaking the essential contacts. We address this with PAIR (Physics-Aware Interaction Retargeting), a contact-centric, two-stage pipeline that preserves contact semantics across morphology differences to generate physically consistent HHoI data.This high-quality data, however, exposes a second failure: conventional imitation learning policies merely mimic trajectories and lack interactive understanding. We therefore introduce D-STAR (Decoupled Spatio-Temporal Action Reasoner), a hierarchical policy that disentangles when to act from where to act. In D-STAR, Phase Attention (when) and a Multi-Scale Spatial module (where) are fused by the diffusion head to produce synchronized whole-body behaviors beyond mimicry.By decoupling these reasoning streams, our model learns robust temporal phases without being distracted by spatial noise, leading to responsive, synchronized collaboration. We validate our framework through extensive and rigorous simulations, demonstrating significant performance gains over baseline approaches and a complete, effective pipeline for learning complex whole-body interactions from HHI data.

Abstract:
Diffusion Transformers (DiTs) have shown exceptional performance in image generation, yet their large parameter counts incur high computational costs, impeding deployment in resource-constrained settings. To address this, we propose Pluggable Pruning with Contiguous Layer Distillation (PPCL), a flexible structured pruning framework specifically designed for DiT architectures. First, we identify redundant layer intervals through a linear probing mechanism combined with the first-order differential trend analysis of similarity metrics. Subsequently, we propose a plug-and-play teacher-student alternating distillation scheme tailored to integrate depth-wise and width-wise pruning within a single training phase. This distillation framework enables flexible knowledge transfer across diverse pruning ratios, eliminating the need for per-configuration retraining. Extensive experiments on multiple Multi-Modal Diffusion Transformer architecture models demonstrate that PPCL achieves a 50% reduction in parameter count compared to the full model, with less than 3% degradation in key objective metrics. Notably, our method maintains high-quality image generation capabilities while achieving higher compression ratios, rendering it well-suited for resource-constrained environments.

Abstract:
Vision-Language Models (VLMs) have demonstrated remarkable capabilities in aligning and understanding multimodal signals, yet their potential to reason over structured data, where multimodal entities are connected through explicit relational graphs, remains largely underexplored. Unlocking this capability is crucial for real-world applications such as social networks, recommendation systems, and scientific discovery, where multimodal information is inherently structured. To bridge this gap, we present GraphVLM, a systematic benchmark designed to evaluate and harness the capabilities of VLMs for multimodal graph learning (MMGL). GraphVLM investigates three complementary paradigms for integrating VLMs with graph reasoning: (1) VLM-as-Encoder, which enriches graph neural networks through multimodal feature fusion; (2) VLM-as-Aligner, which bridges modalities in latent or linguistic space to facilitate LLM-based structured reasoning; and (3) VLM-as-Predictor, which directly employs VLMs as multimodal backbones for graph learning tasks. Extensive experiments across six datasets from diverse domains demonstrate that VLMs enhance multimodal graph learning via all three roles. Among these paradigms, VLM-as-Predictor achieves the most substantial and consistent performance gains, revealing the untapped potential of vision-language models as a new foundation for multimodal graph learning.

Abstract:
Post-training quantization (PTQ) enables rapid deployment of deep pretrained models. In the low-bit regime, recent PTQ methods for vision models adopt asymmetric quantization (AsymQ), introducing zero-point offsets to mitigate quantization errors. However, these offsets impose substantial hardware overhead and fail to fully capture the non-symmetric structure of pretrained weight distributions, leaving many quantization levels unused.In this paper, we reveal a hidden symmetry in the pretrained weights: after removing a few sparse outliers, the distribution becomes nearly symmetric.Accordingly, we propose Dense and Additive Sparse Quantization (DASQ), which decomposes the weights into dense and sparse matrices.The dense component captures the symmetric structure around zero, while the sparse component models the removed outliers, and both can be processed in parallel and can be implemented with efficient zero-point-free computation.Experiments on image classification, object detection, and instance segmentation show that DASQ surpasses state-of-the-art PTQ methods with lower BOPs. On an FPGA, DASQ also demonstrates higher accuracy and lower power consumption than AsymQ at comparable throughput.

Abstract:
Object pose estimation is a fundamental task in computer vision and robotics, yet most methods require extensive, dataset-specific training. Concurrently, large-scale vision language models show remarkable zero-shot capabilities. In this work, we bridge these two worlds by introducing ConceptPose, a framework for object pose estimation that is both training-free and model-free. ConceptPose leverages a vision-language-model (VLM) to create open-vocabulary 3D concept maps, where each point is tagged with a concept vector derived from saliency maps. By establishing robust 3D-3D correspondences across concept maps, our approach allows precise estimation of 6DoF relative pose. Without any object or dataset-specific training, our approach achieves state-of-the-art results on common zero shot relative pose estimation benchmarks, outperforming the strongest baseline by a relative 62% in average ADD(-S) score, including methods that utilize extensive dataset-specific training.

Abstract:
The recent success of artificial intelligence motivates many non-professional users to train their own models. Those users often resort to cloud training services, seeking to obtain a sufficiently accurate model at a modest cost, for which properly setting up the learning rate and batch size is crucial. While various Hyper-parameter Optimization (HPO) methods have been proposed in that regard, they largely act based on heavy-weight validation signals, being inefficient in the overall cost. We find that the model training process can be viewed as a two-dimensional voting process---with gradients for different iterations and from different samples; moreover, to attain cost-efficient training is to ensure that the gradient redundancy is within a proper range which is similar across diverse models. Based on that insight, we further introduce GR-Gauge, a general method that gauges the gradient redundancy to instruct HPO decisions like configuration searching and trial termination. Extensive experiments demonstrate that GR-Gauge can help attain near-optimal accuracy in much less time than existing methods.

Abstract:
Open-vocabulary learning on CLIP provides remarkable generalization on diverse concepts, however, falters under the realistic streaming open-world evaluations for Stability against distractor classes and Extensibility to novel classes. Current fine-tuning methods often fail these tests since they are mainly designed for closed-set conditions, leading to the performance gaps while the target vocabulary progressively scales. We formalize a "vocabulary scaling law" showing that these openness measures can be lower-bounded by performance on the full class-name universe, implying that robust fine-tuning should: (i) account for the entire vocabulary, (ii) tune class-name embeddings rather than context, and (iii) enforce orthogonality between prompt embeddings including training and open-set class names. Guided by our analysis, we propose Submodular-Vocabulary Fine-tuning (SVFT), a bi-level optimization framework that approximates the intractable objective of tuning all class name embedding by greedily selecting a small, informative subset of class names via constrained submodular maximization, thus, allows the employment of efficient greedy algorithm for the near-optimal class-name subset selection to fine-tune CLIP instead of using all open classes. Across extensive experiments, SVFT consistently improves both stability and extensibility, advancing the openness and practical robustness of CLIP-based vision-language models.

Abstract:
Style generation has made significant progress through diffusion models. Recent efforts have explored reinforcement learning with human-preference reward models to enhance diffusion models for general downstream applications. However, we identify a critical limitation: existing human-preference reward models struggle to effectively perceive image style, resulting in suboptimal performance after reinforcement fine-tuning. To address this, we first introduce a large-scale style reward modeling dataset comprising 400K paired samples spanning 1,000 diverse style categories, augmented with textual instructions and style reward annotations. We then propose StyleDoctor, a novel style perception reward model capable of jointly evaluating style consistency between paired images and style-text alignment. StyleDoctor outperforms existing style perception models in both style retrieval and generation tasks. Extensive quantitative and qualitative experiments demonstrate the superiority of StyleDoctor over competing approaches, showcasing its efficiency and versatility in style-conditioned generation. Dataset and code are available at StyleDoctor.

Abstract:
Text-to-image generative models have achieved impressive fidelity and diversity, but can inadvertently produce unsafe or undesirable content due to implicit biases embedded in large-scale training datasets.Existing concept erasure methods, whether text-only or image-assisted, face trade-offs: textual approaches often fail to fully suppress concepts, while naive image-guided methods risk over-erasing unrelated content. We propose TICoE, a Text-Image Collaborative Erasing framework that achieves precise and faithful concept removal through a continuous convex concept manifold and hierarchical visual representation learning. TICoE precisely removes target concepts while preserving unrelated semantic and visual content. To objectively assess the quality of erasure, we further introduce a fidelity-oriented evaluation strategy that measures post-erasure usability. Experiments on multiple benchmarks show that TICoE surpasses prior methods in concept removal precision and content fidelity, enabling safer, more controllable text-to-image generation.

Abstract:
The synthesis of synchronized audio-visual content is a key challenge in generative AI, with open-source models facing challenges in robust audio-video alignment. Our analysis reveals that this issue is rooted in three fundamental challenges of the joint diffusion process: (1) Correspondence Drift, where concurrently evolving noisy latents impede stable learning of alignment; (2) inefficient global attention mechanisms that fail to capture fine-grained temporal cues; and (3) the intra-modal bias of conventional Classifier-Free Guidance (CFG), which enhances conditionality but not cross-modal synchronization.To overcome these challenges, we introduce Harmony, a novel framework that mechanistically enforces audio-visual synchronization. We first propose a Cross-Task Synergy training paradigm to mitigate drift by leveraging strong supervisory signals from audio-driven video and video-driven audio generation tasks. Then, we design a Global-Local Decoupled Interaction Module for efficient and precise temporal-style alignment. Finally, we present a novel Synchronization-Enhanced CFG (SyncCFG) that explicitly isolates and amplifies the alignment signal during inference. Extensive experiments demonstrate that Harmony establishes a new state-of-the-art, significantly outperforming existing methods in both generation fidelity and, critically, in achieving fine-grained audio-visual synchronization.

Abstract:
We introduce Illustrator's Depth, a novel definition of depth that addresses a key challenge in digital content creation: decomposing flat images into editable, ordered layers. Inspired by an artist's compositional process, illustrator's depth infers a layer index for each pixel, forming an interpretable image decomposition through a discrete, globally consistent ordering of elements optimized for editability. We also propose and train a neural network using a curated dataset of layered vector graphics to predict layering directly from raster inputs. Our layer index inference unlocks a range of powerful downstream applications. In particular, it significantly outperforms state-of-the-art baselines for image vectorization while also enabling high-fidelity text-to-vector-graphics generation, automatic 3D relief generation from 2D images, and intuitive depth-aware editing. By reframing depth from a physical quantity to a creative abstraction, illustrator's depth prediction offers a new foundation for editable image decomposition.

Abstract:
With the growing accessibility of large-scale 3D point clouds from LiDAR and photogrammetric techniques, 3D change detection (3DCD) has become essential for understanding dynamic scenes. Existing methods typically formulate this as segmentation, treating each point independently for binary classification. This leads to isolated misclassified noise points inside regions. Meanwhile, feature similarity at boundaries causes boundary ambiguity. The more severe class imbalance inherent to change detection further exacerbates this issue. To address these challenges, we propose SRGCD, a Stability-Driven Region Growth Framework that redefines 3DCD as region growing rather than segmentation. Our key insight is that progressively expanding from highly confident seeds avoids pitfalls of point-wise classification while elegantly alleviating class imbalance. Specifically, we first apply strict constraints through Mutual Geometric Consistency Prior to identify minimal highly reliable unchanged seeds. From these seeds, Stability-Guided Controlled Attention modules progressively propagate stability from stable regions to neighboring uncertain points, enabling unchanged regions to grow layer-by-layer from interior cores toward boundaries. This coarse-to-fine growing process naturally forms coherent regions, avoiding isolated noise while achieving compact, well-defined boundaries through progressive expansion. Extensive experiments on the synthetic dataset Urb3DCD and the real-world dataset HKCD demonstrate that SRGCD achieves state-of-the-art performance, significantly improving interior completeness and boundary compactness over existing methods.

Abstract:
Zero-shot object counting (ZSOC) aims to enumerate objects of arbitrary categories specified by text descriptions without requiring visual exemplars. However, existing methods often treat counting as a coarse retrieval task, suffering from a lack of fine-grained quantity awareness. Furthermore, they frequently exhibit spatial insensitivity and degraded generalization due to feature space distortion during model adaptation. To address these challenges, we present QICA, a novel framework that synergizes quantity perception with robust spatial cast aggregation. Specifically, we introduce a Synergistic Prompting Strategy (SPS) that adapts vision and language encoders through numerically conditioned prompts, bridging the gap between semantic recognition and quantitative reasoning. To mitigate feature distortion, we propose a Cost Aggregation Decoder (CAD) that operates directly on vision-text similarity maps. By refining these maps through spatial aggregation, CAD prevents overfitting while preserving zero-shot transferability. Additionally, a multi-level quantity alignment loss (\mathcal L _ MQA ) is employed to enforce numerical consistency across the entire pipeline. Extensive experiments on FSC-147 demonstrate competitive performance, while zero-shot evaluation on CARPK and ShanghaiTech-A validates superior generalization to unseen domains.

Abstract:
Human mesh recovery (HMR) from a single RGB image is inherently ambiguous, as multiple 3D poses can correspond to the same 2D observation. Recent diffusion-based methods tackle this by generating various hypotheses, but often sacrifice accuracy. They yield predictions that are either physically implausible or drift from the input image, especially under occlusion or in cluttered, in-the-wild scenes. To address this, we introduce a dual-memory augmented HMR critique agent with self-reflection to produce context-aware quality scores for predicted meshes. These scores distill fine-grained cues about 3D human motion structure, physical feasibility, and alignment with the input image. We use these scores to build a group-wise HMR preference dataset. Leveraging this dataset, we propose a group preference alignment framework for finetuning diffusion-based HMR models. This process injects the rich preference signals into the model, guiding it to generate more physically plausible and image-consistent human meshes. Extensive experiments demonstrate that our method achieves superior performance compared to state-of-the-art approaches.

Abstract:
Large Vision-Language Models (LVLMs) exhibit powerful reasoning capabilities but suffer sophisticated jailbreak vulnerabilities. Fundamentally, aligning LVLMs is not just a safety challenge but a problem of economic efficiency. Current alignment methods struggle with the trade-off between safety, utility, and operational costs. Critically, a focus solely on final outputs (process-blindness) wastes significant computational budget on unsafe deliberation. This flaw allows harmful reasoning to be disguised with benign justifications, thereby circumventing simple additive safety scores.To address this, we propose EcoAlign, an inference-time framework that reframes alignment as an economically rational search by treating the LVLM as a boundedly rational agent. EcoAlign incrementally expands a thought graph and scores actions using a forward-looking function (analogous to net present value) that dynamically weighs expected safety, utility, and cost against the remaining budget. To prevent deception, path safety is enforced via the weakest-link principle. Extensive experiments across 3 closed-source and 2 open-source models on 6 datasets show that EcoAlign matches or surpasses state-of-the-art safety and utility at a lower computational cost, thereby offering a principled, economical pathway to robust LVLM alignment.

Abstract:
Real-time, streaming interactive avatars represent a critical yet challenging goal in digital human research. Although diffusion-based human avatar generation methods achieve remarkable success, their non-causal architecture and high computational costs make them unsuitable for streaming. Moreover, existing interactive approaches are typically restricted to the head-and-shoulder region, limiting their ability to produce gestures and body motions. To address these challenges, we propose a two-stage autoregressive adaptation and acceleration framework that applies autoregressive distillation and adversarial refinement to adapt a high-fidelity human video diffusion model for real-time, interactive streaming. To ensure long-term stability and consistency, we introduce three key components: a Reference Sink, a Reference-Anchored Positional Re-encoding (RAPR) strategy, and a Consistency-Aware Discriminator. Building on this framework, we develop a one-shot, interactive, human avatar model capable of generating both natural talking and listening behaviors with coherent gestures. Extensive experiments demonstrate that our method achieves state-of-the-art performance, surpassing existing approaches in generation quality, real-time efficiency, and interaction naturalness.

Abstract:
Dense Video Object Captioning (DVOC) is the task of jointly detecting, tracking, and captioning object trajectories in a video, requiring the ability to understand spatio-temporal details and describe them in natural language.Due to the complexity of the task and the high cost associated with manual annotation, previous approaches resort to training strategies with limited data, potentially leading to suboptimal performance.To circumvent this issue, we propose to generate captions about spatio-temporally localized entities leveraging a state-of-the-art VLM, and extend the LVIS and LV-VIS datasets with our synthetic captions (LVISCap and LV-VISCap). Moreover, we introduce an end-to-end model, CaptionFormer, capable of jointly detecting, segmenting, tracking and captioning object trajectories.With pretraining on LVISCap and LV-VISCap, CaptionFormer achieves state-of-the-art DVOC results on three existing benchmarks, VidSTG, VLN and BenSMOT.

Abstract:
Hand-object interaction (HOI) reconstruction and synthesis are becoming central to embodied AI and AR/VR. Yet, despite rapid progress, existing HOI generation research remains fragmented across three disjoint tracks: (1) pose-only synthesis that predicts MANO trajectories without producing pixels; (2) single-image HOI generation that hallucinates appearance from masks or 2D cues but lacks dynamics; and (3) video generation methods that require both the entire pose sequence and the ground-truth first frame as inputs, preventing true sim-to-real deployment. Inspired by the philosophy of previous work, we think that HOI generation requires a unified engine that brings together pose, appearance, and motion within one coherent framework. Thus we introduce PAM: a Pose-Appearance-Motion Engine for controllable HOI video generation. The performance of our engine is validated by: (1) On DexYCB, we obtain an FVD of 29.13 (vs. 38.83 for InterDyn), and MPJPE of 19.37 mm (vs. 30.05 mm for CosHand), while generating higher-resolution 480x720 videos compared to 256x256/256x384 baselines. (2) On OAKINK2, our full multi-condition model improves FVD from 68.76 - 46.31. (3) An ablation over input conditions on DexYCB shows that combining depth, segmentation, and keypoints consistently yields the best results. (4) For a downstream hand pose estimation task using SimpleHand, augmenting training with 3,400 synthetic videos (207k frames) allows a model trained on only 50% of the real data plus our synthetic data to match the 100% real baseline.

Abstract:
Human perception of visual similarity is inherently adaptive and subjective, depending on the users' interests and focus. However, most image retrieval systems fail to reflect this flexibility, relying on a fixed, monolithic metric that cannot incorporate multiple conditions simultaneously. To address this, we propose CLAY, an adaptive similarity computation method that reframes the embedding space of pretrained Vision-Language Models (VLMs) as a text-conditional similarity space without additional training. This design separates the textual conditioning process and visual feature extraction, allowing highly efficient and multi-conditioned retrieval with fixed visual embeddings. We also construct a synthetic evaluation dataset CLAY-EVAL, for comprehensive assessment under diverse conditioned retrieval settings. Experiments on standard datasets and our proposed dataset show that CLAY achieves state-of-the art retrieval accuracy and notable computational efficiency compared to previous works.

Abstract:
Open-vocabulary scene understanding with online panoptic mapping is essential for embodied applications to perceive and interact with environments. However, existing methods are predominantly offline or lack instance-level understanding, limiting their applicability to real-world robotic tasks. In this paper, we propose OnlinePG, a novel and effective system that integrates geometric reconstruction and open-vocabulary perception using 3D Gaussian Splatting in an online setting. Technically, to achieve online panoptic mapping, we employ an efficient local-to-global paradigm with a sliding window. To build local consistency map, we construct a 3D segment clustering graph that jointly leverages geometric and semantic cues, fusing inconsistent segments within sliding window into complete instances. Subsequently, to update the global map, we construct explicit spatial attribute grids for the local 3D Gaussian map and fuse them into the global map via robust bidirectional bipartite 3D Gaussian instance matching. Finally, we utilize the fused VLM features inside the 3D spatial attribute grids to achieve open-vocabulary scene understanding. Extensive experiments on widely used datasets demonstrate that our method achieves better performance among online approaches, while maintaining real-time efficiency.

Abstract:
Discrete video VAEs underpin modern text-to-video generation and video understanding systems, yet existing tokenizers typically learn visual codebooks at a single scale with limited vocabularies and shallow language supervision, leading to poor cross-modal alignment and zero-shot transfer. We introduce PyraTok, a language-aligned pyramidal tokenizer that learns semantically structured discrete latents across multiple spatial-temporal resolutions. PyraTok builds on a pretrained video VAE and a novel Language aligned Pyramidal Quantization (LaPQ) module that discretizes encoder features at several depths using a shared large binary codebook, yielding compact yet expressive video token sequences. To tightly couple visual tokens with language, PyraTok jointly optimizes multi-scale text-guided quantization and a global autoregressive objective over the token hierarchy. Across ten benchmarks, PyraTok delivers state-of-the-art (SOTA) video reconstruction, consistently improves text-to-video quality, and sets new SOTA zero-shot performance on video instance segmentation, temporal action localization, and video understanding, scaling robustly to up to 4K/8K resolutions.

Abstract:
This paper explores parallel thinking for Multi-modal Large Language Models (MLLMs), aiming to improve Chain-of-Thought (CoT) through multiple diverse reasoning paths. We guide the model to list multiple visual key points and develop an independent reasoning path for each. Therefore, we term this method PointThinker, which is characterized by starting each thinking path with a point. PointThinker offers two key advantages. (1) It amplifies the benefits of parallel thinking. While parallel thinking naturally benefits from multiple reasoning paths, explicitly listing key points further amplifies these benefits by eliminating redundancy and promoting path diversity, enabling the model to explore problems from more varied perspectives. (2) It uses a novel dense (point-wise) reward for reinforcement learning. We observe that during parallel thinking, some points are helpful while others are invalid, yet popular methods assign them the same rewards. Therefore, we propose allocating differentiated rewards to different points within the same chain-of-thought. This is implemented via a self-verification mechanism called Group Points Policy Optimization (GPPO), which combines rollout-level and point-level validation for reward assignment. On challenging benchmarks such as HallusionBench, PointThinker achieves 58.7% accuracy, improving reasoning quality and answer accuracy. Experimental results demonstrate that parallel thinking with point improves performance, and GPPO further contributes non-trivial gains.

Abstract:
Deep learning-based image restoration has achieved significant success. However, when addressing real-world degradations, model performance is limited by the quality of ground-truth images in datasets due to practical constraints in data acquisition. To address this limitation, we propose a novel framework that enhances existing ground truth images to provide higher-quality supervision for real-world restoration. Our framework generates perceptually enhanced ground truth images using super-resolution by incorporating adaptive frequency masks, which are learned by a conditional frequency mask generator. These masks guide the optimal fusion of frequency components from the original ground truth and its super-resolved variants, yielding enhanced ground truth images. This frequency-domain mixup preserves the semantic consistency of the original content while selectively enriching perceptual details, preventing hallucinated artifacts that could compromise fidelity. The enhanced ground truth images are used to train a lightweight output refinement network that can be seamlessly integrated with existing restoration models. Extensive experiments demonstrate that our approach improves the quality of restored images. We further validate the effectiveness of both supervision enhancement and output refinement through user studies.

Abstract:
Time-intensive performance evaluations significantly impede progress in Neural Architecture Search (NAS). To address this, neural predictors leverage surrogate models trained on proxy datasets, allowing for direct performance predictions for new architectures.However, these predictors often exhibit poor generalization due to their limited ability to capture intricate relationships among various architectures. In this paper, we propose HyperNAS, a novel neural predictor paradigm for enhancing architecture representation learning. HyperNAS consists of two primary components: a global encoding scheme and a shared hypernetwork. The global encoding scheme is devised to capture the comprehensive macro-structure information, while the shared hypernetwork serves as an auxiliary task to enhance the investigation of inter-architecture patterns. To ensure training stability, we further develop a dynamic adaptive multi-task loss to facilitate personalized exploration on the Pareto front. Extensive experiments across five representative search spaces, including ViTs, demonstrate the advantages of HyperNAS, particularly in few-shot scenarios. For instance, HyperNAS strikes new state-of-the-art results, with 97.60% top-1 accuracy on CIFAR-10 and 82.4% top-1 accuracy on ImageNet, using at least 5.0xfewer samples.

Abstract:
The rapid growth of AI-generated imagery has blurred the boundary between real and synthetic content, raising practical concerns for digital integrity. Vision-language models (VLMs) can provide natural language explanations, but standard one-pass classifiers often miss subtle artifacts in high-quality synthetic images and offer limited grounding in the pixels. We propose Locate-Then-Examine (LTE), a two-stage VLM-based forensic framework that first localizes suspicious regions and then re-examines these crops together with the full image to refine the real vs. AI-generated verdict and its explanation. LTE explicitly links each decision to localized visual evidence through region proposals and region-aware reasoning. To support training and evaluation, we introduce TRACE, a dataset of 20,000 real and high-quality synthetic images with region-level annotations and automatically generated forensic explanations, constructed by a VLM-based pipeline with additional consistency checks and quality control. Across TRACE and multiple external benchmarks, LTE achieves competitive accuracy and improved robustness while providing human-understandable, region-grounded explanations suitable for forensic deployment.

Abstract:
We propose a federated framework for training diffusion models on decentralized and private datasets. The method learns a shared generative model alongside personalized client models, allowing clients to benefit from cross-client structure while ensuring that the shared model cannot reproduce any client's data on its own. We provide formal differential privacy guarantees for each client and establish utility bounds for conditional generation under a Gaussian mixture model, showing that collaboration improves sample quality relative to private non-collaborative training. Experiments on CIFAR-10, Colorized MNIST, and CelebA support these results: the method generates high-fidelity samples, improves performance on minority and underrepresented classes, and maintains strong protection against membership inference, memorization, and reconstruction attacks.

Abstract:
Humans often resolve visual uncertainty by comparing an image with relevant examples, but ViTs lack the ability to identify which examples would improve their predictions. We present Task-Aligned Context Selection (TACS), a framework that learns to select paired examples which truly improve task performance rather than those that merely appear similar. TACS jointly trains a selector network with the task model through a hybrid optimization scheme combining gradient-based supervision and reinforcement learning, making retrieval part of the learning objective. By aligning selection with task rewards, TACS enables discriminative models to discover which contextual examples genuinely help. Across 18 datasets covering fine-grained recognition, medical image classification, and medical image segmentation, TACS consistently outperforms similarity-based retrieval, particularly in challenging or data-limited settings.

Abstract:
Monocular 3D Object Detection has achieved impressive performance on densely annotated datasets. However, it struggles when only a fraction of objects are labeled due to the high cost of 3D annotation. This sparsely-annotated setting is common in real-world scenarios where annotating every object is impractical.To address this, we propose a novel framework for sparsely-annotated monocular 3D object detection with two key modules.First, we propose Road-Aware Patch Augmentation (RAPA), which leverages sparse annotations by augmenting segmented object patches onto road regions while preserving 3D geometric consistency. Second, we propose Prototype-Based Filtering (PBF), which generates high-quality pseudo-labels by filtering predictions through prototype similarity and depth uncertainty. PBF maintains global 2D RoI feature prototypes and selects pseudo-labels that are both feature-consistent with learned prototypes and have reliable depth estimates.Our training strategy combines geometry-preserving augmentation with prototype-guided pseudo-labeling to achieve robust detection under sparse supervision.Extensive results demonstrate the effectiveness of the proposed method. The source code will be publicly available.

Abstract:
Generative video models achieve high visual fidelity but often violate basic physical principles, limiting reliability in real-world settings. Prior attempts to inject physics rely on conditioning: frame-level signals are domain-specific and short-horizon, while global text prompts are coarse and noisy, missing fine-grained dynamics. We present PhysVid, a physics-aware local conditioning scheme that operates over temporally contiguous chunks of frames. Each chunk is annotated with physics-grounded descriptions of states, interactions, and constraints, which are fused with the global prompt via chunk-aware cross-attention during training. At inference, we introduce negative physics prompts (descriptions of locally relevant law violations) to steer generation away from implausible trajectories. On VideoPhy, PhysVid improves physical commonsense scores by ~ 33% over baseline video generators, and by up to ~ 8% on VideoPhy2. These results show that local, physics-aware guidance substantially increases physical plausibility in generative video and marks a step toward physics-grounded video models.

Abstract:
Humanoid robots have achieved significant progress in motion generation and control, exhibiting movements that appear increasingly natural and human-like. Inspired by the Turing Test, we propose the Motion Turing Test, a framework that evaluates whether human observers can discriminate between humanoid robot and human poses using only kinematic information. To facilitate this evaluation, we present the Human-Humanoid Motion (HHMotion) dataset, which consists of 1,000 motion sequences spanning 15 action categories, performed by 11 humanoid models and 10 human subjects. All motion sequences are converted into SMPL-X representations to eliminate the influence of visual appearance. We recruited 30 annotators to rate the human-likeness of each pose on a 0-5 scale, resulting in over 500 hours of annotation. Analysis of the collected data reveals that humanoid motions still exhibit noticeable deviations from human movements, particularly in dynamic actions such as jumping, boxing, and running. Building on HHMotion, we formulate a human-likeness evaluation task that aims to automatically predict human-likeness scores from motion data. Despite recent progress in multimodal large language models, we find that they remain inadequate for assessing motion human-likeness. To address this, we propose a simple baseline model and demonstrate that it outperforms several contemporary LLM-based methods. The dataset, code, and benchmark release publicly at http://www.lidarhumanmotion.net/mtt/

Abstract:
Despite the remarkable progress of Multimodal Large Language Models (MLLMs) in 2D vision-language tasks, their application to complex 3D scene manipulation remains underexplored. In this paper, we bridge this critical gap by tackling three key challenges in 3D object arrangement task using MLLMs. First, to address the weak visual grounding of MLLMs, which struggle to link programmatic edits with precise 3D outcomes, we introduce an MCP-based API. This shifts the interaction from brittle raw code manipulation to more robust, function-level updates. Second, we augment the MLLM's 3D scene understanding with a suite of specialized visual tools to analyze scene state, gather spatial information, and validate action outcomes. This perceptual feedback loop is critical for closing the gap between language-based updates and precise 3D-aware manipulation. Third, to manage the iterative, error-prone updates, we propose a collaborative multi-agent framework with designated roles for planning, execution, and verification. This decomposition allows the system to robustly handle multi-step instructions and recover from intermediate errors. We demonstrate the effectiveness of our approach on a diverse set of 25 complex object arrangement tasks, where it significantly outperforms existing baselines.

Abstract:
Large vision-language models (VLMs) still struggle with reliable 3D spatial reasoning, a core capability for embodied and physical AI systems. This limitation arises from their inability to capture fine-grained 3D geometry and spatial relationships. While recent efforts have introduced multi-view geometry transformers into VLMs, they typically fuse only the deep-layer features from vision and geometry encoders, discarding rich hierarchical signals and creating a fundamental bottleneck for spatial understanding. To overcome this, we propose SpatialStack, a general hierarchical fusion framework that progressively aligns vision, geometry, and language representations across the model hierarchy. Moving beyond conventional late-stage vision-geometry fusion, SpatialStack stacks and synchronizes multi-level geometric features with the language backbone, enabling the model to capture both local geometric precision and global contextual semantics. Building upon this framework, we develop VLM-SpatialStack, a model that achieves state-of-the-art performance on multiple 3D spatial reasoning benchmarks. Extensive experiments and ablations demonstrate that our multi-level fusion strategy consistently enhances 3D understanding and generalizes robustly across diverse spatial reasoning tasks, establishing SpatialStack as an effective and extensible design paradigm for vision-language-geometry integration in next-generation multimodal physical AI systems.

Abstract:
Vision-language models (VLMs) struggle with 3D-related tasks such as spatial cognition and physical understanding, which are crucial for real-world applications like robotics and embodied agents. We attribute this to a modality gap between the 3D tasks and the 2D training of VLM, which led to inefficient retrieval of 3D information from 2D input. To bridge this gap, we introduce SandboxVLM, a simple yet effective framework that leverages abstract bounding boxes to encode geometric structure for VLMs. Specifically, we design a 3D Sandbox reconstruction and perception pipeline comprising four stages: generating multi-view priors with abstract control, proxy elevation, multi-view voting and clustering, and 3D-aware reasoning. Evaluated in zero-shot settings across multiple benchmarks and VLM backbones, our approach consistently improves spatial intelligence, achieving an 8.3% gain on SAT Real compared with baseline methods for instance. These results demonstrate that presenting VLMs with 3D abstractions substantially enhances their 3D reasoning ability without additional training, suggesting new possibilities for general-purpose embodied intelligence.

Abstract:
Despite progress in video understanding, current MLLMs struggle with counting tasks. Existing benchmarks are limited by short videos, close-set queries, lack of clue annotations, and weak multimodal coverage. In this paper, we introduce CG-AV-Counting, a manually-annotated clue-grounded counting benchmark with 1,027 multimodal questions and 5,845 annotated clues over 497 long videos. It supports both black-box and white-box evaluation, serving as a comprehensive testbed for both end-to-end and reasoning-based counting. To explore ways to improve model's counting capability, we propose AV-Reasoner, a model trained with GRPO and curriculum learning to generalize counting ability from related tasks. AV-Reasoner achieves SOTA results across multiple benchmarks, demonstrating the effectiveness of reinforcement learning. However, experiments reveal that on out-of-domain benchmarks, reasoning in the language space offers limited performance gains, suggesting the need for more robust cross-domain reasoning mechanisms.

Abstract:
Colored glass is widely used in everyday settings, yet its reflective and absorptive properties often introduce ghost shadows and color bias in captured images. However, existing methods typically neglect the absorption issue, making it difficult to address color bias caused by colored glass. To address this, we are the first to apply polarization imaging theory to model the light transmission process within glass. Specifically, we propose a novel imaging model, the Polarization State Tracing Model (PSTM), which traces polarized light along multiple propagation paths and accounts for wavelength-selective absorption, enabling joint reflection removal and color-consistent reconstruction. Guided by PSTM, we design a Channel Ring Attention (CRA) mechanism to efficiently capture inter-angle polarization dependencies and enhance feature interaction across polarization channels, ensuring physically consistent recovery. Besides, the recovered polarization information can be directly applied to advanced downstream tasks, such as Shape-from-Polarization (SfP). We construct a real-world dataset, GlassPol, containing a wide range of glass materials, enabling testing under diverse optical conditions. Extensive experiments show that our method outperforms existing state-of-the-art methods, achieving up to a 3dB improvement in PSNR, establishing a new benchmark for polarized reflection removal.

Abstract:
Vision-Language-Action (VLA) models have emerged as a powerful paradigm for general robotic manipulation. However, existing approaches typically omit intermediate reasoning steps and directly regress actions, limiting reasoning interpretability and performance in long-horizon or compositional tasks. Although recent studies introduce Chain-of-Thought (CoT) reasoning into VLA models, their effectiveness remains suboptimal due to two key issues: (1) generating a full reasoning trajectory at every timestep introduces substantial redundancy, thereby hinders real-time deployment and (2) reasoning is performed independently, neglecting temporal consistency, which leads to planning conflicts.We propose TRM-VLA, a temporal-aware reasoning and memorization framework that integrates explicit temporal modeling into the VLA reasoning process. TRM-VLA consists of two core components: (1) Keyframe-Triggered Reasoning (KTR), which identifies task progress and performs hierarchical CoT reasoning only at key decision points to reduce redundant inference; and (2) Granularity-adaptable Context Memory (GCM), which dynamically stores and retrieves historical reasoning trajectories to maintain inter-frame coherence and global context. Built upon a dual-system architecture--combining a multimodal foundation model for slow reasoning (System 2) with a diffusion-based policy for fast execution (System 1)--TRM-VLA learns to plan and act efficiently in a unified manner. Extensive experiments on LIBERO-90, SIMPLER, and four real-world robotic tasks demonstrate that TRM-VLA achieves state-of-the-art performance while improving reasoning efficiency.

Abstract:
The rise of vision foundation models (VFMs) calls for systematic evaluation. A common approach pairs VFMs with large language models (LLMs) as general-purpose heads, followed by evaluation on broad Visual Question Answering (VQA) benchmarks. However, this protocol has two key blind spots: (i) Instruction tuning data may not align with VQA test distributions, meaning a wrong prediction can stem from such data mismatch rather than VFMs' visual shortcomings; (ii) VQA benchmarks often require multiple visual abilities in a single question, making it difficult to determine whether errors arise from the lack of all required abilities or just one key ability. To address these gaps, we introduce AVA-Bench, the first benchmark that explicitly disentangles 14 Atomic Visual Abilities (AVAs), foundational skills such as localization, depth estimation, and spatial understanding, which collectively support complex visual reasoning tasks. By decoupling AVAs and matching training and test distributions within each, AVA-Bench pinpoints exactly where a VFM excels or falters. Applying AVA-Bench to leading VFMs thus reveals distinctive "ability fingerprints," turning VFM selection from educated guesswork into principled engineering. Notably, we find that a 0.5B LLM yields similar VFM rankings as a 7B LLM while cutting GPU hours by 8x, enabling more efficient evaluation. By offering a comprehensive and transparent benchmark, we hope AVA-Bench lays the foundation for the next generation of VFMs.

Abstract:
Object-centric reconstruction seeks to recover the 3D structure of a scene through composition of independent objects. While this independence can simplify modeling, it discards strong signals that could improve reconstruction, notably repetition where the same object model is seen multiple times in a scene, or across scans. We propose the Joint Reconstruction Model (JRM) to leverage repetition by framing object reconstruction as one of personalized generation: multiple observations share a common subject that should be consistent for all observations, while still adhering to the specific pose and state from each. Prior methods in this direction rely on explicit matching and rigid alignment across observations, making them sensitive to errors and difficult to extend to non-rigid transformations. In contrast, JRM is a 3D flow-matching generative model that implicitly aggregates unaligned observations in its latent space, learning to produce consistent and faithful reconstructions in a data-driven manner without explicit constraints. Evaluations on synthetic and real-world data show that JRM's implicit aggregation removes the need for explicit alignment, improves robustness to incorrect associations, and naturally handles non-rigid changes such as articulation. Overall, JRM outperforms both independent and alignment-based baselines in reconstruction quality.

Abstract:
Zero-shot image denoising has gained prominence in recent years, as it inherently relies on the intrinsic priors of images rather than learning from external data. Nevertheless, most existing methods either fail to fully exploit global priors, or do not properly preserve the fine-grained details governed by local priors. In this work, we propose a novel framework of pseudo sample generation for zero-shot denoising guided by local and global image priors. Specifically, we propose a well-crafted down-sampler based on gradient merging and grouping within a small window to generate down-sampled samples by exploiting spatial locality. Meanwhile, a global random sampler conditioned on a Gaussian distribution is devised to incorporate the nonlocal self-similarity of natural images. These two samplers build a new paradigm of pseudo sample generation powered by both local and global priors, which is termed as Zero-Shot Hybrid Prior-guided Denoising (ZS-HPD). Considering that noise is more likely to affect high-frequency details, we also present a simple yet effective loss that works in the Fourier domain and applies discriminative weights to distinct spectral bands. Numerous experiments on benchmark datasets have demonstrated the superiority of our ZS-HPD over existing advanced methods. The code of our ZS-HPD will be available at here.

Abstract:
Generating lifelike conversational avatars requires modeling not just isolated speakers, but the dynamic, reciprocal interaction of speaking and listening.However, modeling the listener is exceptionally challenging: direct audio-driven training fails, producing stiff, static listening motions. This failure stems from a fundamental imbalance: the speaker's motion is strongly driven by speech audio, while the listener's motion primarily follows an internal motion prior and is only loosely guided by external speech. This challenge has led most methods to focus on speak-only generation. The only prior attempt at joint generation relies on extra speaker's motion to produce the listener.This design is not end-to-end, thereby hindering the real-time applicability.To address this limitation, we present UniLS, the first end-to-end framework for generating unified speak-listen expressions, driven by only dual-track audio.Our method introduces a novel two-stage training paradigm.Stage 1 first learns the internal motion prior by training an audio-free autoregressive generator, capturing the spontaneous dynamics of natural facial motion. Stage 2 then introduces the dual-track audio, fine-tuning the generator to modulate the learned motion prior based on external speech cues.Extensive evaluations show UniLS achieves state-of-the-art speaking accuracy. More importantly, it delivers up to 44.1% improvement in listening metrics, generating significantly more diverse and natural listening expressions. This effectively mitigates the stiffness problem and provides a practical, high-fidelity audio-driven solution for interactive digital humans.

Abstract:
We propose SLARM, a feed-forward model that unifies dynamic scene reconstruction, semantic understanding, and real-time streaming inference. SLARM captures complex, non-uniform motion through higher-order motion modeling, trained solely on differentiable renderings without any flow supervision. Besides, SLARM distills semantic features from LSeg to obtain language-aligned representations. This design enables semantic querying via natural language, and the tight coupling between semantics and geometry further enhances the accuracy and robustness of dynamic reconstruction. Moreover, SLARM processes image sequences using window-based causal attention, achieving stable, low-latency streaming inference without accumulating memory cost. Within this unified framework, SLARM achieves state-of-the-art results in dynamic estimation, rendering quality, and scene parsing, improving motion accuracy by 21%, reconstruction PSNR by 1.6 dB, and segmentation mIoU by 20% over existing methods.

Abstract:
Diffusion transformers (DiTs) offer excellent scalability for high-fidelity generation, but their computational overhead poses a great challenge for practical deployment. Existing acceleration methods primarily exploit the temporal dimension, whereas spatial acceleration remains underexplored. In this work, we investigate spatial acceleration for DiTs via latent upsampling. We found that naive latent upsampling for spatial acceleration introduces artifacts, primarily due to aliasing in high-frequency edge regions and mismatching from noise-timestep discrepancies. Then, based on these findings and analyses, we propose a training-free spatial acceleration framework, dubbed Region-Adaptive Latent Upsampling (RALU), to mitigate those artifacts while achieving spatial acceleration of DiTs by our mixed-resolution latent upsampling. RALU achieves artifact-free, efficient acceleration with early upsampling only on artifact-prone edge regions and noise-timestep matching for different latent resolutions, leading to up to 7.0xspeedup on FLUX-1.dev and 3.0xon Stable Diffusion 3 with negligible quality degradation. Furthermore, our RALU is complementarily applicable to existing temporal acceleration methods and timestep-distilled models, leading to up to 15.9xspeedup.

Abstract:
Open-set active learning (OSAL) aims to identify informative samples for annotation when unlabeled data may contain previously unseen classes--a common challenge in safety-critical and open-world scenarios. Existing approaches typically rely on separately trained open-set detectors, introducing substantial training overhead and overlooking the supervisory value of labeled unknowns for improving known-class learning. In this paper, we propose E^2OAL (Effective and Efficient Open-set Active Learning), a unified and detector-free framework that fully exploits labeled unknowns for both stronger supervision and more reliable querying. E^2OAL first uncovers the latent class structure of unknowns through label-guided clustering in a frozen contrastively pre-trained feature space, optimized by a structure-aware F1-product objective. To leverage labeled unknowns, it employs a Dirichlet-calibrated auxiliary head that jointly models known and unknown categories, improving both confidence calibration and known-class discrimination. Building on this, a logit-margin purity score estimates the likelihood of known classes to construct a high-purity candidate pool, while an OSAL-specific informativeness metric prioritizes partially ambiguous yet reliable samples. These components together form a flexible two-stage query strategy with adaptive precision control and minimal hyperparameter sensitivity. Extensive experiments across multiple OSAL benchmarks demonstrate that E^2OAL consistently surpasses state-of-the-art methods in accuracy, efficiency, and query precision, highlighting its effectiveness and practicality for real-world applications. The code is available at github.com/chenchenzong/E2OAL.

Abstract:
Mixture-of-Experts (MoE) has emerged as an effective approach to reduce the computational overhead of Transformer architectures by sparsely activating a subset of parameters for each token while preserving high model capacity. This paradigm has recently been extended to Vision-Language Models (VLMs), enabling scalable multi-modal understanding with reduced computational cost. However, the widely adopted deterministic top-K routing mechanism may overlook more optimal expert combinations and lead to expert overfitting. To address this limitation and improve the diversity of expert selection, we propose MoE-GRPO, a reinforcement learning (RL)-based framework for optimizing expert routing in MoE-based VLMs. Specifically, we formulate expert selection as a sequential decision-making problem and optimize it using Group Relative Policy Optimization (GRPO), allowing the model to learn adaptive expert routing policies through exploration and reward-based feedback. Furthermore, we introduce a modality-aware router guidance that enhances training stability and efficiency by discouraging the router from exploring experts that are infrequently activated for a given modality. Extensive experiments on multi-modal image and video benchmarks show that MoE-GRPO consistently outperforms standard top-K routing and its variants by promoting more diverse expert selection, thereby mitigating expert overfitting and enabling a task-level expert specialization.

Abstract:
Estimating camera geometry typically involves solving minimal problems formulated as systems of multivariate polynomial equations, which often pose computational challenges when using existing Grobner-basis or resultant-based methods due to matrix inversion needed in the online solver. Here we propose a sampling-based, matrix inversion-free method that constructs the solvers using sparse hidden-variable resultants. The determinant polynomial in the hidden variable is efficiently reconstructed via inverse fast Fourier transform interpolation from sampled evaluations, avoiding symbolic expansion. Solving this polynomial yields the hidden variable, and the remaining unknowns are recovered by identifying rank-1 deficient submatrices and applying Cramer's rule. A greatest common divisor-based criterion ensures robust submatrix identification under noise. Experiments on diverse minimal problems demonstrate that the proposed solver achieves strong numerical stability and competitive runtime, particularly for small-scale problems, providing a practical alternative to traditional Grobner-basis and resultant-based solvers.

Abstract:
Video diffusion models have achieved remarkable generative performance, but their substantial computational and memory costs pose significant challenges for deployment, especially on consumer GPUs. As recent advances in attention optimization mitigate previous computational bottlenecks, linear layers now dominate both computational cost and inference memory. In this work, we focus on quantizing both weights and activations to 4 bits to accelerate these layers. Previous methods, such as SVDQuant, overlook the highly dynamic nature of activations across denoising timesteps, where outlier channels and magnitudes vary dramatically. However, video data inherently exhibits strong activation similarity among neighboring tokens in space and time, which we term spatiotemporal activation similarity, analogous to how video codecs exploit intra- and inter-frame redundancy. Leveraging this property, we introduce DeltaQuant, which partitions activations into local 3D spatiotemporal cubes and uses each cube's mean token as a \coretoken, quantizing only the small differences (delta tokens) to 4 bits while keeping core tokens in FP8. This decomposition substantially reduces quantization error with minimal overhead.For weight quantization, DeltaQuant incorporates SVDQuant's low-rank decomposition to further reduce quantization error.We also implement an efficient kernel that translates DeltaQuant's computational benefits into real-world speedups.Extensive experiments on Wan 2.2 I2V, Wan 2.2 T2V, and LTX-Video T2V demonstrate that DeltaQuant maintains high generation fidelity.On Wan 2.2, it compresses model size by 2.9x and reduces memory footprint by 2.3x. DeltaQuant is compatible with efficient attention mechanisms and few-step distillation. When integrated with these techniques, it achieves an additional 3.0x acceleration, for a total 111.8x end-to-end speedup. Code and models will be released upon publication.

Abstract:
Diffusion models (DMs) have recently achieved remarkable success across diverse modalities, including high-fidelity image and video synthesis.However, their inherent step sequential denoising process introduces substantial cumulative latency, which significantly degrades user experience. While existing multi-GPU parallelization motheds can alleviate latency, they often incur prohibitive GPU-GPU communication overhead, offsetting much of the performance gain.We present Otil (Only Transmit Informative Latents), a communication-efficient parallel framework for accelerating diffusion inference.% Otil can minimizes redundant data exchange across GPUs while preserving generation quality.Our key insight is that latent activations change only marginally between consecutive denoising process. Leveraging this property, Otil identifies and synchronizes only the most informative latent sub-blocks and introduces a dynamic polling mechanism that periodically revisits all spatial regions, ensuring complete coverage without unnecessary communication. The framework is fully plug-and-play and remains compatible with fast sample and architectural acceleration algorithms, without requiring any retraining or architectural modification.Otil reduces GPU-GPU communication up to 87.5% compared with SOTA parallelism methods, achieving 1.8xspeedup on two GPUs with Stable Diffusion v1.5 and 2.6xon four GPUs with Stable Diffusion XL. When combined with few-step samplers (30 steps) and LoRA models, the acceleration further increases to 2.46x-2.84xon 2 GPUs. These demonstrate the strong potential of Otil for scalable and efficient multi-GPU diffusion inference while preserving generation fidelity.

Abstract:
Imaging system design is a complex, time-consuming, and largely manual process; LiDAR design, ubiquitous in mobile devices, autonomous vehicles, and aerial imaging platforms, adds further complexity through unique spatial and temporal sampling requirements. In this work, we propose a framework for automated, task-driven LiDAR system design under arbitrary constraints. To achieve this, we represent LiDAR configurations in a continuous six-dimensional design space and learn task-specific implicit densities in this space via flow-based generative modeling. We then synthesize new LiDAR systems by modeling sensors as parametric distributions in 6D space and fitting these distributions to our learned implicit density using expectation-maximization, enabling efficient, constraint-aware LiDAR system design. We validate our method on diverse tasks in 3D vision, enabling automated LiDAR system design across real-world-inspired applications in face scanning, robotic tracking, and object detection.

Abstract:
Dark optical flow estimation (DOFE) faces critical challenges: discriminative models are less robust to noise and struggle with weakened motion patterns, while diffusion models suffer from discontinuous flow fields and low efficiency. Flow matching (FM), though efficient, remains underexplored for conditional generation in DOFE. In this paper, we propose FlowFM, the first flow matching model tailored to DOFE tasks. Instead of conventional vector field regression, FlowFM proposes estimating the global transformation path constrained by the ground truth optical flow. It generates noisy flow by mixing Gaussian noise with ground truth, then performs a one-step denoising process conditioned on the initial flow field, cost volume, and contextual features for optimal accuracy and efficiency. FlowFM incorporates an implicit Fourier denoising decoder (IFDD) for reliable motion understanding. By leveraging the Fourier transform, IFDD uses amplitude to characterize motion intensity and phase to encode target spatial relationships within flow fields, then directly enhances amplitude to restore dark-caused motion information loss. Experiments show that FlowFM significantly outperforms state-of-the-art methods on the FCDN and VBOF benchmarks, setting a new performance record for DOFE.

Abstract:
Mathematical geometric reasoning is essential for scientific discovery and educational development, requiring precise logic and rigorous formal verification. While recent advances in Multimodal Large Language Models (MLLMs) have improved reasoning tasks, existing models typically struggle with formal geometric reasoning, particularly when dynamically constructing and verifying auxiliary geometric elements. To address these challenges, we introduce Geoint-R1, a multimodal reasoning framework designed to generate formally verifiable geometric solutions from textual descriptions and visual diagrams. Geoint-R1 uniquely integrates auxiliary elements construction, formal reasoning represented via Lean4, and interactive visualization. To systematically evaluate and advance formal geometric reasoning, we propose the Geoint benchmark, comprising 1,885 rigorously annotated geometry problems across diverse topics such as plane, spatial, and solid geometry. Each problem includes structured textual annotations, precise Lean4 code for auxiliary constructions, and detailed solution steps verified by experts. Extensive experiments demonstrate that Geoint-R1 significantly surpasses existing multimodal and math-specific reasoning models, particularly on challenging problems requiring explicit auxiliary element constructions.

Abstract:
We introduce Neural Riemannian Motion Fields (\name), a novel 3D generative human motion prior that enables robust, temporally consistent, and physically plausible 3D motion recovery. Unlike existing VAE or diffusion-based methods, our higher-order motion prior explicitly models the human motion in the zero level set of a collection of neural distance fields (NDFs) corresponding to pose, transition (velocity), and acceleration dynamics. Our framework is rigorous in the sense that our NDFs are constructed on the product space of joint rotations, their angular velocities, and angular accelerations, respecting the geometry of the underlying articulations. We further introduce: (i) a novel adaptive-step hybrid algorithm for projecting onto the set of plausible motions, and (ii) a novel geometric integrator to "roll out" realistic motion trajectories during test-time-optimization and generation. Our experiments show significant and consistent gains: trained on the AMASS dataset, \name remarkably generalizes across multiple input modalities and to diverse tasks ranging from denoising to motion in-betweening and fitting to partial 2D / 3D observations.

Abstract:
Learning directly from boundary representations (B-reps) has significantly advanced 3D CAD analysis. However, state-of-the-art B-rep learning methods rely on absolute coordinates and normals to encode global context, making them highly sensitive to rotations. Our experiments reveal that models achieving over 95% accuracy on aligned benchmarks can collapse to as low as 10% under arbitrary SO(3) rotations. To address this, we introduce FoV-Net, the first B-rep learning framework that captures both local surface geometry and global structural context in a rotation-invariant manner. Each face is represented by a Local Reference Frame (LRF) UV-grid that encodes its local surface geometry, and by Field-of-View (FoV) grids that capture the surrounding 3D context by casting rays and recording intersections with neighboring faces. Lightweight CNNs extract per-face features, which are propagated over the B-rep graph using a graph attention network. FoV-Net achieves state-of-the-art performance on B-rep classification and segmentation benchmarks, demonstrating robustness to arbitrary rotations while also requiring less training data to achieve strong results.

Abstract:
Music style transfer (MST) aims to reinterpret existing musical pieces in new stylistic forms while maintaining their melodic coherence. Conventional approaches conditioned on text or audio overlook the profoundly multimodal character of musical style. Visual ambience -- reflected in color, lighting, and composition -- encodes affective attributes that parallel timbre, rhythm, and harmony, which, however, remain underexplored in MST context. We introduce a flow-based, inversion-free framework for multimodal music style transfer that unifies textual and visual guidance. Our approach tackles two challenges: (1) capturing cross-modal semantics beyond language through a dual-encoder fusion module that merges CLIP- and ViT-derived embeddings, and (2) preserving melodic identity using a differentiable normalized chroma constraint that regulates pitch-class consistency along the generative flow. We reorganize and extend the MeLBench and MusicCaps collections into a genre-structured multimodal dataset to support style-aware analysis. Quantitative and perceptual evaluations demonstrate that our approach achieves superior control, structural fidelity, and cross-modal expressiveness, underscoring the role of visual perception in music generation. Our demo is available online.

Abstract:
Unsupervised domain adaptation (UDA) for pose estimation promises transfer from synthetic to real domains but often suffers instability under domain shift. Prior work attributes this deterioration to gradient interference between source supervision and target consistency. This conflict is distinct in pose estimation, where sparse and heterogeneous supervision signals cause gradients to be highly sensitive to small localization errors and lead to unstable updates. To address these challenges, we propose HamiPose, a Hamiltonian optimization framework that transports decoupled and confidence-calibrated gradients within a unified geometry to mitigate instability. HamiPose first refines gradient interaction through keypointwise geometry decomposition, orthogonally projecting target gradients to preserve nonconflicting component. Channelwise gated alignment then calibrates the parallel component with confidence and alignment, producing decoupled, confidence-calibrated gradients. These gradients are advanced by a Hamiltonian optimizer with a symplectic integrator, providing controlled momentum that stabilizes updates. Extensive experiments demonstrate that HamiPose achieves state-of-the-art performance in UDA pose estimation while maintains strong performance under domain generalization settings.

Abstract:
Causal graphs play a crucial role in AI research as they reveal the data generation processes underlying real-world machine learning and computer vision tasks. Recent studies have leveraged causal graphs to develop more robust and interpretable models. However, limited or biased data often lead to inaccurate causal graph estimation, reducing a model's transferability to unseen domains. To address this challenge, we propose a novel framework that performs Bayesian inference over causal graphs to capture potential underlying causal relations and identify invariant causal features for DG prediction. The key advantage of our framework lies in its ability to quantify causal graph uncertainty in the context of prediction tasks and incorporate it into the prediction process. Our proposed uncertainty provides valuable insights into (i) the reliability of our method on specific datasets, (ii) the alignment between learned causal graphs and unseen test domains, and (iii) the confidence of our predictions. In particular, we go beyond merely quantifying uncertainty and leverage it as weighting factors in a weighted Bayesian inference scheme. Empirical results on multiple benchmark distribution-shift datasets show that our algorithm, Causal Graph Uncertainty-guided Bayesian Inference (CGU-Bayes), outperforms existing DG methods on challenging datasets and achieves state-of-the-art performance overall.

Abstract:
Although 4D reconstruction based on Gaussian Splatting has achieved many impressive results, reconstructing real-world images captured by a casual monocular camera remains a significant challenge. In dynamic scenes, as the camera and objects move during the exposure time, these input images inevitably contain a considerable amount of motion blur, which severely compromises the quality of reconstruction and new viewpoint synthesis. The existing deblurring 3D Gaussian models still cannot handle motion blur issues in real dynamic scenes. To address these challenges, we propose MSCD-GS--a novel method for motion-separated collaborative deblurring 4D reconstruction via Gaussian Splatting, capable of effectively handling motion-blurred inputs. Specifically, due to the distinct motion characteristics of static and dynamic Gaussians, we perform separate motion models for dynamic scene reconstruction. To predict Gaussian changes over exposure time, we designed motion-aware networks for static and dynamic Gaussians, thereby synthesizing virtual blurred images. Finally, we utilize the results from the deblurring network and the synthesized images to supervise 4D reconstruction collaboratively. Extensive experiments demonstrate that MSCD-GS can effectively reconstruct high-quality dynamic scenes from blurred image inputs, with performance surpassing existing methods.

Abstract:
Effective human behavior modeling requires a representation of the human body movement that capitalizes on its compositionality. We propose a hierarchical representation consisting of Action Atoms that capture the atomic joint movements and Action Motifs that are formed by their temporal compositions and encode similar body movements found across different overall human actions. We derive A4Mer, a nested latent Transformer to learn this hierarchical representation from human pose data in a fully self-supervised manner. A4Mer splits a 3D pose sequence into variable-length segments and represents each segment as a single latent token (Action Atoms). Through bottom-up representation learning, temporal patterns composed of these Action Atoms, which capture meaningful temporal spans of reusable, semantic segments of body movements, naturally emerge (Action Motifs). A4Mer achieves this with a unified pretext task of masked token prediction in their respective latent spaces. We also introduce Action Motif Dataset (AMD), a large-scale dataset of multi-view human behavior videos with full SMPL annotations. We introduce a novel use of cameras by mounting them on the feet to achieve their frame-wise annotations despite frequent and heavy body occlusions. Experimental results demonstrate the effectiveness of A4Mer for extracting meaningful Action Motifs, which significantly benefit human behavior modeling tasks including action recognition, motion prediction, and motion interpolation.

Abstract:
The YOLO (You Only Look Once) series has been a cornerstone in real-time object detection, renowned for its efficient convolutional design and rapid inference. However, its reliance on convolutional operations inherently limits its ability to capture long-range dependencies and rich contextual information, leading to suboptimal performance in complex scenes. Recently, SSM (State Space Models) have emerged as an efficient alternative to attention mechanisms, offering global representation with linear time complexity. In this paper, we propose AKCMamba-YOLO, a novel object detector that incorporates SSM into the YOLO architecture. We introduce 3CAKCMamba and 4CAKCMamba modules to a novel object detection framework, enabling enhanced channel interaction and cross-layer semantic fusion. This design improves multi-scale feature modeling while maintaining computational efficiency. To support safety-critical applications, we provide railway pedestrian Detection datasets with 2,975 annotated images under complex scenarios. Experiments on COCO2017, power tower foreign object detection datasets, and our custom dataset show that AKCMamba-YOLO achieves superior accuracy and speed compared to state-of-the-art baselines, making it well-suited for real-time and resource-constrained environments.

Abstract:
Document QA requires not only accurate answers but also identifying where each answer is grounded on the page. Most models treat the task as text-only generation, while existing answer grounding methods generate coarse bounding boxes that fail to capture curved text. We introduce M3Grounder, a hybrid vision-language and segmentation architecture that formulates document grounding as pixel-level segmentation. It produces fine-grained evidence masks refined by a bleed-suppression loss to prevent spillover. M3Grounder autoregressively generates answer text interleaved with [GROUND] tokens that link individual answer spans to their corresponding evidence regions. Also, M3Grounder grounds evidence hierarchically across phrase, line, and block levels using an enclosure loss that enforces spatial containment. We release GroundingDocQA dataset (200K documents, 2M multi-span and multi-granular QA pairs with pixel-level grounding masks), built through a data engine that handles complex layouts, curved-text, and graphics-rich documents. We also release GroundingDocQA-Bench, a diverse and challenging human-verified benchmark. M3Grounder sets a new state of the art in grounded DocVQA, advancing from coarse boxes to hierarchical, fine-grained and contextually grounded evidence.

Abstract:
Co-segmentation aims to identify and segment common objects across a set of point clouds or images. Existing methods focus on single-modal co-segmentation. However, the limited semantics of a single modality restrict the discovery of common objects, leading to costly and labor-intensive segmentation masks. In contrast, cross-modal co-segmentation leverages both modalities, offering two key advantages: (i) additional semantic cues compensate for the absence of segmentation masks; and (ii) complementary modalities provide richer common semantics beyond the limitations of single-modality approaches. Motivated by these challenges, we introduce a novel task: unsupervised point cloud-image cross-modal co-segmentation. We tackle this problem using a coarse-to-fine approach. First, the 3D and 2D branches extract coarse common semantics from each modality, respectively. Then, a cross-modal common semantic graph purifies these features into fine-grained common semantics. Finally, 3D and 2D common semantic features are mutually enhanced, without requiring geometric alignment. Experiments on two standard point cloud benchmarks and two corresponding image co-segmentation datasets demonstrate our superior performance compared to existing unsupervised state-of-the-art methods.

Abstract:
Pre-trained diffusion models excel at generating high-quality images but remain inherently limited by their native training resolution. Recent training-free approaches have attempted to overcome this constraint by introducing interventions during the denoising process; however, these methods incur substantial computational overhead, often requiring more than five minutes to produce a single 4K image.In this paper, we present PixelRush, the first tuning-free framework for practical high-resolution text-to-image generation. Our method builds upon the established patch-based inference paradigm but eliminates the need for multiple inversion and regeneration cycles. Instead, PixelRush enables efficient patch-based denoising within a low-step regime. To address artifacts introduced by patch blending in few-step generation, we propose a seamless blending strategy. Furthermore, we mitigate over-smoothing effects through a noise injection mechanism. PixelRush delivers exceptional efficiency, generating 4K images in approximately 20 seconds representing a 10xto 35xspeedup over state-of-the-art methods while maintaining superior visual fidelity. Extensive experiments validate both the performance gains and the quality of outputs achieved by our approach.

Abstract:
Knowledge-based Visual Question Answering (KVQA) requires models to ground entities in images and reason over factual knowledge. Recent work has introduced its implicit-knowledge variant, IK-KVQA, where a multimodal large language model (MLLM) is the sole knowledge source and answers are produced without external retrieval. Existing IK-KVQA approaches, however, are typically trained with answer-only supervision: reasoning remains implicit, justifications are often weak or inconsistent, and generalization after standard supervised fine-tuning (SFT) can be brittle. We propose StaR-KVQA, a framework that equips IK-KVQA with dual-path structured reasoning traces--symbolic relation paths over text and vision together with path-grounded natural-language explanations--to provide a stronger inductive bias than generic answer-only supervision. These traces act as modality-aware scaffolds that guide the model toward relevant entities and attributes, offering more structure than generic chain-of-thought supervision while not constraining reasoning to any single fixed path. With a single open-source MLLM, StaR-KVQA constructs and selects traces to build an offline trace-enriched dataset and then performs structure-aware self-distillation; no external retrievers, verifiers, or curated knowledge bases are used, and inference is a single autoregressive pass. Across benchmarks, StaR-KVQA consistently improves both answer accuracy and the transparency of intermediate reasoning, achieving up to +11.3% higher answer accuracy on OK-VQA over the strongest baseline.

Abstract:
Multi-modal large language models (MLLMs) have rapidly advanced in visual tasks, yet their spatial understanding remains limited to single images, leaving them ill-suited for physical-world applications that require multi-frame reasoning. In this paper, we propose a framework to equip MLLMs with multi-frame spatial understanding by integrating fundamental spatial skills, including depth perception, visual correspondence, and dynamic perception. We design a novel data pipeline and collect the MultiSPA dataset of more than 27 million samples spanning diverse 3D and 4D scenes to enable training. Alongside MultiSPA, we introduce a comprehensive benchmark that tests a wide spectrum of spatial tasks under uniform metrics. Our resulting model, Multi-SpatialMLLM, achieves significant gains over baselines and proprietary systems, demonstrating scalable and generalizable multi-frame perception. We further observe multi-task benefits and emergent spatial capabilities in challenging scenarios, and showcase how our model can serve as a multi-frame reward annotator for robotics.

Abstract:
Unified models aim to support both understanding and generation by encoding images into discrete tokens and processing them alongside text within a single autoregressive framework. This unified design offers architectural simplicity and cross-modal synergy, which facilitates shared parameterization, consistent training objectives, and seamless transfer between modalities. However, the large number of visual tokens required by such models introduces substantial computation and memory overhead, and this inefficiency directly hinders deployment in resource constrained scenarios such as embodied AI systems. In this work, we propose a unified token compression algorithm UniCompress that significantly reduces visual token count while preserving performance on both image understanding and generation tasks. Our method introduces a plug-in compression and decompression mechanism guided with learnable global meta tokens. The framework is lightweight and modular, enabling efficient integration into existing models without full retraining. Experimental results show that our approach reduces image tokens by up to (4x), achieves substantial gains in inference latency and training cost, and incurs only minimal performance degradation, which demonstrates the promise of token-efficient unified modeling for real world multimodal applications.

Abstract:
Facial Action Unit (AU) detection suffers from limited annotated data, severe class imbalance, label noise, and confounding biases, which often lead to overfitting and degraded performance. We propose an Uncertainty-Driven Causal Transformer (UDCT) framework for robust AU detection by jointly modeling uncertainty and causal intervention. Specifically, we parameterize Transformer attention weights as Gaussian distributions to capture robust AU dependencies while explicitly modeling uncertainty in attention. We further introduce an uncertainty-aware loss reweighting strategy to alleviate the effects of class imbalance and label noise during training. In addition, we incorporate a causal intervention module to suppress confounder-dependent AU associations and encourage the model to focus on more stable and less biased AU relationships. Experiments on BP4D and DISFA demonstrate that UDCT achieves competitive performance with stronger robustness under noisy, imbalanced, and distribution-shifted settings.

Abstract:
Publicly available full-field digital mammography (FFDM) datasets remain limited in size, clinical annotations, and vendor diversity, hindering the development of robust models. We introduce LUMINA, a curated, multi-vendor FFDM dataset that explicitly encodes acquisition energy and vendor metadata to capture clinically relevant appearance variations often overlooked in existing benchmarks. This dataset contains 1824 images from 468 patients (960 benign, 864 malignant), with pathology-confirmed labels, BI-RADS assessments, and breast-density annotations. LUMINA spans six acquisition systems and includes both high- and low-energy imaging styles, enabling systematic analysis of vendor- and energy-induced domain shifts. To address these variations, we propose a foreground-only pixel-space alignment method ("energy harmonization") that maps images to a low-energy reference while preserving lesion morphology. We benchmark CNN and transformer models on three clinically relevant tasks: diagnosis (benign vs. malignant), BI-RADS classification, and density estimation. Two-view models consistently outperform single-view models. EfficientNet-B0 achieves an AUC of 93.54% for diagnosis, while Swin-T achieves the best macro-AUC of 89.43% for density prediction. Harmonization improves performance across architectures and produces more localized Grad-CAM responses. Overall, LUMINA provides (1) a vendor-diverse benchmark and (2) a model-agnostic harmonization framework for reliable and deployable mammography AI.

Abstract:
Temporal Action Detection (TAD) in untrimmed videos poses significant challenges, particularly for Activities of Daily Living (ADL) requiring models to (1) process long-duration videos, (2) capture temporal variations in actions, and (3) simultaneously detect dense overlapping actions. Existing CNN and Transformer-based approaches, struggle to jointly capture fine-grained detail and long-range structure at scale. State-space Model (SSM) based Mamba offers powerful long-range modeling, but naive application to TAD collapses fine-grained temporal structure and fails to account for the challenges inherent to TAD. To this end, we propose Multi-Scale Temporal Mamba (MS-Temba), which extends Mamba to TAD with newly introduced dilated SSMs. Each Temba block, comprising dilated SSMs coupled with our proposed additional losses, enables the learning of discriminative representations across temporal scales. A lightweight Multi-scale Mamba Fuser then unifies these multi-scale features via SSM-based aggregation, yielding precise action-boundary localization. With only 17M parameters, MS-Temba achieves state-of-the-art performance on densely labeled ADL benchmarks TSU & Charades, and further generalizes to long-form video summarization, setting new state-of-the-art results on TVSum & SumMe.

Abstract:
Text-to-image (T2I) diffusion models currently lack an efficient mechanism for early quality assessment, forcing costly random trial-and-error in scenarios requiring multiple generations (e.g., iterating on prompts, agent-based image generation, flow-grpo). To address this, we first reveal a strong correlation between the attention distribution in the early diffusion process and the final image quality. Building upon this insight, we introduce Diffusion Probe, a pioneering framework that leverages the model's internal cross-attention maps as a predictive signal. We propose a lightweight predictor, trained to establish a direct mapping from statistical properties of these nascent cross-attention distributions--extracted from the initial denoising steps--to the final image's comprehensive quality. This allows our probe to accurately forecast various aspects of image quality, regardless of the specific ground-truth quality metric, long before full synthesis is complete. We empirically validate the reliability and generalizability of Diffusion Probe through its consistently strong predictive accuracy across a wide spectrum of conditions. On diverse T2I models (e.g., SDXL, FLUX, Qwen-Image), throughout broad early-denoising windows, across various resolutions, and with different quality metrics, it achieves high correlation (PCC > 0.7) and classification performance (AUC-ROC > 0.9). This intrinsic reliability is further demonstrated in practice by successfully optimizing T2I workflows that benefit from early, quality-guided decisions, such as Prompt Optimization, Seed Selection, and Accelerated RL Training. In these applications, the probe's early signal enables more targeted sampling strategies, preempting costly computations on low-potential paths. This yields a dual benefit: a significant reduction in computational overhead and a simultaneous improvement in final outcome quality, establishing Diffusion Probe as a model-agnostic and broadly applicable tool poised to revolutionize T2I efficiency.

Abstract:
Humans rely on the synergistic control of head (cephalomotor) and eye (oculomotor) to efficiently search for visual information in 360deg. However, prior approaches to visual search are limited to a static image, neglecting the physical embodiment and its interaction with the 3D world. How can we develop embodied visual search agents as efficient as humans while bypassing the constraints imposed by real-world hardware? To this end, we propose humanoid visual search where a humanoid agent actively rotates its head to search for objects or paths in an immersive world represented by a 360deg panoramic image. To study visual search in visually-crowded real-world scenarios, we build H Bench, a new benchmark that moves beyond household scenes to challenging in-the-wild scenes that necessitate advanced visual-spatial reasoning capabilities, such as transportation hubs, large-scale retail spaces, urban streets, and public institutions. Our experiments first reveal that even top-tier proprietary models falter, achieving only 30% success in object and path search. We then use post-training techniques to enhance the open-source models, increasing its success rate by over threefold for both object search (14.83% - 47.38%) and path search (6.44% - 24.94%) on the smallest 3B model. Notably, the lower ceiling of path search reveals its inherent difficulty, which we attribute to the demand for sophisticated spatial commonsense. Our results not only show a promising path forward but also quantify the immense challenge that remains in building MLLM agents that can be seamlessly integrated into everyday human life.

Abstract:
Reconstructing visual semantics from Electroencephalography (EEG) signals enables a deeper understanding of human visual cognition and supports next-generation brain-computer interface (BCI) applications.Despite notable advances in recent years, most existing EEG encoders still struggle to capture the frequency-specific neural dynamics that reflect perceptual and cognitive rhythms. Moreover, the cross-modal alignment between EEG and visual content remains insufficiently tackled, leading to limited semantic consistency and visual fidelity. To address these issues, we propose D^2-FOSA, a unified dual-diffusion guided framework with frequency-oriented semantic alignment, which strengthens the frequency-aware EEG representation for more semantically aligned image reconstruction.Specifically, we design a Frequency-Spatio-Temporal Dynamics Encoder (FSTDE) based on the Frequency-Oriented Mamba (FOMamba) to explicitly model oscillatory patterns and long-range dependencies in EEG signals. The extracted features are then pulled into the CLIP-aligned visual semantic space via contrastive learning.Meanwhile, a Dual Diffusion Latent Generator (DDLG) with bidirectional EEG-image conditioning is designed to enforce cross-modal alignment and promote cycle-consistent generation.Extensive experiments on four challenging datasets demonstrate that our proposed D^2-FOSA significantly outperforms existing methods in both retrieval and reconstruction tasks. Particularly, our D^2-FOSA surpasses the contemporary MB2C method by over 17 FID in the reconstruction task on THINGS-EEG, indicating a substantial improvement in perceptual fidelity. The source code is in the supplementary material.

Abstract:
Counting is one of the fundamental abilities of large language models (LLMs) and large vision-language models (LVLMs). This paper examines how these foundation models represent and compute numerical information in counting tasks. We use controlled experiments with repeated textual and visual items and analyze counting in LLMs and LVLMs through a set of behavioral, observational, and causal mediation analyses. To this end, we design a specialized tool, CountScope, for the mechanistic interpretability of numerical content. Results show that individual tokens or visual features encode latent positional count information that can be extracted and transferred across contexts. Layerwise analyses reveal a progressive emergence of numerical representations, with lower layers encoding small counts and higher layers representing larger ones. We identify an internal counter mechanism that updates with each item, stored mainly in the final token or region. In LVLMs, numerical information also appears in visual embeddings, shifting between background and foreground regions depending on spatial composition. We further reveal that models rely on structural cues such as separators in text, which act as shortcuts for tracking item counts and strongly influence the accuracy of numerical predictions. Overall, counting emerges as a structured, layerwise process in LLMs and follows the same general pattern in LVLMs, shaped by the properties of the vision encoder.

Abstract:
Recently, Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as an effective approach to incentivizing reasoning capability in Large Multimodal Models (LMMs), while the underlying mechanisms behind this post-training paradigm are poorly understood. We begin by exploring how input activations are affected by RLVR through the perspective of logit lens. Our systematic investigations across multiple post-trained LMMs suggest that RLVR shifts low-entropy activations unexpectedly, while high-entropy ones are less affected. We further demonstrate that such phenomena are associated with LMM reasoning by controlled experiments, suggesting a potentially beneficial role of modulating low-entropy activations. To this end, we propose Activation Replay, a novel simple yet effective training-free approach that boosts multimodal reasoning of post-trained LMMs without requiring expensive policy optimization. Our design involves manipulation of visual tokens at test time, replaying low-entropy activations from the input context of base LMMs to regulating the RLVR counterparts. Activation Replay triggers better reasoning across diverse scenarios, including mathematics, o3-like visual agents, and video reasoning. We further show that Activation Replay boosts Pass@K and mitigates narrower reasoning coverage of RLVR. Our design is compared against alternative choices, such as replaying high-entropy activations instead of low-entropy ones, or direct cross-model intervention instead of manipulating input tokens, demonstrating the superiority of our implementation. Codes will be made publicly available.

Abstract:
We introduce OneCAT, a unified multimodal model that seamlessly integrates understanding, generation, and editing within a single decoder-only transformer architecture. OneCAT uniquely eliminates the need for external components such as Vision Transformers (ViT) or vision tokenizer during inference, leading to significant efficiency gains, especially for high-resolution image inputs and outputs. This is achieved through a modality-specific Mixture-of-Experts (MoE) design trained with a unified autoregressive (AR) objective, which also natively supports dynamic resolutions. Furthermore, we pioneer to achieve multi-scale visual autoregressive mechanism within the Large Language Model (LLM) with proposed scale-aware adapter (SAA) that drastically reduces decoding latency compared to diffusion-based methods while maintaining state-of-the-art performance. Our findings demonstrate the powerful potential of pure autoregressive modeling as an elegant foundation for unified multimodal intelligence. As a result, OneCAT outperforms existing unified models across benchmarks for multimodal understanding, generation, and editing.

Abstract:
Deep learning has advanced vectorized road extraction in urban settings, yet off-road environments remain underexplored and challenging. A significant domain gap causes advanced models to fail in wild terrains due to two key issues: lack of large-scale vectorized datasets and structural weakness in prevailing methods. Models such as SAM-Road employ a node-centric paradigm that reasons at sparse endpoints, making them fragile to occlusions and ambiguous junctions in off-road scenes, leading to topological errors.This work addresses these limitations in two complementary ways. First, we release WildRoad, a gloabal off-road road network dataset constructed efficiently with a dedicated interactive annotation tool tailored for road-network labeling. Second, we introduce MaGRoad (Mask-aware Geodesic Road network extractor), a path-centric framework that aggregates multi-scale visual evidence along candidate paths to infer connectivity robustly.Extensive experiments show that MaGRoad achieves state-of-the-art performance on our challenging WildRoad benchmark while generalizing well to urban datasets. A streamlined pipeline also yields roughly 2.5x faster inference, improving practical applicability. Together, the dataset and path-centric paradigm provide a stronger foundation for mapping roads in the wild.

Abstract:
We introduce SPECTRE, a fully transformer-based foundation model for volumetric computed tomography (CT). Our Self-Supervised & Cross-Modal Pretraining for CT Representation Extraction (SPECTRE) approach utilizes scalable 3D Vision Transformer architectures and modern self-supervised and vision-language pretraining strategies to learn general-purpose CT representations. Volumetric CT poses unique challenges, such as extreme token scaling, geometric anisotropy, and weak or noisy clinical supervision, that make standard transformer and contrastive learning recipes ineffective out of the box. The framework jointly optimizes a local transformer for high-resolution volumetric feature extraction and a global transformer for whole-scan context modeling, making large-scale 3D attention computationally tractable. Notably, SPECTRE is trained exclusively on openly available CT datasets, demonstrating that high-performing, generalizable representations can be achieved without relying on private data. Pretraining combines DINO-style self-distillation with SigLIP-based vision-language alignment using paired radiology reports, yielding features that are both geometrically consistent and clinically meaningful. Across multiple CT benchmarks, SPECTRE consistently outperforms prior CT foundation models in both zero-shot and fine-tuned settings, establishing SPECTRE as a scalable, open, and fully transformer-based foundation model for 3D medical imaging.

Abstract:
The primary axes of interest in low-light image enhancement (LLIE) are color constancy--ensuring consistent outputs across inputs of the same scene under varying illumination and noise--and generalization across diverse datasets. Existing methods, whether supervised, unsupervised, or zero-shot, rely on auxiliary loss functions and empirically selected hyperparameters, which yield strong results on the datasets used for evaluation but often exhibit limited generalization. To overcome these constraints, we propose MR. Illuminate (pronounced "Mister Illuminate"), the first deep learning-based solution for LLIE that requires no optimization and no degradation assumption. "MR." emphasizes our Modulate-Refine design: global illuminance and color are modulated via Adaptive Instance Normalization (AdaIN), while local structure and color are refined through self-attention features within a pre-trained diffusion model, taking a unique approach from prior methods. Extensive quantitative evaluations show that our approach surpasses SOTA methods on standard LLIE benchmarks, while qualitative results demonstrate improved color fidelity. Moreover, without any modification to our framework, our method achieves competitive results on the auto white balance (AWB) task, underscoring its strong generalization capability.

Abstract:
Narrative image generation aims to create images featuring multiple distinct characters while capturing their interrelationships, posing significant challenges for current text-to-image diffusion models. As a result, general personalized methods often suffer from poor semantic alignment, identity blending, and aesthetic implausibility.These issues are inadequately captured by existing evaluation metrics such as CLIP, ArcFace, and conventional reward models, which fundamentally fail to align with human perceptual preferences. To align with human preferences, we first construct a fine-grained human preference dataset, NI-RLHF, by collecting both detailed human critiques and preference judgments across three core dimensions: prompt following, identity consistency, and visual quality.This comprehensive dataset facilitates the training of NIReward, a critique-based reward model capable of generating interpretable image evaluations.Building upon the interpretable reward signal from NIReward, we propose Adaptive Dominance-based Preference Optimization (ADPO) to balance learning across diverse preference dimensions while dynamically adapting to reward margins.Experimental results indicate that NIReward significantly outperforms existing evaluation models and reward models, and ADPO yields a significant improvement across the three key preference dimensions. By introducing NIReward and ADPO, our work paves the way for generating narrative images aligned with actual human preferences.

Abstract:
Existing surrounding-view 3D object detectors initialize high-confidence queries using current 2D information, while leveraging historical 3D features as priors. However, such heavy reliance on 2D cues introduces spatio-temporal inconsistencies between 2D and 3D representations. Specifically, 2D cues lack sufficient spatial information, limiting 3D localization capability. Moreover, insufficient temporal interaction often leads to object omission under occlusion. To address these challenges, we propose STUR3D, a unified framework establishing spatio-temporal alignment between 2D and 3D perception. First, we project temporal 3D features to the 2D image plane, empowering the 2D detector to distill representations essential for 3D localization, harmonizing cross-dimensional information. Second, we inject temporal cues into 2D detection, fostering spatio-temporal reasoning, and ensuring robust 3D detection under dynamic scenes and occlusion. Additionally, we embed depth-aware geometric cues into features for 2D-to-3D lifting, mitigating inherent ambiguities. Extensive nuScenes experiments validate STUR3D, achieving SOTA on the test set with 57.9% mAP and 64.6% NDS.

Abstract:
Leveraging pre-trained Diffusion Transformers (DiTs) for high-resolution (HR) image synthesis often leads to spatial layout collapse and degraded texture fidelity. Prior work mitigates these issues with complex pipelines that first perform a base-resolution (i.e., training-resolution) denoising process to guide HR generation. We instead explore the intrinsic generative mechanisms of DiTs and propose ResDiT, a training-free method that scales resolution efficiently. We identify the core factor governing spatial layout, position embeddings (PEs), and show that the original PEs encode incorrect positional information when extrapolated to HR, which triggers layout collapse. To address this, we introduce a PE scaling technique that rectifies positional encoding under resolution changes. To further remedy low-fidelity details, we develop a local-enhancement mechanism grounded in base-resolution local attention. We design a patch-level fusion module that aggregates global and local cues, together with a Gaussian-weighted splicing strategy that eliminates grid artifacts. Comprehensive evaluations demonstrate that ResDiT consistently delivers high-fidelity, high-resolution image synthesis and integrates seamlessly with downstream tasks, including spatially controlled generation.

Abstract:
We present TokenSplat, a feed-forward framework for joint 3D Gaussian reconstruction and camera pose estimation from unposed multi-view images. At its core, TokenSplat introduces a Token-aligned Gaussian Prediction module that aligns semantically corresponding information across views directly in the feature space.Guided by coarse token positions and fusion confidence, it aggregates multi-scale contextual features to enable long-range cross-view reasoning and reduce redundancy from overlapping Gaussians.To further enhance pose robustness and disentangle viewpoint cues from scene semantics, TokenSplat employs learnable camera tokens and an Asymmetric Dual-Flow Decoder (ADF-Decoder) that enforces directionally constrained communication between camera and image tokens. This maintains clean factorization within a feed-forward architecture, enabling coherent reconstruction and stable pose estimation without iterative refinement.Extensive experiments demonstrate that TokenSplat achieves higher reconstruction fidelity and novel-view synthesis quality in pose-free settings, and significantly improves pose estimation accuracy compared to prior pose-free methods.

Abstract:
World engines aim to synthesize long, 3D-consistent videos that support interactive exploration of a scene under user-controlled camera motion. However, existing systems struggle under aggressive 6-DoF trajectories and complex outdoor layouts: they lose long-range geometric coherence, deviate from the target path, or collapse into overly conservative motion. To this end, we introduce Captain Safari, a pose-conditioned world engine that generates videos by retrieving from a persistent world memory. Given a camera path, our method maintains a dynamic local memory and uses a retriever to fetch pose-aligned world tokens, which then condition video generation along the trajectory. This design enables the model to maintain stable 3D structure while accurately executing challenging camera maneuvers. To evaluate this setting, we curate OpenSafari, a new in-the-wild FPV dataset containing high-dynamic drone videos with verified camera trajectories, constructed through a multi-stage geometric and kinematic validation pipeline. Across video quality, 3D consistency, and trajectory following, Captain Safari substantially outperforms state-of-the-art camera-controlled generators. It reduces MEt3R from 0.3703 to 0.3690, improves AUC@30 from 0.181 to 0.200, and yields substantially lower FVD than all camera-controlled baselines. More importantly, in a 50-participant, 5-way human study where annotators select the best result among five anonymized models, 67.6% of preferences favor our method across all axes. Our results demonstrate that pose-conditioned world memory is a powerful mechanism for long-horizon, controllable video generation and provide OpenSafari as a challenging new benchmark for future world-engine research.

Abstract:
Fully supervised Video Semantic Segmentation (VSS) relies heavily on densely annotated video data, limiting practical applicability. Alternatively, applying pre-trained Image Semantic Segmentation (ISS) models frame-by-frame avoids annotation costs but ignores crucial temporal coherence. Recent foundation models such as SAM2 enable high-quality mask propagation yet remain impractical for direct VSS due to limited semantic understanding and computational overhead. In this paper, we propose DiTTA (Distillation-assisted Test-Time Adaptation), a novel framework that converts an ISS model into a temporally-aware VSS model through efficient test-time adaptation (TTA), without annotated videos. DiTTA distills SAM2's temporal segmentation knowledge into the ISS model during a brief, single-pass initialization phase, complemented by a lightweight temporal fusion module to aggregate cross-frame context. Crucially, DiTTA achieves robust generalization even when adapting with highly limited partial video snippets (e.g., initial 10%), significantly outperforming zero-shot refinement approaches that repeatedly invoke SAM2 during inference. Extensive experiments on VSPW and Cityscapes demonstrate DiTTA's effectiveness, achieving competitive or superior performance relative to fully-supervised VSS methods, thus providing a practical and annotation-free solution for real-world VSS tasks.

Abstract:
Modern autonomous vehicle perception systems are often constrained by occlusions, blind spots, and limited sensing range. While existing cooperative perception paradigms, such as Vehicle-to-Vehicle (V2V) and Vehicle-to-Infrastructure (V2I), have demonstrated their effectiveness in mitigating these challenges, they remain limited to ground-level collaboration and cannot fully address large-scale occlusions or long-range perception in complex environments. To advance research in cross-view cooperative perception, we present V2U4Real, the first large-scale real-world multi-modal dataset for Vehicle-to-UAV (V2U) cooperative object perception. V2U4Real is collected by a ground vehicle and a UAV equipped with multi-view LiDARs and RGB cameras. The dataset covers urban streets, university campuses, and rural roads under diverse traffic scenarios, comprising over 56K LiDAR frames, 56K multi-view camera images, and 700K annotated 3D bounding boxes across four classes. To support a wide range of research tasks, we establish benchmarks for single-agent 3D object detection, cooperative 3D object detection, and object tracking. Comprehensive evaluations of several state-of-the-art models demonstrate the effectiveness of V2U cooperation in enhancing perception robustness and long-range awareness.

Abstract:
Recent advancements in the text-rendering capabilities of image generation models have made the end-to-end creation of graphic design content, such as posters, increasingly feasible. However, existing reward models fall short of accurately assessing design quality, as they primarily focus on global image aesthetics while overlooking the critical dimensions of typography and layout. Furthermore, the scarcity of domain-specific preference data remains a significant bottleneck, limiting the further development of graphic design evaluation and generation. To bridge this gap, we design an automated pipeline to construct a high-quality dataset of 70k poster preferences by leveraging the consensus of multiple Multi-modal Large Language Models (MLLMs) to simulate human-like judgment. Based on this dataset, we propose PosterReward, a reward model specifically designed for high-precision poster assessment through a cascaded, multi-stage training strategy. We also provide multiple variants of the model to cater to different application scenarios. Finally, we introduce PosterRewardBench and PosterBench to evaluate the performance of existing reward models in poster assessment and the generation capabilities of current text-to-image models in poster creation, respectively.

Abstract:
Text-to-motion (T2M) generation aims to create realistic human movements from text descriptions, with promising applications in animation and robotics. Despite recent progress, current T2M models perform poorly on unseen text descriptions due to the small scale and limited diversity of existing motion datasets. To address this problem, we introduce OpenT2M, a million-level, high-quality, and open-source motion dataset containing over 2800 hours of human motion. Each sequence undergoes rigorous quality control through physical feasibility validation and multi-granularity filtering, with detailed second-wise text annotations. We also develop an automated pipeline for creating long-horizon sequences, enabling complex motion generation. Building upon OpenT2M, we introduce MonoFirll, a pretrained motion model that achieves compelling T2M results without complicated designs or technique tricks as "frills'". Its core component is 2D-PRQ, a novel motion tokenizer that captures spatiotemporal dependencies by dividing the human body into biology parts. Experiments show that OpenT2M significantly improves generalization of existing T2M models, while 2D-PRQ achieves superior reconstruction and strong zero-shot performance. We expect OpenT2M and MonoFirll will advance the T2M field by addressing longstanding data quality and benchmarking challenges. The project page is https://research.beingbeyond.com/opent2m

Abstract:
Recent approaches for segmentation have leveraged pretrained generative models as feature extractors, treating segmentation as a downstream adaptation task via indirect feature retrieval. This implicit use suffers from a fundamental misalignment in representation.It also depends heavily on indirect feature extraction pipelines, which complicate the workflow and limit adaptation.In this paper, we argue that instead of indirect adaptation, segmentation tasks should be trained directly in a generative manner.We identify a key obstacle to this unified formulation: VAE latents of binary masks are sharply distributed, noise robust, and linearly separable, distinct from natural image latents.To bridge this gap, we introduce timesteps sampling strategy for binary masks that emphasizes extreme noise levels for segmentation and moderate noise for image generation, enabling harmonious joint training.We present GenMask, a DiT trains to generate black-and-white segmentation masks as well as colorful images in RGB space under the original generative objective. GenMask preserves the original DiT architecture while removing the need of feature extraction pipelines tailored for segmentation tasks. Empirically, GenMask attains state-of-the-art performance on referring and reasoning segmentation benchmarks and ablations quantify the contribution of each component.

Abstract:
Unprecedented visual details of biological structures are being revealed by subcellular-resolution whole-brain 3D microscopy data, enabled by recent advances in intact tissue processing and light-sheet fluorescence microscopy (LSFM). These volumetric data offer rich morphological and spatial cellular information, however, the lack of scalable data processing and analysis methods tailored to these petabyte-scale data poses a substantial challenge for accurate interpretation. Further, existing models for visual tasks such as object detection and classification struggle to generalize to this type of data. To accelerate the development of suitable methods and foundational models, we present CANVAS, a comprehensive set of high-resolution whole mouse brain LSFM benchmark data, encompassing six neuronal and immune cell-type markers, along with cell annotations and a leaderboard. We also demonstrate challenges in generalization of baseline models built on existing architectures, especially due to the heterogeneity in cellular morphology across phenotypes and anatomical locations in the brain. To the best of our knowledge, CANVAS is the first and largest LSFM benchmark that captures intact mouse brain tissue at subcellular level, and includes extensive annotations of cells throughout the brain.

Abstract:
Detecting subtle visual anomalies in images remains challenging, particularly when only normal samples are available a priori. Such unsupervised anomaly detection is typically solved by measuring feature similarity of a query patch to a memory of normal patches. However, similarity alone does not reveal how strongly a query patch violates the structure of the normal feature manifold. We propose a training-free Laplacian graph energy optimization formulation, named ANoCo that scores Anomaly by the cost of Non-Conformity of a query patch to align with a fixed normal manifold. For each query patch, we construct a bipartite query to normal graph weighted by cosine affinity, explicitly removing query-query and normal-normal edges to prevent evidence dilution. We formulate anomaly scoring as a convex Laplacian energy with anchored normal nodes, and solve in closed form. In particular, we do not use the optimized features themselves--the anomaly score is the magnitude of the update required to satisfy normality constraints, reframing the graph Laplacian as a non-conformity operator rather than a smoothing prior. The proposed method introduces no learnable parameters, message passing, or sampling, and has complexity comparable to a single linear solve. Across standard benchmarks, it delivers strong image-level AUROC, stable localization maps, and improved robustness over prior methods, demonstrating the effectiveness of using optimization-induced feature drift as anomaly measure.

Abstract:
Humans develop visual intelligence through perceiving and interacting with their environment--a self-supervised learning process grounded in egocentric experience. Inspired by this, we ask how can artificial systems learn stable object representations from continuous, uncurated first-person videos without relying on manual annotations. This setting poses challenges of separating, recognizing, and persistently tracking objects amid clutter, occlusion, and ego-motion. We propose EgoViT, a unified vision Transformer framework designed to learn stable object representations from unlabeled egocentric video. EgoViT bootstraps this learning process by jointly discovering and stabilizing "proto-objects" through three synergistic mechanisms: (1) Proto-object Learning, which uses intra-frame distillation to form discriminative representations; (2) Depth Regularization, which grounds these representations in geometric structure; and (3) Teacher-Filtered Temporal Consistency, which enforces identity over time. This creates a virtuous cycle where initial object hypotheses are progressively refined into stable, persistent representations. The framework is trained end-to-end on unlabeled first-person videos and exhibits robustness to geometric priors of varied origin and quality. On standard benchmarks, EgoViT achieves +8.0% CorLoc improvement in unsupervised object discovery and +4.8% mIoU improvement in semantic segmentation, demonstrating its potential to lay a foundation for robust visual abstraction in embodied intelligence.

Abstract:
Federated Learning (FL) is a decentralized framework that not only enables collaborative training with different clients but also ensures their local data privacy. However, when deletion requests arise under privacy regulations, efficiently removing specific client data contributions from target clients can be challenging. Existing unlearning methods face significant limitations under Non-IID (Non-Independent and Identically Distributed) data distributions when attempting to unlearn specific target clients in FL. Models in sharp optimization regions can suffer catastrophic knowledge loss from minor parameter changes, exacerbating this forgetting due to conflicting parameter updates across clients caused by Non-IID data distributions in FL. Empirically, we observe that conflicting updates under Non-IID settings generate misaligned task vectors that fail to isolate target knowledge. Therefore, we exploit the loss landscape geometry in unlearning specific target clients. We demonstrate that migrating models to flat regions can enhance unlearning robustness in Non-IID FL. Correspondingly, we introduce GDFA, a framework that initially transitions the global model to a flat loss domain. Subsequently, relevant clients generate unlearning task vectors, which GDFA filters to retain only directionally consistent components. This process isolates shared knowledge attributes before precise removal through reverse vector aggregation, maximizing knowledge retention. Extensive experiments demonstrate that GDFA outperforms state-of-the-art methods in unlearning efficacy and efficiency across diverse datasets and architectures, with minimal accuracy loss on retained tasks.

Abstract:
Fine-tuning large-scale text-to-video diffusion models to add new generative controls, such as those over physical camera parameters (e.g., shutter speed or aperture), typically requires vast, high-fidelity datasets that are difficult to acquire. In this work, we propose a data-efficient fine-tuning strategy that learns these controls from sparse, low-quality synthetic data. We show that not only does fine-tuning on such simple data enable the desired controls, it actually yields superior results to models fine-tuned on photorealistic "real" data. Beyond demonstrating these results, we provide a framework that justifies this phenomenon both intuitively and quantitatively.

Abstract:
We propose Confidence-Guided Token Merging (Co-Me), an acceleration mechanism for visual geometric transformers without retraining or finetuning the base model. Co-Me distilled a light-weight confidence predictor to rank tokens by uncertainty and selectively merge low-confidence ones, effectively reducing computation while maintaining spatial coverage. Compared to similarity-based merging or pruning, the confidence signal in Co-Me reliably indicates regions emphasized by the transformer, enabling substantial acceleration without degrading performance. Co-Me applies seamlessly to various multi-view and streaming visual geometric transformers, achieving speedups that scale with sequence length. When applied to VGGT and Pi3, Co-Me achieves up to 21.5x and 20.4x speedup, making visual geometric transformers practical for real-time 3D perception and reconstruction.

Abstract:
Text-video retrieval tasks have seen significant improvements due to the recent development of large-scale vision-language pre-trained models. Traditional methods primarily focus on video representations or cross-modal alignment, while recent works shift toward enriching text expressiveness to better match the rich semantics in videos. However, these methods use only interactions between text and frames/video, and ignore rich interactions among the internal frames within a video, so the final expanded text cannot capture frame contextual information, leading to disparities between text and video. In response, we introduce Energy-Aware Fine-Grained Relationship Learning Network (EagleNet) to generate accurate and context-aware enriched text embeddings. Specifically, the proposed Fine-Grained Relationship Learning mechanism (FRL) first constructs a text-frame graph by the generated text candidates and frames, then learns relationships among texts and frames, which are finally used to aggregate text candidates into an enriched text embedding that incorporates frame contextual information. To further improve fine-grained relationship learning in FRL, we design Energy-Aware Matching (EAM) to model the energy of text-frame interactions and thus accurately capture the distribution of real text-video pairs. Moreover, for more effective cross-modal alignment and stable training, we replace the conventional softmax-based contrastive loss with the sigmoid loss. Extensive experiments have demonstrated the superiority of EagleNet across MSRVTT, DiDeMo, MSVD, and VATEX.

Abstract:
Multimodal Large Language Models (MLLMs) remain far from human-level performance in multi-view spatial reasoning, where models must establish object correspondences across view and infer coherent scene semantics. We analyze this limitation through the Transformation-Driven Visual Reasoning (TVR) task and find that Supervised Fine-Tuning (SFT) fails to capture cross-view consistency, whereas reinforcement learning (RL) fails to reliably identify key referential objects. To bridge this gap, we introduce multi-View Spatial TrAnsformation Reasoning (STAR-R1), a two-stage framework that combines process-supervised SFT with a referential-aware RL paradigm. STAR-R1 first learns structured spatial reasoning trajectories from high-quality CoTs and then uses fine-grained rewards on referential selection and answer correctness to encourage effective exploration and robust scene interpretation. Despite using only a small amount of high-quality training data, STAR-R1 surpasses state-of-the-art models with far more training data on the multi-view spatial understanding benchmarks TVR, MMSI-Bench, MindCube-Bench, and SPAR-Bench. Our study reveals the overlooked potential of RL in multi-view spatial understanding and points a way toward potentially achieving more human-like spatial reasoning in MLLMs.

Abstract:
Despite remarkable progress in text-to-image diffusion models, controlling the semantic and spatial relationships between interacting instances remains a fundamental challenge. Current methods that inject spatial constraints often fail to model the intrinsic functional dependencies between entities, leading to implausible interactions. In this paper, we introduce Semantic Derivative Flow (SDF), a novel graph-guided framework that structures the diffusion process within a directed acyclic interaction graph. Our core innovation is a theoretically-motivated derivative attention mechanism, which explicitly enforces the semantic representation of a predicate to be derived from its subject, and the object from the predicate, formalizing a differentiable semantic graph. This principled approach compels the generative process to adhere to the logical chain of interaction. We further integrate a global context node and a real-time regional refinement module to ground the graph in the visual domain holistically. Extensive experiments demonstrate that our model, an instantiation of SDF, establishes a new state-of-the-art in fidelity and controllability on the HICODet benchmark. We complement our empirical results with a theoretical analysis, framing our method as structured message passing on interaction graphs, which provides a rigorous justification for its efficacy and generalization benefits.

Abstract:
Federated Learning (FL) enables collaborative model training while preserving privacy, but faces challenges with client data heterogeneity and domain shifts during deployment. Although Personalized Federated Learning (PFL) mitigates heterogeneity, it typically requires labelled data from target clients, which is an impractical assumption. Test-Time Adaptation (TTA) offers label-free adaptation, yet its direct use in a continual federated setting risks destabilizing the global model and causing catastrophic forgetting. To address this, we consider the Federated Continual Test-Time Adaptation (FedCTTA) setting, where unlabeled clients arrive sequentially, requiring online adaptation and continuous global model updates. We propose BPFedCTTA, a framework that employs Bayesian Prior-guided Adaptation (BPA) for stable local adaptation via Maximum a Posteriori estimation, and Uncertainty-Gated Single-client Aggregation (UGSA) to selectively integrate updates based on client uncertainty. This approach balances adaptation with knowledge retention, thereby mitigating forgetting. Extensive experiments on cross-domain classification and segmentation show BPFedCTTA outperforms existing FL, PFL, and TTA methods in sequential adaptation and global model improvement. The source code will be made public upon acceptance.

Abstract:
In controllable driving-scene reconstruction and 3D scene generation, maintaining geometric fidelity while synthesizing visually plausible appearance under large viewpoint shifts is crucial. However, effective fusion of geometry-based 3DGS and appearance-driven diffusion models faces inherent challenges, as the absence of pixel-wise, 3D-consistent editing criteria often leads to over-restoration and geometric drift. To address these issues, we introduce FaithFusion, a 3DGS-diffusion fusion framework driven by pixel-wise Expected Information Gain (EIG). EIG acts as a unified policy for coherent spatio-temporal synthesis: it guides diffusion as a spatial prior to refine high-uncertainty regions, while its pixel-level weighting distills the edits back into 3DGS. The resulting plug-and-play system is free from extra prior conditions and structural modifications. Extensive experiments on the Waymo dataset demonstrate that our approach attains SOTA performance across NTA-IoU, NTL-IoU, and FID, maintaining an FID of 107.47 even at 6 meters lane shift.

Abstract:
Simultaneous Localization and Mapping (SLAM) with 3D Gaussian Splatting (3DGS) enables fast, differentiable rendering and high-fidelity reconstruction across diverse real-world scenes. However, existing 3DGS-SLAM approaches handle measurement reliability implicitly, making pose estimation and global alignment susceptible to drift in low-texture regions, transparent surfaces, or areas with complex reflectance properties. To this end, we introduce VarSplat, an uncertainty-aware 3DGS-SLAM system that explicitly learns per-splat appearance variance. By using the law of total variance with alpha compositing, we then render differentiable per-pixel uncertainty map via efficient, single-pass rasterization. This map guides tracking, submap registration, and loop detection toward focusing on reliable regions and contributes to more stable optimization. Experimental results on Replica (synthetic) and TUM-RGBD, ScanNet, and ScanNet++ (real-world) show that VarSplat improves robustness and achieves competitive or superior tracking, mapping, and novel view synthesis rendering compared to existing studies for dense RGB-D SLAM.

Abstract:
Taking advantage of large-scale data and pretrained language models, Video Large Language Models (Video-LLMs) have shown strong capabilities in answering video questions. However, most existing efforts focus on improving performance, with limited attention to understanding their internal mechanisms. This paper aims to bridge this gap through a systematic empirical study. To interpret existing VideoLLMs, we adopt attention knockouts as our primary analytical tool and design three variants: Video Temporal Knockout, Video Spatial Knockout, and Language-to-Video Knockout. Then, we apply these three knockouts on different numbers of layers (window of layers). By carefully controlling the window of layers and types of knockouts, we provide two settings: a global setting and a fine-grained setting. Our study reveals three key findings: (1) Global setting indicates Video information extraction primarily occurs in early layers, forming a clear two-stage process--lower layers focus on perceptual encoding, while higher layers handle abstract reasoning; (2) In the fine-grained setting, certain intermediate layers exert an outsized impact on video question answering, acting as critical outliers, whereas most other layers contribute minimally; (3) In both settings, we observe that spatial-temporal modeling relies more on language-guided retrieval than on intra- and inter-frame self-attention among video tokens, despite the latter's high computational cost. Finally, we demonstrate that these insights can be leveraged to reduce attention computation in Video-LLMs. To our knowledge, this is the first work to systematically uncover how Video-LLMs internally process and understand video content, offering interpretability and efficiency perspectives for future research.

Abstract:
Achieving dexterous grasping remains a key challenge in robotics. Recent generative approaches enable diverse grasps through large-scale data-driven training, yet they often neglect geometric priors of objects, which leads to low data efficiency and poor physical plausibility. We propose GeoDexGrasp, a geometry-aware generation framework for dexterous grasping built upon object-centric geometric representations. We introduce a SIM(3)-equivariant network equipped with a self-supervised disentanglement strategy to extract interpretable and transferable geometric features, including shape, size, pose, and interaction direction. The overall generation process is then decomposed into two stages: first, root rotation generation conditioned on pose and interaction direction; second, hand grasp generation guided by shape and size. By leveraging geometric representations, GeoDexGrasp achieves SOTA physical plausibility (reducing 40% penetration depth) across five datasets, and exhibits improved data efficiency. Additionally, GeoDexGrasp is also lightweight (using less than 20% of the parameters of the previous SOTA method) and attains a comparable grasp success rate.

Abstract:
Condensing the large-scale, high-resolution ImageNet-1K dataset remains a challenge for dataset distillation (DD). Existing methods typically match batch normalization (BN) statistics, i.e., mean and variance, between real and synthetic datasets. Although effective with soft labels, their performance degrades substantially under hard labels. In this paper, we theoretically identify that BN matching mainly aligns the scales of real and synthetic gradients but overlooks their directions. However, experimental evidence demonstrates that gradient direction, rather than scale, is pivotal to model training, clarifying the limitations of prior methods. Building on this insight, we introduce Orthogonal Gradient Matching (OGM), which explicitly aligns the intrinsic direction of gradients, i.e., singular vectors. Specifically, OGM first orthogonalizes real and synthetic gradients by setting all singular values to one, eliminating their scales, and then minimizes the distance between these orthogonal gradients so that their singular vectors coincide. To further reduce computation, OGM employs a least-squares loss whose gradients can be obtained in the forward pass, avoiding back-propagation. Extensive experiments on ImageNet-1K validate the effectiveness of OGM. With only ten images per class (IPC = 10), OGM achieves 47.0% accuracy with soft labels and 16.7% with hard labels, outperforming training-based DD methods and RDED.

Abstract:
Due to the prohibitive cost of data annotation and the inability to obtain sufficient sample data for all defect categories, municipal sewer pipe defect detection poses significant generalization challenges for traditional models. Multi-Label Zero-Shot Learning (ML-ZSL) offers a viable solution to address this challenge. However, existing methods struggle to establish robust and fine-grained visual-semantic alignment between the complex visual environment inside the pipes and the often sparse semantic descriptions, leading to a critical issue: Alignment Ambiguity. To mitigate this, we propose a novel Steering-Fusion-Refining Network (SFR-Net) that follows a three-stage paradigm to progressively dissolve this ambiguity. This is achieved as the Representation Steering (RS) module first integrates a parameter-efficient feature steering mechanism to continuously adapt the representation to the pipe scene; the Multi-Granularity Evidence Fusion (MEF) module subsequently aggregates unambiguous multi-granularity visual evidence through decoupled global and local paths; and the Generalized Relational Score Refining (GR) module ultimately learns and transfers relational logic from seen defects to gain a universal score correction ability, directly refining preliminary prediction scores and significantly boosting the model's zero-shot generalization and prediction consistency. Extensive experiments on the public Sewer-ML dataset and our private WZ-Pipe dataset demonstrate that the proposed SFR-Net achieves state-of-the-art (SOTA) performance in multi-label zero-shot learning task.

Abstract:
Vision-Language Models (VLMs) are increasingly adopted for medical applications, but their clinical utility is limited by a core weakness in quantitative reasoning. This limitation affects tasks ranging from regression of lesion sizes to prediction of bounding-box coordinates and stems from the discrete tokenization schemes underlying Large Language Models (LLMs). To address this, we propose Rotary Number Encoding and Decoding (RNED), a principled method for embedding continuous numerical values directly in the representation space of a VLM. Analogous to rotary position encoding, RNED represents a scalar by applying a number-specific rotation matrix to a dedicated numeric token embedding. This norm-preserving transformation maintains ordinal structure over a wide numerical range and integrates seamlessly with pretrained model weights. For decoding, we introduce a robust score-matching-based scheme to recover continuous values from hidden states in the presence of stochastic noise. We evaluate RNED on two quantitative tasks: radiological measurement estimation and medical visual grounding. On both internal and public benchmarks, RNED consistently outperforms existing VLM baselines. Together, these results show that RNED offers a robust, generalizable solution for numerical reasoning in medical VLMs, enabling models that are both quantitatively reliable and clinically applicable.

Abstract:
3D vision foundation models like Visual Geometry Grounded Transformer (VGGT) have advanced greatly in geometric perception. However it is time-consuming and memory-intensive for long sequences, limiting application to large-scale scenes beyond hundreds of images. To address this, we propose LiteVGGT, achieving up to 10x speedup and substantial memory reduction, enabling efficient processing of 1000-image scenes. We derive two key insights for 3D reconstruction: 1) tokens from local image regions have inherent geometric correlations, leading to high similarity and computational redundancy; 2) token similarity across adjacent network layers remains stable, allowing for reusable merge decisions. Guided by these, we design a simple yet efficient strategy, dubbed geometry-aware cached token merging . We analyze each token's geometric importance, optimizing anchor token selection to better preserve key information for reconstruction. We also cache and reuse merge indices across layers, substantially reducing latency with minimal accuracy impact. This strategy retains VGGT's core performance, enabling efficient fine-tuning and FP8 quantization for further gains. Extensive experiments validate LiteVGGT's effectiveness, scalability, and robustness.

Abstract:
Precise and real-time visual localization is critical for applications like AR/VR and robotics, especially on resource-constrained edge devices such as smart glasses, where battery life and heat dissipation can be primary concerns. While many efficient models exist, further reducing compute without sacrificing accuracy is essential for practical deployment. To address this, we propose asymmetric visual localization: a large Teacher model processes pre-mapped database images offline, while a lightweight Student model processes the query image online. This creates a challenge in matching features from two different models without resorting to heavy, learned matchers.We introduce AsymLoc, a novel distillation framework that aligns a Student to its Teacher through a combination of a geometry-driven matching objective and a joint detector-descriptor distillation objective, enabling fast, parameter-less nearest-neighbor matching.Extensive experiments on HPatches, ScanNet, IMC2022, and Aachen show that AsymLoc achieves up to 95% of the teacher's localization accuracy using an order of magnitude smaller models, significantly outperforming existing baselines and establishing a new state-of-the-art efficiency-accuracy trade-off.

Abstract:
Diffusion-based planners have emerged as a promising approach for human-like trajectory generation in autonomous driving. Recent works incorporate reinforcement fine-tuning to enhance the robustness of diffusion planners through reward-oriented optimization in a generation-evaluation loop. However, they struggle to generate multi-modal, scenario-adaptive trajectories, hindering the exploitation efficiency of informative rewards during fine-tuning. To resolve this, we propose PlannerRFT, a sample-efficient reinforcement fine-tuning framework for diffusion-based planners. PlannerRFT adopts a dual-branch optimization that simultaneously refines the trajectory distribution and adaptively guides the denoising process toward more promising exploration, without altering the original inference pipeline. To support parallel learning at scale, we develop nuMax, an optimized simulator that achieves 10 times faster rollout compared to native nuPlan. Extensive experiments shows that PlannerRFT yields state-of-the-art performance with distinct behaviors emerging during the learning process.

Abstract:
Understanding hand-object interaction (HOI) is fundamental to computer vision, robotics, and AR/VR. However, conventional hand videos often lack essential physical information, such as contact forces and motion dynamics, and are prone to frequent occlusions. To address these challenges, we present Glove2Hand, a framework that translates multi-modal sensing glove data in HOI videos into photorealistic bare-hand representations, while faithfully preserving the underlying physical interaction dynamics. We introduce a novel 3D Gaussian hand model that ensures both temporal and multi-view rendering consistency. The rendered hand is seamlessly integrated into the scene using a diffusion-based hand restorer, which effectively handles complex hand-object interactions and non-rigid deformations. Leveraging Glove2Hand, we introduce HandSense, the first multi-modal HOI dataset featuring multi-view bare-hand videos with synchronized tactile and IMU signals. We demonstrate that HandSense significantly enhances downstream bare-hand applications, including video-based contact estimation and hand tracking under severe occlusion.

Abstract:
Text-conditioned human motion in-betweening leverages keyframes for spatio-temporal control, with text providing high-level semantic guidance for the transitions. However, existing methods are unable to establish a coherent alignment between textual semantics and the spatio-temporal constraints provided by keyframes, often resulting in insufficiently constrained motions with unintended behavior. Moreover, they struggle with precise spatial control, often generating motions that deviate from keyframe constraints. To address these issues, we propose a multi-level diffusion framework that integrates textual semantics with implicit cues from keyframe sequences to modulate global motion dynamics, while leveraging individual keyframes to guide local transitions around them. During inference, to ensure strict keyframe adherence, we propose a novel trajectory refinement strategy that adjusts the root positions of the generated motion, followed by diffusion imputation to refine the poses of the generated keyframes. Additionally, our framework enables semantics-preserving motion editing, allowing for plausible modifications while retaining the original motion semantics. Extensive experiments demonstrate that our method generates high-quality motions that strictly satisfy keyframe constraints while achieving precise semantic alignment.

Abstract:
Earth observation data presents a unique challenge: it is spatial like images, sequential like video or text, and highly multimodal. We present Helios: a multimodal, spatio-temporal foundation model that employs a novel self-supervised learning formulation, masking strategy, and loss all designed for the Earth observation domain. Helios achieves state-of-the-art performance compared to 12 other foundation models across a variety of research benchmarks and real-world tasks from external partners. When evaluating embeddings Helios achieves the best performance on 15 out of 24 tasks, and with full fine-tuning it is the best on 19 of 29 tasks. We deploy Helios as the backbone of an end-to-end platform for data collection, labeling, training, and inference of Earth observation models. The Helios platform puts frontier foundation models and powerful data management tools into the hands of non-profits and NGOs working to solve the world's biggest problems. Helios source code, training data, and pre-trained weights are available at REDACTED.

Abstract:
The development of affective multimodal language models (MLMs) has long been constrained by a gap between low-level perception and high-level interaction, leading to fragmented affective capabilities and limited generalization. To bridge this gap, we propose a cognitively inspired three-level hierarchy that organizes affective tasks according to their cognitive depth--perception, understanding, and interaction--and provides a unified conceptual foundation for advancing affective modeling. Guided by this hierarchy, we introduce Nano-EmoX, a small-scale multitask MLM, and P2E (Perception-to-Empathy), a curriculum-based training framework. Nano-EmoX integrates a suite of omni-modal encoders, including an enhanced facial encoder and a fusion encoder, to capture key multimodal affective cues and improve cross-task transferability. The outputs are projected into a unified language space via heterogeneous adapters, empowering a lightweight language model to tackle diverse affective tasks. Concurrently, P2E progressively cultivates emotional intelligence by aligning rapid perception with chain-of-thought-driven empathy. To the best of our knowledge, Nano-EmoX is the first compact MLM (2.2B) to unify six core affective tasks across all three hierarchy levels, achieving state-of-the-art or highly competitive performance across multiple benchmarks, demonstrating excellent efficiency and generalization.

Abstract:
Panoptic occupancy prediction aims to jointly infer voxel-wise semantics and instance identities within a unified 3D scene representation. Nevertheless, progress in this field remains constrained by the absence of high-quality 3D mesh resources, instance-level annotations, and physically consistent occupancy datasets. Existing benchmarks typically provide incomplete and low-resolution geometry without instance-level annotations, limiting the development of models capable of achieving precise geometric reconstruction, reliable occlusion reasoning, and holistic 3D understanding. To address these challenges, this paper presents an instance-centric benchmark for the 3D panoptic occupancy prediction task. Specifically, we introduce ADMesh, the first unified 3D mesh library tailored for autonomous driving, which integrates over 15K high-quality 3D models with diverse textures and rich semantic annotations. Building upon ADMesh, we further construct CarlaOcc, a large-scale, physically consistent panoptic occupancy dataset generated using the CARLA simulator. This dataset contains over 100K frames with fine-grained, instance-level occupancy ground truth at voxel resolutions as fine as 0.05 m. Furthermore, standardized evaluation metrics are introduced to quantify the quality of existing occupancy datasets. Finally, a systematic benchmark of representative models is established on the proposed dataset, which provides a unified platform for fair comparison and reproducible research in the field of 3D panoptic perception. Code and dataset are available at https://mias.group/CarlaOcc.

Abstract:
Video reasoning segmentation (VRS) endeavors to delineate referred objects in videos guided by implicit instructions that encapsulate human intent and temporal logic. Previous approaches leverage large vision language models (LVLMs) to encode object semantics into \SEG tokens for mask prediction. However, this paradigm suffers from limited interpretability during inference and suboptimal performance due to inadequate spatiotemporal reasoning. Drawing inspiration from seminal breakthroughs in reinforcement learning, we introduce Veason-R1, a specialized LVLM for VRS that emphasizes structured reasoning in segmentation. Veason-R1 is trained through Group Relative Policy Optimization (GRPO) augmented with Chain-of-Thought (CoT) initialization. To begin with, we curate high-quality CoT training data to instill structured reasoning trajectories, bridging video-level semantics and frame-level spatial grounding, yielding the supervised fine-tuned model Veason-SFT. Subsequently, GRPO fine-tuning encourages efficient exploration of the reasoning space by optimizing reasoning chains. To this end, we incorporate a holistic reward mechanism that synergistically enhances spatial alignment and temporal consistency, bolstering keyframe localization and fine-grained grounding. Comprehensive empirical evaluations demonstrate that Veason-R1 achieves state-of-the-art performance on multiple benchmarks, surpassing prior art by significant margins (e.g., +1.3 \mathcal J &\mathcal F in ReVOS and +10.0 \mathcal J &\mathcal F in ReasonVOS), while exhibiting robustness to hallucinations (+8.8 \mathcal R ).

Abstract:
Editing a 3D indoor scene from natural language is conceptually straightforward but technically challenging. Existing open-vocabulary systems often regenerate large portions of a scene or rely on image-space edits that disrupt spatial structure, resulting in unintended global changes or physically inconsistent layouts. These limitations stem from treating editing primarily as a generative task.We take a different view. A user instruction defines a desired world state, and editing should be the minimal sequence of actions that makes this state true while preserving everything else. This perspective motivates Edit-As-Act, a framework that performs open-vocabulary scene editing as goal-regressive planning in 3D space.Given a source scene and free-form instruction, Edit-As-Act predicts symbolic goal predicates and plans in EditLang, a PDDL-inspired action language that we design with explicit preconditions and effects encoding support, contact, collision, and other geometric relations. A language-driven planner proposes actions, and a validator enforces goal-directedness, monotonicity, and physical feasibility, producing interpretable and physically coherent transformations.By separating reasoning from low-level generation, Edit-As-Act achieves instruction fidelity, semantic consistency, and physical plausibility--three criteria that existing paradigms cannot satisfy together. On E2A-Bench, our benchmark of 63 editing tasks across 9 indoor environments, Edit-As-Act significantly outperforms prior approaches across all edit types and scene categories.

Abstract:
The threat of Audio-Video (AV) forgery is rapidly evolving beyond human-centric deepfakes to include more diverse manipulations across complex natural scenes. However, existing benchmarks are still confined to DeepFake-based forgeries and single-granularity annotations, thus failing to capture the diversity and complexity of real-world forgery scenarios. To address this, we introduce AVFakeBench, the first comprehensive audio-video forgery detection benchmark that spans rich forgery semantics across both human subject and general subject. AVFakeBench comprises 12K carefully curated audio-video questions, covering seven forgery types and four levels of annotations. To ensure high-quality and diverse forgeries, we propose a multi-stage hybrid forgery framework that integrates proprietary models for task planning with expert generative models for precise manipulation. The benchmark establishes a multi-task evaluation framework covering binary judgment, forgery types classification, forgery detail selection, and explanatory reasoning. We evaluate 11 Audio-Video Large Language Models (AV-LMMs) and 2 prevalent detection methods on AVFakeBench, demonstrating the potential of AV-LMMs as emerging forgery detectors while revealing their notable weaknesses in fine-grained perception and reasoning.

Abstract:
Current feed-forward 3D/4D reconstruction systems rely on dense geometry and pose supervision - expensive to obtain at scale and particularly scarce for dynamic real-world scenes. We present Flow3r, a framework that augments visual geometry learning with dense 2D correspondences ('flow') as supervision, enabling scalable training from unlabeled monocular videos. Our key insight is that the flow prediction module should be factored: predicting flow between two images using geometry latents from one and pose latents from the other. This factorization directly guides the learning of both scene geometry and camera motion, and naturally extends to dynamic scenes. In controlled experiments, we show that factored flow prediction outperforms alternative designs and that performance scales consistently with unlabeled data. Integrating factored flow into existing visual geometry architectures and training with 800K unlabeled videos, Flow3r achieves state-of-the-art results across eight benchmarks spanning static and dynamic scenes, with its largest gains on in-the-wild dynamic videos where labeled data is most scarce.

Abstract:
Autonomous driving heavily relies on accurate and robust spatial perception. Many failures arise from inaccuracies and instability, especially in long-tail scenarios and complex interactions. However, current vision-language models are weak at spatial grounding and understanding, and VLA systems built on them therefore show limited perception and localization ability.To address these challenges, we introduce Percept-WAM, a perception-enhanced World-Awareness-Action Model that is the first to implicitly integrate 2D/3D scene understanding abilities within a single vision-language model (VLM).Instead of relying on QA-style spatial reasoning, Percept-WAM unifies 2D/3D perception tasks into World-PV and World-BEV tokens, which encode both spatial coordinates and confidence.We propose a grid-conditioned prediction mechanism for dense object perception, incorporating IoU-aware scoring and parallel autoregressive decoding, improving stability in long-tail, far-range, and small-object scenarios. Additionally, Percept-WAM leverages pretrained VLM parameters to retain general intelligence (e.g., logical reasoning) and can output perception results and trajectory control outputs directly. Experiments show that Percept-WAM matches or surpasses classical detectors and segmenters on downstream perception benchmarks, achieving 51.7/58.9 mAP on COCO 2D detection and nuScenes BEV 3D detection. When integrated with trajectory decoders, it further improves planning performance on nuScenes and NAVSIM, e.g., surpassing DiffusionDrive by 2.1 in PMDS on NAVSIM.Qualitative results further highlight its strong open-vocabulary and long-tail generalization.

Abstract:
Recent research in long-form video generation has shifted from bidirectional to autoregressive models, yet these methods commonly suffer from error accumulation and a loss of long-term coherence. While attention sink frames have been introduced to mitigate this performance decay, they often induce a critical failure mode we term sink-collapse: the generated content repeatedly reverts to the sink frame, resulting in abrupt scene resets and cyclic motion patterns. Our analysis reveals that sink-collapse originates from an inherent conflict between the periodic structure of Rotary Position Embedding (RoPE) and the multi-head attention mechanisms prevalent in current generative models. To address it, we propose a lightweight, training-free approach that effectively suppresses this behavior by introducing multi-head RoPE jitter that breaks inter-head attention homogenization and mitigates long-horizon collapse. Extensive experiments show that our method successfully alleviates sink-collapse while preserving generation quality. To the best of our knowledge, this work achieves the first demonstration of real-time, streaming, and infinite-length video generation with little quality decay. As an illustration of this robustness, we generate continuous videos up to 12 hours in length, which, to our knowledge, is among the longest publicly demonstrated results in streaming video generation.

Abstract:
Digital watermarking is a cornerstone for copyright protection. With the rapid advancement of generative models like diffusion models, in-generation and training-free watermarking techniques have garnered more attention for their endogeneity and convenience. These methods typically embed a watermark into the initial noise, where watermark extraction relies on Denoising Diffusion Implicit Models (DDIM) inversion. However, the computationally intensive extraction process severely hinders their path toward practical deployment. To overcome this critical bottleneck, we propose GROW, a novel training-free paradigm that reframes watermarking from a one-shot "embedding" to a progressive "growth". By progressively guiding using frequency-domain gradients, GROW naturally weaves the watermark into the image, which enables inversion-free extraction. Comprehensive experiments on multiple datasets show that GROW not only achieves superior robustness and imperceptibility but also offers a detection speed nearly 100x faster than inversion-based techniques. The code will be made publicly available.

Abstract:
Robotic manipulation from visual observations remains challenging due to the lack of 3D consistent representations that can generalize across diverse viewpoints and sensor configurations. Existing approaches often rely on masked autoencoders or neural scene representations, which fail to capture cross view correspondences. Crucially, while multi-view diffusion models have recently shown tremendous success in 3D aware generative synthesis, their powerful representations offer a promising direction for achieving viewpoint robust visuomotor control. In this paper, we introduce DiffuView, a novel framework that learns unified 3D aware representations through multi-view diffusion pretraining and deploys them for imitation learning. Specifically, DiffuView models the conditional generation of target views given source observations within a diffusion framework, enabling the network to implicitly recover scene geometry and enforce view consistency. The pretrained diffusion network is then utilized as a powerful visual backbone for an action policy, allowing robust control under varying viewpoints and visual conditions. We evaluate DiffuView on two challenging benchmarks, MetaWorld and Libero. Extensive experiments in both simulation and realworld scenarios demonstrate that DiffuView achieves superior generalization, improving success rates under viewpoint shifts by nearly 20% compared with existing methods.

Abstract:
Recent progress in robot learning has been driven by large-scale datasets and powerful visuomotor policy architectures, yet policy robustness remains limited by the substantial cost of collecting diverse demonstrations, particularly for spatial generalization in manipulation tasks. To reduce repetitive data collection, we present Real2Edit2Real, a framework that generates new demonstrations by bridging 3D editability with 2D visual data through a 3D control interface. Our approach first reconstructs scene geometry from multi-view RGB observations with a metric-scale 3D reconstruction model. Based on the reconstructed geometry, we perform depth-reliable 3D editing on point clouds to generate new manipulation trajectories while geometrically correcting the robot poses to recover physically consistent depth, which serves as a reliable condition for synthesizing new demonstrations. Finally, we propose a multi-conditional video generation model guided by depth as the primary control signal, together with action, edge, and ray maps, to synthesize spatially augmented multi-view manipulation videos. Experiments on four real-world manipulation tasks demonstrate that policies trained on data generated from only 1-5 source demonstrations can match or outperform those trained on 50 real-world demonstrations, improving data efficiency by up to 10-50x. Moreover, experimental results on height and texture editing demonstrate the framework's flexibility and extensibility, indicating its potential to serve as a unified data generation framework.

Abstract:
We present STCDiT, a video super-resolution framework built upon a pre-trained video diffusion model, aiming to restore structurally faithful and temporally stable videos from degraded inputs, even under complex camera motions. The main challenges lie in maintaining temporal stability during reconstruction and preserving structural fidelity during generation. To address these challenges, we first develop a motion-aware VAE reconstruction method that performs segment-wise reconstruction, with each segment clip exhibiting uniform motion characteristic, thereby effectively handling videos with complex camera motions. Moreover, we observe that the first-frame latent extracted by the VAE encoder in each clip, termed the anchor-frame latent, remains unaffected by temporal compression and retains richer spatial structural information than subsequent frame latents. We further develop an anchor-frame guidance approach that leverages structural information from anchor frames to constrain the generation process and improve structural fidelity of video features. Coupling these two designs enables the video diffusion model to achieve high-quality video super-resolution. Extensive experiments show that STCDiT outperforms state-of-the-art methods in terms of structural fidelity and temporal consistency.

Abstract:
Model Merging (MM) has emerged as a scalable paradigm for multi-task learning (MTL), enabling multiple task-specific models to be integrated without revisiting the original training data. Despite recent progress, the reliability of MM under test-time distribution shift remains insufficiently understood. Most existing MM methods typically assume that test data are clean and distributionally aligned with both the training and auxiliary sources. However, this assumption rarely holds in practice, often resulting in biased predictions with degraded generalization. To address this issue, we present BD-Merging, a bias-aware unsupervised model merging framework that explicitly models uncertainty to achieve adaptive reliability under distribution shift. First, BD-Merging introduces a joint evidential head that learns uncertainty over a unified label space, capturing cross-task semantic dependencies in MM. Second, building upon this evidential foundation, we propose an Adjacency Discrepancy Score (ADS) that quantifies evidential alignment among neighboring samples. Third, guided by ADS, a discrepancy-aware contrastive learning mechanism refines the merged representation by aligning consistent samples and separating conflicting ones. Combined with general unsupervised learning, this process trains a debiased router that adaptively allocates task-specific or layer-specific weights on a per-sample basis, effectively mitigating the adverse effects of distribution shift. Extensive experiments across diverse tasks demonstrate that BD-Merging achieves superior effectiveness and robustness compared to state-of-the-art MM baselines.

Abstract:
Images and videos are discrete 2D projections of the 4D world (3D space + time). Most visual understanding, prediction, and generation operate directly on 2D observations, leading to suboptimal performance. We propose SeeU, a novel approach that learns the continuous 4D dynamics and generate the unseen visual contents. The principle behind SeeU is a new 2D\to4D\to2D learning framework. SeeU first reconstructs the 4D world from sparse and monocular 2D frames (2D\to4D). It then learns the continuous 4D dynamics on a low-rank representation and physical constraints (discrete 4D\tocontinuous 4D). Finally, SeeU rolls the world forward in time, re-projects it back to 2D at sampled times and viewpoints, and generates unseen regions based on spatial-temporal context awareness (4D\to2D). By modeling dynamics in 4D, SeeU achieves continuous and physically-consistent novel visual generation, demonstrating strong potentials in multiple tasks including unseen temporal generation, unseen spatial generation, and video editing. All data and code will be public at \href https://yuyuanspace.com/SeeU/ here .

Abstract:
Interpretable-by-design models are gaining traction in computer vision because they provide faithful explanations for their predictions. In image classification, these models typically recover human-interpretable concepts from an image and use them for classification. Sparse concept recovery methods leverage the latent space of vision-language models to represent image embeddings as sparse combinations of concept embeddings. However, by ignoring the hierarchical structure of semantic concepts, these methods may produce correct predictions with explanations that are inconsistent with the hierarchy. In this work, we propose Hierarchical Concept Embedding & Pursuit (HCEP), a framework that induces a hierarchy of concept embeddings in the latent space and performs hierarchical sparse coding to recover the concepts present in an image. Given a hierarchy of semantic concepts, we introduce a geometric construction for the corresponding hierarchy of embeddings. Under the assumption that the true concepts form a rooted path in the hierarchy, we derive sufficient conditions for their recovery in the embedding space. We further show that hierarchical sparse coding reliably recovers hierarchical concept embeddings, whereas standard sparse coding fails. Experiments on real-world datasets show that HCEP improves concept precision and recall compared to existing methods while maintaining competitive classification accuracy. Moreover, when the number of samples available for concept estimation and classifier training is limited, HCEP achieves superior classification accuracy and concept recovery. Our results demonstrate that incorporating hierarchical structure into sparse concept recovery leads to more faithful and interpretable image classification models.

Abstract:
Learning latent actions from large-scale videos is crucial for the pre-training of scalable embodied foundation models, yet existing methods often struggle with action-irrelevant distractors. Although incorporating action supervision can alleviate these distractions, its effectiveness is restricted by the scarcity of available action labels. Optical flow represents pixel-level motion between consecutive frames, naturally suppressing background elements and emphasizing moving objects. Motivated by this, we propose robust Latent Action learning with Optical Flow constraints (LAOF), a pseudo-supervised framework that leverages the agent's optical flow as an action-driven signal to learn latent action representations robust to distractors. Experimental results show that the latent representations learned by LAOF outperform existing methods on downstream imitation learning and reinforcement learning tasks. This superior performance arises from optical flow constraints, which substantially stabilize training and improve the quality of latent representations under extremely label-scarce conditions, while remaining effective as the proportion of action labels increases to 10%. Importantly, even without action supervision, LAOF matches or surpasses action-supervised methods trained with 1% of action labels.

Abstract:
The vulnerability of deep neural networks (DNNs) has been preliminarily verified. Existing black-box adversarial attacks usually require multi-round interaction with the model and consume numerous queries, which is impractical in the real-world and hard to scale to recently emerged Video-LLMs. Moreover, no attack in the video domain directly leverages feature maps to shift the clean-video feature space. We therefore propose FeatureFool, a stealthy, video-domain, zero-query black-box attack that utilizes information extracted from a DNN to alter the feature space of clean videos. Unlike query-based methods that rely on iterative interaction, FeatureFool performs a zero-query attack by directly exploiting DNN-extracted information. This efficient approach is unprecedented in the video domain. Experiments show that FeatureFool achieves an attack success rate above 70% against traditional video classifiers without any queries. Benefiting from the transferability of the feature map, it can also craft harmful content and bypass Video-LLM recognition. Additionally, adversarial videos generated by FeatureFool exhibit high quality in terms of SSIM, PSNR, and Temporal-Inconsistency, making the attack barely perceptible. This paper may contain violent or explicit content.

Abstract:
Recent vision-language model (VLM)-based approaches have achieved impressive results on SVG generation. However, because they generate only text and lack visual signals during decoding, they often struggle with complex semantics and fail to produce visually appealing or geometrically coherent SVGs. We introduce DuetSVG, a unified multimodal model that jointly generates image tokens and corresponding SVG tokens in an end-to-end manner. DuetSVG is trained on both image and SVG datasets. At inference, we apply a novel test-time scaling strategy that leverages the model's native visual predictions as guidance to improve SVG decoding quality. Extensive experiments show that our method outperforms existing methods, producing visually faithful, semantically aligned, and syntactically clean SVGs across a wide range of applications.

Abstract:
Adapting models pre-trained on large-scale datasets is a proven way to reach strong performance quickly for downstream tasks. However, the growth of state-of-the-art models makes traditional full fine-tuning unsuitable and difficult, especially for multi-task learning (MTL) where cost scales with the number of tasks. As a result, recent studies investigate parameter-efficient fine-tuning (PEFT) using low-rank adaptation to significantly reduce the number of trainable parameters. However, these existing methods use a single, fixed rank, which may not be optimal for different tasks or positions in the MTL architecture. Moreover, these methods fail to learn spatial information that captures inter-task relationships and helps to improve diverse task predictions. This paper introduces Frequency-Aware and Automatic Rank (FAAR) for efficient MTL fine-tuning. Our method introduces Performance-Driven Rank Shrinking (PDRS) to allocate the optimal rank per adapter location and per task. Moreover, by analyzing the image frequency spectrum, FAAR proposes a Task-Spectral Pyramidal Decoder (TS-PD) that injects input-specific context into spatial bias learning to better reflect cross-task relationships. Experiments performed on dense visual task benchmarks show the superiority of our method in terms of both accuracy and efficiency compared to other PEFT methods in MTL. FAAR reduces the number of parameters by up to 9 times compared to traditional MTL fine-tuning whilst improving overall performance. Our code is available.

Abstract:
The deployment of autonomous agents in Graphical User Interface (GUI) environments confronts significant challenges, notably error accumulation in long-horizon tasks and the severe consequences of irreversible operations. While critic models that provide real-time action assessment offer a promising solution, their effectiveness is hindered by the lack of diverse, high-quality GUI feedback data and public critic benchmarks for computer use.To bridge these gaps,we introduce OS-Oracle that makes three core contributions:(1) a scalable data pipeline for synthesizing cross-platform GUI critic data;(2) a two-stage training paradigm combining supervised fine-tuning (SFT) and consistency-preserving group relative policy optimization (CP-GRPO); (3) OS-Critic Bench, a holistic benchmark for evaluating critic model performance across Mobile, Web, and Desktop platforms.Leveraging this framework, we curate a high-quality dataset containing 310k critic samples. The resulting critic model, OS-Oracle-7B, achieves impressive performance,and further reduces error rates, which improves the capability of GUI agents in dynamic environments. All codes, data and checkpoints will be made public.

Abstract:
Composed Image Retrieval (CIR) has attracted significant attention due to its flexible multimodal query method, yet its development is severely constrained by the Noisy Triplet Correspondence (NTC) problem. Most existing robust learning methods rely on the "small loss hypothesis", but the unique semantic ambiguity in NTC, such as "partial matching", invalidates this assumption, leading to unreliable noise identification. This entraps the model in a self dependent vicious cycle where the learner is intertwined with the arbiter, ultimately causing catastrophic "representation pollution". To address this critical challenge, we propose a novel "Expert-Proxy-Diversion" decoupling paradigm, named Air-Know (ArbIteR calibrated Knowledge iNternalizing rObust netWork). Air-Know incorporates three core modules: (1) The External Prior Arbitration (EPA) module, which utilizes Multimodal Large Language Models (MLLMs) as an offline expert to construct a high precision anchor dataset; (2) Expert Knowledge Internalization (EKI) module, which efficiently guides a lightweight proxy "arbiter" to internalize the expert's discriminative logic; (3) Dual Stream Reconciliation (DSR) module, which leverages the EKI's matching confidence to divert the training data, achieving a clean alignment stream and a representation feedback reconciliation stream. Extensive experiments on multiple CIR benchmark datasets demonstrate that Air-Know significantly outperforms existing SOTA methods under the NTC setting, while also showing strong competitiveness in traditional CIR.

Abstract:
Tensor singular value decomposition (t-SVD) is a promising tool for multi-dimensional image representation, which decomposes a multi-dimensional image into a latent tensor and an accompanying transform matrix. However, two critical limitations of t-SVD methods persist: (1) the approximation of the latent tensor (e.g., tensor factorizations) is coarse and fails to accurately capture spatial local high-frequency information; (2) the transform matrix is composed of fixed basis atoms (e.g., complex exponential atoms in DFT and cosine atoms in DCT) and cannot precisely capture local high-frequency information along the mode-3 fibers. To address these two limitations, we propose a Gaussian Splatting-based Low-rank tensor Representation (GSLR) framework, which compactly and continuously represents multi-dimensional images. Specifically, we leverage tailored 2D Gaussian splatting and 1D Gaussian splatting to generate the latent tensor and transform matrix, respectively. The 2D and 1D Gaussian splatting are indispensable and complementary under this representation framework, which enjoys a powerful representation capability, especially for local high-frequency information. To evaluate the representation ability of the proposed GSLR, we develop an unsupervised GSLR-based multi-dimensional image recovery model. Extensive experiments on multi-dimensional image recovery demonstrate that GSLR consistently outperforms state-of-the-art methods, particularly in capturing local high-frequency information.

Abstract:
Recent feed-forward reconstruction models like VGGT and \pi^3 achieve impressive reconstruction quality but cannot process streaming videos due to quadratic memory complexity, limiting their practical deployment. While existing streaming methods address this through learned memory mechanisms or causal attention, they require extensive retraining and may not fully leverage the strong geometric priors of state-of-the-art offline models. We propose LASER, a training-free framework that converts an offline reconstruction model into a streaming system byaligning predictions across consecutive temporal windows. We observe that simple similarity transformation (Sim(3)) alignment fails due to layer depth misalignment: monocular scale ambiguity causes relative depth scales of different scene layers to vary inconsistently between windows. To address this, we introduce layer-wise scale alignment, which segments depth predictions into discrete layers, computes per-layer scale factors, and propagates them across both adjacent windows and timestamps.Extensive experiments show that LASER achieves state-of-the-art performance on camera pose estimation and point map reconstruction while operating at 14 FPS with 6 GB peak memory on a RTX A6000 GPU, enabling practical deployment for kilometer-scale streaming videos.Our code will be released publicly.

Abstract:
Transformers rely on explicit positional encoding to model structure in data. WhileRotary Position Embedding (RoPE) excels in 1D domains, its application to image generation reveals significant limitations such as fine-grained spatial relationmodeling, color cues, and object counting. This paper identifies key limitationsof standard multi-dimensional RoPE--rigid frequency allocation, axis-wise independence, and uniform head treatment--in capturing the complex structural biasesrequired for fine-grained image generation. We propose HARoPE, a head-wiseadaptive extension that inserts a learnable linear transformation parameterized viasingular value decomposition (SVD) before the rotary mapping. This lightweightmodification enables dynamic frequency reallocation, semantic alignment of rotaryplanes, and head-specific positional receptive fields while rigorously preservingRoPE's relative-position property. Extensive experiments on class-conditional ImageNet and text-to-image generation (Flux and MMDiT) demonstrate that HARoPEconsistently improves performance over strong RoPE baselines and other extensions. The method serves as an effective drop-in replacement, offering a principledand adaptable solution for enhancing positional awareness in transformer-basedimage generative models.

Abstract:
We present AutoTraces, an autoregressive vision-language-trajectory model for robot trajectory forecasting in humam-populated environments, which harnesses the inherent reasoning capabilities of large language models (LLMs) to model complex human behaviors. In contrast to prior works that rely solely on textual representations, our key innovation lies in a novel trajectory tokenization scheme, which represents physical waypoints as discrete tokens, seamlessly integrated into the LLM's space through a lightweight encoder-decoder architecture. This design preserves the LLM's native autoregressive generation mechanism while extending it to physical coordinate spaces, facilitates modeling of long-term interactions in trajectory data. We further introduce an automated chain-of-thought (CoT) generation mechanism that leverages a multimodal LLM to infer spatio-temporal relationships from visual observations and trajectory data, eliminating reliance on manual annotation. Through a two-stage training strategy, our AutoTraces achieves SOTA forecasting accuracy, particularly in long-horizon prediction, while exhibiting strong cross-scene generalization and supporting flexible-length forecasting.

Abstract:
Abstract reasoning from minimal examples remains a core unsolved problem for frontier foundation models such as GPT-5. These models still fail to infer structured transformation rules from a handful of examples, which is a key hallmark of human intelligence. The Abstraction and Reasoning Corpus for Artificial General Intelligence (ARC-AGI) provides a rigorous testbed for this capability, demanding conceptual rule induction and transfer to novel tasks. Most existing methods treat ARC-AGI as a purely textual reasoning task, overlooking the fact that humans rely heavily on visual abstraction when solving such puzzles. However, our experiments reveal a paradox: naively rendering ARC-AGI grids as images degrades performance due to imprecise rule execution. This leads to our central hypothesis that vision and language possess complementary strengths across distinct reasoning stages: vision supports global pattern abstraction and verification, whereas language specializes in symbolic rule formulation and precise execution. Building on this insight, we introduce two synergistic strategies: (1) Vision-Language Synergy Reasoning (VLSR), which decomposes ARC-AGI into modality-aligned subtasks; and (2) Modality-Switch Self-Correction (MSSC), which leverages vision to verify text-based reasoning for intrinsic error correction. Extensive experiments demonstrate that our approach yields up to a 4.33% improvement over text-only baselines across diverse flagship models and multiple ARC-AGI tasks. Our findings suggest that unifying visual abstraction with linguistic reasoning is a crucial step toward achieving generalizable, human-like intelligence.

Abstract:
We address hallucination detection in Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) by framing the problem through the lens of dynamical systems stability theory. Rather than treating hallucination as a straightforward classification task, we conceptualize (M)LLMs as dynamical systems, where factual knowledge is represented by stable equilibrium points within the representation space. Our main insight is that hallucinations tend to arise at the boundaries of knowledge--transition regions separating stable and unstable zones. To capture this phenomenon, we propose Lyapunov Probes: lightweight networks trained with derivative-based stability constraints that enforce a monotonic decay in confidence under input perturbations. By performing systematic perturbation analysis and applying a two-stage training process, these probes reliably distinguish between stable factual regions and unstable, hallucination-prone areas. Experiments on diverse datasets and models demonstrate consistent improvements over existing baselines.

Abstract:
Current autoregressive video diffusion models are constrained by three core bottlenecks: (i) the finite temporal horizon imposed by the base model's 3D Rotary Positional Embedding (3D-RoPE), (ii) slow prompt responsiveness in maintaining fine-grained action control during long-form rollouts, and (iii) the inability to realize discontinuous cinematic transitions within a single generation stream. We introduce Infinity-RoPE, a unified inference-time framework that addresses all three limitations through three interconnected components: Block-Relativistic RoPE, KV Flush, and RoPE Cut. Block-Relativistic RoPE reformulates temporal encoding as a moving local reference frame, where each newly generated latent block is rotated relative to the base model's maximum frame horizon while earlier blocks are rotated backward to preserve relative temporal geometry. This relativistic formulation eliminates fixed temporal positions, enabling continuous video generation far beyond the base positional limits. To obtain fine-grained action control without re-encoding, KV Flush renews the KV cache by retaining only two anchor tokens, the global sink and the last generated latent frame, thereby ensuring immediate prompt responsiveness. Finally, RoPE Cut introduces controlled discontinuities in temporal RoPE coordinates, enabling multi-cut scene transitions within a single continuous rollout. Together, these components establish Infinity-RoPE as a training-free foundation for infinite-horizon, controllable, and cinematic video diffusion. Comprehensive experiments show that Infinity-RoPE consistently surpasses previous autoregressive models in overall VBench scores while requiring only a fraction of their KV cache budget.

Abstract:
Achieving efficient and robust whole-body control (WBC) is essential for enabling humanoid robots to perform complex tasks in dynamic environments. Despite the success of reinforcement learning (RL) in this domain, its sample inefficiency remains a significant challenge due to the intricate dynamics and partial observability of humanoid robots. To address this limitation, we propose PvP, a Proprioceptive-Privileged contrastive learning framework that leverages the intrinsic complementarity between proprioceptive and privileged states. PvP learns compact and task-relevant latent representations without requiring hand-crafted data augmentations, enabling faster and more stable policy learning. To support systematic evaluation, we develop SRL4Humanoid, the first unified and modular framework that provides high-quality implementations of representative state representation learning (SRL) methods for humanoid robot learning. Extensive experiments on the LimX Oli robot across velocity tracking and motion imitation tasks demonstrate that PvP significantly improves sample efficiency and final performance compared to baseline SRL methods. Our study further provides practical insights into integrating SRL with RL for humanoid WBC, offering valuable guidance for data-efficient humanoid robot learning.

Abstract:
Robust 3D environmental perception is critical for applications such as autonomous driving and robot navigation. However, optical sensors such as cameras and LiDAR often fail under adverse conditions, including smoke, fog, and non-ideal lighting. Although specialized radar systems can operate in these environments, their reliance on bespoke hardware and licensed spectrum limits scalability and cost-effectiveness. This paper introduces Rascene, an integrated sensing and communication (ISAC) framework that leverages ubiquitous mmWave OFDM communication signals for 3D scene imaging. To overcome the sparse and multipath-ambiguous nature of individual radio frames, Rascene performs multi-frame, spatially adaptive fusion with confidence-weighted forward projection, enabling the recovery of geometric consensus across arbitrary poses. Experimental results demonstrate that our method reconstructs 3D scenes with high precision, offering a new pathway toward low-cost, scalable, and robust 3D perception.

Abstract:
Video large language models (Video-LLMs) face high computational costs due to large volumes of visual tokens. Existing token compression methods typically adopt a two-stage spatiotemporal compression strategy, relying on stage-specific metrics and an implicit assumption of spatiotemporal separability. Under extremely low retention ratios, however, such approaches often result in unbalanced allocation and loss of visual evidence essential for question answering. We reformulate token compression as a spatiotemporal allocation task within a global token retention pool. We propose a unified selection mechanism that integrates attention weights and semantic similarity to globally select tokens with high contribution and low redundancy. Unselected tokens are merged via clustering and refilled, preserving information integrity. Inside the LLM, we further introduce text-aware merging to perform secondary compression based on query relevance. Without requiring retraining, our method serves as a plug-and-play module compatible with existing Video-LLMs. Experiments show that retaining only about 2% of visual tokens preserves 90.1% of baseline performance across multiple benchmarks, while reducing FLOPs to roughly 2.6%. These benefits generalize across diverse backbones, decreasing end-to-end inference latency and memory consumption. Our unified spatiotemporal token compression strategy establishes the state-of-the-art in video understanding under ultra-low token retention.

Abstract:
Omnidirectional image super-resolution (ODISR) remains challenging due to extreme magnification factors (e.g., 8x, 16x) and projection-specific distortions, which degrade edge integrity and limit model performance. This paper proposes an edge-focused framework combined with spherical geometric augmentation to address these issues. Our approach includes an Edge Focused Block (EFB) that integrates spatial-channel attention via Edge Enhanced and Refined Blocks, strengthening edge feature capture and optimization. We also design an Edge-Aware Multi-Scale (EAM) pipeline, leveraging shallow convolutions for initial feature extraction, local modules for deep mining, and a Global Integration Block for multi-scale aggregation, ensuring coherent edge reconstruction in distorted regions. To mitigate data scarcity, we introduce a rotation-translation augmentation strategy based on spherical projections, expanding datasets while preserving scene continuity. Extensive experiments show our method outperforms state-of-the-art approaches on public datasets.

Abstract:
Point clouds are a fundamental 3D representation in computer vision, enabling a wide range of perception tasks. However, real-world point clouds often suffer from degradations such as incompleteness, noise, outliers, and irregular density, caused by sensor limitations or occlusions. Recovering clean and detailed shapes from such degraded data is crucial for downstream applications. While existing learning-based methods achieve progress on individual tasks like completion or denoising, they typically rely on global bottleneck features, which lose fine-grained geometry and remain sensitive to varying input quality. We propose a unified 3D restoration network that directly takes point clouds as input and adaptively reconstructs high-quality geometry under diverse degradation scenarios. At the core of our approach is a Pseudo-Query module, implemented within a Transformer backbone, which reformulates geometric translation into two cooperative stages to enhance structural clarity, robustness, and local detail preservation. Extensive experiments on curated benchmarks demonstrate that our approach surpasses state-of-the-art performance in general 3D restoration. It effectively handles complex combinations of completion, deformation, and denoising degradations. With this work, we provide a novel unified, point-only backbone for robust 3D restoration, enabling more versatile 3D perception.

Abstract:
Semantic correspondence is essential for handling diverse in-the-wild images lacking explicit correspondence annotations. While recent 2D foundation models offer powerful features, adapting them for unsupervised learning via nearest-neighbor pseudo-labels has key limitations: it operates locally, ignoring structural relationships, and consequently its reliance on 2D appearance fails to resolve geometric ambiguities arising from symmetries or repetitive features. In this work, we address this by reformulating pseudo-label generation as a Fused Gromov-Wasserstein (FGW) problem, which jointly optimizes inter-feature similarity and intra-structural consistency. Our framework, Shape-of-You (SoY), leverages a 3D foundation model to define this intra-structure in the geometric space, resolving abovementioned ambiguity. However, since FGW is a computationally prohibitive quadratic problem, we approximate it through anchor-based linearization. The resulting probabilistic transport plan provides a structurally consistent but noisy supervisory signal. Thus, we introduce a soft-target loss dynamically blending guidance from this plan with network predictions to build a learning framework robust to this noise. SoY achieves state-of-the-art performance on SPair-71k and AP-10k datasets, establishing a new benchmark in semantic correspondence without explicit geometric annotations. Code is available at Shape-of-You.

Abstract:
Reconstructing a dynamic target moving over a large area is challenging. Standard approaches for dynamic object reconstruction require dense coverage in both the viewing space and the temporal dimension, typically relying on multi-view videos captured at each time step.However, such setups are only possible in constrained environments. In real-world scenarios, observations are often sparse over time and captured sparsely from diverse viewpoints (e.g., from security cameras), making dynamic reconstruction highly ill-posed. We present SV-GS, a framework that simultaneously estimates a deformation model and the object's motion over time under sparse observations. To initialize SV-GS, we leverage a rough skeleton graph and an initial static reconstruction as inputs to guide motion estimation. (Later, we show that this input requirement can be relaxed.) Our method optimizes a skeleton-driven deformation field composed of a coarse skeleton joint pose estimator and a module for fine-grained deformations. By making only the joint pose estimator time-dependent, our model enables smooth motion interpolation while preserving learned geometric details. Experiments on synthetic datasets show that our method outperforms existing approaches under sparse observations by up to 34% in PSNR, and achieves comparable performance to dense monocular video methods on real-world datasets despite using significantly fewer frames. Moreover, we demonstrate that the input initial static reconstruction can be replaced by a diffusion-based generative prior, making our method more practical for real-world scenarios.

Abstract:
Unsupervised Anomaly Detection (UAD) is crucial for industrial quality control. Many existing embedding-based methods adopt a single-prototype assumption and learn, for example, a compact hypersphere to enclose all normal features. However, this strategy breaks down under intra-class variance caused by changes in illumination, pose, or texture. To cover all diverse normal samples, a single prototype may induce an overly loose decision boundary, making subtle anomalies hard to detect. To overcome this limitation, we propose PGBL (Prototype-Guided Boundary Learning), a framework that synergizes structured representation learning with targeted anomaly synthesis. First, the Multi-Prototype Compact Constraint (MPCC) module models the normal feature distribution as a mixture of multiple semantic prototypes, enabling tighter local representations for each normal sub-pattern. Second, instead of blind anomaly synthesis, the Boundary-Aware Anomaly Synthesis (BAAS) module generates pseudo-anomalies at the topological boundaries between MPCC clusters. Finally, a Discriminative Boundary Refiner (DBR) learns to shape the final decision surface by separating normal clusters from the synthesized anomalies. Extensive experiments on MVTec-AD, VisA and Real-IAD show that PGBL consistently outperforms prior methods in both anomaly detection and localization.

Abstract:
Structure-from-Motion (SfM) is a cornerstone of 3D perception, yet current methods often fail when applied to complex videos involving challenging camera motions or dynamic scenes.Compounding the problem, the field lacks reliable ground-truth benchmarks for such difficult scenarios, making it hard to gauge real-world progress, or pinpoint where improvements are most needed.To address this gap, we introduce a new benchmark for evaluating camera pose estimation.Our key insight is to leverage online panoramic 360deg as a source of data from which to construct challenging clips, while still enabling robust ground-truth trajectory recovery.The panoramic nature of these videos provides richer visual context for tracking camera motion, even when parts of the view are affected by blur, motion, or dynamic objects.By tracking camera motion across full 360deg videos, we crop and reproject selected portions to generate perspective-view clips that serve as our benchmark---ORBIT---a diverse collection of 100 video clips.Experiments show that COLMAP and other state-of-the-art SfM methods struggle to accurately estimate camera positions on our benchmark, indicating that it remains a challenging and open problem space for future research.As a result, ORBIT provides a valuable testbed where researchers can meaningfully compete and measure progress on truly challenging, real-world SfM problems.

Abstract:
Chain-of-thought (CoT) reasoning has emerged as a powerful tool for multimodal large language models on video understanding tasks. However, its necessity and advantages over direct answering remain underexplored. In this paper, we first demonstrate that for RL-trained video models, direct answering often matches or even surpasses CoT performance, despite CoT producing step-by-step analyses at a higher computational cost. Motivated by this, we propose VideoAuto-R1, a video understanding framework that adopts a reason-when-necessary strategy. During training, our approach follows a Thinking Once, Answering Twice paradigm: the model first generates an initial answer, then performs reasoning, and finally outputs a reviewed answer. Both answers are supervised via verifiable rewards. During inference, the model uses the confidence score of the initial answer to determine whether to proceed with reasoning. Across video QA and grounding benchmarks, VideoAuto-R1 achieves state-of-the-art accuracy with significantly improved efficiency, reducing the average response length by 3.3x, eg., from 149 to just 44 tokens. Moreover, we observe a low rate of thinking-mode activation on perception-oriented tasks, but a higher rate on reasoning-intensive tasks. This suggests that explicit language-based reasoning is generally beneficial but not always necessary.

Abstract:
This paper presents GeoAgent, a model capable of reasoning closely with humans and deriving fine-grained address conclusions. Previous RL-based methods have achieved breakthroughs in performance and interpretability but still remain concerns because of their reliance on AI-generated chain-of-thought (CoT) data and training strategies, which conflict with geographic characteristics. To address these issues, we first introduce GeoSeek, a new geolocation dataset comprising CoT data annotated by geographic experts and professional players. We further thoroughly explore the inherent characteristics of geographic tasks and propose a geo-similarity reward and a consistency reward assessed by a consistency agent to assist training. This encourages the model to converge towards correct answers from a geographic perspective while ensuring the integrity and consistency of its reasoning process. Experimental results show that GeoAgent outperforms existing methods and a series of general VLLMs across multiple grains, while generating reasoning that closely aligns with humans. Pretrained model and data will be openly available.

Abstract:
Recent advances in audio-visual representation learning have shown the value of combining contrastive alignment with masked reconstruction. However, jointly optimizing these objectives in a single forward pass forces the contrastive branch to rely on randomly visible patches designed for reconstruction rather than cross-modal alignment, introducing semantic noise and optimization interference. We propose TG-DP, a Teacher-Guided Dual-Path framework that decouples reconstruction and alignment into separate optimization paths. By disentangling the masking regimes of the two branches, TG-DP enables the contrastive pathway to use a visibility pattern better suited to cross-modal alignment. A teacher model further provides auxiliary guidance for organizing visible tokens in this branch, helping reduce interference and stabilize cross-modal representation learning. TG-DP achieves state-of-the-art performance in zero-shot retrieval. On AudioSet, it improves R@1 from 35.2% to 37.4% for video-to-audio retrieval and from 27.9% to 37.1% for audio-to-video retrieval. The learned representations also remain semantically robust, achieving state-of-the-art linear-probe performance on AS20K and VGGSound. Taken together, our results suggest that decoupling multimodal objectives and introducing teacher-guided structure into the contrastive pathway provide an effective framework for improving large-scale audio-visual pretraining.

Abstract:
We present PhysInOne, a large-scale synthetic dataset addressing the critical scarcity of physically-grounded training data for AI systems. Unlike existing datasets limited to merely hundreds or thousands of examples, PhysInOne provides 2 million videos across 153,810 dynamic 3D scenes, covering 71 basic physical phenomena in mechanics, optics, fluid dynamics, and magnetism. Distinct from previous works, our scenes feature multiobject interactions against complex backgrounds, with comprehensive ground-truth annotations including 3D geometry, semantics, dynamic motion, physical properties, and text descriptions. We demonstrate PhysInOne's efficacy across four emerging applications: physics-aware video generation, long-/short-term future frame prediction, physical property estimation, and motion transfer. Experiments show that fine-tuning foundation models on PhysInOne significantly enhances physical plausibility, while also exposing critical gaps in modeling complex physical dynamics and estimating intrinsic properties. As the largest dataset of its kind, orders of magnitude beyond prior works, PhysInOne establishes a new benchmark for advancing physics-grounded world models in generation, simulation, and embodied AI.

Abstract:
Floorplan localization aims to estimate the camera pose of a query image with respect to a 2D floorplan, providing a lightweight and long-term stable alternative to localization based on 3D maps or large image databases for indoor robotics and AR. Recent methods frame the problem as ray-based matching, representing the image as a set of rays annotated with depth or semantic labels and aligning them with the floorplan. However, they still face challenges in addressing the complexity of indoor environments, which can be decomposed into environmental, geometric, and semantic ambiguities.To address these ambiguities, we propose a floorplan-aware probabilistic fusion framework that models both depth and semantic information within a unified architecture. Our framework also combines a distribution-based ray confidence estimator, which down-weights uncertain geometric hypotheses, with a probabilistic semantic matching scheme based on Jensen-Shannon divergence (JSD), which preserves and leverages informative semantic ambiguity instead of collapsing it into hard labels. Experiments on challenging benchmarks demonstrate that our approach significantly outperforms prior methods in both robustness and accuracy.

Abstract:
Smart factories use advanced technologies to optimize production and increase efficiency. To this end, the recognition of worker activity allows for accurate quantification of performance metrics, improving efficiency holistically while contributing to worker safety. OpenMarcie is, to the best of our knowledge, the biggest multimodal dataset designed for human action monitoring in manufacturing environments. It includes data from wearables sensing modalities and cameras distributed in the surroundings. The dataset is structured around two experimental settings, involving a total of 36 participants. In the first setting, twelve participants perform a bicycle assembly and disassembly task under semi-realistic conditions without a fixed protocol, promoting divergent and goal-oriented problem-solving. The second experiment involves twenty-five volunteers (24 valid data) engaged in a 3D printer assembly task, with the 3D printer manufacturer's instructions provided to guide the volunteers in acquiring procedural knowledge. This setting also includes sequential collaborative assembly, where participants assess and correct each other's progress, reflecting real-world manufacturing dynamics. OpenMarcie includes over 37 hours of egocentric and exocentric, multimodal, and multipositional data, featuring eight distinct data types and more than 200 independent information channels. The dataset is benchmarked across three human activity recognition tasks: activity classification, open vocabulary captioning, and cross-modal alignment.

Abstract:
Bird's eye view (BEV)-based 3D perception plays a crucial role in autonomous driving applications. The rise of large language models has spurred interest in BEV-based captioning to understand object behavior in the surrounding environment. However, existing approaches treat perception and captioning as separate tasks, focusing on the performance of only one task and overlooking the potential benefits of multimodal alignment. To bridge this gap between modalities, we introduce MTA, a novel multimodal task alignment framework that boosts both BEV perception and captioning. MTA consists of two key components: (1) BEV-Language Alignment (BLA), a contextual learning mechanism that aligns the BEV scene representations with ground-truth language representations, and (2) Detection-Captioning Alignment (DCA), a cross-modal prompting mechanism that aligns detection and captioning outputs. MTA seamlessly integrates into state-of-the-art baselines during training, adding no extra computational complexity at runtime. Extensive experiments on the nuScenes and TOD3Cap datasets show that MTA significantly outperforms state-of-the-art baselines in both tasks, achieving a 10.7% improvement in challenging rare perception scenarios and a 9.2% improvement in captioning. These results underscore the effectiveness of unified alignment in reconciling BEV-based perception and captioning.

Abstract:
Point cloud completion is an important yet challenging problem in 3D computer vision, which aims to reconstruct complete and dense 3D shapes from partial point clouds. Although transformer-based and geometry-based approaches have made significant progress, they often struggle to capture the complex, high-order correlations inherent in point clouds. To address this limitation, we propose Hyper-PCN, a point cloud completion framework that leverages hypergraphs to explicitly model complex, higher-order correlations within incomplete inputs for more accurate completion. It comprises two key modules: Hyper Refinement Stack, designed to progressively capture coarse-to-fine high-order correlations through a series of hypergraph learning stages, and Anchor-based Hypergraph Neural Network, which employs a two-stage sampling strategy to construct collaborative hypergraphs, ensuring robust modeling of global structures. Extensive experiments on multiple datasets demonstrate that our approach consistently outperforms state-of-the-art methods.

Abstract:
Recently, masked skeleton reconstruction models have emerged as strong action representation learners, driving significant progress in self-supervised skeleton-based action recognition. However, existing state-of-the-art methods must predict an exceedingly large number of spatiotemporal patches, significantly prolonging training time. Besides, by treating all spatiotemporal regions equally during reconstruction, these models are distracted from learning the critical motion patterns that underlie action semantics. To address these challenges, we propose Adaptive Masked Reconstruction (AMR), a faster and stronger pre-training framework. We first decouple the decoder from the encoder, enabling flexible prediction of larger spatiotemporal patches and dramatically reducing reconstruction complexity. Given that larger patches contain more complex information, which is challenging to predict and consequently degrades performance, we accordingly introduce an adaptive guidance module. This module identifies regions of high motion informativeness, guiding the model to focus on the most discriminative parts of each patch and alleviating reconstruction difficulty. Experiments on NTU RGB+D 60, NTU RGB+D 120, and PKU-MMD datasets demonstrate that AMR not only accelerates pre-training substantially but also improves downstream recognition accuracy, surpassing current state-of-the-art approaches.

Abstract:
Generating high-quality textures for 3D assets is a challenging task. Existing multiview texture generation methods suffer from the multiview inconsistency and missing textures on unseen parts, while UV inpainting texture methods do not generalize well due to insufficient UV data and cannot well utilize 2D image diffusion priors. In this paper, we propose a new method called MV2UV that combines 2D generative priors from multiview generation and the inpainting ability of UV refinement to get high-quality texture maps. Our key idea is to adopt a UV space generative model that simultaneously inpaints unseen parts of multiview images while resolving the inconsistency of multiview images. Experiments show that our method enables a better texture generation quality than existing methods, especially in unseen occluded and multiview-inconsistent parts.

Abstract:
Autonomous Graphical User Interface (GUI) agents powered by Multimodal Large Language Models (MLLMs) are increasingly vital for complex task automation. However, their capacity for self-driven decision-making introduces significant, yet underexplored, security risks, among which backdoor attacks pose a particularly stealthy and high-impact threat. Prior work has shown GUI agents vulnerable to such attacks, but existing methods rely on static trigger-action mappings that execute fixed, context-agnostic behaviors, making them highly detectable. To address this limitation, we introduce AdapAction, a novel backdoor attack that subverts the agent's decision-making by embedding an adaptive, context-aware policy. Unlike traditional approaches, AdapAction enables the agent to autonomously select environmentally coherent malicious actions based on the current GUI state and user instruction, thereby evading detection while preserving functional utility. Extensive experiments on the Android-In-The-Zoo (AitZ) and AndroidControl benchmarks show that AdapAction achieves up to 100% Attack Success Rate (ASR) while preserving benign task utility. More critically, AdapAction consistently evades a multi-principle-based LLM defense evaluating instruction alignment, visual coherence, and safety, whereas traditional fixed-action attacks are easily detected. This resilience stems from AdapAction's contextually grounded malicious actions, which are semantically and visually indistinguishable from legitimate operations. As a result, AdapAction exhibits exceptional stealth and poses a significantly greater real-world threat to LLM-powered GUI agents.

Abstract:
Recently, TabPFN has gained attention as a foundation model for tabular data. However, it struggles to integrate heterogeneous modalities such as images and text, which are common in domains like healthcare and marketing, thereby limiting its applicability. To address this, we present the Multi-Modal Prior-data Fitted Network (MMPFN), which extends TabPFN to handle tabular and non-tabular modalities in a unified manner. MMPFN comprises per-modality encoders, modality projectors, and pre-trained foundation models. The modality projectors serve as the critical bridge, transforming non-tabular embeddings into tabular-compatible tokens for unified processing. To this end, we introduce a multi-head gated MLP and a cross-attention sampler that extract richer context from non-tabular inputs while mitigates attention imbalance issue in multimodal learning. Extensive experiments on medical and general-purpose multimodal datasets demonstrate that MMPFN consistently outperforms competitive state-of-the-art methods and effectively exploits non-tabular modalities alongside tabular features. These results highlight the promise of extending prior-data fitted networks to the multimodal setting, offering a scalable and effective framework for heterogeneous data learning.

Abstract:
Inferring rigid-body physical states and properties from monocular videos is a fundamental step toward physics-based perception and simulation. Existing approaches assume specific underlying physical systems, object types, and camera poses, which are unable to generalize to complex real-world settings. We introduce Dynamics, a vision-language framework that uses language as a unified representation of rigid-body dynamics. Instead of directly predicting parameters, Dynamics generates scene configurations in a structured text format for physics simulation. We enhance the model's generalization by integrating natural language motion reasoning and leveraging optical flow as a semantic-agnostic input. On the CLEVRER dataset, Dynamics achieves a segmentation IoU of 0.30, a 7ximprovement over leading VLMs (InternVL3-8B, Qwen2.5-VL-7B and Claude-4-Sonnet). Further, test-time sampling and evolutionary search further boost performance by 27% and 120% in segmentation IoU, respectively. Finally, we demonstrate strong transfer to a new dataset of 235 real-world rigid-body videos, highlighting the potential of language-driven physics inference for bridging perception and simulation.

Abstract:
Cross-domain few-shot image interpretation (CD-FSII) has been significantly advanced by fine-tuning pre-trained visual feature models using limited labeled samples in target domains. However, profound cross-domain distribution discrepancies, along with inherent conflicts between extensive object visual appearance variations and limited annotations, trap those existing pure visual feature representations into some non-transferable short-cut patterns, thus degrading their cross-domain generalization capacity. To mitigate this problem, we present a simple yet effective cross-modal visual feature enhancement framework which primarily contributes in the following three aspects. 1) We make the first attempt to introduce linguistic descriptions of image attributes to regulate the pre-trained visual feature model for specific target image adaptation. Specifically, image-level attributes (e.g., object appearance in individual images) and domain-level attributes (e.g., overall style and background characteristics of the dataset) are extracted using a pre-trained image captioning model and a large language model (LLM), respectively, to construct comprehensive linguistic characterizations. 2) A lightweight residual cross-attention scheme is developed to seamlessly embed linguistic descriptions of image attributes into visual feature representations, thereby compensating for the limitations of purely visual cues in capturing cross-domain transferable high-level semantic characteristics. 3) The proposed framework is task-agnostic and can be seamlessly integrated with off-the-shelf pre-trained visual feature models. It demonstrates superior generalization performance compared to several state-of-the-art methods across multiple CD-FSII benchmarks, including image classification, semantic segmentation, and object detection. We will release all code and data to facilitate further research.

Abstract:
Multimodal large language models (MLLMs) have achieved remarkable progress on various vision-language tasks, yet their visual perception remains limited. Humans, in comparison, perceive complex scenes efficiently by dynamically scanning and focusing on salient regions in a sequential "blink-like" process. Motivated by this strategy, we first investigate whether MLLMs exhibit similar behavior. Our pilot analysis reveals that MLLMs naturally attend to different visual regions across layers and that selectively allocating more computation to salient tokens can enhance visual perception. Building on this insight, we propose Blink, a dynamic visual token resolution framework that emulates the human-inspired process within a single forward pass. Specifically, Blink includes two modules: saliency-guided scanning and dynamic token resolution. It first estimates the saliency of visual tokens in each layer based on the attention map, and extends important tokens through a plug-and-play token super-resolution (TokenSR) module. In the next layer, it drops the extended tokens when they lose focus. This dynamic mechanism balances broad exploration and fine-grained focus, thereby enhancing visual perception adaptively and efficiently. Extensive experiments validate Blink, demonstrating its effectiveness in enhancing visual perception and multimodal understanding.

Abstract:
3D shape generation has become increasingly important for graphics and vision applications. Current part-aware 3D generation usually overlooks hierarchical part relations or inefficiently encodes multi-level semantics in Euclidean space. Thus we propose a novel framework for hierarchical and efficient part-aware 3D generation in hyperbolic space. Our contributions are three-fold: (1) Hierarchical Hyperbolic Mixture Model (H^2MM): We propose part-aware semantic representation of objects within a hyperbolic manifold, providing a high-fidelity hierarchical part-aware representation of object details and semantics. (2) Hyperbolic Semantically Consistent Diffusion Model: We design the geodesic diffusion process that preserves the hierarchical and semantic structure of H^ 2 MM, and progressively generates semantics from conditions and generates object under their joint guidance. We use an adaptive tree-structured neural network to loosen the constraint of jointly generating nodes and edges in previous hyperbolic diffusion. (3) Hyperbolic Diffusion Model Solver: We leverage higher-order Riemannian gradient on hyperbolic manifolds for designing a fast dedicated high-order solver for diffusion ODEs with the convergence order guarantee. Extensive experiments demonstrate that our method achieves superior quality and efficiency.

Abstract:
Effectively grounding complex language to pixels in remote sensing (RS) images is a critical challenge for applications like disaster response and environmental monitoring. Current models can parse simple, single-target commands but fail when presented with complex geospatial scenarios, e.g., segmenting objects at various granularities, executing multi-target instructions, and interpreting implicit user intent. To drive progress against these failures, we present LaSeRS, the first large-scale dataset built for comprehensive training and evaluation across four critical dimensions of language-guided segmentation: hierarchical granularity, target multiplicity, reasoning requirements, and linguistic variability. By capturing these dimensions, LaSeRS moves beyond simple commands, providing a benchmark for complex geospatial reasoning. This addresses a critical gap: existing datasets oversimplify, leading to sensitivity-prone real-world models. We also propose SegEarth-R2, an MLLM architecture designed for comprehensive language-guided segmentation in RS, which directly confronts these challenges. The model's effectiveness stems from two key improvements: (1) a spatial attention supervision mechanism specifically handles the localization of small objects and their components, and (2) a flexible and efficient segmentation query mechanism that handles both single-target and multi-target scenarios. Experimental results demonstrate that our SegEarth-R2 achieves outstanding performance on LaSeRS and other benchmarks, establishing a powerful baseline for the next generation of geospatial segmentation. All data and code will be released.

Abstract:
Reducing the token count is crucial for efficient training and inference of latent diffusion models, especially at high resolution. A common approach is to build high-compression image tokenizers that store more information by allocating more channels per token. However, when trained solely with reconstruction objectives, high-dimensional latent spaces often fail to maintain meaningful structure, which complicates diffusion training. Existing methods introduce additional objectives, such as semantic alignment or selective dropout, to enforce structure, but typically require costly retraining of the diffusion model. In contrast, pretrained diffusion models already exhibit a structured, lower-dimensional latent space, suggesting a simpler strategy: expand the latent dimensionality while preserving this structure. To this end, we propose Detail-Aligned VAE (DA-VAE), which increases the compression ratio of a pretrained VAE with only lightweight adaptation of the diffusion backbone. DA-VAE imposes an explicit latent layout: the first C channels are inherited from the pretrained VAE at a base resolution, while additional D channels encode high-resolution details. We introduce a simple yet effective detail-alignment mechanism that explicitly regularizes the expanded latent space to preserve the structural properties of the original space defined by the first C channels. Finally, we present a warm-start fine-tuning strategy that enables 1024 x1024 image generation with Stable Diffusion 3.5 using only 32 x32 tokens, 4xfewer than the original model, within a compute budget of 5 H100-days. It further enables 2048 x2048 generation with SD3.5, achieving a 6xspeedup while preserving image quality. We also validate the method and its design choices quantitatively on ImageNet.

Abstract:
Transforming sparse, partial pixel sketches from diverse media into complete, editable vector drawings is essential yet underexplored in digital creation. Prior methods either generate from scratch or inpaint local gaps without predicting global structure, leading to coarse contours and limited detail. To address this, we introduce SketchRevive, a two-stage framework for fine-grained pixel-to-vector sketch completion that couples diffusion-based pixel completion with MLLM-driven refinement and vectorization to produce coherent, detail-faithful SVG results. Specifically, we first construct a practical benchmark by augmenting stroke-annotated sketches from paper and whiteboards. Stage I trains a diffusion model with a line-distribution head to predict per-pixel stroke presence, producing structurally and visually consistent completions. Stage II fine-tunes an MLLM for structure-aware SVG vectorization with iterative refinement, optimized by instance-level stroke attribute similarities. To align key clues, e.g., spatial structure, appearance details across both stages, we introduce a diffusion-prior aggregated encoding module by injecting multi-scale UNet features from Stage I into the MLLM's visual embeddings and using line prediction logits for token compression to prioritize informative tokens. Experiments indicate that SketchRevive completes topology-coherent vector outputs with high fidelity and recognizability while preserving user intent, suitable for interactive creation and artistic design.

Abstract:
Robotic manipulation involves kinematic and semantic transitions that are inherently coupled via underlying actions. However, existing approaches plan within either semantic or latent space without explicitly aligning these cross-modal transitions. To address this, we propose CLaD, a framework that models how proprioceptive and semantic states jointly evolve under actions through asymmetric cross-attention that allows kinematic transitions to query semantic ones. CLaD predicts grounded latent foresights via self-supervised objectives with EMA target encoders and auxiliary reconstruction losses, preventing representation collapse while anchoring predictions to observable states. Predicted foresights are modulated with observations to condition a diffusion policy for action generation. On LIBERO-LONG benchmark, CLaD achieves 94.7% success rate, competitive with large VLAs with significantly fewer parameters.

Abstract:
Fine-tuning vision-language models such as CLIP has become the mainstream paradigm for multi-label image recognition, and prompt tuning is widely adopted due to its lightweight parameter cost and strong transferability. However, we find that when these methods use Binary Cross-entropy as the supervision loss, the model's confidence structure becomes systematically distorted, leading to pronounced miscalibration. Existing calibration techniques, such as temperature scaling or regularization-based methods, largely fail in multi-label settings because they cannot capture inherent semantic dependencies between classes, nor can they correct the global structural shifts introduced during fine-tuning. To address this issue, we propose Class-wise Covariance Regularization, which aligns the predicted covariance structure of class confidences with the semantic correlations encoded in pretrained text embeddings. This alignment preserves the geometric consistency of the class space throughout fine-tuning, resulting in more stable and interpretable confidence distributions across categories. Experiments on multi-label benchmarks show that CCR significantly reduces calibration errors while maintaining or even improving recognition performance.

Abstract:
Prompt learning is a dominant paradigm for adapting pre-trained Vision-Language Models (VLMs) to downstream tasks. However, existing methods often rely on a simplistic, layer-centric view, assuming shallow layers capture general features while deep layers handle task-specific knowledge. This assumption results in uncontrolled interactions between learnable tokens and original tokens. Task-specific knowledge could degrade the model's core generalization and creates a trade-off between task adaptation and the preservation of zero-shot generalization. To address this, we challenge the layer-centric view and propose DeAR, a framework that achieves fine-grained VLM adaptation by Decomposing Attention head Roles. We posit that the functional specialization within VLMs occurs not between layers, but at the finer-grained level of individual attention heads in the deeper layers. Based on this insight, we introduce a novel metric, Concept Entropy, to systematically classify attention heads into distinct functional roles: Attribute, Generalization, and Mixed. Guided by these roles, we introduce specialized attribute tokens and a Role-Based Attention Mask mechanism to precisely control information flow, ensuring generalization heads remain isolated from task-specific knowledge. We further incorporate a Task-Adaptive Fusion Strategy for inference. Extensive experiments on fifteen datasets show that DeAR achieves a strong balance between task adaptation and generalization, outperforming previous methods across various tasks.

Abstract:
Test-time "thinking" (i.e., generating explicit intermediate reasoning chains) is known to boost performance in large language models and has recently shown strong gains for large vision-language models (LVLMs). However, despite these promising results, there is still no systematic analysis of how thinking actually affects visual reasoning. We provide the first such analysis with a large-scale, controlled comparison of thinking for LVLMs, evaluating 10 variants from the InternVL3.5 and Qwen3-VL families on MMMUval under generous token budgets and multi-pass decoding. We show that more thinking is not always better: long chains often yield long-wrong trajectories that ignore the image and underperform the same models run in standard instruct mode. A deeper analysis reveals that certain short "lookback" phrases, which explicitly refer back to the image, are strongly enriched in successful trajectories and correlate with better visual grounding. Building on this insight, we propose uncertainty-guided lookback, a training-free decoding strategy that combines an uncertainty signal with adaptive lookback prompts and breadth search. Our method improves overall MMMU performance, delivers the largest gains in categories where standard thinking is weak, and outperforms several strong decoding baselines, setting a new state of the art under fixed model families and token budgets. We further show that this decoding strategy generalizes, yielding consistent improvements on five additional benchmarks, including two broad multimodal suites and math-focused visual reasoning datasets.

Abstract:
Offline Multi-Agent Reinforcement Learning (MARL) is critical for coordinating multiple agents in costly and unsafe environments, yet existing methods struggle with high sensitivity to reward functions and weak generalization to new goals, limiting its practical impact. Inspired by single-agent Offline Goal-Conditioned RL (OGCRL), we propose the first goal-conditioned offline MARL framework, extending OGCRL to multi-agent settings under both fully decentralized and centralized training with decentralized execution (CTDE) paradigms. To systematically evaluate this setting, we introduce MangoBench, the first fully cooperative multi-goal benchmark for MARL, covering 3 environments, 4 agent types, and 47 tasks, designed to assess joint-control locomotion, synchronous and asynchronous bimanual manipulation, and robustness to high-dimensional inputs. Extensive experiments demonstrate that our baselines achieve strong multi-goal generalization under sparse rewards, yet no method dominates all tasks, revealing both the intrinsic complexity and the unexplored potential of goal-conditioned offline MARL.

Abstract:
Recent vision-language model (VLM)-based approaches have achieved impressive results on image vectorization tasks. However, they are typically evaluated on synthetic benchmarks, where clean SVGs are rasterized at high resolution and then re-vectorized. As a result, these methods generalize poorly to real-world scenarios, such as images with unknown rasterization methods or those generated by text-to-image models.We introduce VectorArk, a new VLM-based model designed for robust and practical image vectorization. VectorArk employs a novel rounded polygon representation that simplifies the learning process while naturally producing smooth, visually appealing primitives. We also propose a degradation model that enhances robustness across diverse and imperfect inputs.Our experiments show that, in contrast to previous methods, VectorArk achieves superior geometric completeness and artifact suppression across multiple datasets, with comprehensive ablations validating the contribution of each component.

Abstract:
Multimodal Large Language Models (MLLMs) are rapidly becoming the intelligence brain of end-to-end autonomous driving systems. A key challenge is to assess whether MLLMs can truly understand and follow complex real-world traffic rules. However, existing benchmarks mainly focus on single-rule scenarios like traffic sign recognition, neglecting the complexity of multi-rule concurrency and conflicts in real driving. Consequently, models perform well on simple tasks but often fail or violate rules in real world complex situations. To bridge this gap, we propose DriveCombo, a text and vision-based benchmark for compositional traffic rule reasoning. Inspired by human drivers' cognitive development, we propose a systematic Five-Level Cognitive Ladder that evaluates reasoning from single-rule understanding to multi-rule integration and conflict resolution, enabling quantitative assessment across cognitive stages. We further propose a Rule2Scene Agent that maps language-based traffic rules to dynamic driving scenes through rule crafting and scene generation, enabling scene-level traffic rule visual reasoning. Evaluations of 14 mainstream MLLMs reveal a pronounced decline in performance as task difficulty increases, especially in the rule conflict resolution task. After splitting the dataset and fine-tuning on the training set, we further observe substantial improvements in both traffic rule reasoning and downstream planning capabilities. These results highlight the effectiveness of DriveCombo in advancing compliant and intelligent autonomous driving systems.

Abstract:
Humans anticipate, from a glance and a contemplated action of their bodies, how the 3D world will respond, a capability that is equally vital for robotic manipulation. We introduce PointWorld, a large pre-trained 3D world model that unifies state and action in a shared 3D space as 3D point flows: given one or few RGB-D images and a sequence of low-level robot action commands, PointWorld forecasts per-pixel displacements in 3D that respond to the given actions. By representing actions as 3D point flows instead of embodiment-specific action spaces (e.g., joint positions), this formulation directly conditions on physical geometries of robots while seamlessly integrating learning across embodiments. To train our 3D world model, we curate a large-scale dataset spanning real and simulated robotic manipulation in open-world environments, enabled by recent advances in 3D vision and simulated environments, totaling about 2M trajectories and 500 hours across a single-arm Franka and a bimanual humanoid. Through rigorous, large-scale empirical studies of backbones, action representations, learning objectives, partial observability, data mixtures, domain transfers, and scaling, we distill design principles for large-scale 3D world modeling. With a real-time (0.1s) inference speed, PointWorld can be efficiently integrated in the model-predictive control (MPC) framework for manipulation. We demonstrate that a single pre-trained checkpoint enables a real-world Franka robot to perform rigid-body pushing, deformable and articulated object manipulation, and tool use, without requiring any demonstrations or post-training and all from a single image captured in-the-wild.

Abstract:
Open-vocabulary 3D semantic segmentation aims to segment arbitrary categories beyond the training set. Existing methods predominantly rely on distilling knowledge from 2D open-vocabulary models. However, aligning 3D features to the 2D representation space restricts intrinsic 3D geometric learning and inherits errors from 2D predictions. To address these limitations, we propose GeoGuide, a novel framework that leverages pretrained 3D models to integrate hierarchical geometry-semantic consistency for open-vocabulary 3D segmentation. Specifically, we introduce an Uncertainty-based Superpoint Distillation module to fuse geometric and semantic features for estimating per-point uncertainty, adaptively weighting 2D features within superpoints to suppress noise while preserving discriminative information to enhance local semantic consistency. Furthermore, our Instance-level Mask Reconstruction module leverages geometric priors to enforce semantic consistency within instances by reconstructing complete instance masks. Additionally, our Inter-Instance Relation Consistency module aligns geometric and semantic similarity matrices to calibrate cross-instance consistency for same-category objects, mitigating viewpoint-induced semantic drift. Extensive experiments on ScanNet v2, Matterport3D, and nuScenes demonstrate the superior performance of GeoGuide.

Abstract:
Estimating accurate, view-consistent geometry and camera poses from uncalibrated multi-view/video inputs remains challenging--especially at high spatial resolutions and over long sequences. We present DAGE, a dual-stream transformer whose main novelty is to disentangle global coherence from fine detail. A low-resolution stream operates on aggressively downsampled frames with alternating frame/global attention to build a view-consistent representation and estimate cameras efficiently, while a high-resolution stream processes the original images per-frame to preserve sharp boundaries and small structures. A lightweight adapter fuses these streams via cross-attention, injecting global context without disturbing the pretrained single-frame pathway. This design scales resolution and clip length independently, supports inputs up to 2K, and maintains practical inference cost. DAGE delivers sharp depth/pointmaps, strong cross-view consistency, and accurate poses, establishing new state-of-the-art results for video geometry estimation and multi-view reconstruction.

Abstract:
Egocentric walking tour videos provide a rich source of image data to develop rich and diverse visual models of environments around the world. However, the significant presence of humans in frames of these videos due to crowds and eye-level camera perspectives mitigates their usefulness in environment modeling applications. We focus on addressing this challenge by developing a generative algorithm that can realistically remove (i.e., inpaint) humans and their associated shadow effects from walking tour videos. Key to our approach is the construction of a rich semi-synthetic dataset of video clip pairs to train this generative model. Each pair in the dataset consists of an environment-only background clip, and a composite clip of walking humans with simulated shadows overlaid on the background. We randomly sourced both foreground and background components from real egocentric walking tour videos around the world to maintain visual diversity. We then used this dataset to fine-tune the state-of-the-art Casper video diffusion model for object and effects inpainting, and demonstrate that the resulting model performs far better than Casper both qualitatively and quantitatively at removing humans from walking tour clips with significant human presence and complex backgrounds. Finally, we show that the resulting generated clips can be used to build successful 3D/4D models of urban locations.

Abstract:
No-Reference Point Cloud Quality Assessment (NR-PCQA) still struggles with generalization, primarily due to the scarcity of annotated point cloud datasets. Since the Human Visual System (HVS) drives perceptual quality assessment independently of media types, prior knowledge on quality learned from images can be repurposed for point clouds. This insight motivates adopting Unsupervised Domain Adaptation (UDA) to transfer quality-relevant priors from labeled images to unlabeled point clouds. However, existing UDA-based PCQA methods often overlook key characteristics of perceptual quality, such as sensitivity to quality ranking and quality-aware feature alignment, thereby limiting their effectiveness. To address these issues, we propose a novel Quality-aware Domain adaptation framework for PCQA, termed QD-PCQA. The framework comprises two main components: i) a Rank-weighted Conditional Alignment (RCA) strategy that aligns features under consistent quality levels and adaptively emphasizes misranked samples to reinforce perceptual quality ranking awareness; and ii) a Quality-guided Feature Augmentation (QFA) strategy, which includes quality-guided style mixup, multi-layer extension, and dual-domain augmentation modules to augment perceptual feature alignment. Extensive cross-domain experiments demonstrate that QD-PCQA significantly improves generalization in NR-PCQA tasks.

Abstract:
We present Compact 3D Reconstruction with Positive and Negative Primitives (DualPrim), a novel approach for reconstructing compact and topologically regular 3D meshes from multi-view images. Unlike traditional methods that rely on implicit representations such as signed distance functions, or explicit formats such as meshes and point clouds, our method models geometry using quadrics-based 3D primitives. Each primitive is defined by a positive-density superquadric that contributes to the shape, and a negative-density superquadric that carves out local volumes, enabling fine-grained geometric control and flexible topology. This dual-primitive representation yields compact, well-regularized, and efficiently parameterized mesh reconstructions. To infer primitive parameters from multi-view images, we design a differentiable rendering pipeline that jointly estimates positive and negative superquadrics under view-consistent supervision. Extensive experiments demonstrate that DualPrim outperforms state-of-the-art methods in reconstruction accuracy while producing more geometrically concise, interpretable, and high-fidelity 3D meshes.

Abstract:
Diffusion models (DMs) can generate anatomically realistic medical images, offering a compelling route to improving generalization through synthetic augmentation. Yet high visual realism does not necessarily translate into improved downstream utility. This work addresses two key questions in diffusion-driven augmentation. First, what should be synthesized? We show that synthetic adversariality, namely the expected empirical loss induced by synthetic samples, is a key driver of generalization. More importantly, only native adversariality, arising from hard examples supported by the diffusion model distribution, yields consistent gains, whereas artificial adversariality induced by attack-based perturbations is detrimental. Second, how should such samples be synthesized? We propose the Adversariality Miner, a lightweight module that optimizes the initial noise to mine natively adversarial samples without modifying or retraining the diffusion model. Extensive experiments across diverse diffusion backbones and medical benchmarks confirm the effectiveness of our approach, establishing a principled path toward diffusion-driven generalization.

Abstract:
Multimodal autoregressive (AR) models, based on next-token prediction and transformer architecture, have demonstrated remarkable capabilities in various multimodal tasks including text-to-image (T2I) generation. Despite their strong performance in general T2I tasks, our research reveals that these models initially struggle with subject-driven image generation compared to dominant diffusion models. To address this limitation, we introduce Proxy-Tuning, leveraging diffusion models to enhance AR models' capabilities in subject-specific image generation. Our method reveals a striking weak-to-strong phenomenon: fine-tuned AR models consistently outperform their diffusion model supervisors in both subject fidelity and prompt adherence. We analyze this performance shift and identify scenarios where AR models excel, particularly in multi-subject compositions and contextual understanding. This work not only demonstrates impressive results in subject-driven AR image generation, but also unveils the potential of weak-to-strong generalization in the image generation domain, contributing to a deeper understanding of different architectures' strengths and limitations.

Abstract:
Human communication is inherently multimodal and social: words, prosody, and body language jointly carry intent. Yet most prior systems model human behavior as a translation task--co-speech gesture or text-to-motion that maps a fixed utterance to motion clips--without requiring agentic decision-making about when to move, what to do, or how to adapt across multi-turn dialogue. This leads to brittle timing, weak social grounding, and fragmented stacks where speech, text, and motion are trained or inferred in isolation. We introduce ViBES (Voice in Behavioral Expression and Synchrony), a conversational 3D agent that jointly plans language and movement and executes dialogue-conditioned body actions. Concretely, ViBES is a speech-language-behavior (SLB) model with a mixture-of-modality-experts (MoME) backbone: modality-partitioned transformer experts for speech, facial expression, and body motion. The model processes interleaved multimodal token streams with hard routing by modality (parameters are split per expert), while sharing information through cross-expert attention. By leveraging strong pretrained speech-language models, the agent supports mixed-initiative interaction: users can speak, type, or issue body-action directives mid-conversation, and the system exposes controllable behavior hooks for streaming responses. We further benchmark on multi-turn conversation with automatic metrics of dialogue-motion alignment and behavior quality, and observe consistent gains over strong co-speech and text-to-motion baselines. ViBES goes beyond "speech-conditioned motion generation" toward agentic virtual bodies where language, prosody, and movement are jointly generated, enabling controllable, socially competent 3D interaction.

Abstract:
Existing open-vocabulary detectors focus on RGB images and fail to generalize to thermal imagery, where low texture and emissivity variations challenge RGB-based semantics. We present Thermal-Det, the first large language model (LLM) supervised open-vocabulary detector tailored for thermal images. To enable large-scale training, we construct a synthetic dataset by converting GroundingCap-1M into the thermal domain and filtering captions to remove RGB-specific terms, yielding over one million thermally aligned samples with bounding boxes, grounding texts, and detailed captions. Thermal-Det jointly optimizes detection, captioning, and cross-modal distillation objectives. A frozen RGB teacher provides geometric and semantic pseudo-supervision for paired but unlabeled RGB-thermal data, transferring open-vocabulary knowledge without manual annotation. The model further employs a Thermal-Text Alignment Head for text calibration and a Modality-Fused Cross-Attention module for dual-modality reasoning. Unlike prior domain-adaptation methods, the detector is fully fine-tuned to internalize thermal contrast patterns while preserving language alignment. Experiments on public benchmarks show consistent 2-4% AP gains over existing open-vocabulary detectors, establishing a strong foundation for scalable, language-driven thermal perception.

Abstract:
In this paper, we propose BuildingGPT, a novel auto-regressive model for building wireframe reconstruction from point clouds with reinforcement learning.Unlike prior works based on detection or diffusion models, BuildingGPT reformulates the building wireframe reconstruction task into a sequence prediction problem.Based on the hierarchical building wireframe tokenization, the wireframe sequences are organized in a structurally- and semantically-aware order for the next-token prediction.The point cloud encoder first transforms the input point cloud into a fixed-length latent code that serves as the starting of the sequence.Then, BuildingGPT auto-regressively predicts tokens conditioned on the latent code and previously generated tokens.With token sequence predicted, the building wireframe is obtained through detokenization.To enhance the model performance, we adopt a two-stage training paradigm including the pre-training and post-training.After the auto-regressive pre-training, Direct Preference Optimization (DPO) is employed as a post-training strategy to align reconstruction results with human preferences.Extensive experiments on the large-scale MunichWF dataset show that BuildingGPT outperforms existing state-of-the-art methods.We commit to release the code and dataset.

Abstract:
CLIP models learn transferable multi-modal features via image-text contrastive learning on internet-scale data. They are widely used in zero-shot classification, multi-modal retrieval, text-to-image diffusion, and as image encoders in large vision-language models. However, CLIP's pretraining is dominated by images paired with short captions, biasing the model toward encoding simple descriptions of salient objects and leading to coarse alignment on complex scenes and dense descriptions. While recent work mitigates this by fine-tuning on small-scale long-caption datasets, we identify an important common bias: both human- and LLM-generated long captions typically begin with a one-sentence summary followed by a detailed description. We show that this acts as a shortcut during training, concentrating attention on the opening sentence and early tokens and weakening alignment over the rest of the caption. To resolve this, we introduce DeBias-CLIP, which removes the summary sentence during training and applies sentence sub-sampling and text token padding to distribute supervision across all token positions. DeBias-CLIP achieves state-of-the-art long-text retrieval, improves short-text retrieval, and is less sensitive to sentence order permutations. It is a drop-in replacement for Long-CLIP with no additional trainable parameters.

Abstract:
Recent advances in 3D shape generation have achieved impressive results, but most existing methods rely on clean, unoccluded, and well-segmented inputs. Such conditions are rarely met in real-world scenarios. We present ShapeR, a novel approach for conditional 3D object shape generation from casually captured sequences. Given a image sequence, we leverage off-the-shelf visual-inertial SLAM,3D detection algorithms and VLMs to extract for each object, a set of sparse SLAM points, posed multi-view images, and machine-generated captions. A rectified flow transformer trained to effectively condition on these modalities then generates high-fidelity metric 3D shapes. To ensure robustness to the challenges of casually captured data, we employ a range of techniques including on-the-fly compositional augmentations, a curriculum training scheme spanning object- and scene-level datasets, and strategies to handle background clutter. Additionally, we introduce a new evaluation benchmark comprising 178 in the wild objects across 7 real-world scenes with geometry annotations. Experiments show that ShapeR significantly outperforms existing approaches in this challenging setting, achieving an improvement of 2.7x in Chamfer distance compared to SoTA.

Abstract:
Text-to-Image (T2I) diffusion models power modern creative tools, but their open-ended generative nature raises safety, ethical, and copyright concerns. Retraining or fine-tuning to remove every unsafe or copyrighted concept is impractical, motivating training-free interventions that suppress specific semantics while preserving general visual quality. Existing guard-railing methods face a core trade-off: they are either rigid, failing to generalize to paraphrased or context-shifted prompts, or coarse, distorting unrelated content and fidelity. We present GenErase (GENeralizable ERAsure with SEmantic Awareness), a training-free, geometry-grounded framework for robust concept removal in diffusion models. GenErase enforces semantic orthogonality in the cross-attention value space via an explicit erase-and-replace operation, guided by a per-token preserve projector and a hard geometric gate. This design enables precise erasure, explicit protection of critical semantics, and stability across layers, paraphrases, and multi-concept cases. Extensive experiments on identity, object, and style erasure, together with a new GenBench-40 benchmark, show that GenErase achieves state-of-the-art erasure fidelity and superior paraphrase-level generalization, establishing it as a practical and principled guard-rail for safe, real-time diffusion deployment.

Abstract:
The growing demand for immersive 3D content calls for automated monocular-to-stereo video conversion. We present a controllable, direct end-to-end method for upgrading a conventional video to a binocular one. Our approach, based on (conditional) latent diffusion, avoids artifacts due to explicit depth estimation and warping. The key to its high-quality stereo video output is a novel, guided VAE decoder that ensures sharp and epipolar-consistent stereo video output. Moreover, our method gives the user control over the strength of the stereo effect (respectively, the disparity range) at inference time, via an intuitive, scalar tuning knob. Experiments on three different datasets of real-world stereo videos show that our method outperforms both traditional warping-based and recent warping-free baselines and sets a new standard for reliable, controllable stereo video conversion. Please check the project page for the video samples: elastic3d.github.io

Abstract:
Manual modeling of material parameters and 3D geometry is a time consuming yet essential task in the gaming and film industries. While recent advances in 3D reconstruction have enabled accurate approximations of scene geometry and appearance, these methods often fall short in relighting scenarios due to the lack of precise, spatially varying material parameters. At the same time, diffusion models operating on 2D images have shown strong performance in predicting physically based rendering (PBR) properties such as albedo, roughness, and metallicity. However, transferring these 2D material maps onto reconstructed 3D geometry remains a significant challenge. We propose a framework for fusing 2D material data into 3D geometry using a combination of novel learning-based and projection-based approaches. We begin by reconstructing scene geometry via Gaussian Splatting. From the input images, a diffusion model generates 2D maps for albedo, roughness, and metallic parameters. Any existing diffusion model that can convert images or videos to PBR materials can be applied. The predictions are further integrated into the 3D representation either by optimizing an image-based loss or by directly projecting the material parameters onto the Gaussians using Gaussian ray tracing. To enhance fine-scale accuracy and multi-view consistency, we further introduce a light-weight neural refinement step (Neural Merger), which takes ray-traced material features as input and produces detailed adjustments. Our results demonstrate that the proposed methods outperform existing techniques in both quantitative metrics and perceived visual realism. This enables more accurate, relightable, and photorealistic renderings from reconstructed scenes, significantly improving the realism and efficiency of asset creation workflows in content production pipelines.

Abstract:
CoT has significantly enhanced the reasoning ability of LLMs while it faces challenges when extended to multimodal domains, particularly in mathematical tasks.Existing MLLMs typically perform textual reasoning solely from a single static mathematical image, overlooking dynamic visual acquisition during reasoning.In contrast, humans repeatedly examine visual image and employ step-by-step reasoning to prove intermediate propositions.This strategy of decomposing the problem-solving process into key logical nodes adheres to Miller's Law in cognitive science.Inspired by this insight, we propose a ViRC framework for multimodal mathematical tasks, introducing a Reason Chunking mechanism that structures multimodal mathematical CoT into consecutive Critical Reasoning Units (CRUs) to simulate human expert problem-solving patterns.CRUs ensure intra-unit textual coherence for intermediate proposition verification while integrating visual information across units to generate subsequent propositions and support structured reasoning.To this end, we present CRUX dataset by using three visual tools and four reasoning patterns to provide explicitly annotated CRUs across multiple reasoning paths for each mathematical problem.Leveraging the CRUX dataset, we propose a progressive training strategy inspired by human cognitive learning, which includes Instructional SFT, Practice SFT, and Strategic RL, aimed at further strengthening the Reason Chunking ability of the model.The resulting ViRC-7B model achieves a 18.8% average improvement over baselines across multiple mathematical benchmarks.The codes will be made publicly available.

Abstract:
Learning methods using synthetic data have attracted attention as an effective approach for increasing the diversity of training data while reducing collection costs, thereby improving the robustness of model discrimination. However, many existing methods improve robustness only indirectly through the diversification of training samples and do not explicitly teach the model which regions in the input space truly contribute to discrimination; consequently, the model may learn spurious correlations caused by synthesis biases and artifacts. Motivated by this limitation, this paper proposes a learning framework that uses provenance information obtained during the training data synthesis process, indicating whether each region in the input space originates from the target object, as an auxiliary supervisory signal to promote the acquisition of representations focused on target regions. Specifically, input gradients are decomposed based on information about target and non-target regions during synthesis, and input gradient guidance is introduced to suppress gradients over non-target regions. This suppresses the model's reliance on non-target regions and directly promotes the learning of discriminative representations for target regions. Experiments demonstrate the effectiveness and generality of the proposed method across multiple tasks and modalities, including weakly supervised object localization, spatio-temporal action localization, and image classification.

Abstract:
Vision-Language-Action (VLA) models convert high-level language instructions into concrete, executable actions, a task that is especially challenging in open-world environments. We present Visual Foresight Planning (ForeAct), a general and efficient planner that guides a VLA step-by-step using imagined future observations and subtask descriptions. With an imagined future observation, the VLA can focus on visuo-motor inference rather than high-level semantic reasoning, leading to improved accuracy and generalization. Our planner comprises a highly efficient foresight image generation module that predicts a high-quality 640x480 future observation from the current visual input and language instruction within only 0.33s on an H100 GPU, together with a vision-language model that reasons over the task and produces subtask descriptions for both the generator and the VLA. Importantly, state-of-the-art VLAs can integrate our planner seamlessly by simply augmenting their visual inputs, without any architectural modification. The foresight generator is pretrained on over 1 million multi-task, cross-embodiment episodes, enabling it to learn robust embodied dynamics. We evaluate our framework on a benchmark that consists of 11 diverse, multi-step real-world tasks. It achieves an average success rate of 87.4%, demonstrating a +40.9% absolute improvement over the \pi_0 baseline (46.5%) and a +30.3% absolute improvement over \pi_0 augmented with textual subtask guidance (57.1%).

Abstract:
Vision foundation models trained via multi-teacher distillation offer a promising path toward unified visual representations, yet the learning dynamics and data efficiency of such approaches remain underexplored. In this paper, we systematically study multi-teacher distillation for vision foundation models and identify key factors that enable training at lower computational cost. We introduce SigLino, an efficient family of agglomerative vision foundation models that distill knowledge from SigLIP2 and DINOv3 simultaneously into Dense and Mixture-of-Experts students. We show that (1) our Asymmetric Relation-Knowledge Distillation loss preserves the geometric properties of each teacher while enabling effective knowledge transfer, (2) token-balanced batching that packs varying-resolution images into sequences with uniform token budgets stabilizes representation learning across resolutions without sacrificing performance, (3) hierarchical clustering and sampling of training data--typically reserved for self-supervised learning--substantially improves sample efficiency over random sampling for multi-teacher distillation, and (4) the resulting representations transfer effectively to early-fusion Grounding-VLMs, outperforming models trained from scratch. By combining these findings, we curate OpenLVD200M, a 200M-image corpus that demonstrates superior efficiency for multi-teacher distillation. Instantiated in a Mixture-of-Experts, our SigLino-MoE initializes an early-fusion Grounding-VLM that replaces the conventional ViT-LLM stack, demonstrating improved performance compared to a model trained from scratch. We release OpenLVD200M and distilled checkpoints.

Abstract:
Low-visibility scenarios, such as low-light conditions, pose significant challenges to human pose estimation due to the scarcity of annotated low-light datasets and the loss of visual information under poor illumination. Recent domain adaptation techniques attempt to utilize well-lit labels by augmenting well-lit images to mimic low-light conditions. But handcrafted augmentations oversimplify noise patterns, while learning-based methods often fail to preserve high-frequency low-light characteristics, producing unrealistic images that lead pose models to generalize poorly to real low-light scenes. Moreover, recent pose estimators rely on image cues through image-to-keypoint cross-attention, but these cues become unreliable under low-light conditions. To address these issues, we propose Unsupervised Domain Adaptation for Pose Estimation (UDAPose), a novel framework that synthesizes low-light images and dynamically fuses visual cues with pose priors for improved pose estimation. Specifically, our synthesis method incorporates a Direct-Current-based High-Pass Filter (DHF) and a Low-light Characteristics Injection Module (LCIM) to inject high-frequency details from input low-light images, overcoming rigidity or the detail loss in existing approaches. Furthermore, we introduce a Dynamic Control of Attention (DCA) module that adaptively balances image cues with learned pose priors in the Transformer architecture. Experiments show that UDAPose outperforms state-of-the-art methods, with notable AP gains of 10.1 (56.4%) on the ExLPose-test hard set (LL-H) and 7.4 (31.4%) in cross-dataset validation on EHPT-XC.

Abstract:
Recent person re-identification (ReID) leverages heterogeneous sensing with multiple modalities and viewpoints to improve robustness across diverse conditions. However, most approaches target predefined scenario pairs (e.g., visible-infrared or aerial-ground) and train separate task-specific models. In contrast, real-world systems often query against galleries that cover all scenarios, making such designs inefficient and complex to deploy. To bridge this gap, we introduce Any-Scenario ReID (AS-ReID): given a query from any (modality, viewpoint) scenario, a single model retrieves the same identity from a heterogeneous gallery spanning all scenarios. Progress toward AS-ReID is limited by two factors: (i) the lack of a real-world-aligned benchmark with broad scenario coverage, and (ii) the challenge of learning representations with strong intra-identity cohesion and inter-identity discrimination under diverse scenarios. To this end, we construct WHU-MARS, a multispectral aerial-ground benchmark with 2,337 identities and 434,620 images captured by RGB, near-infrared, and thermal infrared cameras on both ground and UAV platforms. WHU-MARS spans day-night, multiple seasons, and diverse weather, enabling AS-ReID as well as conventional ReID tasks. We further propose the Unified Alignment and Discrimination (UAD) framework. Progressive Center Alignment (ProCA) aggregates multi-view features into modality centers and then aligns them toward identity centers to reduce scenario bias. Global Prototype Discrimination (GPD) contrasts samples against global identity prototypes to enforce large-margin discrimination. Extensive experiments highlight the challenges of WHU-MARS and demonstrate the effectiveness of UAD on AS-ReID.

Abstract:
Real-world data collection for embodied agents remains costly and unsafe, calling for scalable, realistic, and simulator-ready 3D environments. However, existing scene-generation systems often rely on rule-based or task-specific pipelines, yielding artifacts and physically invalid scenes. We present SAGE, an agentic framework that, given a user-specified embodied task (e.g., "pick up a bowl and place it on the table"), understands the intent and automatically generates simulation-ready environments at scale. The agent couples multiple generators for layout and object composition with critics that evaluate semantic plausibility, visual realism, and physical stability. Through iterative reasoning and adaptive tool selection, it self-refines the scenes until meeting user intent and physical validity. The resulting environments are realistic, diverse, and directly deployable in modern simulators for policy training. Policies trained purely on this data exhibit clear scaling trends and generalize to unseen objects and layouts, demonstrating the promise of simulation-driven scaling for embodied AI. Code, demos, and the SAGE-10k dataset can be found on the project page: https://research.nvidia.com/labs/dir/sage/.

Abstract:
We propose OMGTex, an end-to-end diffusion-based framework for reconstructing high-quality and editable facial UV textures from multi-style facial images. Existing texture reconstruction methods face two major limitations: (1) Fragility due to reliance on 3D geometry priors, which are difficult to estimate accurately, especially under facial occlusions or in stylized domains; and (2) A lack of semantic disentanglement, inhibiting region-specific texture editing and style transfer. Our work addresses both challenges simultaneously.Our core innovation is a geometry-free pipeline that directly maps a 2D face image to its corresponding editable UV texture. We introduce two key techniques: First, to address the challenge of UV misalignment common in diffusion generation, we introduce a gradient-guided refinement strategy at inference time, which explicitly corrects structural consistency. Second, we leverage the inherent semantic distribution capability of diffusion models and design a novel training paradigm to enhance this tendency, enabling semantic-aware editing of facial texture. Furthermore, to address the data scarcity in multi-style texture reconstruction, we construct CANVAS, the first comprehensive paired texture reconstruction dataset covering realistic and diverse stylized domains.To the best of our knowledge, OMGTex is the first geometry-free inference framework that achieves robust, style-consistent, and editable facial texture reconstruction across diverse domains. Our method achieves state-of-the-art performance on facial texture benchmarks. Codes and the pretrained model weights will be publicly released.

Abstract:
Understanding the geometric and semantic structure of environments is essential for embodied agents. Existing semantic mapping methods trade off between explicit geometry and multi-scale semantics,and lack a native interface for large models, thus requiring additional training of feature projection for semantic alignment. To this end, we propose the multi-scale Gaussian-Language Map (GLMap), which introduces three key designs: (1) explicit geometry, (2) multi-scale semantics covering both instance and region level concepts, and (3) a dual-modality interface where each semantic unit jointly stores a natural language description and a 3D Gaussian representation. The 3D Gaussians enable compact storage and fast rendering of task-relevant images via Gaussian splatting. To enable efficient incremental construction, we further propose a Gaussian Estimator that analytically derives Gaussian parameters from dense point clouds without gradient-based optimization. Experiments on ObjectNav, InstNav, and SQA tasks show that GLMap effectively enhances target localization and contextual reasoning, while remaining compatible with large-model-based methods in a zero-shot manner.

Abstract:
Large multimodal 3D vision-language models show strong generalization across diverse 3D tasks, but their performance still degrades under domain shifts. This has motivated recent studies on test-time adaptation (TTA), which enables models to adapt online using test-time data. Among existing TTA methods, cache-based mechanisms are widely adopted to leverage previously observed samples for online prediction refinement. However, they store limited historical information, leading to progressive information loss as the test stream evolves. In addition, their prediction logits are fused heuristically, making adaptation unstable. To address these limitations, we propose BayesMM, a multimodal Bayesian distribution learning framework for test-time point cloud analysis. BayesMM models textual priors and streaming visual features of each class as Gaussian distributions. Textual parameters are derived from semantic prompts, while visual parameters are updated online with arriving samples. The two modalities are fused via Bayesian model averaging, which automatically adjusts their contributions based on posterior evidence. This yields a unified prediction that adapts to evolving test-time data without training. Extensive experiments on multiple point cloud benchmarks demonstrate that BayesMM maintains robustness under distributional shifts, yielding over 4 percent average improvement.

Abstract:
Jointly optimizing camera poses and object geometry from unposed images is a challenging task in neural surface reconstruction. Existing methods often suffer from pose drift and geometric distortion, stemming from the easy-view bias --- uniform view optimization favors easy-to-optimize views with abundant texture and good overlap that dominate gradient updates, while hard-to-optimize counterparts with weak texture or limited overlap yet critical for geometric completeness are progressively marginalized. To address this, we propose ManifoldNeuS, a novel method that explicitly models and leverages per-view optimizability to guide pose-free neural surface reconstruction. Specifically, we introduce the manifold-aware view optimizability score (MaVOS), which jointly assesses immediate fitness (the ease of optimizing each view) and long-term coverage gain (the value of optimizing each view) over the view-coherent manifold. Building on the MaVOS, we further devise a reconstruction pipeline that incorporates the per-view optimizability as a state control signal to guide the joint optimization process through three key components: dynamic view scheduling, gated positional encoding, and anti-score loss weighting. Experimental results on the benchmark dataset demonstrate that ManifoldNeuS outperforms existing methods in terms of accurate pose estimation and high-quality reconstruction, achieving robust joint optimization without known camera poses.

Abstract:
Recent advances in chart recognition have been driven by supervised fine-tuning (SFT) of vision-language models (VLMs), which unify multiple related tasks, and by diversifying training corpora. In parallel, research on leveraging large language models (LLMs) for object detection has shown that jointly training phrase grounding alongside SFT enhances a model's generative capabilities. Inspired by this, we hypothesize that chart recognition can also benefit from phrase grounding, which aligns textual phrases with chart regions. However, this setting remains underexplored due to the lack of corresponding datasets. In this work, we introduce phrase-grounding-aware SFT via a Side-Masked Attention Module (SMAM) inserted into each transformer layer of the LLM. SMAM performs masked attention within the annotated region aligned with each phrase to produce an additional logit. This logit is supervised and serves as a reference signal to guide the LLM's output prediction during fine-tuning, alongside the standard SFT objective. To enable this approach, we also develop an automated pipeline for generating phrase-to-region alignments, which augments existing datasets. Experiments show that our method effectively incorporates phrase grounding into VLM fine-tuning for chart recognition.

Abstract:
The depth precision of an indirect time-of-flight (I-ToF) camera is highly dependent on its coding scheme. However, identifying the optimal coding scheme is challenging due to the infinitely many possible combinations of modulation and demodulation functions. Although previous works have derived depth-precision metrics to guide coding-scheme design, they either do not satisfy the constraints of real-world I-ToF devices or rely heavily on large-scale deep-learning optimization. In this work, we first analyze the error-propagation process in I-ToF depth sensing and derive a new metric for guiding the design and search of coding schemes. Then we incorporate practical hardware constraints of I-ToF sensors directly into the coding-scheme design, which greatly reduces the space of feasible modulation and demodulation functions and makes metric-based search feasible. The coding schemes obtained by our search method outperform previous schemes in both simulations and real-world experiments.

Abstract:
Unified diffusion editors often rely on a fixed, shared backbone for diverse tasks, suffering from task interference and poor adaptation to heterogeneous demands (e.g., local vs global, semantic vs photometric). In particular, prevalent ControlNet and OmniControl variants combine multiple conditioning signals (e.g., text, mask, reference) via static concatenation or additive adapters which cannot dynamically prioritize or suppress conflicting modalities, thus resulting in artifacts like color bleeding across mask boundaries, identity or style drift, and unpredictable behavior under multi-condition inputs. To address this, we propose Condition-Aware Routing of Experts (CARE-Edit) that aligns model computation with specific editing competencies. At its core, a lightweight latent-attention router assigns encoded diffusion tokens to four specialized experts--Text, Mask, Reference, and Base--based on multi-modal conditions and diffusion timesteps: (i) a Mask Repaint module first refines coarse user-defined masks for precise spatial guidance; (ii) the router applies sparse top-K selection to dynamically allocate computation to the most relevant experts; (iii) a Latent Mixture module subsequently fuses expert outputs, coherently integrating semantic, spatial, and stylistic information to the base images. Experiments validate CARE-Edit's strong performance on contextual editing tasks, including erasure, replacement, text-driven edits, and style transfer. Empirical analysis further reveals task-specific behavior of specialized experts, showcasing the importance of dynamic, condition-aware processing to mitigate multi-condition conflicts. The source code, models, and dataset will be publicly available.

Abstract:
Wide-baseline matching (WBM) requires integrating geometric understanding, viewpoint changes, fine-grained perception, and occlusion reasoning, making it a challenging testbed for spatial reasoning in multimodal large language models (MLLMs) deployed in physical environments. However, current MLLMs lack systematic evaluation and training frameworks for these capabilities. We introduce ReasonMatch-Bench, a benchmark stratified by viewpoint displacement and matching granularity across indoor, outdoor, and object-centric scenarios, and show that current MLLMs still struggle with fine-grained wide-baseline correspondence: on a difficult 90-sample subset, human annotators achieve 84.0 F1, while the best existing baseline reaches 37.2. To bridge this gap, we build a scalable data-generation pipeline that automatically extracts wide-baseline view pairs from large-scale video-3D corpora, including RGB-D videos and SfM reconstructions, yielding diverse and verifiable supervision. We further propose Dynamic Correspondence Reinforcement Learning (DCRL), which combines Image-Level Viewpoint Progression and Point-Level Correspondence Curriculum to improve WBM training through verifiable rewards without explicit CoT supervision. Extensive experiments show that DCRL substantially improves ReasonMatch-Bench and transfers to related spatial benchmarks, while maintaining general visual understanding performance with modest gains on several benchmarks.

Abstract:
Recent feed-forward Gaussian reconstruction models adopt a pixel-aligned formulation that maps each 2D pixel to a 3D Gaussian, entangling Gaussian representations tightly with the input images. In this paper, we propose AnchorSplat, a novel feed-forward 3DGS framework for scene-level reconstruction that represents the scene directly in 3D space. AnchorSplat introduces an anchor-aligned Gaussian representation guided by 3D geometric priors (e.g., sparse point clouds, voxels, or RGB-D point clouds), enabling a more geometry-aware renderable 3D Gaussians that is independent of image resolution and number of views. This design substantially reduces the number of required Gaussians, improving computational efficiency while enhancing reconstruction fidelity. Beyond the anchor-aligned design, we utilize a Gaussian Refiner to adjust the intermediate Gaussians via merely a few forward passes. Experiments on the ScanNet++ v2 NVS benchmark demonstrate the SOTA performance, outperforming previous methods with more view-consistent and substantially fewer Gaussian primitives.

Abstract:
It is essential for understanding neural network decisions to interpret the functionality (also known as concepts) of neurons. Existing approaches describe neuron concepts by generating natural language descriptions, thereby advancing the understanding of the neural network's decision-making mechanism. However, these approaches assume that each neuron has well-defined functions and provides discriminative features for neural network decision-making. In fact, some neurons may be redundant or may offer misleading concepts. Thus, the descriptions for such neurons may cause misinterpretations of the factors driving the neural network's decisions. To address the issue, we introduce a verification of neuron functions, which checks whether the generated concept highly activates the corresponding neuron. Furthermore, we propose a Select-Hypothesize-Verify framework for interpreting neuron functionality. This framework consists of: 1) selecting activation samples that best capture a neuron's well-defined functional behavior through activation-distribution analysis; 2) forming hypotheses about concepts for the selected neurons; and 3) verifying whether the generated concepts accurately reflect the functionality of the neuron. Extensive experiments show that our method produces more accurate neuron concepts. Our generated concepts activate the corresponding neurons with a probability approximately 1.5 times that of the current state-of-the-art method.

Abstract:
3D Gaussian Splatting (3DGS) has received tremendous popularity over the past few years due to its photorealistic visual appearance. However, 3DGS uses volumetric rendering that is not suitable for objects with non-lambertian or transparent materials. To remedy this issue, a family of Order-Independent Transparency (OIT) rendering methods propose to remove or modify the depth sorting step in the 3DGS rendering equation. However, the potential of OIT-based method is still underexplored. In this paper, we observe that the OIT modifications to the rendering equation significantly reduce the inter-independence among individual gaussian splats, resulting in very sparse variable dependencies that can be harnessed by specific optimization techniques such as active set method. To this end, we propose SparseOIT, an OIT-based 3DGS reconstruction algorithm that maintains an active set of gaussian splats and enjoys an acceleration ratio that is proportional to the potential sparsity. SparseOIT is designed by jointly considering the OIT rendering equation, the reconstruction algorithm and the geometric regularization. Through extensive experiments, we demonstrate that SparseOIT outperforms existing methods in the OIT-family by a large margin and also achieves comparable performance to the state-of-the-art 3DGS reconstruction methods based on volumetric rendering.

Abstract:
Adapting production-level computer vision tools to bespoke scientific datasets is a critical "last mile" bottleneck. Current solutions are impractical: fine-tuning requires large annotated datasets scientists often lack, while manual code adaptation costs scientists weeks to months of effort. We consider using AI agents to automate this manual coding, and focus on the open question of optimal agent design for this targeted task. We introduce a systematic evaluation framework for agentic code optimization and use it to study three production-level biomedical imaging pipelines. We demonstrate that a simple agent framework consistently generates adaptation code that outperforms human-expert solutions. Our analysis reveals that common, complex agent architectures are not universally beneficial, leading to a practical roadmap for agent design. We open source our framework and validate our approach by deploying agent-generated functions into a production pipeline, demonstrating a clear pathway for real-world impact.

Abstract:
Personalized image generation based on user behavior reflects individual preferences with minimal user intervention. However, existing studies often rely on inaccurate profiling, high resource costs, and model-specific designs, which jointly restrict creativity, diversity, and generality. To address these limitations, we propose FAN, a novel approach that enables preference-aware personalization using only foundation encoders, without additional structures. FAN performs tailored profiling to capture user preferences and reconstructs transformer-based encoders to integrate them while preserving target fidelity. Experiments show that FAN achieves robust personalization across various foundation text-to-image models. It also extends to applications such as CLIP retrieval, unCLIP, and vision-language models, seamlessly integrating into diverse encoders without fine-tuning.

Abstract:
Learning robust contextual knowledge from unlabeled videos is essential for advancing self-supervised tracking. However, conventional self-supervised trackers lack effective context modeling, while existing context association methods based on non-semantic queries struggle to adapt to unlabeled tracking scenarios, making it difficult to learn reliable contextual cues. In this work, we propose a novel self-supervised tracking framework, named \tracker, which introduces a dual-modal context association mechanism that jointly leverages fine-grained semantic prompts and contextual noise to drive the model toward learning robust tracking representations. Adherent to the easy-to-hard learning principle, our contextual association mechanism operates based on two stages. During early training, instance patch tokens (prompts) are assigned to both forward and backward tracking branches to facilitate the acquisition of tracking knowledge. As training progresses, contextual noise is gradually injected into the model to perturb feature, encouraging the tracker to learn robust tracking representations in a more complex feature space. Thus, this novel contextual association mechanism enables our self-supervised model to learn high-quality tracking representations from unlabeled videos, while being applied exclusively during training to preserve efficient inference. Extensive experiments demonstrate the superiority of our method.

Abstract:
Post-training quantization (PTQ) with computational equivalence for Large Language Models (LLMs) have demonstrated remarkable advances, however, their application to Multimodal Large Language Models (MLLMs) presents substantial challenges. In this paper, we analyze SmoothQuant as a case study and identify two critical issues: Smoothing Misalignment and Cross-Modal Computational Invariance. To address these issues, we propose Modality-Aware Smoothing Quantization (MASQuant), a novel framework that introduces (1) Modality-Aware Smoothing (MAS), which learns separate, modality-specific smoothing factors to prevent Smoothing Misalignment, and (2) Cross-Modal Compensation (CMC), which addresses Cross-modal Computational Invariance by using SVD whitening to transform multi-modal activation differences into low-rank forms, enabling unified quantization across modalities. MASQuant demonstrates stable quantization performance across both dual-modal and tri-modal MLLMs. Experimental results show that MASQuant is competitive among the state-of-the-art PTQ algorithms.

Abstract:
Personalized image generation requires effectively balancing content fidelity with stylistic consistency when synthesizing images based on text and reference examples. Low-Rank Adaptation (LoRA) offers an efficient personalization approach, with potential for precise control through combining LoRA weights on different concepts. However, existing combination techniques face persistent challenges: entanglement between content and style representations, insufficient guidance for controlling elements' influence, and unstable weight fusion that often require additional training. We address these limitations through CRAFT-LoRA, with complementary components: (1) rank-constrained backbone fine-tuning that injects low-rank projection residuals to encourage learning decoupled content and style subspaces; (2) a prompt-guided approach featuring an expert encoder with specialized branches that enables semantic extension and precise control through selective adapter aggregation; and (3) a training-free, timestep-dependent classifier-free guidance scheme that enhances generation stability by strategically adjusting noise predictions across diffusion steps. Our method significantly improves content-style disentanglement, enables flexible semantic control over LoRA module combinations, and achieves high-fidelity generation without additional retraining overhead.

Abstract:
Achieving human-like spatial intelligence for vision-language models (VLMs) requires inferring 3D structures from 2D observations, recognizing object properties and relations in 3D space, and performing high-level spatial reasoning. In this paper, we propose a principled hierarchical framework that decomposes the learning of 3D spatial understanding in VLMs into four progressively complex levels, from geometric perception to abstract spatial reasoning. Guided by this framework, we construct an automated pipeline that processes approximately 5M images with over 45M objects to generate 3D spatial VQA pairs across diverse tasks and scenes for VLM supervised fine-tuning. We also develop an RGB-D VLM incorporating metric-scale point maps as auxiliary inputs to further enhance spatial understanding. Extensive experiments demonstrate that our approach achieves state-of-the-art performance on multiple spatial understanding and reasoning benchmarks, surpassing specialized spatial models and large proprietary systems such as Gemini-2.5-pro and GPT-5. Moreover, our analysis reveals clear dependencies among hierarchical task levels, offering new insights into how multi-level task design facilitates the emergence of 3D spatial intelligence.

Abstract:
Recent Vision-Language-Action (VLA) models reformulate vision-language models by tuning them with millions of robotic demonstrations. While they perform well when fine-tuned for a single embodiment or task family, extending them to multi-skill settings remains challenging: directly merging VLA experts trained on different tasks results in near-zero success rates. This raises a fundamental question: what prevents VLAs from mastering multiple skills within one model? With an empirical decomposition of learnable parameters during VLA fine-tuning, we identify two key sources of non-mergeability:(1) Finetuning drives LoRA adapters in the VLM backbone toward divergent, task-specific directions beyond the capacity of existing merging methods to unify.(2) Action experts develop inter-block dependencies through self-attention feedback, causing task information to spread across layers and preventing modular recombination.To address these challenges, we present MergeVLA, a merging-oriented VLA architecture that preserves mergeability by design.MergeVLA introduces sparsely activated LoRA adapters via task masks to retain consistent parameters and reduce irreconcilable conflicts in the VLM.Its action expert replaces self-attention with cross-attention-only blocks to keep specialization localized and composable.When the task is unknown, it uses a test-time task router to adaptively select the appropriate task mask and expert head from the initial observation, enabling unsupervised task inference.Across LIBERO, LIBERO-Plus, RoboTwin, and multi-task experiments on the real SO101 robotic arm, MergeVLA achieves performance comparable to or even exceeding individually finetuned experts, demonstrating robust generalization across tasks, embodiments, and environments. Project page: https://mergevla.github.io/

Abstract:
Achieving precise alignment between user intent and generated visuals remains a central challenge in text-to-visual generation, as a single attempt often fails to produce the desired output. To handle this, prior approaches mainly scale the visual generation process (e.g., increasing sampling steps or seeds), but this quickly leads to a quality plateau. This limitation arises because prompt adaptation does not scale along with the increasing population of generated samples. To address this, we propose Prompt Redesign for Inference-time Scaling, coined PRIS, a framework that adaptively revises the prompt during inference in response to the scaled visual generations. The core idea of PRIS is to review the generated visuals, identify recurring failure patterns across visuals, and redesign the prompt accordingly before regenerating the visuals with the revised prompt. To provide precise alignment feedback for prompt revision, we introduce a new verifier, element-level factual correction, which evaluates the alignment between prompt attributes and generated visuals at a fine-grained level, achieving more accurate and interpretable assessments than holistic measures. Extensive experiments on both text-to-image and text-to-video benchmarks demonstrate the effectiveness of our approach, including a 15% gain on VBench 2.0. These results highlight that jointly scaling prompts and visuals is key to fully leveraging scaling laws at inference-time.

Abstract:
Recent advancements in Video Large Language Models (VideoLLMs) have enabled strong performance across diverse multimodal video tasks. To reduce the high computational cost of processing dense video frames, efficiency-oriented methods such as frame selection have been widely adopted. While effective at minimizing redundancy, these methods often cause notable performance drops on tasks requiring temporal reasoning. Unlike humans, who can infer event progression from sparse visual cues, VideoLLMs frequently misinterpret temporal relations when intermediate frames are omitted. To address this limitation, we explore visual prompting (VP) as a lightweight yet effective way to enhance temporal understanding in VideoLLMs. Our analysis reveals that simply annotating each frame with explicit ordinal information helps the model perceive temporal continuity. This visual cue also supports frame-level referencing and mitigates positional ambiguity within a sparsely sampled sequence. Building on these insights, we introduce ViKey, a training-free framework that combines VP with a lightweight Keyword-Frame Mapping (KFM) module. KFM leverages frame indices as dictionary-like keys to link textual cues to the most relevant frames, providing explicit temporal anchors during inference. Despite its simplicity, our approach substantially improves temporal reasoning and, on some datasets, preserves dense-frame baseline performance with as few as 20% of frames.

Abstract:
Video diffusion models achieve impressive visual fidelity but remain computationally prohibitive for real-time or interactive generation due to their sequential denoising process. Recent caching methods accelerate inference by reusing outputs across timesteps, typically estimating each new output from the first-order residual, which is the difference between adjacent model predictions.To mitigate the accumulated error in caching methods, we propose D2Cache, a training-free method that leverages the smoothness of second-order residual delta, which is temporal differences between consecutive first-order residuals, to predict future timesteps more accurately. We theoretically show that this second-order correction improves prediction accuracy and effectively suppresses cumulative errors. Moreover, D2Cache adaptively scales second-order deltas using error estimates derived from timestep embeddings, maintaining accuracy across varying cache intervals.Empirically, D2Cache outperforms the state-of-the-art TeaCache across four video diffusion models (Latte, Open-Sora, LTX-video, and Wan2.1) at comparable acceleration rates, showing even larger gains under higher acceleration settings.

Abstract:
Pose-guided video generation refers to controlling the motion of subjects in generated video through a sequence of poses. It enables precise control over subject motion and has important applications in animation. However, current pose-guided video generation methods are limited to accepting only human poses as input, thus generalizing poorly to pose of other subjects. To address this issue, we propose PoseAnything, a general pose-guided video generation framework capable of handling both human and non-human characters, supporting arbitrary skeletal inputs. To enhance consistency preservation during motion, we introduce Part-aware Temporal Coherence Module, which divides the subject into different parts, establishes part correspondences, and computes cross-attention between corresponding parts across frames to achieve fine-grained part-level consistency. Additionally, we propose Subject and Camera Motion Decoupled CFG, a novel guidance strategy that, for the first time, enables independent camera movement control in pose-guided video generation, by separately injecting subject and camera motion control information into the positive and negative anchors of CFG. Furthermore, we present XPose, a high-quality public dataset containing 50,000 non-human pose-video pairs, along with an automated pipeline for annotation and filtering. Extensive experiments demonstrate that PoseAnything significantly outperforms state-of-the-art methods in both effectiveness and generalization.

Abstract:
In this paper, we introduce Object-WIPER, a training-free framework for removing dynamic objects and their associated visual effects from videos, and inpainting them with semantically consistent and temporally coherent content. Our approach leverages a pre-trained text-to-video diffusion transformer (DiT). Given an input video, a user-provided object mask, and query tokens describing the target object and its effects, we localize relevant visual tokens via visual-text cross-attention and visual self-attention. This produces an intermediate effect mask that we fuse with the user mask to obtain a final foreground token mask to replace. We first invert the video through the DiT to obtain structured noise, then reinitialize the masked tokens with Gaussian noise while preserving background tokens. During denoising, we copy values for the background tokens saved during inversion to maintain scene fidelity. To address the lack of suitable evaluation, we introduce a new object removal metric that rewards temporal consistency among foreground tokens across consecutive frames, coherence between foreground and background tokens within each frame, and dissimilarity between the input and output foreground tokens. Experiments on DAVIS and a newly curated real-world associated effect benchmark WIPER-Bench show that Object-WIPER surpasses both training-based and training-free baselines in terms of the metric, achieving clean removal and temporally stable reconstruction without any retraining. Our new benchmark, source code, and pre-trained models will be publicly available.

Abstract:
Recent advances in Multimodal Large Language Models (MLLMs) have led to promising progress in web agents. However, existing web agents often rely on handcrafted execution pipelines or expensive expert trajectories, limiting their adaptability to complex, dynamic environments. To address these challenges, we propose SCALE (Self-Cognitive-Aware Learning and Exploration), which leverages three adversarial roles--Selector, Predictor, and Judger to autonomously discover the agent's limitations and expand its cognitive boundaries through environmental exploration. Moreover, we propose SCALE-Hop, a graph exploration strategy that facilitates global planning and helps agents avoid local exploration traps. To further support learning, we construct SCALE-20k, a large-scale dataset collected from 19 real-world websites, containing diverse task types and structured demonstrations generated from SCALE's exploration traces. Experimental results show that our approach significantly improves the performance and generalization of multiple MLLMs in various web environments. Our framework offers a scalable and generalizable solution for building truly autonomous and adaptive web agents.

Abstract:
In this work, we explore an untapped signal in diffusion model inference. While all previous methods generate images independently at inference, we instead ask if samples can be generated collaboratively. We propose Group Diffusion, unlocking the attention mechanism to be shared across images, rather than limited to just the patches within an image. This enables images to be jointly denoised at inference time, learning both intra and inter-image correspondence. We observe a clear scaling effect -- larger group sizes yield stronger cross-sample attention and better generation quality. Furthermore, we introduce a qualitative measure to capture this behavior and show that its strength closely correlates with FID. Built on standard diffusion transformers, our GroupDiff achieves up to 32.2% FID improvement on ImageNet-256x256. Our work reveals cross-sample inference as an effective, previously unexplored mechanism for generative modeling.

Abstract:
Federated Multi-Label Learning is a distributed paradigm where multiple clients possess heterogeneous multi-label data and perform collaborative learning under privacy constraints without sharing raw data. However, modeling label correlations under heterogeneous distributions remains challenging. Due to client-specific label spaces and varying co-occurrence patterns, correlations learned by individual clients inevitably deviate from the global structure, a phenomenon we term label correlation drift. To address this, we propose FedHarmony, a framework that harmonizes heterogeneous label correlations across clients. It introduces consensus correlation, capturing agreement among clients and serving as a global teacher to correct biased local estimates. During aggregation, FedHarmony evaluates each client by both data size and correlation quality, assigning weights accordingly. Moreover, we develop an accelerated optimization algorithm and theoretically establish faster convergence without sacrificing accuracy. Experiments on real-world federated multi-label datasets show that FedHarmony consistently outperforms state-of-the-art methods.

Abstract:
Vertical federated learning (VFL) trains models by splitting computation across clients and a server that only exchange intermediate embeddings. Recent work shows that a server even if honest-but-curious can steal a client's bottom model by querying the system and regressing on the returned embeddings, and in response, defenses perturb or decouple the embedding channel. We show these defenses remain vulnerable. We propose VENOM, a geometry-aware stealing attack. VENOM first learns a contrastive space over server-observed embeddings, then builds a neighborhood graph and trains a surrogate bottom model to match targets and respect local geometry via a neighbor-matching loss alongside pointwise and feature-shape alignment. This strategy preserves the relational structure that defenses fail to erase, effectively recoupling the embeddings produced by multi-branch and noise-based defenses. Across six datasets, VENOM consistently outperforms standard stealing methods under no defense and multiple defenses, and remains effective with out-of-distribution (OOD) auxiliary data.

Abstract:
High-resolution content creation is rapidly emerging as a central challenge in both the vision and graphics communities. Images serve as the most fundamental modality for visual expression, and content generation that aligns with the user intent requires effective, controllable high-resolution image manipulation mechanisms. However, existing approaches remain limited to low-resolution settings, typically supporting only up to 1K resolution. In this work, we introduce the task of high-resolution image editing and propose a test-time optimization framework to address it. Our method performs patch-wise optimization on high-resolution source images, followed by a fine-grained detail transfer module and a novel synchronization strategy to maintain consistency across patches. Extensive experiments show that our method produces high-quality edits, facilitating high-resolution content creation.

Abstract:
Recent Vision-Language-Action (VLA) models map visual-textual inputs to robotic actions via end-to-end architectures, yet this approach entangles visual understanding with task-specific actions. This leads to an exhaustive collection of full operational sequences and parameter redundancy across tasks, while generic third-person camera setups require fine-tuning for different hardware due to implicit hand-eye assumptions. We argue that decoupling how robots see from how robots act is a missing primitive in VLA systems. We present EgoRoC, a plug-and-play egocentric alignment head that precedes any task policy and exposes only a thin 6-DoF pose interface. EgoRoC establishes task-agnostic viewpoint consistency from a wrist-mounted (first-person) camera and then alternates alignment with manipulation, while a diffusion-based online hand-eye module corrects the action in the end-effector frame for hardware-agnostic deployment. Trained once from static wrist-target image pairs with relative poses, rather than full manipulation trajectories, EgoRoC leaves downstream VLAs unchanged. By turning egocentric alignment into a reusable capability, EgoRoC reduces training redundancy, strengthens zero-shot cross-scene transfer, and scales across VLA backbones without manual calibration. Across simulation and real settings, attaching EgoRoC consistently boosts success rates, especially on long-horizon and out-of-distribution tasks, and improves data efficiency during fine-tuning.

Abstract:
Direct Time-of-Flight (dToF) sensors provide highly accurate metric depth and are more robust than indirect ToF systems in challenging real-world conditions. However, their high manufacturing cost and limited photodiode array size produce depth maps that are extremely sparse, low-resolution, and noisy, making them unsuitable for VR/XR, robotics, and 3D perception tasks that require dense metric depth. Existing monocular and depth completion methods struggle to handle the unique sampling patterns and hardware artifacts of dToF devices, and their performance often deteriorates significantly under severe sparsity or noise. We present a generalizable framework for dense metric depth completion from sparse dToF measurements, capable of operating across diverse sensor types, sparsity levels, and noise conditions. Our model employs a depth-guided dual-branch Vision Transformer encoder that processes RGB images and sparse dToF measurements separately, while a masked joint attention module allows depth tokens to reliably guide image features without being overwritten by them. A lightweight decoder reconstructs dense metric depth efficiently, without diffusion-based or refinement-heavy post-processing. To address the scarcity of paired training data, we introduce a comprehensive dToF simulation pipeline that reproduces the characteristics of flash, sub-VGA flash, and rotating sensors, including hardware-induced degradation, irregular sparsity, and realistic noise distributions. Trained entirely on synthetic data, our model achieves strong zero-shot generalization across 6 datasets and 3 real dToF devices, outperforming state-of-the-art approaches in both accuracy and computational efficiency. This establishes a robust and practical solution for dense metric depth completion from sparse direct ToF sensors.

Abstract:
Learned image compression (LIC) methods surpass traditional algorithms in rate-distortion (RD) performance, but still struggle to optimally balance effectiveness and efficiency. Moreover, although recent studies have demonstrated the effectiveness of utilizing frequency-domain information, they typically rely on fixed frequency transforms and still lack content-adaptive capabilities. Therefore, we propose an advanced spatial-frequency dual-path LIC method with enhanced adaptivity and efficiency. Specifically, for the spatial path, we introduce Cross-Sparse Window Attention, leveraging sparse, window-conditioned global tokens to efficiently model long-range dependencies. It achieves lower computational cost and superior effectiveness than standard Window-based Multi-head Self-attention. For the frequency path, we design a content-adaptive frequency transform, employing a decomposition weight generator and learnable global weights to adaptively process multi-scale frequency components. Furthermore, we propose Denoising-as-Regularizer, a training-only module that structures and smooths the latent representation via a denoising task, enhancing reconstruction quality at zero inference cost. Experiments on the Kodak, CLIC, and Tecnick datasets demonstrate that the proposed method significantly outperforms existing state-of-the-art methods in both RD performance and latency.

Abstract:
Vision-language models (VLMs) with dynamic resolution vision encoders achieve strong performance, but face significant efficiency challenges due to long input sequences. A common approach is to assess the importance of tokens and prune those that are less informative. Recent methods utilizing a small VLM to provide the importance map of visual tokens have outperformed existing rule-based and similarity-driven pruning approaches, particularly under high pruning ratios. However, directly using the small VLM remains unreliable, as it utilizes the aggregated visual attention weights as importance score, which can lead to noisy guidance if the generated tokens are incorrect.To address this, we invert the approach by having it detect non-informative visual tokens according to the user's input query. By adding a variational information bottleneck in the small VLM, we can approximate the entropy of each visual token as pruning guidance. Such a posteriori-guided pruning method allows the large VLM to retain its reasoning capacity with improved efficiency. Extensive experiments on eight benchmarks demonstrate the effectiveness of our approach. With only 5% of visual tokens retained, the large VLM preserves 95% of its original performance, outperforming the state of the art by 8%.

Abstract:
Vision-Language Models (VLMs) have demonstrated remarkable capabilities in instruction following and 2D visual understanding. However, state-of-the-art VLMs, including GPT-5, still struggle with 3D perception, particularly in tasks such as monocular 3D visual grounding. While specialized vision-only models excel in this domain, they often lack the rich semantic understanding inherent to VLMs. To bridge this gap, we propose \texttt MonoVLM , a novel triple-stage training framework that enables VLMs to perform accurate monocular 3D grounding. The core of our method is a progressive training process that uses Group Relative Policy Optimization (GRPO) to gradually teach the model to first localize the described object, then understand its 3D structure, and finally estimate the full 3D bounding box accurately. Comprehensive experiments show that \texttt MonoVLM models significantly outperform existing VLMs and even surpass the performance of specialized vision-only models. We validate our design via extensive comparisons and ablation studies.

Abstract:
This paper proposes the synthetic long-video meta-evaluation (SLVMEval), a benchmark to perform meta-evaluations of text-to-video (T2V) evaluation systems. The proposed SLVMEval benchmark focuses on assessing these systems on videos of up to 10,486 s (approximately 3 h). The benchmark targets a fundamental requirement, i.e., whether the systems can accurately assess video quality in settings that are easy for humans to assess. We adopt a pairwise comparison-based meta-evaluation framework. Building on dense video-captioning datasets, we synthetically degrade source videos to create controlled "high-quality versus low-quality" pairs across 10 distinct aspects. Then, we employ crowdsourcing to filter and retain only those pairs in which the degradation is clearly perceptible, thereby establishing an effective final testbed. Using this testbed, we assess the reliability of existing evaluation systems in ranking these pairs. Experimental results demonstrate that human evaluators can identify the better long video with 84.7%--96.8% accuracy, and in nine of the 10 aspects, the accuracy of these systems falls short of the human assessment, which reveals weaknesses in text-to-long video evaluation.

Abstract:
Traditional image compression prioritizes pixel fidelity but often preserves details irrelevant to downstream vision tasks. Compressing task-specific representations instead better aligns with task semantics, yet redundant information persists across correlated tasks. Existing multi-task compression methods typically rely on static dependency structures, leading to redundant bit allocation across correlated tasks and suboptimal rate-distortion performance. We present Adaptive Task Dependency Compression (ATDC), a framework that models per-image task relationships and encodes representations following an adaptive directed acyclic graph (DAG). ATDC infers pairwise task predictability via a learned correlation matrix, constructs a dynamic DAG to determine the optimal compression order, and encodes each task conditionally on its predecessors, achieving predictive redundancy removal and asymmetric information sharing across tasks. Experiments on the Taskonomy dataset demonstrate consistent gains in rate-distortion efficiency and task accuracy over both human-oriented codecs and state-of-the-art multi-task compression methods.The learned DAGs reveal interpretable, content-dependent task hierarchies, establishing adaptive dependency modeling as a principled paradigm for multi-task representation compression.

Abstract:
Instruction-based video editing promises to democratize content creation, yet its progress is severely hampered by the scarcity of large-scale, high-quality training data. We introduce Ditto, a holistic framework designed to tackle this fundamental challenge. At its heart, Ditto features a novel data generation pipeline that fuses the creative diversity of a leading image editor with an in-context video generator, overcoming the limited scope of existing models. To make this process viable, our framework resolves the prohibitive cost-quality trade-off by employing an efficient, distilled model architecture augmented by a temporal enhancer, which simultaneously reduces computational overhead and improves temporal coherence. Finally, to achieve full scalability, this entire pipeline is driven by an intelligent agent that crafts diverse instructions and rigorously filters the output, ensuring quality control at scale. Using this framework, we invested over 12,000 GPU-days to build Ditto-1M, a new dataset of one million high-fidelity video editing examples. We trained our model, Editto, on Ditto-1M with a curriculum learning strategy. The results demonstrate superior instruction-following ability and establish a new state-of-the-art in instruction-based video editing.

Abstract:
In this work, we propose the problem of localizing cameras and producing renders of a scene, given multiple images captured from ground/aerial/satellite viewpoints. We introduce a dataset called Sky2Ground, which contains synthetic/real images across all 3 viewpoints, along with camera parameters, and dense depth-maps/surface-normals. Recent works have shown that transformer-based nets like VGGT are capable of inferring scene-parameters in a single-forward pass. However, we formally reveal that simply fine-tuning such models reduces performance, and can't be solved simply by bruteforce-scaling. We find the culprit to be satellite images, which inject too much noise during the learning process. Therefore, we propose SkyNet to enable learning using satellite-images. SkyNet is a two-stream neural-net, with one stream explicitly processing satellite, and another processing all modalities together.We propose a restricted-attention mechanism, termed as `Masked-Satellite-Attention' which prevents ground/aerial images from interacting with satellite images. Further, our SkyNet is optimized with strategies inspired from curriculum-learning: sampling cameras which are far-away from each other during training. Extensive experiments on our Sky2Earth dataset reveal that SkyNet outperforms existing methods by 23% in terms of absolute performance. Our dataset, and code shall be made publicly available on huggingface.

Abstract:
The integration of new modalities enhances the capabilities of multimodal large language models (MLLMs) but also introduces additional vulnerabilities. In particular, simple visual jailbreaking attacks can manipulate open-source MLLMs more readily than sophisticated textual attacks. However, these underdeveloped attacks exhibit extremely limited cross-model transferability, failing to reliably identify vulnerabilities in closed-source MLLMs. In this work, we analyse the loss landscape of these jailbreaking attacks and find that the generated attacks tend to reside in high-sharpness regions, whose effectiveness is highly sensitive to even minor parameter changes during transfer. To further explain the high-sharpness localisations, we analyse their feature representations in both the intermediate layers and the spectral domain, revealing an improper reliance on narrow layer representations and semantically poor frequency components. Building on this, we propose a Feature Over-Reliance CorrEction (FORCE) method, which guides the attack to explore broader feasible regions across layer features and rescales the influence of frequency features according to their semantic content. By eliminating non-generalizable reliance on both layer and spectral features, our method discovers flattened feasible regions for visual jailbreaking attacks, thereby improving cross-model transferability. Extensive experiments demonstrate that our approach effectively facilitates visual red-teaming evaluations against closed-source MLLMs.

Abstract:
Current semantic segmentation models are very data-hungry and require massive costly pixel-wise human annotations. Generative data augmentation, which scales the train set using generative models, provides a potential remedy. In this paper, we propose MatchMask, a novel mask-centric generative data augmentation approach tailored for label-scarce semantic segmentation. By leveraging a limited set of labeled semantic masks, MatchMask generates diverse, realistic, and well-aligned image-mask pairs, thereby enhancing the performance of semantic segmentation models. Specifically, to adapt existing text-to-image models for semantic image synthesis in the few-shot setting, we first propose a Gradient Probe Method to investigate the role of each layer in the diffusion model. On this basis, a lightweight LoRA-style adapter is designed for critical layers to enable efficient adaptation, coupled with a Layer-adaptive Cross-attention Fusion mechanism. Meanwhile, we present a robust relative filtering principle to suppress incorrectly synthesized regions. Moreover, the proposed approach is extended to MatchMask++ in the semi-supervised setting to take advantage of additional unlabeled data. Experimental results on PASCAL VOC, COCO and ADE20K demonstrate that MatchMask remarkably enhances the performance of segmentation models, surpassing prior data augmentation techniques in various benchmarks, e.g., 67.5%->74.3% mIoU on PASCAL VOC.

Abstract:
With recent advances, Feed-forward Reconstruction Models (FFRMs) have demonstrated great potential in reconstruction quality and adaptiveness to multiple downstream tasks. However, the excessive reliance on multi-view geometric annotations, e.g. 3D point maps and camera poses, makes the fully-supervised training scheme of FFRMs difficult to scale up.In this paper, we propose Reliev3R, a weakly-supervised paradigm for training FFRMs from scratch without cost-prohibitive multi-view geometric annotations. Relieving the reliance on geometric sensory data and compute-exhaustive structure-from-motion preprocessing, our method draws 3D knowledge directly from monocular relative depths and image sparse correspondences given by zero-shot predictions of pretrained models.At the core of Reliev3R, we design an ambiguity-aware relative depth loss and a trigonometry-based reprojection loss to facilitate supervision for multi-view geometric consistency.Training from scratch with the less data, Reliev3R catches up with its fully-supervised sibling models, taking a step towards low-cost 3D reconstruction supervisions and scalable FFRMs.

Abstract:
While recent text-to-video (T2V) diffusion models have achieved impressive quality and prompt alignment, they often produce low-diversity outputs when sampling multiple videos from a single text prompt. We tackle this challenge by formulating it as a set-level policy optimization problem, with the goal of training a policy that can cover the diverse range of plausible outcomes for a given prompt. To address this, we introduce DPP-GRPO, a novel framework for diverse video generation generation that combines Determinantal Point Processes (DPPs) and Group Relative Policy Optimization (GRPO) theories to enforce explicit reward on diverse generations. Our objective turns diversity into an explicit signal by imposing diminishing returns on redundant samples (via DPP) while supplies groupwise feedback over candidate sets (via GRPO). Our framework is plug-and-play and model-agnostic, and encourages diverse generations across visual appearance, camera motions, and scene structure without sacrificing prompt fidelity or perceptual quality. We implement our method on WAN and CogVideoX, and show that our method consistently improves video diversity on state-of-the-art benchmarks such as VBench, VideoScore, and human preference studies. Moreover, we release our code and a new benchmark dataset of 30,000 diverse prompts to support future research.

Abstract:
Recent work explores how real and synthetic data contribute to VLA model generalization. While the \pi-series model has shown the strong effectiveness of large-scale real-robot pre-training, synthetic data has not previously demonstrated comparable capability at scale.This paper provides the first evidence that synthetic data alone can match the performance of the strongest \pi-dataset in pre-training a VLA model, revealing the substantial value of large-scale simulation.The resulting model also exhibits surprisingly strong zero-shot sim-to-real transfer on several challenging tasks.Our synthetic dataset, InternData-A1, contains over 630k trajectories and 7,433 hours across 4 embodiments, 18 skills, 70 tasks, and 227 scenes, covering rigid, articulated, deformable, and fluid-object manipulation. It is generated through a highly autonomous, fully decoupled, and compositional simulation pipeline that enables flexible task assembly, long-horizon skill composition, and heterogeneous embodiments with minimal manual tuning.Using the same architecture as \pi_0, we pre-train a model entirely on InternData-A1 and find that it matches the official \pi_0 across 49 simulation tasks, 5 real-world tasks, and 4 long-horizon dexterous tasks.We will open-source both the dataset and the generation pipeline to broaden access to large-scale robotic data and to lower the barrier to scalable data creation for embodied AI research.

Abstract:
Predicting wildfire risk is a reasoning-intensive spatial problem that requires the integration of visual, climatic, and geographic factors to infer continuous risk maps. Existing methods lack the causal reasoning and multimodal understanding required for reliable generalization. We introduce FireScope-Bench, a large-scale dataset and benchmark that couples Sentinel-2 imagery and climate data with expert-defined risk rasters across the USA, and real wildfire events in Europe for cross-continental evaluation. Building on this dataset, we propose FireScope, a VLM-based reasoning-to-generation framework that learns from both reinforcement learning and visual supervision to predict risk rasters with complementary reasoning traces. When trained in the USA and tested in Europe, FireScope achieves substantial performance gains, while expert feedback and automated analysis confirm that its reasoning traces are faithful and semantically meaningful. Our findings demonstrate that reasoning can ground raster prediction models, improving both generalization and interpretability. To our knowledge, this is the first framework to (1) demonstrate that language-based reasoning can improve generalization in visual generation, (2) propose a high-resolution wildfire risk model that can be applied across continents, and (3) enable systematic studies of robust cross-continental generalization for multimodal fire risk models. We believe that FireScope-Bench has the potential to serve as a foundation for advancing reasoning-driven, interpretable and generalizable spatial modeling. Data and source code are available at github.com/insait-institute/FireScope.

Abstract:
3D human reaction generation faces three main challenges: (1) high motion fidelity, (2) real-time inference, and (3) autoregressive adaptability for online scenarios. Existing methods fail to meet all three simultaneously. We propose ARMFlow, a MeanFlow-based autoregressive framework that models temporal dependencies between actor and reactor motions. It consists of a causal context encoder and an MLP-based velocity predictor. We introduce Bootstrap Contextual Encoding (BSCE) in training, encoding generated history instead of the ground-truth ones, to alleviate error accumulation in autoregressive generation. We further introduce the offline variant ReMFlow , achieving state-of-the-art performance with the fastest inference among offline methods. Our ARMFlow addresses key limitations of online settings by: (1) enhancing semantic alignment via a global contextual encoder; (2) achieving high accuracy and low latency in a single-step inference; and (3) reducing accumulated errors through BSCE. Our single-step online generation surpasses existing online methods on InterHuman and InterX by about 30% in FID, while matching offline state-of-the-art performance despite using only partial sequence conditions. Code is available in the supplementary.

Abstract:
We propose IrisFP, a novel adversarial-example-based model fingerprinting framework that enhances both uniqueness and robustness by leveraging multi-boundary characteristics, multi-sample behaviors, and fingerprint discriminative power assessment to generate composite-sample fingerprints. Three key innovations make IrisFP outstanding: 1) It positions fingerprints near the intersection of all decision boundaries--unlike prior methods that target a single boundary--thus increasing the prediction margin without placing fingerprints deep inside target class regions, enhancing both robustness and uniqueness; 2) It constructs composite-sample fingerprints, each comprising multiple samples close to the multi-boundary intersection, to exploit collective behavior patterns and further boost uniqueness; and 3) It assesses the discriminative power of generated fingerprints using statistical separability metrics developed based on two reference model sets respectively for pirated and independently-trained models, and assigns fingerprint-specific thresholds to retained fingerprints. Extensive experiments show that IrisFP consistently outperforms state-of-the-art methods, achieving reliable ownership verification by enhancing both robustness and uniqueness.

Abstract:
We propose Fresco, a unified optimization pipeline designed to mitigate early over-sharpening, and cross-view drifting in head avatar reconstruction. Fresco combines a Laplacian-pyramid-based frequency curriculum with UV-space consistency regularization to progressively enhance reconstruction quality. The optimization begins by stabilizing low-frequency appearance in the image domain, which suppresses spurious details and promotes reliable convergence. As learning proceeds, consistency across different viewpoints is reinforced through pixel-level alignment on shared UV texture coordinates. Finally, high-frequency components are refined under explicit frequency-band constraints, and seam boundary regularization is applied to preserve local continuity. By optimizing in a frequency- and UV-aligned space, Fresco achieves robust convergence without pseudo high-frequency artifacts and yields consistent, high-fidelity results across views. Experiments on the NeRSemble dataset validate the effectiveness of our design, outperforming previous state-of-the-art methods.

Abstract:
Accurate 3D scene description is fundamental to robotic navigation and augmented reality, yet current dense captioning methods face significant limitations in processing sparse point cloud data.Existing approaches that apply Euclidean embedding spaces struggle to simultaneously preserve fine-grained local geometric details and model exponentially growing global semantic hierarchies, leading to either inaccurate localization or disjointed, shallow scene descriptions.In this work, we propose a novel Curvature-Aware Captioning framework, integrating novel non-Euclidean geodesic attention mechanisms, to resolve the localization-contextualization conflict. Specifically, self-attention within Oblique space enforces dimensional homogeneity while establishing long-range dependencies. Bidirectional geodesic cross-attention within Lorentz space models hierarchical semantic relationships across scene instances, enabling simultaneous precision in object localization and coherence in scene descriptions.Theoretical analysis confirms that the curvature complementarity between the Oblique manifold and Lorentz hyperboloid resolves the Euclidean-hyperbolic conflict, ensuring feature stability via isotropic optimization while preserving inherent hierarchical relationships.Extensive experiments on ScanRefer and Nr3D benchmarks demonstrate state-of-the-art performance, with significant gains in both localization accuracy and descriptive richness.

Abstract:
Accident anticipation aims to predict impending collisions from dashcam videos and trigger early alerts. Existing methods rely on binary supervision with manually annotated "anomaly onset" frames, which are subjective and inconsistent, leading to inaccurate risk estimation. In contrast, we propose RiskProp, a novel collision-anchored self-supervised risk propagation paradigm for early accident anticipation, which removes the need for anomaly onset annotations and leverages only the reliably annotated collision frame. RiskProp models temporal risk evolution through two observation-driven losses. First, since future frames contain more definitive evidence of an impending accident, we introduce a future-frame regularization loss that uses the model's next-frame prediction as a soft target to supervise the current frame, enabling backward propagation of risk signals. Second, inspired by the empirical trend of rising risk before accidents, we design an adaptive monotonic constraint to encourage a non-decreasing progression over time. Experiments on CAP-DATA and Nexar demonstrate that RiskProp achieves state-of-the-art performance and produces smoother, more discriminative risk curves, improving both early anticipation and interpretability.

Abstract:
Multimodal LLM agents operating in complex game environments must continually reuse past experience to solve new tasks efficiently. In this work, we propose Echo, a transfer-oriented memory framework that enables agents to derive actionable knowledge from prior interactions rather than treating memory as a passive repository of static records. To make transfer explicit, Echo decomposes reusable knowledge into five dimensions: structure, attribute, process, function, and interaction. This formulation allows the agent to identify recurring patterns shared across different tasks and infer what prior experience remains applicable in new situations. Building on this formulation, Echo leverages In-Context Analogy Learning (ICAL) to retrieve relevant experiences and adapt them to unseen tasks through contextual examples. Experiments in Minecraft show that, under a from-scratch learning setting, Echo achieves a 1.3x-1.7x speed-up on object-unlocking tasks. Moreover, Echo exhibits a burst-like chain-unlocking phenomenon, rapidly unlocking multiple similar items within a short time interval after acquiring transferable experience. These results suggest that experience transfer is a promising direction for improving the efficiency and adaptability of multimodal LLM agents in complex interactive environments.

Abstract:
Vision-Language-Action models (VLAs) achieve remarkable performance in sequential decision-making but remain fragile to subtle environmental shifts, such as small changes in object pose. We attribute this brittleness to trajectory overfitting, where VLAs over-attend to the spurious correlation between actions and entities, then reproduce memorized action patterns. We propose Perturbation learning with Delayed Feedback (PDF), a verifier-free test-time adaptation framework that improves decision performance without fine-tuning the base model. PDF mitigates the spurious correlation through uncertainty-based data augmentation and action voting, while an adaptive scheduler allocates augmentation budgets to balance performance and efficiency. To further improve stability, PDF learns a lightweight perturbation module that retrospectively adjusts action logits guided by delayed feedback, correcting overconfidence issue. Experiments on LIBERO (+7.4% success rate) and Atari (+10.3 human normalized score) demonstrate consistent gains of PDF in task success over vanilla VLA and VLA with test-time adaptation, establishing a practical path toward reliable test-time adaptation in multimodal decision-making agents.

Abstract:
Human motion is highly expressive and naturally aligned with language, yet prevailing methods relying heavily on joint text-motion embeddings struggle to synthesize temporally accurate, detailed motions and often lack explainability. To address these limitations, we introduce LabanLite, a motion representation developed by adapting and extending the Labanotation system. Unlike black-box text-motion embeddings, LabanLite encodes each atomic body-part action (e.g., a single left-foot step) as a discrete Laban symbol paired with a textual template. This abstraction decomposes complex motions into interpretable symbol sequences and body-part instructions, establishing a symbolic link between high-level language and low-level motion trajectories. Building on LabanLite, we present LaMoGen, a Text-to-LabanLite-to-Motion Generation framework that enables large language models (LLMs) to compose motion sequences through symbolic reasoning. The LLM interprets motion patterns, relates them to textual descriptions, and recombines symbols into executable plans, producing motions that are both interpretable and linguistically grounded. To support rigorous evaluation, we introduce a Labanotation-based benchmark with structured description-motion pairs and three metrics that jointly measure text-motion alignment across symbolic, temporal, and harmony dimensions. Experiments demonstrate that LaMoGen establishes a new baseline for both interpretability and controllability, outperforming prior methods on our benchmark and two public datasets. These results highlight the advantages of symbolic reasoning and agent-based design for language-driven motion synthesis.

Abstract:
Vision-and-Language Navigation (VLN) requires agents to accurately perceive complex visual environments and reason over navigation instructions and histories. However, existing methods passively process redundant visual inputs and treat all historical contexts indiscriminately, resulting in inefficient perception and unfocused reasoning. To address these challenges, we propose ProFocus, a training-free progressive framework that unifies Proactive Perception and Focused Reasoning through collaboration between large language models (LLMs) and vision-language models (VLMs). For proactive perception, ProFocus transforms panoramic observations into structured ego-centric semantic maps, enabling the orchestration agent to identify missing visual information needed for reliable decision-making, and to generate targeted visual queries with corresponding focus regions that guide the perception agent to acquire the required observations. For focused reasoning, we propose Branch-Diverse Monte Carlo Tree Search (BD-MCTS) to identify top-k high-value waypoints from extensive historical candidates. The decision agent focuses reasoning on the historical contexts associated with these waypoints, rather than considering all historical waypoints equally. Extensive experiments validate the effectiveness of ProFocus, achieving state-of-the-art performance among zero-shot methods on R2R and REVERIE benchmarks.

Abstract:
Volumetric videos offer immersive 4D experiences, but remain difficult to reconstruct, store, and stream at scale. Existing Gaussian Splatting based methods achieve high-quality reconstruction but break down on long sequences, temporal inconsistency, and fail under large motions and disocclusions. Moreover, their outputs are typically incompatible with conventional video coding pipelines, preventing practical applications. We introduce PackUV, a novel 4D Gaussian representation that maps all Gaussian attributes into a sequence of structured, multi-scale UV atlas, enabling compact, image-native storage. To fit this representation from multi-view videos, we propose PackUV-GS, a temporally consistent fitting method that directly optimizes Gaussian parameters in the UV domain. A flow-guided Gaussian labeling and video keyframing module identifies dynamic Gaussians, stabilizes static regions, and preserves temporal coherence even under large motions and disocclusions. The resulting UV atlas format is the first unified volumetric video representation compatible with standard video codecs (e.g., HEVC, FFV1) without losing quality, enabling efficient streaming within existing multimedia infrastructure. To evaluate long-duration volumetric capture, we present PackUV-2B, the largest multi-view 4D dataset to date, featuring more than 50 synchronized 360 cameras, substantial motion, and frequent disocclusions across 100 sequences and 2B (billion) frames. Extensive experiments demonstrate that our method surpasses existing baselines in rendering fidelity while scaling to sequences up to 30 minutes with consistent quality.

Abstract:
Multimodal AI agents are increasingly automating complex real-world workflows that involve online web execution. However, current web-agent benchmarks suffer from a critical limitation: they focus entirely on web-based interaction and perception, lacking grounding in the user's real-world physical surroundings. This limitation prevents evaluation in crucial scenarios, such as when an agent must use egocentric visual perception (e.g., via AR glasses) to recognize an object in the user's surroundings and then complete a related task online (e.g., making a purchase related to that object). To address this gap, we introduce Ego2Web, the first benchmark designed to bridge egocentric video perception and web agent execution. Ego2Web pairs real-world first-person video recordings with web tasks that require visual understanding, web task planning, and interaction in an online environment for successful completion. We utilize an automatic data-generation pipeline combined with human verification and refinement to curate well- constructed, high-quality video-task pairs across diverse task types, including e-commerce, media retrieval, knowledge lookup, etc. To facilitate accurate and scalable evaluation for our benchmark, we also develop a novel LLM-as-a-Judge automatic evaluation method, Ego2WebJudge, which achieves approximately 84% agreement with human judgment, substantially higher than existing evaluation methods. Experiments with diverse SoTA agents on our Ego2Web benchmark show that their performance is still weak, with substantial headroom across all task categories. We also conduct a comprehensive ablation study on task design, highlighting the necessity of accurate video understanding in Ego2Web and the limitations of current agents. We hope Ego2Web can be a critical new resource for developing capable AI assistants that can seamlessly see, understand, and act across the physical and digital worlds.

Abstract:
Recent Vision-Language Models (VLMs) have made remarkable progress in multimodal understanding tasks, yet their evaluation on long video understanding remains unreliable. Due to limited frame inputs, key frames necessary for answering the question may be missing from the model's input. However, models that truthfully refuse to answer under such uncertainty are marked as incorrect, while those that guess may coincidentally produce the correct answer and thus obtain deceptively higher accuracy, leading to misleading evaluation results and encouraging models to guess rather than respond honestly. To address this issue, we introduce VirtueBench, a benchmark explicitly designed to assess model trustworthiness under uncertainty. VirtueBench constructs multiple frame-sampling levels for each video and provides ground truths that distinguish between answerable and unanswerable cases. Evaluations on 25 open-source and commercial VLMs reveal distinct refusal behaviors across different model families, with refusal accuracy ranging from over 70% in the best models to nearly 0% in the worst. Moreover, most models exhibit a substantial drop in refusal when the prompt does not explicitly require them to do so. These findings highlight the need for developing trustworthy VLMs for multimodal understanding, guided by benchmarks and leaderboards that emphasize reliability and trustworthiness.

Abstract:
Humor, as both a creative human activity and a social binding mechanism, has long posed a major challenge for AI generation. Although producing humor requires complex cognitive reasoning and social understanding, theories of humor suggest that it follows learnable patterns and structures, making it theoretically possible for generative models to acquire them implicitly. In recent years, multimodal humor has become a prevalent form of online communication, especially among Gen Z, highlighting the need for AI systems capable of integrating visual understanding with humorous language generation. However, existing data-driven approaches lack explicit modeling or theoretical grounding of humor, often producing literal descriptions that fail to capture its underlying cognitive mechanisms, resulting in the generated image descriptions that are fluent but lack genuine humor or cognitive depth. To address this limitation, we propose HUMORCHAIN (HUmor-guided Multi-step Orchestrated Reasoning Chain for Image Captioning), a theory-guided multi-stage reasoning framework. It integrates visual semantic parsing, humor- and psychology-based reasoning, and a fine-tuned discriminator for humor evaluation, forming an interpretable and controllable cognitive reasoning chain. To the best of our knowledge, this is the first work to explicitly embed cognitive structures from humor theories into multimodal humor generation, enabling a structured reasoning process from visual understanding to humor creation. Experiments on Meme-Image-No-Text, Oogiri-GO, and OxfordTVG-HIC datasets show that HUMORCHAIN outperforms state-of-the-art baselines in human humor preference, Elo/BT scores, and semantic diversity, demonstrating that theory-driven structured reasoning enables large language models to generate humor aligned with human perception.

Abstract:
Recent advancements in video generation have enabled the development of "world models" capable of simulating potential futures for robotics and planning. However, specifying precise goals for these models remains a challenge; text instructions are often too abstract to capture physical nuances, while target images are frequently infeasible to specify for dynamic tasks. To address this, we introduce Goal Force, a novel framework that allows users to define goals via explicit force vectors and intermediate dynamics, mirroring how humans conceptualize physical tasks. We train a video generation model on a curated dataset of synthetic causal primitives--such as elastic collisions and falling dominos--teaching it to propagate forces through time and space. Despite being trained on simple physics data, our model exhibits remarkable zero-shot generalization to complex, real-world scenarios, including tool manipulation and multi-object causal chains. Our results suggest that by grounding video generation in fundamental physical interactions, models can emerge as implicit neural physics simulators, enabling precise, physics-aware planning without reliance on external engines.

Abstract:
Text-to-image models are increasingly utilized in design workflows, but articulating nuanced design intentions solely through text remains a challenge. This work proposes a method that extracts visual attributes from a reference image and injects them directly into the generation pipeline. Specifically, the method optimizes a text token to selectively represent the target attribute using a custom training prompt and two novel embeddings, termed distilled and residual. A wide range of attributes can be extracted through this approach, including shape, material, pose, and camera shot and angle. The effectiveness of the method is validated on various target attributes and text prompts drawn from a newly constructed dataset. The method outperforms existing approaches in selectively extracting and applying target attributes across diverse contexts. Ultimately, the proposed method enables intuitive and controllable text-to-image generation, streamlining the design process.

Abstract:
Point-level weakly-supervised temporal sentiment localization (P-WTSL) aims to detect sentiment-relevant segments in untrimmed multimodal videos using timestamp sentiment annotations, which greatly reduces the costly frame-level labeling. To tackle the intrinsic challenges of imprecise sentiment boundaries in P-WTSL, we propose the Face-guided Sentiment Boundary Enhancement Network FSENet, a unified framework that leverages fine-grained facial features to guide sentiment localization. Specifically, our approach first introduces the Face-guided Sentiment Discovery (FSD) module, which integrates facial features into multimodal interaction via dual-branch modeling for effective sentiment stimuli clues; We then propose the Point-aware Sentiment Semantics Contrast (PSSC) strategy to discriminate sentiment semantics of candidate points (frame-level) near annotation points via contrastive learning, thereby enhancing the model's ability to recognize sentiment boundaries. At last, we design the Boundary-aware Sentiment Pseudo-label Generation (BSPG) approach to convert sparse point annotations into temporally smooth supervisory pseudo-labels. Extensive experiments and visualizations on the benchmark demonstrate the effectiveness of our framework, achieving state-of-the-art performance under full supervision, video-level, and point-level weak supervision, thereby showcasing the strong generalization ability of our FSENet across different annotation settings.

Abstract:
Visual concept composition, which aims to integrate different elements from images and videos into a single, coherent visual output, still falls short in accurately extracting complex concepts from visual inputs and flexibly combining concepts from both images and videos. We introduce Bind & Compose, a one-shot method that enables flexible visual concept composition by binding visual concepts with corresponding prompt tokens and composing the target prompt with bound tokens from various sources. It adopts a hierarchical binder structure for cross-attention conditioning in Diffusion Transformers to encode visual concepts into corresponding prompt tokens for accurate decomposition of complex visual concepts. To improve concept-token binding accuracy, we design a Diversify-and-Absorb Mechanism that uses an extra absorbent token to eliminate the impact of concept-irrelevant details when training with diversified prompts. To enhance the compatibility between image and video concepts, we present a Temporal Disentanglement Strategy that decouples the training process of video concepts into two stages with a dual-branch binder structure for temporal modeling. Evaluations demonstrate that our method achieves superior concept consistency, prompt fidelity, and motion quality over existing approaches, opening up new possibilities for visual creativity.

Abstract:
Unifying image clustering across different clustering scenarios remains challenging due to fundamental gaps among tasks. We introduce a Guideline-Driven Image Clustering Agent, the first universal framework that bridges these gaps through textual guidelines. To incorporate complex guidelines without task-specific training, we propose Generative Concept Proxy Modeling, which generates guideline-aware embeddings via concept proxy extraction. For scenarios requiring automatic cluster discovery, we introduce LLM Traversal based on Minimum Spanning Tree that selectively applies LLM reasoning for complex semantic judgments. Our method generalizes across diverse clustering scenarios spanning from general to fine-grained categorization, from global to local criteria, and from balanced to long-tail distributions. Our framework consistently outperforms specialized methods across diverse clustering tasks.

Abstract:
Recent advances in subject-driven video generation with large diffusion models have enabled personalized content synthesis conditioned on user-provided subjects. However, existing methods lack fine-grained temporal control over subject appearance and disappearance, which are essential for applications such as compositional video synthesis, storyboarding, and controllable animation. We propose AlcheMinT, a unified framework that introduces explicit timestamps conditioning for subject-driven video generation. Our approach introduces a novel positional encoding mechanism that unlocks the encoding of temporal intervals, associated in our case with subject identities, while seamlessly integrating with the pretrained video generation model positional embeddings. Additionally, we incorporate subject-descriptive text tokens to strengthen binding between visual identity and video captions, mitigating ambiguity during generation. Through token-wise concatenation, AlcheMinT avoids any additional cross-attention modules and incurs negligible parameter overhead. We establish a benchmark evaluating multiple subject identity preservation, video fidelity, and temporal adherence. Experimental results demonstrate that AlcheMinT achieves visual quality matching state-of-the-art video personalization methods, while, for the first time, enabling precise temporal control over multi-subject generation within videos.

Abstract:
Lifelong Person Re-Identification aims to adapt to new domains while preserving old knowledge. Existing methods, whether distillation-based or rehearsal-based, attempt to consolidate diverse knowledge within a fixed model architecture. However, the limited adaptability of such architectures often leads to catastrophic forgetting by overwriting previously acquired knowledge. To overcome this limitation, we propose Versatile Incremental Adaptation (VIA), a novel dynamical expansion framework for LReID, which unleashes restricted knowledge during continuous learning by large pre-trained models. Specifically, Unseen-domain person Adapter (UnA) is embedded in VIA, which employs incremental modular learning to capture the domain's specific knowledge, thereby reducing cross-domain interference and releasing task-specific capacity that is previously limited by static parameter sharing. Meanwhile, considering the substantial amount of sharing knowledge across domains in LReID, we design the Domain-aware Dispatch module to enable inter-domain collaboration and knowledge reuse through adaptive cooperation among multiple shared adapters. Besides, a Holistic Domain Controller is designed to dynamically regulate the learning capacity for new domains based on knowledge similarity, thereby effectively unleashing the generalization potential of pre-trained models. Furthermore, a lightweight Similarity-Guided Auto-Selector is proposed to assign inputs to relevant adapters during inference automatically. Extensive experiments validate VIA's effectiveness, surpassing state-of-the-art methods in both seen and unseen domains.

Abstract:
Vision-Language-Action (VLA) models provide a promising paradigm for robot learning by integrating visual perception with language-guided policy learning. However, most existing approaches rely on 2D visual inputs to perform actions in 3D physical environments, creating a significant gap between perception and action grounding. To bridge this gap, we propose a Spatial-Aware VLA Pretraining paradigm that enables models to acquire 3D spatial understanding before robot policy learning. Starting from pretrained vision-language models, we leverage large-scale human demonstration videos to extract 3D visual and 3D action annotations, forming a new source of supervision that aligns 2D visual observations with 3D spatial reasoning. We instantiate this paradigm with VIPA-VLA, a dual-encoder architecture that incorporates a 3D visual encoder to augment semantic visual representations with 3D-aware features, and aligns the two through visual-physical alignment pretraining. When adapted to downstream robot tasks, VIPA-VLA achieves significantly improved grounding between 2D vision and 3D action, resulting in more robust and generalizable robotic policies.

Abstract:
Diffusion models have found extensive use in solving inverse problems, by sampling from an approximate posterior distribution of data given the measurements. Recently, consistency models (CMs) have been proposed to directly predict the final output from any point on the diffusion ODE trajectory, enabling high-quality sampling in just a few neural function evaluations (NFEs). CMs have also been utilized for inverse problems, but existing CM-based solvers either require additional task-specific training or utilize data fidelity operations with slow convergence, limiting their applicability to large-scale problems and making them difficult to extend to nonlinear settings. In this work, we reinterpret CMs as proximal operators of a prior, enabling their integration into plug-and-play (PnP) frameworks. Specifically, we propose PnP-CM, an ADMM-based PnP solver that provides a unified framework for solving a wide range of inverse problems, and incorporates noise perturbations and momentum-based updates to improve performance in the low-NFE regime. We evaluate our approach on a diverse set of linear and nonlinear inverse problems. We also train and apply CMs to MRI data for the first time. Our results show that PnP-CM achieves high-quality reconstructions in as few as 4 NFEs, and produces meaningful results in 2 steps, highlighting its effectiveness in real-world inverse problems while outperforming existing CM-based approaches.

Abstract:
The recent successes of Vision-Language models raise the question of how to equivalently imbue a pretrained speech model with vision understanding, an important milestone towards building a multimodal speech model able to freely converse about images. Building such a conversational Vision-Speech model brings its unique challenges: (i) paired image-speech datasets are much scarcer than their image-text counterparts, (ii) ensuring real-time latency at inference is crucial thus bringing compute and memory constraints, and (iii) the model should preserve prosodic features (e.g., speaker tone) which cannot be inferred from text alone. In this work, we introduce MoshiVis, augmenting a recent dialogue speech LLM, Moshi, with visual inputs through lightweight adaptation modules. An additional dynamic gating mechanism enables the model to more easily switch between the visual inputs and unrelated conversation topics. To reduce training costs, we design a simple one-stage, parameter-efficient fine-tuning pipeline in which we leverage a mixture of image-text (i.e., "speechless") and image-speech samples. We evaluate the model on downstream visual understanding tasks with both audio and text prompts, and report qualitative samples of interactions with MoshiVis.

Abstract:
Orthodontic treatment hinges on tooth alignment, which significantly affects occlusal function, facial aesthetics, and patients' quality of life. Current deep learning approaches often predict transformation matrices for the misaligned tooth point cloud via point-to-point geometric constraints to achieve tooth alignment, ignoring the clinically specific distributions exhibited in these matrices. To address this, we introduce a new automatic tooth alignment method named TAlignDiff, which is assisted by diffusion-based transformation learning. TAlignDiff comprises two main components: a primary point cloud-based regression network (PRN) and a diffusion-based transformation matrix denoising module (DTMD). Geometry-constrained losses supervise PRN learning for point cloud-level alignment. DTMD, as an auxiliary module, learns the latent distribution of matrices from clinical data. We integrate point cloud-based transformation regression and diffusion-based transformation modeling into a unified framework, allowing bidirectional feedback between geometric constraints and diffusion refinement. We validate our method on a challenge dataset from clinical practice and an extra orthodontic dataset. Its efficacy was confirmed through effective ablation studies and comparative analyses, highlighting its potential for application in orthodontic treatment.

Abstract:
3D Gaussian Splatting (3DGS) has demonstrated exceptional performance in reconstruction and novel view synthesis tasks. However, its reliance on Structure-from-Motion preprocessing may lead to degraded performance under sparse-view scenarios. Recent works attempt to address this limitation by leveraging pre-trained image matching models to generate Gaussian primitives but overlook the probabilistic uncertainty embedded in both the initial primitive distribution and iterative position updates. This uncertainty can accumulate and degrade reconstruction fidelity. Hence, we propose BA-GS, a Bayesian framework that models both the global distribution and local uncertainty of Gaussian primitives. At global initialization, a Variational Bayesian Gaussian Mixture Model (VB-GMM) models the latent distribution of primitives, capturing region-wise density and gradient patterns. At local refinement, an Adaptive Kalman Filter refines each primitive's position by recursively fusing noisy gradient observations with spatial priors, dynamically adjusting its covariance according to local uncertainty.This hierarchical Bayesian formulation effectively bridges probabilistic distribution modeling and uncertainty-aware optimization, resulting in improved reconstruction quality under sparse-view conditions. Experiments across multiple benchmark datasets including Tanks and Temples, MVimgNet, and LLFF demonstrate that our method consistently outperforms existing approaches.

Abstract:
We introduce BuildAnyPoint, a novel generative framework for structured 3D building reconstruction from point clouds with diverse distributions, such as those captured by airborne LiDAR and Structure-from-Motion.To recover artist-created building abstraction in this highly underconstrained setting, we capitalize on the role of explicit 3D generative priors in autoregressive mesh generation.Specifically, we design a Loosely Cascaded Diffusion Transformer (Loca-DiT) that initially recovers the underlying distribution from noisy or sparse points, followed by autoregressively encapsulating them into compact meshes.We first formulate distribution recovery as a conditional generation task by training latent diffusion models conditioned on input point clouds, and then tailor a decoder-only transformer for conditional autoregressive mesh generation based on the recovered point clouds.Our method delivers substantial qualitative and quantitative improvements over prior building abstraction methods. Furthermore, the effectiveness of our approach is evidenced by the strong performance of its recovered point clouds on building point cloud completion benchmarks, which exhibit improved surface accuracy and distribution uniformity.

Abstract:
Directly editing ultra-high-resolution (UHR) images is valuable but underexplored, primarily due to the lack of high-quality data and the challenge in modeling high-frequency texture details. We introduce VINS-120K, the first large-scale dataset for instruction-based UHR image editing, comprising 120K carefully curated triplets of instruction, input image, and edited image. Each image exceeds 4K resolution (\geq4096x4096) and is filtered through a rigorous multi-stage pipeline to ensure visual quality, instruction alignment, and aesthetic fidelity. Built on VINS-120K, we further develop a high-frequency-aware post-adaptation strategy to extend pretrained non-high-resolution models to the UHR regime. We also present VINS-4KEval, a benchmark covering diverse editing types, to facilitate consistent evaluation in UHR settings. Experiments confirm that our work improves fine-grained detail synthesis and texture realism in UHR image editing.

Abstract:
Forecasting dynamic scenes remains a fundamental challenge in computer vision, as limited observations make it difficult to capture coherent object-level motion and long-term temporal evolution.We present Motion Group-aware Gaussian Forecasting (MoGaF), a framework for long-term scene extrapolation built upon the 4D Gaussian Splatting representation.MoGaF introduces motion-aware Gaussian grouping and group-wise optimization to enforce physically consistent motion across both rigid and non-rigid regions, yielding spatially coherent dynamic representations.Leveraging this structured space-time representation, a lightweight forecasting module predicts future motion, enabling realistic and temporally stable scene evolution.Experiments on synthetic and real-world datasets demonstrate that MoGaF consistently outperforms existing baselines in rendering quality, motion plausibility, and long-term forecasting stability.

Abstract:
Current foundation models for 3D shapes excel at global tasks (retrieval, classification) but transfer poorly to local part-level reasoning. Recent approaches leverage vision and language foundation models to directly solve dense tasks through multi-view renderings and text queries. While promising, these pipelines require expensive inference over multiple renderings, depend heavily on large language-model (LLM) prompt engineering for captions, and fail to exploit the inherent 3D geometry of shapes. We address this gap by introducing an encoder-only 3D model that produces language-aligned patch-level features directly from point clouds. Our pre-training approach builds on existing data engines that generate part-annotated 3D shapes by pairing multi-view SAM regions with VLM captioning. Using this data, we train a point cloud transformer encoder in two stages: (1) distillation of dense 2D features from visual encoders such as DINOv2 into 3D patches, and (2) alignment of these patch embeddings with part-level text embeddings through a multi-positive contrastive objective. Our 3D encoder achieves zero-shot 3D part segmentation with fast single-pass inference without any test-time multi-view rendering, while significantly outperforming previous rendering-based and feed-forward approaches across several 3D part segmentation benchmarks.

Abstract:
Unsupervised non-rigid point cloud correspondence aims to predict point-to-point correspondences without annotations. Existing methods leverage the spatial-relation-based feature propagation strategy that includes non-physical connections, which are sensitive to non-rigid deformation. To address this issue, we advocate to learn shape topology robust to non-rigid deformation, and propose the topology-aware feature propagation module integrated into a coarse-to-fine propagation and optimization pipeline. To extract point features robust to non-rigid deformation, we estimate keypoints as superpoints and encode superpoint features with topology weights, which learns reasonable topologies under non-rigid deformation. The vector quantization codebook is leveraged to enhance the original superpoint features with stored representative features across the dataset, improving feature robustness against shape variance. Robust point-wise correspondence is yielded after coarse-to-fine feature fusion and efficient test-time optimization. Extensive experiments on multiple benchmarks demonstrate the state-of-the-art performance of our method.

Abstract:
4D radar-camera sensing configuration has gained increasing importance in autonomous driving. However, existing 3D object detection methods that fuse 4D Radar and camera data confront several challenges. First, their absolute depth estimation module is not robust and accurate enough, leading to inaccurate 3D localization. Second, the performance of their temporal fusion module will degrade dramatically or even fail when the ego vehicle's pose is missing or inaccurate. Third, for some small objects, the sparse radar point clouds may completely fail to reflect from their surfaces. In such cases, detection must rely solely on visual unimodal priors. To address these limitations, we propose R4Det, which enhances depth estimation quality via the Panoramic Depth Fusion module, enabling mutual reinforcement between absolute and relative depth. For temporal fusion, we design a Deformable Gated Temporal Fusion module that does not rely on the ego vehicle's pose. In addition, we built an Instance-Guided Dynamic Refinement module that extracts semantic prototypes from 2D instance guidance. Experiments show that R4Det achieves state-of-the-art 3D object detection results on the TJ4DRadSet and VoD datasets.

Abstract:
Face forgery detection faces a critical challenge: a persistent gap between offline benchmarks and real-world efficacy, which we attribute to the ecological invalidity of training data. This work introduces Agent4FaceForgery to address two fundamental problems: (1) how to capture the diverse intents and iterative processes of human forgery creation, and (2) how to model the complex, often adversarial, text-image interactions that accompany forgeries in social media. To solve this, we propose a multi-agent framework where LLM-powered agents, equipped with profile and memory modules, simulate the forgery creation process. Crucially, these agents interact in a simulated social environment to generate samples labeled for nuanced text-image consistency, moving beyond simple binary classification. An Adaptive Rejection Sampling (ARS) mechanism ensures data quality and diversity. Extensive experiments validate that the data generated by our simulation-driven approach brings significant performance gains to detectors of multiple architectures, fully demonstrating the effectiveness and value of our framework.

Abstract:
In this paper, we introduce Geometric Algebra-Informed 3D Gaussian Splatting (GAI-GS), a framework for wireless modeling that couples 3D Gaussian splatting with a geometric algebra-based attention mechanism to explicitly model ray-object interactions in complex propagation environments. GAI-GS encodes joint spatial-electromagnetic (EM) relations into token representations, enabling scene-level aggregation within a unified, end-to-end neural architecture. This design grounds wireless ray propagation in electromagnetic principles, allowing token interactions to model key effects such as multipath, attenuation, and reflection/diffraction. Through extensive evaluations on multiple real-world indoor datasets, GAI-GS consistently surpasses current baselines across various wireless tasks.

Abstract:
While multimodal large language models (MLLMs) have achieved remarkable success in recent advancements, their susceptibility to jailbreak attacks has come to light. In such attacks, adversaries exploit carefully crafted prompts to coerce models into generating harmful or undesirable content. Existing defense mechanisms often rely on external inference steps or safety alignment training, both of which are less effective and impractical when facing sophisticated adversarial perturbations in white-box scenarios. To address these challenges and bolster MLLM robustness, we introduce SAFEMLLM by adopting an adversarial training framework that alternates between an attack step for generating adversarial noise and a model updating step. At the attack step, SAFEMLLM generates adversarial perturbations through a newly proposed contrastive embedding attack (CoE-Attack), which optimizes token embeddings under a contrastive objective. SAFEMLLM then updates model parameters to neutralize the perturbation effects while preserving model utility on benign inputs. We evaluate SAFEMLLM across multiple MLLMs and six jailbreak methods spanning multiple modalities. Experimental results show that SAFEMLLM effectively defends against diverse attacks, maintaining robust performance and utilities.

Abstract:
Recent video editing models have achieved impressive results, but most still require large-scale paired datasets. Collecting such naturally aligned pairs at scale remains highly challenging and constitutes a critical bottleneck, especially for local video editing data. Existing workarounds transfer image editing to video through global motion control for pair-free video editing, but such designs struggle with background and temporal consistency. In this paper, we propose NOVA: Sparse Control & Dense Synthesis, a new framework for video editing. Specifically, the sparse branch provides semantic guidance through user-edited keyframes distributed across the video, and the dense branch continuously incorporates motion and texture information from the original video to maintain high fidelity and coherence. Moreover, we introduce a degradation-simulation training strategy that enables the model to learn motion reconstruction and temporal consistency by training on artificially degraded videos, thus eliminating the need for paired data. Our extensive experiments demonstrate that NOVA outperforms existing approaches in edit fidelity, motion preservation, and temporal coherence.

Abstract:
Recent diffusion-based one-step methods have shown remarkable progress in the field of image super-resolution, yet they remain constrained by three critical limitations: (1) inferior fidelity performance caused by the information loss from compression encoding of low-quality (LQ) inputs; (2) insufficient region-discriminative activation of generative priors; (3) misalignment between text prompts and their corresponding semantic regions. To address these limitations, we propose CODSR, a controllable one-step diffusion network for image super-resolution. First, we propose an LQ-guided feature modulation module that leverages original uncompressed information from LQ inputs to provide high-fidelity conditioning for the diffusion process. We then develop a region-adaptive generative prior activation method to effectively enhance perceptual richness without sacrificing local structural fidelity. Finally, we employ a text-matching guidance strategy to fully harness the conditioning potential of text prompts. Extensive experiments demonstrate that CODSR achieves superior perceptual quality and competitive fidelity compared with state-of-the-art methods while maintaining efficient one-step inference.

Abstract:
Large vision-language models struggle with medical video understanding, where spatial precision, temporal reasoning, and clinical semantics are critical. To address this, we first introduce MedVidBench, a large-scale benchmark of 531,850 video-instruction pairs across 8 medical sources spanning video, segment, and frame-level tasks, curated through a rigorous quality assurance pipeline with expert-guided prompting and dual-model validation. While supervised fine-tuning on MedVidBench yields noticeable gains, standard Reinforcement Learning (RL) fails due to imbalanced reward scales across datasets, which destabilizes optimization and leads to training collapse. To overcome this, we introduce MedGRPO, a novel RL framework for balanced multi-dataset training with two key innovations: (1) cross-dataset reward normalization that maps each dataset's median performance to a common reward value, ensuring fair optimization regardless of difficulty, and (2) a medical LLM judge that evaluates caption quality on five clinical dimensions through comparative similarity scoring. Supervised fine-tuning Qwen2.5-VL-7B on MedVidBench substantially outperforms GPT-4.1 and Gemini-2.5-Flash across all tasks, demonstrating MedVidBench's efficacy, while our MedGRPO framework further improves upon the SFT baseline across grounding and captioning tasks. Our work establishes a foundational benchmark and robust training methodology for advancing vision-language models in medical domains.

Abstract:
Capturing dynamic spatiotemporal neural activity is essential for understanding large-scale brain mechanisms. Functional magnetic resonance imaging (fMRI) provides high-resolution cortical representations that form a strong basis for characterizing fine-grained brain activity patterns. The high acquisition cost of fMRI limits large-scale applications, therefore making high-quality fMRI reconstruction a crucial task. Electroencephalography (EEG) offers millisecond-level temporal cues that complement fMRI. Leveraging this complementarity, we present an EEG-conditioned framework for reconstructing dynamic fMRI as continuous neural sequences with high spatial fidelity and strong temporal coherence at the cortical-vertex level. To address sampling irregularities common in real fMRI acquisitions, we incorporate a null-space intermediate-frame reconstruction, enabling measurement-consistent completion of arbitrary intermediate frames and improving sequence continuity and practical applicability. Experiments on the CineBrain dataset demonstrate superior voxel-wise reconstruction quality and robust temporal consistency across whole-brain and functionally specific regions. The reconstructed fMRI also preserves essential functional information, supporting downstream visual decoding tasks. This work provides a new pathway for estimating high-resolution fMRI dynamics from EEG and advances multimodal neuroimaging toward more dynamic brain activity modeling.

Abstract:
The appeal of RGB-T tracking lies in its resilience when RGB fails under night scenes, glare, fog, and partial occlusion. Despite notable accuracy gains, recent architectures emphasize deep fusion and large parameter counts, driving up FLOPs and bandwidth. This computational burden constrains real-time performance and limits scalability beyond high-end GPUs. To balance accuracy and efficiency, we propose Adaptive Early-Exit (AEE): we augment the backbone with anytime heads and pair them with a confidence-calibrated early-exit policy that halts inference at the earliest reliable layer, skipping redundant computation. For cross-modal interaction, we design a Holistic-Token-Guided Interaction (HTGI) module, where each modality is compressed into a compact set of holistic state tokens and injected into the other modality's modeling stream without layer-wise alignment, enabling targeted information exchange at extremely low cost. On RGB-T benchmarks, the lightweight tracker substantially reduces latency while maintaining competitive accuracy; on LasHeR, it achieves 70.2% precision and 56.3% success, running at 148.3 FPS on GPU, 50.2 FPS on CPU, and 28.7 FPS on an edge device.

Abstract:
Unsupervised domain adaptation transfers knowledge from a labeled source domain to an unlabeled target domain. When source data cannot be accessed, source-free domain adaptation (SFDA) becomes a practical alternative. However, existing SFDA methods mainly rely on pseudo-label based self-training, which often accumulates noise and bias under large domain gaps. We propose VSFOT, a framework that leverages a pretrained Vision-Language Model (VLM) to guide optimal transport (OT) alignment between target features and source prototypes. Instead of relying on unreliable pseudo-labels, VSFOT employs VLM-derived semantic priors and an OT-based matching strategy to achieve stable and reliable adaptation. To further enhance domain alignment, VSFOT incorporates a bidirectional distillation mechanism in which the model learns semantic consistency from the VLM, while the VLM is refined using task-specific cues from the model. These two stages alternate during training. By combining the generalization ability of the VLM with the discriminative power of the task model, VSFOT achieves robust, source-free adaptation and consistently outperforms existing SFDA methods on four benchmark datasets. The code is available.

Abstract:
Multi-object tracking (MOT) based on unmanned aerial vehicle (UAV) aims to identify and continuously track the positions of multiple ground targets during UAV flight. Current mainstream methods utilize appearance matching and motion matching to match targets in consecutive frames. However, these methods often fail in the following scenarios: First, scenarios with multi-scale targets, where small targets have weak appearance features and small bounding boxes; second, scenarios with complex backgrounds or occlusions, where the background or occlusions interfere with the appearance features and change the bounding box size of targets; third, scenarios where the UAV lens shakes, rotates, or zooms, leading to misalignment between consecutive frames; and fourth, scenarios with high targets similarity, where the appearance features between targets are difficult to distinguish, such as vehicles on a road. To address these issues, we propose a multi-object tracking algorithm, ProgTrack, based on a multi-stage progressive matching mechanism. This algorithm simulates human eye tracking strategies, employing a progressive process of "first matching easily matched large targets, then matching difficult-to-match small targets, and finally matching the remaining mixed-scale targets." Similarly, ProgTrack employs three strategies for target matching at different scales and appearances: a simple Local Motion Information (LMI) matching strategy for large targets, a complex Context Enhancement Feature (CE-Feature) matching strategy for small targets, and a Global Motion Information (GMI) matching strategy for multi-scale targets matching, thereby achieving target matching. On the VisDrone2019 UAV tracking dataset, ProgTrack achieves MOTA, MOTP, and IDF1 scores of 40.2, 77.5, and 52.8, respectively, demonstrating state-of-the-art performance among ten methods.

Abstract:
Video crowd counting aims to predict the people count in each frame of a video. It requires effectively leveraging spatio-temporal (ST) information in videos while satisfying real-time constraints. However, most existing methods use ST information from neighboring frames through auxiliary extraction and fusion modules---resulting in large computational cost and the need to buffer multiple frames during inference. Such designs limit their practicality in real-world applications with limited computational resources or stringent real-time requirements. To address these issues, we revisit video crowd counting from the perspective of lightweight image-based counting models that enable real-time deployment under limited resources. We analytically define ST information in a model-independent and statistically interpretable manner, and incorporate it into training via a statistical regularizer that effectively enhances model performance without adding modules or inference overhead. Most framework hyperparameters are further formulated as statistical inference problems, allowing automatic estimation from data and consequently efficient adaptation to new scenarios.Our framework unifies video crowd counting and image-based counting models under a compact, principled formulation that is lightweight, portable, and efficient. We also establish theoretical foundations for adapting image-based counting models to video crowd counting and achieve state-of-the-art accuracy and efficiency across six benchmarks, including challenging DRONECROWD and VSCROWD.

Abstract:
Multimodal Federated Learning (MFL) enables clients with heterogeneous data modalities to collaboratively train models without sharing raw data, offering a privacy-preserving framework that leverages complementary cross-modal information. However, existing methods often overlook personalized client performance and struggle with modality/task discrepancies, as well as model heterogeneity. To address these challenges, we propose FedAFD, a unified MFL framework that enhances client and server learning. On the client side, we introduce a bi-level adversarial alignment strategy to align local and global representations within and across modalities, mitigating modality and task gaps. We further design a granularity-aware fusion module to integrate global knowledge into the personalized features adaptively. On the server side, to handle model heterogeneity, we propose a similarity-guided ensemble distillation mechanism that aggregates client representations on shared public data based on feature similarity and distills the fused knowledge into the global model. Extensive experiments conducted under both IID and non-IID settings demonstrate that FedAFD achieves superior performance and efficiency for both the client and the server.

Abstract:
Humans excel at visual social inference, the ability to infer hidden elements of a scene from subtle behavioral cues such as other people's gaze, pose, and orientation. This capacity drives everyday social reasoning in humans and is critical for developing more human-like AI agents. We introduce Spot The Ball, a challenging benchmark for evaluating visual social inference in vision-language models (VLMs) using sports as a test domain. The task is to localize a removed sports ball from soccer, basketball, and volleyball images. We present a curated evaluation set with human baselines and a scalable pipeline for generating additional test items. We evaluate four state-of-the-art VLMs (Gemini, GPT, LLaMA, Qwen) using three prompting strategies, finding that humans are consistently two to three times more accurate (20-34%) than models (<=17%) across all sports. Our analyses show that models rely on superficial spatial heuristics--such as guessing near the image center or nearby players--while humans leverage social cues like gaze direction and body pose. These findings reveal a persistent human-model gap in visual social reasoning and underscore the need for architectures that explicitly encode structured behavioral cues to achieve robust, human-like inference.

Abstract:
Vision-Language Models (VLMs) have shown strong performance on User Interface (UI) grounding tasks, driven by their ability to process increasingly high-resolution screenshots. However, screenshots are tokenized into thousands of visual tokens (e.g., about 4,700 for 2K resolution), which incurs significant computational overhead and dilutes attention. In contrast, humans typically focus on regions of interest when interacting with UIs. In this work, we pioneer the task of efficient UI grounding. We propose FocusUI, an efficient UI grounding framework that selects patches most relevant to the instruction while preserving positional continuity for precise grounding. FocusUI addresses two key challenges: (1) eliminating redundant tokens in visual encoding by constructing patch-level supervision that fuses an instruction-conditioned score with a rule-based UI-graph score, down-weighting large homogeneous regions to select distinct and instruction-relevant visual tokens; and (2) preserving positional continuity during visual token selection. We find that general visual token pruning methods suffer from severe accuracy degradation on UI grounding tasks due to broken positional information. To address this, we introduce a PosPad strategy, which compresses each contiguous sequence of dropped visual tokens into a single special marker placed at the sequence's last index to preserve positional continuity. Comprehensive experiments on four grounding benchmarks demonstrate that FocusUI surpasses GUI-specific baselines. On the ScreenSpot-Pro benchmark, FocusUI-7B achieves a 3.7% performance improvement over GUI-Actor-7B. Even with only 30% visual token retention, the performance of FocusUI-7B drops by just 3.2%, while achieving up to 1.44x faster inference and 17% lower peak GPU memory.

Abstract:
Shot transitions play a pivotal role in multi-shot video generation, as they determine the overall narrative expression and the directorial design of visual storytelling.However, recent progress has primarily focused on low-level visual consistency across shots, neglecting how transitions are designed and how cinematographic language contributes to coherent narrative expression. This often leads to mere sequential shot changes without intentional film-editing patterns. To address this limitation, we propose ShotDirector, an efficient framework that integrates parameter-level camera control and hierarchical editing-pattern-aware prompting.Specifically, we adopt a camera control module that incorporates 6-DoF poses and intrinsic settings to enable precise camera information injection. In addition, a shot-aware mask mechanism is employed to introduce hierarchical prompts aware of professional editing patterns, allowing fine-grained control over shot content. Through this design, our framework effectively combines parameter-level conditions with high-level semantic guidance, achieving film-like controllable shot transitions.To facilitate training and evaluation, we construct ShotWeaver40K, a dataset that captures the priors of film-like editing patterns, and develop a set of evaluation metrics for controllable multi-shot video generation. Extensive experiments demonstrate the effectiveness of our framework.

Abstract:
Unsupervised Anomaly Detection (UAD) and anomaly classification are frequently used in industrial and medical scenarios. Specifically, UAD identifies anomalous regions at the fine-grained pixel-level, while anomaly classification distinguishes anomaly types at the anomaly region level. However, existing approaches typically treat these tasks independently and sequentially, overlooking the benefits of jointly training them to suppress Local Visual Ambiguity (LVA) caused by the similarities of different types of anomalies in local visual patterns. Moreover, a multi-task learning framework cannot be directly applied to jointly train the two tasks, since UAD and anomaly classification exhibit feature preference incompatibility. To address these limitations, we propose the Prototype-Guided Semi-Supervised Feature Disentanglement (PG-SFD) framework, which makes a paradigm shift from implicit feature sharing to explicit feature disentanglement and explicitly constructs normal and category prototypes to eliminate implicit normal-abnormal semantic coupling via a Dual-Prototype Disentanglement Module (DPRM). Moreover, for cross-task feature differential injection and gradient conflict mitigation, the Differential Gated Interaction (DGI) and Geometry-Regularized Optimization (GRO) are proposed to form a cohesive framework with DPRM. PG-SFD demonstrates high effectiveness in both UAD tasks and weakly supervised classification tasks. Meanwhile, it exhibits stable performance across multiple types of datasets, including industrial and medical datasets, indicating its strong generalizability.

Abstract:
We propose Mesh4D, a feed-forward model for monocular 4D mesh reconstruction. Given a monocular video of a dynamic object, our model reconstructs the object's complete 3D shape and motion, represented as a deformation field. Our key contribution is a compact latent space that encodes the entire animation sequence. This latent space is learned by an autoencoder that, during training, is guided by the skeletal structure of the training objects, providing strong priors on plausible deformations. Crucially, skeletal information is not required at inference time. The encoder employs spatio-temporal attention, yielding a more stable representation of the object's overall deformation. Building on this representation, we train a latent diffusion model that, conditioned on the input video and the mesh reconstructed from the first frame, predicts the full animation in one shot. We evaluate Mesh4D on reconstruction and novel view synthesis benchmarks, outperforming prior methods in recovering accurate 3D shape and deformation.

Abstract:
Existing 3D editing methods rely on computationally intensive scene-by-scene iterative optimization and suffer from multi-view inconsistency. We propose an effective and feed-forward 3D editing framework based on the TRELLIS generative backbone, capable of modifying 3D models from a single editing view. Our framework addresses two key issues: adapting training-free 2D editing to structured 3D representations, and overcoming the bottleneck of appearance fidelity in compressed 3D features. To ensure geometric consistency, we introduce Voxel FlowEdit, an edit-driven flow in the sparse voxel latent space that achieves globally consistent 3D deformation in a single pass. To restore high-fidelity details, we develop a normal-guided single to multi-view generation module as an external appearance prior, successfully recovering high-frequency textures. Experiments demonstrate that our method enables fast, globally consistent, and high-fidelity 3D model editing.

Abstract:
Pre-trained vision encoders like DINOv2 have demonstrated exceptional performance on unimodal tasks. However, we observe that their features are poorly aligned across different modalities. For instance, the feature embedding for an RGB image and its corresponding depth map of the same scene exhibit a cosine similarity that is nearly identical to that of two random, unrelated images. To address this, we propose the Omnivorous Vision Encoder, a novel framework that learns a modality-agnostic feature space. We train the encoder with a dual objective: first, to maximize the feature alignment between different modalities of the same scene; and second, a distillation objective that anchors the learned representations to a fully frozen teacher such as DINOv2. The resulting student encoder becomes "omnivorous" by producing a consistent, unified embedding for a given scene, regardless of the input modality (RGB, Depth, Segmentation, etc.). This approach enables robust cross-modal understanding while retaining the discriminative semantics of the original foundation model.

Abstract:
Video Question Answering (VideoQA) has made significant strides by leveraging multimodal learning to align visual and textual modalities. However, current benchmarks overwhelmingly focus on questions answerable through explicit visual content - actions, objects, and events - directly observable within individual frames or short clips. To truly understand videos as humans do, models must go beyond what is directly shown, inferring hidden relationships and contextual cues that are only implied across frames. Current benchmarks fail to capture this essential aspect of video understanding. To address this gap, we introduce VRR-QA, a benchmark for Visual Relational Reasoning Beyond Explicit Cues. We curate our benchmark from creative and cinematic videos such as movies, that deliberately employ storytelling techniques which omit direct depictions of certain events or relations, requiring viewers to infer them. VRR-QA comprises 1K meticulously expert-annotated QA pairs drawn from 1K creative video clips covering 15 genres across 7 decades of content, from both live-action and animated titles. Our extensive evaluations on 14 leading VideoQA models reveals consistent and significant performance degradation, underscoring their reliance on surface-level visual cues and highlighting the difficulty of implicit reasoning. Even the best model substantially underperforms human baselines with only 64% accuracy. Performance variations across models further illustrate the complexity and diversity of the challenges presented by VRR-QA. By releasing both dataset and data collection framework, VRR-QA establishes a rigorous, diverse, and reproducible testbed for advancing VideoQA.

Abstract:
Prevailing no-reference 3D point cloud quality assessment methods predominantly treat 2D projections and 3D point clouds as independent modalities and rely on simplistic feature fusion, thereby neglecting fundamental mechanisms underlying human 3D perception. To address this limitation, we introduce R3-PCQA (Ray-Reprojection-Reinforcement 3D Point Cloud Quality Assessor), a novel and principled framework that explicitly encodes perceptual priors into the assessment pipeline: A geometric-aware ray-based reprojection pipeline simulates viewpoint-dependent observation of 3D structure. A reinforcement-learning-based quality-salient subcloud selector adaptively attends to perceptually informative regions. The global view attention module aggregates local quality responses across viewpoints, forming a unified representation that facilitates reliable cross-view inference. Extensive experiments demonstrate that R3-PCQA achieves state-of-the-art performance on SJTU-PCQA, WPC, and WPC2.0.

Abstract:
Large multimodal language models have made rapid progress on vision-language tasks, yet their potential for zero-/few-shot object detection (ZSOD/FSOD) under a closed set of target classes has yet to be fully explored.. ZSOD/FSOD is hampered by data scarcity and catastrophic forgetting. Although vision-language models (VLMs) demonstrated excellent performance on multiple benchmarks, they typically rely on large-scale visual pre-training, which is inconsistent with the goal of FSOD to test generalization to new categories under limited supervision. We introduce AgentDet, a shared-blackboard multi-agent framework that unifies ZSOD and FSOD via pseudo-incremental learning. AgentDet decouples detection into four cooperating roles--Agent-Scout, Agent-Pinner, Agent-Curator, and Agent-Judge--which collaboratively maintain a Shared Blackboard and a Knowledge Base. For efficiency, we only train Agent-Judge by updating its image encoder and LLM-based detection head, which is a lightweight recipe that encourages generalization to previously unseen categories. On PASCAL VOC and MS COCO ZSOD/FSOD datasets, AgentDet achieves strongly competitive performance with state-of-the-art results in several settings.

Abstract:
3D super-resolution (3DSR) aims to reconstruct high-resolution (HR) 3D scenes from low-resolution (LR) multi-view images. Existing methods rely on dense LR inputs and per-scene optimization, which restricts the high-frequency priors for constructing HR 3D Gaussian Splatting (3DGS) to those inherited from pretrained 2D super-resolution (2DSR) models. This severely limits reconstruction fidelity, cross-scene generalization, and real-time usability. We propose to reformulate 3DSR as a direct feed-forward mapping from sparse LR views to HR 3DGS representations, enabling the model to autonomously learn 3D-specific high-frequency geometry and appearance from large-scale, multi-scene data. This fundamentally changes how 3DSR acquires high-frequency knowledge and enables robust generalization to unseen scenes. Specifically, we introduce SR3R, a feed-forward framework that directly predicts HR 3DGS representations from sparse LR views via the learned mapping network. To further enhance reconstruction fidelity, we introduce Gaussian offset learning and feature refinement, which stabilize reconstruction and sharpen high-frequency details. SR3R is plug-and-play and can be paired with any feed-forward 3DGS reconstruction backbone: the backbone provides an LR 3DGS scaffold, and SR3R upscales it to an HR 3DGS. Extensive experiments across three 3D benchmarks demonstrate that SR3R surpasses state-of-the-art (SOTA) 3DSR methods and achieves strong zero-shot generalization, even outperforming SOTA per-scene optimization methods on unseen scenes.

Abstract:
Vision-Language Models (VLMs) typically assume a uniform spatial fidelity across the entire field of view of visual inputs, dedicating equal precision to even the uninformative regions. By contrast, human vision is neither uniform nor static. It is adaptive, selective, and resource-efficient. In light of this, we present the first systematic analysis of bio-inspired visual representation methods, providing insights for more efficient and adaptive VLMs. We propose LLMind (Looking Like the Mind), a novel training-free framework that mimics foveated encoding and cortical magnification in human vision to achieve adaptive, efficient representations for VLMs under tight pixel budgets. Our key idea is to explore a Bio-inspired Adaptive Sampling Strategy (BASS). This empowers us to design a Mobius-parameterized module that performs non-uniform sampling while preserving global scene structure. On top of BASS, we introduce closed-loop semantic feedback (CSF) via test-time adaptation to align the perceptual saliency with the textual information from the frozen VLM. We evaluate LLMind against uniform and other sampling baselines across diverse scene-level and region-guided visual question answering (VQA) benchmarks. The results show that ours achieves dramatic gains, with average improvements by +20% on VQAv2, +38% on Seed-Bench, and +37% on A-OKVQA compared to uniform sampling under tight pixel budgets. More surprisingly, results reveal that LLMind can retain up to 82%, 92% and 97% of the full-resolution performance with only 1%, 3% and 5% of the pixels, respectively. Moreover, LLMind is lightweight, plug-and-play, and compatible with existing VLMs without requiring architectural changes.

Abstract:
Our goal is to train a generative model of 3D hand motions, conditioned on natural language descriptions specifying motion characteristics such as handshapes, locations, finger/hand/arm movements. To this end, we automatically build pairs of 3D hand motions and their associated textual labels with unprecedented scale. Specifically, we leverage a large-scale sign language video dataset, along with noisy pseudo-annotated sign categories, which we translate into hand motion descriptions via an LLM that utilizes a dictionary of sign attributes, as well as our complementary motion-script cues. This data enables training a text-conditioned hand motion diffusion model (HandMDM), that is robust across domains such as unseen sign categories from the same sign language, but also signs from another sign language and non-sign hand movements. We contribute extensive experimental investigation of these scenarios and will make our trained models and data publicly available to support future research in this relatively new field.

Abstract:
Model reprogramming adapts pretrained models to downstream tasks by modifying their input and output spaces. Visual reprogramming, as a prominent instance, has been explored in pioneer works on CLIP, which introduces learnable input transformations as visual prompts to repurpose its visual-language alignment for downstream visual tasks. Existing VR methods focus on single-level alignment between prompted images and text descriptions, overlooking inherent structural information in data that facilitates alignment: semantic granularity from label hierarchies and visual granularity from multi-scale representations. To address this gap, we propose Dual Granularity Alignment (DGA) with two key components for multi-level fusion. For visual granularity, we generate multi-scale images and introduce Uncertainty-calibrated Prediction Fusion (UPF), which fuses predictions based on uncertainty estimation to capture hierarchical spatial information. For semantic granularity, we construct category hierarchies via Prototype-guided Label Hierarchization and develop Hierarchical Knowledge Propagation (HKP), which transfers superclass knowledge for coherent multi-level visual prompts alignment. Our DGA collaboratively integrates both granularities to enhance alignment effectiveness. Experiments across 12 downstream datasets demonstrate DGA's superiority over baselines on both ViT-based and ResNet-based CLIP architectures. Specifically, DGA achieves a 4.5% improvement over the previous state-of-the-art method on ViT-16-based CLIP. By explicitly modeling structural granularities, DGA establishes a new paradigm for visual reprogramming.

Abstract:
We present ReDirector, a novel camera-controlled video retake generation method for dynamically captured variable-length videos. In particular, we rectify a common misuse of RoPE in previous works by aligning the spatiotemporal positions of the input video and the target retake. Moreover, we introduce Rotary Camera Encoding (RoCE), a camera-conditioned RoPE phase shift that captures and integrates multi-view relationships within and across the input and target videos. By integrating camera conditions into RoPE, our method generalizes to out-of-distribution camera trajectories and video lengths, yielding improved dynamic object localization and static background preservation. Extensive experiments further demonstrate significant improvements in camera controllability, geometric consistency, and video quality across various trajectories and lengths.

Abstract:
Reinforcement Learning (RL) has achieved remarkable success in various domains, yet it often relies on carefully designed programmatic reward functions to guide agent behavior. Designing such reward functions can be challenging and may not generalize well across different tasks. To address this limitation, we leverage the rich world knowledge contained in pretrained video diffusion models to provide goal-driven reward signals for RL agents without ad-hoc design of reward. Our key idea is to exploit off-the-shelf video diffusion models pretrained on large-scale video datasets as informative reward functions in terms of video-level and frame-level goals. For video-level rewards, we first finetune a pretrained video diffusion model on domain-specific datasets and then employ its video encoder to evaluate the alignment between the latent representations of agent's trajectories and the generated goal videos. To enable more fine-grained goal-achievement, we derive a frame-level goal by identifying the most relevant frame from the generated video using CLIP, which serves as the goal state. We then employ a learned forward-backward representation that represents the probability of visiting the goal state from a given state-action pair as frame-level reward, promoting more coherent and goal-driven trajectories. Experiments on Meta-World and Distracting Control Suite demonstrate the effectiveness of our approach.

Abstract:
Vocabulary-free fine-grained image recognition aims to distinguish visually similar categories within a meta-class without a fixed, human-defined label set. Existing solutions for this problem remain limited by either the usage of a large and rigid list of vocabularies or by the dependency on complex pipelines with fragile heuristics where errors propagate across stages. Meanwhile, the ability of recent large multi-modal models (LMMs) equipped with explicit or implicit reasoning to comprehend visual-language data, decompose problems, retrieve latent knowledge, and self-correct suggests a more principled and effective alternative. Building on these capabilities, we propose FiNDR (Fine-grained Name Discovery via Reasoning), the first reasoning-augmented LMM-based framework for vocabulary-free fine-grained recognition. The system oper- ates in three automated steps: (i) a reasoning-enabled LMM generates descriptive candidate labels for each image; (ii) a vision-language model filters and ranks these candidates to form a coherent class set; and (iii) the verified names instantiate a lightweight multi-modal classifier used at inference time. Extensive experiments on popular fine-grained classification benchmarks demonstrate state-of-the-art per- formance under the vocabulary-free setting, with a significant relative margin of up to 18.8% over previous ap- proaches. Remarkably, the proposed method surpasses zero-shot baselines that exploit pre-defined ground-truth names, challenging the assumption that human-curated vo- cabularies define an upper bound. Ablations further confirm that advanced prompting techniques and built-in rea- soning mechanisms significantly enhance naming quality. Additionally, we show that carefully engineered prompts enable open-source LMMs to match proprietary counter- parts. These findings establish reasoning-augmented LMMs as an effective foundation for scalable, fully automated, open-world fine-grained visual recognition. The source code and relevant prompting guidelines will be released.

Abstract:
Shadow removal under diverse lighting conditions requires disentangling illumination from intrinsic reflectance--a challenge compounded when physical priors are not properly aligned. We propose PhaSR (Physically Aligned Shadow Removal), addressing this through dual-level prior alignment to enable robust performance from single-light shadows to multi-source ambient lighting. First, Physically Aligned Normalization (PAN) performs closed-form illumination correction via Gray-world normalization, log-domain Retinex decomposition, and dynamic range recombination, suppressing chromatic bias. Second, Geometric-Semantic Rectification Attention (GSRA) extends differential attention to cross-modal alignment, harmonizing depth-derived geometry with DINO-v2 semantic embeddings to resolve modal conflicts under varying illumination. Experiments show competitive performance in shadow removal with lower complexity and generalization to ambient lighting where traditional methods fail under multi-source illumination.

Abstract:
Most post-training methods for text-to-image samplers focus on the model weights: either fine-tuning the backbone for alignment or distilling it for few-step efficiency. We take a different route: rescheduling the sampling timeline of a frozen sampler. Instead of a fixed, global schedule, we learn instance-level (prompt- and noise-conditioned) schedules through a single-pass Dirichlet policy. To ensure accurate gradient estimates in high-dimensional policy learning, we introduce a novel reward baseline based on a principled James-Stein estimator; it provably achieves lower estimation errors than commonly used variants and leads to superior results. Our rescheduled samplers consistently improve text-image alignment including text rendering and compositional control across modern Stable Diffusion and Flux model families.Additionally, a 5-step Flux-Dev sampler with our schedules can attain generation quality comparable to deliberately distilled samplers like Flux-Schnell. We thus position our scheduling framework as an emerging model-agnostic post-training lever that unlocks additional generative potential in pretrained samplers.

Abstract:
Physical adversarial examples pose a concrete threat to real-world computer vision systems. Existing works mainly generate physical adversarial examples by affixing patches or projecting light onto targets, which are usually visible and can expose the malicious intention. In this work, we reveal a new attack surface that generates invisible adversarial samples by injecting signals into the camera's power supply. We analyze the mechanism of injecting structural stripe patterns into cameras and demonstrate the feasibility of controllable fine-grained injection with signal modulation. We develop a simulation model to emulate the physically injected perturbation, and propose end-to-end optimization methodologies in both white-box and black-box settings to generate the injection signal parameters. We perform a simulated evaluation across seven classification models and carry out physical signal injection experiments with optimized signals. The results show that physical adversarial examples generated through camera power signal injection can disrupt computer vision performance. Our work introduces a new methodology for physical adversarial examples, emphasizing the need for securing computer vision systems in the physical world.

Abstract:
Vision-Language-Action (VLA) models are emerging as a cornerstone for robotics, with flow-matching policies like \pi_0 showing great promise in generating smooth, continuous actions. As these models advance, their unique action generation mechanism--the vector field dynamics--presents a critical yet unexplored security vulnerability, particularly backdoor vulnerabilities. Existing backdoor attacks designed for autoregressive discretization VLAs cannot be directly applied to this new continuous dynamics.We introduce FlowHijack, the first backdoor attack framework to systematically target the underlying vector-field dynamics of flow-matching VLAs. Our method combines a novel \tau-conditioned injection strategy, which manipulates the initial phase of the action generation, with a dynamics mimicry regularizer. Experiments demonstrate that FlowHijack achieves high attack success rates using stealthy, context-aware triggers where prior works failed. Crucially, it preserves benign task performance and, by enforcing kinematic similarity, generates malicious actions that are behaviorally indistinguishable from normal actions. Our findings reveal a significant vulnerability in continuous embodied models, highlighting the urgent need for defenses targeting the model's internal generative dynamics.

Abstract:
Reconstructing and understanding 3D scenes from unposed sparse views in a feed-forward manner remains as a challenging task in 3D computer vision. Recent approaches use per-pixel 3D Gaussian Splatting for reconstruction, followed by a 2D-to-3D feature lifting stage for scene understanding. However, they generate excessive redundant Gaussians, causing high memory overhead and sub-optimal multi-view feature aggregation, leading to degraded novel view synthesis and scene understanding performance. We propose C3G, a novel feed-forward framework that estimates compact 3D Gaussians only at essential spatial locations, minimizing redundancy while enabling effective feature lifting. We introduce learnable tokens that aggregate multi-view features through self-attention to guide Gaussian generation, ensuring each Gaussian integrates relevant visual features across views. We then exploit the learned attention patterns for Gaussian decoding to efficiently lift features. Extensive experiments on pose-free novel view synthesis, 3D open-vocabulary segmentation, and view-invariant feature aggregation demonstrate our approach's effectiveness. Results show that a compact yet geometrically meaningful representation is sufficient for high-quality scene reconstruction and understanding, achieving superior memory efficiency and feature fidelity compared to existing methods.

Abstract:
We present SceneTok, a novel tokenizer for encoding view sets of scenes into a compressed and diffusable set of unstructured tokens. Existing approaches for 3D scene representation and generation commonly use 3D data structures or view-aligned fields. In contrast, we introduce the first method that encodes scene information into a small set of permutation-invariant tokens that is disentangled from the spatial grid. The scene tokens are predicted by a multi-view tokenizer given many context views and rendered into novel views by employing a light-weight rectified flow decoder. We show that the compression is 1-3 orders of magnitude stronger than for other representations while still reaching state-of-the-art reconstruction quality. Further, our representation can be rendered from novel trajectories, including ones deviating from the input trajectory, and we show that the decoder gracefully handles uncertainty. Finally, the highly-compressed set of unstructured latent scene tokens enables simple and efficient scene generation in 5 seconds, achieving a much better quality-speed trade-off than previous paradigms.

Abstract:
Autoregressive (AR) vision-language models (VLMs) have long dominated multimodal understanding, reasoning, and graphical user interface (GUI) grounding. Recently, discrete diffusion vision-language models (DVLMs) have shown strong performance in multimodal reasoning, offering bidirectional attention, parallel token generation, and iterative refinement. However, their potential for GUI grounding remains unexplored. In this work, we evaluate whether discrete DVLMs can serve as a viable alternative to AR models for GUI grounding. We adapt LLaDA-V for single-turn action and bounding-box prediction, framing the task as text generation from multimodal input. To better capture the hierarchical structure of bounding-box geometry, we propose a hybrid masking schedule that combines linear and deterministic masking, improving grounding accuracy by up to 6.1 points in Step Success Rate (SSR) over the GUI-adapted LLaDA-V trained with linear masking. Evaluations on four datasets spanning web, desktop, and mobile interfaces show that the adapted diffusion model with hybrid masking consistently outperforms the linear-masked variant and performs competitively with autoregressive counterparts despite limited pretraining. Systematic ablations reveal that increasing diffusion steps, generation length, and block length improves accuracy but also increases latency, with accuracy plateauing beyond a certain number of diffusion steps. Expanding the training data with diverse GUI domains further reduces latency by about 1.3 seconds and improves grounding accuracy by an average of 20 points across benchmarks. These results demonstrate that discrete DVLMs are a promising modeling framework for GUI grounding and represent an important step toward diffusion-based GUI agents.

Abstract:
Vision-Language Models (VLMs), such as CLIP, have demonstrated remarkable zero-shot capabilities in various computer vision tasks. However, their application to medical imaging remains challenging due to the high variability and complexity of medical data. Specifically, medical images often exhibit significant domain shifts caused by various confounders, including equipment differences, procedure artifacts, and imaging modes, which can lead to poor generalization when models are applied to unseen domains. To address this limitation, we propose Multimodal Causal-Driven Representation Learning (MCDRL), a novel framework that integrates causal inference with the VLM to tackle domain generalization in medical image segmentation. MCDRL is implemented in two steps: first, it leverages CLIP's cross-modal capabilities to identify candidate lesion regions and construct a confounder dictionary through text prompts, specifically designed to represent domain-specific variations; second, it trains a causal intervention network that utilizes this dictionary to identify and eliminate the influence of these domain-specific variations while preserving the anatomical structural information critical for segmentation tasks. Extensive experiments demonstrate that MCDRL consistently outperforms competing methods, yielding superior segmentation accuracy and exhibiting robust generalizability.

Abstract:
Accurate and fast localization is vital for safe autonomous navigation in GPS-denied areas. Fine-Grained Cross-View Geolocalization (FG-CVG) aims to estimate the precise 2-Degree-of-Freedom (2-DoF) location of a ground image relative to a satellite image. However, current methods force a difficult trade-off, with high-accuracy models being slow for real-time use. In this paper, we introduce GeoFlow, a new approach that offers a lightweight and highly efficient framework that breaks this accuracy-speed trade-off. Our technique learns a direct probabilistic mapping, predicting the displacement (in distance and direction) required to correct any given location hypothesis. This is complemented by our novel inference algorithm, Iterative Refinement Sampling (IRS). Instead of trusting a single prediction, IRS refines a population of hypotheses, allowing them to iteratively 'flow' from random starting points to a robust, converged consensus. Even its iterative nature, this approach offers flexible inference-time scaling, allowing a direct trade-off between performance and computation without any re-training. Experiments on the KITTI and VIGOR datasets show that GeoFlow achieves state-of-the-art efficiency, running at real-time speeds of 29 FPS while maintaining competitive localization accuracy. This work opens a new path for the development of practical real-time geolocalization systems.

Abstract:
While recent vision-language models (VLMs) demonstrate strong image understanding, their ability to "think with images," i.e., to reason through multi-step visual interactions, remains limited. We introduce VISTA-Gym, a scalable training environment for incentivizing tool-integrated visual reasoning capabilities in VLMs. VISTA-Gym unifies diverse real-world multimodal reasoning tasks (7 tasks from 13 datasets in total) with a standardized interface for visual tools (e.g., grounding, parsing), executable interaction loops, verifiable feedback signals, and efficient trajectory logging, enabling visual agentic reinforcement learning at scale. While recent VLMs exhibit strong text-only reasoning, both proprietary and open-source models still struggle with tool selection, invocation, and coordination. With VISTA-Gym, we train VISTA-R1 to interleave tool-use with agentic reasoning via multi-turn trajectory sampling and end-to-end reinforcement learning. Extensive experiments across 11 public reasoning-intensive VQA benchmarks show that VISTA-R1-8B outperforms state-of-the-art baselines with similar sizes by 9.51%-18.72%, demonstrating VISTA-Gym as an effective training ground to unlock the tool-integrated reasoning capabilities for VLMs.

Abstract:
We introduce Diff4Splat, a feed-forward framework for dynamic scene generation from a single image. Our method synergizes the powerful generative priors of video diffusion models with geometric and motion constraints learned from a large-scale 4D dataset. Given a single image, a camera trajectory, and an optional text prompt, our model directly predicts a dynamic scene represented by a deformable 3D Gaussian field. This approach captures appearance, geometry, and motion in a single pass, eliminating the need for test-time optimization or post-hoc processing. At the core of our framework is a video latent transformer that enhances existing video diffusion models, enabling them to jointly model spatio-temporal dependencies and predict 3D Gaussian Primitives over time. Supervised by objectives targeting appearance fidelity, geometric accuracy, and motion consistency, Diff4Splat generates high-fidelity dynamic scenes within 30 seconds. We demonstrate the effectiveness of Diff4Splat across video generation, novel view synthesis, and geometry extraction, where it matches or surpasses optimization-based methods for dynamic scene synthesis while being significantly more efficient.

Abstract:
Chest X-ray radiography (CXR) is an essential medical imaging technique for disease diagnosis. However, as 2D projectional images, CXRs are limited by structural superposition and hence fail to capture 3D anatomies. This limitation makes representation learning and disease diagnosis challenging. To address this challenge, we propose a novel CXR world model named X-WIN, which distills volumetric knowledge from chest computed tomography (CT) by learning to predict its 2D projections in latent space. The core idea is that a world model with internalized knowledge of 3D anatomical structure can predict CXRs under various transformations in 3D space. During projection prediction, we introduce an affinity-guided contrastive alignment loss that leverages mutual similarities to capture rich, correlated information across projections from the same volume. To improve model adaptability, we incorporate real CXRs into training through masked image modeling and employ a domain classifier to encourage statistically similar representations for real and simulated CXRs. Comprehensive experiments show that X-WIN outperforms existing foundation models on diverse downstream tasks using linear probing and few-shot fine-tuning. X-WIN also demonstrates the ability to render 2D projections for reconstructing a 3D CT volume.

Abstract:
We introduce a parameter-efficient adaptation method for panel-aware in-context image generation with pre-trained diffusion transformers. The key idea is to compose learnable, panel-specific orthogonal operators onto the backbone's frozen positional encodings. This design provides two desirable properties: (1) isometry, which preserves the geometry of internal features, and (2) same-panel invariance, which maintains the model's pre-trained intra-panel synthesis behavior. Through controlled experiments, we demonstrate that the effectiveness of our adaptation method is not tied to a specific positional encoding design but generalizes across diverse positional encoding regimes. By enabling effective panel-relative conditioning, the proposed method consistently improves in-context image-based instructional editing pipelines, including state-of-the-art approaches.

Abstract:
Anomaly segmentation seeks to detect and localize unknown or out-of-distribution (OoD) objects that fall outside predefined semantic classes--a capability essential for safe autonomous driving. However, the scarcity and limited diversity of anomaly data severely constrain model generalization in open-world environments. Existing approaches mitigate this issue through synthetic data generation, either by copy-pasting external objects into driving scenes or by leveraging text-to-image diffusion models to inpaint anomalous regions. While these methods improve anomaly diversity, they often lack contextual coherence and physical realism, resulting in domain gaps between synthetic and real data. In this paper, we present ClimaDrive, a semantics-guided image-to-image framework for synthesizing semantically coherent, weather-diverse, and physically plausible OoD driving data. ClimaDrive unifies structure-guided multi-weather generation with prompt-driven anomaly inpainting, enabling the creation of visually realistic training data. Based on this framework, we construct ClimaOoD, a large-scale benchmark spanning six representative driving scenarios under both clear and adverse weather conditions.Extensive experiments on four state-of-the-art methods show that training with ClimaOoD leads to robust improvements in anomaly segmentation. Across all methods, AUROC, AP, and FPR95 show notable gains, with FPR95 dropping from 3.97 to 3.52 for RbA on Fishyscapes LAF. These results demonstrate that ClimaOoD enhances model robustness, offering valuable training data for better generalization in open-world anomaly detection.

Abstract:
Despite the strong multimodal performance, large vision-language models (LVLMs) are vulnerable during fine-tuning to backdoor attacks, where adversaries insert trigger-embedded samples into the training data to implant behaviors that can be maliciously activated at test time. Existing defenses typically rely on retraining backdoored parameters (e.g., adapters or LoRA modules) with clean data, which is computationally expensive and often degrades model performance. In this work, we provide a new mechanistic understanding of backdoor behaviors in LVLMs: the trigger does not influence prediction through low-level visual patterns, but through abnormal cross-modal attention redistribution, where trigger-bearing visual tokens steal attention away from the textual context -- a phenomenon we term attention stealing. Motivated by this, we propose CleanSight, a training-free, plug-and-play defense that operates purely at test time. CleanSight (i) detects poisoned inputs based on the relative visual-text attention ratio in selected cross-modal fusion layers, and (ii) purifies the input by selectively pruning the suspicious high-attention visual tokens to neutralize the backdoor activation. Extensive experiments show that CleanSight significantly outperforms existing pixel-based purification defenses across diverse datasets and backdoor attack types, while preserving the model's utility on both clean and poisoned samples.

Abstract:
The performance of multimodal learning systems, particularly in high-stakes domains like automated depression recognition, is fundamentally constrained by the challenge of learning robust visual representations from limited and complex clinical data. To overcome this, we introduce Cross-Modal Guided Visual Synthesis (CMG-VS), a novel training framework that internally enhances the learning process by synthesizing new, task-relevant visual features. At its core, CMG-VS leverages the rich context from audio and text modalities to guide a conditional generative model. This model learns the intricate mapping from speech and language to visual expression, generating a diverse manifold of plausible visual behaviors to enrich the training distribution. Crucially, this synthesis is not a separate pre-processing step. Through a task-guided joint optimization scheme, the generative process is dynamically steered by the downstream multimodal recognizer's performance. This closed-loop feedback mechanism ensures the synthesized visual features are optimized to be maximally discriminative for the recognition task, rather than merely realistic. Comprehensive experiments on the widely-used DAIC-WOZ and E-DAIC benchmark datasets demonstrate that CMG-VS significantly outperforms existing state-of-the-art methods across all standard regression and classification metrics. Ablation studies further validate that our task-guided synthesis is the key driver of this performance gain, proving its effectiveness as a new paradigm for robust multimodal representation learning.

Abstract:
Most existing 3D keypoint estimation methods rely on manual annotations or calibrated multi-view images, both of which are expensive to collect.This paper introduces KeyDiff3D, a framework that can accurately predict 3D keypoints from a single image, thus eliminating the need for such expensive data acquisitions.To achieve this, we leverage powerful geometric priors embedded in a pretrained multi-view diffusion model.In our framework, the diffusion model generates multi-view images from a single image, serving as supervision signals to provide 3D geometric cues to our model.We also introduce a 3D feature extractor that transforms implicit 3D priors embedded in the diffusion features into explicit 3D feature volumes.Beyond accurate keypoint estimation, we further introduce a pipeline that enables manipulation of 3D objects generated by the diffusion model.Experimental results on diverse datasets, including Human3.6M, CUB-200-2011, Stanford Dogs, and several in-the-wild and out-of-domain inputs, highlight the effectiveness of our method in terms of accuracy, generalization, and its ability to enable manipulation of 3D objects generated by the diffusion model from a single image.

Abstract:
Egocentric video generation with fine-grained control through body motion is a key requirement towards embodied AI agents that can simulate, predict, and plan actions. In this work, we propose EgoControl, a pose-controllable video diffusion model trained on egocentric data. We train a video prediction model to condition future frame generation on explicit 3D body pose sequences. To achieve precise motion control, we introduce a novel pose representation that captures both global camera dynamics and articulated body movements, and integrate it through a dedicated control mechanism within the diffusion process. Given a short sequence of observed frames and a sequence of target poses, EgoControl generates temporally coherent and visually realistic future frames that align with the provided pose control. Experimental results demonstrate that EgoControl produces high-quality, pose-consistent egocentric videos, paving the way toward controllable embodied video simulation and understanding.

Abstract:
Generating animated 3D objects is at the heart of many applications, yet most advanced works are typically difficult to apply in practice because of their limited setup, their long runtime, or their limited quality. We introduce ActionMesh, a generative model that predicts production-ready 3D meshes "in action" in a feed-forward manner. Drawing inspiration from early video models, our key insight is to modify existing 3D diffusion models to include a temporal axis, resulting in a framework we dubbed "temporal 3D diffusion". Specifically, we first adapt the 3D diffusion stage to generate a sequence of synchronized latents representing time-varying and independent 3D shapes. Second, we design a temporal 3D autoencoder that translates a sequence of independent shapes into the corresponding deformations of a pre-defined reference shape, allowing us to build an animation. Combining these two components, ActionMesh generates animated 3D meshes from different inputs like a monocular video, a text description, or even a 3D mesh with a text prompt describing its animation. Besides, compared to previous approaches, our method is fast and produces results that are rig-free and topology consistent, hence enabling rapid iteration and seamless applications like texturing and retargeting. We evaluate our model on standard video-to-4D benchmarks (Consistent4D, Objaverse) and report state-of-the-art performances on both geometric accuracy and temporal consistency, demonstrating that our model can deliver animated 3D meshes with unprecedented speed and quality.

Abstract:
Current 3D human animation methods fail at photorealism: kinematics-based approaches lack non-rigid dynamics like clothing, while methods reconstructing from generated videos suffer from low-quality artifacts and identity loss. To overcome these limitations, we present Ani3DHuman, a framework that marries kinematics-based animation with video diffusion priors. We first introduce a layered motion representation that disentangles rigid motion and residual non-rigid motion. Then, we use a pretrained video diffusion model to restore a coarse rendering from the mesh-rigged animation, which provides supervision for the motion field. However, this restoration task, based on diffusion sampling, is highly challenging, as the initial renderings are out-of-distribution, causing standard deterministic ODE samplers to fail. Therefore, our core technical contribution is self-guided stochastic sampling, which effectively solves the out-of-distribution problem by combining stochastic sampling (for photorealistic quality) with self-guidance (for identity fidelity). These restored videos provide high-quality supervision, enabling the optimization of a realistic 4D motion field. Ani3DHuman achieves state-of-the-art results, and our ablations validate that both components of our sampler are essential for high-fidelity restoration.

Abstract:
Cross-modal biomedical signals such as pathology and genomics can provide richer and more robust semantic guidance for medical image representation learning. However, the availability of such guidance remains limited, as privacy constraints and acquisition costs severely restrict access to medical images paired with other biomedical data. A further challenge lies in modality discrepancy, which introduces intra-modal statistical bias and cross-modal noise, thereby degrading the quality of medical image representations. To address these challenges, we propose KAMP, a large language model (LLM)-driven multimodal pretraining framework for medical image representation learning. KAMP leverages textual priors as the semantic anchor to enhance medical image representations and align them with multimodal biomedical representations, enabling the learning of rich and robust features even when paired data are scarce. KAMP operates in three stages. First, the LLM generates personalized diagnostic knowledge from patient clinical text and imaging metadata. This knowledge is injected as a prior to enrich medical image representations and serves as a semantic anchor to reduce the representation gap between medical images and other biomedical modalities. Second, the LLM is optimized using Group Relative Policy Optimization (GRPO), with the cross-modal aligner pretrained in the first stage serving as the reward model. Third, the refined knowledge is used to retrain the cross-modal aligner, yielding more robust medical image representations while mitigating bias and noise introduced by other modalities. Comprehensive evaluations on brain, bladder, and liver cancer datasets demonstrate that KAMP outperforms existing methods in most downstream few-shot classification tasks.

Abstract:
Human motions are compositional: complex behaviors can be described as combinations of simpler primitives. However, existing approaches primarily focus on forward modeling, e.g., learning holistic mappings from text to motion or composing a complex motion from a set of motion concepts. In this paper, we consider the inverse perspective: decomposing a holistic motion into semantically meaningful sub-components. We propose DeMoGen, a compositional training paradigm for decompositional learning that employs an energy-based diffusion model. This energy formulation directly captures the composed distribution of multiple motion concepts, enabling the model to discover them without relying on ground-truth motions for individual concepts. Within this paradigm, we introduce three training variants to encourage a decompositional understanding of motion: 1. DeMoGen-Exp explicitly trains on decomposed text prompts; 2. DeMoGen-OSS performs orthogonal self-supervised decomposition; 3. DeMoGen-SC enforces semantic consistency between original and decomposed text embeddings. These variants enable our approach to disentangle reusable motion primitives from complex motion sequences. We also demonstrate that the decomposed motion concepts can be flexibly recombined to generate diverse and novel motions, generalizing beyond the training distribution. Additionally, we construct a text-decomposed dataset to support compositional training, serving as an extended resource to facilitate text-to-motion generation and motion composition.

Abstract:
In recent times, large datasets hinder efficient model training while also containing redundant concepts. Dataset distillation aims to synthesize compact datasets that preserve the knowledge of large-scale training sets while drastically reducing storage and computation. Recent advances in diffusion models have enabled training-free distillation by leveraging pre-trained generative priors; however, existing guidance strategies remain limited. Current score-based methods either perform unguided denoising or rely on simple mode-based guidance toward instance prototype centroids (IPC centroids), which often are rudimentary and suboptimal. We propose Manifold-Guided Distillation (ManifoldGD), a training-free diffusion-based framework that integrates manifold consistent guidance at every denoising timestep. Our method employs IPCs computed via a hierarchical, divisive clustering of VAE latent features--yielding a multi-scale coreset of IPCs that captures both coarse semantic modes and fine intra-class variability. Using a local neighborhood of the extracted IPC centroids, we create the latent manifold for each diffusion denoising timestep. At each denoising step, we project the mode-alignment vector onto the local tangent space of the estimated latent manifold, thus constraining the generation trajectory to remain manifold-faithful while preserving semantic consistency. This formulation improves representativeness, diversity, and image fidelity without requiring any model retraining. Empirical results demonstrate consistent gains over existing training-free and training-based baselines in terms of FID, l_2 distance among real and synthetic dataset embeddings, and classification accuracy, establishing ManifoldGD as the first geometry-aware training-free data distillation framework.

Abstract:
Diffusion models achieve strong generative performance but remain slow at inference due to the need for repeated full-model denoising passes. We present Token-Adaptive Predictor (TAP), a training-free, probe-driven framework that adaptively selects a predictor for each token at every sampling step. TAP uses a single full evaluation of the model's first layer as a low-cost probe to compute proxy losses for a compact family of candidate predictors (instantiated primarily with Taylor expansions of varying order and horizon), then assigns each token the predictor with the smallest proxy error. This per-token "probe-then-select" strategy exploits heterogeneous temporal dynamics, requires no additional training, and is compatible with various predictor designs. TAP incurs negligible overhead while enabling large speedups with little or no perceptual quality loss. Extensive experiments across multiple diffusion architectures and generation tasks show that TAP substantially improves the accuracy-efficiency frontier compared to fixed global predictors and caching-only baselines.

Abstract:
Recently, the introduction of Chain-of-Thought (CoT) has largely improved generation ability of unified models. However, it is observed that the current thinking process during generation mainly focuses on the text consistency with the text prompt, ignoring the visual context consistency with the visual reference images during the multi-modal generation, e.g., multi-reference generation. The lack of such consistency results in the failure in maintaining key visual features (like human ID, object attribute, style). To this end, we integrate the visual context consistency into the reasoning of unified models, explicitly motivating the model to sustain such consistency by 1) Adaptive Visual Planning: generating structured visual check list to figure out the visual element of needed consistency keeping, and 2) Iterative Visual Correction: performing self-reflection with the guidance of check lists and refining the generated result in an iterative manner. To achieve this, we use supervised finetuning to teach the model how to plan the visual checking, conduct self-reflection and self-refinement, and use flow-GRPO to further enhance the visual consistency through a customized visual checking reward. The experiments show that our method outperforms both zero-shot unified models and those with text CoTs in multi-modal generation, demonstrating higher visual context consistency.

Abstract:
Although text-to-image diffusion models exhibit remarkable generative power, concept erasure techniques are essential for their safe deployment to prevent the creation of harmful content.This has fostered a dynamic interplay between the development of erasure defenses and the adversarial probes designed to bypass them, and this co-evolution has progressively enhanced the efficacy of erasure methods.However, this adversarial co-evolution has converged on a narrow, text-centric paradigm that equates erasure with severing the text-to-image mapping, ignoring that the underlying visual knowledge related to undesired concepts still persist.To substantiate this claim, we investigate from a visual perspective, leveraging DDIM inversion to probe whether a generative pathway for the erased concept can still be found.However, identifying such a visual generative pathway is challenging because standard text-guided DDIM inversion is actively resisted by text-centric defenses within the erased model.To address this, we introduce TINA, a novel Text-free INversion Attack, which enforces this visual-only probe by operating under a null-text condition, thereby avoiding existing text-centric defenses.Moreover, TINA integrates an optimization procedure to overcome the accumulating approximation errors that arise when standard inversion operates without its usual textual guidance.Our experiments demonstrate that TINA successfully regenerates erased concepts from models treated with state-of-the-art unlearning.The success of TINA proves that current methods merely obscure concepts, highlighting an urgent need for paradigms that operate directly on internal visual knowledge.Code will be released upon acceptance.

Abstract:
Diffusion-based algorithms have emerged as promising techniques for weight generation. However, existing solutions are limited by two challenges: generalizability and missing local supervision targets. The first challenge stems from the inherent lack of cross-task transferability in existing single-level optimization methods, which limits model performance on new tasks. The latter challenge lies in existing research modeling only global optimal weights, neglecting the supervision signals in local target weights. Furthermore, naively assigning local target weights leads to inconsistency between local and global objectives. To address these issues, we propose Mc-Di, which integrates the diffusion algorithm with meta-learning for better generalizability. Additionally, we extend the vanilla diffusion into a local consistency diffusion algorithm. Our theoretical analysis and experimental results demonstrate that the model can learn from local targets while preserving consistency with the global optimum. We validate Mc-Di's superior accuracy and inference efficiency on tasks that require frequent weight updates, including transfer learning, few-shot learning, domain generalization, and language model fine-tuning.

Abstract:
Current video avatar generation methods excel at identity preservation and motion alignment but lack genuine agency--they cannot autonomously pursue long-term goals through adaptive environmental interaction. We address this by introducing L-IVA (Long-horizon Interactive Visual Avatar), a task and benchmark for evaluating goal-directed planning in stochastic generative environments, and ORCA (Online Reasoning and Cognitive Architecture), the first framework enabling active intelligence in video avatars. ORCA embodies Internal World Model (IWM) capabilities through two key innovations: (1) a closed-loop OTAR cycle (Observe-Think-Act-Reflect) that maintains robust state tracking under generative uncertainty by continuously verifying predicted outcomes against actual generations, and (2) a hierarchical dual-system architecture where System 2 performs strategic reasoning with state prediction while System 1 translates abstract plans into precise, model-specific action captions. By formulating avatar control as a POMDP and implementing continuous belief updating with outcome verification, ORCA enables autonomous multi-step task completion in open-domain scenarios. Extensive experiments demonstrate that ORCA significantly outperforms open-loop and non-reflective baselines in task success rate and behavioral coherence, validating our IWM-inspired design for advancing video avatar intelligence from passive animation to active, goal-oriented behavior.

Abstract:
We introduce StableMaterials, a novel approach for generating photorealistic physically-based rendering (PBR) materials that integrate semi-supervised learning with Latent Diffusion Models (LDMs). Our method employs adversarial training to distill knowledge from existing large-scale image generation models, minimizing the reliance on annotated data and enhancing the diversity in generation. This distillation approach aligns the distribution of the generated materials with that of image textures from an SDXL model, enabling the generation of novel materials that are not present in the initial training dataset. Furthermore, we employ a diffusion-based refiner model to improve the visual quality of the samples and achieve high-resolution generation. Finally, we distill a latent consistency model for fast generation in just four steps and propose a new tileability technique that removes visual artifacts typically associated with fewer diffusion steps. We detail the architecture and training process of StableMaterials, the integration of semi-supervised training within existing LDM frameworks. Comparative evaluations with state-of-the-art methods show the effectiveness of StableMaterials, highlighting its potential applications in computer graphics and beyond. StableMaterials is publicly available at https://gvecchio.com/stablematerials.

Abstract:
Digital images are often degraded by soft effects such as lens flare, haze, shadows, and reflections, which reduce aesthetics even though the underlying pixels remain partially visible. The prevailing works address these degradations in isolation, developing highly specialized, specialist models that lack scalability and fail to exploit the shared underlying essences of these restoration problems. While specialist models are limited, recent large-scale pretrained generalist models offer powerful, text-driven image editing capabilities. while recent general-purpose systems (e.g., GPT-4o, Flux Kontext, Nano Banana) require detailed prompts and often fail to achieve robust removal on these fine-grained tasks or preserve identity of the scene. Leveraging the common essence of soft effects, i.e., semi-transparent occlusions, we introduce a foundational versatile model UniSER, capable of addressing diverse degradations caused by soft effects within a single framework. Our methodology centers on curating a massive 3.8M-pair dataset to ensure robustness and generalization, which includes novel, physically-plausible data to fill critical gaps in public benchmarks, and a tailored training pipeline that fine-tunes a Diffusion Transformer to learn robust restoration priors from this diverse data, integrating fine-grained mask and strength controls. This synergistic approach allows UniSER to significantly outperform both specialist and generalist models, achieving robust, high-fidelity restoration in the wild.

Abstract:
Group Relative Policy Optimization (GRPO) has shown promise in aligning image and video generative models with human preferences. However, applying it to modern flow matching models is challenging because of its deterministic sampling paradigm. Current methods address this issue by converting Ordinary Differential Equations (ODEs) to Stochastic Differential Equations (SDEs), which introduce stochasticity. However, this SDE-based GRPO suffers from issues of inefficient credit assignment and incompatibility with high-order solvers for fewer-step sampling. In this paper, we first reinterpret existing SDE-based GRPO methods from a distance optimization perspective, revealing their underlying mechanism as a form of contrastive learning. Based on this insight, we propose Neighbor GRPO, a novel alignment algorithm that completely bypasses the need for SDEs. Neighbor GRPO generates a diverse set of candidate trajectories by perturbing the initial noise conditions of the ODE and optimizes the model using a softmax distance-based surrogate leaping policy. We establish a theoretical connection between this distance-based objective and policy gradient optimization, rigorously integrating our approach into the GRPO framework. Our method fully preserves the advantages of deterministic ODE sampling, including efficiency and compatibility with high-order solvers. We further introduce symmetric anchor sampling for computational efficiency and group-wise quasi-norm reweighting to address reward flattening. Extensive experiments demonstrate that Neighbor GRPO significantly outperforms SDE-based counterparts in terms of training cost, convergence speed, and generation quality.

Abstract:
Multimodal large language models (MLLMs) have made significant advancements in vision understanding and reasoning. However, the autoregressive Transformer architecture used by MLLMs requires tokenization on input images, which limits their ability to accurately ground objects within the 2D image space. This raises an important question: how can sequential language tokens be improved to better ground objects in 2D spatial space for MLLMs? To address this, we present a spatial representation method for grounding objects, namely GETok, that integrates a specialized vocabulary of learnable tokens into MLLMs. GETok first uses grid tokens to partition the image plane into structured spatial anchors, and then exploits offset tokens to enable precise and iterative refinement of localization predictions. By embedding spatial relationships directly into tokens, GETok significantly advances MLLMs in native 2D space reasoning without modifying the autoregressive architecture. Extensive experiments demonstrate that GETok achieves superior performance over the state-of-the-art methods across various referring tasks in both supervised fine-tuning and reinforcement learning settings.

Abstract:
Prior dual-stream methods with the feature interaction mechanism have achieved remarkable performance in single image reflection removal (SIRR). However, they often struggle with (1) semantic understanding gap between the features of pre-trained models and those of reflection removal models, and (2) reflection label inconsistencies between synthetic and real-world training data. In this work, we first adopt the parameter efficient fine-tuning (PEFT) strategy by integrating several learnable Mona layers into the pre-trained model to align the training directions. Then, a label generator is designed to unify the reflection labels for both synthetic and real-world data. In addition, a Gaussian-based Adaptive Frequency Learning Block (G-AFLB) is proposed to adaptively learn and fuse the frequency priors, and a Dynamic Agent Attention (DAA) is employed as an alternative to window-based attention by dynamically modeling the significance levels across windows (inter-) and within an individual window (intra-). These components constitute our proposed Gap-Free Reflection Removal Network (GFRRN). Extensive experiments demonstrate the effectiveness of our GFRRN, achieving superior performance against state-of-the-art SIRR methods.

Abstract:
Physically-based rendering (PBR) materials are fundamental to photorealistic graphics, yet their creation remains labor-intensive and requires specialized expertise. While generative models have advanced material synthesis, existing methods lack a unified representation bridging natural image appearance and PBR properties, leading to fragmented task-specific pipelines and inability to leverage large-scale RGB image data. We present MatPedia, a foundation model built upon a novel joint RGB-PBR representation that compactly encodes materials into two interdependent latents: one for RGB appearance and one for the four PBR maps encoding complementary physical properties. By formulating them as a 5-channel sequence and employing video diffusion architectures, MatPedia naturally captures their correlations while transferring visual priors from RGB generation models. This joint representation enables a unified framework handling multiple material tasks--text-to-material generation, image-to-material generation, and intrinsic decomposition--within a single architecture. Trained on MatHybrid-410K, a mixed corpus combining PBR datasets with large-scale RGB images, MatPedia achieves native 1024x1024 synthesis that substantially surpasses existing approaches in both quality and diversity.

Abstract:
Text-to-image models are commercially valuable assets often distributed under restrictive licenses, but such licenses are enforceable only when violations can be detected. Existing methods require pre-deployment watermarking or internal model access, which are unavailable in commercial API deployments. We present Compositional Semantic Fingerprinting (CSF), the first black-box method for attributing fine-tuned text-to-image models to protected lineages using only query access. CSF treats models as semantic category generators and probes them with compositional underspecified prompts that remain rare under fine-tuning. This gives IP owners an asymmetric advantage: new prompt compositions can be generated after deployment, while attackers must anticipate and suppress a much broader space of fingerprints. Across 6 model families (FLUX, Kandinsky, SD1.5/2.1/3.0/XL) and 13 fine-tuned variants, our Bayesian attribution framework enables controlled-risk lineage decisions, with all variants satisfying the dominance criterion.

Abstract:
Recent visual autoregressive (AR) models have shown promising capabilities in text-to-image generation, operating in a manner similar to large language models. While test-time computation scaling yields higher-quality outputs on challenging natural language tasks, its adaptation to visual AR models remains unexplored and poses unique challenges. Naively applying test-time scaling strategies such as Best-of-N can be suboptimal: they consume full-length computation on erroneous generation trajectories, while the raster-scan decoding scheme lacks a blueprint of the entire canvas, limiting scaling benefits as only a few prompt-aligned candidates are generated. To address these, we introduce GridAR, a test-time scaling framework designed to elicit the best possible results from visual AR models. GridAR employs a grid-partitioned progressive generation scheme in which multiple partial candidates for the same position are generated within a canvas, infeasible ones are pruned early, and viable ones are fixed as anchors to guide subsequent decoding. Coupled with this, we present a layout-specified prompt reformulation strategy that inspects partial views to infer a feasible layout for satisfying the prompt. The reformulated prompt guides subsequent image generation to mitigate the blueprint deficiency. GridAR achieves higher-quality results under limited test-time scaling: with N=4, it even outperforms Best-of-N (N=8) by 14.4% on T2I-CompBench++ while reducing cost by 25.6%. It also generalizes to autoregressive image editing, showing a 14.5% gain in semantic preservation on PIE-Bench over larger-N baselines.

Abstract:
One-dimensional (1D) visual tokenizers offer notable semantic compactness by discarding local spatial priors, and have become increasingly popular for image reconstruction and generation tasks. However, such global and sequential representations struggle to preserve fine-grained visual content; simply increasing network size or token count offers only superficial mitigation. To address this, we introduce VLTok, a novel 1D hybrid tokenizer that unifies visual and language representations in a shared token space through a self-prompted training paradigm. During training, VLTok simultaneously generates 1D visual and textual tokens from images, aligning the textual tokens with embeddings from a pre-trained language model. This cross-modal alignment infuses implicit linguistic cues into the tokenizer, enhancing fine-grained image encoding. At inference, the self-prompted paradigm eliminates the need for external text, maintaining the simplicity of the image-only framework while benefiting from multi-modal guidance. Extensive experiments on the ImageNet benchmark demonstrate that VLTok achieves state-of-the-art performance in both image reconstruction and image generation. For example, under the same model parameter budget, our method yields relative reduction of 11.1% in rFID and 18.7% in gFID compared to GigaTok.

Abstract:
Stochastic human motion prediction aims to forecast future motion distributions. Although recent studies have achieved strong performance in terms of accuracy and diversity, they often overlook plausibility (e.g., resulting in physically unrealistic predictions) and uncertainty quantification, which is essential for real-world applications and downstream tasks. To address these issues, we propose a latent flow-based model equipped with a data-driven Gaussian mixture prior that more effectively disentangles diverse human behaviors than conventional single-modal priors. This prior is derived from patterns in the training data without requiring additional annotations. Furthermore, the fully invertible nature of our model enables natural uncertainty quantification through tractable likelihood computation. Experiments on the Human3.6M and AMASS datasets demonstrate that our approach achieves state-of-the-art performance in both accuracy and plausibility, while also providing reliable uncertainty estimates.

Abstract:
Unaligned RGB-T salient object detection (SOD) remains challenging due to severe cross-modal spatial discrepancies and unreliable feature fusion. Existing methods often assume perfect alignment or rely on geometric registration, which is computationally demanding and sensitive to cross-modal inconsistencies. To address these limitations, we propose an uncertainty-aware modality fusion network (UMFNet) that reformulates RGB-T SOD as an uncertainty-aware representation learning problem. Specifically, the proposed uncertainty alignment module (UAM) models pixel-wise features as Gaussian latent distributions to estimate local uncertainty and identify cross-modal consistency regions within the feature space, thereby achieving implicit alignment without explicit registration. Furthermore, the confidence-guided global modulation (CGM) mechanism leverages confidence maps derived from uncertainty estimation to adaptively regulate the fusion of RGB and thermal features, enhancing salient cues in reliable regions while suppressing noisy or inconsistent information. Extensive experiments on five unaligned and three aligned RGB-T SOD benchmarks demonstrate that UMFNet achieves state-of-the-art performance across diverse alignment conditions.

Abstract:
In this paper, we propose GenSplat, a novel approach for language comprehension in 3D Gaussian Splatting (3DGS). Unlike previous methods that either achieve cross-scene generalization by being bounded to a predefined vocabulary or handle free-form language by overfitting to individual scenes, GenSplat is robust to free-form language queries and generalizable across 3DGS scene representations. Our key insight for this problem is to formulate a structured learning process to progressively align linguistic concepts with 3D Gaussians. It contains two novel technical contributions. First, we propose a Progressive Language Grounding Curriculum that structurally guides the model through learning category-level semantics to instance-level concepts and free-form language, preventing overfitting by building a generalizable language feature space. Second, we design a Multi-modal Large Language Models (MLLM)-guided Reasoning Module that leverages MLLM's semantic and spatial priors to enhance 3D localization and reasoning. To further improve spatial alignment and computational efficiency, we introduce a GeometryAware Frame Selector (GAFS), which adaptively selects the most informative views based on Gaussian and textural cues. Extensive cross-task evaluations (including 3D referring segmentation, 3D visual question answering, and 3D open-vocabulary understanding) demonstrate state-of-the-art performances and strong generalization capability of GenSplat. We will release the codes.

Abstract:
Affordance learning is a complex challenge in many applications, where existing approaches primarily focus on the geometric structures, visual knowledge, and affordance labels of objects to determine interactable regions. However, extending this learning capability to a scene is significantly more complicated, as incorporating object- and scene-level semantics is not straightforward; for example, 3D instance identification often struggles with small, interactable, functional parts (i.e., knobs, handles, etc.). In this work, we introduce AffordBridge, a large-scale dataset with 291,637 functional interaction annotations across 685 high-resolution indoor scenes in the form of point clouds. Our affordance annotations are complemented by RGB images that are linked to the same instances within scenes. Building upon our dataset, we propose AffordMatcher, an affordance learning method that establishes coherent semantic correspondences between image-based and point cloud-based instances for keypoint matching, enabling a more precise identification of affordance regions based on cues, so-called visual signifiers. Experimental results on our dataset demonstrate the effectiveness of our approach against other methods. Our code and dataset will be made publicly available.

Abstract:
Despite advances in the application of MLLMs for various video tasks, video event prediction (VEP) remains relatively underexplored. VEP requires the model to perform fine-grained temporal modeling of videos and establish logical relationships between videos and future events, which current MLLMs still struggle with. In this work, we first present a comprehensive evaluation of current leading MLLMs on the VEP task, revealing the reasons behind their inaccurate predictions, including lack of logical reasoning ability for future events prediction and insufficient utilization of visual information. To address these challenges, we propose Chain of Events (CoE) paradigm, which constructs temporal event chains to implicitly enforce MLLM focusing on the visual content and the logical connections between videos and future events, incentivizing model's reasoning capability with multiple training protocols. Experimental results on public benchmarks demonstrate that our method outperforms both leading open-source and commercial MLLMs, establishing a new state-of-the-art on the VEP task.

Abstract:
The boundary representation (B-Rep) models a 3D solid as its explicit boundaries: trimmed corners, edges, and faces. Recovering B-Rep representation from unstructured data is a challenging and valuable task of computer vision and graphics. Recent advances in deep learning have greatly improved the recovery of 3D shape geometry, but still depend on dense and clean point clouds and struggle to generalize to novel shapes. We propose B-Rep Gaussian Splatting (BrepGaussian), a novel framework that learns 3D parametric representations from 2D images. We employ a Gaussian Splatting renderer with learnable features, followed by a specific fitting strategy. To disentangle geometry reconstruction and feature learning, we introduce a two-stage learning framework that first captures geometry and edges and then refines patch features to achieve clean geometry and coherent instance representations. Extensive experiments demonstrate the superior performance of our approach to state-of-the-art methods.

Abstract:
Flow Matching has limited ability in achieving one-step generation due to its reliance on learned curved trajectories. Previous studies have attempted to address this limitation by either modifying the coupling distribution to prevent interpolant intersections or introducing consistency and mean-velocity modeling to promote straight trajectory learning. However, these approaches often suffer from discrete approximation errors, training instability, and convergence difficulties. To tackle these issues, in the present work, we propose Straight Variational Flow Matching (S-VFM), which integrates a variational latent code representing the "generation overview" into the Flow Matching framework. S-VFM explicitly enforces trajectory straightness, ideally producing linear generation paths. The proposed method achieves competitive performance across three challenge benchmarks and demonstrates advantages in both training and inference efficiency compared with existing methods.

Abstract:
Significant progress has been achieved in high-fidelity video synthesis, yet current paradigms often fall short in effectively integrating identity information from multiple subjects. This leads to semantic conflicts and suboptimal performance in preserving identities and interactions, limiting controllability and applicability. To tackle this issue, we introduce ID-Crafter, a framework for multi-subject video generation that achieves superior identity preservation and semantic coherence. ID-Crafter integrates three key components: (i) a hierarchical identity-preserving attention mechanism that progressively aggregates features at intra-subject, inter-subject, and cross-modal levels; (ii) a semantic understanding module powered by a pretrained Vision-Language Model (VLM) to provide fine-grained guidance and capture complex inter-subject relationships; and (iii) an online reinforcement learning phase to further refine the model for critical concepts. Furthermore, we construct a new dataset to facilitate robust training and evaluation. Extensive experiments demonstrate that ID-Crafter establishes new state-of-the-art performance on multi-subject video generation benchmarks, excelling in identity preservation, temporal consistency, and overall video quality.

Abstract:
A central challenge in visual intelligence is learning the physical structure of scenes from raw videos: how regions form objects and the laws that govern their interactions. Solving these tasks requires world models capable of inferring distributional states of the world from partial observations -- capabilities that current architectures do not provide. We introduce a new class of probabilistic world models that support estimation of the probability of any visual variable, such as appearance and dynamics, conditioned on any other variables. Here, we identify that these models can be trained efficiently with autoregressive sequence modeling, yielding world models from which rich object understanding emerges. First, we demonstrate that our model captures the physical laws governing how objects move by generating multiple plausible future states of the world through sequential inference. Then, by analyzing motion correlations across these futures, we extract objects and articulated object subparts. Having discovered these objects, we show that our world model can manipulate them in 3D. Finally, we demonstrate how physical relationships between objects can be computed from the world model, enabling applications such as Visual Jenga.

Abstract:
Vision-language models (VLMs) excel at broad visual understanding but remain coarse-grained, exhibit visual biases, and miss subtle visual details. Existing training corpora reinforce this limitation by emphasizing general recognition ("Is it a cat or a dog?") over fine-grained perception. To address this, we introduce a new training corpus and task designed to enhance the perceptual abilities of VLMs. TWIN is a large-scale dataset of 561,000 image-pair queries that task models to determine whether two visually similar images depict the same object, encouraging attention to nuanced visual cues. The dataset spans a diverse range of everyday objects across contexts, viewpoints, and appearances. Fine-tuning VLMs on TWIN yields notable gains in fine-grained recognition, even on unseen domains such as art, animals, plants, and landmarks. To quantify these gains, we introduce FGVQA, a benchmark suite of 12,000 queries that repurposes fine-grained recognition and retrieval datasets from multiple domains. While existing VLMs struggle on FGVQA, when fine-tuned on TWIN they improve by up to 19.3%, without compromising performance on general VQA benchmarks. Finally, our TWIN dataset scales favorably with object annotations, and our analysis shows that scale is key to performance. We envision TWIN as a drop-in addition to open-source VLM training corpora, advancing perceptual precision of future models.

Abstract:
Novel-view synthesis plays a crucial role in computer vision with applications in 3D reconstruction, mixed reality, and robotics. Recent approaches, such as 3D Gaussian Splatting (3DGS), have emerged as state-of-the-art solutions, offering high-quality novel view synthesis in real time. However, training 3DGS models remains slow, particularly for high-resolution images, often requiring hours to fit a scene with 200 views. In this work, we aim to accelerate the fitting process by reducing computational overhead and improving learning efficiency. Specifically, we introduce a dilated rendering technique that renders only a subset of pixels instead of the full image, significantly reducing computational costs. To enhance learning efficiency, we develop a convergence-aware budget control mechanism that balances the addition of new Gaussians with the optimization of existing ones. Additionally, to improve densification efficiency and prevent gradient vanishing, we incorporate both positional and appearance error to enhance densification effectiveness. With these improvements, we achieve fast 4K-resolution fitting while maintaining, or even improving, novel view rendering quality. Extensive experiments demonstrate that our method achieves significantly faster optimization than existing approaches while preserving high rendering fidelity.

Abstract:
Image vectorization aims to convert raster images into editable, scalable vector representations while preserving visual fidelity. Existing vectorization methods struggle to represent complex real-world images, often producing fragmented shapes at the cost of semantic conciseness. In this paper, we propose COVec, an illumination-aware vectorization framework inspired by the Clair-Obscur principle of light-shade contrast. COVec is the first to introduce intrinsic image decomposition in the vector domain, separating an image into albedo, shade, and light layers in a unified vector representation. A semantic-guided initialization and two-stage optimization refine these layers with differentiable rendering. Experiments on various datasets demonstrate that COVec achieves higher visual fidelity and significantly improved editability compared to existing methods.

Abstract:
Recent advances in diffusion transformers have shown remarkable generalization in visual synthesis, yet most dense perception methods still rely on text-to-image (T2I) generators designed for stochastic generation. We revisit this paradigm and show that image editing diffusion models are inherently image-to-image consistent, providing a more suitable foundation for dense perception task. We introduce Edit2Perceive, a unified diffusion framework that adapts editing models for depth, normal, and matting. Built upon the FLUX.1 Kontext architecture, our approach employs full-parameter fine-tuning and a pixel-space consistency loss to enforce structure-preserving refinement across intermediate denoising states. Moreover, our single-step deterministic inference yields up to faster runtime while training on relatively small datasets.Extensive experiments demonstrate comprehensive state-of-the-art results across all three tasks, revealing the strong potential of editing-oriented diffusion transformers for geometry-aware perception.

Abstract:
The rapid advancement of Multimodal Large Language Models (MLLMs) has revealed the limitations of existing benchmarks in evaluating complex reasoning over multiple images. To address this gap, we introduce MIRACLE, a novel benchmark for Multi-Image complex Reasoning And Comprehension Logic Evaluation, featuring 4,000 questions across diverse reasoning types such as visual comparison, temporal sequencing, and spatial relations. MIRACLE emphasizes strong inter-image dependencies through a systematic data collection process, followed by delicate instance grouping and question design that enforce cross-image reasoning.Evaluation on leading MLLMs shows that even top-performing models like Gemini-2.5-Pro achieve only 55.91% points, highlighting the significant challenges of multi-image reasoning. Moreover, in scenarios characterized by high visual information density, such as puzzle tasks and ultra multi-image input conditions, all models exhibit a significant drop in performance, which highlights the limitations of MLLMs in handling complex structural relations and collaborative reasoning, revealing deficiencies in their cognitive capabilities under high-load visual reasoning settings. We hope MIRACLE will inspire the community to push the boundaries of multi-image reasoning. MIRACLE is available at https://huggingface.co/datasets/queyuecanyang/MIRACLE.

Abstract:
Novel view synthesis (NVS) under large view deviations remains an underexplored challenge for 3D Gaussian Splatting (3DGS). In urban scenes with limited training coverage, models often fail to maintain geometric consistency when extrapolating to unseen viewpoints, resulting in severe distortions and degraded rendering quality. We introduce Context-Aware Gaussian Splatting (CoRoGS), a Context-aware framework for Robust large-deviation novel view synthesis (LD-NVS) that embeds contextual reasoning into 3DGS. Instead of treating Gaussians as independent primitives, CoRoGS adopts a contextual formulation that explicitly models inter-Gaussian dependencies. This representation is implemented by constructing a 3D Gaussian graph, which propagates relational geometry and semantics via message passing, resulting in context-aware Gaussian updates. To further maintain structural consistency under substantial view deviation, we incorporate a progressive graph expansion strategy that adaptively grows and prunes Gaussians, leading to more coherent and complete scene reconstructions. Extensive experiments demonstrate that CoRoGS outperforms state-of-the-art 3DGS-based methods, producing higher-quality results. We highlight that CoRoGS robustly handles a wide range of view shifts, including lateral deviations (e.g., lane-level offsets) and cross-level transitions such as from ground-level driving views to elevated perspectives.

Abstract:
Color affects how we interpret image style and emotion. Previous color grading methods rely on patch-wise recoloring or fixed filter banks, struggling to generalize across creative intents or align with human aesthetic preferences. In this study, we propose AceTone, the first approach that supports multimodal conditioned color grading within a unified framework. AceTone formulates grading as a generative color transformation task, where a model directly produces 3D-LUTs conditioned on text prompts or reference images. We develop a VQ-VAE-based tokenizer which compresses a 3x32^3 LUT vector to 64 discrete tokens with \Delta \text E <2 fidelity. We further build a large-scale dataset, AceTone-800K, and train a vision-language model to predict LUT tokens, followed by reinforcement learning to align outputs with perceptual fidelity and aesthetics. Experiments show that AceTone achieves state-of-the-art performance on both text-guided and reference-guided grading tasks, improving LPIPS by up to 50% over existing methods. Human evaluations confirm that AceTone's results are visually pleasing and stylistically coherent, demonstrating a new pathway toward language-driven, aesthetic-aligned color grading. Project Page: github.com/martian422/AceTone.

Abstract:
Graphic design is a creative and innovative process that plays a crucial role in applications such as e-commerce and advertising. However, developing an automated design system that can faithfully translate user intentions into editable design files remains an open challenge. Although recent studies have leveraged powerful text-to-image models and MLLMs to assist graphic design, they typically simplify professional workflows, resulting in limited flexibility and intuitiveness. To address these limitations, we propose PSDesigner, an automated graphic design system that emulates the creative workflow of human designers. Building upon multiple specialized components, PSDesigner collects theme-related assets based on user instructions, and autonomously infers and executes tool calls to manipulate design files, such as integrating new assets or refining inferior elements. To endow the system with strong tool-use capabilities, we construct a design dataset, CreativePSD, which contains a large amount of high-quality PSD design files annotated with operation traces across a wide range of design scenarios and artistic styles, enabling models to learn expert design procedures. Extensive experiments demonstrate that PSDesigner outperforms existing methods across diverse graphic design tasks, empowering non-specialists to conveniently create production-quality designs.

Abstract:
Existing methods for human motion control in video generation typically rely on either 2D poses or explicit 3D parametric models (e.g., SMPL) as control signals. However, 2D poses rigidly bind motion to the driving viewpoint, precluding novel-view synthesis. Explicit 3D models, though structurally informative, suffer from inherent inaccuracies (e.g., depth ambiguity and inaccurate dynamics) which, when used as a strong constraint, override the powerful intrinsic 3D awareness of large-scale video generators. In this work, we revisit motion control from a 3D-aware perspective, advocating for an implicit, view-agnostic motion representation that naturally aligns with the generator's spatial priors rather than depending on externally reconstructed constraints. We introduce 3DiMo, which jointly trains a motion encoder with a pretrained video generator to distill driving frames into compact, view-agnostic motion tokens, injected semantically via cross-attention. To foster 3D awareness, we train with view-rich supervision--single-view, multi-view, and moving-camera videos--forcing motion consistency across diverse viewpoints. Additionally, we use auxiliary geometric supervision that leverages SMPL only for early initialization and is annealed to zero, enabling the model to transition from external 3D guidance to learning genuine 3D spatial motion understanding from the data and the generator's priors. Experiments confirm that 3DiMo faithfully reproduces driving motions with flexible, text-driven camera control, significantly surpassing existing methods in both motion fidelity and visual quality.

Abstract:
The "think-with-image" paradigm has recently gained traction for complex visual reasoning tasks. However, existing approaches often struggle with inference inefficiency due to a fixed number of redundant reasoning steps, as well as training instability. This challenge primarily arises from the direct use of standard reinforcement learning policies, which do not incorporate improvements for the think-with-image multi-turn conversational scenario. To address this challenge, we propose VisionLeaf, an entropy-guided, tree-based reasoning framework. Unlike conventional GRPO, where all nodes expand from the root and each leaf has only a single branch, our method grows the reasoning tree from the leaf nodes and selects the most valuable nodes based on entropy for thorough rollout exploration. This leaf-first expansion naturally aligns with the hierarchical nature of multi-step image analysis. Without modifying any model or training data, our VisionLeaf achieves a 4.2% performance improvement on benchmarks such as VSTAR and HRBench, while reducing the number of inference rounds by nearly half--demonstrating significant gains in both accuracy and speed. All our code will be released.

Abstract:
Egocentric perception enables humans to experience and understand the world directly from their own point of view. Translating exocentric (third-person) videos into egocentric (first-person) videos opens up new possibilities for immersive understanding but remains highly challenging due to extreme camera pose variations and minimal view overlap. This task requires faithfully preserving visible content while synthesizing unseen regions in a geometrically consistent manner. To achieve this, we present EgoX, a novel framework for generating egocentric videos from a single exocentric input.EgoX leverages the pretrained spatio-temporal knowledge of large-scale video diffusion models through lightweight LoRA adaptation and introduces a unified conditioning strategy that combines exocentric and egocentric priors via width- and channel-wise concatenation.Additionally, a geometry-guided self-attention mechanism selectively attends to spatially relevant regions, ensuring geometric coherence and high visual fidelity.Our approach achieves coherent and realistic egocentric video generation while demonstrating strong scalability and robustness across unseen and in-the-wild videos.

Abstract:
Multimodal large language models suffer from substantial inference overhead since multimodal KV Cache grows proportionally with the visual input length. Existing multimodal KV cache compression methods mostly rely on attention score to reduce cache size, which makes them are incompatible with established efficient attention kernels (e.g., FlashAttention) and ignores the contribution of value vectors to the attention output. In this work, we revisit multimodal KV cache compression from the perspective of the KV matrices' distribution. First, we observe that frequency-domain energy of multimodal KV matrices is predominantly concentrated in low-frequency and extract this principal energy via a low-pass filter. Further, we find that removing KV pairs that deviate substantially from this principal energy leads to a pronounced performance drop, which we define as Outlier KVs. Considering Outlier KVs are more likely to encode features critical for inference, we propose FlashCache, a frequency-domain-guided, Outlier-KV-aware KV Cache compression framework. First, we introduce an Outlier KV Recognition Module that models the principal component of multimodal KV matrices in the frequency domain and preferentially retains KV pairs that significantly deviate from it. Furthermore, Dynamic Budget Allocation Module is designed to adaptively determine the per-layer KV cache size to retain more Outlier KVs. Experiments on multiple MLLMs and benchmarks demonstrate that FlashCache outperforms state-of-the-art multimoal KV compression methods, achieving up to 1.69x faster decoding with 80% lower KV memory usage while maintaining task performance.

Abstract:
3D reconstruction of transparent objects from multiple views has been a long-standing challenge. In contrast to opaque objects, transparent objects exhibit complex refraction that causes serious image distortions, resulting in a highly ill-posed problem. Existing reconstruction methods commonly depend on special capture devices or controlled environments, which provide more priors and simplify the modeling of refraction. More importantly, these methods lack the capability for reconstruction of mixed transparent and opaque objects, being confined to transparent or opaque materials. To address these challenges, we propose Opti-NeuS, a novel method for reconstructing transparent and opaque objects without controlled environments or additional input. Opti-NeuS incorporates a novel Network to obtain spatially-varying IoR for tracing the refractive ray paths, which can finally model refractive visual distortions. To deal with dual-layered transparent and opaque objects, we devise a two-stage hierarchical reconstruction strategy that decouples outer and inner geometry, combined with alpha-blending for transparency-aware surface separation. Experiments show that Opti-NeuS achieves practical effectiveness and outperforms prior works.

Abstract:
The synergistic framework of multimodal large language models (MLLMs) and vision foundation models demonstrates exceptional performance in image understanding tasks, yet encounters severe temporal inconsistency challenges in video segmentation scenarios. Existing methods predominantly rely on MLLMs trained on static images to generate per-frame segmentation prompts, neglecting the physical continuity of video motion. This paper posits that performance limitations in video understanding tasks from inadequate constraints on model output behavior. Consequently, we propose a spatiotemporal co-optimization mechanism that achieves temporally consistent video segmentation solely by constraining MLLM output behavior, eliminating the need for large-scale video pretraining or complex architectural modifications. Our method features two complementary mechanisms: a Brownian bridge loss that models object trajectories as endpoint-constrained Gaussian processes to ensure temporal smoothness, and a geometry-aware prompt quality loss that enforces spatial consistency with target structures. Experiments on referring expression video segmentation and reasoning video segmentation tasks demonstrate that our method significantly surpasses state-of-the-art techniques on the Ref-YouTube-VOS, Ref-DAVIS-2017, MeVIS, A2d-Sentences, JHMDB-Sentences and ReVOS benchmarks. This work establishes that explicit modeling of physical world constraints can unlock the full potential of statically trained foundation models in dynamic visual understanding tasks.

Abstract:
Few-shot industrial anomaly detection (FS-IAD) presents a critical challenge for practical automated inspection systems operating in data-scarce environments. While existing approaches predominantly focus on obtaining prototypes from limited normal images, they neglect to systematically incorporate statistics of query image to enhance prototype representativeness. To address this issue, we propose FastRef, a novel and efficient prototype refinement framework for FS-IAD. Our method operates through an iterative two-stage process during inference: (1) characteristic transfer from query features to the enhanced prototypes, and (2) anomaly suppression by aligning the enhanced prototypes with their normal counterparts. The characteristic transfer is achieved through linear reconstruction of query features from prototypes with an optimizable transport matrix, while the anomaly suppression addresses a key observation in FS-IAD that unlike conventional IAD with abundant normal prototypes, the limited-sample setting makes anomaly reconstruction more probable in characteristic transfer. Therefore, we employ optimal transport to measure while minimize the gap between prototypes and their enhanced counterparts for anomaly suppression. For comprehensive evaluation, we integrate FastRef with three competitive prototype-based FS-IAD methods: PatchCore, WinCLIP, and AnomalyDINO. Extensive experiments across four benchmark datasets of MVTec, ViSA, MPDD and RealIAD demonstrate both the effectiveness and efficiency of our approach under 1/2/4-shots.

Abstract:
Generating realistic human-human interactions is a challenging task that requires not only high-quality individual body and hand motions, but also coherent coordination among all interactants. Due to limitations in available data and increased learning complexity, previous methods tend to ignore hand motions, limiting the realism and expressivity of the interactions. Additionally, current diffusion-based approaches generate entire motion sequences simultaneously, limiting their ability to capture the reactive and adaptive nature of human interactions. To address these limitations, we introduce Interact2Ar, the first end-to-end text-conditioned autoregressive diffusion model for generating full-body, human-human interactions. Interact2Ar incorporates detailed hand kinematics through dedicated parallel branches, enabling high-fidelity full-body generation. Furthermore, we introduce an autoregressive pipeline coupled with a novel memory technique that facilitates adaptation to the inherent variability of human interactions using efficient large context windows. The adaptability of our model enables a series of downstream applications, including temporal motion composition, real-time adaptation to disturbances, and extension beyond dyadic to multi-person scenarios. To validate the generated motions, we introduce a set of robust evaluators and extended metrics designed specifically for assessing full-body interactions. Through quantitative and qualitative experiments, we demonstrate the state-of-the-art performance of Interact2Ar.

Abstract:
Robust depth estimation aims to maintain high-quality depths across diverse conditions. However, most existing methods estimate depth without taking into account the object-level information. As a result, the predicted depth may easily deviate within objects and become blurred under adverse conditions. To overcome this weakness, we propose RoSAMDepth, a novel framework that can assist robust self-supervised depth estimation in leveraging rich and diverse object-level priors from the Segment Anything Model (SAM). We focus on incorporating object-level information across three key aspects: a segment-guided representation contrasting method that injects object-level awareness into the feature representation space; an adaptive regional outlier masking strategy combined with a regional Gaussian likelihood loss that enforces regional depth smoothness; and an object-level reliability estimation strategy that mitigates the influence of unreliable supervision. Extensive experiments across multiple datasets and diverse weather conditions demonstrate that our method produces sharper, more accurate depth predictions, consistently outperforming state-of-the-art methods.

Abstract:
Federated Learning (FL) enables collaborative model training without sharing raw data, but client data are often Non-Independent and Identically Distributed (Non-IID), which often slow convergence and degrade global performance. Meanwhile, privacy preservation is also a critical concern in FL. To address these two issues, we propose FedAlign, a differentially private framework that aligns local data distributions via client-side statistical moment alignment. Clients upload perturbed distribution statistics, which the server aggregates to infer global distribution characteristics and guide local alignment, thereby reducing inter-client discrepancies. Experiments and theoretical analysis show that FedAlign accelerates convergence and improves accuracy under Non-IID settings while preserving rigorous privacy guarantees.

Abstract:
In this paper, we propose NeoVerse, a versatile 4D world model that is capable of 4D reconstruction, novel-trajectory video generation, and rich downstream applications. We first identify a common limitation of scalability in current 4D world modeling methods, caused either by expensive and specialized multi-view 4D data or by cumbersome training pre-processing. In contrast, our NeoVerse is built upon a core philosophy that makes the full pipeline scalable to diverse in-the-wild monocular videos. Specifically, NeoVerse features pose-free feed-forward 4D reconstruction, online monocular degradation pattern simulation, and other well-aligned techniques. These designs empower NeoVerse with versatility and generalization to various domains. Meanwhile, NeoVerse achieves state-of-the-art performance in standard reconstruction and generation benchmarks.

Abstract:
Generating high-fidelity, seamless textures directly on 3D surfaces, a process we term 3D-native texturing, is a fundamental open challenge, promising to overcome the limitations of traditional UV-based and multi-view projection methods. While promising, existing native approaches are bottlenecked by the lack of a powerful and versatile latent representation, severely limiting the fidelity of generated results. In this work, we identify this representation gap as the central roadblock to progress. We introduce Lafite, a framework that resolves this challenge by learning to generate textures as a 3D generative sparse latent color field. At its core, Lafite leverages a variational autoencoder (VAE) to encode complex surface appearance into a sparse, structured latent space and then decoded into a continuous color field. This novel representation achieves unprecedented fidelity, i.e., yielding a >10 dB PSNR improvement in reconstruction over state-of-the-art methods, by effectively disentangling texture from mesh topology and UV parameterizations. Building on this superior representation, a conditional rectified flow model synthesizes textures with state-of-the-art quality and coherence. Extensive experiments show that Lafite not only sets a new standard for native texturing but also enables flexible downstream applications like editing and material synthesis, paving the way for the next generation of 3D content creation workflows.

Abstract:
Multiple-choice question answering (MCQA) has been a popular format for evaluating and reinforcement fine-tuning (RFT) of modern multimodal language models. Its constrained output format allows for simplified, deterministic automatic verification.However, we find that the options may leak exploitable signals, which makes the accuracy metrics unreliable for indicating real capabilities and encourages explicit or implicit answer guessing behaviors during RFT. We propose ReVeL (Rewrite and Verify by LLM), a framework that rewrites multiple-choice questions into open-form questions while keeping answers verifiable whenever possible. The framework categorizes questions according to different answer types, apply different rewriting and verification schemes, respectively. When applied for RFT, we converted 20k MCQA examples and use GRPO to finetune Qwen2.5-VL models. Models trained on ReVeL-OpenQA match MCQA accuracy on multiple-choice benchmarks and improve OpenQA accuracy by about six percentage points, indicating better data efficiency and more robust reward signals than MCQA-based training. When used for evaluation, ReVeL also reveals up to 20 percentage points of score inflation in MCQA benchmarks (relative to OpenQA), improves judging accuracy, and reduces both cost and latency. We will release code and data publicly.

Abstract:
Controllable, high-fidelity mesh editing remains a significant challenge in the domain of 3D content creation. Existing generative methods often struggle with complex geometries and fail to preserve fine-scale details. We propose CraftMesh, a novel framework for high-fidelity generative mesh manipulation based on Poisson Seamless Fusion. We decompose mesh editing into a pipeline that leverages the strengths of 2D image editing and 3D generation models: we first edit a 2D reference image, then generate a 3D mesh corresponding to the edited region, and fuse it seamlessly into the original mesh through a Geometry and Texture Fusion method. We introduce two core techniques: Poisson Geometric Fusion, which utilizes a hybrid SDF/Mesh representation with normal blending to achieve harmonious geometric integration, and Poisson Texture Harmonization for visually consistent texture blending. We demonstrate state-of-the-art structural consistency, geometric fidelity, and texture quality in challenging editing scenarios.

Abstract:
Multimodal learning integrates complementary information from different modalities such as image, text, and audio to improve model performance, but its success relies on large-scale labeled data, which is costly to obtain. Active learning (AL) mitigates this challenge by selectively annotating informative samples. In multimodal settings, many approaches implicitly assume that modality importance is stable across rounds and keep selection rules fixed at the fusion stage, which leaves them insensitive to the dynamic nature of multimodal learning, where the relative value of modalities and the difficulty of instances shift as training proceeds. To address this issue, we propose RL-MBA, a reinforcement-learning framework for modality-balanced, difficulty-aware multimodal active learning. RL-MBA models sample selection as a Markov Decision Process, where the policy adapts to modality contributions, uncertainty, and diversity, and the reward encourages accuracy gains and balance. Two key components drive this adaptability: (1) Adaptive Modality Contribution Balancing (AMCB), which dynamically adjusts modality weights via reinforcement feedback, and (2) Evidential Fusion for Difficulty-Aware Policy Adjustment (EFDA), which estimates sample difficulty via uncertainty-based evidential fusion to prioritize informative samples. Experiments on Food101, KineticsSound, and VGGSound demonstrate that RL-MBA consistently outperforms strong baselines, improving both classification accuracy and modality fairness under limited labeling budgets.

Abstract:
Modern stereo matching methods have leveraged monocular depth foundation models to achieve superior zero-shot generalization performance. However, most existing methods primarily focus on extracting robust features for cost volume construction or disparity initialization. At the same time, the iterative refinement stage, which is also crucial for zero-shot generalization, remains underexplored. Some methods treat monocular depth priors as guidance for iteration, but conventional GRU-based architectures struggle to exploit them due to the limited representation capacity. In this paper, we propose Prompt Recurrent Unit (PRU), a novel iterative refinement module based on the decoder of monocular depth foundation models. By integrating monocular structure and stereo motion cues as prompts into the decoder, PRU enriches the latent representations of monocular depth foundation models with absolute stereo-scale information while preserving their inherent monocular depth priors. Experiments demonstrate that our PromptStereo achieves state-of-the-art zero-shot generalization performance across multiple datasets, while maintaining comparable or faster inference speed. Our findings highlight prompt-guided iterative refinement as a promising direction for zero-shot stereo matching.

Abstract:
Source-Free Cross-Domain Few-Shot Learning (SF-CDFSL) focuses on fine-tuning with limited training data from target domains (e.g., medical or satellite images), where CLIP has recently shown promising results due to its generalizability to downstream tasks. Current works indicate CLIP's text encoder is more suitable for cross-domain tasks, however, we find that removing certain middle layers of the text encoder can effectively improve performance in SF-CDFSL, which we call the Lost Layers. In this paper, we delve into this phenomenon for a deeper understanding. We discover that instead of being harmful for the SF-CDFSL task, the information in these layers is actually beneficial, but visual gaps prevent this useful information from being fully utilized, making these layers seem redundant.Based on this understanding, unlike current works that simply remove these layers, we propose a method to teachs the model to re-utilize information in these lost layersat both the layer and encoder levels, guiding the re-learning of the visual branch under domain shifts. Our approach effectively addresses the issue of underutilized information in the text encoder. Extensive experiments across various settings, backbones (CLIP, SigLip, PE-Core), and tasks (4 CDFSL datasets and 10 Meta-dataset datasets) demonstrate the effectiveness of our method. We will release the code.

Abstract:
Recent progress in text-to-image generation has greatly advanced visual fidelity and creativity, but it has also imposed higher demands on prompt complexity--particularly in encoding intricate spatial relationships. In such cases, achieving satisfactory results often requires multiple sampling attempts.To address this challenge, we introduce a novel method that strengthens the spatial understanding of current image generation models. We first construct the SpatialReward-Dataset with over 80k preference pairs. Building on this dataset, we build SpatialScore, a reward model designed to evaluate the accuracy of spatial relationships in text-to-image generation, achieving performance that even surpasses leading proprietary models on spatial evaluation.We further demonstrate that this reward model effectively enables online reinforcement learning for the complex spatial generation. Extensive experiments across multiple benchmarks show that our specialized reward model yields significant and consistent gains in spatial understanding for image generation. All models and datasets will be released.

Abstract:
Estimating robot pose from a monocular RGB image is a challenge in robotics and computer vision. Existing methods typically build networks on top of 2D visual backbones and depend heavily on labeled data for training, which is often scarce in real-world scenarios, causing a sim-to-real gap. Moreover, these approaches reduce the 3D-based problem to 2D domain, neglecting the 3D priors. To address these, we propose Robot Topological Alignment Graph (RoboTAG), which incorporates a 3D branch to inject 3D priors while enabling co-evolution of the 2D and 3D representations, alleviating the reliance on labels. Specifically, the RoboTAG consists of a 3D branch and a 2D branch, where nodes represent the states of the camera and robot system, and edges capture the dependencies between these variables or denote alignments between them. Closed loops are then defined in the graph, on which a consistency supervision across branches can be applied. Experimental results demonstrate that our method is effective across robot types, suggesting new possibilities of alleviating the data bottleneck in robotics.

Abstract:
Recent optical flow estimation methods often employ local cost sampling from a dense all-pairs correlation volume. This results in quadratic computational and memory complexity in the number of pixels. Although an alternative memory-efficient implementation with on-demand cost computation exists, this is significantly slower in practice and therefore many prior methods process images at downsampled resolutions, missing fine-grained details. To address this, we propose an algorithm for both memory and compute-efficient implementation of the all-pairs correlation volume sampling, still matching the exact mathematical operator as defined by RAFT. Our approach outperforms on-demand sampling by up to 92% while maintaining equally low memory usage, and performs at least on par with the default implementation with up to 99% lower memory usage. As cost sampling makes up a significant portion of the overall runtime, this can translate to up to 63% savings for the total end-to-end model inference on high-resolution inputs. Our evaluation of existing methods includes an 8K ultra-high-resolution dataset and an inference-time extension of the SEA-RAFT method. With this, we achieve state-of-the-art results at high resolutions both in accuracy and runtime.

Abstract:
End-to-end autonomous driving (E2EAD) systems, which learn to predict future trajectories directly from sensor data, are fundamentally challenged by the inherent spatio-temporal imbalance of trajectory data. This imbalance creates a significant optimization burden, causing models to learn spurious correlations instead of robust driving logic, while also prioritizing uncertain, distant predictions, thereby compromising immediate safety. To address these issues, we propose ResAD, a novel Normalized Residual Trajectory Modeling framework. Instead of predicting the future trajectory directly, our approach reframes the learning task by predicting the residual from a deterministic inertial reference. This inertial reference serves as a stable physical prior, encouraging the model to focus on necessary, context-driven deviations (e.g., traffic rules and obstacles) rather than simple pattern matching. To mitigate the optimization imbalance caused by uncertain, long-term horizons, ResAD further incorporates Point-wise Normalization of the predicted residual. This re-weights the objective to prevent large errors at distant, uncertain waypoints from dominating the learning signal. On the NAVSIM v1 and v2 benchmarks, ResAD achieves state-of-the-art results of 88.8 PDMS and 85.5 EPDMS with only two denoising steps, showing that it simplifies learning and improves planning performance.

Abstract:
This paper studies the training-testing discrepancy (a.k.a. exposure bias) problem for improving the diffusion models. During training, the input of a prediction network at the training timestep is the corresponding ground-truth noisy data that is an interpolation of the noise and the data, and during testing, the input is the generated noisy data. We present a novel training approach, named MixFlow, for improving the training performance. Our approach is motivated by the Slow Flow phenomenon: the ground-truth interpolation that is the nearest to the generated noisy data at a given sampling timestep is observed to correspond to a higher-noise timestep (termed slowed timestep), i.e., the corresponding ground-truth timestep is slower than the sampling timestep. MixFlow leverages the interpolations at the slowed timesteps, named slowed interpolation mixture, for post-training the prediction network at each training timestep. Experiments over class-conditional image generation (including SiT, REPA, and RAE) and text-to-image generation, validate the effectiveness of our approach. Our approach MixFlow over the RAE models achieve strong generation results on ImageNet: 1.43 FID (without guidance) and 1.10 (with guidance) at 256 x256, and 1.55 FID (without guidance) and 1.10 (with guidance) at 512 x512.

Abstract:
Despite the inherent advantages of thermal infrared(TIR) imaging, large-scale data collection and annotation remain a major bottleneck for TIR-based perception. A practical alternative is to synthesize pseudo TIR data via image translation; however, most RGB-to-TIR approaches heavily rely on RGB-centric priors that overlook thermal physics, yielding implausible heat distributions. In this paper, we introduce TherA, a controllable RGB-to-TIR translation framework that produces diverse and thermally plausible images at both scene and object level. TherA couples TherA-VLM with a latent-diffusion-based translator. Given a single RGB image and a user-prompted condition pair, TherA-VLM yields a thermal-aware embedding that encodes scene, object, material, and heat-emission context reflecting the input scene-condition pair. Conditioning the diffusion model on this embedding enables realistic TIR synthesis and fine-grained control across time of day, weather, and object state. Compared to other baselines, TherA achieves state-of-the-art translation performance, demonstrating improved zero-shot translation performance up to 33% increase averaged across all metrics.

Abstract:
We propose PatchScene, a novel diffusion-based framework for large-scale LiDAR scene completion. Unlike existing methods that rely on global latent representations or dense voxel grids, PatchScene adopts a patch-based voxel diffusion paradigm that explicitly generates fine-grained geometry within localized 3D regions. To ensure coherent reconstruction at both spatial and temporal scales, we introduce a confidence-guided spatio-temporal fusion mechanism that integrates overlapping patches and adjacent frames in a unified generative process. Furthermore, we design an Annular-Flow diffusion strategy that leverages the radial density pattern of LiDAR scans to progressively propagate high-fidelity information from near-range to far-range regions, enabling spatially unbounded scene completion. Extensive experiments on the SemanticKITTI benchmark demonstrate that PatchScene achieves state-of-the-art performance across all standard metrics, surpassing previous approaches in both geometric accuracy and temporal consistency. Remarkably, the model trained on 20 m LiDAR ranges generalizes effectively to 50 m scenes without retraining, highlighting its strong scalability and generalization capability for real-world autonomous driving applications.

Abstract:
Multimodal Large Language Models (MLLMs) have recently demonstrated promising capabilities in multimodal coding tasks such as chart-to-code generation. However, existing methods primarily rely on supervised fine-tuning (SFT), which requires the model to learn code patterns through chart-code pairs but does not expose the model to a code execution environment. Moreover, while self-correction through execution feedback offers a potential route to improve coding quality, even state-of-the-art MLLMs have been shown to struggle with effective self-correction. In this work, we introduce MM-ReCoder, a chart-to-code generation model trained with reinforcement learning (RL) and equipped with self-correction ability. We propose a two-stage multi-turn self-correction RL strategy based on Group Relative Policy Optimization (GRPO). The first stage enhances the model's self-correction ability via rolling out a shared first turn, while the second stage improves the coding capability with full-trajectory optimization. MM-ReCoder learns to produce more accurate and executable code through the interaction with the environment and by iteratively correcting its own outputs. Our results on three chart-to-code benchmarks demonstrate the state-of-the-art performance of MM-ReCoder.

Abstract:
Realistic and efficient 3D garment generation remains a longstanding challenge in computer vision and digital fashion. Existing methods typically rely on large vision- language models to produce serialized representations of 2D sewing patterns, which are then transformed into simulation-ready 3D meshes using garment modeling framework such as GarmentCode. Although these approaches yield high-quality results, they often suffer from slow inference times, ranging from 30 seconds to a minute. In this work, we introduce SwiftTailor, a novel two-stage framework that unifies sewing-pattern reasoning and geometry-based mesh synthesis through a compact geometry image representation. SwiftTailor comprises two lightweight modules: PatternMaker, an efficient vision-language model that predicts sewing patterns from diverse input modalities, and GarmentSewer, an efficient dense prediction transformer that converts these patterns into a novel Garment Geometry Image, encoding the 3D surface of all garment panels in a unified UV space. The final 3D mesh is reconstructed through an efficient inverse mapping process that incorporates remeshing and dynamic stitching algorithms to directly assemble the garment, thereby amortizing the cost of physical simulation. Extensive experiments on the Multimodal GarmentCodeData demonstrate that SwiftTailor achieves state-of-the-art accuracy and visual fidelity while significantly reducing inference time. This work offers a scalable, interpretable, and high-performance solution for next-generation 3D garment generation.

Abstract:
Domain generalization for semantic segmentation aims to mitigate the degradation in model performance caused by domain shifts. However, in many real-world scenarios, we are unable to access the model parameters and architectural details due to privacy concerns and security constraints. Traditional fine-tuning or adaptation is hindered, leading to the demand for input-level strategies that can enhance generalization without modifying model weights. To this end, we propose a Style-Adaptive GEneralization framework (SAGE), which improves the generalization of frozen models under privacy constraints. SAGE learns to synthesize visual prompts that implicitly align feature distributions across styles instead of directly fine-tuning the backbone. Specifically, we first utilize style transfer to construct a diverse style representation of the source domain, thereby learning a set of style characteristics that can cover a wide range of visual features. Then, the model adaptively fuses these style cues according to the visual context of each input, forming a dynamic prompt that harmonizes the image appearance without touching the interior of the model. Through this closed-loop design, SAGE effectively bridges the gap between frozen model invariance and the diversity of unseen domains. Extensive experiments on five benchmark datasets demonstrate that SAGE achieves competitive or superior performance compared to state-of-the-art methods under privacy constraints and outperforms full fine-tuning baselines in all settings.

Abstract:
Scalable Vector Graphics (SVG) are central to modern web design, and the demand to animate them continues to grow as web environments become increasingly dynamic.Yet automating the animation of vector graphics remains challenging for vision-language models (VLMs) despite recent progress in code generation and motion planning.VLMs routinely mis-handle SVGs, since visually coherent parts are often fragmented into low-level shapes that offer little guidance of which elements should move together. In this paper, we introduce a framework that recovers the semantic structure required for reliable SVG animation and reveals the missing layer that current VLM systems overlook. This is achieved through a statistical aggregation of multiple weak part predictions, allowing the system to stably infer semantics from noisy predictions.By reorganizing SVGs into semantic groups, our approach enables VLMs to produce animations with far greater coherence. Our experiments demonstrate substantial gains over existing approaches, suggesting that semantic recovery is the key step that unlocks robust SVG animation and supports more interpretable interactions between VLMs and vector graphics.

Abstract:
Text-to-image (T2I) models are increasingly used for synthetic dataset generation, but generating synthetic training data to improve fine-grained classification performance remains challenging. Fine-tuning the T2I model with a few real examples can help generate more appropriate synthetic training data; however, this fine-tuning may also introduce overfitting and reduce diversity in the generated samples. We propose a fine-tuning strategy BOB (Beyond OBjects) for mitigating these concerns. Given a small set of real examples, we first describe them using class-agnostic attributes such as scene background and object pose. We then explicitly condition on these attributes during fine-tuning of the T2I model and marginalize them out during generation. This design mitigates overfitting, thus preserving the T2I model's generative prior and reducing estimation errors, and further minimizes unintended inter-class associations. Extensive experiments across multiple T2I models, backbones, and datasets demonstrate state-of-the-art performance in low-shot fine-grained classification when augmented with synthetic data. Concretely, BOB outperforms DataDream by 7.4% on the Aircraft dataset (from 50.0% to 57.4% when fine-tuning a CLIP classifier with 5 real images augmented with 100 synthetic images). Additionally, in three of the four datasets, the fine-tuning downstream models with synthetic data generated from BOB and five real images achieves better performance than fine-tuning with 10 real images. Collectively, BOB outperforms prior art in 18 of 24 experimental settings, with over 2% accuracy improvements in 14 of these settings.

Abstract:
Large-scale vision-language models (VLMs) have recently achieved remarkable multimodal understanding, but their massive size makes them impractical for deployment on mobile or edge devices. This raises the need for compact yet capable VLMs that can efficiently learn from powerful large teacher. However, distilling knowledge from large teacher to small student remains challenging due to their large size gap: the student often fails to reproduce the teacher's complex, high-dimensional representations, leading to unstable learning and degraded performance. To address this, we propose Masters (Masking teacher and reinforcing student), a mask-progressive reinforcement learning (RL) distillation framework. Masters first masks and non-dominant weights of the teacher to reduce unnecessary complexity, then progressively restores the teacher from mask to gradually increase the teacher capacity during training. This strategy allows the student to learn richer representations of teacher in a smooth and stable manner. To further refine knowledge transfer, Masters integrates an offline RL stage with two complementary rewards: an accuracy reward that measures the correctness of the generated responses, and a distillation reward that quantifies the ease of their responses' transferability from teacher to student. Unlike online think-answer RL paradigms that are computationally expensive and generate lengthy responses, our offline RL leverages pre-generated responses from masked teachers. These provide rich yet efficient guidance, enabling the students to achieve strong performance without requiring the think-answer process. Extensive experiments across diverse VLM benchmarks demonstrate that Masters outperforms existing compact VLMs and partially surpasses large ones, while being far more efficient. Moreover, gradually increasing the teacher sizes during distillation (e.g., from 14B to 38B) yields smoother convergence and stronger generalization than one-shot distillation (e.g., 38B), revealing a scalable path toward efficient and deployable VLMs.

Abstract:
Machine unlearning (MU) aims to remove sensitive or undesired content from pre-trained models. Existing MU methods are commonly characterized as gradually degrading model performance on undesired data to realize approximate forgetting. Despite their successes, the effectiveness in multimodal unlearning tasks remains largely unexplored. In this paper, we first conduct an in-depth analysis and reveal that traditional MU methods tend to disrupt cross-modal alignment, leading to incomplete forgetting in multimodal scenarios. To tackle this challenge, we propose VL-Eraser, a novel unlearning paradigm for VLM unlearning. VL-Eraser reformulates unlearning in VLMs as a two-stage process: distillation and deletion. Specifically, VL-Eraser first introduces a vacuum distillation that disentangles undesired knowledge from the intricate parameters of VLMs and transfers it into low-rank adapters (LoRA). After distillation, unlearning is efficiently achieved by deleting the LoRA parameters from the original model. Extensive experiments across multiple benchmarks demonstrate that VL-Eraser achieves superior unlearning performance while preserving utility compared to the state-of-the-art baselines.

Abstract:
Long-context video understanding and generation pose a significant computational challenge for Transformer-based video models due to the quadratic complexity of self-attention. While existing sparse attention methods employ coarse-grained patterns to improve efficiency, they typically incur redundant computation and suboptimal performance. To address this issue, in this paper, we propose VecAttention, a novel vector-wise sparse attention framework that achieves superior accuracy-efficiency trade-offs for video models. We observe that video attention maps exhibit a strong vertical-vector sparse pattern, and further demonstrate that this vertical-vector pattern offers consistently better accuracy-sparsity trade-offs compared with existing coarse-grained sparse patterns. Based on this observation, VecAttention dynamically selects and processes only informative vertical vectors through a lightweight important-vector selection that minimizes memory access overhead and an optimized vector sparse attention kernel. Comprehensive evaluations on video understanding (VideoMME, LongVideoBench, and VCRBench) and generation (VBench) tasks show that VecAttention delivers a 2.65xspeedup over full attention and a 1.83xspeedup over state-of-the-art sparse attention methods, with comparable accuracy to full attention.

Abstract:
The rapid rise of highly realistic AI-generated images necessitates reliable and generalizable detection methods. However, existing methods are constrained by their discriminative nature: by learning a single static decision boundary, they tend to memorize generator-specific artifacts and consequently fail to generalize to the unseen distributions of new generative models. To overcome this limitation, we propose PPM-CLIP, a new framework that shifts from static classification to conditional generative modeling based on the CLIP vision-language model. Instead of learning a fixed decision boundary, a Probabilistic Prompt Modeling (PPM) module is used as a generator that produces an adaptive distribution of prompts according to the input image. This allows the model to flexibly capture novel artifacts, rather than matching them against fixed templates. In addition, to enhance the visual encoder's sensitivity to subtle artifacts, a Patch-Wise Contrastive Learning (PWCL) strategy is introduced. Extensive experiments on Ojha, GenImage, and DRCT benchmarks demonstrate that our generative paradigm significantly outperforms state-of-the-art methods, especially in cross-domain detection.

Abstract:
We contend that embodied learning is fundamentally a lifecycle problem rather than a single-stage optimization. Systems that optimize only one link (data collection, simulation, learning, or deployment) rarely sustain improvement or generalize beyond narrow settings. We introduce Arcadia, a closed-loop framework that operationalizes embodied lifelong learning by tightly coupling four stages: (1) Self-evolving exploration and grounding for autonomous data acquisition in physical environments, (2) Generative scene reconstruction and augmentation for realistic and extensible scene creation, (3) a Shared embodied representation architecture that unifies navigation and manipulation within a single multimodal backbone, and (4) Sim-from-real evaluation and evolution that closes the feedback loop through simulation-based adaptation. This coupling is non-decomposable: removing any stage breaks the improvement loop and reverts to one-shot training. Arcadia delivers consistent gains on navigation and manipulation benchmarks and transfers robustly to physical robots, indicating that a tightly coupled lifecycle: continuous real-world data acquisition, generative simulation update, and shared-representation learning, supports lifelong improvement and end-to-end generalization. We release standardized interfaces enabling reproducible evaluation and cross-model comparison in reusable environments, positioning Arcadia as a scalable foundation for general-purpose embodied agents.

Abstract:
Partial label learning (PLL) is a weakly supervised learning, where each instance is assigned a set of candidate labels and only one is true. However, due to potentially inaccurate annotations, existing PLL algorithms disambiguate labeling by minimizing the prediction loss, which leaves the model unaware of its prediction credibility. To address this issue, this paper proposes the evidential deep partial label learning (ED-PLL) to quantify disambiguation uncertainty, aiming to achieve candidate label disambiguation and reliability prediction. Firstly, we extend the evidence modeling mechanism to PLL, treating the candidate label set as the source of evidence for the label hypothesis, and using belief and credibility to model classification uncertainty, thereby guiding a more reliable disambiguation process. Meanwhile, we propose the expectation calculation under the Dirichlet distribution of non-candidate labels, which suppresses the output of non-candidate labels by using consistency regularization to further improve the accuracy of disambiguation. Furthermore, a conflict-aware regularization is proposed to evaluate the degree of conflict, which measures the consistency between instances within the class by combining the differences in the distribution of prediction results and model uncertainty, and thus improves the robustness of the model. In addition, this paper theoretically analyzes our method from the perspective of the Expectation-Maximization (EM) algorithm, and the ED-PLL is compatible with any deep network or stochastic optimizer. Experiments on benchmark and real datasets verify the effectiveness of the proposed algorithm.

Abstract:
Decoding visual stimuli from electroencephalography (EEG) signals is a crucial step toward practical brain-computer interfaces (BCIs). However, this task requires large-scale and high-quality EEG-image paired datasets. Compared with abundant image data, the limited EEG recordings restrict the decoding models' performance. To address this challenge, we propose EEGiT, a framework that converts sequential EEG signals into image-like EEG patches and enables the direct use of a pretrained Vision Transformer (ViT) as the EEG encoder. To preserve the spatial topology of brain regions and minimize distributional differences across channels, we group EEG electrodes according to anatomical structures and apply linear interpolation along the spatial dimension. We then resample the EEG signals to align the structure of EEG patches with that of image patches in ViT. This design encourages effective transfer of visual priors learned from large-scale image datasets to EEG representation learning. Experiments on the THINGS-EEG and EEG-3D datasets show that fine-tuning pretrained ViTs improves EEG-to-image retrieval and EEG-based visual classification, while maintaining robustness and strong cross-subject generalization. These results demonstrate a promising direction for leveraging powerful vision models to mitigate data scarcity in EEG decoding.

Abstract:
The paradigm of Multimodal Large Language Models (MLLMs) offers a promising blueprint for advancing the electromagnetic (EM) domain. However, prevailing approaches often deviate from the native MLLM paradigm, instead using task-specific or pipelined architectures that lead to fundamental limitations in model performance and generalization. Fully realizing the MLLM potential in EM domain requires overcoming three main challenges: (1) Data. The scarcity of high-quality datasets with paired EM signals and descriptive text annotations used for MLLMs pre-training; (2) Benchmark. The absence of comprehensive benchmarks to systematically evaluate and compare the performance of models on EM signal-to-text tasks; (3) Model. A critical fragility in low Signal-to-Noise Ratio (SNR) environments, where critical signal features can be obscured, leading to significant performance degradation. To address these challenges, we introduce a tripartite contribution to establish a foundation for MLLMs in the EM domain. First, to overcome data scarcity, we construct and release EM-100k, a large-scale dataset comprising over 100,000 EM signal-text pairs. Second, to enable rigorous and standardized evaluation, we propose EM-Bench, the most comprehensive benchmark featuring diverse downstream tasks spanning from perception to reasoning. Finally, to tackle the core modeling challenge, we present MERLIN, a novel training framework designed not only to align low-level signal representations with high-level semantic text, but also to explicitly enhance model robustness and performance in challenging low-SNR environments. Comprehensive experiments validate our method, showing that MERLIN is state-of-the-art in the EM-Bench and exhibits remarkable robustness in low-SNR settings.

Abstract:
State-of-the-art text-to-video models excel at generating isolated clips but fall short of creating the coherent, multi-shot narratives, which are the essence of storytelling. We bridge this "narrative gap" with HoloCine, a model that generates entire scenes holistically to ensure global consistency from the first shot to the last. Our architecture achieves precise directorial control through a Window Cross-Attention mechanism that localizes text prompts to specific shots, while a Sparse Inter-Shot Self-Attention pattern (dense within shots but sparse between them) ensures the efficiency required for minute-scale generation. Beyond setting a new state-of-the-art in narrative coherence, HoloCine develops remarkable emergent abilities: a persistent memory for characters and scenes, and an intuitive grasp of cinematic techniques. Our work marks a pivotal shift from clip synthesis towards automated filmmaking, making end-to-end cinematic creation a tangible future.

Abstract:
Understanding the global structure of neural network loss landscapes is important for gaining insight into model merging, hyperparameter selection, generalization, and the relationships between distinct solutions. Visualizing the global structure of loss landscapes is very challenging because of the high dimensionality of the parameter space of neural networks. Prior work has primarily focused on visualizing the loss landscape around one single basin, missing how different minima or solutions relate to each other. We introduce Globscope, a framework that provides a global view of the loss landscape across multiple solutions or basins. Globscope learns a low-dimensional non-linear manifold of model parameters using an autoencoder, enabling both latent-space visualization and reconstruction of full model weights. Then it summarizes the relations and connections among minima on this manifold through topological data analysis. Our framework produces continuous, interpretable visualizations that reveal global connectivity patterns in the landscape. We compare Globscope with kernel-based methods and demonstrate its superior performance in preserving the global structure across diverse solutions. We further show how Globscope can be used to analyze two applications: revealing global low-loss solution pathways between distinct solutions using mode connectivity algorithms, and visualizing permutation symmetries of different solutions using re-basin approaches.

Abstract:
Single-view RGB model-based object pose estimation methods achieve strong generalization but are fundamentally limited by depth ambiguity, clutter, and occlusions. Multi-view pose estimation methods have the potential to solve these issues, but existing works rely on precise single-view pose estimates or lack generalization to unseen objects. We address these challenges via the following three contributions. First, we introduce AlignPose, a 6D object pose estimation method that aggregates information from multiple extrinsically calibrated RGB views and does not require any object-specific training or symmetry annotation. Second, the key component of this approach is a new multi-view feature-metric refinement specifically designed for object pose. It optimizes a single, consistent world-frame object pose by minimizing the feature discrepancy between on-the-fly rendered object features and observed image features across all views simultaneously. Third, we report extensive experiments on six datasets (YCB-V, T-LESS, HouseCat6D, ITODD-MV, IPD, XYZ-IBD) using the BOP benchmark evaluation and show that AlignPose outperforms other published methods, especially on challenging industrial datasets where multiple views are readily available in practice.

Abstract:
Current referring video segmentation methods typically operate in an offline manner, where sparse frames are first selected for image-level referring segmentation, and the resulting masks are then propagated across the video. Although video sampling captures global context, its isolated processing steps not only complicate optimization but also restrict applicability to real-world streaming scenarios. In this paper, we propose a simple but efficient MLLM-based framework StreamingRVOS, which can extend image-level segmentation to video-level via a streaming pipeline without introducing extra parameters. Specifically, we employ a Semantic Embedding Recycling (SER) method to propagate temporal context across frames, enabling the model to perceive semantic representation in the video. Then, we propose an Online Mask Consistency Perception (OMCP) strategy to adaptively invoke the MLLM to re-perceive the current scene and regenerate the semantic embedding. We conduct extensive experiments on multiple downstream datasets to prove the effectiveness of StreamingRVOS. Compared to previous methods, our method achieves excellent performance in referring video segmentation (1B variant improves upon Sa2VA by 19.2 on the MeViS dataset), while operating at an average speed of 7 FPS under streaming inference on 1 xA800 GPU.

Abstract:
Multimodal Large Language Models (MLLMs) have shown strong potential in remote sensing (RS) through multi-task reasoning and cross-modal generalization.However, existing RS-MLLMs mainly rely on a single shared expert for all tasks, making it hard to produce reliable results. Meanwhile, the intrinsic redundancy and homogeneity of RS images bring substantial difficulties for both training and inference. These challenges directly conflict with the demands of remote sensing, which values task precision and trustworthy reasoning.To address these limitations, we propose GeoCoT, a manifold-driven mixture-of-experts (MoE) system with Chain-of-Thought (CoT) reasoning. GeoCoT introduces Mani-MoE, a sparse expert architecture grounded in local manifold mapping. It projects high-dimensional tokens onto low-rank subspaces adaptively to eliminate redundancy and uncover intrinsic structure, and then routes them through a sparse expert pathway, where gating decisions are guided by the manifold structure of the input.To optimize this architecture, we adopt a CoT-driven multi-stage training strategy. It leverages a cold-start phase for domain adaptation, followed by our RS Vision Group Relative Policy Optimization (RSV-GRPO) to systematically strengthen structured reasoning from global to objectives. Furthermore, we innovatively build RS-CoT-20k dataset for task-specific supervision.Extensive experiments on multi-task datasets demonstrate that GeoCoT outperforms prior approaches, achieving 5.27 % higher average accuracy than the state-of-the-art method. Our code will be available.

Abstract:
Accurate crop yield prediction is crucial for sustainable agriculture and global food security. While existing methods are predominantly developed for single-crop prediction, they often struggle to generalize across diverse crop types, without addressing the unique crop phenological responses that are dynamically modulated by complex weather patterns. In this paper, we propose PhenoYieldNet, a multi-crop yield prediction framework that learns crop-specific phenology by explicitly modeling their responses with temporal drivers. Specifically, we develop a crop-aware temporal decoder consisting of a Crop Phenology Bank (CPB) and a Crop Phenology Attention (CPA) module. The CPB integrates a set of learnable embeddings, which leverage a query to guide the CPA module to learn the most relevant phenology patterns for the specific crop. And the CPA module explicitly captures multi-scale trend and variation components to construct temporal contexts, enabling the model to dynamically adjust the attention across different phenological stages. To learn robust and generalizable features for multi-crop prediction, the encoder is initialized with a pre-trained foundation model, and further adapted via a self-supervised Temporal Contrastive Adaptation strategy to align with agricultural temporal dynamics. Extensive experiments conducted on multi-crop datasets indicate that our proposed method significantly outperforms state-of-the-art methods, exhibiting strong generalization capabilities across different regions and crops.

Abstract:
Generative models can now produce photorealistic imagery, yet they still struggle with the long, multi-goal prompts that professional designers issue. To expose this gap and better evaluate models' performance in real-world, we introduce Long Goal Bench(LGBench), a 2000-task suite (1000 T2I, 1000 I2I) whose average instruction contains 18---22 tightly coupled goals spanning global layout, local object placement, typography, and logo fidelity. We find even state-of-the-art commercial APIs satisfy fewer than 72% of the goals and routinely miss localized edits, confirming the brittleness of current pipelines. To address this, we present VisionDirector, a training-free, vision-language supervisor that (i) extracts structured goals from long instructions, (ii) dynamically decides between one-shot generation and staged edits, (iii) runs micro-grid sampling plus semantic verification/rollback after every edit, and (iv) logs goal-level rewards. We further fine-tune the planner with Group Relative Policy Optimization, yielding shorter edit trajectories (3.1 vs. 4.2 steps) and stronger alignment. VisionDirector achieves new state of the art on GenEval (+7% overall), and ImgEdit (+0.07 absolute) while producing consistent qualitative improvements on typography, multi-object scenes, and pose editing. The code, benchmark, and evaluation scripts will be released.

Abstract:
We address the problem of tactile localization, where the goal is to identify image regions that share the same material properties as a tactile input. Existing visuo-tactile methods rely on global alignment and thus fail to capture the fine-grained local correspondences required for this task. The challenge is amplified by existing datasets, which predominantly contain close-up, low-diversity images. We propose a model that learns local visuo-tactile alignment via dense cross-modal feature interactions, producing tactile saliency maps for touch-conditioned material segmentation. To overcome dataset constraints, we introduce: (i) in-the-wild multi-material scene images that expand visual diversity, and (ii) a material-diversity pairing strategy that aligns each tactile sample with visually varied yet tactilely consistent images, improving contextual localization and robustness to weak signals. We also construct two new tactile-grounded material segmentation datasets for quantitative evaluation. Experiments on both new and existing benchmarks show that our approach substantially outperforms prior visuo-tactile methods in tactile localization.

Abstract:
Diffusion Transformers (DiTs) have greatly advanced text-to-image generation, but models still struggle to generate the correct spatial relations between objects as specified in the text prompt. In this study, we adopt a mechanistic interpretability approach to investigate how a DiT can generate correct spatial relations between objects. We train, from scratch, DiTs of different sizes with different text encoders to learn to generate images containing two objects whose attributes and spatial relations are specified in the text prompt. We find that, although all the models can learn this task to near-perfect accuracy, the underlying mechanisms differ drastically depending on the choice of text encoder. When using random text embeddings, we find that the spatial-relation information is passed to image tokens through a two-stage circuit, involving two cross-attention heads that separately read the spatial relation and single-object attributes in the text prompt. When using a pretrained text encoder (T5), we find that the DiT uses a different circuit that leverages information fusion in the text tokens, reading spatial-relation and single-object information together from a single text token. We further show that, although the in-domain performance is similar for the two settings, their robustness to out-of-domain perturbations differs, potentially suggesting the difficulty of generating correct relations in real-world scenarios.

Abstract:
Despite remarkable advancements in multimodal large language models (MLLMs), their fine-grained visual understanding is constrained by a primary reliance on sparse textual supervision. Existing efforts to introduce visual supervision typically do so during post-training, when visual representations have already been largely fixed, causing such signals to act mainly as auxiliary constraints rather than as a primary force for shaping perceptual features. In this paper, we aim to fundamentally reshape the model's perceptual backbone by incorporating vision supervision directly into the pre-training stage. We observe that pixel-level image patches and textual tokens naturally coexist in a shared, raw high-dimensional space characterized by an inherent input symmetry. Leveraging this insight, we propose UVU, a novel vision-language unified autoregressive framework that eschews vector quantization. It uniquely employs continuous visual encoding for lossless representation of visual inputs and proposes a large-scale iterative hierarchical clustering algorithm to construct a pixel-level visual codebook, thereby extending the vocabulary for unified supervision and enabling autoregressive generation of pixel-level image tokens alongside textual tokens. UVU effectively synergizes pixel-level visual perception with semantic-level visual understanding, internalizing visual reconstruction capabilities and unlocking the facilitative role of visual supervision in enhancing understanding in the pre-training stage. Extensive experiments across multiple tasks demonstrate that MLLMs are capable of achieving superior multimodal understanding performance under the supervised learning paradigm of UVU.

Abstract:
Normalizing Flows (NFs) have been established as a principled framework for generative modeling. Standard NFs consist of a forward process and a reverse process: the forward process maps data to noise, while the reverse process generates samples by inverting it. Typical NF forward transformations are constrained by explicit invertibility, ensuring that the reverse process can serve as their exact analytic inverse. Recent developments in TARFlow and its variants have revitalized NF methods by combining Transformers and autoregressive flows, but have also exposed causal decoding as a major bottleneck. In this work, we introduce Bidirectional Normalizing Flow (BiFlow), a framework that removes the need for an exact analytic inverse. BiFlow learns a reverse model that approximates the underlying noise-to-data inverse mapping, enabling more flexible loss functions and architectures. Experiments on ImageNet demonstrate that BiFlow, compared to its causal decoding counterpart, improves generation quality while accelerating sampling by up to two orders of magnitude. BiFlow yields state-of-the-art results among NF-based methods and competitive performance among single-evaluation ("1-NFE") methods. Following recent encouraging progress on NFs, we hope our work will draw further attention to this classical paradigm.

Abstract:
Inverse tone mapping (ITM) becomes significantly harder when the SDR input is produced by local tone mapping, which jointly applies global radiometric compression and spatially varying adaptations that distort dynamic range, contrast, and channel-wise color ratios. Existing ITM methods ignore this degradation structure and either regress HDR values directly or rely on a single-channel gain map, which scale luminance only and cannot restore the compressed dynamic range and wide color gamut.We introduce FastGaMer, a structured and resolution-agnostic ITM framework that explicitly mirrors this degradation process. Instead of regressing HDR values, we reconstruct a color gain map, which preserves per-channel amplification, simplifies learning, and enables proper gamut extension. Local and global degradations are inverted separately using dynamic bilateral grids and learnable 3D LUTs, followed by a lightweight neural modulator for global refinement and coherence. All high-resolution operations are network-free, yielding exceptional efficiency.To support color-GM supervision under realistic local TMO degradations, we create a dataset of over 8,000 4K SDR-GM pairs with an additional real-captured test set. FastGaMer outperforms prior lightweight ITM methods by +1.4 dB PQ-PSNR, reduces runtime by 70%, and processes 4K images in only 6.2 ms, achieving both high accuracy and real-time performance.

Abstract:
Reinforcement learning (RL) has achieved great success in dexterous grasping, significantly improving grasp performance and generalization from simulation to the real world. However, fine-grained functional grasping, which is essential for downstream manipulation tasks, remains underexplored and faces several challenges: the complexity of specifying goals and reward functions for functional grasps across diverse objects, the difficulty of multi-task RL exploration, and the challenge of sim-to-real transfer. In this work, we propose DemoFunGrasp for universal dexterous functional grasping. We factorize functional grasping conditions into two complementary components -- grasping style and affordance -- and integrate them into an RL framework that can learn to grasp any object with any functional grasping condition. To address the multi-task optimization challenge, we leverage a single grasping demonstration and reformulate the RL problem as one-step demonstration editing, substantially enhancing sample efficiency and performance.Experimental results in both simulation and the real world show that DemoFunGrasp generalizes to unseen combinations of objects, affordances, and grasping styles, outperforming baselines in both success rate and functional grasping accuracy. In addition to strong sim-to-real capability, by incorporating a vision-language model (VLM) for planning, our system achieves autonomous instruction-following grasp execution.

Abstract:
Human-like multimodal reaction generation is essential for natural group interactions between humans and intelligent embodied AI. However, existing approaches are often limited to single-modality or speaking-only responses in dyadic interactions, making them unsuitable for realistic social scenarios. Many also overlook nonverbal cues and the complex dynamics of polyadic interactions, both critical for engagement and conversational coherence. In this work, we present PolySLGen, an online framework for Polyadic multimodal Speaking and Listening reaction Generation. Given past conversation and motion from all participants, PolySLGen generates a future speaking or listening reaction for a target participant, including speech, body motion, and speaking state score.To model group interactions effectively, we propose a pose fusion module and a social cue encoder that jointly aggregate motion and social signals from the group. Extensive experiments, along with quantitative and qualitative evaluations, show that PolySLGen produces contextually appropriate and temporally coherent multimodal reactions, outperforming several adapted and state-of-the-art baselines in motion quality, motion-speech alignment, speaking state prediction, and human-perceived realism.

Abstract:
Foundation models are transforming Earth Observation (EO), yet the diversity of EO sensors and modalities makes a single universal model unrealistic. Multiple specialized EO foundation models (EOFMs) will likely coexist, making efficient knowledge transfer across modalities essential. Most existing EO pretraining relies on masked image modeling, which emphasizes local reconstruction but provides limited control over global semantic structure. To address this, we propose a dual-teacher contrastive distillation framework for multispectral imagery that aligns the student's pretraining objective with the contrastive self-distillation paradigm of modern optical vision foundation models (VFMs). Our approach combines a multispectral teacher with an optical VFM teacher, enabling coherent cross-modal representation learning. Experiments across diverse optical and multispectral benchmarks show that our model adapts to multispectral data without compromising performance on optical-only inputs, achieving state-of-the-art results in both settings, with average improvements of 3.64 percentage points in semantic segmentation, 1.2 in change detection, and 1.31 in classification. This demonstrates that contrastive distillation provides a principled and efficient approach to scalable representation learning across heterogeneous EO data sources.

Abstract:
Video question answering (VideoQA) is a challenging task that requires integrating spatial, temporal, and semantic information to capture the complex dynamics of video sequences. Although recent advances have introduced various approaches for video understanding, most existing methods still rely on locating relevant frames to answer questions rather than reasoning through the evolving storyline as humans do. Humans naturally interpret videos through coherent storylines, an ability that is crucial for making robust and contextually grounded predictions. To address this gap, we propose SVAgent, a storyline-guided cross-modal multi-agent framework for VideoQA. The storyline agent progressively constructs a narrative representation based on frames suggested by a refinement suggestion agent that analyzes historical failures. In addition, cross-modal decision agents independently predict answers from visual and textual modalities under the guidance of the evolving storyline. Their outputs are then evaluated by a meta-agent to align cross-modal predictions and enhance reasoning robustness and answer consistency. Experimental results demonstrate that SVAgent achieves superior performance and interpretability by emulating human-like storyline reasoning in video understanding.

Abstract:
The rapid development of generative models has significantly advanced image and video applications. Among these, video creation, aimed at generating videos under various conditions, has gained substantial attention. However, existing video creation models either focus solely on a few specific conditions or suffer from excessively long generation times due to complex model inference, making them impractical for real-world applications. To mitigate these issues, we propose an efficient unified video creation model, named VDOT. Concretely, we model the training process with the distribution matching distillation (DMD) paradigm. Instead of using the Kullback-Leibler (KL) minimization, we additionally employ a novel computational optimal transport (OT) technique to optimize the discrepancy between the real and fake score distributions. The OT distance inherently imposes geometric constraints, mitigating potential zero-forcing or gradient collapse issues that may arise during KL-based distillation within the few-step generation scenario, and thus, enhances the efficiency and stability of the distillation process. Further, we integrate a discriminator to enable the model to perceive real video data, thereby enhancing the quality of generated videos. To support training unified video creation models, we propose a fully automated pipeline for video data annotation and filtering that accommodates multiple video creation tasks. Meanwhile, we curate a unified testing benchmark, UVCBench, to advance the field. Experiments demonstrate that our 4-step VDOT outperforms or matches the performance of other baselines with 50 denoising steps. Code is available at https://vdot-page.github.io.

Abstract:
Table recognition (TR) aims to transform table images into semi-structured representations such as HTML or Markdown.As a core component of document parsing, TR has long relied on supervised learning, with recent efforts dominated by fine-tuning vision-language models (VLMs) using labeled data.While VLMs have brought TR to the next level, pushing performance further demands large-scale labeled data that is costly to obtain.Consequently, although proprietary models have continuously pushed the performance boundary, open-source models, often trained with limited resources and, in practice, the only viable option for many due to privacy regulations, still lag far behind.To bridge this gap, we introduce TRivia, a self-supervised fine-tuning method that enables pretrained VLMs to learn TR directly from unlabeled table images in the wild. Built upon Group Relative Policy Optimization, TRivia automatically identifies unlabeled samples that most effectively facilitate learning and eliminates the need for human annotations through a question-answering-based reward mechanism. An attention-guided module generates diverse questions for each table image, and the ability to interpret the recognition results and answer them correctly provides feedback to optimize the TR model.This closed-loop process allows the TR model to autonomously learn to recognize, structure, and reason over tables without labeled data. Leveraging this pipeline, we present TRivia-3B, an open-sourced, compact, and state-of-the-art TR model that surpasses existing systems (e.g., Gemini 2.5 Pro, MinerU2.5) on three popular benchmarks.

Abstract:
Ensuring fairness in medical vision-language models (VLMs) is essential for equitable healthcare, yet existing models amplify biases across demographic subgroups such as race and gender. Traditional fairness mitigation approaches relying on broad distribution alignment, fall short in addressing these nuanced intersectional disparities. We propose fairness-aware relational prompting (FRP), a novel framework that reformulates prompt generation as a dynamic, fairness-aware reasoning process. FRP constructs a relational graph to capture fine-grained, sample-level similarities and employs a hyperbolic graph layer to explicitly model the hierarchical structure of intersectional identities. Leveraging hyperbolic geometry enables reasoning over complex attribute combinations, effectively reducing entrenched biases. Evaluations on the FairVLMed and Harvard-GF datasets demonstrate that FRP achieves state-of-the-art diagnostic performance, with an area under the curve of 77.50% and 85.94% respectively, while substantially improving the demographic parity difference and equalized odds difference.

Abstract:
Video Large Language Models (VLMs) have achieved remarkable success in video understanding, but the significant computational cost from processing dense frames severely limits their practical application. Existing methods alleviate this by selecting keyframes, but their greedy decision-making, combined with a decoupled evaluation of relevance and diversity, often falls into local optima and results in erroneously selecting irrelevant noise frames. To address these challenges, we propose GIFT: Global Irreplaceability Frame Targeting, a novel training-free framework that selects frames by assessing their intrinsic irreplaceability. Specifically, we first introduce Directed Diversity to quantify a frame's uniqueness conditioned on relevance, which allows us to formulate a unified irreplaceability score. Subsequently, our Budget-Aware Refinement strategy employs a adaptive iterative process that first secures a core set of frames with the highest irreplaceability, and then shifts its priority to building crucial temporal context around these selections as the budget expands. Extensive experiments demonstrate that GIFT achieves a maximum average improvement of 12.5% across long-form video benchmarks on LLaVA-Video-7B compared to uniform sampling. Code will be released soon.

Abstract:
Vision-Language-Action (VLA) models are advancing autonomous driving by replacing modular pipelines with unified end-to-end architectures. Current VLAs face two challenges: (1) they require extensive datasets annotated with reasoning traces, and (2) these traces greatly increase token counts, inflating training and inference costs. We propose NoRD (No Reasoning for Driving), a data- and inference-efficient VLA that addresses both. Compared to existing VLAs, NoRD achieves competitive performance while being fine-tuned on atleast <60% of the data and no reasoning annotations, resulting in 3x fewer tokens. Our approach applies Reinforcement Learning (RL) to fine-tune a Supervised fine-tuning (SFT) policy trained on a small, reasoning-free dataset. However, we observe that the standard RL algorithm, Group Relative Policy Optimization (GRPO), fails to yield significant improvements over this data-efficient SFT policy. We find that this limitation stems from difficulty bias, which disproportionately penalizes reward signals from scenarios that produce high-variance rollouts within GRPO. NoRD overcomes this limitation by incorporating Dr.GRPO, a recent algorithm designed to mitigate difficulty bias in LLMs. As a result, NoRD achieves competitive performance on Waymo and NAVSIM without large datasets, reasoning or additional inputs, enabling scalable, data-efficient training, and fast inference.

Abstract:
Automatic medical report generation from multimodal longitudinal imaging is crucial for clinical diagnosis but remains challenging due to privacy constraints and evolving disease dynamics. While federated learning (FL) enables decentralized model training without data sharing, its extension to longitudinal medical modeling remains underexplored. Existing FL approaches overlook temporal non-stationarity across visits and patient-specific heterogeneity, causing unstable optimization and degraded report quality.We introduce Federated Temporal Adaptation (FTA), a new FL setting for longitudinal medical report generation, and propose FedTAR, a framework combining parameter-efficient personalization and meta-learned temporal aggregation. FedTAR employs a metadata-conditioned LoRA module that generates patient-specific adapters from Gaussian-mixture embeddings and a residual temporal aggregation scheme that adaptively weights client updates via first-order MAML, ensuring stable and efficient optimization under temporal heterogeneity.Experiments on J-MID (1M exams) and MIMIC-CXR demonstrate consistent improvements in linguistic accuracy, temporal coherence, and cross-site generalization, establishing FedTAR as a robust, privacy-preserving paradigm for federated multimodal longitudinal modeling.

Abstract:
Image abstraction, a fundamental component of non-photorealistic rendering (NPR), aims to simplify photographs into stylized depictions while preserving perceptually important structures. A central difficulty is selectivity: removing fine textures while preserving semantically meaningful boundaries. Existing approaches often expose only a few entangled controls, so the amount of smoothing and the set of preserved structures cannot be adjusted in a clear and intuitive manner. We propose the Semantic Scale Space (SSS), a two-dimensional abstraction framework parameterized by smoothing strength and semantic granularity. SSS externalizes the stopping set by using a controllable semantic boundary detector to specify which structures act as barriers to smoothing, independently of how far simplification proceeds within homogeneous regions. We further introduce Adaptive Granularity Scheduling Smoothing (AGSS), a concrete traversal policy in this space that combines a donor-gated diffusion operator with a fine-to-coarse granularity schedule. To evaluate selectivity fairly, we also introduce an effect-matched protocol based on a Region Homogeneity Index, which compares methods under matched smoothing levels. On the SBD benchmark, AGSS achieves higher boundary preservation and lower geometric drift than strong baselines at the same degree of smoothing. On DIV2K, a user study further shows that downstream NPR results built on AGSS-based abstractions are consistently preferred. These results demonstrate that SSS and AGSS provide a practical framework and method for controllable image abstraction in modern creative applications.

Abstract:
Accurate 3D object detection in autonomous driving relies on effectively combining complementary information from multiple sensors. 4D millimeter-wave radar provides sparse yet physically reliable measurements, whose potential for enhancing sensor fusion has not been fully utilized. In this work, we propose Radar Prior Guided Fusion(RPGFusion), a practical 4D radar-camera fusion framework. We first generate radar prior maps that encode spatial confidence and depth cues. These priors guide image feature sampling while preventing the uneven BEV feature distribution (near-dense, far-sparse) caused by Lift-Splat-Shoot view transformation. To address the sparsity and noise inherent in point clouds, we adopt a hybrid robust encoding and sparse-to-dense feature propagation. We further introduce spatial alignment and semantic fusion modules to reconcile geometric and semantic differences between modalities, yielding more consistent and complementary BEV representations. Extensive experiments on the public View-of-Delft and TJ4DRadSet show that RPGFusion outperforms prior radar-camera fusion methods, achieving SOTA performance. Our work not only uses 4D radar signals to guide image BEV queries, but also enables robust radar feature encoding and densification for 3D perception, demonstrating the strong potential of 4D radar.

Abstract:
Source-Free Cross-Domain Few-Shot Learning (SF-CDFSL) focuses on fine-tuning with limited training data from target domains (e.g., medical or satellite images), where Vision-Language Models (VLMs) such as CLIP and SigLIP have shown promising results. Current works in traditional visual models suggest that improving visual discriminability enhances performance. However, in VLM-based SF-CDFSL tasks, we find that strengthening visual-modal discriminability actually suppresses VLMs' performance. In this paper, we aim to delve into this phenomenon for an interpretation and a solution.By both theoretical and experimental proofs, our study reveals that fine-tuning with the typical cross-entropy loss (L_ vlm ) inherently includes a visual learning part and a cross-modal learning part, where the cross-modal part is crucial for rectifying the heavily disrupted modality misalignment in SF-CDFSL.However, we find that the visual learning essentially acts as a shortcut that encourages the model to reduce L_ vlm without considering the cross-modal part, therefore hindering the cross-modal alignment and harming the performance.Based on this interpretation, we further propose an approach to address this problem: first, we perturb the visual learning to guide the model to focus on the cross-modal alignment. Then, we use the visual-text semantic relationships to gradually align the visual and textual modalities during the fine-tuning. Extensive experiments on various settings, backbones (CLIP, SigLip, PE-Core), and tasks (4 CDFSL datasets and 11 FSL datasets) show that we consistently set new state-of-the-art results. We will release the code.

Abstract:
3D Gaussian Splatting (3DGS) enables high-quality novel view synthesis, motivating interest in generating higher-resolution renders than those available during training. A natural strategy is to apply super-resolution (SR) to low-resolution (LR) input views, but independently enhancing each image introduces multi-view inconsistencies, leading to blurry renders. Prior methods attempt to mitigate these inconsistencies through learned neural components, temporally consistent video priors, or joint optimization on LR and SR views, but all uniformly apply SR across every image. In contrast, our key insight is that close-up LR views may contain high-frequency information for regions also captured in more distant views, and that we can use the camera pose relative to scene geometry to inform where to add SR content. Building from this insight, we propose SplatSuRe, a method that selectively applies SR content only in undersampled regions lacking high-frequency supervision, yielding sharper and more consistent results. Across Tanks & Temples, Deep Blending and Mip-NeRF 360, our approach surpasses baselines in both fidelity and perceptual quality. Notably, our gains are most significant in localized foreground regions where higher detail is desired.

Abstract:
Vision-and-Language Navigation (VLN) requires agents to follow long-horizon instructions and navigate complex 3D environments. However, existing approaches face two major challenges: constructing an effective long-term memory bank and overcoming the compounding errors problem. To address these issues, we propose DecoVLN, an effective framework designed for robust streaming perception and closed-loop control in long-horizon navigation. First, we formulate long-term memory construction as an optimization problem and introduce adaptive refinement mechanism that selects frames from a historical candidate pool by iteratively optimizing a unified scoring function. This function jointly balances three key criteria: semantic relevance to the instruction, visual diversity from the selected memory, and temporal coverage of the historical trajectory. Second, to alleviate compounding errors, we introduce a state-action pair-level corrective finetuning strategy. By leveraging geodesic distance between states to precisely quantify deviation from the expert trajectory, the agent collects high-quality state-action pairs in the trusted region while filtering out the polluted data with low relevance. This improves both the efficiency and stability of error correction. Extensive experiments demonstrate the effectiveness of DecoVLN, and we have deployed it in real-world environments.

Abstract:
Visual-Inertial Odometry (VIO) is a critical component for robust ego-motion estimation, enabling foundational capabilities such as autonomous navigation in robotics and real-time 6-DoF tracking for augmented reality.Existing methods face a well-known trade-off: filter-based approaches are efficient but prone to drift, while optimization-based methods, though accurate, rely on computationally prohibitive Visual-Inertial Bundle Adjustment (VIBA) that is difficult to run on resource-constrained platforms.Rather than removing VIBA altogether, we aim to reduce how often and how heavily it must be invoked. To this end, we cast two key design choices in modern VIO, when to run the visual frontend and how strongly to trust its output, as sequential decision problems, and solve them with lightweight reinforcement learning (RL) agents. Our framework introduces a lightweight, dual-pronged RL policy that serves as our core contribution: (1) a Select Agent intelligently gates the entire VO pipeline based only on high-frequency IMU data; and (2) a composite Fusion Agent that first estimates a robust velocity state via a supervised network , before an RL policy adaptively fuses the full (p, v, q) state.Experiments on the EuRoC MAV and TUM-VI datasets show that, in our unified evaluation, the proposed method achieves a more favorable accuracy-efficiency-memory trade-off than prior GPU-based VO/VIO systems: it attains the best average ATE while running up to 1.77xfaster and using less GPU memory. Compared to classical optimization-based VIO systems, our approach maintains competitive trajectory accuracy while substantially reducing computational load.

Abstract:
Incremental 3D object perception is a critical step toward embodied intelligence in dynamic indoor environments. However, existing incremental 3D detection methods rely on extensive annotations of novel classes for satisfactory performance. To address this limitation, we propose FI3Det, a Few-shot Incremental 3D Detection framework that enables efficient 3D perception with only a few novel samples by leveraging vision-language models (VLMs) to learn knowledge of unseen categories. FI3Det introduces a VLM-guided unknown object learning module in the base stage to enhance perception of unseen categories. Specifically, it employs VLMs to mine unknown objects and extract comprehensive representations, including 2D semantic features and class-agnostic 3D bounding boxes. To mitigate noise in these representations, a weighting mechanism is further designed to re-weight the contributions of point- and box-level features based on their spatial locations and feature consistency within each box. Moreover, FI3Det proposes a gated multimodal prototype imprinting module, where category prototypes are constructed from aligned 2D semantic and 3D geometric features to compute classification scores, which are then fused via a multimodal gating mechanism for novel object detection. As the first framework for few-shot incremental 3D object detection, we establish both batch and sequential evaluation settings on two datasets, ScanNet V2 and SUN RGB-D, where FI3Det achieves strong and consistent improvements over baseline methods.

Abstract:
Many everyday objects are difficult to directly grasp (e.g., a flat iPad) or manipulate functionally (e.g., opening the cap of a pen lying on a desk). Such tasks require sequential, asymmetric coordination between two arms, where one arm performs preparatory manipulation that enables the other's goal-directed action--for instance, pushing the iPad to the table's edge before picking it up, or lifting the pen body to allow the other hand to remove its cap. In this work, we introduce Collaborative Preparatory Manipulation, a class of bimanual manipulation tasks that demand understanding object semantics and geometry, anticipating spatial relationships, and planning long-horizon coordinated actions between the two arms. To tackle this challenge, we propose a visual affordance-based framework that first envisions the final goal-directed action and then guides one arm to perform a sequence of preparatory manipulations that facilitate the other arm's subsequent operation. This affordance-centric representation enables anticipatory inter-arm reasoning and coordination, generalizing effectively across various objects spanning diverse categories. Extensive experiments in both simulation and the real world demonstrate that our approach substantially improves task success rates and generalization compared to competitive baselines.

Abstract:
Cooperative perception significantly enhances scene understanding by integrating complementary information from diverse agents. However, existing research often overlooks critical challenges inherent in real-world multi-source data integration, specifically high temporal latency and multi-source noise. To address these practical limitations, we propose Collaborative Alignment and Transformation Network (CATNet), an adaptive compensation framework that resolves temporal latency and noise interference in multi-agent systems. Our key innovations can be summarized in three aspects. First, we introduce a Spatio-Temporal Recurrent Synchronization (STSync) that aligns asynchronous feature streams via adjacent-frame differential modeling, establishing a temporal-spatially unified representation space. Second, we design a Dual-Branch Wavelet Enhanced Denoiser (WTDen) that suppresses global noise and reconstructs localized feature distortions within aligned representations. Third, we construct an Adaptive Feature Selector (AdpSel) that dynamically focuses on critical perceptual features for robust fusion. Extensive experiments on multiple datasets demonstrate that CATNet consistently outperforms existing methods under complex traffic conditions, proving its superior robustness and adaptability.

Abstract:
A key scalability challenge in neural solvers for industrial-scale physics simulations is efficiently capturing both fine-grained local interactions and long-range global dependencies across millions of spatial elements. We introduce the Multi-Scale Patch Transformer (MSPT), an architecture that combines local point attention within patches with global attention to coarse patch-level representations. To partition the input domain into spatially-coherent patches, we employ ball trees, which handle irregular geometries efficiently. This dual-scale design enables MSPT to scale to millions of points on a single GPU. We validate MSPT on standard PDE benchmarks (elasticity, plasticity, fluid dynamics, porous flow) and large-scale aerodynamic datasets (ShapeNet-Car, Ahmed-ML), achieving state-of-the-art accuracy with substantially lower memory footprint and computational cost.

Abstract:
Event-based motion deblurring has attracted increasing attention, as the high temporal resolution of event cameras provides motion cues unavailable to conventional RGB sensors, thereby enabling more effective deblurring. In real-world scenes, motion blur is often complex and nonlinear, with different regions exhibiting diverse motion speeds and directions. However, most existing approaches rely on handcrafted event representations that overlook such spatiotemporal motion heterogeneity, resulting in suboptimal deblurring performance. To address this limitation, we propose a learnable 3D Gaussian event representation module that adaptively selects key spatiotemporal coordinates for deblurring based on the blurred image content and the event density distribution. The event stream is then aggregated with 3D Gaussian weighting kernels to extract local motion features that are sensitive to motion direction and speed. Furthermore, to fully exploit the motion information encoded in the event representation, we adopt a two-stage fusion strategy. In the first stage, local motion features are used to enhance fine detail restoration. In the second stage, a bidirectional attention fusion module leverages one-dimensional Gaussian-weighted event frames to correct global spatial misalignment, leading to more accurate structural alignment. Extensive experiments on both synthetic and real-world datasets demonstrate the effectiveness of our method and show that it consistently outperforms state-of-the-art approaches.

Abstract:
Multimodal large language models (MLLMs) have demonstrated impressive reasoning and instruction-following capabilities, yet their expanded modality space introduces new compositional safety risks that emerge from complex text-image interactions.Such cross-modal couplings can produce unsafe semantics even when individual inputs are benign, exposing the fragile safety awareness of current MLLMs.While recent works enhance safety by guiding models to reason about potential risks, unregulated reasoning traces may compromise alignment; although Group Relative Policy Optimization (GRPO) offers self-rewarded refinement without human supervision, it lacks verifiable signals for reasoning safety.To address this, we propose SafeGRPO a self-rewarded multimodal safety alignment framework that integrates rule-governed reward construction into GRPO, enabling interpretable and verifiable optimization of reasoning safety. Built upon the constructed SafeTag-VL-3K dataset with explicit visual, textual, and combined safety tags, SafeGRPO performs step-guided safety thinking to enforce structured reasoning and behavior alignment, substantially improving multimodal safety awareness, compositional robustness, and reasoning stability across diverse benchmarks without sacrificing general capabilities.

Abstract:
Old photos preserve invaluable historical memories, making their restoration and colorization highly desirable. While existing restoration models can address some degradation issues like denoising and scratch removal, they often struggle with accurate colorization.This limitation arises from the unique degradation inherent in old photos, such as faded brightness and altered color hues, which are different from modern photo distributions, creating a substantial domain gap during colorization. In this paper, we propose a novel old photo colorization framework based on the generative diffusion model FLUX. Our approach introduces a structure-color decoupling strategy that separates structure preservation from color restoration, enabling accurate colorization of old photos while maintaining structural consistency. We further enhance the model with a progressive Direct Preference Optimization (Pro-DPO) strategy, which allows the model to learn subtle color preferences through coarse-to-fine transitions in color augmentation. Additionally, we address the limitations of text-based prompts by introducing visual semantic prompts, which extract fine-grained semantic information directly from old photos, helping to eliminate the color bias inherent in old photos. Experimental results on both synthetic and real datasets demonstrate that our approach outperforms existing state-of-the-art colorization methods, including closed-source commercial models, producing high-quality and vivid colorization.

Abstract:
Recent advances in video large language models have demonstrated strong capabilities in understanding short clips. However, scaling them to hours- or days-long videos remains highly challenging due to limited context capacity and the loss of critical visual details during abstraction. Existing memory-augmented methods mitigate this by leveraging textual summaries of video segments, yet they heavily rely on text and fail to utilize visual evidence when reasoning over complex scenes. Moreover, retrieving from fixed temporal scales further limits their flexibility in capturing events that span variable durations. To address this, we introduce WorldMM, a novel multimodal memory agent that constructs and retrieves from multiple complementary memories, encompassing both textual and visual representations. WorldMM comprises three types of memory: episodic memory indexes factual events across multiple temporal scales, semantic memory continuously updates high-level conceptual knowledge, and visual memory preserves detailed information about scenes. During inference, an adaptive retrieval agent iteratively selects the most relevant memory source and leverages multiple temporal granularities based on the query, continuing until it determines that sufficient information has been gathered. WorldMM significantly outperforms existing baselines across five long video question-answering benchmarks, achieving an average 8.4% performance gain over previous state-of-the-art methods, showing its effectiveness on long video reasoning.

Abstract:
We present Grid Distillation, a generative dataset distillation framework that compresses large-scale datasets into a compact set of informative synthetic samples. Our method constructs high-resolution compositional grids via spectral submodular optimization, which injects world knowledge from CLIP representations to maximize semantic coverage and diversity. These grids are then downsampled into low-resolution distilled images optimized for diversity and representational efficiency. During training, a single-step diffusion reconstruction (based on Stable Diffusion Turbo) restores fine-grained spatial details from diffusion priors, bridging the gap between compact representations and natural image statistics. A grid-aware cropping strategy further enhances discriminability by probabilistically aligning crops with grid boundaries, maintaining compatibility with standard 224 x 224 inference inputs. Experiments on ImageWoof, ImageNette, ImageIDC, and ImageNet-1K demonstrate consistent improvements over existing dataset distillation methods across multiple IPC settings.

Abstract:
Egocentric 3D human motion estimation is essential for AR/VR experiences, yet remains challenging due to limited body coverage from the egocentric viewpoint, frequent occlusions, and scarce labeled data. We present EgoPoseFormer v2, a method that addresses these challenges through two key contributions: (1) a transformer-based model for temporally consistent and spatially grounded body pose estimation, and (2) an auto-labeling system that enables the use of large unlabeled datasets for training.The proposed model is fully differentiable, introduces identity-conditioned queries, multi-view spatial refinement, causal temporal attention, and supports both keypoints and parametric body representations under a constant compute budget.The proposed auto-labeling system scales learning to tens of millions of unlabeled frames via uncertainty-aware semi-supervised training. The system follows a teacher-student schema to generate pseudo-labels and guide training with uncertainty distillation, enabling the model to generalize to different environments. In experiments on the EgoBody3M benchmark, EgoPoseFormer v2 outperforms two state-of-the-art methods by 12.2% and 19.4% in accuracy, and reduces temporal jitter by 22.2% and 51.7%, respectively. Furthermore, our auto-labeling system additionally improves the wrist MPJPE by 13.1%.

Abstract:
Large Vision-Language Models (VLMs) are increasingly deployed across diverse applications, such as AI agents, yet remain vulnerable to targeted adversarial attacks. However, the practical robustness of such attacks often remains unclear with limited evaluation under defenses. Diffusion-based purification (DBP), a widely adopted black-box defense, effectively blocks current attacks by removing adversarial perturbations via generative diffusion. Prior DBP evasion attacks target white-box image classifiers and are ill-suited to VLMs, incurring high computational costs and gradient instability when adapted. In this paper, we present PureProof, a black-box targeted attack on VLMs resilient to DBP. It consists of three core components. Stochastic Reverse Alignment (SRA) guides adversarial optimization via single-step reverse prediction, avoiding costly full-trajectory backpropagation. Adaptive Re-noising Augmentation (ARA) mitigates diffusion stochasticity through timestep-adaptive re-noising. Self-Consistency Regularization (SCR) stabilizes optimization by promoting local temporal coherence. Extensive experiments on open-source and commercial VLMs show that PureProof consistently outperforms prior attacks against DBP and achieves strong noise resilience, revealing critical vulnerabilities in VLMs and offering broader safety implications for real-world VLM deployments, including emerging agentic settings.

Abstract:
Reconstructing a physiologically plausible cardiac video from a single image remains a fundamental challenge in generative modeling, owing to the complex and nonlinear periodic dynamics of echocardiography. Previous image-to-video (I2V) approaches primarily focus on temporal continuity, yet often struggle to capture the intrinsic periodicity of cardiac motion, leading to limited temporal coherence and semantic consistency. We present EchoVDiff, a novel phase-aware diffusion model that reconstructs a full cardiac cycle from any single frame. Instead of direct pixel synthesis, EchoVDiff integrates physiological priors into a diffusion paradigm, learning interpretable mappings between cardiac phase, anatomy, and motion. By jointly modeling temporal rhythm and spatial semantics within a disentangled latent space, it achieves controllable and physiologically consistent generation. Extensive experiments on EchoNet-Dynamic and EchoNet-Pediatric demonstrate that EchoVDiff consistently surpasses state-of-the-art methods in both fidelity and temporal coherence. Remarkably, it enables accurate reconstruction of complete cardiac cycles from arbitrary phases, marking the first demonstration of single-frame-driven echocardiographic video generation.

Abstract:
This paper tackles the problem of physics-aware human motion synthesis in a dynamic scene. Unlike existing works which mainly tend to generate physically unrealistic motions due to limited contact modeling, typically restricted to hands, in this paper, we introduce a physics-aware human motion generation framework that explicitly models the full spectrum of human-related forces, including human-object, human-scene, and internal body dynamics. Our method imposes soft physical constraints to maintain force and torque balance, ensuring physically grounded motion synthesis. We further propose a novel continuous distance-based force model that generalizes contact modeling to arbitrary surfaces, capturing interactions not only with static environments but also with dynamic, moving objects. Extensive experiments show that our approach significantly improves physical plausibility and generalizes well to complex scenes, setting a new benchmark for physically consistent human motion generation.

Abstract:
Prototype-based Personalized Federated Learning (ProtoPFL) enables efficient multi-domain adaptation by communicating compact class prototypes, but directly sharing them poses privacy risks. A common defense involves per-example l_2 clipping before prototype computation to bound sensitivity, followed by isotropic Gaussian noise to enforce Local Differential Privacy (LDP). However, Isotropic Gaussian Prototype Perturbation (IGPP) typically over-perturbs discriminative dimensions and struggles to balance the clipping threshold with representation fidelity. In this paper, we propose VPDR, a client-side privacy plug-in that seamlessly integrates into existing ProtoPFLs. Motivated by the observation that dimension-wise class variance reflects discriminability, we introduce Variance-adaptive Prototype Perturbation (VPP), which allocates less noise to discriminative subspaces, preserving semantic separability while ensuring privacy. We further develop Distillation-guided Clipping Regularization (DCR), which enables feature norms to adaptively concentrate near the predefined clipping threshold while maintaining prediction consistency. Theoretical analysis shows that our groupwise mechanism provides privacy guarantees no weaker than the isotropic baseline under the same privacy constraints. Extensive experiments on multi-domain benchmarks demonstrate that VPDR achieves a superior privacy-utility trade-off, outperforming IGPP in personalized federated fine-tuning without sacrificing robustness against realistic attacks.

Abstract:
Event keypoint detection has garnered significant attention due to its ability to establish spatial associations through matching, which is fundamental for various computer vision tasks. However, achieving robust event keypoint detection remains challenging due to the difficulty in balancing the exploitation of event data and compatibility with established algorithms. Moreover, the limited use of co-visible information often results in excessive keypoint detection in non-matching regions, leading to incorrect matches. To address these challenges, we propose a novel Co-visible Focused 3D-guided 2D Event Keypoint Detection Network (EV-CGNet), which mainly consists of a 3D-guided 2D feature prototype learning (G2PL) module and a co-visible region-focused detector and descriptor learning (CDDL) module. The proposed method enjoys several merits. First, the proposed G2PL module can leverage fine-grained spatio-temporal cues from event points to guide the learning of event frame feature prototypes. Second, the proposed CDDL module can focus keypoint detection on co-visible regions, ensuring accurate matches. Comprehensive experimental evaluations on six challenging benchmarks show that our method significantly outperforms state-of-the-art event keypoint detection methods.

Abstract:
A hallmark of human intelligence is the ability to create complex artifacts through structured multi-step processes. Generating procedural tutorials with AI is a longstanding but challenging goal, facing three key obstacles: (1) scarcity of multi-task procedural datasets, (2) maintaining logical continuity and visual consistency between steps, and (3) generalizing across multiple domains. To address these challenges, we propose a multi-domain dataset covering 21 tasks with over 24,000 procedural sequences. Building upon this foundation, we introduce MakeAnything, a framework based on the diffusion transformer (DIT), which leverages fine-tuning to activate the in-context capabilities of DIT for generating consistent procedural sequences. We introduce asymmetric low-rank adaptation (LoRA) for image generation, which balances generalization capabilities and task-specific performance by freezing encoder parameters while adaptively tuning decoder layers. Additionally, our ReCraft model enables image-to-process generation through spatiotemporal consistency constraints, allowing static images to be decomposed into plausible creation sequences. Extensive experiments demonstrate that MakeAnything surpasses existing methods, setting new performance benchmarks for procedural generation tasks.

Abstract:
Sparse-view 3D Gaussian Splatting (3DGS) reconstructs scenes using 3D Gaussians from sparse input views. Yet, this method is prone to overfitting, which is exacerbated at higher resolutions as the expanded dimensionality amplifies floating artifacts and reconstruction ambiguities. In this paper, we present a systematic study of 3DGS under sparse-view conditions and varying input resolutions. While prior work has overlooked resolution as a key factor in sparse-view performance, we identify and quantify a trade-off: lower-resolution inputs facilitate stable global geometry reconstruction, whereas higher-resolution inputs enable finer detail recovery but introduce high-frequency artifacts and instability. Building on this insight, we further propose CAGS, a Confidence-Guided Multi-Scale Aggregation that reconstructs scenes through a coarse-to-fine hierarchical optimization process. Our approach employs a matching-based weighting aggregation strategy to anchor high-resolution reconstructions to stabilize structural priors and filtering noise through cross-scale consistency, and a multi-scale pseudo-view regularization to refine local details without amplifying noise. Extensive experiments on the LLFF and Mip-NeRF360 datasets demonstrate that CAGS significantly outperforms existing methods, particularly under demanding high-resolution conditions. Moreover, our paradigm can be seamlessly integrated into other 3DGS-based pipelines, thereby extending the field from low-resolution reconstructions to high-fidelity outputs under real-world sparse-view constraints.

Abstract:
We present LightMover, a framework for controllable light manipulation in single images that leverages video diffusion priors to produce physically plausible illumination changes without re-rendering the scene. We formulate light editing as a sequence-to-sequence prediction problem in visual token space: given an image and light-control tokens, the model adjusts light position, color, and intensity together with resulting reflections, shadows, and falloff from a single view. This unified treatment of spatial (movement) and appearance (color, intensity) controls improves both manipulation and illumination understanding. We further introduce an adaptive token-pruning mechanism that preserves spatially informative tokens while compactly encoding non-spatial attributes, reducing control sequence length by 41% while maintaining editing fidelity. To train our framework, we construct a scalable rendering pipeline that generates large numbers of image pairs across varied light positions, colors, and intensities while keeping the scene content consistent with the original image. LightMover enables precise, independent control over light position, color, and intensity, and achieves high PSNR and strong semantic consistency (DINO, CLIP) across different tasks.

Abstract:
Image editing models are advancing rapidly, yet comprehensive evaluation remains a significant challenge. Existing image editing benchmarks generally suffer from limited task scopes, insufficient evaluation dimensions, and heavy reliance on manual annotations, which significantly constrain their scalability and practical applicability. To address this, we propose I2I-Bench, a comprehensive benchmark for image-to-image editing models, which features (i) diverse tasks, encompassing 10 task categories across both single-image and multi-image editing tasks, (ii) comprehensive evaluation dimensions, including 30 decoupled and fine-grained evaluation dimensions with automated hybrid evaluation methods that integrate specialized tools and large multimodal models (LMMs), and (iii) rigorous alignment validation, justifying the consistency between our benchmark evaluations and human preferences.Using I2I-Bench, we benchmark numerous mainstream image editing models, investigating the gaps and trade-offs between editing models across various dimensions. We will open-source all components of I2I-Bench to facilitate future research.

Abstract:
We present Lumosaic, a compact active hyperspectral video system designed for real-time capture of dynamic scenes. Our approach combines a programmable narrowband LED array with a coded-exposure-pixel (CEP) camera capable of high-speed pixel-wise exposure, thereby enabling joint encoding of scene information across space, time, and wavelength within each video frame. Unlike passive snapshot systems that divide light across multiple spectral channels simultaneously and assume no motion during a frame's exposure, Lumosaic actively synchronizes illumination and sub-frame exposures to improve photon utilization and preserve spectral fidelity under motion. A learning-based reconstruction pipeline then recovers 31-channel hyperspectral (400-700 nm) video at 30 fps and VGA resolution, producing temporally coherent and spectrally accurate reconstructions. Experiments on synthetic benchmarks and real-world dynamic scenes demonstrate that Lumosaic significantly improves reconstruction fidelity and temporal stability over existing snapshot hyperspectral imaging systems, enabling robust hyperspectral video across diverse materials and motion conditions.

Abstract:
Online 3D instance segmentation is a critical capability for embodied agents navigating in dynamic environments. However, a fundamental challenge remains in adapting powerful 2D foundation models, like SAM, to 3D online segmentation. Naively lifting SAM's 2D masks to 3D results in severe spatial fragmentation, where a single object is shattered into multiple disconnected parts, especially under occlusion. Subsequent attempts to link these fragments over time via conventional 3D IoU-based tracking prove highly fragile: they struggle to handle occlusions or topological changes, ultimately causing catastrophic identity drift. Departing from such post-processing approaches, we reframe online segmentation as a learnable composition problem. We introduce SAMosaic3D, a differentiable framework that treats SAM-derived masks as "mosaic tiles" and learns to assemble them into temporally consistent 3D instances. SAMosaic3D comprises two key components: Fragment-to-Instance Adaptive Assembly that aggregates fragments through soft-gated attention, and Instance-to-Scene Online Merging that employs cascaded semantic-geometric matching to preserve object identities through learned short-term association and geometric verification for long-term re-identification. Evaluations on ScanNet, ScanNet200, SceneNN and 3RScan datasets demonstrate state-of-the-art performance and zero-shot cross-dataset generalization. Extensive ablation studies validate the effectiveness of the designed modules.

Abstract:
Despite significant progress in 4D generation, rig and motion--the core structural and dynamic components of animation--are typically modeled as separate problems. Existing pipelines rely on ground-truth skeletons and skinning weights for motion generation and treat auto-rigging as an independent process, undermining scalability and interpretability. We present RigMo, a unified generative framework that jointly learns rig and motion directly from raw mesh sequences, without any human-provided rig annotations. RigMo encodes per-vertex deformations into two compact latent spaces: a rig latent that decodes into explicit Gaussian bones and skinning weights, and a motion latent that produces time-varying SE(3) transformations. Together, these outputs define an animatable mesh with explicit structure and coherent motion, enabling feed-forward rig and motion inference for deformable objects. Beyond unified rig-motion discovery, we introduce a Motion-DiT model operating in RigMo's latent space and demonstrate that these structure-aware latents can naturally support downstream motion generation tasks. Experiments on DeformingThings4D, Objaverse-XL, and TrueBones demonstrate that RigMo learns smooth, interpretable, and physically plausible rigs, while achieving superior reconstruction and category-level generalization compared to existing auto-rigging and deformation baselines. RigMo establishes a new paradigm for unified, structure-aware, and scalable dynamic 3D modeling

Abstract:
Hand mesh reconstruction has attracted growing attention in recent years.Despite significant progress, existing methods often struggle to balance reconstruction quality and inference efficiency.In this work, we propose TokenHand, a novel framework for single-view 3D hand mesh reconstruction that achieves both high accuracy and real-time inference.Our method represents a 3D hand model using M discrete tokens, each describing a specific sub-structure of the hand.This compositional representation enables efficient modeling with minimal reconstruction error.Furthermore, we reformulate hand mesh reconstruction as a classification problem rather than a regression task.Specifically, a classifier predicts the categories of the M tokens from an input image, and a pre-trained decoder network subsequently reconstructs the 3D hand mesh from the predicted tokens without any post-processing.Extensive experiments demonstrate that TokenHand achieves comparable or superior performance to existing methods across standard benchmarks, while maintaining high efficiency in practical scenarios.

Abstract:
Rehearsal-based methods are the cornerstone of modern online class-incremental learning (OCIL), yet they face a fundamental challenge: the gradient of the current task often conflicts with that of the rehearsal data from the memory buffer, leading to catastrophic forgetting. Recent works have implicitly addressed this by using hypergradients, but the underlying mechanism has remained poorly understood. In this paper, we provide a formal analysis revealing that hypergradients mitigate forgetting by aligning task-specific gradients towards a common meta-objective, thereby reducing their conflict. However, we argue that this conflict-reducing alignment is inherently myopic--it only considers the immediate gradient directions, failing to account for the loss landscape geometry one step ahead. To overcome this limitation, we introduce a novel framework: Lookahead Optimization for Rehearsal (LOR). LOR explores a set of future model states by taking lookahead steps along different directions that balance plasticity and stability and optimizes a first-order Log-Sum-Exp (LSE) surrogate to emphasize the worst-performing sampled lookahead directions. Theoretical analysis from both optimization and statistical perspectives corroborates the robustness of our approach. Extensive experiments on Seq-CIFAR10, Seq-CIFAR100, and Seq-TinyImageNet demonstrate that LOR significantly outperforms state-of-the-art methods, establishing a new and more robust paradigm for rehearsal-based OCIL.

Abstract:
Model editing aims to update knowledge to add new concepts and change relevant information without retraining. Lifelong editing is a challenging task, prone to disrupting previously learned concepts, especially for Vision Language Models (VLMs), because sequential edits can lead to degraded reasoning and cross-modal misalignment. Existing VLM knowledge editing methods based on gated adapters, activation edits, and parameter merging techniques address catastrophic forgetting seen in full fine-tuning; however, they still operate in the shared representation space of the VLM, where concepts are entangled, so edits interfere with other non-relevant concepts. We hypothesize that this instability persists because current methods algorithmically control edits via optimization rather than structurally separating knowledge. We introduce Dynamic Subspace Concept Alignment (DSCA) which by design mitigates this limitation by decomposing the representation space into a set of orthogonal semantic subspaces and proposing edits only in those transformed spaces. These subspaces are obtained through incremental clustering and PCA on joint vision-language representations. This process structurally isolates concepts, enabling precise, non-interfering edits by turning isolation from a soft training objective into an architectural property. The surgical edits are guided by a multi-term loss function for maintaining task fidelity, edit locality, and cross-modal alignment. With the base model frozen, our method achieves 98% single-edit success, remains over 95% after 1,000 sequential edits, lowers hallucination by 3-5%, and achieves the best backward transfer (BWT) scores on continual instruction-tuning benchmarks. Extensive experiments demonstrate DSCA's state-of-the-art stability and knowledge retention capability in continual lifelong editing across various datasets and benchmarks.

Abstract:
Acceleration methods for diffusion models (e.g., token merging or downsampling) typically optimize for synthesis quality under reduced compute, yet they often ignore the model's latent discriminative capacity. We revisit token compression with a joint objective and present BiGain, a training-free, plug-and-play framework that preserves generation quality while markedly improving classification in accelerated diffusion models. Our key insight is frequency separation: mapping feature-space signals into a frequency-aware representation disentangles fine detail from global semantics, enabling compression that respects both generative fidelity and discriminative utility. BiGain reflects this principle with two frequency-aware operators: (1) Laplacian-gated token merging, which encourages merges among spectrally smooth tokens while discouraging merges of high-contrast tokens, thereby retaining edges and textures; and (2) Interpolate-Extrapolate KV Downsampling, which downsamples keys/values via a controllable interextrapolation between nearest and average pooling while keeping queries intact, thereby conserving attention precision without retraining. Across DiT- and U-Net-based backbones and multiple datasets of ImageNet-1K, ImageNet-100, Oxford-IIIT Pets, and COCO-2017, our proposed operators consistently improve the speed-accuracy trade-off for diffusion-based classification, while maintaining, sometimes even enhancing generation quality under comparable acceleration. For instance, on ImageNet-1K, with 70% token merging ratio on Stable Diffusion 2.0, BiGain increases classification accuracy by 7.15% while also improving FID for generation by 0.34 (1.85%). Our comprehensive analyses indicate that balanced spectral retention, preserving high-frequency detail alongside low/mid-frequency semantic content is a reliable design rule for token compression in diffusion models. To our knowledge, BiGain is the first framework to jointly study and advance both generation and classification under accelerated diffusion, supporting lower-cost deployment of dual-purpose generative systems.

Abstract:
Document understanding is a long-standing practical task. Vision-Language Models (VLMs) have gradually become a primary approach in this domain, demonstrating effective performance on single-page tasks. However, their effectiveness diminishes when handling long documents. In such scenarios, clues are often scattered across multiple pages and modalities, and redundancy from lengthy inputs can impair the model's judgment. While retrieval-augmented generation mitigates this issue by filtering for question-relevant content, the retrieved results still contain substantial redundancy. To address these limitations, we propose SLEUTH, a multi-agent framework. Concretely, SLEUTH orchestrates a retriever and four collaborative agents in a coarse-to-fine process. The framework identifies key textual and visual clues within the retrieved pages, filters for salient visual evidence such as tables and charts, and analyzes the query to devise a reasoning strategy. It ultimately synthesizes a distilled, evidence-dense multimodal context to generate the final prediction. SLEUTH is model-agnostic and scalable. When paired with advanced VLM backbones, it consistently improves performance on multiple long-document benchmarks, achieving SOTA results. Ablation studies verify each module's effectiveness and confirm the benefits of our hierarchical refinement paradigm.

Abstract:
In multimodal large language models (MLLMs), the surge of visual tokens significantly increases the inference time and computational overhead, making them impractical for real-time or resource-constrained applications.Visual token pruning is a promising strategy for reducing the cost of MLLM inference by removing redundant visual tokens.Existing researches usually assume that all attention heads contribute equally to the visual interpretation.However, our study reveals that different heads may capture distinct visual semantics and inherently play distinct roles in visual processing.In light of this observation, we propose HAWK, a head importance-aware visual token pruning method that perceives the varying importance of attention heads in visual tasks to maximize the retention of crucial tokens.By leveraging head importance weights and text-guided attention to assess visual token significance, HAWK effectively retains task-relevant visual tokens while removing redundant ones.The proposed HAWK is entirely training-free and can be seamlessly applied to various MLLMs.Extensive experiments on multiple mainstream vision-language benchmarks demonstrate that HAWK achieves state-of-the-art accuracy. When applied to Qwen2.5-VL, HAWK retains 96.0% of the original accuracy after pruning 80.2% of the visual tokens. Additionally, it reduces end-to-end latency to 74.4% of the original and further decreases GPU memory usage across the tested models.

Abstract:
Federated Graph Learning (FGL) offers a privacy-preserving paradigm for collaborative training on graph data, yet significant topological heterogeneity poses a critical threat to generalization fairness, often yielding a global model dominated by a subset of clients. This introduces two critical issues: at the global level, aggregation bias disproportionately amplifies the influence of dominant clients, while at the local level, blind optimization results in inefficient and inequitable training processes. To address these challenges, we propose FedSST, an adaptive fairness framework. FedSST introduces a fair, structure-based signal to quantify client contributions, which in turn guides fair aggregation and adaptive local training. Extensive experiments across diverse cross-domain and cross-dataset settings demonstrate that FedSST enhances generalization fairness and overall model performance, outperforming various state-of-the-art methods.

Abstract:
Blind 360deg image quality assessment (IQA) aims to predict perceptual quality for panoramic images without a pristine reference. Unlike conventional planar images, 360deg content in immersive environments restricts viewers to a limited viewport at any moment, making viewing behaviors critical to quality perception. Although existing scanpath-based approaches have attempted to model viewing behaviors by approximating the human view-then-rate paradigm, they treat scanpath generation and quality assessment as separate steps, preventing end-to-end optimization and task-aligned exploration. To address this limitation, we propose RL-ScanIQA, a reinforcement-learned framework for blind 360deg IQA. RL-ScanIQA optimize a PPO-trained scanpath policy and a quality assessor, where the policy receives quality-driven feedback to learn task-relevant viewing strategies. To improve training stability and prevent mode collapse, we design multi-level rewards, including scanpath diversity and equator-biased priors. We further boost cross-dataset robustness using distortion-space augmentation together with rank-consistent losses that preserve intra-image and inter-image quality orderings. Extensive experiments on three benchmarks show that RL-ScanIQA achieves superior in-dataset performance and cross-dataset generalization.

Abstract:
The rise of vision-language models (VLMs) has driven the initial exploration of open-vocabulary remote sensing image semantic segmentation (OVRSIS), enabling recognition of unseen categories in complex Earth observation scenes. However, existing methods primarily focus on enhancing visual representations of domain-specific remote sensing images, while overlooking the effect of textual information. In this paper, we argue that there exists a crucial issue of textual ambiguity in OVRSIS task, limiting final segmentation performance. Therefore, we propose a plug-and-play yet effective Test-time Multi-Prompt Adaptation (TMPA) method to mitigate textual ambiguity in OVRSIS. Specifically, TMPA first generates diverse, context-aware descriptions for each category instead of the naive class name by executing a large language model with a task-driven prompt, which can effectively avoid some textual ambiguity, i.e., background class has different meanings in various tasks. Furthermore, TMPA develops a visual-guided test-time adaptation strategy for the generated multi-prompts, which adaptively refines the prompt representations of each category with high-confidence visual features for the uncertain predictions with high entropy, making TMPA better applicable to different scenarios. Particularly, a pixel-level loss with entropy minimization is proposed to optimize the text prompt with a bias during inference, where prompt bias is constructed based on a weighted combination of high-confidence visual features. Our TMPA can be flexibly integrated into existing methods for boosting their performance. Extensive experiments are conducted on 17 remote sensing datasets, and the results show our TMPA can significantly improve its counterparts, while achieving state-of-the-art performance.

Abstract:
In this paper, we present Chain-of-Models Pre-Training (CoM-PT), a novel performance-lossless training acceleration method for vision foundation models (VFMs). This approach fundamentally differs from existing acceleration methods in its core motivation: rather than optimizing each model individually, CoM-PT is designed to accelerate the training pipeline at the model family level, scaling efficiently as the model family expands. Specifically, CoM-PT establishes a pre-training sequence for the model family, arranged in ascending order of model size, called model chain. In this chain, only the smallest model undergoes standard individual pre-training, while the other models are efficiently trained through sequential inverse knowledge transfer from their smaller predecessors by jointly reusing the knowledge in the parameter space and the feature space. As a result, CoM-PT enables all models to achieve performance that is mostly superior to standard individual training while significantly reducing training cost, and this is extensively validated across 45 datasets spanning zero-shot and fine-tuning tasks. Notably, its efficient scaling property yields a remarkable phenomenon: training more models even results in higher efficiency. For instance, when pre-training on CC3M: i) given ViT-L as the largest model, progressively prepending smaller models to the model chain reduces computational complexity by up to 72%; ii) within a fixed model size range, as the VFM family scales across 3, 4, and 7 models, the acceleration ratio of CoM-PT exhibits a striking leap: from 4.13x to 5.68x and 7.09x. Since CoM-PT is naturally agnostic to specific pre-training paradigms, we open-source the code to spur further extensions in more computationally intensive scenarios, such as large language model pre-training.

Abstract:
Large-scale video generation models have shown remarkable potential in modeling photorealistic appearance and lighting interactions in real-world scenes. However, a closed-loop framework that jointly understands intrinsic scene properties (e.g., albedo, normal, material, and irradiance), leverages them for video synthesis, and supports editable intrinsic representations remains unexplored. We present V-RGBX, the first end-to-end framework for intrinsic-aware video editing. V-RGBX unifies three key capabilities: (1) video inverse rendering into intrinsic channels, (2) photorealistic video synthesis from these intrinsic representations, and (3) keyframe-based video editing conditioned on intrinsic channels. At the core of V-RGBX is an interleaved conditioning mechanism that enables intuitive, physically grounded video editing through user-selected keyframes, supporting flexible manipulation of any intrinsic modality. Extensive qualitative and quantitative results show that V-RGBX produces temporally consistent, photorealistic videos while propagating keyframe edits across sequences in a physically plausible manner. We demonstrate its effectiveness in diverse applications, including object appearance editing and scene-level relighting, surpassing the performance of prior methods.

Abstract:
Natural videos exhibit heterogeneous temporal dynamics, with certain segments undergoing high-dynamic scene transitions and others dominated by low-dynamic visual changes. However, treating all frames identically, a common practice in most MLLMs, leads to redundant visual encoding, which results in significant computational overhead. The recent state-of-the-art model, i.e., Qwen2.5-VL, adopts a fixed two-frame encoding scheme, but our pilot experiments indicate that it encounters a visual confusion problem under high-dynamic frame pairs. To address this issue, we propose FlexiVideo, an efficient MLLM that models temporal dynamics leveraging visual variation. FlexiVideo first employs an adaptive temporal segmentation module to estimate inter-frame differences, grouping consecutive frames into scene segments with subtle visual changes. Subsequently, a dynamical spatio-temporal embedding module adjusts the temporal window for scene-level encoding. By restructuring scene-level visual representations within a structured temporal organization, our approach models dynamics more effectively and reduces the encoding burden while preserving fine-grained visual variations. Extensive experiments show that FlexiVideo-3B consistently outperforms Qwen2.5-VL-3B across 6 general video benchmarks. Notably, when evaluated on MotionBench at 10 FPS, FlexiVideo-3B reduces visual tokens by 43.5% compared with Qwen2.5-VL-3B while achieving a 1.3% performance gain, striking a significantly better balance between efficiency and effectiveness. Code and checkpoints are available at github.com/AI9Stars/FlexiVideo.

Abstract:
Generalized Category Discovery (GCD) aims to transfer knowledge from known categories to automatically discover new, unseen ones while preserving recognition of the known classes. Despite recent progress, existing GCD approaches typically assume that all data are drawn from the same distribution, which is rarely valid in real-world scenarios. In practice, data often experience simultaneous domain shifts and novel category emergence, causing severe performance degradation of existing systems. To address this challenge, we propose CausalGCD, a causality-inspired framework designed to mitigate domain-shift bias in category discovery. Specifically, we first analyze the causal graph to uncover the relationships among key variables in cross-domain GCD. We then introduce the concept of causal dependency risk and propose a Causal Dependency Risk Estimator to capture causal semantics, further deriving a theoretically computable upper bound to optimize this risk under cross-domain GCD settings. Furthermore, we propose a Causal Geometric Manifold Constraint that enforces invariant manifold-level associations between known and unknown categories across domains, thereby facilitating robust discovery of novel classes. Extensive experiments on the SSB-C and DomainNet benchmarks demonstrate the effectiveness of CausalGCD and highlight the significance of causal reasoning in open-world category discovery.

Abstract:
Creating new visual concepts often requires connecting distinct ideas through their most relevant shared attributes--their vibe. We introduce Vibe Blending, a novel task for generating coherent and meaningful hybrids that reveals these shared attributes between images. Achieving such blends is challenging for current methods, which struggle to identify and traverse nonlinear paths linking distant concepts in latent space. We propose Vibe Space, a hierarchical graph manifold that learns low-dimensional geodesics in feature spaces like CLIP, enabling smooth and semantically consistent transitions between concepts. To evaluate creative quality, we design a cognitively inspired framework combining human judgments, LLM reasoning, and a geometric path-based difficulty score. We find that Vibe Space produces blends that humans consistently rate as more creative and coherent than current methods.

Abstract:
Text-to-image diffusion models have achieved remarkable progress, generating visually realistic and semantically coherent images from textual prompts. However, natural language alone lacks the precision required for design-centric applications that demand strict spatial and structural fidelity--particularly when representing complex concepts that integrate multi-level information, such as product or scene design. To address this limitation, controllable diffusion frameworks introduce auxiliary conditions (e.g., depth, edge, or reference images) to guide the generative process. Models like ControlNet and IP-Adapter effectively inject such priors, improving structural or appearance alignment. Yet, real-world design tasks rarely depend on a single type of condition. They often require simultaneous integration of multiple heterogeneous cues--for instance, preserving spatial layout from depth maps, structural outlines from edge maps, and stylistic attributes from reference images. Current approaches either handle only one condition or naively stack multiple ones, resulting in computational inefficiency and conflicting guidance that degrade generation quality. This multi-condition inconsistency forms a critical bottleneck for applying diffusion models to real-world design workflows, motivating our proposed framework. We propose a data-driven adaptive condition fusion mechanism for multi-conditional diffusion. Our method introduces a novel condition adaptation module that dynamically selects and fuses subsets of conditions based on the diffusion timestep, task characteristics, and feature injection position. This adaptive strategy harmonizes diverse structural and appearance priors, achieving controllable yet flexible generation in complex design scenarios. Experiments demonstrate significant improvements in fidelity, consistency, and controllability across multi-condition tasks, establishing a new direction for practical, detail-preserving diffusion-based design generation.

Abstract:
Joint reconstruction of 3D human and object from a single image is an active research area, with pivotal applications in robotics and digital content creation. Despite recent advances, existing approaches suffer from two fundamental limitations. First, their reconstructions rely heavily on physical contact information, which inherently cannot capture non-contact human-object interactions, such as gazing at or pointing toward an object. Second, the reconstruction process is primarily driven by local geometric proximity, neglecting the human and object appearances that provide global context crucial for understanding holistic interactions. To address these issues, we introduce TeHOR, a framework built upon two core designs. First, beyond contact information, our framework leverages text descriptions of human-object interactions to enforce semantic alignment between the 3D reconstruction and its textual cues, enabling reasoning over a wider spectrum of interactions, including non-contact cases. Second, we incorporate appearance cues of the 3D human and object into the alignment process to capture holistic contextual information, thereby ensuring visually plausible reconstructions. As a result, our framework produces accurate and semantically coherent reconstructions, achieving state-of-the-art performance.

Abstract:
Cross-Domain Few-Shot Learning (CDFSL) adapts models trained with large-scale general data (source domain) to downstream target domains with only scarce training data, where the research on vision-language models (e.g., CLIP) is still in the early stages. Typical downstream domains, such as medical diagnosis, require fine-grained visual cues for interpretable recognition, but we find that current fine-tuned CLIP models can hardly focus on these cues, albeit they can roughly focus on important regions in source domains. Although current works have demonstrated CLIP's shortcomings in capturing local subtle patterns, in this paper, we find that the domain gap and scarce training data further exacerbate such shortcomings, much more than that of holistic patterns, which we call the local misalignment problem in CLIP-based CDFSL. To address this problem, due to the lack of supervision in aligning local visual features and text semantics, we turn to self-supervision information. Inspired by the translation task, we propose the CC-CDFSL method with cycle consistency, which translates local visual features into text features and then translates them back into visual features (and vice versa), and constrains the original features close to the translated back features.To reduce the noise imported by richer information in the visual modality, we further propose a Semantic Anchor mechanism, which first augments visual features to provide a larger corpus for the text-to-image mapping, and then shrinks the image features to filter out irrelevant image-to-text mapping. Extensive experiments on various benchmarks, backbones, and fine-tuning methods show we can (1) effectively improve the local vision-language alignment, (2) enhance the interpretability of learned patterns and model decisions by visualizing patches, and (3) achieve state-of-the-art performance. Our codes will be released.

Abstract:
Group-relative reinforcement learning with verifiable rewards (RLVR) often wastes the most informative data it already has--the failures. When all rollouts are wrong, gradients stall; when one happens to be correct, the update usually ignores why the others are close-but-wrong, and credit can be misassigned to spurious chains. We present CARE (Contrastive Anchored REflection), a failure-centric post-training framework for multimodal reasoning that turns errors into supervision. CARE combines: (i) an anchored-contrastive objective that forms a compact subgroup around the best rollout and a set of semantically proximate hard negatives, performs within-subgroup z-score normalization with negative-only scaling, and includes an all-negative rescue to prevent zero-signal batches; and (ii) Reflection-Guided Resampling(RGR), a one-shot structured self-repair that rewrites a representative failure and re-scores it with the same verifier, converting near-misses into usable positives without any test-time reflection. CARE improves accuracy and training smoothness while explicitly increasing the share of learning signal that comes from failures. On Qwen2.5-VL-7B, CARE lifts macro-averaged accuracy by 4.6 points over GRPO across six verifiable visual-reasoning benchmarks; with Qwen3-VL-8B it reaches competitive or state-of-the-art results on MathVista and MMMU-Pro under an identical evaluation protocol.

Abstract:
While language models have become impactful in many real-world applications, video generation remains largely confined to entertainment. Motivated by video's inherent capacity to demonstrate physical-world information that is difficult to convey through language alone (e.g., imagine teaching someone to tie a tie using only text), we identify an underutilized opportunity to extend video as a new answer modality for Next-Event Prediction (NEP), formalized as Video-Next-Event Prediction (VNEP). While the established NEP task takes a video with a procedural or predictive question as input to predict the next event in text, VNEP requires dynamic video responses. This shift from telling to showing unlocks more intuitive and customized answers for procedural learning and creative exploration. However, this task remains challenging for existing models, as it demands an understanding of multimodal input, instruction-conditioned reasoning, and the generation of video with visual and semantic consistency. To address this, we introduce VANS, a model that leverages reinforcement learning to align a Vision-Language Model (VLM) with a Video Diffusion Model (VDM) for VNEP. The core of VANS is our proposed Joint-GRPO that orchestrates the VLM and VDM to function as a unit. Driven by a shared reward on their respective output, it optimizes the VLM to produce captions that are both accurate and friendly to visualize, while guiding the VDM to generate videos that are faithful to these captions and the input visual context. To enable this learning, we craft VANS-Data-100K, a dedicated dataset for the VNEP task. Experiments on procedural and predictive benchmarks demonstrate that VANS achieves state-of-the-art performance in both video event prediction and visualization.

Abstract:
Humans intuitively rely on text and symbols inscribed on objects (e.g. "PULL", "Squeeze and Turn") to perform tasks safely and correctly. In contrast, vision-language-action models excel at following external language commands, but remain largely unaware of this object-centric information. This capability is essential for reliable robotic operation, yet progress remains unmeasured due to the absence of standardized benchmarks. To address this gap, we introduce INSIGHT Bench, a benchmark that formalizes the task of "in-situ guide grounding". INSIGHT Bench provides a comprehensive taxonomy that evaluates how agents utilize diverse guide information, including action-direction cues and procedural instructions. It also includes a scalable simulation framework that procedurally generates tasks and programmatically links each visual guide to its corresponding physical constraint. We release both the benchmark and the resulting trajectory dataset to support future research. Our evaluation of recent VLA models reveals a critical limitation: their ability to ground in-situ guides is inconsistent and strongly dependent on the type of information. While models succeed on some guide categories, they frequently fail on others. However, performance improves substantially when the same information is provided as language instructions, indicating that in-situ guides could contribute to manipulation performance if VLAs were capable of interpreting them. These findings underscore the need for further research on understanding and grounding in-situ guides.

Abstract:
How to achieve vision-language (VL) tracking using natural language descriptions from a video sequence without relying on any bounding-box ground truth? In this work, we achieve this goal by tackling self-supervised VL tracking, which aims to evaluate tracking capabilities guided by natural language descriptions. We introduce \tracker, a novel self-supervised VL tracker that is capable of tracking any referred object by a language description. Unlike traditional methods that equally fuse all language and visual tokens, we propose an efficient Dynamic Token Aggregation Module, which treats each visual token unequally. The module consists of three main steps: i) Based on an anchor token, it selects multiple important target tokens from the template frame. ii) The selected target tokens are merged according to their attention scores and aggregated into the language tokens, thereby eliminating redundant visual token noise and enhancing semantic alignment. iii) Finally, the fused language tokens serve as guiding signals to extract potential target tokens from the search frame and propagate them to subsequent frames, enhancing temporal prompts and encouraging the tracker to autonomously learn instance tracking from unlabeled videos. This new modeling approach enables the effective self-supervised learning of language-guided tracking representations without the need for large-scale bounding box annotations. Extensive experiments on VL tracking benchmarks show that \tracker surpasses SOTA self-supervised methods.

Abstract:
Mixture-of-Experts (MoE) models expand capacity via sparse expert activation, but routing logits can misalign with expert structure (unstable routing, underutilization) and load imbalance can create stragglers. Auxiliary load-balancing losses reduce disparity but often weaken specialization and downstream accuracy. We propose ERMoE, a sparse MoE transformer that reparameterizes each expert in a learned orthonormal eigenbasis and routes with an Eigenbasis Score (cosine similarity between token features and an expert basis) instead of learned gating logits. By tying assignments to each expert's representation space, ERMoE stabilizes utilization, improves interpretability, and removes explicit balancing losses and their gradient interference. ERMoE reaches state-of-the-art results on ImageNet and image-text retrieval (COCO, Flickr30K) with flatter expert loads. A 3D MRI variant (ERMoE-ba) improves brain age prediction by over 7% and yields anatomically interpretable expert specializations.

Abstract:
Recent research has investigated the use of large language models (LLMs) to generate traffic scenarios for autonomous driving. However, pretrained LLMs often fail to align with real-world traffic distributions. In this work, we present TrafficAlign, an automated framework that synthesizes traffic scenarios based on real-world driving videos, performs data validation, and aligns LLMs with the synthesized scenarios. The evaluation shows that traffic scenarios generated by TrafficAlign are highly effective, revealing up to 10.8% more collisions on average across three autonomous driving models than state-of-the-art methods. Furthermore, fine-tuning these driving models with TrafficAlign-generated scenarios significantly reduced collision rates by 36.1% compared with the original models. A qualitative study using traffic datasets from six geographically diverse regions shows that TrafficAlign-generated scenarios exhibit strong alignment with corresponding traffic distributions in these regions.

Abstract:
Modern industrial quality control heavily relies on automated anomaly detection. While few-shot anomaly detection addresses the challenge of limited labeled data, real-world inspection faces a vast diversity of anomaly types, sizes, and shapes. We identify the primary cause for the anomaly detection difficulty as the progressive loss of detect cues as they pass through deep feature extraction pipelines. To counteract the defect cue fading, we propose a Defect Cue-Preserved Structural Feature Refinement model, referred to as DCP-SFR. Recognizing that early-stage cues are paramount, we design a conditional anomaly cue amplification module to produce an initial anomaly score map, which is then enhanced to increase the contrast between anomalous and normal regions. The amplified cues is subsequently used for reconstruction-based anomaly localization, by anchoring attention on true anomaly regions to preserve spatial integrity and prevent drift. Further, we incorporate a structure-aware segmentation refinement stage to improve anomaly segmentation in terms of edge alignment, thereby significantly improve boundary accuracy. On the MVTec AD and VisA benchmarks, DCP-SFR achieves state-of-the-art performance, with an image-level AUROC of 97.3% and a pixel-level AUROC of 98.2%, demonstrating strong cross-domain generalization performance.

Abstract:
Although current watermarking techniques for 3D Gaussian Splatting (3DGS) are promising in protecting the copyrights of both 3DGS models and their rendered images, they greatly suffer from low watermark robustness and poor rendering quality when applying quantization to large 3DGS models to accommodate resource-limited devices. To address these problems, this paper introduces a novel two-stage quantization-aware 3DGS watermarking approach called Robust3DGSW. By properly embedding watermarks into the mid-frequency bands of both the 3D Gaussian parameters and 2D rendered images, the first stage of Robust3DGSW can effectively counteract the quantization-induced signal loss and mitigate the adverse effects of watermarks on rendered images. In the second stage, Robust3DGSW trains both 2D and 3D decoders using our proposed multi-scale adversarial perturbation approach, alongside a gradual quantization process, which enables robust watermark extraction even under excessive quantization. Comprehensive experimental results obtained from the well-known Blender, LLFF, and MipNeRF-360 datasets demonstrate that, when compared to leading 3DGS watermarking techniques, Robust3DGSW not only mitigates the negative effects of quantization on watermarks but also enables fast rendering with high quality.

Abstract:
Recent progress in GPU-accelerated, photorealistic simulation has opened a scalable data-generation path for robot learning, where massive physics and visual randomization allow policies to generalize beyond curated environments. Building on these advances, we develop a teacher-student-bootstrap learning framework for vision-based humanoid loco-manipulation, using articulated-object interaction as a representative high-difficulty benchmark. Our approach introduces a staged-reset exploration strategy that stabilizes long-horizon privileged-policy training, and a GRPO-based fine-tuning procedure designed to mitigate partial observability and improve closed-loop consistency in sim-to-real RL. Trained entirely on synthetic simulation data, the resulting policy achieves robust zero-shot performance across diverse articulated objects--including multiple door types--and outperforms human teleoperators by up to 31.7% in task completion time under the same whole-body control stack. This represents the first humanoid sim-to-real policy capable of diverse articulated loco-manipulation from pure RGB perception.

Abstract:
Instruction-based image editing through natural language has emerged as a powerful paradigm for intuitive visual manipulation. While recent models achieve impressive results on single edits, they suffer from severe quality degradation under multi-turn editing. Through systematic analysis, we identify progressive loss of high-frequency information as the primary cause of this quality degradation. We present FreqEdit, a training-free framework that enables stable editing across 10+ consecutive iterations. Our approach comprises three synergistic components: (1) high-frequency feature injection from reference velocity fields to preserve fine-grained details, (2) an adaptive injection strategy that spatially modulates injection strength for precise region-specific control, and (3) a path compensation mechanism that periodically recalibrates the editing trajectory to prevent over-constraint. Extensive experiments demonstrate that FreqEdit achieves superior performance in both identity preservation and instruction following compared to seven state-of-the-art baselines.

Abstract:
Long-Tail Class Incremental Learning (LTCIL) combines two fundamental challenges: catastrophic forgetting of past tasks and severe class imbalance. Existing approaches mitigate one challenge at a time, through rehearsal, reweighting, or classifier alignment, but they typically assume static priors and rely on multi-stage training. In contrast, we propose AdaPrior, a simple Bayesian framework that treats LTCIL as a problem of dynamic prior misalignment. Our key idea is to estimate model-induced priors online via an exponential moving average and use them for (i) debiasing during training (AdaPrior Loss), and (ii) lightweight post-hoc correction at inference. The combined approach unifies loss-level and inference-level debiasing without additional stages or heavy computation. We provide theoretical analysis showing that AdaPrior's prior estimator converges to the true model prior and that its logit adjustment yields well calibrated posteriors under mild assumptions. Extensive experiments on CIFAR100-LT, Food-101-LT, ImageNet-LT-subset, and iNaturalist18-subset demonstrate consistent gains over recent LTCIL baselines. Beyond accuracy, AdaPrior improves calibration, and forgetting curves, making it a practical and scalable solution for long-tail continual learning.

Abstract:
As AI generative models evolve at unprecedented speed, image attribution has become a moving target. New diffusion, adversarial and autoregressive generators appear almost monthly, making existing watermark, classifier and inversion methods obsolete upon release. The core problem lies not in model recognition, but in the inability to adapt attribution itself.We introduce IncreFA, a framework that redefines attribution as a structured incremental learning problem, allowing the system to learn continuously as new generative models emerge. IncreFA departs from conventional incremental learning by exploiting the hierarchical relationships among generative architectures and coupling them with continual adaptation. It integrates two mutually reinforcing mechanisms: (1) Hierarchical Constraints, which encode architectural hierarchies through learnable orthogonal priors to disentangle family-level invariants from model-specific idiosyncrasies; and (2) a Latent Memory Bank, which replays compact latent exemplars and mixes them to generate pseudo-unseen samples, stabilising representation drift and enhancing open-set awareness.On the newly constructed Incremental Attribution Benchmark (IABench) covering 28 generative models released between 2022 and 2025, IncreFA achieves state-of-the-art attribution accuracy and 98.9% unseen detection under a temporally ordered open-set protocol.

Abstract:
Scene Graph Generation (SGG) aims to structurally represent visual scenes by detecting objects and their pairwise relationships. Despite significant progress, current models encode visual knowledge with ambiguous visual context and logically inferred implicit relations due to their purely neural, pipeline-based nature. This limitation underscores the need to advance beyond identifying what relations exist to explaining why they exist and how they can be compositionally reasoned about through logical rule chaining. To address these challenges, we introduce NeuroRule, the first Neurally-Guided Rule Induction Network that integrates Mask2Former pixel-precise visual understanding with a differentiable rule induction engine. Our proposed method enables automatic learning of compositional logical rules directly from visual data while providing transparent explanations for relational predictions. NeuroRule introduces three key innovations: (1) a neural-symbolic bridge that maps visual features to probabilistic symbolic representations; (2) a differentiable rule-learning mechanism that automatically discovers interpretable first-order logic rules without manual engineering; and (3) a compositional chain rule system that enables complex inference while propagating confidence scores through an end-to-end trainable pipeline. Extensive experiments on the benchmark datasets, including Visual Genome (VG), Panoptic Scene Graph (PSG), and OpenPSG, demonstrate that NeuroRule achieves state-of-the-art performance. Our method significantly improves few-shot relation extraction while maintaining full interpretability in its rule-based explanations.

Abstract:
Adversarial patches have emerged as a popular privacy-preserving approach for resisting AI-driven surveillance systems. However, their conspicuous appearance makes them difficult to deploy in real-world scenarios. In this paper, we propose a thermally activated adversarial wearable designed to ensure adaptability and effectiveness in complex real-world environments. The system integrates thermochromic dyes with flexible heating units to induce visually dynamic adversarial patterns on clothing surfaces. In its default state, the clothing appears as an ordinary black T-shirt. Upon heating via an embedded thermal unit, hidden adversarial patterns on the fabric are activated, allowing the wearer to effectively evade detection across both visible and infrared modalities. Physical experiments demonstrate that the adversarial wearable achieves rapid texture activation within 50 seconds and maintains an adversarial success rate above 80% across diverse real-world surveillance environments. This work demonstrates a new pathway toward physically grounded, user-controllable anti-AI systems, highlighting the growing importance of proactive adversarial techniques for privacy protection in the age of ubiquitous AI surveillance.

Abstract:
Deep Neural Networks (DNNs) have revolutionized numerous industries, yet their decision-making processes remain largely opaque. Most existing explanation methods visualize the importance of image regions that influence a classifier's decisions, but they predominantly focus on identifying regions with positive contributions, often overlooking those with negative impacts. In this paper, we introduce a novel black-box explanation method, the Metropolis-Hastings Explainer (MHE), designed to provide confidence-faithful explanations. MHE enhances the fidelity of explanations by ensuring that the explained regions closely align with the original confidence score, sampling instances that best match the classifier's confidence. Furthermore, MHE improves sampling efficiency by utilizing existing valid samples to explore more potential valid ones, reducing computational overhead. To enhance the clarity of explanations, MHE prioritizes valid samples with smaller areas when other factors are equal, thereby reducing the explanation area. Building upon the MHE framework, we propose two extensions: MHE-e, which focuses exclusively on regions with positive contributions, and MHE-pro, which refines explanation quality by integrating multi-scale information. MHE-pro progressively regions, optimizing both sampling efficiency and explanation quality. Experimental results demonstrate that MHE delivers superior and stable explanation quality across various models, including ResNet50, VGG16, ViT, DINO, and CLIP, on datasets such as ImageNet, CUB-200-2011, and VOC2012, providing explanations that closely approximate the original classification confidence.

Abstract:
Open-vocabulary semantic segmentation extends pixel-level recognition to arbitrary text-described categories. Despite strong global semantic understanding, vision-language models such as CLIP exhibit limited spatial precision and semantic ambiguity across large vocabularies, constraining their effectiveness for dense prediction. We present S2C2Seg, a training-free framework that integrates with existing methods through Category Subset Selection (CSS) and Consistent Semantic Guidance (CSG). CSS employs three complementary scoring functions to filter category candidates: CLIP-based global semantic similarity, spatial presence from dense prediction models, and multi-view consistency via alignment and conditional entropy. This joint exploitation of semantic, spatial, and consistency cues reduces category redundancy and semantic ambiguity. CSG adaptively fuses CLIP global features with local spatial predictions through category-specific confidence weighting, applying stronger regularization to high-similarity categories for correcting prediction biases while preserving spatial precision for low-confidence categories. Extensive experiments across eight benchmarks demonstrate broad applicability: when integrated with SCLIP, ProxyCLIP, and CorrCLIP, S2C2Seg achieves consistent improvements of 3.4 to 9.7 percentage points in mIoU, establishing a new state-of-the-art of 51.2% average mIoU.

Abstract:
Single-image reflection removal (SIRR) is a highly ill-posed and computationally demanding problem. Existing CNN or Transformer-based methods often rely on large receptive fields and heavy computation, limiting their deployment on resource-constrained devices. To address this, we propose LightRR, a lightweight yet effective reflection removal network that unifies a wavelet-based mechanism and State Space Modeling (SSM).Specifically, we introduce an Asymmetric Frequency Mamba Block (AFM), which leverages the Discrete Wavelet Transform (DWT) to decompose features into low- and high-frequency components. This allows for targeted modeling of frequency-specific dependencies via Mamba-based state space dynamics.This design not only captures long-range context efficiently but also reduces spatial resolution and computation while preserving critical details.Furthermore, a knowledge distillation-enhanced encoder allows the network to inherit the representational power of large pre-trained models during training, enabling lightweight inference.Extensive experiments on multiple real-world benchmarks demonstrate that LightRR achieves performance comparable to state-of-the-art methods, while using only 3.01% of the parameters and 5.22% of the FLOPs (vs. RDNet), highlighting its superior balance between accuracy and efficiency.

Abstract:
Streaming video reasoning requires models to operate in a setting where history grows without bound while meaningful evidence remains scarce. In such a landscape, relevant signal is like an oasis -- small, critical, and easily lost in a desert of redundancy. Enlarging memory only widens the desert; aggressive compression dries up the oasis. The real difficulty lies in discovering where to look, not how much to remember. We therefore introduce OASIS, a novel framework for streaming video reasoning that tackles this challenge through structured, on-demand retrieval. It organizes streaming history into hierarchical events and performs reasoning as controlled refinement -- short-context inference first, followed by semantically grounded retrieval only when uncertainty arises. As the retrieval is driven by high-level intent rather than embedding similarity, the retrieved memory is substantially more accurate and less noisy. Additionally, the mechanism is plug-and-play, training-free, and readily attaches to different streaming MLLM backbones. Experiments across multiple benchmarks and backbones show that OASIS achieves strong gains in long-horizon accuracy and compositional reasoning with bounded token cost and low request delay.

Abstract:
Zero-Shot Temporal Action Detection (ZSTAD) aims to localize and recognize action instances from unseen action categories in untrimmed videos. Although existing methods have shown effectiveness by advancing architectural text-video alignment, they still struggle with capturing semantic distinctions between action classes, resulting in text-irrelevant predictions.To address this issue, we propose a Text-Foreground Concentrated Alignment for zero-shot temporal action DEtector (TF-CADE) that explicitly aligns textual information with action-relevant foreground regions.Specifically, we introduce Action Concentrate Aggregation (ACA), which extracts action concentrate scores to aggregate temporally informative video segments into a foreground-weighted video embedding.This foreground concentrated alignment enhances the semantic consistency between text and video features and improves inter-class discriminability.In addition, a Certainty-based Confidence Re-weighting (CCR) strategy refines per-snippet confidence scores by leveraging foreground-aware similarity, effectively suppressing irrelevant action classes during inference.Extensive evaluations show that our TF-CADE not only achieves state-of-the-art performance under in-distribution settings but also excels in cross-dataset generalization to unseen action classes.

Abstract:
Reliable 3D reconstruction from in-the-wild image collections is often hindered by noisy images--irrelevant inputs with little or no view overlap with others. While traditional Structure-from-Motion pipelines handle such cases through geometric verification and outlier rejection, feed-forward 3D reconstruction models lack these explicit mechanisms, leading to degraded performance under in-the-wild conditions. In this paper, we discover that the existing feed-forward reconstruction model, e.g., VGGT, despite lacking explicit outlier-rejection mechanisms or noise-aware training, can inherently distinguish distractor images. Through an in-depth analysis under varying proportions of synthetic distractors, we identify a specific layer that naturally exhibits outlier-suppressing behavior. Further probing reveals that this layer encodes discriminative internal representations that enable an effective noise-filtering capability, which we simply leverage to perform outlier-view rejection in feed-forward 3D reconstruction without any additional fine-tuning or supervision. Extensive experiments on both controlled and in-the-wild datasets demonstrate that this implicit filtering mechanism is consistent and generalizes well across diverse scenarios.

Abstract:
3D Gaussian splatting (3DGS) has emerged as a promising approach for high-fidelity 3D scene representation. However, relighting and composition of Gaussian splatting remain challenging because path tracing is not directly applicable. Existing relighting methods for Gaussian splatting typically adopt either approximate rendering formulations or rely on Gaussian ray tracing, yielding low relighting performance and low rendering efficiency. To address these limitations, we propose Gaussian hybrid path tracing (GHPT), a three-stage framework to acquire relightable Gaussian splatting models. The first stage utilizes planar-based Gaussian splatting reconstruction representation (PGSR) to enable multi-view consistent depth rendering and reconstruct the surface mesh of a scene. The second stage performs physically-based differentiable rendering on the obtained mesh to reconstruct the material maps and the environment map. The third stage utilizes factorized inverse path tracing (FIPT) on the G-buffer rendered by the PGSR, and visibility and indirect illumination are evaluated by hardware-accelerated ray tracing on the mesh with the material maps and the environment map reconstructed in the second stage. Experiments demonstrate that the relighting performance of GHPT outperforms the baselines, and our method can perform real-time relighting and composition of Gaussian splatting.

Abstract:
We introduce WorldGen, a method for generating large, fully formed, navigable 3D worlds from a single text prompt. Existing approaches to 3D scene generation often trade off scene diversity, completeness, and correctness in different ways. We push this envelope by producing large scenes explicitly decomposed into individual, high-quality 3D meshes, making them compatible with standard game engines. Our approach first uses a language-driven procedural generator to lay out the scene's basic volumes and navigable regions. An image generator then establishes the scene's theme, style, and details. Next, we obtain a high-quality, compositional 3D reconstruction of the planned scene. This step first uses an image-to-3D model to perform a holistic reconstruction that implicitly determines the shape and location of all scene objects, accounting for context and navigability. The reconstruction is then decomposed into individual entities, which are regenerated at higher resolution, synthesizing additional details with guidance from the image generator. We ablate key design choices and compare qualitatively against existing scene generators, showing that our design addresses their gaps.

Abstract:
The training of diffusion models is computationally intensive, making effective pre-training essential. However, real-world deployments often demand models of variable sizes due to diverse memory and computational constraints, posing challenges when corresponding pre-trained versions are unavailable.To address this, we propose FINE, a novel pre-training method whose resulting model can flexibly factorize its knowledge into fundamental components, termed learngenes, enabling direct initialization of models of various sizes and eliminating the need for repeated pre-training.Rather than optimizing a conventional full-parameter model, FINE represents each layer's weights as the product of U_ \star , \Sigma_ \star ^ (l) , and V_ \star ^\top, where U_ \star and V_ \star serve as size-agnostic learngenes shared across layers, while \Sigma_ \star ^ (l) remains layer-specific.By jointly training these components, FINE forms a decomposable and transferable knowledge structure that allows efficient initialization through flexible recombination of learngenes, requiring only light retraining of \Sigma_ \star ^ (l) on limited data.Extensive experiments demonstrate the efficiency of FINE, achieving state-of-the-art performance in initializing variable-sized models across diverse resource-constrained deployments. Furthermore, models initialized by FINE effectively adapt to diverse tasks, showcasing the task-agnostic versatility of learngenes.

Abstract:
Training-free open-vocabulary semantic segmentation (OVSS) promises rapid adaptation to new label sets without retraining. Yet, many methods rely on heavy post-processing or handle text and vision in isolation, leaving cross-modal geometry underutilized. Others introduce auxiliary vision backbones or multi-model pipelines, which increase complexity and latency while compromising design simplicity.We present PEARL, \underline P rocrust\underline e s \underline a lignment with text-awa\underline r e \underline L aplacian propagation, a compact two-step inference that follows an align-then-propagate principle. The Procrustes alignment step performs an orthogonal projection inside the last self-attention block, rotating keys toward the query subspace via a stable polar iteration. The text-aware Laplacian propagation then refines per-pixel logits on a small grid through a confidence-weighted, text-guided graph solve: text provides both a data-trust signal and neighbor gating, while image gradients preserve boundaries. In this work, our method is fully training-free, plug-and-play, and uses only fixed constants, adding minimal latency with a small per-head projection and a few conjugate-gradient steps. Our approach, PEARL, sets a new state-of-the-art in training-free OVSS without extra data or auxiliary backbones across standard benchmarks, achieving superior performance under both with-background and without-background protocols.

Abstract:
Cancer survival prediction through multimodal learning that combines histopathology images with genomic data represents a promising research direction. However, current approaches still suffer from two key limitations. First, most methods operate in a Euclidean feature space, which makes it difficult to capture the intrinsic hierarchies in histopathology, where information is organized from patches to whole-slide images to patients, and in genomics, where it progresses from genes to pathways to patients. Second, they typically discretize survival times into coarse risk intervals, neglecting fine-grained ordinal relationships among samples within the same interval and thus failing to capture the continuous ranking characteristics of survival outcomes.To address these issues, we propose \ourmethod, a hyperbolic hierarchical multimodal learning framework for survival prediction. H2-SurvNet first employs a hyperbolic hierarchical information modeling (H2IM) module that maps multimodal features into a shared hyperbolic space and explicitly encodes intra-modal and inter-modal hierarchies across patches, WSIs, patients, genes, and pathways. On top of this representation, we design a Temporal Ordinal Contrastive learning (TOCL) module that models the temporal progression of survival outcomes by enforcing ordinal risk ordering through contrastive objectives, thereby promoting continuity in the learned risk scores.Extensive experiments on heterogeneous cohorts from TCGA, CPTAC, and NLST demonstrate that H2-SurvNet consistently outperforms state-of-the-art multimodal survival prediction methods and exhibits strong robustness and generalization across diverse data distributions. Source code will be released upon acceptance.

Abstract:
Reconstructing high-fidelity, animatable human avatars from monocular videos remains a critical challenge. Existing 3DGS-based human animation methods constrain Gaussian parameters but exclude scale, which we argue is crucial for adapting human poses to challenging out-of-distribution poses. To achieve robust animation under unseen poses, we propose Tavatar, which derives key parameters such as scale, rotation, and other geometric attributes directly from the local mesh geometry, instead of learning them through unconstrained optimization. This paradigm shift enforces topological consistency by design, as each Gaussian is analytically anchored to the local mesh geometry, inheriting its spatial structure and deformation behavior. Specifically, we bind Gaussians to mesh faces and vertices, deriving their scales and orientations from triangle properties and local edge lengths to ensure coherent surface coverage. To ensure the stability of this analytical mapping, we introduce a crucial equilateral regularization term that preserves mesh integrity. Extensive experiments demonstrate that Tavatar achieves superior animation robustness on challenging out-of-distribution poses, reducing normal error by 13.8% on X-Avatar and 17.9% on PeopleSnapshot against the best baseline, while maintaining competitive rendering quality.

Abstract:
Few-Shot Medical Image Segmentation (FSMIS) aims to segment novel object classes in medical images using only minimal annotated examples, addressing the critical challenges of data scarcity and domain shifts prevalent in medical imaging. While Diffusion Models (DM) excel in visual tasks, their potential for FSMIS remains largely unexplored. We propose that the rich visual priors learned by large-scale DMs offer a powerful foundation for a more robust and data-efficient segmentation approach. In this paper, we introduce SD-FSMIS, a novel framework designed to effectively adapt the powerful pre-trained Stable Diffusion (SD) model for the FSMIS task. Our approach repurposes its conditional generative architecture by introducing two key components: a Support-Query Interaction (SQI) and a Visual-to-Textual Condition Translator (VTCT). Specifically, SQI provides a straightforward yet powerful means of adapting SD to the FSMIS paradigm. The VTCT module translates visual cues from the support set into an implicit textual embedding that guides the diffusion model, enabling precise conditioning of the generation process. Extensive experiments demonstrate that SD-FSMIS achieves competitive results compared to state-of-the-art methods in standard settings. Surprisingly, it also demonstrated excellent generalization ability in more challenging cross-domain scenarios. These findings highlight the immense potential of adapting large-scale generative models to advance data-efficient and robust medical image segmentation.

Abstract:
Compositional generalization -- the ability to understand and generate novel combinations of learned concepts -- enables models to extend their capabilities beyond limited experiences. While effective, the data structures and principles that enable this crucial capability remain poorly understood. We propose that compositional generalization fundamentally requires decomposing high-level concepts into basic, low-level concepts that can be recombined across similar contexts, similar to how humans draw analogies between concepts. For example, someone who has never seen a peacock eating rice can envision this scene by relating it to their previous observations of a chicken eating rice.In this work, we formalize these intuitive processes using principles of causal modularity and minimal changes. We introduce a hierarchical data-generating process that naturally encodes different levels of concepts and their interaction mechanisms. Theoretically, we demonstrate that this approach enables compositional generalization supporting complex relations between composed concepts, advancing beyond prior work that assumes simpler interactions like additive effects. Critically, we also prove that this latent hierarchical structure is provably recoverable (identifiable) from observable data like text-image pairs, a necessary step for learning such a generative process. To validate our theory, we apply insights from our theoretical framework and achieve significant improvements on benchmark datasets.

Abstract:
Large Vision-Language Models (LVLMs) have achieved remarkable progress in visual-textual understanding, yet their reliability is critically undermined by hallucinations, i.e., the generation of factually incorrect or inconsistent responses.While recent studies using steering vectors demonstrated promise in reducing hallucinations, a notable challenge remains: they inadvertently amplify the severity of residual hallucinations. We attribute this to their exclusive focus on the decoding stage, where errors accumulate autoregressively and progressively worsen subsequent hallucinatory outputs.To address this, we propose Prefill-Time Intervention (PTI), a novel steering paradigm that intervenes only once during the prefill stage, enhancing the initial Key-Value (KV) cache before error accumulation occurs.Specifically, PTI is modality-aware, deriving distinct directions for visual and textual representations. This intervention is decoupled to steer keys toward visually-grounded objects and values to filter background noise, correcting hallucination-prone representations at their source.Extensive experiments demonstrate PTI's significant performance in mitigating hallucinations and its generalizability across diverse decoding strategies, LVLMs, and benchmarks. Moreover, PTI is orthogonal to existing decoding-stage methods, enabling plug-and-play integration and further boosting performance.

Abstract:
Part-level 3D generation is essential for applications requiring decomposable and structured 3D synthesis. However, existing methods either rely on implicit part segmentation with limited granularity control or depend on strong external segmenters trained on large annotated datasets. In this work, we observe that part awareness emerges naturally during whole-object geometry learning and propose Geom-Seg VecSet, a unified geometry-segmentation latent representation that jointly encodes object geometry and part-level structure. Building on this representation, we introduce UniPart, a two-stage latent diffusion framework for image-guided part-level 3D generation. The first stage performs joint geometry generation and latent part segmentation, while the second stage conditions part-level diffusion on both whole-object and part-specific latents. A dual-space generation scheme further enhances geometric fidelity by predicting part latents in both global and canonical spaces. Extensive experiments demonstrate that UniPart, achieves superior segmentation controllability and part-level geometric quality compared with existing approaches.

Abstract:
Part-level pointing is important for fine-grained interaction and reasoning, yet existing Multimodal Large Language Models (MLLMs) remain limited to instance-level pointing. Part-level pointing presents unique challenges: annotation is costly, parts are long-tail distributed, and many are difficult to specify precisely in language. We introduce POinting at Parts (POP), a training-free, plug-and-play approach that addresses these challenges under a few-shot setup. POP fuses textual and visual attention maps with self-supervised visual correspondences from query image and few-shot examples. On average across the three evaluated datasets, POP achieves accuracy gains of up to 8.9 points in the one-shot setting and 16.4 points in the three-shot setting for the pointing-capable MLLMs--Qwen2.5-VL, Ovis2.5, and Molmo. Notably, even MLLMs without pointing capability benefit significantly from the proposed approach. These results establish a simple yet effective path toward fine-grained spatial grounding in MLLMs.

Abstract:
While recent video generation models achieve impressive visual quality, generating physically plausible videos remains challenging, especially for fluid dynamics and rigid-body motions. To address this, we present NS-Diff, a physics-guided reinforcement learning framework for video diffusion. First, we design a noise-robust physical dynamics detector that distinguishes rigid and fluid regions by analyzing motion in noisy latent frames. Second, we introduce a Physics-Conditioned Latent Injection module, which encodes velocity fields, deformation gradients, and material masks, and injects them into the DiT denoiser via cross-attention. Third, we introduce a reinforcement learning optimization module that enforces simplified Navier-Stokes constraints on fluid dynamics and minimum-jerk principles on rigid bodies through policy gradients. Experiments on PhysVideoBench, UCF, and MSR-VTT show that our approach reduces jerk errors by 43%, decreases fluid divergence by 33%, and improves FVD by 22.7%, achieving higher physical plausibility and visual quality.

Abstract:
Generating dynamic and interactive 3D trees has wide applications in virtual reality, games, and world simulation. However, existing methods still face various challenges in generating structurally consistent and realistic 4D motion for complex real trees. In this paper, we propose DynamicTree, the first framework that can generate long-term, interactive 3D motion for 3DGS reconstructions of real trees. Unlike prior optimization-based methods, our approach generates dynamics in a fast feed-forward manner. The key success of our approach is the use of a compact sparse voxel spectrum to represent the tree movement. Given a 3D tree from Gaussian Splatting reconstruction, our pipeline first generates mesh motion using the sparse voxel spectrum and then binds Gaussians to deform the mesh. Additionally, the proposed sparse voxel spectrum can also serve as a basis for fast modal analysis under external forces, allowing real-time interactive responses. To train our model, we also introduce 4DTree, the first large-scale synthetic 4D tree dataset containing about 8.5k animated tree meshes with semantic labels and 100-frame motion sequences. Extensive experiments demonstrate that our method achieves realistic and responsive tree animations, significantly outperforming existing approaches in both visual quality and computational efficiency.

Abstract:
Recent advancements in neural visual geometry, including transformer-based models such as VGGT and Pi3, have achieved impressive accuracy on 3D reconstruction tasks. However, their reliance on full attention makes them fundamentally limited by GPU memory capacity, preventing them from scaling to large, unordered image collections. We introduce MERG3R, a training-free divide-and-conquer framework that enables geometric foundation models to operate far beyond their native memory limits. MERG3R first reorders and partitions unordered images into overlapping, geometrically diverse subsets that can be reconstructed independently. It then merges the resulting local reconstructions through an efficient global alignment and confidence-weighted bundle adjustment procedure, producing a globally consistent 3D model. Our framework is model-agnostic and can be paired with existing neural geometry models. Across large-scale datasets--including 7-Scenes, NRGBD, Tanks & Temples, and Cambridge Landmarks--MERG3R consistently improves reconstruction accuracy, memory efficiency, and scalability, enabling high-quality reconstruction when the dataset exceeds memory capacity limits.

Abstract:
Recent 3D Gaussian Splatting (3DGS) dropout methods address overfitting under sparse-view conditions by randomly nullifying Gaussian opacities. However, we identify a neighbor compensation effect in these approaches: dropped Gaussians are often compensated by their neighbors, weakening the intended regularization. Moreover, these methods overlook the contribution of high-degree spherical harmonic coefficients (SH) to overfitting. To address these issues, we propose DropAnSH-GS, a novel anchor-based dropout strategy. Rather than dropping Gaussians independently, our method randomly selects certain Gaussians as anchors and simultaneously removes their spatial neighbors. This effectively disrupts local redundancies and encourages the model to learn more robust, globally informed representations. Furthermore, we extend the dropout to color attributes by randomly dropping higher-degree SH coefficients to concentrate appearance information in lower-degree SH. This strategy further mitigates overfitting and enables flexible post-training model compression via SH truncation. Experimental results demonstrate that DropAnSH-GS substantially outperforms existing dropout methods with negligible computational overhead and can be readily integrated into various 3DGS variants to enhance their performances.

Abstract:
Cryo-electron microscopy (cryo-EM) is an indispensable technique for determining the 3D structures of dynamic biomolecular complexes. While typically applied to image a single molecular species, cryo-EM holds great potential for structure determination of many targets simultaneously in a high-throughput fashion. However, existing methods typically focus on modeling conformational heterogeneity within a single or very few structures and are not designed to resolve compositional heterogeneity arising from mixtures of many distinct molecular species. To address this challenge, we propose CryoHype, a transformer-based hypernetwork for cryo-EM reconstruction that dynamically adjusts the weights of an implicit neural representation based on the input structure. Using CryoHype, we successfully reconstruct 1,000 distinct structures from cryo-EM imaging in the fixed-pose setting without any pre-existing knowledge of the structures present, which is beyond the capabilities of any existing algorithm.

Abstract:
Normalizing flows (NFs) are end-to-end likelihood-based generative models for continuous data, and have recently regained attention with encouraging progress on image generation. Yet in the video generation domain, where spatiotemporal complexity and computational cost are substantially higher, state-of-the-art systems almost exclusively rely on diffusion-based models. In this work, we revisit this design space by presenting STARFlow-V, a normalizing flow-based video generator with substantial benefits such as end-to-end learning, robust causal prediction and native likelihood estimation. Building upon the recently proposed STARFlow, STARFlow-V operates in the spatiotemporal latent space with a global-local architecture which restricts causal dependencies to a global latent space while preserving rich local within-frame interactions. This eases error accumulation over time, a common pitfall of standard autoregressive diffusion model generation. Additionally, we propose flow-score matching, which equips the model with a light-weight causal denoiser to improve the video generation consistency in an autoregressive fashion. To improve the sampling efficiency, STARFlow-V employs a video-aware Jacobi iteration scheme that recasts inner updates as parallelizable iterations without breaking causality. Thanks to the invertible structure, the same model can natively support text-to-video, image-to-video as well as video-to-video generation tasks. Empirically, STARFlow-V achieves strong visual fidelity and temporal consistency with practical sampling throughput relative to diffusion-based baselines. These results present the first evidence, to our knowledge, that NFs are capable of high-quality autoregressive video generation, establishing them as a promising research direction for building world models.

Abstract:
Recent advances in vision-language pretraining have enabled strong medical foundation models, yet most analyze radiographs in isolation, overlooking the key clinical task of comparing prior and current images to assess interval change. For chest radiographs (CXRs), capturing interval change is essential, as radiologists must evaluate not only the static appearance of findings but also how they evolve over time. We introduce TILA (Temporal Inversion-aware Learning and Alignment), a simple yet effective framework that uses temporal inversion, reversing image pairs, as a supervisory signal to enhance the sensitivity of existing temporal vision--language models to directional change. TILA integrates inversion-aware objectives across pretraining, fine-tuning, and inference, complementing conventional appearance modeling with explicit learning of temporal order. We also propose a unified evaluation protocol to assess order sensitivity and consistency under temporal inversion, and introduce MS-CXR-T_retrieval, a retrieval evaluation set constructed through a general protocol that can be applied to any temporal CXR dataset. Experiments on public datasets and real-world hospital cohorts demonstrate that TILA consistently improves progression classification and temporal embedding alignment when applied to multiple existing architectures.

Abstract:
Few-shot Semantic Segmentation (FSS) aims to segment objects of novel categories given only a handful of labeled examples. However, existing methods often rely on complex category-specific modeling, resulting in high computational cost and limited generalization under low-data regimes. To address these challenges, we propose a Bayesian Probabilistic Network (BPNet) that reformulates FSS as a composition of three interpretable components: a prior, a likelihood, and a class-consistency term. Specifically, an efficient Segment Anything Model (SAM) is employed to generate fragmented prior regions for the query image, while both the likelihood and the consistency terms are estimated by a lightweight Class-Agnostic Localization Model (CALM). CALM simultaneously predicts the class consistency between support-query pairs through a binary classification head and estimates the likelihood by localizing the target region in the support image. By evaluating SAM-generated regions in parallel, CALM can efficiently identify the core region, thereby transforming the segmentation problem into a simple binary classification task. Furthermore, to mitigate the semantic incompleteness of SAM proposals, we introduce an attention-based Semantic Completion Module (SCM), which leverages local and global context cues to integrate fragmented regions into semantically complete masks. Extensive experiments demonstrate that BPNet achieves state-of-the-art performance while maintaining high efficiency.

Abstract:
Recent releases such as o3 highlight human-like "thinking with images" reasoning that combines tool use with stepwise verification, yet most open-source approaches still rely on text-only chains, rigid visual schemas, or single-step pipelines, limiting flexibility, interpretability, and transferability on complex tasks. We introduce CodeDance, which explores executable code as a general solver for visual reasoning. Unlike fixed-schema calls (e.g., only predicting bounding-box coordinates), CodeDance defines, composes, and executes code to orchestrate multiple tools, compute intermediate results, and render visual artifacts (e.g., boxes, lines, plots) that support transparent, self-checkable reasoning. To guide this process, we introduce a reward for balanced and adaptive tool calling, which balances exploration with efficiency and mitigates tool overuse. Interestingly, beyond the expected capabilities taught by atomic supervision, we empirically observe novel emergent behaviors during RL training: CodeDance demonstrates novel tool invocations, unseen compositions, and cross-task transfer. These behaviors arise without task-specific fine-tuning, suggesting a general and scalable mechanism for executable visual reasoning. Extensive experiments across reasoning benchmarks (e.g., visual search, math, chart QA) show that CodeDance not only consistently outperforms schema-driven and text-only baselines, but also surpasses closed models such as GPT-4o and larger open-source models.

Abstract:
Multi-Exposure Fusion (MEF) seeks to generate a single high-quality image from multiple inputs captured at different exposure levels. Despite substantial progress, most existing approaches depend on statistical metrics that poorly reflect human perceptual preferences. Electroencephalography (EEG) provides a direct physiological window into human cognition, yet its use in low-level vision remains limited due to scarce paired data and the absence of bio-signals during inference. We address these challenges through two key contributions. First, we introduce Cog-Expo, the first dataset capturing human cognitive responses to multi-exposure stimuli, establishing a bridge between neuroscience and computational photography. Second, we propose a bi-level coupled learning framework that leverages this cognitive information without requiring it during inference. A Mental Integrated Transformer serves as the Teacher, incorporating cognitive priors to guide visual feature learning, while a lightweight Student is trained to approximate these cues using only image inputs. Through bi-level optimization, the Teacher learns inherently distillable representations, enabling the Student to emulate cognitive guidance efficiently. Extensive experiments confirm that our method achieves state-of-the-art fusion performance and aligns more closely with human perception.

Abstract:
Reconstructing object geometry from radio frequency (RF) signals is fundamentally challenging due to the lensless imaging nature of RF sensing, which leads to low spatial resolution and high noise. Unlike light signals, RF signals can penetrate occlusions and thus capture information about hidden scenes. Existing Non-Line-of-Sight (NLoS) 3D neural reconstruction methods can recover coarse surfaces inside enclosed environments but often suffer from unstable optimization, noisy surface geometry, and surface ambiguity, failing to produce accurate zero-level sets from the signed distance field (SDF). These limitations largely stem from neglecting the role of Line-of-Sight (LoS) geometry outside the enclosed region, which provides valuable physical constraints for modeling signal propagation. In this paper, we introduce a Unified LoS and NLoS neural geometry reconstruction framework GeRaF 2.0 that leverages the outside LoS geometry to model and guide RF propagation from the LoS region into the NLoS region. By integrating visual LoS priors into the neural field formulation, GeRaF 2.0 achieves stable training and physically consistent reconstruction of both visible and hidden geometry, setting a new state-of-the-art in RF-based geometry reconstruction.

Abstract:
Diffusion models have emerged as the leading approach for text-to-image generation. However, their iterative sampling process, which gradually morphs random noise into coherent images, introduces significant latency that limits their applicability. While recent few-step diffusion models reduce the number of sampling steps to as few as one to four steps, they often compromise image quality and prompt alignment, especially in one-step generation. Additionally, these models require computationally expensive training procedures. To address these limitations, we propose ImageRAGTurbo, a novel approach to efficiently finetune few-step diffusion models via retrieval augmentation. Given a text prompt, we retrieve relevant text-image pairs from a database and use them to condition the generation process. We argue that such retrieved examples provide rich contextual information to the UNet denoiser that helps reduce the number of denoising steps without compromising image quality. Indeed, our initial investigations show that using the retrieved content to edit the denoiser's latent space (\mathcal H -space) without additional finetuning already improves prompt fidelity. To further improve the quality of the generated images, we augment the UNet denoiser with a trainable adapter in the \mathcal H -space, which efficiently blends the retrieved content with the target prompt using a cross-attention mechanism. Experimental results on fast text-to-image generation demonstrate that our approach produces high-fidelity images without compromising latency compared to existing methods.

Abstract:
3D single object tracking (SOT) in point clouds is essential for real-world 3D perception, yet it remains challenging due to data sparsity and large variations in scale and structure across diverse object categories. Most existing methods rely on a category-specific paradigm that trains separate models for each class, severely limiting scalability and generalization in real deployment. Extending these methods to a single model capable of tracking diverse object categories proves inadequate, as the significant variations across categories make it difficult to establish reliable geometric correspondences without category-specific priors. To overcome these limitations, we propose a Unified Structural KeyPoint Tracker (UniKPT), a novel structure-aware and generalizable framework for category-unified 3D point cloud tracking. UniKPT comprises three key modules: (1) an adaptive keypoint extractor that produces scale-aware and semantically meaningful keypoints; (2) a progressive correspondence aligner that establishes robust cross-frame geometric associations; and (3) a confidence-aware structural localization module that suppresses unreliable matches and leverages fine-grained structural relationships for precise 3D localization. Extensive experiments on the nuScenes and KITTI benchmarks show that UniKPT achieves new state-of-the-art performance in category-unified 3D SOT. On the challenging nuScenes dataset, our unified model further surpasses category-specific state-of-the-art trackers by 4.37% in Success and 5.16% in Precision.

Abstract:
Aligning video generative models with human preferences remains challenging: current approaches rely on Vision-Language Models (VLMs) for reward modeling, but these models struggle to capture subtle temporal dynamics. We propose a fundamentally different approach: repurposing video generative models, which are inherently designed to model temporal structure, as reward models. We present the Generative-Transformer-based Self-Supervised Video Judge (\modelname), a novel evaluation model that transforms state-of-the-art video generation models into powerful temporally-aware reward models. Our key insight is that generative models can be reformulated as energy-based models (EBMs) that assign low energy to high-quality videos and high energy to degraded ones, enabling them to discriminate video quality with remarkable precision when trained via contrastive objectives. To prevent the model from exploiting superficial differences between real and generated videos, we design challenging synthetic negative videos through controlled latent-space perturbations: temporal slicing, feature swapping, and frame shuffling, which simulate realistic but subtle visual degradations. This forces the model to learn meaningful spatiotemporal features rather than trivial artifacts. \modelname achieves state-of-the-art performance on GenAI-Bench and MonteBench using only 30K human-annotations: 6xto 65xfewer than existing VLM-based approaches.

Abstract:
Computer-Aided Design (CAD) is essential in industrial design, but the complexity of traditional CAD modeling and workflows presents significant challenges for automating the generation of high-precision, editable CAD models. Existing methods, such as 3D reconstruction from sketches, often produce non-editable, approximate models that fall short of meeting the stringent requirements for precision and editability in industrial design. Moreover, the reliance on text or image-based inputs often requires significant manual annotation, limiting their scalability and applicability in industrial settings. To overcome these challenges, we propose the Heterogeneous Collaborative Multi-Expert Reinforcement Learning (CME-CAD) paradigm, a novel training paradigm for CAD code generation. Our approach integrates the complementary strengths of these models, facilitating collaborative learning and improving the model's ability to generate accurate, constraint-compatible, and fully editable CAD models. We introduce a two-stage training process: Multi-Expert Fine-Tuning (MEFT), and Multi-Expert Reinforcement Learning (MERL). Additionally, we present CADExpert, an open-source benchmark consisting of 17,299 instances, including orthographic projections with precise dimension annotations, expert-generated Chain-of-Thought (CoT) processes, executable CADQuery code, and rendered 3D models. The CADExpert dataset is publicly available at https://modelscope.cn/datasets/zhuofanChen/CADExpert.

Abstract:
Color constancy aims to keep object colors consistent under varying illumination. Cross-camera generalization in color constancy remains challenging because learning-based models often overfit to the color response characteristics of the training camera, resulting in degraded performance on images captured by other cameras. We propose VLM-CC, a feedback-guided framework that formulates color constancy as an iterative refinement process. Instead of directly estimating the illuminant from raw input, VLM-CC performs iterative correction driven by vision-language model (VLM)-based evaluation. At each iteration, the image is white-balanced using the current estimate and converted to pseudo-sRGB. A lightweight LoRA-tuned VLM then assesses the corrected image, identifying the dominant residual color cast and providing qualitative feedback. This feedback is mapped to a residual illumination direction (red, green, or blue) and used to update the illuminant estimate until convergence. Our key idea is to reframe color constancy as an iterative perceptual feedback problem, leveraging VLM evaluation instead of direct RGB regression. By replacing direct RGB estimation with VLM-guided perceptual feedback, VLM-CC achieves state-of-the-art robustness in cross-camera color constancy across multiple datasets.

Abstract:
Recent advances in multimodal large language models (MLLMs) have shown great potential for extending vision-language reasoning to professional tool-based image editing, enabling intuitive and creative editing. A promising direction is to use reinforcement learning (RL) to enable MLLMs to reason about and execute optimal tool-use plans within professional image-editing software. However, training remains challenging due to the lack of reliable, verifiable reward signals that can reflect the inherently subjective nature of creative editing. In this work, we introduce RetouchIQ, a framework that performs instruction-based executable image editing through MLLM agents guided by a generalist reward model. RetouchIQ interprets user-specified editing intentions and generates corresponding, executable Lightroom adjustments, bridging high-level aesthetic goals with precise parameter control. To move beyond conventional, rule-based rewards that compute similarity against a fixed reference image using handcrafted metrics, we propose a generalist reward model, an RL fine-tuned MLLM that evaluates edited results through a set of generated metrics on a case-by-case basis. Then, the reward model provides scalar feedback through multimodal reasoning, enabling reinforcement learning with high-quality, instruction-consistent gradients. We curate an extended dataset with 190k instruction-reasoning pairs and establish a new benchmark for instruction-based image editing. Experiments show that RetouchIQ substantially improves both semantic consistency and perceptual quality over previous MLLM-based and diffusion-based editing systems. Our findings demonstrate the potential of generalist reward-driven MLLM agents as flexible, explainable, and executable assistants for professional image editing.

Abstract:
Text-to-motion generation aims to synthesize realistic and semantically aligned 3D human motions from natural language descriptions. Existing diffusion-based methods often rely on isotropic latent priors and shallow cross-modal supervision, which lead to semantic entanglement, limited controllability, and poor interpretability. We propose HESP, a unified diffusion framework that hierarchically enhances semantic priors for disentangled text-driven motion generation. At its core, HESP introduces an Adaptive Gaussian Variational Autoencoder (AG-VAE) that structures the latent motion manifold into multiple semantically coherent submanifolds, enabling interpretable and controllable motion representations. To further bridge linguistic and kinematic semantics, we design a Dynamic Cross-Modal Memory (DCMM) module for adaptive semantic fusion and a Hierarchical Cross-Modal Attention (HCA) mechanism to capture multi-level text-motion correspondences. Extensive experiments on HumanML3D and KIT-ML demonstrate that HESP consistently outperforms state-of-the-art baselines such as SALAD, MoMask, and MDM, achieving improvement while maintaining higher diversity and physical plausibility. Moreover, the structured latent space of HESP provides interpretable clusters that reveal clear semantic boundaries among different motion categories.

Abstract:
Diffusion models are promising for sparse-view novel view synthesis (NVS), as they can generate pseudo-ground-truth views to aid 3D reconstruction pipelines like 3D Gaussian Splatting (3DGS). However, these synthesized images often contain photometric and geometric inconsistencies, and their direct use for supervision can impair reconstruction. To address this, we propose Partial-Reference Image Quality Assessment (PR-IQA), a framework that evaluates diffusion-generated views using reference images from different poses, eliminating the need for ground truth. PR-IQA first computes a geometrically consistent partial quality map in overlapping regions. It then performs quality completion to inpaint this partial map into a dense, full-image map. This completion is achieved via a cross-attention mechanism that incorporates reference-view context, ensuring cross-view consistency and enabling thorough quality assessment. When integrated into a diffusion-augmented 3DGS pipeline, PR-IQA restricts supervision to high-confidence regions identified by its quality maps. Experiments demonstrate that PR-IQA outperforms existing IQA methods, achieving full-reference-level accuracy without ground-truth supervision. Thus, our quality-aware 3DGS approach more effectively filters inconsistencies, producing superior 3D reconstructions and NVS results.

Abstract:
In this paper, we explore video anomaly detection (VAD) from a fine-grained perspective, which aims not only to detect anomalous events but also to identify their specific categories. Due to the limited number of examples per category, existing methods either fail to handle intra-class variation across diverse contexts or struggle with inter-class confusion caused by shared visual primitives. To address these challenges, we propose a progressive cross-granularity learning paradigm that leverages coarse- and fine-grained labels in a complementary manner to progressively refine representations from generic anomaly patterns to category-specific semantics. Building on this paradigm, we develop Fine-VAD, a progressive alignment framework that aligns video features with supervision signals at multiple granularities. Extensive experiments on two benchmark datasets demonstrate that Fine-VAD achieves up to 47.7% relative improvement in fine-grained anomaly classification. Notably, our paradigm generalizes well across diverse model architectures, offering an adaptable and effective solution for real-world fine-grained VAD.

Abstract:
Sound effects build an essential layer of multimodal storytelling, shaping the emotional atmosphere and the narrative semantics of videos. Despite recent advancement in video-text-to-audio (VT2A), the current formulation faces three key limitations: (1) an imbalance between visual and textual conditioning that leads to visual dominance; (2) the absence of a concrete definition for fine-grained controllable generation; (3) weak instruction understanding and following, as existing datasets rely on brief categorical tags. To address these limitations, we introduce EchoFoley (Event-Centric Hierarchical cOntrol Foley), a new task designed for video-grounded sound generation with both event-level local control and hierarchical semantic control. Our symbolic representation for sounding events specifies when, what, and how each sound is produced within a video or instruction, enabling fine-grained controls like sound generation, insertion, and editing. To support this task, we construct EchoFoley-6k, a large-scale, expert-curated benchmark containing over 6,000 video-instruction-annotation triplets and 42,000 fine-grained sounding event annotations.Building upon this foundation, we propose EchoVidia, a sounding-event-centric agentic generation framework with slow-fast thinking strategy. Experiments show that EchoVidia surpasses recent VT2A models by 40.7% in controllability and 12.5% in perceptual quality.

Abstract:
Clear imaging under hazy conditions is a critical task. Prior-based and neural methods have improved results. However, they operate on RGB frames, which suffer from limited dynamic range. Therefore, dehazing remains ill- posed and can erase structure and illumination details. To address this, we use event cameras for dehazing for the first time. Event cameras offer much higher HDR (120dB vs. 60dB) and microsecond latency, therefore they suit hazy scenes. In practice, transferring HDR cues from events to frames is hard because real paired data are scarce. To tackle this, we propose an event-guided diffusion model that utilizes the strong generative priors of diffusion mod- els to reconstruct clear images from hazy inputs by effec- tively transferring HDR information from events. Specifi- cally, we design an event-guided module that maps sparse HDR event features, e.g., edges, corners, into the diffusion latent space. This clear conditioning provides precise struc- tural guidance during generation, improves visual realism, and reduces semantic drift. For real-world evaluation, we collect a drone dataset in heavy haze (AQI = 341) with synchronized RGB and event sensors. Experiments on two benchmarks and our dataset achieve state-of-the-art results.

Abstract:
Diffusion-based motion generation has advanced rapidly, but current methods still struggle with long-horizon consistency, style control, and multi-condition guidance. A major reason is the fused-conditioning design, where semantic, stylistic, and temporal signals share a single pathway, causing interference and limiting controllability.We propose MoCoDiff, a controlable autoregressive diffusion framework that introduces Injection Modulation Controllers (IMC). IMC is a lightweight, modality-specific linear modulation modules that inject text, style, and history signals through separate conditioning paths. IMC preserves the simplicity of a frozen backbone while avoiding the entanglement inherent to fused conditioning, enabling more stable and interpretable multi-condition control.To further enhance long-range synthesis, we develop a controllable autoregressive diffusion model equipped with Temporal IMC (TIMC), which applies history as a timestep-dependent corrective signal. This controllable formulation actively suppresses drift, enforces smooth transitions across motion segments, and significantly improves temporal coherence over extended sequences.Experiments show that MoCoDiff achieves state-of-the-art style fidelity, transition quality, and efficiency, while supporting flexible and interpretable multi-condition motion synthesis without retraining.

Abstract:
Text-to-Image Person Retrieval (TIPR) aims to retrieve pedestrian images with a given natural language description. It remains highly challenging due to the inherent ambiguity in cross-modal alignment: existing models often struggle to capture fine-grained correspondences, and their understanding of detailed pedestrian attributes is typically confined to partial or coarse cues, leading to mismatched or erroneous retrieval results. To overcome this challenge, we propose CECA, a Conversation-Enhanced Cross-modal Alignment framework. CECA strengthens the attribute correspondence between textual and visual modalities through multimodal large language models (MLLMs)-guided dialogue, enhances token-level alignment via a Bidirectional Cross-attention Mixer (BCM), and stabilizes optimization with a Confidence-Aware Weighting Loss (CAWL) that reduces the impact of low-quality conversational responses. Extensive experiments on three public benchmarks demonstrate the superior performance and strong generalization ability of our approach.

Abstract:
Adapting vision-language models to remote sensing imagery remains challenging due to two key factors: limited semantic coverage in textual representations and insufficient adaptability of visual features. These issues are particularly significant in aerial scenes, which involve various visual appearances and fine-grained object distinctions. We propose AVION, a knowledge distillation framework tailored for remote sensing adaptation of vision-language models. The teacher module constructs semantically rich textual prototypes by collecting descriptions from a large language model and verifying validity using remote sensing image features. The student module integrates lightweight and learnable prompts into both vision and language encoders, guided by the teacher to align embeddings and their cross-modal relationships. Once trained, the student operates independently during inference. Experiments on six optical remote sensing benchmarks show that AVION improves few-shot classification and base-class accuracy without degrading generalization to novel categories. It also enhances mean recall for cross-modal retrieval, with minimal additional trainable parameters.

Abstract:
Reconstructing 3D human motion and human-object interactions (HOI) from Internet videos is a fundamental step toward building large-scale datasets of human behavior. Existing methods struggle to recover globally consistent 3D motion under dynamic cameras, especially for motion types underrepresented in current motion-capture datasets, and face additional difficulty recovering coherent human-object interactions in 3D. We introduce a two-stage framework leveraging 2D diffusion that reconstructs 3D human motion and HOI from Internet videos. In the first stage, we synthesize multi-view 2D motion data for each domain, leveraging 2D keypoints extracted from Internet videos to incorporate human motions that rarely appear in existing MoCap datasets. In the second stage, a camera-conditioned multi-view 2D motion diffusion model is trained on the domain-specific synthetic data to recover 3D human motion and 3D HOI in the world space. We demonstrate the effectiveness of our method on Internet videos featuring challenging motions such as gymnastics, as well as in-the-wild HOI videos, and show that it outperforms prior work in producing realistic human motion and human-object interaction.

Abstract:
Immunofluorescence (IF) images reveal detailed information about structures and functions at the subcellular level. However, unlike RGB images, IF datasets pose challenges for deep learning models due to their inconsistencies in channel count and configuration, stemming from varying staining protocols across laboratories and studies. Although existing approaches build channel-adaptive models for training, they do not perform evaluations across IF datasets with unseen channel configurations. To address this, we first introduce a biologically informed view of cellular image channels by grouping them into either context or concept, where we treat the context channels as a reference for the concept channels in the image. We leverage this view to propose Channel Conditioned Cell Representations (C3R), a framework that learns representations that transfers well to both in-distribution (ID) and out-of-distribution (OOD) datasets which contain same and different channel configurations, respectively. C3R is a two-fold framework comprising a channel-adaptive encoder architecture and a masked knowledge distillation training strategy, both built around the context-concept principle. We find that C3R outperforms existing benchmarks on both ID and OOD tasks, while yielding state-of-the-art results on frozen encoder evaluation on the CHAMMI benchmark. Our method opens a new pathway for cross-dataset generalization between IF datasets, with no need for retraining on unseen channel configurations.

Abstract:
As efficient alternatives to softmax Attention, linear state space models (SSMs) achieve constant memory and linear compute, but maintain only a lossy, fading summary of the past, often leading to inferior performance in recall oriented settings. We propose Gated KalmaNet (GKA), a layer that reduces this gap by accounting for the full past when predicting the next token, while maintaining SSM-style efficiency. GKA is inspired by the Kalman Filter and solves ridge regression problems online at test time, with constant memory and linear time in the sequence length. An insight is that standard Kalman filter equations are numerically unstable in low-precision environments (e.g., bfloat16) and difficult to parallelize on modern GPUs. We address both challenges via two innovations: (1) an adaptive regularization strategy with input-dependent gating that controls the condition number of the problem, ensuring numerical stability and balancing memory retention; (2) the use of Chebyshev Iteration instead of conventional iterative solvers, which we show to be more stable in low-precision settings. To improve scalability, we implement Chebyshev Iteration in a hardware-aware, chunk-wise manner, along with custom kernels for backpropagating through our adaptive regularization and gating mechanisms. On short-context tasks, GKA shows strong language understanding capabilities and outperforms existing SSMs (e.g., Mamba2, and Gated DeltaNet). On long-context tasks, GKA excels at real-world RAG and LongQA tasks up to 128k tokens with more than 10% relative improvement over baselines. Finally, we show GKA outperforms Mamba when extended for ImageNet classification.

Abstract:
Reliable 4D object detection, which refers to 3D object detection in streaming video, is crucial for perceiving and understanding the real world. Existing open-set 4D object detection methods typically make predictions on a frame-by-frame basis without modeling temporal consistency, or rely on complex multi-stage pipelines that are prone to error propagation across cascaded stages. Progress in this area has been hindered by the lack of large-scale datasets that capture continuous reliable 3D bounding box (b-box) annotations. To overcome these challenges, we first introduce DA4D, a large-scale 4D detection dataset containing over 280k sequences with high-quality b-box annotations collected under diverse conditions. Building on DA4D, we propose DetAny4D, an open-set end-to-end framework that predicts 3D b-boxes directly from sequential inputs. DetAny4D fuses multi-modal features from pre-trained foundational models and designs a geometry-aware spatiotemporal decoder to effectively capture both spatial and temporal dynamics. Furthermore, it adopts a multi-task learning architecture coupled with a dedicated training strategy to maintain global consistency across sequences of varying lengths. Extensive experiments show that DetAny4D achieves competitive detection accuracy and significantly improves temporal stability, effectively addressing long-standing issues of jitter and inconsistency in 4D object detection. Data and code will be released upon acceptance.

Abstract:
Image captions serve as efficient surrogates for visual content in multimodal systems such as retrieval, recommendation, multi-step agentic inference pipelines. Yet current evaluation practices miss a fundamental question: Can captions stand-in for images in real downstream tasks? We propose a utility-based benchmark, CaptionQA, to evaluate model-generated captions, where caption quality is measured by how well it supports downstream tasks. CaptionQA is an extensible domain-dependent benchmark covering 4 domains, Natural, Document, E-commerce, and Embodied AI, each with fine-grained taxonomies (25 top-level and 69 subcategories) that identify useful information for domain-specific tasks. CaptionQA builds 33,027 densely annotated multiple-choice questions (50.3 per image on average) that explicitly require visual information to answer, providing a comprehensive probe of caption utility. In our evaluation protocol, an LLM answers these questions using captions alone, directly measuring whether captions preserve image-level utility and are utilizable by a downstream LLM. Evaluating state-of-the-art MLLMs reveals substantial gaps between the image and its caption utility. Notably, models nearly identical on traditional image-QA benchmarks lower by up to 32% in caption utility. We release CaptionQA along with an open-source pipeline for extension to new domains.

Abstract:
The approximation and convergence properties of implicit neural representations (INRs) are known to be highly sensitive to parameter initialization strategies. While several data-driven initialization methods demonstrate significant improvements over standard random sampling, the reasons for their success -- specifically, whether they encode classical statistical signal priors or more complex features -- remain poorly understood. In this study, we explore this phenomenon through a series of experimental analyses leveraging noise pretraining. We pretrain INRs on diverse noise classes (e.g., Gaussian, Dead Leaves, Spectral) and measure their ability to both fit unseen signals and encode priors for an inverse imaging task (denoising). Our analyses on image and video data reveal a surprising finding: simply pretraining on unstructured noise (Uniform, Gaussian) dramatically improves signal fitting capacity compared to all other baselines. However, unstructured noise also yields poor deep image priors for denoising. In contrast, we also find that noise with the classic 1/|f^\alpha| spectral structure of natural images achieves an excellent balance of signal fitting and inverse imaging capabilities, performing on par with the best data-driven initialization methods. This finding enables more efficient INR training in applications lacking sufficient prior domain-specific data.

Abstract:
Look-Up Table based methods have emerged as a promising direction for efficient image restoration tasks. Recent LUT-based methods focus on improving their performance by expanding the receptive field. However, they inevitably introduce extra computational and storage overhead, which hinders their deployment in edge devices.To address this issue, we propose ShiftLUT, a novel framework that attains the largest receptive field among all LUT-based methods while maintaining high efficiency. Our key insight lies in three complementary components. First, Learnable Spatial Shift module (LSS) is introduced to expand the receptive field by applying learnable, channel-wise spatial offsets on feature maps. Second, we propose an asymmetric dual-branch architecture that allocates more computation to the information-dense branch, substantially reducing inference latency without compromising restoration quality.Finally, we incorporate a feature-level LUT compression strategy called Error-bounded Adaptive Sampling (EAS) to minimize the storage overhead.Compared to the previous state-of-the-art method TinyLUT, ShiftLUT achieves a 3.8xlarger receptive field and improves an average PSNR by over 0.21 dB across multiple standard benchmarks, while maintaining a small storage size and inference time.

Abstract:
Generating high-fidelity human videos that match user-specified identities is important yet challenging in the field of generative AI.Existing methods often rely on an excessive number of training parameters and lack compatibility with other AIGC tools.In this paper, we propose Stand-In, a lightweight and plug-and-play framework for identity preservation in video generation.Specifically, we introduce a conditional image branch into the pre-trained video generation model.Identity control is achieved through restricted self-attentions with conditional position mapping.Thanks to these designs, which greatly preserve the pretrained prior of the video generation model, our approach is able to outperform other full-parameter training methods in video quality and identity preservation, even with just 1% additional parameters and only 2000 training pairs.Moreover, our framework can be seamlessly integrated for other tasks, such as subject-driven video generation, pose-referenced video generation, stylization, and face swapping.Code and dataset will be available to the community.

Abstract:
Caching-based acceleration methods have recently driven significant progress in efficient video generation with diffusion models. However, we identify a critical limitation when directly applying these acceleration techniques to auto-regressive video diffusion models, which generate long videos by sequentially synthesizing segments conditioned on historical context. In such settings, any approximation errors introduced by acceleration tend to propagate and accumulate over time, resulting in severe error accumulation and progressive degradation of video quality. To address this challenge, we propose ARCache, the first training-free caching-based acceleration framework specifically designed for auto-regressive video diffusion models. ARCache improves both the timing and quality of caching through two key components. First, History-Guided Cache (HGC) leverages historical information to adaptively schedule caching for each segment, enabling more accurate and efficient cache utilization. Second, Enhanced Residual Correction (ERC) adaptively refines the residual trajectory for subsequent segments, effectively mitigating error accumulation while introducing minimal computational overhead. Extensive experiments on FramePack-F1, SkyReels-V2, and auto-regressive world model Matrix-Game demonstrate that ARCache achieves state-of-the-art acceleration and visual fidelity.

Abstract:
Visual Text Rendering (VTR) remains a critical challenge in text-to-image generation, where even advanced models frequently produce text with structural anomalies such as distortion, blurriness, and misalignment. However, we find that leading MLLMs and specialist OCR models largely fail to perceive these structural anomalies, creating a critical bottleneck for both VTR evaluation and RL-based optimization. As a result, even state-of-the-art generators (e.g., SeedDream4.0, Qwen-Image) still struggle to render structurally faithful text. To address this, we propose TextPecker, a plug-and-play structural anomaly perceptive RL strategy that mitigates noisy reward signals and works with any text-to-image generator. To enable this capability, we construct a recognition dataset with character-level structural-anomaly annotations and develop a stroke-editing synthesis engine to expand structural-error coverage. Experiments show that TextPecker consistently improves diverse text-to-image models; even on the well-optimized Qwen-Image, it significantly yields average gains of 4% in structural fidelity and 8.7% in semantic alignment for Chinese text rendering, establishing a new state-of-the-art in high-fidelity VTR. Our work fills a gap in VTR optimization, providing a foundational step towards reliable and structural faithful visual text generation.

Abstract:
Open-vocabulary semantic segmentation (OVSS) aims to segment arbitrary category regions in images using open-vocabulary prompts, necessitating that existing methods possess pixel-level vision-language alignment capability. Typically, this capability involves computing the cosine similarity, ie, logits, between visual and linguistic features, and minimizing the distribution discrepancy between the logits and the ground truth (GT) to generate optimal logits that are subsequently used to construct segmentation maps, yet it depends on time-consuming iterative training or model-specific attention modulation. In this work, we propose a more direct approach that eschews the logits-optimization process by directly deriving an analytic solution for the segmentation map. We posit a key hypothesis: the distribution discrepancy encodes semantic information; specifically, this discrepancy exhibits consistency across patches belonging to the same category but inconsistency across different categories. Based on this hypothesis, we directly utilize the analytic solution of this distribution discrepancy as the semantic maps. In other words, we reformulate the optimization of the distribution discrepancy as deriving its analytic solution, thereby eliminating time-consuming iterative training, freeing us from model-specific attention modulation, and achieving state-of-the-art performance on eight benchmark datasets.

Abstract:
Reconstructing 3D human-object interaction (HOI) from monocular images is highly challenging especially when human and object are mutually occluded. Existing methods primarily rely on single-view inputs, which fundamentally limit their ability to recover occluded regions and accurately estimate contact areas. To address these challenges, we for the first time, consider to introduce novel-view feature priors to enhance monocular 3D HOI reconstruction. We first design a cross-view generator that learns to infer novel-view image features from a single-view input, enriching spatial geometry at the feature level without requiring extra inputs during inference. Guided by both real and generated view features, a spatial cross-view feature fusion module adaptively aggregates complementary cues to enhance the initial reconstruction of human and object meshes. Built upon this reconstruction, we sample 3D vertex features from both views and introduce a bidirectional cross-view Transformer to integrate multi-view vertex representations for accurate contact estimation. Finally, the predicted contact maps are leveraged to refine human-object meshes, yielding geometrically consistent and physically plausible reconstructions.Experiments on BEHAVE and InterCap show that our proposed CrossHOI surpasses state-of-the-art methods in both reconstruction accuracy and contact prediction, especially under severe occlusions.

Abstract:
Human-centric multi-view video has a clear semantic structure: a static background and dynamic human motion. We propose a generative compression framework that explicitly decouples these components. The background is modeled once with 3D Gaussian Splatting, while the human is represented by a personalized Gaussian avatar reconstructed from a sparse set of key views that are transmitted only once and driven by compact per-frame pose parameters from the Skinned Multi-Person Linear (SMPL) model. The encoder sends only three elements: the background, the key views, and the SMPL parameters, enabling high-fidelity multi-viewpoint synthesis at dramatically reduced bitrates. This shifts compression from low-level redundancy removal to semantics-aware generative modeling. Experiments across multiple human-centric datasets demonstrate superior rate-distortion performance, particularly for long and densely captured sequences, and naturally enable semantic editing.

Abstract:
Plants frequently contain numerous organs, organized in 3D branching systems defining the plant's architecture. Reconstructing the architecture of plants from unstructured observations is challenging because of self-occlusion and spatial proximity between organs, which are often thin structures. To achieve the challenging task, we propose an approach that allows to infer a parameterized representation of the plant's architecture from a given 3D scan of a plant. In addition to the plant's branching structure, this representation contains parametric information for each plant organ, and can therefore be used directly in a variety of tasks. In this data-driven approach, we train a recursive neural network with virtual plants generated using a procedural model. After training, the network allows to infer a parametric tree-like representation based on an input 3D point cloud. Our method is applicable to any plant that can be represented as binary axial tree. We quantitatively evaluate our approach on Chenopodium Album plants on reconstruction, segmentation and skeletonization, which are important problems in plant phenotyping. In addition to carrying out several tasks at once, our method achieves results on-par with strong baselines for each task. We apply our method, trained exclusively on synthetic data, to 3D scans and show that it generalizes well.

Abstract:
Outlier rejection for correspondence-based point cloud registration confronts two fundamental challenges in real-world scenarios. First, low-overlap regions yield sparse and fragmented inlier distributions that are difficult to discover using conventional one-step global search strategies. Second, large-scale scenes present dense correspondence inputs that impose stringent requirements on the accuracy-efficiency trade-off of search algorithms. To this end, we propose a hierarchical multi-hop graph search framework that progressively refines correspondences to address these challenges. Our method constructs a compatibility graph with transformation-invariant embeddings to predict correspondence confidence, establishing the foundation for cluster-balanced seed sampling that ensures comprehensive coverage across fragmented regions. These strategically selected seeds subsequently drive hierarchical multi-hop expansion, progressively discovering inliers through multi-resolution graph layers while circumventing the high complexity of exhaustive global search. Finally, distribution-aware ranking jointly evaluates geometric consistency and spatial coverage to select well-distributed transformations from multiple hypotheses. Experiments on 3DMatch, 3DLoMatch, and KITTI demonstrate that our method achieves highly competitive performance compared to existing methods in both low-overlap and large-scale scenarios.

Abstract:
Autoregressive (AR) video generative models rely on video tokenizers that compress pixels into discrete token sequences. The length of these token sequences is crucial for balancing reconstruction quality against downstream generation computational cost. Traditional video tokenizers apply a uniform token assignment across temporal blocks of different videos, often wasting tokens on simple, static, or repetitive segments while underserving dynamic or complex ones. To address this inefficiency, we introduce EVATok, a framework to produce Efficient Video Adaptive Tokenizers. Our framework estimates optimal token assignments for each video to achieve the best quality-cost trade-off, develops lightweight routers for fast prediction of these optimal assignments, and trains adaptive tokenizers that encode videos based on the assignments predicted by routers. We demonstrate that EVATok delivers substantial improvements in efficiency and overall quality for video reconstruction and downstream AR generation. Enhanced by our advanced training recipe that integrates video semantic encoders, EVATok achieves superior reconstruction and state-of-the-art class-to-video generation on UCF-101, with at least 24.4% savings in average token usage compared to the prior state-of-the-art LARP and our fixed-length baseline.

Abstract:
Generalizable 3D Gaussian Splatting aims to directly predict Gaussian parameters using a feed-forward network for scene reconstruction. Among these parameters, Gaussian means are particularly difficult to predict, so depth is usually estimated first and then unprojected to obtain the Gaussian sphere centers. Existing methods typically rely solely on a single warp to estimate depth probability, which hinders their ability to fully leverage cross-view geometric cues, resulting in unreliable and coarse depth maps. To address this limitation, we propose IDESplat, which iteratively applies warp operations to boost depth probability estimation for accurate Gaussian mean prediction. First, to eliminate the inherent unreliability of a single warp, we introduce a Depth Probability Boosting Unit (DPBU) that integrates epipolar attention maps produced by cascading warp operations in a multiplicative manner. Next, we construct an iterative depth estimation process by stacking multiple DPBUs, progressively identifying potential depth candidates with high likelihood. As IDESplat iteratively boosts depth probability estimates and updates the depth candidates, the depth map is gradually refined, resulting in accurate Gaussian means. We conduct experiments on RealEstate10K, ACID and DL3DV. IDESplat achieves outstanding reconstruction quality and state-of-the-art performance with real-time efficiency. On RE10K, it outperforms DepthSplat by 0.33 dB in PSNR, using only 10.7% of the parameters and 70% of the memory. Additionally, our IDESplat improves PSNR by 2.95 dB over DepthSplat on the DTU dataset in cross-dataset experiments, demonstrating its strong generalization ability.

Abstract:
Digital humans are fundamental to immersive interaction, yet creating a unified model for holistic modalities, including text, audio, motion, and visual content, remains an open challenge. In this paper, we present Archon, a fully pretrained, human-centric unified multimodal model for holistic avatar generation. Archon unifies seven modalities with modality-specific tokenizers, and a native autoregressive unified multimodal model pretrained on synchronized modalities and 72 diverse tasks to model holistic joint distributions. To address the token explosion challenge in high-fidelity talking videos, we introduce a memory-efficient semantic video reparameterization, achieving 4x token reduction while preserving fine-grained dynamics, coupled with a semantic-driven video diffusion decoder. We further propose a "Thinking in Modality" that decomposes ambiguous cross-modal tasks into stepwise thinking in an alternative chain of modality, progressively enhancing fidelity and controllability. Extensive experiments demonstrate that Archon achieves superior or comparable performance across diverse digital human generation tasks, validating the effectiveness of our unified framework.

Abstract:
Multi-rater medical image segmentation captures the inherent ambiguity of clinical interpretation, where diagnostic boundaries vary across experts and imaging devices. Existing approaches often reduce this diversity to consensus labels or treat rater differences as noise, resulting in overconfident and poorly calibrated models. We propose a harmonized probabilistic framework that disentangles acquisition artifacts from genuine annotator variability through adaptive feature conditioning and frequency-domain personalization. A lightweight Harmonizer Network implicitly models scanner-specific artifacts and performs dynamic feature modulation to standardize latent representations, ensuring that uncertainty reflects anatomy rather than noise. To represent rater-specific styles, we introduce High-Frequency Prompt Modules that operate in the spectral domain to encode annotator-dependent boundary precision and textural sensitivity. These prompts adaptively modulate harmonized features to produce personalized yet anatomically consistent segmentations. Furthermore, a Generalized Energy Distance (GED)-based regularization aligns the generative distribution with empirical annotation variability, promoting diversity where experts disagree and consensus where they converge. Experiments on LIDC-IDRI and NPC-170 show SOTA aggregated and individualized segmentation, with notable GED reductions and improved Dice scores, especially on noisy cases. Beyond accuracy, the model exhibits clinically meaningful uncertainty, confidence rises in agreement regions and declines in ambiguous areas, supporting its use as a reliable and interpretable tool for multi-expert clinical workflows.

Abstract:
We introduce MOMO, the first multi-sensor foundation model for Mars remote sensing. MOMO uses model merge to integrate representations learned independently from three key Martian sensors (HiRISE, CTX, and THEMIS), spanning resolutions from 0.25 m/pixel to 100 m/pixel. Central to our method is our novel Equal Validation Loss (EVL) strategy, which aligns checkpoints across sensors based on validation loss similarity before fusion via task arithmetic. This ensures models are merged at compatible convergence stages, leading to improved stability and generalization. We train MOMO on a large-scale, high-quality corpus of 12 million samples curated from Mars orbital data and evaluate it on 9 downstream tasks from Mars-Bench. MOMO achieves better overall performance compared to ImageNet pre-trained, earth observation foundation model, sensor-specific pre-training, and fully-supervised baselines. Particularly on segmentation tasks, MOMO shows consistent and significant performance improvement. Our results demonstrate that model merging through an optimal checkpoint selection strategy provides an effective approach for building foundation models for multi-resolution data. The model weights, pretraining code, pretraining data, and evaluation code are available at: github.com/kerner-lab/MOMO.

Abstract:
Transferring pre-trained knowledge from a source model to a target model of a different architectural size is a key challenge for flexible and efficient model scaling. However, current parameter-space methods treat Small-to-Large (S2L) and Large-to-Small (L2S) scaling as separate, incompatible problems, focusing on parameter synthesis and selection, respectively.This fragmented perspective has resulted in specialized tools, hindering a unified, bidirectional framework.In this paper, we propose BoT (Bidirectional knowledge Transfer), the first size-agnostic framework to unify S2L and L2S scaling.Our core insight is to treat model weights as continuous signals, where models of different sizes represent distinct discretizations of the transferable knowledge.This multi-resolution perspective directly casts S2L and L2S scaling as the signal processing operations of upsampling and downsampling, naturally leading to the adoption of the Discrete Wavelet Transform (DWT) and its Inverse (IDWT).BoT leverages the recursive nature of wavelets, using the decomposition level as a dynamic scaling factor to bridge disparate model sizes in a parameter-free and computationally efficient manner. Extensive experiments on DeiT, BERT, and GPT demonstrate significant pre-training FLOPs savings (up to 67.1% for S2L, 52.8% for L2S) and state-of-the-art performance on benchmarks like GLUE and SQuAD.

Abstract:
Ray-tracing-based 3D Gaussian splatting (3DGS) enjoys the generality of supporting non-pinhole camera models and relightable formulations. However, they are usually lacking in performance, partially due to the need for depth-based sorting of all intersecting Gaussians along the traced rays.In this paper, we introduce a sorting-free differentiable stochastic formulation for ray-traced 3DGS, enabling efficient reconstruction and rendering of both standard and relightable 3DGS scenes.For standard 3DGS, our method offers performance comparable to rasterization-based 3DGS and outperforms sorting-based ray tracing.For relightable 3DGS, our technique provides higher-quality reconstructions and renderings thanks to the accurate shadow and shading computation provided by fully ray-traced shadow and light rays.

Abstract:
Latent diffusion models such as Stable Diffusion 1.5 offer strong generative priors that are highly valuable for image restoration, yet their full pipelines remain too computationally heavy for deployment on edge devices. Existing lightweight variants predominantly compress the denoising U-Net or reduce the diffusion trajectory, which disrupts the underlying latent manifold and limits generalization beyond a single task. We introduce NanoSD, a family of Pareto-optimal diffusion foundation models distilled from Stable Diffusion 1.5 through network surgery, feature-wise generative distillation, and structured architectural scaling jointly applied to the U-Net and the VAE encoder-decoder. This full-pipeline co-design preserves the generative prior while producing models that occupy distinct operating points along the accuracy-latency-size frontier (e.g., 130M-315M parameters, achieving real-time inference down to 20ms on mobile-class NPUs). We show that parameter reduction alone does not correlate with hardware efficiency, and we provide an analysis revealing how architectural balance, feature routing, and latent-space preservation jointly shape true on-device latency. When used as a drop-in backbone, NanoSD enables state-of-the-art performance across image super-resolution, image deblurring, face restoration, and monocular depth estimation, outperforming prior lightweight diffusion models in both perceptual quality and practical deployability. NanoSD establishes a general-purpose diffusion foundation model family suitable for real-time visual generation and restoration on edge devices.

Abstract:
Combining multiple object detection datasets offers a path to improved model generalisation but is hindered by inconsistencies in class semantics and bounding box annotations. Some methods to address this assume shared label taxonomies and address only spatial inconsistencies; others require manual relabelling, or produce a unified label space, which may be unsuitable when a fixed target label space is required. We propose Label-Aligned Transfer (LAT), a label transfer framework that systematically projects annotations from diverse source datasets into the label space of a target dataset. LAT begins by training dataset-specific detectors to generate pseudo-labels, which are then combined with ground-truth annotations via a Privileged Proposal Generator (PPG) that replaces the region proposal network in two-stage detectors. To further refine region features and address pseudo-label noise, a Semantic Feature Fusion (SFF) module injects class-aware context and features from overlapping proposals using a confidence-weighted attention mechanism. This pipeline preserves dataset-specific annotation granularity while enabling many-to-one label space transfer across heterogeneous datasets, resulting in a semantically and spatially aligned representation suitable for training a downstream detector. LAT thus jointly addresses both class-level misalignments and bounding box inconsistencies without relying on shared label spaces or manual re-annotation. Across multiple benchmarks, LAT demonstrates consistent improvements in detection performance, achieving gains of up to +8.4 AP over baseline methods.

Abstract:
Remote photoplethysmography (rPPG) measurement enables non-contact physiological monitoring but suffers from accuracy degradation under head motion and illumination changes. Existing deep learning methods are mostly heuristic and lack theoretical grounding, limiting robustness and interpretability. In this work, we propose a physics-informed rPPG paradigm derived from the Navier-Stokes equations of hemodynamics, showing that the pulse signal follows a second-order dynamical system whose discrete solution naturally leads to a causal convolution, justifying the use of a Temporal Convolutional Network (TCN). Based on this principle, we design the PHASE-Net, a lightweight model with three key components: 1) Zero-FLOPs Axial Swapper module to swap or transpose a few spatial channels to mix distant facial regions, boosting cross-region feature interaction without changing temporal order; 2) Adaptive Spatial Filter to learn a soft spatial mask per frame to highlight signal-rich areas and suppress noise for cleaner feature maps; and 3) Gated TCN, a causal dilated TCN with gating that models long-range temporal dynamics for accurate pulse recovery. Extensive experiments demonstrate that PHASE-Net achieves state-of-the-art performance and strong efficiency, offering a theoretically grounded and deployment-ready rPPG solution.

Abstract:
Recent advances in generative AI have dramatically improved photorealistic image synthesis, yet they fall short for studio-level multi-object compositing. This task demands simultaneous (i) near-perfect preservation of each item's identity, (ii) precise background and color fidelity, (iii) layout and design elements control, and (iv) complete, appealing displays showcasing all objects. However, current state-of-the-art models often alter object details, omit or duplicate objects, and produce layouts with incorrect relative sizing or inconsistent item presentations. To bridge this gap, we introduce PLACID, a framework that transforms a collection of object images into an appealing multi-object composite. Our approach makes two main contributions. First, we leverage a pretrained image-to-video (I2V) diffusion model with text control to preserve objects consistency, identities and background details by exploiting temporal priors from videos. Second, we propose a novel data curation strategy that generates synthetic sequences where randomly placed objects smoothly move to their target positions. This synthetic data aligns with the video model's temporal priors during training. At inference, objects initialized at random positions consistently converge into coherent layouts guided by text, with the final frame serving as the composite image. Extensive quantitative evaluations and user studies demonstrate that PLACID surpasses state-of-the-art methods in multi-object compositing, achieving superior identity, background, and color preservation, with less omitted objects and visually appealing results.

Abstract:
Statistically consistent methods based on the noise transition matrix (T) offer a theoretically grounded solution to Learning with Noisy Labels (LNL), with guarantees of convergence to the optimal clean-data classifier. In practice, however, these methods are often outperformed by empirical approaches such as sample selection, and this gap is usually attributed to the difficulty of accurately estimating T. The common assumption is that, given a perfect T, noise-correction methods would recover their theoretical advantage. In this work, we put this longstanding hypothesis to a decisive test. We conduct experiments under idealized conditions, providing correction methods with a perfect, oracle transition matrix. Even under these ideal conditions, we observe that these methods still suffer from performance collapse during training. This compellingly demonstrates that the failure is not fundamentally a T-estimation problem, but stems from a more deeply rooted flaw. To explain this behaviour, we provide a unified analysis that links three levels: macroscopic convergence states, microscopic optimisation dynamics, and information-theoretic limits on what can be learned from noisy labels. Together, these results give a formal account of why ideal noise correction fails and offer concrete guidance for designing more reliable methods for learning with noisy labels.

Abstract:
Open-vocabulary 3D object affordance grounding aims to identify functional regions of objects given arbitrary semantic descriptions. However, existing methods often rely on fixed training categories and geometric priors, lacking geometric invariance and analogical reasoning capabilities. Since there exists a significant domain gap when transferring affordance knowledge learned from 2D images to 3D point clouds, existing methods struggle to generalize well to objects with diverse shapes or unseen categories, and fail to perform effective category reasoning.To address these challenges, we propose QueryMe, a Query-driven framework that learns from Multimodal evidence spaces to achieve open-vocabulary 3D affordance grounding.The proposed approach is to project human-object interaction images into 3D space, employ an Adaptive Spatial Attention module to focus on key interaction regions, and introduce a multimodal query structure to retrieve available geometrically consistent functional parts within the point cloud, effectively fusing visual, linguistic, and geometric cues.Leveraging attention-based query mechanisms, our method adaptively localizes affordance regions and performs analogy reasoning through geometric similarity, thereby exhibiting strong generalization to unseen scenes and objects. Experimental results demonstrate that QueryMe consistently outperforms state-of-the-art approaches, with the AUC improving by 4.19% compared to previous work for unseen affordance grounding tasks.

Abstract:
Image captioning remains a fundamental task for vision-language understanding, yet ground-truth supervision still relies predominantly on human-annotated references.Because human annotations reflect subjective preferences and expertise, ground-truth captions are often incomplete or even incorrect, which in turn limits caption models.We argue that caption quality should be assessed by two objective aspects: completeness (does the caption cover all salient visual facts?) and correctness (are the descriptions true with respect to the image?).To this end, we introduce CCCaption: a dual-reward reinforcement learning framework with a dedicated fine-tuning corpus that explicitly optimizes these properties to generate Complete and Correct Captions.For completeness, we use diverse LVLMs to disentangle the image into a set of visual queries, and reward captions that answer more of these queries, with a dynamic query sampling strategy to improve training efficiency.For correctness, we penalize captions that contain hallucinations by validating the authenticity of sub-caption queries, which are derived from the caption decomposition.Our symmetric dual-reward optimization jointly maximizes completeness and correctness, guiding models toward captions that better satisfy these objective criteria. Extensive experiments across standard captioning benchmarks show consistent improvements, offering a principled path to training caption models beyond human-annotation imitation.

Abstract:
Video matting remains limited by the scale and realism of existing datasets. While leveraging segmentation data can enhance semantic stability, the lack of effective boundary supervision often leads to segmentation-like mattes lacking fine details. To this end, we introduce a learned Quality Evaluator (QE) that assesses semantic and boundary quality of alpha mattes without ground truth. It produces a pixel-wise evaluation map that identifies reliable and erroneous regions, enabling fine-grained quality assessment. The QE scales up video matting in two ways: (1) as an online matting-quality feedback during training to suppress erroneous regions, providing comprehensive supervision, and (2) as an offline selection module for data curation, improving annotation quality by combining the strengths of leading video and image matting models. This process allows us to build a large-scale real-world video matting dataset, VMReal, containing 28K clips and 2.4M frames. To handle large appearance variations in long videos, we introduce a reference-frame training strategy that incorporates long-range frames beyond the local window for effective training. Our MatAnyone 2 achieves state-of-the-art performance on both synthetic and real-world benchmarks, surpassing prior methods across all metrics.

Abstract:
Consumer LiDARs in mobile devices and robots typically output a single depth value per pixel. Yet internally, they record full time-resolved histograms containing direct and multi-bounce light returns; these multi-bounce returns encode rich non-line-of-sight (NLOS) cues that can enable perception of hidden objects in a scene. However, severe hardware limitations of consumer LiDARs make NLOS reconstruction with conventional methods difficult. In this work, we motivate a complementary direction: enabling NLOS perception with low-cost LiDARs through data-driven inference. We present DENALI, the first large-scale real-world dataset of space-time histograms from low-cost LiDARs capturing hidden objects. We capture time-resolved LiDAR histograms for 72,000 hidden-object scenes across diverse object shapes, positions, lighting conditions, and spatial resolutions. Using our dataset, we show that consumer LiDARs can enable accurate, data-driven NLOS perception. We further identify key scene and modeling factors that limit performance, as well as simulation-fidelity gaps that hinder current sim-to-real transfer, motivating future work toward scalable NLOS vision with consumer LiDARs.

Abstract:
Detection Transformer (DETR) has redefined object detection by casting it as a set prediction task within an end-to-end framework. Despite its elegance, DETR and its variants still rely on fixed learnable queries and suffer from severe query utilization imbalance, which limits adaptability and leaves the model capacity underused. We propose PaQ-DETR (Pattern and Quality-Aware DETR), a unified framework that enhances both query adaptivity and supervision balance. It learns a compact set of shared latent patterns capturing global semantics and dynamically generates image-specific queries through content-conditioned weighting. In parallel, a quality-aware one-to-many assignment strategy adaptively selects positive samples based on localization-classification consistency, enriching supervision and promoting balanced query optimization. Experiments on COCO, CityScapes, and other benchmarks show consistent gains of 1.5%-4.2% mAP across DETR backbones, including ResNet and Swin-Transformer. Beyond accuracy improvement, our method provides interpretable insights into how dynamic patterns cluster semantically across object categories.

Abstract:
Reinforcement learning (RL) has become a powerful tool for post-training visual generative models, with Group Relative Policy Optimization (GRPO) increasingly used to align generators with human preferences. However, existing GRPO pipelines rely on a single scalar reward per sample, treating each image or video as a holistic entity and ignoring the rich spatial and temporal structure of visual content. This coarse supervision hinders the correction of localized artifacts and the modeling of fine-grained perceptual cues. We introduce Visual Preference Policy Optimization (ViPO), a GRPO variant that lifts scalar feedback into structured, pixel-level advantages. ViPO employs a Perceptual Structuring Module that uses pretrained vision backbones to construct spatially and temporally aware advantage maps, redistributing optimization pressure toward perceptually important regions while preserving the stability of standard GRPO. Across both image and video benchmarks, ViPO consistently outperforms vanilla GRPO, improving in-domain alignment with human-preference rewards and enhancing generalization on out-of-domain evaluations. The method is architecture-agnostic, lightweight, and fully compatible with existing GRPO training pipelines, providing a more expressive and informative learning signal for visual generation.

Abstract:
Material classification has emerged as a critical task in computer vision and graphics, supporting the assignment of accurate material properties to a wide range of digital and real-world applications. While traditionally framed as an image classification task, this domain faces significant challenges due to the scarcity of annotated data, limiting the accuracy and generalizability of trained models. Recent advances in vision-language foundation models (VLMs) offer promising avenues to address these issues, yet existing solutions leveraging these models still exhibit unsatisfying results in material recognition tasks. In this work, we propose a novel framework that effectively harnesses foundation models to overcome data limitations and enhance classification accuracy. Our method integrates two key innovations:(a) a robust image generation and auto-labeling pipeline that creates a diverse and high-quality training dataset with material-centric images, and automatically assigns labels by fusing object semantics and material attributes in text prompts; (b) a prior incorporation strategy to distill information from VLMs, combined with a joint fine-tuning method that optimizes a pre-trained vision foundation model alongside VLM-derived priors, preserving broad generalizability while adapting to material-specific features.Extensive experiments demonstrate significant improvements on multiple datasets. We show that our synthetic dataset effectively captures the characteristics of real world materials, and the integration of priors from vision-language models significantly enhances the final performance.

Abstract:
Controllable driving scene generation is critical for realistic and scalable autonomous driving simulation, yet existing approaches struggle to jointly achieve photorealism and precise control. We introduce HorizonForge, a unified framework that reconstructs scenes as editable Gaussian Splats and Meshes, enabling fine-grained 3D manipulation and language-driven vehicle insertion. Edits are rendered through a noise-aware video diffusion process that enforces spatial and temporal consistency, producing diverse scene variations in a single feed-forward pass without per-trajectory optimization. To standardize evaluation, we further propose HorizonSuite, a comprehensive benchmark spanning ego- and agent-level editing tasks such as trajectory modifications and object manipulation. Extensive experiments show that Gaussian Splatting delivers substantially higher fidelity than alternative 3D representations, and that temporal priors from video diffusion are essential for coherent synthesis. Combining these finding, HorizonSuite establishes a simple yet powerful paradigm for photorealistic, controllable driving simulation. achieving an 83.4% user-preference gain and a 25.19% FID improvement over the second best state-of-the-art method.

Abstract:
This paper addresses the challenge of training a single network to jointly perform multiple dense prediction tasks, such as segmentation and depth estimation, i.e., multi-task learning (MTL). Current approaches mainly capture cross-task relations in the 2D image space, often leading to unstructured features lacking 3D-awareness. We argue that 3D-awareness is vital for modeling cross-task correlations essential for comprehensive scene understanding. We propose to address this problem by integrating correlations across views, i.e., cost volume, as geometric consistency in the MTL network. Specifically, we introduce a lightweight Cross-view Module (CvM), shared across tasks, to exchange information across views and capture cross-view correlations, which are integrated with MTL encoder features for multi-task prediction. This module is architecture-agnostic and can be applied to both single- and multi-view data. Extensive results on NYUv2 and PASCAL-Context demonstrate that our method effectively injects geometric consistency into existing MTL methods to improve performance.

Abstract:
Small object detection (SOD) plays a vital role in applications such as anti-UAV tasks, yet conventional image-based methods struggle in high-speed scenarios due to the limited frame rate. Event cameras offer a promising alternative by capturing spatiotemporal event streams with microsecond-level temporal resolution. To address the inherent sparsity of small objects in event data, existing methods typically formulate the detection task as semantic segmentation on spatiotemporal point clouds to leverage long-term contextual information. However, these methods often fail to enforce effective spatiotemporal consistency constraints, resulting in fragmented object trajectories. To mitigate these problems, we propose a topology-constrained sparse convolutional network (SpTopoNet), which models the topological structure of moving object trajectories in event point clouds. Our network comprises two key components: a Topology Learning Module (TLM) that discriminates local structures to separate genuine targets from noise, and a Spatial Consistency Module (SCM) that captures long-range spatiotemporal dependencies to enhance trajectory continuity. Additionally, we introduce an event topology-aware loss function that leverages topological correlations to guide the network to maintain structural integrity of target event patterns.Experiments on the benchmark dataset demonstrate the superiority of our method in both detection performance and trajectory completeness.

Abstract:
Recent advances in multimodal reward modeling have been largely driven by a paradigm shift from discriminative to generative approaches. Building on this progress, recent studies have further employed reinforcement learning with verifiable rewards (RLVR) to enhance multimodal reward models (MRMs). Despite their success, RLVR-based training typically depends on labeled multimodal preference data, which are costly and labor-intensive to obtain, making it difficult to scale the training of MRMs. To overcome this limitation, we propose a Multi-Stage Reinforcement Learning (MSRL) approach, which can achieve scalable reinforcement learning for MRMs with limited multimodal data. MSRL redefines the conventional RLVR-based training paradigm by first learning a generalizable reward reasoning capability from large-scale textual preference data and then progressively transferring this capability to multimodal tasks through caption-based and fully multimodal reinforcement learning stages. Furthermore, we introduce a cross-modal knowledge distillation approach to improve preference generalization within MSRL. Extensive experiments demonstrate that MSRL effectively scales the RLVR-based training of generative MRMs and substantially improves their performance across both visual understanding and visual generation tasks (e.g., 68.5%-74.8% on VLReward Bench, 69.2%-75.4% on GenAI-Bench), without requiring additional multimodal preference annotations.

Abstract:
Diffusion- and flow-based models usually allocate compute uniformly across space, updating all patches with the same timestep and number of function evaluations. While convenient, this ignores the heterogeneity of natural images: some regions are easy to denoise, whereas others benefit from more refinement or additional context. Motivated by this, we explore patch-level noise scales for image synthesis. We find that naively varying timesteps across image tokens performs poorly, as it exposes the model to overly informative training states that do not occur at inference. We therefore introduce a timestep sampler that explicitly controls the maximum patch-level information available during training, and show that moving from global to patch-level timesteps already improves image generation over standard baselines. By further augmenting the model with a lightweight per-patch difficulty head, we enable adaptive samplers that allocate compute dynamically where it is most needed. Combined with noise levels varying over both space and diffusion time, this yields Patch Forcing (PF), a framework that advances easier regions earlier so they can provide context for harder ones. PF achieves superior results on class-conditional ImageNet, remains orthogonal to representation alignment and guidance methods, and scales to text-to-image synthesis. Our results suggest that patch-level denoising schedules provide a promising foundation for adaptive image generation.

Abstract:
Textual adapter-based tuning methods have shown significant potential in transferring knowledge from pre-trained Vision-Language Models (VLMs) to downstream tasks. Existing works generally employ the deterministic textual feature adapter to refine each category textual representation. However, due to inherent factors such as different attributes and contexts, there exists significant diversity in textual descriptions for each category. Such description diversity offers rich discriminative semantic knowledge that can benefit downstream visual learning tasks. Obviously, the traditional deterministic adapter model cannot adequately capture this varied semantic information. Also, it is desirable to exploit the inter-class relationships in VLM adapter. To address these issues, we propose to exploit random graph model into VLM adapter and develop a novel Vertex Random Graph Adapter (VRGAdapter). VRGAdapter first models the inherent diverse descriptions of each category and inter-class relationships of different categories simultaneously by leveraging a Vertex Random Knowledge Graph (VRKG) model. Then, it employs probabilistic message propagation on VRKG to learn context-aware distribution representation for each class node. Finally, it adopts a reparameterized sampling function to achieve textual adapter learning. Note that, VRGAdapter provides a more general adapter solution that encompasses traditional graph-based adapter as a special case. In addition, to enable more robust performance for downstream tasks, we also introduce a new Uncertainty-guided Multi-branch Fusion (UMF) scheme that dynamically integrates multiple pre-trained models for ensemble prediction. Extensive experiments on multiple benchmark datasets demonstrate the effectiveness of our approach.

Abstract:
Recent advances in vision-language foundation models have enabled text-driven evaluation of image aesthetics and visual quality. However, existing models are typically optimized for fixed prompts or specific datasets, limiting their adaptability to diverse evaluation criteria. This paper presents Probabilistic Prompt Adaptation (PPA), a unified probabilistic framework that flexibly predicts aesthetic and quality scores conditioned on arbitrary text prompts. PPA formulates score prediction as a mixture over prompts, dynamically estimating prompt suitability based on both image content and task context. By marginalizing over prompts pre-sampled from a large language model (LLM), it enables annotation-free training using only triplets of task, image, and score. Experiments across multiple IAA and IQA benchmarks demonstrate that PPA achieves consistent and perceptually aligned prompt-based scoring, allowing fine-grained control over evaluation semantics.

Abstract:
Temporal forgery localization aims to temporally identify manipulated segments in videos. Most existing benchmarks focus on appearance-level forgeries, such as face swapping and object removal. However, recent advances in video generation have driven the emergence of activity-level forgeries that modify human actions to distort event semantics, resulting in highly deceptive forgeries that critically undermine media authenticity and public trust. To overcome this issue, we introduce ActivityForensics, the first large-scale benchmark for localizing manipulated activity in videos. It contains over 6K forged video segments that are seamlessly blended into the video context, rendering high visual consistency that makes them almost indistinguishable from authentic content to the human eye. We further propose Temporal Artifact Diffuser (TADiff), a simple yet effective baseline that exposes artifact cues through a diffusion-based feature regularizer. Based on ActivityForensics, we introduce comprehensive evaluation protocols covering intra-domain, cross-domain, and open-world settings, and benchmark a wide range of state-of-the-art forgery localizers to facilitate future research. The dataset and code are available at https://activityforensics.github.io.

Abstract:
Given an image and a descriptive prompt, image inversion seeks to identify the initial noise that, when denoised, accurately reconstructs the original image. This is crucial for applications like image editing, which can be achieved by denoising the inverted noise with an edited prompt. Existing methods often rely on approximations. They often require many steps, leading to inaccuracies, slow processing, and artifacts due to the inherent intractability in the inversion process. To overcome these issues, in this work, we propose FlashIn, a novel algorithm for faster and more accurate image inversion, enabling high-quality, real-time editing. FlashIn offers two main contributions: i) A learnable neural network directly maps an image to its corresponding noise. Trained with a cycle-consistent strategy using generated data and seed noise, this approach yields a more efficient and precise inversion model. ii) Adversarial training aligns noise-reconstructed images with real ones, enhancing inversion accuracy and editing quality. These strategies enable a fast, accurate inversion process in a single step, with further improvements possible through additional steps. Integrated with few-step diffusion models such as Flux.1-Schnell, our method achieves high-quality image editing within one second on a single GPU, facilitating real-time, interactive editing. Extensive experiments demonstrate that FlashIn delivers state-of-the-art inversion precision and impressive editing results across various scenarios and applications.

Abstract:
Recent advances in 3D Gaussian Splatting (3DGS) enable photorealistic real-time rendering but also increase the risks of unauthorized copying and redistribution. Existing 3DGS watermarking methods typically rely on handcrafted thresholds or globally fixed hyperparameters to balance invisibility and robustness, making their embedding behavior static and scene-agnostic. We instead formulate 3DGS watermarking as a goal-directed decision process and introduce Write Where It Matters (W2M), the first reinforcement learning-based framework that adaptively learns where and how much to embed. By modeling the embedding process as a Markov Decision Process, W2M uses a lightweight policy network to allocate precise Gaussian updates directly from immediate reward feedback, iteratively. The reward incentivizes both rendering-space invisibility and decoding robustness under various image- and model-level distortions. To achieve efficiency, W2M operates on a structured 3DGS backbone organized around learnable anchors and applies policy-guided per-anchor gradient scaling. Extensive experiments across the Blender, LLFF, and Mip-NeRF 360 datasets demonstrate that W2M achieves state-of-the-art bit accuracy, strong perceptual fidelity, and structural consistency under both standard and adversarial conditions.

Abstract:
Reinforcement learning (RL), particularly GRPO, improves image generation quality significantly by comparing the relative performance of images generated within the same group. However, in the later stages of training, the model tends to produce homogenized outputs, lacking creativity and visual diversity, restricting the application scenarios of the model. This issue can be analyzed from both reward modeling and generation dynamics perspectives. First, traditional GRPO relies on single sample quality as the reward signal, driving the model to converge toward a few high reward generation modes while neglecting distribution-level diversity. Second, conventional GRPO regularization neglects the dominant role of early stage denoising in preserving diversity, causing a misaligned regularization budget that limits the achievable quality-diversity trade-off.Motivated by these insights, we revisit the diversity degradation problem from both reward modeling and generation dynamics. At the reward level, we propose a distributional creativity bonus based on semantic grouping. Specifically, we construct a distribution-level representation via spectral clustering over samples generated from the same caption, and adaptively allocate exploratory rewards according to group sizes to encourage the discovery of novel visual modes. At the generation level, we introduce a structure-aware regularization, which enforces stronger early-stage constraints to preserve diversity without compromising reward optimization efficiency. Experiments demonstrate that our method achieves an 13%~18% improvement in semantic diversity under matched quality scores, establishing a new Pareto frontier between image quality and diversity for GRPO-based image generation.

Abstract:
3D medical image classification is essential to modern clinical workflows. Medical foundation models (FMs) have emerged as a promising approach for scaling to new tasks, yet current research suffers from three critical pitfalls: data-regime bias, suboptimal adaptation, and insufficient task coverage. In this paper, we address these pitfalls and introduce AnyMC3D, a scalable 3D classifier adapted from 2D FMs. It allows efficient scaling to new tasks by adding only lightweight plugins ( 1M parameters per task) to a single frozen backbone. Besides, this versatile framework also supports multi-view inputs, auxiliary pixel-level supervision, and interpretable heatmap generation. We establish a comprehensive benchmark of 12 tasks covering diverse pathologies, anatomies, and modalities and systematically evaluate state-of-the-art 3D classification techniques. Our analysis reveals several key insights: (1) effective adaptation is critical to unlock FM potential, (2) general-purpose FMs can match medical-specific FMs if properly adapted, and (3) 2D-based methods surpass 3D architectures for 3D classification. For the first time, we demonstrate the feasibility of achieving state-of-the-art performance across diverse applications using a single scalable framework (e.g., 1st place in the challenge), eliminating the need for separate task-specific 3D models.

Abstract:
Understanding animal behavior requires modeling how bodies, objects, and other agents interact over time, not simply detecting isolated actions or estimating pose frame by frame. Existing animal video datasets target pose estimation or coarse, passively observed actions, and rarely provide the structured, multi-entity interaction annotations needed to study behavioral dynamics. We introduce MooCap, a multi-view video benchmark for animal-object-human interaction understanding under controlled experimental protocols. MooCap contains 42 hours of synchronized multi-camera video from 43 individually tested cows across seven standardized interaction scenarios, including novel environment, novel object, novel human, human approach, unfamiliar conspecifics (restricted and unrestricted) and Dam reunion (restricted and unrestricted). Recordings are densely annotated with 23 fine-grained behaviors, 39 body keypoints across 157 test sessions, 4 spatial zones, and 43 subjects, describing interactions among subjects, objects, humans, and other cattle. We establish three benchmarks on MooCap: (1) dense temporal action segmentation over 1200-1500-second sequences; (2) pose-based behavior and interaction recognition from keypoint trajectories; and (3) longitudinal behavioral classification linking adult behaviors with rearing conditions. Benchmarking results reveal that state-of-the-art temporal segmentation models achieve only 66.4% frame accuracy and 30.6% F1@0.5, with performance degrading further during interaction-heavy segments. Overall, MooCap bridges multi-view pose estimation, multi-entity tracking, and structured behavioral protocols to enable interaction-aware models for animal behavior analysis.

Abstract:
Multimodal reasoning over long-horizon video is challenging due to the need for precise spatiotemporal fusion and alignment across modalities. While recent methods such as Group Relative Policy Optimization (GRPO) have shown promise in this domain, they suffer from three key limitations: (1) data inefficiency from their on-policy design, (2) a vanishing advantage problem, where identical or near-identical rewards within a group eliminate the learning signal by producing zero-valued advantages, and (3) uniform credit assignment that fails to emphasize critical reasoning steps. We introduce AVATAR (Audio-Video Agent for Alignment and Reasoning, a framework that addresses these limitations through two core components: (1) an off-policy training architecture that improves sample efficiency and resolves vanishing advantages by reusing past experiences with greater reward diversity, and (2) Temporal Advantage Shaping (TAS), a credit assignment strategy that emphasizes early (planning) and late (synthesis) reasoning phases. AVATAR achieves strong performance across various benchmarks, outperforming the Qwen2.5-Omni baseline by +5.4 on MMVU, +4.9 on OmniBench, and +4.5 on Video-Holmes. Furthermore, it surpasses standard GRPO by +3.7 on OmniBench and +1.9 on Video-Holmes, while demonstrating 5 times sample efficiency, requiring 80 fewer generated completions to reach target performance.

Abstract:
Video world models aim to simulate dynamic, real-world environments, yet existing methods struggle to provide unified and precise control over camera and multi-object motion, as videos inherently capture dynamics in the projected 2D image plane. To bridge this gap, we introduce VerseCrafter, a geometry-driven video world model that generates dynamic, realistic videos from a unified 4D geometric world state. Our approach is centered on a novel 4D Geometric Control representation, which encodes the world state as a static background point cloud and per-object 3D Gaussian trajectories. This representation captures each object's motion path and probabilistic 3D occupancy over time, providing a flexible, category-agnostic alternative to rigid bounding boxes and parametric models. We render 4D Geometric Control into 4D control maps for a pretrained video diffusion model, enabling high-fidelity, view-consistent video generation that faithfully follows the specified dynamics. To enable training at scale, we develop an automatic data engine and construct VerseControl4D, a real-world dataset of 35K training samples with automatically derived prompts and rendered 4D control maps. Extensive experiments show that VerseCrafter achieves superior visual quality and more accurate control over camera and multi-object motion than prior methods.

Abstract:
Achieving real-time physics-based animation that generalizes across diverse 3D shapes and discretizations remains a fundamental challenge. We introduce PhysSkin, a physics-informed framework that addresses this challenge. In the spirit of Linear Blend Skinning, we learn continuous skinning fields as basis functions lifting motion subspace coordinates to full-space deformation, with subspace defined by handle transformations. To generate mesh-free, discretization-agnostic, and physically consistent skinning fields that generalize well across diverse 3D shapes, PhysSkin employs a new neural skinning fields autoencoder which consists of a transformer-based encoder and a cross-attention decoder.Furthermore, we also develop a novel physics-informed self-supervised learning strategy that incorporates on-the-fly skinning-field normalization and conflict-aware gradient correction, enabling effective balancing of energy minimization, spatial smoothness, and orthogonality constraints.PhysSkin shows outstanding performance on generalizable neural skinning and enables real-time physics-based animation.

Abstract:
Long-video understanding remains challenging due to extreme temporal redundancy, sparse yet decisive events, and the instability of long-horizon reasoning in visual-language models (VLMs). Existing agent-based methods invoke external micro-tools but remain static, repeatedly rebuilding long chains of fine-grained operations for each task without acquiring reusable multi-step perceptual skills.We propose META, the first training-free agent capable of self-evolving its tool-augmented reasoning. META operates through dual Solving and Evolving loops: it analyzes its own tool trajectories, abstracts recurring multi-step patterns into reusable macro-tools, and distills failed executions into structured failure priors that refine tool usage. Through symbolic consolidation and pruning, META progressively shortens reasoning paths and acquires more general perceptual and temporal abilities--without any parameter updates. META achieves state-of-the-art performance on long-video benchmarks, demonstrating a scalable, model-agnostic paradigm for long-video understanding that can continually evolve without additional training.

Abstract:
With the emergence of 3D foundation models, there is growing interest in fine-tuning them for downstream tasks, where LoRA is the dominant fine-tuning paradigm. As 3D datasets exhibit distinct variations in texture, geometry, camera motion, and lighting, there are interesting fundamental questions: 1) Are there LoRA subspaces associated with each type of variation? 2) Are these subspaces disentangled (i.e., orthogonal to each other)? 3) How do we compute them effectively? This paper provides answers to all these questions. We introduce a robust approach that generates synthetic datasets with controlled variations, fine-tunes a LoRA adapter on each dataset, and extracts a LoRA subspace associated with each type of variation. We show that these subspaces are approximately disentangled. Integrating them leads to a reduced LoRA subspace that enables efficient LoRA fine-tuning with improved prediction accuracy for downstream tasks. In particular, we show that such a reduced LoRA subspace, despite being derived entirely from synthetic data, generalizes to real datasets. An ablation study validates the effectiveness of the choices in our approach.

Abstract:
The diffusion model presents a powerful ability to capture the entire (conditional) data distribution. However, due to the lack of sufficient training and data to learn to cover low-probability areas, the model will be penalized for failing to generate high-quality images corresponding to these areas. To achieve better generation quality, guidance strategies such as classifier free guidance (CFG) can guide the samples to the high-probability areas during the sampling stage. However, the standard CFG often leads to over-simplified or distorted samples. On the other hand, the alternative line of guiding diffusion model with its bad version is limited by carefully designed degradation strategies, extra training and additional sampling steps. In this paper, we proposed a simple yet effective strategy Internal Guidance (IG), which introduces an auxiliary supervision on the intermediate layer during training process and extrapolates the intermediate and deep layer's outputs to obtain generative results during sampling process. This simple strategy yields significant improvements in both training efficiency and generation quality on various baselines. On ImageNet 256x256, SiT-XL/2+IG achieves FID=5.31 and FID=1.75 at 80 and 800 epochs. More impressively, LightningDiT-XL/1+IG achieves FID=1.34 which achieves a large margin between all of these methods. Combined with CFG, LightningDiT-XL/1+IG achieves the current state-of-the-art FID of 1.19.

Abstract:
Test-Time Adaptive Object Detection (TTAOD) aims to maintain detection performance under distribution shifts without retraining. While recent vision-language models enable open-vocabulary detection, existing TTAOD methods--whether closed-set or open-vocabulary--focus exclusively on improving classification confidence and largely overlook the degradation of bounding box localization. To address this critical gap, we propose ViTPrompt (Visual Token-Prompting), a training-free framework that jointly refines both bounding boxes and class scores at test time. Our key insight is to augment the original text prompt with instance-aware visual tokens extracted from high-confidence detections in an initial forward pass; this enriched prompt is then used in a second inference stage, where the cross-modal decoder leverages the enhanced semantic context to produce more accurate box coordinates and classification logits. ViTPrompt requires no backpropagation, parameter updates, or external memory, making it highly efficient for real-time deployment. Experiments on multiple out-of-distribution benchmarks demonstrate that ViTPrompt achieves state-of-the-art performance, delivering consistent improvements in both localization accuracy and classification fidelity , and establishing itself as a holistic solution for open-vocabulary TTAOD.

Abstract:
Recent Video-to-Audio (V2A) methods have achieved remarkable progress, enabling the synthesis of realistic, high-quality audio. However, they struggle with fine-grained temporal control in multi-event scenarios or when visual cues are insufficient, such as small regions, off-screen sounds, or occluded/partially visible objects.In this paper, we propose FoleyDirector, a framework that, for the first time, enables precise temporal guidance in DiT-based V2A generation while preserving the base model's audio quality and allowing seamless switching between V2A generation and temporally controlled synthesis. FoleyDirector introduces Structured Temporal Scripts(STS), a set of captions corresponding to short temporal segments, to provide richer temporal information. These features are integrated via the Script-Guided Temporal Fusion Module, which employs Temporal Script Attention to fuse STS features coherently. To handle complex multi-event scenarios, we further propose Bi-Frame Sound Synthesis, enabling parallel in-frame and out-of-frame audio generation and improving controllability.To support training and evaluation, we construct the DirectorSound dataset and introduce VGGSound-Director and DirectorBench. Experiments demonstrate that FoleyDirector substantially enhances temporal controllability while maintaining high audio fidelity, empowering users to act as Foley directors and advancing V2A toward more expressive and controllable.

Abstract:
Generating accurate glyphs for visual text rendering is essential yet challenging. Existing methods typically enhance text rendering by training on a large amount of high-quality scene text images, but the limited coverage of glyph variations and excessive stylization often compromise glyph accuracy, especially for complex or out-of-domain characters. Some methods leverage reinforcement learning to alleviate this issue, yet their reward models usually depend on text recognition systems that are insensitive to fine-grained glyph errors, so images with incorrect glyphs may still receive high rewards. Inspired by Direct Preference Optimization (DPO), we propose GlyphPrinter, a preference-based text rendering method that eliminates reliance on explicit reward models. However, the standard DPO objective only models overall preference between two samples, which is insufficient for visual text rendering where glyph errors typically occur in localized regions. To address this issue, we construct the GlyphCorrector dataset with region-level glyph preference annotations and propose Region-Grouped DPO (R-GDPO), a region-based objective that optimizes inter- and intra-sample preferences over annotated regions, substantially enhancing glyph accuracy. Furthermore, we introduce Regional Reward Guidance, an inference strategy that samples from an optimal distribution with controllable glyph accuracy. Extensive experiments demonstrate that the proposed GlyphPrinter outperforms existing methods in glyph accuracy while maintaining a favorable balance between stylization and precision.

Abstract:
Prevailing 3D texture generation methods, which often rely on multi-view fusion, are frequently hindered by inter-view inconsistencies and incomplete coverage of complex surfaces, limiting the fidelity and completeness of the generated content. To overcome these challenges, we introduce TEXTRIX, a native 3D attribute generation framework for high-fidelity texture synthesis and downstream applications such as precise 3D part segmentation. Our approach constructs a latent 3D attribute grid and leverages a Diffusion Transformer equipped with sparse attention, enabling direct coloring of 3D models in volumetric space and fundamentally avoiding the limitations of multi-view fusion. Built upon this native representation, the framework naturally extends to high-precision 3D segmentation by training the same architecture to predict semantic attributes on the grid. Extensive experiments demonstrate state-of-the-art performance on both tasks, producing seamless, high-fidelity textures and accurate 3D part segmentation with precise boundaries.

Abstract:
Semantic segmentation of ultra-high-resolution (UHR) remote sensing imagery is critical for applications like environmental monitoring and urban planning but faces com- putational and optimization challenges. Conventional methods either lose fine details through downsampling or fragment global context via patch processing. While multi-branch networks ad- dress this trade-off, they suffer from computational inefficiency and conflicting gradient dynamics during training. We propose F2Net, a frequency-aware framework that decomposes UHR images into high- and low-frequency components for specialized processing. The high-frequency branch preserves full-resolution structural details, while the low-frequency branch processes downsampled inputs through dual sub-branches capturing short- and long-range dependencies. A Hybrid-Frequency Fusion mod- ule integrates these observations, guided by two novel objectives: Cross-Frequency Alignment Loss ensures semantic consistency between frequency components, and Cross-Frequency Balance Loss regulates gradient magnitudes across branches to stabilize training. Evaluated on DeepGlobe and Inria Aerial benchmarks, F2Net achieves state-of-the-art performance with mIoU of 80.22 and 83.39, respectively.

Abstract:
Open-vocabulary panoptic segmentation remains hindered by two coupled issues: (i) mask selection bias, where objectness heads trained on closed vocabularies suppress masks of categories not observed in training, and (ii) limited regional understanding in vision-language models such as CLIP, which were optimized for global image classification rather than localized segmentation. We introduce OVRCOAT, a simple, modular framework that tackles both. First, a CLIP-conditioned objectness adjustment (COAT) updates background/foreground probabilities, preserving high-quality masks for out-of-vocabulary objects. Second, an open-vocabulary mask-to-text refinement (OVR) strengthens CLIP's region-level alignment to improve classification of both seen and unseen classes with markedly lower memory cost than prior fine-tuning schemes. The two components combine to jointly improve objectness estimation and mask recognition, yielding consistent panoptic gains. Despite its simplicity, OVRCOAT sets a new state of the art on ADE20K (+5.5% PQ) and delivers clear gains on Mapillary Vistas and Cityscapes (+7.1% and +3% PQ, respectively). The code is available here.

Abstract:
Fine-tuning has become the default way to adapt powerful foundation models, but this also enables low-cost repurposing for harmful objectives. Existing immunization methods try to optimize local geometry or simulate short attacker horizons, and penalize observed loss drops. However, in practice, downstream tuners run thousands of updates and overcome these short-horizon defenses.In this paper, we propose CLAMP (Contractive Long-horizon Attacker Mitigation via Progress-bounding), an immunization method that traps harmful fine-tuning by shaping the attacker's optimization dynamics rather than only the initial landscape. Our key idea is to make harmful training locally contractive, making each update smaller than the last. This yields a closed-form bound on the attacker's training beyond the attacker's simulated training steps. We also introduce a Hessian-free directional curvature penalty, to create adversarial landscapes along harmful descent directions. Our bi-level objective minimizes the attacker's predicted improvement from train step zero to infinity. Experiments show our method withstands long-horizon fine-tuning across classification, generative, and autoregressive settings, substantially reduces harmful task adaptation, while preserving benign utility and fine-tuneability.

Abstract:
Feed-forward view synthesis models predict a novel view in a single pass with minimal 3D inductive bias. Existing works encode cameras as Plucker ray maps, which tie predictions to the arbitrary world coordinate gauge and make them sensitive to small camera transformations, thereby undermining geometric consistency. In this paper, we ask what inputs best condition a model for robust and consistent view synthesis. We propose projective conditioning, which replaces raw camera parameters with a target-view projective cue that provides a stable 2D input. This reframes the task from a brittle geometric regression problem in ray space to a well-conditioned target-view image-to-image translation problem. Additionally, we introduce a masked autoencoding pretraining strategy tailored to this cue, enabling the use of large-scale uncalibrated data for pretraining. Our method shows improved fidelity and stronger cross-view consistency compared to ray-conditioned baselines on our view-consistency benchmark. It also achieves state-of-the-art quality on standard novel view synthesis benchmarks.

Abstract:
Computer-Aided Design (CAD) modeling underpins a wide range of industrial applications. During the conceptual design phase, designers often refine initial solutions iteratively to achieve desired results. A key goal of AI-assisted CAD is to support the full modeling workflow from initial generation to iterative refinement. However, most existing approaches treat generation and editing as separate tasks, hindering coherence and adaptability in real-world scenarios. To address this limitation, we propose CAD-Refiner, a unified framework that supports free-form multimodal inputs and enables iterative refinement over previously generated results. Specifically, we design an agent named CAD Insighter that interprets multimodal inputs into topological structure graphs, which explicitly represent the fundamental elements and their relationships within CAD objects. We then propose a carefully designed decoder architecture and a Sequence Injection Strategy (SIS) to enable multiple applications within a unified modeling framework. Furthermore, we propose CAD Checker, an error-aware feedback module that performs geometry-based reward shaping during optimization, enhancing modeling quality and geometric validity. Additionally, we introduce MMCAD, a multimodal extension of DeepCAD tailored for CAD generation and editing. Extensive experiments demonstrate the effectiveness of CAD-Refiner across multiple tasks.

Abstract:
Multi-agent 3D reconstruction, as a key technology for large-scale VR/AR, robot swarms, and digital twins, has attracted growing attention. Recent end-to-end 3D reconstruction methods achieve strong performance in single-agent scenarios, but they are difficult to directly extend to multi-agent collaborative settings, where they often suffer from unstable tracking, excessive memory consumption, and frequent loop-closure failures, thus failing to meet real-time and large-scale deployment requirements. To address these issues, we propose TOPOMA, a real-time end-to-end 3D reconstruction framework tailored for multi-agent collaboration. TOPOMA explicitly models the spatial topological structure of the scene and tightly couples it with end-to-end representation learning, thereby jointly solving core challenges such as inter-agent spatial alignment and submap fusion. Concretely, we introduce topology skeleton modeling and optimization, decentralized loop closure, and topology-guided residual transport, and build upon them a fully distributed inference architecture in which each agent can independently store, reconstruct, and incrementally optimize its map while collaborating through lightweight topological information. Extensive experiments demonstrate that, compared with existing methods, TOPOMA achieves consistently higher trajectory accuracy, reconstruction quality, robustness, and topological consistency, showing superior adaptability and scalability.

Abstract:
Current compositional image-to-3D scene generation approaches construct 3D scenes by time-consuming iterative layout optimization or inflexible joint object-layout generation. Moreover, most methods rely on limited field-of-view perspective images, hindering the creation of complete 360^\circ environments. To address these limitations, we design Pano3DComposer, an efficient feed-forward framework for panoramic images. To decouple object generation from layout estimation, we propose a plug-and-play Object-World Transformation Predictor. This module converts the 3D objects generated by off-the-shelf image-to-3D models from local to world coordinates. To achieve this, we adapt the VGGT architecture to Alignment-VGGT by using target object crop, multi-view object renderings and camera parameters to predict the transformation. The predictor is trained using pseudo-geometric supervision to address the shape discrepancy between generated and ground-truth objects. For input images from unseen domains, we further introduce a Coarse-to-Fine (C2F) alignment mechanism for Pano3DComposer that iteratively refines geometric consistency with feedback of scene rendering. Our method achieves superior geometric accuracy for image/text-to-3D tasks on synthetic and real-world datasets. It can generate a high-fidelity 3D scene in approximately 20 seconds on an RTX 4090 GPU. The code will be released if accepted.

Abstract:
Large vision foundation models, such as SAM2, have achieved remarkable performance in video object segmentation and tracking (VOST). However, their effectiveness is hindered by significant computational overhead. While model pruning is a widely used strategy to address this issue, traditional static and input-agnostic pruning approaches fall short in managing the diverse and complex nature of video data effectively. A promising alternative is dynamic networks, yet they often struggle to translate theoretical computational reductions into actual acceleration. Furthermore, both static and dynamic approaches typically focus on visual features of individual frames while neglecting the temporal correlations between them, limiting their performance in handling complex video streams.To address these challenges, we propose Recurrent Dynamic Submodel (RDS), a dynamic architecture that adaptively selects submodel blocks for each frame. Specifically, it has a lightweight Prediction-aware Router (PAR), which leverages both the segmentation mask from the previous frame and the visual features of the current frame to make routing decisions, enabling the submodel to explicitly capture the temporal nature of video data. Additionally, to reduce the cost of adapting the dynamic submodel, we introduce an Importance-aware LoRA (I-LoRA), tuning parameters only in the most critical blocks. Extensive experiments show that our approach achieves an optimal trade-off between performance and speed, requiring minimal trainable parameters and training data.

Abstract:
Humans perceive and reason about their surroundings in four dimensions - three spatial and one temporal axis - by building persistent, structured internal representations that encode semantic meaning, spatial layout, and temporal dynamics. These multimodal memories enable them to recall past events, infer unobserved states, and integrate new information into context-dependent reasoning. Inspired by this capability, we introduce R4, a training-free framework for retrieval-augmented reasoning in 4D spatio-temporal space that equips vision-language models (VLMs) with structured, lifelong memory. R4 continuously constructs a 4D knowledge database by anchoring object-level semantic descriptions in metric space and time, yielding a persistent world model that can be shared across agents. At inference, natural language queries are decomposed into semantic, spatial, and temporal keys to retrieve relevant observations, which are integrated into the VLM's reasoning through an iterative retrieval-reasoning loop. Unlike classical retrieval-augmented generation methods, retrieval in R4 operates directly in structured 4D space, enabling episodic and collaborative reasoning without training. Experiments on embodied question answering and navigation benchmarks demonstrate that R4 substantially improves retrieval and reasoning over spatio-temporal information compared to baselines, advancing a new paradigm for embodied 4D reasoning in dynamic environments.

Abstract:
Stereo image compression is essential for a wide range of 3D vision. Recent methods have demonstrated strong capabilities in eliminating inter-view redundancy and enabling compact entropy coding via spatial-domain stereo transformation and advanced autoregressive entropy models. However, these approaches often suffer from high-frequency information loss and incur considerable coding latency. To overcome these limitations, we propose a novel frequency stereo context transfer (FSCT) module. Unlike spatial-domain methods, the FSCT module separately captures inter-view redundancy in high- and low-frequency components and dynamically balances their contributions to preserve reconstruction quality. In addition, we replace the conventional autoregressive framework with a checkerboard strategy and integrate the FSCT module to model inter-view priors, enabling faster and more efficient entropy coding. Extensive experiments demonstrate that our method achieves state-of-the-art rate-distortion performance among existing stereo image compression approaches, while also attaining the lowest coding latency.

Abstract:
Contrastive vision-language models (VLMs) such as CLIP achieve strong zero-shot recognition yet remain vulnerable to spurious correlations, particularly background over-reliance. We introduce Cluster-based Concept Importance (CCI), a novel interpretability method that uses CLIP's own patch embeddings to group spatial patches into semantically coherent clusters, masking them, and evaluating relative changes in model predictions. CCI sets a new state of the art on faithfulness benchmarks, surpassing prior methods by large margins; for example, it yields more than a twofold improvement on the deletion-AUC metric for MS COCO retrieval. We further propose that CCI when combined with GroundedSAM, automatically categorizes predictions as foreground or background-driven, providing a crucial diagnostic ability. Existing benchmarks such as CounterAnimals, however, rely solely on accuracy and implicitly attribute all performance degradation to background correlations. Our analysis shows this assumption to be incomplete, since many errors arise from viewpoint variation, scale shifts, and fine-grained object confusions. To disentangle these effects, we introduce COVAR, a benchmark that systematically varies object foregrounds and backgrounds. Leveraging CCI with COVAR, we conduct a comprehensive evaluation of eighteen CLIP variants, providing both methodological advances and empirical evidence that chart a path toward more robust vision-language models.

Abstract:
Streaming Video Large Language Models (VideoLLMs) have demonstrated impressive performance across various video understanding tasks, but they face significant challenges in real-time deployment due to the high computational cost of processing dense visual tokens from continuous video streams. In streaming video scenarios, the primary bottleneck lies in the Vision Transformer (ViT) encoding stage, where redundant processing of temporally similar frames leads to inefficiency. Additionally, inflated token sequences during LLM pre-filling further exacerbate latency and memory overhead. To address these challenges, we propose Streaming Token Compression (STC), a plug-and-play hierarchical framework that seamlessly integrates into existing streaming VideoLLMs, optimizing both ViT encoding and LLM pre-filling stages to accelerate processing. STC introduces two token-level accelerators: STC-Cacher, which reduces ViT encoding overhead by caching and reusing features from temporally similar frames, and STC-Pruner, which compresses the visual token sequence before it enters the LLM, preserving only the most salient tokens based on both spatial and temporal relevance. Extensive experiments on four baseline streaming VideoLLMs across five benchmarks demonstrate that STC outperforms other compression methods. Notably, STC retains up to 99% of accuracy on the ReKV framework while reducing ViT encoding latency and LLM pre-filling latency by 24.5% and 45.3%.

Abstract:
3D LiDAR-based gait recognition has gained increasing attention due to its robustness to illumination, privacy preservation, and capability for long-range and non-contact identity verification. However, existing point cloud-based methods suffer from two critical limitations: they fail to model semantically distant correlations across spatial scales and employ simplistic temporal aggregation that cannot handle gait's inherent heterogeneity. To address these limitations, we propose MS^2Gait, a multi-scale spatio-temporal framework tailored for raw point cloud gait recognition. Our Hierarchical Spatial Feature Extraction module introduces four complementary interaction strategies to explicitly capture long-range semantic dependencies and recover structural information under blockage. Additionally, a Similarity-based Temporal Enhancement Transformer strategy leverages multi-scale aggregation to dynamically weight frames based on motion coherence, effectively handling temporal heterogeneity without explicit supervision. Extensive evaluations on SUSTech1K and FreeGait demonstrate that MS^2Gait achieves 93.5% and 83.1% in Rank-1 accuracy, respectively, outperforming prior state-of-the-art methods, while exhibiting significant robustness against non-gait nuisance factors.

Abstract:
Accurate global localization is critical for autonomous driving and robotics, but GNSS-based approaches often degrade due to occlusion and multipath effects. As an emerging alternative, cross-view pose estimation predicts the 3-DoF camera pose corresponding to a ground-view image with respect to a geo-referenced satellite image. However, existing methods struggle to bridge the significant viewpoint gap between the ground and satellite views mainly due to limited spatial correspondences. We propose a novel cross-view pose estimation method that constructs view-invariant representations through dual-axis transformation (VIRD). VIRD first applies a polar transformation to the satellite view to facilitate horizontal correspondence, then uses context-enhanced positional attention on the ground and polar-transformed satellite features to mitigate vertical misalignment, explicitly bridging the viewpoint gap. To further strengthen view invariance, we introduce a view-reconstruction loss that encourages the derived representations to reconstruct the original and cross-view images. Experiments on the KITTI and VIGOR datasets demonstrate that VIRD outperforms the state-of-the-art methods without orientation priors, reducing median position and orientation errors by 50.7% and 76.5% on KITTI, and 18.0% and 46.8% on VIGOR, respectively.

Abstract:
Image restoration aims to recover high quality images from inputs degraded by various factors, such as adverse weather, blur, or low light. While recent studies have shown remarkable progress across individual or unified restoration tasks, they still suffer from limited generalization and inefficiency when handling unknown or composite degradations. To address these limitations, we propose RAR, a Restore, Assess and Repeat process, that integrates Image Quality Assessment (IQA) and Image Restoration (IR) into a unified framework to iteratively and efficiently achieve high quality image restoration. Specifically, we introduce a restoration process that operates entirely in the latent domain to jointly perform degradation identification, image restoration, and quality verification. The resulting model is fully trainable end to end and allows for an all-in-one assess and restore approach that dynamically adapts the restoration process. Also, the tight integration of IQA and IR into a unified model minimizes the latency and information loss that typically arises from keeping the two modules disjoint, (e.g. during image and/or text decoding). Extensive experiments show that our approach consistent improvements under single, unknown and composite degradations, thereby establishing a new state-of-the-art.

Abstract:
Large vision-language models (VLMs) often benefit from intermediate visual cues, either injected via external tools or generated as latent visual tokens during reasoning, but these mechanisms still overlook fine-grained visual evidence (e.g., polylines in charts), generalize poorly across domains, and incur high inference-time cost. In this paper, we propose Bi-directional Perceptual Shaping (BiPS), which transforms question-conditioned masked views into bidirectional where-to-look signals that shape perception during training. BiPS first applies a KL-consistency constraint between the original image and an evidence-preserving view that keeps only question-relevant regions, encouraging coarse but complete coverage of supporting pixels. It then applies a KL-separation constraint between the original and an evidence-ablated view where critical pixels are masked so the image no longer supports the original answer, discouraging text-only shortcuts (i.e., answering from text alone) and enforcing fine-grained visual reliance. Across eight benchmarks, BiPS boosts Qwen2.5-VL-7B by 8.2% on average and shows strong out-of-domain generalization to unseen datasets and image types.

Abstract:
Generating complete digital twins from videos requires precise camera control, global scene coverage, and strict spatial-temporal consistency--constraints that remain challenging for perspective video generators due to their limited field of view (FoV). Their narrow FoV forces long or multi-view trajectories, amplifying cross-view inconsistency and temporal drift.We argue that 360deg video generation offers a natural solution: panoramic coverage simplifies trajectory design and provides strong global context for maintaining coherence. We introduce Pantheon360: Taming Digital Twin Generation via 3D-Aware 360deg Video Diffusion, a controllable 360deg video generation framework that synthesizes high-fidelity videos from sparse 360deg inputs. The key idea is an explicit 3D Cache, reconstructed from the input, which serves as a geometric scaffold for any user-defined camera path. This allows the diffusion model to focus on photorealistic texture refinement while the 3D Cache enforces global geometric consistency.Experiments show that Pantheon360 achieves superior visual quality and unmatched geometric coherence, enabling reliable and flexible 360deg scene generation for downstream simulation and digital-twin applications.

Abstract:
Denoising-based diffusion transformers, despite their strong generation performance, suffer from inefficient training convergence. Existing methods addressing this issue, such as REPA (relying on external representation encoders) or SRA (requiring dual-model setups), inevitably incur heavy computational overhead during training due to external dependencies. To tackle these challenges, this paper proposes SRA 2, a lightweight intrinsic self-representation alignment framework for efficient diffusion training. SRA 2 leverages off-the-shelf pre-trained Variational Autoencoder (VAE) features: their reconstruction property ensures inherent encoding of visual priors like rich texture details, structural patterns, and basic semantic information. Specifically, SRA 2 aligns the intermediate latent features of diffusion transformers with VAE features via a lightweight projection layer, supervised by a feature alignment loss. This design accelerates training without extra representation encoders or dual-model maintenance, resulting in a simple yet effective pipeline. Extensive experiments demonstrate that SRA 2 improves both generation quality and training convergence speed compared to vanilla diffusion transformers, matches or outperforms state-of-the-art acceleration methods, and incurs merely 4% extra GFLOPs with zero additional cost for external guidance models.

Abstract:
Continual self-supervised learning (CSSL) in medical imaging trains a foundation model sequentially, alleviating the need for collecting multi-modal images for joint training and offering promising improvements in downstream performance while preserving data privacy. However, most existing methods still rely on replaying data from previous stages to prevent catastrophic forgetting, which compromises privacy and limits their applicability in real-world scenarios where data transfer across sites is often restricted. In this work, we propose InvCoSS, an inversion-driven continual self-supervised learning framework for medical multi-modal image pre-training. Specifically, after training on a previous task, InvCoSS inverts the pre-trained self-supervised model to generate synthetic images that approximate the original training distribution. These synthetic images are then combined with data from the new task for joint optimization, which effectively mitigates catastrophic forgetting while strictly adhering to the constraint of no access to previous real data. Furthermore, to improve the fidelity of synthetic images, we introduce a novel InvUNet with a multi-scale fusion architecture to restore both high- and low-frequency components of the inverted images. To enhance diversity and prevent mode collapse, we design a repulsive representation-learning mechanism that encourages a diverse feature space for synthetic images without class guidance. Extensive experiments across nine downstream tasks validate the effectiveness of InvCoSS, achieving performance comparable to or even superior to prior data-replay methods while significantly reducing storage requirements and eliminating data privacy constraints.

Abstract:
Recent progress in self- and weakly supervised occupancy estimation has largely relied on 2D projection or rendering-based supervision, which suffers from geometric inconsistencies and severe depth bleeding.We thus introduce ShelfOcc, a vision-only method that overcomes these limitations without relying on LiDAR.ShelfOcc brings supervision into native 3D space by generating metrically consistent semantic voxel labels from video, enabling true 3D supervision without any additional sensors or manual 3D annotations.While recent vision-based 3D geometry foundation models provide a promising source of prior knowledge, they do not work out of the box as a prediction due to sparse or noisy and inconsistent geometry, especially in dynamic driving scenes.Our method introduces a dedicated framework that mitigates these issues by filtering and accumulating static geometry consistently across frames, handling dynamic content and propagating semantic information into a stable voxel representation.This data-centric shift in supervision for weakly/shelf-supervised occupancy estimation allows the use of essentially any SOTA occupancy model architecture without relying on LiDAR data.We argue that such high-quality supervision is essential for robust occupancy learning and constitutes an important complementary avenue to architectural innovation.On the Occ3D-nuScenes benchmark, ShelfOcc substantially outperforms all previous weakly/shelf-supervised methods (up to a 34% relative improvement), establishing a new data-driven direction for LiDAR-free 3D scene understanding.

Abstract:
Automatic album organization has been studied extensively over the past decades due to significant progress in digital photography. Recent vision-language models (VLMs) have shown strong performance on multi-image understanding, making them natural candidates for automating album organization workflows. While VLMs' abilities in multi-image understanding have been widely studied, their performance on album organization remains underexplored. To bridge this gap, we introduce AlbumBench, the first comprehensive benchmark for automatic album organization. Specifically, we (1) define album organization tasks as photo selection for album-specific user objectives, photo rating according to how well user intents are fulfilled, and album-specific photo grouping given a user query which requires contextual understanding of the album; (2) establish AlbumBench, a benchmark dataset containing 27051 images across 641 albums with 5 annotations per image; and (3) evaluate mainstream open-source and proprietary VLMs on AlbumBench. We show that AlbumBench presents unique challenges compared to traditional multi-image understanding benchmarks due to its requirement for understanding album context and user intent. Our findings reveal a significant performance gap between open-source and proprietary VLMs on album organization tasks. Despite this gap, even the best-performing proprietary models sometimes struggle with tasks that humans find relatively easy. We hope that AlbumBench can serve as a foundation for unifying album organization research and motivate improvements in VLMs' performance on these tasks.

Abstract:
Adapting CLIP to vertical domains is typically approached by novel fine-tuning strategies or by continual pre-training (CPT) on large domain-specific datasets. Yet, data itself remains an underexplored factor in this process. We revisit this task from a data-centric perspective: Can effective data selection substitute for large-scale datasets in CPT? We introduce CHIPS (Curvature-aware Hybrid Influence in Projection Subspace), which assigns each image-text pair a utility score that integrates three complementary factors aligned with three goals: faithfulness via a curvature-aware and Newton-style alignment computed in CLIP's end-point subspace; scalability via an InfoNCE-aware curvature estimator with Johnson-Lindenstrauss (JL) sketching; and retention via a selection-aware relevance weight combined with learnability to balance target adaptation against general-domain preservation. We justify this design theoretically by proving a lower-bound guarantee on the proxy's correlation with full-parameter alignment and by characterizing the bias-variance trade-offs introduced by curvature mixing and JL sketching. We evaluate CHIPS empirically across various settings: 1) CHIPS attains state-of-the-art performance among selection baselines on 17 medical benchmarks, matches full-dataset CPT with 30% of the data, and outperforms half-dataset CPT using only 10%; 2) on 31 general-domain benchmarks, CHIPS yields the least performance drop under all retention ratios.

Abstract:
Articulated object pose estimation is a core task in embodied AI and computer vision. Existing methods typically regress poses in a continuous space, but often struggle with 1) navigating a large, complex search space and 2) failing to incorporate intrinsic kinematic constraints. In this paper, we introduce DICArt (DIsCrete Diffusion for Articulated Object Pose Estimation), a novel framework that formulates pose estimation as a conditional discrete diffusion process. Instead of operating in a continuous domain, DICArt progressively denoises a noisy pose representation through a learned reverse diffusion procedure to recover the ground-truth pose.To improve modeling fidelity, we propose a flexible flow decider that dynamically determines whether each token should be denoised or reset, effectively balancing the real and noise distributions during diffusion. Additionally, we incorporate a hierarchical kinematic coupling strategy, estimating the pose of each rigid part hierarchically to respect the object's kinematic structure.We validate DICArt on both synthetic and real-world datasets with multi-hinged articulated objects. Experimental results demonstrate its superior performance and robustness over state-of-the-art baselines. By integrating discrete generative modeling with structural priors, DICArt offers a new paradigm for reliable category-level 6D pose estimation in complex environments. Codewill be publicly available upon acceptance.

Abstract:
Composed Image Retrieval (CIR) has demonstrated significant potential by enabling flexible, multimodal queries that combine a reference image and modification text. However, CIR inherently prioritizes semantic matching, struggling to reliably retrieve a user-specified instance across contexts. In practice, emphasizing concrete instance fidelity over broad semantics is often more consequential. In this work, we propose Object-Anchored Composed Image Retrieval (OACIR), a novel fine-grained retrieval task that mandates strict instance-level consistency. To advance research on this task, we construct OACIRR (OACIR on Real-world images), the first large-scale, multi-domain benchmark comprising over 160K quadruples and four challenging candidate galleries enriched with hard-negative instance distractors. Each quadruple augments the compositional query with a bounding box that visually anchors the object in the reference image, providing a precise and flexible way to ensure instance preservation. To address the OACIR task, we propose AdaFocal, a framework featuring a Context-Aware Attention Modulator that adaptively intensifies attention within the specified instance region, dynamically balancing focus between the anchored instance and the broader compositional context. Extensive experiments demonstrate that AdaFocal substantially outperforms existing compositional retrieval models, particularly in maintaining instance-level fidelity, thereby establishing a robust baseline for this challenging task while opening new directions for more flexible, instance-aware retrieval systems.

Abstract:
With the widespread adoption of deep learning in visual tasks, Class-Incremental Learning (CIL) has become an important paradigm for handling dynamically evolving data distributions. However, CIL faces the core challenge of catastrophic forgetting, often manifested as a prediction bias toward new classes. Existing methods mainly attribute this bias to intra-task class imbalance and focus on corrections at the classifier head. In this paper, we highlight an overlooked factor--temporal imbalance--as a key cause of this bias. Earlier classes receive stronger negative supervision toward the end of training, leading to asymmetric precision and recall. We establish a temporal supervision model, formally define temporal imbalance, and propose the Temporal-Adjusted Loss (TAL), which uses a temporal decay kernel to construct a supervision strength vector and dynamically reweight the negative supervision in cross-entropy loss. Theoretical analysis shows that TAL degenerates to standard cross-entropy under balanced conditions and effectively mitigates prediction bias under imbalance. Extensive experiments demonstrate that TAL significantly reduces forgetting and improves performance on multiple CIL benchmarks, underscoring the importance of temporal modeling for stable long-term learning.

Abstract:
3D anomaly detection targets the detection and localization of defects in 3D point clouds trained solely on normal data. While a unified model improves scalability by learning across multiple categories, it often suffers from Inter-Category Entanglement (ICE)--where latent features from different categories overlap, causing the model to adopt incorrect semantic priors during reconstruction and ultimately yielding unreliable anomaly scores. To address this issue, we propose the Semantically Disentangled Unified Model for 3D Anomaly Detection, which reconstructs features conditioned on disentangled semantic representations. Our framework consists of three key components: (i) Coarse-to-Fine Global Tokenization for forming instance-level semantic identity, (ii) Category-Conditioned Contrastive Learning for disentangling category semantics, and (iii) a Geometry-Guided Decoder for semantically consistent reconstruction. Extensive experiments on Real3D-AD and Anomaly-ShapeNet demonstrate that our method achieves state-of-the-art for both unified and category-specific models, improving object-level AUROC by 2.8% and 9.1%, respectively, while enhancing the reliability of unified 3D anomaly detection.

Abstract:
All-in-one image restoration aims to recover clean images from heterogeneous degradations with a single model, but joint training on multiple degradations with a shared backbone often induces cross-task interference and unstable optimization, making it hard to maintain strong performance across all tasks. To address this, we propose Retrieve-to-Restore (R2R), a lightweight framework that decouples degradation adaptation from backbone computation through a retrieval-based degradation bank. Specifically, R2R externalizes degradation knowledge as unified, degradation-specific priors stored in a compact degradation bank. A degradation amalgamator aggregates intra-class features guided by ground truth into task-level clean priors during training, while degradation matching retrieves the most relevant prior at inference to modulate backbone features for restoration. This retrieval-guided design explicitly separates degradation cues from shared reconstruction capacity, enabling stable multi-degradation training and straightforward scaling to additional degradation types. Extensive comparisons on benchmarks with one, three, and five degradations show that R2R achieves PSNR on par with state-of-the-art all-in-one methods while using about 91% fewer MACs.

Abstract:
Abstract visual reasoning benchmarks such as ARC-AGI evaluate the ability to infer generalizable transformation rules from few graphical demonstrations, a capability where current deep learning models severely underperform. Mainstream LLMs achieve only 15.8% (DeepSeek-R1) and 34.5% (o3-mini-high) accuracy. The core reason lies in their static processing of task examples: unlike humans, who iteratively refine their understanding of examples while solving problems, these models lack mechanisms for dynamically aligning understanding and solving. We address this gap with the Understanding and Solving Reasoning Loop (USRL) framework. The architecture comprises two explicitly interacting modules: an Understanding Module (UM) that encodes and refines rule representations of examples, and a Solving Module (SM) that generates a draft solution informed by these evolving rule representations. Through recurrent interaction, the model iteratively aligns its draft solution with its understanding about task examples continuously. Furthermore, we introduce an adaptive reasoning halting mechanism that autonomously terminates the reasoning loop based on the consistency between the generated draft solution and the examples. With 7M parameters, our model achieves 47.2% accuracy on ARC-AGI-1.

Abstract:
Large Vision Language Models (LVLMs) with retrieval-augmented generation (RAG) are emerging as a main paradigm for processing vision-language medical tasks due to their promising achievements. However, existing approaches exhibit two significant limitations in both retrieval and generation stage: First, during the retrieval stage, most methods typically rely on a single similarity signal to estimate document relevance, ignoring the rich information available in multimodal data, which may fail to accurately retrieve matching content. Second, in the generation stage, retrieved documents are integrated directly and uniformly into the input for LVLMs, without taking into account their varying relevance to the question, which may result in the dilution of crucial information and exacerbate the negative impact of irrelevant content. To address these limitations, we propose MR-RAG, a dual-stage RAG enhancement framework by considering multimodal relevance in both retrieval and generation phases. Specifically, a Multimodal Cooperative Retrieval (MCR) module leverages intra- and cross-modal signals for precise document alignment. Subsequently, an Importance-Aware Information Flow Augmentation (IFA) mechanism adjusts attention paths based on fused relevance to refine answer generation. By bridging retrieval and generation via multimodal signals, MR-RAG significantly improves factual accuracy. Experiments on three medical datasets show our method outperforms state-of-the-art baselines, achieving up to a 6.4% accuracy gain.

Abstract:
Visual AutoRegressive modeling (VAR) based on next-scale prediction has revitalized autoregressive visual generation. Although its full-context dependency, i.e., modeling all previous scales for next-scale prediction, facilitates more stable and comprehensive representation learning by leveraging complete information flow, the resulting computational inefficiency and substantial overhead severely hinder VAR's practicality and scalability. This motivates us to develop a new VAR model with better performance and efficiency without full-context dependency. To address this, we reformulate VAR as a non-full-context Markov process, proposing Markov-VAR. It is achieved via Markovian Scale Prediction: we treat each scale as a Markov state and introduce a sliding window that compresses certain previous scales into a compact history vector to compensate for historical information loss owing to non-full-context dependency. Integrating the history vector with the Markov state yields a representative dynamic state that evolves under a Markov process. Extensive experiments demonstrate that Markov-VAR is extremely simple yet highly effective: Compared to VAR on ImageNet, Markov-VAR reduces FID by 10.5% (256x256) and decreases peak memory consumption by 83.8% (1024x1024). We believe Markov-VAR can serve as a foundation for future research on visual autoregressive generation and other downstream tasks. Project Page.

Abstract:
Generalizable gaze estimation methods have garnered increasing attention due to their critical importance in real-world applications and have achieved significant progress. However, they often overlook the effect of label noise, arising from the inherent difficulty of acquiring precise gaze annotations, on model generalization performance. In this paper, we are the first to comprehensively investigate the negative effects of label noise on generalization in gaze estimation. Further, we propose a novel solution, called See-Through-Noise (SeeTN) framework, which improves generalization from a novel perspective of mitigating label noise. Specifically, we propose to construct a semantic embedding space via a prototype-based transformation to preserve a consistent topological structure between gaze features and continuous labels. We then measure feature-label affinity consistency to distinguish noisy from clean samples, and introduce a novel affinity regularization in the semantic manifold to transfer gaze-related information from clean to noisy samples. Our proposed SeeTN promotes semantic structure alignment and enforces domain-invariant gaze relationships, thereby enhancing robustness against label noise. Extensive experiments demonstrate that our SeeTN effectively mitigates the adverse impact of source-domain noise, leading to superior cross-domain generalization without compromising the source-domain accuracy, and highlight the importance of explicitly handling noise in generalized gaze estimation.

Abstract:
Long-form video reasoning remains a major challenge for Video Large Language Models (Video LLMs), as static uniform frame sampling leads to information dilution and obscures critical evidence. Furthermore, existing pixel-space video reasoning agents, which are designed to actively interact with the video to acquire new visual information, remain suboptimal due to their lack of rigorous reward mechanisms to enforce evidence purity and their inability to perform temporal information supplementation beyond pre-sampled frames. To address this critical gap, we propose a novel evidence-prioritized adaptive framework built upon our core philosophy: "Select Less, Reason More." Our core contribution is the evidence-aware reinforcement learning (EARL) framework, which transforms the model into an active interrogator of evidence. EARL is precisely engineered to dynamically select the most relevant frames and, crucially, to perform localized re-sampling around the selected key frames to access fine-grained temporal detail. Extensive experiments on five demanding video reasoning benchmarks demonstrate that our EARL-trained model achieves new state-of-the-art among open-source Video LLMs, simultaneously learning an effective and high-purity visual evidence selection policy. Impressively, our 7B model achieves 59.8% on LongVideoBench, 69.0% on MVBench and 64.9% on VideoMME. These results highlight the importance of prioritizing evidence purity and the effectiveness of our framework.

Abstract:
Recent advances in diffusion-based controllable visual generation have led to remarkable improvements in image quality. However, these powerful models are typically deployed on cloud servers due to their large computational demands, raising serious concerns about user data privacy. To enable secure and efficient on-device generation, we explore in this paper controllable diffusion models built upon linear attention architectures, which offer superior scalability and efficiency, even on edge devices. Yet, our experiments reveal that existing controllable generation frameworks, such as ControlNet and OminiControl, either lack the flexibility to support multiple heterogeneous condition types or suffer from slow convergence on such linear-attention models. To address these limitations, we propose a novel controllable diffusion framework tailored for linear attention backbones like SANA. The core of our method lies in a unified gated conditioning module working in a dual-path pipeline, which effectively integrates multi-type conditional inputs, such as spatially aligned and non-aligned cues. Extensive experiments on multiple tasks and benchmarks demonstrate that our approach achieves state-of-the-art controllable generation performance based on linear-attention models, surpassing existing methods in terms of fidelity and controllability.

Abstract:
Monocular 3D detection (Mono3D) aims to infer 3D bounding boxes from a single RGB image.Without auxiliary sensors such as LiDAR, this task is inherently ill-posed since the 3D-to-2D projection introduces depth ambiguity.Previous works often predict 3D attributes (e.g., depth, size, and orientation) in parallel, overlooking that these attributes are inherently correlated through the 3D-to-2D projection.However, simply enforcing such correlations through sequential prediction can propagate errors across attributes, especially when objects are occluded or truncated, where inaccurate size or orientation predictions can further amplify depth errors.Therefore, neither parallel nor sequential prediction is optimal.In this paper, we propose MonoCoP, an adaptive framework that learns when and how to leverage inter-attribute correlations with two complementary designs.A Chain-of-Prediction (CoP) explores inter-attribute correlations through feature-level learning, propagation, and aggregation, while an Uncertainty-Guided Selector (GS) dynamically switches between CoP and parallel paradigms for each object based on the predicted uncertainty.By combining their strengths, MonoCoP achieves state-of-the-art (SOTA) performance on KITTI, nuScenes, and Waymo, significantly improving depth accuracy, particularly for distant objects.

Abstract:
Large Vision-Language Models (LVLMs) have shown strong performance across various multimodal tasks by leveraging the reasoning capabilities of Large Language Models (LLMs). However, processing visually complex and information-rich images, such as infographics or document layouts, requires these models to generate a large number of visual tokens, leading to significant computational overhead. To address this, we propose PinPoint, a novel two-stage framework that first identifies instruction-relevant image regions and then refines them to extract fine-grained visual features for improved reasoning and efficiency. Central to our approach is the Instruction-Region Alignment, which localizes relevant regions using both visual input and textual instructions. We further introduce new annotations that provide richer ground-truth supervision for instruction-relevant regions across challenging VQA benchmarks: InfographicVQA, MultiPageDocVQA, and SinglePageDocVQA. Experimental results show that PinPoint not only achieves superior accuracy compared to existing methods but also reduces computational overhead by minimizing irrelevant visual tokens.

Abstract:
X-ray computed tomography (CT) reconstructs volumetric representations of objects from projection images obtained by transmitting X-rays through a target. Recent splat-based tomography, which represents a volume as a continuous distribution of 3D Gaussians, has demonstrated both high reconstruction quality and fast convergence in cone-beam sparse-view CT. However, when deployed in real CT systems with limited and non-uniform view distributions, we observe distinctive streak and strip artifacts that are far more pronounced than in conventional reconstruction methods. Through detailed analysis, we show that these artifacts primarily originate from pose inaccuracies in the acquisition geometry rather than from view sparsity itself. We revisit pose sensitivity in the splatting formulation and derive a stable gradient-based framework that jointly refines geometric parameters during reconstruction. Our study not only identifies how pose perturbations propagate through the differentiable projection operator but also reveals why splat-based CT is particularly vulnerable to geometric misalignment. The resulting formulation remains lightweight and easily integrable into existing pipelines while substantially improving reconstruction fidelity under real-world sparse-view conditions.

Abstract:
Multimodal image registration is a fundamental task and a prerequisite for downstream cross-modal analysis. Despite recent progress in shared feature extraction and multi-scale architectures, two key limitations remain. First, some methods use disentanglement to learn shared features but mainly regularize the shared part, allowing modality-private cues to leak into the shared space. Second, most multi-scale frameworks support only a single transformation type, limiting their applicability when global misalignment and local deformation coexist. To address these issues, we formulate hybrid multimodal registration as jointly learning a stable shared feature space and a unified hybrid transformation. Based on this view, we propose HRNet, a Hybrid Registration Network that couples representation disentanglement with hybrid parameter prediction. A shared backbone with Modality-Specific Batch Normalization (MSBN) extracts multi-scale features, while a Cross-scale Disentanglement and Adaptive Projection (CDAP) module suppresses modality-private cues and projects shared features into a stable subspace for matching. Built on this shared space, a Hybrid Parameter Prediction Module (HPPM) performs non-iterative coarse-to-fine estimation of global rigid parameters and deformation fields, which are fused into a coherent deformation field. Extensive experiments on four multimodal datasets demonstrate state-of-the-art performance on rigid and non-rigid registration tasks. The code is available at the project website.

Abstract:
Group Relative Policy Optimization (GRPO) has emerged as an effective and lightweight framework for post-training visual generative models. However, its performance is fundamentally limited by the ambiguity of textual-visual correspondence: a single prompt may validly describe diverse visual outputs, and a single image or video may support multiple equally correct interpretations. This many-to-many relationship leads reward models to generate uncertain and weakly discriminative signals, causing GRPO to underutilize reliable feedback and overfit noisy ones. We introduce Bayesian Prior-Guided Optimization (BPGO), a novel extension of GRPO that explicitly models reward uncertainty through a semantic prior anchor. BPGO adaptively modulates optimization trust at two levels: inter-group Bayesian trust allocation emphasizes updates from groups consistent with the prior while down-weighting ambiguous ones, and intra-group prior-anchored renormalization sharpens sample distinctions by expanding confident deviations and compressing uncertain scores.Across both image and video generation tasks, BPGO delivers consistently stronger semantic alignment, enhanced perceptual fidelity, and faster convergence than standard GRPO and recent variants.

Abstract:
Video relighting with background replacement is a challenging task critical for applications in film production and creative media. Existing methods struggle to balance temporal consistency, spatial fidelity, and illumination naturalness. To address these issues, we introduce FlowPortal, a novel training-free flow-based video relighting framework. Our core innovation is a Residual-Corrected Flow mechanism that transforms a standard flow-based model into an editing model, guaranteeing perfect reconstruction when input conditions are identical and enabling faithful relighting when they differ, resulting in high structural consistency. This is further enhanced by a Decoupled Condition Design for precise lighting control and a High-Frequency Transfer mechanism for detail preservation. Additionally, a masking strategy isolates foreground relighting from background pure generation process. Experiments demonstrate that FlowPortal achieves superior performance in temporal coherence, structural preservation, and lighting realism, while maintaining high efficiency.

Abstract:
Medical visual grounding serves as a crucial foundation for fine-grained multimodal reasoning and interpretable clinical decision support. Despite recent advances in reinforcement learning (RL) for grounding tasks, existing approaches such as Group Relative Policy Optimization (GRPO) suffer from severe reward sparsity when directly applied to medical images, primarily due to the inherent difficulty of localizing small or ambiguous regions of interest, which is further exacerbated by the rigid and suboptimal nature of fixed IoU-based reward schemes in RL. This leads to vanishing policy gradients and stagnated optimization, particularly during early training. To address this challenge, we propose MedLoc-R1, a performance-aware reward scheduling framework that progressively tightens the reward criterion in accordance with model readiness. MedLoc-R1 introduces a sliding-window performance tracker and a multi-condition update rule that automatically adjust the reward schedule from dense, easily obtainable signals to stricter, fine-grained localization requirements, while preserving the favorable properties of GRPO without introducing auxiliary networks or additional gradient paths. Experiments on three medical visual grounding benchmarks demonstrate that MedLoc-R1 consistently improves both localization accuracy and training stability over GRPO-based baselines. Our framework offers a general, lightweight, and effective solution for RL-based grounding in high-stakes medical applications. Code and checkpoints will be released after acceptance.

Abstract:
The issue of algorithmic biases in deep learning has led to the development of various debiasing techniques, many of which perform complex training procedures or dataset manipulation. However, an intriguing question arises: is it possible to extract fair and bias-agnostic subnetworks from standard vanilla-trained models without relying on additional data, such as unbiased training set? In this work, we introduce Bias-Invariant Subnetwork Extraction (BISE), a learning strategy that identifies and isolates "bias-free" subnetworks that already exist within conventionally trained models, without retraining or finetuning the original parameters. Our approach demonstrates that such subnetworks can be extracted via pruning and can operate without modification, effectively relying less on biased features and maintaining robust performance. Our findings contribute towards efficient bias mitigation through structural adaptation of pre-trained neural networks via parameter removal, as opposed to costly strategies that are either data-centric or involve (re)training all model parameters. Extensive experiments on common benchmarks show the advantages of our approach in terms of the performance and computational efficiency of the resulting debiased model.

Abstract:
High-quality video editing and processing are crucial in domains such as filmmaking and autonomous driving, where accurate visual refinement and data preparation are essential. However, it is challenging to achieve precise control over dynamic objects while maintaining spatiotemporal consistency. Current approaches typically utilize text prompts or 2D structural priors for video editing to ensure consistency, yet they struggle to effectively constrain the spatial variations of dynamic 3D objects. In this paper, we introduce RecEdit-Drive, a framework that integrates Spatial Feature Warping and Spatiotemporal Collaborative Modeling to effectively control 3D object variations and enhance video consistency. The spatial feature warping enhances precise control over the edited foreground 3D objects, enhancing spatial consistency in the generated videos; and the spatiotemporal collaborative modeling seamlessly integrates edited foreground objects into the background, yielding realistic and consistent edited videos. Besides, we design an inference strategy to reconstruct an accurate background structure through noise manipulation, providing a reliable reference for foreground instance editing at early denoising stages. We perform extensive qualitative and quantitative comparisons regarding general video editing and downstream tasks on the public datasets, demonstrating the state-of-the-art performance of our proposed method.

Abstract:
Learning with noisy labels (LNL) has received growing attention, with most prior work following the paradigm of clean-sample reliance (e.g., sample selection). However, this reliance also imposes intrinsic limitations, as overfitting to even a few noisy samples is inevitable, creating a major bottleneck for further improvement. This limitation motivates us to go beyond mere clean-sample reliance and explore how to actively forget corrupted knowledge already internalized by models while suppressing further noise assimilation. To this end, we propose FINE, a fundamentally novel perspective for LNL that unifies active ForgettIng via machine unlearning (MU) and Noise supprEssion via negative learning (NL) within a cohesive framework. Specifically, we first reveal two key stages of noise fitting: early-stage generalized learning and later-stage noise overfitting. To actively forget early-stage noise accumulation, we introduce an MU-based module that employs a negative cross-entropy loss to erase corrupted knowledge, while an NL-based module leveraging complementary labels suppresses later-stage overfitting and mitigates reliance on noisy supervision. These modules act synergistically as plug-and-play regularizers, seamlessly integrating into existing baselines. Finally, extensive experiments on both synthetic and real-world noisy benchmarks demonstrate that our FINE consistently boosts robustness and generalization.

Abstract:
Cross-modal alignment aims to learn semantically consistent latent representations across diverse modalities. Prevailing methods rely on a text-guided aggregation paradigm to achieve fine-grained alignment, while they suffer from redundant patch-word correlations and high computational costs. To address these issues, we propose CoV-Align, an effective and efficient fine-grained cross-modal alignment framework with cohesive visual semantics priority. Through a semantically convergent attention mechanism, it progressively aggregates meaningful visual patches in a text-free manner. We design a coarse visual semantic feature extractor that integrates deformable attention and consistent assign attention to group patches with semantic consistency. A cohesive and discriminative feature optimization is presented to enhance intra-semantic cohesion and inter-semantic discriminability of visual region features, resulting in explicit improvements in cross-modal alignment. Extensive experiments demonstrate that CoV-Align achieves state-of-the-art performance on the Flickr30K and MS-COCO benchmarks. Notably, it delivers a 3-5xcomputational speedup compared to pioneer approaches, offering compelling advantages for large-scale multi-modal tasks.

Abstract:
We present Ego-1K, a large-scale, time-synchronized collection of egocentric multiview videos designed to advance neural 3D video synthesis, dynamic scene understanding, and embodied perception. The dataset contains nearly 1,000 short egocentric videos taken with a custom rig with 12 synchronous cameras surrounding a VR headset worn by the user. Scene content focuses on hand motions and hand-object interactions in different settings. We describe rig design, data processing, and calibration. Our dataset enables new ways to benchmark egocentric scene reconstruction methods. We believe this is an important area of research as smart glasses with multiple cameras become omnipresent. Our experiments demonstrate that our dataset presents unique challenges for existing 3D and 4D novel view synthesis methods due to high disparities and image motion caused by close dynamic objects and rig egomotion. Our dataset supports future research in this challenging domain, enabling 4D world creation and sharing.

Abstract:
Recent advances in multimodal large language models (LLMs) have enabled unified reasoning across images, audio, and video, but extending such capability to brain imaging remains largely unexplored. Bridging this gap is essential to link neural activity with semantic cognition and to develop cross-modal brain representations. To this end, we present fMRI-LM, a foundational model that bridges functional MRI (fMRI) and language through a three-stage framework. In Stage 1, we learn a neural tokenizer that maps fMRI into discrete tokens embedded in a language-consistent space. In Stage 2, a pretrained LLM is adapted to jointly model fMRI tokens and text, treating brain activity as a sequence that can be temporally predicted and linguistically described. To overcome the lack of natural fMRI-text pairs, we construct a large descriptive corpus that translates diverse imaging-based features into structured textual descriptors, capturing the low-level organization of fMRI signals. In Stage 3, we perform multi-task, multi-paradigm instruction tuning to endow fMRI-LM with high-level semantic understanding, supporting diverse downstream applications. Across various benchmarks, fMRI-LM achieves strong zero-shot and few-shot performance, and adapts efficiently with parameter-efficient tuning (LoRA), establishing a scalable pathway toward a language-aligned, universal model for structural and semantic understanding of fMRI.

Abstract:
Facial expression recognition (FER) in the wild is severely hampered by label noise and annotation ambiguity. Existing methods, including sample selection, label ensembling, and consistency regularization, primarily rely on ordinary label supervision and offer limited control over non-target predictions, leading to spurious activations and overfitting to noisy labels. To address this limitation, we propose a novel learning framework, named Complementary Label Exchange Learning (CLEX), to enhance robustness by exchanging knowledge from non-target predictions across augmented views. Specifically, CLEX comprises three synergistic components. First, Stochastic Non-Target Logit Exchange randomly swaps a subset of non-target logits between original and augmented views to couple error-prone predictions, creating robust consistency constraints. Second, Scale-Invariant Logit Normalization eliminates magnitude artifacts through Lp-norm normalization, ensuring that regularization operates over geometrically meaningful directions rather than being dominated by arbitrary scales. Third, Complementary Suppression Loss selectively penalizes spurious activations over a randomly retained subset of non-target classes, avoiding the uniform shrinkage that hampers discriminative learning. To further stabilize training, we incorporate attention consistency regularization that enforces spatial alignment between augmented views, while retaining auxiliary cross-entropy to preserve semantic localization capability. Extensive experiments across multiple benchmark FER datasets (RAF-DB, FERPlus, and AffectNet) demonstrate that CLEX consistently outperforms existing robust FER learning approaches.

Abstract:
Recent advancements in diffusion-based image editing pose a significant threat to the authenticity of digital visual content. Traditional embedding-based watermarking methods often introduce perceptible perturbations to maintain robustness, inevitably compromising visual fidelity. Meanwhile, existing zero-watermarking approaches, typically relying on global image features, struggle to withstand sophisticated manipulations. In this work, we uncover a key observation: while individual image patches undergo substantial alterations during AI-based editing, the relational distance between patch pairs remains relatively invariant. Leveraging this property, we propose Relational Zero-Watermarking (Rel-Zero), a novel framework that requires no modification to the original image but derives a unique zero-watermark from these editing-invariant patch relations. By grounding the watermark in intrinsic structural consistency rather than absolute appearance, Rel-Zero provides a non-invasive yet resilient mechanism for content authentication. Extensive experiments demonstrate that Rel-Zero achieves substantially improved robustness across diverse editing models and manipulations compared to prior zero-watermarking approaches.

Abstract:
Diffusion models have recently achieved remarkable success in real-world image super-resolution (ISR), typically balancing a trade-off between fidelity (i.e., similarity to HR images) and realism (i.e., perceptual naturalness). To better account for subjective preferences in image quality, controllable diffusion-based methods have been explored, allowing personalized adjustment of this trade-off via tunable parameters. While existing controllable methods have shown effective control, they typically operate in the latent space and require repeated network inference during adjustment, eventually limiting their practicality. In this paper, we propose IFCSR, a simple yet practical approach for one-step diffusion-based real-world ISR that enables inference-free control between fidelity and realism. The key idea is to design a controllable model that adjusts the fidelity-realism trade-off in the image space, rather than in the latent space. Such an image-space control allows users to seamlessly adjust the trade-off without extra inference after an initial inference of fidelity- and realism-specific images. We further introduce a two-stage training scheme and specialized losses that encourage the controllable space to span a broad spectrum of fidelity and realism. Our method achieves quality competitive with state-of-the-art models while providing a practical advantage through inference-free control.

Abstract:
Visual anomaly detection is vital for quality control applications by identifying deviations from normal patterns. Previous structural or logical anomaly detection methods mainly focus on pixel-level deviations like texture defects and reconstruction errors, ignoring the object-level structural and contextual inconsistencies. These overlooked layout anomalies remain critical yet underexplored, e.g., factually defective hallucinations appeared in generative text-to-image models. Based on the above observation, in this paper, we introduce scene layout anomaly detection, a new task that predicts an object-level anomaly map from the input image to reveal the semantic plausibility and geometric consistency of each object in the scene. Specifically, we propose LayoutAD, an unsupervised learning framework that constructs semantic and geometric graphs to jointly reason over semantic-geometric misalignment among objects. Under this formulation, we are able to detect diverse layout deviations, including object attribute implausibilities and relationship mismatches. Extensive experiments show that LayoutAD outperforms baselines qualitatively and quantitatively across various scenarios, benefiting scene understanding and generation applications like video anomaly detection and self-corrected image generation.

Abstract:
3D Gaussian Splatting (3DGS) has recently emerged as a powerful explicit representation enabling fast, high-fidelity rendering, making it a promising foundation for closed-loop simulators and perception models in autonomous driving. However, conventional 3DGS implicitly assumes consistent exposure and tone mapping across views. Real driving data violates this assumption due to heterogeneous camera pipelines and dynamic outdoor illumination, baking exposure discrepancies and sensor noise into the radiance field and producing artifacts and inconsistent illumination especially in static backgrounds crucial for realistic simulation. These issues are amplified in autonomous driving, where sparse viewpoints, varying exposures, and outdoor lighting interact, while prior work mainly targets dynamic-object reconstruction and overlooks cross-view photometric consistency.To address this limitation, we introduce P2GS, a physically consistent Gaussian Splatting framework that jointly decomposes a view-invariant linear HDR radiance field, per-view exposure scales, and tone-mapping functions from only LDR images without HDR supervision. P2GS employs a unified optimization strategy grounded in the physical image-formation process, enforcing relative-exposure consistency and HDR-domain radiance regularization. This yields a radiance field robust to inter-camera illumination differences while preserving the real-time efficiency of standard 3DGS. Experiments across real and simulated driving environments show that P2GS matches or surpasses prior methods in LDR reconstruction while providing substantially improved photometric consistency, reliable exposure normalization, and physically coherent illumination across diverse scenes.

Abstract:
We introduce Cupid, a generative 3D reconstruction framework that jointly models the full distribution over both canonical objects and camera poses. Our two-stage flow-based model first generates a coarse 3D structure and 2D-3D correspondences to estimate the camera pose robustly. Conditioned on this pose, a refinement stage injects pixel-aligned image features directly into the generative process, marrying the rich prior of a generative model with the geometric fidelity of reconstruction. This strategy achieves exceptional faithfulness, outperforming state-of-the-art reconstruction methods by over 3 dB PSNR and 10% in Chamfer Distance. As a unified generative model that decouples the object and camera pose, Cupid naturally extends to multi-view and scene-level reconstruction tasks without requiring post-hoc optimization or fine-tuning.

Abstract:
Few-Shot Class-Incremental Learning (FSCIL) poses a critical challenge in machine learning, requiring models to continuously integrate novel classes with limited samples while preserving knowledge of previously seen classes. While existing FSCIL approaches have demonstrated promising results, they still suffer from catastrophic forgetting and few-shot overfitting due to the challenge of balancing old knowledge retention with new knowledge acquisition. To address these challenges, we propose an innovative Semantic-Guided Global-Local Collaborative Prompt Learning (SGLC) framework. Built upon powerful pre-trained Vision-Language Models (VLMs), the framework first introduces a dual-alignment mechanism: globally aligning visual features with visual-textual prototypes and locally aligning multi-view visual features with local textual attribute features, which facilitates effective knowledge learning while preserving existing knowledge via frozen prototypes of previous classes. Furthermore, to alleviate overfitting, we incorporate Large Language Models (LLMs) to generate semantically rich textual descriptions, which simultaneously guide both global and local prompt learning through knowledge distillation. Extensive experiments on the miniImageNet, CIFAR-100, and CUB200 datasets demonstrate that SGLC performs favorably against the state-of-the-art methods.

Abstract:
Visual AutoRegressive (VAR) modeling has garnered significant attention for its innovative next-scale prediction paradigm. However, mainstream VAR paradigms attend to all tokens across historical scales at each autoregressive step. As the next scale resolution grows, the computational complexity of attention increases quartically with resolution, causing substantial latency. Prior accelerations often skip high-resolution scales, which speeds up inference but discards high-frequency details and harms image quality. To address these problems, we present SparVAR, a training-free acceleration framework that exploits three properties of VAR attention: (i) strong attention sinks, (ii) cross-scale activation similarity, and (iii) pronounced locality. Specifically, we dynamically predict the sparse attention pattern of later high-resolution scales from a sparse decision scale, and construct scale self-similar sparse attention via an efficient index-mapping mechanism, enabling high-efficiency sparse attention computation at large scales. Furthermore, we propose cross-scale local sparse attention and implement an efficient block-wise sparse kernel, which achieves > 5xfaster forward speed than FlashAttention. Extensive experiments demonstrate that the proposed SparVAR can reduce the generation time of an 8B model producing 1024x1024 high-resolution images to the 1s, without skipping the last scales. Compared with the VAR baseline accelerated by FlashAttention, our method achieves a 1.57xspeed-up while preserving almost all high-frequency details. When combined with existing scale-skipping strategies, SparVAR attains up to a 2.28xacceleration, while maintaining competitive visual generation quality.

Abstract:
Discrete Diffusion Models (DDMs) have shown great potential in image generation, especially when equipped with reinforcement learning (RL) techniques.However, a fundamental yet overlooked limitation is revealed in our experiments: severe imbalance of credit assignment across timesteps during training. As a result, early generation timesteps, which carry higher exploration potential and determine the global structure, provide a smaller contribution to policy optimization.To conquer this, we propose a simple-yet-effective approach, to Re-balance timestep credit of Reinforcement Learning (RebRL) for better exploration-exploitation trade-off and more efficient training of DDMs.RebRL is plug-and-play\textemdash simply replacing uniform temporal policy with strategic rebalancing along masking stages. RebRL is analytically plausible\textemdashderivation and analysis show that it enjoys a uniform token-level policy gradient, which benefits policy optimization.Experiments on text-to-image generation benchmarks show that RebRL achieves state-of-the-art performance on GenEval and improves human preference score by up to 3.40 while effectively reducing training steps by ~40%.

Abstract:
Camera-only 3D object detection has emerged as a cost-effective and scalable alternative to LiDAR for autonomous driving, yet existing methods primarily prioritize overall performance while overlooking the severe long-tail imbalance inherent in real-world datasets. In practice, many rare but safety-critical categories such as children, strollers, or emergency vehicles are heavily underrepresented, leading to biased learning and degraded performance. This challenge is further exacerbated by pronounced inter-class ambiguity (e.g., visually similar subclasses) and substantial intra-class diversity (e.g., objects varying widely in appearance, scale, pose, or context), which together hinder reliable long-tail recognition. In this work, we introduce SemLT3D, a Semantic-Guided Expert Distillation framework designed to enrich the representation space for underrepresented classes through semantic priors. SemLT3D consists of: (1) a language-guided mixture-of-experts module that routes 3D queries to specialized experts according to their semantic affinity, enabling the model to better disentangle confusing classes and specialize on tail distributions; and (2) a semantic projection distillation pipeline that aligns 3D queries with CLIP-informed 2D semantics, producing more coherent and discriminative features across diverse visual manifestations. Although motivated by long-tail imbalance, the semantically structured learning in SemLT3D also improves robustness under broader appearance variations and challenging corner cases, offering a principled step toward more reliable camera-only 3D perception.

Abstract:
Recent unified multimodal models (UMMs) have achieved remarkable progress in visual comprehension and generation. However, existing datasets and benchmarks focus predominantly on single-turn interactions, overlooking the multi-turn, context-dependent nature of real-world image creation and editing. To bridge this gap, we introduce W E A V E, the first comprehensive suite for in-context interleaved cross-modality comprehension and generation, comprising two complementary components. W E A V E-100k is a large-scale dataset of 100,000 interleaved samples spanning over 370,000 dialogue turns and 500,000 images, encompassing comprehension, editing, and generation tasks that demand reasoning over prior context. W E A V E-Bench is a human-annotated benchmark of 100 items with 480 images, equipped with a hybrid VLM judge evaluation framework that jointly leverages reference images and original-image-instruction pairs to assess multi-turn generation, visual memory, and world-knowledge reasoning across diverse domains. Experiments show that training on W E A V E-100k substantially improves vision comprehension, image editing, and comprehension-generation collaboration, while further enabling the emergence of visual-memory capabilities in UMMs. Extensive evaluations on W E A V E-Bench reveal persistent limitations of current approaches in multi-turn, context-aware image generation and editing. We hope W E A V E provides both a perspective and a foundation for advancing in-context interleaved comprehension and generation in the multimodal community.

Abstract:
Multilingual text-to-image (T2I) models have advanced rapidly in terms of visual realism and semantic alignment, and are now widely utilised. Yet outputs vary across cultural contexts: because language carries cultural connotations, images synthesized from multilingual prompts should preserve cross-lingual cultural consistency. We conduct a comprehensive analysis showing that current T2I models often produce culturally neutral or English-biased results under multilingual prompts.Analyses of two representative models indicate that the issue stems not from missing cultural knowledge but from insufficient activation of culture-related representations. We propose a probing method that localizes culture-sensitive signals to a small set of neurons in a few fixed layers. Guided by this finding, we introduce two complementary alignment strategies: (1) inference-time cultural activation that amplifies the identified neurons without backbone fine-tuned; and (2) layer-targeted cultural enhancement that updates only culturally relevant layers. Experiments on our CultureBench demonstrate consistent improvements over strong baselines in cultural consistency while preserving fidelity and diversity.

Abstract:
Event cameras hold excellent dynamic properties, showing great potential for monocular depth estimation (MDE). However, existing methods mainly improve performance by optimizing contextual features, but still struggle with the ill-posed and nonlinear nature of direct full-depth regression. In this paper, we propose HypoDepth, the first event-image monocular depth iterative refinement framework. By introducing a discrete Depth Hypothesis Volume (DHV), we transform the depth regression problem into a constrained depth search task. Specifically, we construct a 3D cost volume between the DHV features and contextual features and perform a multi-scale correlation search to guide stable residual optimization. This lightweight cost volume enables efficient global-to-local refinement across multi-resolution. Our method outperforms existing approaches on DSEC and MVSEC with state-of-the-art results and strong zero-shot generalization. Meanwhile, our tiny model achieves an excellent balance between accuracy and efficiency, enabling real-time performance on resource-limited devices.

Abstract:
Out-of-distribution (OOD) detection is crucial for ensuring the reliability of deep learning models. Existing methods mostly focus on regular entangled representations to discriminate in-distribution (ID) and OOD data, neglecting the rich contextual information within images. This issue is particularly challenging for detecting near-OOD, as models with simplicity bias struggle to learn discriminative features in disentangled representations. The human visual system can use the co-occurrence of objects in the natural environment to facilitate scene understanding. Inspired by this, we propose an Object-Centric OOD detection framework that learns to capture Object CO-occurrence (OCO) patterns within images. The proposed method introduces a new OOD detection paradigm that understands object co-occurrence within an image by predicting disentangled representations for the test sample, then adaptively divides patterns into three scenarios based on object co-occurrence patterns observed in ID training data, and finally performs OOD detection in a divide-and-conquer manner. By doing so, OCO can distinguish near-OOD by considering the semantic contextual relationships present in their images, avoiding the tendency to focus solely on simple, easily learnable regions. We evaluate OCO through experiments across challenging and full-spectrum OOD settings, demonstrating competitive results and confirming its ability to address both semantic and covariate shifts.

Abstract:
All-in-one Medical Image Restoration (MedIR) models offer a promising path towards generalized medical imaging intelligence but face two critical spatiotemporal challenges: 1) Spatial modality interference, where conflicting gradients from diverse modalities (e.g., MRI, CT, PET) degrade performance; and 2) a temporal static-world assumption that ignores the continual data streams in real-world clinical settings, leading to catastrophic forgetting. To address this dual challenge, we propose a novel lifelong learning framework (called Resilient On-the-fly Medical Enhancement, ROME) governed by a "disentangle-optimize-consolidate" paradigm. ROME first resolves the foundational modality conflict via the Modality-Invariant Disentanglement via Adversarial Balancing (MIDAB) module. It establishes a strategic "adversarial balance" between a "content preservation force" and a "modality erasure force" to optimize a disentangled, unified feature manifold. Building on this stable foundation, the Adaptive Feature Consolidation (AFC) module combats forgetting. AFC dynamically locates an optimal feature consolidation point via a prediction network, enforced by a novel diversity loss to ensure robust continuous learning. Experiments demonstrate that ROME not only achieves SOTA performance in static settings but also exhibits superior resilience in rigorous domain-incremental benchmarks, reducing the average catastrophic performance degradation by over 10%.

Abstract:
Generating high-quality 360deg panoramic videos from perspective input is one of the crucial applications for virtual reality (VR), whereby high-resolution videos are especially important for immersive experience. Existing methods are constrained by computational limitations of vanilla diffusion models, only supporting <= 1K resolution native generation and relying on suboptimal post-hoc super-resolution. We introduce CubeComposer, a novel spatio-temporal autoregressive diffusion model that natively generates 4K-resolution 360deg videos. By decomposing videos into cubemap representations with six faces, CubeComposer autoregressively synthesizes content in a well-planned spatio-temporal order, reducing peak memory demands while enabling high-resolution output. Specifically, to address challenges in multi-dimensional autoregression, we propose: (1) a spatio-temporal autoregressive strategy that orchestrates 360deg video generation across cube faces and time windows for coherent synthesis; (2) a cube face context management mechanism, equipped with a sparse context attention design to improve efficiency; and (3) continuity-aware techniques, including cube-aware positional encoding, padding, and blending to eliminate boundary seams. Extensive experiments on benchmark datasets demonstrate that CubeComposer outperforms state-of-the-art methods in native resolution and visual quality, supporting practical VR application scenarios.

Abstract:
Recent advances in Multimodal Large Language Models have greatly improved visual understanding and reasoning, yet their quadratic attention and offline training protocols make them ill-suited for streaming settings where frames arrive sequentially and future observations are inaccessible. We diagnose a core limitation of current Video-LLMs, namely Time-Agnosticism, in which videos are treated as an unordered bag of evidence rather than a causally ordered sequence, yielding two failures in streams: temporal order ambiguity, in which the model cannot follow or reason over the correct chronological order, and past-current focus blindness where it fails to distinguish present observations from accumulated history. We present WeaveTime, a simple, efficient, and model-agnostic framework that first teaches order and then uses order. We introduce a lightweight Temporal Reconstruction objective--our Streaming Order Perception enhancement--that instills order-aware representations with minimal finetuning and no specialized streaming data. At inference, a Past-Current Dynamic Focus Cache performs uncertainty-triggered, coarse-to-fine retrieval, expanding history only when needed. Plugged into exsiting Video-LLM without architectural changes, WeaveTime delivers consistent gains on representative streaming benchmarks, improving accuracy while reducing latency. These results establish WeaveTime as a practical path toward time-aware stream Video-LLMs under strict online, time-causal constraints. Code and weights will be made publicly available.

Abstract:
Multimodal Web Search Agents demonstrate a practically valuable capability by fusing information from diverse modalities (e.g., text and vision), retrieved iteratively from the internet, to address complex user queries. However, the visual modality is prone to information overload, and the noise contained within it--such as irrelevant background details or complex structures--can disrupt the model's attention, misdirecting its operational focus toward an erroneous path. To address the aforementioned challenge, we propose ReFAct (Reasoning, Focusing, and Acting), a novel framework that empowers the agent to actively manage its cross-modal context. This allows the agent to adjust its operational focus, thereby mitigating the impact of noise on multimodal Web Search Agents. Specifically, ReFAct employs a Grounding tool for active visual perception to dynamically filter information. We also design external memory-based Defocus/Refocus operations for selective information retention, further modulating information density within the multimodal context. Ultimately, this ensures the agent maintains focus during problem-solving. To evaluate and enhance agent capabilities in complex and noisy multimodal contexts, we first propose a pipeline for constructing datasets with flexible complexity. We introduce a new open-source benchmark: GroundedVQA. Finally, we experimentally demonstrate the effectiveness of our proposed method on GroundedVQA and other widely-used benchmarks.

Abstract:
Novel view synthesis of dynamic scenes is fundamental to achieving photorealistic 4D reconstruction and immersive visual experiences. Recent progress in Gaussian-based representations has significantly improved real-time rendering quality, yet existing methods still struggle to maintain a balance between long-term static and short-term dynamic regions in both representation and optimization. To address this, we present SharpTimeGS, a lifespan-aware 4D Gaussian framework that achieves temporally adaptive modeling of both static and dynamic regions under a unified representation. Specifically, we introduce a learnable lifespan parameter that reformulates temporal visibility from a Gaussian-shaped decay into a flat-top profile, allowing primitives to remain consistently active over their intended duration and avoiding redundant densification. In addition, the learned lifespan modulates each primitive's motion, reducing drift in long-lived static points while retaining unrestricted motion for short-lived dynamic ones. This effectively decouples motion magnitude from temporal duration, improving long-term stability without compromising dynamic fidelity. Moreover, we design a lifespan-velocity-aware densification strategy that mitigates optimization imbalance between static and dynamic regions by allocating more capacity to regions with pronounced motion while keeping static areas compact and stable. Extensive experiments on multiple benchmarks demonstrate that our method achieves state-of-the-art performance while supporting real-time rendering up to 4K resolution at 100 FPS on one RTX 4090.

Abstract:
Recent advances in CLIP-based continual learning have shown the potential of leveraging pre-trained vision-language models for sequential tasks. However, existing methods overlook a key problem we call Asymmetric Drift. In unimodal CLIP-based continual learning, the visual branch undergoes stronger adaptation because the visual distribution shifts significantly, whereas the text branch remains relatively stable due to the low variance of textual prompts. This imbalance increases the modality distance and degrades cross-modal alignment over time. To address this issue, we propose CCA-CL, a framework that accumulates visual-textual covariance statistics across tasks and solves Canonical Correlation Analysis to compute a shared subspace. In this subspace, the distance between visual and textual features is minimized, enabling better alignment without modifying CLIP parameters. This also makes our method naturally compatible with exemplar-free CL settings. To further capture nonlinear relationships that linear Canonical Correlation Analysis is hard to model, we introduce Random Fourier Projection as an extension. Experimental results demonstrate that CCA-CL effectively mitigates the asymmetric drift problem and achieves state-of-the-art performance on several benchmarks.

Abstract:
Large vision-language models (VLMs) show strong multimodal understanding but still struggle with 3D spatial reasoning, such as distance estimation, size comparison, and cross-view consistency. Existing 3D-aware methods either depend on auxiliary 3D information or enhance RGB-only VLMs with geometry encoders through shallow feature fusion. We propose SpaceMind, a multimodal large language model trained on SpaceMind-700K for spatial reasoning from RGB inputs. The model adopts a dual-encoder architecture integrating VGGT as a spatial understanding encoder and InternViT as a 2D visual encoder. The key idea is to treat the camera representation as an active guiding modality rather than passive metadata. Specifically, SpaceMind introduces a lightweight Camera-Guided Modality Fusion module before the language model to replace shallow fusion. It applies camera-conditioned biasing to spatial tokens, assigns query-independent weights reflecting their geometric importance, and uses the camera embedding to gate the fused representation. Empirically, SpaceMind establishes new state-of-the-art results on VSI-Bench, SQA3D and SPBench, surpassing both open and proprietary systems on VSI-Bench and SPBench by large margins and achieving state-of-the-art performance on SQA3D. These results demonstrate that camera-guided modality fusion is an effective and practical inductive bias for equipping VLMs with genuinely spatially grounded intelligence. We will release code and model checkpoints to support future research.

Abstract:
Talking head generation creates lifelike avatars from static portraits for virtual communication and content creation. However, current models do not yet convey the feeling of truly interactive communication, often generating one-way responses that lack emotional engagement. We identify two key challenges toward truly interactive avatars: generating motion in real-time under causal constraints and learning expressive, vibrant reactions without additional labeled data. To address these challenges, we propose Avatar Forcing, a new framework for interactive head avatar generation that models real-time user-avatar interactions through diffusion forcing. This design allows the avatar to process real-time multimodal inputs, including the user's audio and motion, with low latency for instant reactions to both verbal and non-verbal cues such as speech, nods, and laughter. Furthermore, we introduce a direct preference optimization method that leverages synthetic losing samples constructed by dropping user conditions, enabling label-free learning of expressive interaction. Experimental results demonstrate that our framework enables real-time interaction with low latency (about 500ms), achieving 6.8x speedup compared to the baseline, and produces reactive and expressive avatar motion, which is preferred over 80% against the baseline.

Abstract:
Single-image 3D human reconstruction holds significant promise due to its convenience and high demand in various applications. Previous methods have garnered tremendous progress by employing 2D multi-view diffusion models to generate auxiliary views as reconstruction priors, but they struggle with 3D inconsistencies and limited generalization capabilities. In this paper, we present FISHuman, which aims to generate fine-grained, high-fidelity, and content-wise diverse 3D humans from a single-view input, providing production-ready 3D assets. We propose an elaborately designed workflow that reconstructs dynamic 3D meshes from multi-view inconsistent guidance. Specifically, we adapt a dual-stream transformer-based video diffusion model to generate cross-modally aligned multi-view RGB and normal sequences. We find that naively employing static 3D reconstruction can lead to geometric distortions and texture blurriness, due to the lack of 3D awareness within the generated frames. To address this, we introduce a novel 4D remeshing module that explicitly disentangles the learning of the globally shared canonical mesh and transient variations by tracking per-vertex deformations under different viewpoints. The topological consistency of the deformed meshes inherently enables the optimization of a unified UV representation that effectively integrates appearance attributes across frames. Both qualitative and quantitative experimental results demonstrate the superiority of our method over prior works in terms of appearance realism, geometric fineness, and generalization diversity. We also showcase the applicability of our reconstructed avatars for downstream applications including animation and 3D editing.

Abstract:
Despite achieving high-quality rendering, 3D Gaussian Splatting suffers from aliasing when the rendering resolution changes, as it is typically trained at a fixed resolution. To address this limitation, we introduce a method that enables the model to generate resolution-adaptive 3D Gaussians under arbitrary resolution changes. In particular, built upon Scaffold-GS, we enhance the anchor feature representation by incorporating a resolution-embedding to encode continuous resolution information. From these enhanced anchor features, a pixel coverage gate dynamically forms resolution-adaptive 3D Gaussians. Furthermore, we drastically reduce storage requirements by selecting a compact subset of proxy anchors and designing a residual anchor predictor that reconstructs the unselected leaf anchors based on the proxy anchors, enabling faithful scene representation without compromising visual fidelity. As a result, our method provides continuous and alias-free rendering across resolutions while maintaining practical scalability and memory efficiency. Extensive experiments across diverse resolution ranges demonstrate that our approach achieves an optimal balance between fidelity and memory, enabling practical arbitrary-resolution view synthesis even in resource-constrained settings.

Abstract:
While iterative stereo matching achieves high accuracy, its dependence on Recurrent Neural Networks (RNN) hinders edge deployment, a challenge underexplored in existing researches. We analyze iterative refinement and reveal that disparity updates are spatially sparse and temporally redundant. First, we introduce a progressive iteration pruning strategy that suppresses redundant update steps, effectively collapsing the recursive computation into a near-single-pass inference. Second, we propose a collaborative monocular prior transfer framework that implicitly embeds depth priors without requiring a dedicated monocular encoder, thereby eliminating its associated computational burden. Third, we develop FlashGRU, a hardware-aware RNN operator leveraging structured sparsity and I/O-conscious design, achieving a 7.28xspeedup, 76.6% memory peak reduction and 80.9% global memory requests reduction over natvie ConvGRUs under 2K resolution. Our PipStereo enables real-time, high-fidelity stereo matching on edge hardware: it processes 320x640 frames in just 75ms on an NVIDIA Jetson Orin NX (FP16) and 19ms on RTX 4090, matching the accuracy of large iterative based models, and our generalization ability and accuracy far exceeds that of existing real-time methods.

Abstract:
Direct Preference Optimization has achieved remarkable success in aligning diffusion models with human feedback. However, existing methods heavily rely on image-level preferences, which suffer from sparse rewards in the spatial dimension. This creates a fundamental misalignment: while an image may be globally preferred, it can contain locally inferior instances. Applying the same positive preference to these areas thus unfairly credits distracting regions while penalizing informative ones, leading to suboptimal performance and inefficient learning. To resolve this issue, we propose IAPO, an Instance-Aware Preference Optimization that introduces instance-level credit assignment to advance alignment from image-level to instance-level. We first construct a high-quality instance-level preference dataset by automatically identifying and relabeling corresponding instances in image pairs using vision-language models and object detection models. Leveraging this fine-grained dataset, we design a novel instance alignment loss with a dynamic reweighting mask that modulates instance-level loss within annotated bounding boxes, suppressing distractors to enforce fine-grained human preference alignment. Extensive experiments demonstrate that our method not only achieves state-of-the-art performance in multiple benchmarks but also attains higher training efficiency due to fine-grained instance-level preference labels.

Abstract:
Online reconstruction of dynamic scenes aims to learn from streaming multi-view inputs under low-latency constraints. The fast training and real-time rendering capabilities of 3D Gaussian Splatting have made on-the-fly reconstruction practically feasible, enabling online 4D reconstruction. However, existing online approaches, despite their efficiency and visual quality, fail to learn per-Gaussian motion that reflects true scene dynamics. Without explicit motion cues, appearance and motion are optimized solely under photometric loss, causing per-Gaussian motion to chase pixel residuals rather than true 3D motion. To address this, we propose MoRGS, an efficient online per-Gaussian motion reasoning framework that treats Gaussian movement as a core modeling object. Specifically, we efficiently leverage optical flow on a sparse set of key views as a lightweight motion cue to guide per-Gaussian motion toward the scene's true dynamics. To compensate for the sparsity and view-dependence of flow, we learn a per-Gaussian motion offset field that reconciles discrepancies between projected 3D motion and observed flow across views and time. In addition, we introduce a per-Gaussian motion confidence that separates dynamic from static Gaussians and weights Gaussian attribute residual updates, thereby suppressing redundant motion in static regions for better temporal consistency and accelerating the modeling of large motions. Extensive experiments demonstrate that MoRGS achieves state-of-the-art reconstruction quality and motion fidelity among online methods, while maintaining streamable performance.

Abstract:
We present SpaceTimePilot, a video diffusion model that disentangles space and time for controllable generative rendering. Given a monocular video, SpaceTimePilot can independently alter both the camera viewpoint and the motion sequence within the generative process, re-rendering the scene for continuous and arbitrary exploration across space and time. To achieve this, we introduce an effective animation time-embedding mechanism in the diffusion process, allowing explicit control of the output video's motion sequence with respect to that of the source video. As no datasets provide paired videos of the same dynamic scene with continuous temporal variations, we propose a temporal-warping training scheme that repurposes existing multi-view datasets to mimic temporal differences. This simple yet crucial strategy enables the model to learn temporal control, directly producing the observed space-time disentanglement effects.To further enhance the precision of dual control, we introduce two additional components: an improved camera-conditioning mechanism that allows altering the camera from the first frame, and CamxTime, the first synthetic Space and Time full-coverage rendering dataset that provides fully free space-time video trajectories within a scene. Joint training on the temporal-warping scheme and the CamxTime dataset yields more precise temporal control. We evaluate SpaceTimePilot on both real-world and synthetic data, demonstrating clear space-time disentanglement and strong results compared to prior arts.

Abstract:
Evaluations of image compression performance which include human preferences have generally found that naive distortion functions such as MSE are insufficiently aligned to human perception.In order to align compression models to human perception, prior work has employed differentiable perceptual losses consisting of neural networks calibrated on large-scale datasets of human psycho-visual judgments. We show that, surprisingly, state-of-the-art vision-language models (VLMs) can replicate binary human two-alternative forced choice (2AFC) judgments zero-shot when asked to reason about the differences between pairs of images. Motivated to exploit the powerful zero-shot visual reasoning capabilities of VLMs, we propose Vision Language Models for Image Compression (VLIC), a diffusion-based image compression system designed to be post-trained with binary VLM judgments. VLIC leverages existing techniques for diffusion model post-training with preferences, rather than distilling the VLM judgments into a separate perceptual loss network. We show that calibrating this system on VLM judgments produces competitive or state-of-the-art performance on human-aligned visual compression depending on the dataset, according to perceptual metrics and large-scale user studies. We additionally conduct an extensive analysis of the VLM-based reward design and training procedure and share important insights.

Abstract:
In this paper we focus on the viewing graph, which is used to represent cameras (as nodes) and their pairwise relationships (as edges) in the context of Structure from Motion. By analyzing this graph, it is possible to establish if the available pairwise relationships (e.g., fundamental matrices in the uncalibrated case) are theoretically enough to uniquely determine the cameras, in which case the graph is termed "solvable". Previous results considered calibrated and uncalibrated settings, whereas other camera models have not been explored in the context of viewing graph solvability: this work represents the first study under the affine camera model. We provide a characterization of the problem in terms of a linear system, from which we derive a practical method to check affine solvability. We complement this by some theoretical results providing sufficient/necessary conditions for affine solvability, in order to give further insights on the problem. Thanks to our experiments, we analyze synthetic graphs and real graphs coming from structure-from-motion datasets, where we focus on understanding the differences among different camera models (calibrated, uncalibrated and affine) in terms of solvability. In this context, we also raise an open research question and conjecture a possible answer, which is supported by empirical evidence.

Abstract:
We present MotionCrafter, a framework that leverages video generators to jointly reconstruct 4D geometry and estimate dense motion from a monocular video. The key idea is a joint representation of dense 3D point maps and 3D scene flows in a shared coordinate system, together with a 4D VAE tailored to learn this representation effectively. Unlike prior work that strictly aligns 3D values and latents with RGB VAE latents--despite their fundamentally different distributions--we show that such alignment is unnecessary and can hurt performance. Instead, we propose a new data normalization and VAE training strategy that better transfers diffusion priors and greatly improves reconstruction quality. Extensive experiments on multiple datasets show that MotionCrafter achieves state-of-the-art performance in both geometry reconstruction and dense scene flow estimation, delivering 38.64% and 25.0% improvements in geometry and motion reconstruction, respectively, all without any post-optimization.

Abstract:
Customized video generation aims to produce videos that faithfully preserve the subject's appearance from reference images while maintaining temporally consistent motion from reference videos. Existing methods struggle to ensure both subject appearance similarity and motion pattern consistency due to the lack of object-level guidance for subject and motion. To address this, we propose SMRABooth, which leverages the self-supervised encoder and optical flow encoder to provide object-level subject appearance and motion representations. These representations are aligned with the model during the LoRA fine-tuning process. Our approach is structured in three core stages: (1) We exploit subject representations via a self-supervised encoder to guide subject alignment, enabling the model to capture overall structure of subject and enhance high-level semantic consistency. (2) We utilize motion representations from an optical flow encoder to capture structurally coherent and object-level motion trajectories independent of appearance. (3) We propose a subject-motion association decoupling strategy that applies sparse LoRAs injection across both locations and timing, effectively reducing interference between subject and motion LoRAs. Extensive experiments show that SMRABooth excels in subject and motion customization, maintaining consistent subject appearance and motion patterns, proving its effectiveness in controllable text-to-video generation.

Abstract:
Thinking with images has emerged as an effective paradigm for advancing visual reasoning, extending beyond text-only chains of thought by injecting visual evidence into intermediate reasoning steps. However, existing methods fall short of human-like abstract visual thinking, as their flexibility is fundamentally limited by external tools. In this work, we introduce Monet, a training framework that enables multimodal large language models (MLLMs) to reason directly within the latent visual space by generating continuous embeddings that function as intermediate visual thoughts. We identify two core challenges in training MLLMs for latent visual reasoning--high computational cost in latent-vision alignment and insufficient supervision over latent embeddings, and address them with a three-stage distillation-based supervised fine-tuning (SFT) pipeline. We further reveal a limitation of applying GRPO to latent reasoning: it primarily enhances text-based reasoning rather than latent reasoning. To overcome this, we propose VLPO (Visual-latent Policy Optimization), a reinforcement learning method that explicitly incorporates latent embeddings into policy gradient updates.To support SFT, we construct Monet-SFT-125K, a high-quality text-image interleaved CoT dataset containing 125K real-world, chart, OCR, and geometry CoTs. Our model, Monet-7B, shows consistent gains across real-world perception and reasoning benchmarks and exhibits strong out-of-distribution generalization on challenging abstract visual reasoning tasks. We also empirically analyze the role of each training component and discuss our early unsuccessful attempts, providing insights for future developments in visual latent reasoning.

Abstract:
We study zero-shot 3D alignment of two given meshes from a short text prompt describing their spatial relation---an essential capability for content creation and scene assembly. Earlier approaches primarily rely on geometric alignment procedures, while recent work leverages pretrained 2D diffusion models to model language-conditioned object-object spatial relationships. In contrast, we directly optimize the relative pose at test time---updating translation, rotation, and isotropic scale with CLIP-driven gradients via a differentiable renderer---without training a new model.Our framework augments language supervision with geometry-aware objectives: a variant of soft-Iterative Closest Point (ICP) term to encourage controlled surface attachment and a penetration loss to discourage interpenetration. A phased schedule strengthens contact constraints over time, camera control concentrates views on the interaction region, and randomized restarts improve robustness.To enable evaluation, we curate a benchmark of 50 \ mesh pair, prompt\ cases spanning diverse categories and relations, and compare against baselines. Across the benchmark, our method yields semantically faithful and physically plausible alignments, improving CLIP similarity while reducing intersection volume.

Abstract:
Recent advances in vision-language models (VLMs) have made impressive strides in understanding spatio-temporal relationships in videos. However, when spatial information is obscured, these models struggle to capture purely temporal patterns. We introduce SpookyBench, a benchmark where information is encoded solely in temporal sequences of noise-like frames, mirroring natural phenomena from biological signaling to covert communication. Interestingly, while humans can recognize shapes, text, and patterns in these sequences with over 98% accuracy, state-of-the-art VLMs achieve 0% accuracy. This highlights a critical limitation: an over-reliance on frame-level spatial features and an inability to extract meaning from temporal cues. Overcoming this limitation will require novel architectures or training paradigms that decouple spatial dependencies from temporal processing. Our systematic analysis shows that this issue persists across model scales and architectures. We release SpookyBench to catalyze research in temporal pattern recognition and bridge the gap between human and machine video understanding. Dataset and code are available on the project website.

Abstract:
MeanFlow (MF) has recently been established as a framework for one-step generative modeling. However, its "fastforward" nature introduces key challenges in both the training objective and the guidance mechanism. First, the original MF's training target depends not only on the underlying ground-truth fields but also on the network itself. To address this issue, we recast the objective as a loss on the instantaneous velocity v, re-parameterized by a network that predicts the average velocity u. Our reformulation yields a more standard regression problem and improves the training stability. Second, the original MF fixes the classifier-free guidance scale during training, which sacrifices flexibility. We tackle this issue by formulating guidance as explicit conditioning variables, thereby retaining flexibility at test time. The diverse conditions are processed through in-context conditioning, which reduces model size and benefits performance. Overall, our improved MeanFlow (iMF) method, trained entirely from scratch, achieves 1.72 FID with a single function evaluation (1-NFE) on ImageNet 256x256. iMF substantially outperforms prior methods of this kind and closes the gap with multi-step methods while using no distillation. We hope our work will further advance fastforward generative modeling as a stand-alone paradigm.

Abstract:
Topology-consistent dynamic model sequences are essential for applications such as animation and model editing. However, existing 4D reconstruction methods face challenges in generating high-quality topology-consistent meshes. To address this, we propose a topology-aware dynamic reconstruction framework based on Gaussian Splatting. We introduce a Gaussian topological structure that explicitly encodes spatial connectivity. This structure enables topology-aware densification and pruning, preserving the manifold consistency of the Gaussian representation. Temporal regularization terms further ensure topological coherence over time, while differentiable mesh rasterization improves mesh quality. Experimental results demonstrate that our method reconstructs topology-consistent mesh sequences with significantly higher accuracy than existing approaches. Moreover, the resulting meshes enable precise 3D keypoint tracking.

Abstract:
Machine Unlearning (MU) focuses on removing the influence of training samples from pre-trained models without retraining the model entirely. Existing MU methods have made several efforts to enable complete forgetting while preserving the model's performance on remaining data. However, they typically apply equal weights across different data, overlooking the ambiguous decision boundaries between similar samples or approximate classes. This leads to unnecessary consumption of shallowly memorized samples and significant performance degradation for approximate retention classes. Additionally, the inherent inconsistency between forgetting and retention objectives results in gradient conflict and domination problems during training, hindering model convergence and degrading overall performance. To address these, we introduce a novel adaptive gradient reweighting that assigns importance weights to individual forget samples or vulnerable retention classes, thereby enabling more efficient unlearning and preserving the performance of approximate classes. Subsequently, we propose a multi-stage objective optimization strategy, which comprises three optimization stages: Direction Rectification, Temporal Stabilization, and Adaptive Objective Combination. This strategy rectifies the direction of conflicting gradients and prevents one task (forgetting or retention) from dominating the model update. Comprehensive analyses and extensive experiments on multiple public datasets demonstrate that our method achieves considerable performance improvements in various tasks and scenarios.

Abstract:
In-context image generation and editing (ICGE) enables users to specify visual concepts through interleaved image-text prompts, demanding precise understanding and faithful execution of user intent. Although recent unified multimodal models exhibit promising understanding capabilities, these strengths often fail to transfer effectively to image generation. We introduce Re-Align, a unified framework that bridges the gap between understanding and generation through structured reasoning-guided alignment. At its core lies the In-Context Chain-of-Thought (IC-CoT), a structured reasoning paradigm that decouples semantic guidance and reference association, providing clear textual target and mitigating confusion among reference images. Furthermore, Re-Align introduces an effective RL training scheme that leverages a surrogate reward to measure the alignment between structured reasoning text and the generated image, thereby improving the model's overall performance on ICGE tasks. Extensive experiments verify that Re-Align outperforms competitive methods of comparable model scale and resources on both in-context image generation and editing tasks.

Abstract:
Video generation models are rapidly advancing, but can still struggle with complex video outputs that require significant semantic branching or repeated high-level reasoning about what should happen next. In this paper, we introduce a new class of omni video-text models that integrate ideas from recent LM reasoning advances to address this challenge. More specifically, we present TV2TV, a unified generative modeling framework which decomposes video generation into an interleaved text and video generation process. TV2TV jointly learns language modeling (next-token prediction) and video flow matching (next-frame prediction) using a Mixture-of-Transformers architecture. At inference time, TV2TV decides when to alternate between generating text and video frames, allowing the model to "think in words" about subsequent content before "acting in pixels" to produce frames. This design offloads much of the responsibility for deciding what should happen next to the language modeling tower, enabling improved visual quality and prompt alignment of generated videos. It also enables fine-grained controllability, allowing users to modify the video generation trajectory through text interventions at any point in the process. In controlled experiments on video game data, TV2TV demonstrates substantial improvements in both visual quality (preferred 91% of the time in human evaluations vs. a comparable text-to-video model) and controllability (19 point improvement in fine-grained instruction following accuracy vs. a "think-then-act" approach). TV2TV also scales to natural videos, as we show by augmenting sports videos with interleaved natural language action descriptions using VLMs. Training TV2TV on this corpus yields strong visual quality and prompt alignment, showcasing the model's ability to reason about and generate complex real-world action sequences. Together, these results highlight TV2TV as a promising step toward video generation with open-ended textual reasoning and control.

Abstract:
Person De-Reidentification (De-ReID) is an emerging and safety-critical task that aims to selectively forget specific individuals in surveillance systems while preserving the recognition capability for others. Existing methods typically learn both forgetting and retaining objectives within a unified feature space, which leads to conflicting optimization goals and may cause unexpected performance degradation on novel or retained identities. We provide a new perspective to handle De-ReID through feature space decoupling. Although it is a promising solution, discriminating which feature space should be used for the given novel query remain unsolved. To alleviate these challenges, we propose SANER, advancing De-ReID with a Switchable Adapter (SA) and a test-time Non-parametric Enhanced Routing (NER) algorithm. SA decouples the pre-trained feature space into two task-specific subspaces with a forgetting adapter and a retaining adapter. The former suppresses identity-specific semantics for de-identification, while the latter preserves discriminative cues for accurate re-ID. In addition, SA is further enhanced with NER to adaptively analyze optimal feature space routing for the given query at test-time by comparing the query with pre-computed prototypes in the original feature space. Extensive experiments on multiple De-ReID benchmarks demonstrate the effectiveness of SANER, achieving new state-of-the-art De-ReID performance.

Abstract:
Extreme low-light Raw image restoration remains challenging due to overwhelming noise and severe detail loss.In this paper, we exploit the potential of the dual-exposure setting for this severely ill-posed problem.Existing methods suffer from unreliable cross-exposure alignment, resulting in degraded detail recovery and compromised color fidelity. To address these challenges, we propose RawMetaDiff, a novel generative diffusion framework that restores a high-fidelity Raw image from a short-exposure input, conditioned on a potentially misaligned long-exposure reference under the guidance of Raw metadata.At its core, we propsed two complementary mechanisms: the Meta-Assistant Color Transfer (MACT) enforces color consistency by aligning global color statistics along the channel dimension,while the Meta-Normed Cross Attention (MNCA) leverages Raw metadata to establish robust cross-exposure spatial correspondences and inject shadow details.To support robust diffusion training, we first collect a 1K real-world, dual-exposure Raw dataset, namely DERaw, and then design a realistic degradation model to synthesize data that closely approximates real-world conditions.Extensive experiments on both synthetic and real-world datasets demonstrate that RawMetaDiff significantly outperforms existing methods, justifying an effective new solution for extreme low-light Raw image restoration from the generative perspective.

Abstract:
In this paper, we study unsupervised multi-modal anomaly detection under challenging few-shot conditions, where only a few normal training samples are available for each class. To prevent the trivial reconstruction of anomalies, recent methods often rely on discrete prototypes to establish an information bottleneck. However, such discrete prototypes fail to capture continuous and anisotropic variations of normal features. We therefore propose GPFlow, a probability-flow-inspired framework that models normality using learnable Gaussian prototypes. The key component of GPFlow is an analytical Posterior-Mean Path (PMP) router, which reconstructs features through the posterior mean of a noise-smoothed Gaussian mixture. This yields anisotropic shrinkage as a covariance-aware information bottleneck: PMP selectively preserves normal variations aligned with the covariance structure of Gaussian prototypes while strictly suppressing deviations inconsistent with the prototypes. To exploit complementary knowledge across modalities, GPFlow further combines intra-modal and cross-modal reconstruction, and applies a lightweight instance-aware prior calibration to alleviate the distribution mismatch between sparse training data and diverse test samples. Experiments on MVTec-3D-AD and Eyecandies show that GPFlow achieves significant performance improvement with only a few normal training samples while remaining computationally efficient.

Abstract:
We introduce a new template-free tracking paradigm based solely on natural language, capable of tracking an arbitrary object and seamlessly switching to a new target without box initialization.Our key idea is to localize an object via vision-language (VL) correlation.However, using the correlation alone is brittle under large search regions due to spatial uncertainty and ambiguous VL saliency. To resolve these, we propose MVLM, a memory-based vision-language margin confidence that integrates vision-language correlation, encoder prediction, and temporal memory.MVLM dynamically gates the search region--switching between compact ROI (Region of Interest) search and global re-localization--to reduce spatial uncertainty. Theoretically, we derive bounds that connect the MVLM score to tracking probability, characterizing mis-localization within ROI and ROI-exclusion probabilities.Through extensive evaluation, we validate our theorems and achieve state-of-the-art performance on several benchmarks (TNL2K, LaSOT, OTB99 and MGIT) using only language guidance.

Abstract:
Vision-language models (VLMs) excel at multimodal understanding, yet their text-only decoding forces them to verbalize visual reasoning, limiting performance on tasks that demand visual imagination. Recent attempts train VLMs to render explicit images, but the heavy image-generation pre-training often hinders the reasoning ability. Inspired by the way humans reason with mental imagery--the internal construction and manipulation of visual cues--we investigate whether VLMs can reason through interleaved multimodal trajectories without producing explicit images. To this end, we present a Machine Mental Imagery framework, dubbed as "Mirage", which augments VLM decoding with latent visual tokens alongside ordinary text. Concretely, whenever the model chooses to "think visually", it recasts its hidden states as next tokens, thereby continuing a multimodal trajectory without generating pixel-level images. Begin by supervising the latent tokens through distillation from ground-truth image embeddings, we then switch to text-only supervision to make the latent trajectory align tightly with the task objective. A subsequent reinforcement learning stage further enhances the multimodal reasoning capability. Experiments on diverse benchmarks demonstrate that \Model unlocks stronger multimodal reasoning without explicit image generation.

Abstract:
Deformable image registration (DIR) remains a fundamental yet challenging problem in medical image analysis, largely due to the prohibitively high-dimensional deformation space of dense displacement fields and the scarcity of voxel-level supervision. Existing reinforcement learning frameworks often project this space into coarse, low-dimensional representations, limiting their ability to capture spatially variant deformations. We propose MorphSeek, a fine-grained representation-level policy optimization paradigm that reformulates DIR as a spatially continuous optimization process in the latent feature space. MorphSeek introduces a stochastic Gaussian policy head atop the encoder to model a distribution over latent features, facilitating efficient exploration and coarse-to-fine refinement. The framework integrates unsupervised warm-up with weakly supervised fine-tuning through Group Relative Policy Optimization, where multi-trajectory sampling stabilizes training and improves label efficiency. Across three 3D registration benchmarks (OASIS brain MRI, LiTS liver CT, and Abdomen MR-CT), MorphSeek achieves consistent Dice improvements over competitive baselines while maintaining high label efficiency with minimal parameter cost and low step-level latency overhead. Beyond optimizer specifics, MorphSeek advances a representation-level policy learning paradigm that achieves spatially coherent and data-efficient deformation optimization, offering a principled, backbone-agnostic, and optimizer-agnostic solution for scalable visual alignment in high-dimensional settings.

Abstract:
Spatio-Temporal Video Grounding (STVG) aims to localize target objects in videos based on natural language descriptions. While Multimodal Large Language Models have shown promise, a significant gap remains between current models and real-world demands involving diverse objects and complex queries. We attribute this to limited benchmark scope, causing models to exhibit category bias, oversimplified reasoning, and poor linguistic robustness.To address these limitations, we introduce OmniGround, a comprehensive benchmark with 3,475 videos spanning 81 categories and complex real-world queries. We propose the Forward-Backward-Refinement (FBR) annotation pipeline for high-quality labels and DeepSTG, a systematic evaluation framework quantifying dataset quality beyond superficial statistics. Evaluations reveal performance average drops of 10.4% on complex real-world scenes, particularly with small/occluded objects and intricate spatial relations. Motivated by these, we propose PG-TAF, a training-free two-stage framework decomposing STVG into high-level temporal grounding and fine-grained spatio-temporal propagation. Experiments demonstrate PG-TAF achieves 25.6% and 35.6% improvements in m_tIoU and m_vIoU on OmniGround with consistent gains across four benchmarks.

Abstract:
Traditional classifiers treat all class labels as mutually independent, thereby considering all negative classes to be equally incorrect. This approach fails severely in many real-world scenarios, where a known semantic hierarchy defines a partial order of preferences over negative classes. While hierarchy-aware feature representations have shown promise in mitigating this problem, their performance is typically assessed using metrics like Mistake Severity (MS) and Average Hierarchical Distance (AHD). In this paper, we highlight important shortcomings in existing hierarchical evaluation metrics, demonstrating that they are often incapable of measuring true hierarchical performance. Our analysis reveals that existing methods learn sub-optimal hierarchical representations, despite competitive MS and AHD scores. To counter these issues, we introduce Hierarchical Composition of Orthogonal Subspaces (Hier-COS), a novel framework for unified 'hierarchy-aware fine-grained' and 'hierarchical multi-label' classification. We show that Hier-COS is theoretically guaranteed to be consistent with the given hierarchy tree. Furthermore, our framework implicitly adapts the learning capacity for different classes based on their position within the hierarchy tree -- a vital property absent in existing methods. Finally, to address the limitations of evaluation metrics, we propose Hierarchically Ordered Preference Score (HOPS), a ranking-based metric that demonstrably overcomes the deficiencies of current evaluation standards. We benchmark Hier-COS on four challenging datasets, including the deep and imbalanced tieredImageNet-H (12-level) and iNaturalist-19 (7-level). Through extensive experiments, we demonstrate that Hier-COS achieves state-of-the-art performance across all hierarchical metrics for every dataset, while simultaneously beating the top-1 accuracy in all but one case. Lastly, we show that Hier-COS can effectively learn to transform the frozen features extracted from a pretrained backbone (ViT) to be hierarchy-aware, yielding substantial benefits for hierarchical classification performance.

Abstract:
Accurate depth estimation is crucial for 3D reconstruction and precise navigation in posterior segment ophthalmic surgery. However, acquiring annotated data remains challenging due to the impracticality of depth sensors under surgical microscopes. To overcome this limitation, we introduce RetinalDepth, a novel synthetic dataset comprising 44,800 temporal stereo image pairs across 896 diverse scenes for posterior segment surgery, developed via a Real2Sim2Real pipeline. In the Real-to-Sim phase, we model anatomically accurate eyes and instruments in Blender, integrating ultra-wide-field retinal textures, refractive aqueous humor, and dynamic trajectories, enhanced by post-processing for realism. RetinalDepth provides synchronized left and right RGB frames with pixel-perfect depth maps, surface normals, instrument segmentation masks, and camera parameters, enabling robust training of monocular, stereo, and video depth estimation models. In the Sim-to-Real phase, we introduce temporal depth variance, a novel metric for quantifying the stability of frame-to-frame depth estimation. Fine-tuning on RetinalDepth significantly boosts model performance on both synthetic and real surgical videos, enhancing generalization, surgical precision, and novice training. As the first synthetic benchmark for posterior segment surgery, RetinalDepth bridges the sim-to-real gap in ophthalmic computer vision.

Abstract:
Continual Test-Time Adaptation (CTTA) for semantic segmentation is vital for deploying vision models in dynamic environments with persistent domain shifts. Existing methods often degrade over time as self-supervised updates amplify early prediction errors. We attribute this fragility to a geometric limitation: Euclidean feature spaces, with polynomial volume growth, lead to distorted semantic representations and crowded, unstable decision boundaries. We propose HyperProtoSeg, a hyperbolic prototypical segmentation network that learns geometrically optimal class prototypes in the Poincare ball. Leveraging the exponential expansion of hyperbolic space, it enforces large and uniform inter-class margins with low distortion, yielding well separated and curvature-stable embeddings. For robust online adaptation, we introduce Hyperbolic Boundary Consistency Adaptation (HBCA), which partitions pixels by cross-view consistency into confident "core" and uncertain "boundary" sets. HBCA applies geodesic distance minimization for confident regions and a novel Hyperbolic Directional Consistency Loss for uncertain ones, preventing error amplification. Experiments on challenging synthetic- to-real benchmarks (Cityscapes - ACDC, IDD - IDD-AW, SHIFT) show that HyperProtoSeg + HBCA achieves an average improvement of (1.94%,4.02%,1.24%) over state-of-the-art CTTA methods under severe structural shifts.

Abstract:
Humanoid robotics has strong potential to transform daily service and caregiving applications. Although recent advances in general motion tracking within physics engines (GMT) have enabled virtual characters and humanoid robots to reproduce a broad range of human motions, these behaviors are primarily limited to contact-less social interactions or isolated movements. Assistive scenarios, by contrast, require continuous awareness of a human partner and rapid adaptation to their evolving posture and dynamics. In this paper, we formulate the imitation of closely interacting, force-exchanging human-human motion sequences as a multi-agent reinforcement learning problem. We jointly train partner-aware policies for both the supporter (assistant) agent and the recipient agent in a physics simulator to track assistive motion references. To make this problem tractable, we introduce a partner policies initialization scheme that transfers priors from single-human motion-tracking controllers, greatly improving exploration. We further propose dynamic reference retargeting and contact-promoting reward, which adapt the assistant's reference motion to the recipient's real-time pose and encourage physically meaningful support. We show that \model is the first method capable of successfully tracking assistive interaction motions on established benchmarks, demonstrating the benefits of a multi-agent RL formulation for physically grounded and socially aware humanoid control.

Abstract:
Recent work has shown that neural networks can perform 3D tasks such as Novel View Synthesis (NVS) without explicit 3D reconstruction. Even so, we argue that strong 3D inductive biases are still helpful in the design of such networks. We show this point by introducing LagerNVS, an encoder-decoder neural network for NVS that builds on `3D-aware' latent features. The encoder is initialized from a 3D reconstruction network pre-trained using explicit 3D supervision. This is paired with a lightweight decoder, and trained end-to-end with photometric losses. LagerNVS achieves state-of-the-art deterministic feed-forward Novel View Synthesis (including 31.4 PSNR on Re10k), with and without known cameras, renders in real time, generalizes to in-the-wild data, and can be paired with a diffusion decoder for generative extrapolation. See szymanowiczs.github.io/lagernvs for code, models, and examples.

Abstract:
The core challenge in Text-based Person Retrieval (TPR) lies in establishing fine-grained, many-to-many semantic alignment between textual words and visual regions. Existing methods predominantly rely on pointwise similarity or attention mechanisms, implicitly assuming matches are independent and balanced. Consequently, under conditions of attribute overlap and substantial background noise, these methods often misallocate matching weights to non-discriminative regions or words, resulting in ambiguous matching outcomes. To address this, we propose QC-Align, a quota-calibrated fine-grained alignment framework guided by context-aware marginals. Specifically, we propose a Context-Aware Marginal Estimator (CAME) that dynamically assigns "matching quotas" to each word and visual region, and subsequently employs a Quota-Calibrated Transport (QCT) objective to explicitly constrain the matching quality each word and region can carry, thereby jointly optimizing the many-to-many correspondence between text and vision under these constraints. Notably, QC-Align is a parameter-free, plug-and-play training regularizer that requires no fine-grained annotations and incurs no inference overhead. Experiments on multiple mainstream person retrieval benchmarks demonstrate that QC-Align consistently improves baseline model performance, with greater gains and better interpretability in few-shot and cross-domain scenarios.

Abstract:
Articulated objects are common in the real world, yet modeling their structure and motion remains a challenging task for 3D reconstruction methods. In this work, we introduce Part^ 2 GS, a novel framework for modeling articulated digital twins of multi-part objects with high-fidelity geometry and physically consistent articulation. Part^ 2 GS leverages a part-aware 3D Gaussian representation that encodes articulated components with learnable attributes, enabling structured, disentangled transformations that preserve high-fidelity geometry. To ensure physically consistent motion, we propose a motion-aware canonical representation guided by physics-based constraints, including contact enforcement, velocity consistency, and vector-field alignment. Furthermore, we introduce a field of repel points to prevent part collisions and maintain stable articulation paths, significantly improving motion coherence over baselines. Extensive evaluations on both synthetic and real-world datasets show that Part^ 2 GS consistently outperforms state-of-the-art methods by up to 10xin Chamfer Distance for movable parts.

Abstract:
Test-Time Adaptation (TTA) aims to mitigate distributional shifts between training and test domains during inference time. However, existing TTA methods fall short in the realistic scenario where models face both continually changing domains and the simultaneous emergence of unknown semantic classes --- a challenging setting we term Open-set Continual Test-Time Adaptation (OCTTA). The coupling of domain and semantic shifts often collapses the feature space, severely degrading both classification and out-of-distribution detection. To tackle this, we propose DOmain COmpensation (DOCO), a lightweight and effective framework that robustly performs domain adaptation and OOD detection in a synergistic, closed loop. DOCO first performs dynamic, adaptation-conditioned sample splitting to separate likely ID from OOD samples. Then, using only the ID samples, it learns a domain compensation prompt by aligning feature statistics with the source domain, guided by a structural preservation regularizer that prevents semantic distortion. This learned prompt is then propagated to the OOD samples within the same batch, effectively isolating their semantic novelty for more reliable detection. Extensive experiments on multiple challenging benchmarks demonstrate that DOCO outperforms prior CTTA and OSTTA methods, establishing a new state-of-the-art for the demanding OCTTA setting.

Abstract:
Backdoor attacks pose potential threats to object detection models, highlighting the importance of studying their security. However, existing backdoor attacks mainly rely on trigger-specific intrinsic features, which limits their practicality in real-world scenarios. In this paper, we propose a novel backdoor attack that leverages dynamic object interactions in realistic scenarios to activate malicious behavior. By hijacking the Non-Maximum Suppression (NMS) process in object detectors, this attack demonstrates robust effectiveness, including misclassification, mislocalization, and object appearance/disappearance, while maintaining the model's normal performance on clean inputs. Experimental results demonstrate that our attack exhibits significant attack performance across various object detectors and datasets, and remains effective both in physical environments and under existing defense mechanisms. These findings highlight the urgent need to develop efficient and robust defense strategies against backdoor attacks.

Abstract:
In this paper, we propose a test-time adaptive agent that performs exploratory inference through posterior-guided belief refinement without relying on gradient-based updates or additional training for LLM agent operating under partial observability. Our agent maintains an external structured belief over the environment state, iteratively updates it via action-conditioned observations, and selects actions by maximizing predicted information gain over the belief space. We estimate information gain using a lightweight LLM-based surrogate and assess world alignment through a novel reward that quantifies the consistency between posterior belief and ground-truth environment configuration. Experiments show that our method outperforms inference-time scaling baselines such as prompt-augmented or retrieval-enhanced LLMs, in aligning with latent world states with significantly lower integration overhead.

Abstract:
Understanding and reasoning over structured knowledge is a fundamental capability for intelligent systems. While Large Language Models (LLMs) have leveraged textual knowledge graphs for relational reasoning, linearizing graph structures into text often leads to token inefficiency and loss of higher-order relational cues. Inspired by the advances of Large Multimodal Model to capture higher-order relational structures explicitly novel paradigm of visualized knowledge representation, where knowledge graphs are transformed into graphical visualizations that LMMs can directly perceive and reason over. To systematically evaluate this capability, we introduce VKG-QA, a benchmark for Visual Knowledge Graph-based Question Answering, covering three major categories and fourteen subtasks. VKG-QA is constructed via a semi-automatic pipeline ensuring high-quality, semantically aligned, and visually clear data. We evaluate 19 representative LMMs on VKG-QA and perform extensive quantitative and qualitative analyses. Results reveal that current models struggle with visualized relational understanding, graph-specific comprehension remains challenging, and closed-source models significantly outperform open-source counterparts. VKG-QA thus highlights critical limitations in current LMMs and provides a scalable platform for advancing graph-aware visual reasoning.

Abstract:
Video diffusion transformers (DiTs) suffer from prohibitive inference latency due to quadratic attention complexity. Existing sparse attention methods either overlook semantic similarity, or fail to adapt to heterogeneous token distributions across layers, leading to model performance degradation. We propose AdaCluster, a training-free adaptive clustering framework that accelerates the generation of DiTs while preserving accuracy. AdaCluster applies an angle-similarity preserving clustering method to query vectors for higher compression, and designs a euclidean-similarity preserving clustering method for keys, covering cluster number assignment, threshold-wise adaptive clustering, and efficient critical cluster selection. Experiments on CogVideoX-2B, HunyuanVideo, and Wan-2.1 via one A40 GPU demonstrate up to 1.67x-4.31x speedup with negligible quality degradation.

Abstract:
Video inverse problems such as inpainting, deblurring and super-resolution are fundamental to streaming, telepresence, and AR/VR, where high perceptual quality must coexist with tight latency constraints. Diffusion-based priors currently deliver state-of-the-art reconstructions, but existing approaches either adapt image diffusion models with ad hoc temporal regularizers--leading to temporal artifacts--or rely on native video diffusion models whose iterative posterior sampling is far too slow for real-time use. We introduce InstantViR, an amortized inference framework for ultra-fast video reconstruction powered by a pre-trained video diffusion prior. We distill a powerful bidirectional video diffusion model (teacher) into a causal autoregressive student that maps a degraded video directly to its restored version in a single forward pass, inheriting the teacher's strong temporal modeling while completely removing iterative test-time optimization. The distillation is prior-driven: it only requires the teacher diffusion model and known degradation operators, and does not rely on externally paired clean/noisy video data. To further boost throughput, we replace the standard VAE in video diffusion backbone with a highly efficient LeanVAE, enabling low-latency latent-space processing. Across streaming random inpainting, Gaussian deblurring and super-resolution, InstantViR matches or surpasses the reconstruction quality of diffusion-based baselines while running at over 35 FPS on NVIDIA A100 GPUs, achieving up to 100x speedups over iterative video diffusion priors. These results show that diffusion-based video reconstruction is compatible with real-time, interactive, editable, streaming scenarios, turning high-quality video restoration into a practical component of modern vision systems.

Abstract:
Novel view synthesis from sparse-view inputs poses a significant challenge in 3D computer vision, particularly for achieving high-quality scene reconstructions with limited viewpoints. We introduce TWINGS, a framework that enhances 3D Gaussian Splatting (3DGS) by directly addressing point sparsity. We employ Thin Plate Splines (TPS), a smooth non-rigid deformation model that minimizes bending energy to estimate a globally coherent warp from control-point correspondences, to align backprojected points from estimated depth with triangulated 3D control points, yielding calibrated backprojected points. By sampling these calibrated points near the control points, TWINGS provides a fast and geometrically accurate initialization for 3DGS, ultimately improving structural detail preservation and color fidelity in reconstructed scenes. Extensive experiments on DTU, LLFF, and Mip-NeRF360 demonstrate that TWINGS consistently outperforms existing methods, delivering detailed and accurate reconstructions under sparse-view scenarios.

Abstract:
Large language models (LLMs) need reliable test-time control of hallucinations. Existing conformal methods for LLMs typically provide only marginal guarantees and rely on a single global threshold, which can under-cover hard prompts, over-cover easy ones, and produce oversized prediction sets. We propose Conditional Factuality Control (CFC), a post-hoc conformal framework that returns set-valued outputs with conditional coverage guarantees. CFC defines a continuous, feature-conditional acceptance threshold through augmented quantile regression on a latent "success" score, and deploys it through a fixed-point threshold rule at inference time. Theoretically, we show that CFC satisfies a conditional coverage guarantee under exchangeability and analyze its efficiency, proving that, under mild assumptions on the score distributions, the conditional rule is strictly more sample-efficient than marginal conformal prediction at the same target coverage. We further derive a PAC-style variant, CFC-PAC, which shrinks the nominal risk level based on a stability bound, yielding a finite-sample certificate that the conditional miscoverage deviates from the target by at most a slack. Empirically, on synthetic data, real-world reasoning and QA benchmarks, and a Flickr8k VLM setting, CFC and CFC-PAC consistently attain near-target coverage across difficulty groups while using smaller prediction sets than CP and non-CP baselines.

Abstract:
Recent progress in feed-forward 3D Gaussian Splatting (3DGS) has notably improved rendering quality. However, the spatially uniform and highly redundant 3DGS map generated by previous feed-forward 3DGS methods limits their integration into downstream reconstruction tasks. We propose SparseSplat, the first feed-forward 3DGS model that adaptively adjusts Gaussian density according to scene structure and information richness of local regions, yielding highly compact 3DGS maps. To achieve this, we propose entropy-based probabilistic sampling, generating large, sparse Gaussians in textureless areas and assigning small, dense Gaussians to regions with rich information. Additionally, we designed a specialized point cloud network that efficiently encodes local context and decodes it into 3DGS attributes, addressing the receptive field mismatch between the general 3DGS optimization pipeline and feed-forward models. Extensive experimental results demonstrate that SparseSplat can achieve state-of-the-art rendering quality with only 22% of the Gaussians and maintain reasonable rendering quality with only 1.5% of the Gaussians.

Abstract:
The emergence of Large Language Models (LLMs) has driven rapid progress in multi-modal learning, particularly in the development of Large Vision-Language Models (LVLMs). However, existing LVLM training paradigms place excessive reliance on the LLM component, giving rise to two critical robustness challenges: language bias and language sensitivity. To address both issues simultaneously, we propose a novel Self-Critical Inference (SCI) framework that extends Visual Contrastive Decoding by conducting multi-round counterfactual reasoning through both textual and visual perturbations. This process further introduces a new strategy for improving robustness by scaling the number of counterfactual rounds. Moreover, we also observe that failure cases of LVLMs differ significantly across models, indicating that fixed robustness benchmarks may not be able to capture the true reliability of LVLMs. To this end, we propose the Dynamic Robustness Benchmark (DRBench), a model-specific evaluation framework targeting both language bias and sensitivity issues. Extensive experiments show that SCI consistently outperforms baseline methods on DRBench, and that increasing the number of inference rounds further boosts robustness beyond existing single-step counterfactual reasoning methods.

Abstract:
Merging multiple Low-Rank Adaptation (LoRA) modules is promising for constructing general-purpose systems, yet challenging because LoRA update directions span different subspaces and contribute unevenly. When merged naively, such mismatches can weaken the directions most critical to certain task losses while overemphasizing relatively less important ones, ultimately reducing the model's ability to represent all tasks faithfully. We revisit this problem through two perspectives: subspace coverage, which captures how broadly LoRA directions cover diverse representational directions, and anisotropy, which reflects the imbalance of influence across those directions. We propose TARA-Merging (Task-Rank Anisotropy Alignment), which aligns merging weights using a preference-weighted cross-entropy pseudo-loss while preserving task-relevant LoRA subspaces. This ensures broad subspace coverage and mitigates anisotropy via direction-wise reweighting. Across eight vision and six NLI benchmarks, TARA-Merging consistently outperforms vanilla and LoRA-aware baselines, demonstrating strong robustness and generalization, and highlighting the importance of addressing both subspace coverage and anisotropy in LoRA merging.

Abstract:
Hand-object interaction (HOI) involves dynamics where human manipulations produce spatio-temporal effects on objects. However, existing semantic HOI benchmarks focus on either manipulation or effects at a coarse level, lacking fine-grained spatio-temporal reasoning to capture HOI dynamics. We introduce HanDyVQA, a fine-grained video QA benchmark that comprehensively covers both the manipulation and effect aspects of HOI. HanDyVQA comprises six complementary question types (Action, Process, Objects, Location, State Change, and Object Parts), totaling 11.1K multiple-choice QA pairs. Collected QA pairs require recognizing manipulation styles, hand/object motions, and part-level state changes. HanDyVQA also includes 10.3K segmentation masks for Objects and Object Parts, enabling the evaluation of object/part-level reasoning in video object segmentation. We evaluated recent video foundation models on our benchmark and found that even the best, Gemini-2.5-Pro, achieved only 73% accuracy, well below human performance (97%). Further analysis shows the remaining challenges in spatial relationships, motion, and part-level geometric understanding. We also found that incorporating explicit HOI cues into visual features improves performance, providing insights for future HOI-aware models.

Abstract:
Multi-stage generative models have shown great promise in 3D content creation due to focused generation of structure or texture in different stages, but their outputs often fail to align with human preferences. The key bottleneck to apply alignment methods is the presence of non-differentiable operations between generative stages.This disconnection stops preference signals applied to the final output from being backpropagated to the crucial, early stages of generation, while simple separated stage-wise alignment leads to texture-geometry inconsistency.To address this challenge, we introduce Circular-DPO, which builds a preference feedback loop to align multi-stage 3D generation models to human preference.Our method first applies Direct Preference Optimization (DPO) to refine the final 3D asset.We then construct new preference pairs by sampling and decoding the assets generated by the optimized model.These newly-formed pairs are used to train the preceding generative stage, effectively creating a feedback loop that bridges the non-differentiable gap. Furthermore, to enhance robustness against noisy data, we introduce a quality-aware weighting mechanism that prioritizes reliable preference pairs during training. Experiments demonstrate that our approach improves the alignment of generated 3D content with human preferences by enabling holistic, multi-stage optimization.

Abstract:
Current AI weather forecasting models predict conventional atmospheric variables but cannot distinguish between cloud microphysical species critical for aviation safety. We introduce AviaSafe, a hierarchical, physics-informed neural forecaster that produces global, six-hourly predictions of these four hydrometeor species for lead times up to 7 days. Our approach addresses the unique challenges of cloud prediction: extreme sparsity, discontinuous distributions, and complex microphysical interactions between species. We integrate the Icing Condition (IC) index from aviation meteorology as a physics-based constraint that identifies regions where supercooled water fuels explosive ice crystal growth. The model employs a hierarchical architecture that first predicts cloud spatial distribution through masked attention, then quantifies species concentrations within identified regions. Training on ERA5 reanalysis data, our model achieves lower RMSE for cloud species compared to baseline and outperforms operational numerical models on certain key variables at 7-day lead times.The ability to forecast individual cloud species enables new applications in aviation route optimization where distinguishing between ice and liquid water determines engine icing risk.

Abstract:
Federated Learning (FL) enables decentralized model training across multiple clients without exposing private data, making it ideal for privacy-sensitive applications. However, in real-world FL scenarios, clients often hold data from distinct domains, leading to severe domain shift and degraded global model performance. To address this, prototype learning has been emerged as a promising solution, which leverages class-wise feature representations. Yet, existing methods face two key limitations: (1) Existing prototype-based FL methods typically construct a single global prototype per class by aggregating local prototypes from all clients without preserving domain information. (2) Current feature-prototype alignment is domain-agnostic, forcing clients to align with global prototypes regardless of domain origin. To address these challenges, we propose Federated Domain-Aware Prototypes (FedDAP) to construct domain-specific global prototypes by aggregating local client prototypes within the same domain using a similarity-weighted fusion mechanism. These global domain-specific prototypes are then used to guide local training by aligning local features with prototypes from the same domain, while encouraging separation from prototypes of different domains. This dual alignment enhances domain-specific learning at the local level and enables the global model to generalize across diverse domains. Finally, we conduct extensive experiments on three different datasets: DomainNet, Office-10, and PACS to demonstrate the effectiveness of our proposed framework to address the domain shift challenges.

Abstract:
Unified Multimodal Models (UMMs) have shown impressive performance in both understanding and generation with a single architecture. However, UMMs still exhibit a fundamental inconsistency: understanding favors compact embeddings, whereas generation favors reconstruction-rich representations. This structural trade-off produces misaligned decision boundaries, degraded cross-modal coherence, and heightened vulnerability under distributional and adversarial shifts. In this paper, we present UniGame, a self-adversarial post-training framework that directly targets the inconsistencies. By applying a lightweight perturber at the shared token interface, UniGame enables the generation branch to actively seek and challenge fragile understanding, turning the model itself into its own adversary. Experiments demonstrate that UniGame significantly improves the consistency (+4.6%). Moreover, it also achieves substantial improvements in understanding (+3.6%), generation (+0.02), out-of-distribution and adversarial robustness (+4.8% and +6.2% on NaturalBench and AdVQA). The framework is architecture-agnostic, introduces <1% additional parameters, and is complementary to existing post-training methods. These results position adversarial self-play as a general and effective principle for enhancing the coherence, stability, and unified competence of future multimodal foundation models.

Abstract:
3D Gaussian Splatting (3DGS) has demonstrated breakthrough performance in novel view synthesis and real-time rendering. Nevertheless, its practicality is constrained by the high memory cost due to a huge number of Gaussian points. Many pruning-based 3DGS variants have been proposed for memory saving, but often compromise spatial consistency and may lead to rendering artifacts. To address this issue, we propose graph-based spatial distribution optimization for compact 3D Gaussian Splatting (GS\textasciicircum2), which enhances reconstruction quality by optimizing the spatial distribution of Gaussian points. Specifically, we introduce an evidence lower bound (ELBO)-based adaptive densification strategy that automatically controls the densification process. In addition, an opacity-aware progressive pruning strategy is proposed to further reduce memory consumption by dynamically removing low-opacity Gaussian points. Furthermore, we propose a graph-based feature encoding module to adjust the spatial distribution via feature-guided point shifting. Extensive experiments validate that GS\textasciicircum2 achieves a compact Gaussian representation while delivering superior rendering quality. Compared with 3DGS, it achieves higher PSNR with only about 12.5% Gaussian points. Furthermore, it outperforms all compared baselines in both rendering quality and memory efficiency.

Abstract:
Recent advances in vision-language models (VLMs) have garnered substantial attention in open-vocabulary semantic and part segmentation (OSPS). However, existing methods extract image-text alignment cues from cost volumes through a serial structure of spatial and class aggregations, leading to knowledge interference between class-level semantics and spatial context. Therefore, this paper proposes a simple yet effective parallel cost aggregation (PCA-Seg) paradigm to alleviate the above challenge, enabling the model to capture richer vision-language alignment information from cost volumes. Specifically, we design an expert-driven perceptual learning (EPL) module that efficiently integrates semantic and contextual streams. It incorporates a multi-expert parser to extract complementary features from multiple perspectives. In addition, a coefficient mapper is designed to adaptively learn pixel-specific weights for each feature, enabling the integration of complementary knowledge into a unified and robust feature embedding. Furthermore, we propose a feature orthogonalization decoupling (FOD) strategy to mitigate redundancy between the semantic and contextual streams, which allows the EPL module to learn diverse knowledge from orthogonalized features. Extensive experiments on eight benchmarks show that each parallel block in PCA-Seg adds merely 0.35M parameters while achieving state-of-the-art OSPS performance.

Abstract:
Motion blur arises when rapid scene changes occur during the exposure period, collapsing rich intra-exposure motion into a single RGB frame. Without explicit structural or temporal cues, RGB-only deblurring is highly ill-posed and often fails under extreme motion. Inspired by the human visual system, brain-inspired vision sensors introduce temporally dense information to alleviate this problem. However, event cameras still suffer from event rate saturation under rapid motion, while the event modality entangles edge features and motion cues, which limits their effectiveness. As a recent breakthrough, the complementary vision sensor (CVS), Tianmouc, captures synchronized RGB frames together with high-frame-rate, multi-bit spatial difference (SD, encoding structural edges) and temporal difference (TD, encoding motion cues) data within a single RGB exposure, offering a promising solution for RGB deblurring under extreme dynamic scenes. To fully leverage these complementary modalities, we propose Spatio-Temporal Difference Guided Deblur Net (STGDNet), which adopts a recurrent multi-branch architecture that iteratively encodes and fuses SD and TD sequences to restore structure and color details lost in blurry RGB inputs. Our method outperforms current RGB or event-based approaches in both synthetic CVS dataset and real-world evaluations. Moreover, STGDNet exhibits strong generalization capability across over 100 extreme real-world scenarios. Project page: https://tmcDeblur.github.io/

Abstract:
Existing image SR and generic diffusion models transfer poorly to fluid SR: they are sampling-intensive, ignore physical constraints, and often yield spectral mismatch and spurious divergence. We address fluid super-resolution (SR) with ReMD (Residual-Multigrid Diffusion), a physics-consistent diffusion framework. At each reverse step, ReMD performs a multigrid residual correction: the update direction is obtained by coupling data consistency with lightweight physics cues and then correcting the residual across scales; the multiscale hierarchy is instantiated with a multi-wavelet basis to capture both large structures and fine vortical details. This coarse-to-fine design accelerates convergence and preserves fine structures while remaining equation-free. Across atmospheric and oceanic benchmarks, ReMD improves accuracy and spectral fidelity, reduces divergence, and reaches comparable quality with markedly fewer sampling steps than diffusion baselines. Our results show that enforcing physics consistency inside the diffusion process via multigrid residual correction and multi-wavelet multiscale modeling is an effective route to efficient fluid SR.

Abstract:
Visible-thermal (RGB-T) object detection is a crucial technology for applications such as autonomous driving, where multimodal fusion enhances performance in challenging conditions like low light. However, the security of RGB-T detectors, particularly in the physical world, has been largely overlooked. This paper proposes a novel approach to RGB-T physical attacks using adversarial clothing with a non-overlapping RGB-T pattern (NORP). To simulate full-view (0^ \circ -360^ \circ ) RGB-T attacks, we construct 3D RGB-T models for human and adversarial clothing. NORP is a new adversarial pattern design using distinct visible and thermal materials without overlap, avoiding the light reduction in overlapping RGB-T patterns (ORP). To optimize the NORP on adversarial clothing, we propose a spatial discrete-continuous optimization (SDCO) method. We systematically evaluated our method on RGB-T detectors with different fusion architectures, demonstrating high attack success rates both in the digital and physical worlds. Additionally, we introduce a fusion-stage ensemble method that enhances the transferability of adversarial attacks across unseen RGB-T detectors with different fusion architectures.

Abstract:
Pose stylization, which aims to synthesize stylized content aligning with target poses, serves as a fundamental task across 2D, 3D, and video domains. In the 3D realm, prevailing approaches typically rely on a cascade pipeline: first manipulating the image pose via 2D foundation models and subsequently lifting it into 3D representations. However, this paradigm limits the precision and diversity of the 3d pose stylization. To this end, we propose a novel paradigm for 3D pose stylization that unifies pose stylization and 3D generation within a cohesive framework. This integration minimizes the risk of cumulative errors and enhances the model's efficiency and effectiveness. In addition, diverging from previous works that typically utilize 2D skeleton images as guidance, we directly utilize the 3D skeleton because it can provide a more accurate representation of 3D spatial and topological relationships, which significantly enhances the model's capacity to achieve richer and more precise pose stylization. Moreover, we develop a scalable data engine to construct a large-scale dataset of "Image-Skeleton-Mesh" triplets, enabling the model to jointly learn identity preservation and geometric alignment. Extensive experiments demonstrate that PoseMaster significantly outperforms state-of-the-art methods in both qualitative and quantitative metrics. Owing to the strict spatial alignment between the generated 3D meshes and the conditioning skeletons, PoseMaster enables the direct creation of animatable assets when coupled with automated skinning models, highlighting its compelling potential for automated character rigging.

Abstract:
Existing dynamic scene rendering methods often adopt rigid-body or direction-limited assumptions, yet real-world motion and contact routinely violate these, producing artifacts near occlusion boundaries. To address this, we introduce a unified, source-aware framework for dynamic rendering that enforces the consistency of Gaussian primitives under explicit manifold constraints. We project predicted velocities onto physically grounded priors via efficient, parallel inner solves: (i) a Helmholtz parameterization that separates divergence-free and potential-flow motion components; (ii) an anisotropic, compressible directional prior; and (iii) an affine family that disentangles rotation from isotropic scaling. Experiments on extensive benchmarks show consistent improvements over state-of-the-art methods in reconstruction fidelity and temporal coherence. Our approach ensures physically realistic rendering, especially near contacts, and substantially reduces motion-boundary artifacts.

Abstract:
While diffusion models have achieved great success in the field of video generation, this progress is accompanied by a rapidly escalating computational burden. Among the existing acceleration methods, Feature Caching is popular due to its training-free property and considerable speedup performance, but it inevitably faces semantic and detail drop with further compression. Another widely adopted method, training-aware step-distillation, though successful in image generation, also faces drastic degradation in video generation with a few steps. Furthermore, the quality loss becomes more severe when simply applying training-free feature caching to the step-distilled models, due to the sparser sampling steps. This paper novelly introduces a distillation-compatible learnable feature caching mechanism for the first time. We employ a lightweight learnable neural predictor instead of traditional training-free heuristics for diffusion models, enabling a more accurate capture of the high-dimensional feature evolution process. Furthermore, we explore the challenges of highly compressed distillation on large-scale video models and propose a conservative Restricted MeanFlow approach to achieve more stable and lossless distillation. By undertaking these initiatives, we further push the acceleration boundaries to 11.8 times while preserving generation quality. Extensive experiments demonstrate the effectiveness of our method.

Abstract:
Advances in generative modeling have made it increasingly easy to fabricate realistic portrayals of individuals, creating serious risks for security, communication, and public trust. Detecting such person-driven manipulations requires systems that not only distinguish altered content from authentic media but also provide clear and reliable reasoning. In this paper, we introduce TriDF, a comprehensive benchmark for interpretable DeepFake detection. TriDF contains high-quality forgeries from advanced synthesis models, covering 16 DeepFake types across image, video, and audio modalities. The benchmark evaluates three key aspects: Perception, which measures the ability of a model to identify fine-grained manipulation artifacts using human-annotated evidence; Detection, which assesses classification performance across diverse forgery families and generators; and Hallucination, which quantifies the reliability of model-generated explanations. Experiments on state-of-the-art multimodal large language models show that accurate perception is essential for reliable detection, but hallucination can severely disrupt decision-making, revealing the interdependence of these three aspects. TriDF provides a unified framework for understanding the interaction between detection accuracy, evidence identification, and explanation reliability, offering a foundation for building trustworthy systems that address real-world synthetic media threats.

Abstract:
In large-scale industrial documents with scanned images, complex layouts, and multiple pages, the effectiveness of retrieval-augmented generation (RAG) is highly dependent on chunking quality. However, existing text-centric chunkers overlook the visual and structural cues present in real-world documents, leading to redundant or ambiguous chunks that impair retrieval and answer accuracy. To address this problem, we propose \ours which integrates (i) SharedDet for normalizing document parsing and OCR outputs into a document-level frame, (ii) Multi-modal block embeddings with boundary-aware SoftROI, (iii) global document-tree reconstruction via biaffine scoring, and (iv) structure-aware dependency chunking that preserves boundaries and reduces redundancy. \ours achieves consistent gains across both Document Hierarchical Parsing (DHP) and corpus-level RAG evaluations, improving STEDS by +28.5--39.6%, retrieval nDCG by +1.1--15.3%, and QA ANLS by +4.5--15.3%. These results demonstrate that modeling document-level dependencies with Multi-modal, structure-aware chunking improves RAG performance on long, multi-page industrial documents.

Abstract:
The miniaturization of thermal sensors for mobile platforms inherently limits their spatial resolution and textural fidelity, leading to blurry and less informative images. Existing thermal super-resolution (SR) methods can be grouped into single-image and RGB-guided approaches: the former struggles to recover fine structures from limited information, while the latter relies on accurate and laborious cross-camera calibration, which hinders practical deployment and robustness. Here, we propose 3M-TI, a calibration-free Multi-camera cross-Modality diffusion framework for Mobile Thermal Imaging. At its core, 3M-TI integrates a cross-modal self-attention module (CSM) into the diffusion UNet, replacing the original self-attention layers to adaptively align thermal and RGB features throughout the denoising process, without requiring explicit camera calibration. This design enables the diffusion network to leverage its generative prior to enhance spatial resolution, structural fidelity, and texture detail in the super-resolved thermal images. Extensive evaluations on real-world mobile thermal cameras and public benchmarks validate our superior performance, achieving state-of-the-art results in both visual quality and quantitative metrics. More importantly, the thermal images enhanced by 3M-TI lead to substantial gains in critical downstream tasks like object detection and segmentation, underscoring its practical value for robust mobile thermal perception systems.

Abstract:
Black-Box Domain Adaptation (BBDA) is a highly practical yet challenging strategy that enables the deployment of pre-trained detectors to new unlabeled target domains without accessing source data or models. Compared to previous domain adaptation studies, BBDA not only provides stronger data privacy protection but also offers greater portability. Despite growing interest, existing BBDA strategies remain difficult to apply directly to object detection, as most prior works focus on classification and segmentation tasks that do not involve bounding box localization and rely on different learning mechanisms. In this paper, inspired by lifelong learning, we propose Retention-Driven Knowledge Compression (RDKC), which applies a brain-inspired continual learning process to BBDA for object detection. Specifically, RDKC consists of two key components: Memory Retention (MR) and Scene Compression (SC). MR is designed specifically for object detection under the BBDA setting, where it performs memorized contrastive learning on partitioned regions to better utilize informative cues from reliable areas while filtering out potential noise from noise prediction labels. SC introduces a contrastive mechanism between near- and far-view regions, which enables the model to better learn from far-view regions under the guidance of near-view cues. Experimental results demonstrate that under the BBDA setting, RDKC outperforms previous SOTA methods across all evaluated benchmarks, achieving superior performance improvements.

Abstract:
We present Muses, the first training-free method for fantastic 3D creature generation in a feed-forward paradigm. Previous methods, which rely on part-aware optimization, manual assembly, or 2D image generation, often produce unrealistic or incoherent 3D assets due to the challenges of intricate part-level manipulation and limited out-of-domain generation. In contrast, Muses leverages the 3D skeleton--a fundamental representation of biological forms--to explicitly and rationally compose diverse elements. This skeletal foundation formalizes 3D content creation as a structure-aware pipeline of design, composition, and generation. Muses begins by constructing a creatively composed 3D skeleton with coherent layout and scale through graph-constrained reasoning. This skeleton then guides a voxel-based assembly process within a structured latent space, integrating regions from different objects. Finally, image-guided appearance modeling under skeletal conditions is applied to generate a style-consistent and harmonious texture for the assembled shape. Extensive experiments establish Muses' state-of-the-art performance in terms of visual fidelity and alignment with textual descriptions, and potential on flexible 3D object editing.

Abstract:
Large Vision Language Models (LVLMs) exhibit strong perceptual and linguistic capabilities yet struggle with complex visual reasoning tasks that require structured, compositional, and adaptive inference. Existing approaches either rely on costly inference-time exploration--such as multi-path or tree-based Chain-of-Thought (CoT) search--or on expensive post-training with large curated CoT datasets. We propose ReaGEN, a lightweight framework for the adaptive generation of structured reasoning chains that enhances reasoning without modifying the underlying vision-language model (VLM). ReaGEN first employs a teacher-guided evolutionary search to collect sample specific CoT structure, leveraging attention-derived stage importance to capture how information flows across reasoning stages. These adaptive CoT structures are then used to train a compact generator (GEN) that learns to refine and improve CoT structures by reflecting on attention feedback from the reasoning process. At inference, the GEN dynamically produces question-adaptive structured CoTs, and can be iteratively invoked to refine them based on the VLM's internal state--achieving the flexibility of deep search with single-path efficiency. Across diverse multimodal reasoning benchmarks, ReaGEN achieves up to +26 accuracy points over test-time scaling methods while reducing the average inference-time token usage by 79%, making it a scalable approach for structured reasoning generation in self-hosted or open-source VLMs with access to internal attention signals.

Abstract:
Models for AI-based skin cancer screening suffer a severe performance drop when shifting from expert dermoscopic (source) images to consumer-grade clinical (target) photos, hindering real-world deployment. Existing domain adaptation methods often ignore crucial semantic invariants, such as clinical concepts. While new foundation models like MONET can provide this semantic information as dense, probabilistic scores, this metadata is unavailable at test time, creating a deployment paradox for practical image-only screening tools. We address this gap by proposing CoFiDA-M, a privileged information framework that learns from concepts at training time but deploys as an image-only model. Our method trains a teacher network that uses MONET concept probabilities to guide a FiLM modulator, transforming visual features into a semantically "edited" feature space. A lightweight, image-only student is then trained to reproduce this edited representation, not just the teacher's final predictions. This distillation "bakes" the clinical reasoning into the student's weights.On a challenging multi-dataset benchmark, our image-only student significantly outperforms state-of-the-art approaches, especially in melanoma recall. Our work provides a practical and generalizable framework for leveraging noisy, probabilistic metadata as privileged information, demonstrating strong cross-dataset robustness and potential for real-world deployment beyond dermatology. The code will be made publicly available upon acceptance.

Abstract:
Recent studies have witnessed significant advances in image restoration foundation models driven by improvements in the scale and quality of pre-training data. In this work, we find that the data mixture proportions from different restoration tasks are also a critical factor directly determining the overall performance of all-in-one image restoration models. To this end, we propose a high-capacity diffusion-based image restoration foundation model, FoundIR-v2, which adopts a data equilibrium scheduling paradigm to dynamically optimize the proportions of mixed training datasets from different tasks. By leveraging the data mixing law, our method ensures a balanced dataset composition, enabling the model to achieve consistent generalization and comprehensive performance across diverse tasks. Furthermore, we introduce an effective Mixture-of-Experts (MoE)-driven scheduler into generative pre-training to flexibly allocate task-adaptive diffusion priors for each restoration task, accounting for the distinct degradation forms and levels exhibited by different tasks. Extensive experiments demonstrate that our method can address over 50 sub-tasks across a broader scope of real-world scenarios and achieves favorable performance against state-of-the-art approaches.

Abstract:
Flow-based Transformer models have achieved state-of-the-art image generation performance, but often suffer from high inference latency and computational cost due to their large parameter sizes. To improve inference efficiency without compromising quality, we propose Bridged Progressive Rectified Flow Transformers (NAMI), which decompose the generation process across temporal, spatial, and architectural demensions. We divide the rectified flow into different stages according to resolution, and use a BridgeFlow module to connect them. Fewer Transformer layers are used at low-resolution stages to generate image layouts and concept contours, and more layers are progressively added as the resolution increases. Experiments demonstrate that our approach achieves fast convergence and reduces inference time while ensuring generation quality. The main contributions of this paper are summarized as follows: (1) We introduce Bridged Progressive Rectified Flow Transformers that enable multi-resolution training, accelerating model convergence; (2) NAMI leverages piecewise flow and spatial cascading of Diffusion Transformer (DiT) to rapidly generate images, reducing inference time by 64% for generating 1024x1024 resolution images; (3) We propose a BridgeFlow module to align flows between different stages; (4) We propose the NAMI-1K benchmark to evaluate human preference performance, aiming to mitigate distributional bias and comprehensively assess model effectiveness. The results show that our model is competitive with state-of-the-art models.

Abstract:
MeanFlow is a powerful few-step generative framework that can be trained from scratch, but its performance degrades significantly when the one-step loss uses a large portion of training data. This stems from a temporal scale imbalance: gradients from different stages of generation contribute unevenly, leading to unstable optimization--evident in blurry samples and high FID scores. The core issue is a conflict between two opposing forces: terms that amplify variance over long time spans and strong constraints needed near the start of generation, which a fixed sampling strategy cannot reconcile. To resolve this, we propose Temporal Equilibrium MeanFlow (TEMF), which balances these competing demands through two simple yet effective components: (1) a temporal equilibrium weighting function that equalizes gradient influence across all time scales, and (2) a dynamic boundary scheduler that gradually shifts training focus--from stabilizing early steps to refining the full trajectory as training progresses. Without changing the model architecture, TEMF retains true one-step generation with classifier-free guidance, achieving a state-of-the-art FID of 2.62 on ImageNet 256x256--achieving the best results among diffusion- and flow-based one-step methods.

Abstract:
Test-time alignment (TTA) aims to adapt models to specific rewards during inference. However, existing methods tend to either under-optimise or over-optimise (reward hack) the target reward function. We propose Null-Text Test-Time Alignment (Null-TTA), which aligns diffusion models by optimising the unconditional embedding in classifier-free guidance, rather than manipulating latent or noise variables. Due to the structured semantic nature of the text embedding space, this ensures alignment occurs on a semantically coherent manifold and prevents reward hacking (exploiting non-semantic noise patterns to improve the reward). Since the unconditional embedding in classifier-free guidance serves as the anchor for the model's generative distribution, Null-TTA directly steers model's generative distribution towards the target reward rather than just adjusting the samples, even without updating model parameters. Thanks to these desirable properties, we show that Null-TTA achieves state-of-the-art target test-time alignment while maintaining strong cross-reward generalisation. This establishes semantic-space optimisation as an effective and principled novel paradigm for TTA.

Abstract:
Modern microscopy routinely produces gigapixel images that contain structures across multiple spatial scales, from fine cellular morphology to broader tissue organization. Many analysis tasks require combining these scales, yet most vision models operate at a single resolution or derive multi-scale features from one view, limiting their ability to exploit the inherently multi-resolution nature of microscopy data. We introduce MuViT, a transformer architecture built to fuse true multi-resolution observations from the same underlying image. MuViT embeds all patches into a shared world-coordinate system and extends rotary positional embeddings to these coordinates, enabling attention to integrate large-scale context with high-resolution detail within a single encoder. Across synthetic benchmarks, kidney histopathology, and high-resolution mouse-brain microscopy, MuViT delivers consistent improvements over strong ViT and CNN baselines. Multi-resolution MAE pretraining further produces scale-consistent representations that enhance downstream tasks. These results demonstrate that explicit world-coordinate modelling provides a simple yet powerful mechanism for leveraging multi-resolution information in large-scale microscopy analysis. Code is available at github.com/weigertlab/muvit.

Abstract:
Generative pose priors have recently emerged as a powerful tool for inference under occlusion or noise. Yet today's strongest generative paradigm, flow matching, remains unused for human pose due to two fundamental barriers: the absence of a pre-trained flow prior and the non-Euclidean nature of articulated poses. We overcome both by introducing PoseD-Flow, a novel framework to unify Riemannian Flow Matching (RFM) with training-free guidance for 3D human pose recovery. PoseD-Flow is composed of two contributions: (i) PoseRFM, the first RFM model of human pose, defined directly on the product manifold of joint rotations, and (ii) Riemannian D-Flow, a principled guidance mechanism that, by differentiating through its ODE sampling dynamics, conditions PoseRFM at inference without any task-specific training. Our theoretical analysis shows that the induced dynamics are shaped by data covariance and manifold curvature, yielding a bias toward realistic poses. Across pose completion, denoising, and inverse kinematics, PoseD-Flow establishes new state of the art, particularly under noise, occlusion, and partial observations.

Abstract:
Text-based human motion editing aims to modify existing motion sequences according to natural language instructions while maintaining the consistency of the original motion. Existing diffusion-based approaches often rely on heuristic similarity cues or coarse global conditioning, leading to motion distortion and suboptimal semantic alignment. The key challenge lies in balancing change (i.e. precisely editing target regions) and invariance (i.e. preserving unedited parts). To handle such challenge, we propose an Omni-Supervised Positive-Negative Learning framework, named OmniME. Our method integrates three complementary components: (1) retrospective feature supervision that enforces coarse-to-fine consistency across transformer layers, (2) motion preservation mechanism that focuses on subtle variations accoding to the source-target similarity, and (3) triplet-based semantic alignment that strengthens text-motion correspondence. Together, these components form a unified supervision paradigm that balances change and invariance. Extensive experiments on the MotionFix and STANCE Adjustment datasets demonstrate that OmniME achieves state-of-the-art performance in editing alignment, validating the effectiveness of our unified learning framework. Our source codes and models have been released at: https://github.com/ rocket-ycyer/OmniME.git

Abstract:
Vision-Language Models (VLMs) have achieved remarkable progress across tasks such as visual question answering and image captioning. Yet, the extent to which these models perform visual reasoning as opposed to relying on linguistic priors remains unclear. To address this, we introduce VisRes Bench, a benchmark designed to study visual reasoning in naturalistic settings without contextual language supervision. Analyzing model behavior across three levels of complexity, we uncover clear limitations in perceptual and relational visual reasoning capacities. VisRes isolates distinct reasoning abilities across its levels. Level 1 probes perceptual completion and global image matching under perturbations such as blur, texture changes, occlusion, and rotation; Level 2 tests rule-based inference over a single attribute (e.g., color, count, orientation); and Level 3 targets compositional reasoning that requires integrating multiple visual attributes. Across more than 19,000 controlled task images, we find that state-of-the-art VLMs perform near random under subtle perceptual perturbations, revealing limited abstraction beyond pattern recognition. We conclude by discussing how VisRes provides a unified framework for advancing abstract visual reasoning in multimodal research (https://visres-bench.github.io).

Abstract:
Generalized Category Discovery (GCD) aims to identify both known and novel classes from unlabeled data with the aid of labeled examples. While promising, most existing GCD methods rely on simultaneous access to labeled and unlabeled datasets--an assumption often impractical in real-world deployments. Continual Category Discovery (CCD) relaxes this requirement by adapting a pre-trained model to streaming unlabeled data, yet it typically assumes domain-consistent data distributions. This places a strong limitation on its applicability. In this work, we study Open Continual Category Discovery (OCCD), where the model must robustly discover previously unseen concepts from real-world data streams that may originate from heterogeneous and shifting domains. To address this, we propose an adaptive framework built on three key ideas. First, we propose a weight-aware separation module, which leverages partial unbalanced optimal transport for instance probability modeling and employs binary response spectrum quantization to generate cues for distinguishing known and unknown categories, enabling automatic sample separation. Second, for known categories, we introduce a cross-domain semantic alignment module that incorporates adversarial learning to perform adaptive prototype matching, thereby enhancing robustness against domain shifts. Finally, for unknown categories, we design a category topology consistency constraint that preserves semantic relationships between known and novel classes during distribution shifts. Experiments show our approach excels at discovering new categories while maintaining strong performance on known ones in evolving domains.

Abstract:
It has been demonstrated in various contexts that monotonicity leads to better explainability in neural networks. However, not every function can be well approximated by a monotone neural network.We demonstrate that monotonicity can still be used in two ways to boost explainability. First, we use an adaptation of the decomposition of a trained ReLU network into two monotone and convex parts, thereby overcoming numerical obstacles from an inherent blowup of the weights in this procedure. Our proposed saliency methods -- SplitCAM and SplitLRP --improve onstate of the art results on both VGG16 and Resnet18 networks on ImageNet-S across all Quantus saliency metric categories.Second, we exhibit that training a model as the difference between two monotone neural networks results in a system with strong self-explainability properties.

Abstract:
Large Vision-Language Models (VLMs) have achieved remarkable success in multi-modal reasoning, but their inference time efficiency remains a significant challenge due to the memory overhead during decoding, especially when the query and answer of VLMs consist of long sequences of visual and text tokens. This paper presents AttentionPack, an adaptive and attention-aware optimization framework tailored for large vision-language models with improving memory-efficiency during decoding, focusing on addressing the challenges due to the increased high number of visual inputs and interactions, particularly in long-context tasks with multiple high-resolution images or videos. AttentionPack is novel in two aspects: (i) We introduce a multi-head attention compaction method for economically storing key and value matrices by exploiting the implicit low-rank structure, and (ii) we develop a token-specific attention-aware decompression mechanism to reduce latency overhead. Experimental results on multiple benchmarks demonstrate that AttentionPack improves memory efficiency by up to 8x, enabling higher batch sizes and faster batch inference while preserving the model output quality or longer context lengths for superior retrieval performance. We also report the effectiveness of AttentionPack combined with eviction, quantization and kernel fusion, showing further efficiency gains for resource-limited environments.

Abstract:
We introduce Talk2Move, a reinforcement learning (RL) based diffusion framework for text-instructed spatial transformation of objects within scenes. Spatially manipulating objects in a scene through natural language poses a challenge for multimodal generation systems. While existing text-based manipulation methods can adjust appearance or style, they struggle to perform object-level geometric transformations--such as translating, rotating, or resizing objects--due to scarce paired supervision and pixel-level optimization limits. Talk2Move employs Group Relative Policy Optimization (GRPO) to explore geometric actions through diverse rollouts generated from input images and lightweight textual variations, removing the need for costly paired data. A spatial reward guided model aligns geometric transformations with linguistic description, while off-policy step evaluation and active step sampling improve learning efficiency by focusing on informative transformation stages. Furthermore, we design object-centric spatial rewards that evaluate displacement, rotation, and scaling behaviors directly, enabling interpretable and coherent transformations.Experiments on curated benchmarks demonstrate that Talk2Move achieves precise, consistent, and semantically faithful object transformations, outperforming existing text-guided editing approaches in both spatial accuracy and scene coherence.

Abstract:
Continual Generalized Category Discovery (C-GCD) seeks to incrementally discover new categories from unlabeled data and memorize old categories' knowledge, fostering model adaptability in real-world scenarios. Especially, the unlabeled data is from both old and new classes, requiring the model to recognize previously learned classes while discovering. In response, recent efforts focus on devising specific frameworks and various anti-forgetting strategies, striving for a typical stability-plasticity trade-off. Unlike previous studies, in this work, we first revisit these methods and identify that most of these methods over-protect old classes, hampering the accurate discovery of novel ones. To address this challenge, we introduce the Decouple Your Discovery and Memory (DYDM), a dual-branch architecture that decouples the discovery of new classes and the memorization of old classes. The discovery branch is focused on accurately recognizing new classes, while the memory branch consolidates all identified categories in a recursive manner and functions as the inference branch. Importantly, benefiting from the strong knowledge retention ability of the memory branch, the discovery branch can facilitate the recognition of novel classes from the unlabeled data, achieving a win-win outcome between plasticity and stability. Extensive experiments on various datasets and settings demonstrate the superiority of our approach, achieving leads of up to 9.87%, 7.30%, 3.18%, and 8.25%. Furthermore, our framework can integrate with existing approaches, consistently enhancing their performance.

Abstract:
Existing retrieval-augmented approaches for Dense Video Captioning (DVC) often fail to achieve accurate temporal segmentation aligned with true event boundaries, as they rely on heuristic strategies that overlook ground truth event boundaries.The proposed framework, STaRC, overcomes this limitation by supervising frame-level saliency through a highlight detection module. Note that the highlight detection module is trained on binary labels derived directly from DVC ground truth annotations without the need for additional annotation.We also propose to utilize the saliency scoresas a unified temporal signal that drives retrieval via saliency-guided segmentation and informs caption generation through explicit Saliency Prompts injected into the decoder.By enforcing saliency-constrained segmentation, our method produces temporally coherent segments that align closely with actual event transitions, leading to more accurate retrieval and contextually grounded caption generation. We conduct comprehensive evaluations on the YouCook2 and ViTT benchmarks, where STaRC achieves state-of-the-art performance across most of the metrics.

Abstract:
Generalization remains a critical challenge in deep learning-based point cloud geometry compression. While existing methods perform well on standard benchmarks, their performance collapses in real-world scenarios due to two fundamental limitations: the lack of context models that are robust across diverse data densities, and the inability to efficiently adapt to out-of-distribution (OOD) data. To overcome both challenges, we introduce AnyPcc, a universal point cloud compression framework. AnyPcc first employs a Universal Context Model that leverages coarse-grained spatial priors with fine-grained channel priors to ensure robust context modeling across the entire density spectrum. Second, our novel Instance-Adaptive Fine-Tuning (IAFT) strategy tackles OOD data by synergizing explicit and implicit compression paradigms. For each instance, it fine-tunes a small subset of network weights and transmits them within the bitstream. The minimal bitrate overhead from these weights is significantly outweighed by the resulting gains in geometry compression. Extensive experiments on a benchmark of 15 diverse datasets confirm that AnyPcc sets a new state-of-the-art in point cloud compression while maintaining low complexity. Our code and datasets have been released to encourage reproducible research.

Abstract:
Remote photoplethysmography (rPPG) enables non-contact physiological measurement from facial videos but often suffers from severe performance degradation under domain shifts. Traditional STMap-based methods [??] rely on predefined spatio-temporal representations that offer engineered robustness but discard fine-grained temporal cues. In contrast, end-to-end rPPG models directly learn hierarchical features from raw videos, capturing richer physiological patterns yet remaining highly sensitive to variations in illumination, motion, and camera sensors. To address these challenges, we propose FLOW (Feature-Level Optimal Warping), an Optimal Transport (OT)-driven and plug-and-play framework for multi-source domain generalization in rPPG measurement.FLOW formulates domain shifts as structured Optimal Transport problems and performs feature-level warping to align multiple source domains in a shared latent space. Specifically, a dual-consistency regularization is proposed to enforce both frequency fidelity and mapping invariance, while a shape-adaptive alignment module is designed to enable architecture-agnostic integration without re-training. We further derive a generalization bound based on conditional OT discrepancy, providing theoretical insight into FLOW's robustness under distributional shifts. Extensive experiments across diverse rPPG benchmarks demonstrate that FLOW consistently improves cross-domain generalization while maintaining lightweight and modular deployment.

Abstract:
In this paper, we explore an important yet underexplored task in robot manipulation: cycle-based manipulation, where robots need to perform cyclic or repetitive actions with an expected terminal time. These tasks are crucial in daily life, such as shaking a bottle or knocking a nail. However, few prior works have explored this task, leading to two main challenges: 1) the imitation methods often fail to complete these tasks within the expected terminal time due to the ineffective utilization of history; 2) the absence of a benchmark with sufficient data and automatic evaluation tools hinders development of effective solutions in this area. To address these challenges, we first propose the CycleManip framework to achieve cycle-based task manipulation in an end-to-end imitation manner without requiring any extra models, hierarchical structure or significant computational overhead. The core insight is to enhance effective history perception by a cost-aware sampling strategy and to improve historical understanding by multi-task learning. Second, we introduce a cycle-based task manipulation benchmark, which provides diverse cycle-based tasks, and an automatic evaluation method. Extensive experiments conducted in both simulation and real-world settings demonstrate that our method achieves high success rates in cycle-based task manipulation. The results further show strong adaptability performance in general manipulation, and the plug-and-play ability on imitation policies such as Vision-Language-Action (VLA) models. Moreover, the results show that our approach can be applied across diverse robotic platforms, including bi-arm grippers, dexterous hands, and humanoid robots.

Abstract:
Recent video generation models demonstrate impressive synthesis capabilities but remain limited by single-modality conditioning, constraining their holistic world understanding. This stems from insufficient cross-modal interaction and limited modal diversity for comprehensive world knowledge representation.To address these limitations, we introduce UnityVideo, a unified framework for world-aware video generation that jointly learns across multiple modalities--segmentation masks, human skeletons, DensePose, optical flow, and depth maps--and training paradigms. Our approach features two core components: (1) dynamic noising to unify heterogeneous training paradigms, and (2) a modality switcher with an in-context learner that enables unified processing via modular parameters and contextual learning. We contribute a large-scale unified dataset with 1.3M samples. Through joint optimization, UnityVideo accelerates convergence and significantly enhances zero-shot generalization to unseen data. We demonstrate that UnityVideo achieves superior video quality, consistency, and improved alignment with physical world constraints.

Abstract:
While recent foundation models for remote sensing segmentation have shown notable progress, they still fall short in processing diverse multi-modal inputs, synergizing complementary prompt types, and leveraging semantic hierarchies. To address these limitations, we introduce SkySense-VITA, a unified in-context segmentation model, which synergistically processes both optical and Synthetic Aperture Radar (SAR) imagery using VIsual, TextuAl, or fused prompts. Based on a novel prompt-and-prediction decoupling strategy, we propose the VITA-Former and VITA-Decoder to decouple multi-modal prompt fusion and prediction process, allowing the model to flexibly support visual-only, textual-only, and fused prompt modes. We train SkySense-VITA with a progressive two-stage strategy: a first stage of Image-Level Alignment Pretraining featuring optical-SAR alignment, and a second stage of Pixel-Level In-context Pretraining using Semantic Granularity Annealing (SGA), a coarse-to-fine curriculum that enables robust hierarchical learning. To support this training, we introduce our new large-scale, multi-modal Sky-VT-300k dataset. Extensive experiments show SkySense-VITA establishes a new state-of-the-art (SOTA) on 18 datasets, with an average performance lead of over 10% mean Intersection over Union (mIoU).

Abstract:
Long streaming video QA remains challenging due to growing visual tokens and limited reasoning length of large language models (LLMs). KV-caching stores the Key-Value (KV) of the historical tokens via LLM prefill and enables more efficient streaming QA. However, existing methods cache every one or two frames, causing redundant memory usage and losing fine-grained spatial details within frame or temporal contexts across frames. This paper proposes MuKV, a method that features a multi-grained KV cache compression module and a semi-hierarchical retrieval approach to improve both efficiency and accuracy for long streaming VideoQA. For the offline KV cache, MuKV extracts visual representations at patch-, frame-, and segment-levels. The multiple levels of granularity preserve both local cues and global temporal context, while maintaining efficiency with a dual signal token compression mechanism guided by self-attention and frequency. For online QA, MuKV designs a semi-hierarchical retrieval method to retrieve relevant KV caches for answer generation. Experiments on long-streaming VideoQA benchmarks show that MuKV significantly improves answer accuracy, without sacrificing memory and online QA efficiency. Moreover, our compression mechanism alone brings consistent benefits across answer accuracy, memory, and QA efficiency over baselines, showcasing highly effective contribution.

Abstract:
Despite progress, Vision-Language-Action models (VLAs) are limited by a scarcity of large-scale, diverse robot data. While human manipulation videos offer a rich alternative, existing methods are forced to choose between small, precisely-labeled datasets and vast in-the-wild footage with unreliable hand tracking labels. We present JALA, a pretraining framework that learns Jointly-Aligned Latent Actions. JALA bypasses full visual dynamic reconstruction, instead learns a predictive action embedding aligned with both inverse dynamics and real actions. This yields a transition-aware, behavior-centric latent space for learning from heterogeneous human data. We scale this approach with UniHand-Mix, a 7.5M video corpus (>2,000 hours) blending laboratory and in-the-wild footage. Experiments demonstrate that JALA generates more realistic hand motions in both controlled and unconstrained scenarios, significantly improving downstream robot manipulation performance in both simulation and real-world tasks. These results indicate that jointly-aligned latent actions offer a scalable pathway for VLA pretraining from human data.

Abstract:
As an emerging multi-granularity clustering paradigm, granular-ball computing (GBC) hierarchically represents samples through granular-balls (GBs) to capture compact, multi-scale features. Nevertheless, its effective application to clustering-based segmentation methods (CSMs) remains challenging due to two key issues: representing intrinsic uncertainties and defining a justifiable, semantics-aware quality criterion. To address them, the first segmentation framework based on GBC (SegGBC) is proposed to alleviate the single-granularity limitation of existing CSMs. Concretely, we leverage intuitionistic fuzzy sets (IFS) to explicitly quantify image uncertainty: membership and non-membership encode evidence, and the IFS hesitation degree models residual ambiguity. In addition, a semantic compactness metric criterion (SCM_GB) is designed to characterize semantic information by considering the "stable region" in conjunction with the overall density of GBs. The proposal of "stable region" ensures robust semantics concurrently with high computational efficiency. Extensive experiments demonstrate that the proposed SegGBC achieves promising performance for segmentation. The proposed segmentation GB representation is a plug-and-play front-end, significantly boosting the performance of CSMs by >+3.25% SA and >+3.92% mIoU on standard image and COCO benchmarks. Code is available at supplementary material.

Abstract:
Benefiting from the powerful generative priors of diffusion models, diffusion-based real-world image super-resolution (Real-ISR) methods have demonstrated impressive performance.To achieve efficient Real-ISR, several recent works have designed one-step diffusion-based models.Howerver, unmediatedly feeding LR into a diffusion model creates a distributional gap with the model's original input.A straightforward approach to reduce the distribution gap is to introduce noise to the LR latents. However, directly adding noise inevitably corrupts the content of the LR images.In this study, we propose DNF-SR, a Dual-input and Negative-aware Feature fine-tuning method for Real-ISR.Specifically, we use a dual-input strategy that concatenates the original LR image with the noisy LR input and feeds them into a diffusion-based image editing model, ensuring both high-fidelity one-step super-resolution and improved perceptual and content consistency.Additionally, the noise present in the noisy LR input introduces randomness and diversity into the outputs. We exploit this property and propose a post-training optimization method, Negative-aware Feature Fine-Tuning (NF2T), which guides the model toward producing higher-quality results.NF^2T classifies multiple outputs into positive and negative subsets and then defines implicit policy improvement directions in both the image and feature spaces, thereby further enhancing the stability of the optimization.Extensive experiments show that DNF-SR outperforms other methods.Code will be released.

Abstract:
Large-scale maps of field boundaries are essential for agricultural monitoring tasks. Existing deep learning approaches for satellite-based field mapping are sensitive to illumination, spatial scale, and changes in geographic location. We conduct the first systematic evaluation of segmentation and geospatial foundation models (GFMs) for global field boundary delineation using the Fields of The World (FTW) benchmark. We evaluate 18 models under unified experimental settings, showing that a U-Net semantic segmentation model outperforms instance-based and GFM alternatives on a suite of performance and deployment metrics. We propose a new segmentation approach that combines a U-Net backbone, composite loss functions, and targeted data augmentations to enhance performance and robustness under real-world conditions. Our model achieves a 76% IoU and 47% object-F1 on FTW, an increase of 6% and 9% over the previous baseline. Our approach provides a practical framework for reliable, scalable, and reproducible field boundary delineation across model design, training, and inference. We release all models and model-derived field boundary datasets for five countries.

Abstract:
The robust execution of long-horizon manipulation tasks remains a central challenge in embodied intelligence, necessitating both coherent high-level planning and reliable low-level control. Existing approaches often encounter two critical limitations: the accumulation of prediction errors in subgoal planning, leading to compounding deviations over time; and the planning-execution gap, where high-level abstract plans fail to be effectively grounded in the continuous perception-action space. To address these challenges, we propose a novel unified framework, Affordance-Grounded Bidirectional Latent Planning (AGiLe). AGiLe introduces a bidirectional latent planning mechanism that jointly optimizes a backward planner and a forward critic. The backward planner generates goal-directed subgoals from the final objective, while the forward critic assesses their reachability, ensuring temporal robustness through sustained consistency in long-horizon planning. Furthermore, AGiLe bridges the planning-execution gap by leveraging affordance as structural guidance, grounding abstract subgoals into dense, pixel-level visual affordances that drive action. This enhances spatial robustness, enabling the system to effectively adapt to semantic and visual distractors. Extensive empirical evaluations across both simulation and real-world settings confirm that AGiLe significantly outperforms strong baselines, achieving an 8.5% improvement over prior state-of-the-art methods and demonstrating strong effectiveness and robustness in long-horizon manipulation tasks. Project website: https://agile-long.github.io.

Abstract:
Diffusion-native watermarking provides a more secure and reliable way to trace images from latent diffusion models (LDMs) by embedding information directly into the generative process. However, existing methods suffer from a fundamental limitation: their embedding capacity is extremely small. We introduce MaxMark, a high-capacity watermarking framework that supports embed rich watermark messages into generated images. MaxMark uses two components: a robust watermark embedding module that enhance the secret message and places them into reliable regions of the latent noise, and a distribution transformation module that maps the watermarked latent back to an approximate Gaussian, ensuring compatibility with the diffusion process and preserving image fidelity. The distribution transformation is implemented with an invertible neural network (INN), whose exactly reversible structure enables precise recovery and efficient training. Experiments show that MaxMark surpasses prior methods in capacity, robustness, and imperceptibility, achieving up to a 46% improvement in bit accuracy for large watermark payloads.

Abstract:
We address the problem of making 3D scenes interactive by asking: what would objects feel like if touched in a virtual environment? While neural scene representations achieve high visual realism, they fall short in modeling physical interactions, such as the vibration patterns produced when surfaces are explored with different actions. We propose a framework that learns the correspondence between scene, user action, and haptic response, enabling the generation of tactile signals from simulated interactions in 3D scenes. Our approach leverages a neural field representation conditioned on geometry information and interaction dynamics to synthesize material-specific vibrotactile signals. Experiments show that the generated signals reliably convey both material properties and the user action. This paves the way toward interactive, touch-aware virtual environments with realistic haptic feedback. Project page available here.

Abstract:
Multimodal Large Language Models (MLLMs) have recently demonstrated remarkable perceptual and reasoning abilities. However, they struggle to perceive fine-grained geometric structures, constraining their ability of geometric understanding and visual reasoning. To address this, we propose GeoTikzBridge, a framework that enhances local geometric perception and visual reasoning through tikz-based code generation. Within this framework, we build two models supported by two complementary datasets. The GeoTikzBridge-Base model is trained on GeoTikz-Base dataset, the largest image-to-tikz dataset to date with 2.5M pairs (16 xlarger than existing open-sourced datasets). This process is achieved via iterative data expansion and a localized geometric transformation strategy. Subsequently, GeoTikzBridge-Instruct is fine-tuned on GeoTikz-Instruct dataset which is the first instruction-augmented tikz dataset supporting visual reasoning. Extensive experimental results demonstrate that our models achieve state-of-the-art performance among open-sourced MLLMs. Furthermore, GeoTikzBridge models can serve as plug-and-play reasoning modules for any MLLM(LLM), enhancing reasoning performance in geometric problem-solving.

Abstract:
Real-world object detection must operate in evolving environments where new classes emerge, domains shift, and unseen objects must be identified as unknown--all without accessing prior data. We introduce Evolving World Object Detection (EWOD), a paradigm coupling incremental learning, domain adaptation, and unknown detection under exemplar-free constraints. To tackle EWOD, we propose EW-DETR framework that augments DETR-based detectors with three synergistic modules: Incremental LoRA Adapters for exemplar-free incremental learning under evolving domains; a Query-Norm Objectness Adapter that decouples objectness-aware features from DETR decoder queries; and Entropy-Aware Unknown Mixing for calibrated unknown detection. This framework generalises across DETR-based detectors, enabling state-of-the-art RF-DETR to operate effectively in evolving-world settings. We also introduce FOGS (Forgetting, Openness, Generalisation Score) to holistically evaluate performance across these dimensions. Extensive experiments on Pascal Series and Diverse Weather benchmarks show EW-DETR outperforms other methods, improving FOGS by 57.24%.

Abstract:
Diffusion Transformers achieve impressive generative quality but remain computationally expensive due to iterative sampling. Recently, dynamic resolution sampling has emerged as a promising acceleration technique by reducing the resolution of early sampling steps. However, existing methods rely on heuristic re-noising at every resolution transition, injecting noise that breaks cross-stage consistency and forces the model to relearn global structure. In addition, these methods indiscriminately upsample the entire latent space at once without checking which regions have actually converged, causing accumulated errors, and visible artifacts. Therefore, we propose Fresco, a dynamic resolution framework that unifies re-noise and global structure across stages with progressive upsampling, preserving both the efficiency of low-resolution drafting and the fidelity of high-resolution refinement, with all stages aligned toward the same final target. Fresco achieves near-lossless acceleration across diverse domains and models, including 10 times speedup on FLUX, and 5 times on HunyuanVideo, while remaining orthogonal to distillation, quantization and feature caching, reaching 22 times speedup when combined with distilled models.

Abstract:
To tell a story effectively, a 3D animation often necessitates carefully planned behaviors of both characters and the camera in the 3D scene, where the camera placement and movement determine how the characters are displayed on screen. Thus, creating storytelling animations can be challenging. While significant progress has been made in the fields of character motion synthesis and virtual cinematography, previous methods focus on either character motion generation or camera motion generation, falling short of handling the two tasks simultaneously. In this paper, we propose a novel diffusion-based generative model to jointly synthesize character and camera motions in 3D space for creating storytelling 3D animations. Our model treats individual characters and the camera in a 3D scene as independent, equally important entities, and explicitly models pairwise interactions among them in the generation process. By being trained on a mixture dataset of real and synthetic character-camera motion data, our model is capable of generating high-quality multi-character motions coupled with compelling camera motions. We show that our model outperforms existing specialized approaches on the human motion generation and camera motion tasks.

Abstract:
Fine-tuning large language models (LLMs) using reinforcement learning (RL) objectives has gained traction, especially in scenarios where labeled data is limited. Building on its success in the language domain, recent efforts have extended RL-based fine-tuning to multimodal tasks. Visual-RFT, for instance, applied Group Relative Policy Optimization (GRPO) to fine-tune multimodal LLMs (MLLMs) across various visual perception benchmarks, achieving notable improvements over standard supervised fine-tuning (SFT). However, its scope was limited by a narrow evaluation of RL adaptation strategies. In this work, we expand the landscape by introducing new RL-based baselines on the same benchmarks and conducting a deeper analysis of GRPO's training dynamics. We identify key limitations--such as reduced generation diversity, constrained policy exploration, and suboptimal reward formulation and aggregation. To address these, we propose DEVA: a framework that enhances Diversity via a flow-based training objective, encourages broader policy Exploration through global entropic regularization, and leverages alignment Volume as a non-verifiable reward combined with harmonic Aggregation. Applied to GRPO and other RL methods, DEVA delivers consistent gains in both quantitative (+5 to +13 points) and qualitative metrics. We further provide visualizations, ablations, and analyses to unpack the contributions of each component in our framework.

Abstract:
Despite the remarkable progress in open-vocabulary object detection (OVD), a significant gap remains between the training and testing phases. During training, the RPN and RoI heads often misclassify unlabeled novel-category objects as background, causing some proposals to be prematurely filtered out by the RPN while others are further misclassified by the RoI head. During testing, these proposals again receive low scores and are removed in post-processing, leading to a significant drop in recall and ultimately weakening novel-category detection performance.To address these issues, we propose a novel training framework--NoOVD--which innovatively integrates a self-distillation mechanism grounded in the knowledge of frozen vision-language models (VLMs).Specifically, we design K-FPN, which leverages the pretrained knowledge of VLMs to guide the model in discovering novel-category objects and facilitates knowledge distillation--without requiring additional data--thus preventing forced alignment of novel objects with background.Additionally, we introduce R-RPN, which adjusts the confidence scores of proposals during inference to improve the recall of novel-category objects.Cross-dataset evaluations on OV-LVIS, OV-COCO, and Objects365 demonstrate that our approach consistently achieves superior performance across multiple metrics.

Abstract:
Most of the models used to generate embeddings for retrieval are not trained for the purpose which leads them to focus on coarse semantic alignment rather than particular object attributes or arrangements. This limits their performance, particularly on challenging problems such as cross-modal fine-grained retrieval. Furthermore, their training objectives lack the discriminative ability required to distinguish between descriptions that are semantically similar but factually different. To address these challenges, we propose POGA (Paraphrased and Oppositional Graph Alignment), a novel framework for fine-grained cross-modal alignment. POGA comprises two core innovations: (1) Multi-source Graph Augmentation (MSGA), which not only generates paraphrased positives and oppositional negatives, but also parses the image and all text variants into structured graphs to provide difference-rich supervisory signals; (2) Hybrid Multi-granularity Alignment (HMA), which defines a composite training objective that jointly optimizes the model at four distinct granularities: including robust dual global alignment, and precise matching at three fine-grained levels: node, relation, and focal disproving. Experiments on benchmarks such as DCI and DOCCI demonstrate that POGA performs favorably against several state-of-the-art methods in long-text understanding and complex relation discrimination.

Abstract:
Due to the lack of effective cross-modal modeling, existing open-source audio-video generation methods often exhibit compromised lip synchronization and insufficient semantic consistency. To mitigate these drawbacks, we propose UniAVGen, a unified framework for human-centric joint audio and video generation. UniAVGen is anchored in a dual-branch joint synthesis architecture, incorporating two parallel Diffusion Transformers (DiTs) to build a cohesive cross-modal latent space. At its heart lies an Asymmetric Cross-Modal Interaction mechanism, which enables bidirectional, temporally aligned cross-attention, thus ensuring precise spatiotemporal synchronization and semantic consistency. Furthermore, this cross-modal interaction is augmented by a Face-Aware Modulation (FAM) module, which dynamically prioritizes salient regions in the interaction process. To enhance generative fidelity during inference, we additionally introduce Modality-Aware Classifier-Free Guidance (MA-CFG), a novel strategy that explicitly amplifies cross-modal correlation signals. Notably, UniAVGen's robust joint synthesis design enables the seamless unification of pivotal audio-visual tasks within a single model. Furthermore, we demonstrate that joint multi-task training can further boost the performance of joint generation. Comprehensive experiments validate that, with far fewer training samples (1.3M vs. 30.1M), UniAVGen delivers overall advantages in audio-video synchronization, timbre consistency, and emotion consistency.

Abstract:
Unified conditional image generation remains difficult because different tasks depend on fundamentally different internal representations. Some require conceptual understanding for semantic synthesis, while others rely on localization cues for spatial precision. Forcing these heterogeneous tasks to share a single representation leads to concept-localization representational conflict.To address this issue, we propose CoLoGen, a unified diffusion framework that progressively learns and reconciles this concept-localization duality. CoLoGen uses a staged curriculum that first builds core conceptual and localization abilities, then adapts them to diverse visual conditions, and finally refines their synergy for complex instruction-driven tasks. Central to this process is the Progressive Representation Weaving (PRW) module, which dynamically routes features to specialized experts and stably integrates their outputs across stages.Experiments on editing, controllable generation, and customized generation show that CoLoGen achieves competitive or superior performance, offering a principled representational perspective for unified image generation.

Abstract:
The rapid adoption of vision-language models (VLMs) has heightened the demand for robust intellectual property (IP) protection of these high-value pretrained models. Effective IP protection should proactively confine model deployment within authorized domains and prevent unauthorized transfers. However, existing methods rely on static training-time definitions, limiting flexibility in dynamic environments and often producing opaque responses to unauthorized inputs. To address these limitations, we propose a novel dynamic authorization with legality-aware intellectual property protection (AoD-IP) for VLMs, a framework that supports authorize-on-demand and legality-aware assessment. AoD-IP introduces a lightweight dynamic authorization module that enables flexible, user-controlled authorization, allowing users to actively specify or switch authorized domains on demand at deployment time. This enables the model to adapt seamlessly as application scenarios evolve and provides substantially greater extensibility than existing static-domain approaches. In addition, AoD-IP incorporates a dual-path inference mechanism that jointly predicts input legality-aware and task-specific outputs. Comprehensive experimental results on multiple cross-domain benchmarks demonstrate that AoD-IP maintains strong authorized-domain performance and reliable unauthorized detection, while supporting user-controlled authorization for adaptive deployment in dynamic environments.

Abstract:
Accurately capturing and aggregating spatiotemporal information has become crucial for video object detection. Previous methods mainly perform feature aggregation in the spatiotemporal domain, treating all regions indiscriminately and overlooking both their relative importance and the frequency characteristics that capture periodic motion patterns. This limits the capability of these methods to capture dynamic interactions and adapt to complex scene variations. In this paper, we propose a novel Dual-Domain Feature Aggregation Network (D2FANet) for video object detection, which, to the best of our knowledge, is the first work to introduce the frequency-domain feature aggregation into the video object detection task. By collaboratively modeling spatiotemporal and frequency information, our D2FANet enhances motion awareness and temporal consistency, thereby improving detection accuracy. First, we develop a frequency-domain feature aggregation module that decomposes frame features into high- and low-frequency distributions and reinforces object query representations through aggregating multi-scale frequency features. Second, we design a spatiotemporal-domain feature aggregation module that leverages an importance guidance mechanism to dynamically emphasize regions of different importance and reinforces object query representations via guiding the aggregation of spatiotemporal features. Experiments on the ImageNet VID and EPIC-KITCHENS datasets demonstrate that D2FANet achieves state-of-the-art performance.

Abstract:
Visual models are deployed on many Internet-of-Things (IoT) devices to power a variety of visual applications at the network edge. These models often need to be fine-tuned on-device continually to adapt to changing operating environments timely. However, the computing and energy overheads incurred are often overwhelming for resource-constrained IoT devices. Existing methods score sample importance based on the entire model via multi-epoch training, incurring overhead that may even exceed the training overhead reduction. To reduce these fine-tuning overheads, this paper presents LoPrune, a novel data pruning method that identifies and removes samples with negligible contributions to model adaptation. The key idea is to evaluate sample importance via a Trainable Subspace Alignment (TSA) Score to align the importance estimation with accurate update directions of the learnable adapter, i.e., Low-Rank Adaptation (LoRA). Specifically, LoPrune projects the influence function onto the LoRA subspace, enforcing consistency between the importance score and the model's updatable directions while substantially reducing the problem's dimensionality. It then leverages Kronecker-Factored Approximate Curvature to approximate the change of learnable adapter induced by a sample as its TSA score, retaining higher-scoring samples. Experiments with four representative visual models fine-tuned on three datasets demonstrate that compared with the best state-of-the-art data pruning baselines, LoPrune can reduce fine-tuning overhead by up to 72.9%, achieving a 3.69 xtraining speedup while improving fine-tuning accuracy by 3.50%.

Abstract:
Long-form video understanding remains challenging due to the extended temporal structure and dense multimodal cues. Despite recent progress, many existing approaches still rely on hand-crafted reasoning pipelines or employ token-consuming video preprocessing to guide MLLMs in autonomous reasoning. To overcome these limitations, we introduce VideoARM, an agentic reasoning-over-hierarchical-memory paradigm for long-form video understanding. Instead of static, exhaustive preprocessing, VideoARM performs adaptive, on-the-fly agentic reasoning and memory construction. Specifically, VideoARM performs an adaptive and continuous loop of observing, thinking, acting, and memorizing, where a controller autonomously invokes tools to interpret the video in a coarse-to-fine manner, thereby substantially reducing token consumption. In parallel, a hierarchical multimodal memory continuously captures and updates multi-level clues throughout the operation of the agent, providing precise contextual information to support the controller in decision-making. Experiments on prevalent benchmarks demonstrate that VideoARM outperforms the state-of-the-art method, DVD, while significantly reducing token consumption for long-form videos.

Abstract:
Generating 3D shapes at part level is pivotal for downstream applications such as mesh retopology, UV mapping, and 3D printing. However, existing part-based generation methods often lack sufficient controllability and suffer from poor semantically meaningful decomposition. To this end, we introduce X-Part, a controllable generative model designed to decompose a holistic 3D object into semantically meaningful and structurally coherent parts with high geometric fidelity. X-Part exploits the bounding box as prompts for the part generation and injects point-wise semantic features for meaningful decomposition. Furthermore, we design an editable pipeline for interactive part generation. Extensive experimental results show that X-Part achieves state-of-the-art performance in part-level shape generation. This work establishes a new paradigm for creating production-ready, editable, and structurally sound 3D assets. Codes will be released for public research.

Abstract:
Despite recent progress, video diffusion models still struggle to synthesize realistic videos involving highly dynamic motions or requiring fine-grained motion controllability. A central limitation lies in the scarcity of such examples in commonly used training datasets. To address this, we introduce DynaVid, a video synthesis framework that leverages synthetic motion data in training, which is represented as optical flow and rendered using computer graphics pipelines. This approach offers two key advantages. First, synthetic motion offers diverse motion patterns and precise control signals that are difficult to obtain from real data. Second, unlike rendered videos with artificial appearances, rendered optical flow encodes only motion and is decoupled from appearance, thereby preventing models from reproducing the unnatural look of synthetic videos. Building on this idea, DynaVid adopts a two-stage generation framework: a motion generator first synthesizes motion, and then a motion-guided video generator produces video frames conditioned on that motion. This decoupled formulation enables the model to learn dynamic motion patterns from synthetic data while preserving visual realism from real-world videos. We validate our framework on two challenging scenarios, vigorous human motion generation and extreme camera motion control, where existing datasets are particularly limited. Extensive experiments demonstrate that DynaVid improves the realism and controllability in dynamic motion generation and camera motion control.

Abstract:
On-device AI systems increasingly adopt a single foundation model equipped with task-specific Low-Rank Adaptation (LoRA) modules, forming a multi-LoRA LLM that supports multiple tasks.We study how to adapt such a model to a new task on memory-constrainted devices.Although LoRA reduces trainable parameters, fine-tuning a full set of modules remains memory-intensive.To improve efficiency, we apply sparse updating, training a subset of LoRA modules within the memory budget.However, existing sparse updating methods assume all candidate parameters are instantiated and cannot estimate the importance of modules that do not yet exist, while prior memory models designed for sequential networks fail to capture the blockwise parallel structure of Transformers.We propose TaskIT, a framework for memory-efficient fine-tuning via cross-task importance transfer. TaskIT predicts pre-insertion module importance by transferring from previously tuned tasks and employs a block-based memory predictor that captures parallel and sequential dependencies of Transformer blocks. A dynamic programming scheduler then selects module locations, numbers, and ranks to maximize accuracy within the memory budget.Experiments on uni-modal and cross-modal benchmarks show that TaskIT achieves superior accuracy-memory tradeoffs compared with Zero-FT, non-LoRA, and LoRA-based fine-tuning methods.

Abstract:
We present IMDD-1M, the first large-scale Industrial Multimodal Defect Dataset comprising 1,000,000 aligned image-text pairs, designed to advance multimodal learning for manufacturing and quality inspection. IMDD-1M contains high-resolution real-world defects spanning over 60 material categories and more than 400 defect types, each accompanied by expert-verified annotations and fine-grained textual descriptions detailing defect location, severity, and contextual attributes. This dataset enables a wide spectrum of applications, including classification, segmentation, retrieval, captioning, and generative modeling. Building upon IMDD-1M, we train a diffusion-based vision-language foundation model from scratch, specifically tailored for industrial scenarios. The model serves as a generalizable foundation that can be efficiently adapted to specialized domains through lightweight fine-tuning. With less than 5% of the task-specific data required by dedicated expert models, it achieves comparable performance, highlighting the potential of data-efficient foundation model adaptation for industrial inspection and generation, paving the way for scalable, domain-adaptive, and knowledge-grounded manufacturing intelligence.

Abstract:
Cross-view object correspondence, exemplified by the representative task of ego-exo object correspondence, aims to establish consistent associations of the same object across different viewpoints (e.g., ego-centric and exo-centric). This task poses significant challenges due to drastic viewpoint and appearance variations, making existing segmentation models, such as SAM2, non-trivial to apply directly. To address this, we present V^ 2 -SAM, a unified cross-view object correspondence framework that adapts SAM2 from single-view segmentation to cross-view correspondence through two complementary prompt generators. Specifically, the Cross-View Anchor Prompt Generator (V^ 2 -Anchor), built upon DINOv3 features, establishes geometry-aware correspondences and, for the first time, unlocks coordinate-based prompting for SAM2 in cross-view scenarios, while the Cross-View Visual Prompt Generator (V^ 2 -Visual) enhances appearance-guided cues via a novel visual prompt matcher that aligns ego-exo representations from both feature and structural perspectives. To effectively exploit the strengths of both prompts, we further adopt a multi-expert design and introduce a Post-hoc Cyclic Consistency Selector (PCCS) that adaptively selects the most reliable expert based on cyclic consistency. Extensive experiments validate the effectiveness of V^ 2 -SAM, achieving new state-of-the-art performance on Ego-Exo4D (Ego-Exo object correspondence), DAVIS-2017 (video object tracking), and HANDAL-X (robotic-ready cross-view correspondence).

Abstract:
Industrial Anomaly Detection (IAD) has attracted significant attention and witnessed rapid development. However, the advancement in this field is hindered by two key issues: the performance saturation of existing benchmarks, limiting discriminative evaluation of different IAD methods, and the absence of benchmarks tailored to assess recent multi-modal large language models (MLLMs) in anomaly detection. To this end, we present Omni-AD, a comprehensive IAD benchmark featuring: i) Large scale: The dataset consists of approximately 35K images (6xlarger than MVTec AD) with 150 product categories (10xlarger than MVTec AD) spanning 16 industrial sectors, delivering unprecedented category diversity and greater dataset scale compared with existing datasets. ii) Versatility: The benchmark supports both conventional unsupervised and emerging MLLM-based IAD evaluation protocols. The latter is achieved by defining three subtasks of progressive difficulty, with two structured as visual question answering (VQA) and one as visual grounding. iii) Challenge: Extensive experimental results of state-of-the-art methods reveal that the Omni-AD benchmark is more challenging than existing benchmarks, which can drive the future development of the IAD field.

Abstract:
Computer-using agents (CUAs) must plan task workflows across diverse and evolving applications, yet progress is limited by the lack of large-scale, high-quality training data. Existing datasets are narrow, static, and costly to annotate, while synthetic data often yields oversimplified or misaligned behaviors. We present Watch & Learn (W&L), a framework that converts readily available Internet videos of human computer use into executable UI trajectories at scale. Instead of directly generating actions or relying on handcrafted heuristics, we cast trajectory annotation as an inverse dynamics problem that predicts user actions from consecutive screen states, which simplifies learning and generalizes across domains. Through a task-aware retrieval and labeling pipeline, W&L yields over 53K high-quality trajectories that enhance CUAs both as in-context exemplars and as supervised training data. On OSWorld, it consistently improves general-purpose and specialized CUAs, while on WindowsAgentArena it achieves state-of-the-art performance among 7B-scale models under the 15-step limit. These results show that web-scale human demonstration videos can serve as a practical and scalable foundation for advancing real-world CUAs.

Abstract:
Face-swapping deepfakes allow realistic identity transfer, which can serve creative purposes but increases the risk of identity abuse. A proactive defense aims to prevent deepfake creation by obstructing identity feature extraction from input images, essential for identity-driven face-swapping. Existing proactive defense approaches aim to protect faces by hindering accurate identity feature extraction, but tend to introduce visible artifacts and fail to degrade the visual quality of the face-swapping deepfakes. This work proposes a proactive face-swapping defense using identity blending and attribute distortion (DeepProtect) that integrates global identity fusion in the latent space and local prompt-driven adversarial watermarking to address these problems. This work dilutes distinct identity representations by channel-wise blending of multiple identities in the latent space and optimizing the generator for visual consistency. The proposed approach distorts facial components in the identity space, directly influencing how faces are reconstructed in deepfakes. This approach applies semantic directions derived from user-provided text prompts to embed imperceptible adversarial watermarks that selectively distort facial attributes, affecting the visual fidelity of deepfake results. The proposed method hinders face-swapping deepfakes while preserving the perceptual quality of the protected images, offering a robust and practical solution for facial privacy protection. The experimental results reveal that DeepProtect effectively defends against face-swapping deepfakes while preserving visual consistency.

Abstract:
Large-scale multi-view clustering aims to explore the complementary and consistent information among different views in efficient manner. Despite the impressive performance gained by the existing methods, they just perform anchor learning in a single space with the orthogonal or some other constraints from the multi-view data, leading to undesired anchors. The anchors can simultaneously occur in more spaces and the complementary information among these spaces is able to be adopted for learning anchors. Meanwhile, the space with basis being the anchored cluster center is neglected to learn anchors by most existing works. In this work, we propose learning anchor in Dual Orthogonal Space for Fast Multi-view Clustering (DOSFMVC). DOSFMVC conducts anchor learning in dual orthogonal space, aiming at utilizing the complementary information among two spaces in producing anchors with high quality. DOSMFVC introduces the consensus anchored cluster center as basis of the extra space and clustering indicator of anchors based on this basis in anchor learning. The anchor learning and partition are integrated into a unified model, where the final cluster assignment can be adopted for clustering results. Extensive experiments confirm the superiority of our method compared with some state-of-the-art methods on several benchmark datasets.

Abstract:
Scaling multimodal alignment between video and audio is challenging, particularly due to limited data and the mismatch between text descriptions and frame-level video information. In this work, we tackle the scaling challenge in multimodal-to-audio generation, examining whether models trained on short instances can generalize to longer ones during testing. To tackle this challenge, we present multimodal hierarchical networks so-called MMHNet, an enhanced extension of state-of-the-art video-to-audio models. Our approach integrates a hierarchical method and non-causal Mamba to support long-form audio generation. Our proposed method significantly improves long audio generation up to more than 5 minutes. We also prove that training short and testing long is possible in the video-to-audio generation tasks without training on the longer durations. We show in our experiments that our proposed method could achieve remarkable results on long-video to audio benchmarks, beating prior works in video-to-audio tasks. Moreover, we showcase our model capability in generating more than 5 minutes, while prior video-to-audio methods fall short in generating with long durations. Our project page: https://echoesovertime.github.io

Abstract:
Vision-Language Models (VLMs) are frequently undermined by object hallucination--generating content that contradicts visual reality--due to an over-reliance on linguistic priors. We introduce Positive-and-Negative Decoding (PND), a training-free inference framework that intervenes directly in the decoding process to enforce visual fidelity. PND is motivated by our key finding of a critical attention deficit in VLMs, where visual features are empirically under-weighted. Our framework corrects this via a dual-path contrast: The positive path amplifies salient visual evidence using multi-layer attention to encourage faithful descriptions, directly counteracting the attention deficit. Simultaneously, the negative path identifies and degrades the core object's features to create a strong counterfactual, which penalizes ungrounded, prior-dominant generation. By contrasting the model's outputs from these two perspectives at each step, PND steers generation towards text that is not just linguistically probable, but visually factual. Extensive experiments on benchmarks like POPE, MME, and CHAIR show that PND achieves state-of-the-art performance with up to 6.5% accuracy improvement, substantially reducing object hallucination while also enhancing descriptive detail--all without requiring any model retraining. The method generalizes effectively across diverse VLM architectures including LLaVA, InstructBLIP, InternVL, and Qwen-VL.

Abstract:
Existing feed-forward networks excel at predicting a single set of physical properties from visual appearance, but this point-estimate paradigm fundamentally fails to capture the real world's inherent physical ambiguity. We address this by reframing physics prediction as a task of learning a controllable, continuous distribution of material properties. We introduce UNIPIXIE, a framework trained to predict a continuous and parameterized path of physically plausible material properties from a single visual input. By learning a direct mapping along an object's softest-to-stiffest spectrum on our PIXIEMULTIVERSE dataset, UNIPIXIE allows for controllable generation of diverse, physically-valid material fields via a single intuitive parameter. Crucially, UNIPIXIE introduces a novel unified architecture to produce simulation-ready parameters for diverse physics solvers, including continuum-based Material Point Method (MPM), reduced-order deformation based on Linear Blend Skinning (LBS), and anchor-based Spring-Mass systems, addressing a key portability issue in prior work. Experiments show our approach not only generates a rich variety of plausible dynamics but also reduces Young's Modulus prediction error by over 50% against the strongest deterministic baseline, bridging the gap between static point-estimates and the continuous nature of physical reality.

Abstract:
Recent advances in Visual-Language-Action (VLA) models have shown promising potential for robotic manipulation tasks.However, real-world robotic tasks often involve long-horizon, multi-step problem-solving and require generalization for continual skill acquisition, extending beyond single actions or skills. These challenges present significant barriers for existing VLA models, which use monolithic action decoders trained on aggregated data, resulting in poor scalability.To address these challenges, we propose AtomicVLA, a unified planning-and-execution framework that jointly generates task-level plans, atomic skill abstractions, and fine-grained actions. AtomicVLA constructs a scalable atomic skill library through a Skill-Guided Mixture-of-Experts (SG-MoE), where each expert specializes in mastering generic yet precise atomic skills. Furthermore, we introduce a flexible routing encoder that automatically assigns dedicated atomic experts to new skills, enabling continual learning.We validate our approach through extensive experiments. In simulation, AtomicVLA outperforms \pi_ 0 by 2.4% on LIBERO, 10% on LIBERO-LONG, and outperforms \pi_ 0 and \pi_ 0.5 by 0.22 and 0.25 in average task length on CALVIN. Additionally, our AtomicVLA consistently surpasses baselines by 18.3% and 21% in real-world long-horizon tasks and continual learning. These results highlight the effectiveness of atomic skill abstraction and dynamic expert composition for long-horizon and lifelong robotic tasks.

Abstract:
Large-scale pre-trained image-text models exhibit robust multimodal representation, yet applying contrastive language-image pretraining (CLIP) to audio-visual localization remains challenging. Replacing the classification token ([CLS]) with an audio-embedded token ([V_A])struggles to capture semantic cues, and the prompt "a photo of a [V_A]" fails to establish meaningful connections between audio embeddings and context tokens. To address these issues, we propose sound-aware prompt learning (SouPLe), which replaces fixed prompts with learnable context tokens. These tokens incorporate visual features to generate conditional context for a mask decoder, effectively bridging semantic correspondence between audio and visual inputs. Experiments on VGGSound, SoundNet, and AVSBench confirm that SouPLe significantly improves localization and segmentation performance.

Abstract:
Generalizable Neural Radiance Fields (GeNeRF) enable high-quality scene reconstruction from a limited number of views and can generalize to unseen scenes. However, in real-world environments, transient distractors disrupt structural consistency across views, leading to deviated supervision signals and degraded reconstruction quality. Existing distractor-free NeRF methods rely on per-scene optimization and they estimate uncertainty from per-view reconstruction errors to remove distractors, but this is unreliable to GeNeRF, because it may misjudge inconsistent static structures from source views as distractors. To address this issue, we propose MUGeNeRF: a multi-view uncertainty-guided distractor-aware GeNeRF method, aim to effectively alleviate GeNeRF's robust modeling challenges in dynamic scenes with transient distractions. We explicitly decompose distractor awareness into two complementary uncertainty modeling tasks: Source-view uncertainty, serving as a transferable prior during the feed-forward process, captures structural inconsistencies across source views caused by viewpoint changes or dynamic factors; Target-view uncertainty focuses on observation anomalies caused by transient changes to infer distractor spatial distribution. These two uncertainties are integrated into a heteroscedastic reconstruction loss that guides adaptive supervision weighting, boosting the model's capability to detect and suppress distractors, and enabling more robust geometric modeling. Extensive experiments demonstrate that our method not only outperforms existing GeNeRF approaches but also rivals the performance of scene-specific distractor-free NeRFs.

Abstract:
Unifying multimodal understanding, generation and reconstruction representation in a single tokenizer remains a key challenge in building unified models. Previous research predominantly attempts to address this in a dual encoder paradigm, e.g., utilizing the separate encoders for understanding and generation respectively or balancing semantic representations and low-level features with contrastive loss. In this paper, we propose VQRAE, a Vector Quantization version of Representation AutoEncoders, which pioneers the first exploration in unified representation to produce Continuous semantic features for image understanding and Discrete tokens for visual generation within a unified tokenizer. Specifically, we build upon pretrained vision foundation models with a symmetric ViT decoder and adopt a two-stage training strategy: first, it freezes the encoder and learns a high-dimensional semantic VQ codebook with pixel reconstruction objective; then jointly optimizes the encoder with self-distillation constraints. This design enables negligible semantic information for maintaining the ability of multimodal understanding, discrete tokens that are compatible for generation and fine-grained reconstruction. Besides, we identify the intriguing property in quantizing semantic encoders that rely on high-dimensional codebook in contrast to the previous common practice of low-dimensional codebook in image reconstruction. The semantic VQ codebook can achieve a 100% utilization ratio at a dimension of 1536. VQRAE presents competitive performance on several benchmarks of visual understanding, generation and reconstruction with promising scaling property in the autoregressive paradigm for its discrete merits.

Abstract:
Generating realistic and expressive audio-driven talking avatars remains a central challenge in digital human synthesis. Existing methods often depend on intermediate representations such as pose estimations for natural body motion, which restricts flexibility and adds visual distortions. Moreover, most audio-driven approaches rely on discrete emotion classifiers or text labels to regulate facial expression, reducing complex affective dynamics to coarse categories such as happy, sad, or angry. Such categorical supervision fails to capture the continuous and fine-grained speech dynamics (rhythm, energy, intensity) resulting in limited synchronization and emotionally shallow motion. To overcome these limitations, we present SyncDreamer, a unified Diffusion Transformer framework that generates identity-preserving and emotionally expressive talking avatars from only a single image, speech audio, and text prompt.We propose a visual adapter with Attention Localization Loss to maintain identity fidelity, further incorporating an audio dynamics encoder for rhythm- and emotion-aware motion, and an RL-based Cross-Modal Prompt Enhancer grounding textual cues in visual context for fine-grained motion control. Extensive experiments on portrait and full-body benchmarks demonstrate state-of-the-art performance in realism, synchronization accuracy, and semantic controllability, establishing a scalable foundation for expressive digital avatars in interactive and creative applications.

Abstract:
Federated learning (FL) in post-deployment settings must adapt to non-stationary data streams across heterogeneous clients without access to ground-truth labels. A major challenge is learning rate selection under client-specific, time-varying distribution shifts, where fixed learning rates often lead to underfitting or divergence. We propose Fed-ADE (Federated Adaptation with Distribution Shift Estimation), an unsupervised federated adaptation framework that leverages lightweight estimators of distribution dynamics. Specifically, Fed-ADE employs uncertainty dynamics estimation to capture changes in predictive uncertainty and representation dynamics estimation to detect covariate-level feature drift, combining them into a per-client, per-timestep adaptive learning rate. We provide theoretical analyses showing that our dynamics estimation approximates the underlying distribution shift and yields dynamic regret and convergence guarantees. Experiments on image and text benchmarks under diverse distribution shifts (label, covariate, and concept) demonstrate consistent improvements over strong baselines. These results highlight that distribution shift-aware adaptation enables effective and robust federated post-adaptation under real-world non-stationarity.

Abstract:
Feature attribution methods explain the predictions of deep neural networks by assigning importance scores to individual input features. However, most existing methods focus solely on marginal effects, overlooking feature interactions, where groups of features jointly influence model output. Such interactions are especially important in image classification tasks, where semantic meaning often arises from pixel interdependencies rather than isolated features. Existing interaction-based methods for images are either coarse (e.g., superpixel-only) or, fail to satisfy core interpretability axioms. In this work, we introduce H-Sets, a novel two-stage framework for discovering and attributing higher-order feature interactions in image classifiers. First, we detect locally interacting pairs via input Hessians and recursively merge them into semantically coherent sets; segmentation from Segment Anything (SAM) is used as a spatial grouping prior but can be replaced by other segmentations. Second, we attribute each set with IDG-Vis, a set-level extension of Integrated Directional Gradients that integrates directional gradients along pixel-space paths and aggregates them with Harsanyi dividends. While Hessians introduce additional compute at the detection stage, this targeted cost consistently yields saliency maps that are sparser and more faithful. Evaluations across VGG, ResNet, DenseNet and MobileNet models on ImageNet and CUB datasets show that H-Sets generate more interpretable and faithful saliency maps compared to existing methods.

Abstract:
We propose tttLRM, a novel large 3D reconstruction model that leverages a Test-Time Training (TTT) layer to enable long-context, autoregressive 3D reconstruction with linear computational complexity, further scaling the model's capability. Our framework efficiently compresses multiple image observations into the fast weights of the TTT layer, forming an implicit 3D representation in the latent space that can be decoded into various explicit formats, such as Gaussian Splats (GS) for downstream applications. The online learning variant of our model supports progressive 3D reconstruction and refinement from streaming observations. We demonstrate that pretraining on novel view synthesis tasks effectively transfers to explicit 3D modeling, resulting in improved reconstruction quality and faster convergence. Extensive experiments show that our method achieves superior performance in feedforward 3D Gaussian reconstruction compared to state-of-the-art approaches on both objects and scenes.

Abstract:
Embodied agents face a critical dilemma that end-to-end models lack interpretability and explicit 3D reasoning, while modular systems ignore cross-component interdependencies and synergies. To bridge this gap, we propose the Dynamic 3D Vision-Language-Planning Model (D3D-VLP). Our model introduces two key innovations: 1) A Dynamic 3D Chain-of-Thought (3D CoT) that unifies planning, grounding, navigation, and question answering within a single 3D-VLM and CoT pipeline; 2) A Synergistic Learning from Fragmented Supervision (SLFS) strategy, which uses a masked autoregressive loss to learn from massive and partially-annotated hybrid data. This allows different CoT components to mutually reinforce and implicitly supervise each other. To this end, we construct a large-scale dataset with 10M hybrid samples from 5K real scans and 20K synthetic scenes that are compatible with online learning methods such as RL and DAgger. Our D3D-VLP achieves state-of-the-art results on multiple benchmarks, including Vision-and-Language Navigation (R2R-CE, REVERIE-CE, NavRAG-CE), Object-goal Navigation (HM3D-OVON), and Task-oriented Sequential Grounding and Navigation (SG3D). Real-world mobile manipulation experiments further validate the effectiveness.

Abstract:
Dataset distillation enables efficient training by distilling the information of large-scale datasets into significantly smaller synthetic datasets. Diffusion based paradigms have emerged in recent years, offering novel perspectives for dataset distillation. However, they typically necessitate additional fine-tuning stages, and effective guidance mechanisms remain underexplored. To address these limitations, we rethink diffusion based dataset distillation and propose a Dual Matching Guided Diffusion (DMGD) framework, centered on efficient training-free guidance. We first establish Semantic Matching via conditional likelihood optimization, eliminating the need for auxiliary classifiers. Furthermore, we propose a dynamic guidance mechanism that enhances the diversity of synthetic data while maintaining semantic alignment. Simultaneously, we introduce an optimal transport (OT) based Distribution Matching approach to further align with the target distribution structure. To ensure efficiency, we develop two enhanced strategies for diffusion based framework: Distribution Approximate Matching and Greedy Progressive Matching. These strategies enable effective distribution matching guidance with minimal computational overhead. Experimental results on ImageNet-Woof, ImageNet-Nette, and ImageNet-1K demonstrate that our training-free approach achieves significant improvements, outperforming state-of-the-art (SOTA) methods requiring additional fine-tuning by average accuracy gains of 2.1%, 5.4%, and 2.4%, respectively.

Abstract:
Recent multimodal large language models (MLLMs) such as GPT-4o and Qwen3-Omni show strong perception but struggle in multi-speaker, dialogue-centric settings that demand agentic reasoning, tracking who speaks, maintaining roles, and grounding events across time. These scenarios are central to multimodal audio-video understanding, where models must jointly reason over audio and visual streams in applications such as conversational video assistants and meeting analytics. We introduce AMusE, a benchmark designed around tasks that are inherently agentic, requiring models to decompose complex audio-visual interactions into planning, grounding, and reflection steps. It evaluates MLLMs across three modes: zero-shot, guided, and agentic, and six task families, including spatio-temporal speaker grounding and multimodal dialogue summarization. Across all modes, current models exhibit weak multi-speaker reasoning and inconsistent behavior under both non-agentic and agentic evaluation. Motivated by the inherently agentic nature of these tasks and recent advances in LLM agents, we propose RAFT, a data-efficient agentic alignment framework that integrates reward optimization with intrinsic multimodal self-evaluation as reward and selective parameter adaptation for data and parameter-efficient updates. Using RAFT, we achieve up to 39.52% relative improvement in accuracy on our benchmark. Together, AMusE and RAFT provide a practical platform for examining agentic reasoning in multimodal models and improving their capabilities.

Abstract:
Artistic font generation aims to synthesize stylized glyphs based on a reference style. However, existing approaches suffer from limited style diversity and coarse control. In this work, we explore the potential of element-driven artistic font generation. Elements are the fundamental visual units of a font, serving as reference images for the desired style. Conceptually, we categorize elements into object elements (e.g., flowers or stones) with distinct structures and amorphous elements (e.g., flames or clouds) with unstructured textures. We introduce FontCrafter, an element-driven framework for font creation, and construct a large-scale dataset, ElementFont, which contains diverse element types and high-quality glyph images. However, achieving high-fidelity reconstruction of both texture and structure of reference elements remains challenging. To address this, we propose an in-context generation strategy that treats element images as visual context and uses an inpainting model to transfer element styles into glyph regions at the pixel level. To further control glyph shapes, we design a lightweight Context-aware Mask Adapter (CMA) that injects shape information. Moreover, a training-free attention redirection mechanism enables region-aware style control and suppresses stroke hallucination. In addition, edge repainting is applied to make boundaries more natural. Extensive experiments demonstrate that FontCrafter achieves strong zero-shot generation performance, particularly in preserving structural and textural fidelity, while also supporting flexible controls such as style mixture.

Abstract:
Bird's-Eye-View (BEV) detection has become a dominant paradigm for 3D object detection in autonomous driving, due to its strong perception capability. However, most existing methods mainly focus on constructing high-quality BEV feature representations, while neglecting the design of task-specific detection heads. In practice, they directly adopt the center-based head originally developed for 2D detection, without any specific optimization. This leads to three inherent limitations: (i) a geometric mismatch between the Gaussian kernel used for classification and the real BEV object, (ii) degraded end-to-end performance without Non-Maximum Suppression (NMS), and (iii) sparse supervisory signals. To address these issues, we propose Spe-BEVHead, a detection head specifically tailored for BEV 3D object detection. Spe-BEVHead introduces three BEV-specific adaptations: (1) a Rotated Box Kernel that generates geometry-aligned classification weights, (2) a Local Response Refinement Module (LRRM) that suppresses non-peak responses and improves end-to-end performance, and (3) a dual-branch architecture that provides richer supervisory signals to promote more robust learning while inherently preserving the performance for end-to-end inference. Extensive experiments show that Spe-BEVHead can be seamlessly integrated into existing BEV backbones, delivering direct performance gains while retaining competitive performance under the challenging end-to-end setting.

Abstract:
The integration of vision-language models (VLMs) is driving a new generation of embodied agents capable of operating in human-centered environments. However, as deployment expands, these systems face growing safety risks, particularly when executing hazardous instructions. Current safety evaluation benchmarks remain limited: they cover only narrow scopes of hazards and focus primarily on final outcomes, neglecting the agent's full perception-planning-execution process and thereby obscuring critical failure modes. Therefore, we present SAFE, a benchmark for systematically assessing the safety of embodied VLM agents on hazardous instructions. SAFE comprises three components: SAFE-THOR, an extensible adversarial simulation sandbox with a universal adapter that maps high-level VLM outputs to low-level embodied controls, supporting diverse agent workflow integration; SAFE-VERSE, a risk-aware task suite inspired by Asimov's Three Laws of Robotics, comprising 45 adversarial scenarios, 1,350 hazardous tasks, and 9,900 instructions that span risks to humans, environments, and agents; and SAFE-DIAGNOSE, a multi-level and fine-grained evaluation protocol measuring agent performance across perception, planning, and execution. Applying SAFE to nine state-of-the-art VLMs and two embodied agent workflows, we uncover systematic failures in translating hazard recognition into safe planning and execution. Our findings reveal fundamental limitations in current safety alignment and demonstrate the necessity of a comprehensive, multi-stage evaluation for developing safer embodied intelligence.

Abstract:
Fine-grained bird species identification in the wild is frequently unanswerable from a single image: key cues may be non-visual (e.g. vocalization), or obscured due to occlusion, camera angle, or low resolution. Yet today's multimodal systems are typically judged on answerable, in-schema cases, encouraging confident guesses rather than principled abstention. We propose the RealBirdID benchmark: given an image of a bird, a system should either answer with a species or abstain with a concrete, evidence-based rationale: "requires vocalization," "low quality image," or "view obstructed". For each genus, the dataset includes a validation split composed of curated unanswerable examples with labeled rationales, paired with a companion set of clearly answerable instances. We find that (1) the species identification on the answerable set is challenging for a variety of open-source and proprietary models (\leq13% accuracy for MLLMs including GPT-5 and Gemini-2.5 Pro), (2) models with greater classification ability are not necessarily more calibrated to abstain from unanswerable examples, and (3) that MLLMs generally fail at providing correct reasons even when they do abstain. RealBirdID establishes a focused target for abstention-aware fine-grained recognition and a recipe for measuring progress.

Abstract:
In recent years, while Scene Graph Generation has advanced significantly, mainstream methods remain constrained by predefined object and relationship categories, limiting generalization to open real-world scenarios. Inspired by open vocabulary object detection, recent efforts have expanded SGG to the open vocabulary domain. However, these models often rely on off-the-shelf VLMs, lacking discriminative attribute extraction and suffering from limited object-relationship semantic interaction, which leads to misclassification of novel categories. To address these issues, we propose the MoE Feature Decoupling (MoE-FD) framework for Open Vocabulary Scene Graph Generation. MoE-FD adaptively learns feature decoupling for objects and relationships via multiple experts, prioritizing critical features through gating network weights. Moreover, it models semantic interactions between objects and relationships using iterative cross-attention, enhancing relationship triple associations and visual-semantic alignment. The main contributions of MoE-FD are threefold: (1) A MoE-based feature decoupling framework that adaptively enhances discriminative feature representation for objects and relations. (2) Semantic interaction modeling between objects and relations to strengthen relationship triple associations and image-text alignment accuracy. (3) Extensive experiments demonstrate the effectiveness of MoE-FD on the Visual Genome dataset.

Abstract:
The heavy computational burden of Large Vision-Language Models (LVLMs) stems primarily from the lengthy visual token sequences generated by their vision encoders. To mitigate this, recent work has shifted towards pruning tokens within the vision encoder. However, we observe that these methods predominantly rely on a suboptimal decoupled heuristic method. This strategy is conceptually flawed: it is prone to sampling collapse, fails to fundamentally eliminate token redundancy, and tends to systematically discard secondary yet important semantic clusters. Addressing this limitation, this paper proposes to formalize visual token pruning as a unified Representativeness Optimization Problem. We introduce SCoRe (Salience-Coverage Reduction), a unified optimization method theoretically grounded in the Weighted k-Center Problem. SCoRe constructs the final token set greedily: at each iteration, it selects the token that maximizes the unified representativeness score, thereby optimizing global representativeness. Extensive experiments demonstrate that SCoRe achieves State-of-the-Art (SOTA) performance across multiple benchmarks. Notably, with negligible computational overhead, our method reduces tokens by 94.4% while retaining 95% of the full performance.

Abstract:
As recent AI models have successfully generated high-resolution photorealistic images, it has also been socially important to detect whether an image is generated by AI. Since training data for the detection task is often not available due to the diversity of generative models, training-free detection approaches have been practically considered. A common approach is to utilize the image-level reconstruction error from the latent diffusion model (LDM). However, we find this score suffers from instance-specific biases, particularly in images with simple backgrounds. To this end, we propose a novel image-level debiasing score function that cancels out background contribution by normalizing the reconstruction error on the augmented images with similar background information. To be specific, we show that rotation and low-pass filtering are effective augmentation strategies. To promote generalization to broader generative models, we newly explore latent-level reconstruction error as an additional training-free signal. However, we observe that the latent-level score also suffers to latent-specific bias. To mitigate this, we introduce a rotation-based latent-level debiasing score based on the normalization of the rotated latent. We unify the aforementioned scores into a single unified debiasing score, RDD, which achieves state-of-the-art training-free detection performance across diverse generative models. Furthermore, our framework can be robust to corruption of the examined images.

Abstract:
In remote sensing rotated object detection, mainstream methods suffer from two bottlenecks, directional incoherence at detector neck and task conflict at detecting head. Ulitising fourier rotation equivariance, we introduce Fourier Angle Alignment, which analyses angle information through frequency spectrum and aligns the main direction to a certain orientation. Then we propose two plug and play modules : FAAFusion and FAA Head. FAAFusion works at the detector neck, aligning the main direction of higher-level features to the lower-level features and then fusing them. FAA Head serves as a new detection head, which pre-aligns RoI features to a canonical angle and adds them to the original features before classification and regression. Experiments on DOTA-v1.0, DOTA-v1.5 and HRSC2016 show that our method can greatly improve previous work. Particularly, our method achieves new state-of-the-art results of 78.72% mAP on DOTA-v1.0 and 72.28% mAP on DOTA-v1.5 datasets with single scale training and testing, validating the efficacy of our approach in remote sensing object detection. The code will be public upon acceptance.

Abstract:
Monocular UAV videos pose a fundamental challenge for 3D reconstruction: dynamic scene modeling requires accurate camera poses, yet recovering poses from long UAV trajectories often fails in texture-sparse regions and in the presence of moving objects. Existing approaches typically handle either pose-free static reconstruction or dynamic reconstruction with known poses, but jointly solving both from casual aerial footage remains difficult due to motion coupling and severe scale variation. We introduce AeroGS, a scale-aware Gaussian splatting framework that jointly recovers camera trajectories and reconstructs dynamic scenes from pose-free monocular videos. Central to our method are scale-aware spatio-temporal anchors (S2A-Anchors), which enable a unified optimization via three key decoupling mechanisms: (i) separating ego-motion from object motion, (ii) isolating static geometry from temporal deformation, and (iii) adapting to scale variation between distant terrain and nearby objects. This design effectively stabilizes optimization under large motion and scale imbalance. Extensive experiments on UAV and driving benchmarks show that AeroGS achieves state-of-the-art rendering quality (PSNR/LPIPS), precise trajectory recovery (ATE/RPE), and faithful motion reconstruction, consistently surpassing recent pose-free baselines.

Abstract:
Stereo image compression (SIC) has become increasingly vital with its applications surging in fields such as 3D reconstruction and autonomous navigation. Previous methods leverage cross-attention to model inter-view redundancy and employ autoregressive entropy models to predict probability distributions, achieving impressive rate-distortion performance. However, they suffer from slow coding speed due to the quadratic complexity of cross-attention mechanisms and the spatial autoregressive iterations of the entropy models. To address these limitations, we propose MambaSIC, which introduces two key innovations. First, we propose a Mamba-based stereo visual state space block (stereo VSSB) that leverages its linear complexity and long-range modeling capabilities to more rapidly and efficiently capture redundancy information between the two views. Second, to accelerate the compression process and enhance the accuracy of probability estimation, we introduce a bi-directional multi-reference entropy model that utilizes a checkerboard partitioning strategy and the stereo VSSB to get rich inter-view priors. Experimental results demonstrate that our MambaSIC outperforms the state-of-the-art methods in both compression performance and efficiency.

Abstract:
Recent advances in generative modeling have substantially enhanced novel view synthesis, yet maintaining consistency across viewpoints remains challenging. Diffusion-based models rely on stochastic noise-to-data transitions, which obscure deterministic structures and yield inconsistent view predictions. We advocate a Data-to-Data Flow Matching framework that learns deterministic transformations between paired views, enhancing view-consistent synthesis through explicit data coupling. Building on this, we propose Probability Density Geodesic Flow Matching (PDG-FM), which aligns interpolation trajectories with density-based geodesics of a data manifold. To enable tractable geodesic estimation, we employ a teacher-student framework that distills density-based geodesic interpolants into an efficient ambient-space predictor. Empirically, our method surpasses diffusion-based baselines on Objaverse and GSO30 datasets, demonstrating improved structural coherence and smoother transitions across views. These results highlight the advantages of incorporating data-dependent geometric regularization into deterministic flow matching for consistent novel view generation.

Abstract:
Test-time prompt tuning (TPT) has emerged as a powerful technique for adapting pre-trained vision-language models (VLMs) to diverse downstream tasks, including image classification and visual reasoning. With the rise of text-driven object detectors, we extend TPT to object detection, unlocking new capabilities for cross-domain adaptation. However, a critical challenge in TPT is the inherent miscalibration caused by entropy minimization: domain shifts often lead to incorrect predictions, and enforcing high confidence exacerbates miscalibration, ultimately degrading performance. To tackle this, we introduce InsCal, a novel framework designed to enhance cross-domain object detection through three key innovations: (1) extending TPT to a multi-source paradigm, enabling knowledge aggregation across diverse domains; (2) reducing domain gaps via a novel text-driven style transfer strategy that aligns features to the source domain without requiring reference images; and (3) refining the entropy minimization objective with instance-specific calibration, ensuring robust and well-calibrated adaptation. Our approach not only mitigates miscalibration but also significantly improves cross-domain object detection performance, setting a new benchmark for test-time adaptation in VLMs.

Abstract:
Estimating continuous optical flow is a fundamental yet challenging problem in dynamic visual perception. Event-based cameras, with microsecond latency and high dynamic range, capture brightness changes asynchronously, offering a unique opportunity to model motion with fine temporal precision. However, the scarcity of temporally dense ground-truth annotations limits the effectiveness of supervised learning, while contrast maximization (CM) frameworks, focused on sharpening the Image of Warped Events (IWE), often neglect temporal continuity and structural coherence, leading to distorted trajectories under complex motion.To overcome these challenges, we propose a hybrid-supervised framework for continuous-time optical flow estimation, grounded in the principle of Spatio-temporal Structural Consistency (STSC). This paradigm jointly enforces local structural stability and trajectory continuity, ensuring physically coherent motion across time. To further enhance representation and robustness, we design a bidirectionally complementary multi-scale architecture and employ a curriculum-guided hybrid training strategy, enabling a smooth transition from supervised point constraints to self-supervised manifold regularization.Comprehensive experiments across multiple benchmarks show that our method achieves state-of-the-art performance in both continuous-time and standard optical flow estimation, demonstrating the effectiveness of the proposed learning paradigm.

Abstract:
Ensuring accessible pedestrian navigation requires reasoning about both semantic and spatial aspects of complex urban scenes, a challenge that existing Large Vision-Language Models (LVLMs) struggle to meet. Although these models can describe visual content, their lack of explicit grounding leads to object hallucinations and unreliable depth reasoning, limiting their usefulness for accessibility guidance. We introduce WalkGPT, a pixel-grounded LVLM for the new task of Grounded Navigation Guide, unifying language reasoning and segmentation within a single architecture for depth-aware accessibility guidance. Given a pedestrian-view image and a navigation query, WalkGPT generates a conversational response with segmentation masks that delineate accessible and harmful features along with relative depth estimation. The model incorporates a Multi-Scale Query Projector (MSQP) that shapes the final image tokens by aggregating them along text tokens across spatial hierarchies, and a Calibrated Text Projector (CTP), guided by a Region Alignment Loss, that maps language embeddings into segmentation-aware representations. These components enable fine-grained grounding and depth inference without user-provided cues or anchor points, allowing the model to generate complete and realistic navigation guidance. We also introduce PAVE, a large-scale benchmark of 41k pedestrian-view images paired with accessibility-aware questions and depth-grounded answers. Experiments show that WalkGPT achieves strong grounded reasoning and segmentation performance. The source code and dataset are available on the project website.

Abstract:
Human-product images, which showcase the integration of humans and products, play a vital role in advertising, e-commerce, and digital marketing. The essential challenge of generating such images lies in ensuring the high-fidelity preservation of product details. Among existing paradigms, reference-based inpainting offers a targeted solution by leveraging product reference images to guide the inpainting process. However, limitations remain in three key aspects: the lack of diverse large-scale training data, the struggle of current models to focus on product detail preservation, and the inability of coarse supervision for achieving precise guidance. To address these issues, we propose HiFi-Inpaint, a novel high-fidelity reference-based inpainting framework tailored for generating human-product images. HiFi-Inpaint introduces Shared Enhancement Attention (SEA) to refine fine-grained product features and Detail-Aware Loss (DAL) to enforce precise pixel-level supervision using high-frequency maps. Additionally, we construct a new dataset, HP-Image-40K, with samples curated from self-synthesis data and processed with automatic filtering. Experimental results show that HiFi-Inpaint achieves state-of-the-art performance, delivering detail-preserving human-product images.

Abstract:
Cause-and-effect reasoning in video is a significant challenge for Vision-Language Models (VLMs), as it requires going beyond surface-level perception to a deeper understanding of causal mechanisms. However, existing benchmarks rarely provide the fine-grained, grounded evidence needed to rigorously evaluate this capability. To address this gap, we introduce CaST-Bench, a benchmark for Causal Chain-Grounded Spatio-Temporal Video Reasoning. CaST-Bench presents complex causal questions that require models to identify and localize a chain of multiple spatio-temporal evidences. Through a human-AI collaborative pipeline, we construct a high-quality dataset of 2,066 questions over 1,015 videos, with causal chains annotated by temporal segments and bounding-box tracks. Furthermore, we design a comprehensive evaluation suite with novel metrics that assess not only answer correctness but also the capability for visual evidence grounded reasoning. This grounding is crucial for improving accuracy by mitigating spurious correlations and for enhancing user trust by making models more transparent. Our experiments show that current VLMs struggle with causal questions, largely due to their limited ability to construct precise and grounded causal chains. This highlights an important direction for improving future VLMs.

Abstract:
The detection and grounding of multimedia manipulation has emerged as a critical challenge in combating AI-generated disinformation. While existing methods have made progress in recent years, we identify two fundamental limitations in current approaches: (1) Underestimation of MLLM-driven deception risk: prevailing techniques primarily address rule-based text manipulations, yet fail to account for sophisticated misinformation synthesized by multimodal large language models (MLLMs) that can dynamically generate semantically coherent, contextually plausible yet deceptive narratives conditioned on manipulated images; (2) Unrealistic misalignment artifacts: currently focused scenarios rely on artificially misaligned content that lacks semantic coherence, rendering them easily detectable. To address these gaps holistically, we propose a new adversarial pipeline that leverages MLLMs to generate high-risk disinformation. Our approach begins with constructing the MLLM-Driven Synthetic Multimodal (MDSM) dataset, where images are first altered using state-of-the-art editing techniques and then paired with MLLM-generated deceptive texts that maintain semantic consistency with the visual manipulations. Building upon this foundation, we present the Artifact-aware Manipulation Diagnosis via MLLM (AMD) framework featuring two key innovations: Artifact Pre-perception Encoding strategy and Manipulation-Oriented Reasoning, to tame MLLMs for the MDSM problem. Comprehensive experiments validate our framework's superior generalization capabilities as a unified architecture for detecting MLLM-powered multimodal deceptions. In cross-domain testing on the MDSM dataset, AMD achieves the best average performance, with ACC, mAP, and mIoU scores of 88.18, 60.25, and 61.02, respectively.

Abstract:
Reconstructing the three-dimensional distribution of dark matter from weak-lensing observations is a central but highly ill-posed inverse problem in cosmology. Unlike standard 3D reconstruction with multiple viewpoints, we observe the universe from a single line of sight, through noisy shape distortions of galaxies with uncertain distances, so meaningful recovery of the 3D matter field requires strong prior assumptions. Existing methods either produce point estimates with handcrafted priors or use neural ensembles for approximate Bayesian uncertainty, and struggle to capture the non-Gaussian, filamentary structure of the cosmic web. With the advent of new high-resolution cosmological simulations, we now have an alternative source of prior knowledge that captures the nonlinear statistics of structure formation with far greater fidelity than analytic prescriptions. We leverage these simulations to build a new dataset \texttt Conicus3D , which enables us to learn a data-driven diffusion-model prior capturing the full 3D distribution of dark matter structure across cosmic time. Building on recent plug-and-play approaches, we modify a diffusion-based posterior sampling scheme to the 3D weak-lensing setting, combining the learned prior with a differentiable physical forward model. On realistic simulations targeting a modern weak lensing survey, our approach yields substantially improved 2D and 3D reconstruction accuracy over baseline methods. Moreover, it produces posterior samples whose statistics closely track the underlying simulations, while remaining robust to moderate shifts in cosmology.

Abstract:
Recent advances in Multimodal Large Language Models (MLLMs) have led to significant progress in video understanding. Due to limited context windows and computational overhead, most MLLMs adopt uniform frame sampling. This approach is at high risk of missing critical visual information and constrains performance especially for long videos. To address this problem, we propose a lightweight frame selection method to identify keyframes and train it via a two-stage strategy. In the pre-training stage, the frame selector learns to model relevance between individual video frames and queries. In the reinforcement learning (RL) stage, we employ a hierarchical reward that evaluates selection quality at combination and frame levels. Through stochastic exploration of frame combinations, the selector learns to identify and retain frames that improve task performance rather than merely maximizing query relevance, which can be misleading. The selected frames serve as input to downstream MLLMs for video understanding and reasoning. Experimental results demonstrate the proposed selector improves performance of diverse downstream MLLMs across benchmarks spanning medium to long videos.

Abstract:
Unified multimodal models (UMMs) aim to jointly perform multimodal understanding and generation within a single framework. We present TUNA, a native UMM that builds a unified continuous visual representation by cascading a VAE encoder with a representation encoder. This unified representation space allows end-to-end processing of images and videos for both understanding and generation tasks. Compared to prior UMMs with decoupled representations, TUNA's unified visual space avoids representation format mismatches introduced by separate encoders, outperforming decoupled alternatives in both understanding and generation. Moreover, we observe that stronger pretrained representation encoders consistently yield better performance across all multimodal tasks, highlighting the importance of the representation encoder. Finally, in this unified setting, jointly training on both understanding and generation data allows the two tasks to benefit from each other rather than interfere. Our extensive experiments on multimodal understanding and generation benchmarks show that TUNA achieves state-of-the-art results in image and video understanding, image and video generation, and image editing, demonstrating the effectiveness and scalability of its unified representation design.

Abstract:
Reading text from images or scanned documents via OCR models has been a longstanding focus of researchers. Intuitively, text reading is perceived as a straightforward perceptual task, and existing work primarily focuses on constructing enriched data engineering to enhance SFT capabilities. In this work, we observe that even advanced OCR models exhibit significantly higher entropy in formatted text (e.g., formula, table) compared to plain text, often by an order of magnitude. These statistical patterns reveal that advanced OCR models struggle with high output uncertainty when dealing with format sensitive document, suggesting that reasoning over diverse reading pathways may improve OCR performance. To address this, we propose Format Decoupled Reinforcement Learning (FD-RL) , which leverages high-entropy patterns for targeted optimization. Our approach employs entropy-based data filtration strategy to identify format-intensive instances, and adopt format decoupled rewards tailored to different format types, enabling format-level validation rather than token-level memorization. FD-RL achieves an average score of 90.41 on OmniDocBench, demonstrating highly competitive performance among end-to-end models on this highly popular benchmark. More importantly, we conduct comprehensive ablation studies over data, training, filtering, and rewarding strategies, thoroughly validating their effectiveness.

Abstract:
The rise of generalist robotic policies has created an exponential demand for large-scale training data. However, on-robot data collection is labor-intensive and often limited to specific environments. In contrast, open-world images capture a vast diversity of real-world scenes that naturally align with robotic manipulation tasks, offering a promising avenue for low-cost, large-scale robot data acquisition. Despite this potential, the lack of associated robot actions hinders the practical use of open-world images for robot learning, leaving this rich visual resource largely unexploited. To bridge this gap, we propose IGen, a framework that scalably generates realistic visual observations and executable actions from open-world images. IGen first converts unstructured 2D pixels into structured 3D scene representations suitable for scene understanding and manipulation. It then leverages the reasoning capabilities of vision-language models to transform scene-specific task instructions into high-level plans and generate low-level actions as SE(3) end-effector pose sequences. From these poses, it synthesizes dynamic scene evolution and renders temporally coherent visual observations. Experiments validate the high quality of visuomotor data generated by IGen, and show that policies trained solely on IGen-synthesized data achieve performance comparable to those trained on real-world data. This highlights the potential of IGen to support scalable data generation from open-world images for generalist robotic policy training. Code for IGen will be made publicly available.

Abstract:
While existing generation and unified models excel at general image generation, they struggle with tasks requiring deep reasoning, planning, and precise data-to-visual mapping abilities beyond general scenarios. To push beyond the existing limitations, we introduce a new and challenging task: creative table visualization, requiring the model to generate an infographic that faithfully and aesthetically visualizes the data from a given table. To address this challenge, we propose ShowTable, a pipeline that synergizes MLLMs with diffusion models via a progressive self-correcting process. The MLLM acts as the central orchestrator for reasoning the visual plan and judging visual errors to provide refined instructions, the diffusion execute the commands from MLLM, achieving high-fidelity results. To support this task and our pipeline, we introduce three automated data construction pipelines for training different modules. Furthermore, we introduce TableVisBench, a new benchmark with 800 challenging instances across 5 evaluation dimensions, to assess performance on this task. Experiments demonstrate that our pipeline, instantiated with different models, significantly outperforms baselines, highlighting its effective multi-modal reasoning, generation, and error correction capabilities.

Abstract:
Video Object-Centric Learning seeks to decompose raw videos into a small set of object slots, but existing slot-attention models often suffer from severe over-fragmentation. This is because the model is implicitly encouraged to occupy all slots to minimize the reconstruction objective, thereby representing a single object with multiple redundant slots. We tackle this limitation with a reconstruction-guided slot curriculum (SlotCurri). Training starts with only a few coarse slots and progressively allocates new slots where reconstruction error remains high, thus expanding capacity only where it is needed and preventing fragmentation from the outset. Yet, during slot expansion, meaningful sub-parts can emerge only if coarse-level semantics are already well separated; however, with a small initial slot budget and an MSE objective, semantic boundaries remain blurry. Therefore, we augment MSE with a structure-aware loss that preserves local contrast and edge information to encourage each slot to sharpen its semantic boundaries. Lastly, we propose a cyclic inference that rolls slots forward and then backward through the frame sequence, producing temporally consistent object representations even in the earliest frames. All combined, SlotCurri addresses object over-fragmentation by allocating representational capacity where reconstruction fails, further enhanced by structural cues and cyclic inference. Notable FG-ARI gains of +6.8 on YouTube-VIS and +8.3 on MOVi-C validate the effectiveness of SlotCurri. Our code is available at github.com/wjun0830/SlotCurri.

Abstract:
Worldwide visual geo-localization aims to determine the geographic location of an image anywhere on Earth using only its visual content. Despite recent progress, learning expressive representations of geographic space remains challenging due to the inherently low-dimensional nature of geographic coordinates. We formulate global geo-localization as aligning the visual representation of a query image with a learned geographic representation. Our approach explicitly models the world as a hierarchy of learned geographic embeddings, enabling a distributed and multi-scale representation of geographic space. In addition, we introduce a semantic fusion module that efficiently integrates appearance features with semantic segmentation through latent cross-attention, producing a more robust visual representation for localization. Experiments on five widely used geo-localization benchmarks demonstrate that our method achieves new state-of-the-art results on 22 of 25 reported metrics. Ablation studies show that these improvements are primarily driven by the proposed geographic representation and semantic fusion mechanism.

Abstract:
Generating dynamic 4D objects from sparse inputs is difficult because it demands joint preservation of appearance and motion coherence across views and time while suppressing artifacts and temporal drift. We hypothesize that the view discrepancy arises from supervision limited to pixel- or latent-space video-diffusion losses, which lack explicitly temporally aware, feature-level tracking guidance.We present Track4DGen, a two-stage framework that couples a multi-view video diffusion model with a foundation point tracker and a hybrid 4D Gaussian Splatting (4D-GS) reconstructor. The central idea is to explicitly inject tracker-derived motion priors into intermediate feature representations for both multi-view video generation and 4D-GS. In Stage One, we enforce dense, feature-level point correspondences inside the diffusion generator, producing temporally consistent features that curb appearance drift and enhance cross-view coherence. In Stage Two, we reconstruct a dynamic 4D-GS using a hybrid motion encoding that concatenates co-located diffusion features (carrying Stage-One tracking priors) with Hex-plane features, and augment them with 4D Spherical Harmonics for higher-fidelity dynamics modeling.Track4DGen surpasses baselines on both multi-view video generation and 4D generation benchmarks, yielding temporally stable, text-editable 4D assets. Lastly, we curate Sketchfab28, a high-quality dataset for benchmarking object-centric 4D generation and fostering future research.

Abstract:
Parameter-efficient fine-tuning (PEFT) methods have recently gained popularity for applying deep neural networks on small datasets as they reduce overfitting, simplify deployment, and enable fast training. We demonstrate that for dense image prediction tasks, a well-designed and lightweight dense readout on top of a frozen large backbone can surpass state-of-the-art PEFT methods in both efficiency and accuracy. Our parameter-efficient readout module combines interpolation and attention for fine-grained dense prediction. It integrates seamlessly with a wide range of pretrained vision backbones such as DINOv3 and can be combined with low-rank adaptation (LoRA). We achieve competitive or superior performance in semantic segmentation, object detection, pose estimation and semantic contour prediction, offering an efficient alternative to current PEFT techniques.

Abstract:
Recent vision-language models (VLMs) achieve remarkable reasoning through reinforcement learning (RL), which provides a feasible solution for realizing continuous self-evolving large vision-language models (LVLMs) in the era of experience. However, RL for VLMs requires abundant high-quality multimodal data--especially challenging in specialized domains like chemistry, earth sciences, and multimodal mathematics. Existing strategies such as synthetic data and self-rewarding mechanisms suffer from limited distributions and alignment difficulties, ultimately causing reward hacking: models exploit high-reward patterns, collapsing policy entropy and destabilizing training. We propose DoGe (Decouple to Generalize), a dual-decoupling framework that guides models to first learn from context rather than problem solving by refocusing on the problem context scenarios overlooked by synthetic data methods. By decoupling learning process into dual components (Thinker and Solver), we reasonably quantify the reward signals of this process and propose a two-stage RL post-training approach from freely exploring context to practically solving tasks. Second, to increase the diversity of training data, DoGe constructs an evolving curriculum learning pipeline: an expanded native domain knowledge corpus and an iteratively evolving seed problems pool. Experiments show that our method consistently outperforms the baseline across various benchmarks, providing a scalable pathway for realizing self-evolving LVLMs.

Abstract:
Large-scale chemical reaction datasets are crucial for AI research in chemistry. However, existing chemical reaction data often exist as images within papers, making them not machine-readable and unusable for training machine learning models. In response to this challenge, we propose the RxnCaption framework for the task of chemical Reaction Diagram Parsing (RxnDP). Our framework reformulates the traditional coordinate prediction driven parsing process into an image captioning problem, which Large Vision Language Models (LVLMs) handle naturally. We introduce a strategy termed "BBox and Index as Visual Prompt" (BIVP), which uses our state-of-the-art molecular detector, MolYOLO, to pre-draw molecular bounding boxes and indices directly onto the input image. This turns the downstream parsing into a natural-language description problem. Extensive experiments show that the BIVP strategy significantly improves structural extraction quality while simplifying model design. We further construct the U-RxnDiagram-15k dataset, an order of magnitude larger than prior real-world literature benchmarks, with a balanced test subset across four layout archetypes. Experiments demonstrate that RxnCaption-VL achieves state-of-the-art performance on multiple metrics. We believe our method, dataset, and models will advance structured information extraction from chemical literature and catalyze broader AI applications in chemistry.

Abstract:
Existing online video segmentation models typically combine a per-frame segmenter with complex specialized tracking modules. While effective, these modules introduce significant architectural complexity and computational overhead. Recent studies suggest that plain Vision Transformer (ViT) encoders, when scaled with sufficient capacity and large-scale pre-training, can conduct accurate image segmentation without requiring specialized modules. Motivated by this observation, we propose the Video Encoder-only Mask Transformer (VidEoMT), a simple encoder-only video segmentation model that eliminates the need for dedicated tracking modules. To enable temporal modeling in an encoder-only ViT, VidEoMT introduces a lightweight query propagation mechanism that carries information across frames by reusing queries from the previous frame. To balance this with adaptability to new content, it employs a query fusion strategy that combines the propagated queries with a set of temporally-agnostic learned queries. As a result, VidEoMT attains the benefits of a tracker without added complexity, achieving competitive accuracy while being 5x-10x faster, running at up to 160 FPS with a ViT-L backbone. Code: \href https://www.tue-mps.org/videomt/ https://www.tue-mps.org/videomt/ .

Abstract:
False negatives pose a critical challenge in vision-language pretraining (VLP) due to the many-to-many correspondence between images and texts in large-scale datasets. These false negatives introduce conflicting supervision signals that degrade the learned embedding space and diminish the effectiveness of hard negative sampling. In this paper, we propose FALCON (False-negative Aware Learning of COntrastive Negatives), a learning-based mini-batch construction strategy that adaptively balances the trade-off between hard and false negatives during VLP. Rather than relying on fixed heuristics, FALCON employs a negative mining scheduler that dynamically selects negative samples of appropriate hardness for each anchor instance during mini-batch construction, guided by a proxy for cross-modal alignment improvement. Experimental results demonstrate that FALCON significantly improves performance across three vision-language learning frameworks (ALBEF, BLIP-2, SigLIP-2) and a broad range of downstream tasks and evaluation settings, underscoring its effectiveness and robustness in mitigating the impact of false negatives.

Abstract:
Recent multimodal face generation models address the spatial control limitations of text-to-image diffusion models by augmenting text-based conditioning with spatial priors such as segmentation masks, sketches, or edge maps. This multimodal fusion enables controllable synthesis aligned with both high-level semantic intent and low-level structural layout. However, most existing approaches typically extend pre-trained text-to-image pipelines by appending auxiliary control modules or stitching together separate uni-modal networks. These ad hoc designs inherit architectural constraints, duplicate parameters, and often fail under conflicting modalities or mismatched latent spaces, limiting their ability to perform synergistic fusion across semantic and spatial domains. We introduce MMFace-DiT, a unified dual-stream diffusion transformer engineered for synergistic multimodal face synthesis. Its core novelty lies in a dual-stream transformer block that processes spatial (mask/sketch) and semantic (text) tokens in parallel, deeply fusing them through a shared Rotary Position-Embedded (RoPE) Attention mechanism. This design prevents modal dominance and ensures strong adherence to both text and structural priors to achieve unprecedented spatial-semantic consistency for controllable face generation. Furthermore, a novel Modality Embedder enables a single cohesive model to dynamically adapt to varying spatial conditions without retraining. MMFace-DiT achieves a 40% improvement in visual fidelity and prompt alignment over six state-of-the-art multimodal face generation models, establishing a flexible new paradigm for end-to-end controllable generative modeling.

Abstract:
Modern computer vision models from different architecture families, e.g., CNNs, Vision Transformers, and MLP-Mixers, achieve remarkably similar aggregate performance on standard benchmarks, masking potential systematic differences in how they process visual information. We introduce a simple yet revealing framework to identify where pretrained model families diverge: by systematically mapping the high-disagreement tail versus the low-disagreement tail of the image distribution. Analyzing 12 pretrained models spanning three architecture families on ImageNet validation set, we discover that controversial images exhibit approximately 4.5xhigher disagreement than consensus images (Controversy Score: 4.46). Despite mean accuracy around 80%, models show structured disagreement patterns: within-family agreement exceeds cross-family agreement, with CNNs and ViTs forming distinct clusters while MLPs show lower overall alignment. Crucially, only the top 10% most controversial images drive the majority of architectural divergence, constituting a small but informationally dense subset that reveals fundamental differences masked by aggregate metrics. Our analysis demonstrates that architectural choice matters most on this concentrated controversy space, providing researchers with actionable guidance for model selection and ensemble construction.

Abstract:
Medical Visual Question Answering (Med-VQA) holds significant promise for clinical decision support, yet faces challenges due to limited annotated data and the high computational demands of existing large vision-language models. We propose MedFG-VQA, a lightweight framework that leverages a memory bank to augment DCT-based low-frequency features and employs graph-enhanced cross-attention for effective visual-textual alignment. Specifically, our approach features two key components: Frequency-Memory Fusion (FMF), which enhances low-frequency features by retrieving from a learnable memory bank built on DCT decomposition, and Graph-Aware Cross-Attention (GACA), which aligns visual-textual features via cross-attention and refines them through graph-convolutional aggregation. To address data scarcity, we construct SynMed-VQA, a large-scale synthetic dataset comprising over 2 million question-answer pairs across 9 imaging modalities and 10 major organs, generated with GPT-4o. Extensive experiments on SynMed-VQA and three other standard biomedical VQA benchmarks demonstrate that MedFG-VQA achieves competitive or superior performance compared to much larger models while maintaining significantly lower computational costs, highlighting its efficiency and potential for clinical deployment.

Abstract:
Recently proposed pyramidal models decompose the conventional forward and backward diffusion processes into multiple stages operating at varying resolutions. These models handle inputs with higher noise levels at lower resolutions, while less noisy inputs are processed at higher resolutions. This hierarchical approach significantly reduces the computational cost of inference in multi-step denoising models. However, existing open-source pyramidal video models have been trained from scratch and tend to underperform compared to state-of-the-art systems in terms of visual plausibility.In this work, we present a pipeline that converts a pretrained diffusion model into a pyramidal one through low-cost finetuning, achieving this transformation without degradation in quality of output videos.Furthermore, we investigate and compare various strategies for step distillation within pyramidal models, aiming to further enhance the inference efficiency.

Abstract:
Visual localization, i.e., the problem of estimating the camera pose from which an image was taken, is an important part of applications such as augmented reality and autonomous robots. Many of these applications require a compact memory footprint. Thus, a considerable amount of work has been spent on designing memory-efficient scene representations for visual localization. In this paper, we focus on compressing the 3D structure of the scene by selecting a subset of points from a Structure-from-Motion (SfM) point cloud. In contrast to prior work, which aims to solve (complex) optimization problems, we propose a simple strategy that is almost trivial to implement. Our compression strategy is based on the idea of selecting triplets of points such that the camera pose of each database image (used to build the SfM point cloud) can be accurately estimated from these triplets. Despite its simplicity, our strategy performs similarly to or better than current state-of-the-art structure compression approaches. Combined with standard product quantization approaches to compress feature descriptors, our approach compares favorably with recent learning-based approaches for compact visual localization.

Abstract:
While large vision-language models (LVLMs) achieve strong performance on multimodal tasks, they frequently generate hallucinations--unfaithful outputs misaligned with the visual input. To address this issue, we introduce CIPHER (Counterfactual Image Perturbations for Hallucination Extraction and Removal), a training-free method that suppresses vision-induced hallucinations via lightweight feature-level correction. Unlike prior training-free approaches that primarily focus on text-induced hallucinations, CIPHER explicitly targets hallucinations arising from the visual modality. CIPHER operates in two phases. In the offline phase, we construct OHC-25K (Object-Hallucinated Counterfactuals, 25,000 samples), a counterfactual dataset consisting of diffusion-edited images that intentionally contradict the original ground-truth captions. We pair these edited images with the unchanged ground-truth captions and process them through an LVLM to extract hallucination-related representations. Contrasting these representations with those from authentic (image, caption) pairs reveals structured, systematic shifts spanning a low-rank subspace characterizing vision-induced hallucination. In the inference phase, CIPHER suppresses hallucinations by projecting intermediate hidden states away from this subspace. Experiments across multiple benchmarks show that CIPHER significantly reduces hallucination rates while preserving task performance, demonstrating the effectiveness of counterfactual visual perturbations for improving LVLM faithfulness.

Abstract:
Direct Preference Optimization (DPO) has proven to be an effective solution for mitigating hallucination in Multimodal Large Language Models (MLLMs) by learning from preference pairs. One of its key challenges lies in how to transfer the sequence-level preference into fine-grained supervision on visual fidelity. To safeguard vision-related tokens that are prone to hallucination, existing methods typically allocate training emphasis according to the model's self-assessed visual sensitivity signals. However, such sensitivity, estimated by a model still under training, introduces self-referential bias: reinforcing already well-learned visual cues while neglecting hard-to-perceive but critical details, thereby limiting deeper alignment. In this work, we propose an Uncertainty-aware Exploratory Direct Preference Optimization (UE-DPO) method for MLLMs, which enables the model to uncover its cognitive deficiencies and actively explore for self-correction, guided by token-level epistemic uncertainty. Specifically, we first quantify the uncertainty from the model's failure to ground token predictions in the given image. Then, based on an uncertainty-aware exploration intensity, we encourage more learning pressure on visually deficient tokens in preferred samples, and alleviate the over-penalization of beneficial knowledge in dispreferred samples. Further, we provide a theoretical justification for our method, and extensive experiments demonstrate its effectiveness and robustness.

Abstract:
Non-line-of-sight (NLOS) imaging seeks to recover hidden-scene information from indirect light transport beyond the direct line of sight. Existing NLOS methods can be broadly categorized into active and passive approaches. Active methods rely on controlled illumination and time-resolved sensors, but their dependence on specialized and expensive hardware limits practical deployment. Passive methods instead use steady-state or uncontrolled indirect reflections with conventional sensors, but often suffer from weak signals, unstable observations, and insufficient measurement constraints. In this paper, we present the Similarity-Likelihood Diffusion Network (SLD-Net) for corner-camera hidden-person imaging from multi-exposure wall observations. SLD-Net consists of two stages: DeLi-Inversion estimates an initial reconstruction and a pixel-wise precision map to construct a heteroscedastic pseudo-likelihood, and SiCo-Diffusion incorporates this likelihood into deterministic DDIM sampling and fuses it with the diffusion prior via annealed Bayesian precision updating, producing reproducible and measurement-consistent reconstructions. Experiments on the Reflect-Corridor and Reflect-Room datasets demonstrate consistent improvements over generic, physics-inspired, and NLOS-specific baselines. SLD-Net improves PSNR from 13.84 to 15.58 dB and reduces FID from 264.91 to 73.54 on Reflect-Corridor, and improves PSNR from 11.58 to 12.49 dB and reduces FID from 177.05 to 26.89 on Reflect-Room.

Abstract:
Diffusion models achieve remarkable generation quality, yet face a fundamental challenge known as memorization, where generated samples can replicate training samples exactly. We develop a theoretical framework to explain this phenomenon by showing that the empirical score function (the score function corresponding to the empirical distribution) is a weighted sum of the score functions of Gaussian distributions, in which the weights are sharp softmax functions. This structure causes individual training samples to dominate the score function, resulting in sampling collapse. In practice, approximating the empirical score function with a neural network can partially alleviate this issue and improve generalization. Our theoretical framework explains why: In training, the neural network learns a smoother approximation of the weighted sum, allowing the sampling process to be influenced by local manifolds rather than single points. Leveraging this insight, we propose two novel methods to further enhance generalization: (1) Noise Unconditioning enables each training sample to adaptively determine its score function weight to increase the effect of more training samples, thereby preventing single-point dominance and mitigating collapse. (2) Temperature Smoothing introduces an explicit parameter to control the smoothness. By increasing the temperature in the softmax weights, we naturally reduce the dominance of any single training sample and mitigate memorization. Experiments across multiple datasets validate our theoretical analysis and demonstrate the effectiveness of the proposed methods in improving generalization while maintaining high generation quality.

Abstract:
Microscopy-based phenotypic profiling is scalable for drug discovery but lacks the mechanistic depth of transcriptomics, which remains costly and scarce. Existing multimodal approaches either use images to support other modalities or naively align representations by sample identity, ignoring cell-type and dose variations in weakly paired data-limiting generalization to unseen interventions. In this paper, we introduce an intervention-aware distillation framework that leverages perturbational transcriptomics to guide image representation learning. A transcriptome-conditioned teacher integrates gene expression and intervention metadata to produce soft distributions over a chemistry-aware codebook organized by drug similarity. The teacher employs a fine-tuned single-cell foundation model to encode cell-type context and disentangle dose effects. An image-only student learns to predict these distributions from microscopy alone, distilling mechanistic knowledge while operating independently at test time. This design emphasizes intervention semantics rather than identity alignment and explicitly handles dose and cell-type mismatches. We provide theoretical guarantees showing that transcriptomic guidance tightens the risk bound for image-based prediction. On Cell Painting and RxRx datasets paired with L1000, our method significantly improves one-shot transfer to unseen interventions and drug-target gene discovery compared to self-supervised and alignment baselines

Abstract:
Generative models are prone to hallucinations: plausible but incorrect structures absent in the ground truth. This issue is problematic in image restoration for safety-critical domains such as medical imaging, industrial inspection, and remote sensing, where such errors undermine reliability and trust. For example, in low-field MRI, widely used in resource-limited settings, restoration models are essential for enhancing low-quality scans, yet hallucinations can lead to serious diagnostic errors.Progress has been hindered by a circular dependency: evaluating hallucinations requires labeled data, yet such labels are costly and subjective.We introduce HalluGen, a diffusion-based framework that synthesizes realistic hallucinations with controllable type, location, and severity, producing perceptually realistic but semantically incorrect outputs (segmentation IoU drops from 0.86 to 0.36).Using HalluGen, we construct the first large-scale hallucination dataset comprising 4,350 annotated images derived from 1,450 brain MR images for low-field enhancement, enabling systematic evaluation of hallucination detection and mitigation.We demonstrate its utility in two applications: (1) benchmarking image quality metrics and developing Semantic Hallucination Assessment via Feature Evaluation (SHAFE), a feature-based metric with soft-attention pooling that improves hallucination sensitivity over traditional metrics; and (2) training reference-free hallucination detectors that generalize to real restoration failures.Together, HalluGen and its open dataset establish the first scalable foundation for evaluating hallucinations in safety-critical image restoration.

Abstract:
Diffusion models have demonstrated exceptional generative capabilities but are computationally intensive, posing significant challenges for deployment in resource-constrained or latency-sensitive environments.Quantization offers an effective means to reduce model size and computational cost, with post-training quantization (PTQ) being particularly appealing due to its compatibility with pre-trained models without requiring retraining or training data.However, existing PTQ methods for diffusion models often rely on manual, architecture-specific heuristics that limit their generalizability and hinder integration with industrial deployment pipelines.To address these limitations, we propose SegQuant, a deployment-aware quantization framework that adaptively combines complementary techniques to enhance cross-model versatility.SegQuant consists of a segment-aware, graph-based quantization strategy (SegLinear) that captures structural semantics and spatial heterogeneity, along with a dual-scale quantization scheme (DualScale) that preserves polarity-asymmetric activations using a hardware-native dual-path computation, avoiding performance penalties from custom implementations, which is crucial for maintaining visual fidelity in generated outputs.SegQuant is broadly applicable beyond Transformer-based diffusion models, achieving strong performance while ensuring seamless compatibility with mainstream deployment tools.

Abstract:
Real-world reasoning often requires combining information across modalities, connecting textual context with visual cues in a multi-hop process. Yet, most multimodal benchmarks fail to capture this ability: they typically rely on single images or set of images, where answers can be inferred from a single modality alone. This limitation is mirrored in the training data, where interleaved image-text content rarely enforces complementary, multi-hop reasoning. As a result, Vision-Language Models (VLMs) frequently hallucinate and produce reasoning traces poorly grounded in visual evidence. To address this gap, we introduce CRIT, a new dataset and benchmark built with a graph-based automatic pipeline for generating complex cross-modal reasoning tasks. CRIT consists of diverse domains ranging from natural images, videos, and text-rich sources, and includes a manually verified test set for reliable evaluation. Experiments on this benchmark reveal that even state-of-the-art models struggle on such reasoning tasks. Models trained on CRIT show significant gains in cross-modal multi-hop reasoning, including strong improvements on SPIQA and other standard multimodal benchmarks.

Abstract:
We present PosterIQ, a design-driven benchmark for poster understanding and generation, annotated across composition structure, typographic hierarchy, and semantic intent. It includes 7,765 image-annotation instances and 822 generation prompts spanning real, professional, and synthetic cases. To bridge visual design cognition and generative modeling, we define tasks for layout parsing, text-image correspondence, typography/readability and font perception, design quality assessment, and controllable, composition-aware generation with metaphor. We evaluate state-of-the-art MLLMs and diffusion-based generators, finding persistent gaps in visual hierarchy, typographic semantics, saliency control, and intention communication; commercial models lead on high-level reasoning but act as insensitive automatic raters, while generators render text well yet struggle with composition-aware synthesis. Extensive analyses show PosterIQ is both a quantitative benchmark and a diagnostic tool for design reasoning, offering reproducible, task-specific metrics. We aim to catalyze models' creativity and integrate human-centred design principles into generative vision-language systems.

Abstract:
Saliency-based explainability methods are widely used to interpret deep learning models in medical imaging, yet many existing approaches rely on white box access of models, which is not always possible due to privacy concerns. In this work, we introduce MedLIME, a novel, model-agnostic explanation framework designed to enhance the robustness and fidelity of saliency maps for medical imaging abnormality localization. Building upon the Local Interpretable Model-agnostic Explanations (LIME) paradigm, MedLIME integrates three key components: (1) Generative Masking (GM), (2) Supervised Test-Time Adaptation (STT) and (3) a Evidence-based Regularization (EBR) to improve the saliency map generation accuracy of LIME. Extensive experiments on multiple medical datasets, across three model architectures demonstrate that MedLIME consistently outperforms gradient-based and perturbation-based baselines in abnormality localization as measured by AUPRC. Our results highlight that incorporating generative reconstruction, adaptive perturbation and data-driven regularization improves the reliability and interpretability of medical imaging models.

Abstract:
Recent diffusion and flow matching models have demonstrated strong capabilities in image generation and editing by progressively removing noise through iterative sampling. While this enables flexible inversion for semantic-preserving edits, few-step sampling regimes suffer from poor forward process approximation, leading to degraded editing quality. Existing few-step inversion methods often rely on pretrained generators and auxiliary modules, limiting scalability and generalization across different architectures.To address these limitations, we propose BiFM (Bidirectional Flow Matching), a unified framework that jointly learns generation and inversion within a single model. BiFM directly estimates average velocity fields in both "image \to noise" and "noise \to image" directions, constrained by a shared instantaneous velocity field derived from either predefined schedules or pretrained multi-step diffusion models. Additionally, BiFM introduces a novel training strategy using continuous time-interval supervision, stabilized by a bidirectional consistency objective and a lightweight time-interval embedding. This bidirectional formulation also enables one-step inversion and can integrate seamlessly into popular diffusion and flow matching backbones.Across diverse image editing and generation tasks, BiFM consistently outperforms existing few-step approaches, achieving superior performance and editability. Our code and models will be released upon acceptance.

Abstract:
State-of-the-art whole-body pose estimators often lack robustness, producing anatomically implausible predictions in challenging scenes. We posit this failure stems from spurious correlations learned from visual context, a problem we formalize using a Structural Causal Model (SCM). The SCM identifies visual context as a confounder that creates a non-causal backdoor path, corrupting the model's reasoning. We introduce the Causal Intervention Graph Pose (CIGPose) framework to address this by approximating the true causal effect between visual evidence and pose. The core of CIGPose is a novel Causal Intervention Module: it first identifies confounded keypoint representations via predictive uncertainty and then replaces them with learned, context-invariant canonical embeddings. These deconfounded embeddings are processed by a hierarchical graph neural network that reasons over the human skeleton at both local and global semantic levels to enforce anatomical plausibility. Extensive experiments show CIGPose achieves a new state-of-the-art on COCO-WholeBody. Notably, our CIGPose-x model achieves 67.0% AP, surpassing prior methods that rely on extra training data. With the additional UBody dataset, CIGPose-x is further boosted to 67.5% AP, demonstrating superior robustness and data efficiency.

Abstract:
Co-speech gesture generation has significantly advanced human-computer interaction, yet speaker movements remain constrained due to the omission of text-driven non-spontaneous gestures (e.g., bowing while talking). Existing methods face two key challenges: 1) the semantic prior gap due to the lack of descriptive text annotations in gesture datasets, and 2) the difficulty in achieving coordinated multimodal control over gesture generation. To address these challenges, this paper introduces CoordSpeaker, a comprehensive framework that enables coordinated caption-empowered co-speech gesture synthesis. Our approach first bridges the semantic prior gap through a novel gesture captioning framework, leveraging a motion-language model to generate descriptive captions at multiple granularities. Building upon this, we propose a conditional latent diffusion model with unified cross-dataset motion representation and a hierarchically controlled denoiser to achieve highly controlled, coordinated gesture generation. CoordSpeaker pioneers the first exploration of gesture understanding to tackle the semantic gap in gesture generation while offering a novel perspective of bidirectional gesture-text mapping. Extensive experiments demonstrate that our method produces high-quality gestures that are both rhythmically synchronized with speeches and semantically coherent with arbitrary captions, achieving superior performance with higher efficiency compared to existing approaches.

Abstract:
Multimodal person Re-IDentification (ReID) aims to reliably associate specific individuals by utilizing complementary information from heterogeneous modalities. In contrast, low-frequency radio frequency (RF) signals, with their superior penetration capability and illumination invariance, provide ideal complementary information. However, a key challenge in fusing these heterogeneous modalities lies in the conventional approach that relies heavily on cross-modal distribution matching, which often over-regularizes and weakens the discriminative capacity within each modality. Rather than enforcing direct distribution alignment, canonical correlation analysis (CCA) constructs a shared subspace that maximizes cross-modal correlation, inherently balancing modality specificity and shared semantics. Inspired by this, we reformulate cross-modal alignment as a correlation maximization problem, avoiding direct constraints on feature distributions and guiding the model to harmonize intra-modal discriminative learning with cross-modal alignment. Specifically, VRCLIP first refines CLIP's visual encoder with illumination-disentangling objectives, then aligns RGB and RF embeddings in a canonical correlation subspace, and finally employs an RF-anchored reliability gate for adaptive fusion. To advance the area, we will release VRR, the first large-scale vision-radio ReID dataset with over 650K paired image-radar samples and position annotations for 31 participants. Extensive experiments show state-of-the-art 93.9% mAP and robust generalization across diverse lighting and occlusion conditions.

Abstract:
Video Large Language Models (Video-LLMs) are improving rapidly, yet current Video Question Answering (VideoQA) benchmarks often admit single-cue shortcuts, under-testing reasoning that must integrate evidence across time. We introduce HERBench, a benchmark designed to make multi-evidence integration unavoidable: each question requires at least three non-overlapping cues drawn from distinct video segments. HERBench contains 26,806 five-way multiple-choice questions across 12 compositional tasks. To make evidential demand measurable, we introduce the Minimum Required Frame-Set (MRFS), the smallest number of frames a model must fuse to answer correctly, and show that HERBench imposes higher evidential demand than prior benchmarks. Evaluating 13 state-of-the-art Video-LLMs yields only 31-42% accuracy, only modestly above the 20% random-guess baseline. We disentangle this failure into two critical bottlenecks: (1) a retrieval deficit, where frame selectors overlook key evidence, and (2) a fusion deficit, where models fail to integrate information even when all necessary evidence is provided. HERBench thus provides a principled benchmark for studying robust multi-evidence video understanding.

Abstract:
We introduce MMBench-GUI, a hierarchical benchmark for evaluating GUI automation agents across Windows, macOS, Linux, iOS, Android, and Web. The benchmark spans four levels: Content Understanding, Element Grounding, Task Automation, and Task Collaboration, covering essential skills for GUI agents. To assess both effectiveness and efficiency, we further propose the Efficiency-Quality-Aware (EQA) metric, which measures task success alongside action redundancy. Extensive evaluations reveal that precise visual grounding is the critical determinant of performance, underscoring the advantages of modular designs with specialized grounding modules. Moreover, all agents suffer from substantial inefficiencies, frequently completing tasks with excessive steps despite eventual success. Performance also degrades on complex or cross-application tasks, exposing weaknesses in memory, planning, and adaptive reasoning. By providing broad coverage, standardized protocols, and novel metrics, MMBench-GUI establishes the first comprehensive foundation for advancing GUI agent research.

Abstract:
Recent advances in novel view synthesis have enabled differentiable rendering methods to reconstruct 3D scenes directly from images. Algorithms such as 3D Gaussian Splatting and RayGauss use local basis functions to represent radiance fields, enabling fast, high-quality rendering of real-world scenes. However, these methods lack an exact geometric representation of the scene. In this work, inspired by Hermite Radial Basis Function (HRBF) implicits, we introduce a global implicit function constructed from local RBFs and their derivatives to represent surfaces. The proposed formulation enables learning scene geometry through differentiable rendering of an implicit function. By leveraging local basis functions, it achieves both an efficient geometric representation and fast rendering, using a bounding volume hierarchy (BVH) to accelerate intersections with the local basis functions. The implementation of our approach will be made publicly available upon the paper's publication.

Abstract:
Robust 3D representation learning forms the perceptual foundation of spatial intelligence, enabling downstream tasks in scene understanding and embodied AI. However, learning such representations directly from unposed multi-view images remains challenging. Recent self-supervised methods attempt to unify geometry, appearance, and semantics in a feed-forward manner, but they often suffer from weak geometry induction, limited appearance detail, and inconsistencies between geometry and semantics.We introduce UniSplat, a feed-forward framework designed to address these limitations through three complementary components. First, we propose a dual-masking strategy that strengthens geometry induction in the encoder. By masking both encoder and decoder tokens, and targeting decoder masks toward geometry-rich regions, the model is forced to infer structural information from incomplete visual cues, yielding geometry-aware representations even under unposed inputs.Second, we develop a coarse-to-fine Gaussian splatting strategy that reduces appearance-semantics inconsistencies by progressively refining the radiance field.Finally, to enforce geometric-semantic consistency, we introduce a pose-conditioned recalibration mechanism that interrelates the outputs of multiple heads by reprojecting predicted 3D point and semantic maps into the image plane using estimated camera parameters, and aligning them with corresponding RGB and semantic predictions to ensure cross-task consistency and resolving geometry-semantic mismatches. Together, these components yield unified 3D representations that are robust to unposed, sparse-view inputs and generalize across diverse tasks, laying a perceptual foundation for spatial intelligence.

Abstract:
Camera recapture introduces complex optical degradations, such as perspective warping, illumination shifts, and Moire interference, that remain challenging for deep watermarking systems. We present TIACam, a text-anchored invariant feature learning framework with auto-augmentation for camera-robust zero-watermarking. The method integrates three key innovations: (1) a learnable auto-augmentor that discovers camera-like distortions through differentiable geometric, photometric, and Moire operators; (2) a text-anchored invariant feature learner that enforces semantic consistency via cross-modal adversarial alignment between image and text; and (3) a zero-watermarking head that binds binary messages in the invariant feature space without modifying image pixels. This unified formulation jointly optimizes invariance, semantic alignment, and watermark recoverability. Extensive experiments on both synthetic and real-world camera captures demonstrate that TIACam achieves state-of-the-art feature stability and watermark extraction accuracy, establishing a principled bridge between multimodal invariance learning and physically robust zero-watermarking.

Abstract:
Although industrial inspection systems should be capable of recognizing unprecedented defects, most existing approaches operate under a closed-set assumption, which prevents them from detecting novel anomalies. While visual prompting offers a scalable alternative for industrial inspection, existing methods often suffer from prompt embedding collapse due to high intra-class variance and subtle inter-class differences. To resolve this, we propose UniSpector, which shifts the focus from naive prompt-to-region matching to the principled design of a semantically structured and transferable prompt topology. UniSpector employs the Spatial-Spectral Prompt Encoder to extract orientation-invariant, fine-grained representations; these serve as a solid basis for the Contrastive Prompt Encoder to explicitly regularize the prompt space into a semantically organized angular manifold. Additionally, Prompt-guided Query Selection generates adaptive object queries aligned with the prompt. We introduce Inspect Anything, the first benchmark for visual-prompt-based open-set defect localization, where UniSpector significantly outperforms baselines by at least 19.7% and 15.8% in AP50b and AP50m, respectively. These results show that our method enable a scalable, retraining-free inspection paradigm for continuously evolving industrial environments, while offering critical insights into the design of generic visual prompting.

Abstract:
Many real-world applications in digital forensics, urban monitoring, and environmental analysis require jointly reasoning about visual appearance, location, and time. Beyond standard geo-localization and time-of-capture prediction, these applications increasingly demand more complex capabilities, such as retrieving an image captured at the same location as a query image but at a specified target time. We formalize this problem as Geo-Time Aware Image Retrieval and propose TIGeR, a unified framework for Time, Images and Geo-location Retrieval. TIGeR supports flexible input configurations (single-modality and multi-modality queries) and uses the same representation to perform (i) geo-localization, (ii) time-of-capture prediction, and (iii) geo-time-aware retrieval. By preserving the underlying location identity despite large appearance changes, TIGeR enables retrieval based on where and when a scene was captured, rather than purely on visual similarity. To support this task, we design a multistage data curation pipeline and propose a new diverse dataset of 4.5M paired image-location-time triplets for training and 86k high-quality triplets for evaluation. Extensive experiments show that TIGeR consistently outperforms strong baselines and state-of-the-art methods by up to 16% on time-of-year, 8% time-of-day prediction, and 14% in geo-time aware retrieval recall, highlighting the benefits of unified geo-temporal modeling.

Authors: Junxuan Li, Rawal Khirodkar, Egor Zakharov, Jihyun Lee, Zhaoen Su, Yuan Dong, Julieta Martinez, Kai Li, Qingyang Tan, Takaaki Shiratori, Matthew Hu, Peihong Guo, Xuhua Huang, Zhongshi Jiang, Lingchen Yang, Ariyan Zarei, Marco Pesavento, Yichen Xu, Chengan He, He Wen, Giljoo Nam, Teng Deng, Wyatt Borsos, Anjali Thakrar, Jean-Charles Bazin, Rinat Abdrashitov, Carsten Stoll, Ginés Hidalgo, James Booth, Lucy Wang, Xiaowen Ma, Yu Rong, Sairanjith Thalanki, Chen Cao, Christian Häne, Abhishek Kar, Sofien Bouaziz, Jason Saragih, Yaser Sheikh, Shunsuke Saito

Abstract:
High-quality 3D avatar modeling faces a critical trade-off between fidelity and generalization. On the one hand, multi-view studio data enables high-fidelity modeling of humans with precise control over expressions and poses, but it struggles to generalize to real-world data due to limited scale and the domain gap between the studio environment and the real world. On the other hand, recent large-scale avatar models trained on millions of in-the-wild samples show promise for generalization across a wide range of identities, yet the resulting avatars are often of low-quality due to inherent 3D ambiguities. To address this, we present Large-Scale Codec Avatars (LCA), a high-fidelity, full-body 3D avatar model that generalizes to world-scale populations in a feedforward manner, enabling efficient inference. Inspired by the success of large language models and vision foundation models, we present, for the first time, a pre/post-training paradigm for 3D avatar modeling at scale: we pretrain on 1M in-the-wild videos to learn broad priors over appearance and geometry, then post-train on high-quality curated data to enhance expressivity and fidelity. LCA generalizes across hair styles, clothing, and demographics while providing precise, fine-grained facial expressions and finger-level articulation control, with strong identity preservation. Notably, we observe emergent generalization to relightability and loose garment support to unconstrained inputs, and zero-shot robustness to stylized imagery, despite the absence of direct supervision.

Abstract:
Membership inference attacks (MIAs) aim to determine whether a specific data point was part of a model's training set, serving as effective tools for evaluating privacy leakage of vision models. However, existing MIAs implicitly assume honest query inputs, and their adversarial robustness remains unexplored. We show that MIAs for vision models expose a previously overlooked adversarial surface: adversarial membership manipulation, where imperceptible perturbations can reliably push non-member images into the member region of state-of-the-art MIAs. In this paper, we provide the first unified perspective on this phenomenon by analyzing its mechanism and implications. We begin by demonstrating that adversarial membership fabrication is consistently effective across diverse architectures and datasets. We then reveal a distinctive geometric signature--a characteristic gradient-norm collapse trajectory--that reliably separates fabricated from true members despite their nearly identical semantic representations. Building on this insight, we introduce a principled detection strategy grounded in gradient-geometry signals and develop a robust inference framework that substantially mitigates adversarial manipulation. Extensive experiments show that fabrication is broadly effective, while our detection and robust inference strategies significantly enhance resilience. This work establishes the first comprehensive framework for adversarial membership manipulation in vision models.

Abstract:
Despite significant progress in text-to-video generation, current models still suffer from unrealistic dynamics, temporal inconsistency, and unstable semantic alignment. Existing preference alignment approaches rely on costly and often ambiguous human or VLM-based video preference annotation, which has become a major bottleneck for scaling data. To address this challenge, we propose an annotation-free preference alignment method that constructs accurate preference pairs through video continuation.We extend a pretrained video generation model into a continuation model and apply continuation with different amounts of reference frames while keeping the total video length fixed. As generated segments are inferior to ground-truth frames and and fixed-length continuations conditioned on more reference frames contain less generated content, they exhibit higher fidelity than those with fewer references, naturally inducing a preference order.We further introduce Asymmetrical DPO, which computes preference loss on all continuation regions except the shared prefix conditioning frames and normalizes it by their length, preventing spurious preference signals from leaking into the conditioned portion.Experiments across multiple benchmarks show that our method delivers significant improvements in dynamics realism, temporal coherence, and semantic alignment over existing DPO-based approaches, while fully eliminating the need for human preference labeling or auxiliary reward models.

Abstract:
Most Gaussian Splatting techniques that provide a 3D semantic representation of the scene don't optimize the underlying 3D geometry of the scene. This makes object-level editing or asset extraction challenging. Recent methods, like COBGS, Trace3D, and ObjectGS, acknowledge this limitation and propose approaches that modify the geometry of the scene to represent the underlying semantics. We go a step further and propose a novel solution that provides near perfect boundaries in object extraction. We do so by introducing two new losses in the optimization that take care of: 1. Modifying the geometry of visible Gaussians to respect semantic boundaries, and, 2. Modifying the geometry of non-visible Gaussians that appear once the object is extracted. Our first loss propagates gradients directly through the rasterization to allow for seamless integration within the optimization of the Gaussian parameters. Our second loss also propagates gradients to the Gaussian parameters, but does so without passing through the rasterization. This allows it to modify the geometry of the scene, even if not much transmittance arrives to a Gaussian (partial or non-visible). Exhaustive comparisons to 12 state of the art methods over 4 datasets, using six metrics, demonstrate that our approach produces overall the best boundary segmentation to date.

Abstract:
3D pose transfer aims to transfer the pose-style of a source mesh to a target character while preserving both the target's geometry and the source's pose characteristic. Existing methods are largely restricted to characters with similar structures and fail to generalize to category-free settings (e.g., transferring a humanoid's pose to a quadruped). The key challenge lies in the structural and transformation diversity inherent in distinct character types, which often leads to mismatched regions and poor transfer quality. To address these issues, we first construct a million-scale pose dataset across hundreds of distinct characters. We further propose MimiCAT, a cascade-transformer model designed for category-free 3D pose transfer. Instead of relying on strict one-to-one correspondence mappings, MimiCAT leverages semantic keypoint labels to learn a novel soft correspondence that enables flexible many-to-many matching across characters. The pose transfer is then formulated as a conditional generation process, in which the source transformations are first projected onto the target through soft correspondence matching and subsequently refined using shape-conditioned representations. Extensive qualitative and quantitative experiments demonstrate that MimiCAT generalizes plausible poses across diverse character morphologies, surpassing prior approaches restricted to narrow-category transfer (e.g., humanoid-to-humanoid).

Abstract:
Raw images taken in low-light conditions are very noisy due to low photon count and sensor noise. Learning-based denoisers have the potential to reconstruct high-quality images. For training, however, these denoisers require large paired datasets of clean and noisy images, which are difficult to collect. Noise synthesis is an alternative to large-scale data acquisition: given a clean image, we can synthesize a realistic noisy counterpart. In this work, we propose a general and practical noise synthesis method that requires only one single noisy image and one single dark frame per ISO setting. We represent signal-dependent noise with a Poisson distribution and introduce a Fourier-domain spectral sampling algorithm to accurately model signal-independent noise. The latter generates diverse noise realizations that maintain the spatial and statistical properties of real sensor noise. As opposed to concurrent approaches, our method neither relies on simplified parametric models nor on large sets of clean-noisy image pairs. It is accurate and practical. Moreover, our synthesis method leads to state-of-the-art performances on multiple low-light denoising benchmarks.

Abstract:
Existing depth estimation methods are fundamentally limited to predicting depth on discrete image grids. Such representations restrict their scalability to arbitrary output resolutions and hinder the geometric detail recovery. This paper introduces InfiniDepth, which represents depth as neural implicit fields. Through a simple yet effective local implicit decoder, we can query depth at continuous 2D coordinates, enabling arbitrary-resolution and fine-grained depth estimation. To better assess our method's capabilities, we curate a high-quality 4K synthetic benchmark from five different games, spanning diverse scenes with rich geometric and appearance details. Extensive experiments demonstrate that InfiniDepth achieves state-of-the-art performance on both synthetic and real-world benchmarks across relative and metric depth estimation tasks, particularly excelling in fine-detail regions. It also benefits the task of novel view synthesis under large viewpoint shifts, producing high-quality results with fewer holes and artifacts. Code and data will be made publicly available.

Abstract:
In this paper, we propose a cross-view fusion framework that enhances the robustness of 6-DoF grasp pose estimation in corner views.Our framework alleviates occlusion by incorporating an auxiliary view and avoids the time-consuming, task-agnostic multi-view reconstruction through a post-fusion strategy.To enable cross-view fusion, we propose a self-supervised contrastive learning strategy that leverages cross-view associations to regularize point cloud features.In brief, a cross-view point pair is considered a match if the two points correspond to the same 3D location, and a non-match if they represent distinct grasp directions.The learning strategy significantly enhances the spatial consistency and direction distinctiveness of point features, thereby facilitating cross-view fusion and improving estimation robustness.Furthermore, we propose a cross-view-aligned cylinder integration module to fuse grasp-relevant geometry into a comprehensive representation.Specifically, the module first aligns the cross-view points and features according to their similarity to enhance the robustness against noise.Subsequently, these points are registered into the cylindrical coordinate frame, emphasizing the rotation-symmetric geometry which is important for grasping.Finally, local self-attention and seed cross-attention layers are alternately employed, respectively enabling interactions within single views and across views, which supports fine-grained representation of grasp-relevant geometry.Our framework achieves strong performance on the GraspNet-1Billion benchmark and in real-world applications. Code is available at GitHub.

Abstract:
Recent advances in vision-language models (VLMs) have revealed both the promise and the rigidity of large-scale pretraining. Despite their impressive zero-shot generalization, existing adaptation paradigms--whether prompt-tuning, adapter injection, or fine-tuning--remain class-specific, modality-biased, and structure-agnostic. However, these design choices limit reasoning-level transfer across tasks. To this end, we rethink adaptation as a shared conceptual structure rather than a per-class specialization. We propose CASPA (Concept-Anchored Semantic Prompt Adapter), a dual-anchor semantic adapter that jointly learns shared text and image anchors as a bidirectional conceptual interface between modalities. Each class learns a soft association distribution over these anchors, producing compositional representations which enable parameter sharing and semantic reuse. To further align visual and textual reasoning spaces, CASPA employs Semantic Cross-Consistency Regularization (S-XCR), enforcing geometric and semantic agreement between text- and image-conditioned anchor mixtures. CASPA, therefore, provides a structurally constrained alternative to class-conditional prompt parameterization while keeping the CLIP backbone frozen. Evaluated across four regimes, Base-to-Novel setup, cross-data transfer, few-shot, and backbone-agnostic evaluations, on eleven diverse visual recognition datasets, CASPA matches or outperforms state-of-the-art methods.

Abstract:
The proliferation of AI-generated content (AIGC) has facilitated sophisticated face manipulation, severely undermining visual integrity and posing unprecedented challenges to intellectual property (IP). In response, a common proactive defense leverages fragile watermarks to detect, localize, or even recover manipulated regions. However, these methods always assume an adversary unaware of the embedded watermark, overlooking their inherent vulnerability to watermark removal attacks. Furthermore, this fragility is exacerbated in the commonly used dual-watermark strategy that adds a robust watermark for image ownership verification, where mutual interference and limited embedding capacity reduce the fragile watermark's effectiveness.To address the gap, we propose RecoverMark, a watermarking framework that achieves robust manipulation localization, content recovery, and ownership verification simultaneously. Our key insight is twofold. First, we exploit a critical real-world constraint: an adversary must preserve the background's semantic consistency to avoid visual detection, even if they apply global, imperceptible watermark removal attacks. Second, using the image's own content (face, in this paper) as the watermark enhances extraction robustness. Based on these insights, RecoverMark treats the protected face content itself as the watermark and embeds it into the surrounding background. By designing a robust two-stage training paradigm with carefully crafted distortion layers that simulate comprehensive potential attacks and a progressive training strategy, RecoverMark achieves a robust watermark embedding in no fragile manner for image manipulation localization, recovery, and image IP protection simultaneously. Extensive experiments demonstrate the proposed RecoverMark's robustness against both seen and unseen attacks and its generalizability to in-distribution (ID) and out-of-distribution (OOD) data.

Abstract:
The Composed Image Retrieval (CIR) task provides a flexible retrieval paradigm via a reference image and modification text, but it heavily relies on expensive and error-prone triplet annotations. This paper systematically investigates the Noisy Triplet Correspondence (NTC) problem introduced by annotations. We find that NTC noise, particularly "hard noise" (i.e., the reference and target images are highly similar but the modification text is incorrect), poses a unique challenge to existing Noise Correspondence Learning (NCL) methods because it breaks the traditional "small loss hypothesis". We identify and elucidate three key, yet overlooked, challenges in the NTC task, namely (C1) Modality Suppression, (C2) Negative Anchor Deficiency, and (C3) Unlearning Backlash. To address these challenges, we propose a Cone-based robuSt noisE-unlearning comPositional network (ConeSep). Specifically, we first propose Geometric Fidelity Quantization, theoretically establishing and practically estimating a noise boundary to precisely locate noisy correspondence. Next, we introduce Negative Boundary Learning, which learns a "diagonal negative combination" for each query as its explicit semantic opposite-anchor in the embedding space. Finally, we design Boundary-based Targeted Unlearning, which models the noisy correction process as an optimal transport problem, elegantly avoiding Unlearning Backlash. Extensive experiments on benchmark datasets (FashionIQ and CIRR) demonstrate that ConeSep significantly outperforms current state-of-the-art methods, which fully demonstrates the effectiveness and robustness of our method.

Abstract:
Most Human-Machine Interaction (HMI) research overlooks the maneuvering needs of passengers in autonomous driving (AD). Natural language offers an intuitive interface, yet translating passenger open-ended instructions into control signals--without sacrificing interpretability and traceability--remains a challenge. This study proposes an instruction-realization framework that leverages a large language model (LLM) to interpret instructions, generates executable scripts that schedule multiple model predictive control (MPC)-based motion planners based on real-time feedback, and converts planned trajectories into control signals. This scheduling-centric design decouples semantic reasoning from vehicle control at different timescales, establishing a transparent, traceable decision-making chain from high-level instructions to low-level actions. Due to the absence of high-fidelity evaluation tools, this study introduces a benchmark for open-ended instruction realization in a closed-loop setting. Comprehensive experiments reveal that the framework significantly improves task-completion rates over instruction-realization baselines, reduces LLM query costs, achieves safety and compliance on par with specialized AD approaches, and exhibits considerable tolerance to LLM inference latency. For more qualitative illustrations and a clearer understanding, please refer to Supplementary Material.

Abstract:
The rapid advances in deep learning have significantly enhanced the accuracy of multimodal 3D human pose estimation (HPE). However, the state-of-the-art (SOTA) HPE pipelines still rely on Transformers, whose quadratic complexity makes real-time processing for long sequences impractical. Mamba addresses this issue through selective state-space modeling, enabling efficient sequence processing without sacrificing representational power. Nevertheless, it struggles to capture complex spatial dependencies in multimodal settings. To bridge this gap, we propose VIMCAN, a hybrid architecture that combines the efficient sequence modeling of Mamba with the spatial reasoning of Cross-Attention, and performs robust visual-inertial fusion and human pose estimation between RGB keypoints and wearable IMU data. By leveraging Mamba's dynamic parameterization for temporal modeling and Attention for spatial dependency extraction, VIMCAN achieves superior accuracy, with mean per-joint position errors (MPJPE) of 17.2 mm on TotalCapture and 45.3 mm on 3DPW. VIMCAN outperforms prior Transformer-based and other SOTA approaches while supporting real-time inference at over 60 frames per second on consumer-grade hardware. The source code will be available on GitHub.

Abstract:
Recent advances in text-to-image (T2I) generation can produce highly resembling images of copyrighted characters, often indistinguishable from official depictions, raising serious concerns about intellectual property infringement. Consequently, robust detection of copyright character infringement is urgently needed. Yet, existing methods exhibit limited alignment with human judgments regarding the likelihood of infringement. To bridge this gap, we propose CopyLens, a novel prompt optimization framework that automatically refines textual prompts for vision-language model-based detectors to better match human infringement judgments. Our approach establishes a closed-loop refinement process between a large vision-language model (LVLM) and a large language model (LLM): the LVLM assesses generated images for copyright detection, while the LLM iteratively optimizes detection prompts via meta-prompting, guided by feedback signals derived from human annotation consistency. To facilitate the assessment of prompt-human alignment, we introduce CopyChars, a new large-scale dataset of over 7,000 AI-generated images spanning more than 100 popular copyrighted characters, along with detailed human annotations on potential infringement. Extensive experiments on CopyChars show that the proposed CopyLens can improve detection performance by 5% to 10% compared to recent state-of-the-art methods. This work offers a scalable and automated solution for visual copyright protection and highlights the critical role of prompt engineering.

Abstract:
Training vision models with language supervision enables general and transferable representations. However, many visual domains, especially non-object-centric domains such as medical imaging and remote sensing, contain itemized text annotations: multiple text items describing distinct and semantically independent findings within a single image. Such supervision differs from standard multi-caption supervision, where captions are redundant or highly overlapping. Here, we introduce ItemizedCLIP, a framework for learning complete and explainable visual representations from itemized text supervision. ItemizedCLIP employs a cross-attention module to produce text item-conditioned visual embeddings and a set of tailored objectives that jointly enforce item independence (distinct regions for distinct items) and representation completeness (coverage of all items). Across four domains with naturally itemized text supervision (brain MRI, head CT, chest CT, remote sensing) and one additional synthetically itemized dataset, ItemizedCLIP achieves substantial improvements in zero-shot performance and fine-grained interpretability over baselines. The resulting ItemizedCLIP representations are semantically grounded, item-differentiable, complete, and visually interpretable.

Abstract:
User interface to code (UI2Code) aims to generate executable code that can faithfully reconstruct a given input UI. Prior work focuses largely on web pages and mobile screens, leaving app widgets underexplored. Unlike web or mobile UIs with rich hierarchical context, widgets are compact, context-free micro-interfaces that summarize key information through dense layouts and iconography under strict spatial constraints. Moreover, while (image, code) pairs are widely available for web or mobile UIs, widget designs are proprietary and lack accessible markup. We formalize this setting as the Widget-to-Code (Widget2Code) and introduce an image-only widget benchmark with fine-grained, multi-dimensional evaluation metrics. Benchmarking shows that although generalized multimodal large language models (MLLMs) outperform specialized UI2Code methods, they still produce unreliable and visually inconsistent code. To address these limitations, we develop a baseline that jointly advances perceptual understanding and structured code generation. At the perceptual level, we follow widget design principles to assemble atomic components into complete layouts, equipped with icon retrieval and UI composition modules. At the system level, we design an end-to-end infrastructure, WidgetFactory, which includes a framework-agnostic widget-tailored domain-specific language (WidgetDSL) and a compiler that translates it into multiple front-end implementations (e.g., React, HTML). An adaptive rendering module further refines spatial dimensions to satisfy compactness constraints. Together, these contributions substantially enhance visual fidelity, establishing a strong baseline and unified infrastructure for future Widget2Code research.

Abstract:
Estimating dense 2D optical flow and 3D scene flow is essential for dynamic scene understanding. Recent work combines images, LiDAR, and event data to jointly predict 2D and 3D motion, yet most approaches operate in separate heterogeneous feature spaces. Without a shared latent space that all modalities can align to, these systems rely on multiple modality-specific blocks, leaving cross-sensor mismatches unresolved and making fusion unnecessarily complex. Event cameras naturally provide a spatiotemporal edge signal, which we can treat as an intrinsic edge field to anchor a unified latent representation, termed the Event Edge Space. Building on this idea, we introduce x^2-Fusion, which reframes multimodal fusion as representation unification: event-derived spatiotemporal edges define an edge-centric homogeneous space, and image and LiDAR features are explicitly aligned in this shared representation. Within this space, we perform reliability-aware adaptive fusion to estimate modality reliability and emphasize stable cues under degradation. We further employ cross-dimension contrast learning to tightly couple 2D optical flow with 3D scene flow. Extensive experiments on both synthetic and real benchmarks show that x^2-Fusion achieves state-of-the-art accuracy under standard conditions and delivers substantial improvements in challenging scenarios.

Abstract:
Language-guided photo retouching aims to adjust color and tone while preserving geometry and texture. Recently, diffusion-based retouching shows a superior visual quality, but often struggles with both fidelity issues due to its generative nature and efficiency because of its iterative sampling process.In this work, we propose an efficient and fidelity-preserving retouching method using bilateral space manipulation, which is both compact and content-decoupled. Specifically, instead of directly editing pixels or image latents, our model predicts a low-resolution bilateral grid of affine transforms, which are sliced using a learned guidance map and then applied to the full-resolution image. This approach yields both high fidelity and improved efficiency.To retain strong priors of a pretrained generative model, we distill a multi-step diffusion model into our bilateral grid framework using Variational Score Distillation, complemented by a prompt alignment loss to guide instruction-following behavior. Additionally, we introduce a new benchmark and evaluate our method across multiple dimensions: fidelity, instruction following, and efficiency.Compared to the latest retouch methods, like Gemini-2.5-Flash (Nano-Banana), our method can avoid content drift, significantly improve latency, and generate visually pleasing edits, while maintaining a high level of fidelity.

Abstract:
Visual Autoregressive (AR) models generate images by predicting discrete tokens that are decoded by a visual tokenizer.Despite demonstrating strong overall image generation ability, they still underperform on text rendering with blur strokes and disrupt letter shapes. In this work, we trace this limitation to the visual tokenizer, which struggles to reconstruct fine-grained detail.Improving the tokenizer is straightforward but expensive, as it necessitates retraining both the tokenizer and the AR model.Can we improve text rendering performance of AR models without retraining the existing tokenizer and AR model? To achieve this, we propose the Residual Decoder Adapter(\method) that upgrades an existing tokenizer post-hoc without changing its token space. Specifically, it refines the decoder output of the visual tokenizer by introducing two novel components: (i) a paired codebook that shares the token distribution with the original one; (ii) a parallel branch to learn the tiny differences (residual) between the reconstructed image and the ground-truth images in the pixel space. This residual design allows us to enhance the tokenizer non-invasively while preserving compatibility with prior AR models. \method substantially improves text rendering significantly by a large margin. For instance, we boost finetuned Janus-Pro OCR accuracy rises from 24.52% to 58.26% (TextVisionBlend), from 12.75% to 36.81% (StyledTextSynth) on competitive TextAtlas benchmark. Codes will be released.

Abstract:
The rapid proliferation of generative components, such as LoRAs, has created a vast but unstructured ecosystem. Existing discovery methods depend on unreliable user descriptions or biased popularity metrics, hindering usability. We present CARLoS, a large-scale framework for characterizing LoRAs without requiring additional metadata. Analyzing over 650 LoRAs, we employ them in image generation over a variety of prompts and seeds, as a credible way to assess their behavior. Using CLIP embeddings and their difference to a base-model generation, we concisely define a three-part representation: Direction, defining semantic shift; Strength, quantifying the significance of the effect; and Consistency, quantifying how stable the effect is. Using these representations, we develop an efficient retrieval framework that semantically matches textual queries to relevant LoRAs while filtering overly strong or unstable ones, outperforming textual baselines in automated and human evaluations. While retrieval is our primary focus, the same representation also supports analyses linking Strength and Consistency to legal notions of substantiality and volition, key considerations in copyright, positioning CARLoS as a practical system with broader relevance for LoRA analysis.

Abstract:
Representation learning with Vision Transformers (ViTs) has advanced rapidly, yet the utility of large-scale models in spatially sensitive tasks is hindered by spurious tokens. Prior efforts to mitigate this have been limited, often defining these artifacts narrowly, for example, as simple high-norm outliers. We argue that this scope is insufficient. For dense prediction tasks, we posit that any token failing to encode location-aligned semantics should be treated as a spurious artifact. This broader definition reveals a more complex problem, leading us to systematically categorize and characterize three fundamental types of spurious tokens that corrupt spatial representations. Based on this comprehensive diagnosis, we propose UniRefiner, a universal refinement framework that teaches pre-trained ViTs to self-dispose of these artifacts. UniRefiner uses contrastive registers to explicitly isolate and redistribute spurious tokens via a dual objective: (i) it aligns image tokens with filtered regular tokens to preserve semantics, and (ii) it aligns register tokens with detected spurious tokens to capture the spurious signals. Our method requires only a few epochs of fine-tuning on 5k images to refine diverse ViTs, including massive models like EVA-CLIP-8B and InternViT-6B. Experiments demonstrate consistent and significant improvements: notably, the refined EVA-CLIP-8B achieves 51.9% mIoU on ADE20K (+9.4%), surpassing specialized vision models like DINOv2 (49.1%), while zero-shot segmentation accuracy improves by up to 22%. UniRefiner unlocks the latent spatial potential of existing large-scale foundation models, paving the way for their broader application.

Abstract:
Panoramic depth estimation enables a complete 360^\circ understanding of 3D environments but faces significant challenges in generalizing to real-world scenes. While recent zero-shot depth models like Depth Anything achieve remarkable generalization on perspective images, their performance sharply degrades on panoramas due to projection distortions and the lack of spherical geometric awareness. Moreover, collecting large-scale panoramic RGB-D data is costly, hindering the large-scale training of panoramic foundation models. To address these issues, we propose an SO(3)-Equivariant ViT-Adapter, which transfers the powerful zero-shot capability of the perspective pre-trained ViT to panoramic depth estimation by explicitly incorporating a rotation-equivariant inductive bias. Our adapter introduces an SO(3) deformable cross-attention mechanism to effectively align SO(3)-equivariant features with perspective features, enhancing rotational consistency without modifying the ViT backbone. Trained solely on synthetic panoramas, our framework achieves robust zero-shot sim-to-real performance on real indoor benchmarks, including Matterport3D and Stanford2D3D, demonstrating both data efficiency and strong generalization for panoramic depth estimation.

Abstract:
Multimodal Large Language Models (MLLMs) incorporating 3D geometry demonstrate significant power in 3D scene understanding. Their primary bottleneck, however, is the substantial computational burden associated with processing multi-view, lengthy visual token sequences. To surmount this challenge, we propose Merge3D, a geometry-aware token merging framework that integrates both 3D geometry and 2D semantic information. Conventional 2D compression methods, which rely solely on semantic signals, prove inadequate for 3D tasks, as they tend to discard spatially critical tokens and damage grounding performance. Merge3D bridges the modalities with a Semantic-Geometric Token Merger (SemGeo Merger): 2D attention is used to select semantically salient dominant tokens, while a hybrid 2D+3D similarity assigns and aggregates contextual tokens from spatially coherent 3D neighborhoods. This preserves 3D structural priors and inter-frame correspondences under aggressive compression. Merge3D achieves up to 70% visual token reduction and up to ~3xinference speedup, while retaining strong performance on 3D grounding, captioning, and spatial reasoning benchmarks such as Scan2Cap, CV-Bench, and BLINK.

Abstract:
We present Gaussian Splatting Anisotropic Visibility Field (GAVIS), a novel framework for uncertainty quantification and active mapping in 3DGS. Our key insight is that regions unseen from the training views yield unreliable predictions from the 3DGS. To address this, we introduce a principled and efficient method for quantifying the visibility field in 3DGS, defined as the anisotropic visibility of each particle with respect to the training views, and represented using spherical harmonics. The resulting visibility field is integrated into a Bayesian Network-based uncertainty-aware 3DGS rasterizer, enabling real-time (200 FPS) uncertainty quantification for synthesized views. Active mapping is further performed within a maximum information gain framework building on this formulation. Extensive experiments across diverse environments demonstrate that GAVIS consistently and significantly outperforms prior approaches in both accuracy and efficiency. Moreover, beyond standalone use, our method can be applied post-hoc to improve the performance of existing approaches.

Abstract:
Partial Multi-Label Learning (PML) extends the multi-label learning paradigm to scenarios where each sample is associated with a candidate label set containing both ground-truth labels and noisy labels. Existing PML methods commonly rely on two assumptions: sparsity of the noise label matrix and low-rankness of the ground-truth label matrix. However, these assumptions are inherently conflicting and impractical for real-world scenarios, where the true label matrix is typically full-rank or close to full-rank. To address these limitations, we demonstrate that the sparsity constraint contributes to the high-rank property of the predicted label matrix. Based on this, we propose a novel method Schirn, which introduces a sparsity constraint on the noise label matrix while enforcing a high-rank property on the predicted label matrix. Extensive experiments demonstrate the superior performance of Schirn compared to state-of-the-art methods, validating its effectiveness in tackling real-world PML challenges.

Abstract:
Spiking neural networks (SNNs) are promising for neuromorphic computing, but high-performing models still rely on dense multilayer architectures with substantial communication and state-storage costs. Inspired by autapses, we propose TDA-SNN, a framework that reconstructs SNN architectures using a single leaky integrate-and-fire neuron with time-delayed autapses and a prototype-learning-based training strategy. By reorganizing internal temporal states, TDA-SNN can realize reservoir, multilayer perceptron, and convolution-like spiking architectures within a unified framework. Experiments on sequential, event-based, and image benchmarks show competitive performance in reservoir and MLP settings, while convolutional results reveal a clear space--time trade-off. Compared with standard SNNs, TDA-SNN greatly reduces neuron count and state memory while increasing per-neuron information capacity, at the cost of additional temporal latency in extreme single-neuron settings. These findings highlight the potential of temporally multiplexed single-neuron models as compact computational units for brain-inspired computing.

Abstract:
3D Gaussian representations have emerged as a powerful paradigm for digital head modeling, achieving photorealistic quality with real-time rendering. However, intuitive and interactive creation or editing of 3D Gaussian head models remains challenging. Although 2D sketches provide an ideal interaction modality for fast, intuitive conceptual design, they are sparse, depth-ambiguous, and lack high-frequency appearance cues, making it difficult to infer dense, geometrically consistent 3D Gaussian structures from strokes--especially under real-time constraints. To address these challenges, we propose SketchFaceGS, the first sketch-driven framework for real-time generation and editing of photorealistic 3D Gaussian head models from 2D sketches. Our method uses a feed-forward, coarse-to-fine architecture. A Transformer-based UV feature-prediction module first reconstructs a coarse but geometrically consistent UV feature map from the input sketch, and then a 3D UV feature enhancement module refines it with high-frequency, photorealistic detail to produce a high-fidelity 3D head. For editing, we introduce a UV Mask Fusion technique combined with a layer-by-layer feature-fusion strategy, enabling precise, real-time, free-viewpoint modifications. Extensive experiments show that SketchFaceGS outperforms existing methods in both generation fidelity and editing flexibility, producing high-quality, editable 3D heads from sketches in a single forward pass.

Abstract:
A fundamental yet overlooked limitation of current deepfake detection benchmark is the lack of evaluation frameworks that align technical accuracy with real-world impact. We argue that technical metrics may fail to capture models' actual capacity to mitigate real-world harm, as they treat all errors as equally significant. To bridge this gap, we introduce DeepfakeImpact, a two-stage benchmark that moves beyond pure technical evaluation toward societally-aware assessment. In Stage I, we establish standardized technical baselines by evaluating 33 SOTA detection baslines across 12 widely used datasets. In Stage II, we propose a novel metric (Social Misjudgment Impact, SMI) that quantifies the potential social harm of misclassified videos, and construct a SMI-critical dataset containing high-risk samples. By integrating SMI-aware performance metrics, we shift the evaluation focus from "how accurate" to "how socially beneficial" a detector is. DeepfakeImpact thus provides a more realistic and ethically-grounded foundation for assessing deepfake detectors, urging the community to rethink what truly constitutes progress in this field. All resources will be publicly released at: https://anonymous.4open.science/r/DeepfakeImpact-Stage1-F5EC.

Abstract:
Sign language production aims to generate sign sequences from spoken language, where the generation of sign pose sequences from text is often treated as a significant task. However, due to the differences in grammatical rules and modalities between sign language pose sequences and spoken language text, it is rather challenging to convert text into sign poses (i.e., Text2Pose), while maintaining semantic consistency, motion accuracy and temporal coherence.In this paper, we focus on the Text2Pose task, and propose SignPR, a progressive diffusion framework that jointly models the structural and temporal properties of signing. Structurally, we perform progressive structural refinement: a structural VQVAE encodes each frame into semantic-aware and region-based discrete representations; the diffusion process first produces semantically consistent poses and then progressively refines motion details under text and semantic conditioning. Temporally, we introduce block-wise causal diffusion, which progressively enforces temporal coherence and enables iterative refinement to earlier generated segments, yielding smoother transitions and reduced jitter. Extensive experiments on widely used datasets demonstrate that SignPR achieves superior results compared with prior T2P methods across multiple metrics, producing pose sequences that are semantically faithful, motion-accurate, and temporally coherent.

Abstract:
Multimodal Sentiment Analysis (MSA) aims to comprehensively and robustly interpret human emotions by integrating information from verbal, visual and acoustic modalities. However, the performance of existing models is often hampered by two key challenges: insufficient multilayer semantic extraction inherent to modalities and static feature fusion, leading to low performance. Therefore, this paper proposes a Multi-factor Factor-Decoupling and Semantics-enhanced Fusion Framework for accurate multimodal sentiment analysis. First, each modality is decomposed into three orthogonal subspaces based on a multidimensional information separation mechanism, which is regulated by a contrast constraint for subspace separation, an information gain constraint for maximizing the capture of task-relevant features, and a pairwise constraint for ensuring complementary subspaces. Subsequently, a variational purification strategy is introduced to further ensure the semantic integrity of each sentiment representation. Finally, the fusion module computes the adaptive fusion weights in parallel using multiple orthogonal factors such as sample-level modality saliency, global subspace type importance and feature-level internal attention. Extensive experiments on three datasets demonstrate the effectiveness of the proposed method.

Abstract:
Optical Character Recognition (OCR) is fundamental to Vision-Language Models (VLMs) and high-quality data generation for LLM training. Yet, despite progress in average OCR accuracy, state-of-the-art VLMs still struggle with detecting sample-level errors and lack effective unsupervised quality control.We introduce Consensus Entropy (CE), a training-free, model-agnostic metric that estimates output reliability by measuring inter-model agreement entropy. The core insight is that correct predictions converge in output space, while errors diverge. Based on CE, we develop CE-OCR, a lightweight multi-model framework that verifies outputs by ensemble agreement, selects the best outputs, and further improves efficiency through adaptive routing. Experiments demonstrate that CE is robust for quality verification, improving F1 scores by 42.1% over VLM-as-Judge. CE-OCR achieves consistent OCR gains, outperforming self-consistency and single-model baselines at the same cost. Notably, CE requires no training or supervision, enabling plug-and-play integration.

Abstract:
Understanding a 3D scene immediately with its exploration is essential for embodied tasks, where an agent must construct and comprehend the 3D representation in an online and nearly real-time manner. In this study, we propose EmbodiedSplat, an online feed-forward 3DGS for open-vocabulary scene understanding that enables simultaneous online 3D reconstruction and 3D semantic understanding from the streaming images. Unlike existing open-vocabulary 3DGS methods, our objectives are two-fold: 1) Reconstructs the semantic-embedded 3DGS of the entire scene from over 300 streaming images in an online manner. 2) Highly generalizable to novel scenes with feed-forward design and supports nearly real-time 3D semantic reconstruction when combined with real-time 2D models. To achieve these objectives, we propose an Online Sparse Coefficients Field with a CLIP Global Codebook where it binds the 2D CLIP embeddings to each 3D Gaussian while minimizing memory consumption and preserving the full semantic generalizability of CLIP. Furthermore, we generate 3D geometric-aware CLIP features by aggregating the partial point cloud of 3DGS through 3D U-Net to compensate the 3D geometric prior to 2D-oriented language embeddings. Our code will be publicly available in project website.

Abstract:
Adversarial robustness of BEV 3D object detectors is critical for autonomous driving (AD). Existing invasive attacks require altering the target vehicle itself (e.g. attaching patches), making them unrealistic and impractical for real-world evaluation. While non-invasive attacks that place adversarial objects in the environment are more practical, current methods still lack the multi-view and temporal consistency needed for physically plausible threats. In this paper, we present the first framework for generating universal, non-invasive, and 3D consistent adversarial objects that expose fundamental vulnerabilities for BEV 3D object detectors. Instead of modifying target vehicles, our method inserts rendered objects into scenes with an occlusion-aware module that enforces physical plausibility across views and time. To maintain attack effectiveness across views and frames, we optimize adversarial object appearance using a BEV spatial feature-guided optimization strategy that attacks the detector's internal representations. Extensive experiments demonstrate that our learned universal adversarial objects can consistently degrade multiple BEV detectors from various viewpoints and distances. More importantly, the new environment-manipulation attack paradigm exposes models' over-reliance on contextual cues and provides a practical pipeline for robustness evaluation in AD systems.

Abstract:
Hardware constraints make it challenging to simultaneously acquire hyperspectral images (HSIs) with both high spatial and high spectral resolutions. A promising solution is to fuse low-resolution HSI (LR-HSI) with high-resolution multispectral images (HR-MSI) to generate high-resolution HSI (HR-HSI). Recently, diffusion models have introduced possibilities for HSI super-resolution, but suffer from low-efficiency sampling, detail-limited generation, and insufficient denoising. To address these issues, we propose an Edge-aware Multimodal Residual Diffusion Model (EMR-Diff). Specifically, multimodal residual mechanism is introduced to facilitate efficient information transfer among HR-MSI, LR-HSI, and HR-HSI, significantly improving the fusion efficiency. Edge-aware noise strategy is designed by exploiting the edge information of HR-MSI, which guides the model to prioritize high-frequency detail reconstruction by applying stronger noise perturbations to edge regions. In addition, we propose a Bilateral Attention Fusion UNet and design a multi-scale supervision mechanism to enable progressive reconstruction and collaborative optimization of spectral and spatial features. Extensive experiments demonstrate that our method achieves superior performance over existing approaches in both quantitative metrics and visual quality.

Abstract:
Temporal Sentence Grounding in Videos (TSGV) aims to temporally localize segments of a video that correspond to a given natural language query. Despite recent progress, most existing TSGV approaches operate under closed-vocabulary settings, limiting their ability to generalize to real-world queries involving novel or diverse linguistic expressions. To bridge this critical gap, we introduce the Open-Vocabulary TSGV (OV-TSGV) task and construct the first dedicated benchmarks--Charades-OV and ActivityNet-OV--that simulate realistic vocabulary shifts and paraphrastic variations. These benchmarks facilitate systematic evaluation of model generalization beyond seen training concepts. To tackle OV-TSGV, we propose HERO (Hierarchical Embedding-Refinement for Open-Vocabulary grounding), a unified framework that leverages hierarchical linguistic embeddings and performs parallel cross-modal refinement. HERO jointly models multi-level semantics and enhances video-language alignment via semantic-guided visual filtering and contrastive masked text refinement. Extensive experiments on both standard and open vocabulary benchmarks demonstrate that HERO consistently surpasses state-of-the-art methods, particularly under open-vocabulary scenarios, validating its strong generalization capability and underscoring the significance of OV-TSGV as a new research direction.

Abstract:
Recent adapter-based CLIP tuning (e.g., Tip-Adapter) is a strong few-shot learner, achieving efficiency by caching support features for fast prototype matching. However, these methods rely on global uni-modal feature vectors, overlooking fine-grained patch relations and their structural alignment with class text. To bridge this gap without incurring inference costs, we introduce a novel asymmetric training-only framework. Instead of altering the lightweight adapter, we construct a high-capacity auxiliary Heterogeneous Graph Teacher that operates solely during training. This teacher (i) integrates multi-scale visual patches and text prompts into a unified graph, (ii) performs deep cross-modal reasoning via a Modality-aware Graph Transformer (MGT), and (iii) applies discriminative node filtering to extract high-fidelity class features. Crucially, we employ a cache-aware dual-objective strategy to supervise this relational knowledge directly into the Tip-Adapter's key-value cache, effectively upgrading the prototypes while the graph teacher is discarded at test time. Thus, inference remains identical to Tip-Adapter with zero extra latency or memory. Across standard 1-16-shot benchmarks, our method consistently establishes a new state-of-the-art. Ablations confirm that the auxiliary graph supervision, text-guided reasoning, and node filtering are the essential ingredients for robust few-shot adaptation.

Abstract:
We present a dataset for force-grounded, cross-view articulated manipulation that couples what is seen with what is done and what is felt during real human interaction. The dataset contains 3048 sequences across 381 articulated objects in 38 environments. Each object is operated in four embodiments - (i) human hand, (ii) human hand with a wrist-mounted camera, (iii) handheld UMI gripper, and (iv) a custom Hoi! gripper, where the tool embodiment provides end-effector forces and tactile sensing. Our dataset offers a holistic view of interaction understanding from video, enabling researchers to evaluate how well methods transfer between human and robotic viewpoints, but also investigate underexplored modalities such as interaction forces. The Project Website can be found at https://hoi-dataset.ethz.ch.

Abstract:
As social media platforms proliferate, users increasingly demand intuitive ways to create diverse, high-quality portrait collections. In this work, we introduce Portrait Collection Generation (PCG), a novel task that generates coherent portrait collections by editing a reference portrait image through natural language instructions. This task poses two unique challenges to existing methods: (1) complex multi-attribute modifications such as pose, spatial layout, and camera viewpoint; and (2) high-fidelity detail preservation including identity, clothing, and accessories. To address these challenges, we propose CHEESE, the first large-scale PCG dataset containing 24K portrait collections and 573K samples with high-quality modification text annotations, constructed through an Large Vison-Language Model-based pipeline with inversion-based verification. We further propose SCheese, a framework that combines text-guided generation with hierarchical identity and detail preservation. SCheese employs adaptive feature fusion mechanism to maintain identity consistency, and ConsistencyNet to inject fine-grained features for detail consistency. Comprehensive experiments validate the effectiveness of CHEESE in advancing PCG, with SCheese achieving state-of-the-art performance in handling complex edits with identity and fine-grained details consistency.

Abstract:
Despite recent advances, Vision Language Models (VLMs) still struggle to grasp the dynamics of the world. We note that the ability to reason about a 4D scene, challenging in itself, is further complicated by two factors. First, VLMs observe motion indirectly via its projection onto 2D images. Second, existing datasets fail to disentangle object and camera motion. To address these challenges, we present a QA generation pipeline that focuses on motion-related scene understanding. We take particular care of the entanglement of camera and object motion by casting tracking in both the traditional way and in a novel, fixed reference system, dubbed True-Motion Tracking, which provides an intuitive description of motion. From this pipeline, we generate a large-scale training dataset of 400K samples, 4DP-QA (4D Perception QA), and a 2.2K-sample benchmark, 4DP-QA-Bench. Training existing models on our dataset yields performance improvements on an external benchmark, validating the effectiveness of our method.

Abstract:
One-stream Transformer-based trackers achieve advanced performance in visual object tracking suffer from significant computational overhead that hinders real-time deployment. While token pruning offers a path to efficiency, a critical limitation persists: no existing work performs pruning jointly across all three critical components--the search region, dynamic template, and static template. This isolation overlooks interdependencies, yielding suboptimal pruning and degraded accuracy. To address this, we introduce UTPTrack, a simple and Unified Token Pruning framework that, for the first time, jointly compresses all three components. UTPTrack employs an attention-guided, token type-aware strategy to holistically model redundancy, a design that seamlessly supports unified tracking across multi-modal and language-guided tasks within a single model. Comprehensive evaluations on 10 benchmarks demonstrate that UTPTrack achieves a new state-of-the-art in the accuracy-efficiency trade-off for pruning-based trackers, pruning 65.4% of vision tokens in RGB-based tracking and 67.5% in unified tracking while preserving 99.7% and 100.5% of baseline performance, respectively. This strong performance across both RGB and multimodal scenarios underlines its potential as a robust foundation for future research in efficient visual tracking.

Abstract:
Recent text-to-image (T2I) diffusion models have achieved impressive progress in generating high-fidelity images, yet they often fail to faithfully follow complex user prompts, especially in attribute binding, negation, and compositional reasoning. To address this limitation, we propose PromptEnhancer, a universal prompt rewriting framework that improves prompt interpretability for any pre-trained T2I model. PromptEnhancer is trained through a multi-stage pipeline. We first perform supervised fine-tuning on chain-of-thought-style rewriting data to endow the rewriter with structured prompt analysis and rewriting capabilities. We then introduce AlignEvaluator, a task-specific reward model that provides explicit and fine-grained feedback based on a taxonomy of common T2I failure modes, and further optimize the rewriter with reinforcement learning. This design enables the rewriter to produce prompts that are more precise, semantically complete, and easier for T2I models to follow. To support evaluation, we also construct a comprehensive human-aligned benchmark covering diverse semantic and compositional challenges. Extensive experiments show that PromptEnhancer consistently improves image-text alignment across multiple T2I models and significantly enhances prompt fidelity on challenging cases.

Abstract:
We present Narrative Weaver, a novel framework that addresses a fundamental challenge in generative AI: achieving controllable, long-range, and consistent visual content generation. While existing models excel at generating high-fidelity short-form visual content, they struggle to maintain narrative coherence and visual consistency across extended sequences--a critical limitation for real-world applications such as filmmaking and e-commerce advertising.Narrative Weaver introduces the first holistic solution that seamlessly integrates three essential capabilities: fine-grained control, automatic narrative planning, and long-range coherence. Our architecture combines a Multimodal Large Language Model (MLLM) for high-level narrative planning with a novel fine-grained control module featuring a dynamic Memory Bank that prevents visual drift.To enable practical deployment, we develop a progressive, multi-stage training strategy that efficiently leverages existing pre-trained models, achieving state-of-the-art performance even with limited training data. Recognizing the absence of suitable evaluation benchmarks, we construct and release the E-commerce Advertising Video Storyboard Dataset (EAVSD)--the first comprehensive dataset for this task, containing over 330K high-quality images with rich narrative annotations.Through extensive experiments across three distinct scenarios (controllable multi-scene generation, autonomous storytelling, and e-commerce advertising), we demonstrate our method's superiority while opening new possibilities for AI-driven content creation.

Abstract:
Personalized visual understanding has advanced significantly, yet existing approaches struggle to localize and extract specific concepts when input images contain multiple objects. Many prior methods rely heavily on segmentation-based supervision or exhibit poor compositional generalization, limiting their ability to accurately disentangle and manipulate individual concepts. In this work, we propose UniVerse, a Unified Modulation Framework for segmentation-free, disentangled multi-concept personalization in diffusion transformers. Our method allows for composable and decomposable concept extraction, enabling fine-grained localization and representation of target objects without explicit segmentation masks. UniVerse learns to decompose complex scenes into concept-specific representations and then compose them in a unified manner, enabling robust personalization across diverse visual contexts. Through extensive experiments on multiple benchmarks, we demonstrate that UniVerse significantly outperforms state-of-the-art baselines in both localization accuracy and visual fidelity. Qualitative and quantitative results show that our approach can precisely extract target concepts in cluttered scenes, paving the way for more flexible, interpretable, and personalized visual generation and understanding.

Abstract:
Tracking 3D human motion from egocentric, multi-camera devices is challenged by severe egomotion and partial visibility or occlusions. Existing methods are designed for monocular video often recorded from static or slowly-moving cameras and cannot easily leverage multi-view, calibrated and localized input. This makes them brittle and prone to fail on dynamic egocentric captures. We propose LAMP (Localization Aware Multi-camera People Tracking): a novel, simple framework to solve this via early disentanglement of observer and target motion. LAMP introduces a two-step process: First, we leverage the device's known 6-DoF pose and calibration to convert detected 2D body keypoints from all cameras over a temporal window into a unified 3D world reference frame. Second, an end-to-end-trained Transformer model fits 3D human motion directly to this spatio-temporal ray cloud in world coordinates. This "lift-then-fit" approach allows to learn and leverage a natural prior over world-space human motion, as well as providing an elegant framework to flexibly incorporate information from multiple, temporally asynchronous, partially observing, and moving cameras. LAMP achieves state-of-the-art results on monocular benchmarks, while significantly outperforming baselines for our targeted egocentric multi-camera setting.

Abstract:
Single-image-to-3D shape generation has seen remarkable progress, driven by latent diffusion models trained on the compressed latent space of 3D VAEs. However, the task remains intrinsically ill-posed: recovering complete 3D geometry--especially occluded surfaces--from a single view is inherently ambiguous. Existing VecSet-based approaches further exacerbate this challenge by treating shape tokens as an unordered set without explicit positional encoding. This design forces diffusion models to simultaneously learn visible correspondences from the input image and hallucinate invisible geometry within a large, permutation-invariant token space, where the lack of structure significantly hinders training efficiency and convergence stability.To address this, we propose Visibility Learning, a training paradigm that injects visibility structure and positional inductive bias into the image-to-3D pipeline. Our method comprises two synergistic components: (1) Visibility Grouping (VG), which explicitly partitions VecSet tokens into visible and invisible subsets by exploiting the spatial locality of VecSet VAE decoders; and (2) Visibility-Aware Positional Encoding (VAPE), which assigns shared positional embeddings to image tokens and visible shape tokens to amplify their correspondence, while using distinct encodings for invisible tokens to guide hallucination. By explicitly disentangling visible reconstruction from invisible hallucination, our approach shrinks the effective hypothesis space and provides clear structural guidance for diffusion models. Extensive experiments demonstrate that Visibility Learning accelerates training convergence by up to 4.4xwhile achieving superior generation quality compared to strong VecSet-based baselines.

Abstract:
We introduce WebChain, the largest open-source dataset of human-annotated trajectories on real-world websites, designed to accelerate reproducible research in web agents. It contains 31,725 trajectories and 318k steps, featuring a core Triple Alignment of visual, structural, and action data to provide rich, multi-modal supervision. The data is collected via a scalable pipeline that ensures coverage of complex, high-value tasks often missed by synthetic methods. Leveraging this dataset, we propose a Dual Mid-Training recipe that decouples spatial grounding from planning, achieving state-of-the-art performance on our proposed WebChainBench and other public GUI benchmarks. Our work provides the data and insights necessary to build and rigorously evaluate the next generation of scalable web agents.

Abstract:
Due to the heterogeneity of faces and edges in B-rep, conventional graph-based representations is incapable of establishing a unified formulation for faces and edges, thereby constraining the capabilities of B-rep generative models. We propose a B-rep Variational Graph Auto Encoding (BrepVGAE), the first variational graph autoencoder framework capable of holistically encoding and decoding boundary representations of B-rep models.Firstly, we novelly represent both geometry faces and edges as nodes in a graph representation. We then design a sparse graph autoencoder to aggregate the complete B-rep structure into a compact global latent vector. We then construct a decoder that employs set-based generation, which uses bilinear layers to reconstruct adjacency relationships, i.e., topology, with a single latent vector. Afterwards, the same decoder generates node features for all faces and edges through learnable query vectors and cross-attention mechanisms. Finally, a two-stage training strategy ensures effective coupling of geometry and topology throughout. Comprehensive experiments demonstrate that BrepVGAE significantly outperforms existing methods in reconstruction accuracy, topological validity, and generative diversity. This validates the feasibility and efficacy of decoding complete CAD geometric-topological distributions from a unified latent representation, while also offering novel insights for CAD part retrieval and feature recognition domains.

Abstract:
Current diffusion-based acceleration methods for long-portrait animation struggle to ensure identity (ID) consistency. This paper presents FlashPortrait, an end-to-end video diffusion transformer capable of synthesizing ID-preserving, infinite-length videos while achieving up to 6xacceleration in inference speed. In particular, FlashPortrait begins by computing the identity-agnostic facial expression features with an off-the-shelf extractor. It then introduces a Normalized Facial Expression Block to align facial features with diffusion latents by normalizing them with their respective means and variances, thereby improving identity stability in facial modeling. During inference, FlashPortrait adopts a dynamic sliding-window scheme with weighted blending in overlapping areas, ensuring smooth transitions and ID consistency in long animations. In each context window, based on the latent variation rate at particular timesteps and the derivative magnitude ratio among diffusion layers, FlashPortrait utilizes higher-order latent derivatives at the current timestep to directly predict latents at future timesteps, thereby skipping several denoising steps and achieving 6xspeed acceleration. Experiments on benchmarks show the effectiveness of FlashPortrait both qualitatively and quantitatively.

Abstract:
Visual effects (VFX) are essential for enhancing the expressiveness and creativity of video content, yet producing high-quality effects typically requires expert knowledge and costly production pipelines. Existing AIGC systems face significant challenges in VFX generation due to the scarcity of effect-specific data and the inherent difficulty of modeling supernatural or stylized effects. Moreover, these approaches often require per-effect fine-tuning, which severely limits their scalability and generalization to novel VFX. In this work, we present EffectMaker, a unified reasoning-generation framework that enables reference-based VFX customization. EffectMaker employs a multimodal large language model to interpret high-level effect semantics and reason about how they should adapt to a target subject, while a diffusion transformer leverages in-context learning to capture fine-grained visual cues from reference videos. These two components form a semantic-visual dual-path guidance mechanism that enables accurate, controllable, and effect-consistent synthesis without per-effect fine-tuning. Furthermore, we construct EffectData, the largest high-quality synthetic VFX dataset to date, containing 130k videos across 3k VFX categories, to improve generalization and scalability. Experiments show that EffectMaker achieves superior visual quality and effect consistency over state-of-the-art baselines, offering a scalable and flexible paradigm for customized VFX generation. Project page: https://effectmaker.github.io.

Abstract:
Medical visual question answering models often face potential train-test distribution shifts that hinder generalization across unseen imaging and linguistic patterns. To address this challenge, we propose a dual-level confidence based framework (DuCoR) that achieves implicit self-refinement through iterative pseudo-supervised optimization. Instead of relying on fixed pseudo answers, the model progressively refines its predictions by estimating their reliability from two complementary perspectives. A loss-level confidence captures the reliability of supervision by modeling clean and noisy loss distributions, while a feature-level confidence measures the semantic coherence between sample representations and their pseudo-answer conditioned prototypes. Since these two confidences originate from distinct information sources, including the supervision signal and the input semantics, they provide mutually corrective cues. They are adaptively fused to derive per-sample reliability weights that guide pseudo-supervised optimization toward better alignment with the target distribution. Extensive experiments on multiple medical VQA benchmarks show that our method achieves superior performance and exhibits improved cross-domain generalization over fully supervised baseline.

Abstract:
Image Chain-of-Thought (Image-CoT) is a test-time scaling paradigm that improves image generation by extending inference time. Most Image-CoT methods focus on text-to-image (T2I) generation. Unlike T2I generation, image editing is goal-directed: the solution space is constrained by the source image and instruction. This mismatch causes three challenges when applying Image-CoT to editing: inefficient resource allocation with fixed sampling budgets, unreliable early-stage verification using general MLLM scores, and redundant edited results from large-scale sampling. To address this, we propose ADaptive Edit-CoT (ADE-CoT), an on-demand test-time scaling framework to enhance editing efficiency and performance. It incorporates three key strategies: (1) a difficulty-aware resource allocation that assigns dynamic budgets based on estimated edit difficulty; (2) edit-specific verification in early pruning that uses region localization and caption consistency to select promising candidates; and (3) depth-first opportunistic stopping, guided by an instance-specific verifier, that terminates when intent-aligned results are found. Extensive experiments on three SOTA editing models (Step1X-Edit, BAGEL, FLUX.1 Kontext) across three benchmarks show that ADE-CoT achieves superior performance-efficiency trade-offs. With comparable sampling budgets, ADE-CoT obtains better performance with more than 2x speedup over Best-of-N.

Abstract:
Large multimodal models (LMMs) exhibit strong task generalization capabilities, offering new opportunities for zero-shot visual anomaly segmentation (ZSAS). However, existing LMM-based segmentation approaches still face fundamental limitations: anomaly concepts are inherently abstract and context-dependent, lacking stable visual prototypes, and the weak alignment between high-level semantic embeddings and pixel-level spatial features hinders precise anomaly localization. To address these challenges, we present AG-VAS (Anchor-Guided Visual Anomaly Segmentation), a new framework that expands the LMM vocabulary with three learnable semantic anchor tokens--[SEG], [NOR], and [ANO], establishing a unified anchor-guided segmentation paradigm. Specifically, [SEG] serves as an absolute semantic anchor that translates abstract anomaly semantics into explicit, spatially grounded visual entities (e.g., holes or scratches), while [NOR] and [ANO] act as relative anchors that model the contextual contrast between normal and abnormal patterns across categories. To further enhance cross-modal alignment, we introduce a Semantic-Pixel Alignment Module (SPAM) that aligns language-level semantic embeddings with high-resolution visual features, along with an Anchor-Guided Mask Decoder (AGMD) that performs anchor-conditioned mask prediction for precise anomaly localization. In addition, we curate Anomaly-Instruct20K, a large-scale instruction dataset that organizes anomaly knowledge into structured descriptions of appearance, shape, and spatial attributes, facilitating effective learning and integration of the proposed semantic anchors. Extensive experiments on six industrial and medical benchmarks demonstrate that AG-VAS achieves consistent state-of-the-art performance in the zero-shot setting.

Abstract:
Image-level Weakly Supervised Semantic Segmentation (WSSS) typically leverages Class Activation Maps (CAMs) for pixel-wise localization. However, existing CLIP-based methods often yield under-activated CAMs, primarily due to the inaccurate semantic relationships in the affinity-based refinement. In this work, we propose a novel framework, CD-CLIP (Class Distribution based CLIP), which addresses this issue by introducing a Class Distribution Aware (CDA) module. The CDA module captures richer semantic relationships by modeling patch-wise distributions across all classes using Jensen-Shannon divergence, thereby enhancing the completeness of CAMs. While this significantly improves the coverage of the foreground class, the over-activation at class boundaries might also exist due to the comprehensive integration of relationships between inter target classes. To mitigate this adverse effect on segmentation supervision, we introduce a Super-class Boundary Exploration (SBE) module, which leverages structural knowledge of DINO to generate boundary-aware super-class prototype CAMs. By employing the boundary-enhanced loss, our SBE module effectively provides accurate boundary supervision for the final segmentation. Our proposed CD-CLIP framework achieves state-of-the-art performance on both PASCAL VOC and MS COCO benchmarks. Code will be released.

Abstract:
Autoregressive point cloud generation has long lagged behind diffusion-based approaches in quality. The performance gap stems from the fact that autoregressive models impose an artificial ordering on inherently unordered point sets, forcing shape generation to proceed as a sequence of local predictions. This sequential bias reinforces short-range continuity but limits the model's ability to capture long-range dependencies, thereby weakening its capacity to enforce global structural properties such as symmetry, geometric consistency, and large-scale spatial regularities. Inspired by the level-of-detail (LoD) principle in shape modeling, we propose PointNSP, a coarse-to-fine generative framework that preserves global shape structure at low resolutions and progressively refines fine-grained geometry at higher scales through a next-scale prediction paradigm. This multi-scale factorization aligns the autoregressive objective with the permutation-invariant nature of point sets, enabling rich intra-scale interactions while avoiding brittle fixed orderings. Strictly following the baseline experimental setups, empirical results on ShapeNet benchmark demonstrate that PointNSP achieves state-of-the-art (SoTA) generation quality for the first time within the autoregressive paradigm. Moreover, it surpasses strong diffusion-based baselines in parameter, training, and inference efficiency. Finally, under dense generation with 8,192 points, PointNSP's advantages become even more pronounced, highlighting its strong scalability potential.

Abstract:
Concept erasure aims to remove harmful, inappropriate, or copyrighted content from text-to-image diffusion models while preserving non-target semantics. However, existing methods either rely on costly fine-tuning or apply coarse semantic separation, often degrading unrelated concepts and lacking adaptability to evolving concept sets. In this paper, we propose Graph-Guided Online Concept Erasure (GrOCE), a training-free framework that performs precise and context-aware online removal of target concepts. GrOCE constructs dynamic semantic graphs to identify clusters of target concepts and selectively suppress their influence within text prompts. It consists of three synergistic components: (1) dynamic semantic graph construction (Construct) incrementally builds a weighted graph over vocabulary concepts to capture semantic affinities; (2) adaptive cluster identification (Identify) extracts a target concept cluster through multi-hop traversal and diffusion-based scoring to quantify semantic influence; and (3) selective severing (Sever) removes semantic components associated with the target cluster from the text prompt while retaining non-target semantics and the global sentence structure. Extensive experiments demonstrate that GrOCE achieves state-of-the-art performance on the Concept Similarity (CS) and Frechet Inception Distance (FID) metrics, offering efficient, accurate, and stable concept erasure.

Abstract:
Recent Multimodal Large Language Models (MLLMs) have significantly advanced e-commerce product understanding. However, they still face three challenges: (i) the modality imbalance induced by modality mixed training; (ii) underutilization of the intrinsic alignment relationships among visual and textual information within a product; and (iii) limited handling of noise in e-commerce multimodal data. To address these, we propose MOON2.0, a dynamic modality-balanced MultimOdal representation learning framework for e-commerce prOduct uNderstanding. It comprises: (1) a Modality-driven Mixture-of-Experts (MoE) that adaptively processes input samples by their modality composition, enabling Multimodal Joint Learning to mitigate the modality imbalance; (2) a Dual-level Alignment method to better leverage semantic alignment properties inside individual products; and (3) an MLLM-based Image-text Co-augmentation strategy that integrates textual enrichment with visual expansion, coupled with Dynamic Sample Filtering to improve training data quality. We further release MBE2.0, a co-augmented Multimodal representation Benchmark for E-commerce representation learning and evaluation at https://huggingface.co/datasets/ZHNie/MBE2.0. Experiments show that MOON2.0 delivers state-of-the-art zero-shot performance on MBE2.0 and multiple public datasets. Furthermore, attention-based heatmap visualization provides qualitative evidence of improved multimodal alignment of MOON2.0.

Abstract:
Reading Chinese historical documents across diverse carriers is central to understanding the evolution of Chinese civilization, yet remains labor-intensive and dependent on scarce expert knowledge. Although recent large-scale models show promise on isolated historical collections, they do not systematically probe the fundamental ability to read historical documents across heterogeneous carriers. Therefore, we present MCHDoc, a comprehensive benchmark for reading multi-carrier Chinese historical documents. MCHDoc spans over 3,000 years of history and contains 15,724 high-resolution documents from six major carriers, capturing rich variations in material, layout, etc. Mimicking expert workflows, the benchmark supports page-level and character-level recognition, as well as LLM-based post-correction with and without external knowledge. We systematically evaluate a wide range of large-scale models on MCHDoc. The results show that even top-tier models struggle with multi-carrier historical documents. Furthermore, our analysis highlights several key factors for effectively adapting large models to Chinese historical texts. MCHDoc thus offers a standardized, challenging, and historically grounded benchmark for reading Chinese historical documents and provides a foundation for future research in document analysis and digital humanities.

Abstract:
This paper proposes GM-R^2, a novel Generative Matching Learning framework for unsupervised geometric descriptor learning and correspondence matching. By reformulating descriptor learning as geometry-conditioned cross-view image generation, GM-R^2 leverages the proxy supervisory signal from structurally aligned view synthesis to implicitly enforce feature consistency across correspondence, enabling robust 3D matching. To instantiate GM-R^2, we introduce Denoising-Agnostic Coupled ControlNet conditioned on depth maps as the required geometry-conditioned cross-view generator. It effectively extends the single-view generation of naive ControlNet to the cross-view via coupled depth-map input design and further remove the latent noise dependency to support geometry-only inference (expected by 3D matching). Moreover, we present Zoomable Equirectangular Projection for intrinsics-free point cloud-to-depth mapping that adaptively zooms into the angular region occupies by the narrow-FOV input for dense range-map acquisition. Extensive experiments on 3DMatch and ScanNet datasets verify the superior precision of our GM-R^2, even surpassing supervised methods.

Abstract:
Unified Multimodal Models (UMMs) are redefining the landscape of artificial intelligence by coupling perception and generation across language, vision, and structured reasoning. Yet, despite their growing sophistication, a critical gap persists in evaluation: existing benchmarks largely measure discriminative understanding or unconstrained generation in isolation, overlooking the integrated generative reasoning required for genuine multimodal intelligence. To address this, we introduce GGBench, the benchmark explicitly designed to evaluate geometric generative reasoning--the ability of a model to understand, reason about, and construct a solution within a unified framework. Each instance in GGBench contains precisely aligned natural-language instructions, executable GeoGebra code, and rendered diagrams, enabling deterministic and interpretable verification of a model's reasoning and constructive fidelity. The benchmark comprises 1,411 rigorously curated problems covering eight categories and multiple difficulty levels, resulting in over 7,000 aligned visualizations. We propose a comprehensive tri-modal evaluation protocol that jointly assesses textual planning quality, code executability, and geometric accuracy of generated diagrams through both automated and human-in-the-loop judging. Extensive experiments on both state-of-the-art UMMs and general Large Language Models (LLMs) reveal a large performance gap between end-to-end generation and reasoning-grounded construction. GGBench establishes a new standard for testing multimodal systems that must not only understand but also build, marking a crucial step toward grounded, verifiable generative intelligence.

Abstract:
Current football imitation research primarily aims to optimize reward-based objectives, such as goals scored or win rate proxies, paying less attention to accurately replicating real-world team tactical behaviors. We introduce TacSIm, a large-scale dataset and benchmark for Tactical Style Imitation in football. TacSIm imitates the acitons of all 11 players in one team in the given broadcast footage of Premier League matches under a single broadcast view. Under a offensive or defensive broadcast footage, TacSIm projects the beginning positions and actions of all 22 players from both sides onto a standard pitch coordinate system. TacSIm offers an explicit style imitation task and evaluation protocols. Tactics style imitation is measured by using spatial occupancy similarity and movement vector similarity in defined time, supporting the evaluation of spatial and temporal similarities for one team. We run multiple baseline methods in a unified virtual environment to generate full-team behaviors, enabling both quantitative and visual assessment of tactical coordination. By using unified data and metrics from broadcast to simulation, TacSIm establishes a rigorous benchmark for measuring and modeling style-aligned tactical imitation task in football. The dataset and benchmark are available at TacSIm.

Abstract:
Visible-Infrared Person Re-Identification (VI-ReID) is a challenging retrieval task due to the substantial modality gap between visible and infrared images. While existing methods attempt to bridge this gap by learning modality-invariant features within a shared embedding space, they often overlook the complex and implicit correlations between modalities. This limitation becomes more severe under distribution shifts, where infrared samples are often far fewer than visible ones. To address these challenges, we propose a novel network termed Bi-directional Interaction Transformation (BIT). Instead of relying on rigid feature alignment, BIT adopts a matching-based strategy that explicitly models the interaction between visible and infrared image pairs. Specifically, BIT employs an encoder-decoder architecture where the encoder extracts preliminary feature representations, and the decoder performs bi-directional feature integration and query aware scoring to enhance cross-modality correspondence. To our best knowledge, BIT is the first to introduce such pairwise matching-driven interaction in VI-ReID. Extensive experiments on several benchmarks demonstrate that our BIT achieves state-of-the-art performance, highlighting its effectiveness in the VI-ReID task.

Abstract:
3D Referring Expression Segmentation (3D-RES) aims to segment objects in point clouds according to language descriptions. Unlike common practices in 2D that utilize learnable query embeddings, recent 3D-RES methods typically generate queries directly from 3D points. However, this direct coupling of queries to raw point clouds introduces new challenges: an impractically large number of queries derived from massive point cloud data and a reliance on non-deterministic sampling algorithms. In this paper, we propose a Semantic-based Adaptive Query Network (SAQN), which introduces a novel query strategy for 3D-RES. Instead of generating queries from points, SAQN employs a learnable query vector for each semantic class. This approach drastically reduces the number of queries while maintaining the advantage of avoiding Hungarian matching through implicit class alignment. Additionally, to address potential cross-object ambiguity within semantic classes, we introduce supplementary queries that are adaptively fused with each class query to disambiguate and enrich representations. Comprehensive experiments show that SAQN achieves state-of-the-art performance while reducing the number of queries.

Abstract:
Language is a natural way to command robots, but converting a single instruction into a long-horizon, contact-rich hand-object interaction remains challenging: synthesized references are noisy, human-to-robot retargeting introduces embodiment bias, and fixed-reference tracking lets small errors snowball. We address this with AdaDexTrack, a modulator-in-the-loop framework for language-conditioned manipulation tracking. A distilled generalist tracker serves as the skill carrier, while a tightly aligned modulator performs three feedback corrections: reference modulation (continual adjustment of what to track), object-latent modulation (online adaptation of the object representation to recruit suitable skills), and positional-target modulation (small state-dependent refinements for execution). The tracker is learned via large-scale specialist to generalist distillation on a corpus of language-conditioned hand-object trajectories; the modulator is trained with RL under the same task objective, ensuring tight coupling. Across large-scale evaluations, AdaDexTrack consistently outperforms prior SOTA on unseen-trajectory and unseen-object sets in both average tracking error and success rate, demonstrating robustness and generalization. We further show zero-shot sim-to-real transfer on real hardware, where adding the modulator yields substantial gains over a tracker-only variant. AdaDexTrack reframes language-conditioned dexterous manipulation as modulated tracking, replacing the open-loop, fixed-reference tracking with in-loop modulation that adjusts the reference, object latent, and positional target, yielding drift-resistant execution from noisy text references.

Abstract:
Humans intuitively move to sound, but current humanoid robots lack expressive improvisational capabilities, confined to predefined motions or sparse commands. Generating motion from audio and then retargeting it to robots relies on explicit motion reconstruction, leading to cascaded errors, high latency, and disjointed acoustic-actuation mapping. We propose RoboPerform, the first unified audio-to-locomotion framework that can directly generate music-driven dance and speech-driven co-speech gestures from audio. Guided by the core principle of "motion = content + style", the framework treats audio as implicit style signals and eliminates the need for explicit motion reconstruction. RoboPerform integrates a ResMoE teacher policy for adapting to diverse motion patterns and a diffusion-based student policy for audio style injection. This retargeting-free design ensures low latency and high fidelity. Experimental validation shows that RoboPerform achieves promising results in physical plausibility and audio alignment, successfully transforming robots into responsive freestyle performers capable of reacting to audio.

Abstract:
Online High-Definition (HD) map construction is pivotal for autonomous driving. While recent approaches leverage historical temporal fusion to improve performance, we identify a critical safety flaw in this paradigm: it is inherently "spatially backward-looking." These methods predominantly enhance map reconstruction in traversed areas, offering minimal improvement for the unseen road ahead. Crucially, our analysis of downstream planning tasks reveals a severe asymmetry: while rearward perception errors are often tolerable, inaccuracies in the forward region directly precipitate hazardous driving maneuvers. To bridge this safety gap, we propose AMap, a novel framework for Ahead-aware online HD Mapping. We pioneer a "distill-from-future" paradigm, where a teacher model with privileged access to future temporal contexts guides a lightweight student model restricted to the current frame. This process implicitly compresses prospective knowledge into the student model, endowing it with "look-ahead" capabilities at zero inference-time cost. Technically, we introduce a Multi-Level BEV Distillation strategy with spatial masking and an Asymmetric Query Adaptation module to effectively transfer future-aware representations to the student's static queries. Extensive experiments on the nuScenes and Argoverse 2 benchmark demonstrate that AMap significantly enhances current-frame perception. Most notably, it outperforms state-of-the-art temporal models in critical forward regions while maintaining the efficiency of current-frame inference.

Abstract:
Recent advancements in multimodal large language models (MLLMs) and video agent systems have significantly improved general video understanding. However, when applied to scientific video understanding and educating--a domain that demands external professional knowledge integration and rigorous step-wise reasoning--existing approaches often struggle. To bridge this gap, we propose SciEducator, an iterative self-evolving multi-agent system for scientific video comprehension and education. Rooted in the classical Deming Cycle from management science, our design reformulates its Plan-Do-Study-Act philosophy into a self-evolving reasoning and feedback mechanism, which facilitates the interpretation of intricate scientific activities in videos. Moreover, SciEducator can produce multimodal educational content tailored to specific scientific processes, including textual instructions, visual guides, audio narrations, and interactive references. To support evaluation, we construct SciVBench, a benchmark consisting of 500 expert-verified and literature-grounded science QA pairs across five categories, covering physical, chemical, and everyday phenomena. Extensive experiments demonstrate that SciEducator substantially outperforms leading closed-source MLLMs (e.g., Gemini, GPT-4o) and state-of-the-art video agents on the benchmark, establishing a new paradigm for the community.

Abstract:
We present Artiverse, a diverse and physically grounded dataset of high-quality articulated 3D objects designed for realistic functional modeling and simulation. Artiverse contains 5.4K human-authored objects across a broad range of 88 categories, aggregated from multiple 3D static repositories. Objects are annotated with functional parts, interior structures, realistic kinematic relationships including multi-DoF joints, and physical attributes such as metric scale, material, and mass. We develop a semi-automated annotation pipeline that combines few-shot segmentation, geometric reasoning, and multi-stage human verification to achieve high-quality and efficient annotation, reducing manual annotation time by over 30%. We demonstrate the value of Artiverse on tasks of part mobility analysis, articulated object generation, and physics-based interaction. Artiverse provides a data resource to advance functional understanding for articulated objects.

Abstract:
Virtual try-on aims to fit an in-shop clothing image onto a specific human body. An optimal virtual try-on method should provide diverse and flexible dressing options, accurately reflecting the varied wearing styles encountered in real-life scenarios, tailored to individual preferences and fashion aspirations. However, current methods predominantly perform a direct replacement of the original clothing with the target clothing, following the same dressing pattern. This limited control over clothing adaptation may result in fixed and monotonous try-on outputs. To delve into More Fashion Possibilities with Fine-Grained Adaptations in Virtual Try-On, we propose a novel virtual try-on method, termed MOFA-VTON, which allows adjustment for clothing adaptations in try-on results through simple sketches by users. Specifically, we first design a mask construction strategy that transforms user-drawn curve sketches into a dual-region mask, replacing the traditional clothing-agnostic mask and providing fine-grained layout guidance for the subsequent generation process. Further, we propose layout adjustment blocks that utilize the cross-attention mechanism to independently learn layout correspondences for upper and lower regions of the human body, refining the spatial arrangement of the two regions. With these implementations, our method enables flexible and fine-grained adaptations of target clothing, overcoming the constraints of a fixed layout. Extensive experiments on VITON-HD and DressCode datasets demonstrate that our proposed MOFA-VTON outperforms previous state-of-the-art methods and provides more fashion possibilities for virtual try-on.

Abstract:
Chart Question Answering (CQA) benchmarks are critical for evaluating Multimodal Large Language Models (MLLMs) on visual data reasoning. Existing benchmarks focus mainly on final-answer correctness, ignoring intermediate reasoning steps and the propagation of errors in multi-step processes. To address this, we introduce ChartR, a benchmark designed to assess both the accuracy and robustness of reasoning in chart-understanding tasks. Each question is decomposed into 4-10 sub-questions covering key reasoning types, and each chart includes four visually perturbed variants (blurred, noise-added, watermark-added, annotation-removed) to systematically evaluate robustness. ChartR contains 200 base charts, 800 variants, 1,652 questions, and 8,260 image-question pairs. We further propose a comprehensive evaluation framework with eight metrics that evaluate reasoning-chain accuracy, robustness under visual perturbations, and enable analysis of potential error propagation patterns. Experiments on twelve MLLMs, including general-purpose and chart-specialized models, reveal low reasoning reliability, early-step errors that may propagate, value extraction as the primary bottleneck, and sharp performance drops under perturbations, highlighting reliance on textual cues over true visual understanding.

Abstract:
Embodied 3D Semantic Scene Completion (SSC) infers dense geometry and semantics from continuous egocentric observations. Most existing Gaussian-based methods rely on random initialization of many primitives within predefined spatial bounds, resulting in redundancy and poor scalability to unbounded scenes. Recent depth-guided approach alleviates this issue but remains local, suffering from latency and memory overhead as scale increases. To overcome these challenges, we propose TGSFormer, a scalable Temporal Gaussian Splatting framework for embodied SSC. It maintains a persistent Gaussian memory for temporal prediction, without relying on image coherence or frame caches. For temporal fusion, a Dual Temporal Encoder jointly processes current and historical Gaussian features through confidence-aware cross-attention. Subsequently, a Confidence-aware Voxel Fusion module merges overlapping primitives into voxel-aligned representations, regulating density and maintaining compactness. Extensive experiments demonstrate that TGSFormer achieves state-of-the-art results on both local and embodied SSC benchmarks, offering superior accuracy and scalability with significantly fewer primitives while maintaining consistent long-term scene integrity. The code will be released upon acceptance.

Abstract:
We introduce MetricHMSR (Metric Human Mesh and Scene Recovery), a novel approach for metric human mesh and scene recovery from monocular images. Due to unrealistic assumptions in the camera model and inherent challenges in metric perception, existing approaches struggle to achieve human pose and metric 3D position estimation through a unified module.To address this limitation, MetricHMSR incorporates camera rays to comprehensively encode both the bounding box information and the intrinsic parameters of perspective projection. Then we proposed Human Mixture-of-Experts (MoE), the model dynamically routes image features and ray features to task-specific experts for specialized understanding of different data aspects, enabling a unified framework that simultaneously perceives the local pose and the global 3D position.Based on the results above, we further refine the existing monocular metric depth estimation method to achieve more accurate results, ultimately enabling the seamless overlay of humans and scenes in 3D space.Comprehensive experimental results demonstrate that the proposed method achieves state-of-the-art performance on both human mesh and scene recovery.

Abstract:
Understanding charts requires models to jointly reason over geometric visual patterns, structured numerical data, and natural language -- a capability where current vision-language models (VLMs) remain limited. We introduce ChartNet, a high-quality, million-scale multimodal dataset designed to advance chart interpretation and reasoning. ChartNet leverages a novel code-guided synthesis pipeline to generate 1.5 million diverse chart samples spanning 24 chart types and 6 plotting libraries. Each sample consists of five aligned components: plotting code, rendered chart image, data table, natural language summary, and question-answering with reasoning, providing fine-grained cross-modal alignment. To capture the full spectrum of chart comprehension, ChartNet additionally includes specialized subsets encompassing human annotated data, real-world data, safety, and grounding. Moreover, a rigorous quality-filtering pipeline ensures visual fidelity, semantic accuracy, and diversity across chart representations. Fine-tuning on ChartNet consistently improves results across benchmarks, demonstrating its utility as large-scale supervision for multimodal models. As the largest open-source dataset of its kind, ChartNet aims to support the development of foundation models with robust and generalizable capabilities for data visualization understanding. The dataset is publicly available at https://huggingface.co/datasets/ibm-granite/ChartNet

Abstract:
Since the advent of controllable image generation, increasingly rich modes of control have enabled greater customization and accessibility for everyday users.Zero-shot, identity-preserving models such as Insert Anything and OminiControl now support applications like virtual try-on without requiring additional fine-tuning.While these models may be fitting for humans and rigid everyday objects, they still have limitations for non-rigid or fine-grained categories. These domains often lack accessible, high-quality data--especially videos or multi-view observations of the same subject--making them difficult both to evaluate and to improve upon. Yet, such domains are essential for moving beyond content creation toward applications that demand accuracy and fine detail.Birds are an excellent domain for this task: they exhibit high diversity, require fine-grained cues for identification, and come in a wide variety of poses. We introduce the NABirds Look-Alikes (NABLA) dataset, consisting of 4,759 expert-curated image pairs. Together with 1,073 pairs collected from multi-image observations on iNaturalist and a small set of videos, this forms a benchmark for evaluating identity-preserving generation of birds.We show that state-of-the-art baselines fail to maintain identity on this dataset, and we demonstrate that training on images grouped by species, age, and sex--used as a proxy for identity--substantially improves performance on both seen and unseen species.

Abstract:
Code search, framed as information retrieval (IR), underpins modern software engineering and increasingly powers retrieval-augmented generation (RAG), improving code discovery, reuse, and the reliability of LLM-based coding.Yet existing code IR models remain largely text-centric and often overlook the visual and structural aspects inherent in programming artifacts such as web interfaces, data visualizations, SVGs, schematic diagrams, and UML.To bridge this gap, we introduce MMCoIR, the first comprehensive benchmark for evaluating multimodal code IR across five visual domains, and show through extensive evaluation the task is challenging.Therefore, we then propose CodeMMR, a unified retrieval model that jointly embeds natural language, code, and images into a shared semantic space through instruction-based multimodal alignment.CodeMMR achieves strong generalization across modalities and languages, outperforming competitive baselines (e.g., UniIR, GME, VLM2Vec) by an average of 10 points on nDCG@10.Moreover, integrating CodeMMR into RAG enhances code generation fidelity and visual grounding on unseen code generation tasks, underscoring the potential of multimodal retrieval as a core enabler for next-generation intelligent programming systems.

Abstract:
Natural language provides valuable auxiliary information for enhancing visual object tracking. While existing vision-language tracking methods explicitly leverage linguistic descriptions to aid tracking, they suffer from two critical limitations: the inability to dynamically adapt descriptions to the moving target and changing context; and the strong dependency on language input may causes failure when text is unavailable. To address the issues, we design a simple yet effective plug-and-play module that leverages linguistic assistance implicitly, without requiring explicit language input. The proposed textual inversion module converts visual features from template and search regions into text tokens in the CLIP text embedding space. It effectively inverts visual representations into linguistic forms, integrating contextual information from the both template and search region. The linguistic cues are then injected into the visual feature space via a multi-layer semantic injection mechanism. The design enhances the completeness of cross-modal feature representations and the accuracy of inter-modal semantic alignment, thus enabling dynamically updated linguistic information guidance for general object tracking. Extensive experiments demonstrate the effectiveness of our proposed method. We integrate the proposed module into several advanced trackers and evaluate on both visual and vision-language tracking datasets, including MCITrack, DUTrack, and SeqTrack. By training only the newly introduced module and the corresponding decoder, the proposed approach achieves significant performance gains with minimal computational overhead. Code will be made publicly available.

Abstract:
Embodied Multi-Agent Systems have proven highly effective in addressing complex tasks through coordinated collaboration among heterogeneous agents. However, real-world environments and task specifications are inherently dynamic, exhibiting frequent changes, uncertainty, and variability. Despite these characteristics, most existing frameworks employ static architectures with fixed agent capabilities and rigid task allocation strategies, which substantially constrain their adaptability to evolving conditions. This inflexibility presents significant challenges to maintaining robust and efficient multi-agent cooperation in dynamic and unpredictable settings.To address these limitations, we propose DRAMA, short for Dynamic Orchestration for Resilient Multi-Agent Ecosystems, tailored for rapidly changing environments. DRAMA adopts a multilayer architecture that incorporates three principal mechanisms: adaptive scheduling through an affinity-driven mechanism, fault-tolerant continuity via hierarchical trust-chain task takeover, and collective spatial intelligence that consolidates distributed observations for predictive reasoning. Together, these components enable event-triggered rescheduling and decentralized fault recovery, ensuring uninterrupted task execution amid agent arrivals, dropouts, or recoveries. Extensive experiments in the embodied VirtualHome-Social environment demonstrate that DRAMA achieves a 7% improvement in runtime efficiency and a 10% increase in throughput compared with state-of-the-art baselines, while maintaining superior stability and robustness under dynamic agent populations.

Abstract:
Video-based spatial reasoning -- such as estimating distances, judging directions, or understanding layouts from multiple views -- requires selecting informative frames and, when needed, actively seeking additional viewpoints during inference. Existing multimodal large language models (MLLMs) consume a fixed set of uniformly sampled frames and cannot request new views once reasoning begins, often missing the geometric cues necessary for reliable spatial judgments. We present EagleVision, a dual-stage framework that combines geometry-aware frame selection with active, Bird's-Eye-View (BEV)-grounded reasoning. In the first stage (macro perception), a semantics-perspective-fusion determinantal point process (SPF-DPP) selects a compact set of keyframes that jointly maximize semantic relevance and viewpoint diversity under a fixed token budget. In the second stage (micro verification), the model performs iterative spatial Chain-of-Thought: at each step it can either reason in text or predict a pose on the BEV plane to retrieve the nearest real frame, forming a closed-loop hypothesize-look-verify cycle. The querying policy is trained purely via reinforcement learning with a spatial grounding reward, requiring no human-annotated reasoning traces. On VSI-Bench and SQA3D, EagleVision achieves state-of-the-art performance among open-source vision-language models.

Abstract:
Partial shape matching is a crucial yet underexplored problem in 3D vision, with significant relevance to real-world scenarios where shapes are often only partially observed. Existing feature descriptors face difficulties in this setting, as traditional representations either struggle with the boundaries of partial shapes or heavily depend on the shape's spatial position. While existing approaches have employed DINO features for partial shape matching, these features are not inherently suited for handling partial observations. In this work, we propose a method to refine DINO features using LoRA-based self-supervised learning, enabling the generation of feature descriptors that are robust to partiality. Our features substantially improve performance on partial shape matching compared to traditional or vision foundation features. Additionally, when integrated into existing partial shape matching pipelines, we achieve state-of-the-art results on partial shape matching and left-right prediction benchmarks.

Abstract:
While existing incomplete multi-view multi-label learning methods have achieved promising performance, few studies have focused on the issue of multi-view imbalance. Existing methods using gradient modulation or alternating optimization strategies alleviate this problem but often oversimplify the interaction between views, resulting in persistently poor performance. In response to the challenge, we propose the Cross-view Distillation and Adaptive Masking (CDAM) framework, a novel approach designed to achieve balanced multi-view optimization for the challenging double incomplete multi-view multi-label learning tasks. First, to overcome the performance bottleneck of views, we design a cross-view distillation module. This module aligns low-quality student representations with high-quality teacher representations, thereby effectively mitigating the multi-view imbalance problem. Second, recognizing that distillation may not rectify all low-quality views, we introduce a subsequent adaptive masking module to perform an explicit quality assessment. This module dynamically identifies and masks out any remaining unreliable representations before multi-view fusion, thus preventing low-quality information from corrupting the fused representation. Extensive comparisons with nine state-of-the-art methods on six datasets validate the effectiveness and stability of our method.

Abstract:
Autoregressive models can generate high-quality 3D meshes by sequentially producing vertices and faces, but their token-by-token decoding results in slow inference, limiting practical use in interactive and large-scale applications.We present FlashMesh, a fast and high-fidelity mesh generation framework that rethinks autoregressive decoding through a predict-correct-verify paradigm. The key insight is that mesh tokens exhibit strong structural and geometric correlations that enable confident multi-token speculation. FlashMesh leverages this by introducing a speculative decoding scheme tailored to the commonly used hourglass transformer architecture, enabling parallel prediction across face, point, and coordinate levels.Extensive experiments show that FlashMesh achieves up to a 2xspeedup over standard autoregressive models while also improving generation fidelity. Our results demonstrate that structural priors in mesh data can be systematically harnessed to accelerate and enhance autoregressive generation.

Abstract:
Activation outliers in large-scale transformer models pose a fundamental challenge to model quantization, creating excessively large ranges that cause severe accuracy drops during quantization. We empirically observe that outlier severity intensifies with pre-training scale (e.g., progressing from CLIP to the more extensively trained SigLIP and SigLIP2). Through theoretical analysis as well as empirical correlation studies, we establish the direct link between these activation outliers and dominant singular values of the weights. Building on this insight, we propose Selective Spectral Decay (S2D), a geometrically-principled conditioning method that surgically regularizes only the weight components corresponding to the largest singular values during fine-tuning. Through extensive experiments, we demonstrate that S2D significantly reduces activation outliers and produces well-conditioned representations that are inherently quantization-friendly. Models trained with S2D achieve up to 7% improved PTQ accuracy on ImageNet under W4A4 quantization and 4% gains when combined with QAT. These improvements also generalize across downstream tasks and vision-language models, enabling the scaling of increasingly large and rigorously trained models without sacrificing deployment efficiency.

Abstract:
Latent generative modeling has emerged as the dominant paradigm for Diffusion Transformers (DiT), where a pretrained autoencoder compresses image pixels into a latent space to facilitate the diffusion process. Recently, the use of semantic encoders within autoencoders (AEs) has gained attention, yet their influence on image reconstruction and diffusion model training remains insufficiently explored. In this study, we perform an in-depth examination of how semantic encoders shape latent representation learning for the autoencoders. Our findings reveal a fundamental trade-off: while semantic encoders generate latent spaces enriched with visual semantics, their high level of abstraction makes it challenging to capture fine-grained geometric relationships, thereby requiring larger models and longer training for convergence. To address this issue, we build upon recent advances in representation learning that enable the joint modeling of both semantic abstraction and geometric detail. This leads to a Semantic Auto-Encoder (S-AE) that achieves state-of-the-art performance, combining superior reconstruction quality and discriminative capability. Specifically, with S-AE, we are able to provide a unified latent space that achieves 0.06 FID for image reconstruction and 81.9% classification accuracy on ImageNet, set a state-of-the-art benchmark. Codes and model weights will be made publicably available.

Abstract:
Denoising in the sRGB image space is challenging due to large noise variability. Although end-to-end methods perform well, their effectiveness in real-world scenarios is limited by the scarcity of real noisy-clean image pairs, which are expensive and difficult to collect. To address this limitation, several generative methods have been developed to synthesize realistic noisy images from limited data. These approaches often rely on camera metadata during both training and testing to synthesize real-world noise. However, the lack of metadata or inconsistencies between devices restricts their usability. Therefore, we propose a novel framework called Prompt-Driven Noise Generation (PNG). This model is capable of acquiring high-dimensional prompt features that capture the characteristics of real-world input noise and creating a variety of realistic noisy images consistent with the distribution of the input noise. By eliminating the dependency on explicit camera metadata, our approach significantly enhances the generalizability and applicability of noise synthesis. Comprehensive experiments reveal that our model effectively produces realistic noisy images and show the successful application of these generated images in removing real-world noise across various benchmark datasets.

Abstract:
While prior research on Multimodal Large Language Model (MLLM) hallucinations has primarily examined cross-modal inconsistencies in natural images, hallucination over complex graph structures remains underexplored.Concurrently, there is a lack of robust evaluation for fine-grained reasoning integrating structural, visual, and semantic information.To address these gaps, we present DiGraphHal-Bench, the first large-scale Visual Question Answering (VQA) benchmark for evaluating both hallucination phenomena and fine-grained reasoning of MLLMs on real-world directed graphs. DiGraphHal-Bench comprises high-quality procedural graphs from over six distinct domains and is organized around a taxonomy of four high-level capabilities and twelve fine-grained tasks. To ensure benchmark fidelity, we propose a novel two-stage automatic data curation pipeline that reconciles the trade-off between data scale and quality, thereby guaranteeing reliable evaluation.Experiments reveal that state-of-the-art MLLMs hallucinate notably in fine-grained graph reasoning. Although SFT substantially mitigates these hallucinations and strengthens complex reasoning, performance remains far from optimal. Ablation studies highlight the importance of fundamental capabilities for integrative reasoning, and our benchmark provides a foundation for advancing robust multi-modal graph understanding.

Abstract:
Estimating dense three dimensional motion in dynamic high speed scenes remains challenging due to motion blur, illumination variation, and the limited temporal resolution of conventional cameras. We introduce ARES, a unified framework for Asymmetric RGB-Event Stereo that addresses these issues through a hybrid setup where an event camera captures fine grained temporal dynamics and an RGB camera provides rich spatial structure. To integrate these heterogeneous modalities, we propose Multimodal Contextual Attention, a transformer based fusion mechanism that attends to spatial and temporal contexts under cross view constraints and forms a unified correspondence space for disparity and optical flow estimation. Building on this shared representation, we introduce Temporal Disparity Posterior Fusion, a probabilistic framework that models the evolution of disparity posteriors to infer disparity change and recover metrically coherent scene flow. Trained with sparse supervision and dense self consistency cues, our ARES achieves geometrically consistent and temporally stable three dimensional motion estimation across diverse driving scenarios. Experiments show that ARES attains state of the art performance in scene flow estimation among RGB-event stereo methods, establishing a principled path toward unified asymmetric multimodal stereo sensing. Code available at the project website.

Abstract:
Human action detection in videos requires both semantic recognition and accurate modeling of motion. While recent video foundation models have advanced visual semantics, they still struggle to capture complex and compositional actions due to the limited representation ability of motion. Human skeleton sequences, which explicitly describe the body structure and movement, provide valuable physical and geometric motions that complement RGB videos. However, combining video and skeleton modalities faces two key challenges: (i) label-driven skeleton features are too coarse to describe fine-grained motion, and (ii) skeleton motion and RGB video lie in heterogeneous feature spaces, so current fusion strategies often cause feature interference. To address these, we propose MoVie, a unified Motion-Video processing framework that uses structured human motion as a bridge between the two signals. We first propose a Structural Motion Projection module that decomposes motion into primitive components using a learnable motion dictionary, to produce fine-grained descriptors. Then, we design a Motion-guided Feature Regularization mechanism that aligns visual features with motion through an orthogonality-based transformation, so that fine-grained motion cues can guide visual representations without collapsing semantic diversity. Extensive evaluations on Toyota Smarthome Untrimmed, Charades, Multi-THUMOS and PKU-MMD datasets demonstrate that MoVie significantly improves state-of-the-art action detection performance.

Abstract:
Online 3D reconstruction from streaming inputs requires both long-term temporal consistency and efficient memory usage. Although causal variants of VGGT address this challenge through a key-value (KV) cache mechanism, the cache grows linearly with the stream length, creating a major memory bottleneck. Under limited memory budgets, early cache eviction significantly degrades reconstruction quality and temporal consistency. In this work, we observe that attention in causal transformers for 3D reconstruction exhibits intrinsic spatio-temporal sparsity. Based on this insight, we propose STAC, a Spatio-Temporally Aware Cache Compression framework for streaming 3D reconstruction with large causal transformers. STAC consists of three key components: (1) a Working Temporal Token Caching mechanism that preserves long-term informative tokens using decayed cumulative attention scores; (2) a Long-term Spatial Token Caching scheme that compresses spatially redundant tokens into voxel-aligned representations for memory-efficient storage; and (3) a Chunk-based Multi-frame Optimization strategy that jointly processes consecutive frames to improve temporal coherence and GPU efficiency. Extensive experiments show that STAC achieves state-of-the-art reconstruction quality while reducing memory consumption by nearly 10x and accelerating inference by 4x, substantially improving the scalability of real-time 3D reconstruction in streaming settings.

Abstract:
The critical challenge of semi-supervised semantic segmentation lies in how to fully exploit a large volume of unlabeled data to improve the model's generalization performance for robust segmentation. However, existing softmax scores-based filtering methods tend to be affected by the overconfidence issue in neural networks, leading to the inclusion of incorrect pseudo-labels that negatively impact the training process. In this paper, we propose a novel evidential learning framework to explicitly model the prediction uncertainty for reliable pseudo-label selection. By modeling the distribution of class probabilities using Dirichlet distributions, we obtain principled and improved uncertainty estimates from a distributional perspective. Furthermore, we propose HESS (Hyper-ESS), decoupling the modeling of exclusive and collective evidence for comprehensive evidence perception, to yield more accurate uncertainty estimates. Extensive experiments on three challenging benchmarks demonstrate that integrating HESS into existing semi-supervised semantic segmentation frameworks consistently improves performance, benefiting from more reliable pseudo-label selection. Our work sheds light on the potential of evidential learning in semi-supervised semantic segmentation and opens up new avenues for future research.

Abstract:
In this work, we revisit several key design choices of modern Transformer-based approaches for feed-forward 3D Gaussian Splatting (3DGS) prediction. We argue that the common practice of regressing Gaussian means as depths along camera rays is suboptimal, and instead propose to directly regress 3D mean coordinates using only a self-supervised rendering loss.This formulation allows us to move from the standard encoder-only design to an encoder-decoder architecture with learnable Gaussian tokens, thereby _unbinding_ the number of predicted primitives from input image resolution and number of views. Our resulting method, __TokenGS__, demonstrates improved robustness to pose noise and multiview inconsistencies, while naturally supporting efficient test-time optimization in token space without degrading learned priors. TokenGS achieves state-of-the-art feed-forward reconstruction performance on both static and dynamic scenes, producing more regularized geometry and more balanced 3DGS distribution, while seamlessly recovering emergent scene attributes such as static-dynamic decomposition and scene flow.

Abstract:
Recent advances in diffusion models (DMs) have achieved exceptional visual quality in image editing tasks. However, the global denoising dynamics of DMs inherently conflate local editing targets with the full-image context, leading to unintended modifications in non-target regions. In this paper, we shift our attention beyond DMs and turn to Masked Generative Transformers (MGTs) as an alternative approach to tackle this challenge. By predicting multiple masked tokens rather than holistic refinement, MGTs exhibit a localized decoding paradigm that endows them with the inherent capacity to explicitly preserve non-relevant regions during the editing process. Building upon this insight, we introduce the first MGT-based image editing framework, termed EditMGT. We first demonstrate that MGT's cross-attention maps provide informative localization signals for localizing edit-relevant regions and devise a multi-layer attention consolidation scheme that refines these maps to achieve fine-grained and precise localization. On top of these adaptive localization results, we introduce region-hold sampling, which restricts token flipping within low-attention areas to suppress spurious edits, thereby confining modifications to the intended target regions and preserving the integrity of surrounding non-target areas. To train EditMGT, we construct Crisp-2M, a high-resolution (>1024) dataset spanning seven diverse editing categories. Without introducing additional parameters, we adapt a pre-trained text-to-image MGT into an image editing model through attention injection. Extensive experiments across four standard benchmarks demonstrate that, with fewer than 1B parameters, our model achieves state-of-the-art image similarity performance while enabling faster editing. Moreover, it delivers comparable or superior editing quality, with improvements of 3.6% and 17.6% on style change and style transfer tasks, respectively.

Abstract:
End-to-end autonomous driving (AD) systems increasingly adopt vision-language-action (VLA) models, yet they ignore the passenger's emotional state, which is central to comfort and AD acceptance. We introduce Open-Domain End-to-End (OD-E2E) AD, where an autonomous vehicle must interpret free-form natural-language commands, infer the emotion, and plan a physically feasible trajectory. We propose E3AD, an emotion-aware VLA framework that augments semantic understanding with two cognitively inspired components: a continuous Valence-Arousal-Dominance (VAD) emotion model that captures tone and urgency from language, and a dual-pathway spatial reasoning module that fuses egocentric and allocentric views for human-like spatial cognition. A consistency-oriented training scheme, combining modality pretraining with preference-based alignment, further enforces coherence between emotional intent and driving actions. Across real-world datasets, E3AD improves visual grounding and waypoint planning and achieves state-of-the-art (SOTA) VAD correlation for emotion estimation. These results show that injecting emotion into VLA-style driving yields more human-aligned grounding, planning, and feedback.

Abstract:
360deg panoramic images are increasingly used in VR, autonomous driving, and robotics for holistic scene understanding. However, current Vision-Language Models (VLMs) struggle with 3D spatial reasoning on Equirectangular Projection (ERP) images due to geometric distortion and limited 3D supervision. We introduce PanoEnv, a large-scale VQA benchmark built from synthetic 3D environments, containing 14.8K questions across five categories (e.g., relative position, attribute comparison) grounded in accurate 3D annotations--depth, segmentation, and bounding boxes. Benchmarking 14 state-of-the-art VLMs reveals limited 3D understanding, achieving only 49.34% overall and 8.36% on open-ended (OE) questions. To enhance 3D reasoning, we propose a reinforcement learning post-training framework based on Group Relative Policy Optimization (GRPO) with a ground-truth-guided reward combining five geometry-aware strategies (e.g., distance tolerance, spatial consistency). A two-stage curriculum further mitigates catastrophic forgetting: Stage 1 trains on structured tasks (T/F, MCQ), and Stage 2 fine-tunes on mixed OE data for generalization. Our 7B model sets a new SoTA performance, improving total accuracy to 52.93% (+3.59%) and OE accuracy to 14.83% while maintaining structured-task performance. It also achieves top semantic scores (Q-Score 6.24, P-Score 5.95), surpassing 32B models. These results demonstrate that PanoEnv-QA and our curriculum-based RL framework effectively instill 3D spatial intelligence in VLMs for omnidirectional perception.

Abstract:
Large-scale pre-trained diffusion models have been extensively adopted for real-world image Super-Resolution because of their powerful generative priors through textual guidance. However, when super-resolving high-resolution images with patch-wise inference strategy, most existing diffusion-based SR methods tend to suffer from over-generation, due to the misalignment between the global prompt from LR image and the incomplete semantic information of local patches during each inference step. On the other hand, most existing methods also failed to generate detailed texture in local patches due to the overemphasis on global generation capabilities in network designs and training strategies.To address this issue, we present DreamSR, a novel SR model that suppresses local over-generation and improves fine-detail synthesis, thereby achieving visually faithful results with ultra-high-quality details. Specifically, we propose a dual-branch MM-ControlNet, where the ControlNet generates local textual feature with patch-level prompts while the pre-trained DiT provides global textual feature with global prompts, thereby mitigating over-generation and ensuring semantic consistency across patches.We also design a comprehensive training strategy with stage-specific data processing pipelines and a Receptive-Field Enhancement strategy, enhancing the model's capability to capture patch information and effectively restore local textures.Extensive experiments demonstrate that DreamSR outperforms state-of-the-art methods, providing high-quality SR results.

Abstract:
With the advancement of Vision-Language Models (VLMs), employing VLM-as-a-Judge for visual evaluation has become a widely adopted metric in vision research. However, existing VLM-as-a-Judge approaches suffer from biased scoring outcomes with low discrimination and lack the capacity for unified multi-attribute compositional assessment. To address these limitations, we propose a novel training paradigm, termed JoPPO (Joint Probabilistic Policy Optimization) that enables the VLMs to learn ranking under compositional assessment constraints. We evaluate the JoPPO on image aesthetics as a testbed, a task requiring nuanced understanding of multiple attributes including composition, lighting, color and geometry. Training follows two stages: (1) Supervised Fine-Tuning (SFT) on synthetic composition dataset provided by automated data generation pipeline to instill compositional priors; and (2) Contrastive Joint Conditional Probabilistic Reinforcement Learning: building upon the GRPO algorithm, we introduce JoPPO, which compute reward based on the expected win rate of total scores derived from the conditional distribution of fine-grained attribute scores within batches, effectively enhancing the model's discriminative ability in composite evaluation. Across standard aesthetic benchmarks, our method achieves consistent improvements in ranking consistency, demonstrating strong zero-shot generalization.

Abstract:
Zero-shot captioners are recently proposed models that utilize common-space vision-language representations to caption images without relying on paired image-text data. To caption an image, they proceed by textually decoding a text-aligned image feature, but they limit their scope to global representations and whole-image captions. We present a unified framework for zero-shot captioning that shifts from an image-centric to a patch-centric paradigm, enabling the captioning of arbitrary regions without the need of region-level supervision. Instead of relying on global image representations, we treat individual patches as atomic captioning units and aggregate them to describe arbitrary regions, from single patches to non-contiguous areas and entire images. We analyze the key ingredients that enable current latent captioners to work in our novel proposed framework. Experiments demonstrate that backbones producing meaningful, dense visual features, such as DINO, are key to achieving state-of-the-art performance in multiple region-based captioning tasks. Compared to other baselines and state-of-the-art competitors, our models achieve better performance on zero-shot dense captioning and region-set captioning. We also introduce a new trace captioning task that further demonstrates the effectiveness of patch-wise semantic representations for flexible caption generation. Data and code are available at https://paciosoft.com/Patch-ioner/ .

Abstract:
Layered image generation and editing is a fundamental capability that enables layer-wise reuse, editing, and composition of generated visual content, analogous to word-level editing in natural language. Despite its importance, this remains an underexplored area at scale. To address this gap, we present MRT, a 20B-parameter masked region diffusion model tailored for multi-layer transparent image generation and editing, trained on over 10M multilingual design samples spanning diverse aspect ratios and textual prompts. To fully leverage this scale, we make two key technical contributions. First, we unify three complementary tasks--text-to-layers, image-to-layers, and layers-to-layers--within a shared masked region diffusion framework, where selective token masking enables flexible layer-wise generation and editing. Second, to enable overflow layer generation, we introduce an overflow-aware canvas layer that handles boundary inconsistencies and supports semi-transparent background synthesis, enabling complete editable layers extending beyond visible canvas boundaries. Additionally, we apply diffusion distillation to achieve 8-step, real-time multi-layer generation with minimal quality degradation. Extensive experiments demonstrate that our framework substantially outperforms prior state-of-the-art approaches, including various commercial systems, across all three tasks, establishing a new benchmark for multi-layer transparent image generation. Notably, our model significantly outperforms the concurrent Qwen-Image-Layered model in image-to-layers quality according to user-study results, while achieving 10~100xfaster inference and saving a 50%~90% activation GPU memory consumption during image-to-layer inference.

Abstract:
In computer vision, machine unlearning aims to remove the influence of specific visual concepts or training images without retraining from scratch. Studies show that existing approaches often modify the classifier while leaving internal representations intact, resulting in incomplete forgetting.In this work, we extend the notion of unlearning to the representation level, deriving a three-term interplay between forgetting efficacy, retention fidelity, and class separation. Building on Neural Collapse theory, we show that the orthogonal projection of a simplex Equiangular Tight Frame (ETF) remains an ETF in a lower-dimensional space, yielding a provably optimal forgetting operator.We further introduce the Representation Unlearning Score (RUS) to quantify representation-level forgetting and retention fidelity. Building on this, we introduce POUR (Provably Optimal Unlearning of Representations), a geometric projection method with closed-form (POUR-P) and feature-level unlearning variants under a distillation scheme (POUR-D).Experiments on CIFAR-10/100 and PathMNIST demonstrate that POUR achieves effective unlearning while preserving retained knowledge, outperforming state-of-the-art unlearning methods on both classification-level and representation-level metrics. Code will be released upon acceptance of the paper.

Abstract:
Depth map super-resolution with color guidance is a fundamental task in computer vision that aims to reconstruct high-resolution depth maps by leveraging structural correlations from corresponding guidance images. Recently, with the development of deep learning techniques, the performance of guided depth super-resolution (GDSR) models has been significantly improved. However, most existing approaches rely on black-box architectures that lack theoretical interpretability. Although graph optimization has been explored to integrate model-driven and data-driven frameworks, it remains computationally expensive and struggles to preserve the intrinsic structures of the depth maps. To overcome these limitations, we propose a novel GDSR framework based on a dual graph Laplacian prior, termed LapNet, which efficiently unfolds graph optimization into a deep neural network. Specifically, we first formulate a dual graph Laplacian prior that separately models structural dependencies along the row and column dimensions of the depth maps. This formulation explicitly enforces piecewise smoothness while reducing computational complexity from O(H^3W^3) to O(H^3 + W^3) by avoiding the construction of global affinity graph. Furthermore, we develop a deep implicit prior to extract high-frequency structural cues from the guidance image, serving as a complementary component to the manually designed prior. Finally, we integrate these complementary priors into a unified variational optimization framework, which is efficiently solved through alternating minimization and subsequently unfolded into an interpretable multi-stage deep network. Extensive experiments on both synthetic and real-world datasets demonstrate that LapNet achieves state-of-the-art performance while maintaining low computational complexity.

Abstract:
What role does the first frame play in video generation models? Traditionally, it's viewed as the spatial-temporal starting point of a video, merely a seed for subsequent animation. In this work, we reveal a fundamentally different perspective: video models implicitly treat the first frame as a conceptual memory buffer that stores visual entities for later reuse during generation. Leveraging this insight, we show that it's possible to achieve robust and generalized video content customization in diverse scenarios, using only 20-50 training examples without architectural changes or large-scale finetuning. This unveils a powerful, overlooked capability of video generation models for reference-based video customization.

Abstract:
3D Gaussian Splatting (3DGS) has recently emerged as a promising approach in novel view synthesis, combining photorealistic rendering with real-time efficiency. However, its success heavily relies on dense camera coverage; under sparse-view conditions, insufficient supervision leads to irregular Gaussian distributions--characterized by globally sparse coverage, blurred background, and distorted high-frequency areas.To address this, we propose HeroGS--Hierarchical Guidance for Robust 3D Gaussian Splatting--a unified framework that establishes hierarchical guidance across the image, feature, and parameter levels. At the image level, sparse supervision is converted into pseudo-dense guidance, globally regularizing the Gaussian distributions and forming a consistent foundation for subsequent optimization. Building upon this, Feature-Adaptive Densification and Pruning (FADP) at the feature level leverages low-level features to refine high-frequency details and adaptively densifies Gaussians in background regions.The optimized distributions then support Co-Pruned Geometry Consistency (CPG) at parameter level, which guides geometric consistency through parameter freezing and co-pruning, effectively removing inconsistent splats. The hierarchical guidance strategy effectively constrains and optimizes the overall Gaussian distributions, thereby enhancing both structural fidelity and rendering quality.Extensive experiments demonstrate that HeroGS achieves high-fidelity reconstructions and consistently surpasses state-of-the-art baselines under sparse-view conditions.

Abstract:
In dynamic traffic environments, motion forecasting models must be able to accurately estimate future trajectories continuously. Streaming-based methods are a promising solution, but despite recent advances, their performance often degrades when exposed to heterogeneous observation lengths. To address this, we propose a novel streaming-based motion forecasting framework that explicitly focuses on evolving scenes. Our method incrementally processes incoming observation windows and leverages an instance-aware context streaming to maintain and update latent agent representations across inference steps. A dual training objective further enables consistent forecasting accuracy across diverse observation horizons. Extensive experiments on Argoverse 2, nuScenes, and Argoverse 1 demonstrate the robustness of our approach under evolving scene conditions and also on the single-agent benchmarks. Our model achieves state-of-the-art performance in streaming inference on the Argoverse 2 multi-agent benchmark, while maintaining minimal latency, highlighting its suitability for real-world deployment.

Abstract:
We propose OpenVoxel, a training-free algorithm for grouping and captioning sparse voxels for the open-vocabulary 3D scene understanding tasks. Given the sparse voxel rasterization (SVR) model obtained from multi-view images of a 3D scene, our OpenVoxel is able to produce meaningful groups that describe different objects in the scene. Also, by leveraging powerful Vision Language Models (VLMs) and Multi-modal Large Language Models (MLLMs), our OpenVoxel successfully build an informative scene map by captioning each group, enabling further 3D scene understanding tasks such as open-vocabulary segmentation (OVS) or referring expression segmentation (RES). Unlike previous methods, our method is training-free and does not introduce embeddings from a CLIP/BERT text encoder. Instead, we directly proceed with text-to-text search using MLLMs. Through extensive experiments, our method demonstrates superior performance compared to recent studies, particularly in complex referring expression segmentation (RES) tasks.

Abstract:
Vision-Language Pre-training (VLP) models, while achieving state-of-the-art performance on various multimodal tasks, exhibit significant vulnerability to multimodal adversarial examples. In black-box attack scenarios of VLP models, a key challenge lies in the limited transferability of these adversarial examples. Existing methods to enhance transferability often suffer from an excessive dependence on the source model and a reliance on limited and fixed transformation techniques. To overcome these limitations, we propose a novel Transform to Transfer Attack (TTA) method. Our approach introduces a learnable transformation mechanism that adaptively selects optimal combinations of transformations to maximize input diversity, and incorporates integrated gradients to mitigate over-reliance on the source model, thereby refining the attack optimization process. Extensive experiments demonstrate that TTA achieves outstanding attack performance in downstream tasks, outperforming current state-of-the-art attack methods across different VLP architectures.

Abstract:
Long video understanding presents significant challenges for vision-language models due to extremely long context windows. Existing solutions relying on naive chunking strategies with retrieval-augmented generation, typically suffer from information fragmentation and a loss of global coherence. We present HAVEN, a unified framework for long-video understanding that enables coherent and comprehensive reasoning by integrating audiovisual entity cohesion and hierarchical video indexing with agentic search. First, we preserve semantic consistency by integrating entity-level representations across visual and auditory streams, while organizing content into a structured hierarchy spanning global summary, scene, segment, and entity levels. Then we employ an agentic search mechanism to enable dynamic retrieval and reasoning across these layers, facilitating coherent narrative reconstruction and fine-grained entity tracking. Extensive experiments demonstrate that our method achieves good temporal coherence, entity consistency, and retrieval efficiency, establishing a new state-of-the-art with an overall accuracy of 84.1% on LVBench. Notably, it achieves outstanding performance in the challenging reasoning category, reaching 80.1%. These results highlight the effectiveness of structured, multimodal reasoning for comprehensive and context-consistent understanding of long-form videos.

Abstract:
Medical image segmentation supports clinical workflows by precisely delineating anatomical structures and lesions. However, medical image datasets medical image datasets suffer from acquisition noise and annotation ambiguity, causing pervasive data uncertainty that substantially undermines model robustness. Existing research focuses primarily on model architectural improvements and predictive reliability estimation, while systematic exploration of the intrinsic data uncertainty remains insufficient. To address this gap, this work proposes leveraging the universal representation capabilities of visual foundation models to estimate inherent data uncertainty. Specifically, we analyze the feature diversity of the model's decoded representations and quantify their singular value energy to define the semantic perception scale for each class, thereby measuring sample difficulty and aleatoric uncertainty. Based on this foundation, we design two uncertainty-driven application strategies: (1) the aleatoric uncertainty-aware data filtering mechanism to eliminate potentially noisy samples and enhance model learning quality; (2) the dynamic uncertainty-aware optimization strategy that adaptively adjusts class-specific loss weights during training based on the semantic perception scale, combined with a label denoising mechanism to improve training stability. Experimental results on five public datasets encompassing CT and MRI modalities and involving multi-organ and tumor segmentation tasks demonstrate that our method achieves significant and robust performance improvements across various mainstream network architectures, revealing the broad application potential of aleatoric uncertainty in medical image understanding and segmentation tasks. The code is available.

Abstract:
The rapid development of large vision-language model (VLM) has greatly promoted the research of GUI agent. However, GUI agents still face significant challenges in handling long-horizon tasks. First, single-agent models struggle to balance high-level capabilities and low-level execution capability, facing prevalent issues of responsibility coupling and capability conflicts. Second, agents lack awareness of the task state, leading to progress loss in long-horizon tasks. To address these challenges, we propose a staged execution-feedback reinforcement learning algorithm. Unlike training a unified policy model, we focus on training high-level scheduling models. Specifically, we propose and train two agents: a Coordinator, responsible for the strategic planning and task decomposition; and a State Tracker, responsible for context compression and information management to maintain the task's state and coherence. Based on this, we built the Coordinator-Executor-State Tracker (CES) multi-agent framework, which can be integrated with any low-level Executor model, assisting the Executor in solving long-horizon tasks through task scheduling and state management. Experiments on long-horizon task benchmarks demonstrate that CES significantly enhances the system's planning and state management capabilities. Furthermore, analysis confirms that our trained high-level scheduling module is a generalizable, plug-and-play module that significantly enhances the long-horizon capabilities of various Executors.

Abstract:
Face Anti-Spoofing (FAS) typically depends on a single visual modality when defending against presentation attacks such as print attacks, screen replays, and 3D masks, resulting in limited generalization across devices, environments, and attack types. Meanwhile, Multimodal Large Language Models (MLLMs) have recently achieved breakthroughs in image-text understanding and semantic reasoning, suggesting that integrating visual and linguistic co-inference into FAS can substantially improve both robustness and interpretability. However, the lack of a high-quality vision-language multimodal dataset has been a critical bottleneck. To address this, we introduce FaceCoT (Face Chain-of-Thought), the first large-scale Visual Question Answering (VQA) dataset tailored for FAS. FaceCoT covers 14 spoofing attack types and enriches model learning with high-quality CoT VQA annotations. Meanwhile, we develop a caption model refined via reinforcement learning to expand the dataset and enhance annotation quality. Furthermore, we introduce a CoT-Enhanced Progressive Learning (CEPL) strategy to better leverage the CoT data and boost model performance on FAS tasks. Extensive experiments demonstrate that models trained with FaceCoT and CEPL outperform state-of-the-art methods on multiple benchmark datasets.

Abstract:
Existing hand-object interactions (HOI) methods are largely limited to rigid objects, while 4D reconstruction methods of articulated objects generally require pre-scanning the object or even multi-view videos. It remains an unexplored but significant challenge to reconstruct 4D human-articulated-object interactions from a single monocular RGB video. Fortunately, recent advancements in foundation models present a new opportunity to address this highly ill-posed problem. To this end, we introduce ArtHOI, an optimization-based framework that integrates and refines priors from multiple foundation models. Our key contribution is a suite of novel methodologies designed to resolve the inherent inaccuracies and physical unreality of these priors. In particular, we introduce an Adaptive Sampling Refinement (ASR) method to optimize object's metric scale and pose for grounding its normalized mesh in world space. Furthermore, we propose a Multimodal Large Language Model (MLLM) guided hand-object alignment method, utilizing contact reasoning information as constraints of hand-object mesh composition optimization. To facilitate a comprehensive evaluation, we also contribute two new datasets, ArtHOI-RGBD and ArtHOI-Wild. Extensive experiments validate the robustness and effectiveness of our ArtHOI across diverse objects and interactions. Project: https://arthoi-reconstruction.github.io.

Abstract:
Aligning ground-level imagery with geo-registered satellite maps is crucial for mapping, navigation, and situational awareness, yet remains challenging under large viewpoint gaps or when GPS is unreliable. We introduce Wrivinder, a zero-shot, geometry-driven framework that aggregates multiple ground photographs to reconstruct a consistent 3D scene and align it with overhead satellite imagery. Wrivinder combines SfM reconstruction, 3D Gaussian Splatting, semantic grounding, and monocular depth-based metric cues to produce a stable zenith-view rendering that can be directly matched to satellite context for metrically accurate camera geo-localization. To support systematic evaluation of this task--which lacks suitable benchmarks--we also release MC-Sat, a curated dataset linking multi-view ground imagery with geo-registered satellite tiles across diverse outdoor environments. Together, Wrivinder and MC-Sat provide a first comprehensive baseline and testbed for studying geometry-centered cross-view alignment without paired supervision. In zero-shot experiments, Wrivinder achieves sub-30m geolocation accuracy across both dense and large-area scenes, highlighting the promise of geometry-based aggregation for robust ground-to-satellite localization. The MC-Sat dataset and Wrivinder codebase will be publicly released.

Abstract:
Multimodal Large Language Models (MLLMs) are undergoing rapid progress and represent the frontier of AI development. However, their training and inference efficiency have emerged as a core bottleneck in making MLLMs more accessible and scalable. To address the challenges, we present MiniCPM-V 4.5, an 8B parameter model designed for high efficiency and strong performance. We introduce three core improvements in model architecture, data strategy and training method: a unified 3D-Resampler model architecture for highly compact encoding over images and videos, a unified learning paradigm for document knowledge and text recognition without heavy data engineering, and a hybrid reinforcement learning strategy for proficiency in both short and long reasoning modes. Comprehensive experimental results in OpenCompass evaluation show that MiniCPM-V 4.5 surpasses widely used proprietary models such as GPT-4o-latest, and significantly larger open-source models such as Qwen2.5-VL 72B. Notably, the strong performance is achieved with remarkable efficiency. For example, on the widely adopted VideoMME benchmark, MiniCPM-V 4.5 surpasses Qwen2.5-VL 7B by +2.8, using just 46.7% GPU memory and 8.7% inference time.

Abstract:
End-to-End (E2E) Reinforcement Learning (RL) for autonomous driving still struggles with safety and generalization under distribution shift, as perception-heavy encoders, sparse rewards, and ad hoc uncertainty handling yield brittle closed-loop behavior. This work introduces a unified Deep RL (DRL) framework built around a control-layer reliability interface where the same uncertainty signal informs relational attention, gates policy entropy, and regularizes transfer alignment. An ego-centric relational graph encodes agent influence via uncertainty-weighted attention over kinematics, lane geometry, and semantics, producing a compact control state. A multi-objective differentiable reward shapes safety, progress, and comfort with an uncertainty term. Aleatoric and epistemic uncertainty, captured through per-edge heteroscedastic variance and a critic ensemble, modulate policy entropy for risk-aware exploration. A causal-semantic transfer objective aligns actions, attention, and uncertainty statistics across domains with meta-learned initialization for few-shot adaptation. In closed-loop urban driving across varied towns, traffic, and weather, the framework improves success rate, reduces infractions, and achieves higher time-to-conflict combined with lower lateral deviation over strong baselines.

Abstract:
Image tokenization has significantly advanced visual generation and multimodal modeling, particularly when paired with autoregressive models. However, current methods face challenges in balancing efficiency and quality: high-resolution image generation either requires an excessive number of tokens or compromises critical details through token reduction. To resolve this, we propose Latent Consistency Tokenizer (LacTok) that bridges discrete visual tokens with the compact latent space of pretrained latent diffusion models (LDMs), enabling efficient representation of 1024x1024 images using only 256 tokens--a 16xcompression over VQGAN. LacTok integrates a transformer encoder, a quantized codebook, and a latent consistency decoder. Direct application of LDM as the decoder results in color and brightness discrepancies; thus, we convert it to latent consistency decoder, reducing multi-step sampling to 1-2 steps for direct pixel-level supervision. To endow LacTok with text-to-image generation capabilities, we seamlessly integrate it with an autoregressive transformer, forming LacTokGen. This transformer is trained by predicting compact token sequences conditioned on text instructions. Experiments demonstrate LacTok's superiority in high-fidelity reconstruction, with 10.8 reconstruction Frechet Inception Distance on MSCOCO-2017 5K benchmark for 1024x1024 image reconstruction. LacTokGen achieves 0.73 score on GenEval benchmark and 0.304 HPSv2 on MSCOCO-2017 dataset.

Abstract:
Prototype-based methods enhance interpretability in image recognition by establishing intermediate part-level prototypes to build interpretable classifiers, enabling transparent decision-making through localized evidence and reference to prototypical examples. However, most existing methods typically depend on unimodal visual supervision and constrain prototypes within the visual embedding space, which inherently restricts their alignment with language-conditioned semantic representations. In this paper, we present PRISM (Prototype-based Reasoning with Inter-modal Semantic Mining), a framework for interpretable image recognition that leverages natural language as an auxiliary modality to guide the learning of class-specific part prototypes. PRISM introduces an information-theoretic attribution mechanism to extract semantically relevant image regions conditioned on textual descriptions. By aligning attribution maps with prototype activation patterns, PRISM implicitly anchors visual part prototypes to semantically meaningful image regions without requiring explicit concept annotations. To promote better localization and reduce redundancy among prototypes, we introduce a spatial compactness constraint that encourages each prototype to attend to non-overlapping and spatially concentrated regions. Experiments on multiple fine-grained benchmarks demonstrate that PRISM not only improves classification performance but also provides faithful and semantically grounded visual explanations.

Abstract:
Despite recent progress in 3D visual grounding, existing methods still struggle with three core challenges: 1) cross-modal misalignment that prevents textual cues from being reliably delivered to visual representations, 2) intra-class confusion arising from insufficient understanding of fine-grained expression cues, and 3) geometric reasoning errors caused by inaccurate aggregation of spatially relevant visual features. We propose EG-3DVG, a unified framework that addresses these issues through an expression and geometry aware grounding decoder. The decoder integrates two complementary attention modules--position-guided expression cross-attention (PECA) for reliable text-vision alignment and geometry-aware masked attention (GMA) for selective aggregation of geometry-consistent visual cues. To further distinguish semantically similar instances, we introduce expression-aware contrastive learning (ECL), which strengthens the alignment between the target object token and expression-relevant words. Extensive experiments on ScanRefer and SR3D/NR3D demonstrate that EG-3DVG achieves state-of-the-art performance in both 3D bounding box localization and mask prediction, validating the effectiveness of our geometry- and expression-aware design.

Abstract:
Large Multimodal Models (LMMs) have shown remarkable success in visual understanding tasks. LMMs encode visual and textual inputs into tokens, which are then processed by Large Language Models (LLMs). However, the large number of visual tokens poses a major bottleneck for inference efficiency and memory usage. Reducing visual tokens is a promising training-free solution, but existing methods remain limited. Importance-based approaches suffer from poor generalization, are incompatible with kernel-level inference optimizations, and only consider information from a single modality. Diversity-based strategies typically focus on pairwise token redundancy and treat all tokens as equally important. Recent attempts to sequentially combine importance and diversity criteria still fail to address the intrinsic drawbacks of their underlying metrics. To address these limitations, we reformulate visual token reduction as an optimal subset selection problem jointly guided by two complementary objectives: informativeness and coverage. Informativeness is quantified through per-token intrinsic saliency and visual-textual alignment, while coverage is enforced via a volume-based subset selection criterion that ensures global representativeness in the visual feature space. This joint formulation effectively integrates visual saliency, cross-modal alignment, and global coverage in an end-to-end token selection process, yielding a computationally efficient, model-agnostic framework compatible with modern inference accelerators. Extensive experiments demonstrate that CoIn substantially reduces computation and memory cost while maintaining strong task performance.

Abstract:
Pretrained Vision-Language Models (VLMs) like CLIP show promise in continual learning, but existing Few-Shot Class-Incremental Learning (FSCIL) methods assume homogeneous domains and balanced data distributions, limiting real-world applicability where data arises from heterogeneous disciplines with imbalanced sample availability and varying visual complexity. We identify Domain Gravity, a representational asymmetry where data imbalance across heterogeneous domains causes overrepresented or low-entropy domains to disproportionately influence the embedding space, leading to prototype drift and degraded performance on underrepresented or high-entropy domains. To address this, we introduce Cross-Discipline Variable Few-Shot Class-Incremental Learning (XD-VSCIL), a benchmark capturing real-world heterogeneity and imbalance where Domain Gravity naturally intensifies. We propose Hybrid Prototype Calibration (HyCal), a training-free method combining cosine similarity and Mahalanobis distance to capture complementary geometric properties--directional alignment and covariance-aware magnitude--yielding stable prototypes under imbalanced heterogeneous conditions. Operating on frozen CLIP embeddings, HyCal achieves consistent retention-adaptation improvements while maintaining efficiency. Experiments show HyCal effectively mitigates Domain Gravity and outperforms existing methods in imbalanced cross-domain incremental learning.

Abstract:
Automatic vision inspection holds significant importance in industry inspection. While multimodal large language models (MLLMs) exhibit strong language understanding capabilities and hold promise for this task, their performance remains significantly inferior to that of human experts. In this context, we identify two key challenges: (i) insufficient integration of anomaly detection (AD) knowledge during pre-training and (ii) the lack of technically precise and context-aware language generation for anomaly reasoning. To address these issues, we propose ADSeeker, an anomaly task assistant designed to enhance inspection performance through knowledge-grounded reasoning. ADSeeker first leverages a curated visual document knowledge base, SEEK-MVTec&VisA (SEEK-M&V), which we construct to address the limitations of existing resources that rely solely on unstructured text. SEEK-M&V includes semantic-rich descriptions and image-document pairs, enabling more comprehensive anomaly understanding. To effectively retrieve and utilize this knowledge, we introduce the Query Image-Knowledge Retrieval-Augmented Generation (Q2K RAG) framework. To further enhance the performance in zero-shot anomaly detection (ZSAD), ADSeeker leverages the Hierarchical Sparse Prompt mechanism and type-level features to efficiently extract anomaly patterns. Furthermore, to tackle the challenge of limited industry anomaly detection (IAD) data, we introduce the largest-scale AD dataset, Multi-type Anomaly (MulA), encompassing 72 multi-scale defect types across 26 categories. Extensive experiments show that our plug-and-play framework, ADSeeker, achieves state-of-the-art zero-shot performance on several benchmark datasets.

Abstract:
We present MeshFlow, a new method for compressing and generating artist-like 3D meshes. Current mesh generators often adopt Auto-Regressive (AR) next-token prediction, a natural choice given the discrete nature of mesh connectivity, which, however, scales poorly due to the inference cost being quadratic in mesh size. AR methods also require discretizing the vertex coordinates, which introduces quantization errors and can cause vertex collapse. To address these challenges, we introduce a Variational Autoencoder (VAE) that, supervised with a contrastive loss, represents both continuous vertex positions and discrete connectivity in a continuous latent space.This latent space is significantly more compact than prior token-based mesh representations. We then build a 3D generator based on a Rectified-Flow transformer, which generates all mesh vertices and edges in parallel. This model samples meshes 26xfaster than the fastest AR generator while also achieving state-of-the-art accuracy across standard mesh-generation metrics.

Abstract:
Cinematic video production requires control over scene-subject composition and camera movement, but live-action shooting remains costly due to the need for constructing physical sets. To address this, we introduce the task of cinematic video generation with decoupled scene context: given multiple images of a static environment, the goal is to synthesize high-quality videos featuring dynamic subject while preserving the underlying scene consistency and following a user-specified camera trajectory. We present CineScene, a framework that leverages implicit 3D-aware scene representation for cinematic video generation. Our key innovation is a novel context conditioning mechanism that injects 3D-aware features in an implicit way: By encoding scene images into visual representations through VGGT, CineScene injects spatial priors into a pretrained text-to-video generation model by additional context concatenation, enabling camera-controlled video synthesis with consistent scenes and dynamic subjects. To further enhance the model's robustness, we introduce a simple yet effective random-shuffling strategy for the input scene images during training. To address the lack of training data, we construct a scene-decoupled dataset with Unreal Engine 5, containing paired videos of scenes with and without dynamic subjects, panoramic images representing the underlying static scene, along with their camera trajectories. Experiments show that CineScene achieves state-of-the-art performance in scene-consistent cinematic video generation, handling large camera movements and demonstrating generalization across diverse environments.

Abstract:
Hallucination limits the reliability of multimodal large language models (MLLMs), and it is particularly damaging in video where errors manifest as distorted narratives rather than single-frame mistakes. We introduce a frame-first study of Chimera Hallucination: model stitches visual segments that exist in space and time but do not belong to the same event chain, producing a spurious continuous story. We introduce CH-Risk, a single-forward, reference-free risk estimate tailored to this failure mode. CH-Risk combines two complementary signals: SegCoverage@\alpha (\mathrm SCR @\alpha) measures how many event segments are needed to cover most text-to-frame support, exposing long-range stitching; Alignment with Early Temporal Pathway (AETP) measures rank consistency between support and the temporal pathway formed in early-middle layers, exposing stage mismatch. To turn risk into correction, we further propose CH-M(itigation), a train-free two-stage intervention. Segment-aligned Stage-Aligned Frame Routing (sSAFR) re-weights frames before the mid-layer softmax to route attention toward a small set of pathway-aligned segments. Residual Token Calibration (RTC) then stabilizes token usage within selected segments. Extensive experiments across 9 benchmarks and 6 VideoLLMs show that CH-Risk can predict Chimera and that CH-M consistently reduces hallucination and improves task accuracy with negligible overhead (sub-5% latency, sub-2.5% memory, \~1% FLOPs).

Abstract:
We present FunREC, a method for reconstructing functional 3D digital twins of indoor scenes directly from egocentric RGB-D interaction videos. Unlike existing methods on articulated reconstruction, which rely on controlled setups, multi-state captures, or CAD priors, FunREC operates directly on in-the-wild human interaction sequences to recover interactable 3D scenes. It automatically discovers articulated parts, estimates their kinematic parameters, tracks their 3D motion, and reconstructs static and moving geometry in canonical space, yielding simulation-compatible meshes. Across new real and simulated benchmarks, FunREC surpasses prior work by a large margin, achieving up to +50 mIoU improvement in part segmentation, 5-10x lower articulation and pose errors, and significantly higher reconstruction accuracy. We further demonstrate applications on URDF/USD export for simulation, hand-guided affordance mapping and robot-scene interaction. Our project page is: functionalscenes.github.io.

Abstract:
Recent advances in diffusion models have demonstrated impressive capability in generating high-quality images for simple prompts. However, when confronted with complex prompts involving multiple objects and hierarchical structures, existing models struggle to accurately follow instructions, leading to issues such as concept omission, confusion, and poor compositionality. To address these limitations, we propose a Hierarchical Compositional Generative framework (HiCoGen) built upon a novel Chain of Synthesis (CoS) paradigm. Instead of monolithic generation, HiCoGen first leverages a Large Language Model (LLM) to decompose complex prompts into minimal semantic units. It then synthesizes these units iteratively, where the image generated in each step provides crucial visual context for the next, ensuring all textual concepts are faithfully constructed into the final scene. To further optimize this process, we introduce a reinforcement learning (RL) framework. Crucially, we identify that the limited exploration of standard diffusion samplers hinders effective RL. We theoretically prove that sample diversity is maximized by concentrating stochasticity in the early generation stages and, based on this insight, propose a novel Decaying Stochasticity Schedule to enhance exploration. Our RL algorithm is then guided by a hierarchical reward mechanism that jointly evaluates the image at the global, subject, and relationship levels. We also construct HiCoPrompt, a new text-to-image benchmark with hierarchical prompts for rigorous evaluation. Experiments show our approach significantly outperforms existing methods in both concept coverage and compositional accuracy.

Abstract:
Understanding the physical world is essential for generalist AI agents. However, it remains unclear whether state-of-the-art vision perception models (e.g., large VLMs) can perform quantitative physical reasoning tasks. Existing evaluations are predominantly VQA-based and qualitative, offering limited insight into whether these models can infer the kinematic quantities of moving objects from videos. To address this, we present QuantiPhy, the first benchmark designed to quantitatively measure a VLM's physical reasoning ability. Comprising more than 3.3K video-text instances with numerical ground truth, QuantiPhy evaluates a VLM's performance on estimating an object's size, velocity, and acceleration at a given timestamp, using one of these properties as an input prior. The benchmark standardizes prompts and scoring to assess numerical accuracy, enabling fair comparisons across models. Our experiments on state-of-the-art VLMs reveal a consistent gap between their qualitative plausibility and actual numerical correctness. We further provide an in-depth analysis of key factors like background noise, counterfactual priors, and strategic prompting and find that state-of-the-art VLMs lean heavily on pre-trained world knowledge rather than faithfully using the provided visual and textual inputs when reasoning about objects' kinematic properties. QuantiPhy offers the first rigorous, scalable testbed to move VLMs beyond mere verbal plausibility toward a numerically grounded physical understanding.

Abstract:
Omnimodal large language models (OmniLLMs) have attracted increasing research attention of late towards unified audio-video understanding. However, the high computational cost of processing longer joint audio-video token sequences has become a key bottleneck. Existing token compression methods have not addressed the emerging need to jointly compress multimodal tokens. To bridge this gap, we present OmniZip, a training-free, audio-guided audio-visual token-compression framework that optimizes multimodal token representation and accelerates model inference. Specifically, OmniZip first identifies salient audio tokens, then computes an audio retention score for each time group to capture information density, thereby dynamically guiding video token pruning and preserving cues from audio anchors enhanced by cross-modal similarity. For each time window, OmniZip compresses the video tokens using an interleaved spatio-temporal scheme. Extensive results demonstrate the merits of OmniZip: it achieves a 3.42x inference speedup and a 1.4x memory reduction over other top-performing counterparts, while maintaining the performance of OmniLLMs without training.

Abstract:
Video diffusion models (DMs) have enabled high-quality video synthesis, but their computation costs scale quadratically with sequence length due to the nature of self-attention. While linear attention offers a more efficient alternative, fully replacing quadratic attention demands costly pretraining. This is largely because linear attention lacks sufficient expressiveness and struggles with the complex spatiotemporal dynamics inherent to video generation. In this paper, we present LinVideo, an efficient data-free post-training framework that replaces a target number of self-attention modules with linear attention while preserving performance. First, we observe a significant disparity in the replaceability of different layers. Instead of manual or heuristic choices, we frame layer selection as a binary classification problem and propose a selective transfer, which automatically and progressively converts layers to linear attention with minimal performance impact. Additionally, to overcome the ineffectiveness and even inefficiency of existing objectives in optimizing this challenge transfer process, we introduce an anytime distribution matching (ADM) objective that aligns the distributions of samples across any timestep along the sampling trajectory. This objective is highly efficient and recovers model performance. Extensive experiments show that LinVideo achieves a 1.43\text -1.71x speedup while preserving generation quality, and the 4-step distilled models further reduce latency by 15.9\text -20.9x with only a minor drop in visual quality.

Abstract:
Training vision-language models (VLMs) for complex reasoning remains a challenging task, i.a. due to the scarcity of high-quality image-text reasoning data. Conversely, text-based reasoning resources are abundant and scalable, but it is still an open question how to leveraging them for VLM reasoning. To address this problem, we propose VOLD, a framework to transfer reasoning capabilities from text-only teacher models to VLM student models. To this end, VOLD combines reinforcement learning via Group Relative Policy Optimization (GRPO) with on-policy distillation, which allows the student reasoning traces to be guided by the teacher model, resulting in a significant gain over using GRPO alone.We further show that a cold-start alignment is essential for an effective transfer during the online training phase in this scenario and that without sufficient distributional alignment between teacher and student, on-policy distillation fails to provide meaningful guidance.We evaluate VOLD across diverse benchmarks including MMMU-Pro, MathVision, MathVista, and LogicVista, showing that VOLD outperforms the baseline model significantly and improves over the state of the art by a margin.Our ablation shows the importance of a cold-start alignment via SFT for on-policy distillation with a text-only teacher

Abstract:
Vision-Language Models (VLMs) have achieved remarkable progress in multimodal understanding, yet their positional encoding mechanisms remain suboptimal. Existing approaches uniformly assign positional indices to all tokens, overlooking variations in information density within and across modalities, which leads to inefficient attention allocation where redundant visual regions dominate while informative content is underrepresented. We identify positional granularity as an implicit resource and propose MODIX (Multimodal Information-Driven Positional IndeX Scaling), a training-free framework that dynamically adapts positional strides based on modality-specific contributions. MODIX jointly models intra-modal density via covariance-based entropy and inter-modal interaction via cross-modal alignment to derive unified scores, which rescale positional indices to allocate finer granularity to informative modalities while compressing redundant ones, without requiring any modification to model parameters or architecture. Experiments across diverse architectures and benchmarks demonstrate that MODIX consistently improves multimodal reasoning and adaptively reallocates attention according to task-dependent information distributions, suggesting that positional encoding should be treated as an adaptive resource in Transformers for multimodal sequence modeling.

Abstract:
Humans perceive the 3D world from limited 2D observations. While recent feed-forward generalizable 3D reconstruction models can recover structures from sparse images, they typically represent only observed regions, leaving unseen geometry unmodeled. This raises a fundamental question: Can we infer complete 3D structure from partial observations? We present RnG (Reconstruction and Generation), a feed-forward Transformer that unifies reconstruction and generation by predicting an implicit, complete 3D representation. RnG introduces a reconstruction-guided causal attention mechanism that separates reconstruction and generation at the attention level while treating the KV-cache as an implicit 3D representation. Arbitrary poses can efficiently query this cache to render high-fidelity RGBD outputs. RnG accurately reconstructs visible geometry while generating plausible unseen structures and appearance. It achieves state-of-the-art performance in generalizable 3D reconstruction and novel view generation, while being efficient enough for real-time interactive applications.

Abstract:
Neural fields (NFs) have achieved remarkable success in scene reconstruction and novel view synthesis. However, existing NF approaches that rely on RGB or LiDAR inputs often struggle under adverse weather conditions, limiting their robustness in real-world outdoor environments such as autonomous driving. In contrast, millimeter-wave radar is inherently resilient to environmental variations, yet its integration with NFs remains largely underexplored. Moreover, outdoor driving scenes frequently involve dynamic objects, making spatiotemporal modeling crucial for temporally consistent novel view synthesis. To address these challenges, we present RF4D, a radar-based neural field framework tailored for novel view synthesis in outdoor dynamic scenes. RF4D explicitly incorporates temporal information into its representation, enabling more accurate modeling of object motion. A dedicated scene flow module further predicts temporal offsets between adjacent frames, enforcing temporal occupancy coherence during dynamic scene reconstruction. Moreover, we propose a radar-specific power rendering formulation grounded in radar sensing physics, improving both synthesis accuracy and interpretability. Extensive experiments on public radar datasets demonstrate that RF4D substantially outperforms existing methods in radar measurement synthesis and occupancy estimation accuracy, with particularly strong gains in dynamic outdoor environments.

Abstract:
Vision-Language Models (VLMs) like CLIP exhibit extraordinary out-of-distribution (OOD) generalization, while the theoretical foundations underlying this robustness remain largely unexplored. This work establishes a connection between CLIP and Invariant Risk Minimization (IRM), the principled paradigm to overcome OOD problems, through token-level causal representation learning. Our key insight is that CLIP's contrastive objective, when optimally trained, recovers modality-invariant causal factors at the word-and-phrase granularity. By decomposing text prompts into class-specific tokens (causal factors) and class-agnostic context tokens (environmental factors), we prove that a vocabulary-constrained InfoNCE objective becomes formally equivalent to IRM's invariance criterion. Grounded in this equivalence, we propose a mid-training paradigm aiming to inject invariant learning signals into pre-trained CLIP without architectural modification, yielding CLIP-IRM with superior OOD performance. We further extend this causal alignment to multimodal reasoning via using CLIP-IRM's invariant alignment scores as process-level rewards in reinforcement learning, effectively transplanting IRM's guarantees to robust sequential decision-making in Multimodal Large Language Models. Extensive experiments validate our theoretical framework and present substantial improvements in both multimodal OOD understanding and reasoning tasks.

Abstract:
Current visual grounding models are either based on a Multimodal Large Language Model (MLLM) that performs auto-regressive decoding, which is slow and risks hallucinations, or on re-aligning an LLM with vision features to learn new special or object tokens for grounding, which may undermine the LLM's pretrained reasoning ability. In contrast, we propose VGent, a modular encoder-decoder architecture that explicitly disentangles high-level reasoning and low-level bounding box prediction. Specifically, a frozen MLLM serves as the encoder to provide untouched powerful reasoning capabilities, while a decoder takes high-quality boxes proposed by detectors as queries and selects target box(es) via cross-attending on encoder's hidden states. This design fully leverages advances in both object detection and MLLM, avoids the pitfalls of auto-regressive decoding, and enables fast inference. Moreover, it supports modular upgrades of both the encoder and decoder to benefit the whole system: we introduce (i) QuadThinker, an RL-based training paradigm for enhancing multi-target reasoning ability of the encoder; (ii) mask-aware label for resolving detection-segmentation ambiguity; and (iii) global target recognition to improve the recognition of all the targets which benefits the selection among augmented proposals. Experiments on multi-target visual grounding benchmarks show that VGent achieves a new state-of-the-art with +20.6% F1 improvement over prior methods, and further boosts gIoU by +8.2% and cIoU by +5.8% under visual reference challenges, while maintaining constant, fast inference latency.

Abstract:
We present PercHead, a model for single-image 3D head reconstruction and disentangled 3D editing - two tasks that are inherently challenging due to ambiguity in plausible explanations for the same input. At the heart of our approach lies our novel perceptual loss based on DINOv2 and SAM 2.1. Unlike widely-adopted low-level losses like LPIPS, SSIM or L1, we rely on deep visual understanding of images and the resulting generalized supervision signals. We show that our new loss can be a drop-in replacement for standard losses and used to improve visual quality in high-frequency areas. We base our model architecture on Vision Transformers (ViTs), allowing us to decouple the 3D representation from the 2D input. We train our method on multi-view images for view-consistency and in-the-wild images for strong transferability to new environments. Our model achieves state-of-the-art performance in novel-view synthesis and, furthermore, exhibits exceptional robustness to extreme viewing angles. We also extend our base model to disentangled 3D editing by swapping the encoder and fine-tuning the network. A segmentation map controls geometry and either a text prompt or a reference image specifies appearance. We highlight the intuitive and powerful 3D editing capabilities through an interactive GUI.

Abstract:
Metric depth estimation aims to recover depth maps with absolute scale, high resolution, and cross-scene consistency from visual observations. Existing approaches either rely on large-scale models or costly sensors to preserve metric accuracy and generalization, both ill-suited to resource-constrained deployment. In this paper, we propose LiteSense, a lightweight RGB-ToF fusion framework that leverages compact normalized histogram (CNH) signals together with RGB cues to achieve efficient and reliable metric depth estimation. Specifically, LiteSense leverages a U-Net-style encoder-decoder that forms an RGB-D input by concatenating RGB with upsampled ToF depth, providing explicit metric priors. To address resolution disparity and recover fine details, we introduce the Patch-wise CNH Spatial Injection (PCSI) module, which leverages zone-wise histogram measurements via cross-attention to guide high-level feature fusion. Extensively evaluated on NYUv2 and SUN RGB-D, LiteSense consistently outperforms monocular baselines and DELTAR with substantially lower computational cost, and demonstrates promising zero-shot generalization. We further introduce THDR3K, the first indoor RGB-ToF-CNH dataset, where LiteSense achieves real-world accuracy comparable to--and in challenging cases surpassing--Intel RealSense. All the relevant source codes and the collected dataset are available at GitHub.

Abstract:
Document Visual Question Answering (DocVQA) remains challenging for existing Vision-Language Models (VLMs), especially under complex reasoning and multi-step workflows. Current approaches struggle to decompose intricate questions into manageable sub-tasks and often fail to leverage specialized processing paths for different document elements. We present ORCA: Orchestrated Reasoning with Collaborative Agents for Document Visual Question Answering, a novel multi-agent framework that addresses these limitations through strategic agent coordination and iterative refinement. ORCA begins with a reasoning agent that decomposes queries into logical steps, followed by a routing mechanism that activates task-specific agents from a specialized agent dock. Our framework leverages a set of specialized AI agents, each dedicated to a distinct modality, enabling fine-grained understanding and collaborative reasoning across diverse document components. To ensure answer reliability, ORCA employs a debate mechanism with stress-testing, and when necessary, a thesis-antithesis adjudication process. This is followed by a sanity checker to ensure format consistency. Extensive experiments on three benchmarks demonstrate that our approach achieves significant improvements over state-of-the-art methods, establishing a new paradigm for collaborative agent systems in vision-language reasoning.

Abstract:
Existing Multimodal Large Language Models (MLLMs) suffer from significant performance degradation on the long document understanding task as document length increases. This stems from two fundamental challenges: 1) a low Signal-to-Noise Ratio (SNR), with crucial evidence buried in irrelevant pages; and 2) supervision scarcity, as datasets offering only final short answers provide a weak learning signal. In this paper, we address these challenges by proposing a paradigm that requires the model to execute a structured "Analysis, Localization and Reasoning" workflow. To instill this capability, we design a two-stage training framework: we first perform Supervised Fine-Tuning on high-quality data generated via an efficient knowledge distillation strategy. Subsequently, we employ an Evidence-aware Group Relative Policy Optimization which jointly optimizes for both evidence localization and answer accuracy. Additionally, we introduce a Evidence-Guided Resolution Allocation strategy to mitigate memory constraints of training on multi-pages documents. Extensive experiments demonstrate that DocSeeker achieves superior performance on both in-domain and out-of-domain tasks. We show it robustly generalizes from short-page training to ultra-long documents and is naturally synergistic with visual Retrieval-Augmented Generation systems, serving as an ideal foundation for their implementation.

Abstract:
Multimodal Large Language Models (MLLMs) have demonstrated remarkable proficiency in multimodal tasks.Despite their impressive performance, MLLMs suffer from the modality imbalance issue, where visual information is often underutilized compared to textual representations in deeper layers, leading to degraded visual performance or hallucinations.This issue stems from the predominant reliance on next-text-token-prediction during training, which fails to provide direct visual supervisory signals, resulting in progressive homogenization of visual representations throughout the layers.To this end, we propose Latent Visual Reconstruction (LaVer), a novel training framework that facilitates MLLMs in learning more discriminative visual representations via masked image modeling in the joint latent semantic space of LLM.Our method offers direct visual activation to MLLMs, which exhibit increased visual attention allocation, indicating enhanced utilization of visual information.Extensive experiments across diverse benchmarks prove the superiority of our approach in various scenarios, especially those requiring dense visual capabilities.Code will be publicly available upon publication.

Abstract:
We present UniGen-1.5, a unified multimodal large language model (MLLM) for advanced image understanding, generation and editing. Building upon UniGen, we comprehensively enhance the model architecture and training pipeline to strengthen the image understanding and generation capabilities while unlocking strong image editing ability. Especially, we propose a unified Reinforcement Learning (RL) strategy that improves both image generation and image editing jointly via shared reward models. To further enhance image editing performance, we propose a light Edit Instruction Alignment stage that significantly improves the editing instruction comprehension that is essential for the success of the RL training. Experimental results show that UniGen-1.5 demonstrates competitive understanding and generation performance. Specifically, UniGen-1.5 achieves 0.89 and 4.31 overall scores on GenEval and ImgEdit benchmarks that surpass the state-of-the-art models such as BAGEL and reaching performance comparable to proprietary models such as GPT-Image-1.

Abstract:
The pursuit of autonomous agents with predictive cognitive world models is hindered by a fundamental flaw in current vision-language models (VLMs): they lack cognitive inertia. Operating on isolated snapshots, these models cannot form a temporally coherent world view, leading to erratic decision jitter and a failure to execute complex, multi-step maneuvers. To remedy this, we introduce CogDriver, a framework designed to build a coherent world model by instilling this crucial cognitive property. Our work makes two key contributions: (1) We present CogDriver-Data, a large-scale vision-language-action dataset whose narrative annotations provide the supervisory signal for learning the temporal dynamics of a world model. (2) We develop the CogDriver-Agent, an architecture featuring a sparse temporalmemory to maintain a stable internal state, the foundation of a world model. This is enabled by a spatiotemporal knowledge distillation approach that explicitly teaches decision coherence. Comprehensive experiments validate our paradigm: CogDriver-Agent achieves a 22% increase in the closed-loop Driving Score on Bench2Drive and a 21% reduction in mean L2 error on nuScenes, establishing a new state-of-the-art. These significant gains in both long-term decision-making and imitation accuracy provide strong evidence that our agent is developing a more stable internal world model.

Abstract:
The significance of cross-view 3D geometric modeling capabilities for autonomous driving is self-evident, yet existing Vision-Language Models (VLMs) inherently lack this capability, resulting in their mediocre performance. While some promising approaches attempt to mitigate this by constructing Q&A data for auxiliary training, they still fail to fundamentally equip VLMs with the ability to comprehensively handle diverse evaluation protocols. We thus chart a new course, advocating for the infusion of VLMs with the cross-view geometric grounding of mature 3D foundation models, closing this critical capability gap in autonomous driving. In this spirit, we propose a novel architecture, VGGDrive, which empowers Vision-language models with cross-view Geometric Grounding for autonomous Driving. Concretely, to bridge the cross-view 3D geometric features from the frozen visual 3D model with the VLM's 2D visual features, we introduce a plug-and-play Cross-View 3D Geometric Enabler (CVGE). The CVGE decouples the base VLM architecture and effectively empowers the VLM with 3D features through a hierarchical adaptive injection mechanism. Extensive experiments show that VGGDrive enhances base VLM performance across five autonomous driving benchmarks, including tasks like cross-view risk perception, motion prediction, and trajectory planning. It's our belief that mature 3D foundation models can empower autonomous driving tasks through effective integration, and we hope our initial exploration demonstrates the potential of this paradigm to the autonomous driving community.

Abstract:
Autonomous object detection in remote sensing requires systems that can discover new categories and assign them usable labels during deployment. Existing Open-World Object Detectors identify unknown objects but leave them unnamed until manual annotation. In contrast, Open-Vocabulary Detectors recognize unseen categories only with provided prompts at test time, lacking autonomous discovery or naming. This work presents HSGDet, a detector that achieves both discovery and semantic assignment at test time without external prompts. This method introduces DHGA that navigates a hierarchical semantic graph to perform scene-conditioned coarse-to-fine classification of detected objects. It leverages spatial co-occurrence patterns from surrounding scene context to produce classification confidence scores. High-scoring regions are identified as known objects, while low-scoring regions are flagged as unknown detections. Unknown regions pass to CR2T, which synthesizes text embeddings by fusing visual features, hierarchical parents, and scene context, enabling prompt-free labeling and vocabulary expansion. This approach enables prompt-free semantic labeling and supports autonomous vocabulary expansion without requiring external models. Results demonstrate that HSGDet outperforms state-of-the-art methods by a large margin of 6.6 points in Known mAP and 9.9 points in Unknown Recall. It also reduces Wilderness Impact by 36%, enabling scalable and autonomous aerial monitoring.

Abstract:
Open-Vocabulary Object Detection (OVOD) aims to enable detectors to generalize across categories by leveraging semantic information. Although existing methods are pretrained on large vision-language datasets, their inference is still limited to fixed category names, creating a gap between multimodal training and unimodal inference. Previous work has shown that improving textual representation can significantly enhance OVOD performance, indicating that the textual space is still underexplored. To this end, we propose OVOD-Agent, which transforms passive category matching into proactive visual reasoning and self-evolving detection. Inspired by the Chain-of-Thought (CoT) paradigm, OVOD-Agent extends the textual optimization process into an interpretable Visual Chain of Thought with explicit actions. OVOD's lightweight nature makes LLM-based management unsuitable; instead, we model visual context transitions as a Weakly Markovian Decision Process (w-MDP) over eight state spaces, which naturally represents the agent's state, memory, and interaction dynamics. A Bandit module generates exploration signals under limited supervision, helping the agent focus on uncertain regions and adapt its detection policy. We further integrate Markov transition matrices with Bandit trajectories for self-supervised Reward Model (RM) optimization, forming a closed loop from Bandit exploration to RM learning. Experiments on COCO and LVIS show that OVOD-Agent improves existing OVOD baselines and outperforms prior methods on novel categories, demonstrating strong generalization and scalability.

Abstract:
The Aerial Object Goal Navigation, a challenging frontier in Embodied AI, requires an Unmanned Aerial Vehicle (UAV) agent to autonomously explore, reason, and identify a specific target using only visual perception and language description. However, existing methods struggle with the memorization of complex spatial representations in aerial environments, reliable and interpretable action decision-making, and inefficient exploration and information gathering. To address these challenges, we introduce APEX (Aerial Parallel Explorer), a novel hierarchical agent designed for efficient exploration and target acquisition in complex aerial settings. APEX is built upon a modular, three-part architecture: 1) Dynamic Spatio-Semantic Mapping Memory, which leverages the zero-shot capability of a Vision-Language Model (VLM) to dynamically construct high-resolution 3D Attraction, Exploration, and Obstacle maps, serving as an interpretable memory mechanism. 2) Action Decision Module, trained with reinforcement learning, which translates this rich spatial understanding into a fine-grained and robust control policy. 3) Target Grounding Module, which employs an open-vocabulary detector to achieve definitive and generalizable target identification. All these components are integrated into a hierarchical, asynchronous, and parallel framework, effectively bypassing the VLM's inference latency and boosting the agent's proactivity in exploration. Extensive experiments show that APEX outperforms the previous state of the art by +4.2% SR and +2.8% SPL on challenging UAV-ON benchmarks, demonstrating its superior efficiency and the effectiveness of its hierarchical asynchronous design.

Abstract:
Radiance field methods (e.g. 3D Gaussian Splatting) have emerged as a powerful paradigm for novel view synthesis, yet their appearance modeling often relies on Spherical Harmonics (SH), which impose fundamental limitations. SH struggle with high-frequency signals, exhibit Gibbs ringing artifacts, and fail to capture specular reflections---a key component of realistic rendering. Although alternatives like spherical Gaussians offer improvements, they add significant optimization complexity. We propose Spherical Voronoi (SV) as a unified framework for appearance representation in 3D Gaussian Splatting. SV partitions the directional domain into learnable regions with smooth boundaries, providing an intuitive and stable parameterization for view-dependent effects. For diffuse appearance, SV achieves competitive results while keeping optimization simpler than existing alternatives. For reflections---where SH fail---we leverage SV as learnable reflection probes, taking reflected directions as input following principles from classical graphics. This formulation attains state-of-the-art results on synthetic and real-world datasets, demonstrating that SV offers a principled, efficient, and general solution for appearance modeling in explicit 3D representations.

Abstract:
Recent advances in 4D scene reconstruction have greatly improved dynamic modeling across various domains. However, existing approaches remain limited under aerial conditions with single-view capture, wide spatial range, and dynamic objects of limited spatial footprint and large motion disparity. These challenges cause severe depth ambiguity and unstable motion estimation, making monocular aerial reconstruction inherently ill-posed.To this end, we present AeroDGS, a physics-guided 4D Gaussian splatting framework for monocular UAV videos. AeroDGS introduces a Monocular Geometry Lifting module that reconstructs reliable static and dynamic geometry from a single aerial sequence, providing a robust basis for dynamic estimation. To further resolve monocular ambiguity, we propose a Physics-Guided Optimization module that incorporates differentiable ground-support, upright-stability, and trajectory-smoothness priors, transforming ambiguous image cues into physically consistent motion.The framework jointly refines static backgrounds and dynamic entities with stable geometry and coherent temporal evolution. We additionally build a real-world UAV dataset that spans various altitudes and motion conditions to evaluate dynamic aerial reconstruction. Experiments on synthetic and real UAV scenes demonstrate that AeroDGS outperforms state-of-the-art methods, achieving superior reconstruction fidelity in dynamic aerial environments.

Abstract:
Vision-Language Models (VLMs), such as CLIP, have significantly advanced zero-shot image recognition. However, their performance remains limited by suboptimal prompt engineering and poor adaptability to target classes. While recent methods attempt to improve prompts through diverse class descriptions, they often rely on heuristic designs, lack versatility, and are vulnerable to outlier prompts. This paper enhances prompt by incorporating class-specific concepts. By treating concepts as latent variables, we rethink zero-shot image classification from a Bayesian perspective, casting prediction as marginalization over the concept space, where each concept is weighted by a prior and a test-image conditioned likelihood. This formulation underscores the importance of both a well-structured concept proposal distribution and the refinement of concept priors. To construct an expressive and efficient proposal distribution, we introduce a multi-stage concept synthesis pipeline driven by LLMs to generate discriminative and compositional concepts, followed by a Determinantal Point Process to enforce diversity. To mitigate the influence of outlier concepts, we propose a training-free, adaptive soft-trim likelihood, which attenuates their impact in a single forward pass. We further provide robustness guarantees and derive multi-class excess risk bounds for our framework. Extensive experiments demonstrate that our method consistently outperforms state-of-the-art approaches, validating its effectiveness in zero-shot image classification.

Abstract:
This paper proposes the Continual Dual-Critic with Cross-Attention (CD-CCA) framework for visual reinforcement learning to address the plasticity-stability conflict. Our method introduces continual learning techniques into the visual RL architecture, constructing two complementary critics using Continual Backpropagation (CBP) and Elastic Weight Consolidation (EWC) -- one for maintaining representational plasticity for rapid environmental adaptation, and the other for preserving knowledge stability to prevent catastrophic forgetting. Furthermore, we design a cross-attention based fusion mechanism that balances the value estimates from the dual critics according to observation characteristics. Experimental results on DeepMind Control and CARLA benchmarks show that CD-CCA effective mitigates issues of representation drift and policy degradation. Compared to existing visual RL methods, our approach exhibits enhanced robustness and adaptability in non-stationary environments and long-horizon decision-making tasks, providing a new architectural paradigm for the advancement of continual reinforcement learning.

Abstract:
Continual Learning (CL) enables Vision-Language Models (VLMs) to acquire new capabilities while retaining prior knowledge, for example, by employing task-specific adapters. Existing CL approaches typically optimize these adapters to convergence, often with (near-)orthogonality constraints to reduce interference; however, isolating adapters in orthogonal subspaces can suppress cross-task transfer and sharing. To address this problem, we provide a new perspective based on PAC-Bayesian analysis: once the per-task optimization has converged, adapters should be further shaped to satisfy \underline P hase-like tr\underline A nsition \underline C ons\underline T raints (PACT) -- a two-part formulation that (i) specifies a phase-like transition relation among adapters and (ii) imposes explicit constraints that enforce this relation. Under PACT, adapter dynamics resemble the phase transition of water: the system gravitates toward either a "frozen" (history-preserving, tightly constrained) or a "melted" (task-adaptive, free) regime, while moving between them smoothly rather than via hard thresholds. We operationalize PACT by coupling stability and plasticity regularizers within a two-branch Vision Transformer (ViT), seeding adapters with a Stable Adapter Initialization (SAI), and introducing a Prior Anchoring (PA) mechanism, thereby inducing phase-like adapter dynamics. Across diverse CL settings, PACT surpasses state-of-the-art methods while reducing the number of trainable parameters by 36.96% relative to standard adapter-based baselines. Our code will be released publicly.

Abstract:
Multi-modal agents are making rapid progress on general computer-use tasks. However, existing benchmarks remain largely confined to web browsers and rudimentary applications, failing to capture the professional software workflows that dominate real-world scientific and industrial practices. To bridge this gap, we introduce ProSoftArena, a comprehensive benchmark and platform specifically designed for evaluating multi-modal agents in professional software environments. We establish the first five-level capability hierarchy for professional software manipulation, and curate a benchmark of 456 realistic tasks spanning 6 disciplines and 13 core professional applications. To ensure reliable assessment, we build an executable real-computer environment with an execution-based evaluation framework, and uniquely incorporate a human-in-the-loop evaluation paradigm to quantify agents' collaborative efficiency. Extensive experiments show that even the best-performing agent achieves only a 20.6% success rate on software-level tasks (L2) and completely fails on multi-software workflows (L3). Our in-depth analysis further provides valuable insights in current agent limitations and suggests effective design principles for building more capable agents in professional software settings. This project is available at: https://prosoftarena.github.io.

Abstract:
Explicit 3D representations have already become an essential medium for 3D simulation and understanding. However, the most commonly used point cloud and 3D Gaussian Splatting (3DGS) each suffer from non-photorealistic rendering and significant degradation under sparse inputs. In this paper, we introduce Sparse to Dense lifting (S2D), a novel pipeline that bridges the two representations and achieves high-quality 3DGS reconstruction with minimal inputs. Specifically, the S2D lifting is two-fold. We first present an efficient one-step diffusion model that lifts sparse point cloud for high-fidelity image artifact fixing. Meanwhile, to reconstruct 3D consistent scenes, we also design a corresponding reconstruction strategy with random sample drop and weighted gradient for robust model fitting from sparse input views to dense novel views. Extensive experiments show that S2D achieves the best consistency in generating novel view guidance and first-tier sparse view reconstruction quality under different input sparsity. By reconstructing stable scenes with the least possible captures among existing methods, S2D enables minimal input requirements for 3DGS applications.

Abstract:
Partially Supervised Multi-Task Learning (PS-MTL) aims to leverage knowledge across tasks when annotations are incomplete. Existing approaches, however, have largely focused on the simpler setting of homogeneous, dense prediction tasks, leaving the more realistic challenge of learning from structurally diverse tasks unexplored. To this end, we introduce NexusFlow, a novel, lightweight, and plug-and-play framework effective in both settings. NexusFlow introduces a set of surrogate networks with invertible coupling layers to align the latent feature distributions of tasks, creating a unified representation that enables effective knowledge transfer. The coupling layers are bijective, preserving information while mapping features into a shared canonical space. This invertibility avoids representational collapse and enables alignment across structurally different tasks without reducing expressive capacity.We first evaluate NexusFlow on the core challenge of domain-partitioned autonomous driving, where dense map reconstruction and sparse multi-object tracking are supervised in different geographic regions, creating both structural disparity and a strong domain gap. NexusFlow sets a new state-of-the-art result on nuScenes, outperforming strong partially supervised baselines. To demonstrate generality, we further test NexusFlow on NYUv2 using three homogeneous dense prediction tasks, segmentation, depth, and surface normals, as a representative N-task PS-MTL scenario. NexusFlow yields consistent gains across all tasks, confirming its broad applicability.

Abstract:
While Instruction-based Image Editing (IIE) has achieved significant progress, existing benchmarks pursue task breadth via mixed evaluations. This paradigm obscures a critical failure mode crucial in professional applications: the inconsistent performance of models across tasks of varying semantic scales. To address this gap, we introduce Omni IIE Bench, a high-quality, human-annotated benchmark specifically designed to diagnose the editing consistency of IIE models in practical application scenarios. Omni IIE Bench features an innovative dual-track diagnostic design: (1) Single-turn Consistency, comprising shared-context task pairs of attribute modification and entity replacement; and (2) Multi-turn Coordination, involving continuous dialogue tasks that traverse semantic scales. The benchmark is constructed via an exceptionally rigorous multi-stage human filtering process, incorporating a quality standard enforced by computer vision graduate students and an industry relevance review conducted by professional designers. We perform a comprehensive evaluation of 8 mainstream IIE models using Omni IIE Bench. Our analysis quantifies, for the first time, a prevalent performance gap: nearly all models exhibit a significant performance degradation when transitioning from low-semantic-scale to high-semantic-scale tasks. Omni IIE Bench provides critical diagnostic tools and insights for the development of next-generation, more reliable, and stable IIE models.

Abstract:
In this paper, we propose Iris, a deterministic framework for Monocular Depth Estimation (MDE) that integrates real-world priors into the diffusion model. Conventional feed-forward methods rely on massive training data, yet still miss details. Previous diffusion-based methods leverage rich generative priors yet struggle with synthetic-to-real domain transfer. Iris, in contrast, preserves fine details, generalizes strongly from synthetic to real scenes, and remains efficient with limited training data. To this end, we introduce a two-stage Priors-to-Geometry Deterministic (PGD) schedule: the prior stage uses Spectral-Gated Distillation (SGD) to transfer low-frequency real priors while leaving high-frequency details unconstrained, and the geometry stage applies Spectral-Gated Consistency (SGC) to enforce high-frequency fidelity while refining with synthetic ground truth. The two stages share weights and are executed with a high-to-low timestep schedule. Extensive experimental results confirm that Iris achieves significant improvements in MDE performance with strong in-the-wild generalization.

Abstract:
Low-Rank Adaptation (LoRA) has emerged as a leading technique for efficiently fine-tuning text-to-image diffusion models, and its widespread adoption on open-source platforms has fostered a vibrant culture of model sharing and customization. However, the same modular and plug-and-play flexibility that makes LoRA appealing also introduces a broader attack sur-face. To highlight this risk, we propose Masquerade-LoRA (MasqLoRA), the first backdoor attack that leverages the LoRA mechanism to stealthily inject malicious behavior into text-to-image diffusion models. MasqLoRA operates by freezing the base model parameters and updating only the low-rank adapter weights using a small number of "trigger word-target image" pairs. This enables the attacker to train a standalone backdoor LoRA module that embeds a hidden cross-modal mapping: when the module is loaded and a specific textual trigger is provided, the model produces a predefined visual output; other-wise, it behaves indistinguishably from the clean model, ensuring the stealthiness of the attack. Experimental results demonstrate that MasqLoRA can be trained with minimal resource overhead and achieves a high attack success rate of 99.8%. MasqLoRA reveals a severe and unique threat in the AI supply chain, underscoring the urgent need for dedicated defense mechanisms for the LoRA-centric sharing ecosystem.

Abstract:
Accurately modeling how real-world materials reflect light remains a core challenge in inverse rendering, largely due to the scarcity of real measured reflectance data. Existing approaches rely heavily on synthetic datasets with simplified illumination and limited material realism, preventing models from generalizing to real-world images. We introduce a large-scale polarized reflection and material dataset of real-world objects, captured with an 8-camera, 346-light Light Stage equipped with cross/parallel polarization. Our dataset spans 218 everyday objects across five acquisition dimensions--multiview, multi-illumination, polarization, reflectance separation, and material attributes--yielding over 1.2M high-resolution images with diffuse-specular separation and analytically derived diffuse albedo, specular albedo, and surface normals. Using this dataset, we train and evaluate state-of-the-art inverse and forward rendering models on intrinsic decomposition, relighting, and sparse-view 3D reconstruction, demonstrating significant improvements in material separation, illumination fidelity, and geometric consistency. We hope that our work can establish a new foundation for physically grounded material understanding and enable real-world generalization beyond synthetic training regimes.

Abstract:
Growing privacy and security demands in the real world have spurred interest in adversarially robust Federated Learning (FL). While Adversarial Training (AT) is a well-established defense in centralized learning, its extension to the federated setting, known as Federated Adversarial Training (FAT), faces significant challenges due to data heterogeneity across clients. Existing FAT methods have made significant contributions, but they typically assume a balanced global data distribution, an assumption that rarely holds true in practice due to the prevalence of long-tailed distributions. This work first identifies and diagnoses the severe performance degradation of FAT under long-tailed data, attributing it to skewed feature representations and impaired classifier discriminability. To address this, we propose FedCART, a novel FAT framework that decouples the model into a shared feature extractor and a dual-classifier structure. On the client side, a representation-alignment loss enhances adversarial robustness, while gradient-based class prototypes are extracted for classifier calibration. On the server side, models and prototype sets are aggregated to synthesize balanced virtual features, enabling the re-training of an auxiliary classifier to mitigate long-tailed bias. Extensive experiments demonstrate that FedCART significantly improves both accuracy and robustness, outperforming state-of-the-art FAT methods. To the best of our knowledge, this is the first work to systematically investigate and address FAT under long-tailed distributions, representing a significant step toward practical adversarial robustness in FL.

Abstract:
Zero-shot unsupervised reinforcement learning (URL) offers a promising direction for building generalist agents capable of generalizing to unseen tasks without additional supervision. Among existing approaches, successor representations (SR) have emerged as a prominent paradigm due to their effectiveness in structured, low-dimensional settings. However, SR methods struggle to scale to high-dimensional visual environments. Through empirical analysis, we identify two key limitations of SR in visual URL: (1) SR objectives often lead to suboptimal representations that attend to dynamics-irrelevant regions, resulting in inaccurate successor measures and degraded task generalization; and (2) these flawed representations hinder SR policies from modeling multi-modal skill-conditioned action distributions and ensuring skill controllability. To address these limitations, we propose Saliency-Guided Representation with Consistency Policy Learning (SRCP), a novel framework that improves zero-shot generalization of SR methods in visual URL. SRCP decouples representation learning from successor training by introducing a saliency-guided dynamics task to capture dynamics-relevant representations, thereby improving successor measure and task generalization. Moreover, it integrates a fast-sampling consistency policy with URL-specific classifier-free guidance and tailored training objectives to improve skill-conditioned policy modeling and controllability. Extensive experiments on 16 tasks across 4 datasets from the ExORL benchmark demonstrate that SRCP achieves state-of-the-art zero-shot generalization in visual URL and is compatible with various SR methods.

Abstract:
Existing video frame interpolation (VFI) methods often adopt a frame-centric approach, processing videos as independent short segments (e.g., triplets), which leads to temporal inconsistencies and motion artifacts. To overcome this, we propose a holistic, video-centric paradigm named Local Diffusion Forcing for Video Frame Interpolation (LDF-VFI). Our framework is built upon an auto-regressive diffusion transformer that models the entire video sequence to ensure long-range temporal coherence. To mitigate error accumulation inherent in auto-regressive generation, we introduce a novel skip-concatenate sampling strategy that effectively maintains temporal stability. Furthermore, LDF-VFI incorporates sparse, local attention and tiled VAE encoding, a combination that not only enables efficient processing of long sequences but also allows generalization to arbitrary spatial resolutions (e.g., 4K) at inference without retraining. An enhanced conditional VAE decoder, which leverages multi-scale features from the input video, further improves reconstruction fidelity. Empirically, LDF-VFI achieves state-of-the-art performance on challenging VFI benchmarks, demonstrating superior per-frame quality and temporal consistency, especially in scenes with large motion.

Abstract:
Existing methods achieve high-quality facial albedo capture under controllable lighting, which increases capture cost and limits usability. We propose WildCap, a novel method for high-quality facial albedo capture from a smartphone video recorded in the wild. To disentangle high-quality albedo from complex lighting effects in in-the-wild captures, we propose a novel hybrid inverse rendering framework. We first apply a data-driven method, i.e., SwitchLight, to convert the captured images into more constrained conditions and then adopt model-based inverse rendering. However, unavoidable local artifacts in network predictions, such as shadow-baking, are non-physical and thus hinder accurate inverse rendering of lighting and material. To address this, we propose a novel texel grid lighting model to explain non-physical effects as clean albedo illuminated by local physical lighting. During optimization, we jointly sample a diffusion prior for the albedo map and optimize the lighting, effectively resolving scale ambiguity between local lights and albedo. Other reflectance maps are then predicted from the albedo. Our method achieves significantly better results than prior arts in the same capture setup, closing the quality gap between in-the-wild and controllable recordings by a large margin.

Abstract:
Multimodal Large Language Models (MLLMs) have significantly advanced vision-language understanding. However, even state-of-the-art models struggle with geometric reasoning, revealing a critical bottleneck: the extreme scarcity of high-quality image-text pairs. Human annotation is prohibitively expensive, while automated methods fail to ensure both fidelity and training effectiveness. Existing approaches either passively adapt to available images or employ inefficient random exploration with post-hoc filtering, decoupling generation from learning needs. We propose Socratic-Geo, a fully autonomous framework that dynamically couples data synthesis with model learning through multi-agent interaction. The Teacher agent generates parameterized Python scripts with reflective feedback (Reflect for solvability, RePI for visual validity), ensuring image-text pair purity. The Solver agent optimizes reasoning through preference learning, with failure paths guiding Teacher's targeted augmentation. Independently, the Generator learns image generation capabilities on accumulated "image-code-instruction" triplets, distilling programmatic drawing intelligence into visual generation. Starting from only 108 seed problems, Socratic-Solver achieves 49.11 on six geometric benchmarks using one-quarter of baseline data, surpassing strong baselines by 2.43 points. Socratic-Generator achieves 42.4% on GenExam, establishing new state-of-the-art for open-source models, surpassing Seedream-4.0 (39.8%) and approaching Gemini-2.5-Flash-Image (43.1%).

Abstract:
Visual-language models (VLMs) excel at data mappings, but real-world document heterogeneity and unstructuredness disrupt the consistency of cross-modal embeddings. Recent late-interaction methods enhance image-text alignment through multi-vector representations, yet traditional training with limited samples and static strategies cannot adapt to the model's dynamic evolution, causing cross-modal retrieval confusion. To overcome this, we introduce Evo-Retriever, a retrieval framework featuring an LLM-guided curriculum evolution built upon a novel Viewpoint-Pathway collaboration. First, we employ multi-view image alignment to enhance fine-grained matching via multi-scale and multi-directional perspectives. Then, a bidirectional contrastive learning strategy generates "hard queries" and establishes complementary learning paths for visual and textual disambiguation to rebalance supervision. Finally, the model-state summary from the above collaboration is fed into an LLM meta-controller, which adaptively adjusts the training curriculum using expert knowledge to promote the model's evolution. On ViDoRe V2 and MMEB (VisDoc), Evo-Retriever achieves state-of-the-art performance, with nDCG@5 scores of 65.2 % and 77.1 %.

Abstract:
Text-to-LiDAR generation can customize 3D data with rich structures and diverse scenes for downstream tasks. However, the scarcity of Text-LiDAR pairs often causes insufficient training priors, generating overly smooth 3D scenes. Moreover, low-quality text descriptions may degrade generation quality and controllability. In this paper, we propose a Text-to-LiDAR Diffusion Model for scene generation, named T2LDM, with a Self-Conditioned Representation Guidance (SCRG). Specifically, SCRG, by aligning to the real representations, provides the soft supervision with reconstruction details for the Denoising Network (DN) in training, while decoupled in inference. In this way, T2LDM can perceive rich geometric structures from data distribution, generating detailed objects in scenes. Meanwhile, we construct a content-composable Text-LiDAR benchmark, T2nuScenes, along with a controllability metric. Based on this, we analyze the effects of different text prompts for LiDAR generation quality and controllability, providing practical prompt paradigms and insights. Furthermore, a directional position prior is designed to mitigate street distortion, further improving scene fidelity. Additionally, by learning a conditional encoder via frozen DN, T2LDM can support multiple conditional tasks, including Sparse-to-Dense, Dense-to-Sparse, and Semantic-to-LiDAR generation. Extensive experiments in unconditional and conditional generation demonstrate that T2LDM outperforms existing methods, achieving state-of-the-art scene generation.

Abstract:
Document parsing has recently advanced with multimodal large language models (MLLMs) that directly map document images to structured outputs. Traditional cascaded pipelines depend on precise layout analysis and often fail under casually captured or non-standard conditions. Although end-to-end approaches mitigate this dependency, they still exhibit repetitive, hallucinated, and structurally inconsistent predictions--primarily due to the scarcity of large-scale, high-quality full-page (document-level) end-to-end parsing data and the lack of structure-aware training strategies. To address these challenges, we propose a data-training co-design framework for robust end-to-end document parsing. A Realistic Scene Synthesis strategy constructs large-scale, structurally diverse full-page end-to-end supervision by composing layout templates with rich document elements, while a Document-Aware Training Recipe introduces progressive learning and structure-token optimization to enhance structural fidelity and decoding stability. We further build Wild-OmniDocBench, a benchmark derived from real-world captured documents for robustness evaluation. Integrated into a 1B-parameter MLLM, our method achieves superior accuracy and robustness across both scanned/digital and real-world captured scenarios. All models, data synthesis pipelines, and benchmarks will be publicly released to advance future research in document understanding.

Abstract:
The end-to-end (E2E) paradigm, which maps sensor inputs directly to driving decisions, has recently attracted significant attention due to its unified modeling capability and scalability. However, ensuring safety in this unified framework remains one of the most critical challenges. In this work, we propose SafeDrive, an E2E planning framework designed to perform explicit and interpretable safety reasoning through a trajectory-conditioned Sparse World Model. SafeDrive comprises two complementary networks: the Sparse World Network (SWNet) and the Fine-grained Reasoning Network (FRNet). SWNet constructs trajectory-conditioned sparse worlds that simulate the future behaviors of critical dynamic agents and road entities, providing interaction-centric representations for downstream reasoning. FRNet then evaluates agent-specific collision risks and temporal adherence to drivable regions, enabling precise identification of safety-critical events across future timesteps. SafeDrive achieves state-of-the-art performance on both open-loop and closed-loop benchmarks. On NAVSIM, it records a PDMS of 91.6 and an EPDMS of 87.5, with only 61 collisions out of 12,146 scenarios (0.5%). On Bench2Drive, SafeDrive attains a 66.8% driving score.

Abstract:
Applying 3D Gaussian Splatting to inverse rendering, especially for relightable assets under high-illuminance conditions, remains challenging. Strong specular highlights and complex reflections complicate material-light disentanglement, often baking in shadows and losing specular detail. To address this, we introduce IR-HGP, a framework that achieves robust disentanglement using three synergistic modules: First, a Hybrid Visibility Decomposition module ensures physical visibility consistency. Second, a Generative Illumination Field Prior module infers detailed and high-dynamic range environmental lighting. Finally, a Physics-Aware Radiance Correction module stabilizes optimization and mitigates illumination artifacts. Our framework achieves SOTA material recovery and relighting performance, outperforming existing methods under challenging illumination conditions. It reconstructs the view-dependent "shiny" appearance of reflective surfaces in real time, surpassing the limits of prior 3DGS-based inverse rendering methods.

Abstract:
Heterogeneous reconstruction in cryo-electron microscopy (Cryo-EM) is fundamental for understanding macromolecular structural diversity, yet remains challenging due to extreme noise, continuous conformational changes, and ambiguous image-to-structure mappings. Existing neural approaches often rely on encoder--decoder pipelines or fixed codebooks, which can be computationally demanding or struggle with complex heterogeneity. We propose CryoKRAQEN, a decoder-only framework that integrates triplane implicit representations with kernel-guided latent assignment and quantized embeddings to improve stability and structural discrimination. The method avoids encoder dependencies and mitigates collapse during training, enabling accurate modeling of both conformational and compositional variations. Across diverse Cryo-EM benchmarks, CryoKRAQEN delivers competitive performance, robust reconstructions, and interpretable latent organization compared to state-of-the-art neural and classical methods.

Abstract:
Open-Vocabulary Temporal Action Detection (OV-TAD) aims to classify and localize action segments in untrimmed videos for unseen categories. Previous methods rely solely on global alignment between label-level semantics and visual features, which is insufficient to transfer temporal consistent visual knowledge from seen to unseen classes. To address this, we propose a Phase-wise Decomposition and Alignment (PDA) framework, which enables fine-grained action pattern learning for effective prior knowledge transfer. Specifically, we first introduce the CoT-Prompting Semantic Decomposition (CSD) module, which leverages the chain-of-thought (CoT) reasoning ability of large language models to automatically decompose action labels into coherent phase-level descriptions, emulating human cognitive processes. Then, Text-infused Foreground Filtering (TIF) module is introduced to adaptively filter action-relevant segments for each phase leveraging phase-wise semantic cues, producing semantically aligned visual representations. Furthermore, we propose the Adaptive Phase-wise Alignment (APA) module to perform phase-level visual-textual matching, and adaptively aggregates alignment results across phases for final prediction. This adaptive phase-wise alignment facilitates the capture of transferable action patterns and significantly enhances generalization to unseen actions. Extensive experiments on two OV-TAD benchmarks demonstrated the superiority of the proposed method.

Abstract:
Reinforcement learning (RL) has demonstrated remarkable success in text and image generation, yet its potential in 3D generation remains largely unexplored. Existing attempts typically rely on offline direct preference optimization (DPO) method, which suffers from low training efficiency and limited generalization. In this work, we aim to enhance both the training efficiency and generation quality of RL in 3D mesh generation. Specifically, (1) we design the first asynchronous online RL framework tailored for 3D mesh generation post-training efficiency improvement, which is 3.75xfaster than synchronous RL. (2) We propose Advantage-guided Ranking Preference Optimization (ARPO), a novel RL algorithm that achieves a better trade-off between training efficiency and generalization than current RL algorithms designed for 3D mesh generation, such as DPO and group relative policy optimization (GRPO). (3) Based on asynchronous ARPO, we propose Mesh-Pro, which additionally introduces a novel diagonal-aware mixed triangular-quadrilateral tokenization for mesh representation and a ray-based reward for geometric integrity. Mesh-Pro achieves state-of-the-art performance on artistic and dense meshes.

Abstract:
With the rapid progress of controllable generation, training data synthesis has become a promising way to expand labeled datasets and alleviate manual annotation in remote sensing (RS). However, the complexity of semantic mask control and the uncertainty of sampling quality often limit the utility of synthetic data in downstream semantic segmentation tasks. To address these challenges, we propose a task-oriented data synthesis framework (TODSynth), including a Multimodal Diffusion Transformer (MM-DiT) with unified triple attention and a plug-and-play sampling strategy guided by task feedback. Built upon the powerful DiT-based generative foundation model, we systematically evaluate different control schemes, showing that a text-image-mask joint attention scheme combined with full fine-tuning of the image and mask branches significantly enhances the effectiveness of RS semantic segmentation data synthesis, particularly in few-shot and complex-scene scenarios. Furthermore, we propose a control-rectify flow matching (CRFM) method, which dynamically adjusts sampling directions guided by semantic loss during the early high-plasticity stage, mitigating the instability of generated images and bridging the gap between synthetic data and downstream segmentation tasks. Extensive experiments demonstrate that our approach consistently outperforms state-of-the-art controllable generation methods, producing more stable and task-oriented synthetic data for RS semantic segmentation.

Abstract:
Accurately estimating task progress is critical for embodied agents to plan and execute long-horizon, multi-step tasks. Despite promising advances, existing Vision-Language Models (VLMs) based methods primarily leverage their video understanding capabilities, while neglecting their complex reasoning potential. Furthermore, processing long video trajectories with VLMs is computationally prohibitive for real-world deployment. To address these challenges, we propose the Recurrent Reasoning Vision-Language Model (\text R ^2VLM). Our model features a recurrent reasoning framework that processes local video snippets iteratively, maintaining a global context through an evolving Chain of Thought (CoT). This CoT explicitly records task decomposition, key steps, and their completion status, enabling the model to reason about complex temporal dependencies. This design avoids the high cost of processing long videos while preserving essential reasoning capabilities. We train \text R ^2VLM on large-scale, automatically generated datasets from ALFRED and Ego4D. Extensive experiments on progress estimation and downstream applications, including progress-enhanced policy learning, reward modeling for reinforcement learning, and proactive assistance, demonstrate that \text R ^2VLM achieves strong performance and generalization, achieving a new state-of-the-art in long-horizon task progress estimation. The models and benchmarks are publicly available at https://huggingface.co/collections/zhangyuelin/r2vlm.

Abstract:
Vision Transformers often rely on spurious background correlations rather than foreground object features. While prior model pruning approaches focus solely on improving accuracy, they lack interpretability and fail to verify whether predictions are actually made by focusing on the main foreground object, providing no causal validation of which components drive spurious behavior. We introduce Causal Information Gain Mechanistic Attribution (CIGMA), a general framework for explaining the internal computation of Vision Transformers. CIGMA provides a mechanistic, information theoretic explanation by quantifying the importance of each attention head and determining whether it supports the main object or routes spurious background cues. It ranks attention heads by measuring object versus context reliance with Jensen Shannon based information gain computed from the model's full predictive distributions after two complementary edits, removing the object region and removing the surrounding context, which reveals a spurious subnet that carries background signals and a complementary set of evidence aligned heads. Evaluated on CIFAR-10, CIFAR-100, and Tiny-ImageNet across three VLM architectures (InternVL2-26B, LLaVA-1.6, LLaVA-1.5-13B), CIGMA improves accuracy by 7.6 to 24.8 percentage points over unmodified models while reducing background reliance by 79.5% to 88.1%, substantially outperforming all baselines, demonstrating that causal head-level interventions enable more effective spurious correlation mitigation than token pruning or retraining approaches.

Abstract:
Multimodal tiny object detection plays a critical role in real-world applications, yet remains highly challenging due to weak target representations and complex cross-modal interference. Existing frequency-domain methods for tiny object detection are still largely limited to the visible modality and overlook complementary cross-modal frequency cues in multimodal scenes. In this paper, we investigate cross-modal frequency learning for RGBT tiny object detection. Through frequency characteristic analysis, we find that tiny objects in both RGB and infrared modalities contain richer mid- and high-frequency components as object size decreases. Motivated by this observation, we propose a Dynamic Frequency-Decoupled Cross-Modal Learning Transformer (DyFCLT). Specifically, DyFCLT introduces a Dynamic Frequency-Band Decoupled Cross-Modal Attention (DFCA) mechanism to perform fine-grained cross-modal interaction across dynamic frequency sub-bands, and a Selective Smoothing Enhancement (SSE) module to suppress background noise and enhance foreground responses during multi-scale fusion. Extensive experiments on two RGBT tiny object detection benchmarks and one general-scale benchmark demonstrate that DyFCLT achieves state-of-the-art performance with strong generalization across different scales and scenes.

Abstract:
Sketch editing requires jointly handling high-level semantic changes and precise local redrawing, a combination that is particularly challenging for sparse, style-sensitive line art. Unlike natural images, sketches rely on minimal visual cues, making it difficult for existing methods to reconcile global semantic modifications with fine-grained structural control while preserving overall coherence. We present SketchAssist, an interactive sketch assistant that unifies instruction-guided editing with line-guided region redrawing, enabling efficient and controllable sketch manipulation while preserving overall composition. To support this task, we introduce a controllable data generation pipeline that constructs structured edit sequences with precise attribute variations and maintains structural alignment across multi-step modifications, while expanding stylistic diversity via style-preserving transformations. Building on this data, SketchAssist adopts a unified framework based on DiT, using a multi-channel input representation to encode sketches, masks, and guidance signals within a single interface. To further handle different editing modes, we integrate a Task-guided Mixture-of-Experts (T-MoE) into LoRA layers, enabling adaptive control over semantic and structural guidance. Extensive experiments demonstrate state-of-the-art performance on both tasks, achieving strong instruction adherence and improved structural and style consistency compared to recent methods. Together, our method provide a practical and controllable solution for sketch editing.

Abstract:
This paper studies the problem of universal test-time prompt learning for vision-language models (VLMs) which aims to enhance prompt learning for a pre-trained VLM via unlabeled target data containing out-of-distribution (OOD) samples. However, existing test-time adaptation approaches often overlook class-specific diversity in the target domain and rely on unreliable pseudo-labels due to inadequate uncertainty estimation, which may result in additional adaptation bias during test time. Towards this end, we propose a novel framework named Separability-aware Conjugate Optimization with Prototypical Retrieval (STAR) for universal test-time prompt learning of VLMs. The core of our STAR is to incorporate a separability-aware gating mechanism into conjugate optimization for reliable pseudo-learning with OOD samples. In particular, we first compute the Fisher score to quantify the separability between in-distribution (ID) and OOD samples, which guides our soft gating mechanism for divided training. Then, we employ conjugate optimization to derive reliable pseudo-labels of unlabeled data for test-time adaptation. To further mitigate biases in OOD detection, we maintain a dynamic memory bank which stores high-confidence samples to build class-wise prototypes, which would serve as queries for prototypical retrieval to calibrate OOD detection. Extensive experiments on multiple benchmarks demonstrate that STAR consistently outperforms competing baseline methods.

Abstract:
Indoor environments evolve as objects move, appear, or leave the scene. Capturing these dynamics requires maintaining temporally consistent instance identities across intermittently captured 3D scans, even when changes are unobserved. We introduce and formalize the task of temporally sparse 4D indoor semantic instance segmentation (SIS), which jointly segments, identifies, and temporally associates object instances. This setting poses a challenge for existing 3DSIS methods, which require a discrete matching step due to their lack of temporal reasoning, and for 4D LiDAR approaches, which perform poorly due to their reliance on high-frequency temporal measurements that are uncommon in the longer-horizon evolution of indoor environments. We propose ReScene4D, a novel method that adapts 3DSIS architectures for 4DSIS without needing dense observations. Our method enables temporal information sharing--using spatiotemporal contrastive loss, masking, and serialization--to adaptively leverage geometric and semantic priors across observations. This shared context enables consistent instance tracking and improves standard 3DSIS performance. To evaluate this task, we define a new metric, t-mAP, that extends mAP to reward temporal identity consistency. ReScene4D achieves state-of-the-art performance on the 3RScan dataset, establishing a new benchmark for understanding evolving indoor scenes.

Abstract:
Recent advances in video reward models and post-training strategies have improved text-to-video (T2V) generation. While these models typically assess visual quality, motion quality, and text alignment, they often overlook key structural distortions, such as abnormal object appearances and interactions, which can degrade the overall quality of the generative video.To address this gap, we introduce REACT, a frame-level reward model designed specifically for structural distortions evaluation in generative videos. REACT assigns point-wise scores and attribution labels by reasoning over video frames, focusing on recognizing distortions. To support this, we construct a large-scale human preference dataset, annotated based on our proposed taxonomy of structural distortions, and generate additional data using a efficient Chain-of-Thought (CoT) synthesis pipeline. REACT is trained with a two-stage framework: (1) supervised fine-tuning with masked loss for domain knowledge injection, followed by (2) reinforcement learning with Group Relative Policy Optimization (GRPO) and pairwise rewards to enhance reasoning capability and align output scores with human preferences. During inference, a dynamic sampling mechanism is introduced to focus on frames most likely to exhibit distortion.We also present REACT-Bench, a benchmark for generative video distortion evaluation. Experimental results demonstrate that REACT complements existing reward models in assessing structural distortion, achieving both accurate quantitative evaluations and interpretable attribution analysis.

Abstract:
The rapid advancement of educational applications, artistic creation, and AI-generated content (AIGC) technologies has substantially increased practical requirements for comprehensive Image Aesthetics Assessment (IAA), particularly demanding methods capable of delivering both quantitative scoring and professional understanding. Multimodal Large Language Model (MLLM)-based IAA methods demonstrate stronger perceptual and generalization capabilities compared to traditional approaches, yet they suffer from modality bias (score-only or text-only) and lack fine-grained attribute decomposition, thereby failing to support further aesthetic assessment. In this paper, we present: (1) ArtiMuse, an innovative MLLM-based IAA model with Joint Scoring and Expert-Level Understanding capabilities; (2) ArtiMuse-10K, the first expert-curated IAA dataset comprising 10,000 images spanning 5 main categories and 15 subcategories, each annotated by professional experts with 8-dimensional attributes analysis and a holistic score. Both the model and dataset will be made public.

Abstract:
Despite the recent success of modern imitation learning methods in robot manipulation, their performance is often constrained by geometric variations due to limited data diversity. Leveraging powerful 3D generative models and vision foundation models (VFMs), the proposed AffordGen framework overcomes this limitation by utilizing the semantic correspondence of meaningful keypoints across large-scale 3D meshes to generate new robot manipulation trajectories. This large-scale, affordance-aware dataset is then used to train a robust, closed-loop visuomotor policy, combining the semantic generalizability of affordances with the reactive robustness of end-to-end learning. Experiments in simulation and the real world show that policies traine with AffordGen achieve high success rates and enable zero-shot generalization to truly unseen objects, significantly improving data efficiency in robot learning.

Abstract:
As large language models (LLMs) continue to advance, there is increasing interest in their ability to infer human mental states and demonstrate a human-like Theory of Mind (ToM). Most existing ToM evaluations, however, are centered on text-based inputs, while scenarios relying solely on visual information receive far less attention. This leaves a gap, since real-world human-AI interaction typically requires multimodal understanding. In addition, many current methods regard the model as a black box and rarely probe how its internal attention behaves in multiple-choice question answering (QA). The impact of LLM hallucinations on such tasks is also underexplored from an interpretability perspective. To address these issues, we introduce VisionToM, a vision-oriented intervention framework designed to strengthen task-aware reasoning. The core idea is to compute intervention vectors that align visual representations with the correct semantic targets, thereby steering the model's attention through different layers of visual features. This guidance reduces the model's reliance on spurious linguistic priors, leading to more reliable multimodal language model (MLLM) outputs and better QA performance. Experiments on the EgoToM benchmark--an egocentric, real-world video dataset for ToM with three multiple-choice QA settings--demonstrate that our method substantially improves the ToM abilities of MLLMs. Furthermore, results on an additional open-ended generation task show that VisionToM enables MLLMs to produce free-form explanations that more accurately capture agents' mental states, pushing machine-human collaboration toward greater alignment.

Abstract:
Recent advances in image restoration have enabled high-fidelity recovery of faces from degraded inputs using reference-based face restoration models (Ref-FR). However, such methods focus solely on facial regions, neglecting degradation across the full scene, including body and background, which limits practical usability. Meanwhile, full-scene restorers often ignore degradation cues entirely, leading to underdetermined predictions and visual artifacts. In this work, we propose Face2Scene, a two-stage restoration framework that leverages the face as a perceptual oracle to estimate degradation and guide the restoration of the entire image. Given a degraded image and one or more identity references, we first apply a Ref-FR model to reconstruct high-quality facial details. From the restored-degraded face pair, we extract a face-derived degradation code that captures degradation attributes (e.g., noise, blur, compression), which is then transformed into multi-scale degradation-aware tokens. These tokens condition a diffusion model to restore the full scene in a single step, including the body and background. Extensive experiments demonstrate the superior effectiveness of the proposed method compared to state-of-the-art methods.

Abstract:
Active Learning (AL) reduces annotation costs in medical imaging by selecting only the most informative samples for labeling, but suffers from cold-start when labeled data are scarce. Vision-Language Models (VLMs) address the cold-start problem via zero-shot predictions, yet their temperature-scaled softmax outputs treat text-image similarities as deterministic scores while ignoring inherent uncertainty, leading to overconfidence. This overconfidence misleads sample selection, wasting annotation budgets on uninformative cases. To overcome these limitations, the Similarity-as-Evidence (SaE) framework calibrates text-image similarities by introducing a Similarity Evidence Head (SEH), which reinterprets the similarity vector as evidence and parameterizes a Dirichlet distribution over labels. In contrast to a standard softmax that enforces confident predictions even under weak signals, the Dirichlet formulation explicitly quantifies lack of evidence (vacuity) and conflicting evidence (dissonance), thereby mitigating overconfidence caused by rigid softmax normalization. Building on this, SaE employs a dual-factor acquisition strategy: high-vacuity samples (e.g., rare diseases) are prioritized in early rounds to ensure coverage, while high-dissonance samples (e.g., ambiguous diagnoses) are prioritized later to refine boundaries, providing clinically interpretable selection rationales. Experiments on ten public medical imaging datasets with a 20% label budget show that SaE attains state-of-the-art macro-averaged accuracy of 82.57%. On the representative BTMRI dataset, SaE also achieves superior calibration, with a negative log-likelihood (NLL) of 0.425.

Abstract:
We humans rely on a wide range of commonsense knowledge to interact with an extensive number and categories of objects in the physical world. Likewise, such commonsense knowledge is also crucial for robots to successfully develop generalized object manipulation skills. While recent advancements in Multi-modal Large Language Models (MLLMs) have showcased their impressive capabilities in acquiring commonsense knowledge and conducting commonsense reasoning, effectively grounding this semantic-level knowledge produced by MLLMs to the physical world to thoroughly guide robots in generalized articulated object manipulation remains a challenge that has not been sufficiently addressed. To this end, we introduce analytic concepts, procedurally defined upon mathematical symbolism that can be directly computed and simulated by machines. By leveraging the analytic concepts as a bridge between the semantic-level knowledge inferred by MLLMs and the physical world where real robots operate, we can identify the knowledge of object structure and functionality with physics-informed representations, and then use the physically grounded knowledge to instruct robot control policies for generalized and accurate articulated object manipulation. Extensive experiments in both real world and simulation demonstrate the superiority of our approach.

Abstract:
Segmenting small lesions in medical images remains notoriously difficult. Most prior work tackles this challenge by either designing better architectures, loss functions, or data augmentation schemes; and collecting more labeled data. We take a different view, arguing that part of the problem lies in how the background is modeled. Common lesion segmentation collapses all non-lesion pixels into a single "background" class, ignoring the rich anatomical context in which lesions appear. In reality, the background is highly heterogeneous--composed of tissues, organs, and other structures that can now be labeled manually or inferred automatically using existing segmentation models.In this paper, we argue that training with fine-grained labels that sub-divide the background class, which we call BackSplit, is a simple yet powerful paradigm that can offer a significant performance boost without increasing inference costs. From an information theoretic standpoint, we prove that BackSplit increases the expected Fisher Information relative to conventional binary training, leading to tighter asymptotic bounds and more stable optimization. With extensive experiments across multiple datasets and architectures, we empirically show that BackSplit consistently boosts small-lesion segmentation performance, even when auxiliary labels are generated automatically using pretrained segmentation models. Additionally, we demonstrate that auxiliary labels derived from interactive segmentation frameworks exhibit the same beneficial effect, demonstrating its robustness, simplicity, and broad applicability.

Abstract:
All-in-One Image Restoration (AiOIR) has emerged as a promising yet challenging research direction. To address the core challenges of diverse degradation modeling and detail preservation, we propose UniLDiff, a unified framework enhanced with degradation- and detail-aware mechanisms, unlocking the power of diffusion priors for robust image restoration. Specifically, we introduce a Degradation-Aware Feature Fusion (DAFF) to dynamically inject low-quality features into each denoising step via decoupled fusion and adaptive modulation, enabling implicit modeling of diverse and compound degradations. Furthermore, we design a Detail-Aware Expert Module (DAEM) in the decoder to enhance texture and fine-structure recovery through expert routing. Extensive experiments across multi-task and mixed degradation settings demonstrate that our method consistently achieves state-of-the-art performance, highlighting the practical potential of diffusion priors for unified image restoration.

Abstract:
Vision models for embodied intelligence require efficient 3D comprehension and interaction with objects within the scene. Existing 3D reconstruction models either overlook instance-level perception or rely on time-consuming offline reasoning, showing a less adaptability in real-time embodied scenario. In this paper, we present PromptDepth, the first promptable vision model that features both geometric 3D understanding and instance-level interaction especially designed for embodied intelligence. PromptDepth is a feed-forward network that quickly yields panoptic, instanced, or tracked depth map from two corresponding frames, enabling the real-time infer sequences from embodied agents. Specifically, following the minimal prediction problem, we design a promptable Dense Prediction Transformer, making it flexible to interact with unified dense prediction according to a specific prompt. Considering the substantial discrepancy between panoptic and instanced depth map, we further introduce a novel Instanced Label Distribution Smoothing (ILDS) loss, followed by Gram Anchoring, to mitigate the inherent conflict between dense and discrete representation. Trained on synthetic data only, our model achieves state-of-the-art results in both depth estimation and interactive segmentation on public benchmarks. Extensive experiments demonstrate superior visual efficiency in embodied tasks compared to current fundamental models. We believe that our efficient and flexible geometric 3D model offers a new foundation for vision tasks in embodied intelligence.

Abstract:
Poisoning input views of 3D reconstruction systems has been recently studied. However, existing studies simply backpropagate adversarial gradients through the 3D reconstruction pipeline as a whole, without uncovering the new vulnerability rooted in specific modules of the pipeline. In this paper, we argue that structure-from-motion (SfM), as the geometric core of many widely used reconstruction systems, can be targeted to achieve strong poisoning effects. To this end, we propose PoInit-of-View, which optimizes adversarial perturbations to intentionally introduce cross-view gradient inconsistencies at projections of corresponding 3D points. These inconsistencies disrupt keypoint detection and feature matching, thereby corrupting pose estimation and triangulation within SfM and eventually resulting in low-quality rendered views. We also provide a theoretical analysis connecting cross-view inconsistency to correspondence collapse. Experimental results demonstrate the effectiveness of PoInit-of-View on diverse 3D reconstruction systems and datasets, surpassing the single-view-based method by 25.1% percent in PSNR and 16.5% percent in SSIM in black-box transfer settings, such as 3DGS to NeRF.

Abstract:
We present MaskAdapt, a framework for flexible motion adaptation in physics-based humanoid control. The framework follows a two-stage residual learning paradigm. In the first stage, we train a mask-invariant base policy using stochastic body-part masking and a regularization term that enforces consistent action distributions across masking conditions. This yields a robust motion prior that remains stable under missing observations, anticipating later adaptation in those regions. In the second stage, a residual policy is trained atop the frozen base controller to modify only the targeted body parts while preserving the original behaviors elsewhere. We demonstrate the versatility of this design through two applications: (i) motion composition, where varying masks enable multi-part adaptation within a single sequence, and (ii) text-driven partial goal tracking, where designated body parts follow kinematic targets provided by a pre-trained text-conditioned autoregressive motion generator. Through experiments, MaskAdapt demonstrates strong robustness and adaptability, producing diverse behaviors under masked observations and delivering superior targeted motion adaptation compared to prior work.

Abstract:
Personalized text-to-image (T2I) generation has emerged as a key application for creating user-specific concepts from a few reference images. The core challenge is concept disentanglement: separating the target concept from irrelevant residual information. Lacking such disentanglement, capturing high-fidelity features often incorporates undesired attributes that conflict with user prompts, compromising the trade-off between concept fidelity and text alignment. While existing methods rely on manual guidance, they often fail to represent intricate visual details and lack scalability. We introduce ConceptPrism, a framework that extracts shared features exclusively through cross-image comparison without external information. We jointly optimize a target token and image-wise residual tokens via reconstruction and exclusion losses. By suppressing shared information in residual tokens, the exclusion loss creates an information vacuum that forces the target token to capture the common concept. Extensive evaluations demonstrate that ConceptPrism achieves accurate concept disentanglement and significantly improves overall performance across diverse and complex visual concepts.

Abstract:
Scalable robot learning is hindered by the high cost of acquiring diverse, high-quality embodied data. Existing data generation approaches partially mitigate this issue but typically depend on hard-to-access hardware and labor-intensive manual effort, with limited generalization to diverse scene configurations. To overcome these limitations, we propose Video2Robo, a framework that generates high-quality and diverse robot data directly from a single human demonstration video, enabling seamless deployment on physical robots. At its core, Video2Robo leverages 3D Gaussian Splatting (3DGS) as a powerful scene representation, enabling high-fidelity rendering and explicit 3D scene editing. The framework tracks temporally consistent motion trajectories of task-relevant objects from raw video footage and identifies key task skills, guiding robots to execute tasks kinematically plausibly under novel object arrangements. Furthermore, by augmenting backgrounds, textures, lighting, and camera views, Video2Robo further enhances the diversity of generated data. Extensive evaluations in both simulation and real-world environments demonstrate that policies trained on Video2Robo data achieve superior generalization and transfer performance. Project webpage: Video2Robo.

Abstract:
The dense, temporal nature of video presents a profound challenge for automated analysis. Despite the use of powerful Vision-Language Models, prevailing methods for video understanding are limited by the inherent disconnect between reasoning and perception: they rely on static, pre-processed information and cannot actively seek raw evidence from video as their understanding evolves. To address this, we introduce LensWalk, a flexible agentic framework that empowers a Large Language Model reasoner to control its own visual observation actively. LensWalk establishes a tight reason-plan-observe loop where the agent dynamically specifies, at each step, the temporal scope and sampling density of the video it observes. Using a suite of versatile, Vision-Language Model based tools parameterized by these specifications, the agent can perform broad scans for cues, focus on specific segments for fact extraction, and stitch evidence from multiple moments for holistic verification. This design allows for progressive, on-demand evidence gathering that directly serves the agent's evolving chain of thought. Without requiring any model fine-tuning, LensWalk delivers substantial, plug-and-play performance gains on multiple model recipes, boosting their accuracy by over 5% on challenging long-video benchmarks like LVBench and Video-MME. Our analysis reveals that enabling an agent to control how it sees is key to unlocking more accurate, robust, and interpretable video reasoning.

Abstract:
3D layout generation and editing play a crucial role in Embodied AI and immersive VR interaction. However, manual creation requires tedious labor, while data-driven generation often lacks diversity. The emergence of large models introduces new possibilities for 3D scene synthesis. We present HOG-Layout that enables text-driven hierarchical scene generation, optimization and real-time scene editing with large language models (LLMs) and vision-language models (VLMs). HOG-Layout improves scene semantic consistency and plausibility through retrieval-augmented generation (RAG) technology, incorporates an optimization module to enhance physical consistency, and adopts a hierarchical representation to enhance inference and optimization, achieving real-time editing. Experimental results demonstrate that HOG-Layout produces more reasonable environments compared with existing baselines, while supporting fast and intuitive scene editing.

Abstract:
Multimodal intent recognition (MIR) is hindered by substantial redundancy and noise originating from text, speech, and visual inputs, which weakens feature distinctiveness and ultimately harms recognition performance. Although recent approaches based on the information bottleneck (IB) principle mitigate this issue via feature compression and reconstruction to obtain compact and noise-reduced representations, they still encounter two major drawbacks. First, conventional IB employs a fixed bottleneck dimension, making it unable to accommodate sample-dependent variations in redundancy and noise. Second, simultaneously handling redundancy and noise within a single compression process leads to incomplete feature purification. In this paper, we propose a novel framework named SeD-UD, which incorporates influence-driven input-adaptive bottleneck (IDAB) modules following a hierarchically-decoupled IB structure. Given a redundancy/noise influence factor, IDAB dynamically adjusts dimensions and selects the optimal parameters for compression and reconstruction, thereby achieving the best trade-off between information preservation and interference suppression. The IB structure performs hierarchically-decoupled processing of redundancy and noise via separated de-redundancy and unified denoising based on IDAB modules. Extensive experiments on benchmark datasets show SeD-UD outperforms current state-of-the-art models.

Abstract:
Autoregressive models for 3D mesh generation suffer from a fundamental limitation: they flatten meshes into long vertex-coordinate sequences. This results in prohibitive computational costs, hindering the efficient synthesis of high-fidelity geometry. We argue this bottleneck stems from operating at the wrong semantic level. We introduce FACE, a novel Autoregressive Autoencoder (ARAE) framework that reconceptualizes the task by generating meshes at the face level. Our one-face-one-token strategy treats each triangle face, the fundamental building block of a mesh, as a single, unified token. This simple yet powerful design reduces the sequence length by a factor of nine, leading to an unprecedented compression ratio of 0.11, halving the previous state-of-the-art. This dramatic efficiency gain does not compromise quality; by pairing our face-level decoder with a powerful VecSet encoder, FACE achieves state-of-the-art reconstruction quality on standard benchmarks. The versatility of the learned latent space is further demonstrated by training a latent diffusion model that achieves high-fidelity, single-image-to-mesh generation. FACE provides a simple, scalable, and powerful paradigm that lowers the barrier to high-quality structured 3D content creation.

Abstract:
Multimodal large language models (MLLMs) have rapidly advanced, yet their adoption in medicine remains limited by gaps in domain coverage, modality alignment, and grounded reasoning. In this work, we introduce MedMO, a medical foundation model built upon a generalized MLLM architecture and trained exclusively on large-scale, domain-specific data. MedMO follows a multi-stage training recipe: (i) cross-modal pretraining to align heterogeneous visual encoders with a medical language backbone; (ii) instruction tuning on multi-task supervision that spans captioning, VQA, report generation, retrieval, and grounded disease localization with bounding boxes; and (iii) reinforcement learning with verifiable rewards that combine factuality checks with a box-level GIoU reward to strengthen spatial grounding and reasoning in complex clinical scenarios. MedMO consistently outperforms strong open-source medical MLLMs across multiple modalities and tasks. On VQA benchmarks, MedMO achieves an average accuracy improvement of +21.3% over the baseline and performs within 0.6% of the SOTA Fleming-VL. For text-based QA, it attains +7.9% over the baseline and +15.6% over Fleming-VL. In medical report generation, MedMO delivers significant gains in both semantic and clinical accuracy. Moreover, it exhibits strong grounding capability, achieving an IoU improvement of +40.4 over the baseline and +37.0% over Fleming-VL, underscoring its robust spatial reasoning and localization performance. Evaluations across radiology, ophthalmology, pathology, and emergency care confirm MedMO's broad cross-modality generalization and reliable spatial reasoning.

Abstract:
Accurate survival prediction from multimodal medical data is essential for precision oncology, yet clinical deployment faces a persistent challenge: modalities are frequently incomplete due to cost constraints, technical limitations, or retrospective data availability. While recent methods attempt to address missing modalities through feature alignment or joint distribution learning, they fundamentally lack explicit modeling of the unique contributions of each modality as opposed to the information derivable from other modalities. We propose a novel framework that explicitly decomposes each modality's representation into modality-specific and cross-modal contextualized components through algebraic constraints in a learned low-rank shared subspace. This decomposition enables us to identify precisely what information is missing when a modality is absent. For the truly modality-specific information that cannot be inferred from available modalities, we employ conditional latent diffusion models to generate high-quality representations conditioned on recovered shared information and learned structural priors. Extensive experiments on five TCGA cancer datasets demonstrate that the proposed method achieves state-of-the-art performance with complete data while maintaining robust predictions in both missing pathology and missing genomics conditions.

Abstract:
Beyond the commonly recognized optical aberrations, the imaging performance of simplified optical systems--including single-lens and metalens designs--is often further degraded by veiling glare caused by stray-light scattering from non-ideal optical surfaces and coatings, particularly in complex real-world environments.This compound degradation undermines traditional lens aberration correction yet remains underexplored. A major challenge is that conventional scattering models (e.g., for dehazing) fail to fit veiling glare due to its spatial-varying and depth-independent nature.Consequently, paired high-quality data are difficult to prepare via simulation, hindering application of data-driven veiling glare removal models.To this end, we propose VeilGen, a generative model that learns to simulate veiling glare by estimating its underlying optical transmission and glare maps in an unsupervised manner from target images, regularized by Stable Diffusion (SD)-based priors.VeilGen enables paired dataset generation with realistic compound degradation of optical aberrations and veiling glare, while also providing the estimated latent optical transmission and glare maps to guide the veiling glare removal process.We further introduce DeVeiler, a restoration network trained with a reversibility constraint, which utilizes the predicted latent maps to guide an inverse process of the learned scattering model.Extensive experiments on challenging compact optical systems demonstrate that our approach delivers superior restoration quality and physical fidelity compared with existing methods.These suggest that VeilGen reliably synthesizes realistic veiling glare, and its learned latent maps effectively guide the restoration process in DeVeiler. All code and datasets will be publicly released.

Abstract:
Text-Image Person Re-Identification (TI-ReID) is widely deployed in intelligent surveillance. Built on deep neural networks and vision-language models, TI-ReID models inherit vulnerabilities to adversarial attacks, posing security risks. Yet its security remains less explored than retrieval accuracy, and its robustness to adversarial attacks is largely unexplored. To fill this gap, we propose Reconstruction-residual based Targeted and Untargeted Attack (R^2TUA), which generates perturbations from an image and adversarial prompt that mislead TI-ReID into matching the perturbed image with the adversarial identity. To precisely inject identity attributes into perturbations and achieve fine-grained targeted attack, R^2TUA proposes Transformer-based Gradual Multimodal Fusion (TGMF) that fuses image and adversarial prompt progressively across layers with tunable cross-modal weight. In addition, we propose a fully differentiable Soft Clamp Function (SCF), ensuring perturbations remain inconspicuous while avoiding local gradient vanishing effects that would trap training into suboptimal local minima. To further align perturbed images with the adversarial text descriptions while leading them to mismatch their original descriptions, R^2TUA employs (PPLs) and matching losses during training. Extensive evaluations across multiple datasets and models demonstrate the superior untargeted attack and targeted attack performance of R^2TUA. It also exhibits strong adaptability and transferability against black-box models, outperforming all related attacks across multiple tasks.

Abstract:
Large language models (LLMs) often generate self-contradictory outputs, which severely impacts their reliability and hinders their adoption in practical applications. In video-language models (Video-LLMs), this phenomenon recently draws the attention of researchers. Specifically, these models fail to provide logically consistent responses to rephrased questions based on their grounding outputs. However, the underlying causes of this phenomenon remain underexplored. In this work, we adopt an interpretability-driven approach to analyze, statistically summarize, and intervention the potential factors of the phenomenon. We find that one of the primary reasons for the inconsistency in responses lies in the inability of cross-modal attention heads to effectively distinguish video tokens across different timestamps. To address this, we propose an attention enhancement method called Temporally Conditioned Attention Sharpening (TCAS), which constructs an enhancement objective based on attention distinctions to enhance the model's temporal resolution capability, thereby improving its temporal understanding logic consistency. Experimental results demonstrate that our method significantly enhances the temporal logic consistency of Video-LLMs. Further analyses reveal that our method indeed improves the temporal discriminability of attention heads, validating our conclusions. Additionally, our method even achieves performance improvements in general video temporal grounding tasks, suggesting that temporal logic consistency is an important factor in temporal understanding.

Abstract:
Zero-shot (ZS) 3D anomaly detection is crucial for reliable industrial inspection, as it enables detecting and localizing defects without requiring any target-category training data. Existing approaches render 3D point clouds into 2D images and leverage pre-trained Vision-Language Models (VLMs) for anomaly detection. However, such strategies inevitably discard geometric details and exhibit limited sensitivity to local anomalies. In this paper, we revisit intrinsic 3D representations and explore the potential of pre-trained Point-Language Models (PLMs) for ZS 3D anomaly detection. We propose BTP (Back To Point), a novel framework that effectively aligns 3D point cloud and textual embeddings. Specifically, BTP aligns multi-granularity patch features with textual representations for localized anomaly detection, while incorporating geometric descriptors to enhance sensitivity to structural anomalies. Furthermore, we introduce a joint representation learning strategy that leverages auxiliary point cloud data to improve robustness and enrich anomaly semantics. Extensive experiments on Real3D-AD and Anomaly-ShapeNet demonstrate that BTP achieves superior performance in ZS 3D anomaly detection.

Abstract:
Visual grounding, localizing objects from natural language descriptions, represents a critical bridge between language and vision understanding. While multimodal large language models (MLLMs) achieve impressive scores on existing benchmarks, a fundamental question remains: can MLLMs truly visually ground with human-like sophistication, or are they merely pattern-matching on simplified datasets? Current benchmarks fail to capture real-world complexity where humans effortlessly navigate intricate references and recognize when grounding is impossible. To rigorously assess MLLMs' true capabilities, we introduce GroundingME, a benchmark that systematically challenges models across four critical dimensions: (1) Discriminative: distinguishing highly similar objects, (2) Spatial: understanding complex relational descriptions, (3) Limited: handling occlusions or tiny objects, and (4) Rejection: recognizing ungroundable queries. Through careful curation combining automated generation with human verification, we create 1,005 challenging examples mirroring real-world complexity. Evaluating 25 state-of-the-art MLLMs reveals a profound capability gap: the best model achieves only 45.1% accuracy, while most score 0% on rejection tasks. We explore two strategies for improvements: (1) test-time scaling selects optimal response by thinking trajectory to improve overall performance by up to 4.5%, and (2) data-mixture training boosts rejection accuracy from 0% to 27.9%. GroundingME thus serves as both a diagnostic tool revealing current limitations in MLLMs and a roadmap toward human-level visual grounding. Project page: https://groundingme.github.io

Abstract:
Efficiently understanding long-form videos remains a fundamental challenge for Multimodal Large Language Models (MLLMs). In this paper, we present MLLM-Sampler Joint Evolution (MSJoE), a novel framework that jointly evolves the MLLM and a lightweight key-frame sampler for efficient long-form video understanding. MSJoE builds upon a key assumption that only a small subset of key-frames is truly informative for answering each question to a video. Specifically, MSJoE first reasons out several queries, which describe diverse visual perspectives relevant to the question. Then, these queries interact with a frozen CLIP model to produce a query-frame similarity matrix. Finally, A lightweight sampler predicts key-frame sampling weights from this matrix, selecting informative frames, which are then fed into the MLLM for answer generation. Both the MLLM and sampler are jointly optimized through reinforcement learning, enabling co-adaptation of query-reasoning, frame-sampling, and key-frame understanding. A new long-video QA dataset containing 2.8k videos with 7k question-answer pairs is collected to support the training process. Extensive experiments on VideoMME, LongVideoBench, LVBench, and MLVU show that MSJoE achieves 8.0% accuracy gain upon the base MLLM, and 1.1% higher accuracy than strongest baseline method.

Abstract:
There is growing interest in biomedical vision--language models trained on scientific literature. However, most pipelines compress rich multi-panel figures and long captions into coarse figure-level pairs, discarding the fine-grained correspondences clinicians rely on when zooming into local structures. We introduce Panel2Patch, a data pipeline that mines hierarchical structure from multi-panel, marker-heavy biomedical figures and their surrounding text, and converts them into multi-granular supervision. Given figures and captions, Panel2Patch parses layouts, panels, and visual markers, then constructs aligned image--text pairs at the figure, panel, and region levels, preserving local semantics instead of treating each figure as a single sample. Built on this corpus, we develop a granularity-aware pretraining strategy that unifies heterogeneous objectives from coarse didactic descriptions to fine region-focused phrases in a shared embedding space. Applying Panel2Patch to a small subset of literature figures yields substantially better performance than prior pipelines, demonstrating that exploiting hierarchical figure structure can provide more effective supervision with less pretraining data.

Abstract:
In this paper, we present a holistic multimodal benchmark that evaluates the reasoning capabilities of MLLMs with an explicit focus on reasoning width, a complementary dimension to the more commonly studied reasoning depth. Specifically, reasoning depth measures the model's ability to carry out long-chain, sequential reasoning in which each step is tightly and rigorously linked to the next. Reasoning width tends to focus more on the model's capacity for broad trial-and-error search or multi-constrained optimization: it must systematically traverse many possible and parallelized reasoning paths, apply diverse constraints to prune unpromising branches, and identify valid solution routes for efficient iteration or backtracking. To achieve it, we carefully curate 1200+ high-quality multimodal cases spanning heterogeneous domains, and propose a fine-grained tree-of-thought evaluation protocol that jointly quantifies reasoning width and depth. We evaluate 12 major model families (over 30 advanced MLLMs) across difficulty tiers, question types, and required skills. Results show that while current models exhibit strong performance on general or common-sense VQA tasks, they still struggle to combine deep sequential thought chains with wide exploratory search to perform genuine insight-based reasoning. Finally, we analyze characteristic failure modes to provide possible directions for building MLLMs that reason not only deeper but also wider.

Abstract:
Recent generative models allow robots to generate future visual outcomes for action guidance, yet most still address imagination and control independently, resulting in visually coherent rollouts but physically inconsistent behaviors. While structural priors enhance spatial grounding, these methods remain visually correlation-driven rather than causally informed, overlooking the bidirectional coupling between robot actions and the evolving environment. We formalize the coupling as interaction dynamics, which specify where environmental changes occur and how actions cause them. Based on this formulation, we introduce DynBridge, an end-to-end framework that unifies imagination and control through the shared dynamics representation. Specifically, DynBridge realizes this via three components: (1) an Interaction Dynamics Generator that forecasts interaction dynamics via joint trajectory generation and action prediction; (2) an Action-Conditioned Dynamics Aggregator that integrates dynamics under control signals; and (3) a Dynamics-Guided Action Predictor that leverages the aggregated dynamics to produce executable, context-aware actions. Results demonstrate that DynBridge consistently outperforms prior methods on simulated and real-world benchmarks without external pretraining.

Abstract:
Emerging video diffusion models achieve high visual fidelity but fundamentally couple scene dynamics with camera motion, limiting their ability to provide precise spatial and temporal control. We introduce a 4D-controllable video diffusion framework that explicitly decouples scene dynamics from camera pose, enabling fine-grained manipulation of both scene dynamics and camera viewpoint. Our framework takes continuous world-time sequences and camera trajectories as conditioning inputs, injecting them into the video diffusion model through a 4D positional embedding in the attention layer and adaptive normalizations for feature modulation. To train this model, we curate a unique dataset in which temporal and camera variations are independently parameterized. Experiments show that our model achieves robust real-world 4D control across diverse timing patterns and camera trajectories, while preserving high generation quality and outperforming prior work in controllability.

Abstract:
This paper introduces a novel framework for wide-angle correction by distilling the geometric principles of quasi-conformal (QC) mapping into a generalizable and efficient deep neural network. Our methodology can be divided into two primary stages. In the first stage, we develop an annotation-free teacher pipeline that treats the distortion correction problem as a QC mapping task. Specifically, we minimize the Beltrami smoothness energy under constraints of both line structures and human body regions using a Linear Beltrami Solver and Proximal Gradient Descent (LBS-PGD) algorithm, thereby automatically generating high-quality QC correction flow labels. In the second stage, we propose the Quasi-conformal-mapping Distilled Wide-angle Correction Network (QDWC-Net) to learn the geometric transformation from these labels via distillation. Utilizing a Mamba-based backbone, a soft-argmin head, and a low-rank prior reconstruction module, QDWC-Net predicts the correction flow directly from a distorted input. Extensive quantitative and qualitative experiments verify the effectiveness of our approach. Notably, our distilled student network exhibits enhanced robustness compared to the teacher and achieves a massive 32x speedup (from 26.33s to 0.81s). Overall, our method provides a state-of-the-art solution that excels across multiple real-world datasets, especially in mitigating human body distortion.

Abstract:
Learning-based motion planners, despite recent progress, often suffer from temporal inconsistency. Small perturbations across frames can accumulate into unstable trajectories, degrading comfort and safety in closed-loop driving. Several methods attempt to inject history as a static conditioning signal to stabilize outputs, only to induce the planner to copy historical patterns instead of adapting to environment contexts. To address this limitation, we propose Diffusion Forcing Planner (DFP), a diffusion-based planning framework driven by history-guided control. Specifically, DFP decomposes the full trajectory into history, current and future segments, and assign independent noise levels to each segment. The model jointly denoises the historical and the future segments, enforcing a heterogeneous joint diffusion process. At inference, classifier-free guidance (CFG) is applied to steer future sampling using annealed history in a controllable manner. Closed-loop evaluation and comprehensive ablations on nuPlan show that DFP achieves competitive performance while producing continuous, stable, and controllable motion plans in complex driving scenarios.

Abstract:
Streaming video understanding often involves time-sensitive scenarios where models need to answer exactly when the supporting visual evidence appears: answering before the evidence reflects speculation, answering after it has passed reduces real-time utility. To capture this behavior, we introduce a readiness-aware formulation of streaming video understanding with the Answer Readiness Score (ARS), a timing-aware objective with asymmetric early and late penalties. When combined with correctness, ARS defines an effective accuracy that measures not just whether a model is right, but whether it answers at the appropriate moment. Building on this formulation, we introduce StreamReady, a framework to unify temporal reasoning with on-time answering through a lightweight readiness mechanism that decides if sufficient evidence has been observed before responding. To evaluate this capability, we further introduce ProReady-QA, a benchmark with annotated answer evidence windows and proactive multi-turn questions across local and global contexts. StreamReady achieves superior performance on ProReady-QA, and consistently outperforms prior methods across eight additional streaming and offline long-video benchmarks, demonstrating robust and broadly generalizable video understanding capability.

Abstract:
Adversarial attacks have evolved from simply disrupting predictions on conventional task-specific models to the more complex goal of manipulating image semantics in Large Vision-Language Models (LVLMs). However, existing methods struggle with controllability and cannot precisely manipulate the semantics of specific concepts in an image. We attribute this limitation to semantic entanglement in the patch-token representations that adversarial attacks typically operate on: global context aggregated by self-attention in the vision encoder dominates patch features, making them unreliable for precise local semantic manipulation. Our systematic investigation reveals a key insight: value features, computed within the transformer attention block, provide much more precise handles for manipulation. We show that these value features suppress global-context channels, allowing them to retain high-entropy, disentangled local semantic information. Building on this discovery, we propose V-Attack, a novel method for precise local semantic attacks. V-Attack targets value features and introduces two core components: (1) a Self-Value Enhancement module to refine the intrinsic semantic richness of value features, and (2) a Text-Guided Value Manipulation module that uses text prompts to locate the source concept and optimize it toward a target concept. By bypassing entangled patch features, V-Attack achieves highly effective semantic control. Extensive experiments across diverse LVLMs, including LLaVA, InternVL, DeepSeek-VL, and GPT-4o, show that V-Attack improves attack success rate by an average of 36% over state-of-the-art methods, exposing critical vulnerabilities in modern vision-language understanding.

Abstract:
Vision Transformers (ViTs) often degrade under distribution shifts because they rely on spurious correlations, such as background cues, rather than semantically meaningful features. Existing regularization methods, typically relying on simple foreground-background masks, which fail to capture the fine-grained semantic concepts that define an object (e.g., "long beak" and "wings" for a "bird"). As a result, these methods provide limited robustness to distribution shifts. To address this limitation, we introduce a novel finetuning framework that steers model reasoning toward concept-level semantics. Our approach optimizes the model's internal relevance maps to align with spatially grounded concept masks. These masks are generated automatically, without manual annotation: class-relevant concepts are first proposed using an LLM-based, label-free method, and then segmented using a VLM. The finetuning objective aligns relevance with these concept regions while simultaneously suppressing focus on spurious background areas. Notably, this process requires only a minimal set of images and uses half of the dataset classes. Extensive experiments on five out-of-distribution benchmarks demonstrate that our method improves robustness across multiple ViT-based models. Furthermore, we show that the resulting relevance maps exhibit stronger alignment with semantic object parts, offering a scalable path toward more robust and interpretable vision models. Finally, we confirm that concept-guided masks provide more effective supervision for model robustness than conventional segmentation maps, supporting our central hypothesis.

Abstract:
3D reconstruction from multi-view images is a core challenge in computer vision. Recently, feed-forward methods have emerged as efficient and robust alternatives to traditional per-scene optimization techniques. Among them, state-of-the-art models like the Visual Geometry Grounding Transformer (VGGT) leverage full self-attention over all image tokens to capture global relationships. However, this approach suffers from poor scalability due to the quadratic complexity of self-attention and the large number of tokens generated in long image sequences.In this work, we introduce FlashVGGT, an efficient alternative that addresses this bottleneck through a descriptor-based attention mechanism. Instead of applying dense global attention across all tokens, FlashVGGT compresses spatial information from each frame into a compact set of descriptor tokens. Global attention is then computed as cross-attention between the full set of image tokens and this smaller descriptor set, significantly reducing computational overhead.Moreover, the compactness of the descriptors enables online inference over long sequences via a chunk-recursive mechanism that reuses cached descriptors from previous chunks. Experimental results show that FlashVGGT achieves reconstruction accuracy competitive with VGGT while reducing inference time to just 9.3% for 1,000 images, and scaling efficiently to sequences exceeding 3,000 images.

Abstract:
Conventional evaluation methods for multimodal LLMs (MLLMs) lack interpretability and are often insufficient to fully disclose significant capability gaps across models. To address this, we introduce AuditDM, an automated framework that actively discovers and rectifies MLLM failure modes by auditing their divergence. AuditDM fine-tunes an MLLM as an auditor via reinforcement learning to generate challenging questions and counterfactual images that maximize disagreement among target models. Once trained, the auditor uncovers diverse, interpretable exemplars that reveal model weaknesses and serve as annotation-free data for rectification. When applied to SoTA models like Gemma-3 and PaliGemma-2, AuditDM discovers more than 20 distinct failure types. Fine-tuning on these discoveries consistently improves all models across 16 benchmarks, and enables a 3B model to surpass its 28B counterpart. Our results suggest that as data scaling hits diminishing returns, targeted model auditing offers an effective path to model diagnosis and improvement.

Abstract:
Modern vision-language pipelines are driven by RGB vision encoders trained on massive image-text corpora. While these pipelines have enabled impressive zero-shot capabilities and strong transfer across tasks, they still inherit two structural inefficiencies from the pixel domain: (i) transmitting dense RGB images from edge devices to the cloud is energy-intensive and costly, and (ii) patch-based tokenization explodes sequence length, stressing attention budgets and context limits. We explore 2D Gaussian Splatting (2DGS) as an alternative visual substrate for alignment: a compact, spatially adaptive representation that parameterizes images by a set of colored anisotropic Gaussians. We develop a scalable 2DGS pipeline with structured initialization, luminance-aware pruning, and batched CUDA kernels, achieving over 90x faster fitting and 97% GPU utilization compared to prior implementations. We further adapt contrastive language-image pre-training (CLIP) to 2DGS by reusing a frozen RGB-based transformer backbone with a lightweight splat-aware input stem and a perceiver resampler, training only 9.7 - 13.8% of the total parameters. On a 12.8M dataset from DataComp, GS encoders yield competitive zero-shot performance on 38 datasets from the CLIP benchmark while compressing inputs 3-23.5x relative to pixels. When integrated into a full end-to-end LLaVA-style visual-language model (VLM), GS encoders consistently outperform the RGB VLM baseline across six VQA benchmarks, often with reduced token counts, demonstrating that a transmission-efficient, structured Gaussian representation can match or surpass pixel-based models in downstream multimodal reasoning. Our codebase can be found at https://github.com/ Tambe-Lab/GaussianVision.

Abstract:
The recent surge in video generation has shown the growing demand for high-quality video synthesis using large vision models. Existing video generation models are predominantly based on the video diffusion transformer (vDiT), however, they suffer from substantial inference delay due to self-attention. While prior studies have focused on reducing redundant computations in self-attention, they often overlook the inherent spatio-temporal correlations in video streams and directly leverage sparsity patterns from large language models to reduce attention computations. In this work, we take a principled approach to accelerate self-attention in vDiTs by leveraging the spatio-temporal correlations in the latent space. We show that the attention patterns within vDiT are primarily due to the dominant spatial and temporal correlations at the token channel level. Based on this insight, we propose a lightweight and adaptive reuse strategy that approximates attention computations by reusing partial attention scores of spatially or temporally correlated tokens along individual channels. We demonstrate that our method achieves significantly higher computational savings (85%) compared to state-of-the-art techniques over 4 vDiTs, while preserving almost identical video quality (<0.06% loss on VBench).

Abstract:
Recent progress in Multimodal Large Language Models (MLLMs) has enabled mobile GUI agents capable of visual perception, cross-modal reasoning, and interactive control. However, existing benchmarks are largely English-centric and fail to capture the linguistic and interaction characteristics of the Chinese mobile ecosystem. They also focus on isolated skills such as GUI grounding or offline agent, lacking a unified and fine-grained framework to assess the full capability chain from perception to execution. To address this gap, we introduce GUI-CEval, the first comprehensive benchmark for Chinese mobile GUI agents, built entirely on physical device environments. GUI-CEval spans 201 mainstream apps across four device types and adopts a two-level structure that evaluates both atomic abilities and realistic application-level performance along five dimensions: perception, planning, reflection, execution, and evaluation. All data are collected and verified through multi-stage manual processes to ensure authenticity and reproducibility. Extensive experiments on 20 representative MLLMs and multi-agent systems show that while models such as Qwen2.5-VL and UI-TARS perform competitively, most MLLMs still exhibit clear weaknesses in reflective decision-making and post-action self-evaluation, limiting their reliability in real-world interactions. We hope GUI-CEval provides a comprehensive benchmark to guide capability diagnosis and advance the development of Chinese mobile GUI agents.

Abstract:
Privacy-preserving Person Re-Identification (PP-ReID) addresses the core privacy-utility trade-off in Re-ID by retrieving a person across multiple non-overlapping cameras while applying anonymization techniques to protect sensitive information. However, prior PP-ReID studies are confined to single-modality visible scenarios, as 24-hour surveillance systems require robust cross-modal visible-infrared (VI) capabilities. Extending PP-ReID to the cross-modal VI setting is therefore crucial. Accordingly, we introduce a new task: Privacy-Preserving Visible-Infrared Person Re-Identification (PP-VI-ReID). This task presents two severe challenges: 1) Crude anonymization strategies destroy identity-critical information and disrupt cross-modal alignment 2) The anonymization process creates inconsistent distortions across modalities. It disrupts color-based textures in visible images while obscuring thermal contours in infrared images. This inconsistency with modality gap forms a Mixed Gap. To overcome these challenges, we propose a framework, the Precise Privacy-preserving and Alignment Network (PPA) with two components: 1) A Keypoint-Preserving Regularization (KPR) module leverages human pose as a prior to guide structure-aware anonymization, preserving essential body features. 2) A Differential Consistency-guided Modality Alignment (DCMA) treats anonymization perturbations not as varying noise, but as a stable, learnable offset, facilitating robust alignment between raw and anonymized features across modalities. Experiments on SYSU-MM01 and RegDB validate our framework, establishing a strong baseline.

Abstract:
User Interface (UI) display defect detection poses challenges far beyond UI understanding, requiring fine-grained element boundary understanding, missing-content detection, and reasoning about sequential interface semantic consistency. However, the capabilities of multimodal large language models (MLLMs) and vision-language models (VLMs) for detecting UI defects in realistic, complex interfaces have not been systematically validated. To fill this gap, we present UI-Lens, the first multi-dimensional UI display detection benchmark for Chinese-language UI scenarios. The dataset comprises 4,759 pages meticulously annotated by design experts, covering six core display defect categories. We conduct a systematic evaluation of 10 mainstream models (8 closed-source, 2 open-source). Results show clear shortcomings in current models: for tasks requiring fine-grained element boundary understanding, performance is near random, with task-average F1 scores of 20.36% and 31.21% on Text Overflow and Container Overlap, respectively; for sequential interface semantic consistency (e.g., Text Inconsistency), the task-average F1 score is only 10.61%, indicating severe underperformance. We release UI-Lens to catalyze research toward more robust UI display defect detection with fine-grained boundary awareness in realistic, complex interfaces.

Abstract:
Object pose estimation is a key task for embodied robots, enabling them to interact with objects effectively. Category-level object pose estimation provides a way for robots to estimate the pose of unknown objects. However, estimating object pose from point clouds alone remains challenging. In this paper, we introduce SEGPose, a novel category-level object pose estimation method based on point clouds. Unlike previous methods, SEGPose leverages geometric, topological information, and SE(3)-equivariance, enhancing the network's accuracy in pose prediction. To utilize geometric and topological features, we propose a constraint-based feature extraction and 3D reconstruction method, enabling effective object shape reconstruction. We also design an SE(3)-equivariance feature prediction network to handle pose transformations consistently across viewpoints, improving pose accuracy. Experimental results on benchmark datasets show that SEGPose outperforms all current category-level pose estimation methods based on point clouds. Additionally, we apply the SEGPose to the robotic grasping tasks in real-world scenarios, and the results indicate that SEGPose exhibits excellent generalization capabilities.

Abstract:
A sketch is a distilled form of visual abstraction that conveys core concepts through simplified yet purposeful strokes while omitting extraneous detail. Despite its expressive power, quantifying the efficiency of semantic abstraction in sketches remains challenging. Existing evaluation methods that rely on reference images, low-level visual features, or recognition accuracy do not capture abstraction, the defining property of sketches.To address these limitations, we introduce SEA (Sketch Evaluation metric for Abstraction efficiency), a reference-free metric that assesses how economically a sketch represents class-defining visual elements while preserving semantic recognizability. These elements are derived per class from commonsense knowledge about features typically depicted in sketches. SEA leverages a visual question answering model to determine the presence of each element and returns a quantitative score that reflects semantic retention under simple visual representations.To support this metric, we present CommonSketch, the first semantically annotated sketch dataset, comprising 23,100 human-drawn sketches across 300 classes, each paired with a caption and element-level annotations. Experiments show that SEA aligns closely with human judgments and reliably discriminates levels of abstraction efficiency, while CommonSketch serves as a benchmark providing systematic evaluation of element-level sketch understanding across various vision-language models.

Abstract:
The convergence of 3D geometric perception and video synthesis has created an unprecedented demand for large-scale video data that is rich in both semantic and spatio-temporal information. While existing datasets have advanced either 3D understanding or video generation, a significant gap remains in providing a unified resource that supports both domains at scale. To bridge this chasm, we introduce SceneScribe-1M, a new large-scale, multi-modal video dataset. It comprises one million in-the-wild videos, each meticulously annotated with detailed textual descriptions, precise camera parameters, dense depth maps, and consistent 3D point tracks. We demonstrate the versatility and value of SceneScribe-1M by establishing benchmarks across a wide array of downstream tasks, including monocular depth estimation, scene reconstruction, and dynamic point tracking, as well as generative tasks such as text-to-video synthesis, with or without camera control. By open-sourcing SceneScribe-1M, we aim to provide a comprehensive benchmark and a catalyst for research, fostering the development of models that can both perceive the dynamic 3D world and generate controllable, realistic video content.

Abstract:
Recent advancements in multimodal large reasoning models (MLRMs) have significantly improved performance in visual question answering. However, we observe that transition words (e.g., because, however, and wait) are closely associated with hallucinations and tend to exhibit high-entropy states. We argue that adequate contextual reasoning information can be directly extracted from the token probability distribution. Inspired by superposed representation theory, we propose leveraging latent superposed reasoning to integrate multiple candidate semantics and maintain latent reasoning trajectories. The hypothesis is that reliance on discrete textual inputs may drive the model toward sequential explicit reasoning, underutilizing dense contextual cues during high-entropy reasoning stages. Therefore, we propose constructing rich semantic representations from the token probability distributions to enhance in-context reasoning. With this goal, we present Latent Entropy-Aware Decoding (LEAD), an efficient plug-and-play decoding strategy that leverages semantic context to achieve reliable reasoning. The heart of our method lies in entropy-aware reasoning mode switching. The model employs probability-weighted continuous embeddings under high-entropy states and transitions back to discrete token embeddings as entropy decreases. Moreover, we propose a prior-guided visual anchor injection strategy that encourages the model to focus on visual information. Extensive experiments show that LEAD effectively mitigates hallucinations across various MLRMs on multiple benchmarks.

Abstract:
Federated learning (FL) suffers from performance degradation due to the inevitable presence of noisy annotations in distributed scenarios. Existing approaches have advanced in distinguishing noisy samples from the dataset for label correction by leveraging loss values. However, noisy samples recognition relying on scalar loss lacks reliability for FL under heterogeneous scenarios. In this paper, we rethink this paradigm from a representation perspective and propose FedRG(Federated under Representation Gemometry), which follows "the principle of "representation geometry priority" to recognize noisy labels. Firstly, FedRG creates label-agnostic spherical representations by using self-supervision. It then iteratively fits a spherical von Mises-Fisher (vMF) mixture model to this geometry to capture semantic clusters. This geometric evidence is integrated with a semantic-label soft mapping mechanism to derive a distribution divergence between the label-free and annotated label-conditioned feature space, which robustly identifies noisy samples and updates the semantic-label mappings with the newly separated clean dataset. Lastly, we employ an additional personalized noise absorption matrix on noisy labels to achieve robust optimization. Extensive experimental results demonstrate that FedRG outperforms state-of-the-art methods for FL with data heterogeneity under diverse noisy clients scenarios.

Abstract:
Autoregressive video models are promising for world modeling via next-frame prediction, but they suffer from exposure bias: a mismatch between training on clean contexts and inference on self-generated frames, causing errors to compound and quality to drift over time. We introduce Backwards Aggregation (BAgger), a self-supervised scheme that constructs corrective trajectories from the model's own rollouts, teaching it to recover from its mistakes. Unlike prior approaches that rely on few-step distillation and distribution-matching losses --which can hurt quality and diversity -- BAgger trains with standard score or flow matching objectives, avoiding large teachers and long-chain backpropagation through time. We instantiate BAgger on causal diffusion transformers and evaluate on text-to-video, video extension, and multi-prompt generation, observing more stable long-horizon motion and better visual consistency with reduced drift.

Abstract:
The effective integration and utilization of multimodal data acquired from image cameras and LiDAR is of paramount importance for perception systems. This paper proposes Image-to-Point Cloud Feature Back-Projection (IPFP), a novel method for training multimodal fusion networks that back-projects aggregated image-feature centers (from non-projection-aligned image pixels) into the point-cloud feature set via the estimated depth map. Consequently, image features and point cloud features reside within the same three-dimensional space, enabling the natural enrichment of image information into the point cloud during the network forward pass. This process can be selectively enabled when desired -- for instance, at training time -- and turned off in the absence of multimodal data -- for example, at testing time if only LiDAR sensors are available. Experimental results demonstrate that IPFP can consistently improve state-of-the-art 3D semantic segmentation models, while retaining the ability to process LiDAR-only data at inference.

Abstract:
Although deep learning-based image super-resolution (SR) models have achieved remarkable progress in reconstruction quality, their high computational and memory demands make them unsuitable for lightweight platforms. To address this issue, various quantization techniques have been introduced. Among them, mixed-precision quantization (MPQ) introduces a layer-wise bit-width allocation to balance computational efficiency with reconstruction quality. However, existing MPQ methods based on post-training quantization (PTQ) for SR models face two critical limitations. First, quantization sensitivity estimation using static statistics fails to capture the accurate quantization error induced by each layer, resulting in suboptimal bit allocation. Second, removing batch normalization (BN) to preserve high-frequency details leads to scale inconsistencies across activations, making fixed quantization ranges insufficient to accurately represent their distribution. Therefore, we propose a novel PTQ-based MPQ framework tailored for SR models. Our method estimates the quantization sensitivity of weights and activations by leveraging gradients of the objective function with respect to bit-widths, enabling adaptive layer-wise bit allocation and fast convergence. Additionally, we introduce a dynamic activation range normalization that alleviates the distributional imbalance caused by the absence of BN, ensuring stable quantization under fixed range constraints. Our method outperforms existing PTQ-based methods by 1.26 dB in peak signal-to-noise ratio (PSNR) on the Urban100 dataset and reduces quantization time by x1.9 for 3-bit quantization of EDSR x4.

Abstract:
A fundamental challenge in embodied intelligence is developing expressive and compact state representations for efficient world modeling and decision making. However, existing methods often fail to achieve this balance, yielding representations that are either overly redundant or lacking in task-critical information. We propose an unsupervised approach that learns a highly compressed two-token state representation using a lightweight encoder and a pre-trained Diffusion Transformer (DiT) decoder, capitalizing on its strong generative prior. Our representation is efficient, interpretable, and integrates seamlessly into existing VLA-based models, improving performance by 11.6% on LIBERO and 31% in real-world task success with minimal inference overhead. More importantly, we find that the difference between these tokens, obtained via latent interpolation, naturally serves as a highly effective latent action, which can be further decoded into executable robot actions. This emergent capability reveals that our representation captures structured dynamics without explicit supervision. We name our method StaMo for its ability to learn generalizable robotic Motion from compact State representation, which is encoded from static images, challenging the prevalent dependence to learning latent action on complex architectures and video data. The resulting latent actions also enhance policy co-training, outperforming prior methods by 10.4% with improved interpretability. Moreover, our approach scales effectively across diverse data sources, including real-world robot data, simulation, and human egocentric video.

Abstract:
Vehicle trajectory prediction is critical for safe and efficient autonomous driving. However, its generalization and scalability are hindered by heavy reliance on real-time, online priors. To break this bottleneck, we introduce RAG-TP, a framework reframing the problem from relying on uncertain online perception to retrieving from a large-scale, structured knowledge base. RAG-TP enhances inference-time predictions by dynamically querying a heterogeneous knowledge base rich with scene topologies and motion patterns, using retrieved historical experiences as priors. We further design a dynamic fusion module based on a novel Retrieval-Driven Mixture-of-Experts (MoE). Unlike conventional parametric designs, this mechanism dynamically treats retrieved knowledge units as experts, weighting and integrating them via cross-attention to generate a dense context for final multi-modal trajectory decoding. By decoupling online inference from offline knowledge, this approach grounds predictions in a vast structured database, mitigating model hallucination, compensating for unreliable priors, and significantly enhancing robustness and domain adaptation. Extensive experiments show RAG-TP achieves excellent performance in map-based and map-free settings, demonstrating highly competitive results against specialized map-free methods while performing on par with state-of-the-art (SOTA) map-based models. It demonstrates significant advantages, particularly in cross-domain and zero-shot generalization. Our work provides a promising technical pathway toward building scalable and robust prediction systems for autonomous driving.

Abstract:
Feed-forward visual geometry estimation has recently made rapid progress. However, an important gap remains: multi-frame models usually produce better cross-frame consistency, yet they often underperform strong per-frame methods on single-frame accuracy. This observation motivates our systematic investigation into the critical factors driving model performance through rigorous ablation studies, which reveals several key insights: 1) Scaling up data diversity and quality unlocks further performance gains even in state-of-the-art visual geometry estimation methods; 2) Commonly adopted confidence-aware loss and gradient-based loss mechanisms may unintentionally hinder performance; 3) Joint supervision through both per-sequence and per-frame alignment improves results, while local region alignment surprisingly degrades performance. Furthermore, we introduce two enhancements to integrate the advantages of optimization-based methods and high-resolution inputs: a consistency loss function that enforces alignment between depth maps, camera parameters, and point maps, and an efficient architectural design that leverages high-resolution information. We integrate these designs into CARVE, a resolution-enhanced model for feed-forward visual geometry estimation. Experiments on point cloud reconstruction, video depth estimation, and camera pose/intrinsic estimation show that CARVE achieves strong and robust performance across diverse benchmarks.

Abstract:
Monocular Semantic Scene Completion (SSC) aims to reconstruct complete 3D semantic scenes from a single RGB image, offering a cost-effective solution for autonomous driving and robotics. However, the inherently imbalanced nature of voxel distributions--where over 93% of voxels are empty and foreground classes are rare--poses significant challenges. Existing methods often suffer from redundant emphasis on uninformative voxels and poor generalization to long-tailed categories. To address these issues, we propose VoxSAMNet (Voxel Sparsity-Aware Modulation Network), a unified framework that explicitly models voxel sparsity and semantic imbalance. Our approach introduces: (1) a Dummy Shortcut for Feature Refinement (DSFR) module that bypasses empty voxels via a shared dummy node while refining occupied ones with deformable attention; (2) a Foreground Modulation Strategy combining Foreground Dropout (FD) and Text-Guided Image Filter (TGIF) to alleviate overfitting and enhance class-relevant features. Extensive experiments on the public benchmarks SemanticKITTI and SSCBench-KITTI-360 demonstrate that VoxSAMNet achieves state-of-the-art performance, surpassing prior monocular and stereo baselines with mIoU scores of 18.2% and 20.2%, respectively. Our results highlight the importance of sparsity-aware and semantics-guided design for efficient and accurate 3D scene completion, offering a promising direction for future research.

Abstract:
Instruction tuning has been central to the success of recent vision-language models (VLMs), but it remains expensive-requiring large-scale datasets, high-quality annotations, and large compute budgets. We propose PRioritized cOncept learninG via Relative Error-driven Sample Selection (PROGRESS), a data- and compute-efficient framework that enables VLMs to dynamically select what to learn next based on their evolving needs during training. At each stage, the model tracks its learning progress across skills and selects the most informative samples-those it has not already mastered and that are not too difficult to learn at the current stage of training. This strategy effectively controls skill acquisition and the order in which skills are learned. Specifically, we sample from skills showing the highest learning progress, prioritizing those with the most rapid improvement. Unlike prior methods, PROGRESS requires no upfront answer annotations, queries answers only on a need basis, avoids reliance on additional supervision from auxiliary VLMs, and does not require compute-heavy gradient computations for data selection. Experiments across multiple instruction-tuning datasets of varying scales demonstrate that PROGRESS consistently outperforms state-of-the-art baselines with much less data and supervision. Additionally, we show strong cross-architecture generalization and transferability to larger models, validating PROGRESS as a scalable solution for efficient learning.

Abstract:
Text-to-image generation has advanced rapidly, yet it still struggles to capture the nuanced user preferences. Existing approaches typically rely on multimodal large language models to infer user preferences, but the derived prompts or latent codes rarely reflect them faithfully, leading to suboptimal personalization. We present Premier, a novel preference modulation framework for personalized image generation. Premier represents each user's preference as a learnable embedding and introduces a preference adapter that fuses the user embedding with the text prompt. To enable accurate and fine-grained preference control, the fused preference embedding is further used to modulate the generative process. To enhance the distinctness of individual preference and improve alignment between outputs and user-specific styles, we incorporate a dispersion loss that enforces separation among user embeddings. When user data are scarce, new users are represented as linear combinations of existing preference embeddings learned during training, enabling effective generalization. Experiments show that Premier outperforms prior methods under the same history length, achieving stronger preference alignment and superior performance on text consistency, ViPer proxy metrics, and expert evaluations.

Abstract:
The prohibitive cost of 3D attention hinders high-quality video generation with diffusion models. Existing sparse attention methods either lack content adaptivity (static) or incur excessive overhead from per-step recalculation (dynamic). Our work challenges the necessity of this trade-off, based on a twofold empirical discovery: (1) attention patterns in video diffusion exhibit strong temporal stability, and (2) the requisite computational density progressively decays. This insight motivates RAPID, a framework that performs a one-shot attention block importance estimation early in the generation process. The resulting scores and high-fidelity sparse mask are then cached for efficient reuse, eliminating recalculation overhead. The cached scores also enable an optional, multi-stage adaptive pruning (Turbo mode) for maximum acceleration. On leading models like Wan2.1-14B and HunyuanVideo, our high-fidelity configuration surpasses all baselines across key quality metrics (PSNR, SSIM, LPIPS) under a controlled compute budget. Concurrently, its Turbo mode achieves speedups of up to 1.79x on Wan2.1-14B and 2.01x on HunyuanVideo while maintaining strong visual quality.

Abstract:
LiDAR-based place recognition is highly sensitive to rain, snow, and fog, where scattering and attenuation distort geometric structure and intensity. We tackle this problem with Conditional Latent Velocity Field (C-LaV) denoising, which restores weather-robust representations before retrieval. Single-sweep point clouds are projected into three-channel bird's-eye-view (BEV) images and encoded with a frozen DINOv2-based BEV transformer to obtain a semantically anchored latent space shared across weather conditions. On this manifold, a conditional Flow Matching model learns a velocity field whose probability-flow ordinary differential equation (ODE) deterministically transports noisy latents toward their clear-weather counterparts. From the denoised manifold, a Sinkhorn Aggregation of Local Descriptors (SALAD) head produces compact global descriptors optimized with a truncated Smooth-AP loss. We also establish a unified adverse-weather benchmark with 3 m frame spacing and shared evaluation thresholds across KITTI, NCLT, and Boreas datasets. Under this protocol, C-LaV improves Recall@1 by 17.5% on NCLT snow and 21.5% on Boreas, achieving state-of-the-art weather robustness. Our dataset and code will be publicly available.

Abstract:
In many real-world applications of non-rigid shape matching, the shapes are subject to topological noise (i.e. varying genus). In this paper, we propose a novel formulation based on Markov Random Fields (MRF) that can handle these cases with topological noise. The solutions to our optimisation problem can be approximated efficiently using the alpha expansion algorithm, which comes with theoretical approximation guarantees. In particular, we cast non-rigid 3D shape matching as a multi-labelling problem in which each triangle of the source shape is assigned a label that represents the matching to a specific surface element on the target shape. We propose a novel pairwise term that imposes that our matching prefers solutions in which neighbouring triangles on the source shape remain close on the target shape. Further, by exploiting the specific structure of our label space, we show that the alpha expansion algorithm can be customised to achieve significant speed-ups, while maintaining its approximation guarantees. We evaluate our method on various shape matching datasets including settings in which shapes have topological artefacts.

Abstract:
Sketching in 3D space enables expressive reasoning about shape, structure, and spatial relationships, yet generating 3D sketches through natural language remains a major challenge. In this work, we introduce 3DrawAgent, a training-free, language-driven framework for 3D sketch generation that leverages large language models (LLMs) to sequentially draw 3D Bezier curves under geometric feedback. Unlike prior 2D sketch agents, our method introduces a relative experience optimization strategy that adapts the recently proposed Group Reward Policy Optimization (GRPO) paradigm. Instead of relying on explicit ground-truth supervision, we construct pairwise comparisons among generated sketches, with each pair consisting of a relatively better and a worse result based on CLIP-based perceptual rewards and LLM-based fine-grained qualitative assessment. These experiences are then used to iteratively refine the prior knowledge of 3D drawing, enabling black-box reinforcement of the model's 3D awareness. This design allows our model to self-improve its spatial understanding and drawing quality without parameter updates. Experiments show that 3DrawAgent can generate complex and coherent 3D Bezier sketches from diverse textual prompts, exhibit emergent geometric reasoning, and generalize to novel shapes, establishing a new paradigm for advancing the field of training-free 3D sketch intelligence.

Abstract:
3D Gaussian Splatting has demonstrated superior performance in rendering efficiency and quality, yet the generation of 3D Gaussians still remains a challenge without proper geometric priors. Existing methods have explored predicting point maps as geometric references for inferring Gaussian primitives, while the unreliable estimated geometries may lead to poor generations. In this work, we introduce GaussianGrow, a novel approach that generates 3D Gaussians by learning to grow them from easily accessible 3D point clouds, naturally enforcing geometric accuracy in Gaussian generation. Specifically, we design a text-guided Gaussian growing scheme that leverages a multi-view diffusion model to synthesize consistent appearances from input point clouds for supervision. To mitigate artifacts caused by fusing neighboring views, we constrain novel views generated at non-preset camera poses identified in overlapping regions across different views. For completing the hard-to-observe regions, we propose to iteratively detect the camera pose by observing the largest un-grown regions in point clouds and inpainting them by inpainting the rendered view with a pretrained 2D diffusion model. The process continues until complete Gaussians are generated. We extensively evaluate GaussianGrow on text-guided Gaussian generation from synthetic and even real-scanned point clouds.

Abstract:
Recent generative models have shown strong performance in generating diverse 3D assets from 2D images, a fundamental research topic in computer vision and graphics. However, these models still struggle to generate voluminous 3D assets when the input is a flat image that provides limited 3D cues. We introduce REVIVE 3D, a two-stage, plugand-play pipeline for generating voluminous 3D assets from flat images. In Stage 1, we construct an Inflated Prior by inflating the foreground silhouette to recover global volume and superimposing part-aware details to capture local structure. In Stage 2, 3D Latent Refinement injects Gaussian noise into the Inflated Prior's latent and then denoises it, using the prior's geometric cues to leverage the backbone's pretrained 3D knowledge. Furthermore, our framework supports image-conditioned 3D editing. To quantify volume and surface flatness, we propose Compactness and Normal Anisotropy. We validate Compactness and Normal Anisotropy through a user study, showing that these metrics align with human perception of volume and quality. We show that REVIVE 3D achieves state-of-the-art performance on a challenging flat image dataset, based on extensive qualitative and quantitative evaluations.

Abstract:
We address the challenge of reconstructing long dynamic scenes from multi-view videos in a storage-efficient manner. Recent advances in Gaussian Splatting and its extensions to dynamic scenes have demonstrated impressive visual quality, but remain limited to short duration (<10 s), large storage size (>500 MB), and high GPU VRAM usage.To overcome these limitations, we introduce Layered 4D-Rotor Gaussian Splatting (L4DRotorGS), a novel compressed representation designed for long dynamic scenes. Our approach integrates a layered 4D representation, efficient training, and effective compression into a unified framework. Specifically, 4D Gaussians are first organized into layers based on their temporal extents and then partitioned into discrete temporal buckets. This structure allows for selective access and rendering of only the necessary subsets of 4D Gaussians, substantially reducing GPU memory requirements.To further compress the representation, we apply a series of techniques, Factorized Covariance Quantization, Layered Compression, and Residual Codebook Quantization, achieving a compression ratio of up to 22.3x while preserving high visual fidelity.We implement a highly optimized C++/CUDA framework for efficient training, compression, and real-time rendering, achieving over 500 FPS on an RTX 3090 GPU. Extensive experiments demonstrate the superior storage efficiency, visual quality, and rendering speed of L4DRotorGS, consistently outperforming prior methods in both quantitative metrics and perceptual quality on real-world long dynamic scenes.

Abstract:
We present MOSAIC-GS, a novel, fully explicit, and computationally efficient approach for high-fidelity dynamic scene reconstruction from monocular videos using Gaussian Splatting.Monocular reconstruction is inherently ill-posed due to the lack of sufficient multiview constraints, making accurate recovery of object geometry and temporal coherence particularly challenging. To address this, we leverage multiple geometric cues, such as depth, optical flow, dynamic object segmentation, and point tracking. Combined with rigidity-based motion constraints, these cues allow us to estimate preliminary 3D scene dynamics during an initialization stage.Recovering scene dynamics prior to the photometric optimization reduces reliance on motion inference from visual appearance alone, which is often ambiguous in monocular settings.To enable compact representations, fast training, and real-time rendering while supporting non-rigid deformations, the scene is decomposed into static and dynamic components. Each Gaussian in the dynamic part of the scene is assigned a trajectory represented as time-dependent Poly-Fourier curve for parameter-efficient motion encoding.We demonstrate that MOSAIC-GS achieves substantially faster optimization and rendering compared to existing methods,while maintaining reconstruction quality on par with state-of-the-art approaches across standard monocular dynamic scene benchmarks.

Abstract:
The task of visual grounding (i.e., VG) aims to locate or segment objects in images based on referring expressions. Existing research on VG primarily focuses on large objects. However, these images often contain objects at various scales. Although large objects are usually the visual focus, small objects sometimes carry crucial information. To bridge the gap, we propose a novel benchmark for small object visual grounding, i.e., SoVG. Specifically, we introduce an automatic pipeline using MLLMs to build a benchmark dataset. Our pipeline is built on the popular dataset COCO. Thus, we obtain our RefCOCOs dataset. The visual objects in our RefCOCOs have an average area of 1/50 area of an entire image, whereas that of classic VG datasets is 1/5. Furthermore, we propose SoVG-Net with a hierarchical textual infusion module for the novel SoVG task. Finally, we conduct extensive experiments using classic datasets with our RefCOCOs. The results showcase that our built dataset is useful for advancing VG research, and our proposed SoVG-Net is a strong baseline. Our dataset and code will be made publicly available after review.

Abstract:
Prompt Learning (PL) has emerged as a parameter-efficient technique for adapting Vision-Language Models (VLMs) to downstream tasks. However, almost all existing PL methods are primarily designed and evaluated on well-curated datasets, overlooking a critical post-deployment phenomenon, i.e., the intrinsic connection between input resolution and storage-memory consumption. Specifically, to satisfy the stringent storage-memory constraints on edge devices, models are often limited to low-resolution inputs (e.g., <224x224 for CLIP-ViT/B-16) and generate fewer tokens (with the position embedding resized), which poses a unique challenge in performance robustness. To tackle this issue, we propose LOREAL, an efficient prompt self-distillation framework that learns resolution-invariant representations by excavating attribute semantics. At the heart of LOREAL is a dual-student architecture, i.e., two student models fed with inputs at different resolutions synergistically learn from each other. Building upon this, we contextualize the students' prompt with resolution-invariant attributes queried from the LLM, then leverage cross-modality meta-nets to generate attribute semantics. These meta-nets are bridged between the different encoders of two students, wherein we introduce Low-Level Distillation (LLD) and High-Level Distillation (HLD) to facilitate the learning of more cross-resolution representations. Extensive experiments show that LOREAL significantly improves VLMs' performance and robustness under varied resolution settings, underscoring significant practical utilities.

Abstract:
In this paper, we address the problem of 6-DoF object pose estimation from a single RGB image. Indirect methods that typically predict intermediate 2D keypoints, followed by a Perspective-n-Point solver, have shown great performance. Direct approaches, which regress the pose in an end-to-end manner, are usually computationally more efficient but less accurate. However, direct pose regression heads rely on globally pooled features, ignoring spatial second-order statistics despite their informativeness in pose prediction. They also predict, in most cases, discontinuous pose representations that lack robustness. Herein, we therefore propose a covariance-pooled representation that encodes convolutional feature distributions as a symmetric positive definite (SPD) matrix. Moreover, we propose a novel pose encoding in the form of an SPD matrix via its Cholesky decomposition. Pose is then regressed in an end-to-end manner with a manifold-aware network head, taking into account the Riemannian geometry of SPD matrices. Experiments and ablations consistently demonstrate the relevance of second-order pooling and continuous representations for direct pose regression, including under partial occlusion.

Abstract:
World models have become crucial for autonomous driving, as they learn how scenarios evolve over time to address the long-tail challenges of the real world. However, current approaches relegate world models to limited roles: they operate within ostensibly unified architectures that still keep world prediction and motion planning as decoupled processes. To bridge this gap, we propose DriveLaW, a novel paradigm that unifies video generation and motion planning. By directly injecting the latent representation from its video generator into the planner, DriveLaW ensures inherent consistency between high-fidelity future generation and reliable trajectory planning. Specifically, DriveLaW consists of two core components: DriveLaW-Video, our powerful world model that generates high-fidelity forecasting with expressive latent representations, and DriveLaW-Act, a diffusion planner that generates consistent and reliable trajectories from the latent of DriveLaW-Video, with both components optimized by a three-stage progressive training strategy. New state-of-the-art results across both tasks demonstrate the power of our unified paradigm. DriveLaW not only significantly advances video prediction, surpassing the previous best-performing work by 33.3% in FID and 1.8% in FVD, but also sets a new record on the NAVSIM planning benchmark.

Abstract:
We propose a novel unsupervised framework for online video stabilization. Unlike deep learning-based stabilizers that require paired stable/unstable datasets, our method models the classical three-stage stabilization pipeline and integrates a multithreaded buffering mechanism, effectively addressing three key challenges of end-to-end learning: limited data, poor controllability, and inefficiency on resource-constrained hardware. Existing benchmarks mainly focus on handheld, forward-view, visible-light videos, restricting the application of stabilization in domains such as UAV nighttime remote sensing. To fill this gap, we introduce a new multimodal UAV aerial video dataset (UAV-Test). Experiments show that our approach consistently outperforms state-of-the-art online stabilizers in both quantitative metrics and visual quality, while achieving performance comparable to that of offline methods.

Abstract:
Large Vision-Language Models (LVLMs) often hallucinate when visual evidence conflicts with world knowledge, i.e., in counterfactual scenarios. We propose Envision-Attend-Respond (EnAR), a training-free framework that leverages visual priors to steer the model's attention toward counterfactual elements in the image. The Envision stage constructs a visual impression by invoking a diffusion prior to perform latent perturbations, yielding a prior-consistent counterpart of the input image. The Attend stage processes the original image and its visual impression through the LVLM's vision encoder to localize counterfactual elements, forming a corresponding padded input. The Respond stage performs contrastive decoding between the original and padded inputs to suppress bias and enhance visual understanding. Empirically, EnAR consistently mitigates hallucinations and improves response fidelity, achieving a 10.82% gain on VLMBias and an average 6.9% improvement on POPE, demonstrating robustness across both counterfactual and general hallucination settings. Moreover, the framework remains effective across heterogeneous LVLM architectures, offering a new perspective for hallucination governance in multimodal reasoning.

Abstract:
Self-supervised contrastive learning has emerged as a powerful paradigm for skeleton-based action recognition by enforcing consistency in the embedding space. However, existing methods rely on binary contrastive objectives that overlook the intrinsic continuity of human motion, resulting in fragmented feature clusters and rigid class boundaries. To address these limitations, we propose TranCLR, a Transitional anchor-based Contrastive Learning framework that captures the continuous geometry of the action space. Specifically, the proposed Action Transitional Anchor Construction (ATAC) explicitly models the geometric structure of transitional states to enhance the model's perception of motion continuity. Building upon these anchors, a Multi-Level Geometric Manifold Calibration (MGMC) mechanism is introduced to adaptively calibrate the action manifold across multiple levels of continuity, yielding a smoother and more discriminative representation space. Extensive experiments on the NTU RGB+D, NTU RGB+D 120 and PKU-MMD datasets demonstrate that TranCLR achieves superior accuracy and calibration performance, effectively learning continuous and uncertainty-aware skeleton representations. Code will be made publicly available.

Abstract:
Event cameras have attracted considerable attention for object tracking due to their microsecond-level temporal resolution and wide dynamic range, yet effectively harnessing spiking neural networks (SNNs) in this domain remains challenging. In this paper, we introduce SpikeTrack, a purely spike-driven framework for single-object tracking that addresses the shortcomings of RGB-based approaches in fast-motion or target appearance change. Central to SpikeTrack is the Multi-Search-sequence-and-Single-Template (MSST) training paradigm, which captures rich temporal dependencies, alongside a Dynamic Integer Leaky Integrate-and-Fire (DI-LIF) neuron that adaptively predicts integer-valued activations based on the input features during training and converts them into spikes during inference. Our design preserves the intrinsic sparsity and fine-grained spatiotemporal acuity of event data, resulting in efficient energy consumption without sacrificing performance. Extensive evaluations on FE108, FELT, and VisEvent demonstrate that SpikeTrack exceeds the performance of state-of-the-art trackers in both accuracy and efficiency. Furthermore, ablation studies validate each module's contribution, highlighting the practical potential of spike-driven architectures for future vision applications.

Abstract:
The integration of visual understanding and generation into unified multimodal models represents a significant stride toward general-purpose AI. However, a fundamental question remains unanswered by existing benchmarks: does this architectural unification actually enable synergetic interaction between the constituent capabilities? Existing evaluation paradigms, which primarily assess understanding and generation in isolation, are insufficient for determining whether a unified model can leverage its understanding to enhance its generation, or use generative simulation to facilitate deeper comprehension. To address this critical gap, we introduce RealUnify, a benchmark specifically designed to evaluate bidirectional capability synergy. RealUnify comprises 1,000 meticulously human-annotated instances spanning 10 categories and 32 subtasks. It is structured around two core axes: 1) Understanding Enhances Generation, which requires reasoning (e.g., commonsense, logic) to guide image generation, and 2) Generation Enhances Understanding, which necessitates mental simulation or reconstruction (e.g., of transformed or disordered visual inputs) to solve reasoning tasks. A key contribution is our dual-evaluation protocol, which combines direct end-to-end assessment with a diagnostic stepwise evaluation that decomposes tasks into distinct understanding and generation phases. This protocol allows us to precisely discern whether performance bottlenecks stem from deficiencies in core abilities or from a failure to integrate them. Through large-scale evaluations of 12 leading unified models and 6 specialized baselines, we find that current unified models still struggle to achieve effective synergy, indicating that architectural unification alone is insufficient. These results highlight the need for new training strategies and inductive biases to fully unlock the potential of unified modeling.

Abstract:
We propose MagicQuill V2, a novel framework that introduces a layered composition paradigm to generative image editing, bridging the gap between the semantic power of modern diffusion models and the granular control of traditional graphics software. While state-of-the-art diffusion transformers excel at holistic generation, their use of singular, monolithic prompts fails to disentangle distinct user intentions for content, position, and style. To overcome this limitation, our method deconstructs creative intent into a stack of independently controllable visual cues: a content layer for what to create, a spatial layer for where to place it, a structural layer for how it is shaped, and a color layer for its palette. Our technical contributions include a specialized data generation pipeline for context-aware content integration, a unified control module to process all visual cues, and a fine-tuned spatial branch for precise local editing, including object removal. Extensive experiments validate that this layered approach effectively resolves the user intention gap, granting creators direct, intuitive, and powerful control over the generative process.

Abstract:
Predicting scene dynamics from visual observations is challenging. Existing methods capture dynamics only within observed boundaries failing to extrapolate far beyond the training sequence. Node-RF (Neural ODE-based NeRF) overcomes this limitation by integrating Neural Ordinary Differential Equations (NODEs) with dynamic Neural Radiance Fields (NeRFs), enabling a continuous-time, spatiotemporal representation that generalizes beyond observed trajectories at constant memory cost. From visual input, Node-RF learns an implicit scene state that evolves over time via an ODE solver, propagating feature embeddings via differential calculus. A NeRF-based renderer interprets calculated embeddings to synthesize arbitrary views for long-range extrapolation. Training on multiple motion sequences with shared dynamics allows for generalization to unseen conditions. Our experiments demonstrate that Node-RF can characterize abstract system behavior without explicit model to identify critical points for future predictions. Our code will be made publicly available.

Abstract:
We propose EventHub, a novel framework for training deep-event stereo networks without ground truth annotations from costly active sensors, relying instead on standard color images. From these images, we derive either proxy annotations and proxy events through state-of-the-art novel view synthesis techniques, or simply proxy annotations when images are already paired with event data. Using the training set generated by our data factory, we repurpose state-of-the-art stereo models from RGB literature to process event data, obtaining new event stereo models with unprecedented generalization capabilities.Experiments on widely used event stereo datasets support the effectiveness of EventHub and show how the same data distillation mechanism can improve the accuracy of RGB stereo foundation models in challenging conditions such as nighttime scenes.

Abstract:
On-the-fly Category Discovery (OCD) aims to dynamically identify both known and emerging unknown categories from streaming data, using supervision from only a limited set of labeled classes. Despite recent progress, our empirical analysis reveals fundamental limitations: existing methods suffer from cascading feature-to-hash degradation and severe space monopolization by known classes, fundamentally hindering novel category discovery. To address these coupled challenges, we introduce a principled two-stage framework.We first construct a Hyper-Semantic Space with dual geometric subspaces: a Derived Subspace employing parent-derived prototype augmentation to capture intra-class diversity and enhance inter-class discrimination, and a Calibrated Subspace synthesized through cross-prototype interpolation to impose distributional constraints and prevent representational collapse.Within this geometrically-constrained space, we perform Assignment-Driven Hash Learning, where Flexible Prototype Assignment (FPA) models intra-class variations and enhances inter-class separation, alongside Binary Hash Regularization (BHR) to enforce compact and discriminative hash representations. Our framework serves as a plug-and-play module, consistently improving state-of-the-art OCD methods across fine-grained benchmarks. Code will be released upon acceptance.

Abstract:
Referring 3D segmentation seeks to localize and segment target objects in a 3D scene given a natural-language query, requiring joint reasoning over geometric structures and linguistic cues. Although recent progress using 3D Gaussian Splatting (3DGS) has improved rendering quality, existing methods still struggle to spatially ground textual references due to two fundamental limitations: (1) language encoders provide no explicit positional priors, weakening geometric relation modeling; and (2) cross-modal attention is self-reinforcing, causing spatial errors to propagate through the Gaussian field once misalignment occurs. To address this, we propose GeoCGA, a geometry-aware cross-modal graph alignment framework that bridges linguistic semantics with the 3DGS representation. GeoCGA introduces position-aware prompt expansion to build a semantic-spatial graph capturing relational structure in text, and constructs a Gaussian-based geometric graph encoding 3D topology. A cross-modal alignment module enforces geometric consistency between the two graphs, enabling stable and spatially grounded correspondence across views. GeoCGA consistently outperforms prior state-of-the-art methods, yielding relative mIoU improvements of 20.8% on Ref-LERF, 5.7% on LERF-OVS, and 1.0% on 3D-OVS.

Abstract:
Spatial transcriptomics (ST) links gene expression to tissue architecture and enables predicting spatial expression from H&E-stained whole-slide images (WSIs). However, existing spot- or slide-level predictors focus on single-spot features or pairwise relations, failing to capture high-order, many-to-many cross-cell interactions. As a result, they miss synergistic and antagonistic effects among multiple neighboring cells. Here, we introduce MCToGene, a scalable and accurate framework that explicitly models multi-cell interactions via many-body attention with hierarchical coupling to predict spatial gene expression. MCToGene employs a many-body attention module to encode high-order, many-to-many cross-cell dependencies, enabling context-aware microenvironment modeling. To mitigate the combinatorial burden of many-body modeling, we design a hierarchical interaction module that couples pairwise and many-body representations for feature aggregation and prediction, preserving many-body expressiveness while controlling computation and memory. On HEST-1k and STImage-1K4M, MCToGene surpasses state-of-the-art baselines with 7.85% relative improvement. Ablations confirm that explicit high-order, many-to-many modeling drives these gains, and visualizations demonstrate that multi-cell interactions are essential for biologically coherent spatial predictions.

Abstract:
With the rapid deployment of multimodal large language models (MLLMs), disputes regarding model ownership have become increasingly frequent, raising significant concerns about intellectual property protection. In this paper, we propose a framework for generating copyright triggers for MLLMs, enabling model publishers to embed verifiable ownership information into the model. The goal is to construct trigger images that elicit ownership-related textual responses exclusively in fine-tuned derivatives, while remaining inert in other non-derivative models. Our method constructs a tracking trigger image by treating the image as a learnable tensor, performing adversarial optimization with dual-injection of ownership-relevant semantic information. The first injection is achieved by enforcing textual consistency between the output of an auxiliary MLLM and a predefined ownership-relevant target text; the consistency loss is backpropagated to inject this ownership-related information into the image. The second injection is performed at the semantic-level by minimizing the distance between the CLIP features of the image and those of the target text. Furthermore, we introduce an additional adversarial training stage involving the auxiliary model. It is specifically trained to resist generating ownership-relevant target text, thereby enhancing robustness in heavily fine-tuned derivative models. Extensive experiments demonstrate the effectiveness of our dual-injection approach in tracking model lineage under various fine-tuning and domain-shift scenarios.

Abstract:
Large Multimodal Models (LMMs) have attained impressive achievements in multimodal processing tasks, yet their massive memory demands pose major obstacles to deployment on resource-limited devices. Singular Value Decomposition (SVD) has emerged as a promising compression technique for LMMs, delivering substantial reductions in memory overhead. However, existing SVD-based methods often struggle to effectively alleviate the errors caused by SVD truncation, resulting in a noticeable performance gap when compared to the original models. Moreover, adopting a uniform compression ratio across all transformer layers fails to consider the varying importance of different layers. To tackle these challenges, we propose AdaSVD, an adaptive SVD-based LMM compression approach. Specifically, AdaSVD introduces adaComp, which adaptively compensates for SVD truncation errors by alternately updating the singular matrices. Additionally, AdaSVD introduces adaCR, which adaptively assigns layer-specific compression ratios according to the relative importance of each layer. Comprehensive experiments across multiple LMM families show the effectiveness of AdaSVD, achieving better performance while significantly reducing memory requirements. We will make all the code and models of AdaSVD publicly available.

Abstract:
Enhancing temporal understanding of MLLMs is essential for long-form video analysis, supporting tasks such as temporal localization and time-sensitive question answering. While reinforcement learning (RL) has been explored for temporal reasoning, existing approaches are often limited to specific tasks or datasets, hindering generalization across diverse temporal scenarios. To address this, we present TempR1, an RL-based method that strengthens MLLMs' temporal comprehension through multi-task training. We construct a multi-task corpus covering diverse temporal structures and build upon the Group Relative Policy Optimization (GRPO) algorithm for stable cross-task optimization. Temporal tasks are categorized into three interval-instance correspondence types, with tailored localization rewards designed for each. This design enables TempR1 to capture fine-grained temporal dependencies and adapt to different temporal patterns. Experiments show that TempR1 achieves state-of-the-art performance on multiple benchmarks. Joint optimization across complementary tasks further produces strong synergy, improving both generalization and single-task performance, thereby providing a scalable paradigm for temporal reasoning in MLLMs.

Abstract:
Strand-level hair geometry reconstruction is a fundamental problem in virtual human modeling and the digitization of hairstyles. However, existing methods still suffer from a significant trade-off between accuracy and efficiency. Implicit neural representations can capture the global hair shape but often fail to preserve fine-grained strand details, while explicit optimization-based approaches achieve high-fidelity reconstructions at the cost of heavy computation and poor scalability. To address this issue, we propose EfficientMonoHair, a fast and accurate framework that combines the implicit neural network with multi-view geometric fusion for strand-level reconstruction from monocular video. Our method introduces a fusion-patch-based multi-view optimization that reduces the number of optimization iterations for point cloud direction, as well as a novel parallel hair-growing strategy that relaxes voxel occupancy constraints, allowing large-scale strand tracing to remain stable and robust even under inaccurate or noisy orientation fields. Extensive experiments on representative real-world hairstyles demonstrate that our method can robustly reconstruct high-fidelity strand geometries with accuracy. On synthetic benchmarks, our method achieves reconstruction quality comparable to state-of-the-art methods, while improving runtime efficiency by nearly an order of magnitude.

Abstract:
Generating realistic conversational gestures are essential for achieving natural, socially engaging interactions with digital humans. However, existing methods typically map a single audio stream to a single speaker's motion, without considering social context or modeling the mutual dynamics between two people engaging in conversation. We present DyaDiT, a multi-modal diffusion transformer that generates contextually appropriate human motion from dyadic audio signals. Trained on Seamless Interaction Dataset, DyaDiT takes dyadic audio with optional social-context tokens to produce context-appropriate motion. It fuses information from both speakers to capture interaction dynamics, uses a motion dictionary to encode motion priors, and can optionally utilize the conversational partner's gestures to produce more responsive motion. We evaluate DyaDiT on standard motion generation metrics and conduct quantitative user studies, demonstrating that it not only surpasses existing methods on objective metrics but is also strongly preferred by users, highlighting its robustness and socially favorable motion generation. Code and models will be released upon acceptance.

Abstract:
We study instruction-guided editing of egocentric videos for interactive AR applications. While recent AI video editors perform well on third-person footage, egocentric views present unique challenges -- including rapid egomotion, and frequent hand-object interactions -- that create a significant domain gap. Moreover, existing offline editing pipelines suffer from high latency, limiting real-time interaction.To address these issues, we present a complete ecosystem for egocentric video editing. First, we construct EgoEditData, a carefully designed and manually curated dataset specifically designed for egocentric editing scenarios, featuring rich hand-object interactions, while explicitly preserving hands. Second, we develop EgoEdit, an instruction-following egocentric video editor that supports real-time streaming inference on a single GPU. Finally, we introduce EgoEditBench, an evaluation suite targeting instruction faithfulness, hand and interaction preservation, and temporal stability under egomotion.Across both egocentric and general editing tasks, EgoEdit produces temporally stable, instruction-faithful results with interactive latency. It achieves clear gains on egocentric editing benchmarks--where existing methods struggle--while maintaining performance comparable to the strongest baselines on general editing tasks. EgoEditData and EgoEditBench will be made public for the research community.

Abstract:
Exploiting bi-directional context prediction has long been recognized as a key direction for improving compression efficiency in neural video coding. However, existing neural B-frame codecs still exhibit limited performance gains, particularly in high-resolution videos with large motion, where optical flow estimation becomes unreliable and balanced prediction fusion introduces distortions. To address these challenges, we present the first High-Resolution bi-directional neural video coding method, termed as HR-NVC, which non-uniformly integrates confidence-guided predictive cues from both temporal directions to achieve more reliable and efficient compression. Specifically, we propose Spatio-Temporal Anchored Motion Estimation, which introduces virtual anchor frames and low-resolution priors to significantly improve estimation robustness under large displacements. We further design a Hierarchical Motion Representation that converges multi-scale motion with temporal references, enabling compact and adaptive modeling of motion reliability across resolutions. Finally, a Bi-Contextual Asymmetric Harmonization module performs confidence-guided fusion of bidirectional references, effectively suppressing unreliable contexts and restoring structural consistency near occlusion and scene transition regions. Notably, our model is the first end-to-end-optimized video codec evaluated on 4K-resolution videos, establishing a new benchmark for higher-resolution NVC and achieving state-of-the-art performance among neural B-frame codecs.

Abstract:
Reliable 3D perception from multi-view roadside sensors hinges on the robust fusion of camera and LiDAR data, a task complicated by geometric misalignments and sensor calibration errors. This paper presents GSV2X, a fusion framework that tackles these challenges through two core contributions. First, to achieve robustness against spatial uncertainty, we lift 2D image features into a unified Bird's-Eye-View (BEV) space by representing them as 3D Gaussian distributions. By incorporating learnable perturbations guided by camera geometry, our model explicitly accounts for potential calibration inaccuracies. Second, to maximize the synergy between modalities, we propose a new orthogonal fusion module. This module employs constrained attention to enforce orthogonality between camera and LiDAR features, effectively disentangling redundant information and promoting the learning of complementary representations. Extensive experiments on the challenging RCooper dataset demonstrate that GSV2X sets a new state-of-the-art in multi-view roadside perception and exhibits remarkable robustness in complex, real-world scenarios.

Abstract:
Real-world human action understanding remains challenging due to long-tailed label distributions, compositional motion patterns, and viewpoint variations. Existing skeleton-based methods often lack a structured and transferable representation of motion, and task-specific models for generation, classification, and detection are usually trained independently, resulting in fragmented pipelines and limited cross-task generalization. We present PRISM, a PRImitive-centric Skeleton Modeling framework that learns a shared motion representation from a motion generation objective and transfers it to perception tasks. PRISM represents each action sequence as a trajectory in a primitive coefficient space, which captures how a set of learned atomic motion primitives contribute to the observed motion. A structured decomposition module learns this representation in a physically grounded and view-invariant manner via motion generation. Instead of enforcing joint or unified training across tasks, PRISM provides a single primitive-centric representation that can be sequentially transferred to downstream classification and frame-wise detection through lightweight task heads. This representation introduces structure, compositionality, and improved generalization across distinct supervisions. PRISM consistently improves performance on long-tailed and multi-label datasets and enables interpretable reasoning over compositional and rare actions. Extensive experimental results show that the structured primitive space serves as a transferable and robust foundation for diverse action understanding tasks in real-world datasets.

Abstract:
Evaluating the nuanced human-centric video understanding capabilities of Multimodal Large Language Models (MLLMs) remains a great challenge, as existing benchmarks often overlook the intricacies of emotion, behavior, and cross-modal alignment. We introduce HumanVBench, a comprehensive video benchmark designed to rigorously probe these capabilities across 16 fine-grained tasks. A cornerstone of our work is a novel and scalable benchmark construction methodology, featuring two automated pipelines that synthesize high-quality video annotations and challenging multiple-choice questions with minimal human labor. By leveraging state-of-the-art models for annotation and systematically converting model-induced errors into plausible distractors, our framework provides a generalizable "machine" for creating nuanced evaluation suites. Our extensive evaluation of 30 leading MLLMs on HumanVBench reveals critical deficiencies, particularly in perceiving subtle emotions and aligning speech with visual cues, with even top proprietary models falling short of human performance. We open-source HumanVBench and our synthesis pipelines to catalyze the development of more socially intelligent and capable video MLLMs.

Abstract:
Medical image segmentation is challenging due to limited annotated data, high labeling costs, and substantial image heterogeneity. Although large-scale vision foundation models (e.g., SAM) have shown great potential in this field, existing SAM-based methods typically rely on expert-defined geometric prompts or complex clinical text prompts, which limits their generalizability across diverse medical image segmentation tasks. To overcome these challenges, we propose Simple-ViLMedSAM, a CLIP-SAM integration framework that enables high-accuracy segmentation in zero-shot and few-shot settings using only simple text queries, that is, using only basic anatomical or disease-related text labels. At its core is an Implicit Pos-Prompter (IPP), which generates attribution maps containing implicit positional cues to replace traditional geometric prompts. IPP incorporates a multi-modal information bottleneck and an affinity-based refinement strategy to ensure high-quality guidance from CLIP-SAM interactions. To further enhance segmentation, we introduce a Bidirectional Interaction Decoder (BID) that employs bidirectional cross-attention to align IPP's positional maps with SAM's pixel-level features. By jointly modeling global semantics and local details, BID significantly improves segmentation accuracy. Extensive experiments on four public datasets demonstrate that Simple-ViLMedSAM consistently outperforms existing methods in both zero-shot and few-shot medical image segmentation tasks, using only simple text queries. The code will be publicly available upon acceptance.

Abstract:
Network pruning is an effective technique for enabling lightweight Large Vision-Language Models (LVLMs), which primarily incorporates both weights and activations into the importance metric. However, existing efforts typically process calibration data from different modalities in a unified manner, overlooking modality-specific behaviors. This raises a critical challenge: how to address the divergent behaviors of textual and visual tokens for accurate pruning of LVLMs. To this end, we systematically investigate the sensitivity of visual and textual tokens to the pruning operation by decoupling their corresponding weights, revealing that: (i) the textual pathway should be calibrated via text tokens, since it exhibits higher sensitivity than the visual pathway; (ii) the visual pathway exhibits high redundancy, permitting even 50% sparsity. Motivated by these insights, we propose a simple yet effective Asymmetric Text-Visual Weight Pruning method for LVLMs, dubbed ATV-Pruning, which establishes the importance metric for accurate weight pruning by selecting the informative tokens from both textual and visual pathways. Specifically, ATV-Pruning integrates two primary innovations: first, a calibration pool is adaptively constructed by drawing on all textual tokens and a subset of visual tokens; second, we devise a layer-adaptive selection strategy to yield important visual tokens. Finally, extensive experiments across standard multimodal benchmarks verify the superiority of our ATV-Pruning over state-of-the-art methods.

Abstract:
Diffusion model (DM) based Video Super-Resolution (VSR) approaches achieve impressive perceptual quality. Diffusion model (DM) based Video Super-Resolution (VSR) approaches achieve impressive perceptual quality. However, existing DM-based VSR methods over-prioritize perceptual synthesis while neglecting fidelity gains from accurate alignment and sufficient compensation. In this paper, within the DM-based VSR pipeline, we revisit the role of alignment and compensation between adjacent video frames and reveal two crucial observations: (a) the feature domain is better suited than the pixel domain for information compensation due to its stronger spatial and temporal correlations, and (b) warping at an upscaled resolution better preserves high-frequency information, but this benefit is not necessarily monotonic. Therefore, we propose a novel Densely Guided diffusion model with Aligned Features for Video Super-Resolution (DGAF-VSR), with an Optical Guided Warping Module (OGWM) to maintain high-frequency details in the aligned features and a Feature-wise Temporal Condition Module (FTCM) to deliver dense guidance in the feature domain. Extensive experiments on synthetic and real-world datasets demonstrate that DGAF-VSR surpasses state-of-the-art methods in key aspects of VSR, including perceptual quality (35.82% DISTS reduction), fidelity (0.20 dB PSNR gain), and temporal consistency (30.37% tLPIPS reduction).

Abstract:
Humans commonly reason about object affordance through observed interactions in images or videos, and once formed, such knowledge can be generically generalized to novel objects. Inspired by this principle, we advocate for a novel framework that leverages emerging multimodal large language models (MLLMs) for interaction intention-driven 3D affordance grounding, namely HAMMER. Instead of generating explicit object attribute descriptions or relying on off-the-shelf 2D segmenters, we alternatively aggregate the interaction intention depicted in the reference image into a contact-aware embedding and guide the model to infer textual affordance labels, ensuring it thoroughly excavates object semantics and contextual cues. We further devise a hierarchical cross-modal integration mechanism to fully exploit the complementary information from the MLLM for 3D representation refinement and introduce a multi-granular geometry lifting module that infuses spatial characteristics into the extracted intention embedding, thus facilitating accurate 3D affordance localization. Extensive experiments on public datasets and our newly constructed corrupted benchmark reveal the superiority and robustness of our method in seen and unseen scenarios compared to existing approaches. The code and models will be made publicly available.

Abstract:
Accurate road segmentation from aerial imagery is fundamental to many geospatial applications. However, existing datasets often suffer from limited scene diversity, low semantic granularity, and poor structural continuity, restricting their generalization across environments. To address these challenges, we introduce WorldRoadSeg-360K, the largest and most diverse road segmentation dataset to date, comprising 366,947 high-resolution images collected from 38 countries and 223 cities across various terrains and continents. WorldRoadSeg-360K serves as a comprehensive benchmark and reveals key challenges in handling diverse and structurally complex scenes. Automated approaches often struggle to preserve road connectivity, while current interactive methods lack efficient, topology-sensitive tools for real-world road editing. To this end, we present RoadGIE, establishing a novel interactive paradigm for road extraction in remote sensing. Unlike prior point- or box-based prompting strategies, RoadGIE supports connectivity-aware prompts, including clicks and scribbles, which inherently align with the topology of road networks. To improve structural consistency and mitigate performance degradation during iterative interactions, RoadGIE integrates an expert-guided prompting strategy and adapts the skeleton-based recall loss for interactive scenarios. Meanwhile, to alleviate user intent ambiguity, RoadGIE introduces a topo-semantic instantiation during training to enhance interaction stability and consistency. RoadGIE achieves state-of-the-art performance in both segmentation accuracy and topological consistency on WorldRoadSeg-360K and other benchmarks, while maintaining efficient operation with only 3.7 million parameters and real-time processing capabilities.

Abstract:
Streaming dense video captioning requires real-time processing of continuous visual input while determining precisely when and what to caption. Current approaches primarily focus on designing complex external memory mechanisms, failing to leverage Large Multimodal Models' (LMMs) inherent long-context capabilities. Moreover, existing methods employing threshold-based caption triggering face a severe Threshold-Gated Discrepancy (TGD) problem, a training-inference mismatch arising from data imbalance, where models predominantly predict silence tokens, requiring thresholds that vary drastically across videos with extremely narrow effective ranges. We introduce Takusen, an asynchronous temporal modeling two-agent framework comprising a Small Multimodal Model (SMM) as an Oracle agent and an LMM as a Listener agent. The Oracle agent processes sparse video inputs at an accelerated rate to detect event boundaries, while the Listener agent processes dense inputs to generate accurate captions when prompted by the Oracle's signals. This architecture eliminates threshold dependencies by fundamentally changing how silence/generation decisions are made, resolving the TGD problem. To enhance robustness against boundary prediction instabilities, we integrate uniformly distributed fixed decoding points with Oracle-predicted boundaries. Experiments on ActivityNet Captions and YouCook2 datasets demonstrate that Takusen achieves state-of-the-art performance with a simpler and more efficient design that balances temporal sensitivity with descriptive accuracy.

Abstract:
Autonomous trucking poses unique challenges due to articulated tractor-trailer geometry, and time-varying sensor poses caused by the fifth-wheel joint and trailer flex. Existing perception and calibration methods assume static baselines or rely on high-parallax and texture-rich scenes, limiting their reliability under real-world settings. We propose dCAP (dynamic Calibration and Articulated Perception), a vision-based framework that continuously estimates the 6-DoF (degree of freedom) relative pose between tractor and trailer cameras. dCAP employs a transformer with cross-view and temporal attention to robustly aggregate spatial cues while maintaining temporal consistency, enabling accurate perception under rapid articulation and occlusion. Integrated with BEVFormer, dCAP improves 3D object detection by replacing static calibration with dynamically predicted extrinsics. To facilitate evaluation, we introduce STT4AT, a CARLA-based benchmark simulating semi-trailer trucks with synchronized multi-sensor suites and time-varying inter-rig geometry across diverse environments. Experiments demonstrate that dCAP achieves stable, accurate perception while addressing the limitations of static calibration in autonomous trucking. The dataset, development kit, and source code will be publicly released.

Abstract:
Learning-based structure-from-motion methods such as ACE-Zero have demonstrated strong performance in estimating camera poses and scene coordinates from unordered image collections without requiring ground truth supervision. However, the lack of global and multi-view consistency constraints in ACE-Zero can lead to pose drift and misalignment, particularly in complex or ambiguous scenes. In this work, we propose a hybrid framework that integrates pose graph optimization (PGO) into ACE-Zero to refine camera poses and suppress incorrect refinements. We construct pose graphs directly from ACE-Zero outputs by extracting relative pose constraints from predicted scene coordinates. Furthermore, we introduce an uncertainty-aware optimization strategy by estimating confidence scores using geometric priors, including epipolar and optical flow consistencies across views. Our approach improves the robustness and accuracy of pose estimation, demonstrating that global geometric reasoning can effectively complement learning-based inference in structure-from-motion.

Abstract:
Event cameras offer multiple advantages in monocular egocentric 3D human pose estimation from head-mounted devices, such as millisecond temporal resolution, high dynamic range, and negligible motion blur. Existing methods effectively leverage these properties, but suffer from low 3D estimation accuracy, insufficient in many applications (e.g., immersive VR/AR). This is due to the design not being fully tailored towards event streams (e.g., their asynchronous and continuous nature), leading to high sensitivity to self-occlusions and temporal jitter in the estimates. This paper rethinks the setting and introduces E-3DPSM, an event-driven continuous pose state machine for event-based egocentric 3D human pose estimation. E-3DPSM aligns continuous human motion with fine-grained event dynamics; it evolves latent states and predicts continuous changes in 3D joint positions associated with observed events, which are fused with direct 3D human pose predictions, leading to stable and drift-free final 3D pose reconstructions. E-3DPSM runs in real-time at 80 Hz on a single workstation and sets a new state of the art in experiments on two benchmarks, improving accuracy by up to 19% (MPJPE) and temporal stability by up to 2.7x.

Abstract:
Human behaviors in real-world environments are inherently interactive, with an individual's motion shaped by surrounding agents and the scene. Such capabilities are essential for applications in virtual avatars, interactive animation, and human-robot collaboration. We target real-time human interaction-to-reaction generation, which generates the ego's future motion from dynamic multi-source cues, including others' actions, scene geometry, and optional high-level semantic inputs. This task is fundamentally challenging due to (i) limited and fragmented interaction data distributed across heterogeneous single-person, human-human, and human-scene domains, and (ii) the need to produce low-latency yet high-fidelity motion responses during continuous online interaction. To address these challenges, we propose ReMoGen (Reaction Motion Generation), a modular learning framework for real-time interaction-to-reaction generation. ReMoGen leverages a universal motion prior learned from large-scale single-person motion datasets and adapts it to target interaction domains through independently trained Meta-Interaction modules, enabling robust generalization under data-scarce and heterogeneous supervision. To support responsive online interaction, ReMoGen performs segment-level generation together with a lightweight Frame-wise Segment Refinement module that incorporates newly observed cues at the frame level, improving both responsiveness and temporal coherence without expensive full-sequence inference. Extensive experiments across human-human, human-scene, and mixed-modality interaction settings show that ReMoGen produces high-quality, coherent, and responsive reactions, while generalizing effectively across diverse interaction scenarios.

Abstract:
Recent advancements in vision-language-action (VLA) models have shown promise in robotic manipulation, yet they continue to struggle with long-horizon, multi-step tasks. Existing methods lack internal reasoning mechanisms that can identify task-relevant interaction cues or track progress within a subtask, leading to critical execution errors such as repeated actions, missed steps, and premature termination. To address these challenges, we introduce PALM, a VLA framework that structures policy learning around interaction-centric affordance reasoning and subtask progress cues. PALM distills complementary affordance representations that capture object relevance, contact geometry, spatial placements, and motion dynamics, and serve as task-relevant anchors for visuomotor control. To further stabilize long-horizon execution, PALM predicts continuous within-subtask progress, enabling seamless subtask transitions. Across extensive simulation and real-world experiments, PALM consistently outperforms baselines, achieving a 91.8% success rate on LIBERO-LONG, a 12.5% improvement in average length on CALVIN ABC->D, and a 2ximprovement over real-world baselines across three long-horizon generalization settings.

Abstract:
Lossless compression is essential for efficient data storage and transmission. Although learning-based lossless compressors achieve strong results, most of them are designed for a single modality, leading to redundant compressor deployments in multi-modal settings. Designing a unified multi-modal compressor is critical yet challenging, as different data types vary largely in format, dimension, and statistics. Multi-modal large language models offer a promising resolution but remain too complex for practical use. Thus, we propose OmniZip, a unified and lightweight lossless compressor for multi-modal data (like image, text, speech, tactile, database, and gene sequence). Built on a lightweight backbone, OmniZip incorporates three key components to enable efficient multi-modal lossless compression: a modality-unified tokenizer that reversibly transforms diverse data into tokens, a modality-routing context learning mechanism that enables flexible multi-modal context modeling, and a modality-routing feedforward design that further enhances the model's nonlinear representation flexibility. A reparameterization training strategy is used to enhance model capacity. It outperforms or matches other state-of-the-art compressors on multiple modalities, achieving 42%, 57%, 62% and 42%, 53% higher compression efficiency than gzip on CLIC-M, TouchandGo, enwik9, LibriSpeech, and WikiSQL datasets, respectively. It also supports near real-time inference on resource-constrained edge devices, reaching up to 1MB/s on MacBook CPUs and iPhone NPUs. Our code will be released upon acceptance.

Abstract:
Large vision-language models (LVLMs) are highly vulnerable to visual corruptions, substantially compromising their reliability and limiting real-world deployment. Prior work has attributed this degradation primarily to insufficient visual grounding and overreliance on language priors. However, these explanations often overlook the heterogeneous nature of corruptions, which perturb model perception in fundamentally different ways. We revisit this problem from a corruption-centric perspective and show that diverse corruptions can be organized along two complementary perceptual dimensions--shape and texture--which induce distinct failure modes. To address them, we propose Shape-Texture Dual-Path Contrastive Decoding (ST-CD), a training-free inference framework that constructs complementary contrastive views to diagnose and correct shape- and texture-induced biases through adaptive fusion. Experiments across multiple LVLMs and robustness benchmarks demonstrate that ST-CD consistently improves robustness under heterogeneous corruptions, suggesting that leveraging the complementarity between shape and texture provides a general and effective principle for building robust multimodal models.

Abstract:
Understanding genotype-phenotype relationships is pivotal for advancing biomedical research, drug discovery, and precision medicine. With the rise of high-throughput cellular imaging, it is essential to tightly integrate high-content cellular morphology with structured biological knowledge to extract cellular-scale evidence for genotype-to-phenotype mapping.However, integrating high-dimensional, heterogeneous, and noisy phenotypes with structured knowledge remains challenging. Prior approaches typically treat phenotypes as node features, overlooking that phenotypes primarily convey cellular-scale relational signals about how perturbations reshape interactions. We present KERNEL, a knowledge-guided multimodal graph learning framework that integrates cellular imaging phenotypes into a unified knowledge graph to predict genotype-phenotype interactions, including GRN inference, drug-target interaction prediction, and subtype-specific subnetwork discovery. KERNEL dynamically augments task-relevant edges from noisy phenotypic signals, explicitly learns per-edge confidence and marginal utility, and uses knowledge gating to align graph topology with mechanistic pathways. Across large-scale imaging and single-cell datasets, KERNEL consistently outperforms state-of-the-art baselines, e.g., up to 38.1% AUPR improvement for GRN inference, while delivering more accurate and interpretable DTI and subtype subnetwork discovery, demonstrating robust mechanism learning from richer, harder-to-denoise phenotypes.

Abstract:
Medical image segmentation remains challenging due to limited annotations for training, ambiguous anatomical features, and domain shifts. While vision-language models such as CLIP offer strong cross-modal representations, their potential for dense, text-guided medical image segmentation remains underexplored. We present MedCLIPSeg, a novel framework that adapts CLIP for robust, data-efficient, and uncertainty-aware medical image segmentation. Our approach leverages patch-level CLIP embeddings through probabilistic cross-modal attention, enabling bidirectional interaction between image and text tokens and explicit modeling of predictive uncertainty. Together with a soft patch-level contrastive loss that encourages more nuanced semantic learning across diverse textual prompts, MedCLIPSeg effectively improves data efficiency and domain generalizability. Extensive experiments across 16 datasets spanning five imaging modalities and six organs demonstrate that MedCLIPSeg outperforms prior methods in accuracy, efficiency, and robustness, while providing interpretable uncertainty maps that highlight local reliability of segmentation results. This work demonstrates the potential of probabilistic vision-language modeling for text-driven medical image segmentation.

Abstract:
Intrinsic image decomposition (IID) of outdoor scenes is crucial for relighting, editing, and understanding large-scale environments, but progress has been limited by the lack of real-world datasets with reliable albedo and shading supervision. We introduce Olbedo, a large-scale aerial dataset for outdoor albedo--shading decomposition in the wild. Olbedo contains 5,664 UAV images captured across four landscape types, multiple years, and diverse illumination conditions. Each view is accompanied by multi-view consistent albedo and shading maps, metric depth, surface normals, sun and sky shading components, camera poses, and, for recent flights, measured HDR sky domes. These annotations are derived from an inverse-rendering refinement pipeline over multi-view stereo reconstructions and calibrated sky illumination, together with per-pixel confidence masks. We demonstrate that Olbedo enables state-of-the-art IID models, including both CNN-based and diffusion-based architectures originally trained on synthetic indoor data, to generalize to real outdoor imagery: fine-tuning on Olbedo significantly improves single-view outdoor albedo prediction on the MatrixCity benchmark. We further illustrate applications of Olbedo-trained models to multi-view consistent relighting of 3D assets, material editing, and scene change analysis for urban digital twins. We release the dataset, baseline models, and an evaluation protocol to support future research in outdoor intrinsic decomposition and illumination-aware aerial vision.

Abstract:
Zero-shot anomaly detection (ZSAD) aims to identify unseen anomalies without abnormal supervision, which is essential for open-world scenarios. Recent vision-language models such as CLIP enable anomaly reasoning through shared visual-textual embeddings, but existing methods often rely on coarse prompt fusion, leading to unstable alignment and imprecise localization under domain shifts. To address this issue, we propose the Semantic Graviton Network (SGNet), a physics-inspired framework that models multimodal alignment as an adaptive potential field. We introduce semantic gravitons, learnable dynamic mediators that bridge visual and textual modalities by establishing localized semantic equilibria through attraction and equilibrium forces. A graviton interaction network alternates text-to-graviton and vision-to-graviton coupling to progressively refine multimodal correspondence, while an energy-based potential regularization further stabilizes the interaction process. Extensive experiments on ten industrial and medical benchmarks show that SGNet achieves state-of-the-art performance for zero-shot anomaly detection.

Abstract:
This paper proposes the first task for high-speed 3D point tracking using multi-view Event-RGB hybrid cameras. We design a cuboid observation device comprising 4 RGB cameras (30fps) and 2 Event cameras to synchronously capture high-speed motions, and propose MER-Tracker, a high-frame-rate 3D point-tracking network that fuses the complementary strengths of dual modalities. We first respectively extract 2D motion-change features from the RGB and Event modalities, then apply linear interpolation and anchor sampling to fuse the discrete RGB 3D features and continuous Event 3D features after 3D lifting, and finally employ a LoRA-tuned Transformer based on temporal correlationship to predict the high-frame-rate 3D point trajectories over fast motions, accomplishing high-speed 3D point tracking. To verify the effectiveness of our method, we construct both real-world and simulated high-speed motion datasets. Experiments on these datasets show that our method achieves accurate high-speed 3D point tracking at high-frame-rate (150fps), outperforming state-of-the-art methods.

Abstract:
We present a physics-based method for simulating full-body agents that recover balance by stepping or applying contact forces after being perturbed in dense crowds. While traditional 2D crowd simulations focus on navigation and social interactions in moderately dense settings, interactions in highly dense environments are predominantly physical, leading to push propagation, falls, and potential hazards. Existing models cannot capture how forces are transmitted through the body at the limb level. To address this, we use physics-based anthropomorphic simulations combined with a two-stage deep reinforcement learning framework. In the first stage, a policy is pre-trained using reference motion data and general balance rewards, enabling agents to handle a wide range of perturbations. In the second stage, an adaptive phase refines the policy to allow socially aware interactions, using hand contacts for stabilization guided by an online heuristic targeting neighbors' shoulders based on mechanical efficiency and collision risk. Ablation studies validate the training framework and reward components, and simulations reproduce trends observed in empirical studies of push propagation. Our method scales to large populations, offering new opportunities to study safety and collective behavior in dense crowds.

Abstract:
Generating realistic hand-object interactions (HOI) videos is a significant challenge due to the difficulty of modeling physical constraints (e.g., contact and occlusion between hands and manipulated objects). Current methods utilize HOI representation as an auxiliary generative objective to guide video synthesis. However, there is a dilemma between 2D and 3D representations that cannot simultaneously guarantee scalability and interaction fidelity. To address this limitation, we propose a structure and contact-aware representation that captures hand-object contact, hand-object occlusion, and holistic structure context without 3D annotations. This interaction-oriented and scalable supervision signal enables the model to learn fine-grained interaction physics and generalize to open-world scenarios. To fully exploit the proposed representation, we introduce a joint-generation paradigm with a share-and-specialization strategy that generates interaction-oriented representations and videos. Extensive experiments demonstrate that our method outperforms state-of-the-art methods on two real-world datasets in generating physics-realistic and temporally coherent HOI videos. Furthermore, our approach exhibits strong generalization to challenging open-world scenarios, highlighting the benefit of our scalable design.

Abstract:
Graphical user interface (GUI) agents built on multimodal large language models (MLLMs) have recently demonstrated strong decision-making abilities in screen-based interaction tasks. However, they remain highly vulnerable to pop-up-based environmental injection attacks, where malicious visual elements divert model attention and lead to unsafe or incorrect actions. Existing defense methods either require costly retraining or perform poorly under inductive interference. In this work, we systematically study how such attacks alter the attention behavior of GUI agents and uncover a layer-wise attention divergence pattern between correct and incorrect outputs. Based on this insight, we propose LaSM, a Layer-wise Scaling Mechanism that selectively amplifies attention and MLP modules in critical layers. LaSM improves the alignment between model saliency and task-relevant regions without additional training. Extensive experiments across multiple datasets demonstrate that our method significantly improves the defense success rate and exhibits strong robustness, while having negligible impact on the model's general capabilities. Our findings reveal that attention misalignment is a core vulnerability in MLLM agents and can be effectively addressed through selective layer-wise modulation.

Abstract:
Current audio pre-training seeks to learn unified representations for broad audio understanding tasks, but it remains fragmented and is fundamentally bottlenecked by its reliance on weak, noisy, and scale-limited labels. Drawing lessons from vision's foundational pre-training blueprint, we argue that the audio field must first establish its own large-scale, strong supervision framework. We introduce a new data-centric pipeline that leverages a high-fidelity captioner to create SOTA-quality captions and the first Unified Tag System (UTS) that bridges speech, music, and environmental sounds. We then conduct a systematic comparative study of different pre-training objectives on these strong source data. Our experiments suggest that data quality and coverage are the primary drivers of performance, while the choice of objective dictates downstream task specialization.

Abstract:
Vision-Language Models (VLMs) have enhanced traditional LLMs with visual capabilities through the integration of vision encoders. While recent works have explored various combinations of vision encoders and LLMs, there still lacks a principled understanding of what makes a vision encoder suitable for VLM alignment. In this paper, we systematically investigate this question via comprehensive experiments on a curated collection of 19 pre-trained vision encoders from diverse sources. We first demonstrate that common practices, such as choosing encoders with the largest size or highest zero-shot accuracy, consistently fail to identify optimal models. In fact, these metrics show only weak to moderate correlation with VLM performance. This intriguing finding begs a fundamental question: What factors of vision-encoders matter in VLM? Through comprehensive analysis, we identify that the structural similarity across modalities plays a crucial but previously overlooked role in vision-encoder selection, which we measure using the Gromov-Wasserstein distance as a proxy. From a theoretical perspective, we show that the learnability of cross-modality mapping can be provably associated with the Gromov-Wasserstein distance. Empirical verification on 60+ full VLM training runs shows that our proposed inference-only metric performs significantly better than alternative model selection strategies and exhibits a much stronger correlation with final VLM performance, thereby enabling efficient and effective prediction of VLM performance before full training.

Abstract:
Achieving strong policy generalization to unseen environments remains a core challenge in visual reinforcement learning, and segmenting task-relevant regions to mitigate the influence of irrelevant visual cues has emerged as a promising direction. However, existing methods rely solely on the current observation, lack temporal information, and fail to exploit preceding observations, leaving learned policies susceptible to task-irrelevant background variations and ultimately degrading policy performance. In this paper, we propose temporal segmentation for task-relevant mask in visual reinforcement learning, named TSTM, which extracts task-relevant regions from sequential observations by exploiting temporal information, thereby producing more reliable masks and improving policy generalization. TSTM introduces a temporal segmentation network with an encoder-temporal-decoder architecture, where a convolutional LSTM module captures temporal dependencies across observations. To reduce inference overhead, we further develop a lightweight student network as an efficient substitute for the teacher network. The resulting task-relevant masks are encoded by a CNN-based encoder, and invariant representation learning is employed to improve robustness by enforcing consistency between representations of the original and augmented observation sequences. With these task-relevant representations, we train an actor-critic agent to learn a policy with strong generalization capability. Experimental results demonstrate that TSTM achieves superior generalization performance over existing state-of-the-art methods on most visual RL tasks.

Abstract:
Modern LiDARs are rapidly transitioning from bulky, mechanically scanned systems to ultra-compact, low-cost, solid-state arrays. This miniaturization--while enabling scalability, affordability, and camera-like data structures--introduces a new and severe failure mode: internal-multipath glare. When light from a bright or retroreflective surface reflects and scatters within the LiDAR, light that should reach a single pixel spreads across the pixel array. The resulting artifacts create phantom objects, obscure real ones, and produce safety-critical "ghosts in the point clouds." This paper introduces a physically grounded sensing model and algorithmic techniques for addressing this effect. We show that internal glare can be represented as a linear, scene-independent operator--the Transient Glare Spread Function (TGSF)--acting on the transient measurements. Building on this model, we develop a training-free approach that operates on low-level LiDAR detections (or echoes) prior to point-cloud formation, leveraging knowledge of the glare spread function to reason about the likelihood of each detection arising from glare. The resulting approach is compatible with existing LiDAR signal-processing pipelines, and deployable on unmodified commercial sensors. Using experiments with real single-photon LiDAR hardware, we demonstrate substantial suppression of severe glare artifacts while preserving true scene structure.

Abstract:
Counterfactual explanations (CFEs) are minimal and semantically meaningful modifications of the input of a model that alter the model predictions. They highlight the decisive features the model relies on, providing contrastive interpretations for classifiers. State-of-the-art visual counterfactual explanation methods have primarily focused on interpreting image classifiers, leaving the domain of video models relatively underexplored. For the video CFEs to be useful, they have to be physically plausible, temporally coherent, and exhibit smooth motion trajectories. Existing CFE image-based methods, designed to explain image classifiers, lack the capacity to generate temporally coherent, smooth and physically plausible video CFEs. To address this, we propose Back To The Feature (BTTF), an optimization framework that generates video CFEs. Our method introduces two novel features, 1) an optimization scheme to retrieve the initial latent noise conditioned by the first frame of the input video, 2) a two-stage optimization strategy to enable the search for counterfactual videos in the vicinity of the input video. Both optimization processes are guided solely by the target classifier, ensuring the explanation is faithful. To accelerate convergence, we also introduce a progressive optimization strategy that incrementally increases the number of denoising steps. Extensive experiments on video datasets such as Shape-Moving (motion classification), MEAD (emotion classification), and NTU RGB+D (action classification) show that our BTTF effectively generates valid, visually similar and realistic counterfactual videos that provide concrete insights into the classifier's decision-making mechanism.

Abstract:
With the rapid growth of stereoscopic 3D devices, real-time stereoscopic conversion has become increasingly essential. However, most existing approach rely on depth estimation, forward warping, and heavy inpainting network, resulting in high computational cost and artifacts near occlusion boundaries. Diffusion-based models have also been explored, but they suffer from iterative sampling and geometric inconsistency, making them unsuitable for real-time deployment. To address these issues, we propose Bi-Warp, a simple yet effective approach that synthesizes the right view without inpainting network by leveraging warping operations. Our approach estimates backward flow, approximates the corresponding forward flow, and generates two candidate right views via bidirectional warping. A learnable mask adaptively fuses the candidates, preserving left-right geometric consistency. Building on Bi-Warp, we introduce XPaintNet, a lightweight network that achieves comparable visual quality to state-of-the-art methods while maintaining real-time performance over 100 FPS at 2K resolution.

Abstract:
Mixture-of-Experts (MoE) Multimodal large language models (MLLMs) excel at vision-language tasks, but they suffer from high computational inefficiency. To reduce inference overhead, expert skipping methods have been proposed to deactivate redundant experts based on the current input tokens. However, we find that applying these methods--originally designed for unimodal large language models (LLMs)--to MLLMs results in considerable performance degradation. This is primarily because such methods fail to account for the heterogeneous contributions of experts across MoE layers and modality-specific behaviors of tokens within these layers. Motivated by these findings, we propose MoDES, the first training-free framework that adaptively skips experts to enable efficient and accurate MoE MLLM inference. It incorporates a globally-modulated local gating (GMLG) mechanism that integrates global layer-wise importance into local routing probabilities to accurately estimate per-token expert importance. A dual-modality thresholding (DMT) method is then applied, which processes tokens from each modality separately, to derive the skipping schedule. To set the optimal thresholds, we introduce a frontier search algorithm that exploits monotonicity properties, cutting convergence time from several days to a few hours. Extensive experiments for 3 model series across 13 benchmarks demonstrate that MoDES far outperforms previous approaches. For instance, when skipping 88% experts for Qwen3-VL-MoE-30B-A3B-Instruct, the performance boost is up to 10.67% (97.33% vs. 86.66%). Furthermore, MoDES significantly enhances inference speed, improving the prefilling time by 2.16x and the decoding time by 1.26x.

Abstract:
Text-to-image (T2I) models such as Stable Diffusion and DALLE remain susceptible to generating harmful or Not-Safe-For-Work (NSFW) content under jailbreak attacks despite deployed safety filters. Existing jailbreak attacks either rely on proxy-loss optimization instead of the true end-to-end objective, or depend on large-scale and costly RL-trained generators. Motivated by these limitations, we propose JANUS , a lightweight framework that formulates jailbreak as optimizing a structured prompt distribution under a black-box, end-to-end reward from the T2I system and its safety filters. JANUS replaces a high-capacity generator with a low-dimensional mixing policy over two semantically anchored prompt distributions, enabling efficient exploration while preserving the target semantics. On modern T2I models, we outperform state-of-the-art jailbreak methods, improving ASR-8 from 25.30% to 43.15% on Stable Diffusion 3.5 Large Turbo with consistently higher CLIP and NSFW scores. JANUS succeeds across both open-source and commercial models. These findings expose structural weaknesses in current T2I safety pipelines and motivate stronger, distribution-aware defenses. Warning: This paper contains model outputs that may be offensive.

Abstract:
Gaussian Splatting has become a leading reconstruction technique, known for its high-quality novel view synthesis and detailed reconstruction. However, most existing methods require dense, calibrated views. Reconstruction from free sparse-view images often leads to poor surface due to limited overlap and overfitting.We introduce FSFSplatter for fast geometrically accurate reconstruction from free sparse-view images. Our method integrates end-to-end dense Gaussian scene initialization and geometry-enhanced scene optimization.Specifically, FSFSplatter employs a large transformer to encode multi-view images and generates a dense and geometrically consistent Gaussian scene initialization via a batch based self-splitting Gaussian head. It eliminates local floaters through contribution-based pruning and mitigates overfitting by leveraging depth and multi-view feature supervision, along with differentiable camera parameters within 2 minutes.FSFSplatter outperforms current state-of-the-art methods on widely used DTU, Replica, and BlendedMVS datasets.

Abstract:
In safety-critical industrial applications, accurately identifying causal relationships and quantifying uncertainty is essential for tasks such as root cause analysis, feature selection, and process optimization. Traditional causal discovery methods inadequately handle nonlinearities and complex uncertainties prevalent in industrial sensor data. To address this, we introduce a novel causal discovery framework that integrates Polynomial Chaos Expansion (PCE) representations of stochastic noise into structural equations. This method effectively captures complex nonlinear couplings and arbitrary noise distributions characteristic of industrial data. We rigorously prove the identifiability of causal structures under mild sparsity conditions on the chaos coefficients, extending linear non-Gaussian acyclic model (LiNGAM) identifiability results. Extensive experiments demonstrate superior accuracy, robustness under non-Gaussian noise conditions, and practical uncertainty quantification. This framework presents a principled, interpretable, and computationally feasible approach to causal analysis in nonlinear uncertain industrial environments.

Abstract:
Interpreting individual neurons or directions in activation space is an important topic in mechanistic interpretability. Numerous automated interpretability methods have been proposed to generate such explanations, but it remains unclear how reliable these explanations are, and which methods produce the most accurate descriptions. While crowd-sourced evaluations are commonly used, existing pipelines are noisy, costly, and typically assess only the highest-activating inputs, leading to unreliable results. In this paper, we introduce two techniques to enable cost-effective and accurate crowdsourced evaluation of automated interpretability methods beyond top activating inputs. First, we propose Model-Guided Importance Sampling (MG-IS) to select the most informative inputs to show human raters. In our experiments, we show this reduces the number of inputs needed to reach the same evaluation accuracy by 13x. Second, we address label noise in crowd-sourced ratings through Bayesian Rating Aggregation (BRAgg), which allows us to reduce the number of ratings per input required to overcome noise by 3x. Together, these techniques reduce the evaluation cost by 40x, making large-scale evaluation feasible. Finally, we use our methods to conduct a large scale crowd-sourced study comparing recent automated interpretability methods for vision networks.

Abstract:
We investigate state-of-the-art deepfake detectors that leverage ViT-based vision foundation models and discover that the [CLS] token suffers from the Pre-trained Information Bias (PIB), i.e., it tends to mainly focus on global semantics due to the knowledge dominated by pre-trained model parameters, while struggling to emphasize subtle local forgery cues. To overcome this limitation, one potential way is incorporating the token-level features to reform a detection-specific token. To this end, we propose Query-Driven Token-Level Forgery Purification (QTFP) framework to better capture local forgery traces without losing useful pre-trained prior. Specifically, we introduce randomly initialized, learnable query tokens independent of the backbone and prior knowledge, which effectively aggregate multi-patch evidence into a global token for detection. To make query tokens focus on meaningful regions, we propose a theoretical fake-likelihood contrastive learning loss, which employs a weighting strategy to highlight significant fake regions while diminishing real-like patch impact. Using SNR theory, we verify that the designed weight is both reliable and informative. To further maintain useful authentic information, a real-attention alignment constraint is applied to query tokens. These designs go beyond relying solely on the [CLS] token by jointly reorganizing real and fake information across all tokens, which successfully enhance detector robustness. Extensive experiments on diverse datasets demonstrate the effectiveness of our method.

Abstract:
Prevailing incomplete multi-view clustering (IMVC) approaches typically fail to account for the interference of view-exclusive artifacts when learning view-consensus representations, which could compromise the fidelity of the resulting similarity measure. Moreover, inconsistencies in anchor order across views may distort the graph structure, impairing the clustering performance. The reliance on carefully-tuned regularization hyper-parameters also usually undermines the model's practical utility. To alleviate these issues, we propose a plug-and-play IMVC framework named PJFTH that incorporates Janus-faced affinity learning with topology harmonization. It explicitly models the exclusive-to-consensus interplay, derives a view-private graph from each view, and adaptively integrates them into a global consensus affinity according to the respective view's intrinsic characteristics. Furthermore, a permutation transformation with unary encoding constraints is applied to anchor matrix, realigning anchor topology while preserving the values. This process synchronizes anchor order prior to similarity integration and maintains original anchor properties. Notably, all components are coupled seamlessly and optimized in a joint manner. Also, the provable overall linear complexity further enlarges its scalability and practicality. Experimental results confirm that PJFTH receives competitive performance compared to several leading methods.

Abstract:
Accurate 3D understanding of human hands and objects during manipulation remains a significant challenge for egocentric computer vision. Existing hand-object interaction datasets are predominantly captured in controlled studio settings, which limits both environmental diversity and the ability of models trained on such data to generalize to real-world scenarios. To address this challenge, we introduce a novel marker-less multi-camera system that allows for nearly unconstrained mobility in genuinely in-the-wild conditions, while still having the ability to generate precise 3D annotations of hands and objects. The capture system consists of a lightweight, back-mounted, multi-camera rig that is synchronized and calibrated with a user-worn VR headset. For 3D ground-truth annotation of hands and objects, we develop an ego-exo tracking pipeline and rigorously evaluate its quality. Finally, we present SHOW3D, the first large-scale dataset with 3D annotations that show hands interacting with objects in diverse real-world environments, including outdoor settings. Our approach significantly reduces the fundamental trade-off between environmental realism and accuracy of 3D annotations, which we validate with experiments on several downstream tasks. The dataset will be publicly released upon acceptance. show3d-dataset.github.io

Abstract:
Generalized Category Discovery (GCD) aims to identify both known and unknown categories, with only partial labels given for the known categories, posing a challenging open-set recognition problem. State-of-the-art approaches for GCD are usually built on multi-modality representation learning, which pays heavily attention upon inter-modality alignment rather than intra-modality alignment. In this paper, we propose a novel and effective multi-modal representation learning approach for GCD via Semi-Supervised Rate Reduction, called SSR^2-GCD, to learn cross-modality representations with desired underlying structure properties via properly harnessing intra-modality alignment. Moreover, to boost knowledge transfer, we integrate prompt candidates by leveraging the inter-modal alignment offered by Vision Language Models. We conduct extensive experiments on generic and fine-grained benchmark datasets, demonstrating superior performance of the proposed approach and verifying the importance of harnessing an proper intra-modality alignment.

Abstract:
Infrared small target detection is one highly special category of object detection, faced with tiny target imaging size and cluttered backgrounds. Currently, almost all existing methods are target-centered, directly learning the target features from backgrounds. However, due to weak target signals, they are often difficult in effectively capturing stable features. Sometimes, they cannot even distinguish real targets from background confounders. To overcome these problems, from an opposite perspective, we propose the first Causal-guided Hierarchical Anomaly-aware Learning (CHAL) framework. Breaking through target-centered paradigm, it focuses on background learning, while the targets are handled as the anomalies in backgrounds. In detail, to fulfill the goal, a spatio-temporal neural field is designed to model the background evolution patterns from generative perspective. Meanwhile, a hierarchical anomaly-aware learning is proposed to decompose anomaly discovery. Furthermore, to block the spurious correlations often caused by background confounders, and enhance true target causality, a causal-guiding mechanism is designed. The experiments on three infrared datasets verify the effectiveness and superiority of our CHAL. Even in visible-light scenarios, it still possesses obvious adaptivity. Source code will be open.

Abstract:
Understanding dynamic 3D environments is essential for safe autonomous driving, particularly when reasoning about human-centric, nonrigid agents. However, existing weakly supervised occupancy prediction frameworks predominantly assume rigid-body motion and rely on simple frame-to-frame offsets, limiting their ability to capture fine-grained deformations and maintain temporal coherence. To address this issue, we propose DeGO, a deformable Gaussian occupancy framework that unifies decoupled Gaussian deformation with factorized 4D foundation-model distillation. DeGO disentangles rigid and nonrigid motion, enabling each Gaussian primitive to evolve through both deformation and offset-based updates. In parallel, a factorized 4D distillation strategy transfers cross-camera and cross-frame knowledge from the VGGT foundation model, producing foundation-aligned features that enhance temporal consistency. Experiments on the Occ3D-NuScenes benchmark demonstrate that our method achieves state-of-the-art performance under weak supervision, delivering 13.5% gains on human-centric instances and 10.9% overall improvements. These results highlight the effectiveness of deformation-aware and foundation-guided occupancy modeling for dynamic scene understanding.

Abstract:
The performance of Latent Diffusion Models (LDMs) is critically dependent on the quality of their visual tokenizers. While recent works have explored incorporating Vision Foundation Models (VFMs) into the tokenizers training via distillation, we empirically find this approach inevitably weakens the robustness of learnt representation from original VFM. In this paper, we bypass the distillation by proposing a more direct approach by leveraging the frozen VFM for the LDMs tokenizer, named VFM Variational Autoencoder (VFM-VAE). To fully exploit the potential to leverage frozen VFM for the LDMs tokenizer, we design a new decoder to reconstruct realistic images from the semantic-rich representation of VFM. With the proposed VFM-VAE, we conduct a systematic study on how the representation from different tokenizers impact the representation learning process throughout diffusion training, enabling synergistic benefits of dual-side alignment on both tokenizers and diffusion models. Our effort in tokenizer design and training strategy lead to superior performance and efficiency: our system reaches a gFID (w/o CFG) of 2.22 in merely 80 epochs (a 10xspeedup over prior tokenizers). With continued training to 640 epochs, it further attains a gFID (w/o CFG) of 1.62. These results offer solid evidence for the substantial potential of VFMs to serve as visual tokenizers to accelerate the LDM training progress.

Abstract:
With the advancement of multi-modal Large Language Models (LLMs), Video LLMs have been further developed to perform on holistic and specialized video understanding. However, existing works are limited to specialized video understanding tasks, failing to achieve a comprehensive and multi-grained video perception. To bridge this gap, we introduce UFVideo, the first Video LLM with unified multi-grained cooperative understanding capabilities. Specifically, we design unified visual-language guided alignment to flexibly handle video understanding across global, pixel and temporal scales within a single model. UFVideo dynamically encodes the visual and text inputs of different tasks and generates the textual response, temporal localization, or grounded mask. Additionally, to evaluate challenging multi-grained video understanding tasks, we construct the UFVideo-Bench consisting of three distinct collaborative tasks within the scales, which demonstrates UFVideo's flexibility and advantages over GPT-4o. Furthermore, we validate the effectiveness of our model across 9 public benchmarks covering various common video understanding tasks, providing valuable insights for future Video LLMs.

Abstract:
Face recognition (FR) technologies are increasingly used to power large-scale image retrieval systems, raising serious privacy concerns. Services like Clearview AI and PimEyes allow anyone to upload a facial photo and retrieve a large amount of online content associated with that person. This not only enables identity inference but also exposes their digital footprint, such as social media activity, private photos, and news reports, often without their consent. In response to this emerging threat, we propose Protego, a user-centric privacy protection method that safeguards facial images from such retrieval-based privacy intrusions. Protego encapsulates a user's 3D facial signatures into a pose-invariant 2D representation, which is dynamically deformed into a natural-looking 3D mask tailored to the pose and expression of any facial image of the user, and applied prior to online sharing. Motivated by a critical limitation of existing methods, Protego amplifies the sensitivity of FR models so that protected images cannot be matched even among themselves. Experiments show that Protego significantly reduces retrieval accuracy across a wide range of black-box FR models and performs at least 2x better than existing methods. It also offers unprecedented visual coherence, particularly in video settings where consistency and natural appearance are essential. Overall, Protego contributes to the fight against the misuse of FR for mass surveillance and identity tracing.

Abstract:
Recent advances in zero-shot learning (ZSL) have demonstrated the potential of generative models. Typically, generative ZSL synthesizes visual features conditioned on semantic prototypes to model the data distribution of unseen classes, followed by training a classifier on the synthesized data. However, the synthesized features often remain task-agnostic, leading to degraded performance. Moreover, inferring a faithful distribution from semantic prototypes alone is insufficient for classes that are semantically similar but visually distinct. To address these and advance ZSL, we propose RLVC, an outcome-reward reinforcement learning RL framework with visual cues for generative ZSL. At its core, RL empowers the generative model to self-evolve, implicitly enhancing its generation capability. In particular, RLVC updates the generative model using an outcome-based reward, encouraging the synthesis of task-relevant features. Furthermore, we introduce class-wise visual cues that (i) align synthesized features with visual prototypes and (ii) stabilize the RL training updates. For the training process, we present a novel cold-start strategy. Comprehensive experiments and analyses on three prevalent ZSL benchmarks demonstrate that RLVC achieves state-of-the-art results with a 4.7% gain.

Abstract:
Recent advances in multimodal large language models largely rely on CLIP-based visual encoders, which emphasize global semantic alignment but struggle with fine-grained visual understanding. In contrast, DINOv3 provides strong pixel-level perception yet lacks coarse-grained semantic abstraction, leading to limited multi-granularity reasoning. To address this gap, we propose Granulon, a novel DINOv3-based MLLM with adaptive granularity augmentation. Granulon introduces a text-conditioned granularity Controller that dynamically adjusts the visual abstraction level according to the semantic scope of the textual input, and an Adaptive Token Aggregation module that performs granularity-guided pooling and relation-aware clustering to produce compact, semantically rich visual tokens. This design enables unified "pixel-to-fine-to-coarse" reasoning within a single forward pass. Extensive and interpretable experiments demonstrate that Granulon improves accuracy by 30% and reduces hallucination by 20%, outperforming all visual encoders under identical settings. Code is available at the Supplementary.

Abstract:
Video object removal aims to eliminate dynamic target objects and their visual effects, such as deformation, shadows, and reflections, while restoring seamless backgrounds. Recent diffusion-based video inpainting and object removal methods can remove the objects but often struggle to erase these effects and to synthesize coherent backgrounds. Beyond method limitations, progress is further hampered by the lack of a comprehensive dataset that systematically captures common object effects across varied environments for training and evaluation. To address this, we introduce VOR (Video Object Removal), a large-scale dataset that provides diverse paired videos, each consisting of one video where the target object is present with its effects and a counterpart where the object and effects are absent, with corresponding object masks. VOR contains 60k high-quality video pairs from captured and synthetic sources, covers five effects types, and spans a wide range of object categories as well as complex, dynamic multi-object scenes. Building on VOR, we propose EffectErase, an effect-aware video object removal method that treats video object insertion as the inverse auxiliary task within a reciprocal learning scheme. The model includes task-aware region guidance that focuses learning on affected areas and enables flexible task switching. Then, an insertion-removal consistency objective that encourages complementary behaviors and shared localization of effect regions and structural cues. Trained on VOR, EffectErase achieves superior performance in extensive experiments, delivering high-quality video object effect erasing across diverse scenarios.

Abstract:
Audio-driven 3D talking head synthesis has advanced rapidly with Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS). By leveraging rich pre-trained priors, few-shot methods enable instant personalization from just a few seconds of video. However, under expressive facial motion, existing few-shot approaches often suffer from geometric instability and audio-emotion mismatch, highlighting the need for more effective emotion-aware motion modeling. In this work, we present EmoTaG, a few-shot emotion-aware 3D talking head synthesis framework built on the Pretrain-and-Adapt paradigm. Our key insight is to reformulate motion prediction in a structured FLAME parameter space rather than directly deforming 3D Gaussians, thereby introducing explicit geometric priors that improve motion stability. Building upon this, we propose a Gated Residual Motion Network (GRMN), which captures emotional prosody from audio while supplementing head pose and upper-face cues absent from audio, enabling expressive and coherent motion generation. Extensive experiments demonstrate that EmoTaG achieves state-of-the-art performance in emotional expressiveness, lip synchronization, visual realism, and motion stability.

Abstract:
With the rapid advancement of diffusion models, talking face generation has made remarkable progress. However, existing diffusion-based methods still require task-specific fine-tuning and large-scale audiovisual datasets, resulting in high computational costs that hinder accessibility for resource-constrained researchers and small teams. To address this, we propose a fine-tuning-free paradigm that directly performs talking face generation using the pretrained weights of Stable Diffusion and IP-Adapter. This backbone leverages the visual embedding capability of IP-Adapter to mine lip-related semantics from the pretrained Stable Diffusion. To address the challenges of identity drift, synchronization errors, and temporal instability, we also design three trainable-parameter-free components: 1) the Structurist, which explicitly disentangles and reassembles lip and appearance features to mitigate identity drift and appearance distortion; 2) the Structure Controller, which adaptively refines embeddings based on quasi-monotonic motion trends for precise lip synchronization; and 3) the Noise Sensor, which introduces a Gaussian prior to detect and suppress flicker and jitter artifacts and enhance temporal consistency. Experimental results show that our method outperforms existing SOTA approaches in both lip-sync accuracy (at least 0.16 gain in PCLD) and visual fidelity (at least 0.7 improvement in FID), establishing a novel fine-tuning-free diffusion framework for talking face generation.

Abstract:
Dynamic driving scene modeling is critical for autonomous driving simulation and closed-loop learning. While recent feed-forward methods offer fast inference through data-driven priors, they struggle with long-range driving sequences due to quadratic complexity in sequence length and restrictive assumptions for dynamic objects. We propose UFO, a novel recurrent paradigm that combines the strengths of optimization-based and feed-forward methods for efficient long-range 4D modeling. Our approach maintains a persistent 4D scene representation composed of scene tokens that are progressively refined as new frames arrive, enabling future observations to correct earlier uncertain predictions. A visibility-based filtering mechanism exploits the locality of 3D-to-pixel correspondence, reducing complexity from quadratic to near-linear in sequence length. For dynamic objects, we introduce object pose-guided modeling with learned lifespans, enabling complex long-range motion without kinematic assumptions. Experiments on the Waymo Open Dataset demonstrate that UFO significantly outperforms both per-scene optimization and feed-forward baselines across 2s, 8s, and zero-shot 16s sequences.

Abstract:
X-ray contraband detection is critical for public safety. However, current methods primarily rely on bounding box annotations, which limit model generalization and performance due to the lack of pixel-level supervision and real-world data. To address these limitations, we introduce XSeg. To the best of our knowledge, XSeg is the largest X-ray contraband segmentation dataset to date, including 98,644 images and 295,932 instance masks, and contains the latest 30 common contraband categories. The images are sourced from public datasets and our synthesized data, filtered through a custom data cleaning pipeline to remove low-quality samples. To enable accurate and efficient annotation and reduce manual labeling effort, we propose Adaptive Point SAM (APSAM), a specialized mask annotation model built upon the Segment Anything Model (SAM). We address SAM's poor cross-domain generalization and limited capability in detecting stacked objects by introducing an Energy-Aware Encoder that enhances the initialization of the mask decoder, significantly improving sensitivity to overlapping items. Additionally, we design an Adaptive Point Generator that allows users to obtain precise mask labels with only a single coarse point prompt. Extensive experiments on XSeg demonstrate the superior performance of APSAM.

Abstract:
We introduce TimeViper, a hybrid vision-language model designed to tackle challenges of long video understanding. Processing long videos demands both an efficient model architecture and an effective mechanism for handling extended temporal contexts. To this end, TimeViper adopts a hybrid Mamba-Transformer backbone that combines the efficiency of state-space models with the expressivity of attention mechanisms. Through this hybrid design, we reveal the vision-to-text information aggregation phenomenon, where information progressively flows from vision tokens to text tokens across increasing LLM depth, resulting in severe vision token redundancy. Motivated by this observation, we propose TransV, a token information transfer module that transfers and compresses vision tokens into instruction tokens while maintaining multimodal understanding capabilities. This design enables TimeViper to process hour-long videos exceeding thousands of frames. Extensive experiments across multiple benchmarks demonstrate that TimeViper competes with state-of-the-art models while reducing inference cost. We further analyze attention behaviors of both Mamba and Transformer layers, offering new insights into hybrid model interpretability. This work represents an initial step towards developing, interpreting, and compressing hybrid Mamba-Transformer architectures.

Abstract:
Accurately recovering hand poses within the body context remains a major challenge in 3D whole-body pose estimation. This difficulty arises from a fundamental supervision gap: whole-body pose estimators are trained on full-body datasets with limited hand diversity, while hand-only estimators, trained on hand-centric datasets, excel at detailed finger articulation but lack global body awareness. To address this, we propose Hand4Whole++, a modular framework that leverages the strengths of both pre-trained whole-body and hand pose estimators. We introduce CHAM (Conditional Hands Modulator), a lightweight module that modulates the whole-body feature stream using hand-specific features extracted from a pre-trained hand pose estimator. This modulation enables the whole-body model to predict wrist orientations that are both accurate and coherent with the upper-body kinematic structure, without retraining the full-body model. In parallel, we directly incorporate finger articulations and hand shapes predicted by the hand pose estimator, aligning them to the full-body mesh via differentiable rigid alignment. This design allows Hand4Whole++ to combine globally consistent body reasoning with fine-grained hand detail. Extensive experiments demonstrate that Hand4Whole++ substantially improves hand accuracy and enhances overall full-body pose quality.

Abstract:
Millimeter-wave radar's all-weather capability makes it increasingly vital for autonomous perception. However, the high cost of radar data collection drives the need for data generation to augment radar datasets. Existing works mainly target partial radar representations, e.g., 2D or 3D slices, leading to information loss and limited downstream performance. To overcome these issues, we introduce the novel task of LiDAR-to-4DRadar translation, which generates complete 4D radar tensors, with three spatial and one Doppler axes, guided by LiDAR data that preserve spatial and semantic consistency. We propose a novel diffusion bridge model in an aligned LiDAR-4DRadar latent space, namely L2RLDB, to tackle this task. Specifically, first, a key-voxel-aware VAE compresses high-dimensional, noisy radar tensors into a compact latent space, while enabling precise numerical reconstruction and key-voxel identification. Second, to bridge the cross-modal gap between sparse 3D LiDAR and dense 4D radar, we develop a patch-wise contrastive learning module to align LiDAR latents with radar semantically and spatially. Finally, we formulate the translation as a diffusion bridge process between LiDAR and radar latents, enabling the synthesis of full radar tensors from Doppler-lacking LiDAR inputs. Experiments verify that L2RLDB achieves high-fidelity 4D radar generation and significantly improves downstream detection through data augmentation.

Abstract:
We introduce the problem of material-aware part grouping in untextured meshes.Many real-world shapes, such as scales of pinecones or windows of buildings, contain repeated structures that share the same material but exhibit geometric variations.When assigning materials to such meshes, these repeated parts often require piece-by-piece manual identification and selection, which is tedious and time-consuming.To address this, we propose Material Magic Wand, a tool that allows artists to select part groups based on their estimated material properties -- when one part is selected, our algorithm automatically retrieves all other parts likely to share the same material. The key component of our approach is a part encoder that generates a material-aware embedding for each 3D part, accounting for both local geometry and global context.We train our model with a supervised contrastive loss that brings embeddings of material-consistent parts closer while separating those of different materials;therefore, part grouping can be achieved by retrieving embeddings that are close to the embedding of the selected part.To benchmark this task, we introduce a curated dataset of 100 shapes with 241 part-level queries.We verify the effectiveness of our method through extensive experiments and demonstrate its practical value in an interactive material assignment application.

Abstract:
Agent-based editing models have substantially advanced interactive experiences, processing quality, and creative flexibility. However, two critical challenges persist: (1) instruction hallucination--text-only chain-of-thought (CoT) reasoning cannot fully prevent factual errors due to inherent information bottlenecks; (2) reward hacking--dynamic policy optimization against static reward models allows agents to exploit flaws in reward functions. To address these issues, we propose JarvisEvo, a unified image editing agent that emulates an expert human designer by iteratively editing, selecting appropriate tools, evaluating results, and reflecting on its own decisions to refine outcomes. JarvisEvo offers three key advantages: (1) an interleaved multimodal chain-of-thought (iMCoT) reasoning mechanism that enhances instruction following and editing quality; (2) a synergistic editor-evaluator policy optimization (SEPO) framework that enables self-improvement without external rewards, effectively mitigating reward hacking; and (3) support for both preservative and generative editing through seamless integration of Adobe Lightroom and Qwen-Image-Edit tools. On ArtEdit-Bench, JarvisEvo outperforms Nano-Banana by an average of 18.95% on preservative editing metrics, including a substantial 44.96% improvement in pixel-level content fidelity, while maintaining competitive performance in generative editing tasks.

Abstract:
Vision-Language-Action (VLA) models built on pretrained Vision-Language Models (VLMs) show strong potential but are limited in practicality due to their large parameter counts. To mitigate this issue, using a lightweight VLM has been explored, but it compromises spatiotemporal reasoning. Although some methods suggest that incorporating additional 3D inputs can help, they usually rely on large VLMs to fuse 3D and 2D inputs and still lack temporal understanding. Therefore, we propose SwiftVLA, an architecture that enhances a compact model with 4D understanding while preserving design efficiency. Specifically, our approach features a pretrained 4D visual geometry transformer with a temporal cache that incrementally extracts 4D features from 2D images. Then, to enhance the VLM's ability to exploit both 2D images and 4D features, we introduce Fusion Tokens, a set of learnable tokens trained with a future prediction objective to generate unified representations for action generation. Finally, we introduce a mask-and-reconstruct strategy that randomly masks 4D inputs to the VLM and trains the VLA to reconstruct the masked features. This self-reconstruction objective helps learn effective 4D representations, allowing the 4D branch to be dropped at inference with minimal performance loss. Extensive experiments in real and simulated environments show that SwiftVLA outperforms lightweight baselines and rivals VLAs up to 7xlarger. On edge devices, SwiftVLA achieves comparable performance while being 18xfaster than the \pi_0 and reducing the memory footprint by 12x.

Abstract:
Vision-Language Models (VLMs) have been increasingly applied in real-world scenarios due to their outstanding understanding and reasoning capabilities. Although VLMs have already demonstrated impressive capabilities in common visual question answering and logical reasoning, they still lack the ability to make reasonable decisions in complex real-world environments. We define this ability as spatial logical reasoning, which not only requires understanding the spatial relationships among objects in complex scenes, but also the logical dependencies between steps in multi-step tasks. To bridge this gap, we introduce Spatial Logical Question Answering (SpatiaLQA), a benchmark designed to evaluate the spatial logical reasoning capabilities of VLMs. SpatiaLQA consists of 9,605 question answer pairs derived from 241 real-world indoor scenes. We conduct extensive experiments on 41 mainstream VLMs, and the results show that even the most advanced models still struggle with spatial logical reasoning. To address this issue, we propose a method called recursive scene graph assisted reasoning, which leverages visual foundation models to progressively decompose complex scenes into task-relevant scene graphs, thereby enhancing the spatial logical reasoning ability of VLMs, outperforming all previous methods. We will release our code and dataset soon.

Abstract:
Deep learning models remain highly vulnerable to evolving adversarial attacks. While existing continual adversarial training approaches often assume abundant adversarial data at each stage, real-world scenarios frequently involve limited data availability. This paper addresses the setting of Few-shot Continual Adversarial Training, where only a small number of adversarial examples are available per stage, presenting major challenges in achieving robust generalization and mitigating catastrophic forgetting. To tackle these challenges, we propose a novel continual adversarial training framework that incorporates three key components: (i) an Adversarial Margin loss that explicitly pushes clean samples away from decision boundaries to enhance feature discrimination; (ii) a Gaussian mixture model Prototype Replay strategy that synthesizes representative pseudo-features to preserve knowledge of past adversarial domains; and (iii) a Multi-Domain Balanced loss that guides updates to stabilize learning across diverse attack distributions. Extensive experiments on ImageNet-1K and CIFAR-100 demonstrate that our approach consistently outperforms state-of-the-art methods in both clean and robust accuracy across a variety of adversarial settings. The code will be released.

Abstract:
Panoramic semantic segmentation is pivotal for comprehensive 360deg scene understanding in critical applications like autonomous driving and virtual reality. However, progress in this domain is constrained by two key challenges: the severe geometric distortions inherent in panoramic projections and the prohibitive cost of dense annotation. While Unsupervised Domain Adaptation (UDA) from label-rich pinhole-camera datasets offers a viable alternative, many real-world tasks impose a stricter source-free (SFUDA) constraint where source data is inaccessible for privacy or proprietary reasons. This constraint significantly amplifies the core problems of domain shift, leading to unreliable pseudo-labels and dramatic performance degradation, particularly for minority classes. To overcome these limitations, we propose the DAPASS framework. DAPASS introduces two synergistic modules to robustly transfer knowledge without source data. First, our Panoramic Confidence-Guided Denoising (PCGD) module generates high-fidelity, class-balanced pseudo-labels by enforcing perturbation consistency and incorporating neighborhood-level confidence to filter noise. Second, a Contextual Resolution Adversarial Module (CRAM) explicitly addresses scale variance and distortion by adversarially aligning fine-grained details from high-resolution crops with global semantics from low-resolution contexts. DAPASS achieves state-of-the-art performances on outdoor (Cityscapes-to-DensePASS) and indoor (Stanford2D3D) benchmarks, yielding 55.04% (+2.05%) and 70.38% (+1.54%) mIoU, respectively.

Abstract:
Federated learning (FL) has become a promising paradigm for collaborative medical image analysis, yet existing frameworks remain tightly coupled to task-specific backbones and are fragile under heterogeneous imaging modalities. Such constraints hinder real-world deployment, where institutions vary widely in modality distributions and must support diverse downstream tasks. To address this limitation, we propose OmniFM,a modality- and task-agnostic FL framework that unifies training across classification,segmentation, super-resolution, visual question answering,and multimodal fusion without re-engineering the optimization pipeline. OmniFM builds on a key frequency-domain insight: low-frequency spectral components exhibit strong cross-modality consistency and encode modality-invariant anatomical structures. Accordingly, OmniFM integrates(i) Global Spectral Knowledge Retrieval to inject global frequency priors, (ii) Embedding-wise Cross-Attention Fusion to align representations, and (iii) Prefix-Suffix Spectral Prompting to jointly condition global and personalized cues, together regularized by a Spectral-Proximal Alignment objective that stabilizes aggregation. Experiments on real-world datasets show that OmniFM consistently surpasses state-of-the-art FL baselines across intra- and cross-modality heterogeneity, achieving superior results under both fine-tuning and training-from-scratch setups.

Abstract:
Multi-view clustering for remote sensing data has received increasing attention by leveraging diverse data representations to enhance Earth observation. Existing methods are primarily developed under the assumption that each pixel is fully observed across all views. No prior work has investigated the more practical yet challenging scenario where some views suffer from partially missing data. To bridge this gap, this paper presents the first study on clustering incomplete remote sensing data, termed orthogonal spatial-aware multi-view anchor graph clustering (OSMAGC). Specifically, spatial-aware anchors and multi-scale anchor graphs are initially constructed by exploiting the superpixel-based texture characteristics of each view. Based on these, multi-scale anchor graph learning is performed through view weighting and matrix factorization on incomplete data. Structure-aligned consensus feature learning is achieved by aligning the multi-scale graph structures within a shared latent space. To ensure spatial continuity and smoothness, orthogonal spatial-aware regularization is imposed in both horizontal and vertical directions. These three modules are jointly optimized through a well-designed optimization algorithm in a mutually reinforcing manner. Extensive experiments on four benchmark datasets validate the effectiveness and efficiency of our proposed method over the state-of-the-art competitors.

Abstract:
Feed-forward 3D Gaussian Splatting (3DGS) models enable real-time scene generation but are hindered by suboptimal pixel-aligned primitive placement, which relies on a dense, rigid grid that limits both quality and efficiency. We introduce a new feed-forward architecture that detects 3D Gaussian primitives at a sub-pixel level, replacing the pixel grid with an adaptive, "Off-The-Grid" distribution. Inspired by keypoint detection, our decoder learns to locally distribute primitives across image patches. We also provide an Adaptive Density mechanism by assigning varying number of primitives per patch based on Shannon entropy. We combine the proposed decoder with a pre-trained 3D reconstruction backbone and train them end-to-end using photometric supervision without any 3D annotation. The resulting pose-free model generates photorealistic 3DGS scenes in seconds, achieving state-of-the-art novel view synthesis for feed-forward models. It outperforms competitors while using far fewer primitives, demonstrating a more accurate and efficient allocation that captures fine details and reduces artifacts.

Abstract:
Reading measurement instruments is effortless for humans and requires relatively little domain expertise, yet it remains surprisingly challenging for current vision-language models (VLMs) as we find in preliminary evaluation. In this work, we introduce MeasureBench, a benchmark on visual measurement reading covering both real-world and synthesized images of various types of measurements, along with an extensible pipeline for data synthesis. Our pipeline procedurally generates a specified type of gauge with controllable visual appearance, enabling scalable variation in key details such as pointers, scales, fonts, lighting, and clutter. Evaluation on popular proprietary and open-weight VLMs shows that even the strongest frontier VLMs struggle with measurement reading in general. We have also conducted preliminary experiments with reinforcement finetuning (RFT) over synthetic data, and find a significant improvement on both in-domain synthetic subset and real-world images. Our analysis highlights a fundamental limitation of current VLMs in fine-grained spatial grounding. We hope this resource and our code releases can help future advances on visually grounded numeracy and precise spatial perception of VLMs, bridging the gap between recognizing numbers and measuring the world.

Abstract:
Vision-Language-Action (VLA) models for autonomous driving often hit a performance plateau during Reinforcement Learning (RL) optimization. This stagnation arises from exploration capabilities constrained by previous Supervised Fine-Tuning (SFT), leading to "persistent failures" in long-tail scenarios. In these critical situations, all explored actions yield a zero-value driving score. This information-sparse reward signals a failure, yet fails to identify its root cause--whether it is due to incorrect planning, flawed reasoning, or poor trajectory execution. To address this limitation, we propose VLA with Explicit Learning from Failures (ELF-VLA), a framework that augments RL with structured diagnostic feedback. Instead of relying on a vague scalar reward, our method produces detailed, interpretable reports that identify the specific failure mode. The VLA policy then leverages this explicit feedback to generate a Feedback-Guided Refinement. By injecting these corrected, high-reward samples back into the RL training batch, our approach provides a targeted gradient, which enables the policy to solve critical scenarios that unguided exploration cannot. Extensive experiments demonstrate that our method unlocks the latent capabilities of VLA models, achieving state-of-the-art (SOTA) performance on the public Navsim benchmark for overall PDMS, EPDMS score and high-level planning accuracy.

Abstract:
Federated Learning (FL) enables collaborative model training while preserving data privacy, but its practical deployment is hampered by system and statistical heterogeneity. While federated network pruning offers a path to mitigate these issues, existing methods face a critical dilemma: server-side pruning lacks personalization, whereas client-side pruning is computationally prohibitive for resource-constrained devices. Furthermore, the pruning process itself induces significant parametric divergence among heterogeneous submodels, destabilizing training and hindering global convergence. To address these challenges, we propose SubFLOT, a novel framework for server-side personalized federated pruning. SubFLOT introduces an Optimal Transport-enhanced Pruning (OTP) module that treats historical client models as proxies for local data distributions, formulating the pruning task as a Wasserstein distance minimization problem to generate customized submodels without accessing raw data. Concurrently, to counteract parametric divergence, our Scaling-based Adaptive Regularization (SAR) module adaptively penalizes a submodel's deviation from the global model, with the penalty's strength scaled by the client's pruning rate. Comprehensive experiments demonstrate that SubFLOT consistently and substantially outperforms state-of-the-art methods, underscoring its potential for deploying efficient and personalized models on resource-constrained edge devices.

Abstract:
Long-horizon robotic manipulation is increasingly important for real-world deployment, requiring spatial disambiguation in complex layouts and temporal resilience under dynamic interaction. However, existing end-to-end and hierarchical Vision-Language-Action (VLA) policies often rely on text-only cues while keeping plan intent latent, which undermines referential grounding in cluttered or underspecified scenes, impedes effective task decomposition of long-horizon goals with close-loop interaction, and limits causal explanation by obscuring the rationale behind action choices. To address these issues, we first introduce Visual Sketch, an implausible visual intermediate that renders points, boxes, arrows, and typed relations in the robot's current views to externalize spatial intent, connect language to scene geometry, and provide a human-verifiable bridge between high-level reasoning and low-level control. Building on Visual Sketch, we present Action-Sketcher, a VLA framework that operates in a cyclic See -> Think -> Sketch -> Act workflow coordinated by adaptive token-gated strategy for reasoning triggers, sketch revision, and action issuance, thereby supporting reactive corrections and human interaction while preserving real-time action prediction. To enable scalable training and evaluation, we curate diverse corpus with interleaved images, text, Visual Sketch supervision, and action sequences, and train Action-Sketcher with a multi-stage curriculum recipe that combines interleaved sequence alignment for modality unification, language-to-sketch consistency for precise linguistic grounding, and imitation learning augmented with sketch-to-action reinforcement for robustness. Extensive experiments on cluttered scenes and multi-object tasks, in simulation and on real-world tasks, show improved long-horizon success, stronger robustness to dynamic scene changes, and enhanced interpretability via editable sketches and step-wise plans.

Abstract:
Radar semantic segmentation (RSS) is critical for robust perception in adverse conditions, but poses unique challenges: radar frequency maps are highly anisotropic, multi-scale, sparse and noisy. Conventional CNN or Transformer architectures, designed for camera images, fail to account for these characteristics, leading degraded performance. We propose MARSS (Modular Attention-enhanced Radar Semantic Segmentation), a novel framework that integrates three specialized modules to address radar-specific issues. In the encoder, the RADE module employs lightweight channel self-attention and depthwise convolutions to robustly encode noisy, anisotropic features. In intermediate layers, the RFAF module performs multi-scale feature fusion and region-level attention to isolate salient radar features. The decoder's RADM module combines state space models with axial self-attention to reconstruct segmentation masks with anisotropy and temporality-aware context. These components collectively suppress noise, disentangle range-Doppler features, and enforce spatial-temporal consistency. On the CARRADA dataset, MARSS achieves substantially higher performance than prior RSS methods, especially for small fast-moving targets.

Abstract:
High-precision dichotomous image segmentation (DIS) is a task of extracting fine-grained objects from high-resolution images.Existing methods trade efficiency for accuracy: non-diffusion methods are fast but suffer from weak semantics and unstable spatial priors, causing false detections; diffusion-based methods offer high accuracy via strong generative priors but are computationally expensive.In depth maps, a complete object appears as a low variance region with a smooth interior and sharp boundaries, whereas the background exhibits a chaotic, high variance pattern due to disconnected surfaces at varying depths. We refer to this as the depth integrity-prior.Inspired by this, and noting that DIS currently lacks depth maps, we leverage pseudo-depth information from monocular depth estimation models to obtain essential semantic understanding, thereby rapidly revealing spatial differences across target objects and the background.To exploit this prior, we propose the Prior-guided Depth Fusion Network (PDFNet), which fuses RGB and pseudo-depth features for depth-aware structure perception. We further introduce a novel depth integrity-prior loss to enforce depth consistency in segmentation and a fine-grained enhancement module with adaptive patch selection to sharpen boundaries.Notably, PDFNet with DAM-v2 achieves SOTA (F^ max _b 0.915 on DIS-VD and 0.915 on DIS-TE) using less than half the params of diffusion-based methods.

Abstract:
Vision-language models (VLMs) have achieved remarkable success across diverse tasks. However, concerns about their trustworthiness persist, particularly regarding tendencies to lean more on textual cues than visual evidence and the risk of producing ungrounded or fabricated responses. To address these issues, we propose Saliency-R1, a framework for improving the interpretability and faithfulness of VLMs reasoning. Specifically, we introduce a novel saliency map technique that efficiently highlights critical image regions contributing to generated tokens without additional computational overhead. This can further be extended to trace how visual information flows through the reasoning process to the final answers, revealing the alignment between the thinking process and the visual context. We use the overlap between the saliency maps and human-annotated bounding boxes as the reward function, and apply Group Relative Policy Optimization (GRPO) to align the salient parts and critical regions, encouraging models to focus on relevant areas when conduct reasoning. Experiments show Saliency-R1 improves reasoning faithfulness, interpretability, and overall task performance.

Abstract:
Zero-shot anomaly detection (ZSAD) aims to utilize auxiliary data to train models for generalized learning of unseen categories, which has important application value in fields such as industrial quality inspection and medical diagnosis. Although methods based on CLIP show potential, their pre-training objective of focusing on overall semantic alignment between images and text makes the model insensitive to local details, which is inherently contradictory to the need for fine-grained local features in anomaly detection. Existing improvement methods rely on predefined text prompt frameworks to perceive local information, but struggle to effectively address the issue of insufficient local perception. To address this, this paper proposes a dynamic local visual prompting method based on CLIP (DLVP-CLIP). DLVP dynamically identifies and extracts local visual features from key regions in images as prompt tokens using the Semantic-Aware Local Feature Selector (SLFS) module, and utilizes the multi-modal local prompt (MLoP) module to jointly optimize representations in both visual and textual spaces, achieving more precise cross-modal alignment. Additionally, the high-low frequency decomposition module (HFD) is introduced to separate and process global structural and local textural information via wavelet transformation, thereby enhancing detail perception. Extensive experiments on 13 anomaly detection datasets demonstrate that DLVP-CLIP achieves outstanding ZSAD performance on datasets from the industrial and medical domains.

Abstract:
Physically plausible reconstruction of human-object dynamics from a single video remains under-explored in physics-based methods. Most prior approaches omit human-generated internal actuation by assuming motion driven solely by gravity and simple contacts. They also rely on idealized constitutive laws that underfit heterogeneous and anisotropic materials. We introduce PhysHO, which tightly couples SMPL-driven Linear Blend Skinning (LBS) with a Material Point Method (MPM) simulator to address these gaps. Our key insight is to use LBS as an interpretable actuation prior and MPM to propagate those forces through contact under physical constraints. Concretely, we derive targeted actuation with a PD controller guided by LBS trajectories and gate it per particle via a learnable LBS-impact factor so that only particles inside the SMPL volume are directly actuated. We model real materials with residual neural constitutive laws layered on expert elastic-plastic models and conditioned on per particle to capture heterogeneity and anisotropy. We stabilize monocular learning with structure-preserving 3D flow supervision and a progressive and loss-balanced training schedule. Our PhysHO reconstructs observed dynamics with high fidelity, and predicts future motion and simulates outcomes under novel human actions. Experimental results demonstrate robust human-driven interactions beyond gravity-only scenes.

Abstract:
Voxel-wise dose prediction is a critical yet challenging task in radiotherapy (RT) planning, as bespoke models trained from scratch often struggle to generalize across institutions, scanners, and planning protocols. Meanwhile, large generative backbones pretrained on billion-scale visual data have learned strong geometric and structural priors that may transfer beyond their source domain. We propose DiffKT3D, an Any2Any 3D diffusion framework that transfers prior knowledge from pretrained video diffusion models to clinically meaningful dose prediction. To support flexible conditioning over heterogeneous clinical inputs, including CT, target and organ masks, body mask, and beam descriptors, we introduce a modality- and role-aware Any2Any formulation that avoids cross-attention while retaining a unified conditioning interface. We further apply reinforcement-learning (RL) post-training with a clinically informed Scorecard reward to align generations with institutional treatment preferences. On the GDP--HMM challenge, DiffKT3D reduces voxel-wise MAE from 2.07 to 1.93 relative to the top challenge solution, while also improving image quality and clinical preference alignment. These results show that pretrained diffusion priors, combined with flexible multimodal conditioning and scorecard-aligned post-training, provide an effective approach to RT dose prediction.

Abstract:
Vision-language models (VLMs) enable text-guided object detection but degrade severely under cross-view scenarios where ground and aerial viewpoints differ in altitude, scale, and spatial layout. These geometric changes introduce systematic complexity variations between viewpoints, e.g., ground view images contain dense and highly occluded structures, while aerial images are sparse and globally organized. Fixed VLM fusion mechanisms cannot handle this discrepancy. We propose CrossVL, a framework combining Complexity-Aware Pathway Aggregation (CPA) and Paired Curriculum Learning (PCL) for enhanced cross-view detection for VLM. CPA estimates scene complexity from multimodal statistics and routes visual features through multiple pathways to obtain view-specific representations. PCL leverages semantic consistency of synchronized ground-aerial pairs to provide stable early supervision and then gradually shifts toward randomized sampling. On MAVREC, CrossVL improves Florence-2's aerial mAP from 58.66% to 61.03% and reduces the ground-aerial performance gap from 8.63pp to 6.65pp, while also achieving a 3.3x reduction in variance across random seeds. CPA provides stable complexity-aware feature aggregation, and PCL enhances optimization dynamics. Together, they demonstrate that coordinated architectural and training adaptations are crucial for robust cross-view VLM detection.

Abstract:
Weakly supervised video anomaly detection (WS-VAD) involves identifying the temporal intervals that contain anomalous events in untrimmed videos, where only video-level annotations are provided as supervisory signals. However, a key limitation persists in WS-VAD, as dense frame-level annotations are absent, which often leaves existing methods struggling to learn anomaly semantics effectively. To address this issue, we propose a novel framework named LAS-VAD, short for Learning Anomaly Semantics for WS-VAD, which integrates anomaly-connected component mechanism and intention awareness mechanism. The former is designed to assign video frames into distinct semantic groups within a video, and frame segments within the same group are deemed to share identical semantic information. The latter leverages an intention-aware strategy to distinguish between similar normal and abnormal behaviors (e.g., taking items and stealing). To further model the semantic information of anomalies, as anomaly occurrence is accompanied by distinct characteristic attributes (i.e., explosions are characterized by flames and thick smoke), we additionally incorporate anomaly attribute information to guide accurate detection. Extensive experiments on two benchmark datasets, XD-Violence and UCF-Crime, demonstrate that our LAS-VAD outperforms current state-of-the-art methods with remarkable gains.

Abstract:
Segment Anything Models (SAMs) are extensively used in computer vision for universal image segmentation, but deploying them on resource-constrained devices is challenging due to their high computational and memory demands. Post-Training Quantization (PTQ) is a widely used technique for model compression and acceleration. However, existing PTQ methods fail to consider the cross-attention architecture in the SAM decoder. This degradation primarily stems from the unique challenges posed by SAMs: (1) Attention dissipation, where the attention information in the decoder, which is crucial for representing segmentation masks, collapses into a diffuse and non-semantic form under low-bit quantization; and (2) Reconstruction oscillation, where bidirectional coupling within the two-way transformer introduces cross-branch error interference and destabilizes convergence. To tackle these issues, we propose CAR-SAM, a unified quantization framework tailored for SAMs. Firstly, to mitigate attention dissipation, we introduce MatMul-Aware Compensation (MAC) mechanism that transfers activation-induced quantization errors from MatMul to preceding linear weights. Secondly, to mitigate oscillation in decoder optimization, we develop a Joint Cross-Attention Reconstruction (JCAR) strategy that jointly reconstructs coupled attention branches, suppressing oscillatory behavior and promoting stable convergence. Extensive experiments show that CAR-SAM robustly quantizes SAM models down to 4-bit precision, surpassing existing methods by 14.6% and 6.6% mAP on SAM-B and SAM-L respectively.

Abstract:
Prompt learning has emerged as an efficient alternative to fine-tuning pre-trained vision-language models (VLMs).Despite its promise, current methods still struggle to maintain tail-class discriminability when adapting to class-imbalanced datasets. In this work, we propose cluster-aware neural collapse prompt tuning (CPT), which enhances the discriminability of tail classes in prompt-tuned VLMs without sacrificing their overall generalization.First, we design a cluster-invariant space by mining semantic assignments from the pre-trained VLM and mapping them to prompt-tuned features.This computes cluster-level boundaries and restricts the constraints to local neighborhoods, which reduces interference with the global semantic structure of the pre-trained VLM.Second, we introduce neural-collapse-driven discriminability optimization with three losses: textual Equiangular Tight Frame (ETF) separation loss, class-wise convergence loss, and rotation stabilization loss.These losses work together to shape intra-cluster geometry for better inter-class separation and intra-class alignment.Extensive experiments on 11 diverse datasets demonstrate that CPT outperforms SOTA methods, with stronger performance on long-tail classes and good generalization to unseen classes.

Abstract:
Vision-Language-Action (VLA) models have recently enabled robotic manipulation by grounding visual and linguistic cues into actions. However, most VLAs assume the Markov property, relying only on the current observation and thus suffering from temporal myopia that degrades long-horizon coherence. In this work, we view motion as a more compact and informative representation of temporal context and world dynamics, capturing inter-state changes while filtering static pixel-level noise. From this perspective, HiF-VLA equips a motion-centric world model for the VLA, enabling agents to reason about temporal dynamics for future evolution during action generation. Building on this idea, we propose HiF-VLA (Hindsight, Insight, and Foresight for VLAs), a unified framework that leverages motion for bidirectional temporal reasoning. HiF-VLA encodes past dynamics through hindsight priors, anticipates future motion via foresight reasoning, and integrates both through a hindsight-modulated joint expert to enable a "think-while-acting" paradigm for long-horizon manipulation. As a result, HiF-VLA surpasses strong baselines on LIBERO-Long and CALVIN ABC-D benchmarks, while incurring negligible additional inference latency. Furthermore, HiF-VLA achieves substantial improvements in real-world long-horizon manipulation tasks, demonstrating its broad effectiveness in practical robotic settings.

Abstract:
Uncertainty Quantification (UQ) is crucial for ensuring the reliability of automated image segmentations in safety-critical domains like biomedical image analysis or autonomous driving. In segmentation, UQ generates pixel-wise uncertainty scores that must be aggregated into image-level scores for downstream tasks like Out-of-Distribution (OoD) or failure detection. Despite routine use of aggregation strategies, their properties and impact on downstream task performance have not yet been comprehensively studied. Global Average is the default choice, yet it does not account for spatial and structural features of segmentation uncertainty. Alternatives like patch-, class- and threshold-based strategies exist, but lack systematic comparison, leading to inconsistent reporting and unclear best practices. We address this gap by (1) formally analyzing properties, limitations, and pitfalls of common strategies; (2) proposing novel strategies that incorporate spatial uncertainty structure and (3) benchmarking their performance on OoD and failure detection across ten datasets that vary in image geometry and structure. We find that aggregators leveraging spatial structure yield stronger performance in both downstream tasks studied. However, the performance of individual aggregators depends heavily on dataset characteristics, so we (4) propose a meta-aggregator that integrates multiple aggregators and performs robustly across datasets.

Abstract:
Sparse activation layers, primarily Mixture-of-Experts (MoE) and memory-based modules, have become a central approach for scaling large models and are gaining traction in vision tasks. Despite conceptual similarities, these paradigms have evolved independently, hindering systematic comparison and the development of modules that exploit their complementary strengths. To bridge this gap, we propose OneSparse, a unified framework that reformulates MoE and memory modules under a common abstraction. This enables their systematic comparison and integration, revealing a continuous design space. Guided by this abstraction, we design the Nexus Layer, which features two key innovations: a unified routing mechanism that combines the efficiency of memory retrieval with MoE's load balancing to ensure stable and scalable token assignment, and an adaptive processing strategy where memory modules sketch coarse representations while expert modules refine critical regions. Extensive experiments on image classification, object detection, and semantic segmentation demonstrate that our Nexus Layer establishes a new performance--efficiency frontier, surpassing representative sparse baselines across convolutional and Transformer architectures. These results validate the power of the OneSparse framework to unify and integrate complementary sparse paradigms and underscore the potential of hybrid sparse modeling in vision.

Abstract:
The "Thinking with Text" and "Thinking with Images" paradigms significantly improve the reasoning abilities of large language models (LLMs) and Vision-Language Models (VLMs). However, these paradigms have inherent limitations. (1) Images capture only single moments and fail to represent dynamic processes or continuous changes, and (2) The separation of text and vision as distinct modalities, which hinders unified multimodal understanding and generation. Therefore, we propose "Thinking with Video", a new paradigm that leverages video generation models such as Sora-2 to use video frames as a unified medium for multimodal reasoning. To support this exploration, we developed the Video Thinking Benchmark (VideoThinkBench), which covers both vision-centric tasks (e.g., Eyeballing Puzzles) and text-centric tasks (e.g., GSM8K and MMMU). Our evaluation on VideoThinkBench establishes Sora-2 as a capable reasoner. On vision-centric tasks, Sora-2 is comparable to state-of-the-art (SOTA) VLMs, and even surpasses GPT-5 by 10% on eyeballing puzzles. On text-centric tasks, Sora-2 achieves 92% accuracy on MATH, and 69.2% accuracy on MMMU. Furthermore, we systematically analyze the source of these abilities. We also find that self-consistency and in-context learning can improve Sora-2's performance. In summary, our findings show that the video generation model is the potential unified multimodal understanding and generation model, positioning "Thinking with Video" as a potential unified multimodal reasoning paradigm.

Abstract:
Multi-view multi-label classification (MVMLC) aims to utilize both consensus and complementarity information to predict potentially relevant labels for samples. Existing MVMLC approaches typically focus on either feature-level fusion, which integrates complementary features for more expressive representations, or decision-level fusion, which aggregates view-specific predictions to exploit label supervision more effectively. In fact, relying solely on feature-level fusion often underutilizes label information and limits discriminability of learned representations, whereas pure decision-level fusion pays insufficient attention to view representation expressiveness and thus constrains classification performance. To address these limitations, we propose DF^2-VB, a dual-level fusion framework that jointly exploits complementary strengths to mitigate their respective weaknesses by integrating feature- and decision-level fusion. At the feature level, a Fuzzy Dynamic Fusion (FDF) module maps consensus features into a more compatible fuzzy feature space, where essential features are identified and redundant features are suppressed to further fuse an expressive consensus representation and boost view-specific predictions for decision-level fusion. At the decision level, a View-specific Boosting (VB) strategy adaptively measures the importance of samples and view-specific predictions to strengthen the utilization of supervision for facilitating the discriminability in feature-level fusion. Complementarily, FDF and VB jointly reinforce the model expressiveness and discriminability for reliable predictions. Extensive experiments on multiple public datasets verify the superiority of our strategy over advanced MVMLC models.

Abstract:
Self-supervised learning (SSL) has achieved remarkable success across both NLP and CV domains. However, sign language translation (SLT) models still heavily rely on gloss annotations in gloss-based SLT or text annotations in gloss-free SLT (GFSLT) during pretraining, aiming to ensure that the backbone provides effective sign language (SL) features for the translation model. Such reliance restricts the scalability and generalization ability of the SLT model. One natural question arises: Can existing SSL methods be directly applied to the SL domain to train an effective sign feature extractor for downstream GFSLT tasks, eliminating the need for text annotations? In this paper, we propose a simple yet effective pretraining framework with two goals: (1) decoupling the pretraining process from gloss or text annotations, relying purely on sign frames; and (2) only global frames are required during inference for simplicity. We show that directly applying existing SSL methods yields suboptimal performance, as SL features involve subtle motion patterns and discriminative cues that are often confined to local regions. To achieve this, we introduce SignDINO, a simple yet effective sign-aware DINO training strategy that learns effective and semantically meaningful representations from global frames without any textual supervision. Specifically, a student-teacher architecture is employed, where the teacher model receives the global sign frame, while the student model learns from masked local views that preserve only the hand and facial regions. Such a simple design encourages the model to infer global semantics from discriminative local cues, allowing the teacher model to extract SL-related feature during inference solely based on global views. Extensive experiments on public SL datasets show that SignDINO achieves highly competitive performance on the GFSLT task without relying on extra cues or additional SL-related pretraining.

Abstract:
The rapid growth of the text-to-image (T2I) community has fostered a thriving online ecosystem of expert models, which are variants of pretrained diffusion models specialized for diverse generative capabilities. Yet, existing model merging methods remain limited in fully leveraging abundant online expert resources and still struggle to meet diverse in-the-wild user needs. We present DiffGraph, a novel automated agent-driven graph-based model merging framework, which automatically harnesses online experts and flexibly merges them for diverse user needs. Our DiffGraph constructs a scalable graph and organizes ever-expanding online experts within it through node registration and calibration. Then, DiffGraph dynamically activates specific subgraphs based on user needs, enabling flexible combinations of different experts to achieve user-desired generation. Extensive experiments show the efficacy of our method.

Abstract:
Lymph node metastasis diagnosis in pathological images is a highly challenging four-class classification task, comprising macrometastasis, micrometastasis, isolated tumor cells (ITC), and negative lesions.Unlike conventional classification settings, this four-class scenario simultaneously suffers from inter-class and intra-slide scarcity of minority information.Existing approaches based on CNNs or GNNs primarily emphasize node-level feature learning, making it difficult to capture high-order feature interactions and topological dependencies among cells, while also overlooking the representational insufficiency induced by class scarcity.To address these challenges, we propose a dual-level generative framework that integrates class-prompt priors with high-order structural modeling to enhance the representation capacity of minority classes.At the hypergraph level, we develop a prompt-guided hierarchical hypergraph variational autoencoder (HGVAE) capable of generating diverse and topologically consistent hypergraph representations for minority classes.At the hypernode level, we introduce an anchor-diffusion mixup strategy to enrich the minority node features of high-attention positive anchor nodes.Extensive experiments on the four-class NIMM dataset, as well as TCGA datasets, demonstrate that the proposed framework effectively alleviates feature scarcity and significantly boosts the classification performance of minority classes.

Abstract:
We present Vanast, a unified framework that generates garment-transferred human animation videos directly from a single human image, garment images, and a pose guidance video. Conventional two-stage pipelines treat image-based virtual try-on and pose-driven animation as separate processes, which often results in identity drift, garment distortion, and front-back inconsistency. Our model addresses these issues by performing the entire process in a single unified step to achieve coherent synthesis. To enable this setting, we construct large-scale triplet supervision. Our data generation pipeline includes generating identity-preserving human images in alternative outfits that differ from garment catalog images, capturing full upper and lower garment triplets to overcome the single-garment-posed video pair limitation, and assembling diverse in-the-wild triplets without requiring garment catalog images. We further introduce a Dual Module architecture for video diffusion transformers to stabilize training, preserve pretrained generative quality, and improve garment accuracy, pose adherence, and identity preservation while supporting zero-shot garment interpolation. Together, these contributions allow Vanast to produce high-fidelity, identity-consistent animation across a wide range of garment types.

Abstract:
We propose a multimodal-driven framework for high-fidelity long-term digital human animation termed Soul, which generates semantically coherent videos from a single-frame portrait image, text prompts, and audio, achieving precise lip synchronization, vivid facial expressions, and robust identity preservation. We construct Soul-1M, containing 1 million finely annotated samples with a precise automated annotation pipeline (covering portrait, upper-body, full-body, and multi-person scenes) to mitigate data scarcity, and we carefully curate Soul-Bench for comprehensive and fair evaluation of audio-/text-guided animation methods. The model is built on the Wan2.2-5B backbone, integrating audio-injection layers and multiple training strategies together with threshold-aware codebook replacement to ensure long-term generation consistency. Meanwhile, step/CFG distillation and a lightweight VAE are used to optimize inference efficiency, achieving an 11.4xspeedup with negligible quality loss. Extensive experiments show that Soul significantly outperforms current leading open-source and commercial models on video quality, video-text alignment, identity preservation, and lip-synchronization accuracy, demonstrating broad applicability in real-world scenarios such as virtual anchors and film production.

Abstract:
Video Unsupervised Domain Adaptation (VUDA) poses a significant challenge in action recognition, requiring the adaptation of a model from a labeled source domain to an unlabeled target domain. Despite recent advances, existing VUDA methods often fall short of fully supervised performance, a key reason being the prevalence of static and uninformative backgrounds that exacerbate domain shifts. Additionally, prior approaches largely overlook computational efficiency, limiting real-world adoption. To address these issues, we propose Learnable Motion-Focused Tokenization (LMFT) for VUDA. LMFT tokenizes video frames into patch tokens and learns to discard low-motion, redundant tokens, primarily corresponding to background regions, while retaining motion-rich, action-relevant tokens for adaptation. Extensive experiments on three standard VUDA benchmarks across 21 domain adaptation settings show that our VUDA framework with LMFT achieves state-of-the-art performance while significantly reducing computational overhead. LMFT thus enables VUDA that is both effective and computationally efficient.

Abstract:
Transformer architectures, particularly Diffusion Transformers (DiTs), have become widely used in diffusion and flow-matching models due to their strong performance compared to convolutional UNets. However, the isotropic design of DiTs processes the same number of patchified tokens in every block, leading to relatively heavy computation during training process. In this work, we introduce a multi-patch transformer design in which early blocks operate on larger patches to capture coarse global context, while later blocks use smaller patches to refine local details. This hierarchical design could reduces computational cost by up to 50% in GFLOPs while achieving good generative performance. In addition, we also propose improved designs for time and class embeddings that accelerate training convergence. Extensive experiments on the ImageNet dataset demonstrate the effectiveness of our architectural choices.

Abstract:
Video snapshot compressive imaging (SCI) offers a promising alternative to high-speed cameras by encoding multiple frames into a single 2D measurement. However, SCI requires algorithms to reconstruct the high-speed video, and as resolution increases, reconstruction becomes computationally expensive and memory-intensive. Much of the resource is wasted on recovering large background regions that contain little useful information, highlighting the need for selective, object-driven reconstruction. Existing object detectors struggle to perform accurately on SCI measurements due to the spatial-temporal aliasing introduced by coded exposure. To address this challenge, we propose DetectSCI, the first framework enabling object-guided region-of-interest (ROI) reconstruction for high-resolution SCI. The inside detector comprises two key components: an encoder built from weight-sharing Mamba-Implicit Modules (MIM) for progressive feature refinement, and a Frequency Mamba (FM) module dedicated to frequency-aware query selection. MIM enhances features via multi-scale dilated convolutions and implicit representations, while FM restores discriminative details by decomposing and reweighting frequency bands. Experiments on the SportsMOT dataset show that DetectSCI achieves 80.9 Average Precision (AP), surpassing the best CNN-based detector by at least 2.8 AP and the best Transformer-based detector by at least 4.1 AP, while maintaining comparable efficiency. Code will be released.

Abstract:
Human driving behavior is inherently personal, which is shaped by long-term habits and influenced by short-term intentions. Individuals differ in how they accelerate, brake, merge, yield, and overtake across diverse situations. However, existing end-to-end autonomous driving systems either optimize for generic objectives or rely on fixed driving modes, lacking the ability to adapt to individual preferences or interpret natural language intent. To address this gap, we propose Drive My Way (DMW), a personalized Vision-Language-Action (VLA) driving framework that aligns with users' long-term driving habits and adapts to real-time user instructions. DMW learns a user embedding from our personalized driving dataset collected across multiple real drivers and conditions the policy on this embedding during planning, while natural language instructions provide additional short-term guidance. Closed-loop evaluation on the Bench2Drive benchmark demonstrates that DMW improves style instruction adaptation, and user studies show that its generated behaviors are recognizable as each driver's own style, highlighting personalization as a key capability for human-centered autonomous driving.

Abstract:
Integrating LiDAR and camera inputs into a unified Bird's-Eye-View (BEV) representation is crucial for enhancing 3D perception capabilities of autonomous vehicles. However, existing methods suffer from spatial misalignment between LiDAR and camera features, which causes inaccurate depth supervision in camera branch and erroneous fusion during cross-modal feature aggregation. The root cause of this misalignment lies in projection errors, stemming from calibration inaccuracies and rolling shutter effect.The key insight of this work is that locations of these projection errors are not random but highly predictable, as they are concentrated at object-background boundaries which 2D detectors can reliably identify. Based on this, our main motivation is to utilize 2D object priors to pre-align cross-modal features before fusion. To address local misalignment, we propose Prior Guided Depth Calibration (PGDC), which leverages 2D priors to alleviate misalignment and preserve correct cross-modal feature pairs. To resolve global misalignment, we introduce Discontinuity Aware Geometric Fusion (DAGF) to suppress residual noise from PGDC and explicitly enhance sharp depth transitions at object-background boundaries, yielding a structurally aware representation. To effectively utilize these aligned representations, we incorporate Structural Guidance Depth Modulator (SGDM), using a gated attention mechanism to efficiently fuse aligned depth and image features. Our method achieves SOTA performance on nuScenes validation dataset, with its mAP and NDS reaching 71.5% and 73.6% respectively. Additionally, on the Argoverse 2 validation set, we achieve a competitive mAP of 41.7%.

Abstract:
Unrestricted adversarial attacks based on generative models typically operate either directly in image space or through diffusion-style denoising and re-noising, which limits transferability and robustness against defenses. We revisit this problem through the lens of flow matching and continuous-time velocity fields, and propose AdvFM, a velocity-field attack that injects adversarial signals into the flow-matching dynamics instead of the pixel space. Given a noisy state x_t, AdvFM perturbs the reconstruction at t = 1 and converts this perturbation into a change of the velocity field, yielding a state update that amplifies the inner PGD step in the noisy space. We further introduce a lookahead variant that optimizes a two-point objective over the current and rolled-out reconstructions, reducing temporal mismatch along the ODE trajectory. From a theoretical perspective, we show that compared to diffusion-based attacks, AdvFM enjoys: (i) larger single-step increases in the black-box loss via step amplification, (ii) reduced gradient variance and stronger surrogate-target alignment due to Gaussian smoothing, enhancing its transferability, and (iii) perturbations that concentrate in robust-tangent directions, thereby aligning with robust gradients of adversarially trained models and surviving purification more effectively; the lookahead variant further lowers gradient noise for a two-point robust objective. Extensive experiments demonstrate that AdvFM achieves promising performance in both black-box transferability and a suite of adversarial training and purification defenses.

Abstract:
Multi-modal Retrieval-Augmented Generation (MMRAG) has emerged as a powerful paradigm for enhancing Multimodal Large Language Models (MLLMs) in knowledge-intensive question answering by integrating external visual, textual, and structural knowledge. However, existing MMRAG frameworks suffer from critical limitations, including noisy and irrelevant retrieval, cross-modal semantic misalignment, lack of adaptive reasoning, and incoherent generation across local and global contexts. We introduce CogniVerse, a novel MMRAG framework that addresses these challenges through a cognitive-inspired, mathematically rigorous approach. Drawing from human-like reasoning, CogniVerse integrates three synergistic components: (1) a Cognitive Reflection Module (CRM) that dynamically assesses retrieval necessity and filters relevant multi-modal content, reducing noise and computational overhead; (2) a Multi-modal Retrieval Module that aligns embeddings in a Riemannian manifold using information geometry and refines knowledge graphs via spectral graph theory, ensuring precise and coherent retrieval; and (3) a Hierarchical Generation Module that employs an optimal transport-based loss to balance token-level accuracy and global semantic coherence. Grounded in advanced theoretical frameworks, including convergence guarantees for geometric alignment and spectral optimization, CogniVerse achieves robust cross-modal integration and adaptive knowledge utilization. Extensive experiments demonstrate that CogniVerse significantly outperforms state-of-the-art systems in both accuracy and coherence, while reducing retrieval latency.

Abstract:
While 3D Gaussian Splatting (3DGS) achieves real-time photorealistic rendering, its performance degrades significantly when training images contain transient objects that violate multi-view consistency. Existing methods face a circular dependency: accurate transient detection requires a well-reconstructed static scene, while clean reconstruction itself depends on reliable transient masks. We address this challenge with DualSplat, a Failure-to-Prior framework that converts first-pass reconstruction failures into explicit priors for a second reconstruction stage. We observe that transients, which appear in only a subset of views, often manifest as incomplete fragments during conservative initial training.We exploit these failures to construct object-level pseudo-masks by combining photometric residuals, feature mismatches, and SAM2 instance boundaries. These pseudo-masks then guide a clean second-pass 3DGS optimization, while a lightweight MLP refines them online by gradually shifting from prior supervision to self-consistency. Experiments on RobustNeRF and NeRF On-the-go show that DualSplat outperforms existing baselines, demonstrating particularly clear advantages in transient-heavy scenes and transient regions.

Abstract:
We present a novel method for generating geometrically realistic and consistent orbital videos from a single image of an object. Existing video generation works mostly rely on pixel-wise attention to enforce view consistency across frames. However, such mechanism does not impose sufficient constraints for long-range extrapolation, e.g. rear-view synthesis, in which pixel correspondences to the input image are limited. Consequently, these works often fail to produce results with a plausible and coherent structure. To tackle this issue, we propose to leverage rich shape priors from a 3D foundational generative model as an auxiliary constraint, motivated by its capability of modeling realistic object shape distributions learned from large 3D asset corpora. Specifically, we prompt the video generation with two scales of latent features encoded by the 3D foundation model: (i) a denoised global latent vector as an overall structural guidance, and (ii) a set of latent images projected from volumetric features to provide view-dependent and fine-grained geometry details. In contrast to commonly used 2.5D representations such as depth or normal maps, these compact features can model complete object shapes, and help to improve inference efficiency by avoiding explicit mesh extraction. To achieve effective shape conditioning, we introduce a multi-scale 3D adapter to inject feature tokens to the base video model via cross-attention, which retains its capabilities from general video pretraining and enables a simple and model-agonistic fine-tuning process. Extensive experiments on multiple benchmarks show that our method achieves superior visual quality, shape realism and multi-view consistency compared to state-of-the-art methods, and robustly generalizes to complex camera trajectories and in-the-wild images.

Abstract:
Reconstructing articulated objects into high-fidelity digital twins is crucial for applications such as robotic manipulation and interactive simulation. Recent self-supervised methods using differentiable rendering frameworks like 3D Gaussian Splatting remain highly sensitive to the initial part segmentation. Their reliance on heuristic clustering or pre-trained models often causes optimization to converge to local minima, especially for complex multi-part objects.To address these limitations, we propose ArtPro, a novel self-supervised framework that introduces adaptive integration of mobility proposals. Our approach begins with an over-segmentation initialization guided by geometry features and motion priors, generating part proposals with plausible motion hypotheses. During optimization, we dynamically merge these proposals by analyzing motion consistency among spatial neighbors, while a collision-aware motion pruning mechanism prevents erroneous kinematic estimation. Extensive experiments on both synthetic and real-world objects demonstrate that ArtPro achieves robust reconstruction of complex multi-part objects, significantly outperforming existing methods in accuracy and stability.

Abstract:
Extracting reliable generative traces in generated images is critical for AI-generated images (AIGIs) detection. However, a fundamental challenge exists: AIGIs inherently contain generative traces with no trace-free counterpart available, making supervised extraction of these artifacts infeasible. In this work, we overcome this through a surrogate supervision framework. We design a dynamic reconstructor that simulates diverse generative traces on real images through stochastically varied architectures and parameters. The reconstruction residuals serve as supervision to train an extractor that learns to isolate traces, i.e., generative signatures (GenSign). A detector then fuses extracted GenSign with RGB features to distinguish real images from AIGIs. Our key insight is that sufficient architectural diversity in simulation enables effective transfer to real-world generators, resolving the absence of ground truth GenSign. Extensive experiments across four benchmarks demonstrate state-of-the-art generalization, confirming that our simulation-based learning paradigm is capable of extracting general and transferable forensic features.

Abstract:
Although current Video-LLMs achieve impressive performance in video understanding tasks, their autoregressive decoding efficiency remains constrained by the massive number of video tokens. Visual token pruning can partially ease this bottleneck, yet existing approaches still suffer from information loss and yield only modest acceleration in decoding. In this paper, we propose ParallelVLM, a training-free draft-then-verify speculative decoding framework that overcomes both mutual waiting and limited speedup-ratio problems between draft and target models in long-video settings. ParallelVLM features two parallelized stages that maximize hardware utilization and incorporates an Unbiased Verifier-Guided Pruning strategy to better align the draft and target models by eliminating the positional bias in attention-guided pruning. Extensive experiments demonstrate that ParallelVLM effectively expands the draft window by 1.6-1.8x with high accepted lengths, and accelerates various video understanding benchmarks by 3.36x on LLaVA-Onevision-72B and 2.42x on Qwen2.5-VL-32B compared with vanilla autoregressive decoding.

Abstract:
Deep learning-based gaze estimation methods often exhibit significant performance degradation on unseen target domains. Through systematic frequency-domain analysis, we reveal that face images contain frequency components with distinct contributions: some facilitate cross-domain generalization while others introduce domain-specific interference that impedes it, with both components varying across datasets and constituting a key source of domain gap. Based on these observations, we propose the Frequency-Guided Adaptive Learning framework (FGAL), a novel framework enhancing domain generalization without accessing target domain data. The FGAL consists of two complementary modules: the Adaptive Interference Suppression Module (AISM) and the Spectrum Diversification Module (SDM). AISM adaptively suppresses sample-specific interfering frequency components through learnable modulation maps, while SDM diversifies frequency distribution patterns to enhance robustness against cross-domain variations. Experiments demonstrate that FGAL achieves substantial improvements, outperforming baselines by up to 28.2% and state-of-the-art methods by up to 19.5% across multiple cross-domain settings, demonstrating our framework's potential for broader domain generalization tasks.

Abstract:
Vision foundation models have shown great promise for open-set 3D object retrieval (3DOR) through efficient adaptation to multi-view images. Leveraging semantically aligned latent space, previous work typically adapts the CLIP encoder to build view-based 3D descriptors. Despite CLIP's strong generalization ability, its lack of fine-grainedness prompted us to explore the potential of a more recent self-supervised encoder--DINO. To address this, we propose DINO Eats CLIP (DEC), a novel framework for dynamic multi-view integration that is regularized by synthesizing data for unseen classes. We first find that simply mean-pooling over view features from a frozen DINO backbone gives decent performance. Yet, further adaptation causes severe overfitting on average view patterns of known classes. To combat it, we then design a module named Chunking and Adapting Module (CAM). It segments multi-view images into chunks and dynamically integrates local view relations, yielding more robust features than the standard pooling strategy. Finally, we propose Virtual Feature Synthesis (VFS) module to mitigate bias towards known categories explicitly. Under the hood, VFS leverages CLIP's broad, pre-aligned vision-language space to synthesize virtual features for unseen classes. By exposing DEC to these virtual features, we greatly enhance its open-set discrimination capacity. Extensive experiments on standard open-set 3DOR benchmarks demonstrate its superior efficacy.

Abstract:
High-fidelity generative video editing has seen significant quality improvements by leveraging pre-trained video foundation models. However, their computational cost is a major bottleneck, as they are often designed to inefficiently process the full video context regardless of the inpainting mask's size, even for sparse, localized edits. In this paper, we introduce EditCtrl, an efficient video inpainting control framework that focuses computation only where it is needed. Our approach features a novel local video context module that operates solely on masked tokens, yielding a computational cost proportional to the edit size. This local-first generation is then guided by a lightweight temporal global context embedder that ensures video-wide context consistency with minimal overhead. Not only is EditCtrl 10xmore compute efficient than state-of-the-art generative editing methods, it even improves editing quality compared to methods designed with full-attention. Finally, we showcase how EditCtrl unlocks new capabilities, including multi-region editing with text prompts and real-time autoregressive content propagation.

Abstract:
As an important and challenging problem in computer vision, Panoramic Semantic Segmentation (PASS) aims to provide complete scene perception based on an ultra-wide angle of view. Most PASS methods often focus on spherical geometry with RGB input or use the depth information in original or HHA format, which does not make full use of panoramic image geometry. To address these shortcomings, we propose REL-SF4PASS with our REL depth representation based on cylindrical coordinate and Spherical-dynamic Multi-Modal Fusion (SMMF). REL is made up of Rectified Depth, Elevation-Gained Vertical Inclination Angle, and Lateral Orientation Angle, which fully represent 3D space in cylindrical coordinate style and the surface normal direction. SMMF aims to ensure the diversity of fusion for different panoramic image regions and reduce the breakage of cylinder side surface expansion in ERP projection, using different fusion strategies to match different regions. Experimental results show that REL-SF4PASS considerably improves performance and robustness on popular benchmarks (e.g., Stanford2D3D Panoramic datasets). It gains 2.35% average mIoU improvement on all 3 folds and reduces the performance variance by approximately 70% when facing 3D disturbance.

Abstract:
Early children's developmental trajectories set up a natural goal for sample-efficient pretraining of vision foundation models. We introduce BabyVLM-V2, a developmentally grounded framework for infant-inspired vision-language modeling that extensively improves upon BabyVLM-V1 through a longitudinal, multifaceted pretraining set, a versatile model, and, most importantly, DevCV Toolbox for cognitive evaluation. The pretraining set maximizes coverage while minimizing curation of a longitudinal, infant-centric audiovisual corpus, yielding video-utterance, image-utterance, and multi-turn conversational data that mirror infant experiences. DevCV Toolbox adapts all vision-related measures of the recently released NIH Baby Toolbox into a benchmark suite of ten multimodal tasks, covering spatial reasoning, memory, and vocabulary understanding aligned with early children's capabilities. Experimental results show that a compact model pretrained from scratch can achieve competitive performance on DevCV Toolbox, outperforming GPT-4o on some tasks. We hope the principled, unified BabyVLM-V2 framework will accelerate research in developmentally plausible pretraining of vision foundation models.

Abstract:
Efficient streaming video generation is critical for simulating interactive and dynamic worlds. Existing methods distill few-step video diffusion models with sliding window attention, using initial frames as sink tokens to maintain attention performance and reduce error accumulation. However, video frames become overly dependent on these static tokens, resulting in copied initial frames and diminished motion dynamics. To address this, we introduce Reward Forcing, a novel framework with two key designs. First, we propose EMA-Sink, which maintains fixed-size tokens initialized from initial frames and continuously updated by fusing evicted tokens via exponential moving average as they exit the sliding window. Without additional computation cost, EMA-Sink tokens capture both long-term context and recent dynamics, preventing initial frame copying while maintaining long-horizon consistency. Second, to better distill motion dynamics from teacher models, we propose a novel Rewarded Distribution Matching Distillation (Re-DMD). Vanilla distribution matching treats every training sample equally, limiting the model's ability to prioritize dynamic content. Instead, Re-DMD biases the model's output distribution toward high-reward regions by prioritizing samples with greater dynamics rated by a vision-language model. Re-DMD significantly enhances motion quality while preserving data fidelity. We include both quantitative and qualitative experiments to show that Reward Forcing achieves state-of-the-art performance on standard benchmarks while enabling high-quality streaming video generation at 23.1 FPS on a single H100 GPU.

Abstract:
Trajectory prediction in autonomous driving has traditionally been studied from a model-centric perspective. However, existing datasets exhibit a strong long-tail distribution in scenario density, where common low-density cases dominate and safety-critical high-density cases are severely underrepresented. This imbalance limits model robustness and hides failure modes when standard evaluations average errors across all scenarios. We revisit trajectory prediction from a data-centric perspective and present Den-TP, a framework for density-aware dataset curation and evaluation. Den-TP first partitions data into density-conditioned regions using agent count as a dataset-agnostic proxy for interaction complexity. It then applies a gradient-based submodular selection objective to choose representative samples within each region while explicitly rebalancing across densities. The resulting subset reduces the dataset size by 50% yet preserves overall performance and significantly improves robustness in high-density scenarios. We further introduce density-conditioned evaluation protocols that reveal long-tail failure modes overlooked by conventional metrics. Experiments on Argoverse 1 and 2 with state-of-the-art models show that robust trajectory prediction depends not only on data scale, but also on balancing scenario density.

Abstract:
When video reasoning requires external knowledge, many systems with large multimodal models (LMMs) adopt retrieval augmentation to supply the missing context. Appending textual or multi-clip evidence, however, forces heterogeneous signals into a single attention space. We observe diluted attention and higher cognitive load even on non-long videos. The bottleneck is not only what to retrieve but how to represent and fuse external knowledge with the video backbone.We present Graph-to-Frame RAG (G2F-RAG), a training free and auditable paradigm that delivers knowledge in the visual space. On the offline stage, an agent builds a problem-agnostic video knowledge graph that integrates entities, events, spatial relations, and linked world knowledge. On the online stage, a hierarchical multi-agent controller decides whether external knowledge is needed, retrieves a minimal sufficient subgraph, and renders it as a single reasoning frame appended to the video. LMMs then perform joint reasoning in a unified visual domain. This design reduces cognitive load and leaves an explicit, inspectable evidence trail.G2F-RAG is plug-and-play across backbones and scales. It yields consistent gains on diverse public benchmarks, with larger improvements in knowledge-intensive settings. Ablations further confirm that knowledge representation and delivery matter. G2F-RAG reframes retrieval as visual space knowledge fusion for robust and interpretable video reasoning.

Abstract:
Estimating camera pose in dynamic environments is a critical challenge, as most visual SLAM and SfM methods assume inputs from static environments. While recent dynamic-aware methods exist, they are often not unified: semantic-based approaches are brittle, per-sequence optimization methods fail on short sequences, and other learned models sometimes perform badly on static-only scenes. We present Wildpose, a unified monocular pose estimation framework that is robust in dynamic environments while maintaining state-of-the-art performance on static and low-ego-motion datasets. Our key insight is to connect the two powerful paradigms in modern 3D vision: the rich perceptual frontend of feed-forward models and the end-to-end optimization of differentiable bundle adjustment (BA). We achieve this by enhancing the differentiable BA pipeline in two ways. First, we introduce a new 3D-aware update operator by integrating a frozen, pre-trained MASt3R feature backbone and training the operator's subsequent layers on a diverse curriculum of static and dynamic data. Second, we propose a high-capacity motion mask detector that leverages rich, multi-level 3D-aware features from the same frozen backbone. Extensive experiments show Wildpose consistently outperforms prior methods across a wide variety of benchmarks, including dynamic (Wild-SLAM, Bonn), static (TUM, 7-Scenes), and low-ego-motion (Sintel) datasets.

Abstract:
Aligning Diffusion models has achieved remarkable breakthroughs in generating high-quality, human preference-aligned images. Existing techniques, such as supervised fine-tuning (SFT) and DPO-style preference optimization, have become principled tools for fine-tuning diffusion models. However, SFT relies on high-quality images that are costly to obtain, while DPO-style methods depend on large-scale preference datasets, which are often inconsistent in quality. Beyond data dependency, these methods are further constrained by computational inefficiency. To address these two challenges, we propose Composite Reward Assisted Fine-Tuning (CRAFT), a lightweight yet powerful fine-tuning paradigm that requires significantly reduced training data while maintaining computational efficiency. It first leverages a Composite Reward Filtering (CRF) technique to construct a high-quality and consistent training dataset and then perform an enhanced variant of SFT. We also theoretically prove that CRAFT actually optimizes the lower bound of group-based reinforcement learning, establishing a principled connection between SFT with selected data and reinforcement learning. Our extensive empirical results demonstrate that CRAFT with only 100 samples can easily outperform recent SOTA preference optimization methods with thousands of preference-paired samples. Moreover, CRAFT can even achieve 11-220xfaster convergences than the baseline preference optimization methods, highlighting its extremely high efficiency.

Abstract:
Image memorability, i.e., how likely an image is to be remembered, has traditionally been studied in computer vision either as a passive prediction task, with models regressing a scalar score, or with generative methods altering the visual input to boost the image likelihood of being remembered. Yet, none of these paradigms supports users at capture time, when the crucial question is how to improve a photo memorability. We introduce the task of Memorability Feedback (MemFeed), where an automated model should provide actionable, human-interpretable guidance to users with the goal to enhance an image future recall. We also present MemCoach, the first approach designed to provide concrete suggestions in natural language for memorability improvement (e.g., "emphasize facial expression," "bring the subject forward"). Our method, based on Multimodal Large Language Models (MLLMs), is training-free and employs a teacher-student steering strategy, aligning the model internal activations toward more memorable patterns learned from a teacher model progressing along least-to-most memorable samples. To enable systematic evaluation on this novel task, we further introduce MemBench, a new benchmark featuring sequence-aligned photoshoots with annotated memorability scores. Our experiments, considering multiple MLLMs, demonstrate the effectiveness of MemCoach, showing consistently improved performance over several zero-shot models. The results indicate that memorability can not only be predicted but also taught and instructed, shifting the focus from mere prediction to actionable feedback for human creators. Dataset and code will be publicly released upon publication.

Abstract:
Despite their impressive performance in multi-label classification of chest X-ray images (CXR), deep learning models are widely plagued by two types of spurious correlations: feature confounding arising from pathological co-occurrence and shortcut learning triggered by non-pathological visual confounders. These non-causal dependencies severely undermine the interpretability and robustness of models in real-world clinical settings. To address these challenges, we propose the Dual Adjustment Reasoning with Counterfactuals for Trustworthy Chest X-ray Classification (DARC) framework, the first to synergistically decouple both types of confounding sources from a causal mechanism perspective. At the data level, we construct CheXconf, the first pixel-level annotation dataset of non-pathological visual confounders in CXR, comprising 40,213 annotated instances across 11 categories. This provides a solid foundation for accurately modeling these confounders. At the methodological level, we design a novel dual-stream causal learning architecture. Its Global Stream leverages the back-door adjustment criterion with CheXconf to explicitly block spurious paths from non-pathological confounders. Concurrently, the Local Stream employs counterfactual reasoning, constrained by anatomical priors, to disentangle the visual coupling of co-occurring pathologies. Experiments on large-scale public benchmarks demonstrate that our method achieves significant improvements in task performance, interpretability, and robustness.

Abstract:
High-quality imaging of dynamic scenes in extremely low-light conditions is highly challenging. Photon scarcity induces severe noise and texture loss, causing significant image degradation. Event cameras, featuring a high dynamic range (120 dB) and high sensitivity to motion, serve as powerful complements to conventional cameras by offering crucial cues for preserving subtle textures. However, most existing approaches emphasize texture recovery from events, while paying little attention to image noise or the intrinsic noise of events themselves, which ultimately hinders accurate pixel reconstruction under photon-starved conditions. In this work, we propose NEC-Diff, a novel diffusion-based event-RAW hybrid imaging framework that extracts reliable information from heavily noisy signals to reconstruct fine scene structures. The framework is driven by two key insights: (1) combining the linear light-response property of RAW images with the brightness-change nature of events to establish a physics-driven constraint for robust dual-modal denoising; and (2) dynamically estimating the SNR of both modalities based on denoising results to guide adaptive feature fusion, thereby injecting reliable cues into the diffusion process for high-fidelity visual reconstruction. Furthermore, we construct the REAL (Raw and Event Acquired in Low-light) dataset which provides 47,800 pixel-aligned low-light RAW images, events, and high-quality references under 0.001-0.8 lux illumination. Extensive experiments demonstrate the superiority of NEC-Diff under extreme darkness.

Abstract:
Existing video omnimatte methods typically rely on slow, multi-stage, or inference-time optimization pipelines that fail to fully exploit powerful generative priors, producing suboptimal decompositions. Our key insight is that, if a video inpainting model can be finetuned to remove the foreground-associated effects, then it must be inherently capable of perceiving these effects, and hence can also be finetuned for the complementary task: foreground layer decomposition with associated effects. However, although naively finetuning the inpainting model with LoRA applied to all blocks can produce high-quality alpha mattes, it fails to capture associated effects. Our systematic analysis reveals this arises because effect-related cues are primarily encoded in specific DiT blocks and become suppressed when LoRA is applied across all blocks. To address this, we introduce EasyOmnimatte, the first unified, end-to-end video omnimatte method. Concretely, we finetune a pretrained video inpainting diffusion model to learn dual complementary experts while keeping its original weights intact: an Effect Expert, where LoRA is applied only to effect-sensitive DiT blocks to capture the coarse structure of the foreground and associated effects, and a fully LoRA-finetuned Quality Expert learns to refine the alpha matte. During sampling, Effect Expert is used for denoising at early, high-noise steps, while Quality Expert takes over at later, low-noise steps. This design eliminates the need for two full diffusion passes, significantly reducing computational cost without compromising output quality. Ablation studies validate the effectiveness of this Dual-Expert strategy. Experiments demonstrate that EasyOmnimatte sets a new state-of-the-art for video omnimatte and enables various downstream tasks, significantly outperforming baselines in both quality and efficiency.

Abstract:
Recent advances in Vision-Language-Action (VLA) models have enabled robots to execute increasingly complex tasks. However, VLA models trained through imitation learning struggle to operate reliably in dynamic environments and often fail under Out-of-Distribution (OOD) conditions. To address this issue, we propose Robot-Conditioned Normalizing Flow(RC-NF), a real-time monitoring model for robotic anomaly detection and intervention that ensures the robot's state and the object's motion trajectory align with the task. RC-NF decouples the processing of task-aware robot and object states within the normalizing flow. It requires only positive samples for unsupervised training and calculates accurate robotic anomaly scores during inference through the probability density function. We further present LIBERO-Anomaly-10, a benchmark comprising three categories of robotic anomalies for simulation evaluation. RC-NF achieves state-of-the-art performance across all anomaly types compared to previous methods in monitoring robotic tasks. Real-world experiments demonstrate that RC-NF operates as a plug-and-play module for VLA models (e.g., Pi0) , providing a real-time OOD signal that enables state-level rollback or task-level replanning when necessary, with a response latency under 100 ms. These results have demonstrated that our RC-NF noticeably enhances the robustness and adaptability of VLA-based robotic systems in dynamic environments.

Abstract:
This paper studies unsupervised cross-domain image retrieval (UCDIR), which aims to retrieve images of the same category across different domains without relying on labeled data. Existing methods typically utilize pseudo-labels, derived from clustering algorithms, as supervisory signals for intra-domain representation learning and cross-domain feature alignment. However, these discrete pseudo-labels often fail to provide accurate and comprehensive semantic guidance. Moreover, the alignment process frequently overlooks the entanglement between domain-specific and semantic information, leading to semantic degradation in the learned representations and ultimately impairing retrieval performance. This paper addresses the limitations by proposing a Text-Phase Synergy Network with Dual Priors (TPSNet). Specifically, we first employ CLIP to generate a set of class-specific prompts per domain, termed as domain prompt, serving as a text prior that offers more precise semantic supervision. In parallel, we further introduce a phase prior, represented by domain-invariant phase features, which is integrated into the original image representations to bridge the domain distribution gaps while preserving semantic integrity. Leveraging the synergy of these dual priors, TPSNet significantly outperforms state-of-the-art methods on UCDIR benchmarks.

Abstract:
Latent inpainting in diffusion models still relies almost universally on linearly interpolating VAE latents under a downsampled mask. We propose a key principle for compositing image latents: Pixel-Equivalent Latent Compositing (PELC). An equivalent latent compositor should be the same as compositing in pixel space. This principle enables full-resolution mask control and true soft-edge alpha compositing, even though VAEs compress images 8x spatially. Modern VAEs capture global context beyond patch-aligned local structure, so linear latent blending cannot be pixel-equivalent: it produces large artifacts at mask seams and global degradation and color shifts. We introduce DecFormer, a 7.7M-parameter transformer that predicts per-channel blend weights and an off-manifold residual correction to realize mask-consistent latent fusion. DecFormer is trained so that decoding after fusion matches pixel-space alpha compositing, is plug-compatible with existing diffusion pipelines, requires no backbone finetuning and adds only 0.07% of FLUX.1-Dev's parameters and 3.5% FLOP overhead. On the FLUX.1 family, DecFormer restores global color consistency, soft-mask support, sharp boundaries, and high-fidelity masking, reducing error metrics around edges by up to 53% over standard mask interpolation. Used as an inpainting prior, a lightweight LoRA on FLUX.1-Dev with DecFormer achieves fidelity comparable to FLUX.1-Fill, a fully finetuned inpainting model. While we focus on inpainting, PELC is a general recipe for pixel-equivalent latent editing, as we demonstrate on a complex color-correction task.

Abstract:
Vision-based end-to-end (E2E) driving has garnered interest in the research community due to its scalability and synergy with multimodal large language models (MLLMs). However, current E2E driving benchmarks primarily feature nominal scenarios paired with existing open-loop evaluation metrics that fall short in capturing the multimodal nature of driving or effectively evaluating performance in long-tail scenarios. To address these gaps, we introduce the Waymo Open Dataset for End-to-End Driving (WOD-E2E). WOD-E2E contains 4,021 driving segments (approximately 12 hours), specifically curated for challenging long-tail scenarios that are rare in daily life with an occurring frequency of less than 0.03%. Each segment in WOD-E2E includes the high-level routing information, ego states, and 360-degree camera views from 8 surrounding cameras. To evaluate E2E driving performance on these long-tail situations, we propose a novel open-loop evaluation metric: Rater Feedback Score (RFS). Unlike conventional distance-based metrics, RFS measures how closely a predicted trajectory matches rater-annotated trajectory preference labels. WOD-E2E includes rater preference labels for validation, and a separate held out test set is used for the 2025 WOD-E2E Challenge benchmark leaderboard.

Abstract:
Long-term language-guided referring in fixed-view videos is challenging: the referent may be occluded or leave the scene for long intervals and later re-enter, while framewise referring pipelines drift as re-identification (ReID) becomes unreliable. AR2-4FV leverages background stability for long-term referring. An offline Anchor Bank is distilled from static background structures; at inference, the text query is aligned with this bank to produce an Anchor Map that serves as persistent semantic memory when the referent is absent. An anchor-based re-entry prior accelerates re-capture upon return, and a lightweight ReID-Gating mechanism maintains identity continuity using displacement cues in the anchor frame. The system predicts per-frame bounding boxes without assuming the target is visible in the first frame or explicitly modeling appearance variations. AR2-4FV achieves +10.3% Re-Capture Rate (RCR) improvement and -24.2% Re-Capture Latency (RCL) reduction over the best baseline, and ablation studies further confirm the benefits of the Anchor Map, re-entry prior, and ReID-Gating.

Abstract:
Diffusion Magnetic Resonance Imaging (dMRI) plays a critical role in studying microstructural changes in the brain. It is, therefore, widely used in clinical practice; yet progress in learning general-purpose representations from dMRI has been limited. A key challenge is that existing deep learning approaches are not well-suited to capture the unique properties of diffusion signals. Brain dMRI is normally composed of several brain volumes, each with different attenuation characteristics dependent on the direction and strength of the diffusion-sensitized gradients. Thus, there is a need to jointly model spatial, diffusion-weighting, and directional dependencies in dMRI. Furthermore, varying acquisition protocols (e.g., differing numbers of directions) further limit traditional models. To address these gaps, we introduce a diffusion space rotatory positional embedding (D-RoPE) plugged into our dMRI transformer to capture both the spatial structure and directional characteristics of diffusion data, enabling robust and transferable representations across diverse acquisition settings and an arbitrary number of diffusion directions. After self-supervised masked autoencoding pretraining, tests on several downstream tasks show that the learned representations and the pretrained model can provide competitive or superior performance compared to several baselines in these downstream tasks (even compared to a fully trained baseline); the finetuned features from our pretrained encoder resulted in a 6% higher accuracy in classifying mild cognitive impairment and a 0.05 increase in the correlation coefficient when predicting cognitive scores. Code is available at: github.com/gustavochau/D-RoPE

Abstract:
Radiology report generation (RRG) aims to automatically produce clinically faithful reports from chest X-ray images. Prevailing work typically follows a scale-driven paradigm, by multi-stage training over large paired corpora and oversized backbones, making pipelines highly data- and compute-intensive. In this paper, we propose Oracle-educated GRPO (OraPO) with a FactScore-based reward (FactS) to tackle the RRG task under constrained budgets. OraPO enables single-stage, RL-only training by converting failed GRPO explorations on rare or difficult studies into direct preference supervision via a lightweight oracle step. FactS grounds learning in diagnostic evidence by extracting atomic clinical facts and checking entailment against ground-truth labels, yielding dense, interpretable sentence-level rewards. Together, OraPO and FactS create a compact and powerful framework that significantly improves learning efficiency on clinically challenging cases, setting the new SOTA performance on the CheXpert Plus dataset (0.341 in F1) with 2-3 orders of magnitude less training data using a small base VLM on modest hardware.

Abstract:
Recent advances in Multimodal Large Language Models (MLLMs) have enabled promising progress in table reasoning from visual table inputs. Despite their ability to capture rich visual cues such as color and layout, MLLMs still underperform compared to text-only models.We argue that a major limitation lies in the pre-training process, which inadvertently weakens the model's intrinsic reasoning ability and consequently hinders the effectiveness of reinforcement fine-tuning on table reasoning tasks.In this paper, we introduce TableMix, a novel framework that tackles this challenge from a data-centric perspective. At the core of TableMix is a principled data mixing strategy. Specifically, TableMix constructs a hybrid dataset that combines: (1) multimodal table reasoning data to improve task-specific reasoning, (2) text-only mathematical reasoning data to revive the model's logical competence, and (3) simple multimodal perception data to preserve visual grounding.Recognizing the non-uniform difficulty of mixed data, we further propose a Difficulty-Aware Reward Shaping (DRS) mechanism, which enables the Group Relative Policy Optimization (GRPO) algorithm to adaptively reward concise reasoning for easy problems while encouraging more elaborate reasoning for complex ones, thereby reducing redundant computation and errors.Extensive experiments show that TableMix markedly enhances the reasoning ability of MLLMs, outperforming strong multimodal baselines and even rivaling state-of-the-art text-only models.

Abstract:
Micro-gestures are subtle and transient movements triggered by unconscious neural and emotional activities, holding great potential for human-computer interaction and clinical monitoring. However, their low amplitude, short duration, and strong inter-subject variability make existing deep models prone to degradation under low-sample, noisy, and cross-subject conditions. This paper presents an active inference-based framework for micro-gesture recognition, featuring Expected Free Energy (EFE)-guided temporal sampling and uncertainty-aware adaptive learning. The model actively selects the most discriminative temporal segments under EFE guidance, enabling dynamic observation and information gain maximization. Meanwhile, sample weighting driven by predictive uncertainty mitigates the effects of label noise and distribution shift. Experiments on the SMG dataset demonstrate the effectiveness of the proposed method, achieving consistent improvements across multiple mainstream backbones. Ablation studies confirm that both the EFE-guided observation and the adaptive learning mechanism are crucial to the performance gains. This work offers an interpretable and scalable paradigm for temporal behavior modeling under low-resource and noisy conditions, with broad applicability to wearable sensing, HCI, and clinical emotion monitoring.

Abstract:
Unsupervised point cloud segmentation is critical for embodied artificial intelligence and autonomous driving, as it mitigates the prohibitive cost of dense point-level annotations required by fully supervised methods. While integrating 2D pre-trained models such as the Segment Anything Model (SAM) to supplement semantic information is a natural choice, yet this approach faces a fundamental mismatch between discrete 3D points and continuous 2D images. This mismatch leads to inevitable projection overlap and complex modality alignment, resulting in compromised semantic consistency across 2D-3D transfer. To address these limitations, this paper proposes PointGS, a simple yet effective pipeline for unsupervised 3D point cloud segmentation. PointGS leverages 3D Gaussian Splatting as a unified intermediate representation to bridge the discrete-continuous domain gap. Input sparse point clouds are first reconstructed into dense 3D Gaussian spaces via multi-view observations, filling spatial gaps and encoding occlusion relationships to eliminate projection-induced semantic conflation. Multi-view dense images are rendered from the Gaussian space, with 2D semantic masks extracted via SAM, and semantics are distilled to 3D Gaussian primitives through contrastive learning to ensure consistent semantic assignments across different views. The Gaussian space is aligned with the original point cloud via two-step registration, and point semantics are assigned through nearest-neighbor search on labeled Gaussians. Experiments demonstrate that PointGS outperforms state-of-the-art unsupervised methods, achieving +0.9% mIoU on ScanNet-V2 and +2.8% mIoU on S3DIS.

Abstract:
Autonomous driving needs fast, scalable 4D reconstruction and re-simulation for training and evaluation, yet most methods for dynamic driving scenes still rely on per-scene optimization, known camera calibration, or short frame windows, making them slow and impractical. We revisit this problem from a feedforward perspective and introduce Driving Gaussian Grounded Transformer (DGGT), a unified framework for pose-free dynamic scene reconstruction. We note that the existing formulations, treating camera pose as a required input, limit flexibility and scalability. Instead, we reformulate pose as an output of the model, enabling reconstruction directly from sparse, unposed images and supporting an arbitrary number of views for long sequences. Our approach jointly predicts per-frame 3D Gaussian maps and camera parameters, disentangles dynamics with a lightweight dynamic head, and preserves temporal consistency with a lifespan head that modulates visibility over time. A diffusion-based rendering refinement further reduces motion/interpolation artifacts and improves novel-view quality under sparse inputs. The result is a single-pass, pose-free algorithm that achieves state-of-the-art performance and speed. Trained and evaluated on large-scale driving benchmarks (Waymo, nuScenes, Argoverse2), our method outperforms prior work both when trained on each dataset and in zero-shot transfer across datasets, and it scales well as the number of input frames increases.

Abstract:
Precise motion timing (PMT) is crucial for swift motion analysis. A millisecond difference may determine victory or defeat in sports competitions. Despite substantial progress in human pose estimation (HPE), PMT remains largely overlooked by the HPE community due to the limited availability of high-temporal-resolution labeled datasets. Today, PMT is achieved using high-speed RGB cameras in specialized scenarios such as the Olympic Games; however, their high costs, light sensitivity, bandwidth, and computational complexity limit their feasibility for daily use. We developed FlashCap, the first flashing LED-based MoCap system for PMT. With FlashCap, we collect a millisecond-resolution human motion dataset, FlashMotion, comprising the event, RGB, LiDAR, and IMU modalities, and demonstrate its high quality through rigorous validation. To evaluate the merits of FlashMotion, we perform two tasks: precise motion timing and high-temporal-resolution HPE. For these tasks, we propose ResPose, a simple yet effective baseline that learns residual poses based on events and RGBs. Experimental results show that ResPose reduces pose estimation errors by 40% and achieves millisecond-level timing accuracy, enabling new research opportunities. The dataset and code will be shared with the community.

Abstract:
Recent Vision-Language Models (VLMs) have become increasingly susceptible to jailbreak attacks, where adversarial prompts exploit subtle manipulation to circumvent safety alignment.The diversity and adaptability of such jailbreakers necessitate a defense mechanism with strong generalization capability.However, fine-tuning large-scale VLMs is computationally expensive, and introducing excessive visual or textual defense prompts is impractical for preserving image realism and model usability.We propose SafeLogo, which tunes a logo-sized visual prompt into a universal shield against diverse jailbreak attacks through micro-regional adversarial training.We are the first to integrate min-max adversarial optimization into visual defense prompt generation.Specifically, in the outer loop, SafeLogo injects compact, bounded perturbations into extremely small image regions (\leq 2% pixel coverage), effectively preserving both visual fidelity and semantic consistency.Meanwhile, overcoming the limitations of prior defenses constrained to a single attack direction or fixed benign supervision, the inner loop dynamically generates and selects the strongest one from a variety of jailbreakers.Extensive experiments on LLaVA-1.5-13B,MiniGPT-4, and Qwen3-VL show that SafeLogo markedly lowers jailbreak success rates on MM-SafetyBench, VLGuard, and FigStep, while preserving benign performance on MM-Vet and MME.

Abstract:
Autoregressive (AR) models have achieved remarkable success in natural language processing, yet their application to image generation faces significant challenges. When implementing VQ-based decoders for autoregressive image generation, the generated images typically preserve semantic information but struggle with fine-grained details. Recent hybrid AR image generation approaches address these issues by integrating diffusion models as decoder heads, enabling more high-fidelity generation. However, the diffusion-based denoising process introduces significant computational overhead during inference.To accelerate hybrid AR image generation, we propose the Lookahead Decoding Strategy, which integrates the strengths of autoregressive and diffusion models by separating the process into two complementary branches: semantic prediction and detail refinement. The autoregressive branch captures high-level semantic structures while refining coarse predictions made by the parallel. Furthermore, we introduce Guided Diffusion Sampling to steer the diffusion denoising trajectory, significantly reducing the number of denoising steps. Extensive experiments demonstrate that our approach provides an effective solution for accelerating hybrid AR image generation models.

Abstract:
Transformer-based video diffusion models (VDMs) deliver state-of-the-art video generation quality but are constrained by the quadratic cost of self-attention, making long sequences and high resolutions computationally expensive. While linear attention offers sub-quadratic complexity, previous approaches have failed to match the expressiveness of softmax attention unless retrained at significant computational cost. We introduce Attention Surgery, an efficient framework that enables linear or hybrid attention in pretrained VDMs, eliminating the need for training from scratch. Inspired by recent advances in language models, our method combines a novel hybrid attention mechanism, mixing softmax and linear tokens, with a lightweight distillation and fine-tuning pipeline requiring only a few GPU-days. Additionally, we incorporate a cost-aware block-rate strategy to balance expressiveness and efficiency across layers. Applied to Wan2.1 1.3B, a state-of-the-art efficient transformer VDM and evaluated on VBench, VBench2.0 and a human preference study, Attention Surgery achieves competitive results. Furthermore, measurements of on-mobile latency, memory usage, and FLOPs demonstrate notable improvements in scaling behavior for longer videos.

Abstract:
Simulation is essential to the development and evaluation of autonomous robots such as self-driving vehicles. Neural reconstruction is emerging as a promising solution as it enables simulating a wide variety of scenarios from real-world data alone in an automated and scalable way. However, while methods such as NeRF and 3D Gaussian Splatting can produce visually compelling results, they often exhibit artifacts particularly when rendering novel views, and fail to realistically integrate inserted dynamic objects, especially when they were captured from different scenes. To overcome these limitations we introduce DiffusionHarmonizer, an online generative enhancement framework that transforms renderings from such imperfect scenes into photorealistic, temporally consistent outputs. At its core is a single-step temporally-conditioned enhancer that is converted from a pretrained multi-step image diffusion model, capable of running in online simulators on a single GPU. The key to training it effectively, is a custom data curation pipeline that constructs synthetic-real pairs emphasizing appearance harmonization, artifact correction, and lighting realism. Experiments show that DiffusionHarmonizer substantially improves perceptual realism, being chosen by 84.28% of users in our comparative study over the second best method. Furthermore, it matches the temporal coherence of state-of-the art video models while maintaining the inference efficiency of single-step image models, offering a scalable and practical solution for photorealistic simulation in both research and production settings.

Abstract:
Modern mapping tools remain fundamentally point-and-click systems, offering limited support for interactive, multimodal engagement with either map views or real-world camera inputs. This limitation becomes most apparent in the last 100 meters of navigation and during in-situ local discovery, where users naturally combine spatial reasoning over maps with egocentric perception of their surroundings. We introduce IMAIA, an interactive geospatial assistance framework that unifies three core capabilities: (1) map-centric spatial understanding, (2) camera-to-place grounding, and (3) human-centered, bearing-aware navigation. IMAIA consists of two interoperable components--Maps Plus and PAISA--coordinated by a lightweight multi-agent orchestrator. Maps Plus converts vector and satellite maps into a grid-aligned representation that enables flexible, view-conditioned spatial queries. PAISA fuses camera imagery with geospatial signals such as location, heading, and proximity to interpret egocentric scenes and surface relevant attributes of nearby places. The framework is fully modular: vision-language backends can be replaced without altering system behavior. IMAIA enables responsive, interpretable, and context-aware assistance across both map-based and real-world interactions, supporting a new class of interactive mapping experiences that are practical for real-world deployment.

Abstract:
Reconstructing humans and their surrounding environments in a globally consistent 4D space is essential for comprehensive perception. However, prior works typically assume single-view inputs or decouple humans, scenes, and cameras, making them unable to recover coherent geometry, stable motion, and physically aligned trajectories. These limitations motivate us to introduce a new task: unified human-scene-camera reconstruction from multi-view videos, which aims to jointly estimate dynamic humans, static scenes, and camera poses in one global coordinate frame. We propose TROPHIES--Temporal Reconstruction of Places, Humans, and Cameras from Multi-view Videos--a unified framework tailored for this task. TROPHIES features a Human Branch that models human through temporal and spatial reasoning, and a Scene Branch that reconstructs static geometry with human-aware attention. A global alignment and optimization module couples both branches by enforcing scale consistency, contact priors, and cross-view temporal coherence. Experiments on EgoHuman and EgoExo4D demonstrate that TROPHIES achieves globally aligned, physically plausible 4D reconstructions and consistently outperforms existing paradigms in both global fidelity and human-scene consistency.

Abstract:
Seismic images reconstruct subsurface reflectivity from field recordings, guiding exploration and reservoir monitoring. Gas chimneys are vertical anomalies caused by subsurface fluid migration. Understanding these phenomena is crucial for assessing hydrocarbon potential and avoiding drilling hazards. However, accurate detection is challenging due to strong seismic attenuation and scattering. Traditional physics-based methods are computationally expensive and sensitive to model errors, while deep learning offers efficient alternatives, yet lacks labeled datasets. In this work, we introduce SIGMA, a new physics-informed dataset for gas chimney understanding in seismic images, featuring (i) pixel-level gas-chimney mask for detection and (ii) paired degraded and ground-truth image for enhancement. We employed physics-based methods that cover a wide range of geological settings and data acquisition conditions. Comprehensive experiments demonstrate that SIGMA serves as a challenging benchmark for gas chimney interpretation and benefits general seismic understanding.

Abstract:
Medical imaging remains challenging due to acoustic shadows, motion blur, and indistinct boundaries, while adapting vision models to such domains often requires heavy task-specific fine-tuning and can degrade general-image capability. We propose DCRM-ViT, a domain-conditioned residual modulation framework for Vision Transformers that preserves general-vision knowledge while adapting to diverse medical and natural domains. DCRM-ViT keeps the backbone frozen and augments each block with lightweight Residual Modulation Blocks (RMBs), whose parameters are synthesized per sample by a Domain Router (DR) and a Parameter Synthesizer Network (PSN). The DR predicts soft domain weights from input features, and the PSN maps them to low-rank residuals that modulate selected projections and optionally add a domain-aware attention bias. We train the model with a bi-level optimization scheme, where an inner loop adapts RMBs to task supervision and an outer loop updates DR, PSN, and RMB initialization to improve cross-domain generalization. Across fine-grained classification (Food101, SUN397, Stanford Cars) and medical segmentation (ultrasound, CT, MRI), DCRM-ViT consistently outperforms strong baselines with modest trainable compute. Ablation studies confirm the contribution of the proposed components. Overall, DCRM-ViT achieves strong cross-domain performance with low overhead, requiring only 4.7 GFLOPs and 0.3 min/epoch.

Abstract:
Slice-based volumetric imaging is widely applied and it demands representations that compress aggressively while preserving internal structure for analysis. This paper introduces GaussianPile, unifying 3D Gaussian splatting with an imaging system-aware focus model to address this challenge. Our new method introduces three key innovations: (i) a slice-aware piling strategy that positions anisotropic 3D Gaussians to model through-slice contributions, (ii) a differentiable projection operator that encodes the finite-thickness point spread function of the imaging acquisition system, and (iii) a compact encoding and joint optimization pipeline that simultaneously reconstructs and compresses the Gaussian sets. Our CUDA-based design retains the compression and real-time rendering efficiency of Gaussian primitives while preserving high-frequency internal volumetric detail. Experiments on microscopy and ultrasound datasets demonstrate that our method reduces storage and reconstruction cost, sustains diagnostic fidelity, and enables fast 2D visualization, along with 3D voxelization. In practice, it delivers high-quality results in as few as 3 minutes--up to 11xfaster than NeRF-based approaches--and achieves consistent 16xcompression over the original voxel grids, offering a practical path to deployable compression and exploration of slice-based volumetric datasets.

Abstract:
While Multimodal Large Language Models (MLLMs) excel at vision-language tasks, the cost of their language-driven training on internal visual foundational competence remains unclear. In this paper, we conduct a detailed diagnostic analysis to unveil a pervasive issue: visual representation degradation in MLLMs. Specifically, we find that compared to the initial visual features, the visual representation in the middle layers of LLM exhibits both a degradation in global function and patch structure. We attribute this phenomenon to a visual sacrifice driven by the singular text-generation objective, where the model compromises its visual fidelity to optimize for answer generation. We argue that a robust MLLM requires both strong cross-modal reasoning and core visual competence, and propose Predictive Regularization (PRe) to force degraded intermediate features to predict initial visual features, thereby maintaining the inherent visual attributes of the MLLM's internal representations. Extensive experiments confirm that mitigating this visual degradation effectively boosts vision-language performance, underscoring the critical importance of fostering robust internal visual representations within MLLMs for comprehensive multimodal understanding.

Abstract:
Diffusion Transformers (DiTs) have significantly enhanced text-to-image (T2I) generation quality, enabling high-quality personalized content creation. However, fine-tuning these models requires substantial computational complexity and memory, limiting practical deployment under resource constraints. To tackle these challenges, we propose a memory-efficient fine-tuning framework called DiT-BlockSkip, integrating timestep-aware dynamic patch sampling and block skipping by precomputing residual features. Our dynamic patch sampling strategy adjusts patch sizes based on the diffusion timestep, then resizes the cropped patches to a fixed lower resolution. This approach reduces forward & backward memory usage while allowing the model to capture global structures at higher timesteps and fine-grained details at lower timesteps. The block skipping mechanism selectively fine-tunes essential transformer blocks and precomputes residual features for the skipped blocks, significantly reducing training memory. To identify vital blocks for personalization, we introduce a block selection strategy based on cross-attention masking. Evaluations demonstrate that our approach achieves competitive personalization performance qualitatively and quantitatively, while reducing memory usage substantially, moving toward on-device feasibility (e.g., smartphones, IoT devices) for large-scale diffusion transformers.

Abstract:
Scene-consistent video generation aims to create videos that explore 3D scenes based on a camera trajectory. Previous methods rely on video generation models with external memory for consistency, or iterative 3D reconstruction and inpainting, which accumulate errors during inference due to incorrect intermediary outputs, non-differentiable processes, and separate models. To overcome these limitations, we introduce "geometry-as-context". It iteratively completes the following steps using an autoregressive camera-controlled video generation model: (1) estimates the geometry of the current view necessary for 3D reconstruction, and (2) simulates and restores novel view images rendered by the 3D scene. Under this multi-task framework, we develop the camera gated attention module to enhance the model's capability to effectively leverage camera poses. During the training phase, text contexts are utilized to ascertain whether geometric or RGB images should be generated. To ensure that the model can generate RGB-only outputs during inference, the geometry context is randomly dropped from the interleaved text-image-geometry training sequence. The method has been tested on scene video generation with one-direction and forth-and-back trajectories. The results show its superiority over previous approaches in maintaining scene consistency and camera control.

Abstract:
Online Scene Change Detection (SCD) is an extremely challenging problem that requires an agent to detect relevant changes on the fly while observing the scene from unconstrained viewpoints. Existing online SCD methods are significantly less accurate than offline approaches. We present the first online SCD approach that is pose-agnostic, label-free, and ensures multi-view consistency, while operating at over 10 FPS and achieving new state-of-the-art performance, surpassing even the best offline approaches. Our method introduces a new self-supervised fusion loss to infer scene changes from multiple cues and observations, PnP-based fast pose estimation against the reference scene, and a fast change-guided update strategy for the 3D Gaussian Splatting scene representation. Extensive experiments on complex real-world datasets demonstrate that our approach outperforms both online and offline baselines.

Abstract:
Human motion analysis tasks, such as temporal 3D pose estimation, motion prediction, and motion in-betweening, play an essential role in computer vision. However, current paradigms suffer from severe fragmentation. First, the field is split between "perception" models that understand motion from video but only output text, and "generation" models that cannot perceive from raw visual input. Second, generative MLLMs are often limited to single-frame, static poses using dense, parametric SMPL models, failing to handle temporal motion. Third, existing motion vocabularies are built from skeleton data alone, severing the link to the visual domain. To address these challenges, we introduce Superman, a unified framework that bridges visual perception with temporal, skeleton-based motion generation. Our solution is twofold. First, to overcome the modality disconnect, we propose a Vision-Guided Motion Tokenizer. Leveraging the natural geometric alignment between 3D skeletons and visual data, this module pioneers robust joint learning from both modalities, creating a unified, cross-modal motion vocabulary. Second, grounded in this motion language, a single, unified MLLM architecture is trained to handle all tasks. This module flexibly processes diverse, temporal inputs, unifying 3D skeleton pose estimation from video (perception) with skeleton-based motion prediction and in-betweening (generation). Extensive experiments on standard benchmarks, including Human3.6M, demonstrate that our unified method achieves state-of-the-art or competitive performance across all motion tasks. This showcases a more efficient and scalable path for generative motion analysis using skeletons.

Abstract:
Forecasting fetal growth from sequential ultrasound examinations is essential for personalized prenatal care. Existing medical vision-language models (MLLMs) are limited to single-phase/organ evaluations and qualitative reasoning, neglecting longitudinal history and precise continuous biometric values. To address this gap, we introduce the novel task of multi-phase fetal growth forecasting and report generation. To support this task, we first present GrowthFetus, the interesting multi-phase, multi-organ fetal ultrasound dataset to date, containing 9,280 examinations from 2,000 fetuses. Based on this dataset, we propose F^2-Assist, a unified MLLM framework with three key components: (i) a Cross-Phase Organ Alignment module for for heterogeneous multi-organ feature fusion across phases, (ii) a History-Aware Temporal Encoding module for modeling irregular temporal dynamics, and (iii) a Growth Parameter Adapter that encodes continuous biometric values as differentiable tokens for numerically precise reasoning. Extensive experiments show that F^2-Assist achieves temporally coherent predictions and clinically consistent reports, significantly outperforming state-of-the-art MLLMs. Our study establishes a practical framework for longitudinal ultrasound analysis, bridging growth forecasting and report generation in a unified model.

Abstract:
Unsigned distance fields (UDFs) are well suited for representing open surfaces, but learning them from multi-view images is challenging because ground-truth surfaces are unavailable for supervision in most cases and the gradient of a UDF is undefined on the underlying surface. Prior methods optimize UDFs with global objectives and apply gradient-based priors ignoring the non-differentiability for queries on the target surface, which leads to unstable training and over-smoothing on fine details. We address these issues by distilling a patch-based UDF prior, trained on synthetic ground truth algebraic surfaces with closed form expressions, into a lightweight student UDF inside Gaussian optimization process. We design band-limited knowledge distillation strategy that leverages a pretrained patch-based UDF predictor to provide reliable near-surface UDF supervision, enabling stable student training and the recovery of high-frequency geometric details. In addition, we introduce a visibility- and geometry-aware confidence weighting that modulates teacher influence, further steering the student toward accurate surfaces in ambiguous or weakly constrained regions. Extensive experiments on various datasets demonstrate that our approach consistently improves reconstruction accuracy while maintaining competitive efficiency compared to existing UDF- and SDF-based methods.

Abstract:
Despite recent progress, reinforcement learning (RL)-based fine-tuning of diffusion models often struggles with generalization, composability, and robustness against reward hacking. Recent studies have explored prompt refinement as a modular alternative, but most adopt a feed-forward approach that applies a single refined prompt throughout the entire sampling trajectory, thereby failing to fully leverage the sequential nature of reinforcement learning. To address this, we introduce PromptLoop, a plug-and-play RL framework that incorporates latent feedback into step-wise prompt refinement. Rather than modifying diffusion model weights, a multimodal large language model (MLLM) is trained with RL to iteratively update prompts based on intermediate latent states of diffusion models. This design achieves a structural analogy to the Diffusion RL approach, while retaining the flexibility and generality of prompt-based alignment. Extensive experiments across diverse reward functions and diffusion backbones demonstrate that PromptLoop (i) achieves effective reward optimization, (ii) generalizes seamlessly to unseen models, (iii) composes orthogonally with existing alignment methods, and (iv) mitigates over-optimization and reward hacking while introducing only a practically negligible inference overhead.

Abstract:
Human Action Recognition (HAR) is a fundamental computer vision task with diverse real-world applications. Practical deployments often involve low-light environments and unconstrained 6-DoF camera motion, conditions that degrade visual quality, disrupt temporal coherence, and compromise reliability of existing methods. Event cameras, with high low-light sensitivity and microsecond-level temporal resolution, paired with an inertial measurement unit (IMU), present a promising solution. However, current research faces two key challenges: absence of a benchmark integrating low-light conditions, 6-DoF motion, and synchronized IMU data; and lack of effective motion compensation techniques. To address these, we propose Event-IMU Stabilized HAR (EIS-HAR), with two modules. The first is an EIS module that reduces motion blur via a non-linear warping function to reconstruct a motion-compensated input. The second is a HAR module with a four-stage hybrid architecture to efficiently extract spatiotemporal features for accurate action recognition. To alleviate data scarcity, we introduce DarkShake-DVS, the first large-scale event-based HAR benchmark that includes 18,041 real-world clips captured in low light and intense 6-DoF motion, supplemented by synchronized IMU data. Extensive experiments on three datasets demonstrate consistent superiority of EIS-HAR over state-of-the-art methods.

Abstract:
Numerous techniques have been proposed for generating adversarial examples under strict Lp-norm constraints. However, such norm-bounded examples often fail to align well with human perception, and only a few methods specifically explore perceptually aligned adversarial examples. Moreover, it remains unclear whether insights from Lp-constrained attacks can be effectively leveraged to improve perceptual efficacy. In this paper, we introduce DASH, a differentiable meta-attack framework that generates effective and perceptually aligned adversarial examples by strategically composing existing Lp-based attack methods. DASH operates in a multi-stage fashion: at each stage, it aggregates candidate adversarial examples from multiple base attacks using learned, adaptive weights and propagates the result to the next stage. A meta-loss function guides this process by jointly minimizing misclassification loss and perceptual distortion, enabling the framework to dynamically modulate the contribution of each base attack throughout the stages. We evaluate DASH on adversarially trained robust models across CIFAR-10, CIFAR-100, and ImageNet while considering visual perception metrics (e.g. SSIM, FID, LPIPS) in the perturbation budget (instead of Lp-norm). Despite relying solely on Lp-constrained based methods, DASH significantly outperforms state-of-the-art perceptual attacks such as AdvAD, achieving higher attack success rates (e.g., 20.63% improvement) and superior visual quality, as measured by SSIM, LPIPS, and FID (improvements of 11, 0.015, and 5.7, respectively). DASH generalizes well to unseen defenses and different white-box/black-box scenarios, making it a practical and strong baseline for evaluating robustness.

Abstract:
Full-stack multimodal interaction in real-time is a central goal in building intelligent embodied agents capable of natural, dynamic communication. However, existing systems are either limited to unimodal generation or suffer from degraded reasoning and poor cross-modal alignment, preventing coherent and perceptually grounded interactions. In this work, we introduce U-Mind, the first unified system for high-intelligence multimodal dialogue that supports real-time generation and jointly models language, speech, motion, and video synthesis within a single interactive loop. At its core, U-Mind implements a Unified Alignment and Reasoning Framework that addresses two key challenges: enhancing cross-modal synchronization via a segment-wise alignment strategy, and preserving reasoning abilities through Rehearsal-Driven Learning. During inference, U-Mind adopts a text-first decoding pipeline that performs internal chain-of-thought planning followed by temporally synchronized generation across modalities. To close the loop, we implement a real-time video rendering framework conditioned on pose and speech, enabling expressive and synchronized visual feedback. Extensive experiments demonstrate that U-Mind achieves state-of-the-art performance on a range of multimodal interaction tasks, including question answering, instruction following, and motion generation, paving the way toward intelligent, immersive conversational agents.

Abstract:
Large Language Models are increasingly integrated as the cognitive core of 3D embodied agents to enable complex environmental reasoning. However, these agents tend to inherit the critical flaw of hallucination, often failing to ground their responses to their 3D view. While Visual Contrastive Decoding (VCD) is a powerful training-free method for mitigating hallucinations in 2D image-based models, it has not been adapted to the complex 3D embodied environment. In this paper, we embarked on the ambitious goal of being the very first to bridge this gap by introducing a VCD framework for 3D embodied agents. Our method operates at inference time by generating a "negative" 3D context, not by blurring an image, but by applying novel distortions directly to a 3D scene graph, such as swapping object category labels or noising positional coordinates. We evaluate our approach on standard evaluation benchmarks and find that it consistently outperforms existing models. For example, in the random category of 3D-POPE, our 3D-VCD method reduced Yes-rate from 99.9% to 75.1% while simultaneously increasing precision from 50.0% to 62.2%. These results demonstrate that our training-free approach effectively curbs hallucination, yielding 3D agents that are significantly more reliable and grounded.

Abstract:
Scaling up multimodal models has enabled remarkable advances in visual understanding and reasoning, but practical demands call for smaller, efficient systems. In this work, we conduct a principled analysis of downscaling intelligence in multimodal models, examining how reduced large language model (LLM) capacity affects multimodal capabilities. Our initial findings reveal an interesting trend: LLM downscaling disproportionately affects visual capabilities, rather than abilities inherited from the LLM. We then examine whether this drop mainly reflects the expected decline in visual reasoning or a more fundamental loss of perceptual abilities. Isolating the effect of LLM downscaling on perception, we find performance still drops sharply, often matching or exceeding the impact on reasoning. To address this bottleneck, we introduce visual extraction tuning, which explicitly trains the model to extract instruction-relevant visual details consistently across tasks. With these extracted visual details, we then apply step-by-step reasoning to generate answers. Together, these components form our Extract+Think approach, setting a new standard for efficiency and performance in this space.

Abstract:
Existing facial reenactment methods struggle with a trade-off between expressiveness and fine-grained controllability. Holistic facial reenactment models often sacrifice granular control for expressiveness, while methods designed for control may struggle with fidelity and robust disentanglement. Instead of treating facial motion as a monolithic signal, we explore an alternative compositional perspective. In this paper, we introduce PortraitDirector, a novel framework that formulates face reenactment as a hierarchical composition task, achieving high-fidelity and controllable results. We employ a Hierarchical Motion Disentanglement and Composition strategy, deconstructing facial motion into a Spatial Layer for physical movements and a Semantic Layer for emotional content. The Spatial Layer comprises: (i) global head pose, managed via a dedicated representation and injection pathway; (ii) spatially separated local facial expressions, distilled from cropped facial regions and purged of emotional cues via Emotion-Filtering Module leveraging an information bottleneck. The Semantic Layer contains a derived global emotion. The disentangled components are then recomposed into an expressive motion latent. Furthermore, we engineer the framework for real-time performance through a suite of optimizations, including diffusion distillation, causal attention and VAE acceleration. PortraitDirector achieves streaming, high-fidelity, controllable 512 x 512 face reenactment at 20 FPS with a end-to-end 800 ms latency on a single 5090 GPU.

Abstract:
State-of-the-art video generative models produce promising visual content yet often violate basic physics principles, limiting their utility. While some attribute this deficiency to insufficient physics understanding from pre-training, we find that the shortfall in physics plausibility also stems from suboptimal inference strategies. We therefore introduce WMReward and treat improving physics plausibility of video generation as an inference-time alignment problem. In particular, we leverage the strong physics prior of a latent world model (here, VJEPA-2) as a reward to search and steer multiple candidate denoising trajectories, enabling scaling test-time compute for better generation performance. Empirically, our approach substantially improves physics plausibility across image-conditioned, multiframe-conditioned, and text-conditioned generation settings, with validation from human preference study. Notably, in the ICCV 2025 Perception Test PhysicsIQ Challenge, we achieve a final score of 62.64%, winning first place and outperforming the previous state of the art by 7.42%. Our work demonstrates the viability of using latent world models to improve physics plausibility of video generation, beyond this specific instantiation or parameterization.

Abstract:
Vision Transformer (ViT)-based sparse multi-view 3D object detectors have achieved remarkable accuracy but still suffer from high inference latency due to heavy token processing. To accelerate these models, token compression has been widely explored. However, our revisit of existing strategies, such as token pruning, merging, and patch size enlargement, reveals that they often discard informative background cues, disrupt contextual consistency, and lose fine-grained semantics, negatively affecting 3D detection. To overcome these limitations, we propose SEPatch3D, a novel framework that dynamically adjusts patch sizes while preserving critical semantic information within coarse patches. Specifically, we design Spatiotemporal-aware Patch Size Selection (SPSS) that assigns small patches to scenes containing nearby objects to preserve fine details and large patches to background-dominated scenes to reduce computation cost. To further mitigate potential detail loss, Informative Patch Selection (IPS) selects the informative patches for feature refinement, and Cross-Granularity Feature Enhancement (CGFE) injects fine-grained details into selected coarse patches, enriching semantic features. Experiments on the nuScenes and Argoverse 2 validation sets show that SEPatch3D achieves up to 57% faster inference than the StreamPETR baseline and 20% higher efficiency than the state-of-the-art ToC3D-faster, while preserving comparable detection accuracy.

Abstract:
Video generation has recently emerged as a central task in the field of generative AI. However, the substantial computational cost inherent in video synthesis makes model distillation a critical technique for efficient deployment. Despite its significance, there is a scarcity of methods specifically designed for video diffusion models. Prevailing approaches often directly adapt image distillation techniques, which frequently lead to artifacts such as oversaturation, temporal inconsistency, and mode collapse. To address these challenges, we propose a novel distillation framework tailored specifically for video diffusion models. Its core innovations include: (1) an adaptive regression loss that dynamically adjusts spatial supervision weights to prevent artifacts arising from excessive distribution shifts; (2) a temporal regularization loss to counteract temporal collapse, promoting smooth and physically plausible sampling trajectories; and (3) an inference-time frame interpolation strategy that reduces sampling overhead while preserving perceptual quality. Extensive experiments and ablation studies on the VBench and VBench2 benchmarks demonstrate that our method achieves stable few-step video synthesis, significantly enhancing perceptual fidelity and motion realism. It consistently outperforms existing distillation baselines across multiple metrics.

Abstract:
Despite the significant progress of Multi-modal Large Language Models (MLLMs) across diverse tasks, hallucination, which corresponds to the generation of visually inconsistent objects, attributes, or relations, remains a major obstacle to their reliable deployment. Unlike pure language models, MLLMs must ground their generation process in visual inputs; However, existing models often suffer from semantic drift during decoding, causing outputs to diverge from visual facts as the sequence length increases.To address this, we propose KVSmooth, a training-free, plug-and-play method that mitigates hallucination by performing attention-entropy-guided adaptive smoothing on hidden states. Specifically, KVSmooth applies an exponential moving average (EMA) to both keys and values in the KV-Cache while dynamically quantifying the sink degree of each token through its attention distribution entropy to adaptively adjust the smoothing strength. Unlike computationally expensive retraining or contrastive decoding methods, KVSmooth operates efficiently during inference without additional training or model modification. Extensive experiments demonstrate that KVSmooth significantly reduces hallucination (\mathit CHAIR _ S from 41.8 -> 18.2) while improving overall performance (F_1 score from 77.5 -> 79.2), achieving higher precision and recall simultaneously, whereas prior methods often sacrifice one for the other, thereby validating the effectiveness and generality of our method.

Abstract:
The learning order of semantic classes significantly impacts unsupervised domain adaptation for semantic segmentation, especially under adverse weather conditions. Most existing curricula rely on handcrafted heuristics (e.g., fixed uncertainty metrics) and follow a static schedule, which fails to adapt to a model's evolving, high-dimensional training dynamics, leading to category bias. Inspired by Reinforcement Learning, we cast curriculum learning as a sequential decision problem and propose an autonomous class scheduler. This scheduler consists of two components: (i) a high-dimensional state encoder that maps the model's training status into a latent space and distills key features indicative of progress, and (ii) a category-fair policy-gradient objective that ensures balanced improvement across classes. Coupled with mixed source-target supervision, the learned class rankings direct the network's focus to the most informative classes at each stage, enabling more adaptive and dynamic learning. It is worth noting that our method achieves state-of-the-art performance on three widely used benchmarks (e.g., ACDC, Dark Zurich, and Nighttime Driving) and shows generalization ability in synthetic-to-real semantic segmentation.

Abstract:
Distributed Image Compression (DIC) is crucial for multi-view transmission, especially when operating at extremely low bitrates (< 0.1 bpp). Its core challenge is effectively utilizing side information to achieve high-quality reconstruction under strict bitrate budgets. However, existing DIC approaches struggle to exploit global context and object-level details from side information, leading to local blurring and the loss of fine details in the reconstruction. To address these limitations, we propose a Multimodal DIC framework (MDIC), which, for the first time, leverages side information in a multimodal manner into the DIC paradigm, effectively preserving fine-grained local details and enhancing global perceptual quality in reconstructed images. Specifically, we introduce a text-to-image diffusion-based decoder conditioned on textual side information extracted from correlated images to capture shared global semantics. Moreover, we design a feature-mask generator, supervised by a multimodal fine-grained alignment task, to strengthen the exploitation of visual side information. The generated mask serves two purposes: first, it guides the extraction of fine-grained details from losslessly transmitted side information to preserve the semantic consistency of reconstructed details; second, it regulates the extraction of clustered feature representations from the quantized VQ-VAE embeddings, compensating for category information lost under the extreme compression of the primary image. Extensive experiments on the widely used KITTI Stereo and Cityscapes datasets demonstrate that MDIC achieves state-of-the-art perceptual quality at extremely low bitrates.

Abstract:
Pretraining on large-scale data followed by fine-tuning has become a standard paradigm for visual models. However, noise in the pretraining data can be absorbed by the model and carried into downstream tasks, causing catastrophic inheritance. Prior studies mainly link this issue to changes in the feature spectrum, arguing that noise reduces the strength of key feature components. Following this view, they aim to improve transferability by amplifying these components. However, these approaches focus only on spectral energy and implicitly assume that the feature directions remain fixed, which does not hold in practice. In this work, we revisit this view and reveal an overlooked effect: even mild pretraining noise can cause a clear rotation of the dominant feature subspace, despite negligible spectral energy degradation. To quantitatively characterize this phenomenon, we propose using the Principal Directional Angle (PDA) to measure the directional shift between the clean and noisy models. Building on this observation, we introduce the Feature Geometry Stabilization (FGS) framework, which aims to counteract the subspace rotation revealed by PDA by enhancing the geometric stability of the feature space through the synergistic interaction of perturbation consistency, variance-activation regularization, and feature consistency distillation. Experiments across multiple visual benchmarks demonstrate the effectiveness of FGS and verify the importance of stabilizing feature geometry to mitigate catastrophic inheritance.

Abstract:
Recent studies have demonstrated that Reinforcement Learning (RL), notably Group Relative Policy Optimization (GRPO), can intrinsically elicit and enhance the reasoning capabilities of Vision-Language Models (VLMs). However, despite the promise, the underlying mechanisms that drive the effectiveness of RL models as well as their limitations remain underexplored. In this paper, we highlight a fundamental behavioral distinction between RL and base models, where the former engages in deeper yet narrow reasoning, while base models, despite less refined along individual path, exhibit broader and more diverse thinking patterns. Through further analysis of training dynamics, we show that GRPO is prone to diversity collapse, causing models to prematurely converge to a limited subset of reasoning strategies while discarding the majority of potential alternatives, leading to local optima and poor scalability. To address this, we propose Multi-Group Policy Optimization (MUPO), a simple yet effective approach designed to incentivize divergent thinking across multiple solutions, and demonstrate its effectiveness on established benchmarks.

Abstract:
Recent deep unfolding networks (DUNs) have advanced Compressive Sensing (CS) by effectively integrating iterative optimization with deep learning architectures. However, most CS approaches predominantly confine their inference to a single solution space, neglecting the inherent ill-posedness of CS problems that intrinsically permits multiple plausible candidate hypotheses. In this paper, a novel Multi-Hypothesis Collaborative Deep Unfolding CS Network (MHC-DUN) is proposed, which explicitly models and leverages multiple hypotheses by jointly optimizing across diverse solution spaces. Specifically, following the Proximal Gradient Descent algorithm, MHC-DUN jointly performs gradient descent and proximal mapping within this multi-hypothesis paradigm. i) For gradient descent, a well-designed AlphaNet is introduced to dynamically predict spatially varying step sizes for all hypotheses, enabling collaborative gradient updates across multiple solutions. ii) For proximal operator, a sophisticated multi-hypothesis collaborative proximal mapping module is designed, which leverages both intra-hypothesis and inter-hypothesis correlation priors to jointly refine multiple solutions. To enable end-to-end training, a novel composite loss function is designed, which balances measurement fidelity, hypothesis diversity, and reconstruction accuracy, encouraging exploration of complementary solutions while maintaining reconstruction fidelity. Experimental results reveal that the proposed CS method outperforms existing CS networks.

Abstract:
Gaussian Splatting has emerged as a popular 3D representation technique, but still struggles with appearance inconsistencies, especially in dark scenes that require active illumination (e.g., camera flashes or co-moving light sources) to capture usable images, leading to dramatic local appearance fluctuations.While existing methods mainly focus on modeling global appearance changes for in-the-wild scenes, such as those caused by different times of day or weather conditions, they fail to handle the severe variations present in dark scenes with moving light sources.In this paper, we propose a novel Gaussian Splatting-based approach for constructing scene representations in dark scenes where active light sources are rigidly attached to the camera and move together with it.Within this framework, we introduce an illumination-weighted loss function that drives the representation toward the underlying unlit scene. Furthermore, instead of adjusting the illumination of each individual Gaussian as in prior work, we employ a tile-based shading scheme that operates directly on the rendered images, greatly reducing computational cost while explicitly separating illumination from intrinsic scene appearance.Additionally, we further refine the learned Gaussian representation by combining the recovered unlit scene appearance with an advanced geometric prior model, which significantly improves geometric accuracy.Experimental results demonstrate that our method achieves superior reconstruction quality in challenging environments compared to state-of-the-art techniques.

Abstract:
Infrared video acquisition inherently suffers from low spatial resolution and limited frame rates due to the physical constraints of thermal imaging sensors. These limitations make infrared video enhancement uniquely challenging, as it requires restoring spatial details and temporal continuity from highly undersampled thermal signals. To address this challenge, we propose THERIS, a unified THERmal-physics inspired framework for Infrared spatial-temporal video Super-resolution. Grounded in the physical principles of thermal diffusion, THERIS leverages heat conduction dynamics that govern the spatiotemporal evolution of infrared pixel intensities. Specifically, the proposed Thermal Diffusion Interpolation Module (TDIM) treats temporal feature sequences as one-dimensional heat fields and performs frequency-domain diffusion to synthesize temporally coherent intermediate frames. Building on this foundation, the Thermo-Aware State Space Module (TSSM) refines spatiotemporal representations through learnable spectral filtering and selective state-space modeling, while maintaining consistency guided by the thermodynamic prior inherited from TDIM. Additionally, a Temperature Field Modeling Loss is introduced to enforce adherence to the heat conduction equation, promoting temporal coherence and spatial stability in the generated results. Extensive experiments demonstrate that THERIS achieves state-of-the-art performance while producing visually coherent results. To facilitate further research in the infrared video processing domain, we also introduce IRVAL, a high-resolution dataset comprising 108,512 video frames at 512x512 resolution.

Abstract:
We present Pressure2Motion, a novel motion capture algorithm that reconstructs human motion from a ground pressure sequence and text prompt. At inference time, Pressure2Motion requires only a pressure mat, eliminating the need for specialized lighting setups, cameras, or wearable devices, making it suitable for privacy-preserving, low-light, and low-cost motion capture scenarios. Such a task is severely ill-posed due to the indeterminacy of pressure signals with respect to full-body motion. To address this issue, we introduce Pressure2Motion, a generative model that leverages pressure features as input and utilizes a text prompt as a high-level guiding constraint to resolve ambiguities. Specifically, our model adopts a dual-level feature extractor to accurately interpret pressure data, followed by a hierarchical diffusion model that discerns broad-scale movement trajectories and subtle posture adjustments. Both the physical cues gained from the pressure sequence and the semantic guidance derived from descriptive texts are leveraged to guide the motion estimation with precision. To the best of our knowledge, Pressure2Motion is a pioneering work in leveraging both pressure data and linguistic priors for motion reconstruction, and the established MPL benchmark is the first benchmark for this novel motion capture task. Experiments show that our method generates high-fidelity, physically plausible motions, establishing a new state of the art for this task. The codes and benchmarks will be publicly released upon publication.

Abstract:
3D Gaussian splatting (3DGS) has demonstrated impressive performance in synthesizing high-fidelity novel views. Nonetheless, its effectiveness critically depends on the quality of the initialized point cloud. Specifically, achieving uniform and complete point coverage over the underlying scene structure requires overlapping observation frustums, an assumption that is often violated in unbounded, dynamic urban environments. Training Gaussian models with partially initialized point clouds often leads to distortions and artifacts, as camera rays may fail to intersect valid surfaces, resulting in incorrect gradient propagation to Gaussian primitives associated with occluded or invisible geometry. Additionally, existing densification strategies simply clone and split Gaussian primitives from existing ones, incapable of reconstructing geometry from missing structures. To address these limitations, we propose VAD-GS, a 3DGS framework tailored for geometry recovery in challenging urban scenes. Our method identifies unreliable geometry structures via voxel-based visibility reasoning, selects informative supporting views through diversity-aware view selection, and recovers missing structures via multi-view stereo reconstruction. This design enables the generation of new Gaussian primitives guided by reliable geometric priors, even in regions lacking initial points. Extensive experiments on the Waymo and nuScenes datasets demonstrate that VAD-GS outperforms state-of-the-art 3DGS approaches and significantly improves the quality of reconstructed geometry for both static and dynamic objects.Source code will be released upon publication.

Abstract:
Speculative Jacobi Decoding (SJD) offers a draft-model-free approach to accelerate autoregressive text-to-image synthesis. However, the high-entropy nature of visual generation yields low draft-token acceptance rates in complex regions, creating a bottleneck that severely limits overall throughput. To overcome this, we introduce SJD-PAC, an enhanced SJD framework. First, SJD-PAC employs a proactive drafting strategy to improve local acceptance rates in these challenging high-entropy regions. Second, we introduce an adaptive continuation mechanism that sustains sequence validation after an initial rejection, bypassing the need for full resampling. Working in tandem, these optimizations significantly increase the average acceptance length per step, boosting inference speed while strictly preserving the target distribution. Experiments on standard text-to-image benchmarks demonstrate that SJD-PAC achieves a 3.8x speedup with lossless image quality.

Abstract:
The increasing complexity of real-world deployment requires intelligent agents to effectively adapt to non-stationary data streams with stochastic increments under data scarcity. We formally define this challenge as the Few-Shot Hybrid Incremental Learning (FSHIL) paradigm, which reveals a critical stability-plasticity dilemma. Existing strategies struggle to address this dilemma: representation freezing in few-shot incremental learning can mitigate overfitting under data scarcity but leads to insufficient representation plasticity, while architecture expansion in hybrid incremental learning provides plasticity for adaptation but results in overfitting under few-shot conditions. To address this, we propose the Conditional Meta-Expanding Mixture-of-Experts (CME-MoE), which balances feature-level stability-plasticity trade-off through conditional expert reuse and meta-expansion mechanism. Furthermore, recognizing the multi-domain manifestation in the latent space, we introduce the Self-Expanding Prototype Classifier (SEPC), which on-demand expands classification to model complex domain-shifted decision boundaries. The proposed method outperforms existing state-of-the-art methods in three few-shot incremental learning settings across five mainstream datasets, effectively addressing data scarcity and task uncertainty, and providing a robust solution for real-world continual learning.

Abstract:
Embodied intelligence for contact-rich manipulation has predominantly relied on position control, while explicit awareness and regulation of interaction forces remain under-explored, limiting stability, precision, and robustness in real-world tasks. We propose ForceVLA2, an end-to-end vision-language-action framework that equips robots with hybrid force-position control and explicit force awareness. ForceVLA2 introduces force-based prompts into the VLM expert to construct force-aware task concepts across stages, and employs a Cross-Scale Mixture-of-Experts (MoE) in the action expert to adaptively fuse these concepts with real-time interaction forces for closed-loop hybrid force-position regulation. To support learning and evaluation, we construct ForceVLA2-Dataset, containing 1,000 trajectories over 5 contact-rich tasks, including wiping, pressing, and assembling, with multi-view images, task prompts, proprioceptive state, and force signals. Extensive experiments show that ForceVLA2 substantially improves success rates and reliability in contact-rich manipulation, outperforming pi0 and pi0.5 by 48.0% and 35.0%, respectively, across the 5 tasks, and mitigating common failure modes such as arm overload and unstable contact, thereby actively advancing force-aware interactive physical intelligence in VLAs.

Abstract:
Visual relocalization is a fundamental task in the field of 3D computer vision, estimating a camera's pose when it revisits a previously known scene. While point-based hierarchical relocalization methods have shown strong scalability and efficiency, they are often limited by sparse image observations and weak feature matching. In this work, we propose SplatHLoc, a novel hierarchical visual relocalization framework that uses Feature Gaussian Splatting as the scene representation. To address the sparsity of database images, we propose an adaptive viewpoint retrieval method that synthesizes virtual candidates with viewpoints more closely aligned with the query, thereby improving the accuracy of initial pose estimation. For feature matching, we observe that Gaussian-rendered features and those extracted directly from images exhibit different strengths across the two-stage matching process: the former performs better in the coarse stage, while the latter proves more effective in the fine stage. Therefore, we introduce a hybrid feature matching strategy, enabling more accurate and efficient pose estimation. Extensive experiments on both indoor and outdoor datasets show that SplatHLoc enhances the robustness of visual relocalization, setting a new state-of-the-art.

Abstract:
Reinforcement learning (RL) has demonstrated significant potential for post-training language models and autoregressive visual generative models, but adapting RL to masked generative models (MGMs) remains challenging. The core factor is that policy optimization requires the probability likelihood of each step due to its multi-step iteration process. This reliance on entire sampling trajectories introduces high computational cost, whereas natively optimizing random steps often yields suboptimal results. In this paper, we present MaskFocus, a novel RL framework that achieves effective policy optimization for MGMs by focusing on critical steps. Specifically, we determine the step-level information gain by measuring the similarity between the intermediate images at each sampling step and the final generated image. Crucially, we leverage this to identify the most critical and valuable steps and execute policy optimization on them. Furthermore, we design a dynamic routing sampling based on entropy to encourage the model to explore more valuable masking strategies. Extensive experiments on multiple Text-to-Image benchmarks validate the effectiveness of our method.

Abstract:
Novel Class Discovery in Point Cloud Segmentation is recently proposed, aiming to leverage knowledge from known classes to automatically segment unlabeled classes within point clouds. The core of this task lies in leveraging the geometric and semantic knowledge of multiple known classes to achieve semantic understanding and segmentation of novel classes.However, existing methods overlook the high-order associations between known and novel classes, relying solely on binary associations for class assignment and novel class reasoning, which leads to less precise semantic segmentation.To address these issues, we introduce a hypergraph structure to model high-order associations among classes, enabling collaborative reasoning from known classes to novel classes, extending beyond traditional binary relations.Additionally, existing methods focus excessively on extracting semantic information when processing point cloud data, neglecting the importance of geometric features. To address this, we introduce Geometric-Aware Prototypes, enhancing the model's ability to capture geometric spatial information.By propagating geometric information through hyperedges, our method enhances the understanding of spatial distributions across classes, improving segmentation accuracy.Significant performance improvements achieved on the SemanticKITTI and SemanticPOSS datasets demonstrate the superiority of our method.

Abstract:
Most of the multi-agent video understanding frameworks adopt static and non-learnable tool invocation mechanisms, which limit the discovery of diverse clues essential for robust perception and reasoning regarding temporally or spatially complex videos. To address this challenge, we propose a novel Multi-agent system for video understanding, namely VideoChat-M1. Instead of using a single or fixed policy, we adopt a distinct Collaborative Policy Planning (CPP) paradigm with multiple policy agents, which comprises three key processes. (1)Policy Generation: Each agent generates its unique tool invocation policy tailored to the user's query. (2) Policy Execution: Each agent sequentially invokes relevant tools to execute its policy and explore the video content. (3) Policy Communication: During the intermediate stages of policy execution, agents interact with one another to update their respective policies. Through this collaborative framework, all agents work in tandem, dynamically refining their preferred policies based on contextual insights from peers. Moreover, we equip our CPP paradigm with Multi-Agent Reinforcement Learning (MARL). Consequently, policy agents can be jointly optimized to enhance the performance, guided by both the final answer reward and intermediate collaborative process feedback. Extensive experiments demonstrate that VideoChat-M1 achieves SOTA performance across eight benchmarks on four tasks. Notably, on LongVideoBench, our method outperforms Gemini 2.5 pro by 3.6% and GPT-4o by 15.6%.

Abstract:
Multimodal Emotion Recognition in Conversations (MERC) aims to understand emotions expressed in each utterance by effectively integrating audio, text, and visual modalities. However, in real-world scenarios, unavoidable missing modalities often degrade multimodal interpretation performance. To address this, we propose Hypergraph Diffusion and Evidence Fusion based Emotion Recognition (HyperEF), a novel framework designed to mitigate challenges arising from incomplete modalities in MERC. Specifically, to mitigate performance degradation caused by modality absence, we propose Masked Hypergraph Attention (MHGAT) conditioned diffusion model to recover latent features of missing modalities in the latent space. To ensure semantic consistency between recovered and available modalities within the same utterance, we introduce MHGAT that captures high-order semantic information from available modalities to guide the diffusion model's denoising process. Furthermore, to disentangle and model the complex uncertainties inherent in MERC, we propose Dual Channel Evidence Fusion (DCEF), which estimates uncertainty at both feature source level and discriminative level, thereby achieving adaptive evidence fusion. Extensive comparative experiments and interpretability demonstrate the superior performance of our model in emotion recognition, as well as the contribution of each module within the model.

Abstract:
Modern detectors typically deepen backbones and rely on aggressive downsampling to harvest high-level semantics. But this severely degrades low-energy infrared tiny targets via rescale-induced information loss. This work introduces InvDet, a target-aware invertible encoder that unifies information preservation and target-aware enhancement within a reconstruction-guided detection framework. An invertible pathway reconstructs the input from feature latents, exposing information loss as an optimizable quantity. To decouple detection from irrelevant reconstruction, a Target-Aware Reconstruction Modulation (TARM) module operates only in the inverse path, gating high-pass latents and applying a mild gain to low-pass features without altering the forward detection distribution. In addition, a Geometry-Content Tolerance Metric (GCTM) is proposed to focus on truly informative regions and yields a pixel-wise weight map that gently regularizes the reconstruction branch. Our method achieves competitive accuracy on five public infrared benchmarks while exhibiting strong cross-dataset generalization, providing a principled pathway toward detection-friendly representation learning for scale-challenged visual regimes.

Abstract:
The recent advent of powerful video generation models, such as Hunyuan, WanX, Veo3, and Kling, has inaugurated a new era in the field. However, the practical deployment of these models is severely impeded by their substantial computational overhead, which stems from enormous parameter counts and the iterative, multi-step sampling process required during inference. Prior research on accelerating generative models has predominantly followed two distinct trajectories: reducing the number of sampling steps (e.g., LCM, DMD, and MagicDistillation) or compressing the model size for more efficient inference (e.g., ICMD). The potential of simultaneously compressing both to create a fast and lightweight model remains an unexplored avenue. In this paper, we propose _FastLightGen_, an algorithm that transforms large, computationally expensive models into fast, lightweight counterparts. The core idea is to construct an optimal teacher model, one engineered to maximize student performance, within a synergistic framework for distilling both model size and inference steps. Our extensive experiments on HunyuanVideo-ATI2V and WanX-TI2V reveal that a generator using 4-step sampling and 30% parameter pruning achieves optimal visual quality under a constrained inference budget. Furthermore, FastLightGen consistently outperforms all competing methods, establishing a new state-of-the-art in efficient video generation.

Abstract:
The research frontier in human pose prediction (HPP) is advancing toward continual test-time adaptation (TTA), where models must self-adapt to dynamic test distributions. To date, the homeostatic continual TTA remains the sole viable solution, which isolates the model parameters and update domain-sensitive ones. Despite mitigating full-body domain gaps, human anatomical heterogeneity (domain shifts often localize to specific regions) is ignored. This anatomical-agnostic approach forces uniform parameter adaptation across kinematically distinct segments, causing: over-adaptation of stable regions and under-adaptation of shift-prone articulations. To address it, we introduce TT-HA, a novel Test-Time Heterogeneous Adaptation that implicitly estimates domain changes for anatomical segments, and adapt the corresponding parameters. Building on human anatomy, TT-HA partitions parameters into five anatomical subsets using fisher information matrix-based parameters uncertainty analysis. During testing, TT-HA uses the instance normalization statistics and Earth Mover's Distance (EMD) to quantify segment-wise domain changes, dynamically determining which segment-specific parameters to adapt and to what extent. When substantial domain shifts are detected, TT-HA restores only affected segments to source-trained values, ensuring robust adaptation without full parameter resetting; minor shifts trigger the fine-tuning of corresponding parameters while preserving remaining ones. Experiments show TT-HA's superior full-body accuracy with greater limb error decrease than prior methods, proving its anatomically-targeted efficacy.

Abstract:
Rendering 3D surfaces has been revolutionized within the modeling of radiance fields through either 3DGS or NeRF. Although 3DGS has shown advantages over NeRF in terms of rendering quality or speed, there is still room for improvement in recovering high fidelity surfaces through 3DGS. To resolve this issue, we propose a self-constrained prior to constraining the movement of 3D Gaussians, aiming for more accurate depth rendering. Our self-constrained prior is a TSDF grid fused by the rendered depth during the learning of 3D Gaussians. The prior measures a band on both sides of the estimated surface for imposing more specific constraints on the right 3D Gaussians, such as removing 3D Gaussians outside the band, encouraging larger opacity for Gaussians near the center of the band or smaller opacity for Gaussians near the boundary of the band. We regularly update the prior by fusing more recent depth images which are usually more accurate, and progressively narrow the band to tighten the constraint on Gaussian movements. We justify our idea and report our superiority over the state-of-the-art methods in evaluations on widely used benchmarks.

Abstract:
Text-image to video generation aims to synthesize a video conditioned on the given text-image inputs. Nevertheless, existing methods generally assume that the semantic information carried in the input text and image tends to be perfectly paired and temporally aligned, occurring simultaneously in the generated video. As such, existing literature struggles with out-of-distribution (OOD) "unpaired" text-image inputs in the more universal and realistic scenario where i) the semantic information carried by the text and image may occur at different timestamps and ii) the condition image can appear at an arbitrary position rather than the first frame of the synthesized video. Video generation under this OOD setting poses an urgent need to conduct reasoning over the intrinsic connections between the given textual description and referred image, which is challenging and remains unexplored. To address the challenge, in this paper we study the problem of unpaired text-image to video generation for the first time, proposing ReasonDiff, a novel model for accurate video generation from unpaired text-image inputs. Specifically, ReasonDiff designs a VisionNarrator module to harness the powerful reasoning abilities of a multi-modal LLM to analyze the unpaired text-image inputs, producing coherent per-frame narratives that temporally align them. Building upon this VisionNarrator module, ReasonDiff further introduces a novel AlignFormer module, which employs a Multi-stage Temporal Anchor Attention mechanism to predict frame-wise latent representations. These reasoning-enhanced latents are subsequently fused with the condition frame, providing structured guidance throughout the video generation process. Extensive experiments and ablation studies demonstrate that ReasonDiff beats state-of-the-art baselines in terms of video generation quality with unpaired text-image inputs.

Abstract:
Subject-Driven Text-to-Image (T2I) Generation aims to preserve a subject's identity while editing its context based on a text prompt. A core challenge in this task is the "similarity-controllability paradox", where enhancing textual control often degrades the subject's fidelity, and vice-versa. We argue this paradox stems from the ambiguous role of text prompts, which are often tasked with describing both the subject and the desired modifications, leading to conflicting signals for the model. To resolve this, we propose DisCo, a novel framework that first Disentangles and then re-Couples visual and textual information. First, our textual-visual decoupling module isolates the sources of information: subject identity is extracted exclusively from the reference image with the entity word of the subject, while the text prompt is simplified to contain only the modification command, where the subject refers to general pronouns, eliminating descriptive ambiguity. However, this strict separation can lead to unnatural compositions between the subject and its contexts. We address this by designing a dedicated reward signal and using reinforcement learning to seamlessly recouple the visually-defined subject and the textually-generated context. Our approach effectively resolves the paradox, enabling simultaneous high-fidelity subject preservation and precise textual control. Extensive experiments demonstrate that our method achieves state-of-the-art performance, producing highly realistic and coherent images.

Abstract:
Physics-based humanoid control relies on training with motion datasets that have diverse data distributions. However, the fixed difficulty distribution of datasets limits the performance ceiling of the trained control policies. Additionally, the method of acquiring high-quality data through professional motion capture systems is constrained by costs, making it difficult to achieve large-scale scalability. To address these issues, we propose a closed-loop automated motion data generation and iterative framework. It can generate high-quality motion data with rich action semantics, including martial arts, dance, combat, sports, gymnastics, and more. Furthermore, our framework enables difficulty iteration of policies and data through physical metrics and objective evaluations, allowing the trained tracker to break through its original difficulty limits. On the PHC single-primitive tracker, using only approximately 1/10 of the AMASS dataset size, the average failure rate on the test set (2201 clips) is reduced by 45% compared to the baseline. Finally, we conduct comprehensive ablation and comparative experiments to highlight the rationality and advantages of our framework.

Abstract:
Digital image forensics can ensure information credibility in tasks like camera source identification (CSI), synthetic image detection (SID), and social network provenance (SNP). These tasks typically rely on image processing history clues left by in-camera operations, post-capture editing, or synthetic generation. However, most existing forensic methods have obvious limitations: 1) they often only focus on camera-specific traces (e.g., the well-known PRNU), and 2) they demand a substantial amount of annotated training data. To address these constraints, we propose Editprint, a novel general forensic feature that captures highly diverse in- and out-camera processing history clues with minimal unlabeled training data. Ideally, we expect that any images undergoing the same imaging, editing, and transmission processes would yield identical Editprints, and vice versa. To model the in- and out-camera operations, we devise an online editing pool based on self-augmentation strategies. Requiring only minimal (e.g., 10) training data, the editing pool can simulate massive (e.g., 10^\text 7 ) editing chains and traces arising from the in-camera processing and the subsequent out-camera operations. To ensure that Editprint exhibits high discriminative capabilities across various editing chains, we propose using textual descriptions of these chains as labels and supervising their Editprints through language-guided contrastive learning. Extensive experiments show Editprint outperforms existing self-supervised forensics, particularly in non-camera applications such as SNP and SID. We hope that Editprint would inspire the forensic community and serve as a novel benchmark for self-supervised forensics.

Abstract:
Recent advances in gigapixel-level imaging have brought High-Resolution Wide shots to the forefront of research. However, these images present significant challenges: extreme sparsity of foreground, gigapixel-level resolutions and diverse target counts. This makes traditional close-up detectors inaccurate and slow as they are overwhelmed by the background. Although previous research has explored sparse backbones, their fixed sparsity patterns lack the adaptability required to handle diverse target numbers. To address this, we introduce ElasticFormer, a sparse backbone that dynamically allocates computational resources based on foreground proportion. After scoring windows based on variance, proposed ElasticSelector module will predict the foreground proportion for top-k selection. The mechanism guides the model to select target-containing windows, scaling resources in areas where objects are clustered. We introduce a novel loss function combined with the 3-phase training strategy for ElasticSelector, allowing it to function properly when bounding box annotations are missing. A WSOD study is carried on PASCAL VOC 2007 to evaluate its extensibility. Further, ElasticNet is created to verify its backbone-agnostic nature. In experiments on the PANDA gigapixel benchmark, ElasticFormer reduces backbone FLOPs by 80% while achieving a significant improvement in AP_ 50 when compared to fixed-ratio sparse methods.

Abstract:
We introduce Perception Encoder-Audiovisual, PE-AV, a new family of encoders for audio and video understanding trained with scaled contrastive learning. Building on PE, PE-AV makes several key contributions to extend representations to audio, and natively support joint embeddings across audio-video, audio-text, and video-text modalities. PE-AV's unified cross-modal embeddings enable novel tasks such as speech retrieval, and set a new state of the art across standard audio and video benchmarks. We unlock this by building a strong audiovisual data engine that synthesizes high-quality captions for O(100M) audio-video pairs, enabling large-scale supervision consistent across modalities. Our audio data includes speech, music, and general sound effects--avoiding single-domain limitations common in prior work. We exploit ten pairwise contrastive objectives, showing that scaling cross-modality and caption-type pairs strengthens alignment and improves zero-shot performance. Models and code are available.

Abstract:
LiDAR has become an essential sensing modality in autonomous driving, robotics, and smart-city applications. However, ghost points (or ghost), which are false reflections caused by multi-path laser returns from glass and reflective surfaces, severely degrade 3D mapping and localization accuracy. Prior ghost removal rely on geometric consistency in dense point clouds, failing on mobile LiDAR's sparse, dynamic data. We address this by exploiting full-waveform LiDAR (FWL), which captures complete temporal intensity profiles rather than just peak distances, providing crucial cues for distinguishing ghosts from genuine reflections in mobile scenarios. As this is a new task, we present Ghost-FWL, the first and largest annotated mobile FWL dataset for ghost detection and removal. Ghost-FWL comprises 24K frames across 10 diverse scenes with 7.5 billion peak-level annotations, which is 100x larger than existing annotated FWL datasets. Experiments show that our baseline outperforms existing methods in ghost removal accuracy, and our ghost removal further enhance downstream tasks such as LiDAR-based SLAM (66% trajectory error reduction) and 3D object detection (50x false positive reduction).

Abstract:
In this work, we tackle the problem of Open World Object Detection (OWOD). This challenging scenario requires the detector to incrementally learn to classify known objects without forgetting while identifying unknown objects without supervision. Previous OWOD methods have enhanced the unknown discovery process and employed memory replay to mitigate catastrophic forgetting. However, since existing methods heavily rely on the detector's known class predictions for detecting unknown objects, they struggle to effectively learn and recognize unknown object representations. Moreover, while memory replay mitigates forgetting of old classes, it often sacrifices the knowledge of newly learned classes. To resolve these limitations, we propose DEUS (Detecting Unknowns via energy-based Separation), a novel framework that addresses the challenges of Open World Object Detection. DEUS consists of Equiangular Tight Frame (ETF)-Subspace Unknown Separation (EUS) and an Energy-based Known Distinction (EKD) loss. EUS leverages ETF-based geometric properties to create orthogonal subspaces, enabling cleaner separation between known and unknown object representations. Unlike prior energy-based approaches that consider only the known space, EUS utilizes energies from both spaces to better capture distinct patterns of unknown objects. Furthermore, EKD loss enforces the separation between previous and current classifiers, thus minimizing knowledge interference between previous and newly learned classes during memory replay. We thoroughly validate DEUS on OWOD benchmarks, demonstrating outstanding performance improvements in unknown detection while maintaining competitive known class performance.

Abstract:
With the rise of pre-trained vision-language models such as CLIP, performing video anomaly detection (VAD) through cross-modal reasoning has become an emerging trend. However, we observe that CLIP still suffers from weak abnormality awareness: normal and abnormal descriptions are highly entangled in the text embedding space, causing video features to assign nearly indistinguishable similarity scores to both types of prompts. To address this issue, we propose Alert-CLIP, an abnormality-aware latent-enhanced tuning framework that tailors CLIP for VAD. Alert-CLIP introduces a multi-level alignment strategy: (1) video-label alignment, which reshapes the semantic space to establish a coarse-level foundation for abnormality awareness; (2) region-text alignment, which explicitly associates anomaly-related regions with their detailed descriptions to strengthen fine-grained perception; and (3) region-semantic alignment, which further contrasts anomalous regions against multiple hard negative samples, enhancing abnormality-aware discrimination.Extensive experiments on four benchmarks demonstrate that Alert-CLIP consistently surpasses vanilla CLIP across supervised, zero-shot, and open-vocabulary settings, providing a solid foundation for future CLIP-based VAD research.

Abstract:
X-ray computed tomography (CT) suffers from severe metal artifacts when high-attenuation objects such as dental fillings or orthopedic implants are present. These artifacts originate from the polychromatic nature of X-rays, where attenuation varies strongly with photon energy and material composition, breaking the monochromatic assumption used by conventional reconstruction algorithms. Recent neural rendering approaches attempt to address this mismatch through differentiable polychromatic projection models, but they still struggle with smoothness bias, loss of fine structures, and prohibitive computation when extended to large-scale cone-beam CT. We introduce a splat-based metal artifact reduction framework that incorporates a physically grounded polychromatic forward model into a continuous Gaussian representation for cone-beam CT. Each Gaussian encodes the energy-dependent attenuation of the underlying material using a compact material parameterization, which enables efficient joint optimization of geometric and material properties without relying on a metal mask. This compact attenuation formulation captures the essential variation across biological tissues and metallic implants, allowing our model to explain metal-induced nonlinearity while preserving high-frequency structure. Experiments on simulated and real cone-beam CT scans show that our method converges significantly faster and suppresses metal artifacts more effectively than existing reconstruction and neural field-based approaches.

Abstract:
Diffusion-based virtual try-on methods rely on vast high-quality garment-person pairs, which are scarce in practice due to the high cost of data collection and preprocessing, limiting their performance in real-world scenarios.To overcome this bottleneck, we propose Cycle-Consistent Virtual Try-On (CCVTON), a diffusion-based approach that enables effective training using massive in-the-wild person images. Specifically, CCVTON introduces a Cycle-Consistent Learning (CCL) strategy that just employs a single unified generative model to disentangle a garment from a person image (try-off branch) and transfer it to the same individual (try-on branch), forming a reconstruction cycle. To this end, we first warm up a Unified Diffusion Transformer (UDiT) on open-source paired data to acquire basic try-on and try-off capabilities. When adapting UDiT to in-the-wild person images, we employ a Multi-Criteria Filtering Operation to select high-quality garments disentangled from person images by the pretrained UDiT. These filtered garments are not used as inputs for CCL, but serve as soft constraints for a perceptual regularization loss, preventing the try-off branch from collapsing to trivial copying. In addition, we propose garment-aware mask generation with a two-stage refinement process to suppress garment leakage while maintaining person consistency. Extensive experiments show that CCVTON achieves state-of-the-art results.

Abstract:
Long video understanding is challenging due to dense visual redundancy, long-range temporal dependencies, and the tendency of chain-of-thought and retrieval-based agents to accumulate semantic drift and correlation-driven errors. We argue that long-video reasoning should begin not with reactive retrieval, but with deliberate task formulation: the model must first articulate what must be true in the video for each candidate answer to hold. This thinking-before-finding principle motivates VideoHV-Agent, a framework that reformulates video question answering as a structured hypothesis-verification process.Based on video summaries, a Thinker rewrites answer candidates into testable hypotheses, a Judge derives a discriminative clue specifying what evidence must be checked, a Verifier grounds and tests the clue using localized, fine-grained video content, and an Answer agent integrates validated evidence to produce the final answer.Experiments on three long-video understanding benchmarks show that VideoHV-Agent achieves state-of-the-art accuracy while providing enhanced interpretability, improved logical soundness, and lower computational cost.

Abstract:
Egocentric Action Anticipation aims to infer future actions from videos, which is crucial for embodied AI systems. However, its advancement is hindered by the inherent stochasticity of the future, which introduces significant prediction uncertainty. Prevailing methods typically adopt an end-to-end approach to model holistic spatiotemporal contexts, yet they often lack explicit semantic reasoning capabilities, making it difficult to handle open-ended future uncertainties. To address these challenges, we propose a Prototypical Action Reasoning Framework Facilitated by Vision-Language Alignment (PAR-VLA), which leverages the semantic alignment capability of vision-language models to learn disentangled visual prototypes for verbs and nouns. These prototypes serve as robust semantic anchors, transforming the unconstrained temporal prediction problem into a conditional forecasting task guided by well-defined semantic concepts. Our multi-stage framework first extracts visually-grounded and text-aligned prototype groups from a VLM, learning multiple prototypes per category to capture intra-class diversity. Subsequently, a novel Prototypical Context Reasoning-guided Verb-Noun Encoding branch dynamically retrieves the most relevant verb and noun concepts based on visual observations and explicitly models their interactions to guide temporal anticipation. Furthermore, we introduce Dual-Stream Symbiotic Predictive Decoders to more finely capture the interdependencies between verbs and nouns during the prediction process. Experiments Results demonstrate that PAR-VLA achieves state-of-the-art performance and exhibits a strong capability in dealing with future uncertainty.

Abstract:
Text-goal instance navigation (TGIN) asks an agent to resolve a single, free-form description into actions that reach the correct object instance among same-category distractors. We present Context-Nav, which elevates long, contextual captions from a local matching cue to a global exploration prior and verifies candidates through 3D spatial reasoning. First, we compute dense text-image alignments for a value map that ranks frontiers---guiding exploration toward regions consistent with the entire description rather than early detections. Second, upon observing a candidate, we perform a viewpoint-aware relation check: the agent samples plausible observer poses, aligns local frames, and accepts a target only if the spatial relations can be satisfied from at least one viewpoint. The pipeline requires no task-specific training or fine-tuning; we attain state-of-the-art performance on InstanceNav and CoIN-Bench. Ablations show that (i) encoding full captions into the value map avoids wasted motion and (ii) explicit, viewpoint-aware 3D verification prevents semantically plausible but incorrect stops. This suggests that geometry-grounded spatial reasoning is a scalable alternative to heavy policy training or human-in-the-loop interaction for fine-grained instance disambiguation in cluttered 3D scenes.

Abstract:
Human video generation has advanced rapidly with the development of diffusion models, but the high computational cost and substantial memory consumption associated with training these models on high-resolution, multi-frame data pose significant challenges. In this paper, we propose Entropy-Guided Prioritized Progressive Learning (Ent-Prog), an efficient training framework tailored for diffusion models on human video generation. First, we introduce Conditional Entropy Inflation (CEI) to assess the importance of different model components on the target conditional generation task, enabling prioritized training of the most critical components. Second, we introduce an adaptive progressive schedule that adaptively increases computational complexity during training by measuring the convergence efficiency. Ent-Prog reduces both training time and GPU memory consumption while maintaining model performance. Extensive experiments across three datasets, demonstrate the effectiveness of Ent-Prog, achieving up to 2.2x training speedup and 2.4x GPU memory reduction without compromising generative performance.

Abstract:
As vision-language models (VLMs) are increasingly deployed in open-world scenarios, they can be easily induced by visual jailbreak attacks to generate harmful content, posing serious risks to model safety and trustworthy usage.Recent activation steering methods inject directional vectors into model activations during inference to induce refusal behaviors and have demonstrated effectiveness.However, a steering vector may both enhance refusal ability and cause over-refusal, thereby degrading model performance on benign inputs.Moreover, due to the lack of theoretical interpretability, these methods still suffer from limited robustness and effectiveness.To better balance safety and utility, we propose \texttt NullSteer , a null-space projected activation defense framework.Our method constructs refusal directions within model activations through a linear transformation: it maintains zero perturbation within the benign subspace while dynamically inducing refusal along potentially harmful directions, thereby theoretically achieving safety enhancement without impairing the model's general capabilities.Extensive experiments show that \texttt NullSteer significantly reduces harmful outputs under various jailbreak attacks (average ASR reduction over 15% on MiniGPT-4) while maintaining comparable performance to the original model on general benchmarks.

Abstract:
Existing methods for category-level object articulation from a single 3D observation often rely on dense supervision, multi-frame inputs, or CAD templates, and still struggle to disentangle geometry from articulation or to recover explicit joint parameters. We propose SCAPO , a self-supervised framework that estimates canonical geometry, rigid part segmentation, and joint pivots, axes, and articulation states from a single RGB-D observation without ground-truth labels or category-specific models. Our SCAPO first uses an SE(3)-equivariant vector-neuron autoencoder to factor out global pose and align diverse instances into a shared canonical space. On this aligned shape, a joint-aware blend-skinning module is then designed to model part motion. We learn this representation through cycle reconstruction between observed and canonical shapes and cross-space alignment with a learnable canonical template that decouples shared category geometry from instance-specific residual shape. Experiments on synthetic and real articulated-object datasets show that our SCAPO recovers consistent part structure and accurate articulation parameters and outperforms all self-supervised baselines.

Abstract:
3D Gaussian Splatting (3DGS) has enabled efficient 3D scene reconstruction from everyday images with real-time, high-fidelity rendering, greatly advancing VR/AR applications. Fisheye cameras, with their wider field of view (FOV), promise high-quality reconstructions from fewer inputs and have recently attracted much attention. However, since 3DGS relies on rasterization, most subsequent works involving fisheye camera inputs first undistort images before training, which introduces two problems: 1) Black borders at image edges cause information loss and negate the fisheye's large FOV advantage; 2) Undistortion's stretch-and-interpolate resampling spreads each pixel's value over a larger area, diluting detail density-- causes 3DGS overfitting these low-frequency zones, producing blur and floating artifacts. In this work, we integrate fisheye camera model into the original 3DGS framework, enabling native fisheye image input for training without preprocessing. Despite correct modeling, we observed that the reconstructed scenes still exhibit floaters at image edges: Distortion increases toward the periphery, and 3DGS's original per-iteration random-selecting-view optimization ignores the cross-view correlations of a Gaussian, leading to extreme shapes (e.g., oversized or elongated) that degrade reconstruction quality. To address this, we introduce a feature-overlap-driven cross-view joint optimization strategy that establishes consistent geometric and photometric constraints across views--a technique equally applicable to existing pinhole-camera-based pipelines. Our DirectFisheye-GS matches or surpasses state-of-the-art performance on public datasets.

Abstract:
Articulated objects are pervasive in daily environments, such as drawers and refrigerators. Towards their part-level surface reconstruction and joint parameter estimation, REArtGS [??] introduces a category-agnostic approach using multi-view RGB images at two different states. However, we observe that REArtGS still struggles with screw-joint or multi-part objects and lacks geometric constraints for unseen states. In this paper, we propose REArtGS++, a novel method towards generalizable articulated object reconstruction with temporal geometry constraint and planar Gaussian splatting. We first model a decoupled screw motion for each joint without type prior, and jointly optimize part-aware Gaussians with joint parameters through part motion blending. To introduce time-continuous geometric constraint for articulated modeling, we encourage Gaussians to be planar and propose a temporally consistent regularization between planar normal and depth through Taylor first-order expansion. Extensive experiments on both synthetic and real-world articulated objects demonstrate our superiority in generalizable part-level surface reconstruction and joint parameter estimation, compared to existing approaches. Project Site: https://sites.google.com/view/reartgs2/home https://sites.google.com/view/reartgs2/home.

Abstract:
3D semantic occupancy prediction is crucial for autonomous driving perception, offering comprehensive geometric scene understanding and semantic recognition. However, existing methods struggle with geometric misalignment in view transformation due to lack of pixel-level accurate depth estimation, and severe spatial class imbalance where semantic categories exhibit strong spatial anisotropy. To address these challenges, we propose Dr.Occ, a depth- and region-guided occupancy prediction framework. Specifically, we introduce a depth-guided 2D-to-3D View Transformer (D-VFormer) that effectively leverages high-quality dense depth cues from MoGe-2 to construct reliable geometric priors, thereby enabling precise geometric alignment of voxel features. Moreover, inspired by the Mixture-of-Experts (MoE) framework, we propose a region-guided Expert Transformer (R/R-EFormer) that adaptively allocates region-specific experts to focus on different spatial regions, effectively addressing spatial semantic variations. Thus, the two components make complementary contributions: depth guidance ensures geometric alignment, while region experts enhance semantic learning. Experiments on the Occ3D--nuScenes benchmark demonstrate that Dr.Occ improves the strong baseline BEVDet4D by 7.43% mIoU and 3.09% IoU under the full vision-only setting.

Abstract:
In recent years, face-based authentication methods are gradually replacing traditional methods across various applications, offering enhanced security and user convenience. However, these methods are threatened by the continuously evolving DeepFake techniques. In this paper, a novel Visual Speaker Authentication (VSA) approach based on dynamic lip-prints is proposed to improve system security against diverse attacks. The lip-prints are discriminative viseme segments that capture user's localized speaking habits. By leveraging these dynamic lip-prints, this approach can expand the prompt set without requiring additional user-recordings or model retraining, thereby strengthening resilience to replay attacks. Moreover, a Multi-Layer Dynamic-Enhanced Encoder is introduced to model fine-grained lip dynamics, addressing data scarcity challenges and ensuring robust feature extraction even in scenarios with short temporal spans and limited enrollment data. We have carried out extensive experiments on several datasets and the results have demonstrated the effectiveness of the proposed method in both security enhancement and prompt set scalability.

Abstract:
Multimodal semantic segmentation (MMSS) faces significant challenges in real-world applications due to incomplete, degraded, or missing sensor data. To address this, we propose RobustSeg, an efficient teacher-student framework that enhances model robustness under missing-modality conditions while maintaining strong performance in full-modality scenarios. RobustSeg adopts a feedback-based self-distillation paradigm consisting of two complementary stages. Firstly, we introduce Hybrid Prototype Distillation (HPD), which enables more reliable knowledge transfer of both cross-modal and modality-specific aspects. Concretely, combined with dominant-modality selection, HPD performs cross-modal semantic distillation with high-level semantic prototypes to reduce modality bias. Meanwhile, HPD conducts intra-class feature variation distillation for modality-specific structural details. Secondly, to enable the teacher model to gradually produce more balanced and robust modality representations, we make the student model provide feedback from the non-dominant modality to the teacher, benefiting the entire distillation process. Experiments on three datasets demonstrate that our method achieves state-of-the-art robustness (e.g., +2.40% missing-modality performance on DeLiVER) while causing almost no degradation in full-modality performance (only -0.1% mIoU). Moreover, evaluations using different backbones (AnySeg and CMNeXt) further validate the generalization ability of RobustSeg.

Abstract:
Lifting 2D open-vocabulary understanding into 3D Gaussian Splatting (3DGS) scenes is a critical challenge. Mainstream methods, built on an embedding paradigm, suffer from three key flaws: (i) geometry-semantic inconsistency, where points, rather than objects, serve as the semantic basis, limiting semantic fidelity; (ii) semantic bloat from injecting gigabytes of feature data into the geometry; and (iii) semantic rigidity, as one feature per Gaussian struggles to capture rich polysemy. To overcome these limitations, we introduce ExtrinSplat, a framework built on the extrinsic paradigm that decouples geometry from semantics. Instead of embedding features, ExtrinSplat clusters Gaussians into multi-granularity, overlapping 3D object groups. A Vision-Language Model (VLM) then interprets these groups to generate lightweight textual hypotheses, creating an extrinsic index layer that natively supports complex polysemy. By replacing costly feature embedding with lightweight indices, ExtrinSplat reduces scene adaptation time from hours to minutes and lowers storage overhead by several orders of magnitude. On benchmark tasks for open-vocabulary 3D object selection and semantic segmentation, ExtrinSplat outperforms established embedding-based frameworks, validating the efficacy and efficiency of the proposed extrinsic paradigm.

Abstract:
Previous works have explored various customized generation tasks given a reference image, but they still face limitations in generating consistent fine-grained details. In this paper, our aim is to solve the inconsistency problem of generated images by applying a reference-guided post-editing approach and present our ImageCritic. We first construct a dataset of reference-degraded-target triplets obtained via VLM-based selection and explicit degradation, which effectively simulates the common inaccuracies or inconsistencies observed in existing generation models. Furthermore, building on a thorough examination of the model's attention mechanisms and intrinsic representations, we accordingly devise an attention alignment loss and a detail encoder to precisely rectify inconsistencies. ImageCritic can be integrated into an agent framework to automatically detect inconsistencies and correct them with multi-round and local editing in complex scenarios. Extensive experiments demonstrate that ImageCritic can effectively resolve detail-related issues in various customized generation scenarios, providing significant improvements over existing methods.

Abstract:
Recent studies have demonstrated significant progress in aligning text-to-image diffusion models with human preference via Reinforcement Learning from Human Feedback. However, while existing methods achieve high scores on automated reward metrics, they often lead to Preference Mode Collapse (PMC)-a specific form of reward hacking where models converge on narrow, high-scoring outputs (e.g., images with monolithic styles or pervasive overexposure), severely degrading generative diversity. In this work, we introduce and quantify this phenomenon, proposing DivGenBench, a novel benchmark designed to measure the extent of PMC. We posit that this collapse is driven by over-optimization along the reward model's inherent biases. Building on this analysis, we propose Directional Decoupling Alignment (D^2-Align), a novel framework that mitigates PMC by directionally correcting the reward signal.Specifically, our method first learns a directional correction within the reward model's embedding space while keeping the model frozen.This correction is then applied to the reward signal during the optimization process, preventing the model from collapsing into specific modes and thereby maintaining diversity. Our comprehensive evaluation, combining qualitative analysis with quantitative metrics for both quality and diversity, reveals that D^2-Align achieves superior alignment with human preference.

Abstract:
Recent advances in vision-language models have shown strong performance across diverse multimodal tasks, including document question answering that leverages structured visual cues from text, tables, and figures. However, unlike natural images, document images contain large backgrounds and only sparse supporting evidence, leading to the waste of substantial computational resources, especially for long documents. We observe that existing token reduction methods for natural images and videos fall short in utilizing the structural sparsity unique to documents. To address this, we propose DOCPRUNE, a training-free document token pruning framework designed for efficient long document understanding. The proposed method preserves only the essential tokens for the task while removing unnecessary ones, such as background or question-irrelevant tokens. Moreover, it automatically selects the appropriate layers to initiate token pruning based on the model's level of comprehension. Our experiments on the M3DocRAG benchmark show that DOCPRUNE improves throughput by 3.0x and 3.3x in the encoder and decoder, respectively, while boosting the F1 score by +1.0, achieving both higher accuracy and efficiency without any additional training.

Abstract:
Many modern multi-modal models (e.g. CLIP) seek an embedding space in which the two modalities are aligned. Somewhat surprisingly, almost all existing models show a strong modality gap: the distribution of images is well-separated from the distribution of texts in the shared embedding space. Despite a series of recent papers on this topic, it is still not clear why this gap exists nor whether closing the gap in post-processing will lead to better performance on downstream tasks. In this paper we show that under certain conditions, minimizing the contrastive loss will lead to a representation in which the two modalities are separated by a global gap vector that is orthogonal to the embeddings of both modalities. We also show that under these conditions the modality gap is monotonically related to robustness: decreasing the gap does not change the clean accuracy of the models but makes it less likely that a model will change its output when small, semantically inconsequential changes are made to the input. Our experiments show that for many real-world VLMs we can significantly increase robustness by a simple post-processing step that moves one modality towards the mean of the other modality, without any loss to clean accuracy.

Abstract:
Remote sensing image segmentation is critical for a range of applications, including natural disaster monitoring and precision agriculture. Open-vocabulary segmentation enhances flexibility by removing fixed category constraints, enabling more fine-grained and adaptive scene understanding. Unlike CLIP's original pretraining objective, which emphasizes global image-text alignment, segmentation tasks require accurate and discriminative patch-level representations to support precise pixel-wise predictions. As a result, the quality of attention maps--particularly those generated in the final transformer layers--plays a pivotal role in guiding inter-region interactions. However, current methods generate suboptimal representations when capturing the complex spatial hierarchies in remote sensing. We address this gap by optimizing CLIP's 197x197 attention matrix through three key modifications: (1) substituting the 196x196 patch-to-patch submatrix with intermediate-layer feature similarities to preserve spatial structures; (2) prioritizing intermediate-layer attention for global-to-local (class-to-patch) token alignment to reduce classification interference; (3) disabling the \texttt [CLS] token's self-attention to mitigate bias. Extensive experiments on eight remote sensing benchmarks and two building/road extraction datasets demonstrate that our method achieves state-of-the-art performance among existing training-free approaches.

Abstract:
This paper presents the first study on Unsupervised Domain Adaptation (UDA) for multimodal 3D panoptic segmentation (mm-3DPS), aiming to improve generalization under domain shifts commonly encountered in real-world autonomous driving. A straightforward solution is to employ a pseudo-labeling strategy, which is widely used in UDA to generate supervision for unlabeled target data, combined with an mm-3DPS backbone. However, existing supervised mm-3DPS methods rely heavily on strong cross-modal complementarity between LiDAR and RGB inputs, making them fragile under domain shifts where one modality degrades (e.g., poor lighting or adverse weather). Moreover, conventional pseudo-labeling typically retains only high-confidence regions, leading to fragmented masks and incomplete object supervision, which are issues particularly detrimental to panoptic segmentation. To address these challenges, we propose PanDA, the first UDA framework specifically designed for multimodal 3D panoptic segmentation. To improve robustness against single-sensor degradation, we introduce an asymmetric multimodal augmentation that selectively drops regions to simulate domain shifts and improve robust representation learning. To enhance pseudo-label completeness and reliability, we further develop a dual-expert pseudo-label refinement module that extracts domain-invariant priors from both 2D and 3D modalities. Extensive experiments across diverse domain shifts, spanning time, weather, location, and sensor variations, significantly surpass state-of-the-art UDA baselines for 3D semantic segmentation.

Abstract:
Even when told to "do nothing," modern diffusion models subtly alter their output relative to the input they are supposed to preserve. We call this effect No-Op Drift. We introduce the Drift Kernel K_M(sigma), defined as the expected perceptual deviation induced when running a diffusion model at noise strength sigma under a null instruction. Using 120,000 baseline samples (30,000 per model across SD15, SD21, SDXL, and InstructPix2Pix) and 9,600 ablation samples (four strengths, null versus strict copy prompts), we show that variance-driven diffusion models follow a quadratic form K_M(sigma) approximately equal to k_M sigma^2 plus c_M, with aggregate R^2 equal to 0.97. We derive this scaling from first principles via a Taylor expansion of the decoder, yielding k_M equal to Tr(J_D J_D^T), which depends only on the decoder Jacobian and not on prompts. To validate mechanistic structure, we construct synthetic decoders that reproduce the two regimes seen in practice: quadratic variance-driven drift and flat, high-variance edit-driven drift. We show that prompt wording has negligible effect (less than 17 percent coefficient difference), proving that drift is structural and not prompt-induced. We release NoOp-Bench, a benchmark with 10,000 inputs and full code for reproducible kernel estimation. Additional proofs, ablations, LPIPS and CLIP metrics, and extended visualizations appear in the Supplementary Material.

Abstract:
The reliability of autonomous systems in real-world environments is mainly dependent on the robustness of their visual perception.Although recent studies have advanced the handling of visual degradations, physical contaminants that adhere to the camera lens--such as mud, water droplets, and condensation--remain largely underexplored.To this end, we introduce the CLP (Contaminated Lens Protector) dataset, a real-world benchmark designed to evaluate perception performance under realistic lens-protector contamination.The CLP dataset offers degraded images across multiple types of contamination and various lens-to-protector distances, along with dense semantic segmentation masks and aligned restoration targets.This dataset enables robust segmentation and restoration studies in conditions that closely match those encountered by real-world autonomous systems.Experiments analyze strategies to improve perception under contamination with limited data, highlighting the importance of domain generalization, foundation models, data scale, and joint restoration-segmentation pipelines.

Abstract:
Humans constantly reason about 3D proximity, the relations between their body and surrounding objects, to guide perception and action in daily life. Whether multimodal large language models (MLLMs) can perform such embodied 3D reasoning remains unclear. To this end, we introduce EgoProx, a benchmark for egocentric 3D proximity reasoning. We organize our tasks along a cognitive chain, covering intention, exploration, exploitation, and chain-of-actions reasoning. We also design an agent based data engine that produces diverse and consistent QA pairs at scale. We benchmark prevailing MLLMs on EgoProx and conduct additional analyses with dataset specific and task specific instruction tuning. We observe large cross-domain gains, indicating that current MLLMs contain some spatial knowledge; however, they still struggle to effectively leverage it for spatial reasoning VQA.

Abstract:
Multimodal sentiment analysis requires integrating language, visual, and acoustic cues, yet these modalities are often noisy, incomplete, or contradictory, making fusion unreliable. Most existing methods assume uniformly trustworthy modalities and thus degrade when signals conflict. To address this, we propose CICA, a framework that couples Confidence-Aware Pretraining with Confidence-Informed Attention. In pretraining, each modality encoder learns to estimate the reliability of its own representation, producing both embeddings and confidence scores. These scores then guide a confidence-informed attention mechanism, which strengthens contributions from reliable modalities while suppressing noisy or conflicting ones, enabling adaptive fusion under varying signal conditions. CICA achieves state-of-the-art performance across four major benchmarks on MOSI, MOSEI, CH-SIMS, and CH-SIMSv2. It achieves MAE 0.630 and Corr 0.855 on MOSI, and MAE 0.489 and Corr 0.856 on MOSEI, significantly surpassing prior methods. Consistent improvements are also observed across Acc-7, Acc-2, and F1 metrics. Under noisy and missing-modality conditions, CICA maintains significantly more stable performance, indicating improved robustness and interpretability.

Abstract:
Multimodal Large Language Models (MLLMs) suffer from cross-modal hallucinations, where one modality inappropriately influences generation about another, leading to fabricated output. This exposes a more fundamental deficiency in modality-interaction control. To address this, we propose Modality-Adaptive Decoding (MAD), a training-free method that adaptively weights modality-specific decoding branches based on task requirements. MAD leverages the model's inherent ability to self-assess modality relevance by querying which modalities are needed for each task. The extracted modality probabilities are then used to adaptively weight contrastive decoding branches, enabling the model to focus on relevant information while suppressing cross-modal interference. Extensive experiments on CMM and AVHBench demonstrate that MAD significantly reduces cross-modal hallucinations across multiple audio-visual language models (7.8% and 2.0% improvements for VideoLLaMA2-AV, 8.7% and 4.7% improvements for Qwen2.5-Omni). Our approach demonstrates that explicit modality awareness through self-assessment is crucial for robust multimodal reasoning, offering a principled extension to existing contrastive decoding methods.

Abstract:
Spatial intelligence in vision-language models (VLMs) attracts research interest with the practical demand to reason in the 3D world. Despite promising results, most existing methods follow the conventional 2D pipeline in VLMs and use pixel-aligned representations for the vision modality. However, correspondence-based models with implicit 3D scene understanding often fail to achieve spatial consistency, and representation-based models with 3D geometric priors lack efficiency in vision sequence serialization. To address this, we propose a Proxy3D method with compact yet comprehensive 3D proxy representations for the vision modality. Given only video frames as input, we employ semantic and geometric encoders to extract scene features and then perform their semantic-aware clustering to obtain a set of proxies in the 3D space. For representation alignment, we further curate the SpaceSpan dataset and apply multi-stage training to adopt the proposed 3D proxy representations with the VLM. When using shorter sequences for vision information, our method achieves competitive or state-of-the-art performance in 3D visual question answering, visual grounding and general spatial intelligence benchmarks.

Abstract:
Recent advances in text-to-image (T2I) generation via reinforcement learning (RL) have benefited from reward models that assess semantic alignment and visual quality. However, most existing reward models pay limited attention to fine-grained spatial relationships, often producing images that appear plausible overall yet contain inaccuracies in object positioning. In this work, we present SpatialReward, a verifiable reward model explicitly designed to evaluate spatial layouts in generated images. SpatialReward adopts a multi-stage pipeline: a Prompt Decomposer extracts entities, attributes, and spatial metadata from free-form prompts; expert detectors provide accurate visual grounding of object positions and attributes; and a vision-language model applies chain-of-thought reasoning over grounded observations to assess complex spatial relations that are challenging for rule-based methods. To more comprehensively evaluate spatial relationships in generated images, we introduce SpatRelBench, a benchmark covering object attributes, orientation, inter-object relations, and rendered text placement. Experiments on Stable Diffusion and FLUX show that incorporating SpatialReward into RL training consistently improves spatial consistency and overall generation quality, with results aligned more closely to human judgments. These findings indicate that verifiable reward models hold considerable potential for enabling more accurate and controllable optimization in text-to-image generation models.

Abstract:
Conventional 3D instance segmentation methods rely on labor-intensive 3D annotations for supervised training, which limits their scalability and generalization to novel objects. Recent approaches leverage multi-view 2D masks from the Segment Anything Model (SAM) to guide the merging of 3D geometric primitives, thereby enabling zero-shot 3D instance segmentation. However, these methods typically process each frame independently and rely solely on 2D metrics, such as SAM prediction scores, to produce segmentation maps. This design overlooks multi-view correlations and inherent 3D priors, leading to inconsistent 2D masks across views and ultimately fragmented 3D segmentation. In this paper, we propose MV3DIS, a coarse-to-fine framework for zero-shot 3D instance segmentation that explicitly incorporates 3D priors. Specifically, we introduce a 3D-guided mask matching strategy that uses coarse 3D segments as a common reference to match 2D masks across views and consolidates multi-view mask consistency via 3D coverage distributions. Guided by these view-consistent 2D masks, the coarse 3D segments are further refined into precise 3D instances. Additionally, we introduce a depth consistency weighting scheme that quantifies projection reliability to suppress ambiguities from inter-object occlusions, thereby improving the robustness of 3D-to-2D correspondence. Extensive experiments on the ScanNetV2, ScanNet200, ScanNet++, Replica, and Matterport3D datasets demonstrate the effectiveness of MV3DIS, which achieves superior performance over previous methods.

Abstract:
Automatically generating clear and accurate figures for research papers remains challenging, as it requires semantic understanding, precise structure, and visual aesthetics. Existing approaches struggle to balance fidelity and quality: large language model (LLM) code-based methods (e.g., SVG, Mermaid) are structured but inflexible, while image-generation models (e.g., GPT-Image-1, Nano Banana) produce hard-to-edit and often inaccurate figures. We present Paper2Figure, a dual multi-agent system with an interactive web platform for paper-to-figure generation. Generation Agents convert text into our designed FigScript language, encoding figure semantics, styles and layout. The web system renders the FigScript into an initial image, which Refinement Agents iteratively analyze to locate issues and revise the FigScript for improved logic, alignment, aesthetics and text accuracy. Crucially, users can further refine results through an intuitive web interface, ensuring full control over the final output. To evaluate Paper2Figure, we introduce Paper2Figure Bench, a benchmark comprising 100 academic figures with paired descriptions. Experiments demonstrate that Paper2Figure markedly improves accuracy by 12%, beauty by 13.5%, and completeness by 17.0% over state-of-the-art baselines in fully automatic generation without human adjustment. By combining automated generation with interactive edit, Paper2Figure bridges the gap between AI assistance and researcher control, offering a practical solution for high-quality academic figure creation.

Abstract:
Due to the inherent ambiguity of facial expressions and subjectivity in dataset labeling, learning with noisy labels remains a critical challenge in facial expression recognition (FER). The supervisory mechanism of teacher-student network offers a promising approach for noisy-labeled FER. However, this approach is prone to noise accumulation and gradual coupling between the teacher and student parameters during training. We propose an Optimal Teacher Pool-driven dynamic label Noise Suppression framework for facial expression recognition (OTP-NS). Specifically, we construct an optimal teacher pool architecture that dynamically maintains multiple best teacher models while fusing their predictions, thereby mitigating noise accumulation and coupling of teacher-student parameters via update mechanisms. Furthermore, we develop two sample-level noise suppression parts: (1) Similarity-Aware Label Smoothing (SALS), diverging from the static smoothing strength in traditional label smoothing, automatically modulates the smoothing strength for teacher model based on prediction-label similarity, achieving fine-grained noise suppression. (2) Confidence-Weighted Logits (CWL), adaptively adjusting the classification loss of student model based on sample-to-centroid confidence metrics, alleviates the detrimental effects of noisy samples on model training. Extensive experiments on multiple benchmark datasets demonstrate that our method outperforms state-of-the-art approaches across various noise levels, validating the effectiveness of our proposed framework in learning robust representations from noisy data.

Abstract:
What does it mean for a visual system to truly understand affordance? We argue that this understanding hinges on two complementary capacities: geometric perception, which identifies the structural parts of objects that enable interaction, and interaction perception, which models how an agent's actions engage with those parts. To test this hypothesis, we conduct a systematic probing of Visual Foundation Models (VFMs). We find that models like DINO inherently encode part-level geometric structures, while generative models like Flux contain rich, verb-conditioned spatial attention maps that serve as implicit interaction priors. Crucially, we demonstrate that these two dimensions are not merely correlated but are composable elements of affordance. By simply fusing DINO's geometric prototypes with Flux's interaction maps in a training-free, zero-shot manner, we achieve affordance estimation competitive with weakly-supervised methods. This final fusion experiment confirms that geometric and interaction perception are the fundamental building blocks of affordance understanding in VFMs, providing a mechanistic account of how perception grounds action.

Abstract:
Recent advances in multimodal large language models (MLLMs) have led to impressive progress across various benchmarks. However, their capability in understanding infrared images remains unexplored. To address this gap, we introduce IF-Bench, the first high-quality benchmark designed for evaluating multimodal understanding of infrared images. IF-Bench consists of 499 images sourced from 23 infrared datasets and 680 carefully curated visual question-answer pairs, covering 10 essential dimensions of image understanding. Based on this benchmark, we systematically evaluate over 40 open-source and closed-source MLLMs, employing cyclic evaluation, bilingual assessment, and hybrid judgment strategies to enhance the reliability of the results. Our analysis reveals how model scale, architecture, and inference paradigms affect infrared image comprehension, providing valuable insights for this area. Furthermore, we propose a training-free generative visual prompting (GenViP) method, which leverages advanced image editing models to translate infrared images into semantically and spatially aligned RGB counterparts, thereby mitigating domain distribution shifts. Extensive experiments demonstrate that our method consistently yields significant performance improvements across a wide range of MLLMs.

Abstract:
Video object detection has gained notable progress with the advent of transformers. While transformers excel at modeling long-range contextual dependencies, the quadratic complexity limits their efficiency in long-sequence processing. In contrast, Mamba offers greater efficiency in modeling long sequences but tends to exhibit relatively limited contextual learning capability compared with transformers, and its application to video object detection remains unexplored. To harness the complementary strengths of transformers and Mamba, we propose a hybrid Transformer-Mamba network for video object detection (TMambaDet), a pioneering framework in this domain that combines the long-range modeling power of transformers with the efficient long-sequence processing capability of Mamba. Our TMambaDet is characterized by three core components: 1) a spatial adaptive deformable transformer encoder to effectively model the long-range dependencies within each frame, enabling intra-frame feature aggregation that substantially improves the spatial feature representations of objects; 2) a temporal cascaded bidirectional Mamba encoder to efficiently capture the long-range dependencies across frames in video sequences with linear complexity, enabling inter-frame feature aggregation that effectively enhances the temporal feature representations of objects; 3) a Mamba entangled transformer decoder to fully explore the interactions between object queries and spatial-temporal features, enabling fine-grained query-feature alignment that effectively enriches the instance-level representations of object queries. We conduct experiments on the ImageNet VID and EPIC-KITCHENS-55 datasets, showing that TMambaDet achieves state-of-the-art results.

Abstract:
Out-of-distribution (OOD) detection aims to identify samples that deviate from in-distribution (ID). One popular pipeline addresses this by introducing negative labels distant from ID classes and detecting OOD based on their distance to these labels.However, such labels may present poor activation on OOD samples, failing to capture the OOD characteristics. To address this, we propose \underline T est-time \underline A ctivated \underline N egative \underline L abels (TANL) by dynamically evaluating activation levels across the corpus dataset and mining candidate labels with high activation responses in the testing process. Specifically, TANL identifies high-confidence test images online and accumulates their assignment probabilities over the corpus to construct a label activation metric.Such a metric leverages historical test samples to adaptively align with the test distribution, enabling the selection of distribution-adaptive activated negative labels. By further exploring the activation information within the current testing batch, we introduce a more fine-grained, batch-adaptive variant. To fully utilize label activation knowledge, we propose an activation-aware score function that emphasizes negative labels with stronger activations, boosting performance and enhancing its robustness to the label number.Our TANL is training-free, test-efficient, and grounded in theoretical justification. Experiments on diverse backbones and wide task settings validate its effectiveness.Notably, on the large-scale ImageNet benchmark, TANL significantly reduces the FPR95 from 17.5% to 9.8%.Codes will be released.

Abstract:
Video diffusion models have significantly advanced portrait video generation, yet their high computational demands limit their use in interactive applications. This work presents a framework for streamable talking portrait video generation conditioned on speech audio and reference images. Designed meticulously for streaming scenarios, it features a causal video VAE for deep latent compression and an auto-regressive latent denoising model. Our causal VAE integrates a variable number of reference images as guidance, allowing the network to focus on dynamic information rather than static appearance, thereby enhancing compression efficacy and reconstruction quality. Additionally, we extend the residual auto-encoding paradigm to improve spatial-temporal causality handling in our VAE. The generator is based on a Rectified Flow Transformer architecture and produces video latents in a blockwise auto-regressive manner. Our method enables the real-time generation of high-quality talking portrait videos, achieving speeds significantly faster than baseline models. Furthermore, comprehensive experiments demonstrate that it is on par with or even outperforms these large models in realism, vividness, and video quality.

Abstract:
Dynamic scene reconstruction in autonomous driving remains a fundamental challenge due to significant temporal variations, moving objects, and complex scene dynamics. Existing feed-forward 3D models have demonstrated strong performance in static reconstruction but still struggle to capture dynamic motion. To address these limitations, we propose DynamicVGGT, a unified feed-forward framework that extends VGGT from static 3D perception to dynamic 4D reconstruction. Our goal is to model point motion within feed-forward 3D models in a dynamic and temporally coherent manner. To this end, we jointly predict the current and future point maps within a shared reference coordinate system, allowing the model to implicitly learn dynamic point representations through temporal correspondence. To efficiently capture temporal dependencies, we introduce a Motion-aware Temporal Attention (MTA) module that learns motion continuity. Furthermore, we design a Dynamic 3D Gaussian Splatting Head (DGSH) that explicitly models point motion by predicting Gaussian velocities using learnable motion tokens under scene flow supervision. It refines dynamic geometry through continuous 3D Gaussian optimization. Extensive experiments on autonomous driving datasets demonstrate that DynamicVGGT significantly outperforms existing methods in reconstruction accuracy and temporal consistency, achieving robust feed-forward 4D dynamic scene reconstruction under complex driving scenarios.

Abstract:
Learning new robot tasks on new platforms and in new scenes from only a handful of demonstrations remains challenging. While videos of other embodiments---humans and different robots---are abundant, differences in embodiment, camera, and environment hinder their direct use. We address the small-data problem by introducing a unifying, symbolic representation---a compact 3D "trace-space" of scene-level trajectories---that enables learning from cross-embodiment, cross-environment, and cross-task videos. We present TraceGen, a world model that predicts future motion in trace-space rather than pixel space, abstracting away appearance while retaining the geometric structure needed for manipulation. To train TraceGen at scale, we develop TraceForge, a data pipeline that transforms heterogeneous human and robot videos into consistent 3D traces, yielding a corpus of 123K videos and 1.8M observation--trace--language triplets. Pretraining on this corpus produces a transferable 3D motion prior that adapts efficiently: with just five target robot videos, TraceGen attains 80% success across four tasks while offering 50-600x faster inference than state-of-the-art video-based world models. In the more challenging case where only five uncalibrated human demonstration videos captured on a handheld phone are available, it still reaches 67.5% success on a real robot, highlighting TraceGen's ability to adapt across embodiments without heavy pixel-space generation.

Abstract:
With the rapid rise of Large Vision Language Models (LVLMs) for image understanding, the objective of image compression is gradually shifting from human visual perception to machine-oriented semantic understanding. However, conventional learned compression techniques are optimized for pixel-level fidelity and typically operate at fixed or rigid bitrate points, misaligned with the semantic consistency and flexible bitrate control. This gap becomes critical in ultra-low-bitrate regimes, where latent representations often ignore semantic relevance and struggle to disentangle meaningful content from redundant visual details as the bitrate varies. To address these challenges, we develop a token-based flexible compression framework, Token Flow Guided Compression (TFGC), which unifies human- and machine-oriented objectives. TFGC supports variable bitrate control in ultra-low bitrate regimes and enables LVLMs to directly process compressed token without image reconstruction. Specifically, we explore token flow phenomenon in 1D token sequences and exploit it to design token flow propagation, which predicts missing tokens by propagating contextual information from unmasked tokens. Moreover, token semantic guidance aligns compressed representations with LVLM semantic space, while a progressive semantic alignment training strategy further bridges the gap between perceptual reconstruction and semantic reasoning. Experiments show that our framework achieves state-of-the-art LVLMs understanding at comparable bitrates while maintaining satisfactory perceptual quality.

Abstract:
3D Gaussian splatting (3DGS) has become a vital tool for learning a radiance field from multiple posed images. Although 3DGS shows great advantages over NeRF in terms of rendering quality and efficiency, it remains a research challenge to further improve the efficiency of learning 3D Gaussians. To overcome this challenge, we propose novel training strategies and losses to shorten each Gaussian list used to render a pixel, which speeds up the splatting by involving fewer Gaussians along a ray. Specifically, we shrink the size of each Gaussian by resetting their scales regularly, encouraging smaller Gaussians to cover fewer nearby pixels, which shortens the Gaussian lists of pixels. Additionally, we introduce an entropy constraint on the alpha blending procedure to sharpen the weight distribution of Gaussians along each ray, which drives dominant weights larger while making minor weights smaller. As a result, each Gaussian becomes more focused on the pixels where it is dominant, which reduces its impact on nearby pixels, leading to even shorter Gaussian lists. Eventually, we integrate our method into a rendering resolution scheduler which further improves efficiency through progressive resolution increase. We evaluate our method by comparing it with state-of-the-art methods on widely used benchmarks. Our results show significant advantages over others in efficiency without sacrificing rendering quality.

Abstract:
Multimodal Large Language Models have achieved impressive performance on a variety of vision-language tasks, yet their fine-grained visual perception and precise spatial reasoning remain limited. In this work, we introduce DiG (Differential Grounding), a novel proxy task framework where MLLMs learn fine-grained perception by identifying and localizing all differences between similar image pairs without prior knowledge of their number. To support scalable training, we develop an automated 3D rendering-based data generation pipeline that produces high-quality paired images with fully controllable discrepancies. To address the sparsity of difference signals, we further employ curriculum learning that progressively increases complexity from single to multiple differences, enabling stable optimization. Extensive experiments demonstrate that DiG significantly improves model performance across a variety of visual perception benchmarks and that the learned fine-grained perception skills transfer effectively to standard downstream tasks, including RefCOCO, RefCOCO+, RefCOCOg, and general multimodal perception benchmarks. Our results highlight differential grounding as a scalable and robust approach for advancing fine-grained visual reasoning in MLLMs.

Abstract:
Humans can robustly localize visual evidence and provide grounded answers even in noisy environments by identifying critical cues and then relating them to the full context in a bottom-up manner. Inspired by this, we propose DeepScan, a training-free framework that combines Hierarchical Scanning, Refocusing, and Evidence-Enhanced Reasoning for visually grounded reasoning in Large Vision-Language Models (LVLMs). Unlike existing methods that pursue one-shot localization of complete evidence, Hierarchical Scanning performs local cue exploration and multi-scale evidence extraction to recover evidence in a bottom-up manner, effectively mitigating the impacts of distractive context. Refocusing then optimizes the localized evidence view through collaboration of LVLMs and visual experts. Finally, Evidence-Enhanced Reasoning aggregates multi-granular views via a hybrid evidence memory and yields accurate and interpretable answers. Experimental results demonstrate that DeepScan significantly boosts LVLMs in diverse visual tasks, especially in fine-grained visual understanding. It achieves 90.6% overall accuracy on V when integrated with Qwen2.5-VL-7B. Moreover, DeepScan provides consistent improvements for LVLMs across various architectures and model scales without additional adaptation cost. The code will be open-source soon.

Abstract:
Large vision-language models (LVLMs) are powerful, yet they remain unreliable due to object hallucinations. In this work, we show that in many hallucinatory predictions the LVLM effectively ignores the image and instead relies on previously generated output ("prelim") tokens to infer new objects. We quantify this behavior via the mutual information between the image and the predicted object conditioned on the prelim, demonstrating that weak image dependence strongly correlates with hallucination. Building on this finding, we introduce the Prelim Attention Score (PAS), a lightweight, training-free signal computed from attention weights over prelim tokens. PAS requires no additional forward passes and can be computed on the fly during inference. Exploiting this previously overlooked signal, PAS achieves state-of-the-art object-hallucination detection across multiple models and datasets, enabling real-time filtering and intervention.

Abstract:
Diffusion Large Language Models (DLLMs) promise fast non-autoregressive inference but suffer a severe quality and speed tradeoff in parallel decoding. This stems from the "combinatorial contradiction" phenomenon, where parallel tokens form semantically inconsistent combinations. We address this by integrating continuous representations into the discrete decoding process, as they preserve rich inter-position dependency. We propose ReMix (Rejection Mixing), a framework that introduces a novel Continuous Mixing State as an intermediate between the initial masked state and the final decoded token state. This intermediate state allows a token's representation to be iteratively refined in a continuous space, resolving mutual conflicts with other tokens before collapsing into a final discrete sample. Furthermore, a rejection rule reverts uncertain representations from the continuous state back to the masked state for reprocessing, ensuring stability and preventing error propagation.ReMix thus mitigates combinatorial contradictions by enabling continuous-space refinement during discrete diffusion decoding. Extensive experiments demonstrate that ReMix, as a training-free method, achieves a 2-8xinference speedup without any quality degradation.

Abstract:
This paper provides a simplification on OpenVision's architecture and loss design for enhancing its training efficiency. Following the prior vision-language pretraining works CapPa and AIMv2, as well as modern multimodal designs like LLaVA, our changes are straightforward: we remove the text encoder (and therefore the contrastive loss), retaining only the captioning loss as a purely generative training signal. We name this new version OpenVision 2. The initial results are promising: despite this simplification, OpenVision 2 competitively matches the original model's performance on a broad set of multimodal benchmarks while substantially cutting both training time and memory consumption. For example, with ViT-L/14, it reduces training time by about 1.5x (from 83h to 57h), and memory usage by about 1.8x (from 24.5GB to 13.8GB, equivalently allowing the maximum batch size to grow from 2k to 8k). This superior training efficiency also allows us to scale far beyond the largest vision encoder used in OpenVision, reaching more than 1 billion parameters. We hold a strong belief that this lightweight, generative-only paradigm is compelling for future vision encoder development in multimodal foundation models.

Abstract:
Existing active learning (AL) strategies capture fundamentally different notions of data value, e.g., uncertainty or representativeness. Consequently, the effectiveness of strategies can vary substantially across datasets, models, and even AL cycles. Committing to a single strategy risks suboptimal performance, as no single strategy dominates throughout the entire AL process. We introduce REFINE, an ensemble AL method that combines multiple strategies without knowing in advance which will perform best. In each AL cycle, REFINE operates in two stages: (1) Progressive filtering iteratively refines the unlabeled pool by considering an ensemble of AL strategies, retaining promising candidates capturing different notions of value. (2) Coverage-based selection then chooses a final batch from this refined pool, ensuring all previously identified notions of value are accounted for. Extensive experiments across 6 classification datasets and 3 foundation models show that REFINE consistently outperforms individual strategies and existing ensemble methods. Notably, progressive filtering serves as a powerful preprocessing step that improves the performance of any individual AL strategy applied to the refined pool, which we demonstrate on an audio spectrogram classification use case. Finally, the ensemble of REFINE can be easily extended with upcoming state-of-the-art AL strategies.

Abstract:
Advanced autonomous systems rely on multi-sensor fusion for safer and more robust perception. To enable effective fusion, calibrating directly from natural driving scenes (i.e., target-free) with high accuracy is crucial for precise multi-sensor alignment. Existing learning-based calibration methods are typically designed for only a single pair of sensor modalities (i.e., a bi-modal setup). Unlike these methods, we propose LiREC-Net, a target-free, learning-based calibration network that jointly calibrates multiple sensor modality pairs, including LiDAR, RGB, and event data, within a unified framework. To reduce redundant computation and improve efficiency, we introduce a shared LiDAR representation that leverages features from both its 3D nature and projected depth map, ensuring better consistency across modalities. Trained and evaluated on established datasets, such as KITTI and DSEC, our LiREC-Net achieves competitive performance to bi-modal models and sets a new strong baseline for the tri-modal use case.

Abstract:
As the harm caused by fake news grows, the task of detecting and grounding multi-modal media manipulation (DGM4) is gaining more attention. Existing multimodal methods overlook fine-grained semantic alignment between visual and textual modalities, thereby limiting their ability to detect sophisticated and subtle cross-modal manipulations. To address this challenge, we present MaLSF, a novel Mask-aware Local Semantic Fusion framework that explicitly bridges words and pixels via mask-label pairs, enabling the model to perform precise reasoning over fine-grained cross-modal correspondences. MaLSF captures cross-modal local semantics through two key innovations: 1) A Bidirectional Cross-modal Verification Module (BCV) that identifies semantic conflicts between masked regions and associated labels via a bidirectional query mechanism; 2) A Hierarchical Semantic Aggregation (HSA) Module that adaptively aggregates multi-granularity local semantics into decoupled features for task-specific verification. In addition, to extract fine-grained mask-label pairs, we introduce a set of diverse mask-label pair extraction parsers. The proposed model is evaluated on multiple datasets and achieves state-of-the-art performance on both the DGM4 and multimodal fake news detection tasks. Extensive ablation studies and visualization results further verify its effectiveness and interpretability.

Abstract:
We present GaussFusion, a novel approach for improving 3D Gaussian splatting (3DGS) reconstructions in the wild through geometry-informed video generation. GaussFusion mitigates common 3DGS artifacts, including floaters, flickering, and blur caused by camera pose errors, incomplete coverage, and noisy geometry initialization. Unlike prior RGB-based approaches limited to a single reconstruction pipeline, our method introduces a geometry-informed video-to-video generator that refines 3DGS renderings across both optimization-based and feed-forward methods. Given an existing reconstruction, we render a Gaussian primitive video buffer encoding depth, normals, opacity, and covariance, which the generator refines to produce temporally coherent, artifact-free frames. We further introduce an artifact synthesis pipeline that simulates diverse degradation patterns, ensuring robustness and generalization. GaussFusion achieves state-of-the-art performance on novel-view synthesis benchmarks, and an efficient variant runs in real time at 21 FPS while maintaining similar performance, enabling interactive 3D applications.

Abstract:
Low-light degradation hampers machine understanding at night. Existing methods either overfit labeled data (paired supervision) or specific distributions (unpaired supervision), resulting in poor generalization under unseen degradations. In this paper, we propose UniPrior, a unified prior-based low-light adaptation framework that integrates the general semantic prior embedded in vision foundation models (VFMs) with illumination-invariant priors, to capture both stable and changing semantics under varied low-light degradation without any real low-light training data. In detail, the illumination-invariant prior is used as an auxiliary input, and a parallel decoder reconstructs it as a regularization target, enforcing representation consistency and reducing feature drift. Such signal constancy enables us to build a VFM-aligned semantic space via a contrastive training strategy guided by VFM self-correlation maps, enriching features with high-level cues, thereby improving adaptation to diverse low-light conditions. Beyond high-level features, we also give a joint consideration of such unified prior and low-level signal space through our machine-oriented enhancement scheme. We extend the signal prior to handle overexposure and inject VFM-guided semantic cues into the enhancement process via a CLIP-based loss. This coupling of semantic alignment and pixel correction enables sample-adaptive optimization to improve performance. Extensive experiments on multiple low-light tasks demonstrate our method's superiority and practical utility.

Abstract:
Non-human primates are our closest living relatives, and analyzing their behavior is central to research in cognition, evolution, and conservation. Computer vision could greatly aid this research, but existing methods often rely on human-centric pretrained models and focus on single datasets, which limits generalization. We address this limitation by shifting from a model-centric to a data-centric approach and introduce PriVi, a large-scale primate-centric video pretraining dataset. PriVi contains 424 hours of curated video, combining 174 hours from behavioral research across 11 settings with 250 hours of diverse web-sourced footage, assembled through a scalable data curation pipeline. We continue pretraining V-JEPA, a large-scale video model, on PriVi to learn primate-specific representations and evaluate it using a lightweight frozen classifier. Across four benchmark datasets - ChimpACT, PanAf500, BaboonLand, and ChimpBehave - our approach consistently outperforms prior work, including fully finetuned baselines, and scales favorably with fewer labels. These results demonstrate for the first time that domain-level pretraining, where pretraining is conducted on similar data but not the target dataset itself, works for video models. Our primate-centric pretraining substantially improves data efficiency and generalization, making it a promising approach for low-label applications. Dataset, code, and models are available at https://privi.eckerlab.org.

Abstract:
Radiology report generation (RRG) aims to automatically describe medical images via free-text reports. In clinical practice, comparing current and prior chest X-rays is essential for assessing disease progression, motivating the development of longitudinal RRG methods. However, most existing approaches often struggle to capture fine-grained temporal changes, as they often rely on unidirectional alignments or static reasoning pipelines, overlooking the bidirectional and asymmetric nature of disease evolution. To tackle these challenges, we propose BiOTPrompt, a novel framework for disease evolution-aware radiology report generation, which introduces a Bidirectional Optimal Transport (BiOT) mechanism to explicitly model progression dynamics between historical and current chest X-rays. By analyzing the asymmetry between bidirectional transport plans, BiOTPrompt can identify newly emerged and resolved regions, which are then used to construct dynamic prompts that guide large language models (LLMs) in generating clinically relevant diagnostic reports. Furthermore, we incorporate a vision-language consistency constraint to ensure alignment between visual evidence and textual descriptions, mitigating hallucinations and enhancing factual correctness. Extensive experiments on the Longitudinal-MIMIC dataset demonstrate that BiOTPrompt achieves state-of-the-art performance in both language metrics and clinical relevance, setting a new standard for longitudinal radiology report generation.

Abstract:
Dynamic extensions of 3D Gaussian Splatting (3DGS) achieve high-quality reconstructions through neural motion fields, but per-Gaussian neural inference makes these models computationally expensive. Building on DeformableGS, we introduce Speedy Deformable 3D Gaussian Splatting (SpeeDe3DGS), which bridges this efficiency-fidelity gap through three complementary modules: Temporal Sensitivity Pruning (TSP) removes low-impact Gaussians via temporally aggregated sensitivity analysis, Temporal Sensitivity Sampling (TSS) perturbs timestamps to suppress floaters and improve temporal coherence, and GroupFlow distills the learned deformation field into shared SE(3) transformations for efficient groupwise motion. On the 50 dynamic scenes in MonoDyGauBench, integrating TSP and TSS into DeformableGS accelerates rendering by 6.78x on average while maintaining neural-field fidelity and using 10x fewer primitives. Adding GroupFlow culminates in 13.71x faster rendering and 2.53x shorter training, surpassing all baselines in speed while preserving superior image quality.

Abstract:
Many audio-visual learning methods have focused on aligning audio and visual information, either through semantic or temporal correspondence. However, most of these works have utilized monaural audio, which does not contain information about the spatial location of the sound source. In contrast, humans and other animals utilize binaural hearing to perceive this spatial information. Combining spatial sound and visual perception enables powerful high-level reasoning: for example, a person looking for their phone may hear the ringing sound coming from a backpack sitting on a table, and quickly infer that the missing phone is inside the backpack. In this paper, we investigate the problem of Audio-Visual Spatial Reasoning. We design a spatial audio-visual question answering dataset to cover scenarios where semantic correspondence between audio and visual signals is absent but spatial alignment exists, as well as cases with multiple audio-visual semantic correspondences that require spatial reasoning to disambiguate. We propose a model that learns spatial comprehension across the audio and vision modalities by connecting them with a large language model and experimentally demonstrate that spatial sound perception is an essential part of our task.

Abstract:
We present QuCNet, a hybrid quantum classical network for efficient remote sensing image classification. QuCNet integrates a lightweight convolutional encoder with sixteen parallel four-qubit trainable quantum circuits (TQCs) trained under a Hybrid Cyclic Weight-Sharing (HCWS) strategy. This design enhances expressibility while keeping the parameter count extremely low 87K, 85x smaller than prior hybrid models). Guided by expressibility analysis, the proposed quantum configuration maintains stable gradients and mitigates barren plateaus on near term quantum devices. Extensive experiments across seven remote sensing benchmarks (AID, AIDER, UC Merced, NWPU-45, EuroSAT, IIITDMJ Smoke, and USTC SmokeRS) demonstrate that QuCNet consistently improves accuracy and generalization over classical CNN baselines. Furthermore, hardware only inference on IBM Quantum processors (ibm_torino, ibm_fez) confirms robustness under realistic noise and connectivity constraints. These results suggest a practical path toward scalable, hardware feasible quantum deep learning for geospatial applications.

Abstract:
We introduce GaussianZoom, a generative zoom-in 3D reconstruction system with an iterative progressive framework that combines geometry-consistent scene modeling and multi-scale semantic reasoning to enable high-fidelity extreme zoom-in rendering from low-resolution inputs. To achieve this, we develop a novel multi-view consistent super-resolution module with depth-based feature warping and VLM-driven detail synthesis, ensuring accurate multi-view correspondence while enriching fine-scale appearance beyond the observed resolution. To support zooming across large magnification ranges, we further introduce a new expandable continuous Level-of-Detail hierarchy that dynamically modulates Gaussian visibility for smooth, alias-free cross-scale rendering. Experiments on Mip-NeRF360 and Tanks&Temples demonstrate that GaussianZoom achieves superior perceptual quality, multi-view consistency, and robustness under extreme magnification, establishing a strong baseline for generative zoom-in 3D scene reconstruction.

Abstract:
Interactive 3D Gaussian Splatting (3DGS) segmentation is essential for real-time editing of pre-reconstructed assets in film and game production.However, existing methods rely on predefined camera viewpoints, ground-truth labels, or costly retraining, making them impractical for low-latency use.We propose B^3-Seg (Beta--Bernoulli Bayesian Segmentation for 3DGS), a fast and theoretically grounded method for open-vocabulary 3DGS segmentation under camera-free and training-free conditions.Our approach reformulates segmentation as sequential Beta-Bernoulli Bayesian updates and actively selects the next view via analytic Expected Information Gain (EIG).This Bayesian formulation guarantees the adaptive monotonicity and submodularity of EIG, which produces a greedy (1-1/e) approximation to the optimal view sampling policy.Experiments on multiple datasets show that B^3-Seg achieves competitive results to high-cost supervised methods while operating end-to-end segmentation within a few seconds.The results demonstrate that B^3-Seg enables practical, interactive 3DGS segmentation with provable information efficiency.

Abstract:
Physics-aware driving world model is essential for drive planning, out-of-distribution data synthesis, and closed-loop evaluation. However, existing methods often rely on a single diffusion model to directly map driving actions to videos, which makes learning difficult and leads to physically inconsistent outputs. To overcome these challenges, we propose GenieDrive, a novel framework designed for physics-aware driving video generation. Our approach starts by generating 4D occupancy, which serves as a physics-informed foundation for subsequent video generation. 4D occupancy contains rich physical information, including high-resolution 3D structures and dynamics. To facilitate effective compression of such high-resolution occupancy, we propose a VAE that encodes occupancy into a latent tri-plane representation, reducing the latent size to only 58% of that used in previous methods. We further introduce Mutual Control Attention (MCA) to accurately model the influence of control on occupancy evolution, and we jointly train the VAE and the subsequent prediction module in an end-to-end manner to maximize forecasting accuracy. Together, these designs yield a 7.2% improvement in forecasting mIoU at an inference speed of 41 FPS, while using only 3.47 M parameters. Additionally, a Normalized Multi-View Attention is introduced in the video generation model to generate multi-view driving videos with guidance from our 4D occupancy, significantly improving video quality with a 20.7% reduction in FVD. Experiments demonstrate that GenieDrive enables highly controllable, multi-view consistent, and physics-aware driving video generation.

Abstract:
Vision-Language-Action (VLA) models have enabled notable progress in general-purpose robotic manipulation, yet their learned policies often exhibit variable execution quality. We attribute this variability to the mixed-quality nature of human demonstrations, where the implicit principles that govern how actions should be carried out are only partially satisfied. To address this challenge, we introduce the LIBERO-Elegant benchmark with explicit criteria for evaluating execution quality. Using these criteria, we develop a decoupled refinement framework that improves execution quality without modifying or retraining the base VLA policy. We formalize Elegant Execution as the satisfaction of Implicit Task Constraints (ITCs) and train an Elegance Critic via offline Calibrated Q-Learning to estimate the expected quality of candidate actions. At inference time, a Just-in-Time Intervention (JITI) mechanism monitors critic confidence and intervenes only at decision-critical moments, providing selective, on-demand refinement. Experiments on LIBERO-Elegant and real-world manipulation tasks show that the learned Elegance Critic substantially improves execution quality, even on unseen tasks. The proposed model enables robotic control that values not only whether tasks succeed, but also how they are performed.

Abstract:
Cloud-Edge Continual Test-Time Adaptation (CTTA)--with edge devices processing real-time data and the cloud offering strong computing power--is a critical paradigm for models that adapt to dynamic data distributions in real-world scenarios. However, most existing frameworks assume architectural homogeneity between cloud and edge CNNs, which poses a significant performance bottleneck, particularly given the rapid emergence of Transformer-based models. Current methods fail to bridge this architectural gap, resulting in significant deficiencies in adaptation accuracy and practical applicability. To address this, we propose a novel Cross-Architecture Adaptation (CAA) framework for heterogeneous Cloud-Edge CTTA that enables effective adaptation to shifting data distributions. Specifically, CAA deploys a large Transformer-based teacher model on the cloud for robust feature extraction and prediction, and a lightweight CNN-based student model on edge devices to fit resource constraints. Based on such cloud-edge models, a synergistic edge-to-cloud communication strategy, Multi-criteria Dynamic Cross-domain Sampling, ensures only the most informative, class-balanced samples are uploaded, minimizing communication costs while guaranteeing stable, unbiased adaptation. Moreover, a Multi-level Adaptive Heterogeneous Distillation module is proposed to facilitate effective knowledge transfer across the architecturally disparate models, and improve the learning efficiency of the edge one. Experiments on several benchmarks demonstrate that CAA achieves state-of-the-art performance with low edge resource consumption and minimal edge-to-cloud communication overhead.

Abstract:
Scene Coordinate Regression (SCR) has emerged as a memory-efficient paradigm for visual localization. While SCR has demonstrated performance comparable to classic feature matching based approaches in small-scale scenes, it has consistently underperformed in large-scale environments. Large-scale localization is hampered by two challenges: sparse co-visibility and local appearance ambiguity. In this work, we propose CoLoR, a novel training framework tailored for large-scale SCR. First, we explicitly and efficiently partition scene points into multi-view and single-view sets and introduce a two-stage bootstrapping paradigm to provide complete and strong supervision for all points. Second, we propose a multi-granularity retrieval feature, which unifies the conventional global and local features as retrieval-oriented representations at the image and pixel levels, respectively, to enforce feature consistency. Our method achieves state-of-the-art performance on multiple challenging large-scale datasets and significantly narrows the accuracy gap with classical feature matching based approaches while retaining a compact map size.

Abstract:
Weakly Supervised Video Anomaly Detection (WSVAD) aims to localize abnormal segments using only video-level labels during training.Although the paradigm significantly reduces annotation costs, the coarse-grained labels fail to precisely describe the full videos, resulting in the introduction of substantial Weakly Labeled Information (WLI) during training. The presence of WLI makes it difficult for the model to accurately learn the boundary between normal and abnormal behaviors, leading to misclassifications and compromising the precision of anomaly localization.To tackle the challenges posed by WLI, we propose a triplet learning strategy that selects hard segments from normal videos as anchors. By combining contrastive learning with Multiple Instance Learning (MIL) strategy, we increase the projection distance between abnormal segments and anchor samples, to reduce the interference of WLI in anomaly detection.Moreover, considering that anomalies typically occur in dynamic foreground regions, we further design a motion-aware feature enhancement module that extracts dynamic areas within each video segment to emphasize the representation of critical features.This not only improves the accuracy of anchors in triplets, but also enhances the discriminative power of instance features in MIL. Extensive experiments on UCF-Crime, XD-Violence, and MSAD datasets demonstrate the effectiveness of our approach.

Abstract:
Large Vision-Language Models (LVLMs) can reason from image-text inputs and perform well in various multimodal tasks. Despite this success, they are affected by language priors and often produce hallucinations. Hallucinations denote generated content that is grammatically and syntactically coherent, yet bears no match or direct relevance to visual input. To address this problem, we propose Residual Decoding (ResDec). It is a novel training-free method that uses historical information to aid decoding. The method relies on the internal implicit reasoning mechanism and token logits evolution mechanism of LVLMs to correct biases. Extensive experiments demonstrate that ResDec effectively suppresses hallucinations induced by language priors, significantly improves visual grounding, and reduces object hallucinations. In addition to mitigating hallucinations, ResDec also performs exceptionally well on comprehensive LVLM benchmarks, highlighting its broad applicability.

Abstract:
Recent advances in robot manipulation have leveraged pre-trained vision-language models (VLMs) and explored integrating 3D spatial signals into these models for effective action prediction, giving rise to the promising vision-language-action (VLA) paradigm. However, most existing approaches overlook the importance of active perception: they typically rely on static, wrist-mounted cameras that provide an end-effector-centric viewpoint. As a result, these models are unable to adaptively select optimal viewpoints or resolutions during task execution, which significantly limits their performance in long-horizon tasks and fine-grained manipulation scenarios. To address these limitations, we propose ActiveVLA, a novel vision-language-action framework that empowers robots with active perception capabilities for high-precision, fine-grained manipulation. ActiveVLA adopts a coarse-to-fine paradigm, dividing the process into two stages:(1) Critical region localization. ActiveVLA projects 3D inputs onto multi-view 2D projections, identifies critical 3D regions, and supports dynamic spatial awareness.(2) Active perception optimization. Drawing on the localized critical regions, ActiveVLA uses an active view selection strategy to choose optimal viewpoints. These viewpoints aim to maximize amodal relevance and diversity while minimizing occlusions. Additionally, ActiveVLA applies a 3D zoom-in to improve resolution in key areas. Together, these steps enable finer-grained active perception for precise manipulation.Extensive experiments demonstrate that ActiveVLA achieves precise 3D manipulation and outperforms state-of-the-art baselines on three simulation benchmarks. Moreover, ActiveVLA transfers seamlessly to real-world scenarios, enabling robots to learn high-precision tasks in complex environments.

Abstract:
Self-reflection mechanisms that rely on purely text-based rethinking processes perform well in most multimodal tasks. However, when directly applied to long-form video understanding scenarios, they exhibit clear limitations. The fundamental reasons for this lie in two points: (1) long-form video understanding involves richer and more dynamic visual input, meaning rethinking only the text information is insufficient and necessitates a further rethinking process specifically targeting visual information; (2) purely text-based reflection mechanisms lack cross-modal interaction capabilities, preventing them from fully integrating visual information during reflection. Motivated by these insights, we propose REVISOR (REflective VIsual Segment Oriented Reasoning), a novel framework for tool-augmented multimodal reflection. REVISOR enables MLLMs to collaboratively construct introspective reflection processes across textual and visual modalities, significantly enhancing their reasoning capability for long-form video understanding. To ensure that REVISOR can learn to accurately review video segments highly relevant to the question during reinforcement learning, we designed the Dual Attribution Decoupled Reward (DADR) mechanism. Integrated into the GRPO training strategy, this mechanism enforces causal alignment between the model's reasoning and the selected video evidence. Notably, the REVISOR framework significantly enhances long-form video understanding capability of MLLMs without requiring supplementary supervised fine-tuning or external models, achieving impressive results on four benchmarks including VideoMME, LongVideoBench, MLVU, and LVBench.

Abstract:
Controllable hand image generation aims to synthesize geometrically accurate images with consistent appearance. Recently, diffusion models have been widely applied for hand image synthesis. However, through input-level fusion or feature-level modulation, existing methods inject control signals with fixed strength across all timesteps, ignoring the progressive nature of the denoising process. In this paper, we reveal that the modulation of control signals depends on the denoising state and condition complexity. Due to distinct semantic distributions and information densities, achieving effective interaction among these heterogeneous representations remains challenging. To address this, we propose a Temporal and Content Co-Awareness Latent Diffusion method that introduces a dual-driven modulation strategy. Specifically, we design a query-based interaction mechanism to mitigate information redundancy and align semantic distributions. Leveraging cross-domain interaction, the model infers required control information to dynamically adjust pose and appearance injection strengths. Furthermore, we design a Pose-Invariant Appearance Encoder that captures both global appearance consistency and local texture details. Extensive experiments validate our superiority over state-of-the-art.

Abstract:
3D Gaussian Splatting (3DGS) has become a crucial rendering technique for many real-time applications. How- ever, the limited hardware resources on today's mobile platforms hinder these applications, as they struggle to achieve real-time performance. In this paper, we propose SEELE, a general framework designed to accelerate the 3DGS pipeline for resource-constrained mobile devices. Specifically, we propose two GPU-oriented techniques: hybrid preprocessing and contribution-aware rasterization. Hybrid preprocessing alleviates the GPU compute and memory pressure by reducing the number of irrelevant Gaussians during rendering. The key is to combine our view-dependent scene representation with online filtering. Meanwhile, contribution-aware rasterization improves the GPU utilization at the rasterization stage by prioritizing Gaussians with high contributions while reducing computations for those with low contributions. Both techniques can be seamlessly integrated into existing 3DGS pipelines with minimal fine-tuning. Collectively, our framework achieves up to 6.3x speedup and 39.1% model reduction while achieving superior rendering quality compared to existing methods. Our codes will be released upon publication.

Abstract:
Out-of-distribution (OOD) detection is a key requirement for reliable deployment in open-world environments, where a model must recognize inputs that fall outside the semantic scope of known concepts. While recent advances in vision-language models (VLMs) have achieved strong results in image-level OOD detection, most methods still assume that each image contains a single dominant object. This assumption severely limits their applicability to real-world settings where scenes are naturally composed of multiple objects that each demand independent OOD assessment. Existing object-level approaches, including the current SOTA method RUNA, remain constrained by coarse global representations and insufficient modeling of contextual dependencies between objects and their backgrounds. We propose UNI-OOD, a unified framework that performs both object- and image-level OOD detection within a single VLM, without requiring prior knowledge of which task is being addressed at inference time. The key idea is to leverage cross-context attentive modeling that captures complementary visual and textual semantics. UNI-OOD learns to attend to fine-grained spatial details within each object, aligns visual and linguistic embeddings to strengthen semantic correspondence, and models interactions between target objects and their surrounding context. By jointly reasoning over object-centric and background cues, the framework disentangles informative visual evidence from spurious correlations and enables a consistent OOD scoring mechanism across different visual granularities. Extensive experiments on object- and image-level benchmarks demonstrate that UNI-OOD achieves substantial and consistent improvements over previous approaches, establishing new SOTA performance in both object-level and image-level OOD detection.

Abstract:
We present the first study of cross-sensor view synthesis across different modalities. We examine a practical, fundamental, yet widely overlooked problem: getting aligned RGB-X data, where most RGB-X prior work assumes such pairs exist and focuses on modality fusion, but it empirically requires huge engineering effort in calibration. We propose a match-densify-consolidate method. First, we perform RGB-X image matching followed by guided point densification. Using the proposed confidence-aware densification and self-matching filtering, we attain better view synthesis and later consolidate them in 3D Gaussian Splatting (3DGS). Our method uses no 3D priors for X-sensor and only assumes nearly no-cost COLMAP for RGB. We aim to remove the cumbersome calibration for various RGB-X sensors and advance the popularity of cross-sensor learning by a scalable solution that breaks through the bottleneck in large-scale real-world RGB-X data collection.

Abstract:
Current collaborative perception systems have significantly improved 3D object detection performance. However, widely used LiDAR and camera systems often suffer performance degradation under adverse weather conditions. The weather-robust 4D radar provides a promising solution to address this challenge. Nevertheless, the effective fusion of sparse 4D radar measurements with degraded LiDAR data remains a significant challenge due to cross-modal corruption and information loss. In this work, we propose a novel hybrid robust collaborative perception framework (HRCP), designed to improve the collaborative perception performance under adverse weather conditions through LiDAR-4D radar fusion. Specifically, we introduce a hybrid collaboration strategy that considers their distinct physical properties and differently processes them during information transmission. Additionally, we propose a bidirectional cross-modal gating (BCMG) module that enables LiDAR and 4D radar to mutually validate feature reliability, ensuring consistent cross-modal representation, and an adaptive feature enhancement (AFE) module that enables comprehensive refinement of degraded and suppressed regions to mitigate information loss. Extensive experimental results demonstrate that our method outperforms previous state-of-the-art approaches under adverse weather conditions.

Abstract:
Continual learning in multimodal large language models (MLLMs) aims to sequentially acquire knowledge while mitigating catastrophic forgetting, yet existing methods face inherent limitations: architecture-based approaches incur additional computational overhead and often generalize poorly to new tasks, rehearsal-based methods rely on storing historical data, raising privacy and storage concerns, and conventional regularization-based strategies alone are insufficient to fully prevent parameter interference. We propose Octopus, a two-stage continual learning framework based on History-Free Gradient Orthogonalization (HiFGO), which enforces gradient-level orthogonality without historical task data. The two-stage finetuning strategy decouples task adaptation from regularization, achieving a principled balance between plasticity and stability. Experiments on UCIT show that Octopus establishes state-of-the-art performance, surpassing prior SOTA by 2.14% and 6.82% in terms of Avg and Last.

Abstract:
Vision-Language Navigation (VLN) aims to enable agents to navigate to a target location based on language instructions. Traditional VLN often follows a close-set assumption, i.e., training and test data share the same style of the input images and instructions. However, the real world is open and filled with various unseen environments, posing enormous difficulties for close-set methods.To this end, we focus on the General Scene Adaptation (GSA-VLN) task, aiming to learn generalized navigation ability by introducing diverse environments and inconsistent intructions.Towards this task, when facing unseen environments and instructions, the challenge mainly lies in how to enable the agent to dynamically produce generalized strategies during the navigation process. Recent research indicates that by means of fast and slow cognition systems, human beings could generate stable policies, which strengthen their adaptation for open world. Inspired by this idea, we propose the slow4fast-VLN, establishing a dynamic interactive fast-slow reasoning framework.The fast-reasoning module, an end-to-end strategy network, outputs actions via real-time input. It accumulates execution records in a history repository to build memory.The slow-reasoning module analyze the memories generated by the fast-reasoning module. Through deep reflection, it extracts experiences that enhance the generalization ability of decision-making. These experiences are structurally stored and used to continuously optimize the fast-reasoning module. Unlike traditional methods that treat fast-slow reasoning as independent mechanisms, our framework enables fast-slow interaction. By leveraging the experiences from slow reasoning, it continually improves the accuracy and generalization ability of fast decisions. This interaction allows the system to continuously adapt and efficiently execute navigation tasks when facing unseen scenarios. Extensive experiments demonstrate the superiorities of our method.

Abstract:
Recent advances in diffusion models have enabled high-fidelity, identity-preserving image generation for personalized applications such as digital avatars and virtual try-on systems. However, their reliance on sensitive facial reference images raises growing privacy concerns. Existing defense mechanisms are primarily designed for training-based personalization and struggle to generalize to emerging training-free approaches, due to fundamental differences in their identity integration paradigms. To bridge this gap, we propose IDGuardian--the first generalizable and model-agnostic identity protection framework capable of defending against both training-based and training-free personalization methods. IDGuardian abstracts the personalization process into two critical stages: identity extraction and identity injection. It then introduces crafted adversarial perturbations to simultaneously disrupt both stages. Specifically, it degrades the identity features extracted by external encoders and establishes an adversarial conceptual bridge that misdirects the generative trajectory away from the target identity. Extensive experiments show that IDGuardian effectively protects identity across various personalization pipelines and model architectures, while remaining robust to post-processing, adaptive attacks, and cross-dataset generalization.

Abstract:
Portrait animation from a single source image and a driving video is a long-standing problem. Recent approaches tend to adopt diffusion-based image/video generation models for realistic and expressive animation. However, none of these diffusion models realizes high-fidelity disentangled control between the head pose and facial expression, hindering applications like expression-only or pose-only editing and animation. To address this, we propose DeX-Portrait, a novel approach capable of generating expressive portrait animation driven by disentangled pose and expression signals. Specifically, we represent the pose as an explicit global transformation and the expression as an implicit latent code. First, we design a powerful motion trainer to learn both pose and expression encoders for extracting precise and decomposed driving signals. Then we propose to inject the pose transformation into the diffusion model through a dual-branch conditioning mechanism, and the expression latent through cross attention. Finally, we design a progressive hybrid classifier-free guidance for more faithful identity consistency. Experiments show that our method outperforms state-of-the-art baselines on both animation quality and disentangled controllability.

Abstract:
High-quality segmentations are critical in vision tasks where boundary accuracy is important (e.g., medical diagnostics, quality control, etc.). Recently, promptable vision models have emerged as effective backbones for segmentation refinement frameworks. However, their performance not only hinges on prompt quality, they also must overcome noisy input masks and semantically ambiguous outputs from promptable models. Existing prompt-based refiners rely on fixed prompt rules, making them brittle to changing failure modes and new tasks or domains. We propose PromptMoE, a model-agnostic MoE-driven prompting refiner effective in segmentation refinement across tasks and domains. PromptMoE features three collaborative modules to refine an initial mask: our MoE-based Image-Informed Prompting framework (IIP) takes an image and coarse mask and produces a set of expert score maps to guide prompt generation, the Dynamic Expert Selector (DES) activates only the most relevant experts and fuses their maps to avoid dense evaluation and signal dilution, and the Prompt-Placement Explorer (PPE) explores the fused guidance map to place high-confidence spatially diverse point prompts. Across five benchmark datasets (BIG, VOC, DAVIS585, ECSSD, MSRA-B), PromptMoE achieves statistically significant gains over SOTA methods CascadePSP, SegRefiner, and SAMRefiner on semantic, instance, and salient tasks, with mean improvements of +6.24 IoU / +8.99 BIoU.

Abstract:
Graphical User Interface (GUI) agents have the potential to assist users in interacting with complex software (e.g., PowerPoint, Photoshop). While prior research has primarily focused on automating user actions through clicks and keystrokes, this paradigm overlooks human intention, where users value the ability to explore, iterate, and refine their ideas while maintaining agency. To move beyond automation and toward collaboration, GUI agents must understand what users are doing and why. We introduce GUIDE (GUI User Intent Detection Evaluation), a benchmark that evaluates AI models on their ability to perceive user behavior, infer intent, and provide assistance in open-ended GUI tasks. GUIDE consists of 67.5 hours of screen recordings from 120 novice user demonstrations with think-aloud narrations, across 10 software. GUIDE defines three tasks--(i) Behavior State Detection, (ii) Intent Prediction, and (iii) Help Prediction that test a model's ability to recognize behavior state, reason about goals, and decide when and how to help. Evaluations across eight state-of-the-art multimodal models reveal that all models struggled, achieving only 44.6% and 55.0% accuracy on behavior state and help prediction. However, providing user context significantly improved the performance, raising help prediction by up to 50.2pp, highlighting the critical role of structured user understanding in effective assistance. Our dataset is available at https://guide-bench.github.io.

Abstract:
Relighting a person from a single photo is an attractive but ill-posed task, as a 2D image ambiguously entangles 3D geometry, intrinsic appearance, and illumination. Current methods either use sequential pipelines that suffer from error accumulation, or they do not explicitly leverage 3D geometry during relighting, which limits physical consistency. Since relighting and estimation of 3D geometry are mutually beneficial tasks, we propose a unified Multi-Modal Diffusion Transformer (DiT) that jointly solves for both: GeoRelight. We make this possible through two key technical contributions: isotropic NDC-Orthographic Depth (iNOD), a distortion-free 3D representation compatible with latent diffusion models; and a strategic mixed-data training method that combines synthetic and auto-labeled real data. By solving geometry and relighting jointly, GeoRelight achieves state-of-the-art results in photorealistic relighting with physically-consistent shadows, as well as high-fidelity 3D reconstruction and intrinsic estimation from a single image. Project page: https://yuxuan-xue.com/georelight

Abstract:
4D reconstruction of equine family (e.g. horses) from monocular video is important for animal welfare. Previous mainstream 4D animal reconstruction methods require joint optimization of motion and appearance over a whole video, which is time-consuming and sensitive to incomplete observation. In this work, we propose a novel framework called 4DEquine by disentangling the 4D reconstruction problem into two sub-problems: dynamic motion reconstruction and static appearance reconstruction. For motion, we introduce a simple yet effective spatio-temporal transformer with a post-optimization stage to regress smooth and pixel-aligned pose and shape sequences from video. For appearance, we design a novel feed-forward network that reconstructs a high-fidelity, animatable 3D Gaussian avatar from as few as a single image. To assist training, we create a large-scale synthetic motion dataset, VarenPoser, which features high-quality surface motions and diverse camera trajectories, as well as a synthetic appearance dataset, VarenTex, comprising realistic multi-view images generated through multi-view diffusion. While training only on synthetic datasets, 4DEquine achieves state-of-the-art performance on real-world APT36K and AiM datasets, demonstrating the superiority of 4DEquine and our new datasets for both geometry and appearance reconstruction. Comprehensive ablation studies validate the effectiveness of both the motion and appearance reconstruction network. Code and data will be released.

Abstract:
Multimodal sentiment analysis (MSA) aims to identify human emotions through multimodal data. Despite considerable advances in MSA, we find that emotional class centers often overlap when integrating data from different modalities into the same representation space. In this paper, we propose a novel Multi-Metric Representation learning strategy based on clustering (MMRest) to alleviate this issue through flexible multi-metric representation learning, enabling the model to learn fine-grained sentiments. Specifically, we first design a module termed Multi-metric Multimodal learning on Clusters (MMC), which minimizes distances within similar sentiment pairs while maximizing dissimilar ones, aiming to learn a global metric and local metrics in each cluster from multimodal data. Afterwards, we develop a Projection and Decision-Level Fusion (PDLF) module, including two parts. One part utilizes the optimal global and local metrics to obtain a projection value. The other part combines the projection value with an intermediate score which is obtained through the fusion of unimodal and multimodal representations to obtain the final sentiment prediction score. Extensive experiments on the CMU-MOSI and CMU-MOSEI datasets demonstrate that our method is significantly superior to state-of-the-art methods on various evaluation indicators and parameter count, by effectively learning fine-grained emotional boundaries. The code will be made open-source if the paper is accepted.

Abstract:
Audio-visual large language models (AVLLMs) have made significant strides in understanding visual and auditory content. However, their ability to capture fine-grained temporal relationships between audio and visual streams remains insufficiently evaluated. To address this, we introduce FAVE (Fine-grained Audio-Visual Temporal Evaluation), a comprehensive benchmark targeting three core dimensions of temporal perception: cross-modal temporal alignment (FAVE-align), event temporal relationship (FAVE-low), and detailed moment captioning (FAVE-high). To construct FAVE, we propose a scalable annotation pipeline that integrates shot boundary detection, automated captioning, and GPT-assisted refinement to produce temporally grounded, high-quality data. Extensive experiments on twelve state-of-the-art multimodal LLMs, both open-source and closed-source, reveal key limitations in multimodal integration, temporal relationship and timestamp localization, especially for joint audio-visual tasks. These findings highlight the need for better temporal modeling to improve AVLLMs' understanding of real-world video content. FAVE serves as a rigorous testbed for advancing temporally aware multimodal systems, and will be publicly released upon acceptance.

Abstract:
Recent methods have made notable progress in the visual quality of hand-object interaction video synthesis. However, most approaches rely on 2D control signals that lack spatial expressiveness and limit the utilization of synthetic 3D conditional data. To address these limitations, we propose HVG-3D, a unified framework for 3D-aware hand-object interaction (HOI) video synthesis conditioned on explicit 3D representations. To achieve a diffusion-based architecture augmented with a 3D ControlNet, which encodes geometric and motion cues from 3D inputs to enable explicit 3D reasoning during video synthesis, as well as the corresponding training and inference setting. To achieve high-quality synthesis, HVG-3D is designed with two core components: (i) a 3D-aware HOI video generation diffusion architecture that encodes geometric and motion cues from 3D inputs for explicit 3D reasoning; and (ii) a hybrid pipeline for constructing input and condition signals, enabling flexible and precise control during both training and inference. During inference, given a single real image and a 3D control signal from either simulation or real data, HVG-3D generates high-fidelity, temporally consistent videos with precise spatial and temporal control. Experiments on the TASTE-Rob dataset demonstrate that HVG-3D achieves state-of-the-art spatial fidelity, temporal coherence, and controllability, while enabling effective utilization of both real and simulated data.

Abstract:
Propagation-based video editing enables precise user control by propagating a single edited frame into following frames while maintaining the original context such as motion and structures.However, training such models requires large-scale, paired (source and edited) video datasets, which are costly and complex to acquire.Hence, we propose the PropFly, a training pipeline for Propagation-based video editing, relying on on-the-Fly supervision from pre-trained video diffusion models (VDMs) instead of requiring off-the-shelf or precomputed paired video editing datasets.Specifically, our PropFly leverages one-step clean latent estimations from intermediate noised latents with varying Classifier-Free Guidance (CFG) scales to synthesize diverse pairs of 'source' (low-CFG) and 'edited' (high-CFG) latents on-the-fly. The source latent serves as structural information of the video, while the edited latent provides the target transformation for learning propagation.Our pipeline enables an additional adapter attached to the pre-trained VDM to learn to propagate edits via Guidance-Modulated Flow Matching (GMFM) loss, which guides the model to replicate the target transformation.Our on-the-fly supervision ensures the model to learn temporally consistent and dynamic transformations.Extensive experiments demonstrate that our PropFly significantly outperforms the state-of-the-art methods on various video editing tasks, producing high-quality editing results.

Abstract:
Vision-based autonomous driving has gained much attention due to its low costs and excellent performance. Compared with dense BEV (Bird's Eye View) or sparse query models, Gaussian-centric method is a comprehensive yet sparse representation by describing scene with 3D semantic Gaussians. In this paper, we introduce DLWM, a novel paradigm with Dual Latent World Models specifically designed to enable holistic gaussian-centric pre-training in autonomous driving using two stages. In the first stage, DLWM predicts 3D Gaussians from queries by self-supervised reconstructing multi-view semantic and depth images. Equipped with fine-grained contextual features, in the second stage, two latent world models are trained separately for temporal feature learning, including Gaussian-flow-guided latent prediction for downstream occupancy perception and forecasting tasks, and ego-planning-guided latent prediction for motion planning. Extensive experiments in SurroundOcc and nuScenes benchmarks demonstrate that DLWM shows significant performance gains across Gaussian-centric 3D occupancy perception, 4D occupancy forecasting and motion planning tasks.

Abstract:
Real-world documents combine text with tables, charts, photographs, and diagrams arranged in diverse layouts, yet existing research on multimodal large language models (MLLMs) for document QA predominantly produces text-only responses, underutilizing these visual elements. We introduce VinQA, a dataset designed for long-form answer generation where cited visual elements are explicitly interleaved with their supporting text and grounded in relevant document pages. To support this task, we study two encoding methods for feeding raw document page images into an MLLM, along with their visual-element citation mechanisms: (1) Page Encoding, which directly encodes full-page images with bounding boxes of visual elements and treats these boxed regions as citable units; and (2) Modality Encoding, which parses each page to extract text and crop visual elements, encodes them separately, and uses these cropped elements as citable units. In our experiments, we propose M-GroSE, a multimodal evaluation framework extending GroUSE to assess such answers along four dimensions: completeness, answer relevancy, faithfulness, and unanswerability. We additionally report Visual Source F1 to directly measure visual citation accuracy. Although proprietary frontier models still achieve the best overall scores on the VinQA test split, fine-tuning open Qwen2.5-VL models on the VinQA training split substantially improves their performance and markedly narrows this gap. Modality Encoding is initially more robust than Page Encoding for complex documents with long text, many visual elements, and diverse visual citation requirements. After training on VinQA, however, Page Encoding reaches a comparable performance level, showing that it can compete effectively even without the explicit parsing used in Modality Encoding. Finally, Visual G-Eval, an MLLM-based judge, confirms that fine-tuned models insert visual elements at semantically appropriate positions with faithful supporting text.

Abstract:
Image fusion aims to integrate complementary information from multiple source images to produce a more informative and visually consistent representation, benefiting both human perception and downstream vision tasks. Despite recent progress, most existing fusion methods are designed for specific tasks (i.e., multi-modal, multi-exposure, or multi-focus fusion) and struggle to effectively preserve source information during the fusion process. This limitation primarily arises from task-specific architectures and the degradation of source information caused by deep-layer propagation. To overcome these issues, we propose UniFusion, a unified image fusion framework designed to achieve cross-task generalization. First, leveraging DINOv3 for modality-consistent feature extraction, UniFusion establishes a shared semantic space for diverse inputs. Second, to preserve the understanding of each source image, we introduce a reconstruction-alignment loss to maintain consistency between fused outputs and inputs. Finally, we employ a bilevel optimization strategy to decouple and jointly optimize reconstruction and fusion objectives, effectively balancing their coupling relationship and ensuring smooth convergence. Extensive experiments across multiple fusion tasks demonstrate UniFusion's superior visual quality, generalization ability, and adaptability to real-world scenarios.

Abstract:
Recent advances in multimodal large language models (MLLMs) have accelerated progress in domain-oriented AI, yet their development in geoscience and remote sensing (RS) remains constrained by distinctive challenges: wide-ranging disciplinary knowledge, heterogeneous sensor modalities, and a fragmented spectrum of tasks. To bridge these gaps, we introduce GeoMMBench, a comprehensive multimodal question-answering benchmark covering diverse RS disciplines, sensors, and tasks, enabling broader and more rigorous evaluation than prior benchmarks. Using GeoMMBench, we assess 36 open-source and proprietary large language models (LLMs), uncovering systematic deficiencies in domain knowledge, perceptual grounding, and reasoning--capabilities essential for expert-level geospatial interpretation. Beyond evaluation, we propose GeoMMAgent, a multi-agent framework that strategically integrates retrieval, perception, and reasoning through domain-specific RS models and tools. Extensive experimental results demonstrate that GeoMMAgent significantly outperforms standalone LLMs, underscoring the importance of tool-augmented agents for dynamically tackling complex geoscience and RS challenges.

Abstract:
Foley art plays a pivotal role in enhancing immersive auditory experiences in film, yet manual creation of spatio-temporal aligned audio remains labor-intensive. We propose FoleyDesigner, a novel framework inspired by professional Foley workflows, integrating film clip analysis, spatio-temporal controllable Foley generation, and professional mixing capabilities. Technically, FoleyDesigner employs a multi-agent architecture for precise spatio-temporal analysis. It achieves spatio-temporal alignment through latent diffusion models trained on spatiotemporal cues extracted from video frames, combined with large language model (LLM)-driven hybrid mechanisms that emulate film industry-grade post-production practices. To address the lack of high-quality stere Foley datasets in film, we introduce FilmStereo, the first professional stereo Foley dataset containing spatial metadata, precise timestamps, and semantic annotations for eight common Foley categories. For application, the framework supports interactive user control while maintaining seamless integration with professional pipelines, including 5.1-channel Dolby Atmos systems compliant with ITU-R BS.775 standards, thereby offering extensive creative flexibility. Extensive experiments demonstrate that our method achieves superior spatio-temporal alignment compared to existing baselines, with integration validated in film industrial-grade workflows.

Abstract:
Federated Learning (FL) is a decentralized approach where multiple clients collaboratively train a shared global model without sharing their raw data. Despite its effectiveness, conventional FL faces scalability challenges due to excessive computational and communication demands placed on a single central server as the number of participating devices grows. Hierarchical Federated Learning (HFL) addresses these issues by distributing model aggregation tasks across intermediate nodes (stations), thereby enhancing system scalability and robustness against single points of failure. However, HFL still suffers from a critical yet often overlooked limitation: domain shift, where data distributions vary significantly across different clients and stations, reducing model performance on unseen target domains. While Federated Domain Generalization (FedDG) methods have emerged to improve robustness to domain shifts, their integration into HFL frameworks remains largely unexplored. In this paper, we formally introduce Hierarchical Federated Domain Generalization (HFedDG), a novel scenario designed to investigate domain shift within hierarchical architectures. Specifically, we propose HFedATM, a hierarchical aggregation method that first aligns the convolutional filters of models from different stations through Filter-wise Optimal Transport Alignment and subsequently merges aligned models using a Shrinkage-aware Regularized Mean Aggregation. Our extensive experimental evaluations demonstrate that HFedATM significantly boosts the performance of existing FedDG baselines across multiple datasets and maintains computational and communication efficiency. Moreover, theoretical analyses indicate that HFedATM achieves tighter generalization error bounds compared to standard hierarchical averaging, resulting in faster convergence and stable training behavior.

Abstract:
Transformers are remarkably versatile, suggesting the existence of generic inductive biases beneficial across modalities. In this work, we explore a new way to instil such biases in vision transformers (ViTs) through pretraining on procedurally generated data devoid of visual or semantic content. We generate this data with simple algorithms such as formal grammars, so the results bear no relationship to either natural or synthetic images. We use this procedurally generated data to pretrain ViTs in a warm-up phase that bypasses their visual patch embedding mechanisms, thus encouraging the models to internalise abstract computational priors. When followed by standard image-based training, this warm-up significantly improves data efficiency, convergence speed, and downstream performance. On ImageNet-1K, for example, allocating just 1% of the training budget to procedural data improves final accuracy by over 1.7%. In terms of its effect on performance, 1% procedurally generated data is thus equivalent to 28% of the ImageNet-1K data. These findings suggest a promising path toward new data-efficient and domain-agnostic pretraining strategies.

Abstract:
Open-vocabulary detectors (OvOD) inherit tightly coupled cross-modal knowledge from web-scale pretraining, creating privacy, copyright, and compliance risks. Existing machine unlearning methods face geometric entanglement interference in OvOD: forgetting updates inevitably distort preserved knowledge due to shared semantic factors in decomposable embeddings. We introduce SafeDetect, a geometrically constrained unlearning framework that constructs a null-space from preserved knowledge embeddings offline, then constrains parameter updates to this orthogonal complement, mathematically preventing interference with retained concepts. Forgetting is achieved through a one-step mean-flow objective that drives forgotten concepts toward non-detectable, while multimodal decoupling prevents cross-modal recovery. We establish UOD-Bench, the first unified benchmark for OvOD unlearning, featuring 14.7K images with 67.3K region-phrase pairs across three tasks. Extensive experiments across UOD-Bench and standard benchmarks with diverse architectures (e.g., GroundingDINO, LLM-Det) demonstrate that SafeDetect achieves superior forgetting efficacy (64.75% improvement over NPO) while maintaining stable retention performance and significantly better zero-shot generalization, with 1.5x faster convergence than iterative methods.

Abstract:
Video diffusion models have rich world priors, but their use in spatial tasks is limited by poor control, spatial-temporal inconsistent results, and entangled scene-camera dynamics. Current approaches, such as per-task fine-tuning or post-process warping, often introduce visual artifacts, fail to generalize, or incur high computational costs. We introduce WorldForge, a novel, training-free framework that operates purely at inference time to resolve these issues. Our method comprises three synergistic components. First, an intra-step refinement loop injects fine-grained motion guidance during the denoising process, iteratively correcting the output to ensure strict adherence to the target camera path. Second, an optical flow-based analysis identifies and isolates motion-related channels within the latent space. This allows our framework to selectively apply guidance, thereby decoupling motion from appearance and preserving visual fidelity. Third, a dual-path guidance strategy adaptively corrects for drift by comparing the guided generation against an unguided, reference denoising path, effectively neutralizing artifacts caused by misaligned structural inputs. Together, these components inject precise, trajectory-aligned control without model retraining, achieving accurate motion guidance and photorealistic synthesis. As a plug-and-play, model-agnostic solution, WorldForge demonstrates highly versatile generalizability. Beyond robust zero-shot 3D/4D generation, it readily empowers over a dozen diverse downstream applications, seamlessly enabling tasks like video editing, stabilization, and virtual try-on. Extensive experiments confirm state-of-the-art performance in trajectory adherence and perceptual quality, outperforming both training-dependent and inference-only baselines.

Abstract:
As publicly available data dwindles, synthetic data generation (SDG) has become a practical solution for privacy-preserving data sharing. By training generative models on private data, SDG creates samples that retain task-relevant features while obfuscating sensitive content. However, recent work shows that synthetic data can still leak private information via membership inference and reconstruction attacks. Existing defenses often degrade downstream utility. To address the privacy-utility trade-off, we formulate SDG as a bi-objective optimization problem. Yet, intractable gradients and expensive subset evaluation pose major challenges. We address this via alternate optimization over the generative model and data selection parameter, and further recast the selection step as a discrete-time optimal control problem, solved using Pontryagin's Maximum Principle. We propose PrivSynth, a framework that quantifies multiple privacy risks and integrates it into the control objective. Theoretical analysis guarantees convergence, and experiments on benchmark and medical datasets show that PrivSynth achieves better utility and stronger privacy protection than state-of-the-art methods.

Abstract:
We introduce Robowheel, a data engine that converts human hand-object interaction (HOI) videos into training-ready supervision for cross-morphology robotic learning. From monocular RGB/RGB-D inputs, we perform high-precision HOI reconstruction and enforce physical plausibility via a reinforcement learning (RL) optimizer that refines hand-object relative poses under contact and penetration constraints. The reconstructed, contact-rich trajectories are then retargeted to cross-embodiments, robot arms with simple end-effectors, dexterous hands, and humanoids, yielding executable actions and rollouts. To scale coverage, we build a simulation-augmented framework on Isaac Sim with diverse domain randomization (embodiments, trajectories, object retrieval, background textures, hand motion mirroring), which enriches the distributions of trajectories and observations while preserving spatial relationships and physical plausibility. The entire data pipeline forms an end-to-end pipeline from video - reconstruction - retargeting - augmentation - data acquisition.We validate the data on mainstream vision-language-action (VLA) and imitation learning architectures, demonstrating that trajectories produced by our pipeline are as stable as those from teleoperation and yield comparable continual performance gains. To our knowledge, this provides the first quantitative evidence that HOI modalities can serve as effective supervision for robotic learning. Compared with teleoperation, Robowheel is lightweight: a single monocular RGB(D) camera is sufficient to extract a universal, embodiment-agnostic motion representation that could be flexibly retargeted across embodiments. We further assemble a large-scale multimodal dataset combining multi-camera captures, monocular videos, and public HOI corpora for training and evaluating embodied models.

Abstract:
Multimodal Emotion Analysis (MEA) is crucial for human-centric AI, yet current methods struggle with two core challenges: the sparse nature of emotional cues across modalities and their inherent temporal asynchrony. Existing approaches, which often rely on implicit fusion, consequently suffer from diluted salient features and entangled representations. To address this issue, we propose EmoThinker, a new framework that advances MEA through explicit, structured reasoning. Our method introduces a structural token selection mechanism to concentrate on pivotal facial regions while refining background context, enhancing visual saliency and efficiency. For audio, an audio evidence extractor aggregates critical paralinguistic features into compact, emotion-rich tokens. More importantly, we enable step-by-step reasoning by constructing a Chain-of-Emotion-Thought dataset, which provides fine-grained annotations for disentangling asynchronous cues and resolving inter-modal conflicts. By decoupling evidence acquisition from reasoning, EmoThinker achieves a more interpretable and robust emotion analysis. Extensive experiments on multiple benchmarks demonstrate that our framework achieves new state-of-the-art performance.

Abstract:
Video reasoning has advanced with large multimodal models (LMMs), yet their inference is often a single pass that returns an answer without verifying whether the reasoning is evidence-aligned. We introduce Reinforce to Learn, Elect to Reason (RLER), a dual paradigm that decouples learning to produce evidence from obtaining a reliable answer. In RLER-Training, we optimize the policy with group-relative reinforcement learning (RL) and 3 novel task-driven rewards: Frame-sensitive reward grounds reasoning on explicit key frames, Think-transparency reward shapes readable and parsable reasoning traces, and Anti-repetition reward boosts information density. These signals teach the model to emit structured, machine-checkable evidence and potentiate reasoning capabilities. In RLER-Inference, we apply a train-free orchestrator that generates a small set of diverse candidates, parses their answers and cited frames, scores them by evidence consistency, confidence, transparency, and non-redundancy, and then performs a robust evidence-weighted election. This closes the loop between producing and using evidence, improving reliability and interpretability without enlarging the model. We comprehensively evaluate RLER against various open-source and RL-based LMMs on 8 representative benchmarks. RLER achieves state of the art across all benchmarks and delivers an average improvement of 6.3 % over base models, while using on average 3.1 candidates per question, indicating a favorable balance between compute and quality. The results support a simple thesis: making evidence explicit during learning and electing by evidence during inference is a robust path to trustworthy video reasoning.

Abstract:
Continual Visual Question Answering (Continual VQA) poses unique challenges for multimodal continual learning, requiring models to incrementally acquire new knowledge while preserving visual-semantic grounding across tasks. However, existing benchmarks hinder fair and robust evaluation of such capabilities, as they allow models to exploit dataset biases rather than demonstrate genuine continual reasoning. We identify two structural flaws in current benchmark design. First, shared answer vocabularies across tasks encourage answer memorization, inflating performance and underestimating forgetting. Second, static answer priors within each task make the training and test answer distributions nearly identical, obscuring robustness under distribution shifts. To address these issues, we introduce UCo-VQA, an Unbiased benchmark suite that enforces token-level disjoint answer spaces across tasks and introduces intra-task train-test distribution shifts, enabling fairer assessment of forgetting and generalization in multimodal continual learning. We further provide a parameter-efficient baseline that mitigates forgetting and enhances grounding through question-only replay and dual-level distillation, offering a lightweight and memory-efficient framework for continual adaptation. Extensive experiments on UCo-VQA reveal that prior methods substantially overestimate performance under biased setups, while our approach achieves state-of-the-art results, improving robustness and retention by up to 4.18% and 2.21%, respectively.

Abstract:
Recent Large Vision-Language Models (LVLMs) have shown impressive capabilities in multimodal understanding and generation. Despite this progress, they remain prone to hallucination, where model outputs conflict with the visual input due to an over-reliance on textual priors. Existing inference-time mitigation approaches frequently depend on multi-pass or contrastive decoding, which increases latency and limits their applicability in real-time settings. To address this limitation, we propose CausalLens, a training-free and single-pass intervention that directly adjusts the decoder hidden states to strengthen visual grounding. By decomposing attention heads into visual, textual, and system prompt pathways, CausalLens identifies visually reliable heads using a sensitivity measure and selectively adjusts their mid-layer hidden-state contributions. A projection-aligned correction further stabilizes these adjusted states after multi-head fusion, ensuring that the enhanced visual information is preserved throughout decoding. Extensive experiments across multiple hallucination benchmarks and LVLM architectures demonstrate that CausalLens consistently improves visual fidelity while adding negligible computational overhead. The method requires no fine-tuning or architectural changes, making it well-suited for practical, latency-sensitive applications.

Abstract:
Visual attention serves as the primary mechanism through which MLLMs interpret visual information; however, its limited localization capability often leads to hallucinations. We observe that although MLLMs can accurately extract visual semantics from visual tokens, they fail to fully leverage this advantage during subsequent inference. To address this limitation, we propose Vision-Guided Attention (VGA), a training-free method that first constructs precise visual grounding by exploiting the semantic content of visual tokens, and then uses this grounding to guide the model's focus toward relevant visual regions. In image captioning, VGA further refines this guidance dynamically during generation by suppressing regions that have already been described. In VGA, each token undergoes only a single forward pass, introducing a negligible latency overhead. In addition, VGA is fully compatible with efficient attention implementations such as FlashAttention. Extensive experiments across diverse MLLMs and multiple hallucination benchmarks demonstrate that VGA achieves state-of-the-art dehallucination performance. Further analysis confirms that explicit visual guidance plays a crucial role in enhancing the visual understanding capabilities of MLLMs.

Abstract:
Reconstructing dynamic visual experiences from brain activity provides a compelling avenue for exploring the neural mechanisms of human visual perception. While recent progress in fMRI-based image reconstruction has been notable, extending this success to video reconstruction remains a significant challenge. Current fMRI-to-video reconstruction approaches consistently encounter two major shortcomings: (i) inconsistent visual representations of salient objects across frames, leading to appearance mismatches; (ii) poor temporal coherence, resulting in motion misalignment or abrupt frame transitions.To address these limitations, we introduce SemVideo, a novel fMRI-to-video reconstruction framework guided by hierarchical semantic information. At the core of SemVideo is SemMiner, a hierarchical guidance module that constructs three levels of semantic cues from the original video stimulus: static anchor descriptions, motion-oriented narratives, and holistic summaries. Leveraging this semantic guidance, SemVideo comprises three key components: a Semantic Alignment Decoder that aligns fMRI signals with CLIP-style embeddings derived from SemMiner, a Motion Adaptation Decoder that reconstructs dynamic motion patterns using a novel tripartite attention fusion architecture, and a Conditional Video Render that leverages hierarchical semantic guidance for video reconstruction. Experiments conducted on the CC2017 and HCP datasets demonstrate that SemVideo achieves superior performance in both semantic alignment and temporal consistency, setting a new state-of-the-art in fMRI-to-video reconstruction.

Abstract:
Recent advances in foundational Video Diffusion Models (VDMs) have yielded significant progress. Yet, despite the remarkable visual quality of generated videos, reconstructing consistent 3D scenes from these outputs remains challenging, due to limited camera controllability and inconsistent generated content when viewed from distinct camera trajectories.In this paper, we propose WorldStereo, a novel framework that bridges camera-guided video generation and 3D reconstruction via two dedicated geometric memory modules. Formally, the global-geometric memory enables precise camera control while injecting coarse structural priors through incrementally updated point clouds.Moreover, the spatial-stereo memory constrains the model's attention receptive fields with 3D correspondence to focus on fine-grained details from the memory bank.These components enable WorldStereo to generate multi-view-consistent videos under precise camera control, facilitating high-quality 3D reconstruction.Furthermore, the flexible control branch-based WorldStereo shows impressive efficiency, benefiting from the distribution matching distilled VDM backbone without joint training.Extensive experiments across both camera-guided video generation and 3D reconstruction benchmarks demonstrate the effectiveness of our approach. Notably, we show that WorldStereo acts as a powerful world model, tackling diverse scene generation tasks (whether starting from perspective or panoramic images) with high-fidelity 3D results. Models will be released.

Abstract:
This paper reveals that many open-source large language models (LLMs) lack hierarchical knowledge about our visual world, unaware of even well-established biology taxonomies. This shortcoming makes LLMs a bottleneck for vision LLMs' hierarchical visual recognition (e.g., recognizing Anemone Fish but not Vertebrate). We arrive at these findings using about one million four-choice visual question answering (VQA) tasks constructed from six taxonomies and four image datasets. Interestingly, finetuning a vision LLM using our VQA tasks reaffirms LLMs' bottleneck effect because the VQA tasks improve the LLMs' hierarchical consistency in text-only tasks more than the vision LLMs'. We believe that one cannot make vision LLMs understand our visual world hierarchically until LLMs possess corresponding taxonomy knowledge.

Abstract:
Fine-tuning large-scale Vision-Language Models (VLMs) is crucial for specialized tasks, but their performance is often undermined by the label noise prevalent in real-world datasets. Traditional approaches to learning with noisy labels typically rely on a self-referential loop, using a model's own predictions to correct errors. While recent VLM-specific methods have begun to leverage cross-modal information to aid in noise detection, we explore an alternative direction: using the text modality not just to identify noise, but to establish a source of ground truth that is fully independent of the training data's potentially corrupt labels. To this end, we propose Text-ANchored Guided Optimization (TANGO), a framework centered on "semantic anchors"--a set of pure, immutable reference points generated from diverse text descriptions. Building upon these anchors, our approach reframes two key aspects of learning with noisy labels. First, we replace the conventional linear classifier with a parameter-free Text-Anchored Classifier, making predictions a direct, weighted consensus of the clean anchors. Second, we introduce an Anchor-Guided Refinement mechanism that validates each sample's given label against this external ground truth, providing a more robust signal for sample selection and label correction. Extensive experiments demonstrate that this approach achieves competitive and often state-of-the-art performance. Our code will be publicly available.

Abstract:
Online micro gesture recognition from hand skeletons is critical for VR/AR interaction but faces challenges due to limited public datasets and task-specific algorithms. Micro gestures involve subtle motion patterns, which make constructing datasets with precise skeletons and frame-level annotations difficult. To this end, we develop a multi-view self-supervised pipeline to automatically generate skeleton data, complemented by heuristic rules and expert refinement for semi-automatic annotation. Based on this pipeline, we introduce OMG-Bench, the first large-scale public benchmark for skeleton-based online micro gesture recognition. It features 40 fine-grained gesture classes with 13,948 instances across 1,272 sequences, characterized by subtle motions, rapid dynamics, and continuous execution. To tackle these challenges, we propose Hierarchical Memory-Augmented Transformer (HMATr), an end-to-end framework that unifies gesture detection and classification by leveraging hierarchical memory banks which store frame-level details and window-level semantics to preserve historical context. In addition, it employs learnable position-aware queries initialized from the memory to implicitly encode gesture positions and semantics. Experiments show that HMATr outperforms state-of-the-art methods by 7.6% in detection rate, establishing a strong baseline for online micro gesture recognition.

Abstract:
Text-aerial person retrieval aims to identify targets in UAV-captured images from eyewitness descriptions, supporting intelligent transportation and public security applications. Compared to ground-view text-image person retrieval, UAV-captured images often suffer from degraded visual information due to drastic variations in viewing angles and flight altitudes, making semantic alignment with textual descriptions very challenging. To address this issue, we propose a novel Cross-modal Fuzzy Alignment Network (CFANet), which quantifies the token-level reliability by fuzzy logic to achieve accurate fine-grained alignment and incorporates ground-view images as a bridge agent to further mitigate the gap between aerial images and text descriptions, for text-aerial person retrieval. In particular, we design the Fuzzy Token Alignment module that employs the fuzzy membership function to dynamically model token-level association strength and suppress the influence of unobservable or noisy tokens. It can alleviate the semantic inconsistencies caused by missing visual cues and significantly enhance the robustness of token-level semantic alignment. Moreover, to further mitigate the gap between aerial images and text descriptions, we design a Context-Aware Dynamic Alignment module to incorporates the ground-view agent as a bridge in text-aerial alignment and adaptively combine direct alignment and agent-assisted alignment to improve the robustness. In addition, we construct a large-scale benchmark dataset called AERI-PEDES by using a chain-of-thought to decompose text generation into attribute parsing, initial captioning, and refinement, thus boosting textual accuracy and semantic consistency. Experiments on AERI-PEDES and TBAPR demonstrate the superiority of our method. The code and dataset will be publicly released.

Abstract:
Motion transfer has emerged as a promising direction for controllable video generation, yet existing methods largely focus on single-object scenarios and struggle when multiple objects require distinct motion patterns. In this work, we present FlexiMMT, the first implicit image-to-video (I2V) motion transfer framework that explicitly enables multi-object, multi-motion transfer. Given a static multi-object image and multiple reference videos, FlexiMMT independently extracts motion representations and accurately assigns them to different objects, supporting flexible recombination and arbitrary motion-to-object mappings. To address the core challenge of cross-object motion entanglement, we introduce a Motion Decoupled Mask Attention Mechanism that uses object-specific masks to constrain attention, ensuring that motion and text tokens only influence their designated regions. We further propose a Differentiated Mask Propagation Mechanism that derives object-specific masks directly from diffusion attention and progressively propagates them across frames efficiently. Extensive experiments demonstrate that FlexiMMT achieves precise, compositional, and state-of-the-art performance in I2V-based multi-object multi-motion transfer. Our project page is: https://ethan- li123.github.io/FlexiMMT_page/

Abstract:
Positron Emission Tomography (PET) can be used for the early diagnosis of various brain disorders. However, the annotation of PET scans requires the involvement of specialized nuclear medicine experts, making accurately annotated PET data extremely scarce. MRI-based cross-modal domain adaptation methods can improve the brain disorder classification accuracy with sparsely labeled PET data. However, existing methods fail to balance the core requirements of domain discrepancy elimination and modality-specific discriminative information retention in cross-modal tasks. Forced alignment often undermines the core pathological discriminative features of both modalities, making it difficult to meet the collaborative optimization demands of cross-modal brain disorder classification. To address this, we propose a Dual-Stream feature Disentanglement and Alignment (DSDA) framework designed for collaborative optimization of cross-modal domain adaptation and brain disorder classification. This framework first dynamically evaluates and explicitly decouples the critical brain regions relevant to the classification task from the non-critical regions that preserve brain structural integrity. It then applies differential processing to the two types of brain regions: topology-weighted feature alignment for non-critical regions and high-confidence feature fusion for critical regions. This differential processing ensures that the model effectively aligns features while preserving key discriminative information. Extensive experimental results on various datasets (e.g., ADNI, AIBL, and PPMI) demonstrate the effectiveness of DSDA which helps achieve the state-of-the-art performance.

Abstract:
Multi-turn reinforcement learning (RL) for multi-modal agents built upon vision-language models (VLMs) is hampered by sparse rewards and long-horizon credit assignment. Recent methods densify the reward by querying a teacher that provides step-level feedback, e.g., Guided Thought Reinforcement (GTR) and On-Policy Distillation, but rely on costly, often privileged models as the teacher, limiting practicality and reproducibility. We introduce GTR-Turbo, a highly efficient upgrade to GTR that matches its performance without training on or querying an expensive teacher model. Specifically, GTR-Turbo merges the weights of checkpoints produced during ongoing RL training and then uses the resulting merged model as a "free" teacher to guide subsequent RL via supervised fine-tuning or soft logit distillation. This design removes dependence on privileged VLMs (e.g., GPT or Gemini), mitigates the "entropy collapse" observed in prior work, and maintains stable training. Across diverse visual agentic tasks, GTR-Turbo improves the accuracy of the baseline model by 10-30% while reducing wall-clock training time by 50% and compute cost by 60% relative to GTR.

Abstract:
Generating human motion with precise spatial control is a challenging problem. Existing approaches often require task-specific training or slow optimization, and enforcing hard constraints frequently disrupts motion naturalness. Building on the observation that many animation tasks can be formulated as a linear inverse problem, we introduce ProjFlow, a training-free sampler that achieves zero-shot, exact satisfaction of linear spatial constraints while preserving motion realism. Our key advance is a novel kinematics-aware metric that encodes skeletal topology. This metric allows the sampler to enforce hard constraints by distributing corrections coherently across the entire skeleton, avoiding the unnatural artifacts of naive projection. Furthermore, for sparse inputs, such as filling in long gaps between a few keyframes, we introduce a time-varying formulation using pseudo-observations that fade during sampling. Extensive experiments on representative applications, motion inpainting, and 2D-to-3D lifting, demonstrate that ProjFlow achieves exact constraint satisfaction and matches or improves realism over zero-shot baselines, while remaining competitive with training-based controllers.

Abstract:
We introduce RAVEN, a deep learning architecture for processing frequency-modulated continuous-wave (FMCW) radar data that is designed for high computational efficiency. RAVEN reduces computation by using a learnable antenna mixer module on independent receiver state space encoders (SSM) to compress the virtual MIMO array into a compact set of learned features and by performing per-chirp inference with a calibrated early-exit rule, so the model reaches a decision using only a subset of chirps in a radar frame. These design choices yield up to 170x lower computation and 4x lower end-to-end latency than conventional frame-based radar backbones, while achieving state-of-the-art detection and BEV free-space segmentation performance on automotive radar datasets.

Abstract:
Open-Vocabulary Human-Object Interaction (OV-HOI) detection aims to recognize novel HOI categories beyond the training set. Existing OV-HOI detection approaches typically leverage CLIP to extract global visual representations and perform cross-attention between learnable queries and global features to localize human-object pairs. However, such one-stage paradigms tend to overfit seen interactions, limiting their generalization to unseen categories, while the coarse spatial awareness of CLIP also hinders the localization of fine-grained interaction cues. To address these issues, we propose a novel Semantic-Diversified and Interaction-Focused framework (SD-IF), which integrates reinforcement-guided adaptive optimization to jointly enhance semantic generalization and spatial discrimination. Specifically, we introduce a Semantic Diversification (SD) module that applies reinforcement-driven stochastic semantic perturbations and dual-level semantic exploration, expanding the semantic coverage of queries while maintaining visual coherence and effectively encouraging exploration beyond the seen semantic clusters. Furthermore, we design an Interaction Focusing (IF) module that formulates an actor-critic optimization scheme to adaptively refine attention distributions based on detection features and interaction representations, guided by a hybrid reward combining spatial focusing and semantic consistency. Extensive experiments on two standard benchmarks demonstrate the effectiveness of our SD-IF.

Abstract:
Federated Domain Generalization for Person Re-Identification (FedDG-ReID) aims to learn domain-invariant representations from decentralized data. Although Vision Transformers (ViTs) are widely adopted, their global attention often fails to distinguish pedestrians from high similarity backgrounds or diverse viewpoints---a challenge further amplified by cross-client distribution shifts in FedDG-ReID. To address this, we propose Federated Body Distribution Aware Visual Prompt (FedBPrompt), which introduces learnable visual prompts to explicitly guide Transformer attention toward pedestrian-centric regions. FedBPrompt employs a Body Distribution Aware Visual Prompts Mechanism (BAPM) that divides prompts into two groups: Holistic Full Body Prompts suppress cross-client background noise, while Body Part Alignment Prompts capture fine-grained details robust to pose and viewpoint variations. To mitigate the high communication cost of large Transformer models, we further design a Prompt-based Fine-Tuning Strategy (PFTS) that freezes the ViT backbone and updates only lightweight prompts, significantly reducing communication overhead while maintaining adaptability. Extensive experiments demonstrate that BAPM effectively enhances feature discrimination and cross-domain generalization, while PFTS achieves notable performance gains within only a few aggregation rounds. Moreover, both BAPM and PFTS can be easily integrated into existing ViT-based FedDG-ReID frameworks, making FedBPrompt a flexible and effective solution for federated person re-identification. The code will be released.

Abstract:
Vision-language models (VLM) excel at general understanding yet remain weak at dynamic spatial reasoning (DSR), i.e., reasoning about the evolvement of object geometry and relationship in 3D space over time, largely due to the scarcity of scalable 4D-aware training resources. To bridge this gap across aspects of dataset, benchmark and model, we introduce DSR Suite. First, we propose an automated pipeline that generates multiple-choice question-answer pairs from in-the-wild videos for DSR. By leveraging modern vision foundation models, the pipeline extracts rich geometric and motion information, including camera poses, local point clouds, object masks, orientations, and 3D trajectories. These geometric cues enable the construction of DSR-Train for learning and further human-refined DSR-Bench for evaluation. Compared with previous works, our data emphasize (i) in-the-wild video sources, (ii) object- and scene-level 3D requirements, (iii) viewpoint transformations, (iv) multi-object interactions, and (v) fine-grained, procedural answers. Beyond data, we propose a lightweight Geometry Selection Module (GSM) to seamlessly integrate geometric priors into VLMs, which condenses question semantics and extracts question-relevant knowledge from pretrained 4D reconstruction priors into a compact set of geometry tokens. This targeted extraction avoids overwhelming the model with irrelevant knowledge. Experiments show that integrating DSR-Train and GSM into Qwen2.5-VL-7B significantly enhances its dynamic spatial reasoning capability, while maintaining accuracy on general video understanding benchmarks.

Abstract:
Active 3D reconstruction enables an agent to autonomously select viewpoints to build accurate and complete scene geometry efficiently, rather than passively reconstructing scenes from pre-collected images. Existing active reconstruction methods often rely on geometric heuristics, which may result in redundant observations without improving reconstruction quality. To address this, we propose AREA3D, an active reconstruction agent for 3D reconstruction by leveraging feed-forward 3D models and vision-language guidance. The framework decouples view uncertainty modeling from feed-forward reconstruction, enabling precise uncertainty estimation without online optimization. Moreover, the integrated Vision-Language Model provides high-level semantic guidance that guides exploration beyond purely geometric cues. Extensive experiments on both scene-level and object-level benchmarks (Replica and OmniObject3D) demonstrate that AREA3D achieves state-of-the-art reconstruction accuracy, especially in sparse views.

Abstract:
Hateful meme classification aims to identify memes containing hateful content and has become increasingly important in the era of social media dominance. Large multimodal models (LMMs) have significantly enhanced the understanding of multimodal content, advancing this field. However, cognitive biases in LMMs can impede effective collaboration among models. To address this issue, we introduce GECO, a Game-theoretic multi-agEnt Collaboration framewOrk that organizes multiple LMMs into interacting agents and employs game-theoretic principles to guide them toward an optimal cooperative equilibrium. GECO integrates a mixed bonus scheme, incorporating both individual accuracy and cross-model agreement to promote convergence toward a consistent cooperative solution. In addition, we implement efficient policy learning and introduce a penalty coefficient to optimize the framework effectively and ensure training stability. Extensive experiments on five publicly available benchmarks demonstrate that our framework achieves new state-of-the-art performance.

Abstract:
Pre-training on large-scale videos to improve reinforcement learning efficiency is promising yet remains challenging. Existing methods typically treat the agent as an indivisible entity, modeling motion patterns globally. Such global modeling is tightly coupled with the morphology, hindering transfer across domains. In contrast, despite the vast disparity in global motions, the local components exhibit similar motion patterns across different agents. Building on this insight, we propose a novel Deconstruct-Recompose Paradigm (DRP) for learning transferable local motion representations. Specifically, in the Deconstruct phase, we identify multiple local points and track their frame-wise motions, defining each as an Atomic Action. We introduce a Dual-Attention Encoder (DAE) to learn local motion representations from these Atomic Actions, capturing their spatiotemporal relationships. In the Recompose phase, we compose local motion representations with a learnable Motion Aggregation Token '[MAT]' via latent dynamics model learning. Additionally, an adapter bridges local motion and downstream action-specific dynamics to accelerate policy learning. Extensive experiments demonstrate that our method effectively transfers to diverse robotic control and manipulation tasks, significantly improving sample efficiency and performance.

Abstract:
Modeling spinal motion is fundamental to understanding human biomechanics, yet remains underexplored in computer vision due to the spine's complex multi-joint kinematics and the lack of large-scale 3D annotations. We present a biomechanics-aware keypoint simulation framework that augments existing human pose datasets with anatomically consistent 3D spinal keypoints derived from musculoskeletal modeling. Using this framework, we create the first open dataset, named SIMSPINE, which provides sparse vertebra-level 3D spinal annotations for natural full-body motions in indoor multi-camera capture without external restraints. With 2.14 million frames, this enables data-driven learning of vertebral kinematics from subtle posture variations and bridges the gap between musculoskeletal simulation and computer vision. In addition, we release pretrained baselines covering fine-tuned 2D detectors, monocular 3D pose lifting models, and multi-view reconstruction pipelines, establishing a unified benchmark for biomechanically valid spine motion estimation. Specifically, our 2D spine baselines improve the state-of-the-art from 0.63 to 0.80 AUC in controlled environments, and from 0.91 to 0.93 AP for in-the-wild spine tracking.Together, the simulation framework and SIMSPINE dataset advance research in vision-based biomechanics, motion analysis, and digital human modeling by enabling reproducible, anatomically grounded 3D spine estimation under natural conditions.

Abstract:
The recent success of reinforcement learning (RL) in large reasoning models has inspired the growing adoption of RL for post-training Multimodal Large Language Models (MLLMs) to enhance their visual reasoning capabilities. Although many studies have reported improved performance, it remains unclear whether RL training truly enables models to learn from visual information. In this work, we propose the Hallucination-as-Cue Framework, an analytical framework designed to investigate the effects of RL-based post-training on multimodal reasoning models from the perspective of model hallucination. Specifically, we introduce hallucination-inductive, modality-specific corruptions that remove or replace essential information required to derive correct answers, thereby forcing the model to reason by hallucination. By applying these corruptions during both training and evaluation, our framework provides a unique perspective for diagnosing RL training dynamics and understanding the intrinsic properties of datasets. Through extensive experiments and analyses across multiple multimodal reasoning benchmarks, we reveal that the role of model hallucination for RL-training is more significant than previously recognized. For instance, we find that RL post-training under purely hallucination-inductive settings can still significantly improve models' reasoning performance, and in some cases even outperform standard training. These findings challenge prevailing assumptions about MLLM reasoning training and motivate the development of more modality-aware RL-based training designs.

Abstract:
Handwritten mathematical expression recognition is hindered by a fundamental misalignment between the dual representations of LaTeX formulas: the symbolic text and the rendered visual image. This discrepancy means that textually distinct LaTeX sequences can produce visually identical outputs, while minor textual errors can cause catastrophic rendering failures. As a result, text-level reward mechanisms cannot perfectly assess the quality of model predictions, failing to effectively guide the model towards optimal performance during training. To overcome this limitation, we introduce the Image Matching Score (IMS), a lightweight yet effective reward based on the structural edit distance of column-wise image projections, which robustly quantifies the visual fidelity between rendered formulas. Leveraging IMS, we then propose Image-Matching driven Policy Optimization (IMPO), a training framework built upon Group Relative Policy Optimization (GRPO). This approach facilitates stable policy learning directly from our sequence-level visual reward, notably without the need for a separate value function network. Extensive experiments demonstrate that IMPO yields consistent performance gains across various backbone models on the challenging CROHME, HME100K, and M2E datasets. Our framework establishes new state-of-the-art results, improving the Expression Recognition Rate by an average of 1.1% and up to 1.37% over strong prior methods.

Abstract:
Large Vision-Language Models (LVLMs) often omit or misrepresent critical visual content in generated image captions. Minimizing such information loss will force LVLMs to focus on image details to generate precise descriptions. However, measuring information loss during modality conversion is inherently challenging due to the modal gap between visual content and text output. In this paper, we argue that the quality of an image caption is positively correlated with the similarity between images retrieved via text search using that caption. Based on this insight, we further propose Cross-modal Identity Mapping (CIM), a reinforcement learning framework that enhances image captioning without requiring additional annotations. Specifically, the method quantitatively evaluates the information loss from two perspectives: Gallery Representation Consistency and Query-gallery Image Relevance. Supervised under these metrics, LVLM minimizes information loss and aims to achieve identity mapping from images to captions. The experimental results demonstrate the superior performance of our method in image captioning, even when compared with Supervised Fine-Tuning. Particularly, on the COCO-LN500 benchmark, CIM achieves a 20% improvement in relation reasoning on Qwen2.5-VL-7B. The codes are released on GitHub.

Abstract:
Robust 3D semantic occupancy is essential for legged and humanoid robots, yet most Semantic Scene Completion (SSC) systems are built for wheeled platforms with forward-facing sensors. We present OneOcc, a vision-only panoramic SSC framework tailored to severe body jitter and 360deg continuity. OneOcc integrates four complementary modules: (i) Dual-Projection fusion (DP-ER), which jointly exploits the raw annular panorama and its equirectangular unfolding to preserve true 360deg continuity while enabling grid-aligned feature extraction and seam-aware context; (ii) Bi-Grid Voxelization (BGV), which reasons in Cartesian and polar/cylindrical voxel spaces to reduce discretization bias and better align with panoramic geometry, yielding sharper free/occupied boundaries; (iii) a lightweight decoder with Hierarchical AMoE-3D fusion that dynamically routes multi-scale 3D features to specialized experts, improving long-range context and occlusion handling; and (iv) a plug-and-play Gait Displacement Compensation (GDC) module that learns feature-level motion correction from gait, stabilizing representations without extra sensors. We also release two panoramic occupancy benchmarks: QuadOcc (real quadruped, first-person 360deg) and Human360Occ (H3O) (CARLA human-ego 360deg with RGB/Depth/semantic-occupancy and standardized within-/cross-city splits). OneOcc sets new SOTA: on QuadOcc it exceeds strong vision baselines and even popular LiDAR methods, and on H3O it improves within-city by +3.83 mIoU and cross-city by +8.08. The modules are lightweight, enabling deployable full-surround semantic perception for legged and humanoid robots.

Abstract:
World models provide a powerful framework for simulating environment dynamics conditioned on actions or instructions, enabling downstream tasks such as action planning or policy learning.Recent approaches leverage world models as learned simulators, but its application to decision-time planning remains computationally prohibitive for real-time control.A key bottleneck lies in latent representations: conventional tokenizers encode each observation into hundreds of tokens, making planning both slow and resource-intensive.To address this, we propose CompACT, a discrete tokenizer that compresses each observation into as few as 8 tokens, drastically reducing computational cost while preserving essential information for planning.An action-conditioned world model that occupies CompACT tokenizer achieves competitive planning performance with orders-of-magnitude faster planning, offering a practical step toward real-world deployment of world models.

Abstract:
While recent Transformer and Mamba architectures have advanced point cloud representation learning, they are typically developed for single-task or single-domain settings. Directly applying them to multi-task domain generalization (DG) leads to degraded performance. Transformers effectively model global dependencies but suffer from quadratic attention cost and lack explicit structural ordering, whereas Mamba offers linear-time recurrence yet often depends on coordinate-driven serialization, which is sensitive to viewpoint changes and missing regions, causing structural drift and unstable sequential modeling. In this paper, we propose Structure-Aware Domain Generalization (SADG), a Mamba-based In-Context Learning framework that preserves structural hierarchy across domains and tasks. We design structure-aware serialization (SAS) that generates transformation-invariant sequences using centroid-based topology and geodesic curvature continuity. We further devise hierarchical domain-aware modeling (HDM) that stabilizes cross-domain reasoning by consolidating intra-domain structure and fusing inter-domain relations. At test time, we introduce a lightweight spectral graph alignment (SGA) that shifts target features toward source prototypes in the spectral domain without updating model parameters, ensuring structure-preserving test-time feature shifting. In addition, we introduce MP3DObject, a real-scan object dataset for multi-task DG evaluation. Comprehensive experiments demonstrate that the proposed approach improves structural fidelity and consistently outperforms state-of-the-art methods across multiple tasks including reconstruction, denoising, and registration.

Abstract:
Prevalent Computational Aberration Correction (CAC) methods are typically tailored to specific optical systems, leading to poor generalization and labor-intensive re-training for new lenses.Developing CAC paradigms capable of generalizing across diverse photographic lenses offers a promising solution to these challenges.However, efforts to achieve such cross-lens universality within consumer photography are still in their early stages due to the lack of a comprehensive benchmark that encompasses a sufficiently wide range of optical aberrations.Furthermore, it remains unclear which specific factors influence existing CAC methods and how these factors affect their performance.In this paper, we present comprehensive experiments and evaluations involving 24 image restoration and CAC algorithms, utilizing our newly proposed UNICAC, a large-scale benchmark for photographic cameras constructed via automatic optical design.The Optical Degradation Evaluator (ODE) is introduced as a novel framework to objectively assess the difficulty of CAC tasks, offering credible quantification of optical aberrations and enabling reliable evaluation.Drawing on our comparative analysis, we identify three key factors -- prior utilization, network architecture, and training strategy -- that most significantly influence CAC performance, and further investigate their respective effects.We believe that our benchmark, dataset, and observations contribute foundational insights to related areas and lay the groundwork for future investigations.Benchmarks, codes, and Zemax files will be available.

Abstract:
Reliable navigation and decision-making of autonomous vehicles require both accurate localization and object detection. Traditionally, these two tasks are handled separately, leading to redundant computation and limited cross-task knowledge transfer. This paper proposes TACO, the first Task-Aware COntrastive learning framework, which performs joint LiDAR localization and 3D object detection within a single, unified network. TACO leverages contrastive learning to explicitly decouple and align static geographic features for localization and object-centric features for detection. This bidirectional mutual supervision not only enhances localization robustness in dynamic environments by filtering dynamic noise but also boosts detection accuracy via effective spatial context. Additionally, we propose OxfoLD, the first dataset that provides multi-traversal LiDAR localization ground truth with rich 3D object annotations, thereby supporting task validation across various times and weather conditions. Experimental results demonstrate that TACO achieves state-of-the-art localization accuracy while maintaining competitive detection performance. The code and dataset will be released.

Abstract:
The rapid progress of Multimodal Large Language Models (MLLMs) marks a significant step toward artificial general intelligence, offering great potential for augmenting human capabilities. However, their ability to provide effective assistance in dynamic, real-world environments remains largely underexplored. Existing video benchmarks predominantly assess passive understanding through retrospective analysis or isolated perception tasks, failing to capture the interactive and adaptive nature of real-time user assistance. To bridge this gap, we introduce LifeEval, a multimodal benchmark designed to evaluate real-time, task-oriented human-AI collaboration in daily life from an egocentric perspective. LifeEval emphasizes three key aspects: task-oriented holistic evaluation, egocentric real-time perception from continuous first-person streams, and human-assistant collaborative interaction through natural dialogues. Constructed via a rigorous annotation pipeline, the benchmark comprises 4,075 high-quality question-answer pairs across 6 core capability dimensions. Extensive evaluations of 26 state-of-the-art MLLMs on LifeEval reveal substantial challenges in achieving timely, effective and adaptive interaction, highlighting essential directions for advancing human-centered interactive intelligence.

Abstract:
We present FlashLips, a two-stage, mask-free lip-sync system that decouples lips control from rendering and achieves real-time performance, with our U-Net variant running at over 100 FPS on a single GPU, while matching the visual quality of larger state-of-the-art models. Stage 1 is a compact, one-step latent-space editor that reconstructs an image using a reference identity, a masked target frame, and a low-dimensional lips-pose vector, trained purely with reconstruction losses - no GANs or diffusion. To remove explicit masks at inference, we use self-supervision via mouth-altered target variants as pseudo ground truth, teaching the network to localize lip edits while preserving the rest. Stage 2 is an audio-to-pose transformer trained with a flow-matching objective to predict lips-pose vectors from speech. Together, these stages form a simple and stable pipeline that combines deterministic reconstruction with robust audio control, delivering high perceptual quality and faster-than-real-time speed.

Abstract:
Instruction-based multimodal image manipulation has recently made rapid progress. However, existing evaluation methods lack a systematic and human-aligned framework for assessing model performance on complex and creative editing tasks. To address this gap, we propose CREval, a fully automated question-answer (QA)-based evaluation pipeline that overcomes the incompleteness and poor interpretability of opaque Multimodal Large Language Models (MLLMs) scoring. Simultaneously, we introduce CREval-Bench, a comprehensive benchmark specifically designed for creative image manipulation under complex instructions. CREval-Bench covers three categories and nine creative dimensions, comprising over 800 editing samples and 13K evaluation queries. Leveraging this pipeline and benchmark, we systematically evaluate a diverse set of state-of-the-art open and closed-source models. The results reveal that while closed-source models generally outperform open-source ones on complex and creative tasks, all models still struggle to complete such edits effectively. In addition, user studies demonstrate strong consistency between CREval's automated metrics and human judgments. Therefore, CREval provides a reliable foundation for evaluating image editing models on complex and creative image manipulation tasks, and highlights key challenges and opportunities for future research.

Abstract:
In stroke-based rendering, search methods often get trapped in local minima due to discrete stroke placement, while differentiable optimizers lack structural awareness and produce unstructured layouts. To bridge this gap, we propose a dual representation that couples discrete polylines with continuous Bezier control points via a bidirectional mapping mechanism. This enables collaborative optimization: local gradients refine global stroke structures, while content-aware stroke proposals help escape poor local optima. Our representation further supports Gaussian-splatting-inspired initialization, enabling highly parallel stroke optimization across the image. Experiments show that our approach reduces the number of strokes by 30-50%, achieves more structurally coherent layouts, and improves reconstruction quality, while cutting optimization time by 30-40% compared to existing differentiable vectorization methods.

Abstract:
In Class-Incremental Learning (CIL), parameter efficient fine-tuning applied to Pre-trained Models (PTMs) remain vulnerable to catastrophic forgetting as they adapt to new tasks. The prevalent strategy to mitigate catastrophic forgetting is to constrain gradients within the orthogonal subspaces of past tasks, while rigid gradient constraints hinder plasticity. In this paper, we propose a novel CIL framework, Dual Gradient and Semantic-Shift Guided Low-Rank Adaptation (DGS), that balances stability and plasticity via gradient fusion and maintains representation consistency through classifier and patch-token alignment. Specifically, our method introduces the Dual Gradient update strategy that first derives a base subspace projection from the PTMs and then fuses task-specific LoRA gradients with their aligned counterparts through interpolated combination. This design promotes knowledge retention without sacrificing task-specific expressiveness. Furthermore, we employ a Classifier Alignment mechanism with Semantic shift estimation which is based on the calibrated prototype statistics to mitigate classifier shift, and introduce a novel Patch-level Alignment loss to preserve feature consistency across tasks. Extensive experiments on six standard benchmarks demonstrate that our approach consistently outperforms existing CIL methods, highlighting its effectiveness and generalization capability in continual learning scenarios.

Abstract:
Current Vision-Language-Action (VLA) models primarily focus on mapping 2D observations to actions but exhibit notable limitations in spatiotemporal perception and reasoning: 1) spatial representations often rely on additional sensors, introducing substantial computational overhead; 2) visual reasoning is typically limited to future-frame prediction, lacking alignment with the instruction-grounded scene and thus compromising spatiotemporal consistency. To address these challenges, we propose ConsisVLA-4D, a unified and efficient framework that enhances spatiotemporal consistency in 3D-Perception and 4D-Reasoning. Specifically, we design: 1) CV-Aligner, which ensures Cross-View object semantic consistency via filtering instruction-relevant regions and aligning object identities across multiple viewpoints; 2) CO-Fuser, which guarantees Cross-Object spatial geometric consistency by eliminating spatial relation ambiguities between objects across views using compact latent representations. Building upon these, we introduce 3) CS-Thinker to achieve Cross-Scene spatiotemporal consistency as actions unfold. It learns implicit knowledge of local dynamics from object-semantic tokens of CV-Aligner and global depth from geometric tokens of CO-Fuser, thereby enhancing efficient visual reasoning under scene variations. Extensive experiments demonstrate that, benefiting from its efficient spatiotemporal consistency design, ConsisVLA-4D achieves 21.6% and 41.5% performance improvements, along with 2.3x and 2.4x inference speedups compared to OpenVLA on the LIBERO benchmark and real-world platforms, respectively.

Abstract:
Person re-identification (Re-ID) requires a balance between discriminative capability and computational efficiency for real-world deployment. However, even the Visual State Space Model (SSM), despite its linear complexity, suffers from redundant computation due to dense token processing. We propose SSM-aware Token-Efficient VMamba (TE-VMamba), which integrates adaptive patch pruning and merging modules to reduce redundant tokens while preserving identity-discriminative cues. The layer-adaptive pruning strategy removes low-importance tokens in shallow layers to enhance efficiency, whereas the depth-aware merging strategy consolidates semantically similar tokens in deeper layers to improve representation compactness. Learnable layer-wise thresholds dynamically balance accuracy and computational cost across the network. On the Market-1501 benchmark, TE-VMamba reduces FLOPs by over 60%, achieving significant computational savings while maintaining competitive accuracy. These results highlight the potential of structured token reduction in state-space models for efficient and powerful person re-identification.

Abstract:
Recent progress in text-to-image generation has greatly advanced visual fidelity and creativity, but it has also imposed higher demands on prompt complexity--particularly in encoding intricate spatial relationships. In such cases, achieving satisfactory results often requires multiple sampling attempts.To address this challenge, we introduce a novel method that strengthens the spatial understanding of current image generation models. We first construct the SpatialReward-Dataset with over 80k preference pairs. Building on this dataset, we build SpatialScore, a reward model designed to evaluate the accuracy of spatial relationships in text-to-image generation, achieving performance that even surpasses leading proprietary models on spatial evaluation.% Through rigorous data curation and filtering, SpatialScore surpasses several leading vision-language models (VLMs) and existing reward models on spatial evaluation. We further demonstrate that this reward model effectively enables online reinforcement learning for the complex spatial generation. Extensive experiments across multiple benchmarks show that our specialized reward model yields significant and consistent gains in spatial understanding for image generation. All models and datasets will be released.

Abstract:
Virtual task assistants must recognize and explain users' mistakes to provide effective and corrective guidance. In this paper, we address the problem of error reasoning in long task videos, which is to detect and explain errors. Although recent Vision-Language Models (VLMs) demonstrate strong capabilities in visual question answering, they struggle to attend to the sparse spatiotemporal cues associated with errors in long task videos. We introduce an error reasoning framework, AXG-Reasoner, that leverages a frozen VLM in conjunction with a proposed Action eXecution Graph (AXG) and a temporal action segmentation (TAS) model, obtained and learned from normal (error-free) videos. To enable VLMs to attend to the sparse spatiotemporal cues associated with errors, we decompose each action segment of the video, obtained by TAS, into a sequence of fine-grained subactions by aligning it with the AXG. For each subaction segment, we query the VLM using a small number of keyframes and enhanced prompts to detect and explain errors, enabling data efficiency. To avoid costly manual subaction annotations, we develop a method to automatically construct AXG from training videos using foundation models. Extensive experiments on EgoPER and CaptainCook4D show that our method consistently improves over VLM baselines in error explanation by effectively identifying spatiotemporal cues and achieves state-of-the-art performance in error detection.

Abstract:
3D echocardiography provides superior cardiac quantification to traditional 2D echocardiography, which suffers from geometric idealizations and imaging plane misalignment. However, despite its advantages, clinical adoption of 3D echo remains limited due to logistical and visualization challenges. We propose a novel framework that reconstructs the 3D shape of the left ventricle (LV) throughout the cardiac cycle from sparse 2D echocardiographic views routinely acquired in clinical practice, without the need for external hardware or manual tracking. Our method integrates EchoPOSE, a new deep network that automatically estimates the 6D pose (position and orientation) of LV segmentations, with a graph-harmonic algorithm for 3D shape reconstruction. EchoPOSE employs a transformer-based architecture that combines local image features with global multi-view context, and introduces a geometry-aware loss to ensure spatial consistency across intersecting imaging planes. Trained and evaluated on large-scale synthetic data derived from 3D echocardiography and validated on prospectively acquired clinical echocardiograms, EchoPOSE achieves 3.78 mm and 8.65^ \circ pose errors, yielding 87.5% Dice reconstruction accuracy, 1.44% ejection fraction (EF) error, and 3.03% volume error, surpassing alternative deep learning techniques and classical clinical approaches. Notably, the framework remains robust under suboptimal imaging alignment, suggesting that EchoPOSE can reduce the sonography skills required for transducer positioning and allow minimally trained clinicians to perform echo scans.

Abstract:
Weakly supervised learning (WSL) provides a cost-effective learning paradigm for video anomaly detection (VAD) from data with video-level annotation instead of requiring costly fine-grained segment-level annotation. Although contemporary methods have shown promising results on challenging real-world surveillance videos, most of them are evaluated using the Area Under the Receiver Operating Characteristic Curve (AUROC). We reveal that a high AUROC could result in a very low recall for meaningful False Positive Rate (FPR) thresholds. Thus, these models suffer from limited practical values, especially in high-stake domains (e.g. public safety and medical diagnosis), where missing the true anomalies incur high cost. This surprising phenomenon is rooted in the interplay of weak supervision and the highly imbalanced distribution between normal and anomalous video segments. To tackle this key challenge in VAD systems, we propose a novel dual exploration strategy that combines temporal clustering with uncertainty-based segment exploration. Temporal clustering selects diverse segments based on both semantic and temporal similarity, while uncertainty-based sampling targets low-scoring segments with high model uncertainty. The main aim of exploration is to ensure that the model learns from a wide range of patterns, both diverse and ambiguous, resulting in more informed and robust decision-making, and reduction in false negatives. Meanwhile, we adopt two practical metrics to replace the commonly used AUROC score for a more effective measure for evaluation. Experiments conducted in challenging real-world videos demonstrate that our exploration strategy improves VAD performance compared to the baselines on these metrics, which justifies its improved practical value in real-world settings.

Abstract:
Procedural planning aims to predict a sequence of actions that transforms an initial visual state into a desired goal, a fundamental ability for intelligent agents operating in complex environments. Existing approaches typically rely on large-scale models that learn procedural structures implicitly, resulting in limited sample-efficiency and high computational cost. In this work we introduce ViterbiPlanNet, a principled framework that explicitly integrates procedural knowledge into the learning process through a Differentiable Viterbi Layer (DVL). The DVL embeds a Procedural Knowledge Graph (PKG) directly with the Viterbi decoding algorithm, replacing non-differentiable operations with smooth relaxations that enable end-to-end optimization. This design allows the model to learn through graph-based decoding. Experiments on CrossTask, COIN, and NIV demonstrate that ViterbiPlanNet achieves state-of-the-art performance with an order of magnitude fewer parameters than diffusion- and LLM-based planners. Extensive ablations show that performance gains arise from our differentiable structure-aware training rather than post-hoc refinement, resulting in improved sample efficiency and robustness to shorter unseen horizons. We also address testing inconsistencies establishing a unified testing protocol with consistent splits and evaluation metrics. With this new protocol, we run experiments multiple times and report results using bootstrapping to assess statistical significance.

Abstract:
3D Large Vision-Language Models (3D LVLMs) built upon Large Language Models (LLMs) have achieved remarkable progress across various multimodal tasks. However, their inherited position-dependent modeling mechanism, Rotary Position Embedding (RoPE), remains suboptimal for 3D multimodal understanding. The vanilla RoPE formulation fails to preserve essential three-dimensional spatial structures when encoding 3D tokens, and its relative distance computation overlooks angular dependencies hindering the model's ability to capture directional variations in visual representations. To overcome these limitations, we introduce Spherical Coordinate-based Positional Embedding (SoPE). Our method maps point-cloud token indices into a 3D spherical coordinate space, enabling unified modeling of spatial locations and directional angles. This formulation preserves the inherent geometric structure of point-cloud data, enhances spatial awareness, and yields more consistent and expressive geometric representations for multimodal learning. In addition, we introduce a multi-scale frequency mixing strategy to fuse feature information across different frequency domains. Experimental results on multiple 3D scene benchmarks validate the effectiveness of our approach, while real-world deployment experiments further demonstrate its strong generalization capability.

Abstract:
Single-view 3D human reconstruction has garnered significant attention in recent years. Despite numerous advancements, prior research has concentrated on reconstructing 3D models from clear, close-up images of individual subjects, often yielding subpar results in the more prevalent multi-person scenarios. Reconstructing 3D human crowd models is a highly intricate task, laden with challenges such as: 1) extensive occlusions, 2) low clarity, and 3) numerous and various appearances. To address this task, we propose CrowdGaussian, a unified framework that directly reconstructs multi-person 3D Gaussian Splatting (3DGS) representations from single-image inputs. To handle occlusions, we devise a self-supervised adaptation pipeline that enables the pretrained large human model to reconstruct complete 3D humans with plausible geometry and appearance from heavily occluded inputs. Furthermore, we introduce Self-Calibrated Learning (SCL). This training strategy enables single-step diffusion models to adaptively refine coarse renderings to optimal quality by blending identity-preserving samples with clean/corrupted image pairs. The outputs can be distilled back to enhance the quality of multi-person 3DGS representations. Extensive experiments demonstrate that CrowdGaussian generates photorealistic, geometrically coherent reconstructions of multi-person scenes.

Abstract:
Video data provides a rich source beyond expensive action-labeled data for advancing robot learning. Recent approaches have demonstrated promising potential in leveraging video data by learning latent actions for policy training. The latent action tokenizer encodes latent actions between successive video frames, and the tokenizer is trained to reconstruct future frames using current frames and the encoded latent actions. However, the unique pairing of successive frames permits future frame reconstruction with little understanding of transition dynamics, hindering the learning of semantically consistent latent actions. Moreover, the tokenizer typically allocates distinct latent action subsets to individual embodiments to accommodate heterogeneous morphologies, constraining knowledge transfer. To overcome such limitations, we propose the action-centric cycle consistency, aiming to establish a unified latent action space. Our method samples latent actions from the latent action space and decodes them with video frames to generate diverse subsequent frames, then enforces cycle consistency by predicting the sampled actions from both original and generated frames. Our concise method creates a challenging task that learns corresponding latent actions from current frames and diverse generated future frames, compelling the tokenizer to develop semantically consistent action representations. Additionally, sampled latent actions can be applied to video frames from distinct embodiments, facilitating the alignment of latent actions across embodiments. Experiments demonstrate that our approach achieves a 20.1% improvement over OpenVLA on the LIBERO benchmark and increases the average length from 3.27 to 3.93 on the CALVIN benchmark. In real-world experiments, our method maintains strong performance with a 44% improvement.

Abstract:
Indoor monocular semantic scene completion (MSSC) is notably more challenging than its outdoor counterpart due to complex spatial layouts and severe occlusions. While transformers are well suited for modeling global dependencies, their high memory cost and difficulty in reconstructing fine-grained details have limited their use in indoor MSSC. To address these limitations, we introduce ASFormer, a serialized transformer framework tailored for indoor MSSC. Our model features three key designs: (1) an Adaptive Serialized Transformer with learnable shifts that dynamically adjust receptive fields; (2) a Center-Relative Positional Encoding that captures spatial information richness; and (3) a Convolution-Modulated Layer Normalization that bridges heterogeneous representations between convolutional and transformer features. Extensive experiments on NYUv2 and Occ-ScanNet demonstrate that ASFormer achieves state-of-the-art performance. Code will be made publicly available.

Abstract:
We introduce an unsupervised approach for segmenting multiscale subcellular objects in 3D volumetric cryo-electron tomography (cryo-ET) images. To this end, we address key challenges such as lack of annotated data, large data volumes, high heterogeneity of subcellular shapes and sizes, and high inter-domain variability of cellular cryo-ET images across different experiments and contexts. Our method requires users to only select a small number of slabs from a few representative tomograms in the dataset. The core of our method is extracting features for the corresponding slabs, leveraging a Stable Diffusion foundation model pretrained on mostly natural images. The feature extraction is followed by a novel heuristic-based feature aggregation strategy, and adaptive thresholding to segment the aggregated features. The resulting masks are refined with pretrained CellPose to split composite regions, and then utilized as pseudo-ground truth for training supervised deep learning models. We validated our unsupervised foundation-model based pipeline on publicly available cryo-ET benchmark datasets, demonstrating performance that closely approximates expert human annotations. This fully automated, data-driven framework enables the mining of multi-scale subcellular patterns, paving the way for accelerated biological discoveries from large-scale cellular cryo-ET datasets.

Abstract:
In this work, we observe that for indoor 3D object detection, fundamental geometric cues induce homogeneous spatial responses across scenes, whereas scene-specific structure yields heterogeneous signatures. However, existing detectors lack effective mechanisms to jointly extract and exploit such dual properties, which imposes inherent limitations on detection performance. Guided by this insight, we propose H^2A^2, a homogeneity-aware and heterogeneity-aware feature perception network for unified indoor 3D object detection under cross-scene training paradigms.Technically, we introduce a structural-feature-aware kernel selection (SF-KS) method, which encompasses three core components:(i) task-aware linear modulation, a channel-wise affine transformation that strengthens scene-structural feature representation; (ii) kernel weight selection strategy that integrates an offset validity prior to suppress non-informative cross-scene transfer while utilizing a structural consistency posterior to capture scene-homogeneous cues. and (iii) task-aware channel gating that suppresses scene-irrelevant feature responses. Overall, SF-KS enables the precise optimization of homogeneous features while specializing in scene-specific heterogeneous ones. In addition, to stabilize cross-scene optimization, we further introduce norm-based gradient homogenization (NGH) algorithm, which normalizes and dynamically reweights per-task gradient norms to mitigate conflicts and promote consistent updates. Extensive experiments on diverse indoor benchmarks show that H^2A^2 delivers consistent gains over strong baselines and improves cross-scene generalization.

Abstract:
Hierarchical classification (HC) on degraded images presents challenges due to feature corruption, unreliable confidence estimation, and fine-grained misclassification. Existing methods often struggle to balance semantic consistency and adaptive decision paths under low-quality visual conditions. To address this, we propose HierUQ, a unified framework that integrates uncertainty quantification with adaptive granularity reconciliation. A Vision Transformer backbone extracts global features, which are fused with semantic embeddings via bilinear and semantic-guided cross-attentions. We develop a principled Hierarchical Uncertainty Quantification (HUQ) strategy based on label smoothing and proper scoring rules. When confidence is insufficient, a Confidence-Aware Path Adjustment (CAPA) mechanism adaptively rolls back predictions to higher-level nodes, mitigating overclassification and error propagation, stabilizing the learning trajectory, overcoming degradation-induced interference, and enhancing fine-grained classification accuracy. To enhance learning, we introduce a self-paced joint optimization (MLJO) over multi-level objectives with dynamic loss weighting. Experiments on degraded remote sensing and natural image benchmarks show that HierUQ achieves state-of-the-art performance with strong robustness and adaptability.

Abstract:
Split DNNs enable edge devices by offloading intensive computation to a cloud server, but this paradigm exposes privacy vulnerabilities, as the intermediate features can be exploited to reconstruct the private inputs via Feature Inversion Attack (FIA). Existing FIA methods often produce limited reconstruction quality, making it difficult to assess the true extent of privacy leakage. To reveal the privacy risk of the leaked features, we introduce FIA-Flow, a black-box FIA framework that achieves high-fidelity image reconstruction from intermediate features. To exploit the semantic information within intermediate features, we design a Latent Feature Space Alignment Module (LFSAM) to bridge the semantic gap between the intermediate feature space and the latent space. Furthermore, to rectify distributional mismatch, we develop Deterministic Inversion Flow Matching (DIFM), which projects off-manifold features onto the target manifold with one-step inference. This decoupled design simplifies learning and enables effective training with few image-feature pairs. To quantify privacy leakage from a human perspective, we also propose two metrics based on a large vision-language model. Experiments show that FIA-Flow achieves more faithful and semantically aligned feature inversion across various models (AlexNet, ResNet, Swin Transformer, DINO, and YOLO11) and layers, revealing a more severe privacy threat in Split DNNs than previously recognized.

Abstract:
Video dataset condensation aims to reduce the immense computational cost of video processing. However, it faces a fundamental challenge regarding the inseparable interdependence between spatial appearance and temporal dynamics. Prior work follows a static/dynamic disentanglement paradigm where videos are decomposed into static content and auxiliary motion signals. This multi-stage approach often misrepresents the intrinsic coupling of real-world actions. We introduce Progressive Refinement and Insertion for Sparse Motion (PRISM), a holistic approach that treats the video as a unified and fully coupled spatiotemporal structure from the outset. To maximize representational efficiency, PRISM addresses the inherent temporal redundancy of video by avoiding fixed frame optimization. It begins with minimal temporal anchors and progressively inserts key-frames only where linear interpolation fails to capture non-linear dynamics. These critical moments are identified through gradient misalignments. Such an adaptive process ensures that representational capacity is allocated precisely where needed, minimizing storage requirements while preserving complex motion. Extensive experiments demonstrate that PRISM achieves competitive performance across standard benchmarks while providing state-of-the-art storage efficiency through its sparse and holistically learned representation.

Abstract:
Reconstructing dynamic visual experiences as videos from functional magnetic resonance imaging (fMRI) is pivotal for advancing the understanding of neural processes. However, current fMRI-to-video reconstruction methods are hindered by a semantic gap between noisy fMRI signals and the rich content of videos, stemming from a reliance on incomplete semantic embeddings that neither capture video-specific cues (e.g., actions) nor integrate prior knowledge. To this end, we draw inspiration from the dual-pathway processing mechanism in human brain and introduce CineNeuron, a novel hierarchical framework for semantically enhanced video reconstruction from fMRI signals with two synergistic stages. First, a bottom-up semantic enrichment stage maps fMRI signals to a rich embedding space that comprehensively captures textual semantics, image contents, action concepts, and object categories. Second, a top-down memory integration stage utilizes the proposed Mixture-of-Memories method to dynamically select relevant "memories" from previously seen data and fuse them with the fMRI embedding to refine the video reconstruction. Extensive experimental results on two fMRI-to-video benchmarks demonstrate that CineNeuron surpasses state-of-the-art methods across various metrics.

Abstract:
Hallucination has been a significant impediment to the development and application of current Large Vision-Language Models (LVLMs). To mitigate hallucinations, one intuitive and effective way is to directly increase attention weights to image tokens during inference. Although this effectively reduces the hallucination rate, it often induces repetitive descriptions. To address this, we first conduct an analysis of attention patterns and reveal that real object tokens tend to assign higher attention to the generated text than hallucinated ones. This inspires us to leverage the generated text, which contains instruction-related visual information and contextual knowledge, to alleviate hallucinations while maintaining linguistic coherence. We therefore propose Increase Attention to Generated Text (IAT) and demonstrate that it significantly reduces the hallucination rate while avoiding repetitive descriptions. To prevent naive amplification from impairing the inherent prediction capabilities of LVLMs, we further explore Adaptive IAT (AdaIAT) that employs a layer-wise threshold to control intervention time and fine-grained amplification magnitude tailored to the characteristics of each attention head.Both analysis and experiments demonstrate the effectiveness of AdaIAT. Results of several LVLMs show that AdaIAT effectively alleviates hallucination (reducing hallucination rates C_S and C_I on LLaVA-1.5 by 35.8% and 37.1%, respectively) while preserving linguistic performance and prediction capability, achieving an attractive trade-off.

Abstract:
We present GP-4DGS, a novel framework that integrates Gaussian Processes (GPs) into 4D Gaussian Splatting (4DGS) for principled probabilistic modeling of dynamic scenes. While existing 4DGS methods focus on deterministic reconstruction, they are inherently limited in capturing motion ambiguity and lack mechanisms to assess prediction reliability. By leveraging the kernel-based probabilistic nature of GPs, our approach introduces three key capabilities: (i) uncertainty quantification for motion predictions, (ii) motion estimation for unobserved or sparsely sampled regions, and (iii) temporal extrapolation beyond observed training frames. To scale GPs to the large number of Gaussian primitives in 4DGS, we design spatio-temporal kernels that capture the correlation structure of deformation fields and adopt variational Gaussian Processes with inducing points for tractable inference. Our experiments show that GP-4DGS enhances reconstruction quality while providing reliable uncertainty estimates that effectively identify regions of high motion ambiguity. By addressing these challenges, our work takes a meaningful step toward bridging probabilistic modeling and neural graphics.

Abstract:
We introduce WAM-Flow, a vision-language-action (VLA) model that casts ego-trajectory planning as discrete flow matching over a structured token space. In contrast to autoregressive decoders, WAM-Flow performs fully parallel, bidirectional denoising, enabling coarse-to-fine refinement with a tunable compute-accuracy trade-off. Specifically, the approach combines a metric-aligned numerical tokenizer that preserves scalar geometry via triplet-margin learning, a geometry-aware flow objective and a simulator-guided GRPO alignment that integrates safety, ego progress, and comfort rewards while retaining parallel generation. A multi-stage adaptation converts a pre-trained auto-regressive backbone (Janus-1.5B) from causal decoding to non-causal flow model and strengthens road-scene competence through continued multimodal pretraining. Thanks to the inherent nature of consistency model training and parallel decoding inference, WAM-Flow achieves superior closed-loop performance against autoregressive and diffusion-based VLA baselines, with 1-step inference attaining 88.7 PDMS and 5-step inference reaching 90.3 PDMS on NAVSIM v1 benchmark. These results establish discrete flow matching as a new promising paradigm for end-to-end autonomous driving.

Abstract:
Avoiding catastrophic forgetting for previous tasks and maintaining model plasticity to support new tasks are two critical objectives of continual learning. However, existing methods usually neglect one of the two aspects and fail to support long task sequences with satisfactory performance, especially in resource-constrained scenarios in which the size of the model is limited. This work proposes GRAPA, a parameter-efficient continual learning method that well balances stability and plasticity of the model to handle long task sequences with diverse complexities. GRAPA enhances model plasticity without sacrificing stability with two novel designs. First, a gradient-guided parameter reuse strategy is proposed to make full use of frozen parameters while ensuring that no task interference is introduced. Second, a reinforcement-learning-based parameter allocation is designed to enable the model to adapt to the current task on top of reused parameters while preserving maximal model capacity for future tasks. Experiments on multiple task sequences composed of various datasets demonstrate that GRAPA lifts mean task accuracy by up to 7.67%, with up to 14.92% gains on subsequent complex tasks, reflecting GRAPA's superior plasticity.

Abstract:
3D Gaussian Splatting (3DGS) has revolutionized radiance field reconstruction by achieving high-quality novel view synthesis with fast rendering speed, introducing 3D Gaussian primitives to represent the scene. However, 3DGS encounters blurring and floaters when applied to complex scenes, caused by the reconstruction of redundant and ambiguous geometric structures. We attribute this issue to the unstable optimization of the Gaussians. To address this limitation, we present a plug-and-play PDE-based optimization method that overcomes the optimization constraints of 3DGS-based approaches in various tasks, such as novel view synthesis and surface reconstruction. Firstly, we theoretically derive that the 3DGS optimization procedure can be modeled as a PDE, and introduce a viscous term to ensure stable optimization. Secondly, we use the Material Point Method (MPM) to obtain a stable numerical solution of the PDE, which enhances both global and local constraints. Additionally, an effective Gaussian densification strategy and particle constraints are introduced to ensure fine-grained details. Extensive qualitative and quantitative experiments confirm that our method achieves state-of-the-art rendering and reconstruction quality.

Abstract:
Video data is more cost-effective than motion capture data for learning 3D character controllers, yet using it to generate realistic and physically plausible motions remains challenging. Previous approaches typically rely on off-the-shelf motion reconstruction techniques to extract 3D kinematic trajectories for physics-based imitation. However, these approaches are fundamentally limited by the physical implausibility of the estimated motions and lack a unified framework for handling non-human subjects and complex human-object interactions (HOI). We tackle this challenge by introducing Mimic2DM, a novel motion imitation framework that learns the control policy directly and solely from 2D motions extracted from videos. By formulating a 2D motion tracking objective, we train a general single-view tracking policy capable of imitating arbitrary reference motions in a physics simulation using exclusively 2D data. Furthermore, by training on 2D motions from diverse viewpoints, our view-agnostic policy implicitly learns 3D motion structures, allowing it to perform full 3D motion tracking simply by aggregating multiple views. To extend this capability, we integrate the tracking policy with an autoregressive transformer-based 2D motion generator, constructing a hierarchical control framework capable of long-horizon motion synthesis and interactive control. We show that the proposed approach is versatile and effectively synthesizes physically plausible and diverse motions across a range of domains, including dancing, soccer dribbling, and animal locomotion, without any reliance on explicit 3D motion data.

Abstract:
Recent unified models integrate understanding experts (e.g., LLMs) with generative experts (e.g., diffusion models), achieving strong multimodal performance. However, recent advanced methods such as BAGEL and LMFusion follow the Mixture-of-Transformers (MoT) paradigm, adopting a symmetric design that mirrors one expert to another for convenient initialization and fusion, which remains suboptimal due to inherent modality discrepancies. In this work, we propose HBridge, an asymmetric H-shaped architecture that enables heterogeneous experts to optimally leverage pretrained priors from their respective Recent unified models integrate understanding experts (e.g., LLMs) with generative experts (e.g., diffusion models), achieving strong multimodal performance. However, recent advanced methods such as BAGEL and LMFusion follow the Mixture-of-Transformers (MoT) paradigm, adopting a symmetric design that mirrors one expert to another for convenient initialization and fusion, which remains suboptimal due to inherent modality discrepancies. modality domains. Unlike prior dense fusion strategies that straightforwardly connect all layers between experts via shared attention, HBridge selectively bridges intermediate layers, reducing over 40% attention sharing, which improves efficiency and enhances generation quality. Shallow and deep layers, which capture modality-specific representations, are decoupled, while mid-layer bridging promotes semantic alignment. To further strengthen cross-modal coherence, we introduce semantic reconstruction tokens that explicitly guide the generative expert to reconstruct visual semantic tokens of the target image. Extensive experiments across multiple benchmarks demonstrate the effectiveness and superior performance of HBridge, establishing a new paradigm for unified multimodal generation.

Abstract:
Active 3D Gaussian reconstruction achieves superior completeness and rendering quality by intelligently selecting viewpoints. However, existing methods suffer from two critical limitations: information gain metrics that prioritize geometric coverage while ignoring rendering quality, and overfitting to sparse view configurations that degrades novel view synthesis. We introduce ActivePolicy, a novel framework addressing both challenges through principled NBV selection and regularization. We propose GL-Graph, a graph-theoretic strategy that unifies geometric consistency, rendering quality, and observation redundancy into a single stability criterion. To counteract overfitting, we introduce 4D-Reg, which identifies floaters through manifold discrepancies among three depth types (R-Depth, alpha-Depth, C-Depth) and suppresses them via adaptive dropout. Extensive experiments demonstrate state-of-the-art reconstruction completeness and rendering fidelity on standard benchmarks.

Abstract:
Diffusion large language models (dLLMs) are emerging as promising alternatives to autoregressive (AR) LLMs. Recently, this paradigm has been extended to multimodal tasks, leading to the development of diffusion multimodal large language models (dMLLMs). These models are expected to retain the reasoning capabilities of LLMs while enabling faster inference through parallel generation. However, when combined with Chain-of-Thought (CoT) reasoning, dMLLMs exhibit two critical issues.First, we observe that dMLLMs often generate the final answer token at a very early timestep. This trend indicates that the model determines the answer before sufficient reasoning, leading to degraded reasoning performance. Second, during the initial timesteps, dMLLMs show minimal dependency on visual prompts, exhibiting a fundamentally different pattern of visual information utilization compared to AR vision-language models. In summary, these findings indicate that dMLLMs tend to generate premature final answers without sufficiently grounding on visual inputs.To address these limitations, we propose Position & Step Penalty (PSP) and Visual Reasoning Guidance (VRG). PSP penalizes tokens in later positions during early timesteps, delaying premature answer generation and encouraging progressive reasoning across timesteps. VRG, inspired by the classifier-free guidance, amplifies visual grounding signals to enhance the model's alignment with visual evidence.Extensive experiments across various dMLLMs demonstrate that our method achieves up to 7.5% higher accuracy while delivering more than 3x speedup compared to reasoning with four times more diffusion steps. Our code will be released after publication.

Abstract:
In computational pathology, few-shot whole slide image classification is primarily driven by the extreme scarcity of expert-labeled slides. Recent vision-language methods incorporate textual semantics generated by large language models, but treat these descriptions as static class-level priors that are shared across all samples and lack sample-wise refinement. This limits both the diversity and precision of visual-semantic alignment, hindering generalization under limited supervision. To overcome this, we propose the stochastic MUlti-view Semantic Enhancement (MUSE), a framework that first refines semantic precision via sample-wise adaptation and then enhances semantic richness through retrieval-augmented multi-view generation. Specifically, MUSE introduces Sample-wise Fine-grained Semantic Enhancement (SFSE), which yields a fine-grained semantic prior for each sample through MoE-based adaptive visual-semantic interaction. Guided by this prior, Stochastic Multi-view Model Optimization (SMMO) constructs an LLM-generated knowledge base of diverse pathological descriptions per class, then retrieves and stochastically integrates multiple matched textual views during training. These dynamically selected texts serve as enriched semantic supervisions to stochastically optimize the vision-language model, promoting robustness and mitigating overfitting. Experiments on three benchmark WSI datasets show that MUSE consistently outperforms existing vision-language baselines in few-shot settings, demonstrating that effective few-shot pathology learning requires not only richer semantic sources but also their active and sample-aware semantic optimization.

Abstract:
In-generation watermarking for latent diffusion models has recently shown high robustness in marking generated images for easier detection and attribution. However, its application to autoregressive (AR) image models is underexplored. Autoregressive models generate images by autoregressively predicting a sequence of visual tokens that are then decoded into pixels using a VQ-VAE decoder. Inspired by KGW watermarking for large language models, we examine token-level watermarking schemes that bias the next-token prediction based on prior tokens. We find that a direct transfer of these schemes works in principle, but the detectability of the watermarks decreases considerably under common image perturbations. As a remedy, we propose a watermarking approach based on visual token clustering, which assigns similar tokens to the same set (red or green). We investigate token clustering in a training-free setting, as well as in combination with a more accurate fine-tuned token or cluster predictor. Overall, our experiments show that cluster-based watermarks greatly improve robustness against perturbations and regeneration attacks while preserving image quality, outperforming a set of baselines and concurrent works. Moreover, our methods offer fast verification runtime, comparable to lightweight post-hoc watermarking techniques.

Abstract:
Scene-level point cloud self-supervised learning (PC-SSL) has demonstrated potential in enhancing the generalization capability of 3D vision models. Despite the advances in the field through existing methods, the sample-independent modeling paradigm still poses significant limitations in terms of maintaining consistent semantic representations across scenes. This challenge hinders the construction of a unified and transferable semantic space. To address this issue, we propose a PC-SSL framework based on cross-sample semantic propagation (CSP), in which samples within a batch are serialized into continuous input and processed by a state-space model to enable semantic state propagation. This mechanism explicitly models the dynamic dependencies across samples in the state space, allowing the network to establish cross-sample semantic consistency in the latent space and achieve global semantic alignment. Since serialization-based pretraining requires batch-level input organization, we further introduce an asymmetric semantic preservation distillation (SPD) during finetuning to achieve structural alignment of semantic transfer and eliminate inconsistencies caused by batch dependency. The proposed SPD ensures stable transfer of pretrained semantics through a heterogeneous input mechanism and a semantic feature alignment constraint. This enables the model to maintain structured semantic consistency and robustness under single-scene testing conditions. Extensive experiments on multiple benchmark datasets demonstrate that our method consistently outperforms state-of-the-art methods in both performance and semantic consistency.

Abstract:
We introduce EvObj for unsupervised 3D instance segmentation that bridges the geometric domain gap between synthetic pretraining data and real-world point clouds. Current methods suffer from structural discrepancies when transferring object priors from synthetic datasets (e.g., ShapeNet) to real scans (e.g., ScanNet), particularly due to morphological variations and occlusion artifacts. To address this, EvObj integrates two innovative modules: (1) An object discerning module that dynamically refines object candidates, enabling continuous adaptation of object priors to target domains; and (2) An object completion module that reconstructs partial geometries after discovering objects. We conduct extensive experiments on both real-world and synthetic datasets, demonstrating superior 3D object segmentation performance over all baselines while achieving state-of-the-art results.

Abstract:
3D Gaussian Splatting (3DGS) has recently emerged as a powerful explicit 3D representation, enabling photorealistic and real-time novel view synthesis. However, most 3DGS pipelines still assume precomputed camera poses and offline optimization, which introduces latency and makes them brittle in fast-motion, real-world scenarios. Existing online 3DGS systems mostly fall into two camps: (1) hybrid systems that rely on a separate traditional SLAM system for camera poses and optimize Gaussians decoupled from tracking, increasing system complexity; and (2) purely Gaussian-based systems that estimate poses from dense photometric errors, requiring repeated rendering of a large number of Gaussians and thus incurring high computational cost. Moreover, current online methods are often sensitive to motion blur and high dynamic range scenes, limiting their applicability in practice.We address these limitations with a sparse, edge-guided online 3DGS framework. Our method represents the scene as an edge-aligned sparse Gaussian map and estimates 6-DoF camera poses by aligning rendered 3D edges with observed 2D edges using a distance transform based objective, yielding roughly 2x faster per-iteration pose optimization than existing Gaussian-based systems while recovering clear scene geometry. We further leverage a dual-channel hybrid pixel vision sensor that outputs blur-free, high-frame-rate spatial-difference edge signals alongside RGB images, and use these signals both for robust edge-based tracking and for a mutual supervision scheme that mitigates motion blur in dense 3D reconstruction. Our system maintains stable tracking and high-fidelity geometry under extremely high-speed motion, where existing RGB-only methods fail, while remaining compatible with standard RGB cameras and achieving competitive tracking accuracy.

Abstract:
We present a differentiable framework to adaptively compute 4D illumination conditions with respect to an object, for efficient, high-quality simultaneous acquisition of its shape and reflectance, with a unified spatial-angular structured light and a single camera. Using a simple histogram-based pixel-level probability model for depth and reflectance, we differentiably link the next illumination condition(s) with a loss that encourages the reduction in depth uncertainty. As new structured illumination is cast, corresponding image measurements are used to update the uncertainty at each pixel. Finally, a fine-tuning-based approach reconstructs the depth map and reflectance parameter maps, by minimizing the differences between all physical measurements and their simulated counterparts. The effectiveness of our framework is demonstrated on physical objects with wide variations in shape and appearance. Our depth results compare favorably with state-of-the-art techniques, while our reflectance results are comparable when validated against photographs.

Abstract:
Despite the rapid progress of Large Vision-Language Models (LVLMs), the integration of visual modalities introduces new safety vulnerabilities that adversaries can exploit to elicit biased or malicious outputs. In this paper, we demonstrate an underexplored vulnerability via semantic slot filling, where LVLMs complete missing slot values with unsafe content even when the slot types are deliberately crafted to appear benign. Building on this finding, we propose StructAttack, a simple yet effective single-query jailbreak framework under black-box settings. StructAttack decomposes a harmful query into a central topic and a set of benign-looking slot types, then embeds them as structured visual prompts (e.g., mind maps, tables, or sunburst diagrams) with small random perturbations. Paired with a completion-guided instruction, LVLMs automatically recompose the concealed semantics and generate unsafe outputs without triggering safety mechanisms. Although each slot appears benign in isolation (local benignness), StructAttack exploits LVLMs' reasoning to assemble these slots into coherent harmful semantics. Extensive experiments on multiple models and benchmarks show the efficacy of our proposed StructAttack.

Abstract:
Native FP8 support in modern hardware is essential for training large Transformers, but is severely hindered by extreme activation outliers. Existing solutions either rely on complex mixed-precision engineering or invasive architectural modifications. This paper fundamentally challenges the conventional wisdom that outliers are data-driven. We demonstrate that extreme outliers are a data-independent, mechanically-produced artifact of training, originating from specific structural properties of the weight matrices (i.e., colinearity). Based on this insight, we propose TWEO (Transformers Without Extreme Outliers), a novel, non-invasive loss function. TWEO effectively prevents extreme outliers via a very simple loss term, which reduces outliers from 10000+ to less than 20. TWEO then enables full-model FP8 pre-training with neither engineering tricks nor architectural changes for both LLM and ViT. When standard FP8 training catastrophically collapses, TWEO achieves performance comparable to the BF16 baseline while delivering a 36% increase in training throughput. Also, TWEO enables a new quantization paradigm. Hardware-friendly W8A8 per-tensor static quantization of LLMs, previously considered completely unusable due to outliers, achieves SOTA performance for the first time on TWEO-trained models.

Abstract:
Understanding the fine-grained articulation of human hands is critical in high-stakes settings such as robot-assisted surgery, chip manufacturing, and AR/VR-based human-AI interaction. Despite achieving near-human performance on general vision-language benchmarks, current vision-language models (VLMs) struggle with fine-grained spatial reasoning, especially in interpreting complex, articulated hand poses. We introduce HandVQA, a large-scale diagnostic benchmark designed to evaluate VLMs' understanding of detailed hand anatomy through visual question answering. Built upon high-quality 3D hand datasets (FreiHAND, InterHand2.6M, FPHA), our benchmark includes over 1.6M controlled multiple-choice questions that probe spatial relationships between hand joints, such as angles, distances, and relative positions. We evaluate several state-of-the-art VLMs (LLaVA, DeepSeek and Qwen-VL) in both base and fine-tuned settings, using lightweight fine-tuning via LoRA. Our findings reveal systematic limitations in current models, including hallucinated finger parts, incorrect geometric interpretations, and poor generalization. HandVQA not only exposes these critical reasoning gaps but provides a validated path to improvement. We demonstrate that the 3D-grounded spatial knowledge learned from our benchmark transfers in a zero-shot setting, significantly improving accuracy of model on novel downstream tasks like hand gesture recognition (+10.33%) and hand-object interaction (+2.63%).

Abstract:
Vision Transformers (ViTs) enable strong multi-view 3D detection but are limited by high inference latency from dense token and query processing across multiple views and large 3D regions. Existing sparsity methods, designed mainly for 2D vision, prune or merge image tokens but do not extend to full-model sparsity or address 3D object queries. We introduce SToRe3D, a relevance-aligned sparsity framework that jointly selects 2D image tokens and 3D object queries while storing filtered features for reactivation. Mutual 2D-3D relevance heads allocate compute to driving-critical content and preserve other embeddings. Evaluated on nuScenes and our new nuScenes-Relevance benchmark, SToRe3D achieves up to 3x faster inference with marginal accuracy loss, establishing real-time large-scale ViT-based 3D detection while maintaining accuracy on planning-critical agents.

Abstract:
Active perception and manipulation are crucial for robots to interact with complex scenes. Existing methods struggle to unify semantic-driven perception actively with robust, viewpoint-invariant execution accordingly. To this end, we propose SaPaVe, an end-to-end framework that jointly learns these capabilities in a data-efficient manner. Central to our approach is a decoupling of camera and manipulation actions, contrary to shared-action-space, and learning in a bottom-up strategy: we first train semantic camera control on our proposed large-scale dataset, then jointly optimizes both action types via hybrid data. To support this, we introduce ActiveViewPose-200K, comprising 200k image-language-camera movement pairs for semantic camera movement learning, and a 3D geometry-aware module that improves execution robustness under dynamic viewpoints. We further present ActiveManip-Bench, the first benchmark filling the gap to evaluate active manipulation. Extensive experiments in both simulation and real-world settings show that SaPaVe outperforms recent VLA models such as GR00T N1 and pi0, achieving up to 31.25% higher success rates in real-world tasks. Our results show that tightly coupled perception and execution, when trained with decoupled yet coordinated strategies, enable efficient and generalizable active manipulation.

Abstract:
End-to-end text-image machine translation (TIMT), which directly translates textual content in images across languages, is crucial for real-world multilingual scene understanding. Despite advances in vision-language large models (VLLMs), robustness across diverse visual scenes and low-resource languages remains underexplored due to limited evaluation resources. We present MMTIT-Bench, a human-verified multilingual and multi-scenario benchmark with 1,400 images spanning fourteen non-English and non-Chinese languages and diverse settings such as documents, scenes, and web images, enabling rigorous assessment of end-to-end TIMT. Beyond benchmarking, we study how reasoning-oriented data design improves translation. Although recent VLLMs have begun to incorporate long Chain-of-Thought (CoT) reasoning, effective thinking paradigms for TIMT are still immature: existing designs either cascade parsing and translation in a sequential manner or focus on language-only reasoning, overlooking the visual cognition central to VLLMs. We propose Cognition-Perception-Reasoning for Translation (CPR-Trans), a data paradigm that integrates scene cognition, text perception, and translation reasoning within a unified reasoning process. Using a VLLM-driven data generation pipeline, CPR-Trans provides structured, interpretable supervision that aligns perception with reasoning. Experiments on 3B and 7B models show consistent gains in accuracy and interpretability. We will release MMTIT-Bench to promote the multilingual and multi-scenario TIMT research upon acceptance.

Abstract:
Nuclei instance segmentation and classification are fundamental but remain challenging in pathology due to severe class imbalance and organ- and stain-induced variability. While vision-language approaches can inject explicit semantic cues that reduce spurious contextual bias under imbalance, the absence of instance level textual annotations has limited their utility for nucleus-level analysis. We introduce an instance-level vision-language framework that derives attribute-guided textual descriptions from ground-truth masks. We then align visual representations with these semantic text anchors via contrastive learning, coupling morphology with semantics at the instance level. To capture intra-class variations while maintaining organ-consistent class semantics, we learn multiple class-specific tokens that act as prototypes representing diverse submodes within a class, summarizing morphologically similar nuclei. Our approach improves both segmentation and classification without manual text labels, indicating that language-guided instance alignment combined with prototype-based semantic feedback yields more discriminative and generalizable nuclei representations.

Abstract:
This paper introduces Nestwork, a unified latent-diffusion framework for conditional 3D furnished house layout generation using a heterogeneous graph of rooms and furniture. Designing reasonable and controllable 3D layouts that reflect the underlying semantic structure of a house is a key challenge in AI-assisted architectural design. Existing graph-based methods either produce unfurnished multi-room layouts or generate furnished scenes one room at a time, preventing joint reasoning over room structure and furniture placement. Nestwork represents an entire house as a heterogeneous graph with typed room and furniture nodes and multiple spatial relations. A single unconditional autoencoder based on a heterogeneous graph attention network embeds this graph into a compact latent space, and a low-rank relational field compensates for missing geometric edge information at test time. A diffusion denoiser is trained once using random masking, enabling the same model to operate under different conditioning strengths, from topology-only to fully annotated graphs. Multi-level conditioning combines masked node-level attention with graph-level embeddings to support flexible user control, including layouts specified through natural-language descriptions. Experiments on the 3D-FRONT dataset show that Nestwork achieves high fidelity, structural consistency, and diversity. Controlled ablations further validate the contributions of each component.

Abstract:
As humans, we are natural any-horizon reasoners, i.e., we can decide whether to iteratively skim long videos or watch short ones in full when necessary for a given task. With this in mind, one would expect video reasoning models to reason flexibly across different durations. However, SOTA models are still trained to predict answers in a single turn while processing a large number of frames, akin to watching an entire long video, requiring significant resources. This raises the question: Is it possible to develop performant any-horizon video reasoning systems? Inspired by human behavior, we first propose SAGE, an agent system that performs multi-turn reasoning on long videos while handling simpler problems in a single turn. Secondly, we introduce an easy synthetic data generation pipeline using Gemini-2.5-Flash to train the orchestrator, SAGE-MM, which lies at the core of SAGE. We further propose an effective RL post-training recipe essential for instilling any-horizon reasoning ability in SAGE-MM. Thirdly, we curate SAGE-Bench with an average duration of greater than 700 seconds for evaluating video reasoning ability in real-world entertainment use cases. Lastly, we empirically validate the effectiveness of our system, data, and RL recipe, observing notable improvements of up to 6.1% on open-ended video reasoning tasks, as well as an impressive 8.2% improvement on videos longer than 10 minutes. We will open-source our system code, data, and checkpoints upon publication.

Abstract:
Video Large Language Models (VideoLLMs) face a fundamental performance bottleneck: the token explosion intrinsic to video inputs. The resulting O(N^2) prefill cost makes conventional Transformer inference prohibitively expensive at scale. Existing attempts fall into a hard accuracy-latency dilemma: naive sparsification risks losing essential temporal-spatial context, whereas naive parallelization introduces substantial communication and memory overhead. To overcome this impasse, we propose SegMo, a unified framework based on algorithm-system co-design that jointly optimizes what to compute (sparsification) and how to compute it (parallelism). Driven by the empirical discovery of Local Cohesion in VideoLLM attention, SegMo integrates two key components: (1) Content-Aware Sparsification (CAS): A lightweight, hierarchical algorithm that first employs Query Relevance for scene-level assessment, and then uses Temporal Redundancy for intra-scene static redundancy pruning, to generate a precise, non-uniform computation load, ensuring accuracy. (2) Locally-Cohesive Segment Parallelism (LSP): A novel paradigm that leverages attention locality to partition the video at scene boundaries, using a lightweight Global Context Injection mechanism to replace the massive communication and memory overheads of global attention. SegMo was validated across LVBench, LongVideoBench, and Video-MME. Our CAS module improved accuracy by up to 12.00%. When integrated with LSP, the full system (CAS + LSP) achieved a peak prefill acceleration of 3.55x, while still maintaining a significant accuracy gain of up to 8.31%.

Abstract:
Federated Learning (FL), a distributed learning paradigm that enables local training on user-held data across decentralized devices, is vulnerable to backdoor attacks due to limited visibility into client updates. Exploiting this opacity, adversaries induce targeted misbehavior on trigger inputs without affecting overall performance, thereby compromising the trust and integrity of collaborative training in federated learning systems. Existing federated backdoor attacks mainly concentrate on benign knowledge alignment on trigger-surface design or representation guidance to evade defense mechanisms. However, trigger-surface attacks suffer from insufficient alignment, leaving malicious knowledge distinguishable from benign updates. In contrast, representation-guided attacks attempt to obscure the boundary between benign and malicious behaviors. Nevertheless, excessive incorporation of benign knowledge within a shared parameter space leads to over-alignment, ultimately degrading attack effectiveness. To overcome shared parameter space dilemma in backdoor attack, we propose Batman, a novel backdoor attack that aligns benign knowledge within the malicious null space, which effectively decouples malicious space from shared parameter space and enables benign alignment in an orthogonal direction of this space that does not interfere with the attack effectiveness. To further enhance stealthiness, we combine both clean and global models to guide the alignment perturbation within this null space to evade detection. Experiments on four benchmark datasets demonstrate that Batman consistently achieves strong backdoor performance while remaining stealthy under various defenses.

Abstract:
Multimodal Large Language Models (MLLMs) have shown remarkable progress in visual reasoning and understanding tasks but still struggle to capture the complexity and subjectivity of human emotions. Existing approaches based on supervised fine-tuning often suffer from limited generalization and poor interpretability, while reinforcement learning methods such as Group Relative Policy Optimization fail to align with the intrinsic characteristics of emotional cognition.To address these challenges, we propose Reflective Reinforcement Learning for Emotional Reasoning (EMO-R3), a framework designed to enhance the emotional reasoning ability of MLLMs. Specifically, we introduce Structured Emotional Thinking to guide the model to perform step-by-step emotional reasoning in a structured and interpretable manner, and design a Reflective Emotional Reward that enables the model to re-evaluate its reasoning based on visual-text consistency and emotional coherence. Extensive experiments demonstrate that EMO-R3 significantly improves both the interpretability and emotional intelligence of MLLMs, achieving superior performance across multiple visual emotional understanding benchmarks.

Abstract:
Vision-Language Models (VLMs), such as CLIP, have achieved impressive zero-shot recognition performance but remain highly susceptible to adversarial perturbations, posing significant risks in safety-critical scenarios. Previous training-time defenses rely on adversarial fine-tuning, which requires labeled data and costly retraining, while existing test-time strategies fail to reliably distinguish between clean and adversarial inputs, thereby preventing both adversarial robustness and clean accuracy from reaching their optimum. To address these limitations, we propose Test-Time Padding (TTP), a lightweight defense framework that performs adversarial detection followed by targeted adaptation at inference. TTP identifies adversarial inputs via the cosine similarity shift between CLIP feature embeddings computed before and after spatial padding, yielding a universal threshold for reliable detection across architectures and datasets. For detected adversarial cases, TTP employs trainable padding to restore disrupted attention patterns, coupled with a similarity-aware ensemble strategy for a more robust final prediction. For clean inputs, TTP leaves them unchanged by default or optionally integrates existing test-time adaptation techniques for further accuracy gains. Comprehensive experiments on diverse CLIP backbones and fine-grained benchmarks show that TTP consistently surpasses state-of-the-art test-time defenses, delivering substantial improvements in adversarial robustness without compromising clean accuracy. The code for this paper will be released soon.

Abstract:
Reasoning Large Multi-modality Models (LMMs) have become the de facto choice for many applications. However, these models rely on a Chain-of-Thought (CoT) process that is lengthy and unpredictable at runtime, often resulting in inefficient use of computational resources (due to memory fragmentation) and sub-optimal accuracy (due to under- and over-thinking). Empirical evidence shows a strong correlation between CoT length and task difficulty. This suggests that the CoT length can be estimated ahead of time based on a hidden parameter representing the amount of "fuel" available to support the reasoning process. Based on this insight, we propose Fuel Gauge, the first method which extracts this hidden signal and predicts CoT length ahead of time. We demonstrate the utility on the Fuel Gauge on two downstream tasks: predictive KV cache allocation, which addresses memory fragmentation in LMM serving systems, and CoT length modulation, which mitigates under-thinking and over-thinking. Extensive experiments on LMMs across text-only, image-text, and video-text question answering benchmarks demonstrate the effectiveness, generalizability, and practical value of Fuel Gauge. For example, on the GPQA-Diamond benchmark, our Fuel Gauge achieves less than half the CoT length prediction error compared to the baseline; this translates into a 13.37x reduction in the memory allocation frequency.

Abstract:
Audio-Visual Segmentation (AVS) aims to accurately segment sounding objects in video frames by leveraging audio-visual correspondence cues. However, it remains challenging due to the intrinsic semantic incompleteness within a single modality and the semantic gap between audio and visual representations. Traditional feature-fusion decoding approaches struggle to suppress fusion noise effectively, while recent methods that incorporate data-dependent priors often increase the complexity of modeling audio-visual correlations, leading to poor cross-domain generalization. To address these issues, we propose a novel adaptive contrastive and prototype learning framework, BYOAVP, for AVS. Specifically, we design a Self-Supervised Audio Enhancement (SSAE) module that introduces contrastive learning to adaptively align audio representations with gradient-blocked visual embeddings, thus narrowing the semantic gap between modalities. Furthermore, a Dynamic Prototype Constraint (DPC) module is developed to refine pixel-wise category perception via momentum-based prototype updating, while enhancing localization of sounding regions through cross-modal interaction. Extensive experiments demonstrate that our method achieves SOTA performance across two AVS benchmarks and six sub-tasks, exhibiting strong robustness and generalization ability.

Abstract:
Temporally consistent surface reconstruction of dynamic 3D objects from unstructured point cloud data remains challenging, especially for very long sequences. Existing methods either optimize deformations incrementally, risking drift and requiring long runtimes, or rely on complex learned models that demand category-specific training. We present Neu-PiG, a fast deformation optimization method based on a novel preconditioned latent-grid encoding that distributes spatial features parameterized on the position and normal direction of a keyframe surface. Our method encodes entire deformations across all time steps at various spatial scales into a multi-resolution latent grid, parameterized by the position and normal direction of a reference surface from a single keyframe. This latent representation is then augmented for time modulation and decoded into per-frame 6-DoF deformations via a lightweight multi-layer perceptron (MLP). To achieve high-fidelity, drift-free surface reconstructions in seconds, we employ Sobolev preconditioning during gradient-based training of the latent space, completely avoiding the need for any explicit correspondences or further priors. Experiments across diverse human and animal datasets demonstrate that Neu-PiG outperforms state-the-art approaches, offering both superior accuracy and scalability to long sequences while running at least 60x faster than existing training-free methods and achieving inference speeds on the same order as heavy pre-trained models.

Abstract:
Modern perception increasingly relies on fisheye, panoramic, and other wide field-of-view (FoV) cameras, yet most pipelines still apply planar CNNs designed for pinhole imagery on 2D grids, where pixel-space neighborhoods misrepresent physical adjacency and models are sensitive to global rotations. Traditional spherical CNNs partially address this mismatch but require costly spherical harmonic transform that constrains resolution and efficiency. We present Unified Spherical Frontend (USF), a distortion-free lens-agnostic framework that transforms images from any calibrated camera onto the unit sphere via ray-direction correspondences, and performs spherical resampling, convolution, and pooling canonically in the spatial domain. USF is modular: projection, location sampling, value interpolation, and resolution control are fully decoupled. Its configurable distance-only convolution kernels offer rotation-equivariance, mirroring translation-equivariance in planar CNNs while avoiding harmonic transforms entirely. We compare multiple standard planar backbones with their spherical counterparts across classification, detection, and segmentation tasks on synthetic (Spherical MNIST) and real-world (PANDORA, Stanford 2D-3D-S) datasets, and stress-test robustness to extreme lens distortions, varying FoV, and arbitrary rotations. USF scales efficiently to high-resolution spherical imagery and maintains less than 1% performance drop under random test-time rotations without training-time rotational augmentation, and enables zero-shot generalization to any unseen (wide-FoV) lenses with minimal performance degradation.

Abstract:
Most Camouflaged Object Detection (COD) methods rely on costly pixel-level annotations. Recent studies have adopted unsupervised COD (UCOD) to eliminate labeling costs, but still suffer from two issues:1) insufficient supervision, leading to reliance on self-supervised backbone DINO and reduced model flexibility; and 2) ineffective use of pseudo-labels, which widens the performance gap with supervised methods and limits real-world applicability. In this paper, we propose a novel teacher-student framework for UCOD to address these two issues. To tackle the lack of supervision, we build a powerful teacher model by integrating Multimodal Large Language Models (MLLMs) and the Segment Anything Model (SAM) to generate high-quality pseudo-labels. However, the teacher model faces two challenges: 1) suboptimal performance of MLLMs in COD, and 2) cascading errors.To address these challenges, we first propose a Camouflaged-Aware Chain-of-Thought (CA-CoT) for MLLMs. CA-CoT guides MLLMs through step-by-step reasoning to simulate human perceptual processes, thereby enhancing their performance in COD.Subsequently, we design a Graded Mask Evaluator (GME) to mitigate cascading errors, which evaluates and grades the quality of masks generated by SAM, and then filters out the low-quality masks to provide more reliable supervision.To better leverage pseudo-labels, we propose Graded Knowledge Distillation (GKD), which adaptively enhances distillation at both image and pixel levels based on pseudo-label quality. Extensive experiments show that our method outperforms existing UCOD approaches by a large margin. Notably, our method also achieves good performance under zero-shot settings. Code will be released.

Abstract:
Always-on sensing is essential for next-generation edge/wearable AI systems, yet continuous high-fidelity RGB video capture remains prohibitively expensive for resource-constrained mobile and edge platforms. We present a new paradigm for efficient streaming video understanding: grayscale-always, color-on-demand. Through preliminary studies, we discover that color is not always necessary. Sparse RGB frames suffice for comparable performance when temporal structure is preserved via continuous grayscale streams. Building on this insight, we propose ColorTrigger, an online training-free trigger that selectively activates color capture based on windowed grayscale affinity analysis. Designed for real-time edge deployment, ColorTrigger uses lightweight quadratic programming to detect chromatic redundancy causally, coupled with credit-budgeted control and dynamic token routing to jointly reduce sensing and inference costs. On streaming video understanding benchmarks, ColorTrigger achieves 91.6% of full-color baseline performance while using only 8.1% RGB frames, demonstrating substantial color redundancy in natural videos and enabling practical always-on video sensing on resource-constrained devices.

Abstract:
Large vision-language models (LVLMs) have achieved impressive performance across a variety of multimodal tasks, yet remain vulnerable to targeted adversarial attacks, particularly in black-box settings. In this paper, we propose VCP-Attack, a transferable targeted attack framework that combines structured contrastive supervision with subspace-guided perturbation optimization. Specifically, we employ a dynamic PCA-based projection to constrain perturbations within semantically meaningful low-dimensional subspaces, and design a multi-sample contrastive loss to align adversarial features with target semantics while pushing them away from the source semantics. Extensive experiments on seven open-source and three proprietary LVLMs--including GPT-4o, Claude, and Gemini--show that VCP-Attack achieves state-of-the-art performance in black-box targeted attacks. Under a fixed perturbation budget (epsilon = 16/255), our method achieves an average attack success rate (ASR) of 94.2% on open-source models and 83.1% on proprietary models, surpassing the strongest baselines by 23.3% and 16.8%, respectively. Notably, VCP-Attack achieves a 95.6% ASR on GPT-4o. Comprehensive ablation studies and visualizations further validate the effectiveness of the dynamic subspace projection and semantic contrastive supervision. While evaluated on image captioning, our approach is model-agnostic and exhibits strong potential for broader applications to black-box adversarial settings in vision-language tasks.

Abstract:
Deploying pretrained visual models in real-world environments often suffers from significant performance degradation due to the diversity of testing scenarios. Continuous adaptation of learning models on edge devices via unlabeled data collected from the target domain is highly effective for boosting generalization capability. However, gradient-backpropagation-based optimization of the massive parameters in deep neural networks is vastly more time-consuming than forward inference, rendering online learning infeasible on low-power edge devices. To address this critical challenge, we propose a lightweight gradient-free forward-memorizing framework, namely MemFlow, which leverages a frozen backbone and enables efficient fine-tuning of the mapping between features and predictions. Specifically, MemFlow employs randomly connected neurons to memorize feature-label associations; within the network, spiking signals are propagated, and predictions are generated by associating neuron-stored memories according to their confidence levels. More notably, MemFlow supports reinforced memorization of feature mappings using unlabeled data, thereby enabling rapid adaptation to new domains. Extensive experiments on four real-world cross-domain datasets demonstrate that MemFlow achieves performance improvements of up to 10% while consuming less than 1% of the computational time required by traditional domain adaptation methods.

Abstract:
In online incremental learning, data continuously arrives with substantial shifts in distribution, creating a significant challenge since previous samples cannot be revisited. Prior research has typically relied on either a single adaptive centroid or fixed multiple centroids to represent each class in the latent space. However, such methods struggle when class data streams are inherently multimodal and require continual centroid updates. To overcome this, we introduce an online Mixture Model learning framework grounded in Optimal Transport theory (MMOT), where centroids evolve incrementally with new data. This approach offers two main advantages: (i) it provides a more precise characterization of complex data streams, and (ii) it enables improved class similarity estimation for unseen samples during inference through MMOT-derived centroids. Furthermore, to strengthen representation learning and mitigate catastrophic forgetting, we design a Dynamic Preservation strategy that regulates the latent space and maintains class separability over time. Experimental evaluations on benchmark datasets confirm the superior effectiveness of our proposed method.

Abstract:
Image-to-poster generation is a high-demand task requiring not only local adjustments but also high-level design understanding. Models must generate text, layout, style, and visual elements while preserving semantic fidelity and aesthetic coherence. The process spans two regimes: local editing, where ID-driven generation, rescaling, filling, and extending must preserve concrete visual entities; and global creation, where layout- and style-driven tasks rely on understanding abstract design concepts. These intertwined demands make image-to-poster a multi-dimensional process coupling entity-preserving editing with concept-driven creation under image-prompt control. To address these challenges, we propose PosterOmni, a generalized artistic poster creation framework that unlocks the potential of a base edit model for multi-task image-to-poster generation. PosterOmni integrates the two regimes, namely local editing and global creation, within a single system through an efficient data-distillation-reward pipeline: (i) constructing multi-scenario image-to-poster datasets covering six task types across entity-based and concept-based creation; (ii) distilling knowledge between local and global experts for supervised fine-tuning; and (iii) applying unified PosterOmni Reward Feedback to jointly align visual entity-preserving and aesthetic preference across all tasks. Additionally, we establish PosterOmni-Bench, a unified benchmark for evaluating both local editing and global creation. Extensive experiments show that PosterOmni significantly enhances reference adherence, global composition quality, and aesthetic harmony, outperforming all open-source baselines and even surpassing several proprietary systems.

Abstract:
Fine-grained visual understanding is shifting from static classification to knowledge-augmented reasoning, where models must justify as well as recognise. Existing approaches remain limited by closed-set taxonomies and single-label prediction, leading to significant degradation under open-set or context-dependent conditions. We present the Knowledge-Augmented Fine-Grained Reasoning Agent (KFRA), a unified framework that transforms fine-grained perception into evidence-driven reasoning. KFRA operates through a three-stage closed reasoning loop that emulates expert analysis. It first performs open-vocabulary detection and web-scale retrieval to generate category hypotheses. It then conducts discriminative regions localisation by aligning textual knowledge with visual evidence through a global-to-local focusing mechanism. Finally, it integrates all multimodal evidence within a large multimodal model to perform interpretable reasoning. Unlike existing agents that treat retrieval and reasoning as independent processes, KFRA establishes a retrieval-grounding coupling that converts retrieved knowledge into spatially grounded evidence for verification. This design enables factual, interpretable, and task-agnostic reasoning across diverse fine-grained scenarios. To evaluate this capability, we construct FGExpertBench, a benchmark designed to assess reasoning depth and cross-task generalisation across six knowledge dimensions. Extensive experiments demonstrate that KFRA consistently surpasses both standalone large multimodal models and current agent frameworks, achieving up to 19 percent improvement in reasoning accuracy and delivering evidence-grounded interpretability in open-set fine-grained visual understanding.

Abstract:
Recent advancements in Unified Multimodal Models (UMMs) have enabled remarkable image understanding and generation capabilities. However, while models like Gemini-2.5-Flash-Image show emerging abilities to reason over multiple related images, existing benchmarks rarely address the challenges of multi-image context generation, focusing mainly on text-to-image or single-image editing tasks. In this work, we introduce MICON-Bench, a comprehensive benchmark covering six tasks that evaluate cross-image composition, contextual reasoning, and identity preservation. We further propose an MLLM-driven Evaluation-by-Checkpoint framework for automatic verification of semantic and visual consistency, where multimodal large language model (MLLM) serves as a verifier. Additionally, we present Dynamic Attention Rebalancing (DAR), a training-free, plug-and-play mechanism that dynamically adjusts attention during inference to enhance coherence and reduce hallucinations. Extensive experiments on various state-of-the-art open-source models demonstrate both the rigor of MICON-Bench in exposing multi-image reasoning challenges and the efficacy of DAR in improving generation quality and cross-image coherence.

Abstract:
Text-to-Image (T2I) models have achieved significant progress in generating high-quality images, with compositional visual generation emerging as an important capability that enables them to synthesize coherent, natural scenes from multiple discrete concepts. However, this powerful compositionality, while enhancing creativity, also introduces new safety risks: combinations of different concepts can produce high-risk images without explicitly expressing harmful content. Motivated by this, we propose CoRA (Composable Reassembly Attack): an attack method that preserves the original semantics while bypassing safety filters. Unlike traditional compositional generation approaches that rely on modifying the sampling process, CoRA operates solely in the text space under a black-box setting, iteratively rewriting and guiding prompts through interactive steps. Specifically, CoRA decomposes a potentially harmful intent into a set of fine-grained, superficially benign but semantically complete visual elements, and then uses iterative selection and reassembly to guide the target T2I model to recombine these elements without triggering safety checks, thereby recovering the original malicious semantics. Experimental results show that CoRA significantly improves attack success rates, producing higher-risk outputs while maintaining semantic consistency. Warning: This paper contains offensive or disturbing content.

Abstract:
The Segment Anything Model 2 (SAM2) has emerged as a foundation model for universal segmentation. Owing to its generalizable visual representations, SAM2 has been successfully applied to various downstream tasks. However, extending SAM2 to the RGB-D video salient object detection (RGB-D VSOD) task encounters three challenges including limited spatial modeling of linear LoRA, insufficient employment of SAM's multi-scale features, and dependence of initialization on explicit prompts. To address the issues, we present Multi-Modal Mixture-of-Experts with Memory-Augmented SAM (M4-SAM), which equips SAM2 with modality-related PEFT, hierarchical feature fusion, and prompt-free memory initialization. Firstly, we inject Modality-Aware MoE-LoRA, which employs convolutional experts to encode local spatial priors and introduces a modality dispatcher for efficient multi-modal fine-tuning, into SAM2's encoder. Secondly, we deploy Gated Multi-Level Feature Fusion, which hierarchically aggregates multi-scale encoder features with an adaptive gating mechanism, to balance spatial details and semantic context. Finally, to conduct zero-shot VSOD without manual prompts, we utilize a Pseudo-Guided Initialization, where a coarse mask is regarded as a pseudo prior and used to bootstrap the memory bank. Extensive experiments demonstrate that M4-SAM achieves the state-of-the-art performance across all evaluation metrics on three public RGB-D VSOD datasets.

Abstract:
Image quality assessment (IQA) and image restoration are fundamental problems in low-level vision. Although IQA and restoration are closely connected conceptually, most existing work treats them in isolation. Recent advances in unified multimodal understanding-generation models demonstrate promising results and indicate that stronger understanding can improve generative performance. This motivates a single model that unifies IQA and restoration and explicitly studies how IQA can guide restoration, a setting that remains largely underexplored yet highly valuable. In this paper, we propose UARE, to our knowledge the first Unified vision-language model for image quality Assessment, Restoration, and Enhancement. Built on pretrained unified understanding and generation models, we introduce a two-stage training framework. First, a progressive, easy-to-hard schedule expands from single-type distortions to higher-order mixed degradations, enabling UARE to handle multiple degradations. Second, we perform unified fine-tuning of quality understanding and restoration with interleaved text-image data, aligning IQA signals with restoration objectives. Through multi-task co-training, UARE leverages IQA to boost restoration and enhancement performance. Extensive experiments across IQA, restoration, and enhancement tasks demonstrate the effectiveness of UARE. The code and models will be made available.

Abstract:
Vision-Language Models (VLMs) often over-rely on linguistic priors even when images are provided, leading to object hallucinations. We revisit object-wise hallucination from the perspective of how visual evidence shapes the model's uncertainty. For each input, we measure decision uncertainty with and without the image, and define a Visual Evidence Sensitivity (VES) signal as the image-attributable change in entropy. Building on this signal, we introduce Visual Evidence Sensitivity Reinforcement Fine-Tuning (VES-RFT), a training-time reinforcement fine-tuning method that explicitly rewards reliance on correct visual evidence. We pair this continuous, annotation-free signal with a verifiable reward that enforces factual object correctness by automatically checking generated object mentions against the image, yielding a computable objective without human annotations. We optimize the dual objective using critic-free GRPO with KL regularization, requiring only parallel image and no-image passes during training while preserving single-pass inference. Across multiple VLM families and benchmarks, VES-RFT consistently suppresses object hallucinations and improves robustness under ambiguity. Specifically, on LLaVA-7B, VES-RFT reduces 12.8 and 1.8 on CHAIR_S/CHAIR_I of MS-COCO, and increases POPE accuracy by 4.92%. Extensive experiments indicate that turning uncertainty into a learnable reward, paired with verifiable correctness signals, provides a scalable mechanism for training-time hallucination mitigation and stronger visual grounding.

Abstract:
A sliding-window inference strategy is commonly adopted in recent training-free open-vocabulary semantic segmentation methods to overcome limitation of the CLIP in processing high-resolution images. However, this approach introduces a new challenge: each window is processed independently, leading to semantic discrepancy across windows. To address this issue, we propose Global-Local Aligned CLIP (GLA-CLIP), a framework that facilitates comprehensive information exchange across windows. Rather than limiting attention to tokens within individual windows, GLA-CLIP extends key-value tokens to incorporate contextual cues from all windows. Nevertheless, we observe a window bias: outer-window tokens are less likely to be attended, since query features are produced through interactions within the inner window patches, thereby lacking semantic grounding beyond their local context. To mitigate this, we introduce a proxy anchor, constructed by aggregating tokens highly similar to the given query from all windows, which provides a unified semantic reference for measuring similarity across both inner- and outer-window patches. Furthermore, we propose a dynamic normalization scheme that adjusts attention strength according to object scale by dynamically scaling and thresholding the attention map to cope with small-object scenarios. Moreover, GLA-CLIP can be equipped on existing methods and broad their receptive field. Extensive experiments validate the effectiveness of GLA-CLIP in enhancing training-free open-vocabulary semantic segmentation performance. Codes are available at github.com/2btlFe/GLA-CLIP

Abstract:
Semi-supervised learning (SSL) has achieved remarkable progress by leveraging both limited labeled data and abundant unlabeled data. However, unlabeled datasets often contain out-of-distribution (OOD) samples from unknown classes, which can lead to performance degradation in open-set SSL scenarios. Current OOD detection methods are constrained by the absence of labeled OOD samples. Although optimal transport (OT) has proven to be effective to provide pseudo OOD scores for supervised learning, it still faces two main challenges, i.e., the unavoidable problem of finding the optimal transport plan and the unreliable OOD score caused by dense solutions. To overcome thess limitations, we propose a novel open-set OOD detection model named DREW, which leverages Dynamic REWeighting approach for OT-based OOD detection. Specifically, we start by formulating OOD detection as a semi-unbalanced optimal transport (SemiUOT) problem. The proposed DREW model can dynamically transform SemiUOT into the classical OT formula and then directly obtain the pseudo OOD score from the new source distribution weights. Contrary to existing OT-based methods, DREW provides theoretically grounded and more accurate pseudo OOD scores, while avoiding the direct computation of the transport plan. Empirical results demonstrate the superiority of DREW in terms of both accuracy and efficiency. Extensive analytical experiments are conducted to elucidate the properties of each component.

Abstract:
Adversarial patch attacks pose a significant threat to the reliability of object detection (OD) models, particularly in real-time security applications. Although several defenses have been proposed, they often suffer from two limitations: 1) reduced performance on benign images, and 2) impractical processing time for real-time OD applications. In this paper, we present AntiStyler, a novel and rapid defense against adversarial patches. Given an input image, AntiStyler identifies and masks pixels that exhibit a "random" style associated with adversarial attacks and uses a series of spatial filters to enhance the mask and remove unwanted noise, efficiently masking adversarial patches. AntiStyler features model-, patch-, and attack-agnostic capabilities and does not require any training, making it a fully agnostic zero-shot defense against adversarial patch attacks. Our evaluation on the COCO, INRIA, Superstore, and APRICOT datasets, with both digital and physical attacks, demonstrates AntiStyler's state-of-the-art robustness (improving adversarial performance by 8-15 mAP%) without compromising the original performance on benign images. Additionally, unlike most existing defenses, AntiStyler can process 10-12 frames per second (FPS), making it efficient and relevant for real-time OD applications.

Abstract:
Volume electron microscopy (vEM) enables nanoscale 3D imaging of biological structures but remains constrained by acquisition trade-offs, leading to anisotropic volumes with limited axial resolution. Existing deep learning methods seek to restore isotropy by leveraging lateral priors; yet their assumptions break down for morphologically anisotropic structures. We present EMGauss, a general framework for 3D reconstruction from planar scanned 2D slices with applications in vEM, which circumvents the inherent limitations of isotropy-based approaches. Our key innovation is to reframe slice-to-3D reconstruction as a 3D dynamic scene rendering problem based on Gaussian splatting, where the progression of axial slices is modeled as the temporal evolution of 2D Gaussian point clouds. To enhance fidelity in data-sparse regimes, we incorporate a Teacher-Student bootstrapping mechanism that uses high-confidence predictions on unobserved slices as pseudo-supervisory signals. Compared with diffusion- and GAN-based reconstruction methods, EMGauss substantially improves interpolation quality, enables continuous slice synthesis, and eliminates the need for large-scale pretraining. Beyond vEM, it potentially provides a generalizable slice-to-3D solution across diverse imaging domains.

Abstract:
Federated Learning (FL) facilitates decentralized model training while preserving data privacy. However, achieving both robust generalization and effective personalization simultaneously in heterogeneous (non-IID) environments remains a formidable challenge. Furthermore, the widespread adoption of proprietary Foundation Models (FMs) introduces a critical requirement for dual privacy: (a) protecting sensitive client data and (b) securing the server's valuable intellectual property. This mandates strictly black-box access to the FM. To address these multifaceted challenges, we introduce FedOT, a novel FL framework optimized for black-box FMs. FedOT employs a shared global task-dependent classifier while facilitating local adaptation through client-specific orthogonal transformations applied externally to the FM embeddings. This architecture inherently guarantees that the FM's internal parameters remain inaccessible and unmodified. By enforcing orthogonality, FedOT effectively mitigates gradient conflicts across diverse clients, which is theoretically bounded, preserves the semantic integrity of the FM representations, and achieves robust performance under significant data heterogeneity. The synergy of global and local parameters optimally balances generalization and personalization, markedly outperforming baseline FL methods across diverse benchmarks. Extensive empirical analysis, including rigorous multi-seed validation and scalability assessments, substantiates the robustness, efficiency, and superior performance of FedOT.

Abstract:
Vision-Language-Action (VLA) models inherit strong priors from pretrained Vision-Language Models (VLMs), but naive fine-tuning often disrupts these representations and harms generalization. Existing fixes -- freezing modules or applying uniform regularization -- either overconstrain adaptation or ignore the differing roles of VLA components. We present MAPS (Module-Wise Proximity Scheduling), the first robust fine-tuning framework for VLAs. Through systematic analysis, we uncover an empirical order in which proximity constraints should be relaxed to balance stability and flexibility. MAPS linearly schedules this relaxation, enabling visual encoders to stay close to their pretrained priors while action-oriented language layers adapt more freely. MAPS introduces no additional parameters or data, and can be seamlessly integrated into existing VLAs. Across MiniVLA-VQ, MiniVLA-OFT, OpenVLA-OFT, and challenging benchmarks such as SimplerEnv, CALVIN, LIBERO, as well as real-world evaluations on the Franka Emika Panda platform, MAPS consistently boosts both in-distribution and out-of-distribution performance (up to +30%). Our findings highlight empirically guided proximity to pretrained VLMs as a simple yet powerful principle for preserving generalization in VLM-to-VLA transfer.

Abstract:
Large Vision-Language Models (LVLMs) are foundational to modern multimodal applications, yet their susceptibility to adversarial attacks remains a critical concern. Prior white-box attacks rarely generalize across tasks, and black-box methods depend on expensive transfer, which limits efficiency. The vision encoder, standardized and often shared across LVLMs, provides a stable gray-box pivot with strong cross-model transfer. Building on this premise, we introduce PA-Attack (Prototype-Anchored Attentive Attack). PA-Attack begins with a prototype-anchored guidance that provides a stable attack direction towards a general and dissimilar prototype, tackling the attribute-restricted issue and limited task generalization of vanilla attacks. Building on this, we propose a two-stage attention enhancement mechanism: (i) leverage token-level attention scores to concentrate perturbations on critical visual tokens, and (ii) adaptively recalibrate attention weights to track the evolving attention during the adversarial process. Extensive experiments across diverse downstream tasks and LVLM architectures show that PA-Attack achieves an average 75.1% score reduction rate (SRR), demonstrating strong attack effectiveness, efficiency, and task generalization in LVLMs.

Abstract:
Existing methods for deepfake detection aim to develop generalizable detectors. Although "generalizable" could be the ultimate target once and for all, with limited training forgeries and domains, it appears idealistic to expect generalization that covers entirely unseen variations, especially given the diversity, advancement, and vast volume of real-world deepfakes. Therefore, introducing large-scale multi-domain data for training can be feasible and important for real-world applications.However, within such a multi-domain scenario, the differences between multiple domains, rather than the subtle real/fake distinctions, dominate the feature space. As a result, despite detectors being able to relatively separate real and fake within each domain (i.e., high AUC), they struggle with single-image real/fake judgments in domain-unspecified conditions (i.e., low ACC).In this paper, we first define a new research paradigm named Multi-In-Domain Face Forgery Detection (MID-FFD), which includes sufficient volumes of real-fake domains for training. Then, the detector should provide definitive real-fake judgments to the domain-unspecified inputs, which simulate the frame-by-frame independent detection scenario in the real world. Meanwhile, to address the domain-dominant issue, we propose a two-stage, model-agnostic framework termed DevDet (\underline Dev eloper for \underline Det ector) to amplify real/fake differences and make them dominant in the feature space. DevDet consists of a Face Forgery Developer (FFDev) and a Dose-Adaptive detector Fine-Tuning strategy (DAFT). Experiments demonstrate our superiority in effectively predicting real-fake under the MID-FFD scenario while maintaining original generalization ability to unseen data.

Abstract:
Resting-state functional MRI (rs-fMRI) provides rich information for modeling brain connectivity in disease diagnosis. However, most existing brain graph learning methods rely solely on imaging data, leading to limited biological interpretability and poor integration of external medical knowledge. To address these challenges, we propose an Interpretability-Enhanced Brain Graph Learning (IEBGL) framework that anchors brain network modeling in large-scale medical knowledge. Our framework introduces two complementary modules: LLM-Instructed Topological Reconstruction (LITR) and Literature-Augmented Semantic Aggregation (LASA). LITR employs large language model (LLM) reasoning to refine brain connectivity and construct topological structure. LASA augments node representations by aggregating semantic information from biomedical literature, ensuring the model's interpretability and relevance to clinical disease knowledge. Finally, the framework is trained with the Graph Bi-directional Mamba Network (GBMN) for disease diagnosis. Extensive experiments on the REST-meta-MDD and ABIDE datasets, together with 35,133 depression-related and 32,617 autism-related publications, demonstrate that IEBGL outperforms state-of-the-art methods in classification performance. Further analyses show that the LITR module reveals biologically meaningful alterations in brain connectivity, while the LASA module establishes interpretable associations between these regions and disease-related biomedical literature. Together, these mechanisms help IEBGL explain abnormal brain connections and their links to disease-related knowledge.

Abstract:
A faithful decision-making process requires models to ground human-understandable concepts both spatially (where they appear in the image) and causally (how they influence the prediction). Recent advances in Vision-Language Models (VLMs) enable concept-level alignment and have inspired Concept Bottleneck Models (CBMs), which explain predictions by mapping image representations to human-understandable concepts, allowing users to trace decisions through explicit semantic reasoning. However, existing CBMs suffer from two key inconsistencies. First, semantic inconsistency: VLMs often fail to localize fine-grained part-attribute concepts, producing noisy or incomplete masks. Second, object inconsistency: object-agnostic concepts such as "head: streamlined front profile" may describe multiple categories (e.g., fish or human); without enforcing object identity, non-targeted regions can introduce spurious evidence that corrupts the bottleneck representation. To address these challenges, we propose a new Object-Aware Concept Bottleneck Model (OA-CBM) that jointly enforces semantic- and object-level consistency. Specifically, (1) we redefine concepts as part-attribute pairs to enhance VLM robustness at the semantic level, and (2) introduce class-agnostic object clustering to suppress irrelevant visual evidence. We further annotate two grounding datasets with part-attribute descriptions and conduct extensive experiments. Results demonstrate that OA-CBM produces more faithful and robust explanations while maintaining competitive predictive performance.

Abstract:
We propose an integrated learning scheme of Video Super-Resolution and Enhancement in Low-Light environment, named VSRELL, which aims to recover Well-Illuminated High-Resolution (WIHR) sequence from Low-Light Low-Resolution (LLLR) counterparts. Due to the complex coupling of multiple degradations, this joint task has received relatively little attention. Our approach jointly models illumination enhancement and spatial-temporal super-resolution to disentangle intertwined degradations. Specifically, we introduce an Illumination-Noise Co-Optimization (INCO) network that employs a dynamic window partitioning strategy to explicitly model physical priors of illumination variations and noise distributions within individual frames of a long-term sequence. This effectively suppresses cross-frame noise accumulation and illumination flickering, achieving simultaneous optimization of motion compensation and brightness correction.Additionally, an Illumination-Sensitive Feature Propagation (ISFP) mechanism is introduced, which utilizes hierarchical illumination-sensing gating unit to adaptively modulate feature channel responses. By adjusting feature propagation intensity and using memory feature attenuation strategy, it can enhance the weighting of high-quality features and suppress error accumulation propagation and strengthen transmission efficiency. The experiments show that VSRELL can explicitly strengthen the brightness continuity and texture fidelity of the restored output, maintaining temporal consistency across the video.

Abstract:
Referring Expression Comprehension (REC) aims to localize the image region corresponding to a natural language query. Recent neuro-symbolic REC approaches leverage large language models (LLMs) and vision-language models (VLMs) to perform compositional reasoning, decomposing queries into structured programs and executing them step-by-step. While such approaches achieve interpretable reasoning and strong zero-shot generalization, they assume that intermediate reasoning steps are accurate. However, this assumption causes cascading errors: false detections and invalid relations propagate through the reasoning chain, yielding high-confidence false positives even when no target is present in the image. To address this limitation, we introduce Verification-Integrated Reasoning Operators (VIRO), a neuro-symbolic framework that embeds lightweight operator-level verifiers within reasoning steps. Each operator executes and validates its output, such as object existence or spatial relationships, allowing the system to robustly handle no-target cases through verification-aware abstention. Our framework achieves state-of-the-art performance, reaching 61.1% balanced accuracy across target-present and no-target settings, and demonstrates generalization to real-world egocentric data. VIRO also shows high reliability with a program failure rate of at most 0.3%, efficient per-query runtime, and scalability through decoupled program generation and execution.

Abstract:
The advent of "OCR 2.0" and large-scale vision-language models (VLMs) has set new benchmarks in text recogni- tion. However, these unified architectures often come with significant computational demands, challenges in precise text localization within complex layouts, and a propen- sity for textual hallucinations. Revisiting the prevailing notion that model scale is the sole path to high accu- racy, this paper introduces PP-OCRv5, a meticulously op- timized, lightweight OCR system with merely 5 million pa- rameters. We demonstrate that PP-OCRv5 achieves per- formance competitive with many billion-parameter VLMs on standard OCR benchmarks, while offering superior lo- calization precision and reduced hallucinations. The cor- nerstone of our success lies not in architectural expansion but in a data-centric investigation. We systematically dis- sect the role of training data by quantifying three critical dimensions: data difficulty, data accuracy, and data diver- sity. Our extensive experiments reveal that with a sufficient volume of high-quality, accurately labeled, and diverse data, the performance ceiling for traditional, efficient two- stage OCR pipelines is far higher than commonly assumed. This work provides compelling evidence for the viability of lightweight, specialized models in the large-model era and offers practical insights into data curation for OCR. The source code and models are publicly available at https: //github.com/PaddlePaddle/PaddleOCR.

Abstract:
Imitation learning enables robots to learn how to execute tasks via observation. However, real-world environments like homes and offices are often severely partially observed due to their large spatial scales. In addition, many tasks involve executing a series of subtasks requiring autonomous robots to reason over extended time horizons. To address these challenges, we propose using scene graphs as an explicit and structured memory mechanism in imitation learning. By maintaining a dynamic scene graph that captures object-centric relationships and their evolution over time, our method allows the agent to retain relevant historical context during task execution to efficiently reason over incrementally accrued scene information. Our experiments on simulated mobile manipulation and real-world tabletop manipulation demonstrate that our approach substantially improves policy performance, particularly in settings that demand long-term reasoning and robust generalization under partial observability.

Abstract:
Recent perception-generalist approaches based on language models have achieved state-of-the-art results across diverse tasks, including 3D scene layout estimation and 3D object detection, via unified architecture and interface. However, these approaches rely on autoregressive next-token prediction, which is inherently slow. In this work, we introduce Fast SceneScript, a novel structured language model for accurate and efficient 3D scene understanding. Our method employs multi-token prediction (MTP) to reduce the number of autoregressive iterations and significantly accelerate inference. While MTP improves speed, unreliable token predictions can significantly reduce accuracy. To filter out unreliable tokens, we adapt self-speculative decoding (SSD) for structured language models and introduce confidence-guided decoding (CGD) with an improved scoring mechanism for token reliability. Furthermore, we design a parameter-efficient mechanism that reduces the parameter overhead of MTP. Extensive experiments on synthetic and real-world benchmarks demonstrate that Fast SceneScript can generate up to 9 tokens per decoder inference step without compromising accuracy, while adding only 7.5% additional parameters.

Abstract:
Weakly-Supervised Dense Video Captioning aims to localize and describe events in videos trained only on caption annotations, without temporal boundaries. Prior work introduced an implicit supervision paradigm based on Gaussian masking and complementary captioning. However, existing method focus merely on generating non-overlapping masks without considering their semantic relationship to corresponding events, resulting in simplistic, uniformly distributed masks that fail to capture semantically meaningful regions. Moreover, relying solely on ground-truth captions leads to sub-optimal performance due to the inherent sparsity of existing datasets. In this work, we propose SAIL, which constructs semantically-aware masks through cross-modal alignment. Our similarity-aware training objective guides masks to emphasize video regions with high similarity to their corresponding event captions. Furthermore, to guide more accurate mask generation under sparse annotation settings, we introduce an LLM-based augmentation strategy that generates synthetic captions to provide additional alignment signals. These synthetic captions are incorporated through an inter-mask mechanism, providing auxiliary guidance for precise temporal localization without degrading the main objective. Experiments on ActivityNet Captions and YouCook2 demonstrate state-of-the-art performance on both captioning and localization metrics.

Abstract:
Interactive segmentation models such as the Segment Anything Model (SAM) have demonstrated remarkable generalization on natural images, but they perform suboptimally on remote sensing imagery (RSI) due to severe domain shifts and the scarcity of dense annotations. To address this limitation, we propose a point-supervised, self-prompting framework that adapts SAM to RSI using only sparse point annotations. Our method employs a Refine-Requery-Reinforce loop, in which coarse pseudo-masks are generated from initial points (Refine), improved with self-constructed box prompts (Requery), and embeddings are aligned with Soft Semantic Alignment (SSA) to mitigate error propagation. (Reinforce). Without relying on full-mask supervision, our approach progressively enhances SAM's segmentation quality and domain robustness through self-guided prompt adaptation. We evaluate our proposed method on three RSI benchmark datasets, WHU, HRSID, and NWPU VHR-10, demonstrating that it consistently outperforms pretrained SAM and recent point-supervised segmentation methods. Compared to the fully supervised model, our approach reduces the performance gap to 1.3% (WHU), 4.9% (HRSID), and 8.5% (NWPU) while relying only on 1-point annotations. Our results demonstrate that self-prompting and semantic alignment provide an efficient path towards scalable, point-level adaptation of foundation segmentation models for remote sensing applications.

Abstract:
Reconstructing high-fidelity 3D head geometry from images is critical for a wide range of applications, yet existing methods face fundamental limitations. Traditional photogrammetry achieves exceptional detail but requires extensive camera arrays (25--200+ views), substantial computation, and manual cleanup in challenging areas like facial hair. Recent alternatives present a fundamental trade-off: foundation models enable efficient single-image reconstruction but lack fine geometric detail, while optimization-based methods achieve higher fidelity but require dense views and expensive computation. We bridge this gap with a hybrid approach that combines the strengths of both paradigms.Our method introduces a multi-view surface normal prediction model that extends monocular foundation models with cross-view attention to produce geometrically consistent normals in a feed-forward pass. We then leverage these predictions as strong geometric priors within an inverse rendering optimization framework to recover high-frequency surface details. Our approach outperforms state-of-the-art single-image and multi-view methods, achieving high-fidelity reconstruction on par with dense-view photogrammetry while reducing camera requirements and computational cost.

Abstract:
3D shape anomaly detection is a crucial task for industrial inspection and geometric analysis. Existing deep learning approaches typically learn representations of normal shapes and identify anomalies via out-of-distribution feature detection or decoder-based reconstruction. They often fail to generalize across diverse anomaly types and scales, such as global geometric errors (e.g., planar shifts, angle misalignments), and are sensitive to noisy or incomplete local points during training. To address these limitations, we propose a hierarchical point-patch anomaly scoring network that jointly models regional part features and local point features for robust anomaly reasoning. An adaptive patchification module integrates self-supervised decomposition to capture complex structural deviations. Beyond evaluations on public benchmarks (Anomaly-ShapeNet and Real3D-AD), we release an industrial test set with real CAD models exhibiting planar, angular, and structural defects. Experiments on public and industrial datasets show superior AUC-ROC and AUC-PR performance, including over 40% point-level improvement on the new industrial anomaly type and average object-level gains of 7% on Real3D-AD and 4% on Anomaly-ShapeNet, demonstrating strong robustness and generalization.

Abstract:
Cross-view geo-localization is a critical task for UAV navigation, event detection, and aerial surveying, which establish correspondence between drone-captured and satellite imagery. Most existing approaches embed cross-view data into a joint feature space to maximize similarity between paired images. However, these methods typically assume perfect alignment of image pairs in training data, an assumption that rarely holds in practical scenarios. In real-world conditions, factors such as urban canyon effects, electromagnetic interference, and adverse weather frequently induce GPS drift, resulting in systematic alignment shifts where only partial correspondences exist between image pairs. Despite its prevalence, this source of noisy correspondence has received limited attention in current research.To our best knowledge, this work presents the first systematic investigation of the Noisy Correspondence in Cross-View Geo-Localization (NC-CVGL) problem, specifically addressing the practical scenario where a significant portion of training pairs exhibit spatial misalignment due to GPS inaccuracies. To this end, we propose PAUL (Partition and Augmentation by Uncertainty Learning), a framework that achieves noise-robust learning through three coordinated mechanisms: Co-partition separates noisy from clean samples using data uncertainty and loss patterns; Co-augmentation enhances high-confidence regions via local assessment; and Co-training refines learning through mutual supervision between dual networks.Unlike conventional noise-handling methods that filter or relabel noisy samples, PAUL effectively utilizes noisy data through this triple collaborative mechanism. Comprehensive experiments validate the effectiveness of individual components in PAUL, which consistently achieves superior performance over other competitive noisy-correspondence-driven methods in various noise ratios.

Abstract:
Diffusion models often exhibit inconsistent sample quality due to stochastic variations inherent in their sampling trajectories. Although training-based fine-tuning and inference-time alignment techniques aim to improve sample fidelity, they typically necessitate full denoising processes and external reward signals. This incurs substantial computational costs, hindering their broader applicability. In this work, we unveil an intriguing phenomenon: a previously unobserved yet exploitable link between sample quality and characteristics of the denoising trajectory during classifier-free guidance (CFG). Specifically, we identify a strong correlation between high-density regions of the sample distribution and the Accumulated Score Differences (ASD)--the cumulative divergence between conditional and unconditional scores. Leveraging this insight, we introduce CFG-Rejection, an efficient, plug-and-play strategy that filters low-quality samples at an early stage of the denoising process, crucially without requiring external reward signals or model retraining. Importantly, our approach necessitates no modifications to model architectures or sampling schedules and maintains full compatibility with existing diffusion frameworks. We validate the effectiveness of CFG-Rejection in image generation through extensive experiments, demonstrating marked improvements on human preference scores (HPSv2, PickScore) and challenging benchmarks (GenEval, DPG-Bench). We anticipate that CFG-Rejection will offer significant advantages for diverse generative modalities beyond images, paving the way for more efficient and reliable high-quality sample generation.

Abstract:
The emergence of Large Vision-Language Models (LVLMs) has significantly advanced video understanding capabilities. However, existing benchmarks focus predominantly on coarse-grained tasks such as action segmentation, classification, captioning, and retrieval. Furthermore, these benchmarks often rely on entities that can be easily identified verbally, like household objects, animals, human subjects, etc., limiting their applicability to complex, in-the-wild video scenarios. But, many applications such as furniture assembly, cooking, etc., require step-by-step fine-grained spatio-temporal understanding of the video, which is not sufficiently evaluated in current benchmarks.To address this gap, we introduce Flat-Pack Bench, a novel benchmark centered on furniture assembly tasks. Our benchmark evaluates LVLMs on nuanced tasks, including temporal ordering of assembly actions, temporal localization of assembly state, understanding part mating, and tracking, using multiple-choice questions paired with visual prompts highlighting relevant parts as references for fine-grained questions. Our experiments reveal that state-of-the-art LVLMs struggle significantly with fine-grained spatio-temporal reasoning, highlighting their limitations in effectively leveraging temporal information from videos, limited tracking ability, and understanding of spatial interactions like physical contact.

Abstract:
Group Relative Policy Optimization (GRPO) is a powerful technique for aligning generative models, but its effectiveness is bottlenecked by the conflict between large group sizes and prohibitive computational costs. In this work, we investigate the trade-off through empirical studies, yielding two key observations. First, we discover the reward clustering phenomenon in which many trajectories collapse toward the group-mean reward, offering limited optimization value. Second, we design a heuristic strategy named Optimal Variance Filtering (OVF), and verify that a high-variance subset of trajectories selected by OVF can outperform the larger, unfiltered group. However, this static, post-sampling OVF approach still necessitates critical computational overhead, as it performs unnecessary sampling for trajectories that are ultimately discarded. To resolve this, we propose Pro-GRPO (Proactive GRPO), a novel dynamic framework that integrates latent feature-based trajectory pruning into the sampling process. Through the early termination of reward-clustered trajectories, Pro-GRPO reduces computational overhead. Leveraging its efficiency, Pro-GRPO employs an "Expand-and-Prune" strategy. This strategy first expands the size of initial sampling group to maximize trajectory diversity, then it applies multi-step OVF to the latents, avoiding prohibitive computational costs. Extensive experiments on both diffusion-based and flow-based models demonstrate the generality and effectiveness of our Pro-GRPO framework.

Abstract:
The temporal evolution patterns of surface spatial structures constitute a central concern within the field of intelligent remote sensing interpretation.However, constrained by the availability of only two temporal phases, modeling sparse spatio-temporal change processes to effectively interpret surface alterations remains a core challenge in intelligent remote sensing analysis. To address this, this paper proposes SpikeAdapter, a lightweight enhancement framework. This framework comprises Geo-Spike Interpolation (GSI-P), an spiking neural network (SNN) feature extractor, and the spatio-temporal fusion module STSpikeFuse. Inspired by the brain's perceptual response to new and fading stimuli, the core GSI-P module transforms bi-temporal radiometric differences into sparse spike sequences with time-to-first-spike characteristics.Then we use a feature extractor of SNN to capture dynamic variations of land-surface targets. The STSpkeFuse module employs a learnable temporal decay mechanism to adaptively fuse the SNN features with the semantic representations. This representations are generated by a traditional artificial neural network (ANN) backbone.Extensive experiments on change detection datasets demonstrate that SpikeAdapter effectively enhances temporal awareness and interpretability.

Abstract:
Reinforcement learning from human feedback (RLHF) with reward models has advanced alignment of generative models to human aesthetic and perceptual preferences. However, jointly optimizing multiple rewards often incurs an alignment tax--improving one dimension while degrading others. To address this, we introduce two complementary methods: MapReduce LoRA and Reward-aware Token Embedding (RaTE). MapReduce LoRA trains preference-specific LoRA experts in parallel and iteratively merges them to refine a shared base model; RaTE learns reward-specific token embeddings that compose at inference for flexible preference control. Experiments on Text-to-Image generation (Stable Diffusion 3.5 Medium and FLUX.1-dev) show improvements of 36.1%, 4.6%, and 55.7%, and 32.7%, 4.3%, and 67.1% on GenEval, PickScore, and OCR, respectively. On Text-to-Video generation (HunyuanVideo), visual and motion quality improve by 48.1% and 90.0%, respectively. On the language task, Helpful Assistant, with Llama-2 7B, helpful and harmless improve by 43.4% and 136.7%, respectively. Our framework sets a new state-of-the-art multi-preference alignment recipe across modalities.

Abstract:
We present WebGym, the largest-to-date open-source environment for training realistic visual web agents. Real websites are non-stationary and diverse, making artificial or small-scale task sets insufficient for robust policy learning. WebGym contains nearly 300,000 tasks with rubric-based evaluations across diverse, real-world websites and difficulty levels. We train agents with a simple reinforcement learning (RL) recipe, which trains on the agent's own interaction traces (rollouts), using task rewards as feedback to guide learning. To enable scaling RL, we speed up sampling of trajectories in WebGym by developing a high-throughput asynchronous rollout system, designed specifically for web agents. Our system achieves a 4-5x rollout speedup compared to naive implementations. Second, we scale the task set breadth, depth, and size, which results in continued performance improvement. Fine-tuning a strong base vision-language model, Qwen-3-VL-8B-Instruct, on WebGym results in an improvement in success rate on an out-of-distribution test set from 26.2% to 42.9%, significantly outperforming agents based on proprietary models such as GPT-4o and GPT-5-Thinking that achieve 27.1% and 29.8%, respectively. This improvement is substantial because our test set consists only of tasks on websites never seen during training, unlike many other prior works on training visual web agents.

Abstract:
Radiology report generators often produce fluent text yet miss crucial details, leading to local semantic conflicts or flipped findings that require stronger penalties. Cross-entropy (CE) merely increases the probability of the ground-truth token y^ without directly suppressing the model's current wrong choice \hat y , and treats all positions uniformly, so corrections are not prioritized. We introduce a self-adaptive optimization framework that dynamically adjusts token-level gradients based on semantic discrepancy cues derived from a frozen LLM referee. The LLM itself is not the contribution--it merely provides weak supervision to trigger the adaptive learning process. Within this framework, (i) semantic conflicts between the predicted and reference reports are automatically localized and tagged with "..." (used only during training), and (ii) adaptive, stronger penalties are applied within these sparse but critical spans. Updates follow a push-pull scheme: error spans are pushed down, while non-error tokens are reinforced. The update strength is governed by two complementary signals--normalized entropy (for uncertainty calibration) and focal-style confidence (for handling over- and under-confident predictions). On MIMIC-CXR and IU-Xray, our framework consistently improves both language metrics (BLEU-4, ROUGE-L, CIDEr) and clinical metrics (RadGraph F1, CheXbert), and remains robust to noisy or imperfect error tags.

Abstract:
Radiance field reconstruction aims to recover high-quality 3D representations from multi-view RGB images. Recent advances, such as 3D Gaussian splatting, enable real-time rendering with high visual fidelity on sufficiently powerful graphics hardware. However, efficient online transmission and rendering across diverse platforms requires drastic model simplification, reducing the number of primitives by several orders of magnitude. We introduce DiffSoup, a radiance field representation that employs a soup (i.e., a highly unstructured set) of a small number of triangles with neural textures and binary opacity. We show that this binary opacity representation is directly differentiable via stochastic opacity masking, enabling stable training without a mollifier (i.e., smooth rasterization). DiffSoup can be rasterized using standard depth testing, enabling seamless integration into traditional graphics pipelines and interactive rendering on consumer-grade laptops and mobile devices.

Abstract:
Diffusion Multi-modal Large Language Models (dMLLMs) have recently emerged as a novel architecture unifying image generation and understanding. However, developing effective and efficient Test-Time Scaling (TTS) methods to unlock their full generative potential remains an underexplored challenge. To address this, we propose dMLLM-TTS, a novel framework operating on two complementary scaling axes: (1) trajectory exploration scaling to enhance the diversity of generated hypotheses, and (2) iterative refinement scaling for stable generation. Conventional TTS approaches typically perform linear search across these two dimensions, incurring substantial computational costs of O(NT) and requiring an external verifier for best-of-N selection. To overcome these limitations, we propose two innovations. First, we design an efficient hierarchical search algorithm with O(N+T) complexity that adaptively expands and prunes sampling trajectories. Second, we introduce a self-verified feedback mechanism that leverages the dMLLMs' intrinsic image understanding capabilities to assess text-image alignment, eliminating the need for external verifier. Extensive experiments on the GenEval benchmark across three representative dMLLMs (e.g., Lumina-DiMOO, MMaDA, Muddit) show that our framework substantially improves generation quality while achieving up to 6x greater efficiency than linear search.

Abstract:
Video reasoning, which requires multi-step deduction across frames, remains a major challenge for multimodal large language models (MLLMs). While reinforcement learning (RL)-based methods enhance reasoning capabilities, they often rely on text-only chains that yield ungrounded or hallucinated conclusions. Conversely, frame-retrieval approaches introduce visual grounding, yet still struggle with inaccurate evidence localization. To address these limitations, we present Conan, a framework for evidence-grounded multi-step video reasoning. Conan identifies context and evidence frames, reasons over cross-frame clues, and adaptively decides when to conclude or explore further. To achieve this, we 1) construct Conan-91K, a large-scale dataset of automatically generated reasoning traces that include frame identification, evidence reasoning, and action decision, and 2) design a multi-stage progressive cold-start strategy combined with an Identification-Reasoning-Action (AIR) RLVR training framework to progressively incentivize multi-step visual reasoning. Extensive experiments on six multi-step reasoning benchmarks demonstrate that Conan surpasses the baseline Qwen2.5-VL-7B-Instruct by an average of over 10% in accuracy, achieving state-of-the-art performance. Furthermore, Conan generalizes effectively to long video understanding tasks, validating its strong scalability and robustness.

Abstract:
Test-Time Adaptation (TTA) enables real-time adaptation to domain shifts without offline retraining. Recent TTA methods have predominantly explored additive approaches that introduce lightweight modules for feature refinement. Recently, a subtractive approach that removes domain-sensitive channels has emerged as an alternative direction. We observe that these paradigms exhibit complementary effectiveness patterns: subtractive methods excel under severe shifts by removing corrupted features, while additive methods are effective under moderate shifts requiring refinement. However, each paradigm operates effectively only within limited shift severity ranges, failing to generalize across diverse corruption levels. This leads to the following question: can we adaptively balance both strategies based on measured feature-level domain shift? We propose CD-Buffer, a novel complementary dual-buffer framework where subtractive and additive mechanisms operate in opposite yet coordinated directions, driven by a unified discrepancy metric. Our key innovation lies in discrepancy-driven coupling, which unifies removal and refinement through a shared metric and automatically balances both strategies based on feature-level shift severity. This establishes channel-wise adaptive balancing that enables differentiated treatment for heterogeneous shift magnitudes without manual tuning. Extensive experiments on KITTI, Cityscapes, and ACDC datasets demonstrate state-of-the-art performance, consistently achieving superior results across diverse weather conditions and severity levels.

Abstract:
Semi-supervised medical image segmentation (SSMIS) aims to alleviate annotation scarcity, but general methods, often developed on few-class datasets, suffer performance degradation in class-imbalanced multi-organ scenarios. Existing class-imbalanced SSMIS methods also struggle, as their single-decoder architecture is forced to handle vastly different scales with shared parameters. This process is easily dominated by majority classes, fundamentally limiting tail-class segmentation capability. To address this, we propose a "Divide, Conquer, and Aggregate" (DCA) framework, featuring a unified encoder, three expert decoders, and an aggregation decoder. First, we Divide by applying a Logarithmic Gap Analysis to statically partition foreground classes into stable Head, Medium, and Tail sets, which aligns with anatomical priors. Then, we Conquer by training the three architecturally asymmetric experts independently using a label-split strategy. This fundamentally alleviates the burden on a single decoder. The experts' predictions on unlabeled data are fused via logit stitching to generate high-quality pseudo-labels. Finally, we Aggregate using an aggregation decoder with a Dynamic Feature Aggregation Module (DFAM), which dynamically fuses priors from all three experts to achieve unbiased predictions and fully leverage unlabeled data. Experiments demonstrate that our DCA framework significantly outperforms state-of-the-art general and class-imbalanced SSMIS methods.

Abstract:
Existing Test-Time Adaptation (TTA) methods for Vision-Language Models (VLMs), focusing on designing efficient adaptation parameters (eg. prompts or residual prototypes), predominantly rely on high-confidence samples obtained via entropy-based filtering. However, this prevailing paradigm implicitly inherits the VLM's class-wise prediction biases and leads to insufficient coverage of the test distribution, rendering the adaptation process biased and insufficiently exploratory.To overcome these limitations, we propose Dynamic Logits Adjustment and Exploration (DLAE), a novel framework that integrates Dynamic Logit Adjustment (DLA) with a Consistency-Guided Exploratory Cache (CGEC). DLA dynamically recalibrates model logits based on test prediction statistics, thereby mitigating class-wise prediction inconsistencies. Different from traditional cache mechanisms, our CGEC actively identifies additional samples near decision boundaries whose predicted labels are sensitive to the logit adjustment, thereby exploring beyond only high-confidence samples. By enforcing semantic and temporal consistency, the cache preserves the reliability of selected samples while enabling cautious yet effective exploration of low-confidence regions, ultimately yielding stable and reliable adaptation.Extensive experiments across multiple vision-language benchmarks demonstrate that our approach consistently surpasses state-of-the-art TTA methods, showing superior stability, adaptability, and generalization.

Abstract:
Owing to the weak stereo geometry of satellite images, Planar Block Adjustment (PBA) is a predominant technique for correcting geometric distortions in satellite images, which treats elevation as a known constraint and primarily optimizes planar coordinates. Existing PBA methods mainly rely on explicit tie points, suffering from parallax caused by inaccurate elevation (e.g., near high buildings) and irreversible error accumulation, which severely degrades adjustment accuracy. In this paper, a "Beyond Tie Points" paradigm for satellite image adjustment is proposed. A pretrained feature extractor is employed to extract robust dense features and a parallax-aware confidence map from each image. A gridded coarse-to-fine optimization framework then directly solves for the adjustment parameters basing on confidence-weighted feature consistency. Experiments conducted on multiview satellite image datasets covering Beijing, Guangzhou and San Jose demonstrate that the proposed method is significantly superior to traditional approaches in both accuracy and robustness, reducing the average error by up to 75.43% compared to traditional PBA.

Abstract:
Consistent image generation requires faithfully preserving identities, styles, and logical coherence across multiple images,which is essential for applications such as storytelling and character design.Supervised training approaches struggle with this task due to the lack of large-scale datasets capturing visual consistency and the complexity of modeling human perceptual preferences.In this paper, we argue that reinforcement learning (RL) offers a promising alternative by enabling models to learn complex and subjective visual criteria in a data-free manner.To achieve this, we introduce PaCo-RL, a comprehensive framework that combines a specialized consistency reward model with an efficient RL algorithm.The first component, PaCo-Reward, is a pairwise consistency evaluator trained on a large-scale dataset constructed via automated sub-figure pairing.It evaluates consistency through a generative, autoregressive scoring mechanism enhanced by task-aware instructions and CoT reasons.The second component, PaCo-GRPO,leverages a novel resolution-decoupled optimization strategy to substantially reduce RL cost,alongside a log-tamed multi-reward aggregation mechanism that ensures balanced and stable reward optimization.Extensive experiments across the two representative subtasks show that PaCo-Reward significantly improves alignment with human perceptions of visual consistency,and PaCo-GRPO achieves state-of-the-art consistency performance with improved training efficiency and stability.Together, these results highlight the promise of PaCo-RL as a practical and scalable solution for consistent image generation.We will release the code, dataset, and models.

Abstract:
Vision-Language-Action (VLA) models achieve strong generalization in robotic manipulation but remain largely reactive and 2D-centric, making them unreliable in tasks that require precise 3D reasoning. We propose GeoPredict, a geometry-aware VLA framework that augments a continuous-action policy with predictive kinematic and geometric priors. GeoPredict introduces a trajectory-level module that encodes motion history and predicts multi-step 3D keypoint trajectories of robot arms, and a predictive 3D Gaussian geometry module that forecasts workspace geometry with track-guided refinement along future keypoint trajectories. These predictive modules serve exclusively as training-time supervision through depth-based rendering, while inference requires only lightweight additional query tokens without invoking any 3D decoding. Experiments on RoboCasa Human-50, LIBERO, and real-world manipulation tasks show that GeoPredict consistently outperforms strong VLA baselines, especially in geometry-intensive and spatially demanding scenarios.

Abstract:
Perceiving and reconstructing objects from images are critical for real-to-sim transfer tasks, which are widely used in the robotics community.Existing methods rely on multiple submodules such as detection, segmentation, shape reconstruction, and pose estimation to complete the pipeline.However, such modular pipelines suffer from inefficiency and cumulative error, as each stage operates on only partial or locally refined information while discarding global context.To address these limitations, we propose UniPR, the first end-to-end object-level real-to-sim perception and reconstruction framework.Operating directly on a single stereo image pair, UniPR leverages geometric constraints to resolve the scale ambiguity.We introduce Pose-Aware Shape Representation to eliminate the need for per-category canonical definitions and to bridge the gap between reconstruction and pose estimation tasks.Furthermore, we construct a large-vocabulary stereo dataset, LVS6D, comprising over 6,300 objects, to facilitate large-scale research in this area.Extensive experiments demonstrate that UniPR reconstructs all objects in a scene in parallel within a single forward pass, achieving significant efficiency gains and preserves true physical proportions across diverse object types, highlighting its potential for practical robotic applications.

Abstract:
Existing approaches for improving the efficiency of Large Vision-Language Models (LVLMs) are largely based on the concept of visual token reduction. This approach, however, creates an information bottleneck that impairs performance, especially on challenging tasks that require fine-grained understanding and reasoning. In this work, we challenge this paradigm by introducing VISion On Request (VISOR), a method that reduces inference cost without discarding visual information. Instead of compressing the image, VISOR improves efficiency by sparsifying the interaction between image and text tokens. Specifically, the language model attends to the full set of high-resolution visual tokens through a small, strategically placed set of attention layers: general visual context is provided by efficient cross-attention between text-image, while a few well-placed and dynamically selected self-attention layers refine the visual representations themselves, enabling complex, high-resolution reasoning when needed. Based on this principle, we first train a single universal network on a range of computational budgets by varying the number of self-attention layers, and then introduce a lightweight policy mechanism that dynamically allocates visual computation based on per-sample complexity. Extensive experiments show that VISOR drastically reduces computational cost while matching or exceeding state-of-the-art results across a diverse suite of benchmarks, and excels in challenging tasks that require detailed visual understanding.

Abstract:
Incremental learning (IL) arises from the need to continuously update models under limited data and computational resources. Most existing IL studies focus on data-scarce settings. They often develop complex methods that rely on heavy computation, while overlooking the computational resource constraints common in real-world scenarios. This motivates us to formalize the problem of Computational Resource-Aware Incremental Learning, which explicitly considers the computational budget during model training. To tackle this problem, we propose Smart Replay, an efficient memory rehearsal algorithm that adaptively allocates resources by scheduling the replay ratio across mini-batches. We cast replay-ratio optimization into an optimal control formulation that jointly minimizes new-task and memory losses. We further propose a heuristic Q-function to guide ratio adjustments, adaptively balancing short-term efficiency and long-term stability. Finally, we develop a practical algorithm that periodically updates the replay ratio during training. Experiments on multiple benchmarks validate that Smart Replay consistently outperforms fixed-replay baselines, achieving higher accuracy and lower forgetting under the same computational budget.

Abstract:
Recent rapid advancement of generative models has significantly improved the fidelity and accessibility of AI-generated synthetic images. While enabling various innovative applications, the unprecedented realism of these synthetics makes them increasingly indistinguishable from authentic photographs, posing serious security risks, such as media credibility and content manipulation. Although extensive efforts have been dedicated to detecting synthetic images, most existing approaches suffer from poor generalization to unseen data due to their reliance on model-specific artifacts or low-level statistical cues. In this work, we identify a previously unexplored distinction that real images maintain consistent semantic attention and structural coherence in their latent representations, exhibiting more stable feature transitions across network layers, whereas synthetic ones present discernible distinct patterns. Therefore, we propose a novel approach termed latent transition discrepancy (LTD), which captures the inter-layer consistency differences of real and synthetic images. LTD adaptively identifies the most discriminative layers and assesses the transition discrepancies across layers. Benefiting from the proposed inter-layer discriminative modeling, our approach exceeds the base model by 14.35% in mean Acc across three datasets containing diverse GANs and DMs. Extensive experiments demonstrate that LTD outperforms recent state-of-the-art methods, achieving superior detection accuracy, generalizability, and robustness.

Abstract:
Generalist Anomaly Detection (GAD) seeks to overcome the domain-specific limitations of traditional anomaly detection by training a unified model that can generalize to unseen classes. A promising GAD strategy involves using residual features to create a class-invariant space. However, existing methods that directly model the distribution of residuals face unpredictable risks: there is inconsistency between residual and instance features, i.e., subtle defects may yield small residuals (false negatives), or normal feature residuals could be large due to the diversity of normality (false positives). To address these limitations, we propose a novel residual-based learning framework that re-purposes residuals as a guide to learn instance-level normality, rather than modeling their distribution directly. Our framework features two new attention-based modules: Residual Feature Learning (RFL), which uses learnable proxies to capture diverse patterns from the residual features, and Normality Learning from Support (NLS), which leverages these residual proxies to aggregate query-related normality proxies from the support instance features. These dynamically generated normality proxies are then used to hunt for normality within the query patch features, enabling accurate anomaly localization. Extensive experiments on GAD benchmarks demonstrate the effectiveness of our method. Code is available at HNQ-GAD.

Abstract:
Despite the rapid progress in multimodal models and Large Visual-Language Models (LVLM), they remain highly susceptible to adversarial perturbations, raising serious concerns about their reliability in real-world use. While adversarial training has become the leading paradigm for building models that are robust to adversarial attacks, Test-Time Transformations (TTT) have emerged as a promising strategy to boost robustness at inference.In light of this, we propose Energy-Guided Test-Time Transformation (ET3), a lightweight, training-free defense that enhances the robustness by minimizing the energy of the input samples.Our method is grounded in a theory that proves our transformation succeeds in classification under reasonable assumptions. We present extensive experiments demonstrating that ET3 provides a strong defense for classifiers, zero-shot classification with CLIP, and also for boosting the robustness of LVLMs in tasks such as Image Captioning and Visual Question Answering. Code is available at github.com/OmnAI-Lab/Energy-Guided-Test-Time-Defense .

Abstract:
Existing appliance assets suffer from poor rendering, incomplete mechanisms, and misalignment with manuals, leading to simulation-reality gaps that hinder appliance manipulation development. In this work, we introduce the RealAppliance dataset, comprising 100 high-fidelity appliances with complete physical, electronic mechanisms, and program logic aligned with their manuals. Based on these assets, we propose the RealAppliance-Bench benchmark, which evaluates multimodal large language models and embodied manipulation planning models across key tasks in appliance manipulation planning: manual page retrieval, appliance part grounding, open-loop manipulation planning, and closed-loop planning adjustment. Our analysis of model performances on RealAppliance-Bench provides insights for advancing appliance manipulation research

Abstract:
Volume Electron Microscopy (VEM) is crucial for 3D tissue imaging but often produces anisotropic data with poor axial resolution, hindering visualization and downstream analysis. Existing methods for isotropic reconstruction often suffer from neglecting abundant axial information and employing simple downsampling to simulate anisotropic data. To address these limitations, we propose VEMamba, an efficient framework for isotropic reconstruction. The core of VEMamba is a novel 3D Dependency Reordering paradigm, implemented via two key components: an Axial-Lateral Chunking Selective Scan Module (ALCSSM), which intelligently re-maps complex 3D spatial dependencies (both axial and lateral) into optimized 1D sequences for efficient Mamba-based modeling, explicitly enforcing axial-lateral consistency; and a Dynamic Weights Aggregation Module (DWAM) to adaptively aggregate these reordered sequence outputs for enhanced representational power. Furthermore, we introduce a realistic degradation simulation and then leverage Momentum Contrast (MoCo) to integrate this degradation-aware knowledge into the network for superior reconstruction. Extensive experiments on both simulated and real-world anisotropic VEM datasets demonstrate that VEMamba achieves highly competitive performance across various metrics while maintaining a lower computational footprint.

Abstract:
Exploration capacity shapes both inference-time performance and reinforcement learning (RL) training for large (vision-) language models, as stochastic sampling often yields redundant reasoning paths with little high-level diversity. This paper proposes Reasoning Palette, a novel latent-modulation framework that endows the model with a stochastic latent variable for strategic contextualization, guiding its internal planning prior to token generation. This latent context is inferred from the mean-pooled embedding of a question-answer pair via a variational autoencoder (VAE), where each sampled latent potentially encodes a distinct reasoning context. During inference, a sampled latent is decoded into learnable token prefixes and prepended to the input prompt, modulating the model's internal reasoning trajectory. In this way, the model performs internal sampling over reasoning strategies prior to output generation, which shapes the style and structure of the entire response sequence. A brief supervised fine-tuning (SFT) warm-up phase allows the model to adapt to this latent conditioning. Within RL optimization, Reasoning Palette facilitates structured exploration by enabling on-demand injection for diverse reasoning modes, significantly enhancing exploration efficiency and sustained learning capability. Experiments across multiple reasoning benchmarks demonstrate that our method enables interpretable and controllable control over the (vision-) language model's strategic behavior, thereby achieving consistent performance gains over standard RL methods.

Abstract:
Constructing a 4D language field that supports open-vocabulary queries is essential for semantic perception and interaction in dynamic environments. Existing 4D Gaussian-based approaches face two major challenges. First, the assumption of a static identity per Gaussian leads to semantic inconsistency, as motion fields warp Gaussians across object boundaries over time, causing oscillating identity assignments. Second, current methods typically model dynamic semantics as a set of discrete, predefined state prototypes, which fail to capture dynamic continuity and delineate fine-grained temporal boundaries. To address these issues, we propose LangField4D, a novel 4D Gaussian framework that jointly models spatio-temporal identity and semantics in a unified and continuous representation. We introduce an Identity-Adaptive Gaussian Grouping module that assigns each Gaussian a learnable adaptation feature to dynamically capture its object affiliation, ensuring consistent semantic tracking across time. Building upon this affiliation structure, we further design a Continuous Spatio-Temporal Semantic Learning mechanism based on a Tetraplane representation, which encodes both time-invariant and time-varying semantics within a continuous latent space. Extensive experiments on dynamic scene benchmarks demonstrate that we achieve state-of-the-art performance on both time-agnostic and time-sensitive open-vocabulary query tasks.

Abstract:
Novel view synthesis for dynamic scenes remains challenging due to complex motion variations. Recent methods represent dynamic and static regions with separate Gaussians to improve efficiency and accuracy, but inaccurate assignment of static and dynamic Gaussian primitives still limits performance. We identify two key issues, namely inaccurate mask priors and improper tag representations, which lead to boundary artifacts, loss of fine-grained motion details, and overfitting on input views, resulting in degraded side-view synthesis. To address these problems, we propose a spatio-temporally fine-grained mask field and a discontinuous dynamic-static tagging field to achieve accurate assignment of dynamic and static Gaussian primitives, enabling high-quality novel view synthesis, especially in fine-grained motions, motion boundary regions, and side viewpoints. Experiments show that our method achieves state-of-the-art rendering quality and real-time performance.

Abstract:
As vision-language models are deployed at scale, understanding their internal mechanisms becomes increasingly critical. Existing interpretability methods predominantly rely on activations, making them dataset-dependent, vulnerable to data bias, and often restricted to coarse head-level explanations. We introduce SITH (Semantic Inspection of Transformer Heads), a fully data-free, training-free framework that directly analyzes CLIP's vision transformer in weight space. For each attention head, we decompose its value-output matrix into singular vectors and interpret each one via COMP (Coherent Orthogonal Matching Pursuit), a new algorithm that explains them as sparse, semantically coherent combinations of human-interpretable concepts. We show that SITH yields coherent, faithful intra-head explanations, validated through reconstruction fidelity and interpretability experiments. This allows us to use SITH for precise, interpretable weight-space model edits that amplify or suppress specific concepts, improving downstream performance without retraining. Furthermore, we use SITH to study model adaptation, showing how fine-tuning primarily reweights a stable semantic basis rather than learning entirely new features.

Abstract:
Charts are high-density visual carriers of complex data and medium for information extraction and analysis. Due to the need for precise and complex visual reasoning, automated chart understanding poses a significant challenge to existing Multimodal Large Language Models (MLLMs). Many MLLMs trained with reinforcement learning (RL) face the challenge of credit assignment. Their advantage estimation, typically performed at the trajectory level, cannot distinguish between correct and incorrect reasoning steps within a single generated response. To address this limitation, we introduce SketchVL, a novel MLLM that optimized with FinePO, a new RL algorithm designed for fine-grained credit assignment within each trajectory. SketchVL's methodology involves drawing its intermediate reasoning steps as markers on the image and feeding the annotated image back to itself, creating a robust, multi-step reasoning process. During training, the FinePO algorithm leverages a Fine-grained Process Reward Model (FinePRM) to score each drawing action within a trajectory, thereby precisely assigning credit for each step. This mechanism allows FinePO to more strongly reward correct tokens when a trajectory is globally successful, and more heavily penalize incorrect tokens when the trajectory is globally suboptimal, thus achieving fine-grained reinforcement signals. Experiments show that SketchVL learns to align its step-level behavior with the FinePRM, achieving an average performance gain of 7.23% over its base model across chart datasets, natural image datasets, and mathematics, providing a promising new direction for training powerful reasoning models.

Abstract:
Multi-modal Large Language Models (MLLMs) have achieved remarkable performance across a wide range of visual reasoning tasks, yet their vulnerability to safety risks remains a pressing concern. While prior research primarily focuses on jailbreak defenses that detect and refuse explicitly unsafe inputs, such approaches often overlook contextual safety, which requires models to distinguish subtle contextual differences between scenarios that may appear similar but diverge significantly in safety intent. In this work, we present MM-SafetyBench++, a carefully curated benchmark designed for contextual safety evaluation. Specifically, for each unsafe image-text pair, we construct a corresponding safe counterpart through minimal modifications that flip the user intent while preserving the underlying contextual meaning, enabling controlled evaluation of whether models can adapt their safety behaviors based on contextual understanding. Further, we introduce EchoSafe, a training-free framework that maintains a self-reflective memory bank to accumulate and retrieve safety insights from prior interactions. By integrating relevant past experiences into current prompts, EchoSafe enables context-aware reasoning and continual evolution of safety behavior during inference. Extensive experiments on various multi-modal safety benchmarks demonstrate that EchoSafe consistently achieves superior performance, establishing a strong baseline for advancing contextual safety in MLLMs. All benchmark data and code will be made publicly accessible upon acceptance.

Abstract:
We introduce PhysGaia, a novel physics-aware benchmark for Dynamic Novel View Synthesis (DyNVS) that encompasses both structured objects and unstructured physical phenomena. While existing datasets primarily focus on photorealistic appearance, PhysGaia is specifically designed to support physics-consistent dynamic reconstruction. Our benchmark features complex scenarios with rich multi-body interactions, where objects realistically collide and exchange forces. Furthermore, it incorporates a diverse range of materials, including liquid, gas, textile, and rheological substance, moving beyond the rigid-body assumptions prevalent in prior work. To ensure physical fidelity, all scenes in PhysGaia are generated using material-specific physics solvers that strictly adhere to fundamental physical laws. We provide comprehensive ground-truth information, including 3D particle trajectories and physical parameters (e.g., viscosity), enabling the quantitative evaluation of physical modeling. To facilitate research adoption, we also provide integration pipelines for recent 4D Gaussian Splatting models along with our dataset and their results. By addressing the critical shortage of physics-aware benchmarks, PhysGaia can significantly advance research in dynamic view synthesis, physics-based scene understanding, and the integration of deep learning with physical simulation, ultimately enabling more faithful reconstruction and interpretation of complex dynamic scenes.

Abstract:
Realistic 3D city generation is fundamental to a wide range of applications, including virtual reality and digital twins. However, most existing methods rely on training a single diffusion model, which limits their ability to generate personalized and boundless city-scale scenes. In this paper, we present Yo'City, a novel agentic framework that enables user-customized and infinitely expandable 3D city generation by leveraging the reasoning and compositional capabilities of off-the-shelf large models. Specifically, Yo'City first conceptualizes the city through a top-down planning strategy that defines a hierarchical "City-District-Grid" structure. The Global Planner determines the overall layout and potential functional districts, while the Local Designer further refines each district with detailed grid-level descriptions. Subsequently, the grid-level 3D generation is achieved through a produce-refine-evaluate isometric image synthesis loop, followed by image-to-3D generation. To simulate continuous city evolution, Yo'City further introduces a user-interactive, relationship-guided expansion mechanism, which performs scene graph-based distance- and semantics-aware layout optimization, ensuring spatially coherent city growth. To comprehensively evaluate our method, we construct a diverse benchmark dataset and design six multi-dimensional metrics that assess generation quality from the perspectives of semantics, geometry, texture, and layout. Extensive experiments demonstrate that Yo'City consistently outperforms existing state-of-the-art methods across all evaluation aspects.

Abstract:
Hybrid CNN-Transformer architectures achieve strong results in image super-resolution, but scaling attention windows or convolution kernels significantly increases computational cost, limiting deployment on resource-constrained devices. We present UCAN, a lightweight network that unifies convolution and attention to expand the effective receptive field efficiently. UCAN combines window-based spatial attention with a Hedgehog Attention mechanism to model both local texture and long-range dependencies, and introduces a distillation-based large-kernel module to preserve high-frequency structure without heavy computation. In addition, we employ cross-layer parameter sharing to further reduce complexity. On Manga109 (4x), UCAN-L achieves 31.63 dB PSNR with only 48.4G MACs, surpassing recent lightweight models. On BSDS100, UCAN attains 27.79 dB, outperforming methods with significantly larger models. Extensive experiments show that UCAN achieves a superior trade-off between accuracy, efficiency, and scalability, making it well-suited for practical high-resolution image restoration.

Abstract:
Learning from multiple modalities often suffers from imbalance, where information-rich modalities dominate optimization while weaker or partially missing modalities contribute less. This imbalance becomes severe in realistic settings with imbalanced missing modalities (IMR), where each modality is absent with different probabilities, distorting representation learning and gradient dynamics. We revisit this issue from a training-process perspective and propose BALM, a model-agnostic plug-in framework to achieve balanced multimodal learning under IMR. The framework consists of two complementary modules. The Feature Calibration Module (FCM) operates at the representation level, recalibrating unimodal features through global contextual information to build a shared representation basis across heterogeneous missing patterns. The Gradient Rebalancing Module (GRM) works at the optimization level, equalizing learning dynamics across modalities by modulating gradient magnitudes and directions from distributional and spatial perspectives. BALM can be seamlessly integrated into diverse backbones, including multimodal emotion recognition (MER) models, without altering their architectures. Experimental results across multiple MER benchmarks confirm that BALM consistently enhances robustness and improves performance under diverse missing and imbalance settings.

Abstract:
Prospective reconstruction is crucial in many clinical applications such as MRI-guided radiotherapy, which demands accurate image reconstruction and fast motion estimation from currently acquired measurements. However, prospective reconstruction remains challenging due to ultra-sparse sampling and stringent latency requirements. In this work, we propose PDMR, an Prospective Dynamic 3D MRI Reconstruction framework with latent-space motion tracking. Our core idea is to learn an efficient and generalizable latent manifold of motion fields offline, enabling rapid online adaptation for prospective reconstruction. Specifically, we parameterize the deformation vector fields (DVFs) on a low-dimensional manifold, effectively reducing the search space for fast online adaptation, and employ a tri-plane representation to achieve geometry-aware and memory-efficient encoding of 3D motion. Experiments on both XCAT digital phantoms and in-house abdominal MRI datasets demonstrate that PDMR achieves high-fidelity and temporally consistent reconstruction across multiple prospective scenarios (Immediate and After-2min), outperforming state-of-the-art retrospective and online methods. Our results suggest a promising pathway toward ultra-fast, motion-aware prospective MRI reconstruction in clinical practice.

Abstract:
By 2026, up to 90% of online content may be synthetically generated, raising urgent concerns about the proliferation of AI-driven disinformation. In response, policymakers and technology companies are turning to watermarking as a safeguard: California's Bill AB 321 mandates watermarking of AI-generated media, while firms like Meta and Google have begun deploying watermarking systems to mitigate misuse. However, current watermarking methods remain brittle and vulnerable to attack. In this work, we introduce the visual paraphrase attack, a generative method that removes both visible and invisible watermarks from AI-generated images. The attack proceeds in two steps: (1) generating a descriptive caption from the original image, and (2) feeding this caption into a diffusion-based text-to-image model to produce a visually similar, watermark-free image. Our experiments show that this attack reliably removes watermarks while preserving semantic content of the original image, exposing a critical flaw in existing watermarking strategies. To counter this, we propose PECCAVI, the first watermarking technique explicitly designed to resist visual paraphrasing. PECCAVI embeds robust, high-fidelity watermarks (PSNR > 30 dB) in semantically stable image regions termed Non-Melting Points (NMPs) using multi-channel frequency domain encoding and noisy burnishing to obfuscate watermark locations and hinder reverse engineering. The method is model-agnostic and significantly more resilient than current alternatives. We also release the first benchmark dataset for visual paraphrasing attacks and open-source all code and resources, providing a foundation for future research on robust watermarking in the era of generative AI

Abstract:
Multimodal Large Language Models (MLLMs) with explicit step-by-step reasoning have achieved strong performance on complex tasks. However, such reasoning is unnecessary for many simple queries and introduces substantial computational overhead. To address this inefficiency, we present R-4B, an auto-thinking MLLM that dynamically determines whether to invoke the reasoning process based on input complexity.Our key idea is to equip a single model with both thinking and non-thinking capabilities and train it to select the appropriate mode. We first introduce bi-mode annealing, a unified training paradigm that constructs a model competent in both reasoning-intensive and direct-answer settings without requiring explicit complexity annotations. Building on this foundation, we propose Bi-mode Policy Optimization (BPO), a lightweight reinforcement learning algorithm that employs a dual-rollout mechanism: for each input, the model generates both thinking and non-thinking responses. This prevents mode collapse and enables robust learning of an adaptive reasoning policy using only simple, rule-based rewards.Extensive experiments across 25 benchmarks show that R-4B achieves state-of-the-art performance among models of similar scale. It consistently surpasses Qwen2.5-VL-7B and matches or exceeds larger models such as Kimi-VL-A3B-Thinking-2506 (16B) on reasoning-intensive tasks, while reducing computational cost by avoiding redundant reasoning. Our results demonstrate that adaptive auto-thinking offers an effective and scalable pathway toward more efficient multimodal reasoning models.

Abstract:
Vision-Language-Action (VLA) models are vulnerable to adversarial attacks, yet universal and transferable attacks remain underexplored, as most existing patches overfit to a single model and fail in black-box settings. To address this gap, we present a systematic study of universal, transferable adversarial patches against VLA-driven robots under unknown architectures, finetuned variants, and sim-to-real shifts. We introduce UPA-RFAS (Universal Patch Attack via Robust Feature, Attention, and Semantics), a unified framework that learns a single physical patch in a shared feature space while promoting cross-model transfer. UPA-RFAS combines (i) a feature-space objective with an l_1 deviation prior and repulsive InfoNCE loss to induce transferable representation shifts, (ii) a robustness-augmented two-phase min-max procedure where an inner loop learns invisible sample-wise perturbations and an outer loop optimizes the universal patch against this hardened neighborhood, and (iii) two VLA-specific losses: Patch Attention Dominance to hijack text to vision attention and Patch Semantic Misalignment to induce image-text mismatch without labels. Experiments across diverse VLA models, manipulation suites, and physical executions show that UPA-RFAS consistently transfers across models, tasks, and viewpoints, exposing a practical patch-based attack surface and establishing a strong baseline for future defenses.

Abstract:
Despite significant progress has been made in image deraining, we note that most existing methods are often developed for only specific types of rain degradation and fail to generalize across diverse real-world rainy scenes. How to effectively model different rain degradations within a universal framework is important for real-world image deraining. In this paper, we propose UniRain, an effective unified image deraining framework capable of restoring images degraded by rain streak and raindrop under both daytime and nighttime conditions. To better enhance unified model generalization, we construct an intelligent retrieval augmented generation (RAG)-based dataset distillation pipeline that selects high-quality training samples from all public deraining datasets for better mixed training. Furthermore, we incorporate a simple yet effective multi-objective reweighted optimization strategy into the asymmetric mixture-of-experts (MoE) architecture to facilitate consistent performance and improve robustness across diverse scenes. Extensive experiments show that our framework performs favorably against the state-of-the-art models on our proposed benchmarks and multiple public datasets.

Abstract:
Spike camera is a neuromorphic vision sensor with ultra-high temporal resolution, capable of capturing fast-moving scenes by firing a stream of binary spikes. However, its relatively low spatial resolution limits the acquisition of fine-grained visual details, motivating research on spike camera super resolution (SCSR). Existing SCSR methods typically operate on fixed-length spike sequences, where the accessible information is restricted to a local temporal neighborhood. Moreover, spike fluctuations hinder reliable intensity extraction. Both factors affect the performance of SCSR. To address these issues, we propose a hierarchical recurrent network named Spk2VidNet to reconstruct high-fidelity high resolution image sequences from low resolution spike data. To mitigate fluctuations, Spk2VidNet progressively exploits temporal correlations within spike stream to enhance feature representation by hierarchically enlarging temporal receptive fields. Within the recurrent block, we introduce an alignment module that leverages the motion consistency among multiple frames to jointly estimate and mutually refine inter-frame motions, achieving more accurate temporal alignment. In addition, we develop a fusion module to adaptively integrate neighboring aligned features based on multi-scale similarity for robust feature aggregation. We further propose a segment-wise training with state transfer strategy to efficiently model long-term dependencies with limited GPU memory. Experiments on synthetic and real-captured spike data demonstrate that Spk2VidNet achieves state-of-the-art performance.

Abstract:
Vision-Language-Action (VLA) models are emerging as an important technology in autonomous driving, recognized for their sophisticated reasoning and interpretability. However, traditional VLA models often rely on image-to-text with Chain-of-Thought (CoT) reasoning, which converts sequential visual scenes into textual symbols, thereby under-utilizing spatial context in visual information. Existing autonomous driving systems using VLA models predict only a single sequence of waypoints as a trajectory considering a given command and multiple aspects. However, we suggest evaluating each sequence of waypoints to reveal the importance of the corresponding aspect. We introduce HybridDriveVLA, a VLA model that integrates visual Chain-of-Thought (V-Cot) reasoning and a proposed Tree-of-Thought (ToT)-inspired waypoint evaluation (ToT-evaluation). V-Cot reasoning anticipates future scenes, which serve as goals for ToT-evaluation. The ToT-evaluation generates and scores waypoints based each on safety, progress, and comfort aspects. The highest cumulated score of the waypoints based on the three aspects is optimal. To the best of our knowledge, we are the first to propose a unified method integrating both a CoT and a ToT approach in a VLA model. Experimental results demonstrated that HybridDriveVLA achieved strong performance on comfort, progress, and safety metrics with and average collision rate of 0.17% on the nuScenes benchmark, outperforming traditional VLA models.

Abstract:
Few-shot Class-Incremental Learning (FSCIL) requires learning new classes from very limited data while preventing catastrophic forgetting. Existing methods rely mainly on visual features and are prone to overfitting, while recent vision-language models (VLMs) offer better transferability but suppress fine-grained information due to contrastive feature decorrelation. Moreover, current FSCIL approaches often use static or fully optimizable prompts, making them either rigid or susceptible to semantic drift in incremental sessions. We introduce QR-Prompt, a residual-driven framework that leverages the visual-textual feature residual of VLMs to recover discriminative fine-grained cues missing from the contrastive space. To ensure stability, we propose Discriminative Subspace Quantization (DSQ), which builds a discrete memory of residual subspaces. To enable plasticity, a Hierarchical Prompt Encoder (HPE) and Prompt Composer (PC) transform these discrete codes into continuous, class-adaptive prompts for novel classes. We derive bounds relating DSQ codebook size to generalization and classification margin, and achieve consistent improvements over state-of-the-art FSCIL methods on CUB200, CIFAR100, and miniImageNet. Our results show that residual-based quantization combined with hierarchical prompt composition yields stable and expressive VLM adaptation for FSCIL.

Abstract:
3D Gaussian Splatting (3DGS) has recently emerged as a powerful scene representation and is increasingly used for visual localization and pose refinement. However, despite its high-quality differentiable rendering, the robustness of 3DGS-based pose refinement remains highly sensitive to both the initial camera pose and the reconstructed geometry. In this work, we take a closer look at these limitations and identify two major sources of uncertainty: (i) pose prior uncertainty, which often arises from regression or retrieval models that output a single deterministic estimate, and (ii) geometric uncertainty, caused by imperfections in the 3DGS reconstruction that propagate errors into PnP solvers. Such uncertainties can distort reprojection geometry and destabilize optimization, even when the rendered appearance still looks plausible.To address these uncertainties, we introduce a relocalization framework that combines Monte Carlo pose sampling with Fisher Information-based PnP optimization. Our method explicitly accounts for both pose and geometric uncertainty and requires no retraining or additional supervision. Across diverse indoor and outdoor benchmarks, our approach consistently improves localization accuracy and significantly increases stability under pose and depth noise.

Abstract:
Advances in large reasoning models have shown strong performance on complex reasoning tasks by scaling test-time compute through extended inference-time thinking. However, recent studies observe that in vision-dependent tasks, extended textual reasoning at inference time can often degrade performance as models progressively lose attention to visual tokens, increasingly relying on textual priors alone. To address this, prior works used reinforcement learning (RL)-based fine-tuning to route visual tokens or employ refocusing mechanisms during reasoning. While effective, these methods are computationally expensive, requiring large-scale data generation and policy optimization. To leverage the benefits of inference-time compute without additional RL fine-tuning, we propose VisRef, a visually grounded test-time scaling framework. Our key idea is to actively guide the reasoning process through re-injecting a coreset of visual tokens that are semantically relevant to the reasoning context yet diverse and globally representative of the image for more grounded multi-modal reasoning. Experiments on three visual reasoning benchmarks with state-of-the-art multi-modal large reasoning models demonstrate that under fixed inference-time compute budgets, VisRef consistently outperforms existing test-time scaling approaches by up to 6.4%.

Abstract:
Following the success of Group Relative Policy Optimization (GRPO) in foundation LLMs, an increasing number of works have sought to adapt GRPO to Visual Large Language Models (VLLMs) for visual perception tasks (e.g., detection and segmentation). However, much of this line of research rests on a long-standing yet unexamined assumption: training paradigms developed for language reasoning can be transferred seamlessly to visual perception. Our experiments show that this assumption is not valid, revealing intrinsic differences between reasoning-oriented and perception-oriented settings. Using reasoning segmentation as a representative case, we surface two overlooked factors: (i) the need for a broader output space, and (ii) the importance of fine-grained, stable rewards. Building on these observations, we propose Dr.Seg, a simple, plug-and-play GRPO-based framework consisting of a Look-to-Confirm mechanism and a Distribution-Ranked Reward module, requiring no architectural modifications and integrating seamlessly with existing GRPO-based VLLMs. Extensive experiments demonstrate that Dr.Seg improves performance in complex visual scenarios while maintaining strong generalization. Code, data, and models will be publicly released upon acceptance.

Abstract:
Anticipating diverse future states is a central challenge in video world modeling. Discriminative world models produce deterministic predictions that implicitly average over possible futures, while existing generative world models remain computationally expensive. Recent work demonstrates that predicting the future in the feature space of a vision foundation model (VFM), rather than a latent space optimized for pixel reconstruction, requires significantly fewer world model parameters. However, most such approaches remain discriminative. In this work, we introduce DeltaWorld, a generative VFM-based world model that efficiently generates diverse plausible futures. At the core of DeltaWorld is DeltaTok, a tokenizer that encodes the feature difference between consecutive frames into a single continuous "delta" token, reducing video from a three-dimensional spatio-temporal representation to a one-dimensional temporal sequence. For example, this yields a 1,024x token reduction with 512x512 frames. Delta tokens enable efficient and effective multi-hypothesis training, where many diverse futures are generated in parallel and only the best is supervised. At inference, this leads to diverse predictions in a single forward pass. Experiments on dense forecasting tasks demonstrate that DeltaWorld forecasts futures that more closely align with real-world outcomes, while having over 35x fewer parameters and using 2,000x fewer FLOPs than existing generative world models. Project page: https://deltatok.github.io.

Abstract:
The Vision Transformer (ViT) has surpassed Convolutional Neural Networks (CNNs) in performance, becoming the de facto architecture in modern computer vision. However, despite its superior representational capacity, research on the adversarial robustness of ViTs remains limited, with most studies still biased toward CNN-based models. This work aims to address this architectural bias and conduct an in-depth analysis of the interaction between ViTs and adversarial training (AT).We first show that ViTs can identify semantic components of objects through their class attention maps, indicating that adversarially trained ViTs inherently encode strong semantic priors. Next, using the proposed Gradient Path Masking (GPM) analysis, we examine the internal information flow of ViTs and verify that the residual path serves as a major bottleneck that provides advantageous information to adversaries. Furthermore, our inter-patch relation analysis reveals that adversarially trained ViTs tend to rely more on global than local relationships in early layers--a novel observation suggesting a fundamental potential incompatibility between ViTs and hybrid architectures that inject CNN-style inductive biases.Building upon these findings, we design a simple yet effective two-stage AT scheme to mitigate this structural incompatibility, achieving simultaneous improvements in robustness and generalization across various ViT variants and training methods. The proposed method is compatible with a wide range of AT frameworks and models.

Abstract:
As the third generation of neural networks, Spiking Neural Networks (SNNs) have demonstrated remarkable potential across diverse applications owing to their unique temporal dynamics. In recent years, analyzing the robustness of SNNs from a temporal perspective has become an emerging research focus. However, most existing works examine only the overall temporal behavior of SNNs, typically applying adversarial attacks that rely on time-averaged gradients. In this study, we revisit SNN robustness through the lens of temporal granularity, emphasizing the distinct behaviors that occur at individual time steps. We first introduce a Temporal Granularity Attack (TG-Attack), which selectively perturbs gradients at specific time steps. This approach enables a finer-grained evaluation of SNN robustness across time and demonstrates higher attack success rates than traditional gradient-averaging methods. Furthermore, we theoretically show that the robustness of SNNs at a given time step is determined by the Hessian of the input-output gradient at that step, which we define as Temporal Sensitivity (TS). By calculating the Temporal Sensitivity Value (TSV) for each time step, robustness can be effectively estimated without generating adversarial examples. Finally, we propose a Temporal Granularity Regularization (TG-Reg) term that constrains the TSV across all time steps, thereby improving the model's overall robustness. Experimental evaluations confirm that our framework consistently outperforms existing state-of-the-art methods.

Abstract:
Continual Learning (CL) provides an effective paradigm for acquiring new knowledge, and the principle of learning without retaining past samples has led to exemplar-free CL that better matches practical conditions. However, a key challenge is the semantic shift, which requires reliable activation of past class representations to align with the current feature space. While drift compensation acts as the activator, it commonly assumes uniform semantic distributions and shifts, which is unrealistic for random data streams. For this, we propose the Dual-Estimator (Dual-E) to decouple global and local semantic shifts, addressing both issues of non-uniformity. Specifically, to address intra-task non-uniform semantic distributions that limit effective compensation for low-frequency semantics, Dual-E incorporates a mixture-of-experts estimator comprising multiple networks that model semantic shifts across diverse local representation spaces. For inter-task non-uniformity in semantic shifts, where uniform full-scale compensation potentially overlooks the varying degrees of semantic change across classes, Dual-E employs a low-rank estimator with an embedded low-rank network that prioritizes global semantic trends for classes exhibiting larger shifts. Dual-E leverages analytical solutions to update within a few epochs, enabling efficient plug-in integration with existing exemplar-free methods. Extensive experiments on diverse datasets demonstrate the advantages of Dual-E over state-of-the-art approaches.

Abstract:
We propose a fast and correspondence-free local point cloud registration method that leverages geometric surface structure and reproducing kernel Hilbert space (RKHS) embeddings. The method represents point clouds as continuous functions with point-wise anisotropic kernels that encode local geometry. This formulation improves alignment along surface normals while relaxing alignment along tangential directions. To solve the resulting registration problem, we propose a second-order on-manifold optimization scheme with approximate Riemannian Hessians, achieving a speedup of up to 10x over the first-order solvers used in prior correspondence-free RKHS-based methods. We demonstrate improved frame-to-frame LiDAR and RGB-D tracking accuracy across diverse indoor and outdoor datasets. On a LiDAR tracking registration task in the driving domain, we achieve a reduction of >55% in both translational and rotational drift in challenging feature-sparse environments. On object registration benchmarks, we show improved robustness over ICP-based methods and further gains when refining global initialization, particularly under moderate misalignment.

Abstract:
Recent progress in multimodal large language models (MLLMs) has highlighted the challenge of efficiently bridging pre-trained Vision-Language Models (VLMs) with Diffusion Models. While methods using a fixed number of learnable query tokens offer computational efficiency, they suffer from task generalization collapse, failing to adapt to new tasks that are distant from their pre-training tasks. To overcome this, we propose Noisy Query Tokens, which learn a distributed representation space between the VLM and Diffusion Model via end-to-end optimization, enhancing continual learning. Additionally, we introduce a VAE branch with linear projection to recover fine-grained image details. Experimental results confirm our approach mitigates generalization collapse and enables stable continual learning across diverse tasks.

Abstract:
Video Frame Interpolation (VFI) is a crucial task in video processing. Flow-based methods, despite their success, are constrained by a fundamental dilemma: forward warping is efficient but prone to artifacts, while backward warping yields higher quality at a significant computational cost, especially for multi-frame interpolation. This trade-off is a major bottleneck. To overcome this, we introduce "One-Shot Flow, Any-Time Frame," a novel framework for Event-based VFI (E-VFI) that achieves both high efficiency and superior quality for arbitrary-time interpolation. Our framework uniquely computes a comprehensive motion trajectory representation in a single pass using a Bidirectional Flow Estimation Block (BiFEB), leveraging the high temporal resolution of event data. Subsequently, our Flow Query (FQ) module can instantly retrieve the bidirectional optical flow for any timestamp, enabling the generation of any number of frames without repeated computation. Finally, a novel Bidirectional Warping (BiW) mechanism intelligently fuses the strengths of both warping directions, effectively mitigating artifacts and producing high-fidelity results. Extensive experiments show that our approach consistently surpasses state-of-the-art E-VFI methods in both reconstruction quality and inference efficiency, representing a substantial advance in efficient and high-quality event-based video interpolation. The code will be released after acceptance.

Abstract:
Large 3D reconstruction models have revolutionized the 3D content generation field, enabling broad applications in virtual reality and gaming. Just like other large models, large 3D reconstruction models suffer from hallucinations as well, introducing structural outliers (e.g., odd holes or protrusions) that deviate from the input data. However, unlike other large models, hallucinations in large 3D reconstruction models remain severely underexplored, leading to malformed 3D-printed objects or insufficient immersion in virtual scenes. Such hallucinations majorly originate from that existing methods reconstruct 3D content from sparsely generated multi-view images which suffer from large viewpoint gaps and discontinuities. To mitigate hallucinations by eliminating the outliers, we propose Dehallu3D for 3D mesh generation. Our key idea is to design a balanced multi-view continuity constraint to enforce smooth transitions across dense intermediate viewpoints, while avoiding over-smoothing that could erase sharp geometric features. Therefore, Dehallu3D employs a plug-and-play optimization module with two key constraints: (i) adjacent consistency to ensure geometric continuity across views, and (ii) adaptive smoothness to retain fine details. We further propose the Outlier Risk Measure (ORM) metric to quantify geometric fidelity in 3D generation from the perspective of outliers. Extensive experiments show that Dehallu3D achieves high-fidelity 3D generation by effectively preserving structural details while removing hallucinated outliers.

Abstract:
Progress in object detection benchmarks is stagnating. It is limited not by architectures but by the inability to distinguish model improvements from label noise. To restore trust in benchmarking the field requires rigorous quantification of annotation consistency to ensure the reliability of evaluation data. However, standard statistical metrics fail to handle the instance correspondence problem inherent to vision tasks. Furthermore, validating new agreement metrics remains circular because no objective ground truth for agreement exists. This forces reliance on unverifiable heuristics. We propose K\alphaLOS, a unified meta-algorithm that generalizes the "Localization First" principle to standardize dataset quality evaluation. By resolving spatial correspondence before assessing agreement, our framework transforms complex spatio-categorical problems into nominal reliability matrices. Unlike prior heuristic implementations, K\alphaLOS employs a principled, data-driven configuration; by statistically calibrating the localization parameters to the inherent agreement distribution, it generalizes to diverse tasks ranging from bounding boxes to volumetric segmentation or pose estimation. This standardization enables granular diagnostics beyond a single score. These include annotator vitality, collaboration clustering, and localization sensitivity. To validate this approach, we introduce a novel and empirically derived noise generator. Where prior validations relied on uniform error assumptions, our controllable testbed models complex and non-isotropic human variability. This provides evidence of the metric's properties and establishes K\alphaLOS as a robust standard for distinguishing signal from noise in modern computer vision benchmarks.

Abstract:
The integration of Vision-Language Models (VLMs) into autonomous driving promises to solve long-tail scenarios, but this paradigm faces the critical and unaddressed challenge of catastrophic forgetting. The very fine-tuning process used to adapt these models to driving-specific data simultaneously erodes their invaluable pre-trained world knowledge, creating a self-defeating paradox that undermines the core reason for their use. This paper provides the first systematic investigation into this phenomenon. We introduce a new large-scale dataset of 180K scenes, which enables the first-ever benchmark specifically designed to quantify catastrophic forgetting in autonomous driving. Our analysis reveals that existing methods suffer from significant knowledge degradation. To address this, we propose the Drive Expert Adapter (DEA), a novel framework that circumvents this trade-off by shifting adaptation from the weight space to the prompt space. DEA dynamically routes inference through different knowledge experts based on scene-specific cues, enhancing driving-task performance without corrupting the model's foundational parameters. Extensive experiments demonstrate that our approach not only achieves state-of-the-art results on driving tasks but also effectively mitigates catastrophic forgetting, preserving the essential generalization capabilities that make VLMs a transformative force for autonomous systems. Dataset and code will be released.

Abstract:
We present BEV-SLD, a LiDAR global localization method building on the Scene Landmark Detection (SLD) concept. Unlike scene-agnostic pipelines, our self-supervised approach leverages bird's-eye-view (BEV) images to discover scene-specific patterns at a prescribed spatial density and treat them as landmarks. A consistency loss aligns learnable global landmark coordinates with per-frame heatmaps, yielding consistent landmark detections across the scene. Across campus, industrial, and forest environments, BEV-SLD delivers robust localization and achieves strong performance compared to state-of-the-art methods.

Abstract:
Autofocus in dynamic environments remains challenging for conventional frame-based sensors, which often fail under fast motion, low light, or high dynamic range conditions. Event cameras, with microsecond temporal resolution and asynchronous brightness detection, offer a promising alternative. However, typical event-based autofocus methods assume that the sharpest focus corresponds to the maximum event rate. In this paper, we reveal a counterintuitive yet consistent phenomenon: the true focus actually corresponds to a local minimum in the event-rate curve. We theoretically derive this behavior from the physics of event generation and show that as defocus blur increases, the event rate first rises and then declines, forming a dual-peak-valley structure across focal distances. Based on this insight, we propose an Event Structural Valley-based Autofocus (ESVA) framework that identifies the valley between two dominant peaks as the true focal position. ESVA integrates structural smoothing, consistency filtering, and a dual-peak constraint to robustly recover the valley under noise and motion disturbances. Extensive experiments on multiple synthetic and real-world datasets demonstrate that ESVA delivers accurate and stable focus estimation, consistently outperforming existing event-only autofocus methods without requiring image reconstruction or supervision.

Abstract:
Diagrams are widely used in daily life. However, offline diagrams typically exist in the form of images, lacking structured data representation, which significantly limits their reusability and editability. Current research mainly focuses on supporting basic query tasks for online diagrams and does not meet the semantic understanding and interaction requirements for complex offline diagrams. Although large language models (LLMs) possess powerful reasoning and knowledge integration capabilities, their performance in processing offline diagrams is unsatisfactory due to the inability to accurately understand the structure and content of offline diagrams. To address these issues,we propose DiagramDiff, a framework consisting of a high-precision diagram reconstruction model and an instance-level diagram element recognition model. The framework converts offline diagrams into standardized data structures, enabling LLMs to transition from being unable to understand offline diagrams to becoming intelligent assistants capable of performing tasks such as semantic reasoning, logical validation, and efficient diagram editing. We have constructed a dataset containing diagrams and their corresponding question and answering(Q&A) and editing tasks. Experiments demonstrate that DiagramDiff achieves state-of-the-art performance in diagram reconstruction and recognition tasks, significantly enhancing LLMs' understanding and interaction capabilities with offline diagrams.

Abstract:
This work studies how latent space structure impacts the performance of Latent Diffusion Models (LDMs). We show that effective generation requires a latent space that is simultaneously locally smooth, enabling stable and reliable reconstruction, and globally dispersive, allowing the model to draw diverse and meaningful samples without collapsing into narrow regions of the latent space. However, existing approaches often emphasize smoothness, which may lead to concentrated latent regions and limited exploration of the broader space. To address these limitations, we propose Self-Organized Representation Learning (SORL), a bottom-up training paradigm inspired by self-organization in complex systems, where global structure emerges naturally from simple local interactions. The critical latent properties of smoothness and maximal dispersity are not explicitly imposed. Instead, SORL promotes these properties through two complementary local mechanisms: local attraction, which encourages coherent reconstructions among nearby latent codes, and local repulsion, which prevents latent codes from collapsing into dense clusters. Through their interaction, SORL induces a latent manifold that maintains both local smoothness and global dispersity, leading to improved reconstruction and generation.

Abstract:
While multimodal large language models (MLLMs) have shown remarkable success across a wide range of tasks, long-form video understanding remains a significant challenge.In this study, we focus on video understanding by MLLMs.This task is challenging because processing a full stream of RGB frames is computationally intractable and highly redundant, as self-attention have quadratic complexity with sequence length.In this paper, we propose ReMoRa, a video MLLM that processes videos by operating directly on their compressed representations.A sparse set of RGB keyframes is retained for appearance, while temporal dynamics are encoded as a motion representation, removing the need for sequential RGB frames.These motion representations act as a compact proxy for optical flow, capturing temporal dynamics without full frame decoding.To refine the noise and low fidelity of block-based motions, we introduce a module to denoise and generate a fine-grained motion representation.Furthermore, our model compresses these features in a way that scales linearly with sequence length.We demonstrate the effectiveness of ReMoRa through extensive experiments across a comprehensive suite of long-video understanding benchmarks.ReMoRa outperformed baseline methods on multiple challenging benchmarks, including LongVideoBench, NExT-QA, and MLVU. Our project page is available at https://remora-v1rcm.kinsta.page/.

Abstract:
Inspired by the general Vision-and-Language Navigation (VLN) task, aerial VLN has attracted widespread attention, owing to its significant practical value in applications such as logistics delivery and urban inspection. However, existing methods face several challenges in complex urban environments, including insufficient generalization to unseen scenes, suboptimal performance in long-range path planning, and inadequate understanding of spatial continuity. To address these challenges, we propose HTNav, a new collaborative navigation framework that integrates Imitation Learning (IL) and Reinforcement Learning (RL) within a hybrid IL-RL framework. This framework adopts a staged training mechanism to ensure the stability of the basic navigation strategy while enhancing its environmental exploration capability. By integrating a tiered decision-making mechanism, it achieves collaborative interaction between macro-level path planning and fine-grained action control. Furthermore, a map representation learning module is introduced to deepen its understanding of spatial continuity in open domains. On the CityNav benchmark, our method achieves state-of-the-art performance across all scene levels and task difficulties. Experimental results demonstrate that this framework significantly improves navigation precision and robustness in complex urban environments.

Abstract:
Federated Unlearning (FUL) aims to remove specific participants' data contributions from a trained Federated Learning model, thereby ensuring data privacy and compliance with regulatory requirements. Despite its potential, progress in FUL has been limited due to several challenges, including the cross-client knowledge inaccessibility and high computational and communication costs. To overcome these challenges, we propose Federated On-server Unlearning (FOUL), a novel framework that comprises two key stages. The learning-to-unlearn stage serves as a preparatory learning phase, during which the model identifies and encodes the key features associated with the forget clients. This stage is communication-efficient and establishes the basis for the subsequent unlearning process. Subsequently, on-server knowledge aggregation phase aims to perform the unlearning process at the server without requiring access to client data, thereby preserving both efficiency and privacy. We introduce a new data setting for FUL, which enables a more transparent and rigorous evaluation of unlearning. To highlight the effectiveness of our approach, we propose a novel evaluation metric termed time-to-forget, which measures how quickly the model achieves optimal unlearning performance. Extensive experiments conducted on three datasets under various unlearning scenarios demonstrate that FOUL outperforms the Retraining in FUL. Moreover, FOUL achieves competitive or superior results with significantly reduced time-to-forget, while maintaining low communication and computation costs.

Abstract:
Perspective-aware spatial reasoning involves understanding spatial relationships from specific viewpoints--either egocentric (observer-centered) or allocentric (object-centered).While vision-language models (VLMs) perform well in egocentric settings, their performance deteriorates when reasoning from allocentric viewpoints, where spatial relations must be inferred from the perspective of objects within the scene.In this study, we address this underexplored challenge by introducing Symbolic Projective Layout (SymPL), a framework that reformulates allocentric reasoning into symbolic-layout forms that VLMs inherently handle well.By leveraging four key factors--projection, abstraction, bipartition, and localization--SymPL converts allocentric questions into structured symbolic-layout representations.Extensive experiments demonstrate that this reformulation substantially improves performance in both allocentric and egocentric tasks, enhances robustness under visual illusions and multi-view scenarios, and that each component contributes critically to these gains.These results show that SymPL provides an effective and principled approach for addressing complex perspective-aware spatial reasoning.

Abstract:
High-quality global illumination (GI) in real-time rendering is commonly achieved using precomputed lighting techniques, with lightmap as the standard choice. To support GI for static objects in dynamic lighting environments, multiple lightmaps at different lighting conditions need to be precomputed, which incurs substantial storage and memory overhead.To overcome this limitation, we propose Neural Dynamic GI (NDGI), a novel compression technique specifically designed for temporal lightmap sets. Our method utilizes multi-dimensional feature maps and lightweight neural networks to integrate the temporal information instead of storing multiple sets explicitly, which significantly reduces the storage size of lightmaps. Additionally, we introduce a block compression (BC) simulation strategy during the training process, which enables BC compression on the final generated feature maps and further improves the compression ratio. To enable efficient real-time decompression, we also integrate a virtual texturing (VT) system with our neural representation. Compared with prior methods, our approach achieves high-quality dynamic GI while maintaining remarkably low storage and memory requirements, with only modest real-time decompression overhead. To facilitate further research in this direction, we will release our temporal lightmap dataset precomputed in multiple scenes featuring diverse temporal variations.

Abstract:
Video Diffusion Transformers (DiTs) have been synthesizing high-quality video with high fidelity from given text descriptions involving motion. However, understanding how Video DiTs convert motion words into video remains insufficient. Furthermore, while prior studies on interpretable saliency maps primarily target objects, motion-related behavior in Video DiTs remains largely unexplored. In this paper, we investigate concrete motion features that specify when and which object moves for a given motion concept. First, to spatially localize, we introduce GramCol, which adaptively produces per-frame saliency maps for any text concept, including both motion and non-motion. Second, we propose a motion-feature selection algorithm to obtain an Interpretable Motion-Attentive Map (IMAP) that localizes motion spatially and temporally. Our method discovers concept saliency maps without the need for any gradient calculation or parameter update. Experimentally, our method shows outstanding localization capability on the motion localization task and zero-shot video semantic segmentation, providing interpretable and clearer saliency maps for both motion and non-motion concepts.

Abstract:
Full-body motion estimation from sparse VR observations is an inherently under-constrained problem, with only three 6-DoF trackers (HMD and controllers) available to infer a full skeletal pose. To address this ambiguity, we introduce a probabilistic framework that models joint orientations as distributions on SO(3) using the Matrix Fisher distribution. Instead of predicting a single deterministic pose, our network outputs a distribution for each joint, whose mode and concentration directly quantify prediction uncertainty on the rotation manifold. This enables likelihood-based training and principled uncertainty propagation. At the core of our model is a causal Transformer encoder that fuses sparse observations with motion history. We further propose region-wise tokens for the torso, arms, and legs, obtained via attention pooling over local joint features and semantic VR anchors. These tokens guide compact, per-region Fisher regression. To ensure kinematic coherence efficiently, we employ a limb refinement module, where each child joint's Fisher parameters are conditioned on its parent's distribution and the regional context, propagating pose and uncertainty hierarchically. Extensive experiments on standard sparse-VR benchmarks show that our approach achieves state-of-the-art performance, while providing well-calibrated joint-wise uncertainty.

Abstract:
In this work, we propose a Multigrain-aware Semantic Prototype Scanning paradigm for pan-sharpening, built upon a high-order RWKV architecture and a tri-token prompting mechanism derived from semantic clustering. Specifically, our method contains three key components: 1) Multigrain-aware Semantic Prototype Scanning. Although RWKV offers an efficient linear-complexity alternative to Transformers, its conventional bidirectional raster scanning is still semantic-agnostic and prone to positional bias. To address this issue, we introduce a semantic-driven scanning strategy that leverages locality-sensitive hashing to group semantically related regions and construct multi-grain semantic prototypes, enabling context-aware token reordering and more coherent global interaction. 2) Tri-token Prompt Learning. We design a tri-token prompting mechanism consisting of a global token, cluster-derived prototype tokens, and a learnable register token. The global and prototype tokens provide complementary semantic priors for RWKV modeling, while the register token helps suppress noisy and artifact-prone intermediate representations. 3) Invertible Q-Shift. To counteract spatial details, we apply center difference convolution on the value pathway to inject high-frequency information, and introduce an invertible multi-scale Q-shift operation for efficient and lossless feature transformation without parameter-heavy receptive field expansion. Experimental results demonstrate the superiority of our method.

Abstract:
Recent proprietary models such as Sora2 demonstrate promising progress in generating multi-shot videos conditioned on multiple reference characters. However, academic research on this problem remains limited. We study this task and identify a core challenge: when reference images exhibit highly similar appearances, the model often suffers from reference confusion, where semantically similar tokens degrade the model's ability to retrieve the correct context. To address this, we introduce PoCo (Position Embedding as a Context Controller), which incorporates position encoding as additional context control beyond semantic retrieval. By employing side information of tokens, PoCo enables precise token-level matching while preserving implicit semantic consistency modeling. Building on PoCo, we develop a multi-reference and multi-shot video generation model capable of reliably controlling characters with extremely similar visual traits. Extensive experiments demonstrate that PoCo improves cross-shot consistency and reference fidelity compared with various baselines.

Abstract:
Contemporary text-to-image models exhibit a surprising degree of mode collapse, as can be seen when sampling several images given the same text prompt. Previous work has attempted to address this issue by steering the model using guidance mechanisms, or by generating a large pool of candidates and refining them. In this work, we take a different direction and aim for diversity in generations via noise optimization. Specifically, we show that a simple noise optimization objective can mitigate mode collapse while preserving the fidelity of the base model. We also analyze the frequency characteristics of the noise and show that alternative noise initializations with different frequency profiles can improve both optimization and search. Our experiments demonstrate that noise optimization yields superior results in terms of generation quality and diversity.

Abstract:
Latent Diffusion Models (LDMs) inherently follow a coarse-to-fine generation process, where high-level semantic structure is generated slightly earlier than fine-grained texture. This indicates the preceding semantics potentially benefit the texture generation by providing a semantic anchor. Recent advances have integrated semantic priors from pretrained visual encoders to further enhance LDMs, yet they still denoise semantic and VAE-encoded texture synchronously, neglecting such ordering. Observing these, we propose Semantic-First Diffusion (SFD), a latent diffusion paradigm that explicitly prioritizes semantic formation. SFD first constructs composite latents by combining the compact semantic latent, which is extracted from pretrained visual encoder via a dedicated Semantic VAE, with the texture latent. The core of SFD is to denoise the semantic and texture latents asynchronously using separate noise schedules: semantics precede textures by a temporal offset, providing clearer high-level guidance for texture refinement and enabling natural coarse-to-fine generation. On ImageNet 256x256 with guidance, SFD achieves FID 1.06 (LightningDiT-XL) and FID 1.04 (1.0B LightningDiT-XXL), while achieving up to 100xfaster convergence than original DiT without guidance. SFD also improves existing methods like ReDi and VA-VAE, demonstrating the effectiveness of asynchronous, semantics-led modeling.

Abstract:
Whole-slide images (WSIs) in computational pathology contain billions of pixels, making end-to-end training of feature extractors and multi-instance learning (MIL) networks infeasible on a single commodity GPU.Existing methods often freeze the feature extractor and train MIL networks on the resulting frozen features, which introduces a semantic gap that limits downstream performance. To address this issue, we propose FBTA, a Feature Bridging and Translation Alignment framework for WSI classification. FBTA is the first end-to-end MIL framework trainable on a single 24\,GB GPU, leveraging three complementary feature-bag views: end-to-end features enable joint optimization, frozen features stabilize training, and translated features support practical inference.Experiments on diverse datasets, including TCGA-NSCLC (Shot20/50/100) and TCGA-STAD, demonstrate the effectiveness and generality of FBTA, which consistently improves performance across three MIL architectures and two extractors. For example, with ResNet-50 as the extractor, FBTA improves the accuracy of the classic ABMIL by 13.1% and 15.8% on the NSCLC-Shot50 and TCGA-STAD datasets, respectively, and further enhances the state-of-the-art MambaMIL by 4.1% and 9.2% on the same datasets. Moreover, FBTA yields additional gains for MIL models that incorporate self-supervised pretraining strategies and data augmentation techniques.These results suggest FBTA is a feasible and scalable framework for end-to-end MIL on gigapixel WSIs. The code will be available.

Abstract:
Recent advances in video generation models have achieved impressive results. However, these models heavily rely on the use of high-quality data that combines both high visual quality and high motion quality. In this paper, we identify a key challenge in video data curation: the Motion-Vision Quality Dilemma. We discovered that visual quality and motion intensity inherently exhibit a negative correlation, making it hard to obtain golden data that excels in both aspects. To address this challenge, we first examine the hierarchical learning dynamics of video diffusion models and conduct gradient-based analysis on quality-degraded samples. We discover that quality-imbalanced data can produce gradients similar to golden data at appropriate timesteps. Based on this, we introduce the novel concept of Timestep selection in Training Process. We propose Timestep-aware Quality Decoupling (TQD), which modifies the data sampling distribution to better match the model's learning process. For certain types of data, the sampling distribution is skewed toward higher timesteps for motion-rich data, while high visual quality data is more likely to be sampled during lower timesteps. Through extensive experiments, we demonstrate that TQD enables training exclusively on separated imbalanced data to achieve performance surpassing conventional training with better data, challenging the necessity of perfect data in video generation. Moreover, our method also boosts model performance when trained on high-quality data, showcasing its effectiveness across different data scenarios.

Abstract:
Images can be viewed as layered compositions, foreground objects over background, with potential occlusions. This layered representation enables independent editing of elements, offering greater flexibility for content creation. Despite the progress in large generative models, decomposing a single image into layers remains challenging due to limited methods and data. We observe a strong connection between layer decomposition and in/outpainting tasks, and propose adapting a diffusion-based inpainting model for layer decomposition using lightweight finetuning. To further preserve detail in the latent space, we introduce a novel multi-modality context fusion module with linear attention complexity. Our model is trained purely on a synthetic dataset constructed from open-source assets and achieves superior performance in object removal and occlusion recovery, unlocking new possibilities in downstream editing and creative applications.

Abstract:
Recently, the community has witnessed significant progress in human modeling from single or multi-view inputs. However, these approaches often rely on guessing the occluded regions through either generative models or template fitting. In this work, we address these challenges by exploring optimal fusion strategies using only sparse multi-view inputs. We propose SMVRT, an end-to-end implicit 3D reconstruction framework for sparse multi-view human modeling. Our key contribution lies in the fusion blocks at three stages of the network. First, local and global features alternative fusion modules are designed to enhance 2D features. Second, attentional fusion is performed on warped multi-view and multi-level 2D features to form 3D feature grid. The feature grid aggregates spatially coherent multi-view features by 3D regularization. Third, attentional 2D-3D feature aggregation generates the enhanced latent embeddings for query points to decode occupancies. Experiments on the THUman 2.0/2.1, MultiGarment and MultiHuman datasets demonstrate that our system significantly outperforms state-of-the-art methods both qualitatively and quantitatively.

Abstract:
Text-to-Image (T2I) models generate high-quality images but are vulnerable to malicious backdoor attacks that inject harmful biases (e.g., trigger-activated gender or racial stereotypes). Existing debiasing methods, often designed for natural statistical biases, struggle with these deliberate and subtle injected attacks. We propose AutoDebias, a framework that automatically identifies and mitigates these malicious biases in T2I models without prior knowledge of the specific attack vectors. Specifically, AutoDebias leverages vision-language models to detect trigger-activated visual patterns and constructs neutralization guides by generating counter-prompts. These guides drive a CLIP-guided training process that breaks the harmful associations while preserving the original model's image quality and diversity. Unlike methods designed for natural bias, AutoDebias effectively addresses subtle, injected stereotypes and multiple interacting attacks. We evaluate the framework on a new benchmark covering 17 distinct backdoor attack scenarios, including challenging cases where multiple backdoors co-exist. AutoDebias detects malicious patterns with 91.6% accuracy and reduces the backdoor success rate from 90% to negligible levels, while preserving the visual fidelity of the original model.

Abstract:
Unsupervised Camouflaged Object Detection (UCOD) remains a challenging task due to the high intrinsic similarity between target objects and their surroundings, as well as the reliance on noisy pseudo-labels that hinder fine-grained texture learning. While existing refinement strategies aim to alleviate label noise, they often overlook intrinsic perceptual cues, leading to boundary overflow and structural ambiguity. In contrast, learning without pseudo-label guidance yields coarse features with significant detail loss. To address these issues, we propose a unified UCOD framework that enhances both the reliability of pseudo-labels and the fidelity of features. Our approach introduces the Multi-Cue Native Perception module, which extracts intrinsic visual priors by integrating low-level texture cues with mid-level semantics, enabling precise alignment between masks and native object information. Additionally, Pseudo-Label Evolution Fusion intelligently refines labels through teacher-student interaction and utilizes depthwise separable convolution for efficient semantic denoising. It also incorporates Spectral Tensor Attention Fusion to effectively balance semantic and structural information through compact spectral aggregation across multi-layer attention maps. Finally, Local Pseudo-Label Refinement plays a pivotal role in local detail optimization by leveraging attention diversity to restore fine textures and enhance boundary fidelity. Extensive experiments on multiple UCOD datasets demonstrate that our method achieves state-of-the-art performance, characterized by superior detail perception, robust boundary alignment, and strong generalization under complex camouflage scenarios.

Abstract:
Vision-Language Models (VLMs) have emerged as versatile solutions for zero-shot question answering (QA) across various domains. However, enabling VLMs to effectively comprehend structured graphs and perform accurate, efficient QA remains challenging. Existing approaches typically rely on a single type of graph topology representation (GTR) of graphs, such as fixed-style visual images or unified text descriptions. This "one-size-fits-all" strategy often neglects model-specific and task-specific preferences, resulting in inaccurate or overly lengthy responses to graph-related queries. To address this, we propose the DynamicGTR framework, which dynamically selects the optimal GTR for each query during inference, thereby enhancing the zero-shot graph QA capabilities of VLMs with a customizable accuracy and brevity trade-off. Extensive experiments show that DynamicGTR not only improves VLM-based graph algorithm QA performance but also successfully transfers the experience trained from synthetic graph algorithm tasks to real-world applications like link prediction and node classification, without any additional training. Additionally, DynamicGTR demonstrates strong transferability across tasks, domains, and models, suggesting its potential as a flexible solution for broad graph scenarios.

Abstract:
Recent advances in diffusion models and vision-language models (VLMs) have significantly enhanced the controllability of image editing. Methods like FlowEdit enable step-by-step editing along a visible, noise-free trajectory, where each intermediate result is a clear image, eliminating the need for full noise inversion. However, these approaches still operate in pixel space or VAE latent space, where intermediate outputs often suffer from visual artifacts, distortions, or unrealistic details--making reliable semantic evaluation difficult. Furthermore, they remain open-loop systems, applying static edits without feedback to guide or correct the editing process adaptively. We propose UniEdit-I, the first training-free, closed-loop image editing framework that operates entirely within the semantic latent space of a unified VLM by introducing an Understanding-Editing-Verifying (UEV) loop:(1) Understanding: parses the source image and editing instruction into a structured source prompt and a minimal target specification;(2) Editing: applies dynamic semantic offsets, with a configurable feedback weighting mechanism that adaptively modulates editing intensity based on real-time alignment feedback;(3) Verifying: leverages the VLM's own multimodal reasoning capability to evaluate the intermediate output along multiple semantic dimensions and trigger early stopping or refinement. By transforming the VLM from a post-hoc evaluator into an in-process conductor, UniEdit-I establishes the first semantics-driven, self-correcting closed-loop image editing pipeline. Evaluated on GEdit-Bench, UniEdit-I achieves state-of-the-art performance without any fine-tuning or architectural modifications, and even surpasses several large-scale pre-trained editors.

Abstract:
Dyadic interactive head generation aims to synthesize realistic head motions that respond both verbally and non-verbally to an interlocutor in real-time conversation. The existing works often focus on offline scenarios, and struggle with a shallow understanding of the multimodal conversational context while also lacking long-term coherence. To address these limitations, we propose MimicTalker, a novel method for producing real-time, contextually-aware, and long-term consistent interactive head motions. To this end, we propose a Multimodal Interactive Context Extraction (MICE) module to capture both instantaneous and long-term multimodal interactive information from the interlocutor. To enhance in-depth conversational understanding, we propose a Semantic-enhanced Dynamic Interaction (SDI) module to integrate the intentions and topics of the conversation, which are automatically extracted through an LLM-based analyzer. Further, we propose a semantic-guided Motion Style Memory (MSM) mechanism, enabling the long-term motion consistency throughout the conversation. We conduct experiments on both short conversational segments (25 seconds) and extended dialogues (6 minutes), and the comprehensive experiments demonstrate that our method significantly outperforms existing approaches.

Abstract:
Recently, Spectral Compressive Imaging (SCI) has achieved remarkable success, unlocking significant potential for dynamic spectral vision. However, existing reconstruction methods, primarily image-based, suffer from two limitations: (i) Encoding process masks spatial-spectral features, leading to uncertainty in reconstructing missing information from single compressed measurements, and (ii) The frame-by-frame reconstruction paradigm fails to ensure temporal consistency, which is crucial in the video perception. To address these challenges, this paper seeks to advance spectral reconstruction from the image level to the video level, leveraging the complementary features and temporal continuity across adjacent frames in dynamic scenes. Initially, we construct the first high-quality dynamic hyperspectral image dataset (DynaSpec), comprising 30 sequences obtained through frame-scanning acquisition. Subsequently, we propose the Propagation-Guided Spectral Video Reconstruction Transformer (PG-SVRT), which employs a spatial-then-temporal attention to effectively reconstruct spectral features from abundant video information, while using a bridged token to reduce computational complexity. Finally, we conduct simulation experiments to assess the performance of four SCI systems, and construct a DD-CASSI prototype for real-world data collection and benchmarking. Extensive experiments demonstrate that PG-SVRT achieves superior performance in reconstruction quality, spectral fidelity, and temporal consistency, while maintaining minimal FLOPs.

Abstract:
The performance of Vision-Language Transformers drops sharply when an input modality (e.g., image) is missing, because the model is forced to make predictions using incomplete information. Existing missing-aware prompt methods help reduce this degradation, but they still rely on conventional prediction heads (e.g., a Fully-Connected layer) that compute class scores in the same way regardless of which modality is present or absent. We introduce Decoupled Prototype Learning (DPL), a new prediction head architecture that explicitly adjusts its decision process to the observed input modalities. For each class, DPL selects a set of prototypes specific to the current missing-modality cases (image-missing, text-missing, or mixed-missing). Each prototype is then decomposed into image-specific and text-specific components, enabling the head to make decisions that depend on the information actually present. This adaptive design allows DPL to handle inputs with missing modalities more effectively while remaining fully compatible with existing prompt-based frameworks. Extensive experiments on MM-IMDb, UPMC Food-101, and Hateful Memes demonstrate that DPL outperforms state-of-the-art approaches across all widely used multimodal image-text datasets and various missing cases.

Abstract:
Major state-of-the-art unified tokenizers predominantly adopt pixel reconstruction and feature alignment as pretext tasks, they leave key domains largely unexplored such as architecture, supervised objectives and tasks interaction, potentially resulting in limited performance. We systematically investigate the critical factors of a unified visual tokenizer and propose a novel framework that strengthens synergy between understanding and generation in various aspects. Our initial analysis focus on properties of frontier vision models, confirming inherent conflict in contrastive learning style models for unifying generation and understanding, and demonstrate distinct convergence behavior of codebooks. To address the above bottleneck, we hierarchically decouple the conflicting proxy tasks, enriching the diversity of semantic features supervision to enhance thesemantic and low-level capabilities. Subsequently, we further introduce attention-prioritized mapping strategy, which guides fine-grained generation with powerful semantic prior. Our method achieves rFID of 0.33 and zero-shot accuracy of 80.9% on ImageNet at 256x256 resolution, surpassing VILA-U by 7.6% and outperforms continuous embedding of SigLIP. When applied to discrete unified MLLMs, our 7B model exceeds TokenFlow-13B by 3.1% in understanding and achieve SOTA performance in GenAI-Bench and MJHQ-30K.

Abstract:
To continuously enhance model adaptability in surgical video scene parsing, recent studies incrementally update it to progressively learn to segment an increasing number of surgical instruments over time. However, prior works constantly overlooked the potential of positive forward knowledge transfer, i.e., how past knowledge could help learn new classes, and positive backward knowledge transfer, i.e., how learning new classes could help refine past knowledge. In this paper, we propose a self-reflection hierarchical prompt framework that unlocks the power of positive forward and backward knowledge transfer in class incremental segmentation, aiming to proficiently learn new instruments, improve existing skills of regular instruments, and avoid catastrophic forgetting of old instruments. Our framework is built on a frozen, pre-trained model that adaptively appends instrument-aware prompts for new classes throughout training episodes. To enable positive forward knowledge transfer, we organize instrument prompts into a hierarchical prompt parsing tree with the instrument-shared prompt partition as the root node, n-part-shared prompt partitions as intermediate nodes and instrument-distinct prompt partitions as leaf nodes, to expose the reusable historical knowledge for new classes to simplify their learning. Conversely, to encourage positive backward knowledge transfer, we conduct self-reflection refining on existing knowledge by directed-weighted graph propagation, examining the knowledge associations recorded in the tree to improve its representativeness without causing catastrophic forgetting. Our framework is applicable to both CNN-based models and advanced transformer-based foundation models, yielding more than 5% and 11% improvements over the competing methods on two public benchmarks respectively.

Abstract:
Learning temporal transformations, that is, how visual objects evolve across frames, is a fundamental challenge in video representation learning. Frame-to-frame dynamics involve complex, non-linear, and non-local changes that go far beyond conventional spatial augmentations. We propose TimeBridge, a self-supervised method that combines the joint embedding for video representation with learning temporal transformations by reconstructing in-between frames from only the start and end frames. This formulation encourages the model to infer the temporal evolution bridging the two endpoints, rather than merely encoding static frame representations. Unlike joint-embedding methods that lack explicit transformation modelling or future-prediction objectives that rely on unconstrained extrapolation, TimeBridge learns concrete frame-to-frame dynamics by promoting temporal consistency. We realise this through cross-concatenated class tokens and lightweight decoders, which recombine features from the start and end frames to reconstruct intermediates. TimeBridge achieves new state-of-the-art performance on multiple dense video prediction benchmarks, including 73.5 J&F on DAVIS 2017 video object segmentation, 47.5 mIoU on VIP semantic part propagation.

Abstract:
We begin by developing a mathematical framework of neural differentiation, formulated at the level of individual neurons. This framework formalizes the principle that each neuron should acquire a distinct representational role within the network, thereby avoiding redundancy and maximizing collective expressivity. Differentiation is quantified through the Neural Differentiation Index (NDI), a loss-aware measure that characterizes neuron significance from geometric, informational, and curvature-based perspectives within a unified framework. The NDI enables a rigorous characterization of how strongly a neuron diverges from its peers in both function and importance, and supports theoretical guarantees: we establish formal bounds on the error increase under NDI-guided elimination, thereby providing provable safety margins for network compression. Building on this foundation, we introduce Neural Differentiation Pruning (NDP) as a practical instantiation. NDP leverages NDI to perform adaptive, training-time neuron sparsification, followed by targeted fine-tuning, guiding networks toward compact yet highly differentiated backbones. Although the terminology draws loose intuition from biological differentiation, the framework is fully mathematical and architecture-agnostic. Experiments on modern vision benchmarks and architectures show that NDP achieves substantial structured sparsity while maintaining--or even improving--accuracy and robustness, underscoring the practical impact of the differentiation framework.

Abstract:
Text-to-motion generation is a fundamental task in computer vision, aiming to synthesize 3D human motion sequences from natural language descriptions. However, due to the limited scale and diversity of existing datasets, models trained to directly map raw text to motion often struggle to generalize to out-of-domain textual inputs. We observe that although high-level motion semantics vary widely, many motions share a common set of underlying atomic motions--that is, simple, reusable body-part movements. Building on this insight, we introduce an Atomic Motion Decomposition and Recomposition framework for open-vocabulary text-to-motion generation. Our approach consists of two key components: a Textual Decomposition module that parses out-of-domain descriptions into atomic motion units, and an Atomic Recomposition module that integrates these units to produce the final motion sequence. Our model achieves a competitive performance on the in-domain HumanML3D dataset, and extensive experiments on two out-of-domain datasets (IDEA400 and Mixamo) demonstrate that our method substantially outperforms state-of-the-art approaches in open-vocabulary motion generation.

Abstract:
Recently, 3D Gaussian Splatting (3DGS) has revolutionized radiance field reconstruction, enabling efficient and high-fidelity novel view synthesis. However, seamless integration of both aerial and street view images to model urban scenes remains a significant challenge for 3DGS. This joint setting suffers from extreme view coverage disparity, complex multi-scale details, and imbalanced viewpoint distributions.In this work, we present Urban-GS, a novel framework built upon Gaussian Splatting for the compact unified reconstruction and high-fidelity rendering of urban scenes from both aerial and street views. Specifically, we first develop an Aerial-Street Joint Adaptive Densification method to resolve the densification conflicts arising from large view coverage disparity. We then introduce a Contribution-based Anchor Pruning strategy to effectively mitigate the storage overhead from capturing multi-scale scene details. Furthermore, we propose a Global-to-Local Optimization strategy to refine the reconstruction of under-optimized regions resulting from imbalanced view distributions. Experiments across diverse urban scene datasets demonstrate that Urban-GS significantly outperforms the state-of-the-art method in novel-view rendering quality, while simultaneously reducing storage overhead by an average of 41%.

Abstract:
Recent advances in 3D scene-language understanding have leveraged Large Language Models (LLMs) for 3D reasoning by transferring their general reasoning ability to 3D multi-modal contexts. However, existing methods typically adopt standard decoders from language modeling, which rely on a causal attention mask. This design introduces two fundamental conflicts in 3D scene understanding: sequential bias among order-agnostic 3D objects and restricted object-instruction attention, hindering task-specific reasoning. To overcome these limitations, we propose 3D Spatial Language Instruction Mask (3D-SLIM), an effective masking strategy that replaces the causal mask with an adaptive attention mask tailored to the spatial structure of 3D scenes. Our 3D-SLIM introduces two key components: a Geometry-adaptive Mask that constrains attention based on spatial density rather than token order, and an Instruction-aware Mask that enables object tokens to directly access instruction context. This design allows the model to process objects based on their spatial relationships while being guided by the user's task. 3D-SLIM is simple, requires no architectural modifications, and adds no extra parameters, yet it yields substantial performance improvements across diverse 3D scene-language tasks. Extensive experiments across multiple benchmarks and LLM baselines validate its effectiveness and underscore the critical role of decoder design in 3D multi-modal reasoning.

Abstract:
Brain imaging analysis is crucial for diagnosing and treating brain disorders, and multimodal large language models (MLLMs) are increasingly supporting it. However, current brain imaging visual question-answering (VQA) benchmarks either cover a limited number of imaging modalities or are restricted to coarse-grained pathological descriptions, hindering a comprehensive assessment of MLLMs across the full clinical continuum. To address these, we introduce OmniBrainBench, the first comprehensive multimodal VQA benchmark specifically designed to assess the multimodal comprehension capabilities of MLLMs in brain imaging analysis with closed- and open-ended evaluations. OmniBrainBench comprises 15 distinct brain imaging modalities collected from 30 verified medical sources, yielding 9,527 validated VQA pairs and 31,706 images. It simulates clinical workflows and encompasses 15 multi-stage clinical tasks rigorously validated by a professional radiologist. Evaluations of 24 state-of-the-art models, including open-source general-purpose, medical, and proprietary MLLMs, highlight the substantial challenges posed by OmniBrainBench. Experiments reveal that proprietary MLLMs like Gemini-2.5-Pro (66.58%) outperform open-source and medical MLLMs yet lag far behind physicians (91.35%), while medical MLLMs show wide variance in closed- and open-ended VQA. Open-source general-purpose MLLMs generally trail but excel in specific tasks, and all MLLMs fall short in complex preoperative reasoning, revealing a critical visual-to-clinical gap. OmniBrainBench establishes a new standard to assess MLLMs in brain imaging analysis, highlighting the gaps against physicians. We publicly release our benchmark and code at link.

Abstract:
Unsupervised domain adaptation (UDA) for videos and 1D time-series data faces significant challenges due to domain shifts in terms of both temporal dynamics and feature distributions. Existing UDA approaches for time-series data often address temporal alignment and uncertainty mitigation as separate objectives, leading to unstable training, noisy pseudo-labels, and incomplete feature transfer. This disjoint treatment fails to capture inter-channel causal dependencies and also overlooks the impact of prediction uncertainty on adaptation quality. This limits the transferability of learned representations and results in suboptimal adaptation. To address the aforementioned limitations, we propose a novel UDA framework, named Causally-Regularized Optimal Transport (in short Causal-OT), that preserves domain-invariant causal mechanisms by embedding causal graph regularization into robust OT alignment process. First we estimate inter-channel causal graphs in both source and target domains and learn a transport plan that not only aligns feature distributions but also improves interpretability and minimizes the discrepancy between causal structures of the Granger graphs. However, pseudo-labeling may still prone to error propagation allowing incorrect target predictions during self-training, degrading the model stability and transfer quality across domains. To mitigate this, we further introduce a causality-aware pseudo-labeling strategy that selects high-confidence target samples based on both entropy and structural consistency with the causal graph of the source domain. This enhances robustness against pseudo-label noise.Extensive experiments on six time-series benchmarks achieving 4.5% gain in accuracy and a 3.8% improvement in F1-score. We conduct experiments on four benchmark video datasets that achieve a 2.5% gain in accuracy.

Abstract:
Weakly supervised video anomaly detection (WS-VAD) aims to localize frame-level anomalies using only video-level labels. This task is typically formulated within a multiple instance learning (MIL) paradigm, where each video is treated as a bag of snippets, achieving robust performance without requiring additional information.However, existing methods often struggle with noisy supervision signals. Normal snippets within abnormal bags are frequently misclassified as anomalies due to inaccurate anomaly scores. These misclassified instances act as noisy samples, introducing false supervision that hinders the learning of true anomaly patterns.In this work, we introduce D^ 2 MIL, a Denoising-Debiasing framework within the Multiple Instance Learning paradigm designed to suppress noise and improve anomaly discrimination. Our approach integrates two key components:(1) Denoising Module: We introduce a dynamic drop rate to adaptively filter out suspected noisy samples during training, based on the observation that noisy samples incur higher training losses. (2) Debiasing Module: We leverage a vision-language model to re-evaluate the discarded samples. This recovers potentially valuable abnormal instances that were mistakenly removed, as they are similar to noisy samples but difficult for the model to recognize. D^ 2 MIL is a general purpose denoising strategy that can be integrated into any MIL-based method. Our extensive experiments on the three benchmark datasets (ShanghaiTech, UCF-Crime, and MSAD) demonstrate that D^ 2 MIL is compatible with diverse MIL frameworks and consistently enhances their performance.

Abstract:
Scalable Vector Graphics (SVG) are central to digital design due to their inherent scalability and editability. Despite significant advancements in content generation enabled by Visual Language Models (VLMs), existing text-to-SVG generation methods are limited by a core challenge: the autoregressive training process does not incorporate visual perception of the final rendered image, which fundamentally constrains generation quality. To address this limitation, we propose an Introspective SVG Generation Framework (IntroSVG). At its core, the framework instantiates a unified VLM that operates in a closed loop, assuming dual roles of both generator and critic. Specifically, through Supervised Fine-Tuning (SFT), the model learns to draft SVGs and to provide feedback on their rendered outputs; moreover, we systematically convert early-stage failures into high-quality error-correction training data, thereby enhancing model robustness. Subsequently, we leverage a high-capacity teacher VLM to construct a preference dataset and further align the generator's policy through Direct Preference Optimization (DPO). During inference, the optimized generator and critic operate collaboratively in an iterative "generate-review-refine" cycle, starting from imperfect intermediate drafts to autonomously improve output quality. Experimental results demonstrate that our method achieves state-of-the-art performance across several key evaluation metrics, generating SVGs with more complex structures, stronger semantic alignment, and greater editability. These results corroborate the effectiveness of incorporating explicit visual feedback into the generation loop.

Abstract:
Prompt quality plays a critical role in the performance of the Segment Anything Model (SAM), yet existing approaches often rely on heuristic or manually crafted prompts, limiting scalability and generalization. In this paper, we propose Point Prompt Defender, an adversarial reinforcement learning framework that adopts an attack-for-defense paradigm to automatically optimize point prompts. We construct a task-agnostic point prompt environment by representing image patches as nodes in a dual-space graph, where edges encode both physical and semantic distances. Within this environment, an attacker agent learns to activate a subset of prompts that maximally degrade SAM's segmentation performance, while a defender agent learns to suppress these disruptive prompts and restore accuracy. Both agents are trained using Deep Q-Networks with a reward signal based on segmentation quality variation. During inference, only the defender is deployed to refine arbitrary coarse prompt sets, enabling enhanced SAM segmentation performance across diverse tasks without retraining. Extensive experiments show that Point Prompt Defender effectively improves SAM's robustness and generalization, establishing a flexible, interpretable, and plug-and-play framework for prompt-based segmentation.

Abstract:
Multimodal large language models (MLLMs) have achieved remarkable success on diverse visual-language tasks. However, fixed-resolution models face challenges in perceiving fine-grained visual details, particularly due to distracted attention and blurry vision. To address these issues, we propose SLoFo, a training-free and self-guided inference framework that mimics the human "Scan-Locate-Focus" process. SLoFo first adopts a dual-branch mechanism to identify critical image regions: the Semantic branch constructs a gradient-based semantic relevance map, and the Structure branch estimates visual token uniqueness offering complementary and robust evidence. By combining both branches, SLoFo perceives and explicitly crop critical regions. During inference, with additional cropped sub-image, SLoFo applies a progressive visual token pruning strategy to improve attention focus on key areas while reducing computational overhead. Experiments on detail-sensitive and general-purpose benchmarks show that SLoFo consistently improves accuracy (+4.79% on TextVQA, +2.58% on GQA) and robustness (+4.60% on POPE-MSCOCO adversarial) without training or external modules.

Abstract:
Existing eyeglasses removal methods primarily focus on opaque or fully transparent lenses. However, when dealing with semi-transparent sunglasses, these methods often corrupt the visible facial details beneath the lenses, thereby degrading the performance of downstream vision tasks. To address this issue, we propose Diff-SemiER, a novel diffusion-based framework for semi-transparent eyeglasses removal that leverages generative priors and transparency-aware adaptive fusion. The proposed framework fully utilizes the visible eye-region information beneath the lenses while retaining sufficient generative flexibility, thereby striking a balance between generation and restoration within semi-transparent regions. Specifically, Diff-SemiER comprises two diffusion branches: the Generative Prior Diffusion Model (GPDM) generates high-quality eyeglass-free facial images via image inpainting, which provides global semantic guidance for highly occluded scenarios. The Transparency-Aware Adaptive Fusion Diffusion Model (TAFDM) employs a Soft Mask-Aware Adaptive Fusion (SMAF) mechanism to adaptively merge generative and restorative features across multiple scales, enabling dynamic trade-offs between generative capability and fine-detail preservation under varying occlusion levels. Furthermore, we design a transmittance-based data synthesis method to construct a large-scale, high-quality dataset of faces with semi-transparent eyeglasses for model training and evaluation. Extensive experimental results demonstrate that Diff-SemiER significantly outperforms state-of-the-art methods in both synthetic and real-world scenarios.

Abstract:
Autonomous driving must operate reliably across diverse surfaces to enable safe mobility. However, most driving datasets are captured on well-paved flat roads. Moreover, recent driving datasets primarily provide sparse LiDAR ground truth for images, which is insufficient for assessing fine-grained geometry in depth estimation and completion. To address these gaps, we introduce CARD, a multi-modal driving dataset that delivers quasi-dense 3D ground truth across continuous sequences rich in speed bumps, potholes, irregular surfaces and off-road segments. Our sensor suite includes synchronized global-shutter stereo cameras, front and rear LiDARs, 6-DoF poses from LiDAR-inertial odometry, per-wheel motion traces, and full calibration. Notably, our multi-LiDAR fusion yields 500K valid depth pixels per frame, about 6.5x more than KITTI Depth Completion and 10x more on average than other public driving datasets. The dataset spans 110 km and 4.7 hours across Germany and Italy. In addition, CARD provides 2D bounding boxes targeting road-topography irregularities, enabling accurate benchmarking for both geometry and perception tasks. Furthermore, we introduce a standardized evaluation protocol for road surface irregularities and a stereo-guided depth completion variant that achieves leading performance on CARD. Moreover, we benchmark state-of-the-art depth estimation models to establish strong baselines. We host CARD on Hugging Face with an open source SDK and standardized splits to enable public leaderboards and reproducible evaluation. The CARD dataset is hosted on (https://huggingface.co/CARD-Data).

Abstract:
Accurately modeling the relationship between perturbations, transcriptional responses, and phenotypic changes is essential for building an AI Virtual Cell (AIVC). However, existing methods typically constrained to modeling direct associations, such as Perturbation -> RNA or Perturbation -> Morphology, overlook the crucial causal link from RNA to morphology. To bridge this gap, we propose TRIDENT, a cascade generative framework that synthesizes realistic cellular morphology by conditioning on both the perturbation and the corresponding gene expression profile. To train and evaluate this task, we construct MorphoGene, a new dataset pairing L1000 gene expression with Cell Painting images for 98 compounds. TRIDENT significantly outperforms state-of-the-art approaches, achieving up to 7-fold improvement with strong generalization to unseen compounds. In a case study on docetaxel, we validate that RNA-guided synthesis accurately produces the corresponding phenotype. An ablation study further confirms that this RNA conditioning is essential for the model's high fidelity. By explicitly modeling transcriptome-phenome mapping, TRIDENT provides a powerful in silico tool and moves us closer to a predictive virtual cell.

Abstract:
Compound activity modeling is critical for drug discovery, where accurate in silico predictions can significantly reduce reliance on expensive, time-consuming target-specific experimental assays. Traditional machine learning approaches for compound activity modeling typically rely on either chemoproteomics-centric molecular data or phenotype-centric imaging screens, limiting their ability to capture complementary biological signals. While multimodal approaches show promise, they often fail to capture the interplay between molecular mechanisms and cellular responses. In this paper, we present BiGMINT, a Biologically Guided Multimodal framework that hierarchically INTegrates chemoproteomic and high-content imaging (HCI) data, introducing chemoproteomics-guided phenotypic aggregation, task-aware cross-modal fusion, and protein-protein interaction priors for modeling activities. On two large-scale in-house datasets, with 99K and 40K compound-HCI pairs from U2OS and iNeuron, BiGMINT improves mean AUCROC by up to 10.0% and 4.2%, and high-performing task coverage by up to 103% and 56% over best unimodal and multimodal methods. Thorough analysis revealed mechanistic insights, showing these gains stem from modality complementarity, and protein-protein priors enhance modeling of challenging activities.

Abstract:
Deep learning-based online mapping has emerged as a cornerstone of autonomous driving, yet these models frequently fail to generalize beyond familiar environments. We propose a framework to identify and measure the underlying failure modes by disentangling two effects: Memorization of input features and overfitting to known map topologies. We propose metrics based on evaluation subsets that control for geographical proximity and topological similarity between training and validation scenes. We introduce Frechet distance-based reconstruction statistics that capture per-element shape fidelity without threshold tuning, and define complementary failure-mode scores: an input-feature overfitting score quantifying the performance drop when geographic cues disappear, and a topology overfitting score measuring degradation as scenes become topologically novel. Beyond models, we analyze dataset biases and contribute topology-aware diagnostics: A minimum-spanning-tree (MST) diversity metric for training sets and a symmetric coverage metric to quantify topological similarity between splits. Leveraging these, we formulate an MST-based sparsification strategy that reduces redundancy and improves balancing and performance while shrinking training size. Experiments on nuScenes and Argoverse 2 across multiple state-of-the-art models yield more trustworthy assessment of generalization and show that topology-diverse and balanced training sets lead to improved performance. Our results motivate failure-mode-aware protocols and topology-centric dataset design for deployable online mapping.

Abstract:
Vision-Language-Action Models (VLAs) have demonstrated significant promise in generalizing to complex, long-horizon robotic manipulation tasks. However, their performance remains brittle, as they are typically trained on trajectory-monotonic, failure-free demonstrations. This reliance on "perfect" data leaves them unable to recover from common execution errors, such as a missed grasp, a dropped object, or an unexpected collision. In this paper, we propose FLARE, a novel framework that endows VLAs with robust error recovery capabilities through a "Retry" and "Reset" paradigm. First, we introduce a "Retry" mechanism by injecting perturbation and bridging segments that decouple robot pose from environment state into demonstrations, enabling the policy to autonomously handle execution deviations. Second, to address critical, state-breaking (OOD) failures, we introduce a "Reset" pipeline. We leverage an MLLM for offline failure analysis to automatically identify OOD states from execution videos. This analysis enables the efficient, targeted collection of a small library of object-centric "Reset" skills, which are trained to restore the environment to a task-valid state. Our full framework integrates these learned policies. At inference, an online MLLM monitor arbitrates between task execution and "Reset" skills. Experiments on challenging, contact-rich manipulation tasks show our approach significantly improves task success and robustness.

Abstract:
Embodied navigation is a fundamental capability for robotic agents operating. Real-world deployment requires open vocabulary generalization and low training overhead, motivating zero-shot methods rather than task-specific RL training. However, existing zero-shot methods that build explicit 3D scene graphs often compress rich visual observations into text-only relations, leading to high construction cost, irreversible loss of visual evidence, and constrained vocabularies.To address these limitations, we introduce the Multi-modal 3D Scene Graph (M3DSG), which preserves visual cues by replacing textual relational edges with dynamically assigned images. Built on M3DSG, we propose MSGNav, a zero-shot navigation system that includes a Key Subgraph Selection module for efficient reasoning, an Adaptive Vocabulary Update module for open vocabulary support, and a Closed-Loop Reasoning module for accurate exploration reasoning. Additionally, we further identify the "last mile" problem in zero-shot navigation -- determining the feasible target location with a suitable final viewpoint, and propose a Visibility-based Viewpoint Decision module to explicitly resolve it. Comprehensive experimental results demonstrate that MSGNav achieves state-of-the-art performance on GOAT-Bench and HM3D-OVON datasets. The open-source code will be publicly available.

Abstract:
Face recognition remains vulnerable to presentation attacks, calling for robust Face Anti-Spoofing (FAS) solutions. Recent MLLM-based FAS methods reformulate the binary classification task as the generation of brief textual descriptions to improve cross-domain generalization. However, their generalizability is still limited, as such descriptions mainly capture intuitive semantic cues (e.g., mask contours) while struggling to perceive fine-grained visual patterns. To address this limitation, we incorporate external visual tools into MLLMs to encourage deeper investigation of subtle spoof clues. Specifically, we propose the Tool-Augmented Reasoning FAS (TAR-FAS) framework, which reformulates the FAS task as a Chain-of-Thought with Visual Tools (CoT-VT) paradigm, allowing MLLMs to begin with intuitive observations and adaptively invoke external visual tools for fine-grained investigation. To this end, we design a tool-augmented data annotation pipeline and construct the ToolFAS-16K dataset, which contains multi-turn tool-use reasoning trajectories. Furthermore, we introduce a tool-aware FAS training pipeline, where Diverse-Tool Group Relative Policy Optimization (DT-GRPO) enables the model to autonomously learn efficient tool use. Extensive experiments under a challenging one-to-eleven cross-domain protocol demonstrate that TAR-FAS achieves SOTA performance while providing fine-grained visual investigation for trustworthy spoof detection.

Abstract:
Hierarchical Vision-Language-Action (VLA) models have rapidly become a dominant paradigm for robotic manipulation. It typically comprising a Vision-Language backbone for perception and understanding, together with a generative policy for action generation. However, its performance is increasingly bottlenecked by the action generation proceess. (i) Low inference efficiency. A pronounced distributional gap between isotropic noise priors and target action distributions, which increases denoising steps and the incidence of infeasible samples. (ii) Poor robustness. Existing policies condition solely on the current observation, neglecting the constraint of history sequence and thus lacking awareness of task progress and temporal consistency. To address these issues, we introduce OptimusVLA, a dual-memory VLA framework with Global Prior Memory (GPM) and Local Consistency Memory (LCM). GPM replaces Gaussian noise with task-level priors retrieved from semantically similar trajectories, thereby shortening the generative path and reducing the umber of function evaluations (NFE). LCM dynamically models executed action sequence to infer task progress and injects a learned consistency constraint that enforces temporal coherence and smoothness of trajectory. Across three simulation benchmarks, OptimusVLA consistently outperforms strong baselines: it achieves 98.6% average success rate on LIBERO, improves over pi_0 by 13.5% on CALVIN, and attains 38% average success rate on RoboTwin 2.0 Hard. In Real-World evaluation, OptimusVLA ranks best on Generalization and Long-horizon suites, surpassing pi_0 by 42.9% and 52.4%, respectively, while delivering 2.9x inference speedup.

Abstract:
Mamba-based models have emerged as a promising paradigm for medical image segmentation due to their linear-complexity state-space modeling. However, their effectiveness is still limited by the mismatch between anatomical geometry and tissue-specific semantics, as well as by spatially entangled diagnostic cues. To address these limitations, we propose GeoSemba, a Mamba-based segmentation framework that jointly models cross-level geometric-semantic interactions and cross-dimensional spatial-channel dependencies within a single scan. To this end, GeoSemba comprises two complementary components. The Semantic-guided State Refiner (SSR) derives semantically discriminative region representatives and leverages geometry-conditioned inter-region dependencies to enable coherent semantic propagation across structurally related regions. The Cross-dimensional Affinity Refiner (CAR) adopts a coarse-to-fine strategy of macro-perception and micro-focus to selectively enhance informative spatial-channel interactions while suppressing weak and noisy correlations. Extensive experiments on benchmark datasets spanning six medical imaging modalities show that GeoSemba consistently delivers superior segmentation accuracy while maintaining high computational efficiency.

Abstract:
Zero-shot 3D Anomaly Detection (ZS3DAD) is an emerging task that aims to detect anomalies in a target dataset without any target training data, which is particularly important in scenarios constrained by sample scarcity and data privacy concerns. While current methods adapt CLIP by projecting 3D point clouds into 2D representations, they face challenges. The projection inherently loses some geometric details, and the reliance on a single 2D modality provides an incomplete visual understanding, limiting their ability to detect diverse anomaly types. To address these limitations, we propose the Geometry-Aware Prompt and Synergistic View Representation Learning (GS-CLIP) framework, which enables the model to identify geometric anomalies through a two-stage learning process. In the stage 1, we dynamically generate text prompts embedded with 3D geometric priors. These prompts contain global shape context and local defect information distilled by our Geometric Defect Distillation Module (GDDM). In the stage 2, we introduce Synergistic View Representation Learning architecture that processes rendered and depth images in parallel. A Synergistic Refinement Module (SRM) subsequently fuses the features of both streams, capitalizing on their complementary strengths. Comprehensive experimental results on four large-scale public datasets show that GS-CLIP achieves superior performance in detection and segmentation. Code will be released upon acceptance.

Abstract:
Unified multimodal models significantly improve visual generation by combining vision-language models (VLMs) with diffusion models. However, existing methods struggle to fully balance sufficient interaction and flexible implementation due to vast representation difference. Considering abundant and hierarchical information in VLM's layers from low-level details to high-level semantics, we propose ParaUni. It extracts features from variants VLM's layers in a Parallel way for comprehensive information interaction and retains a flexible separation architecture to enhance generation in Unified multimodal model. Concretely, visual features from all VLM's layers are fed in parallel into a Layer Integration Module (LIM), which efficiently integrates fine-grained details and semantic abstractions and provides the fused representation as a condition to the diffusion model. To further enhance performance, we reveal that these hierarchical layers respond unequally to different rewards in Reinforcement Learning (RL). Crucially, we design a Layer-wise Dynamic Adjustment Mechanism (LDAM) to facilitate multiple reward improvements that aligns the hierarchical properties of these layers using RL. Extensive experiments show ParaUni leverages complementary multi-layer features to substantially improve generation quality and shows strong potential for multiple reward advances during RL stages.

Abstract:
Federated Prototype Learning (FedCL) has emerged as an effective strategy for handling data heterogeneity in Federated Learning (FL). In FedCL, clients collaboratively construct a set of global feature centers (prototypes), and let local features align with these prototypes to mitigate the effects of data heterogeneity. The performance of FedCL highly depends on the quality of prototypes. Existing methods assume that larger inter-class distances among prototypes yield better performance, and thus design different methods to increase these distances. However, we observe that while these methods increase prototype distances to enhance class discrimination, they inevitably disrupt essential semantic relationships among classes, which are crucial for model generalization. This raises an important question: how to construct prototypes that inherently preserve semantic relationships among classes? Directly learning these relationships from limited and heterogeneous client data can be problematic in FL. Recently, the success of pre-trained language models (PLMs) demonstrates their ability to capture semantic relationships from vast textual corpora. Motivated by this, we propose FedTSP, a novel method that leverages PLMs to construct semantically enriched prototypes from the textual modality, enabling more effective collaboration in heterogeneous data settings. We first use a large language model (LLM) to generate fine-grained textual descriptions for each class, which are then processed by a PLM on the server to form textual prototypes. To address the modality gap between client image models and the PLM, we introduce trainable prompts, allowing prototypes to adapt better to client tasks. Extensive experiments demonstrate that FedTSP mitigates data heterogeneity while significantly accelerating convergence.

Abstract:
Zero-Shot Temporal Action Localization (ZS-TAL) aims to classify and localize actions in untrimmed videos that are unseen during training. Existing training-based ZS-TAL methods typically rely on fine-tuning models on large-scale annotated training data. This can be impractical in real-world applications and damage its generalization. As a result, Training-Free ZS-TAL has gained attention, which directly leveraging Vision-Language Models (VLM) enables action localization without any additional training. However, current techniques perform test-time adaptation independently on each video, neglecting the potential benefit of accumulating knowledge from historical test videos. To address this, we propose a learnable lookup table (LLT) framework. During testing, we continuously update lookup table by incorporating high-confidence, diverse lookup candidates to construct action-positive lookup item. Additionally, we introduce a learnable residual module to adapt the corresponding lookup item to the current video context features. Finally, we employ refined activation scores to select accurate video frames and further adjust the text prototypes. This simple yet effective text-visual collaboration enables training-free ZS-TAL to harness historical videos. Extensive experiments show our method significantly outperforms state-of-the-art zero-shot VLM baselines, validating the effectiveness of our framework.

Abstract:
Adapting closed-box service models (i.e., APIs) for target tasks typically relies on reprogramming via Zeroth-Order Optimization (ZOO). However, this standard strategy is known for extensive, costly API calls and often suffers from slow, unstable optimization. Furthermore, we observe that this paradigm faces new challenges with modern APIs (e.g., GPT-4o). These models can be less sensitive to the input perturbations ZOO relies on, thereby hindering performance gains. To address these limitations, we propose an Alternative efficient Reprogramming approach for Service models (AReS). Instead of direct, continuous closed-box optimization, AReS initiates a single-pass interaction with the service API to prime an amenable local pre-trained encoder. This priming stage trains only a lightweight layer on top of the local encoder, making it highly receptive to the subsequent glass-box (white-box) reprogramming stage performed directly on the local model. Consequently, all subsequent adaptation and inference rely solely on this local proxy, eliminating all further API costs. Experiments demonstrate AReS's effectiveness where prior ZOO-based methods struggle: on GPT-4o, AReS achieves a +27.8% gain over the zero-shot baseline, a task where ZOO-based methods provide little to no improvement. Broadly, across ten diverse datasets, AReS outperforms state-of-the-art methods (+2.5% for VLMs, +15.6% for standard VMs) while reducing API calls by over 99.99%. AReS thus provides a robust and practical solution for adapting modern closed-box models.

Abstract:
Most event-based algorithms typically split the event stream into fixed groups (e.g., fixed time or fixed count) for downstream processing, lacking adaptivity to scene dynamics. Several adaptive partitioning strategies have been proposed, but they are unable to cope well with heterogeneous velocity scenarios (HVS) involving both fast- and slow-moving objects. To address this issue, we propose Adaptive Spatial-Temporal Window (ASTW) strategy, which simultaneously achieves temporal adaptivity and spatial locality in event partitioning. Based on the principle of maximum entropy, we derive a patch-level time window determination criterion and efficiently implement it based on event density and vectorized calculations. Experiments on publicly available event-based object detection and tracking datasets demonstrate that ASTW significantly outperforms existing state-of-the-art partitioning strategies. We also construct HetVel, the first RGB-event dual-modality dataset for HVS, and further highlight the advantages of ASTW on this challenging benchmark. We believe that our ASTW strategy and the constructed HetVel dataset will advance the field of neuromorphic vision.

Abstract:
We introduce a fundamentally new paradigm in video sensing, 1-bit computational video, that redefines the limits of imaging efficiency and performance. Instead of the conventional high-bit-depth capture, we show that one bit measurements captured by time-varying thresholding can be used to reconstruct full-bit-depth videos, eliminating the need for power-hungry, high-precision analog-to-digital conversion (ADC) at the sensor as well as reducing the energy consumption in data transmission. We propose thresholding strategies to effectively capture spatiotemporal dependencies in video streams. Despite the significant data compression at acquisition, we recover full-bit-depth videos with high fidelity through neural video reconstruction. Our method unlocks significant gains in memory efficiency, power savings, and data throughput reduction at the sensor, making it ideal for imaging systems with ultra-low-power requirements or high-speed video capture. We validate our framework on video recovery from simulated 1-bit measurements. Our work redefines the camera pipeline, potentially paving the way for gigapixel, kilohertz imaging systems on low-power sensor hardware.

Abstract:
Model merging offers a promising paradigm for consolidating multiple expert models into a single multitask architecture. However, its effectiveness is often hindered by task interference, where conflicting parameter updates from different tasks degrade performance. While dynamic, Mixture-of-Experts based methods have improved adaptability, they are fundamentally limited by constructing their expert pools from task vectors in isolation, failing to resolve underlying structural conflicts across tasks. In this paper, we introduce DuetMerging, a novel framework that synergistically mitigates task interference from both dynamic and static perspectives. Dynamically, we apply Tucker decomposition to a unified tensor of task vectors, creating a harmonized expert pool derived from a shared core tensor that structurally enhances synergies and suppresses conflicts. Statically, we introduce a neuron-based sparsification technique that leverages task-specific neuron activation patterns to create a precise mask. This allows us to selectively preserve critical information from the decomposition's residual while suppressing functionally irrelevant or conflicting parameters. Comprehensive experiments demonstrate that DuetMerging outperforms existing methods, establishing a new state-of-the-art in both task performance and parameter efficiency.

Abstract:
Reward models are critical for aligning vision-language systems with human preferences, yet current approaches suffer from hallucination, weak visual grounding, and an inability to use tools for verification, limiting their reliability on complex multimodal reasoning tasks.We present ARM-Thinker, an Agentic multimodal Reward Model that autonomously invokes external tools (e.g., image cropping, doc page retrieval) to ground judgments in verifiable evidence, replacing static, non-interactive reward scoring.This enables the model to verify fine-grained visual details, cross-reference multi-page evidence, and validate reasoning claims, which are capabilities absent in existing reward models.We train ARM-Thinker with multi-stage reinforcement learning, jointly optimizing tool-calling decisions and judgment accuracy.To evaluate agentic reward modeling, we introduce ARMBench-VL, comprising three benchmarks that assess fine-grained visual grounding (image-level tools), multi-page document understanding (retrieval tools), and instruction following (text-level verification).ARM-Thinker achieves +16.2% average improvement on reward modeling benchmarks, +9.6% on tool-use tasks, and outperforms baselines on multimodal math and logical reasoning benchmarks.Our results demonstrate that agentic capabilities significantly enhance both accuracy and interpretability of reward models.

Abstract:
Unified interpreting and synthesizing human behaviors within 3D environments is vital for advancing spatial intelligence and humanoid robotics. Despite recent advancements (e.g., HSI-GPT), two fundamental capabilities expected of a unified model--understanding and generation--still lag behind specialist models. This is primarily due to 1) single-granularity codebook overemphasizes low-frequency motion details while neglecting motion semantics, 2) limited decoding capacity of the motion detokenizer which restricts the fidelity of human-scene interactions, 3) only relying on supervised fine-tuning (SFT) failing to capture high-level motion semantics and logical reasoning with an end-to-end mapping. To this issue, we develop HSI-GPT2--a reasoning-enhanced, dual-granularity motion-representational large Scene-Motion-Language model, powered by reinforcement learning (RL) with Chain-of-Thought (CoT) reasoning. First, HSI-GPT2 introduces a Dual-granularity Motion Tokenizer, DMoTok, which jointly preserves both fine-grained motion details and text-aligned motion semantics for various HSI-related tasks. Further, a motion diffusion decoder functions as a motion detokenizer, translating deep semantics and detailed features of LLMs into physically grounded human motions. Finally, we curate a Motion Chain-of-Thought (MoCoT) data engine and extend a Group Relative Policy Optimization (GRPO) paradigm to execute long-horizon and compositionallyrich commands. Results on standard HSI benchmarks confirm the clear superiority of HSI-GPT2 in enhancing interaction quality, semantic alignment, behavioral diversity, and generalization to unseen 3D scenes.

Abstract:
Procedural sequence generation aims to create intermediate images through multi-step processes, which is applied in industrial design, educational tutorial, and book illustration. However, existing methods often focus on a specific domain or initialize several expert networks for different domains, which face three challenges. First, the poor generalization to unseen domains. Second, the parameter redundancy due to multiple expert networks.Third, the difficulty in adaptively determining the number of generation steps for different processes.To address these challenges, we propose ProcessMaker, a novel framework that harnesses the inherent generalization capabilities in Diffusion Transformers (DiTs) for procedural sequence generation. Concretely, we introduce three key innovations: (1) Self-supervised Representation Alignment to explore the generalized ability for unseen processes. (2) Sparse Masks for different domains without additional expert networks. (3) A sliding window strategy, which dynamically accommodates the generation steps based on the process complexity. Extensive experiments validate that our ProcessMaker achieves procedural sequence generation with generalization ability and adaptive steps, while using only 7.3% trainable parameters compared with the state-of-the-art method.

Abstract:
While Multimodal Large Language Models (MLLMs) excel at single-image understanding, they exhibit significantly degraded performance in multi-image reasoning scenarios. Multi-image reasoning presents fundamental challenges including complex inter-relationships between images and scattered critical information across image sets. Inspired by human cognitive processes, we propose a Cognition-Inspired Meta-Action Framework (CINEMA), which decomposes multi-image reasoning into five structured meta-actions: Global, Focus, Hint, Think, and Answer, explicitly modeling the sequential cognitive steps humans naturally employ. For cold-start training, we introduce a Retrieval-Based Tree Sampling strategy that generates high-quality meta-action trajectories to bootstrap the model with reasoning patterns. During reinforcement learning, we adopt a two-stage paradigm: an exploration phase with Diversity-Preserving Strategy to avoid entropy collapse, followed by an annealed exploitation phase with DAPO to gradually strengthen exploitation. To train our model, we construct a dataset of 56k cold-start and 58k reinforcement learning instances spanning multi-image, multi-frame, and single-image tasks. We conduct extensive evaluations on multi-image reasoning benchmarks, video understanding benchmarks, and single-image benchmarks, achieving competitive state-of-the-art performance on several key benchmarks. Our model surpasses GPT-4o on the MUIR and MVMath benchmarks and notably outperforms specialized video reasoning models on video understanding benchmarks, demonstrating the effectiveness and generalizability of our human cognition-inspired reasoning framework.

Abstract:
6D pose estimation is a key technology in computer vision and robotic manipulation. However, many methods remain heavily dependent on CAD models that are difficult to obtain. Object-level 3D reconstruction provides an alternative route, and 3D Gaussian Splatting (3DGS) shows convincing potential owing to its training and rendering efficiency. Nevertheless, under sparse reference views, 3DGS is prone to floating artifacts and appearance overfitting, which weakens the stability of pose estimation. We present PoseGaussian, a method for sparse-view 6D pose estimation for unseen objects that builds on improved 3DGS. First, we use sparse RGB-D views to inject a depth structure prior into the 3DGS initialization for stable structure, and we adopt adaptive density control, view-warping augmentation, and joint photometric-depth supervision to reduce floaters and appearance overfitting under sparse reference views. Next, in the pose estimation stage, we apply a two-stage learning-guided ICP initializer that exploits geometric features to obtain a stable initial pose. Finally, we introduce a 3DGS-based iterative pose refiner that aligns rendered and query images in both appearance and geometry, further improving pose estimation accuracy. Experiments on LINEMOD, GenMOP, and our real-world datasets show that PoseGaussian achieves significant improvements over baseline methods under model-free and sparse-view settings, demonstrating strong generalization to unseen objects and robustness to view sparsity.

Abstract:
Single-image human mesh recovery provides a compact 3D, person-centric representation that supports analysis, animation, AR and VR, rehabilitation, and human-computer interaction. However, prevailing systems impose an intact-limb prior and degrade on people with limb loss, because fixed-topology models cannot represent residual limbs. In this work, we present ResiHMR, a residual-limb aware framework for single-image 3D human modeling. ResiHMR adopts residual-limb keypoints and introduces two components: (i) a topology-adaptive Residual Anchor-Factor Optimization module that constrains estimation to the observed kinematic subgraph of anatomically valid structures, and (ii) a geometry-based Residual-Limb Reconstruction module that estimates residual-limb boundaries and convex limb-termination geometry. Together, these modules introduce topology-aware optimization and explicit termination geometry as tools for human mesh recovery under non-standard limb anatomy. Unlike joint-removal methods in a fixed topology, ResiHMR explicitly reconstructs residual-limb surfaces and aligns optimization with limb-loss topology, which better matches prosthetic biomechanics and real-world use. To the best of our knowledge, this is the first single-image HMR system that explicitly reconstructs residual-limb surfaces and performs topology-adaptive optimization for individuals with limb loss. On a curated dataset of real-world images with limb loss, ResiHMR improves reconstruction quality under both SMPLify-X and HSMR backbones, reducing intact-joint 2D MPJPE from 41.32 to 37.40 with SMPLify-X and residual-limb 2D MPJPE from 73.61 to 23.19 with HSMR.

Abstract:
Visual Autoregressive (VAR) modeling introduces a new paradigm for image generation by extending autoregressive mechanisms from next-token prediction to next-scale prediction, achieving remarkable performance. However, as the number of tokens increases rapidly with scale, processing full token maps at high resolution becomes computationally expensive. In addition, the inherently sequential nature of autoregressive modeling prevents parallel inference across scales, which further increases latency.To address these challenges, we propose LazyVAR, a training-free and plug-and-play acceleration method for VAR models. Our key observation is that the similarity of aggregated latent features between adjacent scales progressively increases with the scale index, reaching particularly higher values at larger scales. We treat this similarity as a Scale-wise Update Index, which serves as the pruning criterion. Consequently, more tokens can be pruned at larger scales to improve efficiency. Furthermore, we propose Parallel Group Decoding, which leverages this high similarity at larger scales to decode tokens from different scales in parallel, further accelerating inference. Experimental results show that the proposed LazyVAR achieves up to a 2.94x speedup of recent VAR models with negligible performance degradation, allowing the Infinity-2B text-to-image model to generate 1024x1024 resolution images within 0.5 seconds on a single RTX 4090 GPU.

Abstract:
Soft boundaries, like thin hairs, are commonly observed in natural and computer-generated imagery, but they remain challenging for 3D vision due to the ambiguous mixing of foreground and background cues. This paper introduces Guardians of the Hair (HairGuard), a framework designed to recover fine-grained soft boundary details in 3D vision tasks. Specifically, we first propose a novel data curation pipeline that leverages image matting datasets for training and design a depth fixer network to automatically identify soft boundary regions. With a gated residual module, the depth fixer refines depth precisely around soft boundaries while maintaining global depth quality, allowing plug-and-play integration with state-of-the-art depth models. For view synthesis, we perform depth-based forward warping to retain high-fidelity textures, followed by a generative scene painter that fills disoccluded regions and eliminates redundant background artifacts within soft boundaries. Finally, a color fuser adaptively combines warped and inpainted results to produce novel views with consistent geometry and fine-grained details. Extensive experiments demonstrate that HairGuard achieves state-of-the-art performance across monocular depth estimation, stereo image/video conversion, and novel view synthesis, with significant improvements in soft boundary regions.

Abstract:
Infrared pedestrian detection is crucial in safety-critical systems but remains vulnerable to adversarial attacks. Existing physical attacks often rely on fixed, static patterns. However, they often lack robustness across scales, as their hand-crafted or uniformly generated structures are fundamentally limited by a fixed receptive field and fail to adapt to varying distances and scene contexts. In light of this, we propose AdvFractal, a black-box attack that exploits the innate self-similarity and structural richness of fractal geometry to naturally generate multi-scale, physically realizable adversarial perturbations. By modeling perturbations with H-type fractals and optimizing parameters via Particle Swarm Optimization, AdvFractal seamlessly coordinates attacks across scales, progressively disrupting detector features from local textures to global shapes. Experiments show AdvFractal achieves an attack success rate (ASR) of 97.54% in the physical domain and 99.16% cross-dataset, significantly outperforming state-of-the-art methods. The perturbations are highly effective in the infrared spectrum while remaining stealthy in visible light, offering a novel approach for evaluating and understanding the security of infrared detection systems.

Abstract:
Diffusion models have shown promising performance as data-driven priors for computational imaging, as well as some capacity to detect out-of-distribution (OOD) images. However, existing approaches to OOD detection often require some knowledge of the shifted distribution, fail to detect subtle or localized distribution shifts, and operate on full images, rather than the indirect measurements available in inverse problems. We propose an OOD detection metric based on the Kullback-Leibler divergence between the diffusion prior and the posterior distribution, that (i) does not require any calibration data or knowledge of the shifted distribution, and (ii) can detect whole images as OOD as well as localize OOD patches within an image. Experimentally, we show that this metric can detect subtle yet semantically meaningful distribution shifts, such as the shift from healthy liver CT scans to those with tumors, and generalizes across different types of diffusion models, datasets, and inverse problems.

Abstract:
Chain-of-Thought (CoT) is a key technique for enhancing the reasoning capabilities of Vision Language Models (VLMs). Existing methods often employ Reinforcement Learning (RL) with external constraints to align the model's reasoning process with human cognitive patterns. However, we argue that the model's intrinsic reasoning paths may differ from human cognition, and that forcing such alignment can constrain the model's potential and even degrade its performance. To address this, we propose leveraging the model's intrinsic self-evaluation to guide its optimization. We hypothesize that a model's self-generated confidence scores are effective indicators of its reasoning quality. Based on this evaluation metric, we design two novel reward functions: (1) Sequential Confidence Rigorous Evaluation (SCRE) for challenging problems that demand strict logical reasoning, and (2) Intra-group Score Re-ranking (IGSR) for general-purpose, open-ended scenarios. We name our method Video-RAISE (Reasoning Alignment through Intrinsic Self-Evaluation). Comprehensive experiments on six video understanding benchmarks demonstrate that Video-RAISE achieves state-of-the-art (SOTA) performance, significantly outperforming previous methods and even proprietary models, e,g. GPT-4o and Gemini 1.5 pro. For instance, on the VideoMMMU benchmark, our Video-RAISE achieves a new SOTA accuracy of 52.8%, outperforming the previous best model by a significant 3.0%. In addition, our method achieves a reasoning path consistency of 90%, which is double that of the Qwen2.5-VL-Instruct and even surpasses the performance of supervised fine-tuning.

Abstract:
Controlling both camera motion and object dynamics is essential for coherent and expressive video generation, yet current methods typically handle only one motion type or rely on ambiguous 2D cues that entangle camera-induced parallax with true object movement. We present SymphoMotion, a unified motion-control framework that jointly governs camera trajectories and object dynamics within a single model. SymphoMotion features a Camera Trajectory Control mechanism that integrates explicit camera paths with geometry-aware cues to ensure stable, structurally consistent viewpoint transitions, and an Object Dynamics Control mechanism that combines 2D visual guidance with 3D trajectory embeddings to enable depth-aware, spatially coherent object manipulation. To support large-scale training and evaluation, we further construct RealCOD-25K, a comprehensive real-world dataset containing paired camera poses and object-level 3D trajectories across diverse indoor and outdoor scenes, addressing a key data gap in unified motion control. Extensive experiments and user studies show that SymphoMotion significantly outperforms existing methods in visual fidelity, camera controllability, and object-motion accuracy, establishing a new benchmark for unified motion control in video generation.

Abstract:
A symmetry on rigid motion is one of the salient factors in efficient learning of 3D point cloud problems. Group convolution has been a representative method to extract equivariant features, but its realizations have struggled to retain both rigorous symmetry and scalability simultaneously. We advocate utilizing the intertwiner framework to resolve this trade-off, but previous works on it, which did not achieve complete SE(3) symmetry or scalability to large-scale problems, necessitate a more advanced kernel architecture. We present Equivariant Coordinate-based Kernel Convolution, or ECKConv. It acquires SE(3) equivariance from the kernel domain defined in a double coset space, and its explicit kernel design using coordinate-based networks enhances its learning capability and memory efficiency. The experiments on diverse point cloud tasks, e.g., classification, pose registration, part segmentation, and large-scale semantic segmentation, validate the rigid equivariance, memory scalability, and outstanding performance of ECKConv compared to state-of-the-art equivariant methods.

Abstract:
Recognizing visual and textual patterns is essential in many real-world applications of modern AI. However, tackling long-tail pattern recognition tasks remains challenging for current pre-trained foundation models such as LLMs and VLMs. While finetuning pre-trained models can improve accuracy in recognizing such implicit patterns, it is usually infeasible due to a lack of training data and high computational overhead. In this paper, we propose ADAMAB, an efficient embedding calibration framework for few-shot pattern recognition. To maximally reduce the computational costs, ADAMAB trains embedder-agnostic light-weight calibrators on top of fixed embedding models without accessing their parameters. To mitigate the need for large-scale training data, we introduce an adaptive data augmentation strategy based on the Multi-Armed Bandit (MAB) mechanism. With a modified upper confidence bound algorithm, ADAMAB diminishes the gradient shifting and offers theoretically guaranteed convergence in few-shot training. Our multi-modal experiments justify the superior performance of ADAMAB, with up to 40% accuracy improvement when training with less than 5 initial data samples of each class.

Abstract:
Sparse Autoencoders (SAEs) have demonstrated significant success in interpreting Large Language Models (LLMs) by decomposing dense representations into sparse, semantic components. However, their potential for analyzing Vision Transformers (ViTs) remains largely under-explored. In this work, we present the first application of SAEs to the ViT [CLS] token for out-of-distribution (OOD) detection, addressing the limitation of existing methods that rely on entangled feature representations. We propose a novel framework utilizing a Top-k SAE to disentangle the dense [CLS] features into a structured latent space. Through this analysis, we reveal that in-distribution (ID) data exhibits consistent, class-specific activation patterns, which we formalize as Class Activation Profiles (CAPs). Our study uncovers a key structural invariant: while ID samples preserve a stable pattern within CAPs, OOD samples systematically disrupt this structure. Leveraging this insight, we introduce a scoring function based on the divergence of core energy profiles to quantify the deviation from ideal activation profiles. Our method achieves strong results on the FPR95 metric--critical for safety-sensitive applications--across multiple benchmarks, while also achieving competitive AUROC. Overall, our findings demonstrate that the sparse, disentangled features revealed by SAEs can serve as a powerful, interpretable tool for robust OOD detection in vision models.

Abstract:
Learning dense correspondences across deformable 3D shapes remains a long-standing challenge due to structural variability, non-isometric deformation, and inconsistent topology. Existing methods typically trade off generalization, geometric fidelity, and efficiency. We address this by proposing SGSoft, a unified intrinsic pipeline that (i) constructs a geodesic correspondence field on a canonical template, (ii) learns multimodal dense descriptors guided by pretrained semantic priors with this geodesic correspondence field supervision, (iii) retrieves dense correspondences in a single feed-forward pass via nearest-neighbor search in descriptor space. This formulation enables stable and topology-invariant supervision under large pose variation, structural differences, and remeshing. SGSoft achieves state-of-the-art inter-category generalization while offering the best accuracy-efficiency trade-off among prior methods. It also achieves near real-time inference without pre-alignment, pairwise optimization, or post-refinement. Learned descriptors can be transferred effectively to downstream tasks such as semantic segmentation and deformation transfer, establishing a scalable and deployment-ready paradigm for dense 3D correspondence.

Abstract:
Video Temporal Grounding (VTG) aims to localize a temporal segment in a video corresponding to a natural language query. However, existing VTG models assume that a relevant segment always exists, causing them to always predict a target segment even when the query is irrelevant to the video. While recent approaches attempt to handle irrelevant queries, they can only reject those that are entirely unrelated to the video and still fail to handle hard-irrelevant queries that are semantically similar but not actually relevant.To address this, we propose Refusal-Aware Reinforcement Fine-Tuning (RA-RFT) to effectively refuse hard-irrelevant queries in VTG.Our method is based on the Group Relative Policy Optimization (GRPO) framework and integrates four reward objectives--format, refuse-IoU, explain, and query correction--to improve both relevance discrimination and fine-grained semantic reasoning.In addition, to effectively support RA-RFT, we construct a Hard-Irrelevant VTG (HI-VTG) dataset, which includes hard-irrelevant queries and their refusal answers.We demonstrate the effectiveness of our method across various relevance-aware VTG scenarios, including hard-irrelevant VTG, simply-shuffled RA-VTG, and human-annotated RA-VTG settings. We also show that the proposed method is scalable by applying it to various LVLM-based VTG models.

Abstract:
Recent years have witnessed the rapid emergence of 3D Gaussian Splatting (3DGS) as a powerful approach for 3D reconstruction and novel view synthesis. Its explicit representation with Gaussian primitives enables fast training, real-time rendering, and convenient post-processing such as editing and surface reconstruction. However, 3DGS suffers from a critical drawback: the number of primitives grows drastically for scenes with high-frequency appearance details, since each primitive can represent only a single color, requiring multiple primitives for every sharp color transition.To overcome this limitation, we propose Neural Gabor splatting, which augments each Gaussian primitive with a lightweight multi-layer perceptron (MLP) that models a wide range of color variations within a single primitive. To further control primitive numbers, we introduce a frequency-aware densification strategy that selects mismatch primitives for pruning and cloning based on frequency energy.Our method achieves accurate reconstruction of challenging high-frequency surfaces. We demonstrate its effectiveness through extensive experiments on both standard benchmarks, such as Mip-NerRF360 and high-frequency surface datasets (e.g., checkered patterns), supported by comprehensive ablation studies.

Abstract:
We propose AdapTok, an adaptive temporal causal video tokenizer that can flexibly allocate tokens for different frames based on video content. AdapTok is equipped with a block-wise masking strategy that randomly drops tail tokens of each block during training, and a block causal scorer to predict the reconstruction quality of video frames using different numbers of tokens. During inference, an adaptive token allocation strategy based on integer linear programming is further proposed to adjust token usage given predicted scores. Such design allows for sample-wise, content-aware, and temporally dynamic token allocation under a controllable overall budget. Extensive experiments for video reconstruction and generation on UCF-101 and Kinetics-600 demonstrate the effectiveness of our approach. Without additional image data, AdapTok consistently improves reconstruction quality and generation performance under different token budgets, allowing for more scalable and token-efficient generative video modeling.

Abstract:
Vision-Language models (e.g., CLIP) facilitate the development of open-vocabulary camouflaged object segmentation (OVCOS), but existing methods still rely on mask annotations for fully-supervised training. In contrast, the training-free paradigm can rapidly process unseen data, representing a highly promising solution. However, in camouflage scenarios, existing training-free methods utilize sparse textual prompts and ignore the category similarity between visual patches, leading to inadequate object binding capability. To alleviate these issues, we propose a fine-grained object binding and adaptive hybrid prompt framework for training-free OVCOS. The framework first employs multimodal large language models (MLLMs) to explicitly model fine-grained textual descriptions of camouflaged objects and background. Building on this, we construct a semantic probe to decouple object and background features and explicitly model category similarity between visual patches via semantic consistency ranking, thereby achieving accurate object binding. Subsequently, we propose an entropy-guided text embedding adjustment strategy to adjust textual embeddings, aiming to further enhance fine-grained object binding. Finally, we utilize an adaptive hybrid prompt generation strategy to generate hybrid prompts, assisting SAM in accurately segmenting camouflaged objects. Experimental results on the OVCamo benchmark demonstrate that our method achieves excellent performance, significantly surpassing the advanced training-free ResCLIP.

Abstract:
Research on the intelligent interpretation of all-weather, all-time Synthetic Aperture Radar (SAR) is crucial for advancing remote sensing applications. In recent years, although Visual Language Models (VLMs) have demonstrated strong open-world understanding capabilities on RGB images, their performance is severely limited when directly applied to the SAR field due to the complexity of the imaging mechanism, sensitivity to scattering features, and the scarcity of high-quality text corpora. To systematically address this issue, we constructed the inaugural SAR Image-Text-AlphaEarth feature triplet dataset and developed FUSAR-GPT, a VLM specifically for SAR. FUSAR-GPT innovatively introduces a geospatial baseline model as a 'world knowledge' prior and embeds multi-source remote-sensing temporal features into the model's visual backbone via 'spatiotemporal anchors', enabling dynamic compensation for the sparse representation of targets in SAR images. Furthermore, we designed a two-stage SFT strategy to decouple the knowledge injection and task execution of large models. The spatiotemporal feature embedding and the two-stage decoupling paradigm enable FUSAR-GPT to achieve state-of-the-art performance across several typical remote sensing visual-language benchmark tests, significantly outperforming mainstream baseline models by over 10%.

Abstract:
HOI detection has long been dominated by task-specific models, sometimes with early vision-language backbones such as CLIP. With the rise of large generative VLMs, a key question is whether standalone VLMs can perform HOI detection competitively against specialized HOI methods. Existing benchmarks such as HICO-DET require exact label matching under incomplete annotations, so any unmatched prediction is marked wrong. This unfairly penalizes valid outputs, especially from less constrained VLMs, and makes cross-paradigm comparison unreliable. To address this limitation, we introduce CrossHOI-Bench, a multiple-choice HOI benchmark with explicit positives and curated negatives, enabling unified and reliable evaluation of both VLMs and HOI-specific models. We further focus on challenging scenarios, such as multi-person scenes and fine-grained interaction distinctions, which are crucial for revealing real differences between the two paradigms. Experiments show that large VLMs achieve competitive, sometimes superior, zero-shot performance, yet they struggle with multiple concurrent actions and with correctly assigning interactions to the target person. Conversely, HOI-specific methods remain weaker in general HOI reasoning but demonstrate stronger multi-action recognition and more reliable identification of which person performs which action. These findings expose complementary strengths and weaknesses of VLMs and HOI-specific methods, which existing benchmarks fail to reveal due to incorrect penalization.

Abstract:
Capturing scenes with both high dynamic range (HDR) and high-speed motion remains challenging for conventional cameras. Existing alternating-exposure approaches exacerbate temporal resolution loss, making them unsuitable for high-speed scenes. Consequently, current solutions typically compromise either spatial resolution through fixed spatial-varying attenuation levels or employ multi-sensor configurations to maintain temporal resolution. In this paper, we leverage an ultra-high speed spike camera to enable spatial and temporal attenuation of incident light, thereby reconstructing high frame rate (HFR) and HDR video with a single sensor. We achieve this by placing a rapidly rotating spoke-pattern neutral density (SpokeND) filter in front of the spike camera, enabling each pixel to periodically capture multi-attenuated spikes while maintaining full spatial resolution. Building on these multi-attenuated spikes, we propose ReST-Net, which comprises the ReGain and ReFine modules. The ReGain module reconstructs spatially consistent frames by learning to recover relative gain from the multi-attenuated spikes, and the ReFine module removes temporal fluctuations to produce temporally consistent HDR videos. Extensive experiments on synthetic and real-world data demonstrate that our method can reconstruct HDR video at up to 2000 FPS.

Abstract:
Personalizing visual generative models to meet specific user needs has gained increasing attention, yet current methods like Low-Rank Adaptation (LoRA) remain impractical due to their demand for task-specific data and lengthy optimization. While a few hypernetwork-based approaches attempt to predict adaptation weights directly, they struggle to map fine-grained user prompts to complex LoRA distributions, limiting their practical applicability. To bridge this gap, we propose LoFA, a general framework that efficiently predicts personalized priors for fast model adaptation. We first identify a key property of LoRA: structured distribution patterns emerge in the relative changes between LoRA and base model parameters. Building on this, we design a two-stage hypernetwork: first predicting relative distribution patterns that capture key adaptation regions, then using these to guide final LoRA weight prediction. Extensive experiments demonstrate that our method consistently predicts high-quality personalized priors within seconds, across multiple tasks and user prompts, even outperforming conventional LoRA that requires hours of processing.

Abstract:
Vision-language-action (VLA) models achieve strong in-distribution performance but degrade sharply under novel camera viewpoints and visual perturbations. We show that this brittleness primarily arises from misalignment in Spatial Modeling, rather than Physical Modeling. To address this, we propose a one-shot adaptation framework that recalibrates visual representations through lightweight, learnable updates. Our first method, Feature Token Modulation (FTM), applies a global affine transformation to visual tokens and improves Libero viewpoint accuracy from 48.5% to 87.1% with only 4K parameters. Building on this, Feature Linear Adaptation (FLA) introduces low-rank updates to the ViT encoder, achieving 90.8% success with 4.7M parameters--matching LoRA-scale finetuning at far lower cost. Together, these results reveal substantial untapped robustness in pretrained VLA models and demonstrate that targeted, minimal visual adaptation is sufficient to restore viewpoint generalization.

Abstract:
Pretrained models have become standard in both vision and language, yet they typically do not provide reliable measures of confidence. Existing uncertainty estimation methods--such as deep ensembles and MC dropout--are often too computationally expensive to deploy in practice. Evidential Deep Learning (EDL) offers a more efficient alternative, but it requires models to be trained to output evidential quantities from the start, which is rarely true for pretrained networks. To enable EDL-style uncertainty estimation in pretrained models, we propose the Evidential Transformation Network (ETN), a lightweight post-hoc module that converts a pretrained predictor into an evidential model. ETN operates in logit space: it learns a sample-dependent affine transformation of the logits and interprets the transformed outputs as parameters of a Dirichlet distribution for uncertainty estimation. We evaluate ETN on image classification and large language model question-answering benchmarks, under both in-distribution and out-of-distribution settings. ETN consistently improves uncertainty estimation over post-hoc baselines, while preserving accuracy and adding only minimal computational overhead.

Abstract:
The demand for dynamic 3D assets in AR/VR has recently popularized Deformable Gaussian Splatting. However, traditional RGB cameras are limited in their ability to reconstruct high-speed scenes due to motion blur and low temporal resolution. While event cameras offer a promising alternative, reconstructing a complete scene from their sparse and noisy output is a significant challenge. Existing event-based methods rely on an auxiliary sensor, such as a frame camera, thereby inducing tedious hardware and calibration challenges.We introduce FastEventDGS, a novel Deformable Gaussian Splatting-based framework that leverages a single event camera for high-fidelity 4D reconstruction. Our method utilizes a continuous camera trajectory parametrization and integrates two event generation models to provide both photometric and geometric constraints. We further propose a local patch event motion loss to constrain object motion, effectively mitigating overfitting. To ensure robust reconstruction, we employ an off-the-shelf model for depth correction and apply noise regularization terms in the final stage. We demonstrate robust results on both new synthetic and real-world datasets, highlighting our framework's ability to provide a simplified, event-only solution for high-fidelity 4D reconstruction in dynamic scenes.

Abstract:
Tracking any point (TAP) is a fundamental yet challenging task in computer vision, requiring high precision and long-term motion reasoning. Recent attempts to combine RGB frames and event streams have shown promise, yet they typically rely on synchronous or non-adaptive fusion, leading to temporal misalignment and severe degradation when one modality fails. We introduce TAPFormer, a transformer-based framework that performs asynchronous temporal-consistent fusion of frames and events for robust and high-frequency arbitrary point tracking.Our key innovation is a Transient Asynchronous Fusion (TAF) mechanism, which explicitly models the temporal evolution between discrete frames through continuous event updates--bridging the gap between low-rate frames and high-rate events. In addition, a Cross-modal Locally Weighted Fusion (CLWF) module adaptively adjusts spatial attention according to modality reliability, yielding stable and discriminative features even under blur or low light.To evaluate our approach under realistic conditions, we construct a novel real-world frame-event TAP dataset under diverse illumination and motion conditions.Our method outperforms existing point trackers, achieving a 28.2% improvement in average pixel error within threshold. Moreover, on standard point tracking benchmarks, our tracker consistently achieves the best performance. We will release the code and dataset upon acceptance to support future research.

Abstract:
Embodied Question Answering (EQA) requires agents to navigate 3D environments, accumulate visual evidence, and reason over partial observations to answer questions. However, current agents struggle to maintain coherent, long-horizon behavior: planning remains reactive, causing inconsistent actions, while monolithic memories entangle all observations, hindering retrieval of the sparse but crucial evidence. We address these issues by reframing EQA through the lens of predictive processing, in which coherent behavior emerges from a prediction-correction loop grounded in stable priors. Guided by this perspective, we propose Predict Before You Explore (Pred-EQA), an architecture that integrates predictive planning with specialized memory. A high-level planner predicts where question-relevant evidence is likely to appear and generates a compact set of actionable exploration branches encoding long-horizon intent. A low-level executor then reduces uncertainty within these branches, revising predictions when they fail. A dual-memory system complements this process by separating slowly evolving structural priors from compact, question-relevant visual evidence, enabling consistent planning and efficient evidence accumulation. Through this prediction-guided exploration, Pred-EQA achieves coherent trajectories under partial observability. Experiments on OpenEQA and Express-Bench show that Pred-EQA achieves state-of-the-art results in both accuracy and exploration efficiency, demonstrating the benefits of prediction-driven embodied reasoning.

Abstract:
Document generation has gained growing attention in the field of AI-driven content creation. In this work, we push its boundaries by introducing AnyDoc, a framework capable of handling multiple generation tasks across a wide spectrum of document categories, all represented in a unified HTML/CSS format. To overcome the limited coverage and scale of existing human-crafted document datasets, AnyDoc first establishes a scalable data synthesis pipeline to automatically generate documents in HTML/CSS form. This pipeline yields DocHTML, a large-scale dataset containing 265,206 document samples, while spanning 111 categories and 32 distinct styles. Additionally, all documents are equipped with comprehensive metadata, including design intentions, HTML/CSS source code, visual assets, and rendered screenshots. Building on the curated dataset, AnyDoc fine-tunes multi-modal large language models (MLLMs) to achieve three practical document generation tasks: intention-to-document, document derendering, and element-to-document. To address the content overflow issue observed during fine-tuning, AnyDoc further incorporates a height-aware reinforcement learning (HARL) post-training procedure. By defining a reward function based on the difference between predicted and target document heights, overflow is penalized and gradually mitigated during HARL, thereby enhancing overall performance. Qualitative and quantitative experiments demonstrate that AnyDoc outperforms both general-purpose MLLMs and task-specific baselines across all three tasks.

Abstract:
Recent advances in 3D vision-language models (VLMs) highlight a strong potential for 3D scene understanding and reasoning.However, effectively tokenizing 3D scenes into holistic scene tokens, and leveraging these tokens across diverse 3D understanding tasks, remain highly challenging. We present NDTokenizer3D, a generalist 3D VLM that performs a wide range of 3D scene understanding tasks while naturally supporting human interactions, thereby bridging language-level reasoning with 3D spatial understanding. The core of our approach is a novel three-stage scene tokenization pipeline built upon a Multi-Scale Normal Distributions Transform (NDT) representation, paired with a Multi-Scale NDT Decoder (MSDec).Specifically, NDTokenizer3D first constructs a multi-scale NDT representation from raw high-resolution point clouds, preserving both global context and fine-grained geometric details. Next, the MSDec progressively fuses cross-scale NDT features, producing holistic scene tokens consumable by LLM endpoints. Beyond tokenization, MSDec is repurposed as a general interface for human-interactive prompting (points, boxes, masks) and segmentation-mask decoding, unifying diverse 3D scene understanding tasks within a single architecture. With this compact and unified design, NDTokenizer3D offers a fine-grained, general-purpose 3D VLM, achieving remarkable improvements in 3D Referring Segmentation, 3D Visual Question Answering, and 3D Dense Captioning.

Abstract:
The rapid evolution of deepfake technologies demands robust and reliable face forgery detection algorithms. While determining whether an image has been manipulated remains essential, the ability to precisely localize forgery clues is also important for enhancing model explainability and building user trust. To address this dual challenge, we introduce DiffusionFF, a diffusion-based framework that simultaneously performs face forgery detection and fine-grained artifact localization. Our key idea is to establish a novel encoder-decoder architecture: a pretrained forgery detector serves as a powerful "artifact encoder", and a denoising diffusion model is repurposed as an "artifact decoder". Conditioned on multi-scale forgery-related features extracted by the encoder, the decoder progressively synthesizes a detailed artifact localization map. We then fuse this fine-grained localization map with high-level semantic features from the forgery detector, leading to substantial improvements in detection capability. Extensive experiments show that DiffusionFF achieves state-of-the-art (SOTA) performance across multiple benchmarks, underscoring its superior effectiveness and explainability.

Abstract:
Transforming image features from perspective view (PV) space to bird's-eye-view (BEV) space remains challenging in autonomous driving due to depth ambiguity and occlusion. Although several view transformation (VT) paradigms have been proposed, the challenge still remains. In this paper, we propose a new regularization framework, dubbed CycleBEV, that enhances existing VT models for BEV semantic segmentation. Inspired by cycle consistency, widely used in image distribution modeling, we devise an inverse view transformation (IVT) network that maps BEV segmentation maps back to PV segmentation maps and use it to regularize VT networks during training through cycle consistency losses, enabling them to capture richer semantic and geometric information from input PV images. To further exploit the capacity of the IVT network, we introduce two novel ideas that extend cycle consistency into geometric and representation spaces. We evaluate CycleBEV on four representative VT models covering three major paradigms using the large-scale nuScenes dataset. Experimental results show consistent improvements---with gains of up to 0.74, 4.86, and 3.74 mIoU for drivable area, vehicle, and pedestrian classes, respectively---without increasing inference complexity, since the IVT network is used only during training.

Abstract:
Driving World Models (DWMs) have been developing rapidly with the advances of generative models. However, existing DWMs lack 3D scene understanding capabilities and can only generate content conditioned on input data, without the ability to interpret or reason about the driving environment. Moreover, current approaches represent 3D spatial information with point cloud or BEV features do not accurately align textual information with the underlying 3D scene. To address these limitations, we propose a novel unified DWM framework based on 3D Gaussian scene representation, which enables both 3D scene understanding and multi-modal scene generation, while also enabling contextual enrichment for understanding and generation tasks. Our approach directly aligns textual information with the 3D scene by embedding rich linguistic features into each Gaussian primitive, thereby achieving early modality alignment. In addition, we design a novel task-aware language-guided sampling strategy that removes redundant 3D Gaussians and injects accurate and compact 3D tokens into LLM. Furthermore, we design a dual-condition multi-modal generation model, where the information captured by our vision-language model is leveraged as a high-level language condition in combination with a low-level image condition, jointly guiding the multi-modal generation process. We conduct comprehensive studies on the nuScenes, and NuInteract datasets to validate the effectiveness of our framework. Our method achieves state-of-the-art performance. We will release the code publicly on GitHub.

Abstract:
Vision-and-Language Navigation in Continuous Environment (VLN-CE) requires an agent to follow language instructions to navigate the target destination. With the advancement of large language models (LLMs), recent efforts have explored adapting them for zero-shot VLN-CE, offering a promising solution in addressing the drawbacks of poor generalization in the training-based paradigm. However, existing LLM-based works primarily perform naive reasoning for decision-making and lack feedback, e.g., reviewing historical errors and predicting future potentials. Consequently, it may suffer from continuous failure for those initial error tasks. In this paper, we rethink LLM-based zero-shot VLN-CE and propose a new paradigm, named EvoNav, to improve the agent's decision-making with future thought and history experience via Future Chain-of-Thought (F-CoT) and History Chain-of-Experience (H-CoE). F-CoT predicts future actions and landmarks as thoughts to assist navigation progress estimation and direction selection, while H-CoE summarizes historical trajectories and scenes as experience to improve navigation decision reliability. Both F-CoT and H-CoE cooperatively evolve the agent's decision-making. Extensive experiments in both the simulator and real-world environments demonstrate the effectiveness of our EvoNav.

Abstract:
Existing ViT backdoor attacks based on backbone-overwriting full-tuning are computationally expensive and inflict performance degradation. This has forced adversaries towards the Visual Parameter-Efficient Fine-Tuning (PEFT) paradigm, dominated by adapter-based (e.g., LoRA) and prompt-based (e.g., VPT) approaches. While adapter security has seen initial study, the risks of the burgeoning prompt-based ecosystem remain critically unexplored. We fill this critical gap, exposing how the evolution of VPT towards dynamic and context-aware architectures can facilitate a far more dangerous and emergent threat.This vulnerability arises even though these dynamic modules unlock superior benign performance. We propose VIPER, an attack framework built on a lightweight, dynamic Visual Prompt Generator (VPG) that demonstrates this vulnerability. Critically, this dynamic architecture enables Functional Fusion: an emergent phenomenon where malicious logic and benign task utility are tightly fused into the same sparse, high-magnitude parameter core. This fusion creates a formidable "hostage" dilemma, as pruning the attack necessarily destroys the benign performance. Comprehensive evaluations show VIPER effectively addresses the attacker's trilemma: VIPER not only achieves state-of-the-art performance on clean data, but also maintains near-100% ASR even under 90% VPG-module pruning (where LoRA attacks collapse), while adding only an imperceptible 0.06ms (1.16%) of inference latency. VIPER's results, driven by Functional Fusion, expose a new, paradigm-level risk in dynamic prompt architectures.

Abstract:
Recent advances in fMRI pretraining have significantly improved visual decoding accuracy by leveraging cross-subject neuroimaging datasets. A prevailing strategy aligns individual fMRI signals into a shared feature space using subject-specific adapters, followed by a shared decoder. However, this unstructured feature space overlooks the redundancy and functional correlations among voxels and fails to incorporate the brain's intrinsic functional architecture centered on regions of interest (ROIs). To address these limitations, we propose ROITok, an ROI-guided fMRI pretraining framework. Our method introduces Sparse ROI Context Fusion to learn ROI-level visual representations and captures functional synergy between ROIs from cross-subject data. Inspired by Matryoshka Representation Learning (MRL), we design an embedding compression scheme that prioritizes the most informative visual components first, with later tokens adding progressively finer but still useful details. ROITok achieves strong transfer learning performance on the NSD and GOD datasets and shows strong resilience against high-level additive noises, while offering better interpretability. It allows for quantitative assessment of each brain region's contribution to decoding tasks. Our analysis shows that ROI-based pretraining can automatically learn the brain's visual hierarchy. Different ROIs can provide complementary contexts for decoding tasks; combining them improves decoding robustness.

Abstract:
Referring video object segmentation (RVOS) aims to segment objects within a video according to natural language expressions. Unlike earlier works focusing on static single-object scenarios, recent studies address more complex motion scenes. Previous methods typically adopt a query-based, logically multi-stage pipeline to handle these scenarios. However, this paradigm learns trajectory consistency modeling and multimodal fusion from scratch, which often leads to trajectory inconsistencies and insufficient multimodal understanding. To address these limitations, we propose DeRVOS, a framework that decouples RVOS into two key branches: consistent trajectory generation and multimodal understanding. We extract temporally consistent object representations using a powerful pretrained instance trajectory generation model and perform cross-modal alignment via a unified multimodal encoder, enabling upstream modeling of trajectory consistency and vision-language understanding. This design reduces RVOS to the task of modeling the relationship between referring expressions and instance trajectories. To connect the two branches and enable efficient motion-aware semantic understanding, we introduce the Trajectory Alignment and Implicit Selection (TAIS) module, which progressively performs cross-frame multimodal alignment and motion-guided implicit trajectory selection. Extensive experiments demonstrate that DeRVOS achieves state-of-the-art results on both traditional RVOS benchmarks and the challenging MeViS dataset, surpassing LVLM-based methods by 4.7%.

Abstract:
The application of Large Multimodal Models (LMMs) to long-form video understanding is constrained by limited context lengths and the computationally prohibitive cost of processing dense video tokens. Consequently, recent research has focused on query-aware frame selection, methods that often incur significant computational overhead. This paper challenges the assumption that such complex search mechanisms are universally necessary. We first identify and validate a query typology distinguishing between global query and localized query. We demonstrate that while uniform sampling is both effective and efficient for global queries, localized queries indeed necessitate query-aware selection for optimal performance. Building on this insight, we propose DIG, a training-free frame selection framework that adapts its strategy based on the query type. Specifically, DIG employs efficient uniform sampling for global queries while activating a specialized pipeline to extract query-relevant frames for localized queries. Experiments on three long-form video understanding benchmarks demonstrate that DIG consistently outperforms existing baselines and robustly improves LMM performance, even when scaling the input frame count to 256.

Abstract:
Virtual try-on (VTON) has advanced single-garment visualization, yet real-world fashion centers on full outfits with multiple garments, accessories, fine-grained categories, layering, and diverse styling, remaining beyond current VTON systems. Existing datasets are category-limited and lack outfit diversity. We introduce Garments2Look, the first large-scale multimodal dataset for outfit-level VTON, comprising 80K many-garments-to-one-look pairs across 40 major categories and 300+ fine-grained subcategories. Each pair includes an outfit with 3-12 reference garment images (Average 4.48), a model image wearing the outfit, and detailed item and try-on textual annotations. To balance authenticity and diversity, we propose a synthesis pipeline. It involves heuristically constructing outfit lists before generating try-on results, with the entire process subjected to strict automated filtering and human validation to ensure data quality. To probe task difficulty, we adapt SOTA VTON methods and general-purpose image editing models to establish baselines. Results show current methods struggle to try on complete outfits seamlessly and to infer correct layering and styling, leading to misalignment and artifacts.

Abstract:
Parameter Recombination (PR) methods aim to efficiently compose the weights of a neural network, and encompasses tasks like Parameter-Efficient FineTuning (PEFT) and Model Compression (MC), among others. Most methods typically focus on one application of PR, which can make composing them challenging. For example, when deploying a large model you may wish to compress the model and also quickly adapt to new settings. However, PEFT methods often can still contain millions of parameters. This may be small compared to the original model size, but can be problematic in resource constrained deployments like edge devices, where they take a larger portion of the compressed model's parameters. To address this, we present Coefficient-gated weight Recombination by Interpolated Shared basis Projections (\method ), a general approach that can address multiple PR tasks within the same framework, which can enable seamless integration. It accomplishes this by using a factorization process that decomposes pretrained weights into basis matrices and their component projections. Sharing these basis matrices across layers and adjusting its size enables us to perform MC, whereas the small size of the projection weights (fewer than 200 in some experiments) enables \method support PEFT. Experiments on ViT models show \method outperforms methods from prior work capable of dual-task applications by 4-5% while also outperforming the state-of-the-art in PEFT by 1.5% and PEFT+MC combinations by almost 1%.

Abstract:
Autonomous driving requires generating safe and reliable trajectories from complex multimodal inputs. Traditional modular pipelines separate perception, prediction, and planning, while recent end-to-end (E2E) systems learn them jointly. Vision-language models (VLMs) further enrich this paradigm by introducing cross-modal priors and commonsense reasoning, yet current VLM-based planners face three key challenges: (i) a mismatch between discrete text reasoning and continuous control, (ii) high latency from autoregressive chain-of-thought decoding, and (iii) inefficient or non-causal planners that limit real-time deployment. We propose ColaVLA, a unified vision-language-action framework that transfers reasoning from text to a unified latent space and couples it with a hierarchical, parallel trajectory decoder. The Cognitive Latent Reasoner compresses scene understanding into compact, decision-oriented meta-action embeddings through ego-adaptive selection and only two VLM forward passes. The Hierarchical Parallel Planner then generates multi-scale, causality-consistent trajectories in a single forward pass. Together, these components preserve the generalization and interpretability of VLMs while enabling efficient, accurate and safe trajectory generation. Experiments on the nuScenes benchmark show that ColaVLA achieves state-of-the-art performance in both open-loop and closed-loop settings with favorable efficiency and robustness.

Abstract:
3D Gaussian Splatting has revolutionized neural rendering with real-time performance. However, scaling this approach to large scenes using Level-of-Detail methods faces critical challenges: inefficient serial traversal consuming over 60% of rendering time, and redundant Gaussian-tile pairs that incur unnecessary processing overhead. To address these limitations, we propose FilterGS, featuring a parallel filtering mechanism with two complementary filters that enable efficient selection without tree traversal, coupled with a scene-adaptive Gaussian shrinkage strategy that minimizes redundancy through opacity-based scaling. Extensive experiments demonstrate that FilterGS achieves state-of-the-art rendering speeds while maintaining competitive visual quality across multiple large-scale datasets.

Abstract:
Traditional speaker diarization systems have primarily focused on constrained scenarios such as meetings and interviews, where the number of speakers is limited and acoustic conditions are relatively clean. To explore open-world speaker diarization, we extend this task to the visual media domain, encompassing complex audiovisual programs such as films and TV series. This new setting introduces several challenges, including long-form video understanding, a large number of speakers, cross-modal asynchrony between audio and visual cues, and uncontrolled in-the-wild variability. To address these challenges, we propose Cinematic Speaker Registration & Diarization (CineSRD), a unified multimodal framework that leverages visual, acoustic, and linguistic cues from video, speech, and subtitles for speaker annotation. CineSRD first performs visual anchor clustering to register initial speakers and then integrates an audio language model for speaker turn detection, refining annotations and supplementing unregistered off-screen speakers. Furthermore, we construct and release a dedicated speaker diarization benchmark for visual media that includes Chinese and English programs. Experimental results demonstrate that CineSRD achieves superior performance on the proposed benchmark and competitive results on conventional datasets, validating its robustness and generalizability in open-world visual media settings.

Abstract:
Although the temporal spike dynamics of spiking neural networks (SNNs) enable low-power temporal capture capabilities, they also incur inherent inconsistencies that severely compromise representation. In this paper, we perform dual consistency optimization via Stable Spike to mitigate this problem, thereby improving the recognition performance of SNNs. With the hardware-friendly "AND" bit operation, we efficiently decouple the stable spike skeleton from the multi-timestep spike maps, which captures critical semantics and reduces the inconsistency from variable noise spikes. Enforcing the unstable spike maps to converge to the stable spike skeleton significantly improves the inherent consistency across timesteps. Furthermore, we inject amplitude-aware spike noise into the stable spike skeleton to diversify the representations while preserving consistent semantics. The SNN is encouraged to produce perturbation-consistent predictions, thereby contributing to generalization. Extensive experiments across multiple architectures and datasets validate the effectiveness and versatility of our method. In particular, our method significantly advances neuromorphic object recognition under ultra-low latency, improving accuracy by up to 8.33%. This will help unlock the full power consumption and speed potential of SNNs.

Abstract:
We present the first scalable framework for training sound source localization (SSL) models using synthetic data from text-to-X models. Although SSL has made notable progress, existing models remain constrained by limited-scale, uncurated real-world datasets that often suffer from semantic misalignment. Furthermore, the introduction of new SSL tasks and benchmarks has increased the need for more generalizable models. To address these challenges, we leverage synthetic data to create synthetic clones of the VGGSound dataset, enabling both fully synthetic and hybrid real-synthetic training. We demonstrate that synthetic data can effectively replace, refine, and scale real training datasets. Extensive experiments across multiple benchmarks show that synthetic data not only matches real data in performance but also enables significant improvements when combined with real samples. Our findings provide the first systematic evidence that synthetic data can serve as a scalable and effective approach for advancing SSL models.

Abstract:
Training-free zero-shot composed image retrieval models are recently gaining increasing research interest due to their generalizability and flexibility in unseen multimodal retrieval. Recent LLM-based advances focus on generating the expected target caption by exploring the compositional ability behind the LLMs. Although efficient, we find that 1) the generated captions tend to introduce unexpected features from the reference image due to the semantic gap between the input image and text modification, where the image contains much more details than the text; 2) the point-to-point alignment during the retrieval stage fails to capture diverse compositions.To address these challenges, we introduce a novel Semantic Transition and Transportation in Collaboration framework for training-free zero-shot CIR tasks. Specifically, given the composed caption inferred by an LLM, we aim to refine it through a transition vector in the embedding space and make it closer to the target image. Combining LLMs with user instruction, the refined caption concentrates more on the core modification intent and thus filters out unnecessary noise. Moreover, to explore diverse alignment during the retrieval stage, we model the caption and image as discrete distributions and reformulate the retrieval task as a set-to-set alignment task. Finally, a bidirectional transportation distance is developed to consider fine-grained alignments across modalities and calculate the retrieval score.Extensive experimentsdemonstrate that our method can be general, effective, and beneficial for many CIR tasks.The code is attached in the supplementary material.

Abstract:
3D Gaussian Splatting, known for enabling high-quality static scene reconstruction with fast rendering, is increasingly being applied to multi-view dynamic scene reconstruction. A common strategy involves learning a deformation field to model the temporal changes of a canonical set of 3D Gaussians. However, these deformation-based methods often produce blurred renderings and lose fine motion details in highly dynamic regions due to the inherent limitations of a single, unified model in representing diverse motion patterns. To address these challenges, we introduce Motion-Aware Partitioning of Deformable 3D Gaussian Splatting (MAPo), a novel framework for high-fidelity dynamic scene reconstruction. Its core is a dynamic score-based partitioning strategy that distinguishes between high- and low-dynamic 3D Gaussians. For high-dynamic 3D Gaussians, we recursively partition them temporally and duplicate their deformation networks for each new temporal segment, enabling specialized modeling to capture intricate motion details. Concurrently, low-dynamic 3DGs are treated as static to reduce computational costs. However, this temporal partitioning strategy for high-dynamic 3DGs can introduce visual discontinuities across frames at the partition boundaries. To address this, we introduce a cross-frame consistency loss, which not only ensures visual continuity but also further enhances rendering quality. Extensive experiments demonstrate that MAPo achieves superior rendering quality compared to baselines while maintaining comparable computational costs, particularly in regions with complex or rapid motions.

Abstract:
Reliable long-horizon trajectory prediction requires both high positional accuracy and physically plausible temporal motion consistency. However, existing methods suffer from two fundamental limitations. First, they overlook the inherent difference in prediction logic: near-future trajectories are primarily governed by historical dynamics, whereas distant-future behaviors are driven by high-level semantic context. Yet, most methods employ a unified decoding pathway that blurs the temporal distinction.Second, although the near future is relatively easier to predict, existing methods lack mechanisms for coherent trajectory propagation across time horizons, often resulting in kinematically implausible predictions with inconsistent heading evolution and degraded long-horizon performance. To address these challenges, we propose NDPNet, a dual-stage architecture that decouples near- and distant-horizon modeling into specialized pathways, with a dedicated transition module ensuring smooth temporal bridging. Furthermore, we introduce a novel motion-aware coherence loss that explicitly embeds kinematic priors to enforce trajectory consistency. Extensive experiments show that NDPNet achieves SOTA performance on Argoverse 2 and WOMD. Notably, on WOMD, it ranks 1^ \text st in both minFDE _6 and minADE _6 across all standard horizons (3s, 5s, 8s) without ensemble learning or NMS post-processing, and is the first to achieve sub-1.75 minFDE _6 for 8s prediction, surpassing prior methods by a large margin. The code will be released subsequently.

Abstract:
Diffusion Models have emerged as a leading class of generative models, yet their iterative sampling process remains computationally expensive. Timestep distillation is a promising technique to accelerate generation, but it often requires extensive training and leads to image quality degradation. Furthermore, fine-tuning these distilled models for specific objectives, such as aesthetic appeal or user preference, using Reinforcement Learning (RL) is notoriously unstable and easily falls into reward hacking. In this work, we introduce Flash-DMD, a novel framework that enables fast convergence with distillation and joint RL-based refinement.Specifically, we first propose an efficient timestep-aware distillation strategy that significantly reduces training cost with enhanced realism, outperforming DMD2 with only 2.1% its training cost. Second, we introduce a joint training scheme where the model is fine-tuned with an RL objective while the timestep distillation training continues simultaneously. We demonstrate that the stable, well-defined loss from the ongoing distillation acts as a powerful regularizer, effectively stabilizing the RL training process and preventing policy collapse. Extensive experiments on score-based and flow matching models show that our proposed Flash-DMD not only converges significantly faster but also achieves state-of-the-art generation quality in the few-step sampling regime, outperforming existing methods in visual quality, human preference, and text-image alignment metrics. Our work presents an effective paradigm for training efficient, high-fidelity, and stable generative models. Codes are attached in the supplementary.

Abstract:
Recent breakthroughs in 3D Gaussian Splatting (3DGS) have advanced neural rendering with high fidelity and speed. However, its performance degrades significantly in large-scale scenes due to the computational burden of tile-based rasterization. Existing optimization efforts either require costly scene re-training or focus on narrow aspects of the pipeline, overlooking critical inefficiencies in real-world deployments. Through a comprehensive analysis, we identify three primary sources of redundancy and low GPU utilization: redundant inter-frame pre-processing, viewpoint-based occlusion redundancy, and severe tile-level load imbalance. To address these issues, we propose CaT-GS, a novel and efficient 3DGS rendering pipeline. CaT-GS introduces a speculative multi-frame preprocessing method to eliminate redundant computations across consecutive frames, and an inter-frame caching mechanism to eliminate viewpoint redundant rendering stages. Furthermore, it refactors rasterization tasks with a dedicated kernel to mitigate tile load imbalance, significantly boosting GPU utilization. Extensive experiments demonstrate that CaT-GS achieves a speedup of up to 10 times over the original 3DGS and up to 70% over previous state-of-the-art methods, establishing a new benchmark for high-fidelity, real-time rendering of large-scale scenes.

Abstract:
In unsupervised cross-modal hashing, real world multimodal data often exhibit partial alignment and semantic ambiguity. Dominant modalities can easily bias the fusion process, while semantically related samples may be mistakenly treated as negatives in contrastive learning, leading to unstable optimization. To address these issues, we propose Unsupervised Weighted Masked Contrastive Hashing (UWMCH). UWMCH introduces masking before multimodal fusion to construct partially observed interactions, encouraging the model to learn complementary semantics and reducing over-reliance on dominant modality cues. We further develop a semantic affinity guided weighted contrastive objective to reduce the influence of false negatives by combining instance level consistency with a cluster consensus prior. In addition, the global and local semantic geometries of the fused space are stabilized via Cluster-Centroid Agreement (CCA) and Semantic Structure Regularization (SSR). Extensive experiments on three benchmark datasets demonstrate the effectiveness and robustness of the proposed method.

Abstract:
Aligning text-to-video diffusion models with human preferences is crucial for generating high-quality videos. Existing Direct Preference Otimization (DPO) methods rely on multi-sample ranking and task-specific critic models, which is inefficient and often yields ambiguous global supervision. To address these limitations, we propose LocalDPO, a novel post-training framework that constructs localized preference pairs from real videos and optimizes alignment at the spatio-temporal region level. We design an automated pipeline to efficiently collect preference pair data that generates preference pairs with a single inference per prompt, eliminating the need for external critic models or manual annotation. Specifically, we treat high-quality real videos as positive samples and generate corresponding negatives by locally corrupting them with random spatio-temporal masks and inpainting only the masked regions using the frozen base model. During training, we introduce a region-aware DPO loss that restricts preference learning to corrupted areas for rapid convergence. Experiments on Wan2.1 and CogVideoX demonstrate that LocalDPO consistently improves video fidelity, temporal coherence and human preference scores over other post-training approaches, establishing a more efficient and fine-grained paradigm for video generator alignment.

Abstract:
Fairness in Continual Learning for Large Multimodal Models (LMMs) is an emerging yet underexplored challenge, particularly in the presence of imbalanced data distributions that can lead to biased model updates and suboptimal performance across tasks. While recent continual learning studies have made progress in addressing catastrophic forgetting, the problem of fairness caused by imbalanced data remains largely underexplored. This paper presents a novel Fairness Direct Preference Optimization (FaiDPO or \phi-DPO) framework for continual learning in LMMs. In particular, we first propose a new continual learning paradigm based on Direct Preference Optimization (DPO) to mitigate catastrophic forgetting by aligning learning with pairwise preference signals. Then, we identify the limitations of conventional DPO in imbalanced data and present a new \phi-DPO loss that explicitly addresses distributional biases. We provide a comprehensive theoretical analysis demonstrating that our approach addresses both forgetting and data imbalance. Additionally, to enable \phi-DPO-based continual learning, we construct pairwise preference annotations for existing benchmarks in the context of continual learning. Extensive experiments and ablation studies show the proposed \phi-DPO achieves State-of-the-Art performance across multiple benchmarks, outperforming prior continual learning methods of LMMs.

Abstract:
Recent advancements in reasoning language models have fueled growing interest in extending such capabilities to multimodal domains. However, despite notable progress in visual and video reasoning, the lack of transparent and reproducible data curation and training pipelines remains a major barrier to scalable research. In this work, we introduce OpenMMReasoner, a fully transparent two-stage recipe for multimodal reasoning spanning supervised fine-tuning (SFT) and reinforcement learning (RL). In the SFT stage, we construct an 874k-sample cold-start dataset with rigorous step-by-step validation, providing a strong foundation for reasoning capabilities. The subsequent RL stage leverages a 74k-sample dataset across diverse domains to further sharpen and stabilize these abilities, resulting in a more robust and efficient learning process. Extensive evaluations demonstrate that our training recipe not only surpasses strong baselines but also highlights the critical role of data quality and training design in shaping multimodal reasoning performance. Notably, our method achieves a 9.5% improvement over the Qwen2.5-VL-7B-Instruct baseline across nine multimodal reasoning benchmarks, establishing a solid empirical foundation for future large-scale multimodal reasoning research.

Abstract:
Recent advancements in deep neural networks (DNNs) have significantly improved visual quality of camera captures under low-light (<10lx) conditions, yet visual quality in extreme low-light (<1lx) remains inadequate. Existing DNN models are computationally intensive and suffer from large processing times, making them impractical for real-time enhancement of high-resolution video. Consequently, Ultra HD (UHD) videos (4K/8K) captured in extreme low-light environments exhibit elevated noise and diminished detail. Developing DNN-based solutions for UHD video enhancement faces challenges including paired dataset creation, temporal consistency, and efficient deployment under strict latency (<33ms) and power constraints (<250mA for 30fps video).We present a comprehensive methodology for developing a real-time raw to raw denoising solution for UHD video in extreme low-light, designed for seamless integration into existing ISP pipelines. Unlike ISP-replacement approaches, our solution enhances commercial camera stacks across sensor platforms. Our framework comprises: (1) Diverse dataset creation methodology; (2) A low-complexity model architecture optimized for mobile compute elements; (3) Efficient training and post-training optimizations (reparameterization, restructuring, quantization) to meet latency constraints while ensuring high-quality output. The result is a power-efficient real-time raw to raw video denoiser that improves extreme low-light video quality while preserving downstream ISP behavior.

Abstract:
Recent advances in self-supervised learning (SSL) for point clouds have substantially improved 3D scene understanding without human annotations. Existing approaches emphasize semantic awareness by enforcing feature consistency across augmented views or by masked scene modeling. However, the resulting representations transfer poorly to instance localization, and often require full finetuning for strong performance. Instance awareness is a fundamental component of 3D perception, thus bridging this gap is crucial for progressing toward true 3D foundation models that support all downstream tasks on 3D data. In this work, we introduce PointINS, an instance-oriented self-supervised framework that enriches point cloud representations through geometry-aware learning.PointINS employs an orthogonal offset branch to jointly learn high-level semantic understanding and geometric reasoning, yielding instance awareness. We identify two consistent properties essential for robust instance localization and formulate them as complementary regularization strategies, Offset Distribution Regularization (ODR), which aligns predicted offsets with empirically observed geometric priors, and Spatial Clustering Regularization (SCR), which enforces local coherence by regularizing offsets with pseudo-instance masks. Through extensive experiments across five datasets, PointINS achieves on average +3.5% mAP improvement for indoor instance segmentation and +4.1% PQ gain for outdoor panoptic segmentation, paving the way for scalable 3D foundation models.

Abstract:
Omni-modal large language models (omni LLMs) have recently achieved strong performance across audiovisual understanding tasks, yet they remain highly susceptible to cross-modal hallucinations arising from spurious correlations and dominant language priors. In this work, we propose Modality-Decoupled Direct Preference Optimization (MoD-DPO), a simple and effective framework for improving modality grounding in omni LLMs. MoD-DPO introduces modality-aware regularization terms that explicitly enforce invariance to corruptions in irrelevant modalities and sensitivity to perturbations in relevant modalities, thereby reducing unintended cross-modal interactions. To further mitigate over-reliance on textual priors, we incorporate a language-prior debiasing penalty that discourages hallucination-prone text-only responses. Extensive experiments across multiple audio-visual hallucination benchmarks demonstrate that MoD-DPO consistently improves perception accuracy and hallucination resistance, outperforming previous preference optimization baselines under similar training budgets. Our findings underscore the importance of modality-faithful alignment and demonstrate a scalable path toward more reliable and resilient multimodal foundation models.

Abstract:
Infrared and visible image fusion (IVIF) aims to synthesise complementary information from the two source modalities while preserving natural textures and salient thermal signatures simultaneously. Existing solutions predominantly rely on extensive sets of rigidly aligned image pairs for training. However, acquiring such data is often impractical due to the costly and labour-intensive alignment process. Besides, maintaining a rigid pairing setting during training restricts the volume of cross-modal relationships, thereby limiting the generalisation performance. To this end, this work challenges the necessity of Strictly Paired Training Paradigm (SPTP) by systematically investigating UnPaired and Arbitrarily Paired Training Paradigms (UPTP and APTP) for high-performance IVIF. We establish a theoretical objective of APTP, reflecting the complementary nature between UPTP and SPTP. More importantly, we develop a practical framework capable of significantly enriching cross-modal relationships even with severely limited and unaligned training data. To validate our propositions, three end-to-end lightweight baselines, alongside a set of innovative loss functions, are designed to cover three classic frameworks (CNN, Transformer, GAN). Comprehensive experiments demonstrate that the proposed APTP and UPTP are feasible and capable of training models on a severely limited and content-inconsistent infrared and visible dataset, achieving performance comparable to that of a dataset 100xlarger in SPTP. This finding fundamentally alleviates the cost and difficulty of data collection while enhancing model robustness from the data perspective, delivering a feasible solution for IVIF studies.

Abstract:
Generating realistic and user-preferred advertisements is a key challenge in e-commerce. Existing approaches utilize multiple independent models driven by click-through-rate (CTR) to controllably create attractive image or text advertisements. However, their pipelines lack cross-modal perception and rely on CTR that only reflects average preferences. Therefore, we explore jointly generating personalized image-text advertisements from historical click behaviors. We first design a Unified Advertisement Generative model (Uni-AdGen) that employs a single autoregressive framework to produce both advertising images and texts. By incorporating a foreground perception module and instruction tuning, Uni-AdGen enhances the realism of the generated content. To further personalize advertisements, we equip Uni-AdGen with a coarse-to-fine preference understanding module that effectively captures user interests from noisy multimodal historical behaviors to drive personalized generation. Additionally, we construct the first large-scale Personalized Advertising image-text dataset (PAd1M) and introduce a Product Background Similarity (PBS) metric to facilitate training and evaluation. Extensive experiments show that our method outperforms baselines in general and personalized advertisement generation. The dataset and codes will be released after acceptance.

Abstract:
Reinforcement Learning with Verifiable Rewards (RLVR) for Multimodal Large Language Models (MLLMs) is highly dependent on high-quality labeled data, which is often scarce and prone to substantial annotation noise in real-world scenarios. Existing unsupervised RLVR methods, including pure entropy minimization, can overfit to incorrect labels and limit the crucial reward ranking signal for Group-Relative Policy Optimization (GRPO). To address these challenges and enhance noise tolerance, we propose a novel two-stage, token-level entropy optimization method for RLVR. This approach dynamically guides the model from exploration to exploitation during training. In the initial exploration phase, token-level entropy maximization promotes diverse and stochastic output generation, serving as a strong regularizer that prevents premature convergence to noisy labels and ensures sufficient intra-group variation--enabling more reliable reward gradient estimation in GRPO. As training progresses, the method transitions into the exploitation phase, where token-level entropy minimization encourages the model to produce confident and deterministic outputs, thereby consolidating acquired knowledge and refining prediction accuracy. Empirically, across three MLLM backbones-Qwen2-VL-2B, Qwen2-VL-7B, and Qwen2.5-VL-3B--spanning diverse noise settings and multiple tasks, our phased strategy consistently outperforms prior approaches by unifying and enhancing external, internal, and entropy-based methods, delivering robust and superior performance across the board.

Abstract:
Cryo-electron tomography (cryo-ET) enables in-situ visualization of cellular ultrastructure, but reconstructions are severely degraded by extremely low SNR and missing-wedge artifacts due to dose limits and restricted tilt angles. Existing learning-based approaches are further constrained by inaccurate noise modeling and the lack of reliable ground truth, limiting restoration quality. We propose cryoDeRec, a multi-task framework that jointly performs denoising and missing wedge recovery in a fully supervised manner. Our key contribution is a dual-objective training strategy that leverages synthetic noisy tomograms and their corresponding clean tomograms, encouraging structural fidelity while recovering missing information. To support training, we introduce an imaging simulation pipeline that captures authentic noise distributions and incorporates isotropic structural priors by simulating tomograms from real EMDB structures. Experiments on four realistic cryo-ET datasets and two extremely low-SNR simulated datasets (all reconstructed via WBP) show that cryoDeRec restores high-quality tomograms directly from raw inputs without preprocessing, consistently outperforming prior state of the art. Our findings show that training on a comprehensive simulated dataset, which captures realistic noise and structure, enables models to generalize effectively to real cryo-ET tomograms.

Abstract:
Unsupervised Domain Adaptation (UDA) is essential for deploying medical segmentation models across diverse clinical environments. Existing methods are fundamentally limited, suffering from semantically unaware feature alignment that results in poor distributional fidelity and from pseudo-label validation that disregards global anatomical constraints, thus failing to prevent the formation of globally implausible structures. To address these issues, we propose SHAPE (Structure-aware Hierarchical Unsupervised Domain Adaptation with Plausibility Evaluation), a framework that reframes adaptation towards global anatomical plausibility. Built on a DINOv3 foundation, its Hierarchical Feature Modulation (HFM) module first generates features with both high fidelity and class-awareness. This shifts the core challenge to robustly validating pseudo-labels. To augment conventional pixel-level validation, we introduce Hypergraph Plausibility Estimation (HPE), which leverages hypergraphs to assess the global anatomical plausibility that standard graphs cannot capture.This is complemented by Structural Anomaly Pruning (SAP) to purge remaining artifacts via cross-view stability.SHAPE significantly outperforms prior methods on cardiac and abdominal cross-modality benchmarks, achieving state-of-the-art average Dice scores of 90.08% (MRI - CT) and 78.51% (CT - MRI) on cardiac data, and 87.48% (MRI - CT) and 86.89% (CT - MRI) on abdominal data.

Abstract:
3D Gaussian Splatting (3DGS) has emerged as a mainstream representation for 3D scenes, drawing increasing research attention to its understanding, generation, and editing. However, existing studies remain limited to low-level perception, low-quality generation, and low-efficiency editing, lagging far behind their image counterparts in the era of Multimodal Large Language Models (MLLMs). To bridge this gap, we propose MLLMSplat, a novel framework that adapts 2D MLLMs to achieve high-level understanding, high-quality generation, and high-efficiency editing of 3DGS scenes. Specifically, our comprehensive framework consists of three core designs: (1) a 3DGS tokenizer that can be seamlessly integrated into MLLMs in a training-free manner; (2) a 3DGS de-tokenizer that non-intrusively extends the 2D latent diffusion model in MLLMs using a dual rotary positional encoding space, while augmenting it with a jointly trained and sampled 3DGS decoder; and (3) a surrogate task that enhances feed-forward editing capabilities. Extensive experiments demonstrate that MLLMSplat delivers state-of-the-art performance across 3DGS understanding, generation, and editing.

Abstract:
Convexity is a fundamental geometric prior that underlies many natural and man-made structures, yet remains challenging to impose effectively in end-to-end trainable segmentation networks. We revisit convexity from a functional perspective and propose a unified, threshold-free convexity prior based on the quasi-concavity of the network's output mask function u. Instead of constraining a single binary segmentation, we require all super-level sets of u to be convex, transforming global shape constraints into local, differentiable inequalities on u and its derivatives. From this principle, we derive zero, first, and second-order characterizations, yielding respectively a local midpoint convexification algorithm, a gradient-based condition linked to supporting hyperplanes, and a sufficient second-order inequality expressed as a quadratic form on the tangent plane. The first and second-order formulations produce a compact convolutional loss that can be densely applied across the image without thresholding. Our quasi-concavity losses integrate seamlessly with modern segmentation networks via the proposed convex gradient projection module (CGPM). They consistently enforce convexity and improve shape regularity across multiple datasets, outperforming networks tailored for retinal segmentation and surpassing previous shape-aware methods. Remarkably, our analysis unifies a wide spectrum of previous convex shape models, from discrete 1-0-1 line constraints and graph-cuts convexity formulations to curvature or signed distance Laplacian based level-set priors, within a single continuous and differentiable framework.

Abstract:
We present FlexAvatar, a flexible large reconstruction model for high-fidelity 3D head avatars with detailed dynamic deformation from single or sparse images, without requiring camera poses or expression labels. It leverages a transformer-based reconstruction model with structured head query tokens as a canonical anchor to aggregate flexible input-number-agnostic, camera-pose-free and expression-free inputs into a robust canonical 3D representation. For detailed dynamic deformation, we introduce a lightweight UNet decoder conditioned on UV-space position maps, which can produce detailed expression-dependent deformations in real time. To better capture rare but critical expressions like wrinkles and bared teeth, we also adopt a data distribution adjustment strategy during training to balance the distribution of these expressions in the training set. Moreover, a lightweight 10-second refinement can further enhance identity-specific details in extreme identities without affecting deformation quality. Extensive experiments demonstrate that our FlexAvatar achieves superior 3D consistency, detailed dynamic realism compared with previous methods, providing a practical solution for animatable 3D avatar creation.

Abstract:
Multimodal Large Language Models (MLLMs) have achieved remarkable progress in vision-language understanding, yet processing long visual token sequences remains computationally expensive. Existing approaches mitigate this cost by reducing image tokens, either by discarding them after the visual encoder or by pruning them in the early Transformer layers of the LLM. While these strategies improve efficiency, they inevitably discard informative visual content and risk degrading downstream reasoning performance. To address this challenge, we introduce Hi-Lo Prune, a training-free pruning strategy that is built on a simple principle: look at what you will lose before you remove it. Instead of directly dropping tokens, Hi-Lo Prune first identifies which tokens are safe to prune through a coarse-to-fine selection process, and then encourages the model to absorb their information before pruning occurs. Specifically, the framework consists of three stages: (1) Hierarchical Pruning Token Selection. After visual encoding, we apply a coarse-to-fine process that identifies tokens to retain and selects a critical set of pruning candidates from redundant ones. (2) Attention-Guided Candidate Token Merge. Before removing selected tokens, an attention mechanism is applied to the early LLM layers, which explicitly transfers information from these candidates to the retained tokens. (3) Low-informative Candidate Token Removal. At a designated Transformer layer, the pruned tokens are removed, reducing computation for all subsequent layers. This design enables aggressive early-layer pruning while preserving critical visual cues. Experiments on Qwen2-VL, Qwen2.5-VL, and Qwen3-VL demonstrate that Hi-Lo Prune consistently outperforms existing pruning methods across multiple benchmarks, achieving strong performance even under high pruning ratios without any fine-tuning.

Abstract:
Vision foundation models (VFMs) have achieved strong performance across various vision tasks. However, it still remains challenging to apply VFMs for cross-domain few-shot segmentation (CD-FSS), which segments objects of novel classes under domain shifts using only a few labeled exemplars. The challenge is mainly driven by two factors: (1) limited labeled exemplars per novel class relative to the scale of VFM pre-training, making the model prone to overfitting during retraining, and (2) target-domain shifts underrepresented during pre-training, inducing cross-domain inconsistency and layerwise sensitivity. To address these issues, we propose Hierarchical Exemplar Representation Adaptation (HERA), a three-stage select-regularize-calibrate VFM-based segmentation framework that learns effectively from limited labels and adapts to novel domains without source-data retraining. We first design Hierarchical Layer Selection (HLS) to adaptively identify the most informative VFM layer using a data-dependent Exemplar Transfer Risk (ETR) computed for each candidate layer. Then, Prior-Guided Regularization (PGR) regularizes interactions on the selected representation, yielding well-structured local signals for the subsequent stage. Furthermore, Pixelwise Adaptive Calibration (PAC) combines the selected representation with the refined interaction maps to calibrate pixelwise predictions, producing consistent masks. Together, these stages form a hierarchical select-regularize-calibrate pipeline that guides frozen VFM features in new domains while fine-tuning less than 2.7% of parameters at test time. Extensive experiments show that HERA surpasses the state-of-the-art by more than 4.1 mIoU across multiple CD-FSS benchmarks.

Abstract:
One-shot Federated Learning (OSFL) has emerged as a promising paradigm to mitigate the high communication overhead of traditional federated learning. However, its effectiveness is often hindered by data heterogeneity across clients. While recent methods leverage pre-trained diffusion models to generate data for OSFL, they often struggle with some practical limitations, including a lack of semantic fidelity in capturing the fine-grained characteristics of local data, and insufficient diversity in the generated data, which collectively degrade the performance of the global model. To address these challenges, we propose Espresso, a novel framework that enhances both the fidelity and diversity of synthetic data in OSFL. Espresso consists of two main components: (1) Fine-Grained Condition Learning, which learns fine-grained conditional embeddings to improve semantic fidelity and diversity by modeling intra-category patterns, and (2) Semantics-Preserving Sampling, which diversifies the generated data by modeling the distribution of latent noises and applying a self-reflection sampling strategy. Extensive experiments on benchmark datasets demonstrate that Espresso can improve the semantic fidelity and diversity of the synthetic data, leading to an enhancement in the performance of the global model compared to state-of-the-art OSFL methods under data heterogeneity.

Abstract:
Video Novel View Synthesis (VNVS) aims to render arbitrary novel viewpoints of dynamic scenes from a single-view video, but its algorithmic training faces a major challenge: the lack of large-scale multi-view video datasets. Prior methods often train on monocular data by framing it as an inpainting task, which typically leads to a training-inference gap and visual artifacts. While synthetic multi-view data can partially alleviate the data scarcity issue, its high acquisition costs and limited diversity restrict scalability. To address these problems, we propose Scaling4D, a novel strategy that theoretically bridges the training-inference gap while leveraging large-scale monocular videos for training. Specifically, we take a higher-level perspective on the problem, reformulating VNVS into a general correspondence-guided generation task. Furthermore, in conjunction with extensive real-world data, we establish a synthetic data pipeline integrated with our training strategy to enhance precision. Qualitative and quantitative results demonstrate a positive correlation between performance and training data volume, confirming the scalability.

Abstract:
Vision-Language Models (VLMs) still lack robustness in spatial intelligence, demonstrating poor performance on spatial understanding and reasoning tasks. We attribute this gap to the absence of a visual geometry learning process capable of reconstructing 3D space from 2D images. We present G^2VLM, a geometry grounded vision-language model that bridges two fundamental aspects of spatial intelligence: spatial 3D reconstruction and spatial understanding. G^2VLM natively leverages learned 3D visual geometry features to directly predict 3D attributes and enhance spatial reasoning tasks via in-context learning and interleaved reasoning. Our unified design is highly scalable for spatial understanding: it trains on abundant multi-view image and video data, while simultaneously leveraging the benefits of 3D visual priors that are typically only derived from hard-to-collect annotations.Experimental results demonstrate G^2VLM is proficient in both tasks, achieving comparable results to state-of-the-art feed-forward 3D reconstruction models and achieving better or competitive results across spatial understanding and reasoning tasks.By unifying a semantically strong VLM with low-level 3D vision tasks, we hope G^2VLM can serve as a strong baseline for the community and unlock more future applications, such as 3D scene editing.

Abstract:
Decoupled dataset distillation (DD) compresses large corpora into a few synthetic images by matching a frozen teacher's statistics. However, current residual-matching pipelines rely on static real patches, creating a fit-complexity gap and a pull-to-anchor effect that reduce intra-class diversity and hurt generalization. To address these issues, we introduce RETA--a Retrieval and Topology Alignment framework for decoupled DD. First, Dynamic Retrieval Connection (DRC) selects a real patch from a prebuilt pool by minimizing a fit-complexity score in teacher feature space; the chosen patch is injected via a residual connection to tighten feature fit while controlling injected complexity. Second, Persistent Topology Alignment (PTA) regularizes synthesis with persistent homology: we build a mutual k-NN feature graph, compute persistence images of components and loops, and penalize topology discrepancies between real and synthetic sets, mitigating pull-to-anchor effect. Across CIFAR-100, Tiny-ImageNet, ImageNet-1K, and multiple ImageNet subsets, RETA consistently outperforms various baselines under comparable time and memory, especially reaching 64.3% top-1 accuracy on ImageNet-1K with ResNet-18 at 50 images per class, +3.1% over the best prior.

Abstract:
Current advancements in human motion understanding are strongly reliant on video data. Nevertheless, privacy regulations and operational constraints increasingly restrict the use of visual data in real-world scenarios. Inferring posture through wearable sensors, such as instrumented insoles measuring plantar activation, presents itself as a promising alternative. However, the absence of large-scale multimodal datasets hinders the rigorous benchmarking of these methodologies. We introduce HUMAPS-4D, a novel multimodal dataset designed for human motion analysis, effectively bridging computer vision and biomechanics. This dataset integrates synchronized motion capture, multi-view video, IMUs, plantar pressure signals, sEMG activation patterns, and high-level semantic annotations. The data was collected from 32 subjects performing 30 actions over a total duration of 14 hours. Participants demonstrate substantial anthropometric variability (age, body proportions, and morphology), which supports robust generalization across diverse body types. Distinct from existing resources, this collection offers a unique pairing of low-level physiological signals and high-level human motor descriptors. This capability enables the development of generative and inference models conditioned by both physical and semantic constraints, while simultaneously reducing the reliance on personally identifiable visual data. We establish benchmark tasks specifically targeting posture reconstruction from plantar pressure, semantic motion segmentation, physics-informed motricity analysis, and multimodal fusion under privacy-preserving conditions. The dataset, along with its associated annotation tools and visualization utilities, is scheduled for online release soon.

Abstract:
Long-term behavioral monitoring of individual animals is crucial for studying behavioral changes that occurs over different time scales, especially for conservation and evolutionary biology. Computer vision methods have proven to benefit biodiversity monitoring, but automated behavior monitoring in wild populations remains challenging. This stems from the lack of datasets that cover a range of computer vision tasks necessary to extract biologically meaningful measurements of individual animals. Here, we introduce such a dataset (CHIRP) with a new method (CORVID) for individual re-identification of wild birds. The CHIRP (Combining beHaviour, Individual Re-identification and Postures) dataset is curated from a long-term population of wild Siberian jays studied in Swedish Lapland, supporting re-identification (re-id), action recognition, 2D keypoint estimation, object detection, and instance segmentation. In addition to traditional task-specific benchmarking, we introduce application-specific benchmarking with biologically relevant metrics (feeding rates, co-occurrence rates) to evaluate the performance of models in real-world use cases. Finally, we present CORVID COlouR-based Video reID), a novel pipeline for individual identification of birds based on the segmentation and classification of colored leg rings, a widespread approach for visual identification of individual birds. CORVID offers a probability-based id tracking method by matching the detected combination of color rings with a database. We use application-specific benchmarking to show that CORVID outperforms state of the art re-id methods. We hope this work offers the community a blueprint for curating real-world datasets from ethically approved biological studies to bridge the gap between computer vision research and biological applications.

Abstract:
Illumination inconsistency is a fundamental challenge in multi-view 3D reconstruction. Variations in sunlight direction, cloud cover, and shadows break the constant-lighting assumption underlying both classical multi-view stereo (MVS) and structure from motion (SfM) pipelines and recent neural rendering methods, leading to geometry drift, color inconsistency, and shadow imprinting. This issue is especially critical in UAV-based reconstruction, where long flight durations and outdoor environments make lighting changes unavoidable.However, existing datasets either restrict capture to short time windows, thus lacking meaningful illumination diversity, or span months and seasons, where geometric and semantic changes confound the isolated study of lighting robustness.We introduce UAVLight, a controlled-yet-real benchmark for illumination-robust 3D reconstruction. Each scene is captured along repeatable, geo-referenced flight paths at multiple fixed times of day, producing natural lighting variation under consistent geometry, calibration, and viewpoints. With standardized evaluation protocols across lighting conditions, UAVLight provides a reliable foundation for developing and benchmarking reconstruction methods that are consistent, faithful, and relightable in real outdoor environments.

Abstract:
Tables are pervasive in diverse documents, making table recognition (TR) a fundamental task in document analysis. Existing modular TR pipelines separately model table structure and content, leading to suboptimal integration and complex workflows.End-to-end approaches rely heavily on large-scale TR data and struggle in data-constrained scenarios.To address these issues, we propose TDATR (Table Detail-Aware Table Recognition) improves end-to-end TR through table detail-aware learning and cell-level visual alignment.TDATR adopts a "perceive-then-fuse" strategy. The model first performs table detail-aware learning to jointly perceive table structure and content through multiple structure understanding and content recognition tasks designed under a language modeling paradigm. These tasks can naturally leverage document data from diverse scenarios to enhance model robustness.The model then integrates implicit table details to generate structured HTML outputs, enabling more efficient TR modeling when trained with limited data.Furthermore, we design a structure-guided cell localization module integrated into the end-to-end TR framework, which efficiently locates cell and strengthens vision-language alignment. It enhances the interpretability and accuracy of TR.We achieve state-of-the-art or highly competitive performance on seven benchmarks without dataset-specific fine-tuning.

Abstract:
Panoramic dental radiographs require fine-grained spatial reasoning, bilateral symmetry understanding, and multi-step diagnostic verification, yet existing vision-language models operate under a static single-pass paradigm that limits their clinical reliability. In this paper, we introduce OralGPT-Plus, an agentic vision-language model designed to perform iterative and symmetry-aware diagnostic reasoning for panoramic dental radiograph analysis. To support this paradigm, we construct DentalProbe, a five-thousand-image dataset with expert-curated diagnostic trajectories that provide structured supervision for localized inspection and contralateral comparison. We further develop a Reinspection-driven reinforcement learning framework that encourages clinically meaningful re-examination and stabilizes long-horizon reasoning with rubric-based reward and conditioned diagnostic-driven reward. In parallel, we present MMOral-X, the first benchmark for holistic panoramic diagnosis, containing 300 open-ended questions and region-level annotations across multiple difficulty levels. OralGPT-Plus demonstrates consistent and reliable improvements over strong baselines on MMOral-X and established panoramic benchmarks, indicating the effectiveness of interactive and symmetry-informed reasoning. Our work highlights the value of agentic modeling for dental imaging and provides a foundation for future research in clinically aligned panoramic radiograph analysis.

Abstract:
Vision Transformers (ViTs) have achieved state-of-the-art results across various vision tasks. To enable a practical deployment of ViTs on modern hardware systems, post-training quantization (PTQ) has been actively studied in recent years. In particular, Hessian-based block reconstruction approaches have demonstrated promising results in quantizing ViT models to ultra-low bitwidths (e.g., 4-bit). However, finding a representative approximate Hessian, a fundamental step in recent approaches such as APHQ-ViT and FIMA-Q, remains underexplored in terms of the quantization-induced error and estimation cost. To address these shortcomings, we first reveal that the sample independence assumption used in recent works, which ignores the covariance term, can lead to a significant approximation error, especially for sub-four-bits. Inspired by least-squares regression, we propose LS-ViT, a block reconstruction framework that effectively estimates a representative Hessian by explicitly minimizing this approximation error across all samples. Extensive experiments with various ViT models across different vision tasks demonstrate that LS-ViT achieves new state-of-the-art performance. In addition, LS-ViT reduces quantization time compared to prior work, enabling a practical, plug-and-play, quantization-aware deployment for ViTs. The source code is available.

Abstract:
This paper presents a new dataset for Novel View Synthesis, generated from a high-quality, animated film with stunning realism and intricate detail. Our dataset captures a variety of dynamic scenes, complete with detailed textures, lighting, and motion, making it ideal for training and evaluating cutting-edge 4D scene reconstruction and novel view generation models. In addition to high-fidelity RGB images, we provide multiple complementary modalities, including depth, surface normals, object segmentation and optical flow, enabling a deeper understanding of scene geometry and motion. The dataset is organised into three distinct benchmarking scenarios: a dense multi-view camera setup, a sparse camera arrangement, and monocular video sequences, enabling a wide range of experimentation and comparison across varying levels of data sparsity. With its combination of visual richness, high-quality annotations, and diverse experimental setups, this dataset offers a unique resource for pushing the boundaries of view synthesis and 3D vision.

Abstract:
Embodied Question Answering (EQA) requires an agent to interpret language, perceive its environment, and navigate within 3D scenes to produce responses. Existing EQA benchmarks assume that every question must be answered, but embodied agents should know when they do not have sufficient information to answer. In this work, we focus on a minimal requirement for EQA agents, abstention: knowing when to withhold an answer. From an initial study of 500 human queries, we find that 32.4% contain missing or underspecified context. Drawing on this initial study and cognitive theories of human communication errors, we derive five representative categories requiring abstention: actionability limitation, referential underspecification, preference dependence, information unavailability, and false presupposition. We augment OpenEQA by having annotators transform well-posed questions into ambiguous variants outlined by these categories. The resulting dataset, AbstainEQA, comprises 1,636 annotated abstention cases paired with 1,636 original OpenEQA instances for balanced evaluation. Evaluating on AbstainEQA, we find that even the best frontier model only attains 42.79% abstention recall, while humans achieve 91.17%. We also find that scaling, prompting, and reasoning only yield marginal gains, and that fine-tuned models overfit to textual cues. Together, these results position abstention as a fundamental prerequisite for reliable interaction in embodied settings and as a necessary basis for effective clarification.

Abstract:
Real-image super-resolution (Real-ISR) seeks to recover HR images from LR inputs with mixed, unknown degradations. While diffusion models surpass GANs in perceptual quality, they under-reconstruct high-frequency (HF) details due to a low-frequency (LF) bias and a depth-wise "low-first, high-later" hierarchy. We introduce FRAMER, a plug-and-play training scheme that exploits diffusion priors without changing the backbone or inference. At each denoising step, the final-layer feature map teaches all intermediate layers. Teacher and student feature maps are decomposed into LF/HF bands via FFT masks to align supervision with the model's internal frequency hierarchy. For LF, an Intra Contrastive Loss (IntraCL) stabilizes globally shared structure. For HF, an Inter Contrastive Loss (InterCL) sharpens instance-specific details using random-layer and in-batch negatives. Two adaptive modulators, Frequency-based Adaptive Weight (FAW) and Frequency-based Alignment Modulation (FAM), reweight per-layer LF/HF signals and gate distillation by current similarity. Across U-Net and DiT backbones (e.g., Stable Diffusion 2, 3), FRAMER consistently improves PSNR/SSIM and perceptual metrics (LPIPS, NIQE, MANIQA, MUSIQ). Ablations validate the final-layer teacher and random-layer negatives. The code and project page will be publicly released.

Abstract:
Success in generative modeling across language, image, and video demonstrates that large, well-curated datasets are the key driver for building capable models. 3D Human motion, however, has lagged behind, constrained by an unsatisfying choice between small, high-fidelity motion capture datasets and large-scale in-the-wild collections dominated by static or low-quality sequences.We introduce RoMo, a rich, large-scale, carefully curated dataset of in-the-wild human motions that resolves these tradeoffs. To ensure quality, we introduce a taxonomy-aware filtering pipeline that aggressively removes static and artifact-prone sequences. Every sequence is annotated with detailed captions and organized by a novel three-level semantic taxonomy. This hierarchical structure provides the first benchmark for fine-grained, per-category evaluation, revealing model strengths and weaknesses obscured by global metrics. We demonstrate that models trained on RoMo achieve state-of-the-art fidelity and diversity while gaining a superior understanding of complex, subtle text prompts. Finally, we release the Motion Toolbox to standardize metrics, data conversion, and visualization, establishing a foundation for reproducible and interpretable motion generation research.

Abstract:
Timely and accurate forecasts of severe weather events are essential for early warning and for constraining downstream analysis and decision-making. Since severe weather events prediction still depends on subjective, time-consuming expert interpretation, end-to-end "AI weather station" systems are emerging but face three major challenges: (1) scarcity of severe weather event samples; (2) imperfect alignment between high-dimensional meteorological data and textual warnings; (3) current multimodal language models cannot effectively process high-dimensional meteorological inputs or capture their complex spatiotemporal dependencies. To address these challenges, we introduce MP-Bench, the first large-scale multimodal dataset for severe weather events prediction, comprising 421,363 pairs of raw multi-year meteorological data and corresponding text caption, covering a wide range of severe weather scenarios. On top of this dataset, we develop a Meteorology Multimodal Large Model (MMLM) that directly ingests 4D meteorological inputs. In addition, it is designed to accommodate the unique characteristics of 4D meteorological data flow, incorporating three plug-and-play adaptive fusion modules that enable dynamic feature extraction and integration across temporal sequences, vertical pressure layers, and spatial dimensions. Extensive experiments on MP-Bench show that MMLM achieves strong performance across multiple tasks, demonstrating effective severe weather understanding and representing a key step toward automated, AI-driven severe weather events forecasting systems.

Abstract:
Contrastive learning is widely used for generating multimodal data representations by aligning embeddings of different modalities of the same data samples. This alignment is achieved through a loss function that treats matched and unmatched modality pairs as positive and negative samples within a data batch. However, when extending contrastive learning to scenarios involving more than two modalities, existing approaches either rely solely on fully unmatched modalities as negative samples, or fail to distinguish between partially and fully unmatched modalities, thereby overlooking the fine-grained contrasting relationships. To address this limitation, we propose Easy2Hard, a novel framework that explicitly separates partially and fully unmatched modalities. To learn from negative samples at improved granularity, Easy2Hard further introduces a sigmoid weighting curriculum that smoothly transitions the learning process from easy (partially unmatched) to hard (fully unmatched) contrasts. Comprehensive evaluations on five multimodal datasets across diverse domains demonstrate the superiority of our approach.

Abstract:
Machine Unlearning (MU), erasing undesirable content from Artificial Intelligence (AI) models, plays an essential role in developing safe and trustworthy AI systems.Despite notable advances, the baseline MU methods rely on retraining from scratch without the data to be removed, which is computationally expensive and financially prohibitive.To address this challenge, we propose a simple yet efficient training-free and retain-set-free MU algorithm designed explicitly as a diagnostic baseline: Machine Unlearning pH-Test (MUpHT).It is designed to serve as a practical evaluation reference for future MU methods.Our method eliminates the low dimensional subspaces associated with undesirable concepts from the space spanned by the model's weight vectors, thereby rendering the model "blind" to these undesirable contents. Additionally, we extend our retain-aware variant to handle entangled features by leveraging a generalized Rayleigh quotient over the undesirable and retain sets, enabling an efficient tradeoff between preserving retained knowledge and suppressing undesirable knowledge.Our method enables evaluation of MU across diverse visual tasks, including concept erasure for classification, image generation, and multimodal applications.By producing an unlearned model instantly from only a few samples, our method serves as a quick litmus test for MU.

Abstract:
This paper focuses on the alignment of flow-matching models with human preference. A promising way is fine-tuning by directly backpropagating reward signals through the differentiable generation process of flow matching. However, backpropagating through long trajectories results in prohibitive memory costs and gradient explosion. Therefore, direct-gradient methods struggle to update early generation steps, which are crucial for determining the global structure of the final image. To address this issue, we introduce LeapAlign, a fine-tuning method that reduces computational cost and enables direct gradient propagation from reward to early generation steps. Specifically, we shorten the long trajectory into only two steps by designing two consecutive leaps, each skipping multiple ODE sampling steps and predicting future latents in a single step. By randomizing the start and end timesteps of the leaps, LeapAlign leads to efficient and stable model updates at any generation step. To better use such shortened trajectories, we assign higher training weights to those that are more consistent with the long generation path. To further enhance gradient stability, we reduce the weights of gradient terms with large magnitude, instead of completely removing them as done in previous works. When fine-tuning the Flux model, LeapAlign consistently outperforms state-of-the-art GRPO-based and direct-gradient methods across various metrics, achieving superior image quality and image-text alignment.

Abstract:
Diffusion transformers have recently delivered strong text-to-image generation around 1K resolution, but we show that extending them to native 4K across diverse aspect ratios exposes a tightly coupled failure mode spanning positional encoding, VAE compression, and optimization. Tackling any of these factors in isolation leaves substantial quality on the table. We therefore take a data--model co-design view and introduce UltraFlux, a Flux-based DiT trained natively at 4K on MultiAspect-4K-1M, a 1M-image 4K corpus with controlled multi-AR coverage, bilingual captions, and rich VLM/IQA metadata for resolution- and AR-aware sampling. On the model side, UltraFlux couples (i) Resonance 2D RoPE with YaRN for training-window-, frequency-, and AR-aware positional encoding at 4K; (ii) a simple, non-adversarial VAE post-training scheme that improves 4K reconstruction fidelity; (iii) an SNR-Aware Huber Wavelet objective that rebalances gradients across timesteps and frequency bands; and (iv) a Stage-wise Aesthetic Curriculum Learning strategy that concentrates high-aesthetic supervision on high-noise steps governed by the model prior. Together, these components yield a stable, detail-preserving 4K DiT that generalizes across wide, square, and tall ARs. On the Aesthetic-Eval@4096 benchmark and multi-AR 4K settings, UltraFlux consistently outperforms strong open-source baselines across fidelity, aesthetic, and alignment metrics, and--with a LLM prompt refiner--matches or surpasses the proprietary Seedream 4.0.

Abstract:
Hand-Object Interaction (HOI) remains a core challenge in digital human video synthesis, where models must generate physically plausible contact and preserve object identity across frames. Although recent HOI reenactment approaches have achieved progress, they are typically trained and evaluated in-domain and fail to generalize to complex, in-the-wild scenarios. In contrast, all-in-one video editing models exhibit broader robustness but still struggle with HOI-specific issues such as inconsistent object appearance. In this paper, we present GenHOI, a lightweight augmentation to pretrained video generation models that injects reference-object information in a temporally balanced and spatially selective manner. For temporal balancing, we propose Head-Sliding RoPE, which assigns head-specific temporal offsets to reference tokens, distributing their influence evenly across frames and mitigating the temporal decay of 3D RoPE to improve long-range object consistency. For spatial selectivity, we design a two-level spatial attention gate that concentrates object-conditioned attention on HOI regions and adaptively scales its strength, preserving background realism while enhancing interaction fidelity. Extensive qualitative and quantitative evaluations on unseen, in-the-wild scenes demonstrate that GenHOI significantly outperforms state-of-the-art HOI reenactment and all-in-one video editing competitors.

Abstract:
The rapid advancement of Multimodal Large Language Models (MLLMs) has given rise to sophisticated autonomous agents capable of performing complex, human-like tasks across the web. However, this also introduces significant security risks, particularly from stealthy MLLM agents that can evade conventional detection mechanisms by mimicking human behavior. In this paper, we propose DualMirage, a novel CAPTCHA framework that proactively counters and identifies stealthy agents by exploiting fundamental disparities between human and machine perception. DualMirage employs a dual-pronged strategy: (1) Contour Illusions, which utilize cognitive principles to generate illusory contours that humans perceive effortlessly yet pose interpretation challenges for MLLMs; and (2) Adversarial Illusions, which embed human-imperceptible perturbations optimized to mislead the visual encoders of target MLLMs and thereby elicit characteristic, identifiable model responses. Evaluations on five state-of-the-art MLLMs demonstrate that DualMirage achieves an average 95.8% human success rate while blocking MLLM agents (up to 100% agent blocking rate), outperforming existing CAPTCHAs. Furthermore, DualMirage induces models to expose identities actively, achieving 58.8% white-box and 21.9% black-box attack success rates, proving effective against stealthy multimodal agents.

Abstract:
Person search, which aims to detect and re-identify individuals in unconstrained scenes, faces an inherent conflict in one-stage models: pedestrian detection focuses on shared human features, while person re-identification requires identity-specific representations. Existing approaches, such as feature decoupling and loss re-weighting, primarily address this issue in later network stages but fail to resolve early-stage feature entanglement. To overcome this limitation, we propose FSLoRA, a Freq-Spatial Low-Rank Adapter that progressively decouples task-specific features at the backbone level. FSLoRA consists of a Spatial-Level Module (SLM), which employs LoRA and a mixture-of-experts to dynamically activate task-relevant spatial features, and a Frequency-Level Module (FLM), which transforms features into the frequency domain to selectively enhance task-relevant frequency components while suppressing task-irrelevant noise. By integrating both spatial and frequency-based adaptations, FSLoRA reduces feature interference, enabling more effective joint optimization. Extensive experiments on CUHK-SYSU, PRW, and Posetrack21 demonstrate that FSLoRA not only achieves state-of-the-art performance but also serves as a plug-and-play module adaptable to various person search frameworks, offering a unified and generalizable solution for one-stage person search.

Abstract:
High-quality data has become a primary driver of progress under scale laws, with curated datasets often outperforming much larger unfiltered ones at lower cost. Online data curation extends this idea by dynamically selecting training samples based on the model's evolving state. While effective in classification and multimodal learning, existing online sampling strategies rarely extend to object detection because of its structural complexity and domain gaps. We introduce DetGain, an online data curation method specifically for object detection that estimates the marginal perturbation of each image to dataset-level Average Precision (AP) based on its prediction quality. By modeling global score distributions, DetGain efficiently estimates the global AP change and computes teacher-student contribution gaps to select informative samples at each iteration. The method is architecture-agnostic and minimally intrusive, enabling straightforward integration into diverse object detection architectures. Experiments on the COCO dataset with multiple representative detectors show consistent improvements in accuracy. DetGain also demonstrates strong robustness under low-quality data and can be effectively combined with knowledge distillation techniques to further enhance performance, highlighting its potential as a general and complementary strategy for data-efficient object detection.

Abstract:
Given a monocular video, the goal of video re-rendering is to generate views of the scene from a novel camera trajectory. Existing methods face two distinct challenges. Geometrically unconditioned models lack spatial awareness, leading to drift and deformation under viewpoint changes. On the other hand, geometrically-conditioned models depend on estimated depth and explicit reconstruction, making them susceptible to depth inaccuracies and calibration errors.We propose to address these challenges by using the implicit geometric knowledge embedded in the latent space of a large 4D reconstruction model to condition the video generation process. These latents capture scene structure in a continuous space without explicit reconstruction. Therefore, they provide a flexible representation that allows the pretrained diffusion prior to regularize errors more effectively. By jointly conditioning on these latents and source camera poses, we demonstrate that our model achieves state-of-the-art results on the video re-rendering task.

Abstract:
Many recent human motion prediction methods adopt a multi-stage refinement framework, where each stage produces an initial guess of future poses for the next stage. These guesses are progressively refined towards the target prediction through a sequence of spatial-temporal reasoning stages.However, such a cascaded design incurs large computation and memory overheads that grow at least linearly with network depth, and lack an explicit stopping criteria.In this paper, we propose MotionDEQ, a deep equilibrium motion predictor that reformulates progressive guessing paradigm as a fixed point problem within an implicit layer. This formulation is conceptually equivalent to performing infinitely many refinement steps, but requires only O(1) training memory and can be solved efficiently through any black-box solvers. We carefully design this implicit refinement process by integrating Euclidean geometric transformations into equilibrium learning, allowing the entire network to be equivariant. We also find DEQs naturally fit the real-world scenario where motion data comes streamingly: the converged fixed point can be reused as a warm initial guess, to help recycle the redundant inference computation when making subsequent predictions.Our experiments demonstrate that MotionDEQ achieves the state-of-the-art prediction performances with superior memory efficiency, using fewer than 300K parameters with 55.3mm prediction error at 400ms on the Human3.6M dataset.

Abstract:
Text-to-Image (T2I) generation has achieved remarkable progress in recent years. Meanwhile, reinforcement learning methods, particularly those based on Group Relative Policy Optimization (GRPO), have attracted widespread attention and been successfully applied to T2I tasks. However, the uniform sampling strategy commonly used during training often ignores the match between sample difficulty and the model's current learning capability, leading to low training efficiency. We argue that improving training efficiency requires continuously prioritizing prompts that match the model's evolving capability and remain actively learnable. To this end, we propose Curriculum Group Policy Optimization (CGPO), an adaptive curriculum training framework. During training, each prompt produces a group of images scored by a reward model. We use the variance of group rewards as an online proxy for prompt inconsistency. A higher variance suggests that the model has partially captured the prompt requirements but has not yet achieved stable mastery. Such prompts are more likely to provide useful learning signals, so we increase their sampling probabilities accordingly. Additionally, to address data imbalance in multi-category datasets, we design a category calibration method based on proportional fairness optimization, which balances training difficulty across categories. Experiments on GenEval, T2I-CompBench++, and DPG Bench demonstrate that our framework effectively improves generation performance.

Abstract:
The adoption of fisheye cameras in robotic manipulation, driven by their exceptionally wide Field of View (FoV), is rapidly outpacing a systematic understanding of their downstream effects on policy learning. This paper presents the first comprehensive empirical study to bridge this gap, rigorously analyzing the properties of wrist-mounted fisheye cameras for imitation learning. Through extensive experiments in both simulation and the real world, we investigate three critical research questions: spatial localization, scene generalization, and hardware generalization. Our investigation reveals that: (1) The wide FoV significantly enhances spatial localization, but this benefit is critically contingent on the visual complexity of the environment. (2) Fisheye-trained policies, while prone to overfitting in simple scenes, unlock superior scene generalization when trained with sufficient environmental diversity. (3) While naive cross-camera transfer leads to failures, we identify the root cause as scale overfitting and demonstrate that hardware generalization performance can be improved with a simple Random Scale Augmentation (RSA) strategy. Collectively, our findings provide concrete, actionable guidance for the large-scale collection and effective use of fisheye datasets in robotic learning.

Abstract:
Physics-based humanoid control has achieved remarkable progress in enabling realistic and high-performing single-agent behaviors, yet extending these capabilities to cooperative human-object interaction (HOI) remains challenging. We present TeamHOI, a framework that enables a single decentralized policy to handle cooperative HOIs across any number of cooperating agents. Each agent operates using local observations while attending to other teammates through a Transformer-based policy network with teammate tokens, allowing scalable coordination across variable team sizes. To enforce motion realism while addressing the scarcity of cooperative HOI data, we further introduce a masked Adversarial Motion Prior (AMP) strategy that uses single-human reference motions while masking object-interacting body parts during training. The masked regions are then guided through task rewards to produce diverse and physically plausible cooperative behaviors. We evaluate TeamHOI on a challenging cooperative carrying task involving two to eight humanoid agents and varied object geometries. Finally, to promote stable carrying, we design a team-size- and shape-agnostic formation reward. TeamHOI achieves high success rates and demonstrates coherent cooperation across diverse configurations with a single policy.

Abstract:
We study concept-level forgetting in pretrained vision models: removing an entire semantic category so the system no longer recognizes that object in unseen images and contexts, rather than merely forgetting specific training examples. Prior work either applies blunt global projections or fine-tunes parameters, which can introduce collateral damage to unrelated features, add compute, and become unstable as forgetting strength increases. We introduce Contrastive Subnet Erasure (CSE), a training-free, encoder-centric edit that targets a compact set of channels most responsible for the class and attenuates them in a calibrated manner. The modification is algebraically folded into the subsequent layer, yielding no inference-time overhead and leaving task heads unchanged. To evaluate whether forgetting generalizes beyond the data used to specify the class, we introduce a cross dataset protocol in which the class is defined on a source dataset and performance is measured on a disjoint target dataset drawn from a different distribution with no shared images. This setup tests whether the model still fails to recognize the object when it looks different or appears in new scenes, and it helps avoid overfitting to patterns in the source dataset. Across CIFAR 10, CIFAR 100, and ImageNet under this protocol, CSE achieves stronger forgetting of the target class while better preserving non target utility than existing baselines in both single class and multi class settings. Overall, CSE provides a simple, stable, and deployment-ready mechanism for class-level unlearning in vision.

Abstract:
Decomposing a scene into its reflectance and shading is a challenge due to the lack of extensive ground-truth data for real-world scenes. We introduce a novel physics-based approach for intrinsic image decomposition using a pair of visible and thermal images. We leverage the principle that light not reflected from an opaque surface is absorbed and detected as heat by a thermal camera. This allows us to relate the ordinalities (or relative magnitudes) between visible and thermal image intensities to the ordinalities of shading and reflectance. The ordinalities enable dense self-supervision of an optimizing neural network to recover shading and reflectance. We perform quantitative evaluations with known reflectance and shading under natural and artificial lighting, and qualitative experiments across diverse scenes. The results demonstrate superior performance over both physics-based and recent learning-based methods, providing a path toward scalable real-world data curation with supervision.

Abstract:
Thermal cameras provide reliable visibility in darkness and adverse conditions, but thermal imagery remains significantly harder to use for novel view synthesis (NVS) than visible-light images. This difficulty stems primarily from two characteristics of affordable thermal sensors. First, thermal images have extremely low dynamic range, which weakens appearance cues and limits the gradients available for optimization. Second, thermal data exhibit rapid frame-to-frame photometric fluctuations together with slow radiometric drift, both of which destabilize correspondence estimation and create high-frequency floater artifacts during view synthesis, particularly when no RGB guidance (beyond camera pose) is available. Guided by these observations, we introduce a lightweight preprocessing and splatting pipeline that expands usable dynamic range and stabilizes per-frame photometry. Our approach achieves state-of-the-art performance across thermal-only NVS benchmarks, without requiring any dataset-specific tuning.

Abstract:
Embodied navigation for long-horizon tasks, guided by complex natural language instructions, remains a formidable challenge in artificial intelligence. Existing agents often struggle with robust long-term planning about unseen environments, leading to high failure rates. To address these limitations, we introduce NavForesee, a novel Vision-Language Model (VLM) that unifies high-level language planning and predictive world model imagination within a single, unified framework.Our approach empowers a single VLM to concurrently perform planning and predictive foresight. Conditioned on the full instruction and historical observations, the model is trained to understand the navigation instructions by decomposing the task, tracking its progress, and formulating the subsequent sub-goal. Simultaneously, it functions as a generative world model, providing crucial foresight by predicting short-term environmental dynamics and long-term navigation milestones. The VLM's structured plan guides its targeted prediction, while the imagined future provides rich context to inform the navigation actions, creating a powerful internal feedback loop of perception-planning/prediction-action. We demonstrate through extensive experiments on the R2R-CE and RxR-CE benchmark that NavForesee achieves highly competitive performance in complex scenarios. Our work highlights the immense potential of fusing explicit language planning with implicit spatiotemporal prediction, paving the way for more intelligent and capable embodied agents.

Abstract:
In this paper, we present HumanNOVA, a photorealistic, universal, and rapid model for generating 3D human avatars from a single RGB image. Achieving both photorealism and generalization is challenging due to the scarcity of diverse, high-quality 3D human data. To address this, we build a scalable data generation pipeline that follows two strategies. The first one is to leverage existing rigged assets and animate them with extensive poses from daily life. The second strategy is to utilize existing multi-camera captures of humans and employ fitting to generate more diverse views for training. These two strategies enable us to scale up to 100k assets, significantly enhancing both the quantity and the diversity of data for robust model training. In terms of the architecture, HumanNOVA adopts a feed-forward, token-conditioned avatar modeling framework that allows fast inference in less than one second and requires no test-time optimization. Given an input image and an estimated simplified human mesh (SMPL) without detailed geometry or appearance, the model first encodes both inputs into compact token representations. These tokens then act as conditioning signals and are fused through cross-attention to construct a triplane-based 3D avatar representation. Extensive experiments on multiple benchmarks demonstrate the superiority of our approach, both quantitatively and qualitatively, as well as its robustness under diverse input image conditions. Project page at https://HumanNOVA.github.io.

Abstract:
Detecting small aerial targets (SATs) in long-range LiDAR is challenging because motion causes extreme variations in point density, breaking fixed-voxel and static-threshold assumptions in standard 3D detection and tracking. To address the challenges, we introduce A3PRL, an RL-driven adaptive perception framework that closes the loop between LiDAR sensing and tracking. A3PRL uses a sparsity-aware proposal stage with Temporal Dispersion Signatures and velocity-change cues, and a lightweight 5D policy that jointly adjusts voxel resolution, detection sensitivity, and association gating from label-free statistics of sparsity, foreground acceptance, and track stability. The policy is trained with privileged ground-truth trajectories to optimize a reward balancing geometric accuracy, temporal stability, and regularized acceptance, but runs fully label-free at test time. On the public MMAUD benchmark, training on V1 and evaluating on unseen V2/V3, A3PRL reduces 3D localization error by about 19% over its non-RL counterpart and consistently outperforms LiDAR-only and multimodal baselines under both day and night conditions. The same policy is able to transfer to other SAT datasets with heterogeneous scan patterns, maintaining accurate and stable trajectories.

Abstract:
Current research in autonomous driving heavily relies on large-scale driving datasets for model fitting or trial-and-error learning strategies in simulation environments. However, these approaches suffer from limited behavioral diversity and fail to cover complex edge-case interactions. To address these limitations, we model the driving environment as an Active Markov Game (AMG) and introduce a multi-agent co-evolutionary training framework for more realistic interactive learning. The AMG formulation extends traditional Markov games by explicitly making state transitions and rewards dependent on the evolving strategies of the agents, thus capturing the interactive dynamics and strategic coupling between the ego vehicle and surrounding agents. Building on this, our multi-agent co-evolutionary training mechanism jointly optimizes the ego vehicle's policy and a diverse pool of opponent strategies, allowing all agents to adapt to each other's behaviors during training. This game-theoretic approach produces a robust ego agent capable of handling diverse, non-stationary driving strategies, overcoming the "non-responsive opponent" limitation found in prior methods. In CARLA simulations of unsignalized intersections and long-tail scenarios, our method performs exceptionally well, achieving near-perfect success rates (98%), and significantly outperforming state-of-the-art baselines such as PPO, DDPG, and IPPO in terms of generalization, safety margins, and control smoothness. These results demonstrate that our approach substantially enhances the robustness, safety, and strategic adaptability of autonomous driving in complex multi-agent environments.

Abstract:
Recent advancements in remote sensing object detection have predominantly focused on oriented bounding box design and small object feature enhancement, while often overlooking the intrinsic geometric properties of remote sensing images, such as rotation invariance and structural symmetry. Many aerial objects appear in multiple orientations and exhibit clear symmetrical patterns, which, if not explicitly modeled, can lead to detection failures and inaccurate localization under geometric variation or partial occlusion. To address this, we propose the Rotation Invariant and Symmetry Aware Pixel Difference Network (RIS-PiDiNet), which introduces a novel convolutional operator called Rotation Invariant and Symmetry Aware Pixel Difference Convolution (RIS-PDC). This operator replaces traditional convolution with a mathematically grounded formulation that encodes rotation group priors and symmetrical constraints. RIS-PDC utilizes pixel differences and symmetry-guided aggregation in the polar harmonic space, enabling the network to infer partially visible structures and deduce occluded symmetrical parts. Besides improving detection accuracy, RIS-PDC enhances model interpretability by embedding geometric principles into the network design. Feature visualizations demonstrate rotation-consistent activations and symmetry-complete responses, revealing how the network captures underlying object structure even under partial visibility or orientation changes. This yields geometrically interpretable detection decisions. To our knowledge, RIS-PiDiNet is the first remote sensing object detection framework that jointly incorporates rotation invariance and symmetry modeling within a unified architecture. Extensive evaluations on standard benchmarks validate its effectiveness, achieving state-of-the-art performance on DOTA-v1.0 (78.53% mAP single-scale, 81.81% multi-scale), HRSC2016 (98.60% mAP), and DIOR-R (67.28% mAP), all with acceptable computational overhead and no increase in parameter count.

Abstract:
Understanding and answering questions based on a user's pointing gesture is essential for next-generation egocentric AI assistants. However, current Multimodal Large Language Models (MLLMs) struggle with such tasks due to the lack of gesture-rich data and their limited ability to infer fine-grained pointing intent from egocentric video.To address this, we introduce EgoPointVQA, a dataset and benchmark for gesture-grounded egocentric question answering, comprising 4000 synthetic and 400 real-world videos across multiple deictic reasoning tasks.Built upon it, we further propose Hand Intent Tokens (HINT), which encode tokens derived from 3D hand keypoints using an off-the-shelf reconstruction model and interleaves them with the model input to provide explicit spatial and temporal context for interpreting pointing intent.We show that our model outperforms others in different backbones and model sizes.In particular, HINT-14Bachieves 68.1% accuracy, on average over 6 tasks, surpassing the state-of-the-art, InternVL3-14B, by 6.6%.To further facilitate the open research, we will release the code, model, and dataset.

Abstract:
Recent advances in generative video modeling, driven by large-scale datasets and powerful architectures, have yielded remarkable visual realism. However, emerging evidence suggests that simply scaling data and model size does not endow these systems with an understanding of the underlying physical laws that govern real-world dynamics. Existing approaches often fail to capture or enforce such physical consistency, resulting in unrealistic motion and dynamics.In this work, we investigate whether integrating the inference of latent physical properties directly into the video generation process can equip models with the ability to produce physically plausible videos. To this end, we propose PHANTOM, a Physics-Infused Video Generation model that jointly models the visual content and latent physical dynamics. Conditioned on observed video frames and inferred physical states, PHANTOM jointly predicts latent physical dynamics and generates future video frames.PHANTOM leverages a physics-aware video representation that serves as an abstract yet informative embedding of the underlying physics, facilitating the joint prediction of physical dynamics alongside video content without requiring an explicit specification of a complex set of physical dynamics and properties. By integrating the inference of physical-aware video representation directly into the video generation process, PHANTOM produces video sequences that are both visually realistic and physically consistent.Quantitative and qualitative results on both standard video generation and physics-aware benchmarks demonstrate that PHANTOM not only outperforms existing methods in terms of adherence to physical dynamics but also delivers competitive perceptual fidelity.

Abstract:
Existing single-image 3D human avatar methods primarily rely on rigid joint transformations, limiting their ability to model realistic cloth dynamics. We present DynaAvatar, a zero-shot framework that reconstructs animatable 3D human avatars with motion-dependent cloth dynamics from a single image. Trained on large-scale multi-person motion datasets, DynaAvatar employs a Transformer-based feed-forward architecture that directly predicts dynamic 3D Gaussian deformations without subject-specific optimization. To overcome the scarcity of dynamic captures, we introduce a static-to-dynamic knowledge transfer strategy: a Transformer pretrained on large-scale static captures provides strong geometric and appearance priors, which are efficiently adapted to motion-dependent deformations through lightweight LoRA fine-tuning on dynamic captures. We further propose the DynaFlow loss, an optical flow-guided objective that provides reliable motion-direction geometric cues for cloth dynamics in rendered space. Finally, we reannotate the missing or noisy SMPL-X fittings in existing dynamic capture datasets, as most public dynamic capture datasets contain incomplete or unreliable fittings that are unsuitable for training high-quality 3D avatar reconstruction models. Experiments demonstrate that DynaAvatar produces visually rich and generalizable animations, outperforming prior methods. Code, pretrained models, and reannotations will be released.

Abstract:
The emergence of virtual reality has necessitated the generation of detailed and customizable 3D hand models for interaction in the virtual world. However, the current methods for 3D hand model generation are both expensive and cumbersome, offering very little customizability to the users. While recent advancements in zero-shot text-to-3D synthesis have enabled the generation of diverse and customizable 3D models using Score Distillation Sampling (SDS), they do not generalize very well to 3D hand model generation, resulting in unnatural hand structures, view-inconsistencies and loss of details. To address these limitations, we introduce HandDreamer, the first method for zero-shot 3D hand model generation from text prompts. Our findings suggest that view-inconsistencies in SDS is primarily caused due to the ambiguity in the probability landscape described by the text prompt, resulting in similar views converging to different modes of the distribution. This is particularly aggravated for hands due to the large variations in articulations and poses. To alleviate this, we propose to use MANO hand model based initialization and a hand skeleton guided diffusion process to provide a strong prior for the hand structure and to ensure view and pose consistency. Further, we propose a novel corrective hand shape guidance loss to ensure that all the views of the 3D hand model converges to view-consistent modes, without leading to geometric distortions. Extensive evaluations demonstrate the superiority of our method over the state-of-the-art methods, paving a new way forward in 3D hand model generation.

Abstract:
Crafting adversarial examples can be formulated as an optimization problem. While sign-based optimizers such as I-FGSM and MI-FGSM have become the de facto standard for the induced optimization problems, there still exist several unsolved problems in theoretical grounding and practical reliability especially in non-convergence and instability, which inevitably influences their transferability. Contrary to the expectation, we observe that the attack success rate may degrade sharply when more number of iterations are conducted. In this paper, we address these issues from an optimization perspective. By reformulating the sign-based optimizer as a specific coordinate-wise gradient descent, we argue that one cause for non-convergence and instability is their non-decaying step-size scheduling. Based upon this viewpoint, we propose a series of new attack algorithms that enforce Monotonically Decreasing Coordinate-wise Step-sizes (MDCS) within sign-based optimizers. Typically, we further provide theoretical guarantees proving that MDCS-MI attains an optimal convergence rate of O(1/\sqrt T ), where T is the number of iterations. Extensive experiments on image classification and cross-modal retrieval tasks demonstrate that our approach not only significantly improves transferability but also enhances attack stability compared to state-of-the-art sign-based methods.

Abstract:
Existing Incremental Learning (IL) methods are primarily evaluated under either a single-domain class-incremental setting, or a multi-domain task-incremental setting with known task identifiers. However, these assumptions often fail to hold in real-world applications. To bridge this gap, we introduce Heterogeneous Incremental Learning (HIL), a new setting for evaluating IL methods under realistic and challenging conditions, where task boundaries are ambiguous or unknown, class distributions shift dynamically across environments, and training data is limited. Model editing is inherently well-suited for this challenging HIL, as it allows for the efficient integration of new knowledge while preserving model capabilities. Thus, we propose a novel Sparse and Anchored Model Editing (SAME) for addressing HIL. Specifically, SAME sparsely and selectively updates task-relevant model parameters to extract compact, task-specific key-value knowledge pairs from limited data. Using these task knowledge pairs, the model performs knowledge injection for new tasks under double-anchor constraints. The knowledge anchor aligns the updated and original model features, while the parameter anchor constrains parameter magnitudes, ensuring stable and consistent knowledge injection. Our method can efficiently solve HIL using only a few labeled examples, without introducing additional model parameters. Extensive experiments on 11 diverse visual-language datasets across 22 sequential tasks show that our method outperforms existing continual learning approaches by 6.8% in average accuracy, while retaining 95.8% of the oracle model performance, demonstrating strong stability and cross-domain generalization.

Abstract:
Image generative models have become indispensable tools to yield exquisite high-resolution (HR) images for everyone, ranging from general users to professional designers. However, a desired outcome often requires generating a large number of HR images with different prompts and seeds, resulting in high computational cost for both users and service providers. Generating low-resolution (LR) images first could alleviate computational burden, but it is not straightforward how to generate LR images that are perceptually consistent with their HR counterparts. Here, we consider the task of generating high-fidelity LR images, called Previews, that preserve perceptual similarity of their HR counterparts for an efficient workflow, allowing users to identify promising candidates before generating the final HR image. We propose the commutator-zero condition to ensure the LR-HR perceptual consistency for flow matching models, leading to the proposed training-free solution with downsampling matrix selection and commutator-zero guidance. Extensive experiments show that our method can generate LR images with up to 33% computation reduction while maintaining HR perceptual consistency. When combined with existing acceleration techniques, our method achieves up to 3xspeedup. Moreover, our formulation can be extended to image manipulations, such as warping and translation, demonstrating its generalizability.

Abstract:
Large-scale text-to-image (T2I) diffusion models deliver remarkable visual fidelity but pose safety risks due to their capacity to reproduce undesirable content, such as copyrighted ones. Concept erasure has emerged as a mitigation strategy, yet existing approaches struggle to balance scalability, precision, and robustness, which restricts their applicability to erasing only a few hundred concepts. To address these limitations, we present Erasing Thousands of Concepts (ETC), a scalable framework capable of erasing thousands of concepts while preserving generation quality. Our method first models low-rank concept distributions via a Student's t-distribution Mixture Model (tMM). It enables pin-point erasure of target concepts via affine optimal transport while preserving others by anchoring the boundaries of target concept distributions without pre-defined anchor concepts. We then train a Mixture-of-Experts (MoE)-based module, termed MoEraser, which removes target embeddings while preserving the anchor embeddings. By injecting noise into the text embedding projector and fine-tuning MoEraser for recovery, our framework achieves robustness to white-box attack such as module removal. Extensive experiments on over 2,000 concepts across heterogeneous domains and diffusion models demonstrate state-of-the-art scalability and precision in large-scale concept erasure.

Abstract:
Predicting enzyme kinetic parameters quantifies how efficiently an enzyme catalyzes a specific substrate under defined biochemical conditions. Canonical parameters such as the turnover number (k_\text cat ), Michaelis constant (K_\text m ), and inhibition constant (K_\text i ) depend jointly on the enzyme sequence, the substrate chemistry, and the conformational adaptation of the active site during binding. Many learning pipelines simplify this process to a static compatibility problem between the enzyme and substrate, fusing their representations through shallow operations and regressing a single value. Such formulations overlook the staged nature of catalysis, which involves both substrate recognition and conformational adaptation. In this regard, we reformulate kinetic prediction as a staged multimodal conditional modeling problem and introduce the Enzyme-Reaction Bridging Adapter (ERBA), which injects cross-modal information via fine-tuning into Protein Language Models (PLMs) while preserving their biochemical priors. ERBA performs conditioning in two stages: Molecular Recognition Cross-Attention (MRCA) first injects substrate information into the enzyme representation to capture specificity; Geometry-aware Mixture-of-Experts (G-MoE) then integrates active-site structure and routes samples to pocket-specialized experts to reflect induced fit. To maintain semantic fidelity, Enzyme-Substrate Distribution Alignment (ESDA) enforces distributional consistency within the PLM manifold in a reproducing kernel Hilbert space. Experiments across three kinetic endpoints and multiple PLM backbones, ERBA delivers consistent gains and stronger out-of-distribution performance compared with sequence-only and shallow-fusion baselines, offering a biologically grounded route to scalable kinetic prediction and a foundation for adding cofactors, mutations, and time-resolved structural cues.

Abstract:
Multilingual document and scene text understanding plays an important role in applications such as search, finance, and public services. However, most existing benchmarks focus on high-resource languages and fail to evaluate models in realistic multilingual environments. In Southeast Asia, the diversity of languages, complex writing systems, and highly varied document types make this challenge even greater. We introduce SEA-Vision, a benchmark that jointly evaluates Document Parsing and Text-Centric Visual Question Answering (TEC-VQA) across 11 Southeast Asian languages. SEA-Vision contains 15,234 document parsing pages from nine representative document types, annotated with hierarchical page-, block-, and line-level labels. It also provides 7,496 TEC-VQA question-answer pairs that probe text recognition, numerical calculation, comparative analysis, logical reasoning, and spatial understanding. To make such multilingual, multi-task annotation feasible, we design a hybrid pipeline for Document Parsing and TEC-VQA. It combines automated filtering and scoring with MLLM-assisted labeling and lightweight native-speaker verification, greatly reducing manual labeling while maintaining high quality. We evaluate several leading multimodal models and observe pronounced performance degradation on low-resource Southeast Asian languages, highlighting substantial remaining gaps in multilingual document and scene text understanding. We believe SEA-Vision will help drive global progress in document and scene text understanding.

Abstract:
Recent vision-language models (VLMs) have shown strong generalization and multimodal reasoning abilities in natural domains. However, their application to medical diagnosis remains limited by the lack of comprehensive and structured datasets that capture real clinical workflows. To advance the development of VLMs for clinical applications, particularly in gastric cancer, we introduce Gastric-X, a large-scale multimodal benchmark for gastric cancer analysis providing 1.7K cases. Each case in Gastric-X includes paired resting and dynamic CT scans, endoscopic image, a set of structured biochemical indicators, expert-authored diagnostic notes, and bounding box annotations of tumor regions, reflecting realistic clinical conditions. We systematically examine the capability of recent VLMs on five core tasks: Visual Question Answering (VQA), report generation, cross-modal retrieval, disease classification, and lesion localization. These tasks simulate critical stages of clinical workflow, from visual understanding and reasoning to multimodal decision support. Through this evaluation, we aim not only to assess model performance but also to probe the nature of VLM understanding: Can current VLMs meaningfully correlate biochemical signals with spatial tumor features and textual reports? We envision Gastric-X as a step toward aligning machine intelligence with the cognitive and evidential reasoning processes of physicians, and as a resource to inspire the development of next-generation medical VLMs.

Abstract:
Discriminative and generative vision models excel in their respective domains but remain semantically misaligned, hindering progress toward unified visual learning. We introduce LEASE (LEArning from SEmantic Dictionaries), a self-supervised framework that bridges this gap using a paired generative-discriminative codebook design. LEASE operates entirely in a discrete token space produced through a one-time precomputation step, enabling efficient training without data augmentations, teacher models, or online tokenizers. LEASE integrates two complementary objectives: a masked token reconstruction loss that captures fine-grained generative detail, and a codebook contrast loss that aligns encoder features with discriminative semantics via adaptive centroid weighting. This dual supervision yields a unified latent space that supports both high-quality generation and strong representation learning.On ImageNet-1K, LEASE achieves state-of-the-art unified performance, outperforming prior VQGAN-based methods such as MAGE and Sorcen across linear probing, unconditional generation, few-shot learning, transfer, and robustness benchmarks. It also competes favorably with domain-specialized contrastive and generative models while surpassing previous MIM methods. The unsupervised LEASE model can also be extended to conditional generation by building upon its learned representations, proving competitive with specialized baselines.Overall, LEASE provides an efficient and effective step toward general-purpose vision models that jointly understand and generate visual content. Code will be released upon acceptance.

Abstract:
We present a novel unsupervised framework to unlock vast unlabeled human demonstration data from continuous industrial video streams for Vision-Language-Action (VLA) model pre-training. Our method first trains a lightweight motion tokenizer to encode motion dynamics, then employs an unsupervised action segmenter leveraging a novel "Latent Action Energy" metric to discover and segment semantically coherent action primitives. The pipeline outputs both segmented video clips and their corresponding latent action sequences, providing structured data directly suitable for VLA pre-training. Evaluations on public benchmarks and a proprietary electric motor assembly dataset demonstrate effective segmentation of key tasks performed by humans at workstations. Further clustering and quantitative assessment via a Vision-Language Model confirm the semantic coherence of the discovered action primitives. To our knowledge, this is the first fully automated end-to-end system for extracting and organizing VLA pre-training data from unstructured industrial videos, offering a scalable solution for embodied AI integration in manufacturing.

Abstract:
High-quality pixel-level responses remain a major bottleneck for multimodal large language models (MLLMs) in regional perception. Existing approaches generally attach regression decoders to MLLM features, achieving strong grounding performance but compromising end-to-end design and increasing training costs. Researchers have applied parameter and data scaling to improve pure MLLMs' ability to generate pixel coordinates in natural language, yet the performance gains on grounding tasks remain markedly weaker than those in standard QA tasks. Our analysis shows the primary bottleneck is that conventional scaling fails to effectively enhance the key reasoning stage required for pixel-level regional perception. To address this, we propose R-Ground, a reasoning framework for MLLM-based grounding built upon a multimodal Monte Carlo Tree Search algorithm. R-Ground leverages structured reasoning actions, multimodal feature alignment scoring, and regional feature weighted voting to perform scaling at the designated reasoning stage. Extensive experiments demonstrate that R-Ground achieves effective reasoning scaling, enabling a 7B MLLM to match or even surpass a 72B model on the grounding task.

Abstract:
We introduce C-GenReg, a training-free framework for 3D point cloud registration that leverages the complementary strengths of world-scale generative priors and registration-oriented Vision Foundation Models (VFMs). Current learning-based 3D point cloud registration methods struggle to generalize across sensing modalities, sampling differences, and environments. Hence, C-GenReg augments the geometric point cloud registration branch by transferring the matching problem into an auxiliary image domain, where VFMs excel, using a World Foundation Model to synthesize multi-view-consistent RGB representations from the input geometry. This generative transfer preserves spatial coherence across source and target views without any fine-tuning. From these generated views, a VFM pretrained for finding dense correspondences extracts matches. The resulting pixel correspondences are lifted back to 3D via the original depth maps. To further enhance robustness, we introduce a "Match-then-Fuse" probabilistic cold-fusion scheme that combines two independent correspondence posteriors, that of the generated-RGB branch with that of the raw geometric branch. This principled fusion preserves each modality's inductive bias and provides calibrated confidence without any additional learning. C-GenReg is zero-shot and plug-and-play: all modules are pretrained and operate without fine-tuning. Extensive experiments on indoor (3DMatch, ScanNet) and outdoor (Waymo) benchmarks demonstrate strong zero-shot performance and superior cross-domain generalization. For the first time, we demonstrate a generative registration framework that operates successfully on real outdoor LiDAR data, where imagery is unavailable.

Abstract:
Color grading is central to High Dynamic Range (HDR) video production, shaping the perceptual tone, contrast, and luminance of content across diverse displays. However, evaluating HDR color grading quality is particularly difficult due to its semantic, content-dependent nature and the lack of large-scale annotated data. While pre-trained Vision-Language Models (VLMs) offer strong semantic priors and generalization ability, their exposure is limited to Standard Dynamic Range (SDR) data, making them poorly equipped to handle HDR photometry and perceptual nuances. We propose HDR-VLM, the first method to adapt a VLM to the HDR domain for perceptual quality assessment. Specifically, HDR-VLM employs a two-stage design: it first bridges the domain gap using a unified HLG-based encoding and progressive adaptation; then it aligns model assessments with noisy, multi-scale human preferences via reinforcement learning with curriculum-inspired rewards. Experiments on a real-world, production-sourced HDR dataset show that HDR-VLM not only outperforms existing quality assessment methods but also produces interpretable attribution rationales. These rationales offer actionable guidance for content creators, enhancing the reliability and transparency of automated HDR quality evaluation.

Abstract:
Whole-Slide Images (WSIs) are widely used for estimating the prognosis of cancer patients. Current studies generally follow a cancer-specific learning paradigm. However, the available training samples for one cancer type are usually scarce in pathology. Consequently, the model often struggles to learn generalizable knowledge, thus performing worse on the tumor samples with inherent high heterogeneity. Although multi-cancer joint learning and knowledge transfer approaches have been explored recently to address it, they either rely on large-scale joint training or extensive inference across multiple models, posing new challenges in computational efficiency. To this end, this paper proposes a new scheme, Sparse Task Vector Mixup with Hypernetworks (STEPH). Unlike previous ones, it efficiently absorbs generalizable knowledge from other cancers for the target via model merging: i) applying task vector mixup to each source-target pair and then ii) sparsely aggregating task vector mixtures to obtain an improved target model, driven by hypernetworks. Extensive experiments on 13 cancer datasets show that STEPH improves over cancer-specific learning and an existing knowledge transfer baseline by 5.14% and 2.01%, respectively. Moreover, it is a more efficient solution for learning prognostic knowledge from other cancers, without requiring large-scale joint training or extensive multi-model inference.

Abstract:
Architectural floor plan design demands joint reasoning over geometry, semantics, and spatial hierarchy, which remains a major challenge for current AI systems. Although recent diffusion and language models improve visual fidelity, they still struggle with coherent spatial reasoning and controllable generation. We present HouseMind, a multimodal large language model that unifies floor plan understanding, generation, and editing in one framework. We introduce discrete room-instance tokens to construct a unified vocabulary that bridges layouts and symbolic reasoning. With multimodal alignment and instruction tuning, the model synthesizes coherent, controllable layouts from text instructions. Experiments show how the framework achieves superior geometric validity and controllability while remaining efficient and locally deployable.

Abstract:
Recent significant advances in 3D scene representation have been driven by 3D Gaussian Splatting (3DGS), which has enabled real-time rendering with photorealistic quality. 3DGS often requires a large number of primitives to achieve high fidelity, leading to redundant representations and high resource consumption, thereby limiting its scalability for complex or large-scale scenes. Consequently, effective pruning strategies and more expressive primitives that can reduce redundancy while preserving visual quality are crucial for practical deployment. We propose an efficient, integrated reconstruction-aware pruning strategy that adaptively determines pruning timing and refining intervals based on reconstruction quality, thus reducing model size while enhancing rendering quality. Moreover, we introduce a 3D Difference-of-Gaussians primitive that jointly models both positive and negative densities in a single primitive, improving the expressiveness of Gaussians under compact configurations. Our method significantly improves model compactness, achieving up to 90% reduction in Gaussian-count while delivering visual quality that is similar to, or in some cases better than, that produced by state-of-the-art methods.

Abstract:
Spiking Neural Networks (SNNs) naturally process visual inputs across multiple timesteps, offering rich temporal dynamics and energy-efficient computation. However, the temporally invariant supervision commonly used in training tends to reinforce the same dominant response patterns across timesteps, leading to redundant representations and limiting temporal discriminability. To overcome this limitation, we introduce Temporal Representation Enhancement(TRE), a novel learning-to-forget paradigm that encourages more diverse and complementary temporal representations. TRE identifies high-contribution semantic patterns through class-specific contribution estimation and temporal accumulation, and selectively suppresses them using a dynamic modulation strategy. By redirecting the model's attention toward alternative yet informative semantic cues, TRE promotes the learning of complementary features across timesteps. This approach not only strengthens the temporal discriminative capacity of SNNs but also enables more effective multi-timestep learning by leveraging richer semantic information. Extensive experiments on both static image datasets and dynamic neuromorphic datasets demonstrate that TRE consistently improves classification accuracy and feature diversity across different SNN backbones.

Abstract:
Recent research on medical MLLMs has shifted its focus from image-level understanding to fine-grained, pixel-level comprehension. Although segmentation serves as the foundation for pixel-level understanding, existing approaches face two major challenges. First, they introduce implicit segmentation tokens and require joint fine-tuning of the MLLM and external pixel decoders, increasing the risk of catastrophic forgetting and limiting out-of-domain generalization. Second, most methods rely on single-pass reasoning and lack the ability to iteratively refine segmentation results. To overcome these limitations, we propose IBISAgent--a novel agentic MLLM that reformulates segmentation as a vision-centric, multi-step decision-making process. IBISAgent enables MLLMs to generate interleaved reasoning and text-based click actions, invoke segmentation tools, and produce high-quality masks without architectural modifications. We also design a two-stage training framework consisting of cold-start SFT and agentic RL with tailored, fine-grained rewards. Through iterative multi-turn visual reasoning, IBISAgent naturally facilitates mask refinement and enhances robustness in complex medical referring and reasoning segmentation tasks. Extensive experiments demonstrate that IBISAgent consistently outperforms both closed-source and open-source SOTA methods.

Abstract:
Multimodal magnetic resonance imaging (MRI) is crucial for brain tumor segmentation, with many methods leveraging its four key modalities to capture complementary information for effective sub-region analysis. However, the absence of several modalities is very common in practice, leading to severe performance degradation in existing full-modality segmentation methods. Limited by the structured data model, recent works often adopt a multi-stage training strategy for full-modality and missing-modality scenarios, which increases training costs and inadequately addresses the interference of miss. In this work, we propose a graph-based one-stage framework for robust brain tumor segmentation with missing modalities. Specifically, we introduce modality-specific virtual nodes that serve as supplementary information sources to compensate for missing modalities. To enhance model robustness against arbitrary modality combinations, we leverage the inherent flexibility of graph networks to devise a dynamic connection strategy. This mechanism dynamically adjusts the adjacency matrix based on modality availability, preserving beneficial information flow while mitigating interference effects caused by missing modalities. Furthermore, we enhance the graph network through heterogeneous weight matrices, enhancing its adaptability to multimodal scenarios. Extensive experiments on the BRATS-2018 and BRATS-2020 datasets demonstrate that our method outperforms the state-of-the-art methods on almost all subsets of incomplete modalities.

Abstract:
Source Free Unsupervised Domain Adaptation (SFUDA) is critical for deploying deep learning models across diverse clinical settings. However, existing methods are typically designed for low-gap, specific domain shifts and cannot generalize into a unified, multi-modalities, and multi-target framework, which presents a major barrier to real-world application. To overcome this issue, we introduce Tell2Adapt, a novel SFUDA framework that harnesses the vast, generalizable knowledge of the Vision Foundation Model (VFM). Our approach ensures high-fidelity VFM prompts through Context-Aware Prompts Regularization (CAPR), which robustly translates varied text prompts into canonical instructions. This enables the generation of high-quality pseudo-labels for efficiently adapting the lightweight student model to target domain. To guarantee clinical reliability, the framework incorporates Visual Plausibility Refinement (VPR), which leverages the VFM's anatomical knowledge to re-ground the adapted model's predictions in target image's low-level visual features, effectively removing noise and false positives. We conduct one of the most extensive SFUDA evaluations to date, validating our framework across 10 domain adaptation directions and 22 anatomical targets, including brain, cardiac, polyp, and abdominal targets. Our results demonstrate that Tell2Adapt consistently outperforms existing approaches, achieving SOTA for a unified SFUDA framework in medical image segmentation. Code are avaliable at GitHub.

Abstract:
Multiple Instance Learning (MIL) has emerged as a promising paradigm for Whole Slide Image (WSI) diagnosis, offering effective learning with limited annotations. However, existing MIL frameworks overlook diagnostic priorities and fail to differentiate the severity of misclassifications in multiclass, leaving clinically critical errors unaddressed. We propose a mistake-severity-aware training strategy that organizes diagnostic classes into a hierarchical structure, with each level optimized using a severity-weighted cross-entropy loss that penalizes high-severity misclassifications more strongly. Additionally, hierarchical consistency is enforced through probabilistic alignment, a semantic feature remix applied to the instance bag to robustly train class priority and accommodate clinical cases involving multiple symptoms. An asymmetric Mikel's Wheel-based metric is also introduced to quantify the severity of errors specific to medical fields. Experiments on challenging public and real-world in-house datasets demonstrate that our approach significantly mitigates critical errors in MIL diagnosis compared to existing methods. We present additional experimental results on natural domain data to demonstrate the generalizability of our proposed method beyond medical contexts.

Abstract:
While Vision-Language Models (VLMs) have achieved remarkable performance across diverse downstream tasks, recent studies have shown that they can inherit social biases from the training data and further propagate them into downstream applications. To address this issue, various debiasing approaches have been proposed, yet most of them aim to improve fairness without having a theoretical guarantee that the utility of the model is preserved. In this paper, we introduce a debiasing method that yields a closed-form solution in the cross-modal space, achieving Pareto-optimal fairness with bounded utility losses. Our method is training-free, requires no annotated data, and can jointly debias both visual and textual modalities across downstream tasks. Extensive experiments show that our method outperforms existing methods in debiasing VLMs across diverse fairness metrics and datasets for both group and intersectional fairness in downstream tasks such as zero-shot image classification, text-to-image retrieval, and text-to-image generation while preserving task performance. Code will be made available upon acceptance.

Abstract:
To effectively manage the complexities of real-world dynamic environments, continual learning must incrementally acquire, update, and accumulate knowledge from a stream of tasks of different nature - without suffering from catastrophic forgetting of prior knowledge. While this capability is innate to human cognition, it remains a significant challenge for modern deep learning systems. At the heart of this challenge lies the stability-plasticity dilemma: the need to balance leveraging prior knowledge, integrating novel information, and allocating model capacity adaptively based on task complexity and synergy. In this paper, we propose a novel exemplar-free class-incremental continual learning (ExfCCL) framework that addresses these issues through a Hierarchical Exploration-Exploitation (HEE) approach. The core of our method is a HEE-guided efficient neural architecture search (HEE-NAS) that enables a learning-to-adapt backbone via four primitive operations - reuse, new, adapt, and skip - thereby serving as an internal memory that dynamically updates selected components across streaming tasks. To address the task ID inference problem in ExfCCL, we exploit an external memory of task centroids proposed in the prior art. We term our method CHEEM (Continual Hierarchical-Exploration-Exploitation Memory). CHEEM is evaluated on the challenging MTIL and VDD benchmarks using both Tiny and Base Vision Transformers and a proposed holistic Figure-of-Merit (FoM) metric. It significantly outperforms state-of-the-art prompting-based continual learning methods, closely approaching full fine-tuning upper bounds. Furthermore, it learns adaptive model structures tailored to individual tasks in a semantically meaningful way.

Abstract:
3D hand pose estimation (HPE) from sparse inertial measurement units (IMUs) has shown great potential in human-computer interaction. However, due to the significant semantic gap between sparse local motion information and structured global pose information, estimating hand poses from sparse IMU signals is ambiguous and challenging. Knowledge distillation can transfer rich knowledge from the stronger teacher to the student model, so that the student enhances performance. Existing approaches distill morphological priors into the IMU-based student model, effectively improving its accuracy in complex scenarios. Nevertheless, overlooking the visual-inertial inherent semantic mismatch and information density difference leads to difficulties for students to learn coupled priors. In this paper, we propose a Multi-Granularity Prior-to-Inertial Distillation Framework for Sequential 3D HPE from Sparse IMUs (MGDHand). We first pre-train a MANO-IMU fusion model as a teacher to encode static geometric morphology prior, dynamic kinematic prior and temporal motion prior. Then, a Multi-Granularity Decoupled Distillation (MGDistill) scheme is proposed to bridge the semantic gap. MGDistill includes a Static Shape Distillation module to transfer time-invariant hand shape priors, and a Dynamic Pose Distillation module to transfer complex joint kinematics and dense pose priors. Additionally, a Temporal Motion Distillation module transfers the fast-changing motion priors (velocity and acceleration). Extensive experiments on public dataset demonstrate that our method significantly outperforms state-of-the-art approaches under sparse IMUs.

Abstract:
Error detection is crucial in industrial training, healthcare, and assembly quality control. Most existing work assumes a single-view setting and cannot handle the practical case where a third-person (exo) demonstration is used to assess a first-person (ego) imitation. We formalize Ego->Exo Imitation Error Detection: given asynchronous, length-mismatched ego and exo videos, the model must localize procedural steps on the ego timeline and decide whether each is erroneous. This setting introduces cross-view domain shift, temporal misalignment, and heavy redundancy. Under a unified protocol, we adapt strong baselines from dense video captioning and temporal action detection and show that they struggle in this cross-view regime. We then propose SAVA-X, an Align-Fuse-Detect framework with (i) view-conditioned adaptive sampling, (ii) scene-adaptive view embeddings, and (iii) bidirectional cross-attention fusion. On the EgoMe benchmark, SAVA-X consistently improves AUPRC and mean tIoU over all baselines, and ablations confirm the complementary benefits of its components.

Abstract:
Domain Generalization Semantic Segmentation (DGSS) in spectral remote sensing is severely challenged by spectral shifts across diverse acquisition conditions, which cause significant performance degradation for models deployed in unseen domains. While fine-tuning foundation models is a promising direction, existing methods employ global, homogeneous adjustments. This "one-size-fits-all" tuning struggles with the spatial heterogeneity of land cover, causing semantic confusion. We argue that the key to robust DGSS lies not in a single global adaptation, but in performing fine-grained, spatially-adaptive refinement of a foundation model's features. To achieve this, we propose SpectralMoE, a novel fine-tuning framework for DGSS. It operationalizes this principle by utilizing a Mixture-of-Experts (MoE) architecture to perform local precise refinement on the foundation model's features, incorporating depth features estimated from selected RGB bands of the spectral remote sensing imagery to guide the fine-tuning process. Specifically, SpectralMoE employs a dual-gated MoE architecture that independently routes visual and depth features to top-k selected experts for specialized refinement, enabling modality-specific adjustments. A subsequent cross-attention mechanism then judiciously fuses the refined structural cues into the visual stream, mitigating semantic ambiguities caused by spectral variations. Extensive experiments show that SpectralMoE sets a new state-of-the-art on multiple DGSS benchmarks across hyperspectral, multispectral, and RGB remote sensing imagery.

Abstract:
We address text-based 3D human motion editing, where the goal is to preserve the style and structure of a source motion while applying edits described in natural language. The release of the MotionFix dataset has spurred active research into training-based diffusion models that directly generate an edited motion from a source motion and a text instruction. While previous works have focused primarily on learning when an edit should occur temporally, our goal is to create a model that understands not only this temporal aspect but also which specific joints are responsible for the change. Targeting this, we propose a novel architecture and a complementary auxiliary task to aid its training. Our architecture consists of two axis-anchored transformers, which extract distinct features along the joint and time dimensions respectively, and a cross-axis fusion block that integrates these representations. We further introduce an auxiliary task that trains the joint-anchored transformer to regress the Soft-DTW distance between source and target joint rotations. This objective teaches the module to understand which joints to modify and which to preserve. Through comprehensive experiments on the MotionFix dataset, we demonstrate that our method significantly improves semantic alignment with both the text instruction and the source motion, as well as the overall fidelity of the generated motion, achieving state-of-the-art results.

Abstract:
In-Context Learning (ICL) has shown great effectiveness in developing generalist image segmentation models. Its significant advantage over text-based descriptions is the ability to convey intricate visual appearance details through simple reference images. However, finding a perfectly matching single example for real-world rare and complex concepts is difficult. Moreover, existing methods are largely confined to semantic or instance-level understanding of the reference image, struggling to express more precise segmentation needs through the input. To address this, we propose CDICS, a novel framework that leverages Compositional prompts and phased task Decoupling to achieve compositional prompt-controlled In-Context Segmentation. Our method introduces compositional prompts derived from reference prompts, combining semantic, part and color images to dynamically define segmentation targets. To effectively fuse this control information, ensure synergy while suppressing interference, and mitigate feature coupling risks, our decoupled two-stage architecture firstly performs coarse-grained semantic localization, then refines the result using compositional appearance prompts to precisely match the specified attributes. This design extends traditional in-context segmentation, enabling it to support compositional prompts. Additionally, we reconstructed two datasets and their benchmarks to acquire data with part-color-specific attributes. Our method demonstrates superior performance on the compositional prompt-controlled in-context segmentation task. It also extends the capabilities of existing in-context segmentation, and makes an attempt toward real-world fine-grained segmentation.

Abstract:
With the rapid progress in diffusion models, image synthesis has advanced to the stage of zero-shot image-to-image generation, where high-fidelity replication of facial identities or artistic styles can be achieved using just one portrait or artwork, without modifying any model weights. Although these techniques significantly enhance creative possibilities, they also pose substantial risks related to intellectual property violations, including unauthorized identity cloning and stylistic imitation. To counter such threats, this work presents Adapter Shield, the first universal and authentication-integrated solution aimed at defending personal images from misuse in zero-shot generation scenarios. We first investigate how current zero-shot methods employ image encoders to extract embeddings from input images, which are subsequently fed into the UNet of diffusion models through cross-attention layers. Inspired by this mechanism, we construct a reversible encryption system that maps original embeddings into distinct encrypted representations according to different secret keys. The authorized users can restore the authentic embeddings via a decryption module and the correct key, enabling normal usage for authorized generation tasks. For protection purposes, we design a multi-target adversarial perturbation method that actively shifts the original embeddings toward designated encrypted patterns. Consequently, protected images are embedded with a defensive layer that ensures unauthorized users can only produce distorted or encrypted outputs. Extensive evaluations demonstrate that our method surpasses existing state-of-the-art defenses in blocking unauthorized zero-shot image synthesis, while supporting flexible and secure access control for verified users.

Abstract:
Video highlight detection aims to identify the most engaging segments in long-form videos, supporting content editing and recommendation, especially for movies and TV dramas. However, existing methods are ill-suited to cinematic content due to its narrative complexity, while the scarcity of annotated data and the high cost of manual labeling further hinder progress. To bridge this gap, we introduce TVHighlights, the first large-scale dataset tailored for video highlight detection in movies and TV dramas, with 1,721 carefully curated videos covering diverse genres. Built on community-driven behaviors, it provides realistic and diverse annotations without human labeling. Based on TVHighlights, we propose LTV-HD: a LLM-guided, human-free collaborative training framework for video highlight detection in cinematic content. LTV-HD operates in two stages: (1) weakly supervised pre-training of a lightweight model using video-level labels, followed by (2) iterative refinement through collaboration between large language models (LLMs) and the lightweight model. LLMs generate noisy clip-level pseudo-labels, which the lightweight model learns from under a noise-robust strategy, and its high-confidence predictions are then fed back to guide the LLM in distilling genre-specific highlight patterns through a self-improving loop. Experiments demonstrate that LTV-HD achieves state-of-the-art performance on TVHighlights, validating its effectiveness in real-world, annotation-free scenarios.

Abstract:
The rapid advancement of AIGC-based video generation has underscored the critical need for comprehensive evaluation frameworks that go beyond traditional generation quality metrics to encompass aesthetic appeal. However, existing benchmarks remain largely focused on technical fidelity, leaving a significant gap in holistic assessment--particularly with respect to perceptual and artistic qualities. To address this limitation, we introduce VGA-Bench, a unified benchmark for joint evaluation of video generation quality and aesthetic quality. VGA-Bench is built upon a principled three-tier taxonomy: Aesthetic Quality, Aesthetic Tagging, and Generation Quality, each decomposed into multiple fine-grained sub-dimensions to enable systematic assessment. Guided by this taxonomy, we design 1,016 diverse prompts and generate a large-scale dataset of over 60,000 videos using 12 video generation models, ensuring broad coverage across content, style, and artifacts. To enable scalable and automated evaluation, we annotate a subset of the dataset via human labeling and develop three dedicated multi-task neural assessors: VAQA-Net for aesthetic quality prediction, VTag-Net for automatic aesthetic tagging, and VGQA-Net for generation and basic quality attributes. Extensive experiments demonstrate that our models achieve reliable alignment with human judgments, offering both accuracy and efficiency. We release VGA-Bench as a public benchmark to foster research in AIGC evaluation, with applications in content moderation, model debugging, and generative model optimization.

Abstract:
Generative AI is widely used to create commercial posters. However, rapid advances in generation have outpaced automated quality assessment. Existing models emphasize generic esthetics or low level distortions and lack the functional criteria required for e-commerce design. It is especially challenging for Chinese content, where complex characters often produce subtle but critical textual artifacts that are overlooked by existing methods. To address this, we introduce E-comIQ-ZH, a framework for evaluating Chinese e-commerce posters. We build the first dataset E-comIQ-18k to feature multi dimensional scores and expert calibrated Chain of Thought (CoT) rationales. Using this dataset, we train E-comIQ-M, a specialized evaluation model that aligns with human expert judgment. Our framework enables E-comIQ-Bench, the first automated and scalable benchmark for the generation of Chinese e-commerce posters. Extensive experiments show our E-comIQ-M aligns more closely with expert standards and enables scalable automated assessment of e-commerce posters. All datasets, models, and evaluation tools will be released to support future research in this area.

Abstract:
The Event Horizon Telescope (EHT) delivered the first image of a black hole by capturing the light from its surrounding accretion flow, revealing structure but not dynamics. Simulations of black hole accretion dynamics are essential for interpreting EHT images but costly to generate and impractical for inference. Motivated by this bottleneck, BHCast presents a framework for forecasting black hole plasma dynamics from a single, blurry snapshot, such as those captured by the EHT. At its core, BHCast is a neural model that transforms a static image into forecasted future frames, revealing the underlying dynamics hidden within one snapshot. With a multi-scale pyramid loss, we demonstrate how autoregressive forecasting can simultaneously super-resolve and evolve a blurry frame into a coherent, high-resolution movie that remains stable over long time horizons. From forecasted dynamics, we can then extract interpretable spatio-temporal features, such as pattern speed (rotation rate) and pitch angle. Finally, BHCast uses gradient-boosting trees to recover black hole properties from these plasma features, including the spin and viewing inclination angle. The separation between forecasting and inference provides modular flexibility, interpretability, and robust uncertainty quantification. We demonstrate the effectiveness of BHCast on simulations of two distinct black hole accretion systems, Sagittarius A and M87, by testing on simulated frames blurred to EHT resolution and real EHT images of M87. Ultimately, our methodology establishes a scalable paradigm for solving inverse problems, demonstrating the potential of learned dynamics to unlock insights from resolution-limited scientific data.

Abstract:
Electronic Navigational Charts (ENCs) are the safety-critical backbone of modern maritime navigation, yet it remains unclear whether multimodal large language models (MLLMs) can reliably interpret them. Unlike natural images or conventional charts, ENCs encode regulations, bathymetry, and route constraints via standardized vector symbols, scale-dependent rendering, and precise geometric structure---requiring specialized maritime expertise for interpretation. We introduce ENC-Bench, the first benchmark dedicated to professional ENC understanding. ENC-Bench contains 20,490 expert-validated samples from 840 authentic National Oceanic and Atmospheric Administration (NOAA) ENCs, organized into a three-level hierarchy: Perception (symbol and feature recognition), Spatial Reasoning (coordinate localization, bearing, distance), and Maritime Decision-Making (route legality, safety assessment, emergency planning under multiple constraints). All samples are generated from raw S-57 data through a calibrated vector-to-image pipeline with automated consistency checks and expert review. We evaluate 10 state-of-the-art MLLMs such as GPT-4o, Gemini 2.5, Qwen3-VL, InternVL-3, and GLM-4.5V, under a unified zero-shot protocol. The best model achieves only 47.88% accuracy, with systematic challenges in symbolic grounding, spatial computation, multi-constraint reasoning, and robustness to lighting and scale variations. By establishing the first rigorous ENC benchmark, we open a new research frontier at the intersection of specialized symbolic reasoning and safety-critical AI, providing essential infrastructure for advancing MLLMs toward professional maritime applications.

Abstract:
Foundation models have achieved success in computational pathology, demonstrating generalization across histopathology tasks. However, existing models overlook the heterogeneous and non-uniform organization of regions of interest (ROIs) because they rely on natural image backbones not tailored for tissue morphology. Consequently, they fail to capture the coherent tissue architecture beyond patches, limiting interpretability and clinical relevance. To address these challenges, we present Cross-modal Adaptive Region Encoder (CARE), a foundation model for pathology that partitions WSIs into several morphologically relevant regions. Specifically, CARE employs a two-stage pretraining strategy: (1) a self-supervised unimodal pretraining stage that learns morphological representations from 34,277 whole-slide images (WSIs) without segmentation annotations, and (2) a cross-modal alignment stage that leverages RNA and protein profiles to refine the construction and representation of adaptive regions. This molecular guidance enables CARE to identify biologically relevant patterns and generate irregular yet coherent tissue regions, selecting the representative area as ROI. CARE supports a broad range of pathology-related tasks, using either the ROI feature or the slide-level feature obtained by aggregating adaptive regions. Based on only one-tenth of the pretraining data typically used by mainstream foundation models, CARE achieves superior average performance across 33 downstream benchmarks, including morphological classification, molecular prediction, and survival analysis, and outperforms other foundation model baselines overall.

Abstract:
Large Vision-Language Models (LVLMs) have achieved remarkable success across diverse multimodal tasks, yet their practical deployment remains constrained by the computational burden arising from lengthy visual tokens. While visual token pruning has emerged as a promising solution, existing methods suffer from a fundamental limitation: once tokens are pruned at a specific layer, they become inaccessible to all subsequent layers, leading to premature information loss that can compromise model performance. Through empirical studies, we observe that different layers exhibit distinct visual region focus, indicating a varying optimal token subset across layers. Motivated by this insight, we propose Adaptive Layer-wise Visual Token Selection (ALVTS), a novel framework that breaks away from the conventional static token pruning paradigm. ALVTS incorporates a lightweight token selector to identify and route important tokens for further processing, while allowing less important tokens to skip the layer, thus minimizing computational redundancy. These two streams of tokens are seamlessly reintegrated before being fed into subsequent layers, facilitating adaptive compression across the entire model. Grounded in our importance consistency constrained low-rank approximation, the proposed token selection module closely emulates the full attention mechanism, effectively capturing its essential patterns without requiring model retraining. Extensive experiments on LLaVA-1.5, LLaVA-NeXT, and Qwen2.5-VL validate the effectiveness of our method. With an 89% token compression ratio, ALVTS retains 96.7% of the original model's accuracy, achieving a superior efficiency-accuracy trade-off for LVLM inference.

Abstract:
Recent advances in 4D Gaussian Splatting (4DGS) have extended the high-speed rendering capability of 3D Gaussian Splatting (3DGS) into the temporal domain, enabling real-time rendering of dynamic scenes.However, one of the major remaining challenges lies in modeling long-range motion-contained dynamic videos, where a naive extension of existing methods leads to severe memory explosion, temporal flickering, and failure to handle appearing or disappearing occlusions over time. To address these challenges, we propose a novel 4DGS framework characterized by an Anchor Relay-based Bidirectional Blending (ARBB) mechanism, named MoRel, which enables temporally consistent and memory-efficient modeling of long-range dynamic scenes.Our method progressively constructs locally canonical anchor spaces at key-frame time index and models inter-frame deformations at the anchor level, enhancing temporal coherence. By learning bidirectional deformations between KfA and adaptively blending them through learnable opacity control, our approach mitigates temporal discontinuities and flickering artifacts.We further introduce a Feature-variance-guided Hierarchical Densification (FHD) scheme that effectively densifies KfA's while keeping rendering quality, based on an assigned level of feature-variance.To effectively evaluate our model's capability to handle real-world long-range 4D motion, we newly compose long-range 4D motion-contained dataset, called SelfCap_ \text LR . It has larger average dynamic motion magnitude, captured at spatially wider spaces, compared to previous dynamic video datasets.Overall, our MoRel achieves temporally coherent and flicker-free long-range 4D reconstruction while maintaining bounded memory usage, demonstrating both scalability and efficiency in dynamic Gaussian-based representations. The code and project page will be publicly released.

Abstract:
Self-supervised monocular depth estimation has achieved remarkable progress in recent years, yet frequency aliasing and the lack of fine-grained cross-frame motion modeling still lead to blurred depth boundaries and suboptimal camera motion estimation.To address these challenges, we propose a progressive self-supervised framework that integrates a Frequency-Guided Depth Network (FGDepth) and a PoseQuery Network (PQNet). FGDepth incorporates a plug-and-play Frequency-Guided Sampling module that explicitly enhances high-frequency details and suppresses aliasing artifacts, producing depth maps with sharper boundaries. PQNet employs channel-aligned attention to model fine-grained cross-frame motion features, enabling more accurate and robust camera motion estimation. Furthermore, we design a progressive three-stage decoupled training strategy that effectively leverages the complementarity between depth and pose estimation, further improving overall performance.Extensive experiments on the KITTI benchmark demonstrate state-of-the-art performance, achieving a 4.1% reduction in Sq Rel over strong baselines, and our method also exhibits excellent cross-dataset generalization on Make3D. Ablation studies further validate the effectiveness of each proposed component.

Abstract:
We propose MIRA (Multimodal Imagination for Reasoning Assessment), a new benchmark designed to evaluate models in scenarios where generating intermediate visual images is essential for successful reasoning. Unlike traditional Chain-of-thought (CoT) methods that rely solely on text, tasks in MIRA require models to generate and utilize intermediate images --- such as sketches, structural diagrams, or path drawings --- to guide their reasoning process. This setup closely mirrors how humans solve complex problems through "drawing to think". To solve this, MIRA focuses on tasks that are intrinsically challenging and involve complex structures, spatial relationships, or reasoning steps that are difficult to express through language alone (e.g., tracking a die's movement on a board and summing the face-down values after each roll). To ensure that our evaluation data is of high-quality, we include 546 multimodal problems, annotated with intermediate visual images and final answers. We also propose a unified evaluation protocol for MIRA that spans three levels of evaluation input: direct input with image and question only, text-only CoT (Text-CoT) input with image and thinking prompts, and Visual-CoT input with both annotated image clues and textual thinking prompts. To probe the upper bound of model capacity on our benchmark, we also report pass@k and majority voting accuracies under different k settings. Experimental results show that existing multimodal large language models (MLLMs), including strongest private models (e.g., GPT-5, o3, Gemini 2.5 Pro) as well as strong open-weight models (e.g., Qwen2.5-VL, GLM 4.5V), perform poorly when relying solely on textual prompts. However, when intermediate visual cues are provided, model performance improves consistently, yielding an average relative gain of 33.7% across all models and tasks. We also probe the upper bound by expanding the search space and designing textual prompts aligned with Visual-CoT, but both yield only limited improvements compared to our Visual-CoT setting. These results underscore the critical role of imagined visual information in enabling successful reasoning on MIRA.

Abstract:
Large Vision-Language Models (VLMs) often answer classic visual illusions "correctly" on original images, yet persist with the same responses when illusion factors are inverted, even though the visual change is obvious to humans. This raises a fundamental question: do VLMs perceive visual changes or merely recall memorized patterns? While several studies have noted this phenomenon, the underlying causes remain unclear. To move from observations to systematic understanding, this paper introduces VI-Probe, a controllable visual-illusion framework with graded perturbations and matched visual controls (without illusion inducer) that disentangles visually grounded perception from language-driven recall. Unlike prior work that focuses on averaged accuracy, we measure stability and sensitivity using Polarity-Flip Consistency, Template Fixation Index, and an illusion multiplier normalized against matched controls. Experiments across different families reveal that response persistence arises from heterogeneous causes rather than a single mechanism. For instance, GPT-5 exhibits memory override, Claude-Opus-4.1 shows perception-memory competition, while Qwen variants suggest visual-processing limits. Our findings challenge single-cause views and motivate probing-based evaluation that measures both knowledge and sensitivity to controlled visual change. Data and code are available at https://sites.google.com/view/vi-probe/.

Abstract:
Neural networks are often treated as monolithic black boxes that process all inputs uniformly through all layers. However, researchers intuitively wonder: do simple images require all 50 layers of ResNet-50, or is the prediction effectively decided much earlier? We investigate when pretrained models make up their minds during a forward pass by training linear probes at each layer of ResNet variants on ImageNet, without modifying the base model. Our findings reveal substantial computational heterogeneity across architectures. We discover pronounced bimodal patterns with distinct populations of early and late deciders: in deeper ResNets, 33--39% of samples stabilize within the first third of the network, while 39--42% are assigned to the final 30% of depth; across all three ResNets, the late fraction ranges from 39% to 54%. The decision layer is highly sensitive to stability criteria, with mean depths increasing from 2.6--4.1 (k=1) to 9.0--10.0 (k=4). Linear probe accuracy exhibits sharp jumps in final residual stages, reaching 73--75% for ResNet-50/101 and 65% for ResNet-18, indicating that semantic consolidation occurs late. These findings expose computational heterogeneity in standard inference and clarify both the promise and the limits of stability-based early exiting.

Abstract:
Spatial transcriptomics (ST) has emerged as a revolutionary approach in the field of tissue analysis that can offer spatial resolved molecular insights for the identification of anomalous regions (AR) on human cancers. Current ST-based methods for detecting AR focus narrowly on the molecular features of local tissue spots, overlooking the matched bulk RNA-seq data that contains crucial diagnostic information. This oversight limits their effectiveness in identifying subtle or heterogeneous tumors, where accurate detection depends on broader genetic context. Besides the genomic signatures, the pathological images can also provide rich visual information to reflect the morphology of AR. To utilize the patient-level diagnostic knowledge and harness complementary information from both histology images and ST, we develop a Bulk RNA-seq Guided Multi-modal Anomalous Regions Detection method (BRGMAR) for the identification of AR from human tissues. Specifically, to effectively model the dependencies in ST, we introduce a Dynamic Multi-Relational Graph Learning (DMRGL) module to adaptively capture complex relationships in ST, including both spatial proximity and gene expression similarity. Then, we design an Optimal Transportation-based Gene Module Alignment (OTGMA) approach to align ST data with patient-level bulk RNA-seq data by matching the compositional and functional similarities of their corresponding gene modules. Finally, we combine the learned genomic features with pathological image representations for accurate AR detection. We evaluate our method on three public available ST datasets for the purpose of identifying cancerous regions from normal tissues, and the experimental results demonstrate the advantage of our method in comparison with the existing studies.

Abstract:
Multi-view image compression (MIC) aims to achieve high compression efficiency by exploiting inter-image correlations, playing a crucial role in 3D applications. As a subfield of MIC, distributed multi-view image compression (DMIC) offers performance comparable to MIC while eliminating the need for inter-view information at the encoder side.However, existing methods in DMIC typically treat all images equally, overlooking the varying degrees of correlation between different views during decoding, which leads to suboptimal coding performance. To address this limitation, we propose a novel OmniParallax Attention Mechanism (OPAM), which is a general mechanism for explicitly modeling correlations and aligned features between arbitrary pairs of information sources.Building upon OPAM, we propose a Parallax Multi Information Fusion Module (PMIFM) to adaptively integrate information from different sources. PMIFM is incorporated into both the joint decoder and the entropy model to construct our end-to-end DMIC framework, ParaHydra.Extensive experiments demonstrate that ParaHydra is the first DMIC method to significantly surpass state-of-the-art MIC codecs, while maintaining low computational overhead. Performance gains become more pronounced as the number of input views increases. Compared with LDMIC, ParaHydra achieves bitrate savings of 19.72% on WildTrack(3) and up to 24.18% on WildTrack(6), while significantly improving coding efficiency (as much as 65xin decoding and 34xin encoding).

Abstract:
Personalized Federated Learning (PFL) has gained significant attention for enabling participating clients to train customized personalized models on non-IID local data. However, current PFL methods mainly suffer from two limitations: 1) Only the personalized part supports heterogeneous design, while the shared part must remain homogeneous. 2) The semantic representations of models generated on different clients with non-IID data characteristics inevitably tend to be inconsistent, negatively impacting model performance. To conquer them, this paper proposes a novel resource-adaptive personalized Federated Learning via Anchor-driven Representation Alignment (FedARA). Concretely, we design a low-rank decomposition and reconstruction fusion scheme for shared feature extractors based on the matrix decomposition technology, where each client can autonomously set the rank value based on its locally available resources, controlling the complexity of extractors and naturally reducing communication and computational costs. Moreover, to address the inconsistency of feature spaces across clients, an anchor-driven representation consistency learning mechanism is developed, which can guide client models to learn unified feature representations and alleviate global knowledge forgetting, thereby improving personalized model performance. Extensive experimental results demonstrate that our method significantly outperforms seventeen state-of-the-art baselines in diverse heterogeneous scenarios with less communication and computational costs.

Abstract:
Topology reasoning is crucial for autonomous driving. Current methods primarily focus on instance-level learning for centerline detection, followed by a sequential module for topology reasoning that relies on simplified MLP layers. Moreover, these approaches often neglect the importance of point-to-instance (P2I) relationships in topology reasoning. To address these limitations, we present TopoHR (Topological Hierarchical Representation), a novel end-to-end framework that establishes cyclic interaction between centerline detection and topology reasoning, allowing them to iteratively enhance each other. Specifically, we introduce a hierarchical centerline representation including point queries, instance queries, and semantic representations. These multi-level features are seamlessly integrated and fused within a hierarchical centerline decoder. Furthermore, we design a hierarchical topology reasoning module that captures both fine-grained P2I relationships and global instance-to-instance (I2I) connections within a unified architecture. With these novel components, TopoHR ensures accurate and robust topology reasoning. On the OpenLane-V2 benchmark, TopoHR refreshes state-of-the-art performance with significant improvements. Notably, compared with previous best results, TopoHR achieves +3.8 in \mathrm DET _ \text l , +5.4 in \mathrm TOP _ \text ll on subset_A and +11.0 in \mathrm DET _ \text l , +7.9 in \mathrm TOP _ \text ll on subset_B, validating the effectiveness of the proposed components. The code will be shared publicly upon paper acceptance.

Abstract:
The rapid evolution of generative AI, including such models as Sora, has intensified the threat of video misinformation. A critical challenge in detecting these AI-generated video misinformation lies in a fundamental disconnect between existing datasets and practical deception tactics. Current datasets often disrupt cross-modal consistency through editing techniques, resulting in unrealistic and easily detectable artifacts. By contrast, generative video misinformation strives for semantic consistency across modalities to remain realism. To address this gap, we introduce RAVM: the first Realistic AI-Generated Video Misinformation Detection Dataset. Unlike existing Video Misinformation Detection (VMD) datasets that are limited to single-source manipulations, RAVM encompasses multiple manipulation sources--Claim, Video, Audio, and Cross-Modal Manipulation--each incorporating diverse manipulation techniques to generate realistic AI-generated video misinformation. To achieve this, we introduce an agent-driven framework for generating realistic video misinformation. Furthermore, we propose an IEEG model that represents multimodal evidence, fact-checking results, and their dependencies as an evidence graph for interpretable detection of AI-generated video misinformation. Extensive experiments on RAVM reveal the vulnerability of existing Multimodal Large Language Models (MLLMs) in detecting AI-generated video misinformation, while the proposed IEEG achieves state-of-the-art performance on RAVM. The dataset is publicly available at: https://gitee.com/VR_NAVE/ravm

Abstract:
Current story visualization methods tend to position subjects solely by text and face challenges in maintaining artistic consistency. To address these limitations, we introduce DreamingComics, a layout-aware story visualization framework. We build upon a pretrained video diffusion-transformer (DiT) model, leveraging its spatiotemporal priors to enhance identity and style consistency. For layout-based position control, we propose RegionalRoPE, a region-aware positional encoding scheme that re-indexes embeddings based on the target layout. Additionally, we introduce a masked condition loss to further constrain each subject's visual features to their designated region. To infer layouts from natural language scripts, we integrate an LLM-based layout generator trained to produce comic-style layouts, enabling flexible and controllable layout conditioning. We present a comprehensive evaluation of our approach, showing a 29.2% increase in character consistency and 36.2% increase in style similarity compared to previous methods, while displaying high spatial accuracy.

Abstract:
Manually annotating accurate 3D hand poses is extremely time-consuming and labor-intensive. Existing self-supervised hand pose estimation methods leverage the discrepancy between input images and rendered outputs, or multi-view consistency constraints, as the driving force to optimize networks and progressively refine pose accuracy. However, these methods are highly susceptible to noisy pseudo-labels and overlook the importance of fully exploiting fine-grained spatial correlations, which undermines the stability of model training. To address these issues, we propose UST-Hand, a self-supervised learning framework that estimates uncertainty distribution of hand pose and constructs a probabilistic point cloud feature space, which enables the complex spatiotemporal relationship modeling. UST-Hand employs a conditional normalizing flow model to capture hand pose distributions and samples diverse hypotheses, facilitating robust learning under noisy pseudo-labels supervision with enhanced stability. These multi-hypothesis are mapped to a unified probabilistic 3D point cloud space for multi-view and temporal feature interaction, comprehensively exploring hand motion patterns and fine-grained spatial correlations. Extensive experiments on three challenging datasets demonstrate that UST-Hand achieves state-of-the-art performance, outperforming existing self-supervised methods by up to 37.8% in Mean Per Vertex Position Error (MPVPE).

Abstract:
Recent Large Vision-Language Models (LVLMs) have demonstrated remarkable performance across various multimodal tasks that require understanding both visual and linguistic inputs. However, object hallucination -- the generation of nonexistent objects in answers -- remains a persistent challenge. Although several approaches such as retraining and external grounding methods have been proposed to mitigate this issue, they still suffer from high data costs or structural complexity. Training-free methods such as Contrastive Decoding (CD) are more cost-effective, avoiding additional training or external models, but still suffer from long-term decay, where visual grounding weakens and language priors dominate as the generation progresses. In this paper, we propose First Logit Boosting (FLB), a simple yet effective training-free technique designed to alleviate long-term decay in LVLMs. FLB stores the logit of the first generated token and adds it to subsequent token predictions, effectively mitigating long-term decay of visual information. We observe that FLB (1) sustains the visual information embedded in the first token throughout generation, and (2) suppresses hallucinated words through the stabilizing effect of the "The" token. Experimental results show that FLB significantly reduces object hallucination across various tasks, benchmarks, and backbone models. Notably, it causes negligible inference overhead, making it highly applicable to real-time multimodal systems. Code is available at https://github. com/jiwooha20/FLB

Abstract:
Accurate uncertainty estimation is essential for reliable appearance-based gaze tracking. However, domain shifts between training and testing often lead to incorrect uncertainty estimates, which is a problem overlooked in existing uncertainty-aware gaze tracking models. To overcome this problem efficiently, we formulate uncertainty estimation as a conditional distribution problem and treat the correction process as an output-level conditional distribution matching task. We therefore introduce a data-efficient post-hoc calibration method to align the predicted, high-error conditional distribution with the empirically observed distribution extracted from a small set of calibration samples. To more faithfully assess the accuracy of the resulting uncertainty estimates, we further introduce a new metric, Coverage Probability Error (CPE), to quantify the distribution-level mismatch between prediction and observation. We validate the calibration procedure across four domain shift scenarios to demonstrate improved uncertainty accuracy and its practical benefits.

Abstract:
Exploring human visual perception and understanding of the stereoscopic world represents a significant topic in computational neuroscience. Recent studies have provided rich Brain-3D datasets, conducted preliminary explorations into 3D visual reconstruction. However, existing research struggles to capture the differences in dynamic changes of 3D stimulus views, and there remains room for improvement in high-fidelity reconstruction and rendering. 3D Gaussian Splatting (3DGS) has recently achieved significant progress in stereoscopic view synthesis. Inspired by it, we propose BrainGS -- an innovative framework for decoding more realistic 3D objects from the brain. BrainGS incorporates a Fusion Time-Spatial Network to achieve comprehensive encoding of the brain, combined with the Multi-Attribute Controller (MAC), it decouples features using visual, semantic, and color as anchors, effectively learning the feature distribution of Brain-3D and providing initial control for 3DGS. The Multi-View Stabilizer (MVS) overcomes the challenge of capturing multi-view changes of 3D objects, creating more robust viewpoint representations. Comprehensive experiments and discussions on fMRI/EEG show the SOTA performance (2.936 FPD, 0.202 LPIPS) of BrainGS, providing reliable neural interpretations, offering new insights into brain stereovision understanding.

Abstract:
Deep learning-based lane detection (LD) plays a critical role in autonomous driving and advanced driver assistance systems. However, its vulnerability to backdoor attacks presents a significant security concern. Existing backdoor attack methods on LD often exhibit limited practical utility due to the artificial and conspicuous nature of their triggers. To address this limitation and investigate the impact of more ecologically valid backdoor attacks on lane detection models, we examine the common data poisoning attack and introduce DBALD, a novel diffusion-based data poisoning framework for generating naturalistic backdoor triggers. DBALD comprises two key components: optimal trigger position finding and stealthy trigger generation. Given the insight that attack performance varies depending on the trigger position, we propose a heatmap-based method to identify the optimal trigger location, with gradient analysis to generate attack-specific heatmaps. A region-based editing diffusion process is then applied to synthesize visually plausible triggers within the most susceptible regions identified previously. Furthermore, to ensure scene integrity and a stealthy attack, we introduce two loss strategies: one for preserving lane structure and another for maintaining the consistency of the driving scene. Consequently, compared to existing attack methods, DBALD achieves both a high attack success rate and superior stealthiness. Extensive experiments on 4 mainstream lane detection models show that DBALD exceeds state-of-the-art methods, with an average success rate improvement of +10.87% and significantly enhanced stealthiness. The experimental results highlight significant practical challenges in ensuring model robustness against real-world backdoor threats in lane detection. Our data and demos are available at https://sites.google.com/view/dbald.

Abstract:
In recent years, the gait parsing sequence has become increasingly popular due to its higher information entropy than the binary silhouette and the keypoint-based skeleton. However, existing parsing-based gait recognition methods have not fully explored the complex, non-linear relationships between features at different positions, semantic, and temporal dynamics levels, i.e., higher-order correlations. To unleash the power of parsing between human body parts and temporal dynamics, this paper proposes a novel hypergraph-based gait recognition framework, named HyperGait. The HyperGait contains a global head and two elaborately-designed modules. In particular, the Spatial Hypergraph Convolutional Module (SHCM) and the Temporal Hypergraph Convolutional Module (THCM) are designed to explore the high-order spatial-level and temporal-level features, respectively.The SHCM extracts fine-grained relationships between human body parts through the hypergraph.The THCM performs the high-order temporal information between temporally related human body parts.Comprehensive experiments on two large-scale gait datasets, i.e., Gait3D and SUSTech1K, show the superior performance of our proposed HyperGait.In highly challenging real-world scenarios, with only parsing as input, our HyperGait achieves the Rank-1 accuracy of 80.5% on the Gait3D dataset.

Abstract:
Spatial transcriptomics (ST) enables spot-level in situ expression profiling, but its high cost and limited throughput motivate predicting expression directly from H&E-stained histology. Recent advances explore using score- or flow-based generative models to estimate the conditional distribution of gene expression from histology, offering a flexible alternative to deterministic regression approaches. However, most existing generative approaches omit explicit modeling of gene-gene dependencies, undermining biological coherence. Single-cell foundation models (sc-FMs), pre-trained across diverse cell populations, capture these critical gene relationships that histology alone cannot reveal. Yet, applying expression-only sc-FMs to histology-conditioned expression modeling is nontrivial due to the absence of a visual pathway, a mismatch between their pre-training and conditional ST objectives, and the scarcity of mixed-cell ST supervision. To address these challenges, we propose HINGE (HIstology-coNditioned GEneration), which retrofits a pre-trained sc-FM into a conditional expression generator while mostly preserving its learned gene relationships. We achieve this by introducing SoftAdaLN, a lightweight, identity-initialized modulation that injects layer-wise visual context into the backbone, coupled with an expression-space masked diffusion objective and a warm-start curriculum to ensure objective alignment and training stability. Evaluated on three ST datasets, HINGE outperforms state-of-the-art baselines on mean Pearson correlation and yields more accurate spatial marker expression patterns and higher pairwise co-expression consistency, establishing a practical route to adapt pre-trained sc-FMs for histology-conditioned spatial expression generation.

Abstract:
Source-free domain adaptation (SFDA) adapts a pre-trained source model to an unlabeled target domain using only the model itself, typically relying on pseudo labeling augmented with auxiliary knowledge and consistency regularization (CR) mechanisms to alleviate noise in the generated pseudo labels. However, existing approaches overlook the geometric structure of the target embedding manifold when assigning pseudo labels, resulting in unreliable distance measurements and consequently severe mislabeling. Moreover, existing CR is applied solely to output logits, making it insensitive to feature-level reliability. To solve these issues, we propose a novel pseudo labeling scheme based on Feature universe, which is an expanded embedding space that models class-wise target distributions and Gravity consistency (GV) regularization, which modulates consistency strength according to feature-level similarity. Our pseudo labeling strategy first models the embedding space with virtual features to construct a feature universe. On this space, pseudo labels are generated through feature traversal, which propagates labels only from statistically reliable regions. In addition, GV jointly encourages logit- and feature-level consistency, aligning predictions for augmented images while preserving the geometric structure of the embedding space. It further modulates the strength of CR for each sample, preventing the confirmation of noisy pseudo labels through a gravity-based force defined between two input embeddings. Experiments on Office-Home, DomainNet-126, and VisDA-C demonstrate consistent improvements over prior SFDA methods, and incorporating GV into baselines yields additional gains.

Abstract:
Multimodal Large Reasoning Models (MLRMs) exhibit remarkable performance on complex tasks by incorporating explicit multi-step reasoning. However, this capability also introduces new security vulnerabilities. Existing jailbreak studies largely overlook Cognitive-level weaknesses embedded in the reasoning process itself. In this work, we uncover a critical cognitive bias in MLRMs: the anchoring effect, where safety judgments are disproportionately influenced by the first piece of information received--the anchor. Building on this finding, we propose the Reasoning-chain Anchoring Attack (RA-Attack), a novel jailbreak framework that fully exploits this vulnerability. RA-Attack employs a cross-modal safe anchor, whose core component is a structured visual mind map. This structured format provides the model with a pre-established, safety-biased reasoning chain that subtly induces it to rationalize and execute subsequent harmful intent. Extensive experiments across seven leading closed- and open-source MLRMs demonstrate the effectiveness of RA-Attack, achieving state-of-the-art jailbreak success rates--92% on Gemini-2.5-Pro and 82% on GPT-4o. Our findings reveal that cognitive biases can be systematically exploited to manipulate multimodal reasoning chains, establishing cognitive security as a critical and underexplored frontier in AI safety research. Warning: This paper contains unsafe examples.

Abstract:
Contrastive Language-Image Pre-training (CLIP) offers a new paradigm for Weakly Supervised Semantic Segmentation (WSSS) by generating Class Activation Maps (CAMs) from text-image alignment. Existing methods primarily rely on hand-crafted templates or general attribute descriptions generated by a large language model to construct text prototypes for querying visual features. However, these strategies faces two major limitations: the inherent modality gap in CLIP prevents text prototypes achieving tight alignment with visual features; and their static text prototypes cannot adaptively respond to target instances that exhibit diverse visual attributes. To address these challenges, our key insight is to directly construct instance-specific visual description prototype as query, thereby bypassing the suboptimal static text description optimization. To this end, we propose the Visual Description Assembly (VDA) framework. It employs a probabilistic model to map complex CLIP visual features into a structured latent space. This latent space allows us to explicitly disentangle and aggregate varied visual attributes, and then dynamically assemble them into instance-specific visual prototypes. Furthermore, to enhance the robustness of this prototype, we adaptively incorporate the semantically stable text prototype into it as the final query for generating superior CAMs. Experimental results show our method outperforms existing baselines, achieving state-of-the-art performance on WSSS benchmarks.

Abstract:
Building reconstruction aims to extract compact wireframes from point clouds. Recent edge-based methods achieve impressive results but suffer from sparse supervision from one-to-one matching, which leaves most edge proposals under-optimized. In this paper, we present Group Relative Edge Optimization (GREO), the first attempt to incentivize dense supervision across edges proposals through reinforcement learning-style optimization in wireframe reconstruction. Specifically, GREO computes edge-level rewards based on geometric alignment quality and transforms them into target confidence distributions via group-wise normalization. In addition, we incorporate entropy regularization to maintain distributional stability and prevent confidence collapse. This joint optimization enables dense and discriminative supervision across all edge proposals through cross-entropy minimization. Experiments on the large-scale Building3D dataset demonstrate that our powerful and versatile GREO integrates seamlessly into existing edge-based methods as a plug-and-play training strategy, achieving state-of-the-art performance on both the Entry-level and Tallinn benchmarks while adding no inference overhead.

Abstract:
Dexterous manipulation remains challenging due to the cost of collecting real-robot teleoperation data, the heterogeneity of hand embodiments, and the high dimensionality of control. We present UniDex, a robot foundation suite that couples a large-scale robot-centric dataset with a unified vision-language-action (VLA) policy and a practical human-data capture setup for universal dexterous hand control. First, we construct UniDex-Dataset, a robot-centric dataset over 50K trajectories across eight dexterous hands (6-24 DoFs), derived from egocentric human video datasets. To transform human data into robot-executable trajectories, we employ a human-in-the-loop retargeting procedure to align fingertip trajectories while preserving plausible hand-object contacts, and we operate on explicit 3D pointclouds with human hands masked to narrow kinematic and visual gaps. Second, we introduce the Function-Actuator-Aligned Space (FAAS), a unified action space that maps functionally similar actuators to shared coordinates, enabling cross-hand transfer. Leveraging FAAS as the action parameterization, we train UniDex-VLA, a 3D VLA policy pretrained on UniDex-Dataset and finetuned with task demonstrations. In addition, we build UniDex-Cap, a simple portable capture setup that records synchronized RGB-D streams and human hand poses and converts them into robot-executable trajectories to enable human-robot data co-training that reduces reliance on costly robot demonstrations. On challenging tool-use tasks across two different hands, UniDex-VLA achieves 81% average task progress and outperforms prior VLA baselines by a large margin, while exhibiting strong spatial, object, and zero-shot cross-hand generalization. Together, UniDex-Dataset, UniDex-VLA, and UniDex-Cap provide a scalable foundation suite for universal dexterous manipulation.

Abstract:
Synthesis of diverse driving scenes serves as a crucial data augmentation technique for validating the robustness and generalizability of autonomous driving systems. Current methods aggregate high-definition (HD) maps and 3D bounding boxes as geometric conditions in diffusion models for conditional scene generation. However, implicit inter-condition dependency causes generation failures when control conditions change independently. Additionally, these methods suffer from insufficient details in both semantic and structural aspects. Specifically, brief and view-invariant captions restrict semantic contexts, resulting in weak background modeling. Meanwhile, the standard denoising loss with uniform spatial weighting neglects foreground structural details, causing visual distortions and blurriness. To address these challenges, we propose DrivePTS, which incorporates three key innovations. Firstly, our framework adopts a progressive learning strategy to mitigate inter-dependency between geometric conditions, reinforced by an explicit mutual information constraint. Secondly, a Vision-Language Model is utilized to generate multi-view hierarchical descriptions across six semantic aspects, providing fine-grained textual guidance. Thirdly, a frequency-guided structure loss is introduced to strengthen the model's sensitivity to high-frequency elements, improving foreground structural fidelity. Extensive experiments demonstrate that our DrivePTS achieves state-of-the-art fidelity and controllability in generating diverse driving scenes. Notably, DrivePTS successfully generates rare scenes where prior methods fail, highlighting its strong generalization ability.

Abstract:
Knowledge-based visual question answering (KB-VQA) demonstrates significant potential for handling knowledge-intensive tasks. However, conflicts arise between static parametric knowledge in vision language models (VLMs) and dynamically retrieved information due to the static model knowledge from pre-training. The outputs either ignore retrieved contexts or exhibit inconsistent integration with parametric knowledge, posing substantial challenges for KB-VQA. Current knowledge conflict mitigation methods primarily adapted from language-based approaches, focusing on context-level conflicts through engineered prompting strategies or context-aware decoding mechanisms. However, these methods neglect the critical role of visual information in conflicts and suffer from redundant retrieved contexts, which impair accurate conflict identification and effective mitigation. To address these limitations, we propose CC-VQA: a novel training-free, conflict- and correlation-aware method for KB-VQA. Our method comprises two core components: (1) Vision-Centric Contextual Conflict Reasoning, which performs visual-semantic conflict analysis across internal and external knowledge contexts; and (2) Correlation-Guided Encoding and Decoding, featuring positional encoding compression for low-correlation statements and adaptive decoding using correlation-weighted conflict scoring. Extensive evaluations on E-VQA, InfoSeek, and OK-VQA benchmarks demonstrate that CC-VQA achieves state-of-the-art performance, yielding absolute accuracy improvements of 3.3% to 6.4% compared to existing methods. Code will be released upon acceptance.

Abstract:
Crafting prompts via Prompt Engineering that steer a model's internal representations toward specific and pre-defined outcomes can be time-consuming, often requiring multiple iterations. Hard Prompt Inversion offers a complementary workflow: start from a reference image and generate a prompt that conditions a text-to-image (T2I) model to reconstruct the reference image. Existing inversion methods either yield incoherent text, or produce prompts that are overly sensitive to downstream token edits. We propose a dLLM-based prompt inversion framework that yield prompts that are (i) more interpretable to humans, (ii) better aligned with the reference image, and (iii) designed for downstream token swap and token append operations (aka edit-friendly prompts). The method is plug-and-play, requiring no finetuning of either the T2I model or the dLLM. Experiments across three datasets show a ~10xreduction in inversion time relative to existing prompt-inversion baselines, higher interpretability scores, and significantly higher prompt editability, as measured by TIFA, GPT-V preference scoring, and controlled user studies, all while preserving high-fidelity image generation. By coupling diffusion-time sampling with token-similarity control inside a dLLM decoder, our approach extends prompt inversion beyond reconstruction to downstream token-editing tasks, enabling faster, more transferable prompts that generalize across multiple T2I models.

Abstract:
Simulation of cellular morphology change has long been a fundamental task in quantitative biology and high-throughput screening, with the potential to accelerate therapeutic development and elucidate disease mechanisms beyond empirical clinical practice. However, the vast perturbation space poses challenges to the discriminative formulation, and existing generative approaches tend to concentrate on the same trajectory subspace, making their generated paths prone to drift. In this paper, we propose a novel Steered Diffusion Bridge approach, named SimuSDB, to define deterministic probabilistic trajectories between two distinct state domains for cell response generation. We first extend the diffusion bridge paradigm to maintain stochasticity and diversity in interpolation trajectories by introducing Brownian bridges. Then, SimuSDB generates cell morphologies that comply with phenotypic constraints, while allowing the latter to explicitly guide the generative process. For the inference stage, we formalize the rule-guided sample generation task as an optimal control problem within a stochastic dynamical system. This way, the generative model can achieve analytically tractable optimal control strategies and steered generation without collapsing toward the trajectory of the same data subspace. Comprehensive experiments demonstrate the superior performance of SimuSDB across various applications, including chemical perturbation and genetic perturbation.

Abstract:
Domain Adaptive Object Detection (DAOD) aims to transfer detectors from a labeled source domain to an unlabeled target domain.Existing DAOD methods employ multi-granularity feature alignment to learn domain-invariant representations.However, the local connectivity of their CNN-based backbone and detection head restricts alignment to local regions, failing to extract global domain-invariant features.Although transformer-based DAOD methods capture global dependencies via attention mechanisms, their quadratic computational cost hinders practical deployment. To solve this, we propose DA-Mamba, a hybrid CNN-State Space Models (SSMs) architecture that combines the efficiency of CNNs with the linear-time long-range modeling capability of State Space Models (SSMs) to capture both global and local domain-invariant features.Specifically, we introduce two novel modules: Image-Aware SSM (IA-SSM) and Object-Aware SSM (OA-SSM).IA-SSM is integrated into the backbone to enhance global domain awareness, enabling image-level global and local alignment.OA-SSM is inserted into the detection head to model spatial and semantic dependencies among objects, enhancing instance-level alignment.Comprehensive experiments demonstrate that the proposed method can efficiently improve the cross-domain performance of the object detector.

Abstract:
Video Large Language Models (VLLMs) demonstrate strong video understanding but suffer from inefficiency due to redundant visual tokens. Existing pruning primary targets intra-frame spatial redundancy or prunes inside the LLM with shallow-layer overhead, yielding suboptimal spatiotemporal reduction and underutilizing long-context compressibility. All of them often discard subtle yet informative context from merged or pruned tokens. In this paper, we propose a new perspective that elaborates token Anchors within intra-frame and inter-frame to comprehensively aggregate the informative contexts via local-global Optimal Transport (AOT). Specifically, we first establish local- and global-aware token anchors within each frame under the attention guidance, which then optimal transport aggregates the informative contexts from pruned tokens, constructing intra-frame token anchors. Then, building on the temporal frame clips, the first frame within each clip will be considered as the keyframe anchors to ensemble similar information from consecutive frames through optimal transport, while keeping distinct tokens to represent temporal dynamics, leading to efficient token reduction in a training-free manner. Extensive evaluations show that our proposed AOT obtains competitive performances across various short- and long-video benchmarks on leading video LLMs, obtaining substantial computational efficiency while preserving temporal and visual fidelity.

Abstract:
Diffusion-based video editing has emerged as an important paradigm for high-quality and flexible content generation. However, despite their generality and strong modeling capacity, Diffusion Transformers (DiT) remain computationally expensive due to the iterative denoising process, posing challenges for practical deployment. Existing video diffusion acceleration methods primarily exploit denoising timestep-level feature reuse, which mitigates the redundancy in denoising process, but overlooks the architectural redundancy within the DiT that many attention operations over spatio-temporal tokens are redundantly executed, offering little to no incremental contribution to the model's output. This work introduces HetCache, a training-free diffusion acceleration framework designed to exploit the inherent heterogeneity in diffusion-based masked video-to-video (MV2V) generation and editing. Instead of uniformly reuse or randomly sampling tokens, HetCache assesses the contextual relevance and interaction strength among various types of tokens in designated computing steps. Guided by spatial priors, it divides the spatial-temporal tokens in DiT model into context and generative tokens, and selectively caches the context tokens that exhibit the strongest correlation and most representative semantics with generative ones. This strategy reduces redundant attention operations while maintaining editing consistency and fidelity. Experiments show that HetCache achieves a noticeable acceleration, including a 2.67 xlatency speedup and FLOPs reduction over commonly used foundation models, with negligible degradation in editing quality.

Abstract:
Recent large vision-language models (LVLMs) have been applied to diverse VQA tasks. However, achieving practical performance typically requires task-specific fine-tuning with large numbers of image-text pairs, which are costly to collect. In this work, we study text-centric training, a setting where only textual descriptions are available and no real images are provided, as a paradigm for low-cost data scaling. Unlike images, whose collection is often restricted by privacy constraints and scarcity in niche domains, text is widely available. Moreover, text is easily editable, enabling automatic diversification and expansion with LLMs at minimal human effort. While this offers clear advantages over image collection in terms of scalability and cost, training on raw text without images still yields limited gains on VQA tasks because of the image-text modality gap. To address this issue, we propose a Text-Printed Image (TPI), which generates synthetic images by directly rendering the given textual description on a plain white canvas. This simple rendering projects text into the image modality and can be integrated into many existing LVLM training pipelines at low cost. Moreover, TPI preserves the semantics of the text, whereas text-to-image models often fail to do. Across four models and seven benchmarks, our systematic experiments show that TPI enables more effective text-centric training than synthetic images generated by a diffusion model. We further explore TPI as a low-cost data-augmentation strategy and demonstrate its practical utility. Overall, our findings highlight the significant potential of text-centric training and, more broadly, chart a path toward fully automated data generation for LVLMs.

Abstract:
Table structure recognition in document AI faces significant challenges due to layout inconsistencies, merged cells, and complex nested structures, which is further exacerbated by the scarcity of large, diverse annotated datasets. In this paper, we present PIX-TAB (Efficient PIXel-Precise TABle Structure Recognition Approach) that provides exact, pixel-level structure using a small, lightweight model that can run on-device. The approach is language-agnostic, as it allows adding support for a new languages simply by replacing the Optical Character Recognition (OCR) model without modifying to the core structure recognition model. Key innovations include: position-aware pixel-precise tokens for deterministic cell reconstruction; speculative decoding for faster sequence generation, and training-only box supervision to stabilize spatial grounding; region-based image segmentation. To mitigate data scarcity we propose a pipeline for generating a large synthetic table dataset. Experimental results validate each component. To address the limitations of existing evaluation methods we introduce TEDS_ struct100 and TEDS_ 100 metrics. Speculative decoding approach significantly improves recognition speed while maintaining accuracy. Finally, the combined techniques enable a mobile-optimized model that is more than three times faster than the full-size version.

Abstract:
Precise fingertip contact detection is a fundamental challenge for natural and immersive virtual reality (VR) interaction. However, existing vision-based methods suffer from insufficient accuracy, with typical depth errors (12-25 mm) being too large to reliably distinguish between hovering and true contact (<3 mm). While commercial motion capture systems provide sub-millimeter accuracy, their prohibitive cost limits widespread adoption. This paper addresses this critical gap by developing a highly accurate and cost-effective system for fingertip contact detection. We introduce a novel, specialised dataset of 53,300 RGB-depth pairs capturing millimeter-scale, hand-table typing interactions. By systematically fine-tuning six state-of-the-art depth estimation architectures on this dataset, we reduce the mean absolute error (MAE) by 68%, from 12.3 mm to a state-of-the-art 3.8 mm. Our complete VR keyboard system, TapBoard-X, achieves 95.96% contact detection accuracy and enables typing speeds of 45.6 WPM with a low 3.1% character error rate, rivalling physical keyboards. This performance is achieved at over a 90% cost reduction compared to commercial systems, democratising high-precision hand tracking for the broader research community and paving the way for the next generation of tactile VR experiences.

Abstract:
Composed Image Retrieval (CIR) has made significant progress, yet current benchmarks are limited to single ground-truth answers and lack the annotations needed to evaluate false positive avoidance, robustness and multi-image reasoning. We present PinPoint, a comprehensive real world benchmark with 7,635 queries and 329K relevance judgments across 23 query categories. PinPoint advances the field by providing: (1) multiple correct answers (averaging 9.1 per query) (2) explicit hard negatives, (3) six instruction paraphrases per query for robustness testing, (4) multi-image composition support (13.4% of queries), and (5) demographic metadata for fairness evaluation. Based on our analysis of 20+ methods across 4 different major paradigms, we uncover three significant drawbacks: The best methods while achieving mAP@10 of 28.5%, still retrieves irrelevant results (hard negatives) 9% of the time. The best models also exhibit 25.1% performance variation across paraphrases, indicating significant potential for enhancing current CIR techniques. Multi-image queries performs 40 to 70% worse across different methods. To overcome these new issues uncovered by our evaluation framework, we propose a training-free reranking method based on an off-the-shelf MLLM that can be applied to any existing system to bridge the gap. We release the complete dataset, including all images, queries, annotations, retrieval index, and benchmarking code.

Abstract:
Current query-driven 3D understanding methods are constrained to static point clouds, limiting their ability to reason about dynamic scenes. To bridge this gap, we propose MORE-STEM, a unified framework for Long-Short MemOry REcall and Spatio-TEmporal Consistency Model in Query-Driven 3D/4D Point Cloud Segmentation. The framework first introduces a Cross-Frame Text-Visual Alignment module that establishes fine-grained, time-aware correspondences between linguistic queries and dynamic 3D features. Building on this, a Spatio-Temporal Consistency Model module enforces motion-aware coherence across consecutive frames, ensuring stable and temporally consistent segmentation. A Long-Short Memory Recall module further enhances cross-scene reasoning through hierarchical memory that balances long-term semantic recall and short-term adaptation. We also construct a new outdoor benchmark for both 3D and 4D instruction segmentation with temporally aligned, motion-centric text annotations. Experiments demonstrate that MORE-STEM achieves state-of-the-art performance across multiple 3D and 4D understanding tasks.

Abstract:
Jigsaw puzzle solving has been an increasingly popular task in the computer vision research community. Recent works have utilized cutting-edge architectures and computational approaches to reassemble groups of pieces into a coherent image, while achieving increasingly good results on well established datasets. However, most of these approaches share a common, restricting setting: operating solely on strictly square puzzle pieces. In this work, we introduce GAP, a set of novel jigsaw puzzles datasets containing synthetic, heavily eroded pieces of unrestricted shapes, generated by a learned distribution of real-world archaeological fragments. We also introduce PuzzleFlow, a novel ViT and Flow-Matching based framework for jigsaw puzzle solving, capable of handling complex puzzle pieces and demonstrating superior performance on GAP when compared to both classic and recent prominent works in this domain.

Abstract:
Although diffusion transformer (DiT)-based video virtual try-on (VVT) has made significant progress in synthesizing realistic videos, existing methods still struggle to capture fine-grained garment dynamics and preserve background integrity across video frames. They also incur high computational costs due to additional interaction modules introduced into DiTs, while the limited scale and quality of existing public datasets also restrict model generalization and effective training. To address these challenges, we propose a novel framework, KeyTailor, along with a large-scale, high-definition dataset, ViT-HD. The core idea of KeyTailor is a keyframe-driven details injection strategy, motivated by the fact that keyframes inherently contain both foreground dynamics and background consistency. Specifically, KeyTailor adopts an instruction-guided keyframe sampling strategy to filter informative frames from the input video. Subsequently, two tailored keyframe-driven modules--the garment details enhancement module and the collaborative background optimization module--are employed to distill garment dynamics into garment-related latents and to optimize the integrity of background latents, both guided by keyframes. These enriched details are then injected into standard DiT blocks together with pose, mask, and noise latents, enabling efficient and realistic try-on video synthesis. This design ensures consistency without explicitly modifying the DiT architecture, while simultaneously avoiding additional complexity. In addition, our dataset ViT-HD comprises 15,070 high-quality video samples at a resolution of 810 x 1080, covering diverse garments. Extensive experiments demonstrate that KeyTailor outperforms state-of-the-art baselines in terms of garment fidelity and background integrity across both dynamic and static scenarios. The dataset and code will be publicly released.

Abstract:
Accurate trajectory prediction is vital for safe autonomous driving, yet existing approaches struggle to balance modeling power and computational efficiency. Attention-based architectures incur quadratic complexity with increasing agents, while recurrent models struggle to capture long-range dependencies and fine-grained local dynamics. Building upon this, we present FoSS, a dual-branch framework that unifies frequency-domain reasoning with linear-time sequence modeling. The frequency-domain branch performs a discrete Fourier transform to decompose trajectories into amplitude components encoding global intent and phase components capturing local variations, followed by a progressive helix reordering module that preserves spectral order; two selective state-space submodules, Coarse2Fine-SSM and SpecEvolve-SSM, refine spectral features with \mathcal O (N) complexity. In parallel, a time-domain dynamic selective SSM reconstructs self-attention behavior in linear time to retain long-range temporal context. A cross-attention layer fuses temporal and spectral representations, while learnable queries generate multiple candidate trajectories, and a weighted fusion head expresses motion uncertainty. Experiments on Argoverse 1 and Argoverse 2 benchmarks demonstrate that FoSS achieves state-of-the-art accuracy while reducing computation by 22.5% and parameters by over 40%. Comprehensive ablations confirm the necessity of each component.

Abstract:
Mamba, originally introduced for language modeling, has recently garnered attention as an effective backbone for vision tasks. However, its underlying mechanism in visual domains remains poorly understood. In this work, we systematically investigate Mamba's representational properties and make three primary contributions. First, we theoretically analyze Mamba's relationship to Softmax and Linear Attention, confirming that it can be viewed as a low-rank approximation of Softmax Attention and thereby bridging the representational gap between Softmax and Linear forms. Second, we introduce a novel binary segmentation metric for activation map evaluation, extending qualitative assessments to a quantitative measure that demonstrates Mamba's capacity to model long-range dependencies. Third, by leveraging DINO for self-supervised pretraining, we obtain clearer activation maps than those produced by standard supervised approaches, highlighting Mamba's potential for interpretability. Notably, our model also achieves a 78.5% linear probing accuracy on ImageNet, underscoring its strong performance. We hope this work can provide valuable insights for future investigations of Mamba-based vision architectures.

Abstract:
Both detection-then-match and detection-free methods have been extensively studied for image-to-point cloud registration, yet they still face significant challenges. The detection-then-match approach emphasizes high-quality correspondences but is limited by the availability of repeatable keypoints, making it susceptible to errors from incorrect matches. In contrast, detection-free methods aim for dense correspondences using a coarse-to-fine strategy to mitigate matching errors. However, non-overlapping regions and low-quality matches still introduce inaccuracies, and the differences between image texture and point cloud structure cause inconsistent region representations, increasing the likelihood of incorrect matches.To address these challenges, we propose two innovative modules: the High-Value Zone Reinforced Selection Module (HZRS) and the Zone Representation Consistency Alignment Module (ZRCA). HZRS employs reinforcement learning to resolve the non-differentiable issue of selecting high-value matching regions, while ZRCA improves region alignment through three stages: understand, coordinate, and accelerate.Extensive experiments and ablation studies on RGB-D Scenes v2 and 7-Scenes demonstrate the superiority of our network, establishing it as the state-of-the-art for image-to-point cloud registration.

Abstract:
Aligning egocentric video with wearable sensors has shown promise for human action recognition, but faces practical limitations in user discomfort, privacy concerns, and scalability. We explore exocentric video with ambient sensors as a non-intrusive, scalable alternative. However, the Global Alignment approach prevalent in egocentric-wearable settings fails in this new setting due to two problems: (P1) inability to capture local details such as subtle motions, and (P2) over-reliance on modality-invariant temporal features that distort negative relationships. To resolve these problems, we propose DETACH, a decomposed spatio-temporal alignment framework. By decomposing both modalities into spatial-temporal components, we preserve subtle temporal cues in videos from spatial features, and convert implicit sensor channel activations into explicit spatial prototypes via online clustering, thereby establishing cross-modal spatial grounding. To avoid over-reliance on temporal features, a spatial-temporal weighted contrastive loss leverages this grounding for fine-grained temporal alignment, prioritizing hard negatives and suppressing false negatives. Comprehensive experiments with downstream tasks on Opportunity++ and HWU-USP datasets demonstrate improvements of up to 30% in F1 and 50% in mAP over adapted egocentric-wearable baselines.

Abstract:
Large vision-language models (LVLMs) achieve strong performance on visual reasoning tasks but remain highly susceptible to hallucination. Existing detection methods predominantly rely on coarse, whole-image measures of how an object token relates to the input image. This global strategy is limited: hallucinated tokens may exhibit weak but widely scattered correlations across many local regions, which aggregate into deceptively high overall relevance, thus evading the current global hallucination detectors. We begin with a simple yet critical observation: a faithful object token must be strongly grounded in a specific image region. Building on this insight, we introduce a patch-level hallucination detection framework that examines fine-grained token-level interactions across model layers. Our analysis uncovers two characteristic signatures of hallucinated tokens: (i) they yield diffuse, non-localized attention patterns, in contrast to the compact, well-focused attention and (ii) they fail to exhibit meaningful semantic alignment with any visual region. Guided by these findings, we develop a lightweight and interpretable detection method that leverages patch-level statistical features, combined with hidden-layer representations. Our approach achieves up to 90% accuracy in token-level hallucination detection, demonstrating the superiority of fine-grained structural analysis for detecting hallucinations.

Abstract:
3D Visual Grounding (3DVG) focuses on locating objects in 3D scenes based on natural language descriptions, serving as a fundamental task for embodied AI and robotics. Recent advances in Multi-modal Large Language Models (MLLMs) have motivated research into extending them to 3DVG. However, MLLMs primarily process 2D visual inputs and struggle with understanding 3D spatial structure of scenes solely from these limited perspectives. Existing methods mainly utilize viewpoint-dependent rendering of reconstructed point clouds to provide explicit structural guidance for MLLMs in 3DVG tasks, leading to inefficiency and limited spatial reasoning. To address this issue, we propose S^2-MLLM, an efficient framework that enhances spatial reasoning in MLLMs through implicit spatial reasoning. We introduce a spatial guidance strategy that leverages the structure awareness of feed-forward 3D reconstruction. By acquiring 3D structural understanding during training, our model can implicitly reason about 3D scenes without relying on inefficient point cloud reconstruction. Moreover, we propose a structure-enhanced module (SE), which first employs intra-view and inter-view attention mechanisms to capture dependencies within views and correspondences across views. The module further integrates multi-level position encoding to associate visual representations with spatial positions and viewpoint information, enabling more accurate structural understanding. Extensive experiments demonstrate that S^2-MLLM unifies superior performance, generalization, and efficiency, achieving significant performance over existing methods across the ScanRefer, Nr3D, and Sr3D datasets. Code will be available upon acceptance.

Abstract:
Dynamic 3D scene reconstruction is essential for immersive media such as VR, MR, and XR, yet remains challenging for long multi-view sequences with large-scale motion. Existing dynamic Gaussian approaches are either Frame-Stream, offering scalability but poor temporal stability, or Clip, achieving local consistency at the cost of high memory and limited sequence length. We propose ClipGStream, a hybrid reconstruction framework that performs stream optimization at the clip level rather than the frame level. The sequence is divided into short clips, where dynamic motion is modeled using clip-independent spatio-temporal fields and residual anchor compensation to capture local variations efficiently, while inter-clip inherited anchors and decoders maintain structural consistency across clips. This Clip-Stream design enables scalable, flicker-free reconstruction of long dynamic videos with high temporal coherence and reduced memory overhead. Extensive experiments demonstrate that ClipGStream achieves state-of-the-art reconstruction quality and efficiency.

Abstract:
Audio-Visual Speech Recognition (AVSR) has achieved remarkable progress in offline conditions, yet its robustness in real-world video conferencing (VC) remains largely unexplored. This paper presents the first systematic evaluation of state-of-the-art AVSR models across mainstream VC platforms, revealing severe performance degradation caused by transmission distortions and spontaneous human hyper-expression. To address this gap, we construct MLD-VC, the first multimodal dataset tailored for VC, comprising 31 speakers, 22.79 hours of audio-visual data, and explicit use of the Lombard effect to enhance human hyper-expression. Through comprehensive analysis, we find that speech enhancement algorithms are the primary source of distribution shift, which alters the first and second formants of audio. Interestingly, we find that the distribution shift induced by the Lombard effect closely resembles that introduced by speech enhancement, which explains why models trained on Lombard data exhibit greater robustness in VC. Fine-tuning AVSR models on MLD-VC mitigates this issue, achieving an average 17.5% reduction in CER across several VC platforms. Our findings and dataset provide a foundation for developing more robust and generalizable AVSR systems in real-world video conferencing. MLD-VC is available at https://huggingface.co/datasets/nccm2p2/MLD-VC.

Abstract:
Conventional whole slide image (WSI) analysis pipelines follow a two-stage process. First, an image encoder, such as a vision transformer (ViT), is used to perform batched offline feature extraction on a series of tiles cropped from the WSI. Second, a multiple instance learning (MIL) model is trained with slide-level labels to obtain task-specific slide embeddings. However, several limitations exist: strong reliance on pre-trained weights of the tile encoder, the absence of receptive fields from the original image, and a lack of task-independent WSI representations. An ideal improvement would be to develop an end-to-end pre-trained WSI model, but training it from scratch will face challenges such as high training costs and computational complexity. In this work, we deconstruct the key steps of ViT-based pathology image representation and propose a conversion strategy called E2E-ViT, which transforms a vanilla ViT into an end-to-end pre-trained WSI model without introducing additional parameters. E2E-ViT directly inputs the entire tissue region in WSIs to efficiently feed image sequences into the transformer backbone, achieving information interaction from the original receptive fields and generating slide features. Through multiple survival prediction tasks, we demonstrate that transformed pre-trained ViTs outperform two-stage MIL models and slide foundation models (SFM). Our work presents a new end-to-end learning paradigm that provides a promising direction for the next generation of computational pathology models.

Abstract:
Integrating segmentation into Multimodal Large Language Models (MLLMs) presents a core trilemma: simultaneously preserving dialogue ability, achieving high segmentation performance, and ensuring fast inference. Prevailing paradigms are forced into a compromise. Embedding prediction methods introduce a conflicting pixel-level objective that degrades the MLLM's general dialogue abilities. The alternative, next-token prediction, reframes segmentation as an autoregressive task, which preserves dialogue but forces a trade-off between poor segmentation performance with sparse outputs or prohibitive inference speeds with rich ones. We resolve this trilemma with all-mask prediction, a novel paradigm that decouples autoregressive dialogue generation from non-autoregressive mask prediction. We present STAMP: Simultaneous Textual All-Mask Prediction, an MLLM that embodies this paradigm. After generating a textual response, STAMP predicts an entire segmentation mask in a single forward pass by treating it as a parallel "fill-in-the-blank" task over image patches. This design maintains the MLLM's dialogue ability by avoiding conflicting objectives, enables high segmentation performance by leveraging rich, bidirectional spatial context for all mask tokens, and achieves exceptional speed. Extensive experiments show that STAMP significantly outperforms state-of-the-art methods across multiple segmentation benchmarks, providing a solution that excels in dialogue, segmentation, and speed without compromise.

Abstract:
In Remote Sensing (RS), Parameter-Efficient Fine-Tuning (PEFT) has emerged as a key approach to activate the generalizable representation ability of foundation models for downstream tasks. However, existing specialized PEFT methods often fail when applied to large-scale Earth observation tasks, as they are unable to fully handle the multifaceted and unpredictable domain gaps (e.g., spatial, semantic, and frequency shifts) inherent in RS data. To overcome this, we propose CrossEarth-Gate, which introduces two primary contributions. First, we establish a comprehensive RS module toolbox to address multifaceted domain gaps, comprising spatial, semantic, and frequency modules. Second, we develop a Fisher-guided adaptive selection mechanism that operates on this toolbox. This selection is guided by Fisher Information to quantify each module's importance by measuring its contribution to the task-specific gradient flow. It dynamically activates only the most critical modules at the appropriate layers, guiding the gradient flow to maximize adaptation effectiveness and efficiency. Comprehensive experiments validate the efficacy and generalizability of our method, where CrossEarth-Gate achieves state-of-the-art performance on 16 out of 18 cross-domain benchmarks for RS semantic segmentation.

Abstract:
Multimodal large language models (MLLMs) have achieved remarkable performance across a wide range of vision-language tasks. However, their ability in low-level visual perception, particularly in detecting fine-grained visual discrepancies, remains underexplored and lacks systematic analysis.In this work, we introduce OddGridBench, a controllable benchmark for evaluating the visual discrepancy sensitivity of MLLMs. OddGridBench comprises over 1,400 grid-based images, where a single element differs from all others by one or multiple visual attributes such as color, size, rotation, or position.Experiments reveal that all evaluated MLLMs, including open-source families such as Qwen3-VL and InternVL3.5, and proprietary systems like Gemini-2.5-Pro and GPT-5, perform far below human levels in visual discrepancy detection. We further propose OddGrid-GRPO, a reinforcement learning framework that integrates curriculum learning and distance-aware reward. By progressively controlling the difficulty of training samples and incorporating spatial proximity constraints into the reward design, OddGrid-GRPO significantly enhances the model's fine-grained visual discrimination ability.We hope OddGridBench and OddGrid-GRPO will lay the groundwork for advancing perceptual grounding and visual discrepancy sensitivity in multimodal intelligence.All resources will be publicly released upon acceptance.

Abstract:
Spatio-Temporal Video Grounding (STVG) requires models to localize objects both spatially and temporally. Despite recent progress, existing methods struggle with complex and fine-grained spatial semantics in language descriptions, leading to error propagation from temporal to spatial grounding stages. We attribute this limitation to the lack of iterative refinement between temporal and spatial predictions. To address these challenges, we propose SARL-STG, the first RL-based framework for STVG. It progressively refines spatio-temporal grounding through multi-stage training with dynamic interaction between temporal and spatial modules. Specifically, SARL-STG contains: (1) a unified architecture that seamlessly integrates a pretrained MLLM for temporal reasoning with an open-vocabulary detector for spatial localization, (2) a hierarchical RL training strategy that progresses from coarse temporal to fine-grained spatio-temporal optimization, and (3) a spatial knowledge-injected reward mechanism that uses spatial grounding quality as discriminative signals for temporal refinement. To facilitate training at scale, we also construct STVG-Wild, a large-scale dataset with diverse spatio-temporal annotations. Experiments demonstrate that our method achieves state-of-the-art performance on multiple benchmarks (HCSTVG, VidSTG, Charades-STA, etc.), significantly reducing error accumulation and enhancing both temporal and spatial grounding accuracy.

Abstract:
Face retouching requires removing subtle imperfections while preserving unique facial identity features, in order to enhance overall aesthetic appeal. However, existing methods suffer from a fundamental trade-off. Supervised learning on labeled data is constrained to pixel-level label mimicry, failing to capture complex subjective human aesthetic preferences. Conversely, while online reinforcement learning (RL) excels at preference alignment, its stochastic exploration paradigm conflicts with the high-fidelity demands of face retouching and often introduces noticeable noise artifacts due to accumulated stochastic drift. To address these limitations, we propose BeautyGRPO, a reinforcement learning framework that aligns face retouching with human aesthetic preferences. We construct FRPref-10K, a fine-grained preference dataset covering five key retouching dimensions, and train a specialized reward model capable of evaluating subtle perceptual differences. To reconcile exploration and fidelity, we introduce Dynamic Path Guidance (DPG). DPG stabilizes the stochastic sampling trajectory by dynamically computing an anchor-based ODE path and replanning a guided trajectory at each sampling timestep, effectively correcting stochastic drift while maintaining controlled exploration. Extensive experiments show that BeautyGRPO outperforms both specialized face retouching methods and general image editing models, achieving superior texture quality, more accurate blemish removal, and overall results that better align with human aesthetic preferences. Our code will be made publicly available.

Abstract:
Perception under low illumination remains a major challenge for computer vision systems, as RGB sensors often fail to capture sufficient structural and color information in extremely dark environments. Event cameras, with their high dynamic range and temporal resolution, provide complementary cues that are well suited for such conditions. In this work, we present eRetinexGS, a novel framework that jointly leverages event streams and low-light frames through 3D Gaussian Splatting for scene-level enhancement and reconstruction. Unlike previous approaches that operate on individual frames, eRetinexGS enforces geometric and photometric consistency across multiple views, bridging the gap between degraded images and noisy event signals. By introducing an event-assisted Retinex decomposition and a reflectance-illumination representation within the 3DGS pipeline, our method reconstructs normal-light radiance fields with fine-grained details and accurate color. Extensive experiments on both synthetic and real datasets demonstrate that eRetinexGS achieves state-of-the-art performance in low-light scene enhancement while maintaining real-time rendering capability.

Abstract:
Recent Multimodal Large Language Models (MLLMs) have demonstrated strong performance on vision-language understanding tasks, yet their inference efficiency is often hampered by the large number of visual tokens, particularly in high-resolution or multi-image scenarios. To address this issue, we propose EvoComp, a visual token compression framework that significantly reduces token count while preserving task accuracy. EvoComp introduces a lightweight encoder-only transformer-based compressor that selects the most informative and non-redundant visual tokens by jointly considering visual and textual contexts. A core challenge lies in providing effective supervision for training the compressor. To this end, we design an evolutionary labeling strategy that searches for token subsets minimizing the MLLM's output loss, while enforcing semantic diversity through vocabulary-based token grouping. We further train the compressor using a tailored loss function combining the GHM loss to mitigate class and difficulty imbalance, and a cosine similarity regularization to encourage semantic separation between retained and discarded tokens. Extensive experiments across multiple vision-language benchmarks show that EvoComp outperforms existing methods based on attention or similarity heuristics. Notably, it retains 99.3% of the original accuracy under 3x token compression and delivers up to 1.6x speedup on mobile devices.

Abstract:
Referring Video Object Segmentation (RVOS) aims to segment objects in videos based on textual queries. Current methods mainly rely on large-scale supervised fine-tuning (SFT) of Multi-modal Large Language Models (MLLMs). However, this paradigm suffers from heavy data dependence and limited scalability against the rapid evolution of MLLMs. Although recent zero-shot approaches offer a flexible alternative, their performance remains significantly behind SFT-based methods, due to the straightforward workflow designs. To address these limitations, we propose Refer-Agent, a collaborative multi-agent system with alternating reasoning-reflection mechanisms. This system decomposes RVOS into step-by-step reasoning process. During reasoning, we introduce a Coarse-to-Fine frame selection strategy to ensure the frame diversity and textual relevance, along with a Dynamic Focus Layout that adaptively adjusts the agent's visual focus.Furthermore, we propose a Chain-of-Reflection mechanism, which employs a Questioner-Responder pair to generate a self-reflection chain, enabling the system to verify intermediate results and generates feedback for next-round reasoning refinement. Extensive experiments on five challenging benchmarks demonstrate that Refer-Agent significantly outperforms state-of-the-art methods, including both SFT-based models and zero-shot approaches.Moreover, Refer-Agent is flexible and enables fast integration of new MLLMs without any additional fine-tuning costs. Code will be released.

Abstract:
Recently, multimodal large language models (MLLMs) have achieved remarkable success in general multimodal tasks. Increasing attention has been given to leveraging MLLMs for fine-grained visual understanding, such as region-level captioning and pixel-level grounding. However, most existing approaches are task-specific. While some recent unified approaches attempt to handle both types simultaneously, they still fall short of deeply exploring the underlying associations across tasks. To bridge this gap, we propose FCLM, a large multimodal model designed to jointly support fine-grained visual understanding through consistency learning. The central premise is that pixel-level captioning and grounding are mutually beneficial and complementary, each enhancing the other in achieving a fine-grained understanding of visual content. Specifically, FCLM analyzes the representation features - visual prompt and segmentation tokens - required for the two types of visual tasks, and achieves advanced reasoning and perception through a newly-designed consistency learning loss and a two-stage training framework. Moreover, a hybrid region extractor is designed to enhance visual prompt embeddings, yielding semantically discriminative representations for detailed captioning. Furthermore, a novel detailed localized referring expression segmentation (DL-RES) task is introduced to evaluate the model's ability to localize targets from detailed textual descriptions. Extensive experiments on seven visual understanding tasks demonstrate the superior performance and strong generalization of FCLM.

Abstract:
Text-to-Image (T2I) diffusion models have demonstrated significant advancements in generating high-quality images, while raising potential safety concerns regarding harmful content generation. Safety-guidance-based methods have been proposed to mitigate harmful outputs by steering generation away from harmful zones, where the zones are averaged across multiple harmful categories based on predefined keywords. However, these approaches fail to capture the complex interplay among different harm categories, leading to "harmful conflicts" where mitigating one type of harm may inadvertently amplify another, thus increasing overall harmful rate. To address this issue, we propose Conflict-aware Adaptive Safety Guidance (CASG), a training-free framework that dynamically identifies and applies the category-aligned safety direction during generation. CASG is composed of two components: (i) Conflict-aware Category Identification (CaCI), which identifies the harmful category most aligned with the model's evolving generative state, and (ii) Conflict-resolving Guidance Application (CrGA), which applies safety steering solely along the identified category to avoid multi-category interference. CASG can be applied to both latent-space and text-space safeguards. Experiments on T2I safety benchmarks demonstrate CASG's state-of-the-art performance, reducing the harmful rate by up to 15.4% compared to existing methods.

Abstract:
Multimodal Large Language Models (MLLMs) perform strong vision-language reasoning under standard conditions but fail in extreme illumination, where RGB inputs lose irrevocable structure and semantics. We propose Event-MLLM, an event-enhanced model that performs all-light visual reasoning by dynamically fusing event streams with RGB frames. Two key components drive our approach: an Illumination Indicator -- a learnable signal derived from a DINOv2 branch that represents exposure degradation and adaptively modulates event-RGB fusion -- and an Illumination Correction Loss that aligns fused features with non-degraded (normal-light) semantics in the latent space, compensating for information lost in extreme lighting. We curate the first multi-illumination event-instruction corpus for MLLMs, with 2,241 event-RGB samples (around 6 QA pairs each) across diverse scenes and 17 brightness rates (0.05x - 20x), plus an instruct-following benchmark for reasoning, counting, and fine-grained recognition under extreme lighting. Experiments show that Event-MLLM markedly outperforms general-purpose, illumination-adaptive, and event-only baselines, setting a new state of the art in robust multimodal perception and reasoning under challenging illumination.

Abstract:
Recent advances in Remote Sensing Foundation Models (RSFMs) have demonstrated considerable potential for Earth Observation (EO) tasks. While adopting natural image foundation models (e.g., DINO) provides a data-efficient strategy for building RSFMs, their strong generalization capability does not fully transfer to complex remote sensing (RS) scenarios due to severe background interference, notably in perceiving challenging targets like low-contrast objects. To this end, we propose ORSATR-X, a novel RSFM that effectively integrates the generalizable representations of DINOv3 with a dedicated mechanism for exciting local contrast information. ORSATR-X comprises two core components: (1) a DINOv3 encoder, which provides rich feature representation under limited RS pre-training data, and (2) a carefully designed side network incorporating a Weber Local Adapter (WLA) and a Multi-scale Aggregation Module (MSAM). The WLA enhances discriminability of low-contrast boundaries in complex scenes through center-surround contrast and directional gradient information enhancement, while the MSAM handles inherent object scale variations in RS imagery by adaptive aggregation of features across multiple scales. Furthermore, we pre-train the side network using an efficient self-supervised distillation strategy. Extensive experiments on scene classification, object detection, and semantic segmentation demonstrate that ORSATR-X achieves state-of-the-art performance among existing RSFMs, demonstrating the effectiveness of our design.

Abstract:
Recent advances in Vision-Language Models (VLMs) have benefited from Reinforcement Learning (RL) for enhanced reasoning. However, existing methods still face critical limitations, including the lack of low-level visual information and effective visual feedback. To address these problems, this paper proposes a unified multimodal interleaved reasoning framework ForeSight, which enables VLMs to See Further with low-level visual cues and Think Deeper with effective visual feedback. First, it introduces a set of low-level visual tools to integrate essential visual information into the reasoning chain, mitigating the neglect of fine-grained visual features. Second, a mask-based visual feedback mechanism is elaborated to incorporate visual reflection into the thinking process, enabling the model to dynamically re-examine and update its answers. Driven by RL, ForeSight learns to autonomously decide on tool invocation and answer verification, with the final answer accuracy as the reward signal. To evaluate the performance of the proposed framework, we construct a new dataset, Character and Grounding SalBench (CG-SalBench), based on the SalBench dataset. Experimental results demonstrate that the ForeSight-7B model significantly outperforms other models with the same parameter scale, and even surpasses the current SOTA closed-source models on certain metrics.

Abstract:
Vision-Language Models (VLMs) often struggle with robust 3D spatial reasoning. Prevailing methods that rely on fine-tuning with 3D visual question-answering (VQA) datasets may overfit dataset-specific biases, while integrating specialized 3D visual encoders is often inflexible and cumbersome. In this paper, we argue that genuine spatial understanding should emerge from learning fundamental geometric priors, not only from high-level VQA supervision. We propose GASP (Geometric-Aware Spatial Priors), a framework that injects these priors directly into the LLM's transformer layers. GASP employs a small correspondence head, applied as a deep supervision signal across all layers, and is trained with a dual objective leveraging ground-truth geometry from large-scale video scenes: a contrastive loss on ground-truth point correspondences enforces 2D view-invariance, while a depth consistency supervision resolves 3D geometric ambiguities. Our analysis first provides a diagnostic showing that standard VLMs' internal correspondence matching accuracy is very low (often below 5%). We then demonstrate that our training substantially improves this behavior, boosting peak layer-wise correspondence to over 70% and maintaining over 85% temporal robustness while baselines remain below 5%. These internal improvements translate to significant gains on downstream spatial benchmarks---including +18.2% on All-Angles Bench and +29.0% on VSI-Bench---all without training on any 3D VQA data. Our findings indicate that learning from fundamental geometric priors is a promising and generalizable pathway towards VLMs with more reliable 3D spatial reasoning.

Abstract:
Test-time adaptation (TTA) has emerged as a promising solution to address real world domain shifts in medical image segmentation. Current approaches adapt by updating or regularizing a pre-trained source model. However, they face two major issues: (i) the source models on which they rely are prone to overfitting under domain shifts; (ii) in dynamic continual testing scenarios, error accumulation and class forgetting are further exacerbated. To overcome these limitations, we propose TanGo, a novel framework that combines Training to adapt with Foundation Guidance and continual style calibration. During training, TanGo learns generalization priors from vision foundation models (VFMs) through distribution-wise consistency learning. We incorporate stable low-frequency representations from a frozen encoder of VFMs as priors to guide the source model, constraining its output feature distribution to yield a more generalizable feature space. At test time, we propose an instance-wise style calibration method that uses a learnable data decorator to shift dynamic test images back toward an enhanced source-like distribution. Subsequently, a series of constraints is applied to draw the decorated test samples closer to the enhanced source distribution while maintaining their semantic integrity, thus boosting the continual adaptability of the source model. Extensive experiments on multiple medical image segmentation tasks demonstrate that TanGo achieves state-of-the-art performance.

Abstract:
Conventional few-shot medical image segmentation (FSMIS) approaches face performance bottlenecks that hinder broader clinical applicability. Although the Segment Anything Model (SAM) exhibits strong category-agnostic segmentation capabilities, its direct application to medical images often leads to over-segmentation due to ambiguous anatomical boundaries. In this paper, we reformulate SAM-based FSMIS as a prompt localization task and propose FoB (Focus on Background), a background-centric prompt generator that provides accurate background prompts to constrain SAM's over-segmentation. Specifically, FoB bridges the gap between segmentation and prompt localization by category-agnostic generation of support background prompts and localizing them directly in the query image. To address the challenge of prompt localization for novel categories, FoB models rich contextual information to capture foreground-background spatial dependencies. Moreover, inspired by the inherent structural patterns of background prompts in medical images, FoB models this structure as a constraint to progressively refine background prompt predictions. Experiments on three diverse medical image datasets demonstrate that FoB outperforms other baselines by large margins, achieving state-of-the-art performance on FSMIS, and exhibiting strong cross-domain generalization.

Abstract:
Adapting Vision-Language Models (VLMs) to few-shot action recognition (FSAR) often trades accuracy for stability: task-specific gains can trigger catastrophic forgetting of domain-general knowledge and reduce inter-class margins. In few-shot episodes, each query is contrasted with only one positive class and a few negatives, so the text encoder sees limited prompt diversity and rarely observes hard counter-examples near decision boundaries. We propose Protect-to-Adapt (P2A), a parameter-efficient fine-tuning method with two complementary modules. Orthogonal Subspace Control (OSC) estimates a principal semantic subspace of the pre-trained backbone and constrains low-rank updates to its orthogonal complement, preserving domain-general semantics while allowing task-specific adaptation. Ranked Negative-prompt Curriculum (RNC) uses a large language model to generate verifier-filtered negative prompts with increasing difficulty. These class-specific hard counter-examples enlarge margins and sharpen decision boundaries under few-shot conditions. With only 2% of backbone parameters trainable, P2A achieves state-of-the-art performance on five FSAR benchmarks and substantially reduces catastrophic forgetting in a cross-dataset continual-learning setting where the model is adapted sequentially to multiple video datasets without replay.

Abstract:
Current training-free methods tackle MLLM hallucination with separate strategies: either enhancing visual signals or suppressing text inertia. However, these separate methods are insufficient due to critical trade-offs: simply enhancing vision often fails against strong language prior, while suppressing language can introduce extra image-irrelevant noise. Moreover, we find their naive combination is also ineffective, necessitating a unified framework. We propose such a framework by focusing on the core asset: the vision token. Our design leverages two key insights: (1) augmented images offer complementary visual semantics, and (2) removing vision tokens (information-gap) isolates hallucination tendencies more precisely than distorting images (modality-gap). Based on these, our framework uses vision tokens in two distinct ways, both operating on latent representations: our Synergistic Visual Calibration (SVC) module incorporates augmented tokens to strengthen visual representations, while our Causal Representation Calibration (CRC) module uses pruned tokens to create latent-space negative samples for correcting internal model biases. By harmonizing these two roles, our framework effectively restores the vision-language balance, significantly reducing object hallucinations, improving POPE accuracy by an average of 2% absolute on LLaVA-1.5 across multiple benchmarks with only a 1.06x inference latency overhead.

Abstract:
Text-to-image (T2I) generative models such as Stable Diffusion and FLUX can synthesize realistic, high-quality images directly from textual prompts. The resulting image quality depends critically on well-crafted prompts that specify both subjects and stylistic modifiers, which have become valuable digital assets. However, the rising value and ubiquity of high-quality prompts expose them to security and intellectual-property risks. One key threat is the prompt stealing attack, i.e., the task of recovering the textual prompt that generated a given image. Prompt stealing enables unauthorized extraction and reuse of carefully engineered prompts, yet it can also support beneficial applications such as data attribution, model provenance analysis, and watermarking validation. Existing approaches often assume white-box gradient access, require large-scale labeled datasets for supervised training, or rely solely on captioning without explicit optimization, limiting their practicality and adaptability. To address these challenges, we propose PROMPTMINER, a black-box prompt stealing framework that decouples the task into two phases: (1) a reinforcement learning-based optimization phase to reconstruct the primary subject, and (2) a VLM-guided search phase to recover stylistic modifiers. Experiments across multiple datasets and diffusion backbones demonstrate that PROMPTMINER achieves superior results, with CLIP similarity up to 0.958 and textual alignment with SBERT up to 0.751, surpassing all baselines. Even when applied to in-the-wild images with unknown generators, it outperforms the strongest baseline by 7.5% in CLIP similarity, demonstrating better generalization. Finally, PROMPTMINER maintains strong performance under defensive perturbations, highlighting remarkable robustness.

Abstract:
Blind image quality assessment (BIQA) plays a crucial role in evaluating and optimizing visual experience. Most existing BIQA approaches fuse shallow and deep features extracted from backbone networks, while overlooking the unequal contributions to quality prediction. Moreover, while various vision encoder backbones are widely adopted in BIQA, the effective quality decoding architectures remain underexplored. To address these limitations, this paper investigates the contributions of shallow and deep features to BIQA, and proposes a effective quality feature decoding framework via GCN-enhanced \underline l ayer\underline i nteraction and MoE-based \underline f eature d\underline e coupling, termed (Life-IQA). Specifically, the GCN-enhanced layer interaction module utilizes the GCN-enhanced deepest-layer features as query and the penultimate-layer features as key, value, then performs cross-attention to achieve feature interaction. Moreover, a MoE-based feature decoupling module is proposed to decouple fused representations though different experts specialized for specific distortion types or quality dimensions. Extensive experiments demonstrate that Life-IQA shows more favorable balance between accuracy and cost than a vanilla Transformer decoder and achieves state-of-the-art performance on multiple BIQA benchmarks.

Abstract:
In a globalized world, cultural elements from diverse origins frequently appear together within a single visual scene. We refer to these as culture mixing scenarios, yet how Large Vision-Language Models (LVLMs) perceive them remains underexplored. We investigate culture mixing as a critical challenge for LVLMs and examine how current models behave when cultural items from multiple regions appear together. To systematically analyze these behaviors, we construct CultureMix, a food Visual Question Answering (VQA) benchmark with 23k diffusion-generated, human-verified culture mixing images across four subtasks: (1) food-only, (2) food+food, (3) food+background, and (4) food+food+background. Evaluating 10 LVLMs, we find consistent failures to preserve individual cultural identities in mixed settings. Models show strong background reliance, with accuracy dropping 14% when cultural backgrounds are added to food-only baselines, and they produce inconsistent predictions for identical foods across different contexts. To address these limitations, we explore three robustness strategies. We find supervised fine-tuning using a diverse culture mixing dataset substantially improve model consistency and reduce background sensitivity. We call for increased attention to culture mixing scenarios as a critical step toward developing LVLMs capable of operating reliably in culturally diverse real-world environments.

Abstract:
Post-Training Quantization (PTQ) has emerged as an effective technique for alleviating the substantial computational and memory overheads of Vision-Language Models (VLMs) by compressing both weights and activations without retraining the full model. Existing PTQ methods primarily rely on static identification and global compensation of sensitive or outlier channels, yet they often overlook the distributional differences of these important channels across inputs, leading to unsatisfactory quantization. In this work, we observe that the distributions and occurrence frequencies of important channels vary significantly both across modalities and among tokens, even within the same modality. Accordingly, we propose Quant Experts (QE), a token-aware adaptive error compensation with mixture-of-experts for VLMs quantization. QE divides the important channels into token-independent and token-dependent groups. For the former, a shared expert is designed for most tokens to compensate for global quantization error using a low-rank adapter. For the latter, routed experts including multiple routed low-rank adapters are elaborated to compensate for local quantization error related to specific tokens. Extensive experiments demonstrate that QE consistently enhances task accuracy across various quantization settings and model scales, ranging from 2B to 70B parameters, while maintaining performance comparable to full-precision models.

Abstract:
Multimodal Large Language Models (MLLMs) have achieved remarkable progress in vision-language understanding, yet how they internally integrate visual and textual information remains poorly understood. To bridge this gap, we perform a systematic layer-wise masking analysis across multiple architectures, revealing how visual-text fusion evolves within MLLMs. The results show that fusion emerges at several specific layers rather than being uniformly distributed across the network, and certain models exhibit a late-stage "review" phenomenon where visual signals are reactivated before output generation. Besides, we further analyze layer-wise attention evolution and observe persistent high-attention noise on irrelevant regions, along with gradually increasing attention on text-aligned areas. Guided by these insights, we introduce a training-free contrastive attention framework that models the transformation between early fusion and final layers to highlight meaningful attention shifts. Extensive experiments across various MLLMs and benchmarks validate our analysis and demonstrate that the proposed approach improves multimodal reasoning performance.Code will be released.

Abstract:
Real-world image restoration is challenged by complex, coupled degradations. Existing "all-in-one" models often lack generalization, while agentic systems suffer from inefficient sequential tool invocation. We propose a VLM-guided one-shot framework for universal photographic post-processing. Our system employs a Vision-Language Model (VLM) as an orchestrator to perform nuanced intent understanding and degradation analysis, dynamically allocating weights to a suite of specialized expert LoRA modules. To ensure superior composability, these experts adapt only Key (K) and Value (V) matrices and are simultaneously merged into a pretrained diffusion backbone for synergistic, single-pass restoration. Furthermore, we introduce a lightweight branch trained via Direct Preference Optimization (DPO) to ensure perceptually optimal weight allocation. Our method achieves state-of-the-art performance across diverse synthetic and real-world datasets. Crucially, it demonstrates remarkable zero-shot generalization on authentic real-world data without additional fine-tuning.

Abstract:
Fine-grained remote sensing datasets often use hierarchical label structures to differentiate objects in a coarse-to-fine manner, with each object annotated across multiple levels. However, embedding this semantic hierarchy into the representation learning space to improve fine-grained detection performance remains challenging. Previous studies have applied supervised contrastive learning at different hierarchical levels to group objects under the same parent class while distinguishing sibling subcategories. Nevertheless, they overlook two critical issues: (1) imbalanced data distribution across the label hierarchy causes high-frequency classes to dominate the learning process, and (2) learning semantic relationships among categories interferes with class-agnostic localization. To address these issues, we propose a balanced hierarchical contrastive loss combined with a decoupled learning strategy within the detection transformer (DETR) framework. The proposed loss introduces learnable class prototypes and equilibrates gradients contributed by different classes at each hierarchical level, ensuring that each hierarchical class contributes equally to the loss computation in every mini-batch. The decoupled strategy separates DETR's object queries into classification and localization sets, enabling task-specific feature extraction and optimization. Experiments on three fine-grained datasets with hierarchical annotations demonstrate that our method outperforms state-of-the-art approaches.

Abstract:
In the progress of industrial anomaly detection, general anomaly detection (GAD) is an emerging trend and also the ultimate goal. Unlike the conventional single- and multi-class AD, general AD aims to train a general AD model that can directly detect anomalies in diverse novel classes without any retraining or fine-tuning on the target data. Recently, Multimodal Large Language Models (MLLMs) have shown great promise in achieving general anomaly detection due to their revolutionary visual understanding and language reasoning capabilities. However, MLLM's general AD ability remains underexplored due to: (1) MLLMs are pretrained on amounts of data sourced from the Web, these data still have significant gaps with the data in AD scenarios. Moreover, the image-text pairs during pretraining are also not specifically for AD tasks. (2) The current mainstream AD datasets are image-based and not yet suitable for post-training MLLMs. To facilitate MLLM-based general AD research, we present MMR-AD, which is a comprehensive benchmark for both training and evaluating MLLM-based AD models. With MMR-AD, we reveal that the AD performance of current SOTA generalist MLLMs still falls far behind the industrial requirements. Based on MMR-AD, we also propose a baseline model, Anomaly-R1, which is a reasoning-based AD model that learns from the CoT data in MMR-AD and is further enhanced by reinforcement learning. Extensive experiments show that our Anomaly-R1 achieves remarkable improvements over generalist MLLMs in both anomaly detection and localization.

Abstract:
Vision-Language-Action (VLA) models built upon Chain-of-Thought (CoT) have achieved remarkable success in advancing general-purpose robotic agents, owing to its significant perceptual comprehension. Recently, since text-only CoT struggles to adequately capture scene details in complex spatial environments, a highly promising strategy involves leveraging visual priors to guide robotic action generation. Nevertheless, these strategies face two inherent challenges: (i) a modality gap between visual observations and low-level actions, and (ii) unstable training due to competing objectives between visual prediction and action generation. To address these challenges, we propose a Vision-Integrated Trajectory Alignment (VITA) framework that learns a shared discrete latent space for vision and action, enabling joint modeling of perception and motor control. VITA introduces a implicit visual CoT: autoregressively generated tokens is simultaneously decoded into future frames predictions and robot actions, thereby internalizing visual dynamics as an inductive bias for motion planning. Extensive experiments on simulated and real-world environments demonstrate state-of-the-art performance. VITA improves 14.5%, 9.6% and 12.1% over existing baselines on CALVIN, LIBERO and SimplerEnv. Furthermore, VITA attains an average success rate of 80.5% across six real-world tasks, demonstrating its potential as a generalist robotic manipulation model.

Abstract:
No-reference image quality assessment (NR IQA) has recently benefited from deep and multimodal models, yet many SOTA systems still violate at least one basic requirement: they either discard critical quality cues via aggressive resizing, fail to generalize across resolutions, cannot be jointly trained on heterogeneous IQA datasets with mismatched MOS scales, or require prohibitive computation. We present ReLIQS, a model for Resolution-agnostic Learning for Image Quality with Saliency, which is resolution-agnostic, preserves original-resolution quality cues, learns from multiple subjective studies, and remains computationally efficient and budget-adaptive. ReLIQS is a CLIP-based multiscale patch-driven architecture that learns both where to look and how to judge quality. Fixed-size patches are sampled across multiple resolutions, including the original resolution, and encoded with a CLIP vision backbone. A lightweight Perceptual Importance Estimator then predicts IQA-specific importance maps to select a small set of informative patches, and a Quality Aspect Module aggregates their embeddings into a single image-level score. Across authentic, synthetic, and AIGC benchmarks spanning diverse resolutions and distortions, ReLIQS generalizes better than strong CNN-, CLIP-, and MLLM-based baselines with matching or reduced computational cost.

Abstract:
End-to-end driving frameworks, which directly map raw sensor data to vehicle control commands, have shown remarkable potential. However, their performance often deteriorates in sparse-critical scenarios, where rare but safety-sensitive events occur. To address this problem, we propose Explorer-Expert Reinforcement Learning (EE-RL), a novel end-to-end framework that integrates an RL-based explorer, a fine-tuned vision-language model (VLM)-based expert, and a dual replay buffer. EE-RL adopts a collaborative learning strategy in which an explorer and two experts jointly generate experiences from regular driving scenarios to guide policy learning. As training progresses, one of the VLMs focuses on reasoning about sparse-critical scenarios, thereby enhancing learning efficiency and policy optimization in both scenarios. Additionally, the StateHash algorithm is designed to separately measure RGB image similarity and vehicle kinematic similarity, thereby skipping unnecessary VLM reasoning and enabling denser, more effective expert experience generation. Experiments on the CARLA Leaderboard demonstrate that EE-RL significantly outperforms state-of-the-art (SOTA) baselines, achieving +19.82% and +20.98% improvements in driving and infraction scores on Town03, respectively. The EE-RL further achieves 0% accident probability for red-light running and an average driving score of 80.09 in Town05-06, demonstrating robustness, generalizability, and the capability to handle sparse-critical scenarios.

Abstract:
Recently, multimodal large language models (MLLMs) have emerged as a unified paradigm for language and image generation. Compared with diffusion models, MLLMs possess a much stronger capability for semantic understanding, enabling them to process more complex textual inputs and comprehend richer contextual meanings. However, this enhanced semantic ability may also introduce new and potentially greater safety risks. Taking diffusion models as a reference point, we systematically analyze and compare the safety risks of emerging MLLMs along two dimensions: unsafe content generation and fake image synthesis. Across multiple unsafe generation benchmark datasets, we observe that MLLMs tend to generate more unsafe images than diffusion models. This difference partly arises because diffusion models often fail to interpret abstract prompts, producing corrupted outputs, whereas MLLMs can comprehend these prompts and generate unsafe content. For current advanced fake image detectors, MLLM-generated images are also notably harder to identify. Even when detectors are retrained with MLLMs-specific data, they can still be bypassed by simply providing MLLMs with longer and more descriptive inputs. Our measurements indicate that the emerging safety risks of the cutting-edge generative paradigm, MLLMs, have not been sufficiently recognized, posing new challenges to real-world safety.

Abstract:
Recent text-to-image (T2I) diffusion models produce visually stunning images and demonstrate excellent prompt following. But do they perform well as synthetic vision data generators? In this work, we revisit the promise of synthetic data as a scalable substitute for real training sets and uncover a surprising performance regression. We generate large-scale synthetic datasets using state-of-the-art T2I models released between 2022 and 2025, train standard classifiers solely on this synthetic data, and evaluate them on real test data. Despite observable advances in visual fidelity and prompt adherence, classification accuracy on real test data consistently declines with newer T2I models as training data generators. Our analysis reveals a hidden trend: These models collapse to a narrow, aesthetic-centric distribution that undermines diversity and real data distribution coverage. Overall, our findings challenge a growing assumption in vision research, namely that progress in generative realism implies progress in data realism. We thus highlight an urgent need to rethink the capabilities of modern T2I models as reliable training data generators.

Abstract:
Vision-Language-Action (VLA) models have emerged as a promising paradigm where pretrained Vision-Language Models (VLMs) serve as System 2 for high-level reasoning, connected to action experts as System 1 for low-level motor control.However, current works fail to genuinely leverage VLM capabilities: VLMs produce latent embeddings that lack semantic interpretability, providing ambiguous and unstable guidance to downstream policies, while solely action supervision further causes VLMs to degenerate into parameter-heavy fusion encoders that memorize action patterns rather than perform generalized reasoning.To bridge this gap, we introduce SemanticVLA, which leverages VLM reasoning through synergistic dual-path design. Explicit trace reasoning generates interpretable spatial waypoints as textual coordinate sequences through the VLM's native language interface, directly reusing its pretrained spatial grounding to provide a thinking process for task planning. Latent action tokens complement trace reasoning by learning compact visuomotor primitives grounded in visual observations, providing more fine-grained action representations beyond pure coordinate prediction. This synergy enables trace reasoning to leverage VLM's multimodal understanding for refining latent token prediction, while latent tokens provide stable and grounded guidance that compensates for trace's numerical sensitivity.SemanticVLA achieves 97.0% average success rate on LIBERO and 65.1% on SimplerEnv WidowX, substantially outperforming strong baselines. More importantly, SemanticVLA maintains significantly more stable performance under instruction rephrasing in both simulation suites, and demonstrates strong advantages on real-world long-horizon and reasoning-intensive tasks.By bridging VLM reasoning and action expert through semantically explicit trace and visually grounded latent action tokens, our approach enables genuine reasoning rather than action memorization.

Abstract:
Multimodal large language models (MLLMs) project visual tokens into the embedding space of language models, yet the internal structuring and processing of visual semantics remain poorly understood. In this work, we introduce a two-fold analytical framework featuring a novel probing tool, EmbedLens, to conduct a fine-grained analysis. We uncover a pronounced semantic sparsity at the input level: visual tokens consistently partition into sink, dead, and alive categories. Remarkably, only the alive tokens, comprising ~60% of the total input, carry image-specific meaning. Furthermore, using a targeted patch-compression benchmark, we demonstrate that these alive tokens already encode rich, fine-grained cues (e.g., objects, colors, and OCR) prior to entering the LLM. Internal visual computations (such as visual attention and feed-forward networks) are redundant for most standard tasks. For the small subset of highly vision-centric tasks that actually benefit from internal processing, we reveal that alive tokens naturally align with intermediate LLM layers rather than the initial embedding space, indicating that shallow-layer processing is unnecessary and that direct mid-layer injection is both sufficient. Ultimately, our findings provide a unified mechanistic view of visual token processing, paving the way for more efficient and interpretable MLLM architectures through selective token pruning, minimized visual computation, and mid-layer injection.

Abstract:
Current image super-resolution methods show strong performance on natural images but distort text, creating a fundamental trade-off between image quality and textual readability. To address this, we introduce TIGER (Text-Image Guided supEr-Resolution), a novel two-stage framework that breaks this trade-off through a "text-first, image-later" paradigm. TIGER explicitly decouples glyph restoration from image enhancement: it first reconstructs precise text structures and uses them to guide full-image super-resolution. This ensures high fidelity and readability. To support comprehensive training and evaluation, we present the UZ-ST (UltraZoom-ST) dataset, the first Chinese scene text dataset with extreme zoom. Extensive experiments show TIGER achieves state-of-the-art performance, enhancing readability and image quality.

Abstract:
Bridging the gap between complex human instructions and precise 3D object grounding remains a significant challenge in vision and robotics. Existing 3D segmentation methods often struggle to interpret ambiguous, reasoning-based instructions, while 2D vision-language models that excel at such reasoning lack intrinsic 3D spatial understanding. In this paper, we introduce REALM, an innovative MLLM-agent framework that enables open-world reasoning-based segmentation without requiring extensive 3D-specific post-training. We perform segmentation directly on 3D Gaussian Splatting representations, capitalizing on their ability to render photorealistic novel views that are highly suitable for MLLM comprehension. As directly feeding one or more rendered views to the MLLM can lead to high sensitivity to viewpoint selection, we propose a novel Global-to-Local Spatial Grounding strategy. Specifically, multiple global views are first fed into the MLLM agent in parallel for coarse-level localization, aggregating responses to robustly identify the target object. Then, several close-up novel views of the object are synthesized to perform fine-grained local segmentation, yielding accurate and consistent 3D masks. Extensive experiments show that REALM achieves remarkable performance in interpreting both explicit and implicit instructions across LERF, 3D-OVS, and our newly introduced REALM3D benchmarks. Furthermore, our agent framework seamlessly supports a range of 3D interaction tasks, including object removal, replacement, and style transfer, demonstrating its practical utility and versatility.

Abstract:
MLLMs are beginning to appear in clinical workflows, but their ability to perform complex medical reasoning remains unclear. We present Med-CMR, a fine-grained Medical Complex Multimodal Reasoning benchmark. Med-CMR distinguishes from existing counterparts by three core features: 1) Systematic capability decomposition, splitting medical multimodal reasoning into fine-grained visual understanding and multi-step reasoning to enable targeted evaluation; 2) Challenging task design, with visual understanding across three key dimensions (small-object detection, fine-detail discrimination, spatial understanding) and reasoning covering four clinically relevant scenarios (temporal prediction, causal reasoning, long-tail generalization, multi-source integration); 3) Broad, high-quality data coverage, comprising 20,653 Visual Question Answering (VQA) pairs spanning 11 organ systems and 12 imaging modalities, validated via a rigorous two-stage (human expert + model-assisted) review to ensure clinical authenticity. We evaluate 18 state-of-the-art MLLMs with Med-CMR, revealing GPT-5 as the top-performing commercial model: 57.81 accuracy on multiple-choice questions (MCQs) and a 48.70 open-ended score, outperforming Gemini 2.5 Pro (49.87 MCQ accuracy, 45.98 open-ended score) and leading open-source model Qwen3-VL-235B-A22B (49.34 MCQ accuracy, 42.62 open-ended score). However, specialized medical MLLMs do not reliably outperform strong general models, and long-tail generalization emerges as the dominant failure mode. Med-CMR thus provides a stress test for visual-reasoning integration and rare-case robustness in medical MLLMs, and a rigorous yardstick for future clinical systems.

Abstract:
3D Gaussian Splatting has recently enabled fast and photorealistic reconstruction of static 3D scenes. However, dynamic editing of such scenes remains a significant challenge. We introduce a novel framework, Physics-Guided Score Distillation, to address a fundamental conflict: physics simulation provides a strong motion prior that is insufficient for photorealism , while video-based Score Distillation Sampling (SDS) alone cannot generate coherent motion for complex, multi-particle scenarios. We resolve this through a unified optimization framework where physics simulation guides Score Distillation to jointly refine the motion prior for photorealism while simultaneously optimizing appearance. Specifically, we learn a neural dynamics model that predicts particle motion and appearance, optimized end-to-end via a combined loss integrating Video-SDS for photorealism with our physics-guidance prior. This allows for photorealistic refinements while ensuring the dynamics remain plausible. Our framework enables scene-wide dynamic weather effects, including snowfall, rainfall, fog, and sandstorms, with physically plausible motion. Experiments demonstrate our physics-guided approach significantly outperforms baselines, with ablations confirming this joint refinement is essential for generating coherent, high-fidelity dynamics.

Abstract:
We propose Ground Reaction Inertial Poser (GRIP), a method that reconstructs physically plausible human motion using four wearable devices. Unlike conventional IMU-only approaches, GRIP combines IMU signals with foot pressure data to capture both body dynamics and ground interactions. Furthermore, rather than relying solely on kinematic estimation, GRIP uses a digital twin of a person, in the form of a synthetic humanoid in a physics simulator, to reconstruct realistic and physically plausible motion. At its core, GRIP consists of two modules: KinematicsNet, which estimates body poses and velocities from sensor data, and DynamicsNet, which controls the humanoid in the simulator using the residual between the KinematicsNet prediction and the simulated humanoid state. To enable robust training and fair evaluation, we introduce a large-scale dataset, Pressure and Inertial Sensing for Human Motion and Interaction (PRISM), that captures diverse human motions with synchronized IMUs and insole pressure sensors. Experimental results show that GRIP outperforms existing IMU-only and IMU-pressure fusion methods across all evaluated datasets, achieving higher global pose accuracy and improved physical consistency.

Abstract:
We present FHAvatar, a novel framework for reconstructing 3D Gaussian avatars with composable face and hair components from an arbitrary number of views. Unlike previous approaches that couple facial and hair representations within a unified modeling process, we explicitly decouple two components in texture space by representing the face with planar Gaussians and the hair with strand-based Gaussians. To overcome the limitations of existing methods that rely on dense multi-view captures or costly per-identity optimization, we propose an aggregated transformer backbone to learn geometry-aware cross-view priors and head-hair structural coherence from multi-view datasets, enabling effective and efficient feature extraction and fusion from few casual captures. Extensive quantitative and qualitative experiments demonstrate that FHAvatar achieves state-of-the-art reconstruction quality from only a few observations of new identities within minutes, while supporting real-time animation, convenient hairstyle transfer, and stylized editing, broadening the accessibility and applicability of digital avatar creation.

Abstract:
Federated learning (FL) enables a central server to collaboratively train a global model with multiple clients while preserving data privacy. However, the distributed nature of FL makes the paradigm vulnerable to backdoor attacks, as proved by numerous recent studies. Although existing studies improve the effectiveness of backdoor attacks through optimized triggers, they have two limitations: (1) they ignore the heterogeneous contribution of individual model layers to the success of a backdoor; (2) they induce conspicuous differences between backdoor and clean models in the early stages of poisoning. The limitations cause backdoor models to exhibit significant discrepancies from clean models, making them easily detectable. To fill these gaps, we propose LaySelFL, a novel layer-selective method to eliminate distance differences induced by the backdoor to conceal attacks in FL. Our central insight is that different layers contribute unequally to backdoor attacks, by localizing poisoning to layers that are most sensitive to backdoor objectives, an attacker can reduce the model differences substantially between the backdoor and clean models. Concretely, LaySelFL identifies sensitive layers via both dynamic and static evaluations of parameter differences between backdoor and benign models, and then applies a targeted training protocol and a regularized loss that constrains differences from the global model in each round. Finally, LaySelFL performs clipping on non-poisoning layers to further mask residual differences introduced by the attack. This strategy yields a more covert and resilient backdoor attack. Extensive experiments show that LaySelFL increases the effectiveness of attacks by 25% and reduces the effectiveness of defense methods to 4%.

Abstract:
With the rapid advances of powerful multimodal models such as GPT-4o, Nano Banana, and Seedream 4.0 in Image Editing, the performance gap between closed-source and open-source models is widening, primarily due to the scarcity of large-scale, high-quality training data and comprehensive benchmarks capable of diagnosing model weaknesses across diverse editing behaviors. Existing data construction methods face a scale-quality trade-off: human annotations are high-quality but not scalable, while automated pipelines suffer from error propagation and noise. To address this, we introduce a lightweight data pipeline that replaces multi-toolchains with an end-to-end model and a unified post-verification stage. For scalable quality control, we train a 7B dual-task expert model, Qwen-Verify, for efficient failure detection and instruction recaptioning. This pipeline yields UnicEdit-10M, a 10M-scale dataset spanning diverse basic and complex editing tasks. We also propose UnicBench, a general benchmark that extends beyond basic edits to explicitly assess spatial and knowledge-driven reasoning. To enable fine-grained diagnosis, we introduce novel metrics, including Non-edit Consistency and Reasoning Accuracy. Our analysis of mainstream models on UnicBench reveals their limitations and provides clear directions for future research. The dataset, benchmark, and code will be released.

Abstract:
Temporal Forgery Localization (TFL) aims to precisely identify manipulated segments within videos or audio streams, providing interpretable evidence for multimedia forensics and security. While most existing TFL methods rely on dense frame-level labels in a fully supervised manner, Weakly Supervised TFL (WS-TFL) reduces labeling cost by learning only from binary video-level labels. However, current WS-TFL approaches suffer from mismatched training and inference objectives, limited supervision from binary labels, gradient blockage caused by non-differentiable top-k aggregation, and the absence of explicit modeling of inter-proposal relationships. To address these issues, we propose GEM-TFL (Graph-based EM-powered Temporal Forgery Localization), a two-phase classification-regression framework that effectively bridges the supervision gap between training and inference. Built upon this foundation, (1) we enhance weak supervision by reformulating binary labels into multi-dimensional latent attributes through an EM-based optimization process; (2) we introduce a training-free temporal consistency refinement that realigns frame-level predictions for smoother temporal dynamics; and (3) we design a graph-based proposal refinement module that models temporal-semantic relationships among proposals for globally consistent confidence estimation. Extensive experiments on benchmark datasets demonstrate that GEM-TFL achieves more accurate and robust temporal forgery localization, substantially narrowing the gap with fully supervised methods.

Abstract:
The vulnerability of deep neural networks to adversarial examples poses a significant challenge for real-world deployment. Existing techniques to enhance deep network robustness rely on adversarial training, an approach that is powerful but computationally intensive and typically tailored to specific attack types. To address these limitations, existing works have explored techniques such as adding gaussian noise or filtering images, both of which can boost the network robustness to various adversarial attacks, albeit modestly. Here, we theoretically demonstrate that these two approaches enhance robustness against adversarial attacks through complementary mechanisms, resulting in supralinear robustness when combined.Building on this insight, we experimentally show that a simple preprocessor combining Gaussian noise and bilateral filtering yields supralinear improvements in adversarial robustness with minimal computational cost. Next, we combine our preprocessor with adversarial training and test on RobustBench to assess its supralinear improvement over state-of-the-art defenses. First, this combination ranks second on AutoAttack and third overall, while using only ~35% of the training FLOPs, using a model with 50% less parametets, trained with ~33% of the epochs and ~15% the data compared to state-of-the-art defenses. Second, our method scales efficiently, matching the accuracy of competing models with roughly 2-8x less total compute across ~3 orders of magnitude. Overall, our approach provides a principled and easily integrable framework for enhancing adversarial robustness, offering negligible computational overhead and a simple yet theoretically grounded design.

Abstract:
Polarization-aware Neural Radiance Fields (NeRF) enables novel view synthesis of specular scenes but suffers from slow training, inefficient rendering, and material/viewpoint assumptions. 3D Gaussian Splatting (3DGS) supports real-time rendering but struggles with reflection reconstruction due to reflection-geometry entanglement. We propose PolarGuide-GSDR, a polarization-guided framework that leverages polarization information to separate specular reflections for guided reconstruction, and builds bidirectional iterative optimization between polarization and 3DGS: it first uses 3DGS geometric priors to disambiguate polarization, then uses refined polarization normal to guide 3DGS normals and Spherical Harmonics representations, achieving high-fidelity reflection separation and full-scene reconstruction without restrictive material assumptions. Experiments on public and self-collected datasets show that PolarGuide-GSDR achieves state-of-the-art performance in specular reconstruction, normal estimation, and novel view synthesis, while maintaining real-time rendering. To our knowledge, this is the first framework embedding polarization priors into 3DGS optimization, offering superior interpretability and real-time performance for complex reflective scenes.

Abstract:
Infrared small target detection and segmentation (IRSTDS) is a critical yet challenging task in defense and civilian applications, owing to the dim, shapeless appearance of targets and severe background clutter. Recent CNN-based methods have achieved promising target perception results, but they only focus on enhancing feature representation to offset the impact of noise, which results in the increased false alarm problem. In this paper, through analyzing the problem from the frequency domain, we pioneer in improving performance from noise suppression perspective and propose a novel noise-suppression feature pyramid network (NS-FPN), which integrates a low-frequency guided feature purification (LFP) module and a spiral-aware feature sampling (SFS) module into the original FPN structure. The LFP module suppresses the noise features by purifying high-frequency components to achieve feature enhancement devoid of noise interference, while the SFS module further adopts spiral sampling to fuse target-relevant features in feature fusion process. Our NS-FPN is designed to be lightweight yet effective and can be easily plugged into existing IRSTDS frameworks. Extensive experiments on the IRSTD-1k and NUAA-SIRST datasets demonstrate that our method significantly reduces false alarms and achieves superior performance on IRSTDS task.

Abstract:
Multimodal language models (MLLMs) are increasingly paired with vision tools (e.g., depth, flow, correspondence) to enhance visual reasoning. However, despite access to these tool-generated visual cues, MLLMs often fail to benefit fully from them. Existing approaches typically feed raw tool outputs into the model, but these dense, pixel-level representations are misaligned with the language-native reasoning strengths of LLMs, leading to weak perception due to reliance on language priors. We argue that, in problems where vision tools can provide the necessary visual cues, the bottleneck is not more tool calls or larger MLLMs, it is how tool outputs are represented. We introduce Perception Programs (P^2), a training-free, model-agnostic method that rewrites tool outputs into compact, structured, language-native summaries that MLLMs can directly parse and reason over. Across six perception-centric tasks in BLINK, P^2 consistently yields large improvements over base models and raw tool-augmented baselines. With GPT-5 Mini as the base model, P^2 raises its accuracy from 41.35% to 86.47% on multi-view reasoning, from 52.42 to 81.45 on relative depth, and achieves a 19.66 overall average gain across tasks, setting new state-of-the-art results. Even on smaller MLLMs, e.g., InternVL3.5-4B and Qwen3VL-4B, we observe 21-25 absolute gains from P^2 across tasks when comparing to image-based tool variants, surpassing prior agentic, supervised, and RL-based tool-use methods, without any training or model modifications.

Abstract:
Tracking-any-point (TAP) answers query-conditioned correspondence but leaves the dense, all-pairs structure of a video implicit. We formulate All-Pairs Tracking (APT): given a video, predict dense displacement and visibility for every source-target frame pair, from which per-pixel trajectories can be read out. To this end, we propose PairFormer, a feed-forward transformer that addresses APT in a single pass. A spatio-temporal patch encoder computes temporally conditioned features for all frames. \revaddtwo CorrBank builds a learnable correlation bank for each frame pair and produces pairwise motion tokens. A broadcast motion mixer aggregates trajectory-wise context and broadcasts it back to refine the pairwise motion tokens. A trajectory head first predicts coarse dense displacement, visibility, and confidence, and then refines them iteratively to form a coherent all-pairs trajectory field. To support APT at scale, we develop PAIRender, a data platform that synthesizes photo-realistic dynamic scenes with dense annotations. From PAIRender we derive a training set (\pi-R10K) and a benchmark (APT-Bench) with an all-to-all evaluation protocol. Experiments show that PairFormer achieves strong performance on APT-Bench and competitive results on standard TAP benchmarks. Code and dataset will be released upon publication.

Abstract:
Human-level agentic intelligence extends beyond low-level geometric perception, evolving from recognizing where things are to understanding what they are for. While existing benchmarks effectively evaluate the geometric perception capabilities of multimodal large language models (MLLMs), they fall short of probing the higher-order cognitive abilities required for grounded intelligence. To address this gap, we introduce the Spatial-Functional Intelligence Benchmark (SFI-Bench), a video-based benchmark with over 1,500 expert-annotated questions derived from diverse egocentric indoor video scans. SFI-Bench systematically evaluates two complementary dimensions of advanced reasoning: (1) Structured Spatial Reasoning, which requires understanding complex layouts and forming coherent spatial representations, and (2) Functional Reasoning, which involves inferring object affordances and their context-dependent utility. The benchmark includes tasks such as conditional counting, multi-hop relational reasoning, functional pairing, and knowledge-grounded troubleshooting, directly challenging models to integrate perception, memory, and inference. Our experiments reveal that current MLLMs consistently struggle to combine spatial memory with functional reasoning and external knowledge, highlighting a critical bottleneck in achieving grounded intelligence. SFI-Bench therefore provides a diagnostic tool for measuring progress toward more cognitively capable and truly grounded multimodal agents.

Abstract:
This paper presents an innovative approach for non-destructive 3D modeling of plant root structures, which are essential for nutrient and water uptake. While Ground Penetrating Radar (GPR) has been used for detecting subsurface objects with well-defined shapes, such as pipes, accurately reconstructing complex root structures remains a significant challenge. To address this, we propose a novel framework that leverages GPR signal shape priors for target signal detection and curve parameter regression across multiple B-scans. By integrating these detection and regression results, we obtain precise hyperbolic curves representing root structures. To further assess complete and detailed 3D root systems, we design a root shape modeling network that processes sparse 3D slices using a specialized point graph network and an upsampling module. The method can be extended to many applications, including civil engineering, geology, and environmental monitoring.

Abstract:
Dense Video Captioning (DVC) is a challenging multimodal task that involves temporally localizing multiple events within a video and describing them with natural language. While query-based frameworks enable the simultaneous, end-to-end processing of localization and captioning, their reliance on shared queries often leads to significant multi-task interference between the two tasks, as well as temporal redundancy in localization. In this paper, we propose utilizing role-specific queries that separate localization and captioning into independent components, allowing each to exclusively learn its role. We then employ contrastive alignment to enforce semantic consistency between the corresponding outputs, ensuring coherent behavior across the separated queries. Furthermore, we design a novel suppression mechanism in which mutual temporal overlaps across queries are penalized to tackle temporal redundancy, supervising the model to learn distinct, non-overlapping event regions for more precise localization. Additionally, we introduce a lightweight module that captures core event concepts to further enhance semantic richness in captions through concept-level representations. We demonstrate the effectiveness of our method through extensive experiments on major DVC benchmarks YouCook2 and ActivityNet Captions.

Abstract:
In this paper, we tackle the Aerial Vision-and-Dialog Navigation (AVDN) task in the training-free setting for resource-efficient high-altitude UAV navigation.Naively applying MLLMs leads to unreliable navigation due to weak directional grounding and the lack of explicit spatial memory.To address these issues, we propose PSC-AVDN, a training-free framework that tightly couples a three-stage Parsing-Search-Confirmation reasoning pipeline with a Structured Spatial Memory (SSM).The parsing stage uses an LLM to convert ambiguous dialogue instructions into stable geometric directional and destination cues.A Search Chain-of-Thought (S-CoT) then performs stepwise target exploration under high-altitude observations, and a Confirmation Chain-of-Thought (C-CoT) conducts fine-grained verification around candidate regions to resolve visual ambiguity.Meanwhile, SSM integrates three complementary sources of spatial cues, including multi-scale visual observation, spatial visual memory, and structured geometric memory to provide global spatial context and long-horizon consistency.Extensive experiments on ANDH and ANDH-Full show that PSC-AVDN establishes new state-of-the-art performance in the training-free setting, matching or surpassing several finetuned methods.

Abstract:
Self-attention with separate pre- and post-projections can be a universal approximator (on compact domains) under mild conditions. Yet we observe a striking gap: an attention-only Transformer (w/o FFN layers) exhibits a marked accuracy drop relative to its standard interleaved attention--FFN baseline. We term this the weak-independence challenge of attention. We study this through a new conceptual lens, Selection-as-Nonlinearity (SaN), which interprets effective nonlinearity as directed, cost-constrained selection, offering a coherent account of attention as context-gated activation. In this joint game-decision view, attention performs a resource-constrained cooperative allocation over values: each query distributes a unit-mass weight budget over shared values to optimize representational utility, under a normalizer (e.g., \mathrm softmax ), and guided by context-derived scores (e.g., q-k similarities). SaN interprets weak-independence as a structural tension: the value weights almost cannot simultaneously attain the decoupled per-query (row-wise) and the per-value (column-wise) optimums under shared budgets, thereby limiting attention's stand-alone capacity. Guided by SaN, we introduce CSaN, an interpretable, efficient attention compensation paradigm with two key insights: 1) hierarchical budget calibration, re-allocate row budgets via inter-query correction signals; and 2) public-private cooperation, enhancing the public attention pathway with a per-token private value pathway to decouple conflicting demands. CSaN is evaluated on various vision benchmarks and demonstrates remarkable gains across popular Transformer families (Swin, ViT, Hiera), enabling models to rival much heavier same-family counterparts ~2xas large.

Abstract:
Training-free image editing with large diffusion models has become practical, yet faithfully performing complex non-rigid edits (e.g., pose or shape changes) remains highly challenging. We identify a key underlying cause: attention collapse in existing attention sharing mechanisms, where either positional embeddings or semantic features dominate visual content retrieval, leading to over-editing or under-editing.To address this issue, we introduce SynPS, a method that Synergistically leveragesPositional embeddings and Semantic information for faithful non-rigid image editing. We first propose an editing measurement that quantifies the required editing magnitude at each denoising step. Based on this measurement, we design an attention synergy pipeline that dynamically modulates the influence of positional embeddings, enabling SynPS to balance semantic modifications and fidelity preservation.By adaptively integrating positional and semantic cues, SynPS effectively avoids both over- and under-editing. Extensive experiments on public and newly curated benchmarks demonstrate the superior performance and faithfulness of our approach. Our code will be publicly released.

Abstract:
The widespread misuse of image generation technologies has raised security concerns, driving the development of AI-generated image detection methods. However, generalization has become a key challenge and open problem: existing approaches struggle to adapt to emerging generative methods and content types in real-world scenarios. To address this issue, we propose a Scene-Aware and Importance-Guided Dynamic Optimization detection framework with continual learning (SAIDO). Specifically, we design Scene-Awareness-Based Expert Module (SAEM) that dynamically identifies and incorporates new scenes using VLLMs. For each scene, independent expert modules are dynamically allocated, enabling the framework to capture scene-specific forgery features better and enhance cross-scene generalization. To achieve an effective balance between model plasticity and stability when learning from multiple image generative methods, we introduce Importance-Guided Dynamic Optimization Mechanism (IDOM), which optimizes each neuron through an importance-guided gradient projection strategy, thereby achieving an effective balance between model plasticity and stability. Extensive experiments on continual learning tasks demonstrate that our method outperforms the current SOTA method in both stability and plasticity, achieving 44.22% and 40.57% relative reductions in average detection error rate and forgetting rate, respectively. On open-world datasets, it improves the average detection accuracy by 9.47% compared to the current SOTA method.

Abstract:
Generative models have shown great potential in trajectory planning. Recent studies demonstrate that anchor-guided generative models are effective in modeling the uncertainty of driving behaviors and improving overall performance. However, these methods rely on discrete anchor vocabularies that must sufficiently cover the trajectory distribution during testing to ensure robustness, inducing an inherent trade-off between vocabulary size and model performance.To overcome this limitation, we propose MeanFuser, an end-to-end autonomous driving method that enhances both efficiency and robustness through three key designs. (1) We introduce Gaussian Mixture Noise (GMN) to guide generative sampling, enabling a continuous representation of the trajectory space and eliminating the dependency on discrete anchor vocabularies. (2) We introduce "MeanFlow Identity", which models the mean velocity field between GMN and data distribution instead of the instantaneous velocity field used in naive flow-matching methods, effectively eliminating numerical errors from ODE solvers and significantly accelerating inference. (3) We design a lightweight Adaptive Reconstruction Module (ARM) that enables the model to consider all sampled proposals and adaptively decide whether to reconstruct a trajectory when none of the proposals is satisfactory. Experiments on the NAVSIM closed-loop benchmark demonstrate that MeanFuser achieves outstanding performance and exceptional inference efficiency, offering a robust and efficient solution for end-to-end autonomous driving.

Abstract:
Capturing full-body human motion with object interactions is crucial for AR/VR and robotics applications, yet it remains challenging for conventional vision-based methods due to occlusions and constrained capture volumes. Inertial measurement units (IMUs) offer a compelling alternative without line-of-sight requirements, but existing IMU-based motion capture assumes an isolated human and ignores object contacts and dynamics. To bridge this gap, we present IMU-HOI, a novel framework that jointly recovers full-body human pose and 6-DoF object trajectory from sparse IMUs on the body and object, explicitly modeling human-object Interaction.Our approach first infers probabilistic hand-object contacts directly from IMU streams and uses them as a high-level signal to route between kinematic and inertial reasoning. These contact cues drive a three-stage fusion pipeline that refines human pose and root translation, and fuses hand-based forward kinematics with object-IMU integration for object motion, yielding coherent, drift-resilient trajectories for both human and object. Experiments on challenging human-object interaction scenarios demonstrate substantial accuracy gains over prior inertial motion capture methods. Moreover, IMU-HOI can be plugged into existing sparse-IMU mocap backbones with minimal changes, effectively extending the scope of purely inertial motion capture from isolated humans to full human-object interaction and joint motion estimation.

Abstract:
The quest for incremental unified multimodal anomaly detection seeks to empower a single model with the ability to systematically detect anomalies across all categories and support incremental learning to accommodate emerging objects/categories. Central to this pursuit is resolving the catastrophic forgetting dilemma, which involves acquiring new knowledge while preserving prior learned knowledge. Despite some efforts to address this dilemma, a key oversight persists: ignoring the potential impact of spurious and redundant features on catastrophic forgetting. In this paper, we delve into the negative effect of spurious and redundant features on this dilemma in incremental unified frameworks, and reveal that under similar conditions, the multimodal framework developed by naive aggregation of unimodal architectures is more prone to forgetting. To address this issue, we introduce a novel denoising framework called IB-IUMAD, which exploits the complementary benefits of the Mamba decoder and information bottleneck fusion module: the former dedicated to disentangle inter-object feature coupling, preventing spurious feature interference between objects; the latter serves to filter out redundant features from the fused features, thus explicitly preserving discriminative information. A series of theoretical analyses and experiments on MVTec 3D-AD and Eyecandies datasets demonstrates the effectiveness and competitive performance of IB-IUMAD. Code Link.

Abstract:
Compositional Zero-Shot Learning (CZSL) aims to recognize unseen attribute-object compositions with learned primitives (attribute and object) knowledge from seen compositions. While previous approaches gain their notable performance through the powerful cross-modal alignment of CLIP, they often overlook the modality gap, an inherent constraint stemming from information-imbalanced training data. In this work, we propose SAM, a novel \underline \text S parse \underline \text A lignment and Unimoal \underline \text M emory Bank to effectively bridging modality gap for CZSL. Specifically, we conduct sparse alignment that links textual representations directly to their semantically pertinent visual patches. This direct linking serves to prune redundant visual data and counter the information imbalance in image-text pairs. Subsequently, with the sparsely aligned visual information as its guidance, the visual adaptive condensation module adaptively fuses these critical cues into a unified representation. Finally, we introduce a dynamically updated memory bank that stores samples from both seen and unseen compositions. This bank serves a dual purpose: it bypasses the modality gap through visual-only classification and concurrently strengthens generalization to unseen compositions. Experiments on three benchmarks demonstrate that our method gains significant improvements over CLIP-based methods under closed-world and open-world settings.

Abstract:
Whole slide image (WSI) region retrieval remains an open challenge in computational pathology, as existing methods struggle to represent and preserve information of all possible regions. Current approaches that rely on fixed-size patches or slide-level retrieval are misaligned with real clinical workflows, where pathologists often examine WSI regions of arbitrary orientations and sizes rather than predefined patches or slides. In this work, we redefine WSI retrieval as a semantically optimal matching problem between arbitrary regions under spatial transformations, which necessitates a region-level representation that maintains semantic consistency. To fulfill this requirement, we introduce semantic tessellation, which organizes patch units into flexible, geometry-aware region descriptors. Building on this representation, we develop the affine identifier, a semantic signature that enables rotation- and scale-consistent region matching. We further derive theoretical bounds between the tessellation-derived descriptors and the ideal pixel-level semantic mask objective, showing that they reliably approximate mask-based region similarity. Together, these components form URICA, a theoretically grounded algorithm for robust WSI region retrieval. Experiments on large public datasets demonstrate that URICA achieves strong and consistent performance across diverse WSI retrieval tasks.

Abstract:
RGB-Event object detection is able to capture clear and detailed features of the target while maintaining high-speed information collection. It is suitable for high dynamic or harsh environments and has become a research hotspot in recent years. The existing RGB-Event object detectors all struggle to fully utilize the fusion features of two modalities, but do not explicitly disentangle shared vs. private features to dedicated branches. To fully tap into the potential of single features, we propose a frequency-domain coherence-based Shared and Private Features Decoupling method for RGB-Event object detection, SPFD network. First, we design a FCFS module to separate shared and private features by exploring the spectral energy distribution differences between dual modalities. Then, we design a TriAdapt Encoder to process the shared and private features, selectively emphasizing texture-rich RGB features in static regions and motion-sensitive event features in dynamic regions, thereby achieving a robust balance between spatial detail and temporal awareness. Finally, a TriInject Decoder is proposed to emphasize the most discriminative modality features dynamically. Experimental results on the DSEC-Det and PKU-DAVIS-SOD datasets demonstrate that our model achieves competitive performance with state-of-the-art methods.

Abstract:
Tagged MRI enables tracking internal tissue motion non-invasively. It encodes motion by modulating anatomy with periodic tags, which deforms along with tissue. However, the entanglement between anatomy, tags and motion poses significant challenges on post processing. Existence of tags and imaging blur hinders downstream tasks such as segmenting anatomy. Tag fading, due to T1-relaxation, disrupts brightness constancy assumption for motion tracking. For decades, these challenges are handled in isolation and sub-optimally. In contrast, we introduce a blind and nonlinear inverse framework for tagged MRI that, for the first time, unifies these tasks: anatomical image recovery, high-resolution cine image synthesis, and motion estimation. At its core, the synergy of MR physics and generative priors enables us to blindly estimate the unknown forward imaging models, high-resolution underlying anatomy, while simultaneously tracking 3D diffeomorphic Lagrangian motion over time. Experiments on tagged brain MRI demonstrate that our approach yields high-resolution anatomy images, cine images, and more accurate motion than specialized methods.

Abstract:
Diffusion models have achieved outstanding success in image generation, yet their objectives are often limited to reconstruction, making it difficult to align with human preferences directly. Reinforcement learning (RL) offers a promising approach to address this by optimizing models using explicit reward signals. However, most studies apply RL across the entire denoising process, which is both computationally expensive and tends to weaken preference alignment, i.e., doing more but achieving less. We observe that the impact of RL fine-tuning varies significantly across denoising stages. In the early stage, image structures are unstable and distant from the final reward signal. Applying RL at this stage leads to delayed rewards and action-reward mismatching, resulting in high variance and inefficient updates. Conversely, in the later stage, reward gains saturate, and continued training tends to overfit local details, intensifying reward hacking. To tackle these challenges, we propose \ourmethod , an RL-enhanced plug-in that improves generation quality while reducing computational cost. Specifically, \ourmethod adaptively identifies the optimal timing for RL by perceiving the structural evolution and semantic consistency during denoising, and dynamically terminates training once the denoising converges and reward gains saturate. As a result, it achieves a rare `dual benefit': a reduction in computational costs alongside a significant performance improvement. Theoretical analysis from an entropy perspective and extensive experiments verify our claims: compared with state-of-the-art methods, \ourmethod improves performance by xx% while cutting computational cost by xx%.

Abstract:
Long video understanding (LVU) remains a core challenge in multimodal learning. Although recent vision-language models (VLMs) have made notable progress, existing benchmarks mainly focus on either fine-grained perception or coarse summarization, offering limited insight into temporal understanding over long contexts. In this work, we define a scene as a coherent segment of a video in which both visual and semantic contexts remain consistent, aligning with human perception. This leads us to a key question: Can current VLMs reason effectively over long, scene-level contexts? To answer this, we introduce a new benchmark, SceneBench, designed to provide scene-level challenges. Our evaluation reveals a sharp drop in accuracy when VLMs attempt to answer scene-level questions, indicating significant forgetting of long-range context. To further validate these findings, we propose Scene Retrieval-Augmented Generation (Scene-RAG), which constructs a dynamic scene memory by retrieving and integrating relevant context across scenes. Scene-RAG improves VLM performance by +2.50%, confirming that current models still struggle with long-context retention. We hope SceneBench will encourage future research toward VLMs with more robust, human-like video comprehension.

Abstract:
2D visual foundation models, such as DINOv3, a self-supervised model trained on large-scale natural images, have demonstrated strong zero-shot generalization, capturing rich global context and fine-grained structural cues. However, an analogous 3D foundation model for downstream volumetric neuroimaging remains lacking, largely due to the challenges of 3D image acquisition and the scarcity of high-quality annotations. To address this gap, we propose to adapt the 2D visual representations learned by DINOv3 to a 3D biomedical segmentation model, enabling more data-efficient and morphologically faithful neuronal reconstruction. Specifically, we design an inflation-based adaptation strategy that inflates 2D filters into 3D operators, preserving semantic priors from DINOv3 while adapting to 3D neuronal volume patches. In addition, we introduce a topology-aware skeleton loss to explicitly enforce structural fidelity of graph-based neuronal arbor reconstruction. Extensive experiments on four neuronal imaging datasets, including two from BigNeuron and two public datasets, NeuroFly and CWMBS, demonstrate consistent improvements in reconstruction accuracy over SoTA methods, with average gains of 2.9% in Entire Structure Average, 2.8% in Different Structure Average, and 3.8% in Percentage of Different Structure.

Abstract:
We demonstrate that in knowledge distillation for diffusion models, the teacher network's highly complex denoising process--stemming from its substantially larger capacity--poses a significant challenge for the student model to faithfully mimic. To address this problem, we propose a coarse-to-fine distillation framework with LInear FiTting-based distillation (LIFT) and Piecewise Local Adaptive Coefficient Estimation (PLACE). First, LIFT decomposes the objective into a "coarse" alignment and a "fine" refinement. The student is then trained on coarse alignment before proceeding to hard refinement. Second, PLACE extends LIFT to address spatially non-uniform errors by partitioning outputs into error-based groups, providing locally adaptive guidance. Our experiments show that LIFT and PLACE is effective across diffusion spaces (image/latent), backbones (U-Net/DiT), tasks (unconditional/conditional), datasets, and even extends to flow-based models such as MMDiT (SD3). Furthermore, under extreme compression with a 1.3M-parameter student (only 1.6% of the teacher), conventional KD fails to provide sufficient guidance for stable training, with FID scores often degrading to 50-200+, but our method remains stably convergent and achieves an FID of 15.73.

Abstract:
Arbitrary-Scale SR (ASISR) remains fundamentally limited by cross-scale distribution shift: once the inference scale leaves the training range, noise, blur, and artifacts accumulate sharply. We revisit this challenge from a cross-scale distribution transition perspective and propose CASR, a simple yet highly efficient cyclic SR framework that reformulates ultra-magnification as a sequence of in-distribution scale transitions. This design ensures stable inference at arbitrary scales while requiring only a single model. CASR tackles two major bottlenecks: distribution drift across iterations and patch-wise diffusion inconsistencies. The proposed SSAM module aligns structural distributions via superpixel aggregation, preventing error accumulation, while SARM module restores high-frequency textures by enforcing correlation-guided consistency and preserving self-similarity structure through correlation alignment. Despite using only a single model, our approach significantly reduces distribution drift, preserves long-range texture consistency, and achieves superior generalization even at extreme magnification.

Abstract:
Recent large vision-language models (LVLMs) have demonstrated impressive reasoning ability by generating long chain-of-thought (CoT) responses. However, CoT reasoning in multimodal contexts is highly vulnerable to visual hallucination propagation: once an intermediate reasoning step becomes inconsistent with the visual evidence, subsequent steps-even if logically valid-can still lead to incorrect final answers. Existing solutions attempt to mitigate this issue by training models to "think with images" via reinforcement learning (RL). While effective, these methods are costly, model-specific, and difficult to generalize across architectures. Differently, we present a lightweight method that bypasses RL training and provides an iterative, training-free, plug-and-play framework for visually-grounded multimodal reasoning. Our key idea is to supervise each reasoning step at test time with visual evidence, ensuring that every decoded token is justified by corresponding visual cues. Concretely, we construct a textual visual-evidence pool that guides the model's reasoning generation. When existing evidence is insufficient, a visual decider module dynamically extracts additional relevant evidence from the image based on the ongoing reasoning context, expanding the pool until the model achieves sufficient visual certainty to terminate reasoning and produce the final answer. Extensive experiments on multiple LVLM backbones and benchmarks demonstrate the effectiveness of our approach. Our method achieves 16.5%-29.5% improvements on TreeBench and 13.7% RH-AUC gains on RH-Bench, substantially reducing hallucination rates while improving reasoning accuracy without additional training.

Abstract:
Visual Autoregressive Modeling (VAR) has recently emerged as a powerful paradigm for image generation that surpasses diffusion models in efficiency and quality. However, accelerating attention computation in VAR is still challenging because attention patterns across scales exhibit strong and complex semantic biases that early coarse-scale tokens dominate global structure, while fine-scale tokens mainly refine local details. Existing acceleration methods rely on heuristic token pruning or fixed attention masks, lacking a principled way to balance acceleration and semantic fidelity. In this work, we propose a two-stage acceleration framework for VAR. First, we introduce a semantic-cost-aware masking strategy (SCA-Mask) that quantifies the importance of each attention tile and formulates mask shape design as a cost-constrained optimization problem. This enables adaptive pruning under a given compute budget while preserving essential semantic context. Second, we present Post-Acceleration Adaptation (PAA), a decoder-side fine-tuning scheme that employs internal knowledge distillation to restore image quality from pruned latents. PAA does not require external data and uses a lightweight LoRA-based adaptation, providing a highly efficient alternative to retraining the autoregressive transformer. Comprehensive experiments across multiple VAR tasks demonstrate that our method achieves decent speedup with negligible loss of visual fidelity, yielding a principled and effective pathway toward fast and high-quality visual autoregressive generation.

Abstract:
The task of video geolocalization aims to determine the precise GPS coordinates of a video's origin and map its trajectory; with applications in forensics, social media, and exploration. Existing classification-based approaches operate at a coarse city-level granularity and fail to capture fine-grained details, while image retrieval methods are impractical on a global scale due to the need for extensive image galleries which are infeasible to compile. Comparatively, constructing a gallery of GPS coordinates is straightforward and inexpensive. We propose VidTAG, a dual-encoder framework that performs frame-to-GPS retrieval using both self-supervised and language-aligned features. To address temporal inconsistencies in video predictions, we introduce the TempGeo module, which aligns frame embeddings, and the GeoRefiner module, an encoder-decoder architecture that refines GPS features using said aligned frame embeddings. Evaluations on Mapillary (MSLS) and GAMa datasets demonstrate our model's ability to generate temporally consistent trajectories and outperform baselines, achieving a 20% improvement at the 1 km threshold over GeoCLIP. We also beat current State-of-the-Art by 25% on global coarse grained video geolocalization (CityGuessr68k). Our approach enables fine-grained video geolocalization and lays a strong foundation for future research. More details on the project webpage: parthpk.github.io/vidtag webpage.

Abstract:
Ego-centric driving videos available online provide an abundant source of visual data for autonomous driving, yet their lack of annotations makes it difficult to learn representations that capture both semantic structure and 3D geometry. Recent advances in large feedforward spatial models demonstrate that point maps and ego-motion can be inferred in a single forward pass, suggesting a promising direction for scalable driving perception. We therefore propose a label-free, teacher-guided framework for learning autonomous driving representations directly from unposed videos. Unlike prior self-supervised approaches that focus primarily on frame-to-frame consistency, we posit that safe and reactive driving depends critically on temporal context. To this end, we leverage a feedforward architecture equipped with a lightweight autoregressive module, trained using multi-modal supervisory signals that guide the model to jointly predict current and future point maps, camera poses, semantic layouts, and motion masks. Multi-modal teachers provide sequence-level pseudo-supervision, enabling LFG to learn a unified pseudo-4D representation from raw YouTube videos without poses, labels, or LiDAR. The resulting encoder not only transfers effectively to downstream autonomous driving planning on the NAVSIM benchmark, surpassing multi-camera and LiDAR baselines with only a single monocular camera, but also yields strong performance when evaluated on a range of semantic, geometric, and motion prediction tasks. These geometry- and motion-aware features position LFG as a compelling video-centric foundation model for autonomous driving.

Abstract:
Multimodal Sentiment Analysis (MSA) aims to integrate textual, acoustic, and visual information to predict sentiment polarity. With the emergence of Large Language Models (LLMs), existing studies commonly employ learnable queries to compress audio-visual representations and feed them as soft prompts into LLMs for MSA. However, due to the implicit learning mechanism of the learnable queries, these learnable queries lack explicit guidance regarding how each query encodes sentiment semantics. To address this issue, we propose a prototype-as-prompt framework that maps audio-visual representations into a fixed set of multimodal sentiment prototypes. These prototypes are then used as soft prompts to guide the LLM in performing MSA. Concretely, we first compress both textual and non-textual features into multimodal prototypes using a resampling-based strategy. We further introduce a sentiment-aware prototype learning that explicitly binds multimodal prototypes with sentiment semantics. To ensure both cross-modal consistency and intra-modal diversity of multimodal sentiment prototypes, we design a cross-modal prototype alignment constraint and a distance-weighted prototype diversity constraint. Extensive experiments across three LLMs and four benchmark datasets show that PaP achieves superior performance with only 0.09%-0.26% of trainable parameters, highlighting its effectiveness and parameter efficiency.

Abstract:
In surgical training for medical students, proficiency development relies on expert-led skill assessment, which is costly, time-limited, difficult to scale, and its expertise remains confined to institutions with available specialists. Automated AI-based assessment offers a viable alternative, but progress is constrained by the lack of datasets containing realistic trainee errors and the multi-view variability needed to train robust computer vision approaches. To address this gap, we present Surgical-Hands (SHANDS), a large-scale multi-view video dataset for surgical hand-gesture and error recognition for medical training. SHANDS captures linear incision and suturing using five RGB cameras from complementary viewpoints, performed by 52 participants (20 experts and 32 trainees), each completing three standardized trials per procedure. The videos are annotated at the frame level with 15 gesture primitives and include a validated taxonomy of 8 trainee error types, enabling both gesture recognition and error detection. We further define standardized evaluation protocols for single-view, multi-view, and cross-view generalization, and benchmark state-of-the-art deep learning models on the dataset. SHANDS is publicly released to support the development of robust and scalable AI systems for surgical training grounded in clinically curated domain knowledge.

Abstract:
Text-driven 3D editing, enabled by advancements in 3D reconstruction techniques such as NeRF and 3D Gaussian Splatting, aims to provide intuitive scene customization. However, existing methods frequently exhibit limitations in controllability and consistency. To address these shortcomings, we propose VDFE, a difference-aware 3D scene editing method based on non-intrusive utilization of pre-trained video diffusion priors, which integrates Optimal Control Guided Flow Editing (FlowOCE), Decoupled Flow Difference (DFD), and Difference-Aware Gaussians Editing (DAGE) modules. Specifically, FlowOCE treats the editing process as an optimal control problem, optimizing a noise-free editing trajectory to minimize unintended modifications in non-target regions and generating results with multi-view consistency and smooth transitions; DFD innovatively achieves more precise localization of editing regions compared to cross-attention mechanisms by analyzing flow differences, which supplies priors for the subsequent optimization process; and DAGE leverages precise localization to selectively update 3D Gaussians for efficient and precise refinement. Extensive experiments demonstrate that our method significantly outperforms existing methods in both qualitative and quantitative evaluations, achieving state-of-the-art (SOTA) performance.

Abstract:
Existing set-based gait recognition methods achieve remarkable performance by capturing global semantic context.However, their order-invariant nature prevents them from modeling the fine-grained kinematic patterns that unfold over time.To unify the global and process-level representations, we propose GaitMax, a framework that captures both semantic context and kinematic motion.GaitMax leverages attention-based spatiotemporal modeling to dynamically represent detailed part-level trajectories.While this detailed representation is more powerful, it also captures more nuisance factors (e.g., clothing, viewpoint), leading to potential shortcuts.To mitigate this, we introduce CDLoss, a Conditional Decorrelation Loss that explicitly disentangles the gait embeddings from nuisance factors using vision-language supervision.This loss requires high-quality nuisance descriptions. We therefore construct GCaption, a new resource that provides natural language annotations for multiple gait datasets, moving beyond simple categorical labels. GCaption not only enables CDLoss but also serves as a foundation for future context-aware gait analysis.The superiority of GaitMax is validated through extensive experiments on multiple large-scale gait benchmarks. Models, code, and resources will be released upon publication.

Abstract:
Diffusion-based text-to-image (T2I) models have enabled remarkable generative capabilities, yet precise text-based image editing that preserves the original's structural and perceptual fidelity remains non-trivial. Existing approaches either rely on retraining with large bespoke datasets, incurring significant computational and curation costs, or adopt lightweight fine-tuning strategies that still require optimization and often fail in fine-grained or semantically complex edits. We propose NEAF (Natural image Editing with Attention Fusion), a novel zero-shot, universal tuning-free framework for arbitrary T2I models, obviating the need for dataset curation or retraining. NEAF introduces a lightweight, learnable XA-Conductor module that dynamically identifies salient cross-attention contributions pertinent to the edit. This module optimizes a weight vector to orchestrate an adaptive fusion of cross-attention maps derived from the source, edited, and reconstruction branches. This triadic-feedback optimization strategy ensures the precise instantiation of user directives while rigorously preserving the fidelity of quiescent regions. Extensive experiments validate NEAF as a flexible and general framework that consistently surpasses existing methods across diverse editing tasks, demonstrating particular dominance in complex, non-rigid editing scenarios where other approaches falter.

Abstract:
Learning general-purpose representations of geographic locations has become essential to geospatial tasks such as population estimation and environmental monitoring. To obtain such representations, multimodal geo-foundation models often use contrastive learning (CL) to align satellite imagery with geo-coordinates, implicitly assuming that cross-modal (shared) information suffices for downstream tasks. However, given the breadth of tasks, task-relevant information may lie beyond the shared space, so retaining modality-specific (unique) features can improve task performance. Prior methods retain unique information through extra training objectives or external models, increasing training complexity. Motivated by the conventional wisdom that earlier layers capture general input features while later layers become task-specific, we hypothesize that intermediate layers in CL models retain more modality-specific structure than the alignment-optimized final layer. Through a trifecta layerwise analysis of modality gap, representation similarity, and mutual information, we validate this trend and find that fusing intermediate (more unique) and final (more shared) representations yields consistent gains on diverse geospatial tasks. Our findings reveal underutilized information diversity in CL models and show that simple layerwise fusion is an efficient path to richer geo-embeddings.

Abstract:
3D Gaussian Splatting (3DGS) has emerged as a leading technology for high-quality 3D scene reconstruction. However, the iterative refinement and densification process leads to the generation of a large number of primitives, each contributing to the reconstruction to a substantially different extent. Estimating primitive importance is thus crucial, both for removing redundancy during reconstruction and for enabling efficient compression and transmission.Existing methods typically rely on rendering-based analyses, where each primitive is evaluated through its contribution across multiple camera viewpoints. However, such methods are 1) sensitive to the number and selection of views; 2) rely on specialized differentiable rasterizers; and 3) have long calculation times that grow linearly with view count, making them difficult to integrate as plug-and-play modules, as well as resulting in limited scalability and generalization.To address these issues, we propose RAP -- a fast feedforward Rendering-free Attribute-guided method for efficient importance score Prediction in 3DGS. RAP infers primitive significance directly from intrinsic Gaussian attributes and local neighborhood statistics, avoiding any rendering-based or visibility-dependent computations. A compact MLP is trained to predict per-primitive importance scores using a combination of rendering loss, pruning-aware loss, and significance distribution regularization loss. After being trained on a small set of scenes, RAP generalizes effectively to unseen data and can be seamlessly integrated into reconstruction, compression, and transmission pipelines, providing a unified and efficient pruning solution.

Abstract:
Vision-Language-Action (VLA) models have recently emerged, demonstrating strong generalization in robotic scene understanding and manipulation. However, when confronted with long-horizon tasks that require defined goal states, such as LEGO assembly or object rearrangement, existing VLA models still face challenges in coordinating long-horizon planning with precise manipulation.Therefore, we aim to endow a VLA model with the capability to infer the "how" process from the "what" outcomes, transforming goal states into executable procedures. In this paper, we introduce ManualVLA, a unified VLA framework built upon a Mixture-of-Transformers (MoT) architecture, enabling coherent collaboration between multimodal manual generation and action execution. Unlike prior VLA models that directly map sensory inputs to actions, we first equip ManualVLA with a planning expert that generates intermediate manuals consisting of images, visual prompts, and textual instructions. Building upon these multimodal manuals, we design a Manual Chain-of-Thought (ManualCoT) reasoning process that feeds them into the action expert, where each manual step provides explicit control conditions, while its latent representation offers implicit guidance for accurate manipulation. To alleviate the burden of data collection, we develop a high-fidelity digital-twin toolkit based on 3D Gaussian Splatting, which automatically generates manual data for planning expert training. ManualVLA demonstrates strong real-world performance, achieving an average success rate 32% higher than the previous hierarchical SOTA baseline on LEGO assembly and object rearrangement tasks.

Abstract:
Open-vocabulary segmentation (OVS) extends the zero-shot recognition capabilities of vision-language models (VLMs) to pixel-level prediction, enabling segmentation of arbitrary categories specified by text prompts. Despite recent progress, OVS lags behind fully supervised approaches due to two challenges: the coarse image-level supervision used to train VLMs and the semantic ambiguity of natural language. We address these limitations by introducing a few-shot setting that augments textual prompts with a support set of pixel-annotated images. Building on this, we propose a retrieval-augmented test-time adapter that learns a lightweight, per-image classifier by fusing textual and visual support features. Unlike prior methods relying on late, hand-crafted fusion, our approach performs learned, per-query fusion, achieving stronger synergy between modalities. The method supports continually expanding support sets, and applies to fine-grained tasks such as personalized segmentation. Experiments show that we significantly narrow the gap between zero-shot and supervised segmentation while preserving open-vocabulary ability.

Abstract:
Inference-time optimization of diffusion latents enables powerful control but often degrades the statistical structure of true Gaussian noise, causing artifacts and reward hacking. To address this, we propose a Gaussianity regularizer that aligns a sample's local statistics with a typical Gaussian realization, rather than relying on pointwise likelihood. We formalize this by computing the KL divergence between the sample distribution and the Gaussian prior. To make the divergence computation tractable from a single sample, we lift each candidate latent into an empirical distribution induced by its statistics and model it as a pairwise Markov Random Field. This yields a Bethe-Kikuchi-style regularizer with 1D marginal, 2D spatial, and multi-scale terms. Our results show improved latent optimization stability and generation quality over prior approaches.

Abstract:
Large Vision-Language Models (LVLMs) have achieved impressive performance across multimodal understanding and reasoning tasks, yet their internal safety mechanisms remain opaque and poorly controlled. In this work, we present a comprehensive framework for diagnosing and repairing unsafe channels within LVLMs. We first perform causal mediation analysis to identify neurons and layers that are causally responsible for unsafe behaviors. Based on these findings, we introduce a dual-modal safety subspace projection method that learns generalized safety subspaces for both visual and textual modalities through generalized eigen-decomposition between benign and malicious activations. During inference, activations are dynamically projected toward these safety subspaces via a hybrid fusion mechanism that adaptively balances visual and textual corrections, effectively suppressing unsafe features while preserving semantic fidelity.Extensive experiments on multiple LVLM safety benchmarks demonstrate that our causal-subspace repair framework significantly enhances safety robustness without degrading general multimodal capabilities, outperforming prior activation steering and alignment-based baselines. Moreover, our approach demonstrates robust transferability, effectively defending against unseen and adaptive attacks.

Abstract:
Pruning-based unlearning has recently emerged as a fast, training-free, and data-independent approach to remove undesired concepts from diffusion models. It promises high efficiency and robustness, offering an attractive alternative to traditional fine-tuning or editing-based unlearning. However, in this paper we uncover a hidden danger behind this promising paradigm. We find that the locations of pruned weights, typically set to zero during unlearning, can act as side-channel signals that leak critical information about the erased concepts.To verify this vulnerability, we design a novel attack framework capable of reviving erased concepts from pruned diffusion models in a fully data-free and training-free manner. Our experiments confirm that pruning-based unlearning is not inherently secure, as erased concepts can be effectively revived without any additional data or retraining.Finally, we explore potential defense strategies and advocate safer pruning mechanisms that conceal pruning locations while preserving unlearning effectiveness, providing practical insights for designing more secure pruning-based unlearning frameworks.

Abstract:
Random frame-level data missing is a critical challenge in multimodal sentiment analysis. Existing methods are largely limited to passive completion via single-pass feedforward connections and static cross-modal fusion, which struggle to generate high-quality completed features. However, the brain is not a passive recipient of external information but a dynamic system for active perceptual inference. Its core lies in the dynamic nested recurrents formed by intra-cortical recurrent completion mechanisms and corticothalamic circuits, which iteratively perform perceptual inference. Inspired by this, we propose the Dynamic Nested Recurrent Network (DNRNet). It is the first to introduce recurrent inference into the data completion task, achieving a paradigm shift from passive completion to active perceptual inference. Its local recurrent loop simulates intra-cortical recurrent pattern completion to perform perceptual inference and generate local correction features. The global recurrent loop simulates the modulatory function of the thalamus, calculating modality confidence to dynamically weight and integrate cross-modal information, generating global correction features. The local and global correction features are fused to obtain the completion signal, which is then combined with the input features of the current iteration to serve as the input for the next iteration. Experiments on the MOSI, MOSEI, and SIMS datasets demonstrate that DNRNet achieves an average accuracy improvement of 1.5%-2.0% over baseline models across all missing rates, validating the superiority of the brain-inspired approach in complex missing data scenarios.

Abstract:
Nuclei instance segmentation in histopathology images is essential for diagnostic accuracy and downstream computational tasks, yet this task relies heavily on expensive pixel level annotations. Although point level annotations substantially reduce the annotation burden for pathologists, many existing methods utilize only a single type of image and overlook the complementary information contained in alternative representations. To address this limitation, we propose DFGNet, a weakly supervised framework that utilizes dual-representation complementary fusion and interleaved guidance learning by jointly modeling RGB images and their corresponding Hematoxylin components. From the complementary fusion perspective, we propose a Reciprocal Cross-scale Dynamic Fusion Module (RCDF) and an Entropy Confidence Aggregation Unit (ECAU) to integrate multi-scale complementary cues and adaptively combine the outputs of the dual branches. In terms of interleaved guidance, we also propose an Interleaved point-Guided Attention (IGA) that enables bidirectional refinement between the segmentation task and the kernel prediction task. Experiments on three benchmark datasets show that DFGNet achieves state-of-the-art performance across multiple metrics and significantly outperforms existing methods. It also demonstrates strong generalization across tissue types and remains robust to annotation shifts in real-world scenarios.

Abstract:
Exemplar replay has become an effective strategy for mitigating catastrophic forgetting in federated continual learning (FCL) by retaining representative samples from past tasks. Existing studies focus on designing sample-importance estimation mechanisms to identify information-rich samples. However, they typically overlook strategies for effectively utilizing the selected exemplars, which limits their performance under continual dynamic heterogeneity across clients and tasks. To address this issue, this paper proposes a federated geometry-aware correction method, termed FEAT, which alleviates imbalance-induced representation collapse that drags rare-class features toward frequent classes across clients. Specifically, it consists of two key modules: 1) the Geometric Structure Alignment module performs structural knowledge distillation by aligning the pairwise angular similarities between feature representations and their corresponding Equiangular Tight Frame prototypes, which are fixed and shared across clients to serve as a class-discriminative reference structure. This encourages geometric consistency across tasks and helps mitigate representation drift; 2) the Energy-based Geometric Correction module removes task-irrelevant directional components from feature embeddings, which reduces prediction bias toward majority classes. This improves sensitivity to minority classes and enhances the model's robustness under class-imbalanced data distributions. Extensive experiments on three benchmark datasets demonstrate that FEAT substantially achieves a 4%-8% improvement in Top-1 accuracy compared to nine state-of-the-art methods.

Abstract:
Fine-tuning Vision-Language Models (VLMs) trained on large-scale datasets of natural image-text pairs has demonstrated impressive performance for various downstream tasks. However, their fine-tuning for remote sensing (RS) tasks faces dual barriers: (1) Data-level barrier caused by the fundamental modality gap between natural imagery and RS data, and (2) Task-level barrier stemming from the requirement for multi-source interaction modeling capabilities. This paper proposes a Cross-modal Fusion Interactive Prompt Tuning (CF-IPT) method to fine-tune CLIP for multi-source RS image classification tasks. It aims to leverage the prompt learning framework to transfer the alignment target of the text branch shifts from natural images to multi-source RS images. Specifically, we design a Multi-source Interactive Fusion-guided Spectral-Spatial Prompt Generation (MFPG) module, which enables cross-modal feature interaction to generate a prompt matrix that preserves the original spectral and spatial information while performing adaptive multi-scale fusion to address the multi-source image adaptation problem. Subsequently, a Spectral-Spatial Prompt-guided Visual-Text Prompt Interaction (V-TPI) Strategy is proposed, which leverages spectral-spatial prompt matrices to guide visual-textual prompt interaction and inject RS-specific information into both branches of CLIP, ultimately enabling multi-source RS image-text representation alignment. The proposed approach performs the downstream task of multi-source RS image classification with merely 0.76% of CLIP's parameters. It is evaluated on several widely used datasets, demonstrating the effectiveness of the proposed approach.

Abstract:
Multimodal large language models (MLLMs) often suffer from perceptual impairments under extended reasoning modes, particularly in visual question answering (VQA) tasks. We identify attention dispersion as the underlying cause: during multi-step reasoning, model's visual attention becomes scattered and drifts away from question-relevant regions, effectively "losing focus" on the visual input. To better understand this phenomenon, we analyze the attention maps of MLLMs and observe that reasoning prompts significantly reduce attention to regions critical for answering the question. We further find a strong correlation between model's overall attention on image tokens and the spatial dispersiveness of model's attention within the image. Leveraging this insight, we propose a training-free Visual Region-Guided Attention (VRGA) framework that selects visual heads based on an entropy-focus criterion and reweights their attention, effectively guiding the model to focus on question-relevant regions during reasoning. Extensive experiments on vision-language benchmarks demonstrate that our method effectively alleviates perceptual degradation, leading to improvements in visual grounding and reasoning accuracy, while offering interpretable insights into how MLLMs process visual information.

Abstract:
Recent Tracking-by-Query-Propagation (TBP) methods have advanced Multi-Object Tracking (MOT) by enabling end-to-end (E2E) pipelines with long-range temporal modeling. However, this reliance on query propagation introduces unexplored architectural vulnerabilities to adversarial attacks. We present FADE, a novel attack framework designed to exploit these specific vulnerabilities. FADE employs two attack strategies targeting core TBP mechanisms: (i) Temporal Query Flooding: Generates spurious temporally consistent track queries to exhaust the tracker's limited query budget, forcing it to terminate valid tracks. (ii) Temporal Memory Corruption: Directly attacks the query updater's memory by severing temporal links via state de-correlation and erasing the learned feature identity of matched tracks. Furthermore, we introduce a differentiable pipeline to optimize these attacks for physical-world realizability by leveraging simulations of advanced perception sensor spoofing. Experiments on MOT17 and MOT20 benchmarks demonstrate that FADE is highly effective against state-of-the-art TBP trackers, causing significant identity switches and track terminations.

Abstract:
Large Vision-Language Models (LVLMs) often suffer from object hallucination, generating objects that are absent from the image. Prior work largely attributes this to insufficient visual attention. However, in this work, we are surprised to find that both real and hallucinated objects receive equally strong visual attention in the model's mid-to-late layers. This indicates that the key issue may not be how much the model attends, but what it attends to and why. To this end, we decode the visual features of high-attention regions using Logit Lens, and observe that high-attention regions corresponding to real objects can be correctly decoded to the target object token, whereas those for hallucinated objects cannot. Building on this, we identify two distinct hallucination mechanisms: (i) visual uncertainty, triggered by semantically similar or confusable regions, masking these regions eliminates the hallucination. (ii) contextual prior, triggered by strong co-occurrence priors, even when the initially attended region is masked, the hallucination persists and attention drifts to other regions. Based on these findings, we propose a simple yet effective training-free Detect-Mitigate framework comprising a Logit-Lens Consistency Check to detect hallucination and targeted remedies: High-Attention Regions Masking (HARM) for visual uncertainty hallucination, and Visual Evidence Enhanced Decoding (VEED) for contextual prior hallucination, which leverages genuine visual evidence to suppress erroneous priors. Our approach achieves state-of-the-art results on multiple hallucination benchmarks.

Abstract:
We introduce ELITE, an Efficient Gaussian head avatar synthesis from a monocular video via Learned Initialization and TEst-time generative adaptation. Prior works rely either on a 3D data prior or a 2D generative prior to compensate for missing visual cues in monocular videos. However, 3D data prior methods often struggle to generalize in-the-wild, while 2D generative prior methods are computationally heavy and prone to identity hallucination. We identify a complementary synergy between these two priors and design an efficient system that achieves high-fidelity animatable avatar synthesis with strong in-the-wild generalization. Specifically, we introduce a feed-forward Mesh2Gaussian Prior Model (MGPM) that enables fast initialization of a Gaussian avatar. To further bridge the domain gap at test time, we design a test-time generative adaptation stage, leveraging both real and synthetic images as supervision. Unlike previous full diffusion denoising strategies that are slow and hallucination-prone, we propose a rendering-guided single-step diffusion enhancer that restores missing visual details, grounded on Gaussian avatar renderings. Our experiments demonstrate that ELITE produces visually superior avatars to prior works, even for challenging expressions, while achieving 60x faster synthesis than the 2D generative prior method.

Abstract:
Understanding how the brain constructs coherent 3D visual percepts from multifaceted experiences remains a pivotal yet underexplored challenge. To investigate this, we introduce BrainSSD, a novel framework for decoding 3D representations from electroencephalography (EEG) signals. The core of BrainSSD is a neuro-inspired fusion architecture, Hierarchical Phase-Amplitude Coupling guided Fusion (HPACF), which synergistically integrates EEG from two distinct viewing paradigms: brief presentations of a static 3D object view, and sustained observation of the object undergoing full rotation. HPACF embodies two key principles of neural computation, namely hierarchical processing realized through multi-level cross-attention, and neural synchrony actualized by using a differentiable estimator of Phase-Amplitude Coupling (PAC) to dynamically guide the integration. The resulting fused representations are subsequently mapped to the visual domain via a multi-level alignment loss. Our framework establishes a new state-of-the-art across a range of EEG decoding tasks, achieving superior discriminative power and exceptional generative fidelity. Furthermore, our static-dynamic dominance analysis provides the first direct visual evidence for a functional specialization in the brain's 3D perception, revealing that neural responses to static object views primarily underpin the object's holistic structure and form, while responses to rotational observation are indispensable for resolving its fine-grained geometric details. Our work presents an advanced framework for probing EEG-based visual decoding and offers computational insights into the brain's synergistic strategies for 3D perception.

Abstract:
Out-of-distribution (OOD) detection seeks to identify samples from unknown classes, a critical capability for deploying machine learning models in open-world scenarios. Recent research has demonstrated that Vision-Language Models (VLMs) can effectively leverage their multi-modal representations for OOD detection. However, current methods often incorporate intra-modal distance during OOD detection, such as comparing negative texts with ID labels or comparing test images with image proxies. This design paradigm creates an inherent inconsistency against the inter-modal distance that CLIP-like VLMs are optimized for, potentially leading to suboptimal performance. To address this limitation, we propose InterNeg, a simple yet effective framework that systematically utilizes consistent inter-modal distance enhancement from textual and visual perspectives. From the textual perspective, we devise an inter-modal criterion for selecting negative texts. From the visual perspective, we dynamically identify high-confidence OOD images and invert them into the textual space, generating extra negative text embeddings guided by inter-modal distance. Extensive experiments across multiple benchmarks demonstrate the superiority of our approach. Notably, our InterNeg achieves state-of-the-art performance compared to existing works, with a 3.47% reduction in FPR95 on the large-scale ImageNet benchmark and a 5.50% improvement in AUROC on the challenging Near-OOD benchmark.

Abstract:
Deepfake detection models often generate natural-language explanations, yet their reasoning is frequently ungrounded in visual evidence, limiting reliability. Existing evaluations measure classification accuracy but overlook reasoning fidelity. We propose DeepfakeJudge, a framework for scalable reasoning supervision and evaluation, that integrates an out-of-distribution benchmark containing recent generative and editing forgeries, a human-annotated subset with visual reasoning labels, and a suite of evaluation models, that specialize in evaluating reasoning rationales without the need for explicit ground truth reasoning rationales. The Judge is optimized through a bootstrapped generator-evaluator process that scales human feedback into structured reasoning supervision and supports both pointwise and pairwise evaluation. On the proposed meta-evaluation benchmark, our reasoning-bootstrapped model achieves an accuracy of 96.2%, outperforming \texttt 30x larger baselines. The reasoning judge attains very high correlation with human ratings and 98.9% percent pairwise agreement on the human annotated meta-evaluation subset. These results establish reasoning fidelity as a quantifiable dimension of deepfake detection and demonstrate scalable supervision for interpretable deepfake reasoning. Our user study indicates that humans prefer reasonings generated by our framework 70% of the time, in faithfullness, groundedness and usefulness compared to other models and datasets. All of our datasets, models, and codebase will be open-sourced.

Abstract:
Collaborative perception (CP) enables multiple vehicles to augment their individual perception capacities through the exchange of feature-level sensory data. However, this fusion mechanism is inherently vulnerable to adversarial attacks, especially in fully untrusted-vehicle environments. Existing defense approaches often assume a trusted ego vehicle as a reference or incorporate additional binary classifiers. These assumptions limit their practicality in real-world deployments due to the questionable trustworthiness of ego vehicles, the requirement for real-time detection, and the need for generalizability across diverse scenarios. To address these challenges, we propose a novel Pseudo-Random Bayesian Inference (PRBI) framework, a first efficient defense method tailored for fully untrusted-vehicle CP. PRBI detects adversarial behavior by leveraging temporal perceptual discrepancies, using the reliable perception from the preceding frame as a dynamic reference. Additionally, it employs a pseudo-random grouping strategy that requires only two verifications per frame, while applying Bayesian inference to estimate both the number and identities of malicious vehicles. Theoretical analysis has proven the convergence and stability of the proposed PRBI framework. Extensive experiments show that PRBI requires only 2.5 verifications per frame on average, outperforming existing methods significantly, and restores detection precision to between 79.4% and 86.9% of pre-attack levels.

Abstract:
Building generalist embodied agents requires a unified system that can interpret multimodal goals, model environment dynamics, and execute reliable actions across diverse real-world tasks. Multimodal large language models (MLLMs) offer strong semantic priors and cross-modal generalization, while world models (WMs) provide actionable latent dynamics for prediction and control. Their combination holds promise for open-ended embodied intelligence, yet introduces two key challenges: (1) establishing a tight coupling between the semantic intent from MLLMs and the dynamic state representations within the WM's latent space, and (2) achieving task-aware adaptability that supports multi-task learning and cross-environment generalization. To address these limitations, we propose ModularAgent, a task-aware dynamic joint framework that enables bidirectional coupling between MLLMs and WMs. ModularAgent establishes two complementary pathways: a forward path that injects MLLM representations into the WMs latent space for semantically guided imagination, and a backward path where WM-generated feedback refines the MLLMs semantic space via dense text-conditioned rewards. This bidirectional interaction is realized through three synergistic components: Task-Aware Dynamic Joint Learning, Task-Aware Behavior Learning, and MLLM-WM Joint Optimization, which together harmonize semantic reasoning and dynamic prediction. Extensive experiments across multi-task and cross-environment settings demonstrate superior stability and generalization over state-of-the-art baselines, marking a step toward open-ended embodied learning.

Abstract:
Change detection and semantic segmentation are key techniques for satellite image analysis in remote sensing. However, acquiring high-quality labeled data is costly and time-consuming. Although recent studies have explored generative models to ease data scarcity, a unified framework supporting both tasks is still lacking, and most methods overlook noise accumulation and cannot generate multispectral images. To address this, we propose the robust diffusion framework for masked image generation (RDF-MIG). RDF-MIG generates bi-temporal change-labeled and single-temporal segmentation-labeled images to enhance downstream change detection and semantic segmentation tasks. Furthermore, to address noise accumulation and improve the quality of generated image-mask pairs, we reformulate the diffusion model training objective by proposing the Maximum Correntropy Robust Diffusion (MCRD) loss, and further design an MSE-consistency calibration that analytically aligns small-error gradients with the MSE objective while preserving robustness to outliers. Experiments indicate that the proposed RDF-MIG framework can generate multispectral image-mask pairs to improve downstream performance, while MCRD loss further enhances the quality of the synthesized data.

Abstract:
Large multimodal models (LMMs) have achieved impressive performance on multimodal reasoning, becoming crucial technology for the advancement of intelligent question-answering systems. In real-world educational scenarios, effective teaching extends far beyond providing answers. Experienced teachers analyze students' incorrect answers to trace underlying errors and provide corrective feedback, termed educational diagnostic reasoning, a capability that remains under-explored in existing LMMs. To bridge this research gap, we introduce Edudiag benchmark, requiring LMMs to reconstruct erroneous reasoning chains from incorrect answers and generate corrective feedback. Through an AI-assisted annotation pipeline with rigorous human verification, we create 8K erroneous reasoning chains and corresponding feedback, spanning three representative educational domains: commonsense, science, and mathematics. Extensive evaluation across 28 leading LMMs highlights Edudiag as a challenging testbed, where even leading proprietary LMMs struggle on it and supervised fine-tuning (SFT) on open-source LMMs achieves marginal performance gains. Moreover, we conduct analysis experiments and identify three critical insights for educational diagnostic reasoning: (i) Effective error tracing remains the primary bottleneck, while SFT models still fail to reversely identify errors that commonly occur. (ii) Group relative policy optimization (GRPO) mitigates this bottleneck and boosts performance. (iii) LMMs optimized with GRPO can generate plausible yet challenging distractors for multiple-choice questions based on their self-constructed erroneous reasoning chains. We believe Edudiag provides a new direction for evaluating the advanced LMMs.

Abstract:
Recently, contrastive learning (CL) plays an important role in exploring complementary information for multi-view clustering (MVC) and has attracted increasing attention. Nevertheless, real-world multi-view data suffer from data incompleteness or noise, resulting in rare-paired samples or mis-paired samples which significantly challenge the effectiveness of CL-based MVC. That is, rare-paired issue prevents MVC from extracting sufficient multi-view complementary information, and mis-paired issue causes contrastive learning to optimize the model in the wrong direction. To address these issues, we propose a unified CL-based MVC framework for enhancing clustering effectiveness on incomplete and noise multi-view data. First, to mitigate the rare-paired issue, we design a global-graph guided contrastive learning, where all view samples construct a global-view affinity graph to form new sample pairs for fully exploring complementary information. Second, to overcome the mis-paired issue, we propose a local-graph weighted contrastive learning, which leverages local neighbors to generate pair-wise weights to adaptively strength or weaken the pair-wise contrastive learning. Our method is imputation-free and can be integrated into a unified global-local graph-guided contrastive learning framework. Extensive experiments on both incomplete and noise settings of multi-view data demonstrate that our method achieves superior performance compared with state-of-the-art approaches.

Abstract:
Scientific images often require accurate numerical representations and correct object attributes. However, current faithfulness metrics are primarily tailored toward photorealistic, real-life imagery, rendering them ill-suited for scientific image evaluation. To address this gap, we introduce a novel evaluation model, SCIEval (Scientific Image Evaluation), which aims to capture faithfulness through three key dimensions: (i) Relevance, measuring overall text-image correspondence; (ii) Accuracy, examining the technical details of scientific objects; and (iii) Explainability, which isolates unfaithful elements within the generated content. To address these dimensions, we curate a specialized dataset of scientific text-image pairs to train three evaluation modules. For the Relevance and Accuracy modules, we propose a CLIP-based strategy that enhances scientific image perception through intra- and cross-modal contrastive learning. Concurrently, the Explainability module is developed by fine-tuning a high-performance Large Multimodal Model (LMM) using supervised rationale signals. Finally, we present SCIEval-Bench, a human-annotated evaluation benchmark consisting of 3,000 samples for scientific text-to-image and 3,000 samples for scientific image captioning. Extensive experiments on SCIEval-Bench demonstrate that our SCIEvalmodel is significantly more reliable than 24 competing models--including GPT-4o--exhibiting a superior correlation with human judgments.

Abstract:
We introduce Primitive-based Representations of Uncertainty (PRIMU), a post-hoc uncertainty estimation (UE) framework for Gaussian Splatting (GS).Reliable UE is essential for deploying GS in safety-critical domains such as robotics and medicine.Existing approaches typically estimate Gaussian-primitive variances and rely on the rendering process to obtain pixel-wise uncertainties.In contrast, we construct primitive-level representations of error and visibility/coverage from training views, capturing interpretable uncertainty information. These representations are obtained by projecting view-dependent training errors and coverage statistics onto the primitives. Uncertainties for novel views are inferred by rendering these primitive-level representations, producing uncertainty feature maps, which are aggregate through pixel-wise regression on holdout data. We analyze combinations of uncertainty feature maps and regression models to understand how their interactions affect prediction accuracy and generalization.PRIMU also enables an effective active view selection strategy by directly leveraging these uncertainty feature maps.Additionally, we study the effect of separating splatting into foreground and background regions.Our estimates show strong correlations with true errors, outperforming state-of-the-art methods, especially for depth UE and foreground objects.Finally, our regression models show generalization capabilities to unseen scenes, enabling UE without additional holdout data.

Abstract:
In this paper, we present DSERT-RoLL, a driving dataset that incorporates stereo event, RGB, and thermal cameras together with 4D radar and dual LiDAR, collected across diverse weather and illumination conditions. The dataset provides precise 2D and 3D bounding boxes with track IDs and ego vehicle odometry, enabling fair comparisons within and across sensor combinations. It is designed to alleviate data scarcity for novel sensors such as event cameras and 4D radar and to support systematic studies of their behavior. We establish unified 3D and 2D benchmarks that enable direct comparison of characteristics and strengths across sensor families and within each family. We report baselines for representative single modality and multimodal methods and provide protocols that encourage research on different fusion strategies and sensor combinations. In addition, we propose a fusion framework that integrates sensor specific cues into a unified feature space and improves 3D detection robustness under varied weather and lighting.

Abstract:
Vision-Language Models (VLMs) have become indispensable for multimodal reasoning, yet their representations often encode and amplify demographic biases, resulting in biased associations and misaligned predictions in downstream tasks. Such behavior undermines fairness and distorts the intended alignment between vision and language. Recent post-hoc approaches attempt to mitigate bias by replacing the most attribute-correlated embedding coordinates with neutral values. However, our systematic analysis reveals three critical limitations of this coordinate-wise approach: feature entanglement, poor cross-dataset generalization, and incomplete bias removal. We find that bias is not localized to a few coordinates but is instead distributed across a few linear subspaces. To address these limitations, we propose Subspace Projection Debiasing (SPD), a geometrically principled framework that identifies and removes the entire subspace of linearly decodable bias while reinserting a neutral mean component to preserve semantic fidelity. Extensive experiments across zero-shot classification, text-to-image retrieval, and image generation validate the effectiveness of SPD: our method achieves more robust debiasing with an average improvement of 18.5% across four fairness metrics, while maintaining minimal loss in task performance compared to the best debiasing baseline.

Abstract:
In remote sensing (RS) cross-modal retrieval, most existing methods employ contrastive learning as their primary optimization objective, aligning anchors with positive counterparts and distinguishing them from negative samples. To improve negative sampling, these approaches typically set thresholds on cross-modal similarity scores, designating negatives that exceed the threshold as false negative samples (FNS). However, dependence on a single cross-modal similarity threshold is fragile because it fails to account for the cross-modal semantic overlaps and gaps. To address these challenges, we introduce TriSim, a novel image-text retrieval framework that constructs a tri-dimensional negative similarity space to mitigate the influence of FNS issue. Specifically, considering that FNS appear as anomalies in this space, Extreme Value Theory (EVT) is applied to model the statistical behavior of the tail distribution for FNS selection. Two complementary tail selection strategies are developed: one identifies samples distant from the dense ellipsoidal center, and the other targets upper-right high-similarity extremes. The selected tail samples are regarded as FNS and modeled using a generalized Pareto distribution, with probabilistic weights assigned in the triplet loss. To further refine the selected FNS, intra-modal saliency differences are computed to generate masks that guide the learning of a gain matrix, which amplifies highly discriminative regions and suppresses ambiguous ones. Extensive experiments on two benchmarks demonstrate the superiority of the proposed TriSim framework in mitigating the influence of false negatives in RS image-text retrieval.

Abstract:
With the rapid development of Vision-Language Models (VLMs), there is a growing demand for automatic analysis of structured visual data. Charts and tables carry quantitative information through regular layouts, explicit numbers, and chart-specific reading patterns, yet current VLMs still underuse these properties, often causing value errors and unreliable analysis. To overcome these limitations, we propose Twin-T, a two-stage expert VLM for comprehensive chart-table tasks across Image, LaTeX, and Python. In stage 1, we propose a dual-head image encoder that can separate structural cues and fine details from input images. In stage 2, we propose MINT, a preference learning method that emphasizes numerical and keyword fidelity, as well as vision-text matching. Furthermore, we introduce a comprehensive TwintVQA benchmark with 17 chart types, 11 task types, 3 data formats, and short / medium / long QA settings. Our model narrows the gap between open-source and closed-source models on mainstream chart-table benchmarks, outperforming open-source models while even remaining competitive with GPT-4o and Gemini-2.5-Pro.

Abstract:
Scene Graph Generation (SGG) unifies object localization and visual relationship reasoning by predicting boxes and subject-predicate-object triples. Yet most pipelines treat SGG as a one-shot, deterministic classification instead of a genuine progressive, generative task. We propose FlowSG, which recasts SGG as continuous-time transport on a hybrid discrete-continuous state: starting from a noised graph , the model progressively grows an image-conditioned scene graph through constraint-aware refinements that jointly synthesize nodes (objects) and edges (predicates). Specifically, we first leverage a VQ-VAE to quantize a scene graph (e.g., the continuous visual features) into compact, predictable tokens; a graph Transformer then (i) predicts a conditional velocity field to transport continuous geometry (boxes) and (ii) updates discrete posteriors for categorical tokens (object features and predicate labels), coupling semantics and geometry via flow-conditioned message aggregation. Training combines flow-matching losses for geometry with a discrete-flow objective for tokens, yielding few-step inference and plug-and-play compatibility with standard detectors/segmenters. Extensive experiments on VG and PSG under closed- and open-vocabulary protocols show consistent gains in predicate R/mR and graph-level metrics, validating the mixed discrete-continuous generative formulation over one-shot classification baselines, e.g., an average improvement of about 3 points over the SOTA USG-Par.

Abstract:
Frozen visual embeddings (e.g., CLIP, DINOv2/v3, SSCD) power retrieval and integrity systems, yet their use on face-containing data is constrained by unmeasured identity leakage and a lack of deployable mitigations. We take an attacker-aware view and contribute: (i) a benchmark of visual embeddings that reports open-set verification at low false-accept rates, a calibrated diffusion-based template inversion check, and face-context attribution with equal-area perturbations; and (ii) propose a one-shot linear projector that removes an estimated identity subspace while preserving the complementary space needed for utility, which for brevity we denote as the identity sanitization projection ISP. Across CelebA-20 and VGGFace2, we show that these encoders are robust under open-set linear probes, with CLIP exhibiting relatively higher leakage than DINOv2/v3 and SSCD, robust to template inversion, and are context-dominant. In addition, we show that ISP drives linear access to near-chance while retaining high non-biometric utility, and transfers across datasets with minor degradation. Our results establish the first attacker-calibrated facial privacy audit of non-FR encoders and demonstrate that linear subspace removal achieves strong privacy guarantees while preserving utility for visual search and retrieval.

Abstract:
Cryo-electron microscopy (cryo-EM) enables the atomic-resolution visualization of biomolecules; however, modern direct detectors generate data volumes that far exceed the available storage and transfer bandwidth, thereby constraining practical throughput. We introduce cryoSENSE, the computational realization of a hardware-software co-designed framework for compressive cryo-EM sensing and acquisition. We show that cryo-EM images of proteins lie on low-dimensional manifolds that can be independently represented using sparse priors in predefined bases and generative priors captured by a denoising diffusion model. cryoSENSE leverages these low-dimensional manifolds to enable faithful image reconstruction from spatial and Fourier-domain undersampled measurements while preserving downstream structural resolution. In experiments, cryoSENSE increases acquisition throughput by up to 2.5xwhile retaining the original 3D resolution, offering controllable trade-offs between the number of masked measurements and the level of downsampling. Sparse priors favor faithful reconstruction from Fourier-domain measurements and moderate compression, whereas generative diffusion priors achieve accurate recovery from pixel-domain measurements and more severe undersampling.

Abstract:
Continual unlearning poses the challenge of enabling large vision-language models to selectively refuse specific image-instruction pairs in response to sequential deletion requests, while preserving general utility. However, sequential unlearning updates distort shared representations, creating spurious associations between vision-language pairs and refusal behaviors that hinder precise identification of refusal targets, resulting in inappropriate refusals. To address this challenge, we propose a novel continual unlearning framework that grounds refusal behavior in fine-grained descriptions of visual and textual concepts decomposed from deletion targets. We first identify which visual-linguistic concept combinations characterize each forget category through a concept modulator, then determine how to generate appropriate refusal responses via a mixture of refusal experts, termed refusers, each specialized for concept-aligned refusal generation. To generate concept-specific refusal responses across sequential tasks, we introduce a multimodal, concept-driven routing scheme that reuses refusers for tasks sharing similar concepts and adapts underutilized ones for novel concepts. Extensive experiments on vision-language benchmarks demonstrate that the proposed framework outperforms existing methods by generating concept-grounded refusal responses and preserving the general utility across unlearning sequences.

Abstract:
Weakly-supervised Video Anomaly Detection (wVAD) aims to detect abnormal events using only binary labels, making it challenging to capture both the diversity of anomalies and their shared semantic cues. Existing methods either focus on a generic anomaly pattern, achieving strong generalization but weak discrimination, or rely on class-level diversity modeling, which ignores shared semantics and suffers from limited generalization. To overcome these limitations, we propose the Mixture of Memory Experts (MoME), a unified framework that jointly learns general and diverse patterns. Each expert in MoME possesses an internal memory for fine-grained specialization and shares an external memory for general knowledge aggregation. To enhance semantic diversity and improve generalization beyond coarse class-level supervision, we introduce an Anomaly Prototype Router that leverages large language models to construct generalized anomaly prototypes for semantically guided expert routing. Moreover, balanced routing, expert diversity, and enhanced pattern discriminability are ensured by the regularization loss for APR, the distinctiveness loss for experts, and reconstruction together with memory tasks, respectively. Extensive experiments on UCF-Crime and XD-Violence demonstrate that our approach achieves state-of-the-art performance, validating the effectiveness of jointly modeling generality and diversity for robust anomaly detection.

Abstract:
Object counting in remote sensing imagery becomes challenging when visual cues are obscured by clouds, fog, shadows, or low-light conditions. Yet earth observation inherently provides complementary geo-modalities, including land use and map, which offer stable structural and contextual priors that remain available when appearance cues fail. In this paper, we introduce GROC, the first large-scale dataset Geo-guided Reasoning in Object Counting under adverse earth observation conditions. GROC contains 1.2 million point annotations over 14K images, each aligned with 3 modalities that preserve original geospatial information. We also provide a data engine to collect a large-scale object counting dataset with multiple geo-modalities, realistic degradations, and reliable annotations. We further present an counting agent that adaptively leverages geo-modalities to produce reliable estimates. Extensive experiments show that existing models struggle to "see" through adverse conditions, whereas geo-modalities improve robustness. GROC establishes the first benchmark that explicitly challenges models to see what they cannot see, charting a new direction for geo-guided amodal reasoning in earth observation.

Abstract:
Multimodal Continual Instruction Tuning aims to continually enhance Large Vision Language Models (LVLMs) by learning from new data without forgetting previously acquired knowledge. Mixture of Experts (MoE) architectures naturally facilitate this by incrementally adding new experts and expanding routers while keeping the existing ones frozen. However, despite expert isolation, MoE-based continual learners still suffer from forgetting due to routing-drift: old-task tokens become mistakenly attracted to newly added experts, degrading performance on prior tasks. We analyze the failure mode at the token level and reveal the token's dilemma: ambiguous and old tokens in new-task data offer minimal learning benefit yet induce forgetting when routed to new experts, due to their ambiguous routing assignment during training. Motivated by this, we propose LLaVA-DyMoE, a dynamic MoE framework that incrementally expands the MoE with drift-aware token assignment. We characterize token types via their routing score distributions and apply targeted regularization. Specifically, a token-level assignment guidance steers ambiguous and old tokens away from new experts to preserve established routing patterns and alleviate routing-drift, while complementary routing score regularizations enforce expert-group separation and promote new-expert specialization. Extensive experiments demonstrate that our LLaVA-DyMoE effectively mitigates routing-drift-induced forgetting, achieving over a 7% gain in mean final accuracy and a 12% reduction in forgetting compared to baselines. The project page is zhaoc5.github.io/DyMoE.

Abstract:
Automatic X-ray prohibited items detection is vital for security inspection and has been widely studied. Traditional methods rely on visual modal, often struggling with complex threats. While recent studies incorporate language to guide single-view images, human inspectors typically use dual-view images in practice. This raises the question: can the second view provide constraints similar to a language modality? In this work, we introduce DualXrayBench, the first comprehensive benchmark for X-ray inspection that includes multiple views and modalities. It supports eight tasks designed to test cross-view reasoning. In DualXrayBench, we introduce a dual-view caption corpus consisting of 45,613 dual-view images across 12 categories with corresponding captions. Building upon these data, we propose the Geometric (cross-view)-Semantic (cross-modality) Reasoner (GSR), a multimodal model that jointly learns correspondences between cross-view geometry and cross-modal semantics, treating the second-view images as a "language-like modality". To enale this, we construct the GSXray dataset, with structured Chain-of-Thought sequences: top , side, conclusion. Comprehensive evaluations on DualXrayBench demonstrate that GSR achieves significant improvements across all X-ray tasks, offering a new sight for real-world X-ray inspection.

Abstract:
Humans inhabit a physical 4D world, where spatial geometry and semantic content evolve over time, forming a dynamic reality. While current Multimodal Large Language Models (MLLMs) demonstrate strong capabilities in understanding static visual inputs, it remains unclear whether they can effectively "think in dynamics," i.e., perceive, track, and reason about spatio-temporal evolution in complex scenes.To systematically evaluate these abilities, we introduce \texttt Dyn-Bench , a large-scale benchmark designed to assess spatio-temporal reasoning and localized dynamics perception. Constructed through multi-stage filtering over massive 2D and 4D data sources, \texttt Dyn-Bench provides a high-quality collection of diverse dynamic scenes, consisting of 1k videos, 7k visual question answering (VQA) pairs, and 3k dynamic object grounding samples.We comprehensively study general-purpose, spatial-aware, and region-level MLLMs to understand how they "think in dynamics" from both linguistic and visual perspectives. Our results reveal that existing models struggle to jointly excel in both spatio-temporal reasoning and dynamic object grounding, often producing inconsistent interpretations of motion and interaction. Conventional prompting strategies i.e., chain-of-thought or caption-based hints) provide only limited improvements.In contrast, structured integration approaches, including Mask-Guided Fusion and the Spatio-Temporal Textual Cognitive Map (ST-TCM), substantially enhance MLLMs' dynamic perception and spatio-temporal reasoning in an evolving 4D world. These findings underscore the importance of explicit spatio-temporal structural cues to bridge the gap between static perception and dynamic reasoning in MLLMs.

Abstract:
Fine-tuning approaches for Vision-Language Models (VLMs) face a critical three-way trade-off between In-Distribution (ID) accuracy, Out-of-Distribution (OOD) generalization, and adversarial robustness. Existing robust fine-tuning strategies resolve at most two axes of this trade-off. Generalization-preserving methods retain ID/OOD performance but leave models vulnerable to adversarial attacks, while adversarial training improves robustness to targeted attacks but degrades ID/OOD accuracy. Our key insight is that the robustness trade-off stems from two geometric failures: sharp, anisotropic minima in parameter space and unstable feature representations that deform under perturbation. To address this, we propose GRACE (Gram-aligned Robustness via Adaptive Curvature Estimation), a unified fine-tuning framework that jointly regularizes the parameter-space curvature and feature-space invariance for VLMs. Grounded in Robust PAC-Bayes theory, GRACE employs adaptive weight perturbations scaled by local curvature to promote flatter minima, combined with a feature alignment loss that maintains representation consistency across clean, adversarial, and OOD inputs. On ImageNet fine-tuning of CLIP models, GRACE simultaneously improves ID accuracy by 10.8%, and adversarial accuracy by 13.5% while maintaining 57.0% OOD accuracy (vs. 57.4% zero-shot baseline). Geometric analysis confirms that GRACE converges to flatter minima without feature distortion across distribution shifts, providing a principled step toward generalized robustness in foundation VLMs.

Abstract:
Newborns perceive the world with low-acuity, color-degraded, and temporally continuous vision, which gradually sharpens as infants develop. To explore the ecological advantages of such staged "visual diets", we train self-supervised learning (SSL) models on object-centric videos under constraints that simulate infant vision: grayscale-to-color (C), blur-to-sharp (A), and preserved temporal continuity (T)--collectively termed CATDiet. For evaluation, we establish a comprehensive benchmark across ten datasets, covering clean and corrupted image recognition, texture-shape cue conflict tests, silhouette recognition, depth-order classification, and the visual cliff paradigm.All CATDiet variants demonstrate enhanced robustness in object recognition, despite being trained solely on object-centric videos. Remarkably, models also exhibit biologically aligned developmental patterns, including neural plasticity changes mirroring synaptic density in macaque V1 and behaviors resembling infants' visual cliff responses. Building on these insights, CombDiet initializes SSL with CATDiet before standard training while preserving temporal continuity. Trained on object-centric or head-mounted infant videos, CombDiet outperforms standard SSL on both in-domain and out-of-domain object recognition and depth perception. Together, these results suggest that the developmental progression of early infant visual experience offers a powerful reverse-engineering framework for understanding the emergence of robust visual intelligence in machines. All code, data, and models are available at Github.

Abstract:
Rare diseases often manifest with distinctive facial phenotypes in children, offering valuable diagnostic cues for clinicians and AI-assisted screening systems. However, progress in this field is severely limited by the scarcity of curated, ethically sourced facial data and the high similarity among phenotypes across different conditions. To address these challenges, we introduce RDFace, a curated benchmark dataset comprising 456 pediatric facial images spanning 103 rare genetic conditions (average 4.4 samples per condition). Each ethically verified image is paired with standardized metadata. RDFace enables the development and evaluation of data-efficient AI models for rare disease diagnosis under real-world low-data constraints. We benchmark multiple pretrained vision backbones using cross-validation and explore synthetic augmentation with DreamBooth and FastGAN. Generated images are filtered via facial landmark similarity to maintain phenotype fidelity and merged with real data, improving diagnostic accuracy by up to 13.7% in ultra-low-data regimes. To assess semantic validity, phenotype descriptions generated by a vision-language model from real and synthetic images achieve a report similarity score of 0.84. RDFace establishes a transparent, benchmark-ready dataset for equitable rare disease AI research and presents a scalable framework for evaluating both diagnostic performance and the integrity of synthetic medical imagery.

Abstract:
To enhance the precision of cancer prognosis, recent research has increasingly focused on multimodal survival methods by integrating genomic data and histology images. However, current approaches overlook the fact that the proteome serves as an intermediate layer bridging genomic alterations and histopathological features while providing complementary biological information essential for survival prediction. This biological reality exposes another architectural limitation: existing integrative analysis studies fuse these heterogeneous data sources in a flat manner that fails to capture their inherent biological hierarchy. To address these limitations, we propose HFGPI, a hierarchical fusion framework that models the biological progression from genes to proteins to histology images from a systems biology perspective. Specifically, we introduce Molecular Tokenizer, a molecular encoding strategy that integrates identity embeddings with expression profiles to construct biologically informed representations for genes and proteins. We then develop Gene-Regulated Protein Fusion (GRPF), which employs graph-aware cross-attention with structure-preserving alignment to explicitly model gene-protein regulatory relationships and generate gene-regulated protein representations. Additionally, we propose Protein-Guided Hypergraph Learning (PGHL), which establishes associations between proteins and image patches, leveraging hypergraph convolution to capture higher-order protein-morphology relationships. The final features are progressively fused across hierarchical layers to achieve precise survival outcome prediction. Extensive experiments on five benchmark datasets demonstrate the superiority of HFGPI over state-of-the-art methods.

Abstract:
To ensure the robustness of 3D point cloud Deep Neural Network(3D DNN), 3D adversarial attack targeting the inference stage and backdoor attack targeting the training stage are well studied. The success of both attacks usually requires a specified permissions that attacker must have. However, the obtainable permissions are uncertain due to the deployment environment changes in practical scenarios. This renders existing separately designed adversarial attack or backdoor attack ineffective. To solve this issue, this paper proposes a unified attack that can adapt to both 3D point cloud backdoor attack and adversarial attack, named UAtt3D. Furthermore, by observing existing attacks, their way to promise attack stealthiness is to limit the undesirable perturbation. This strategy requires moving the point position as little as possible, which restricts the attack intensity and is not suitable for our unified attack. Meanwhile, this strategy will inevitably cause a quality decrease on 3D point cloud due to the remaining malicious perturbation. Therefore, our UAtt3D explores a new avenue to guarantee attack stealthiness which improves the quality of attacked 3D point cloud rather than decreasing it. In detail, to simultaneously consider feature movement of adversarial attack and backdoor feature learning of backdoor attack, a flexible isotropic resampling is designed. It realigns the position of most points based on surface approximation and rays sampling. By fine tuning the resampled point cloud, adversarial point cloud and backdoored point cloud are obtained. Several experiments suggest that the proposed UAtt3D achieves outstanding stealthiness comparing with existing adversarial attacks and backdoor attacks from the subjective and objective perspective. Meanwhile, its attack efficiency is competitive.

Abstract:
Early prediction of Mild Cognitive Impairment (MCI) conversion is hampered by a trade-off between immediacy--making fast predictions from a single baseline sMRI--and accuracy--leveraging longitudinal scans to capture disease progression. We propose MCI-Diff, a diffusion-based framework that synthesizes clinically plausible future sMRI representations directly from baseline data, achieving both real-time risk assessment and high predictive performance. First, a multi-task sequence reconstruction strategy trains a shared denoising network on interpolation and extrapolation tasks to handle irregular follow-up sampling and learn robust latent trajectories. Second, an LLM-driven "linguistic compass" is introduced for clinical plausibility sampling: generated feature candidates are quantized, tokenized, and scored by a fine-tuned language model conditioned on expected structural biomarkers, guiding autoregressive generation toward realistic disease patterns. Experiments on ADNI and AIBL cohorts show that MCI-Diff outperforms state-of-the-art baselines, improving early conversion accuracy by 5-12%.

Abstract:
While 2D vision has been revolutionized by large-scale datasets like ImageNet, 3D vision remains constrained by the scarcity of high-quality, canonically aligned data. We introduce the first scalable, automated framework that generates complete category-level 6D pose datasets directly from text prompts, bypassing the need for existing 3D assets. Our method overcomes key challenges by: (1) ensuring reliable, scalable asset generation via a controlled text-to-image-to-3D pipeline; (2) enforcing built-in canonical alignment through depth-conditioned generation, achieving a 96% pose consistency rate; and (3) enabling large-scale 6D annotation via mixed reality rendering. The pipeline produces high-quality, aligned 3D meshes in under 3 minutes per object--a 5-20x speedup over traditional scanning. We generate over 1,000 instances for each of the 153 categories in the Omni6Dpose benchmark, culminating in 153,000 aligned meshes--a >40x increase in instances per category over previous aligned real-world datasets. Extensive evaluation demonstrates competitive zero-shot sim2real transfer on the NOCS 6D pose benchmark and superior robotic grasping performance in both simulation and real-world zero-shot transfer, where aligned meshes prove essential for success. We release the largest publicly available aligned 3D mesh dataset, largest category-level 6D pose dataset, grasping simulation environments, and open-source pipeline, providing a critical step toward foundation models for 3D understanding and enabling efficient, unlimited generation of task-specific 3D data from scratch. The code and datasets can be found at https://genomni3d.github.io/