TCSVT2026

Abstract:
Learning-based techniques have significantly improved the accuracy and speed of deformable image registration. However, challenges such as reducing computational complexity and handling large deformations persist. To address these challenges, we analyze how convolutional neural networks (ConvNets) influence registration performance using the Horn-Schunck optical flow equation. Supported by prior studies and our empirical experiments, we observe that ConvNets play two key roles in registration: linearizing local intensities and harmonizing global contrast variations. Guided by these insights, we propose the Encoder-Only Image Registration (EOIR) framework comprising five modifications to existing approaches, to achieve a better accuracy-efficiency trade-off. EOIR separates feature learning from flow estimation, employing only a 3-layer ConvNet for feature extraction and a set of 3-layer flow estimators to construct a Laplacian feature pyramid, progressively composing diffeomorphic deformations under a large-deformation model. Results on six datasets across different modalities and anatomical regions demonstrate EOIR’s effectiveness, achieving superior accuracy-efficiency and accuracy-smoothness trade-offs. With comparable accuracy, EOIR provides better efficiency and smoothness, and vice versa. The source code of EOIR is available on Github.

Abstract:
Knowledge distillation (KD) has emerged as a powerful technique for transferring knowledge from large, complex teacher models to smaller, more efficient student models. However, current KD methods primarily concentrate on mimicking instance-level predictions or feature representations, often overlooking the crucial role of class-level semantic structure in guiding effective knowledge transfer. This paper introduces Prototypical Decoupled Knowledge Distillation (PDKD), a novel framework designed to bridge the gap between instance-specific and class-discriminative knowledge by leveraging class prototypes. PDKD incorporates a prototype-aware supervision module that distills global class characteristics by aligning student predictions with both instance-level and prototype-based outputs from the teacher. This module dynamically harmonizes the logit scales of these two targets, effectively addressing the model size mismatches between teacher and student. Furthermore, a feature discrepancy alignment module is proposed to enforce consistency between the teacher and student in how they modulate features between the learned prototypes and individual samples. This alignment preserves the structural relationships between classes. By effectively unifying hierarchical class semantics with instance-level learning, PDKD establishes a new paradigm for training compact yet highly discriminative models. Extensive experiments on CIFAR-100 and ImageNet showcase the superior performance of PDKD compared to existing state-of-the-art methods.

Abstract:
Although great progress has been made in Camouflaged Object Detection (COD), it still faces challenges in complex real-world scenes. Existing methods are primarily designed for visible images but face limitations when detecting highly camouflaged or partially occluded objects. Integrating multiple complementary information sources, such as visible images and infrared images, is an effective way to improve the performance of COD. However, research in this field is limited by the lack of comprehensive and high-quality benchmark datasets. To solve this problem, a Visible-Infrared Artificial Camouflage (VIAC) dataset is constructed. Building on this dataset, we propose a novel Visible-Infrared Camouflaged Object Detection (VICOD) framework, termed the Confidence-Guided Fusion and Inpainting Network (CGFINet). The network utilizes a cross-modal collaborative fusion module (CMCF) to achieve adaptive integration of visible and infrared information. Simultaneously, low-confidence regions segmentation boundaries are refined by leveraging high-confidence pixel information within the confidence-driven inpainting module (CDIM). To focus on low-confidence areas, pixel-level uncertainty is incorporated into the loss function as a dynamic weight factor, which prompts the model to focus on high-uncertainty areas. Extensive experiments on VIAC demonstrate that our method achieves state-of-the-art performance, surpassing existing COD and visible-infrared SOD approaches.

Abstract:
Visual odometry (VO) is a critical component of autonomous robot systems, enabling precise pose estimation from visual inputs. Learning-based VO methods are increasingly recognized for their robustness in challenging scenarios, including dynamic environments, motion blur, and low-light conditions. However, their performance is constrained by both the diversity of the data and its utilization rate. To overcome these limitations, we propose an end-to-end monocular VO system incorporating a novel learning-based end-to-end VO framework and multiple analogy augmentation strategies. We introduce the Context Attention Uncertainty-aware VO Network (CUVO), which prioritizes semantically rich regions and mitigating interference from high-uncertainty areas to enhance attentional focus and pose estimation accuracy. Furthermore, our analogy augmentation methods—temporal reversal, random rotation, and geometric mirroring—enhance image pairs and compute corresponding true pose transformations, significantly increasing training data quantity and diversity. Simultaneously, an analogous loss is applied to ensure consistency between the original and augmented data. Extensive experiments demonstrate that CUVO significantly enhances VO performance, outperforming previous end-to-end VO methods on TartanAir and KITTI datasets. By leveraging analogy augmentation strategy to expand training data under limited data conditions (27k), zero-shot capability of CUVO degrades by up to 29.5% on TartanAir and 23.3% on KITTI. Our work introduces the first image-to-pose data augmentation method tailored for VO and establishes CUVO as a robust system for advancing learning-based visual odometry.

Abstract:
Traditional multi-modal hashing methods map instances into hash codes for multi-modal retrieval tasks, achieving low storage costs and fast retrieval speed. In reality, it is common to encounter missing modality scenarios, i.e., instances that should originally contain all modalities may lack certain modal data points. Faced with this issue, existing methods typically employ an instance-level completion strategy. This strategy selects similar integrated instances based on coarse-grained label similarity and then fuses the corresponding modal data points of the selected instances to complete the missing data. However, this strategy typically involves noisy feature representation due to irrelevant label information. For example, completing a “sunset” labeled instance with an integrated instance labeled by “sunset” and “structure” will include information about “structure”. To address this issue, we propose a novel Fine-Grained Feature-Driven Incomplete Multi-modal Hashing (FDIMH) to directly model the relationship between labels and features rather than adopting instances as a bridge to complete the missing data without injecting noisy information. Specifically, FDIMH initializes a high-dimensional memory bank where each unit denotes a fine-grained feature. During the training process, all the fine-grained features in this memory bank are first pre-optimized to fit the complete instances through an adaptive weight allocation mechanism. Subsequently, the learned memory bank and the adaptive weight allocation mechanism are optimized together with the hashing network through our proposed intra-modality and inter-modality loss functions to bridge the gap between the completed data and real data. Extensive experiments on three widely used datasets demonstrate the superiority of the proposed method compared to the state-of-the-art baselines in multi-modal retrieval tasks. The code is available at https://github.com/LiuJinyu1229/FDIMH

Abstract:
Group activity recognition (GAR) plays a crucial role in computer vision, enabling the exploration and comprehension of human behavior patterns. Existing methods mainly focus on dyad-level interactions within a group, but sociological studies have highlighted the importance of individual features, subgroup-level interactions, and overall group structure for understanding group activities. Therefore, we propose a new framework, the progressive group activity reasoning model (PGAR), which models these four aspects for GAR. Initially, we construct a person-person graph (PPG) using individual features to capture dyadic interactions. Subsequently, the PPG is fed into a novel ingredient graph model (Ingredient-GNN) for capturing subgroup-level interactions. Finally, we fuse the dyad-level and subgroup-level interactions with global information of group structure, obtained through an F-Formation modeling module, to form comprehensive representations for GAR. The F-Formation modeling module decouples the group structure into position, orientation, and skeleton graphs, and subsequently performs attribute recoupling at the individual level using the designed Tri-Coupling Transformer to form a global representation of the group structure. Extensive experiments on four public datasets demonstrate that our final model effectively integrates multi-level representations for group activity understanding, with our F-Formation modeling module outperforming comparable methods that rely solely on non-visual data.

Abstract:
Decomposition-based text-guided video editing paradigm aims to utilize the layered neural atlas model to decompose the input video into foreground and background parts and edit the video in a divide-and-conquer manner, which is meaningful and improves the controllability of editing. However, they may suffer from some limitations: 1) high computational cost of per-video training (i.e, 7\!\!～ \!\!8 hours for training a single atlas model, 2) foreground object deformation is restricted by the foreground opacity value, and 3) restricted flexibility in manipulating multiple objects. In this paper, we propose TraFrCo, a Training-Free Controllable Text-guided Video Editing framework to mitigate these challenges. Instead of training complex atlas models, our method leverages pre-trained segmentation to rapidly decompose videos into foreground and background parts. This allows users to perform independent edits on foreground objects using existing video diffusion editing models without affecting the environment. To ensure visual consistency, we introduce a training-free mechanism that effectively propagates information across frames to fill missing background regions caused by the segmentation-derived foreground masks and reconstructs the scene behind moving objects. Finally, the edited components are seamlessly composited by re-predicting the new foreground masks. In contrast to prior works, TraFrCo enables efficient, fine-grained manipulation of video content without the burden of training. Experimental results verify that our TraFrCo consistently reduces the costs of decomposing video and achieves superior text-guided video editing performance. Codes and video demos will be released at https://github.com/mdswyz/TraFrCo

Abstract:
Gradient sparsification (GS) is an effective method for reducing communication overhead in distributed training. For the first time, we introduce the concept of Multi-dimensional information into GS and propose a new gradient sparsification method named Multi-dimensional information-aware Gradient sparsification (MultiGS), which achieves high compression ratio with negligible accuracy loss and is applicable to mainstream network architectures. MultiGS reconstructs the layer-wise gradient by combining the high-frequency components of the local gradient and the low-frequency components of the sparsified global gradient that effectively addresses the issue of stale gradients and alleviates model bifurcation. Through the convergence proof of MultiGS for smooth non-convex problems and comparison with momentum SGD in convergence speed, we show that such new perspective approach is theoretically reasonable and practically effective. As validated with several mainstream model families (i.e., ResNets, VGGNet, LSTM, Vision Transformer, and Large Language Models), our MultiGS shows better accuracy over previous GS methods. Moreover, empirical results show that when a sufficient number of training nodes are available, MultiGS accelerates the distributed training by more than 3× , which is better than existing sparsification method.

Abstract:
The significant effectiveness of prompt tuning for computer vision tasks has been extensively demonstrated in numerous studies. As a widely feasible solution, the spatial modeling paradigm aims to overcome the limitations of sequence modeling paradigm in capturing spatial relationships within images by learning a prompt token map and aligning it spatially with the image token map. However, such spatial modeling paradigms of visual prompt tuning still face two potential challenges: 1) Most existing methods fail to design individual prompts for different images, and the learned prompts have the same static effect on all images. 2) The strategy of existing methods overlooks the selection of key spatial information and indiscriminately prompts all information within the image. In this work, we propose a novel Dynamically-Selected and Spatial Visual Prompting, termed as DS2VP, which aims to effectively utilize the key spatial information of the input image and enable dynamic visual prompt selection. Specifically, our DS2VP approach is meticulously designed to leverage the key index generator to filter key regions of the image for determining the spatial target of prompts, thus enabling dynamic selection of prompts for different images. By adding prompt tokens at selected key locations, an image prompt fusion module is deployed by adapting the learnable prompt tokens into the input image tokens, further achieving a fine-grained spatial alignment. Moreover, we propose a multi-level prompt interaction module that facilitates interactions between visual prompts at different levels to enhance feature representations across various semantic levels. Extensive experiments on two challenging benchmarks for image classification have demonstrated the superiority of DS2VP over other state-of-the-art methods for visual prompt tuning.

Abstract:
Existing transferable attack methods commonly assume that the attacker knows the training set (e.g., the label set, the input size) of the opaque-box victim models, which is usually unrealistic because in some cases the attacker cannot know this information. In this paper, we define a Generalized Transferable Attack (GTA) problem where the attacker operates without prior knowledge of these specifics and must attack randomly encountered images, potentially from unknown datasets. To solve the challenging GTA problem, we propose a novel Image Classification Disruptor (ICD), designed to train a particular attack to disrupt classification information of any images from arbitrary datasets. Experiments across several datasets demonstrate that ICD clearly outperforms existing transferable attacks on GTA, and show that ICD uses similar texture-like noises to perturb different images from different datasets. Moreover, we observed that ICD noise across images mainly consists of three specific-frequency sine waves for the R, G, and B channels. Inspired by this interesting finding, we also design another novel Sine Attack (SA) method directly optimizes the three sine waves. Experiments show that SA performs comparably to ICD, revealing a notable vulnerability in CNNs under the GTA setting.

Abstract:
Video Object Segmentation (VOS) task aims to segment objects in videos. However, previous settings either require time-consuming manual masks of target objects at the first frame during inference or lack the flexibility to specify arbitrary objects of interest. To address these limitations, we propose the setting named Click Video Object Segmentation (ClickVOS) which segments objects of interest across the whole video according to a single click per object in the first frame. And we provide the extended datasets DAVIS-P and YouTubeVOS-P that with point annotations to support this task. ClickVOS is of significant practical applications and research implications due to its only 1-2 seconds interaction time for indicating an object, comparing annotating the mask of an object needs several minutes. However, ClickVOS also presents increased challenges. To address this task, we propose an end-to-end baseline approach named called Attention Before Segmentation (ABS), motivated by the attention process of humans. ABS utilizes the given point in the first frame to perceive the target object through a concise yet effective segmentation attention. Although the initial object mask is possibly inaccurate, in our ABS, as the video goes on, the initially imprecise object mask can self-heal instead of deteriorating due to error accumulation, which is attributed to our designed improvement memory that continuously records stable global object memory and updates detailed dense memory. In addition, we conduct various baseline explorations utilizing off-the-shelf algorithms from related fields, which could provide insights for the further exploration of ClickVOS. The experimental results demonstrate the superiority of the proposed ABS approach. Extended datasets and codes will be available at https://github.com/PinxueGuo/ClickVOS

Abstract:
Region of Interest (ROI)-based image compression optimizes bit allocation by prioritizing ROI for higher-quality reconstruction. However, as the users (including human clients and downstream machine tasks) become more diverse, ROI-based image compression needs to be customizable to support various pReferences. For example, different users may define distinct ROI or require different quality trade-offs between ROI and non-ROI. Existing ROI-based image compression schemes predefine the ROI, making it unchangeable, and lack effective mechanisms to balance reconstruction quality between ROI and non-ROI. This work proposes a paradigm for customizable ROI-based deep image compression. First, we develop a Text-controlled Mask Acquisition (TMA) module, which allows users to easily customize their ROI for compression by just inputting the corresponding semantic text. It makes the encoder controlled by text. Second, we design a Customizable Value Assign (CVA) mechanism, which masks the non-ROI with a changeable extent decided by users instead of a constant one to manage the reconstruction quality trade-off between ROI and non-ROI. Finally, we present a Latent Mask Attention (LMA) module, where the latent spatial prior of the mask and the latent Rate-Distortion Optimization (RDO) prior of the image are extracted and fused in the latent space, and further used to optimize the latent representation of the source image. Experimental results demonstrate that our proposed customizable ROI-based deep image compression paradigm effectively addresses the needs of customization for ROI definition and mask acquisition as well as the reconstruction quality trade-off management between the ROI and non-ROI. Additionally, even by using the uniform mask as input, our method still outperforms the anchor methods in image reconstruction and machine vision tasks (such as object detection and instance segmentation). Our source code will be available at: https://github.com/hccavgcyv/Customizable-ROI-Based-Deep-Image-Compression

Abstract:
Hand avatars play a pivotal role in a wide array of digital interfaces. Fine-detailed and realistic hand representations enhance user immersion and facilitating natural interaction within virtual environments. While previous studies have focused on photo-realistic hand rendering, little attention has been paid to reconstruct the hand geometry with fine details, which is essential to rendering quality. In the realms of extended reality and gaming, on-the-fly rendering becomes imperative. To this end, we introduce an expressive hand avatar, named XHand, that is designed to comprehensively generate hand shape, appearance, and deformations in real-time. To obtain fine-grained hand meshes, we make use of three feature embedding modules to predict hand deformation displacements, albedo, and linear blending skinning weights, respectively. To achieve photo-realistic hand rendering on fine-grained meshes, our method employs a mesh-based neural renderer by leveraging mesh topological consistency and latent codes from embedding modules. During training, a part-aware Laplace smoothing strategy is proposed by incorporating the distinct levels of regularization to effectively maintain the necessary details and eliminate the undesired artifacts. The experimental evaluations on InterHand2.6M and DeepHandMesh datasets demonstrate the efficacy of XHand, which is able to recover high-fidelity geometry and texture for hand animations across diverse poses in real-time. To reproduce our results, we will make the full implementation publicly available at https://github.com/agnJason/XHand

Abstract:
Multi-view clustering (MVC) has garnered significant attention in recent years due to its ability to leverage shared information across heterogeneous data sources. However, most existing methods focus on improving clustering performance while often neglecting potential conflicts between views and the uncertainty in view distributions. To address this limitation, we propose a novel framework named dual-reliable contrastive fusion multi-view clustering (DRCFMVC). This framework organically integrates uncertainty and conflict within a multi-view contrastive clustering paradigm for the first time. Through an innovative dual reliability weighting mechanism, uncertainty and conflict are systematically incorporated into contrastive learning. Specifically, high-dimensional features are first mapped to cluster distributions via a clustering network. Subsequently, Dempster–Shafer Theory (DST) estimates prediction uncertainty within each view, while Jensen–Shannon (JS) divergence quantifies conflict levels between views. These two complementary types of information are further integrated through the proposed dual-reliable weighting (DRW) fusion strategy, effectively guiding the contrastive optimization process to learn more stable and discriminative representations. Extensive experiments on eleven public benchmark datasets demonstrate that the proposed method achieves superior performance across multiple clustering evaluation metrics compared to existing state-of-the-art approaches. The source code can be made accessible at https://github.com/li-zi-qi/DRCFMVC.

Affiliations: School of Artificial Intelligence, Institute of Artificial Intelligence, Hefei Comprehensive National Science Center, Anhui University, Hefei, China; School of Computer Science and Technology, Institute of Artificial Intelligence, Hefei Comprehensive National Science Center, Anhui University, Hefei, China; School of Artificial Intelligence, the State Key Laboratory of Opto-Electronic Information Acquisition and Protection Technology, Anhui University, Hefei, China

Abstract:
Vehicle-centric perception plays a crucial role in many intelligent systems, including large-scale surveillance systems, intelligent transportation, and autonomous driving. Existing approaches typically employ general pre-trained weights to initialize backbone networks, followed by task-specific fine-tuning. However, these models lack effective learning of vehicle-related knowledge during pre-training, resulting in poor capability for modeling general vehicle perception representations. To handle this problem, we propose VehicleMAE-V2, a novel vehicle-centric pre-trained large model. By exploring and exploiting vehicle-related multimodal structured priors to guide the masked token reconstruction process, our approach can significantly enhance the model’s capability to learn generalizable representations for vehicle-centric perception. Specifically, we design the Symmetry-guided Mask Module (SMM), Contour-guided Representation Module (CRM) and Semantics-guided Representation Module (SRM) to incorporate three kinds of structured priors into token reconstruction including symmetry, contour and semantics of vehicles respectively. SMM utilizes the vehicle symmetry constraints to avoid retaining symmetric patches and can thus select high-quality masked image patches and reduce information redundancy. CRM minimizes the probability distribution divergence between contour features and reconstructed features and can thus preserve holistic vehicle structure information during pixel-level reconstruction. SRM aligns image-text features through contrastive learning and cross-modal distillation to address the feature confusion caused by insufficient semantic understanding during masked reconstruction. To support the pre-training of VehicleMAE-V2, we construct Autobot4M, a large-scale dataset comprising approximately 4 million vehicle images and 12,693 text descriptions. Extensive experiments on five downstream tasks demonstrate the superior performance of VehicleMAE-V2. The source code, dataset, and pre-trained large models are available on https://github.com/Vehicle-AHU/VehicleMAE

Abstract:
In steganographic systems, the quality of cover image selection directly affects the difficulty of steganalysis detection. High-quality covers can effectively conceal embedded traces and reduce the statistical differences between stego images and natural images. As one of the most widely used image formats, JPEG leverages its efficient compression performance and possesses unique characteristics derived from the Discrete Cosine Transform (DCT) and quantization processes, making it an ideal choice for steganographic covers. However, most existing cover selection methods are not specifically designed for the characteristics of JPEG images. To address this issue, this paper proposes a multi-dimensional cover selection method for JPEG steganography, which quantitatively evaluates the steganographic suitability of covers through the collaborative assessment of three core dimensions: blocking artifacts, embedding distortion, and content complexity. Specifically, blocking artifacts ( B_i ) are quantified by the calculation of boundary pixel differences and intra-block smoothness; embedding distortion ( D_i ) enables low-distortion embedding based on the protection of wavelet-domain statistical features and the optimization of block energy consistency; content complexity ( G_i ) assesses texture richness through the normalized count of non-zero DCT coefficients. Finally, the optimal covers are selected using the comprehensive scoring formula S_i = B_i \cdot D_i \cdot (1 - G_i) . Experimental results demonstrate that within the embedding rate range of 0.1 to 0.5 bpnzac, its undetectability outperforms that of four existing steganographic algorithms. Particularly, it achieves near-random detection performance ( P_E \approx 0.5 ) at 0.1 bpnzac, surpassing the second-best method by 0.38%–1.46%, while exhibiting a more significant advantage of 0.46%–14.33% within the 0.2–0.5 bpnzac range.

Abstract:
To address the growing demand for digital image security, this paper introduces AMG, a U-Net-based robust watermarking framework designed for screen-shooting resilience and watermark imperceptibility. The framework fuses channel and spatial attention to adaptively enhance key region features during encoding and decoding, achieving precise watermark embedding with visual fidelity. Guided by a Triple Mask of depth, edge, and gradient components, a specialized loss function strategically embeds the watermark into structurally-rich yet visually non-salient regions. Experiments show AMG surpasses state-of-the-art methods in extraction accuracy, robustness, and imperceptibility, offering a systematic solution to balance watermark security and visual quality.

Abstract:
Transformer-based diffusion models, dubbed Diffusion Transformers (DiTs), have achieved state-of-the-art performance in image and video generation tasks. However, their large model size and slow inference speed limit their practical applications, calling for model compression methods such as quantization. Unfortunately, existing DiT quantization methods overlook 1) impact of reconstruction, a widely used method to calibrate quantization parameters and 2) varying quantization sensitivities across different layers, which hinder their achievable performance. To tackle these issues, we propose innovative time-aware quantization for DiTs (TaQ-DiT). Specifically, 1) we observe a non-convergence issue when reconstructing weights and activations separately during quantization and introduce a joint reconstruction method to resolve this problem and 2) We discover that Post-GELU activations are particularly sensitive to quantization due to their significant variability across different denoising steps as well as extreme asymmetries and variations within each step. To address this, we propose time-variance-aware static transformations to facilitate more effective quantization. Experimental results show that when quantizing DiTs’ weights to 4-bit and activations to 8-bit (W4A8), our method significantly surpasses previous quantization methods. Codes are available at https://github.com/6xy-liu/Taq-DiT.git

Abstract:
Different from natural videos, screen content videos (SCVs) often exhibit homogeneous regions, abrupt content changes, and high prevalence of repetitive patterns. Existing deep learning (DL)-based video compression methods inadequately address the unique characteristics of SCVs, resulting in suboptimal compression performance. Therefore, in this paper, a dedicated deep screen content video compression (DSCVC) framework is proposed based on the motion and content characteristics of SCVs, which includes superpixel-constrained a motion estimation (SCME) module and inter and intra context aggregation (I2CA) module. The SCME is designed to construct a superpixel-based representation of homogeneous regions, leveraging the global correlations among superpixels to effectively capture large-scale motions, which efficiently improves the compression performance. I2CA is developed to jointly utilize inter and intra contexts, which employs a gating mechanism for content-aware context fusion, dynamically aggregating more similar contexts within SCVs. This allows for flexible adaptation to both contiguous and abrupt content changes within SCVs. Furthermore, by leveraging both learnable window and pixel displacements, a displacement-guided window attention mechanism is implemented in I2CA for precise long range repetitive feature localization, thereby reducing redundancy caused by repetitive patterns. To the best of our knowledge, it is the first DL-based video compression framework specifically designed for SCVs. Extensive experimental results demonstrate that the proposed DSCVC significantly outperforms existing methods in terms of compression performance, achieving a bitrate saving of 26.82% compared to VVC and a bitrate saving of 12.30% compared to SOTA DL-based methods.

Abstract:
Multi-modal large language models (MLLMs) have made significant strides in various visual understanding tasks. However, the majority of these models are constrained to process low-resolution images, which limits their effectiveness in perception tasks that necessitate detailed visual information. In our study, we present MG-LLaVA, an innovative MLLM that enhances the model’s visual processing capabilities by incorporating a multi-granularity vision flow, which includes low-resolution, high-resolution, and object-centric features. We propose the integration of an additional high-resolution visual encoder to capture fine-grained details, which are then fused with base visual features through a Conv-Gate fusion network. To further refine the model’s object recognition abilities, we incorporate object-level features derived from bounding boxes identified by offline detectors. Being trained solely on publicly available multimodal data through instruction tuning, MG-LLaVA demonstrates exceptional perception skills. We instantiate MG-LLaVA with a wide variety of language encoders, ranging from 3.8B to 34B, to evaluate the model’s performance comprehensively. Extensive evaluations across multiple benchmarks demonstrate that MG-LLaVA outperforms existing MLLMs of comparable parameter sizes, showcasing its remarkable efficacy.

Abstract:
Underwater applications such as exploration and salvage operations require capturing underwater images (UWIs) to evaluate attributes such as the shape and structural integrity of submerged targets. However, underwater image transmission faces significant challenges due to the limited wireless acoustic channel available in underwater communication systems. Existing image compression algorithms struggle with limited compression ratios, which leads to a loss of crucial structural information and poor reconstruction quality, making them unsuitable for underwater practical applications. To overcome these limitations, we propose a sparse Sketch-based Extreme Underwater Compression framework (SEUCN), which mainly includes two sub-networks: Sparse Sketch Generation Network (SSGN) and Underwater Prior-guided Reconstruction Network (UPRN). To reduce redundancy and ensure effective compression at extremely low bitrates, the SSGN is designed to generate a compression-friendly sparse structural sketch through two ways. Firstly, it focuses on extracting important structural information to support analysis tasks within the constraints of limited bitrates. Secondly, it incorporates an underwater imaging model to focus on learning critical texture information for visual reconstruction. To restore the information lost during compression and achieve high-quality reconstruction, UPRN is designed to enhance structure details, restore underwater style, and enrich texture information during the reconstruction of UWIs from the decoded sketches, by effectively integrating multiple sources of prior knowledge. Specially, considering the high similarity of semantics and texture across different UWIs with common targets, the Dictionary-guided Texture Recovery Module (DTRM) leverages a universal underwater multi-scale feature dictionary as texture prior knowledge to supplement missing texture details. Extensive experiments show that our SEUCN demonstrates outstanding performance in retaining significant structural information to assist underwater practical tasks, and achieves superior visual quality compared to existing methods.

Abstract:
The deployment of 3D detectors strikes one of the major challenges in real-world self-driving scenarios. Existing BEV-based (i.e., Bird Eye View) detectors favor sparse convolutions (known as SPConv) to speed up training and inference, which puts a hard barrier for deployment, especially for on-device applications. In this paper, in order to tackle the challenge of efficient 3D object detection from an industry perspective, we devise a deployment-friendly pillar-based 3D detector, termed FastPillars. Specifically, aiming to compensate the geometric information loss of pillar encoding. First, we design a novel lightweight Max-and-Attention Pillar Encoding (MAPE) module specially for enhancing small objects. Second, we propose a simple yet effective backbone design for pillar-based 3D detection, enhancing pillar representations. We construct FastPillars based on these designs, achieving high performance and low latency without SPConv. Extensive experiments on two large-scale datasets demonstrate the effectiveness and efficiency of FastPillars for on-device 3D detection regarding both performance and speed. Specifically, FastPillars delivers real-time state-of-the-art accuracy on Waymo Open Dataset with 1.8× speed up and 3.8 mAPH/L2 improvement over CenterPoint (SPConv-based). Code will be opened soon in: https://github.com/StiphyJay/FastPillars

Abstract:
Emphasis on modeling visual context underpins the current high-performance dense prediction models, including those for change detection. However, excessive context modeling tends to cause ambiguous feature representations around object boundaries and possibly overwhelms small or thin objects. This encourages maintaining large feature maps to provide sufficient information for an objective rescue. With the goal of efficiently harvesting global context from large feature maps while getting good sensitivity to change boundaries as well as small or thin changes, we propose a change detection model that features (i) an encoder-decoder architecture with state space model-based feature refinement, and (ii) boundary-specific supervision. Our encoder-decoder is equipped with novel modulated Mamba blocks capable of preserving local correlation and then achieving local-global context mixing. To bridge the gap between local and global information during modulation, we employ a spectral transform on the local features to holistically enhance the information encoded in key frequency components. Moreover, our custom-designed boundary-specific supervision explicitly induces the revision of change boundaries. With these improvements, our Mamba-grounded change detection model can efficiently garner boundary-sensitive large feature maps applicable to various change shapes and scales. Extensive experimental results on four public change detection datasets demonstrate that our method consistently outperforms state-of-the-art competitors in terms of key evaluation metrics. Our source code is available at https://github.com/xingronaldo/BSSMamba

Abstract:
Co-clustering has become a research hotspot, which focuses on analyzing block structure and decouple the duality between samples and features, thereby providing a concise approach toward graph-free clustering. Despite this, the label extraction still relies on post-processing, with synergy between independent processes out of evaluation. In addition, while extending it to multiview learning, the redundancy (in high-dimensional feature) and heterogeneity (under different views) of features can lead to difficulty in mining distinct and consensus block structure. In view of these, a novel multiview co-clustering method named Fast Multiview Co-Clustering in Unified Subspace (FOCUS) is put forward, which achieves discrete label decoupling within the same latent space directly. Given that featuring embedding is completed in an unsupervised manner, the principle of information loss minimization is considered to ensure the sparsity and validity of common representations. On this basis, dynamic decoupling is introduced to extract labels for both samples and features, where discrete constraint enables integrated clustering without any post-processing. Besides, extreme feature loss can mislead optimization, so that least-absolute criteria are adopted in function design, while the coupling matrix is further relaxed to be unconstrained for flexible approximation in an enhanced version. In this way, the view weights can be self-updated according to the re-weighted strategy, and the comparison results with eleven state-of-the-art methods on six real-world data sets verify the superiority of our method.

Abstract:
Video captioning remains a challenging task due to the diverse video content and the complex relationships between visual and textual elements. Recent efforts predominantly focus on multimodal architecture designs trained with paired video-caption data. Nonetheless, the learning paradigm suffers from the “one-to-many” corresponding problem, since one source video is mapped to multiple caption annotations. The difficulty of video captioning is further exacerbated by the poor-written captions, which mislead the captioner with irrelevant information. Essentially, the problem stems from the inadequate alignment between video and caption. In this work, we propose a Text-Conditional Alignment Transformer, which fully exploits the rich information provided by diverse labeled captions, and avoids the impacts of label ambiguity and noise. To alleviate the challenge of the “one-to-many” correspondence, we introduce Text-conditioned Video Encoding, which diversifies the video representation by emphasizing the spatial-temporal visual areas relevant to the given descriptions while filtering out redundant visual information. The refined video representation is well-aligned to match the corresponding text description, and naturally converts the “one-to-many” mapping to “one-to-one” mapping. To deal with the noisy annotations, we propose Quality-aware Caption Decoding. We first dynamically measure the qualities of different captions corresponding to the same video in a reference-free manner. Then the estimated qualities are further utilized as auxiliary signals, guiding the model to perform quality-aligned learning from noisy captions. We conduct extensive experiments on MSR-VTT, MSVD, VATEX and ActivityNet-Entities datasets, and demonstrate their consistent performance improvements compared to state-of-the-arts.

Abstract:
CLIP has greatly advanced zero-shot segmentation by leveraging its strong visual-language association and generalization capability. However, directly adapting CLIP for segmentation often yields suboptimal results due to inconsistencies between image and pixel-level prediction objectives. Additionally, merely combining segmentation and CLIP models often leads to disjoint optimization, introducing significant computational overhead and additional parameters. To address these issues, we propose a novel CLIP-to-Seg Distillation approach, incorporating global and local distillation to flexibly transfer CLIP’s powerful zero-shot generalization capability to existing closed-set segmentation models. Global distillation leverages CLS tokens to condense segmentation features and distills high-level concepts to the segmentation model via image-level features. Local distillation adapts CLIP’s local semantic transferability to dense prediction tasks using object-level features, aided by pseudo-mask generation for latent class mining. To further generalize the CLIP-distilled segmentation model, we generate latent text embeddings for the mined latent classes by coordinating their text embeddings and dense features. Our method equips existing closed-set segmentation models with strong generalization capabilities for open concepts through effective and flexible CLIP-to-Seg distillation. Without relying on the CLIP model or introducing extra inference overhead, our method seamlessly integrates into existing closed-set segmentation models and enables zero-shot capability, achieving state-of-the-art performance on multiple benchmarks.

Abstract:
Estimating homography from an image pair is crucial for image alignment, and unsupervised methods that optimize feature reprojection error between target and warped source images have gained attention for their promising performance. In real-world scenes with multiple planes, such as moving objects, outlier rejection strategies are essential to mitigate the influence of non-dominant planes. Existing methods address this by learning a mask based on reprojection error, where high errors indicate non-dominant planes misaligned by homography. However, this error-fitting mask often overextends to the dominant plane, limiting the use of valid image regions for accurate estimation. This paper proposes a novel unsupervised method to compactly exclude non-dominant planes by introducing an uncertainty-adaptive cost volume for homography estimation. We first model uncertainty by assuming image features follow a Gaussian distribution derived from a prior Normal Inverse-Gamma distribution. The network-learned distribution parameters disentangle aleatoric uncertainty, distinguishing data-dependent errors within the total reprojection error. This uncertainty reflects inherent observation noise in image data, effectively indicating non-dominant planes. We then integrate this aleatoric uncertainty into the concatenation volume across image feature maps, creating an adaptive volume that filters out unreliable matching costs associated with non-dominant planes. This adaptive volume simplifies learning homography from the rich, redundant content in the concatenation volume, enabling more efficient and accurate estimation. Experiments demonstrate that our method outperforms existing approaches, achieving state-of-the-art performance both qualitatively and quantitatively.

Abstract:
Multi-view clustering, which uses information from multiple views to partition data into distinct clusters, has garnered significant attention. Existing MLP-based and GCN-based algorithms primarily focus on enhancing performance by extracting node attribute and graph structure features, and then integrating them for clustering. However, features extracted by MLP and GCN may differ in quality due to the high sparsity and noise in multi-view data. Directly integrating these features can lead to feature contamination. To address this issue, we propose a novel mutual calibration network for multi-view clustering (McMVC). Specifically, the attribute and structural features from multi-view data are integrated separately using the attention fusion module. A classifier is then employed to obtain high-confidence pseudo-labels for these fused features. We design a Cluster-level Mutual Calibration (CLMC) module that uses the pseudo-labels to mutually calibrate view features while maintaining compact cluster structures. During training, the attribute and structural features are not directly integrated to avoid feature contamination but are jointly optimized by reliable class information. Concurrently, we construct a Centroid-level Contrastive Calibration (CLCC) module to map view features into their inner centroid space and learn more discriminative centroid representations. Our approach outperforms state-of-the-art methods in multi-view data clustering, as demonstrated by extensive experiments on six real-world benchmark datasets. The source code is available at https://github.com/YuangXiao/McMVC

Abstract:
Video steganography in the intra prediction mode (IPM) domain embeds secret messages by modifying IPM values. However, such modifications are highly susceptible to detection by video steganalysis techniques, particularly those leveraging recompression-based calibration features. In this paper, the signal restoration phenomenon that occurs during video recompression is first modeled, which reveals the underlying reason for the effectiveness of recompression calibration-based detection features. Based on this insight, a IPM priority-preserving strategy is proposed. This strategy integrates the steganographic modification state with the optimal IPM selection mechanism during recompression, employing dynamic cost revision and joint cost decomposition to guide steganographic modifications toward optimal selection. By aligning modifications with recompression tendency, the proposed method mitigates signal restoration effects, reduces distribution discrepancies in calibration-based detection features, and enhances overall steganographic security. Extensive experimental evaluations demonstrate that the proposed scheme significantly improves resistance against both intra-frame and inter-frame steganalysis features while maintaining superior visual quality and bitrate control.

Abstract:
Background subtraction is a core problem in computer vision, widely used in video surveillance to segment moving foreground objects from video sequences. While deep learning approaches have shown strong performance—especially under dynamic backgrounds and sudden illumination changes—they typically rely on large-scale, high-quality labeled video datasets. Acquiring such data is time-consuming and expensive, making existing supervised or weakly supervised methods less suitable for real-time applications. Moreover, many of these methods suffer from performance degradation when applied to unseen video sequences. To address these challenges, we present UTGMP-BS algorithm: an Unsupervised Transformer-based pseudo-label Generator with a Message-Passing network for the Background Subtraction task. UTGMP-BS is a fully unsupervised framework designed to learn directly from unlabeled video sequences. It comprises two key components: a transformer-based pseudo-label generator, which produces initial pixel-level foreground and background labels using an encoder-decoder architecture and an \mathcalL_1 loss, and a message-passing network, which acts as a label-cleaner discriminator to refine the pseudo labels and enforce spatial consistency. These two branches engage in mutual learning through consecutive iterations, enhancing one another’s performance without any ground-truth supervision. The framework is trained using an alternating iterative learning strategy with binary cross-entropy loss, achieving robust background subtraction across varied scenes. Extensive experiments on six publicly available benchmark datasets demonstrate that UTGMP-BS achieves competitive results compared to existing State-of-The-Art (SOTA) methods.

Abstract:
It has been demonstrated that deep learning generates high-quality reference frames and achieves considerable gains in inter coding. Existing neural network-based reference frame generation (NN-RFG) methods are mainly built on convolutional neural networks (CNNs) with limited receptive fields and are applied to the Versatile Video Coding (VVC). In this paper, we propose Transformer-based reference frame synthesis for VVC inter coding, named TRFS. We introduce Transformer into NN-RFG to capture global contextual information and construct multi-scale feature pyramids for accurate optical flow estimation. TRFS operates between the decoded picture buffer (DPB) and reference picture lists (RPL) in VVC, generating new reference frames through spatiotemporal compensation between previously reconstructed frames. First, we present a hierarchical feature extractor based on parameter-efficient Transformer to capture global contextual information with different resolutions and construct multi-scale feature pyramids. Second, we design a weight-sharing optical flow estimator consisting of residual blocks and Transformers to progressively refine the bidirectional or unidirectional optical flows in a coarse-to-fine manner. Third, we employ a U-net frame enhancer equipped with a ConvNeXt variant to learn residuals from the input and warped frames while removing blurring distortion caused by backward warping. To train the TRFS network, we utilize a two-stage incremental learning strategy based on quantization parameter (QP)-distance to address the imbalanced QP gap between the compressed input and its label, enhancing its learning capability. Various experiments demonstrate that TRFS achieves average Bjøntegaard Delta rate (BD-rate) gains of RA: 6.61%, 13.86%, 13.30% and LB: 6.10%, 13.92%, 13.14% for Y, U, V components over VTM-11.0_NNVC-10.0 with Neural Network-Based Video Coding (NNVC)-tools disabled. Furthermore, when evaluated against VTM-11.0_NNVC-10.0 with NNVC-tools enabled, TRFS delivers average BD-rate gains of RA: 4.43% and LB: 3.40% for the luma component.

Abstract:
Text-based person retrieval is a cross-modal task that seeks to match pedestrian images with their corresponding textual descriptions. A key challenge in this task arises from the inherent one-to-many relationships: a single image can correspond to multiple descriptions, and a single description may relate to several images. Conventional deterministic embedding methods, which map images and texts to fixed feature vectors, struggle to capture such complex relationships effectively. To overcome this limitation, we introduce Probabilistic Distribution Alignment (PDA), a framework that represents both pedestrian images and text as probabilistic distributions and models the interactions between visual and linguistic modalities. PDA comprises three main components. First, Distributional Representation Modeling (DRM) encodes images and text into Gaussian distributions using a specially designed distance metric, allowing the model to capture uncertainty in the representations. Second, Cross-Modal Containment (CMC) aligns the distributions of text and masked text with their associated image distributions to strengthen semantic correspondence. Third, Intra-Modal Containment (IMC) enforces structured learning within each modality by embedding distributions alongside their masked variants, improving robustness to incomplete observations. Experiments on standard benchmarks demonstrate that PDA achieves superior performance compared with state-of-the-art methods, effectively handling ambiguity and cross-modal variability. These results highlight probabilistic distribution modeling as a powerful paradigm for vision-language alignment in pedestrian retrieval.

Affiliations: State Key Laboratory of Integrated Chips and Systems, School of Microelectronics, Fudan University, Shanghai, China; State Key Discipline Laboratory of Wide Band-Gap Semiconductor Technology, School of Microelectronics, Xidian University, Xi’an, China; National Key Laboratory of Infrared Detection Technologies, Shanghai Institute of Technical Physics, Chinese Academy of Sciences, Shanghai, China; Department of Electronic Engineering, Shanghai Jiao Tong University, Shanghai, China

Abstract:
Conventional image signal processing (ISP) control algorithms based on human visual perception are insufficient for the demands of modern machine vision systems. Although recent learning-based methods have attempted to adapt ISP hyperparameters for specific vision tasks, their high latency and hardware cost of iterative optimization hinder deployment on edge devices such as those used in autonomous driving. To address these limitations, this paper proposes a real-time machine vision system through algorithm–hardware co-design. First, we introduce a one-iteration learning framework to avoid iterative optimization, significantly reducing latency for real-time use. Second, we propose a hardware-friendly controller, RasterNet, specifically tailored for raster-scanning sensor dataflow, eliminating redundant computation. Third, we present a pipelined ISP controller architecture incorporating branch and chroma time division multiplexing techniques to minimize the number of processing elements, achieving a compact and efficient design. Experiments demonstrate that the proposed system achieves superior object detection accuracy on resource-constrained platforms, with real-time performance reaching 70 FPS on FPGA and 224 FPS on ASIC implementations.

Abstract:
High Dynamic Range (HDR) imaging aims to reconstruct scenes with a wide range of luminance by fusing multi-exposure Low Dynamic Range (LDR) images. In dynamic scenes with pronounced foreground motion or camera jitter, especially under challenging conditions including extremely low or high luminance, widespread saturation, and substantial motion, existing approaches often encounter ghosting artifacts, spatial misalignment, and degradation of fine structural details. Traditional techniques based on handcrafted priors struggle to generalize to complex motion patterns, while most deep learning-based methods operate exclusively in the spatial domain, limiting their ability to capture global contextual cues and restore high-frequency structures that are better represented in the frequency domain. To address these challenges, we introduce a Dual-Domain Parallel Fusion Network with Prompt Refinement (DDPF-PR), which jointly leverages spatial and frequency-domain features for enhanced HDR reconstruction. Specifically, the proposed framework consists of a Bi-Domain Interaction Module(BDIM), which integrates spatial features for local detail and frequency features for global structure to suppress ghosting artifacts caused by motion. In addition, a Prompt Refinement Module(PRM) is designed to recover fine details in degraded regions such as saturated or misaligned areas by adaptively generating structural cues. Extensive experiments demonstrate that DDPF-PR consistently outperforms state-of-the-art methods across multiple benchmarks in both qualitative and quantitative evaluations. The code will be made publicly available.

Abstract:
Conventional frame-based cameras inevitably produce blurry effects due to motion occurring during the exposure time. Event camera, a bio-inspired sensor offering continuous visual information could enhance the deblurring performance. Effectively utilizing the high-temporal-resolution event data is crucial for extracting precise motion information and enhancing deblurring performance. However, existing event-based image deblurring methods usually utilize voxel-based event representations, losing the fine-grained temporal details that are mathematically essential for fast motion deblurring. In this paper, we first introduce point cloud-based event representation into the image deblurring task and propose a Multi-Temporal Granularity Network (MTGNet). It combines the spatially dense but temporally coarse-grained voxel-based event representation and the temporally fine-grained but spatially sparse point cloud-based event. To seamlessly integrate such complementary representations, we design a Fine-grained Point Branch. An Aggregation and Mapping Module (AMM) is proposed to align the low-level point-based features with frame-based features and an Adaptive Feature Diffusion Module (AFDM) is designed to manage the resolution discrepancies between event data and image data by enriching the sparse point feature. Extensive subjective and objective evaluations demonstrate that our method outperforms current state-of-the-art approaches on both synthetic and real-world datasets. Our code is available at: https://github.com/xplin13/MTGNet

Abstract:
Recent advances in text-guided diffusion models have revolutionized conditional image generation, yet they struggle to synthesize complex scenes with multiple objects due to imprecise spatial grounding and limited scalability. We address these challenges through two key modules: 1) Janus-Pro-driven Prompt Parsing, a prompt-layout parsing module that bridges text understanding and layout generation via a compact 1B-parameter architecture, and 2) MIGLoRA, a parameter-efficient plug-in integrating Low-Rank Adaptation (LoRA) into UNet (SD1.5) and DiT (SD3) backbones. MIGLoRA is capable of preserving the base model’s parameters and ensuring plug-and-play adaptability, minimizing architectural intrusion while enabling efficient fine-tuning. To support a comprehensive evaluation, we create DescripBox and DescripBox-1024, benchmarks that span diverse scenes and resolutions. The proposed method achieves state-of-the-art performance on COCO and LVIS benchmarks while maintaining parameter efficiency, demonstrating superior layout fidelity and scalability for open-world synthesis. Here is our project page: https://github.com/FanQi-AI/MIGLoRA

Abstract:
The computational and memory demands of deep neural networks for vision tasks remain a critical barrier to their deployment on resource-constrained edge devices. Although knowledge distillation (KD) effectively transfers over-parameterized models’ knowledge into compact students, its efficacy diminishes substantially when a significant capacity gap exists between them. Current approaches often impose linear mapping constraints between output distributions, an assumption that becomes prohibitively restrictive under such capacity gaps. This paper proposes a fundamental relaxation of alignment requirements. Specifically, rather than enforcing strict parametric relationships, we experimentally validate that preserving monotonic rank correlation between teacher and student outputs suffices for effective knowledge transfer. To operationalize this insight, we introduce Monotonic Rank Knowledge Distillation, a novel framework that leverages differentiable approximations of Kendall’s rank correlation coefficient to measure and optimize rank-order consistency. Our methodology further decomposes rank correlation into inter-class and intra-class components, ensuring the student network retains both global discriminative patterns and fine-grained categorical distinctions inherent to the teacher’s outputs. Extensive experiments across CIFAR-100 and ImageNet-1K benchmarks validate the effectiveness of our approach, demonstrating consistent performance gains over state-of-the-art distillation methods. The proposed framework achieves superior generalization across diverse architectures, including CNN-based, MLP-based, and ViT-based, with particular efficacy in various compression scenarios.

Abstract:
Adapter-based fine-tuning methods for Visual-Language Models (VLMs) have shown promising performance for feature adaptation in limited data scenarios. However, existing adapters generally either employ parameterized transformation for multi-modality feature refining or exploit pairwise relationships between classes (i.e., GraphAdapter) for text enhancement, which ignore the inherent high-order correlations among data samples in the adaptation process. In this paper, for the first time, we propose to exploit the high-order relationships of visual samples within each mini-batch for fine-tuning VLMs and develop a novel Batch HyperGraph Adapter (BHGraphAdapter) to fine-tune VLMs. The core idea of BHGraphAdapter is to conduct feature adapter learning by capturing the inherent high-order semantic information of different samples within each mini-batch, which thus can fully exploit the complex context information in adaptation. Specifically, we first construct a Batch HyperGraph (BHGraph) to model the high-order correlation of samples within each mini-batch. Then, we introduce a message propagation module on BHGraph to update the node embeddings by aggregating information from their high-order neighbors, thereby capturing semantic relationships to enrich feature representation. Finally, we incorporate the proposed BHGraph learning into the pre-trained CLIP framework to achieve the feature adaptation for the downstream tasks. Extensive experiments on 11 benchmark datasets show that our proposed BHGraphAdapter outperforms the SOTA adapter tuning methods. The source code and data will be released at https://github.com/LiuMeilin7195/BHGraphAdapter

Abstract:
The core challenge in video understanding lies in perceiving dynamic content changes over time. However, multimodal large language models (MLLMs) struggle with temporal-sensitive video tasks, such as video temporal grounding, which requires generating timestamps to mark the occurrence of specific events. Existing strategies require MLLMs to generate absolute or relative timestamps directly. We have observed that those MLLMs tend to rely more on language patterns than visual cues when generating timestamps, affecting their performance. To address this problem, we propose VideoExpert, a general-purpose MLLM suitable for several temporal-sensitive video tasks. Inspired by the expert concept, VideoExpert integrates two parallel modules: the Temporal Expert and the Spatial Expert. The Temporal Expert is responsible for modeling time sequences and performing temporal grounding. It processes high-frame-rate yet compressed tokens to capture dynamic variations in videos and includes a lightweight prediction head for precise event localization. The Spatial Expert focuses on content detail analysis and instruction following. It handles specially designed spatial tokens and language input, aiming to generate content-related responses. These two experts collaborate seamlessly via a special token , ensuring coordinated temporal grounding and content generation. Notably, the Temporal and Spatial Experts maintain independent parameter sets. This parameter decoupling design enables specialized learning within each part without mutual interference. By offloading temporal grounding from content generation, VideoExpert prevents text pattern biases in timestamp predictions. Moreover, we introduce a Spatial Compress module to obtain spatial tokens. This module filters and compresses patch tokens while preserving key information, delivering compact yet detail-rich input for the Spatial Expert. Extensive experiments conducted on four widely-used benchmarks (i.e. Charades-STA, QVHighlight, YouCookII and NextGQA) across four tasks (temporal grounding, highlight detection, dense video captioning and grounding question answering) demonstrate the effectiveness and versatility of the VideoExpert.

Abstract:
Camouflaged Object Detection (COD) is a formidable computer vision challenge due to the striking resemblance between camouflaged objects and their surroundings. Despite progress in existing methods, they still face significant limitations, particularly in addressing the issues of fuzzy boundaries and the inadequate fusion of local and global features. To address these challenges, we present a multi-scale COD network named Multi-Scale Local-Global Fusion (MSLGF). MSLGF incorporates a Multi-Scale Fusion Module (MSFM), which skillfully integrates feature maps at multiple scales to produce high-fidelity edge features. Additionally, to refine the detection process, a Local-Global Feature Fusion Module (LGFFM) combines the local edge details with global semantic information of camouflaged targets, significantly enhancing the accuracy of COD. Experimental results show that MSLGF achieves remarkable performance across 3 benchmark datasets, i.e., Camouflaged Object Dataset (CAMO), Camouflaged Object Dataset with 10,000 Images (COD10K), and NC4K. Specifically, MSLGF attains a structure-measure from 0.879 to 0.894 and a weighted F-measure between 0.817 and 0.856. The source code is publicly available at https://github.com/tc-fro/MSLGF

Abstract:
Image deblurring is a challenging image task, which is regarded as a classical inverse problem. Deep primal-dual proximal network (DeepPDNet) is recently proposed which unrolls the Condat-V ũ primal-dual splitting algorithm as a feed-forward network and it has demonstrated excellent restoration performance. However, the feature patterns in the DeepPDNet are well manually designed and thus the network is not implemented in an efficient convolutional fashion. In this work, we revisit the DeepPDNet and extend it in three respects: i) the convolution and pooling operators as well as their associating adjoint operations are studied in the primal-dual algorithm, and then a deep convolutional primal-dual network (DeepConvPDNet) and its full variant with skips are proposed to preserve the optimization consistence of primal-dual Condat-V ũ algorithm; ii) two (cascade vs parallel) variants of the networks are designed according to the structure of convolutional kernels; iii) rather than that the blur kernels are given as prior knowledge, they can be encoded by a set of convolutional layers and deconvolutional layers for their conjugate, resulting to a full learnable deep convolutional primal-dual neural network. We investigate the proposed networks on the MNIST dataset, the grayscale and color version of BSD dataset and GoPro dataset for image deblurring. Extensive experiments are conducted to validate the performance of the proposed networks, and promising results in term of PSNR and SSIM are obtained in comparison with twelve methods including state-of-the-art methods (e.g. Restormer, DRUNet, and DeblurGAN), which validated its effectiveness.

Abstract:
Despite significant progress in real-world image dehazing, efficiently generating high-fidelity, haze-free videos (especially in driving scenarios) remains challenging. Existing methods generally extend image dehazing techniques to videos by employing pre-trained single image dehazing models for preprocessing followed by refinement stages. However, this disjointed two-stage process often leads to unrealistic textures and loss of detail, as it fails to leverage large amounts of high-quality images for prior learning and the subsequent refinement struggles to correct temporal inconsistencies across frames introduced in the first stage. To address these issues, we propose DVDPEC: a Driving Video Dehazing framework utilizing a Position Embedding-based (PE-based) Codebook and a novel Flow Selective Block (FSB). The PE-based codebook stores fine-grained, spatially aware textural information specific to driving videos and leverages implicit positional embeddings for precise, position-aware codebook matching. This enables accurate prior retrieval and improves dehazing results. The FSB aggregates information from adjacent frames by dynamically combining both image flow and prior flow, effectively mitigating flow estimation ambiguities caused by haze. It enhances information fusion across frames, leading to more coherent and visually appealing dehazed videos. Extensive experiments demonstrate that DVDPEC achieves state-of-the-art performance on real-world driving video dehazing tasks, significantly enhancing texture preservation and visual fidelity.

Abstract:
Recent advancements in deep learning have led to significant achievements in hashing for image retrieval. However, existing methods primarily operate under the assumption that training and testing data share the same distribution, meaning that the categories in the training and test sets are identical. This assumption may not hold in real-world scenarios, potentially limiting the effectiveness of these methods. In this work, we investigate the performance of existing deep hashing methods on unseen category data during retrieval tests and find a considerable performance decline. To address this issue, we propose a Hierarchical Text-guided Hashing (HTH) framework to mitigate the performance degradation in open-world image retrieval. Specifically, our method is trained in a self-supervised learning (SSL) framework using automatically synthesized coarse-to-fine textual descriptions. By combining the strengths of SSL in learning discriminative low- and mid-level features with the semantic richness of hierarchically structured text, our approach aims to enhance the model’s ability to generalize across unseen categories and complex open-world settings. Technically, we elaborately design a local attention pooling module to fuse the local patch information. Furthermore, we propose both hierarchical and fine-grained alignment modules, respectively applied to the global and local vision-language representations at different semantic levels, guiding the hash encoding to fully understand the visual primitives and extract discriminative and generalizable semantic information from images. Under the newly established large-scale ImageNet-CoG open evaluation protocol, our method demonstrates significant improvements in generalization compared to state-of-the-art and also possesses enhanced performance across various other open-world retrieval datasets and scenarios.

Abstract:
Subject-driven text-to-image generation aims to generate customized high-fidelity images based on text descriptions for specific subjects, which has gained increasing attention. Despite recent advancements in single-subject customization, existing methods often struggle with multi-subject scenarios, leading to distortions in subject identity. This challenge arises because entangled identity-relevant and irrelevant information can obscure subject identities, and inter-subject interference can cause confusion or loss of individual identities. To address these issues, we propose CausalT2I, a customized multi-subject text-to-image generation framework with causal tuning. First, we propose a subject-aware causal disentanglement method, which can self-adaptively distinguish causally relevant and irrelevant information for subjects through causal intervention and a causal disentangled objective. Then, we design a soft cross-attention guidance strategy to mitigate interference among different subjects by aligning the textual attributes of each subject with its identity-relevant visual attributes. Last, we introduce a causal denoising objective to optimize the denoising process using identity-preserved textual embeddings and identity-irrelevant visual embeddings. Extensive experiments show that CausalT2I has superior generation ability in subject-driven text-to-image generation over existing baseline methods and brings more flexibility and controllability for generating customized multi-subject images.

Abstract:
Recently, artificial intelligence (AI) algorithm has been extensively utilized as an optimization tool in digital watermarking. However, existing works seldom consider the reversibility of the embedding framework, thereby neglecting the protection of sensitive carriers. In this paper, we propose an AI-assisted reversible data hiding (RDH) method based on reinforcement learning. In our method, the Q-Learning algorithm is introduced to address the optimization problem of RDH, i.e., adaptive two-dimensional (2D) mapping generation, and it is designed to simulate pixel modifications in 2D space, employing a Markov decision process formulation within the reinforcement learning paradigm. The new reward function is given to evaluate the effectiveness of 2D mapping based on the estimated embedding capacity and the distortion-capacity ratio. To ensure reversibility, a mapping adjustment strategy is implemented to update the environmental states. Experimental results show that the proposed method outperforms conventional 2D RDH and demonstrates competitive performance compared to other state-of-the-art RDH methods.

Abstract:
Generalized Category Discovery (GCD) is a challenging task that aims to identify both seen and novel categories in unlabeled data. We argue that a clear margin between seen and novel class representations is essential for accurate recognition. However, existing methods often ignore this margin, mapping representations to prototypes without enforcing separation between seen and novel classes. This leads to a bias where seen samples are misclassified as novel. To address this issue, we propose DebiasGCD, a debiasing framework that enhances prototype separation through margin-aware learning. Unlike prior work that relies on static prototype learning and overlooks fine-grained representations, our method introduces Dynamic Prototype Debiasing (DPD) and Spatial-Aware Representation Distillation (SARD) to mitigate this bias. First, DPD dynamically enforces inter-prototype margins, improving class-specific feature learning and prototype discrimination. Meanwhile, SARD promotes local representation of spatial learning, supporting DPD to capture subtle details that further refine class-specific features. By synergizing these components, DebiasGCD significantly improves prototype discriminability, generating more reliable predictions for seen classes. Extensive experiments demonstrate that our approach effectively mitigates pseudo-labeling bias across datasets, especially on fine-grained ones, achieving + 8.3% and + 9.6% improvements on the ‘All’ classes in CUB and Stanford Cars, respectively.

Abstract:
The Contrastive Language-Image Pretraining (CLIP) model has been widely used in various downstream vision tasks. The few-shot learning paradigm has been widely adopted to augment its capacity for these tasks. However, current paradigms may struggle with fine-grained classification, such as satellite image recognition, due to widening domain gaps. To address this limitation, we propose retrieval-enhanced visual prompt learning (RePrompt), which introduces retrieval mechanisms to cache and reuse the knowledge of downstream tasks. RePrompt constructs a retrieval database from either training examples or external data if available, and uses a retrieval mechanism to enhance multiple stages of a simple prompt learning baseline, thus narrowing the domain gap. During inference, our enhanced model can reference similar samples brought by retrieval to make more accurate predictions. A detailed analysis reveals that retrieval helps to improve the distribution of late features, thus, improving generalization for downstream tasks. RePrompt attains state-of-the-art performance on a wide range of vision datasets, including 11 image datasets, 3 video datasets, 1 multi-view dataset, and 4 domain generalization benchmarks.

Abstract:
The advent of text-driven 360-degree panorama generation, enabling the synthesis of 360-degree panoramic images directly from textual descriptions, marks a transformative advancement in immersive visual content creation. This innovation significantly simplifies the traditionally complex process of producing such content. Recent progress in text-to-image diffusion models has accelerated the rapid development in this emerging field. This survey presents a comprehensive review of text-driven 360-degree panorama generation, offering an in-depth analysis of state-of-the-art algorithms. We extend our analysis to two closely related domains: text-driven 360-degree 3D scene generation and text-driven 360-degree panoramic video generation. Furthermore, we critically examine current limitations and propose promising directions for future research. A curated project page with relevant resources and research papers is available at https://littlewhitesea.github.io/Text-Driven-Pano-Gen/

Abstract:
Multi-view clustering (MVC) has emerged as a powerful approach for integrating diverse sources of information from complex datasets. Nevertheless, existing methods struggle to accurately capture the global correlations and high-order structures in the data, and employ anchor-based techniques within a single dimension, limiting their representation. To address these issues, we propose an Anchor-induced Serial Tensor Representation (ASTR) framework, which effectively harnesses serial tensor representation to capture comprehensive multi-view information while reducing approximation errors and enhancing clustering performance. Specifically, ASTR begins with projection learning to explore low-dimensional latent spaces in multi-view data. Then, we introduce multi-anchor learning, where multiple anchor configurations are generated within the latent spaces, yielding a set of corresponding bipartite graphs. Besides, we organize these bipartite graphs into a sequence of global tensors, forming the serial tensor representation that encapsulates high-order inter- and intra-view relationships. Furthermore, we introduce the Laplace function to achieve a more accurate tensor rank approximation, complemented by a thorough theoretical analysis. Finally, a one-step clustering process, guided by adaptive weights, directly fuses the learned graphs to produce the final clustering indicator matrix. Experimental results demonstrate that ASTR possesses superior clustering accuracy and comparable efficiency.

Abstract:
Transformer-based architectures exhibit substantial promise in the realm of ultra-high-definition (UHD) image restoration (IR). Nevertheless, they encounter significant challenges in maintaining high-frequency (HF) details, which are crucial for the reconstruction of texture. Conventional methods tackle computational complexity by significantly reducing the resolution (by a factor of 4 to 8). Moreover, the majority of high-frequency components are eliminated due to the inherent characteristics of self-attention mechanisms, as these mechanisms tend to naturally suppress high-frequency elements during non-local feature integration. This paper proposes a dual-branch transformer architecture that synergistically combines native-resolution HF preservation with efficient contextual modeling, named HiFormer. The high-resolution branch utilizes a directionally-sensitive large-kernel decomposition to effectively address anisotropic degradations with fewer parameters and applies depthwise separable convolutions for localized high-frequency (HF) information extraction. Concurrently, the low-resolution branch assimilates these localized HF elements using adaptive channel modulation to offset spectral losses induced by the inherent smoothing effect of self-attention. Comprehensive experiments across numerous UHD image restoration tasks reveal that our approach surpasses current leading methods in both quantitative metrics and qualitative analysis. The code is available at https://github.com/5chen/HiFormer

Abstract:
Compared to monocular 3D object detection, stereo-based 3D methods offer significantly higher accuracy but still suffer from high computational overhead and latency. The state-of-the-art stereo 3D detection method achieves twice the accuracy of monocular approaches, yet its inference speed is only half as fast. In this paper, we propose StereoDETR, an efficient stereo 3D object detection framework based on DETR. StereoDETR consists of two branches: a monocular DETR branch and a stereo branch. The DETR branch is built upon 2D DETR with additional channels for predicting object scale, orientation, and sampling points. The stereo branch leverages low-cost multi-scale disparity features to predict object-level depth maps. These two branches are coupled solely through a differentiable depth sampling strategy. To handle occlusion, we introduce a constrained supervision strategy for sampling points without requiring extra annotations. Compared with the existing published monocular and binocular 3D detection methods, StereoDETR breaks the trade-off between speed and accuracy. Through a concise framework, it achieves binocular-level accuracy while maintaining monocular-level inference speed. The code is available at https://github.com/shiyi-mu/StereoDETR-OPEN

Abstract:
Recent approaches attempt to adapt powerful interactive segmentation models, such as SAM, to interactive matting and fine-tune the models based on synthetic matting datasets. However, models trained on synthetic data fail to generalize to complex and occlusion scenes. We address this challenge by proposing a new matting dataset based on the COCO dataset, namely COCO-Matting. It selects real-world complex images from COCO and converts semantic segmentation masks to matting labels. The built COCO-Matting comprises an extensive collection of 36,980 human instance-level alpha mattes in complex natural scenarios. Furthermore, existing SAM-based matting methods extract intermediate features and masks from a frozen SAM and only train a lightweight matting decoder by end-to-end matting losses, which do not fully exploit the potential of the pre-trained SAM. Thus, we propose SEMat which revamps the network architecture and training objectives. For network architecture, the proposed feature-aligned transformer learns to extract fine-grained edge and transparency features. The proposed matte-aligned decoder aims to segment matting-specific objects and convert coarse masks into high-precision mattes. For training objectives, the proposed regularization and trimap loss aim to retain the prior from the pre-trained model and push the matting logits extracted from the mask decoder to contain trimap-based semantic information. Extensive experiments across seven diverse datasets demonstrate the superior performance of our method, proving its efficacy in interactive natural image matting. Code is available at https://github.com/XiaRho/SEMat

Abstract:
Accurate stereo matching under limited computational resources remains a central challenge in 3D perception tasks such as autonomous driving and robot navigation. Existing high-accuracy methods often rely on heavy architectures with significant memory and processing demands, while lightweight models typically compromise on feature expressiveness, leading to limited global understanding and detail loss. To bridge this gap, we propose DFD-Stereo, a lightweight and efficient stereo matching framework that delivers high-quality disparity estimation with reduced computational cost. The framework incorporates two key components: 1) a Decoupled Frequency-Spatial Learning (DFSL) module, which enables complementary spatial-frequency representation for enhanced global context modeling, and 2) a Stepwise Coupling Disparity Refinement (SCDR) module, which leverages multi-scale RGB-disparity fusion with Shuffle Attention to refine disparity predictions effectively. Experimental results across multiple benchmarks demonstrate that DFD-Stereo achieves superior accuracy with significantly improved efficiency, offering a promising solution for deployment in resource-constrained 3D vision systems.

Abstract:
3D facial avatar reconstruction is a fundamental problem in computer vision and graphics with applications in digital humans, virtual reality, and telepresence. Recent neural radiance field (NeRF)-based methods have greatly improved fidelity, yet most remain subject-specific, requiring multi-view images with diverse expressions and extensive test-time finetuning, which limits their generalization to unseen identities. Achieving high-quality reconstruction from a single image is particularly challenging due to missing multi-view supervision and the need to balance efficiency with fidelity. To address these issues, we present NOFA++, a generalizable one-shot framework that reconstructs photorealistic 3D facial avatars from a single input. Our method leverages the generative prior of a pretrained 3D GAN in an encoder–generator pipeline to recover a canonical neural volume, and introduces a coarse-to-fine residual generation strategy to synthesize identity-specific details without per-subject optimization. We further design a deformation field conditioned on identity and expression parameters to model facial dynamics, enabling controllable reenactment from video or audio. Extensive experiments show that NOFA++ surpasses state-of-the-art baselines in both reconstruction fidelity and reenactment quality, while eliminating test-time finetuning and generalizing robustly across unseen subjects.

Abstract:
Recent advancements in generative artificial intelligence have significantly promoted content creation and editing, where prevailing studies further extend this exciting progress to video editing. These studies mainly transfer the inherent motion patterns from the source videos to the edited ones, where they often produce inferior results with inconsistency to user intentions, especially when shape changes between the edited and original objects might occur, due to the lack of particular alignments between the delivered motions and edited content. To address this limitation, we present a shape-consistent video editing method, namely StableV2V. Our method decomposes the entire editing pipeline into several sequential procedures, where we first edit the initial video frame, then simulate the shape-aware alignment between the delivered motions and edited sequence, and propagate the edited content to all other frames based on such alignment. Furthermore, we curate a testing benchmark, namely DAVIS-Edit, to offer a comprehensive evaluation of video editing, considering various types of prompts and difficulties. Experimental results and analyses illustrate the superior performance, visual consistency, and inference efficiency of our proposed method compared to existing state-of-the-art video editing studies.

Abstract:
Equirectangular projection (ERP) is a convenient form to store omnidirectional images, but it is neither equal-area nor conformal, creating challenges for subsequent visual communication. When used for image compression, ERP amplifies sampling density and deforms objects near the poles, hindering perceptually optimal bit allocation. Here, we present one of the earliest endeavors to apply deep neural networks to omnidirectional image compression. We first propose parametric pseudocylindrical representations that generalize common pseudocylindrical map projections. A tractable greedy algorithm is introduced to identify (sub-)optimal representation configurations, guided by a proxy objective for rate-distortion performance. We then develop pseudocylindrical convolutions, which can be efficiently implemented by standard convolutions with “pseudocylindrical padding.” To demonstrate the utility of the proposed pseudocylindrical representations and convolutions, we implement an end-to-end omnidirectional image compression method, consisting of an analysis transform, a uniform quantizer, a synthesis transform, and an entropy model. Experiments show that our optimized method achieves consistently better rate-distortion performance compared to the state-of-the-art.

Abstract:
The purpose of multimodal industrial anomaly detection is to detect complex geometric shape defects such as subtle surface deformations and irregular contours that are difficult to detect in 2D-based methods. However, current multimodal industrial anomaly detection lacks the effective use of crucial geometric information like surface normal vectors and 3D shape topology, resulting in low detection accuracy. In this paper, we propose a novel Geometric Prior-based Anomaly Detection network (GPAD). Firstly, we propose a point cloud expert model to perform fine-grained geometric feature extraction, employing differential normal vector computation to enhance the geometric details of the extracted features and generate geometric prior. Secondly, we propose a two-stage fusion strategy to efficiently leverage the complementarity of multimodal data as well as the geometric prior inherent in 3D points. We further propose attention fusion and anomaly regions segmentation based on geometric prior, which enhance the model’s ability to perceive geometric defects. Extensive experiments show that our multimodal industrial anomaly detection model outperforms the State-of-the-art (SOTA) methods in detection accuracy on both MVTec-3D AD and Eyecandies datasets.

Abstract:
Effective 3D object detection requires large-scale annotated datasets, which are expensive and time-consuming to produce - especially in indoor environments containing dense object arrangements. To address this, we propose an Large Vision-Language Model (LVLM)-driven automatic high-quality pseudo-label generation technique for 3D object detection in single- and multi-view scenarios. We propose an Auto3DLabeler that introduces the first-ever text-to-3D Bounding Box transformation. Its pipeline employs a text-based detector, a segmenter and an LVLM to generate annotation estimates, which are further refined by our IoU-guided iterative Box Aggregator and layout-aware prompt Class Refiner modules. We also introduce a semantic-enhanced multi-modal fusion module that integrates image-level semantic information into point cloud representations for precise detections. Collectively, our contributions provide a remarkable boost to the 3D object detection state-of-the-art. Extensive experiments on SUN RGB-D and ScanNet datasets show our unsupervised detector variant outperforming existing semi-supervised detectors, and our semi-supervised variant achieving up to 28.2% absolute gain in challenging scenarios - all this while maintaining considerable compute advantage over existing label-efficient methods. Our code and models will be made public for the community. Our code and model will be made public after acceptance.

Abstract:
Saliency prediction is crucial for improving sports video processing efficiency, thereby providing an enriched viewing experience for a wide-ranging audience. However, there is a long-term absence of well-established eye-tracking dataset and learning-based approach, particularly tailored for sports videos. In this paper, we establish a large-scale eye-tracking dataset dubbed audio-visual sports (AVS). AVS consists of 1,000 high-quality sports videos with eye fixations from 60 participants. Through data analysis on AVS, we observe that human attention patterns exhibit significant variations based on the specific scene context of the sports. Motivated by our observations, we propose a sports-aware saliency prediction approach, named SportSal, which can adaptively predict saliency maps in a hyper manner. Specifically, a hypernetwork is introduced to learn sports-aware priors. Meanwhile, an audio-visual fusion (AVF) block is developed to effectively fuse features from the visual and audio backbones. Given the learned priors and fused audio-visual features, we propose the hyper deformable convolutional (HDC) block and the hyper upsampling (HU) block for dynamic feature extraction and upsampling, respectively. The two blocks are alternatingly connected to adaptively predict saliency maps. Experimental results show that our approach outperforms 21 state-of-the-art saliency prediction approaches over three sports video eye-tracking datasets. Finally, we demonstrate the application of our SportSal approach in perceptual video compression. The dataset and code will be available at https://github.com/WeNsHiJIe-19950103/SportSal

Abstract:
Images serve as a crucial information source for machine intelligence to understand the world, while how to represent images significantly impacting the generalizability and interpretability of intelligent systems. Disentangled representation learning offers a promising approach to improve both aspects. However, most of existing methods predominantly rely on statistical independence assumptions. This poses two key limitations: first, it fails to capture the reality that many concepts are both disentangled yet interrelated; second, it conflicts with human cognitive patterns where concepts naturally exhibit complex dependencies. These limitations further hinder collaboration between machine and human beings. To overcome these limitations, we propose Compositional Invariant Disentanglement (CID), a novel self-supervised learning method that enables models to learn composable representations aligned with human cognitive habits. Inspired by humans’ ability to flexibly recombine concepts, we reframe the definition of disentanglement through the lens of compositional invariance rather than statistical independence. This paradigm shift allows effective disentanglement even with correlated factors, achieving state-of-the-art disentanglement performance across multiple standard benchmarks (improved by 4.0% on Shapes3D, 4.5% on Dsprites, and 28.6% on MPI3D). Furthermore, by building upon and extending the successful self-supervised learning framework BYOL, CID demonstrates potential for large-scale disentanglement pre-training on unlabeled data. This work contributes to extracting more robust and interpretable representations from images for machine intelligence.

Abstract:
Category-level 6D object pose estimation has gained increasing attention in applications of robotic manipulation, augmented reality, and scene understanding, due to its ability to generalize to unseen instances within the same category. However, existing methods struggle with handling the intra-class shape variations as they either adopt mean shape as priors, or build the associations among different instances without explicit category-shared information. To address this problem, a novel category-level object pose estimation method based on diffusion model is proposed, which utilize the generative ability of diffusion model to refine a sparse categorical representation. In contrast to existing dense correspondence-based methods, our method employs a set of keypoints provided by learnable queries to represent object shape, enabling better categorical representation of different instances by focusing on the representative object components. The keypoints are then refined through a forward diffusion process and a reverse denoising process conditioned on category information. This allows the flexible adaption to various instances, especially for those that deviate from the mean shape within the same category. On this basis, a geometric-semantic feature fusion module is presented to enhance keypoint feature representation. By integrating the geometric information from point cloud with the high-level semantics from RGB image using a two-branch attention mechanism, the keypoint feature is enriched and deeply combined, which facilitates the subsequent pose estimation. Extensive experiments on the REAL275 dataset, the CAMERA25 dataset, and real-world complex scenarios demonstrated the effectiveness of proposed method.

Abstract:
Rate control (RC) is a critical component in learned image compression (LIC), particularly in the emerging JPEG-AI standard, which enables adaptive bitrate achievement to meet diverse bandwidth constraints. JPEG-AI default RC employs an iterative optimization process, wherein a pre-trained RC model is selected and the (generated) latent representations are adjusted based on the mismatch between actual and target bitrates. Despite satisfactory results, such a trial-and-error paradigm necessitates multiple processing cycles, resulting in inevitable computational overhead. We propose an efficient neural rate control framework for JPEG-AI to address this limitation. Our idea is to train a ResNet-based neural control (NRC) to learn the mapping from the input images and target bitrates to the optimal coding parameters. The trained NRC can then be applied to predict the coding parameters based on the new input images and target bitrates directly. Experimental results on DIV2K and MSCOCO datasets show that our NRC achieves comparable rate-distortion performance while reducing encoding time by about 5× compared to JPEG-AI default RC.

Abstract:
Few-Shot Class-Incremental Learning (FSCIL) requires models to be updated incrementally with limited labeled samples given in each session, differing from the traditional training paradigm and easily resulting in severe spurious relations between categories. Thus, in this paper, we propose to address FSCIL from a new perspective: enhancing FSCIL via disentangling spurious relations between categories. Accordingly, we propose a simple yet effective approach, dubbed ConTrollable Relation-disentangLed Few-Shot Class-Incremental Learning (CTRL-FSCIL). Specifically, during a base session, we propose to anchor base class embeddings in feature space and build disentangled proxies to bridge gaps between the learning processes of categories encountered in different sessions, making category relations controllable. Furthermore, during incremental learning, the parameters of the backbone network are frozen in order to relieve the negative impact of data scarcity. Meanwhile, a relation disentanglement loss is employed to guide a relation control module to disentangle spurious relations between learned categories. In this way, spurious relation issues in FSCIL can be alleviated. Extensive experiments on CIFAR-100, mini-ImageNet, and CUB-200 demonstrate the effectiveness of CTRL-FSCIL. Our code has been publicly released on github.

Abstract:
Video question answering (VideoQA) necessitates simultaneous understanding of visual and linguistic information, requiring both in-depth analysis of individual modality features and the establishment of cross-modal correlations to achieve precise reasoning. However, VideoQA models often struggle with irrelevant temporal and spatial noise due to the dense events and concepts in real-world complex video contents. Previous works reduce noise by only sampling a fixed number of visual tokens at the patch level, overlooking the variation in the required granularities of features and quantities of visual cues across different question conditions. To address these, we propose an Aggregating-then-Pruning Sampler (APSam), which diversifies feature granularities and adaptively denoises on a per-question basis. Specifically, we propose a conditional token aggregator to obtain multi-granularity visual semantics by merging similar question-relevant tokens. Then, we propose a conditional token pruner, which restricts noise tokens through a variable-capacity receptive field determined by the inputs. Experimental results show that APSamachieves significant performance on three challenging complex VideoQA datasets, i.e., AGQAv2, NExT-QA, and STAR. Further analyses reveal that the APSamalso exhibits high reasoning capability and interpretability.

Abstract:
Visible-Infrared Person Re-identification (VI-ReID) is critical for round-the-clock surveillance systems yet is hindered by significant modality discrepancies. Existing methods often fail to fully exploit frequency domain information, focusing predominantly on spatial domain feature learning or limited frequency decompositions. To address this, we propose the Multi-Frequency Embedding Network (MFENet), a feature-level method that operates in the frequency domain through multi-frequency decomposition to learn discriminative and modality-invariant features. Specifically, the HiLo-Frequency Modulation (HiLo-FM) module efficiently extracts low-frequency features via frequency-domain filtering and high-frequency details through lightweight multiscale convolutions, followed by attention-based fusion. The Frequency-Aware Diversity Enhancer (FADE) module further enriches feature discriminability by weighting multi-frequency components and learning diverse features through multi-branch architectures. To further enhance the performance of our method, we introduce two innovative loss functions. The Cross-Modality Soft Retrieval (CMSR) loss prioritizes cross-modality consistency over intra-modality similarity, while the Cross-Modality Ranking Regularization (CMRR) loss enhances feature diversity through differentiable rank correlation optimization. Extensive experiments demonstrate the state-of-the-art performance of our method, achieving 61.06% Rank-1 and 67.75% mAP in the challenging IR to VIS mode on the largest VI-ReID benchmark LLCM, surpassing existing methods by significant margins without resorting to reranking or additional labeled data. Code is available at https://github.com/GuHY777/MFENet-VIReID.

Abstract:
Temporal Sentence Grounding (TSG) aims at localizing a temporal interval in an untrimmed video that contains the most relevant semantics to a given query sentence. Most existing methods either focus on addressing the problem in a fully-supervised manner where the temporal boundary annotations are provided, or are dedicated to weakly-supervised TSG without any boundary annotations. However, the former ones suffer from expensive annotation cost and the latter ones only give inferior grounding performance. In this paper, we propose an Annotation-efficient Hybrid Learning (AHL) framework that aims to achieve good TSG performance with less annotation cost by leveraging weakly semi-supervised learning, contrastive learning and active learning: (1) AHL includes a progressive pseudo-label self-learning module which generates pseudo labels and progressively selects reliable ones to re-train the model in a progressive manner; (2) AHL includes a novel self-guided contrastive learning method that performs proposal-level contrastive learning based on weakly-labeled data to align the visual and language feature; (3) AHL explores the fully-labeled set construction by gradually expanding it via actively searching on the informative weakly-labeled samples, from the aspects of both difficulty and diversity. We conduct extensive experiments on ActivityNet and Charades-STA datasets and results verify the effectiveness of our proposed AHL to exploit the weakly-labeled data and to achieve the same performance as fully-supervised method, with much less annotation cost. Our code is available at https://github.com/DJX1995/AHL

Abstract:
Today, the popularity of 3D videos is increasing significantly. This trend can be attributed to their immersive appeal and lifelike experience. In an era dominated by the widespread distribution of digital content, data integrity, and ownership, all of these elements are of crucial importance. In this context, the practice of traitor tracing, closely related to Digital Rights Management (DRM), facilitates the identification and tracking of unauthorized users who have violated copyright in order to share illegal copyright-protected content. In this paper, we propose a solution to this problem, we introduce an innovative traitor tracing approach focused on 3D video, with a particular focus on the DIBR (Depth Image-Based Rendering) format, which can be vulnerable to an Interleaving attack strategy. For this purpose, we develop a new phylogeny tree construction method designed to combat collusion attacks. Our experimental evaluations demonstrate the effectiveness of our proposed approach particularly when applied to long fingerprinting codes. Compared to Tardos’ approach, our method delivers very good results, even for a large number of colluders.

Affiliations: Department of Computer Science, City University of Hong Kong, Hong Kong, China; School of Science, Harbin Institute of Technology (Shenzhen), Shenzhen, Guangdong, China; School of Cybersecurity, Northwestern Polytechnical University, Xi’an, Shaanxi, China; School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, Guangdong, China; School of Electronics and Information Engineering, Harbin Institute of Technology (Shenzhen), Shenzhen, Guangdong, China; Institute of Artificial Intelligence (TeleAI) of China Telecom, Beijing, China

Abstract:
Benefiting from the powerful tensor techniques, the tensor low-rank representation has been proposed to construct sophisticated subspace clustering models. Existing tensor low-rank representation methods predominantly rely on a single low-rank prior to reconstruct the row space, which is instrumental in determining the subspace membership of samples by the row space information. However, this strategy neglects the column space and would lead to a subspace information loss. To address this issue, we propose a Dual Tensor Low-Rank Representation method (DTLRR), the first subspace clustering framework to theoretically recover both row and column subspaces simultaneously. Particularly, not simply formulating a dual self-representation model, we instead prove the recovery of both row and column spaces via a unified theoretical framework. Then, we impose low-rank constraints on the two corresponding affinity tensors to effectively capture high-order correlations. Meanwhile, we theoretically demonstrate the existence of compact dictionary tensors within the dual self-representation framework, which effectively eliminates the null spaces of the affinity tensors and significantly reduces computational complexity. Furthermore, an efficient Alternating Direction Method of Multipliers (ADMM) algorithm is designed to solve the proposed DTLRR model with guaranteed convergence. Extensive experiments validate the superior performance of the proposed DTLRR in data clustering, hyperspectral image denoising, and hyperspectral anomaly detection.

Affiliations: College of Electronic and Information Engineering, Tongji University, Shanghai, China; Department of Computation, Information and Technology, Technical University of Munich, Munich, Germany; School of Computer Science and Technology, Tongji University, Shanghai, China; School of Mathematical Sciences, Peking University, Beijing, China; Department of Computer Science, City University of Hong Kong, Hong Kong, China; Huawei Technologies Company Ltd., Shenzhen, China

Abstract:
Recently, general-purpose features for event camera data have become increasingly important in advancing event-based vision applications. Current methods typically adopt pre-training paradigms, yielding promising performance. However, the limited data and sparse spatial information of events hinder effective use of pretraining for rich semantic learning. In this paper, we tackle semantic scarcity by transferring knowledge from large pre-trained image models, without increasing event training data. Concretely, we propose a novel image-to-event knowledge distillation method named I2EKD. Acknowledging that different backbones suit different applications, we fix the teacher and keep the student architecture flexible. To improve versatility, we equip I2EKD with two model-agnostic objectives at the logit and feature levels. Additionally, without task-specific objectives or labels, I2EKD avoids re-distillation and transfers well to downstream applications. Furthermore, leveraging DINOv2 as the teacher, whose feature distribution is built from billions of data, the student can swiftly mimic the superior distribution in a data-efficient manner. Compared with the SOTA pre-training method, I2EKD generates outperforming or comparable features with 1/15 training cost (1/10 data × 2 /3 epochs). Extensive experiments on different vision tasks (object recognition, semantic segmentation, and monocular depth) verify the effectiveness of our method. Notably, I2EKD achieves top-1 object recognition accuracy of 70.72%, leading the pre-training SOTA by 5.89%.

Abstract:
Tensor decompositions are powerful tools for capturing the low-rank structure of dynamic videos. However, existing tensor decompositions primarily consider pixel-wise interactions, thus capturing solely global spatio-temporal correlations and struggling to handle the complex patterns that are inherent to dynamic videos in real-world applications. To overcome this limitation, we propose a dynamic Bhattacharya-Mesner (DyBM) decomposition, which represents the dynamic video as a sum of terms, with each term being the convolution of a BM-rank 1 tensor and a learnable three-dimensional filter. The newly constructed filters enable DyBM decomposition to establish patch-wise interactions in BM-rank 1 tensors, effectively capturing both global and local spatio-temporal correlations in dynamic videos. We further provide a physical interpretation of the factors in DyBM decomposition and offer an in-depth discussion of its relationship to the original BM decomposition. To evaluate the effectiveness of DyBM decomposition, we build a dynamic video recovery model. To solve the model, we develop a corresponding optimization algorithm with a theoretical convergence guarantee. Extensive experiments verify that DyBM decomposition-based method performs more favorably than the state-of-the-art tensor decomposition-based methods especially for dynamic videos.

Abstract:
Textural details are useful for image super-resolution, but massive CNN methods ignored the high-frequency components and generated over-smoothed outputs. The knowledge-based contourlet inference network is proposed in this paper. Different from other CNN-based methods that directly infer high-resolution (HR) images, our model learns to reconstruct the HR image through a series of corresponding contourlet coefficients. Specifically, first, we consider the low-pass subbands of the contourlet as the corresponding low-resolution (LR) image. Then, feed it to the embedding net with residual blocks to provide adequate information for the contourlet coefficients prediction. Finally, we innovatively convert the estimation of contourlet coefficients into the estimation of the generalized gaussian distribution (GGD) parameters, and design the corresponding loss function to ensure training stability, which explores the smoothness of the contour effectively and guarantees the general structure and details of images. Experiments on four remote sensing datasets, four natural scenes and human-made content datasets, and the outdoor dataset demonstrate the superiority of the proposed model quantitatively and qualitatively.

Abstract:
Recent weakly supervised image dehazing (WSID) works have succeeded to improve models’ generalization ability to real scene dehazing by using generative adversarial network (GAN) for unpaired image training. However, it is still difficult for current WSID methods to train one effective dehazing model for various scenes since 1) they always result in residual haze due to insufficient generalization to the feature distribution of real scenes, and 2) they are prone to cause distortions like color shifts, artifacts or halos etc, owing to embedding manual prior or threshold hypothesis for image reconstruction. To solve above problems, in this paper, we propose a novel WSID model via physics-based decomposition (PBD), which estimates atmospheric light, scattering coefficient and scene depth of real haze input to effectively capture the illumination information and haze distribution to recover a preliminary dehazed image by minimizing reconstruction loss. With this constraint, we subtly design a discrete wavelet discriminator (DWD) to effectively improve the generalization to real scene from both spatial and frequency aspect under the supervision of unpaired real clear image. Our PBD is a purely data-driven model freeing from any manual setting or partially correct prior, thus simultaneously ensuring the realness and visibility of dehazed images. Experiments on seven benchmarks verified the strong generalization ability of our PBD, which achieves SOTA dehazing performance with realistic details. Code will be published at https://github.com/NianWang-HJJGCDX/PBD

Abstract:
Capturing discriminative cues with attention mechanisms is crucial for solving the high inter-class similarity problem of person re-identification (Re-ID). Self-attention (SA) learns its own contextual information within a single sample using self-affinity between elements, and some works have demonstrated its superiority in person Re-ID. However, SA weakens some subtle semantic cues and additional visual cues such as backpacks, which makes it difficult to distinguish similar-looking persons. In this paper, we propose an internal-external context interaction (IEI) attention mechanism, which aims to exploit the interaction of inter-sample latent context information and intra-sample local context information to enhance the feature representation of each element. The mechanism is able to capture subtle differences between persons and additional visual cues using inter-sample difference information and rich detail information within the element neighborhood, improving the ability to distinguish similar persons. Based on this mechanism, we propose an internal-external context interaction network (IEINet) for extracting discriminative features from multiple dimensions. In addition, to capture more discriminative information, we propose a region-diverse loss to constrain the network. Many experiments validate the effectiveness of our IEINet and demonstrate that our approach attains state-of-the-art performance on several large-scale person Re-ID datasets.

Abstract:
Unsupervised action segmentation (UAS) aims to identify action boundaries in long, untrimmed videos without the use of annotations. This involves learning discriminative frame features and applying a segmentation mechanism to organize frames into coherent action segments. However, most of the common approaches ignore the importance of multi-scale temporal interactions within the video sequence, resulting in a limited frame representation capability and inaccurate action boundary detection. In this paper, we propose MulSclTE, a novel UAS framework that incorporates multi-scale temporal interactions across global, clip, and frame levels to enhance the overall performance. To address the limited representation capability, we first present global-level interaction enhancement by implementing a bi-directional temporal encoding mechanism, designed to capture comprehensive information across the entire sequence. Then, we devise a hierarchical self-supervised loss function equipped with a clip-level interaction constraint that aims to bring temporally adjacent clips closer while separating non-adjacent ones. To precisely identify action boundaries, we provide comprehensive information by integrating frame-level prediction errors and similarity scores to alleviate the under-segmentation issue, and present a refinement mechanism to mitigate the over-segmentation issue. Extensive experiments on Breakfast, YouTube Instructions, 50Salads, and EPIC-KITCHENS show that MulSclTE attains leading or second-best performance across all datasets, and even exceeds some supervised methods in MoF and F1 metrics, underscoring its robustness and effectiveness.

Abstract:
Inferring a globally consistent normal orientation for point clouds remains challenging, especially for noisy and non-watertight point clouds. To improve accuracy and robustness in orientation inference, a Boundary-Aware Consistent Normal Orientation (BACNO) method is proposed. Its main idea is to transform the normal orientation problem into a boundary-aware narrow band grid partitioning problem. This processing process is as follows: First, an unsigned distance field for an input point cloud is computed, which is defined on a regular grid. The field is then trimmed as a boundary-aware narrow band grid around the point cloud. Next, the narrow band grid is segmented into two parts, with each part located on one side of the input point cloud. Finally, a coarse-to-fine normal orientation strategy is presented to achieve the globally consistent orientation. Extensive experimental results demonstrate that the proposed method outperforms state-of-the-art methods, particularly for noisy and non-watertight point clouds.

Abstract:
The popularity of template-generated videos has recently experienced a significant increase on social media platforms. In general, videos from the same template share similar temporal characteristics, which are unfortunately ignored in the current compression schemes. In view of this, we aim to examine how such temporal priors from templates can be effectively utilized during the compression process for template-generated videos. First, a comprehensive statistical analysis is conducted, revealing that the coding decisions, including the merge, non-affine, and motion information, across template-generated videos are strongly correlated. Subsequently, leveraging such correlations as prior knowledge, a simple yet effective prior-driven compression scheme for template-generated videos is proposed. In particular, a mode decision pruning algorithm is devised to dynamically skip unnecessarily advanced motion vector prediction (AMVP) or affine AMVP decisions. Moreover, an improved AMVP motion estimation algorithm is applied to further accelerate reference frame selection and the motion estimation process. Experimental results on the versatile video coding (VVC) platform VTM-23.0 demonstrate that the proposed scheme achieves moderate time reductions of 14.31% and 14.99% under the Low-Delay P (LDP) and Low-Delay B (LDB) configurations, respectively, while maintaining negligible increases in Bjøntegaard Delta Rate (BD-Rate) of 0.15% and 0.18%, respectively.

Abstract:
Visual relation detection (VRD) aims to identify relationships (or interactions) between object pairs in an image. Although recent VRD models have achieved impressive performance, they are all restricted to pre-defined relation categories, while failing to consider the e.gsemantic ambiguity characteristic of visual relations. Unlike objects, the appearance of visual relations is always subtle and can be described by multiple predicate words from different perspectives, e.g., “ride” can be depicted as “race” and “sit on”, from the sports and spatial position views, respectively. To this end, we propose to model visual relations as continuous embeddings, and design diffusion models to achieve generalized VRD in a conditional generative manner, termed Diff-VRD. We model the diffusion process in a latent space and generate all possible relations in the image as an embedding sequence. During the generation, the visual and text embeddings of subject-object pairs serve as conditional signals and are injected via cross-attention. After the generation, we design a subsequent matching stage to assign the relation words to subject-object pairs by considering their semantic similarities. Benefiting from the diffusion-based generative process, our Diff-VRD is able to generate visual relations beyond the pre-defined category labels of datasets. To properly evaluate this generalized VRD task, we introduce two evaluation metrics, e.gi.e., text-to-image retrieval and SPICE PR Curve inspired by image captioning. Extensive experiments in both human-object interaction (HOI) detection and scene graph generation (SGG) benchmarks attest to the superiority and effectiveness of Diff-VRD.

Abstract:
Image-text matching as a fundamental cross-modal understanding task presents unique challenges in weakly-aligned scenarios. Such data typically feature highly abstract textual captions with sparse entity References, creating a significant semantic gap with visual content. Current mainstream methods, primarily designed for strongly aligned data pairs, employ dynamic modeling or multi-dimensional similarity computation to achieve feature space mapping. However, they struggle with information asymmetry and modal heterogeneity in weakly aligned cases. To address this, we propose a Visual Perception Knowledge Enhancement (VPKE) framework. Unlike existing methods based on strong alignment assumptions, this framework mines latent image semantics through vision-language models and generates auxiliary captions, overcoming the information bottleneck of traditional text modalities. Its core innovation lies in an adaptive knowledge distillation mechanism that combines retrieval-augmented generation (RAG) with key entity extraction. This mechanism effectively filters noise when introducing external knowledge while optimizing cross-modal feature integration. The framework employs multi-level similarity evaluation to dynamically adjust fusion weights among original text, key entities, and auxiliary captions, enabling adaptive integration of diverse semantic features and significantly improving model flexibility. Additionally, multi-scale feature extraction further enhances cross-modal representation capabilities. Experimental results show that the proposed method performs excellently in image-text retrieval tasks on the MSCOCO and Flickr30K datasets, validating its effectiveness.

Abstract:
Infrared small targets are typically tiny and locally salient, which belong to high-frequency components (HFCs) in images. Single-frame infrared small target (SIRST) detection is challenging, since there are many HFCs along with targets, such as bright corners, broken clouds, and other clutters. Current learning-based methods rely on the powerful capabilities of deep networks, but neglect explicit modeling and discriminative representation learning of various HFCs, which is important to distinguish targets from other HFCs. To address the aforementioned issues, we propose a dynamic high-frequency convolution (DHiF) to translate the discriminative modeling process into the generation of a dynamic local filter bank. Especially, DHiF is sensitive to HFCs, owing to the dynamic parameters of its generated filters being symmetrically adjusted within a zero-centered range according to Fourier transformation properties. Combining with standard convolution operations, DHiF can adaptively and dynamically process different HFC regions and capture their distinctive grayscale variation characteristics for discriminative representation learning. DHiF functions as a drop-in replacement for standard convolution and can be used in arbitrary SIRST detection networks without significant decrease in computational efficiency. To validate the effectiveness of our DHiF, we conducted extensive experiments across different SIRST detection networks on real-scene datasets. Compared to other state-of-the-art convolution operations, DHiF exhibits superior detection performance with promising improvement. Codes are available at https://github.com/TinaLRJ/DHiF

Abstract:
Light field (LF) spatial super-resolution (SR) aims at reconstructing high-resolution LF images from low-resolution observations. Recently, subspace disentangling has been widely adopted in numerous methods. By decomposing high-dimensional LF data into spatial, angular and epipolar subspaces, the learning difficulties of deep networks can be significantly reduced. Although achieving continuously improved SR performance, several fundamental issues (e.g., the relative importance of each subspace) remain underexplored, leading to redundant network parameters and high model complexity. In this paper, we revisit this classical mechanism and conduct an empirical study to investigate these issues. Specifically, we first develop a simple, modular, and scalable LF spatial SR network, based on subspace disentangling. We then conduct extensive experiments to quantitatively evaluate the contributions of each subspace branch, the model scaling property, and the depth-width trade-off. Through comprehensive analyses, the inherent patterns are identified, based on which we derive optimal network designs under varying parameter budgets. Without bells and whistles, our method achieves state-of-the-art performance with reduced model size. Code and pre-trained models are available at https://github.com/fyzhang2024/SimSSR/

Abstract:
Dataset distillation has recently attracted significant attention for its ability to distill massive datasets into fractions of their original size while preserving their essential properties. To realize lossless performance, recent efforts like trajectory matching have attempted to mitigate effectiveness decay in high images-per-class (IPC) settings. However, we observe that such trajectory matching-based methods either fail to capture hard patterns within expert trajectory generation, or neglect the differentiation of samples during trajectory matching, leading to suboptimal results. To remedy these issues, we propose a novel framework Nested Difficulty Matching (NDM) featuring two key innovations: 1) Difficulty-Weighted Sampling (DWS), which enhances the capture of hard patterns in expert trajectories by assigning higher weights to misclassified samples, and 2) Nested Sliding Matching (NSM), which prevents distortion of existing patterns and enables precise alignment by incrementally introducing blocks and dynamically adjusting the trajectory matching range. Experimental results demonstrate that NDM achieves a 12.4% average accuracy improvement over the baseline method on three widely used datasets. Notably, our proposed NDM outperforms SOTA methods in high IPC settings, attaining near-lossless distillation on Tiny-ImageNet with a 20% compression ratio. The source code of this work is publicly available at: https://github.com/aether-hang/NDM

Abstract:
Simultaneous Localization and Mapping (SLAM) is a crucial technique in computer vision and has been widely used in robot navigation, virtual reality, and augmented reality. Although SLAM algorithms are relatively mature, their performance is severely degraded under low-light conditions. As a core module of SLAM algorithms, feature matching under low-light conditions faces huge challenges. Existing methods train a general feature matching model for all low-light conditions, ignoring the noise modeling for better accuracy. In this work, we introduce LLFeat, a noise-aware method for feature matching under various low-light conditions. To compress the models of various low-light conditions, we further introduce a Noise-aware Feature Modulation (NaFM) layer to the model structure. Therefore, the models can share the most of parameters and preserve only private NaFM layers for each low-light condition, boosting the accuracy of feature matching with negligible additional parameters. LLFeat achieves remarkable results in the highly challenging MID benchmark in both indoor and outdoor scenes, demonstrating the effectiveness of our method. The code will be released.

Abstract:
Light field (LF) super-resolution (SR) is a fundamental yet challenging task requiring effective processing of high-dimensional 4D spatial-angular information. While Data Augmentation (DA) has proven powerful for 2D image restoration, existing static DA methods struggle to accommodate the complex geometric correspondences inherent in 4D light fields. They often disrupt angular consistency, leading to suboptimal performance. In this paper, we conduct a comprehensive analysis revealing that high-frequency components serve as critical indicators of informational value across different perspectives and spatial contexts. Building upon this insight, we propose CutDEM4D, a novel dynamic DA approach tailored to harness 4D LF information efficiently. Unlike conventional static methods, CutDEM4D dynamically extracts patches by maximizing high-frequency information and synthesizes augmented samples using adaptive frequency-derived weights. This dynamic mechanism effectively integrates multi-view consistency with contextual awareness, ensuring that the augmented data preserves the structural integrity required for LFSR. Our method acts as a model-agnostic plug-and-play module compatible with various network architectures. Extensive experimental results demonstrate that CutDEM4D significantly enhances the utilization of 4D information, consistently improving performance across state-of-the-art LFSR networks. Furthermore, we validate its robustness and generalization capability in downstream tasks, including LF denoising, depth estimation, and real-world super-resolution.

Abstract:
Traditional Test-Time Adaptation (TTA) methods primarily focus on updating the parameters of a pre-trained source model to better fit the target domain. In contrast, recent diffusion-driven TTA approaches leverage an unconditional diffusion model trained on the source domain to map target samples towards the source distribution, without modifying the model parameters. In this paper, we propose to combine the strengths of model adaptation and data adaptation to achieve more effective alignment between the source model and target data. Unlike existing two-stage methods that perform model and data adaptation independently, we introduce a unified Collaborative Model and Data Adaptation (CMDA) framework that integrates the two processes in a mutually beneficial manner. Specifically, model predictions on synthetic target samples serve as category-discriminative signals to guide the reverse diffusion process during data adaptation. Conversely, the synthetic data generated through data adaptation are used to progressively update and refine the source model. This bidirectional collaboration between model and data adaptation occurs iteratively, progressively aligning the source model with the target data. To further enhance prediction accuracy, we designed a lightweight and learnable aggregation network that ensembles predictions from the source and adapted models on both the original and synthetic target samples. This network dynamically integrates complementary predictions, improving the robustness and confidence of the final outputs. Extensive experiments on four benchmark datasets demonstrate that CMDA achieves state-of-the-art performance under the TTA setting.

Abstract:
Model-heterogeneous federated learning (MHFL), supporting FL collaboration across clients with heterogeneous models, has become a more practical FL paradigm. Existing MHFL methods enable knowledge fusion over heterogeneous client models by sharing partial homogeneous parameters or extracted label-wise average representations, suffering from model performance bottlenecks and privacy leakage risks. To bridge this gap, we propose a novel model-heterogeneous Federated learning method with homogeneous Representation Subspace Learning (FedRSL) instead of sharing model parameters or representations. In FedRSL, each client’s local heterogeneous model comprises a feature extractor and a prediction header. 1) We construct a homogeneous representation subspace for each client to learn local representation knowledge, and the server aggregates them to generate the global representation subspace for representation knowledge fusion. 2) To facilitate representation learning capability while maintaining efficient communication and computation, we design a lightweight linear model as the homogeneous low-rank linear representation subspace. For each local data sample, its personalized representation extracted by the feature extractor is processed by the global representation subspace to produce the corresponding generalized representation. 3) To effectively bi-transfer global generalized and local personalized knowledge, we reduce the distance between the local personalized representation and the corresponding generalized representation. Experiments on 3 computer vision and 1 natural language processing benchmark datasets over 6 baselines demonstrate that FedRSL obtains state-of-the-art model accuracy (up to 5.51% accuracy improvement) while consuming low communication and computation overheads.

Abstract:
Tracking is a core technique for analyzing complex fish behaviors, such as schooling and predator avoidance. However, this task presents unique and severe challenges compared to generic object tracking of rigid targets like pedestrians or vehicles. Fish exhibit extreme non-rigid deformation and erratic motion, while underwater environments are characterized by poor illumination and low visibility. These issues, compounded by the need for lightweight, real-time deployment in high-density scenarios, often lead to catastrophic target loss and identity switching in conventional trackers. To tackle these specific challenges, we propose M4FT, a lightweight and robust online multiple fish tracking framework. To overcome the limitation of CNNs in capturing large deformations due to local receptive fields, and the high latency of Transformers, we design M4Net as the detection backbone. By pioneering the Vision Mamba architecture in this domain, M4Net leverages selective state-space modeling to achieve global contextual modeling comparable to Transformers but with linear complexity. It efficiently captures the flexible morphology of fish, all while maintaining a lightweight footprint. Furthermore, to counteract adverse underwater conditions, we integrate an optional UIE module that adaptively enhances imagery, synergistically improving detection robustness without relying on computationally expensive appearance-based re-identification. Experimental validation on the challenging BrackishMOT benchmark shows that M4FT sets a new state-of-the-art, achieving the highest HOTA of 29.2 while incurring only ~10% of the computational cost of mainstream models. Our code, pre-trained models, datasets, and supplementary materials are available at vranlee.github.io/M-4-FT.

Abstract:
Learned reference picture resampling control (LRPRC) adaptively adjusts the coding scale for each frame using an offline-trained neural network. It demonstrates promising rate-distortion (R-D) performance improvements over traditional methods, particularly in high-resolution, low-bit-rate video coding scenarios. However, existing LRPRC methods rely exclusively on locally optimal decision labels derived from greedy strategies for network training, leading to suboptimal control performance. To address this limitation, we introduce a novel data-centric solution that substantially improves training label quality, thereby enhancing overall LRPRC performance. Specifically, our key contribution is a parallelized beam search-based coding scale labeling algorithm, which captures decision dependencies across coding steps and produces higher-quality training labels with enhanced R-D performance. By fully exploiting the intra-trellis and inter-trellis parallelism of beam search and hierarchical coding, our proposed labeling algorithm achieves logarithmic-squared time complexity, making it highly suitable for large-scale cluster computing. We validate this simple yet effective data-centric LRPRC approach in the Versatile Video Encoder (VVenC) using 4K video sequences. Experimental results demonstrate that merely upgrading the beam search labels (without any neural architecture re-designs) consistently outperforms the state-of-the-art LRPRC method, achieving BD-rate reductions of 5.09%, 3.98%, and 3.59% under the fast, medium, and slow presets, respectively.

Abstract:
Directly recognizing action based on compressed video shows significant advantages, such as low storage demands, efficient decoding, and fast inference speeds. Existing methods on compressed video achieve promising performance by separately modeling spatial and motion cues and directly fusing recognition results of I-frames and P-frames. However, these approaches overlook the following inherent attributes of compressed videos: 1) Temporal misalignment between I-frames and P-frames impairs the accuracy of action recognition. 2) Spatiotemporal sparsity of compressed video frames severely hinders the semantic modeling of complex actions. 3) Semantic discrepancy between I-frames and P-frames, which capture appearance and motion information respectively, leads to suboptimal performance when they are fused directly. To address these challenges, we propose a Hierarchical Semantics Interaction Network (HSINet) that ensures refined semantic modeling of compressed video through alignment, interaction, and calibration within a hierarchical fusion framework. Specifically, to resolve temporal misalignment, we propose an efficient cross-modal temporal alignment module that fully combines the spatiotemporal information of P-frames and the spatial information of I-frames, and includes both pre-alignment and fine alignment stages. To mitigate semantic degradation caused by sparsity, we propose a cross-modal semantics interaction module to provide the multi-scale semantics interaction between spatial and temporal representation learning and enhance representations’ spatiotemporal awareness. To calibrate the semantic imbalance between I-frames and P-frames, we propose a modal imbalance calibration module that optimizing directional differences via cosine similarity in hyperspherical space. Experiments on HMDB-51, UCF-101, and Kinetics-400 benchmarks, demonstrate the effectiveness of hierarchical semantic interaction for compressed video action recognition.

Abstract:
Accurate measurement of shock wave motion parameters with high spatiotemporal resolution is essential for applications such as power field testing and damage assessment. However, significant challenges are posed by the fast, uneven propagation of shock waves and unstable testing conditions. To address these challenges, a novel framework is proposed that utilizes multiple event cameras to estimate the asymmetry of shock waves, leveraging its high-speed and high-dynamic range capabilities. Initially, a polar coordinate system is established, which encodes events to reveal shock wave propagation patterns, with adaptive region-of-interest (ROI) extraction through event offset calculations. Subsequently, shock wave front events are extracted using iterative slope analysis, exploiting the continuity of velocity changes. Finally, the geometric model of events and shock wave motion parameters is derived according to event-based optical imaging model, along with the 3D reconstruction model. Through the above process, multi-angle shock wave measurement, motion field reconstruction, and explosive equivalence inversion are achieved. The results of the speed measurement are compared with those of the pressure sensors and the empirical formula, revealing a maximum error of 5.20% and a minimum error of 0.06%. The experimental results demonstrate that our method achieves high-precision measurement of the shock wave motion field with both high spatial and temporal resolution, representing significant progress.

Abstract:
Arbitrary-scale video super-resolution (VSR) aims to enhance video resolution at continuous scales and has attracted increasing attention in recent years. However, existing methods typically rely on fixed degradation modes, such as bicubic downsampling, which often fail to handle the complex degradations of real-world videos. Current real-world datasets only cover limited scales (e.g., × 2 , × 4 ) and are insufficient to capture the diverse degradations required for arbitrary-scale VSR. To address this, we present RealArbVSR, the first real-world VSR dataset with both integer and decimal scale factors, providing a wider range of degradation levels. Moreover, to generate continuous degradations beyond the collected scales, we propose the Continuous Degradation Generation Network (CDGN), which synthesizes realistic LR videos with arbitrary degradations. Specifically, we design a Scale-aware Degradation Module (SDM) to adaptively learn scale-specific degradations and an Implicit Filter Module (IFM) that represents spatial-temporal features as a continuous feature domain for arbitrary-scale LR frame generation. Extensive experiments demonstrate that our CDGN trained on RealArbVSR produces high-fidelity LR videos with arbitrary degradations and significantly enhances the performance of VSR models in real-world scenarios.

Abstract:
Reliable gait recognition under low-light conditions remains challenging for traditional cameras. Event cameras, with their high dynamic range and fine temporal resolution, offer a promising alternative but suffer from sparse signals under small motion amplitudes and vulnerability to noise or irrelevant movements (e.g., shadows). To address these issues, we propose FusionGait, a complementary fusion framework that integrates event with standard frames to achieve robust gait recognition. Specifically, we propose a Self-supervised Hierarchical Feature Extractor (SSL-HFE) built upon DINOv2, which employs learnable prompts to bridge the gap between gray frames and RGB frames, extracts multi-level semantic features, and enhances their discriminability through a self-supervised learning strategy. Then, we introduce the Complementary Fusion Learning Module (CFLM), which employs cross-cost volumes to explicitly model pixel-level correlations between frames and events, enabling effective cross-modal interaction and fusion. Furthermore, we propose EvGSimulator, a sensor-specific data augmentation strategy that simulates diverse illumination conditions based on physical properties. The framework maintains robustness even when real frames are unavailable by reconstructing frame-like representations from events, and it scales to ultra–high-frame-rate scenarios with hundreds of frames per second. We also collect DAVIS346-Gait-RGE, the multi-view semi-indoor & outdoor gait dataset captured with a DAVIS346 event camera, including three modalities: event streams, gray frames, and reconstructed frames. Experiments across multiple datasets show FusionGait achieves state-of-the-art performance, effectively surpassing single-modality methods.

Abstract:
Fashion recommendation is crucial for consumers to express their self-image and personal style. To boost recommendation accuracy and rationality, researchers have explored state-of-the-art (SOTA) methods that incorporate items’ visual and textual information, along with their pairing records. However, these methods fail to fully explore how latent semantic-stylistic correlations between items influence recommendations, and inadequately tackle exposure bias, which results in skewed, restricted recommendations that prioritize popular or frequently observed outfits. To address these limitations, we propose DRFusionRec, a fashion recommender that enhances recommendation rationality and diversity by fully leveraging item semantic-stylistic correlations while mitigating exposure bias. Specifically, to harness rich item correlations, we introduce a novel Multi-Factor Relationship Measurement (MRM) matrix. It integrates semantic and stylistic features by mining the synergistic interaction probabilities across semantically and stylistically adjacent items, capturing the latent compatibility patterns. This matrix is then used to refine item features for richer details. To address homogeneity from exposure bias, we propose an Adaptive Propensity Score (APS) strategy. By dynamically weighting item popularity (direct influence) and popularity of style-similar neighbors (indirect contextual influence), we model exposure confounders to derive item propensity scores. Integrating these scores into item features effectively mitigates the bias. Lastly, MRM and APS synergistically optimize item representations for rationale-based and diverse recommendations. Experimental results validate that the DRFusionRec outperforms SOTA methods in capturing item compatibility, ensuring diverse recommendations, and maintaining reasonable complexity.

Abstract:
Due to the limited training samples in few-shot object detection (FSOD), we observe that current methods may struggle to accurately extract effective features from each channel. Specifically, this issue manifests in two aspects: i) channels with high weights may not necessarily be effective, and ii) channels with low weights may still hold significant value. To handle this problem, we consider utilizing inter-channel correlation to ensure that the novel model can effectively highlight relevant channels and rectify incorrect ones, thereby strengthening channel quality. Since the channel sequence is also 1-dimensional, its similarity with the temporal sequence inspires us to take Mamba for modeling the correlation in the channel sequence Based on this concept, we propose the Spatial-Channel State Space Modeling (SCSM) module for spatial-channel-sequence modeling to accurately extract effective features from each channel. In SCSM, we design the Spatial Feature Modeling (SFM) module to ensure the quality of spatial feature representations. We then introduce the Channel State Modeling (CSM) module, which treats channels as a 1-dimensional sequence and take mamba to capture the correlation between channels. Extensive experiments on the VOC and COCO datasets show that the SCSM module enables the novel detector to improve the quality of channel feature representations and achieve state-of-the-art performance. Code is released at https://github.com/zhimengXin/SCSM

Abstract:
Existing zero-shot deepfake detection methods are often constrained to specific scenarios and struggle in diverse, complex scenarios. To address this limitation, we propose a universal zero-shot deepfake detection method. This method models the common forgery traces across different domains as domain-invariant features and introduces a novel domain-invariant meta-learning strategy. This strategy embeds the mechanism of domain-invariant learning into a meta-learning framework, enabling the model not only to extract specific domain-invariant features from certain domains, but also to leverage the meta-learning mechanism of fast adaptation to new domains. As a result, the model is capable of effectively capturing the intrinsic domain-invariant characteristics of deepfake images, thereby achieving universal zero-shot deepfake detection. Extensive comparative experiments demonstrate that the proposed method achieves the highest average detection AUC (86.96%) across 28 unseen datasets, representing an improvement of 8.02% over the second-best method (78.94%). Moreover, it is the only method that is effective in all four zero-shot scenarios, which strongly validates its superior zero-shot detection performance and universality. Code is released at https://github.com/QinQin741/DIML

Abstract:
Automated Visual Inspection is a cornerstone of modern manufacturing, yet the development of robust deep learning models is frequently impeded by the scarcity and imbalance of training data. This challenge is particularly acute for industrial defects, which often manifest as subtle anomalies intrinsically linked to complex, structured backgrounds. To address this challenge, CRG-DefectDiffuser is proposed as a collaborative generative framework for high-fidelity defect synthesis. At its core lies the Collaborative Refinement Guidance (CRG) mechanism, which orchestrates two specialized diffusion models: one trained on abundant defect-free images to capture background context, and another trained on scarce defect patches to encode fine-grained defect semantics. The CRG mechanism steers the synthesis by dynamically generating a guidance map, which is refined through a four-stage process to ensure that morphologically accurate defects are seamlessly integrated into the appropriate background context. Augmenting training data with our method boosts the defect detection mAP@50-95 from a baseline of 0.496 to 0.557, corresponding to a 12.3 percentage point relative improvement. The framework also demonstrates superior scalability, with performance gains continuing up to the three-fold data augmentation evaluated in our experiments, a point where competing methods often falter. These results establish CRG-DefectDiffuser as an effective and practical solution to data scarcity in industrial visual inspection, with strong potential for generalization across diverse manufacturing scenarios. The source code is publicly available at https://github.com/weihang-luo/CRG-DefectDiffuser

Abstract:
Diffusion model-based networks have been widely applied in the field of image generation and have gradually demonstrated a strong potential in image colorization tasks. However, despite the emergence of various colorization diffusion models, two major challenges remain: 1) the lack of effective control over the colorization process and 2) the prevalent issue of color bleeding. Integrating suitable conditional control can effectively alleviate these challenges. To this end, we propose a unified multi-modal diffusion model that harnesses diverse modality information to achieve flexible and high-quality colorization. Specifically, we introduce a Stroke-Adapter that extracts and integrates stroke prompt, enhancing user control over color distribution. Additionally, we design an Edge-Guided Attention mechanism to effectively inject edge information into the colorization process, significantly reducing color bleeding artifacts. Extensive comparative experiments demonstrate that our method outperforms state-of-the-art image colorization approaches in both qualitative and quantitative evaluations, achieving superior colorization results with enhanced controllability.

Abstract:
Quantization and chroma downsampling are two primary operations that introduce distortions in the JPEG compression. However, most existing blind methods treat artifacts removal as a direct mapping from compressed images to clean ones. They fail to explicitly model the underlying degradation process or design targeted compensation mechanisms. As a result, these methods can only partially remove compression artifacts and struggle to generalize to diverse or unseen degradation scenarios. In this work, we present a novel perspective that formulates artifacts removal as an approximate inversion of the lossy steps in JPEG. Based on this view, we propose an Inverse JPEG Compression Network (IJCN), which aims to progressively compensate for quantization errors and color distortions. Specifically, we first design a Learnable Offset Guidance Module (LOGM) to approximate inverse quantization by modeling both intra-block and inter-block coefficient correlations for predicting rounding offsets. In addition, we propose a Quantization Table Guidance Module (QTGM) that leverages the quantization tables to guide the reconstruction network in mitigating color distortions. By modeling compensation mechanisms under the guidance of quantization tables, IJCN effectively eliminates artifacts across varying compression levels. Extensive experiments demonstrate that IJCN outperforms existing methods in both quantitative metrics and visual quality.

Abstract:
Lossy image compression is essential for Mars exploration missions, due to the limited bandwidth between Earth and Mars. However, the compression may introduce visual artifacts that complicate the geological analysis of the Martian surface. Existing quality enhancement approaches, primarily designed for Earth images, fall short for Martian images due to a lack of consideration for the unique Martian semantics. In response to this challenge, we conduct an in-depth analysis of Martian images, yielding two key insights based on semantics: the presence of texture similarities and the compact nature of texture representations in Martian images. Inspired by these findings, we introduce MarsQE, an innovative, semantic-informed, two-phase quality enhancement approach specifically designed for Martian images. The first phase involves the semantic-based matching of texture-similar reference images, and the second phase enhances image quality by transferring texture patterns from these reference images to the compressed image. We also develop a post-enhancement network to further reduce compression artifacts and achieve superior compression quality. Our extensive experiments demonstrate that MarsQE significantly outperforms existing approaches for Earth images, establishing a new benchmark for the quality enhancement on Martian images. The code is available at https://github.com/keriphLiu/MarsQE

Abstract:
Image inpainting aims to restore missing regions by leveraging surrounding spatial context, where nearby pixels provide crucial structural cues and distant regions offer complementary semantic guidance. To jointly model these complementary dependencies, this paper proposes Hierarchical Sequential Context Modeling (HSCM), a novel inpainting framework that employs state-space models for multi-scale autoregressive sequence modeling. Unlike existing single-scale SSM-based approaches, HSCM explicitly separates pixel-level and semantic-level modeling into two complementary branches. The Local Perception Unit preserves fine-grained textures, and the Global Compensation Unit propagates high-level semantics across patches to enhance overall coherence. The asynchronous hierarchical design first reconstructs local textures and then performs semantic compensation, achieving notable performance gains with minimal computational overhead. Leveraging its four-directional architecture, HSCM maintains linear computational growth with spatial resolution and effectively establishes a comprehensive global receptive field. Furthermore, a Cross-Gated Feedforward Network is proposed to alleviate patch boundary artifacts and enhance inter-channel feature consistency. Built upon a multi-scale encoder–decoder architecture, HSCM delivers state-of-the-art inpainting quality and robust generalization across diverse benchmarks, including CelebA-HQ, FFHQ, Paris Street View, and Places2.

Abstract:
Light field camera usually sacrifice spatial resolution for increasing angular resolution. Although it can capture rich spatial and angular information, leads to the spatial resolution reduced greatly. However, existing super-resolution methods usually focus on spatial and angular super-resolution, but ignore the end-to-end disparity map super-resolution estimation. Meanwhile, recent depth estimation methods can not extract a higher resolution disparity map from the low-resolution light field images directly. Motivated by this issue, we propose an end-to-end light field image super-resolution depth estimation network, named E2SRLF. First, a multi-dimensional channel attention mechanism is introduced to reduce the influence of occlusion and weak textures, and enhance the learning of local and global feature. Then, a spatial super-resolution fusion upsampling method is proposed for constructing the super-resolution dimension and acquiring precise high-resolution information. Additionally, we introduce a high-low resolution collaborative constraint based loss function to enforce network training efficiency. Experimental results demonstrate that E2SRLF can generate a high accuracy high-resolution depth map from the low-resolution light field images directly. Comparing to most of the state-of-the-art light field image depth estimation methods, E2SRLF directly achieved more accuracy high resolution disparity map with the lower resolution light field images input. The code of our method are available at: https://github.com/sansi-zhang/E2SRLF

Affiliations: College of Artificial Intelligence, Chongqing Key Laboratory of Image Cognition, Chongqing University of Posts and Telecommunications, Chongqing, China; Department of Scientific Research, People’s Hospital of Yubei District of Chongqing City, Chongqing, China; West China Biomedical Big Data Center, Sichuan University, Sichuan, China; School of Information Science and Technology, Northwest University, Xi’an, China; School of Artificial Intelligence, Xidian University, Xi’an, China; School of Automation, Chongqing University of Posts and Telecommunications, Chongqing, China

Abstract:
Image aesthetic assessment (IAA) is a challenging task due to the subjectivity and abstraction of aesthetic perception. Psychological studies reveal that aesthetic experiences often trigger emotional responses, while comment texts directly reflect people’s expressions of aesthetics and emotions. However, existing multimodal IAA methods neglect the alignment between modalities. To address this, we propose a multimodal emotion-alignment cognitive network (MEC-Net) for IAA, employing strategies of emotion alignment, subjective–objective interaction, and multimodal fusion. First, an emotion alignment module is introduced to align image and text modalities using emotional stimuli, enhancing the consistency of heterogeneous modal features. Then, a subjective and objective representation module is proposed to extract multi-source information from text and images separately. Next, a subjective-objective interactive LSTM (SO-LSTM) is designed to capture the deep interaction between images and text in aesthetic understanding. Finally, an dynamic multimodal fusion (DMF) based on low-rank decomposition is proposed to integrate subjective, objective, and subjective-objective interactive modal features for aesthetic distribution prediction. Extensive experiments and qualitative analysis on image aesthetic benchmarks indicate that the proposed MEC-Net outperforms the state-of-the-art on three IAA tasks. Further, we increase emotion classification task-driven evaluation metrics to verify the strong generalizability of the proposed MEC-Net.

Abstract:
Human daily action recognition is a challenging task because of the complex structure of actions. To address this challenge, we present the Multi-layer Representation Learning for Multi-modal Action Recognition (MR-MAR). In the atom encoding layer, we design a Motion Atom spatial Alignment (MAA) module to construct contrastive pairs and realize spatial alignment by pulling positive pairs closer while pushing negative pairs further away. Moreover, we devise a Multi-modal motion Atom Learning (MAL) module to effectively capture the nonlinear relationships of multi-modal motion atoms, thereby enhancing the semantic diversity and discriminability. In the phrase encoding layer, we combine motion atoms into multi-scale motion phrases as high-level representations and carry out Motion phrase Temporal Modeling (MTM) to capture the temporal patterns by updating the hidden states. We evaluate the MR-MAR on four multi-modal benchmarks, achieving improvements ranging from 0.9% to 9.4% on NTU RGB+D 120 dataset, 0.6% to 5.5% on N-UCLA dataset, 0.4% to 6.2% on PKU-MMD dataset, and competitive performance on Toyota-Smarthome dataset.

Abstract:
Accurate and efficient roadside cooperative perception is crucial for reducing blind spots and extending sensing ranges. However, it faces challenges in modeling long-short range cooperative dependencies and representing the heterogeneous-density distribution of cross-infrastructure data. While CNNs, Transformers, and State-Space Models have demonstrated superior performance, they inherently struggle to balance the flexibility of long-short range receptive fields with computational costs. Additionally, frequency-domain decomposition remains underutilized for heterogeneous-density data representation. In this work, we propose an innovative Asymmetric Multi-Frequency Scale-Adaptive Mamba (AsymMamba) framework, performing lightweight heterogeneous-density data decomposition to support scalable long-short range cooperative representation. First, an Asymmetric Multi-Frequency Decomposition (AsymFreq) module is designed with wavelet transforms, which unifies the spatial distribution representation of heterogeneous-density data in the frequency domain while mitigating information loss through asymmetric scale partitioning. Subsequently, AsymMamba designs a Scale-Adaptive State-Space Model (AdaSSM) module with a spatial compression and channel expansion mechanism. It not only effectively captures local short-range semantic information but also efficiently models global long-range cooperative dependencies with linear complexity. Experiments on real-world DAIR-V2X and RCooper datasets demonstrate that AsymMamba outperforms state-of-the-art methods, including the Transformer-based CoBEVT and recent Mamba-based variants. Specifically, it achieves 3.4%, 4.3%, and 0.6% 3D object detection improvements at AP@0.5 in vehicle-to-infrastructure cooperation, complex intersection, and long-range corridor roadside cooperative perception scenarios, respectively. Moreover, AsymMamba also achieves superior real-time efficiency with 4x faster inference latency than CoBEVT in a 100m sensing range, and 7x faster in a 200m long-range scenario. The code will be available at GitHub.

Affiliations: School of Software Engineering, Xi’an Jiaotong University, Xi’an, China; School of Artificial Intelligence and the National Engineering Laboratory of Big Data System Computing Technology, Shenzhen University, Shenzhen, China; Computer Vision and Pattern Recognition Laboratory, School of Engineering Science, Lappeenranta-Lahti University of Technology (LUT), Lappeenranta, Finland; Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University, Xi’an, China; Center for Machine Vision and Signal Analysis, University of Oulu, Oulu, Finland

Abstract:
Traditional skin-contact physical sensors typically detect changes of blood volume to predict the periodicity of heartbeat by analyzing the absorption spectra of hemoglobin. However, the contact on human skin may cause uncomfortable feeling and induce difficulty for long-term monitoring. Recently, video-based remote photoplethysmography (rPPG) estimation approaches analyze the periodic facial color changes for matching cardiac cycle in a contactless manner. Nevertheless, the inherent relationship between the changes of facial color and blood volume is not fully exploited. Besides the influence of blood volume (i.e., hemoglobin), there are also other factors such as lighting and reflection that cause the change on facial color. We exploit the physical principles that cause skin color variations to separate the hemoglobin factor driven by blood volume. Based on the physical prior of the reflection of human skin, we introduce an rPPG estimation network assisted by decoupled hemoglobin sequence, named HemNet, which first explicitly leverages hemoglobin to assist rPPG signal estimation. To obtain meaningful hemoglobin from facial video, we design a human skin color disentangler that decouples the facial color variations into four significant features, i.e., hemoglobin, melanin, shading, and specular. We then present a multi-modality rPPG estimator that utilizes cross-covariance attention to extract fused feature from hemoglobin and RGB video inputs. Finally, an adaptive negative Pearson loss is proposed to effectively address phase misalignment between the blood volume in the finger and facial region during the training phase. We evaluate our HemNet on four widely used public benchmark datasets. The superiority of our method is demonstrated in both intra-dataset and cross-dataset test settings. The code is available at https://github.com/jingang-cv/hemnet

Abstract:
Continual cross-modal hashing is critical for efficient retrieval across heterogeneous modalities in dynamic environments. Yet, existing approaches primarily focus on mitigating catastrophic forgetting, while overlooking two key challenges: 1) the hash collision arising from the excessive utilization of the Hamming space across tasks, and 2) the absence of consistency modeling for cross-modal dynamic associations. To address these challenges, we introduce Prompt-driven Bit Extension Hashing (PBEH), a novel framework that dynamically extends hash codes to prevent hash collisions and capture evolving modality-aligned semantics in continuously expanding multi-modal data. Specifically, PBEH first adaptively initializes a set of modality-shared prompts for each task, which are jointly optimized with the hashing functions to enhance model plasticity and retain task-specific knowledge, enabling continual cross-modal semantic alignment. In parallel, a dynamic Hamming space extension mechanism allocates dedicated capacity per task, alleviating bottlenecks and inter-task collisions. During retrieval, queries are encoded via the extended hash functions and matched to stored codes using a truncated strategy for compatibility. To ensure efficiency and semantic stability, only the prompts and hashing functions are updated while the pre-trained backbone remains frozen. Extensive experiments demonstrate that PBEH achieves superior and stable performance in continual cross-modal retrieval with substantially reduced computational overhead. The source codes and datasets are available at https://github.com/Liuwwhh/PBEH

Abstract:
Image-text retrieval (ITR) is a pivotal task in cross-modal research. However, existing methods often suffer from a fundamental yet overlooked challenge: redundancy. This issue manifests as both semantic redundancy within unimodal representations and relationship redundancy in cross-modal alignments. This not only inflates computational costs but also degrades retrieval accuracy by masking salient features and reinforcing spurious correlations. In this work, we are the first to explicitly analyze and address the ITR problem from a redundancy perspective by proposing the iMage-text rEtrieval rEdundancy miTigation (MEET) framework. MEET employs a cascaded, two-stage process to systematically mitigate both forms of redundancy. First, for Semantic Redundancy Mitigation, it repurposes deep hashing and quantization as synergistic tools, producing compact yet highly discriminative representations. Second, for Relationship Redundancy Mitigation, it progressively refines the cross-modal alignment space by filtering misleading negative samples and adaptively reweighting informative pairs. The structural integration of these modules under a unified optimization objective provides a clear and interpretable pathway to retrieval. Extensive experiments on multiple benchmarks demonstrate that MEET consistently surpasses state-of-the-art methods, validating its effectiveness and generalizability.

Abstract:
The primary challenges in image-level weakly supervised semantic segmentation (WSSS) lie in addressing the under-activation issue of target pixels and mitigating the co-occurrence phenomenon in class activation maps. In recent years, Vision-Language Models (VLM) have demonstrated exceptional performance across various vision tasks, primarily attributed to their cross-modal semantic alignment capabilities achieved through contrastive learning mechanisms. Leveraging VLM’s capability to capture fine-grained visual-textual correspondences, this paper proposes a novel Vision-Language Driven Prompt Learning (VLD-PL) framework that addresses two fundamental challenges in WSSS by establishing explicit semantic correspondences between textual descriptors and visual components, ultimately enabling efficient semantic segmentation. The VLD-PL framework consists of two core components Auxiliary Class Matching (ACM) and Background Class Filtering (BCF). The ACM module dynamically identifies semantically relevant auxiliary classes through feature alignment between image and textual embeddings, effectively enlarging target activation while mitigating co-occurrence interference by expanding semantic coverage. Simultaneously, the BCF constructs image-specific background prompts and adaptively refines background feature representations, achieving precise suppression of irrelevant background regions. These dual mechanisms synergistically address both target localization accuracy and background noise suppression, achieving state-of-the-art performance on both the PASCAL VOC 2012 and MS COCO 2014 benchmarks.

Abstract:
Universal Domain Adaptation (UniDA) aims to achieve cross-domain knowledge transfer without label set assumptions. UniDA primarily faces two challenges: domain alignment under label shift and identifying unknown class samples in the target domain. We propose a cross-domain class context optimization method for UniDA to address these two challenges, leveraging a contrastive language-image pre-training model containing learnable prompts. First, we develop a domain context-guided feature augmentation technique, which augments source domain features based on textual features related to the target domain style, improving the consistency of feature distributions between the source domain and target domain. Subsequently, we learn a set of class contexts suitable for the target domain using the augmented source domain features. Furthermore, to improve the ability of the class context to filter unknown class samples, we propose a local known-unknown entropy optimization strategy, which effectively reduces the interference from class-irrelevant semantic information in the images, thereby mitigating erroneous class matching under label shift. Extensive experiments on multiple benchmark datasets demonstrate that our method achieves competitive results compared to advanced UniDA approaches. Additionally, experimental results show that our entropy optimization strategy can serve as a general optimization component for prompt tuning, enhancing the generalization performance of existing methods when applied to downstream tasks with label shift.

Abstract:
Vision analysis tasks often experience substantial performance drops when processing images captured in lowlight environments. Existing approaches introduce infrared images to provide complementary information, and fusion strategies are consequently developed to combine the advantages of visible light images and infrared images. While fusion methods are typically designed to enhance visual quality with the expectation of improving task performance, many existing approaches tend to overemphasize perceptual fidelity, which can inadvertently compromise task-specific feature. To address this issue, we propose a task-driven hierarchical fusion (TDHF) framework designed to retain task-relevant information throughout the fusion process. Specifically, TDHF adopts a multi-scale hierarchical architecture to capture rich feature representations and incorporates a multi-head attention mechanism to model cross-modal interactions. In addition, we introduce a single-step denoising generation module that guides the fusion of infrared edge features and texture details from low-light images progressively. This process ultimately reconstructs the task-critical Y channel in the YCrCb color space. Extensive experiments on object detection under low-light conditions across four benchmark datasets demonstrate that TDHF effectively enhances task performance, achieving up to 1.1% higher mAP on LLVIP and 1.3% higher mAP on FLIR compared with state-of-the-art methods, without relying on excessive optimization of image quality (PSNR).

Abstract:
When acquiring images or videos of electronic displays, moiré patterns often arise due to aliasing between overlapping pixel grids, substantially compromising the perceptual quality of the captured content. Although frequency domain techniques have demonstrated high efficacy in image demoiréing, existing video approaches often overlook inter-frame frequency contextual relationships. This limitation restricts their capacity to achieve consistent temporal coherence and reconstruction fidelity. To overcome these challenges, we introduce a novel network (STFNet) with spatial-temporal filtering in frequency domain for video demoiréing. The proposed architecture comprises two dedicated stages: 1) temporal-Guided Filtering (TGF), aims to adaptively incorporate temporal cues into learnable bandpass filters and 2) joint Filtering with Partially Shared Passbands (JFPS), which enhances representation learning of low-frequency moiré textures through strategic parameter sharing. Comprehensive evaluations on multiple public benchmarks confirm the superiority of our method. STFNet consistently outperforms state-of-the-art alternatives across both quantitative metrics and perceptual quality assessments, demonstrating robust performance in dynamic moiré suppression and detail preservation.

Abstract:
Zero-Shot Sketch-Based Image Retrieval (ZS-SBIR) seeks to correlate unseen hand-drawn sketches with unseen real images by leveraging trained models on visible categories. Recent CLIP-based models, which primarily focus on visual-textual interaction, have demonstrated strong competitiveness in ZS-SBIR. However, they still fall short in the exploration of cross-modal visual representations, especially in terms of cross-modal visual shared and specific information. Differing from the aforementioned researches, we start with class-level, prompt-level, and patch-level visual information, committed to unlocking the potential of visual feature representation. On this foundation, we introduce the vision-centric Triple-way visual modulATion (TAT) framework to enhance the model’s perception of visual shared and specific information. Specifically, we establish unified multi-modal perception by integrating visual-level modality prompter into the CLIP architecture. We then conduct triple-way modulation modeling on prompt, token, and patch levels to effectively mine shared and specific features. Lastly, we develop an enhanced calibration strategy incorporating prompt-aware, token-aware, and logit-aware alignment modules to amplify the model’s proficiency in probing shared-specific features. We thoroughly test our approach to confirm its excellence and the efficacy of individual components. The comparison results on the popular datasets Sketchy, Sketchyv2, Tuberlin, and QuickDraw show that the developed algorithm significantly surpasses the current state-of-the-art technologies.

Abstract:
Previous blind face restoration (BFR) methods have primarily leveraged facial priors from pretrained GAN or diffusion models. These neural BFR models suffer from diverse neural degradations, such as prior bias, topological distortion, textural distortion, and artifact residues, which limit their real-world generalization in complex real-world scenarios. In this paper, we propose an effective framework, InfoBFR, to address neural degradation from an information-theoretic perspective, which achieves BFR boosting in diverse wild and heterogeneous scenes. Specifically, on the basis of the results from pretrained BFR models, InfoBFR considers information compression by using a manifold information bottleneck (MIB) and manifold information compensation (MIC) with efficient diffusion LoRA to conduct information optimization. InfoBFR effectively synthesizes high-fidelity faces with texture and structure boosting. Comprehensive experimental results demonstrate the high boosting performance of InfoBFR (nearly 82%) for state-of-the-art GAN-based and diffusion-based BFR methods, as it can complete BFR tasks in approximately 70 ms and has 4M trainable parameters. It is promising that InfoBFR is the first unified postprocessing restorer universally employed by diverse BFR models to overcome the limitations of neural degradation.

Abstract:
Multimodal large language models (MLLMs) have attracted considerable attention for their impressive capabilities in understanding and generating visual-language content, particularly in tasks such as visual question answering (VQA). However, the rapid evolution of knowledge in real-world applications poses challenges for these models: offline training becomes increasingly costly, and exposure to non-stationary data streams often leads to catastrophic forgetting. In this paper, we propose CL-MoE+, a dual-momentum Mixture-of-Experts (MoE) framework based on MLLMs for continual VQA. Our method integrates continual learning into MLLMs to leverage the rich commonsense knowledge embedded in large language models. We introduce a Dual-Router MoE (RMoE) module that selects both global and local experts through task-level and instance-level routers, enabling robust and context-aware expert allocation. Furthermore, we design an adaptive Momentum MoE (MMoE) to update experts’ parameters based on the knowledge drift degree and their relevance to specific tasks, thereby facilitating knowledge integration without forgetting. Extensive experiments on a 10-task split of the VQA v2 benchmark demonstrate that CL-MoE+ achieves state-of-the-art performance, validating its effectiveness in both retaining historical knowledge and learning new information in the continual learning setting.

Abstract:
Dynamic human relighting is a complex task with significant demand across various applications. Its core challenge lies in effectively handling both dynamic human geometry reconstruction and material estimation. Existing works mainly achieve human reconstruction and relighting with Neural Radiance Fields, but they are not only inefficient, but also still inaccurate in material estimation and relighting. In this paper, we propose a novel approach called ReGA, which leverages efficient 3D Gaussian Splatting to create animatable and relightable avatars from sparse-view human motion. The training process consists of two stages: the geometry stage and the material stage. To overcome the geometric weakness of vanilla Gaussian representation, we introduce dynamic alignment mechanism in the geometry stage, combining the advantages of Gaussian splatting and mesh-based representation to produce reasonable human surface. In the material stage, we enhance the inverse rendering process by introducing twofold correlation strategies that establish chrominance correlation between Gaussian radiance color and albedo. Experiments demonstrate that our method outperforms existing approaches in dynamic human relighting task.

Abstract:
3D motion magnification aims to enable us to visualize subtle, imperceptible motions by integrating eulerian video magnification with novel view synthesis. Existing method extracts the variation of feature embeddings using Neural Radiance Fields (NeRF) over time. However, this volume rendering technique suffers from two shortcomings for 3D motion magnification: 1) when reconstructing time-varying scenes through volume rendering, spatial-temporal operations between static and dynamic representations often generate noticeable artifacts, leading to blurred magnified frames; 2) when processing high-resolution dynamic scenes, the intrinsically low rendering efficiency of these techniques causes excessive computational latency, preventing real-time visualization. In this work, instead of NeRF, we propose a novel Time-varying Gaussian Splatting for 3D Motion Magnification (TG4MM) that is capable of achieving real-time rendering while effectively handling blurred magnified frames in dynamic 3D motion magnification scenes. Specifically, we propose a motion-space decoupled triplane modeling approach. The space triplane captures major spatial structures from the first frame, while the motion triplane captures subtle motion information from subsequent frames. Furthermore, we develop a phase-based motion magnification module that enhances subtle motions by applying filters within the embedding space and subtle motion triplane. Experimental results demonstrate the effectiveness of our method, showing that it outperforms existing 3D motion magnification techniques and achieves a speed up to 126 FPS.

Abstract:
Robotic manipulation requires learning a generalizable policy that can adapt to complicated new environments. However, existing methods typically overlook the inherent task complexity and employ a policy with the same budget for tasks with varied difficulties, facing challenges in inefficient computational resource allocation and zero-shot generalization. In this work, we identify three facets of complexity imbalance issues in the current manipulation tasks at the Inter-task, Intra-task, and Noise-timesteps levels. To address this gap, we introduce the Complexity-Aware Policy (CAP), a novel approach integrating flow matching with a Transformer-based backbone and a Mixture of Heterogeneous Experts (MoHE) structure for policy learning. By leveraging Rectified Flow and dynamically adjusting model capacity based on task complexity, which is assessed through features like object counts and precision needs, our method allocates computational resources efficiently and effectively. This results in faster convergence, optimized computational resource usage, and improved precision across diverse manipulation tasks. Our proposed method achieves the state-of-the-art performance on widely-used CALVIN, LIBERO, and SimplerEnv benchmarks. Our method is further validated through six real-world experiments, where it consistently outperforms baseline methods across all tasks.

Abstract:
6-DoF grasp pose estimation is crucial for achieving robust robot manipulation. Despite significant progress in data-driven methods, the cross-view adaptability of 6-DoF grasp pose estimation remains insufficient. Under the condition of observing from certain viewpoints, the performance of pose estimation is relatively low. In response, this paper introduces VRGraspNet, a novel 6-DoF grasp pose estimation model designed to enhance the robustness and performance of robotic grasping across diverse viewpoints. The key to the VRGraspNet lies in filling in the holes in the scene point cloud and learning multimodal features for seed points. The former provides dense neighborhood points for the seed point, while the latter provides richer information for extracting geometric features of the local area to which the seed point belongs. Furthermore, this paper proposes a Performance-Viewpoint jointly Weighted Loss (PVWL), with the key being two weight factors: a static viewpoint position dependent weight factor that focuses on challenging samples, and a dynamic performance related weight factor that focuses on samples difficult to learn. Extensive experiments on the GraspNet-1Billion dataset demonstrate that VRGraspNet achieves SOTA performance and strong cross-view robustness. Real-world robot experiments further validate its practicality in robotic manipulation tasks. Our source code is available at https://github.com/huamo555/VRGraspNet

Abstract:
Previous deblurring methods mostly tackle global motion blur due to camera shake but struggle with local motion blur from object movement, facing challenges like the random motion blur locations, data imbalance, directional ambiguity, and positional uncertainty. To fill the vacancy of real-world local motion deblurring, we establish ReLoBlur, the first real-world local motion deblurring dataset. ReLoBlur is captured by a synchronized beam-splitting photographing system and annotated via our developed Local Blur Foreground Mask Generator (LBFMG). To bridge the gap between local and global motion deblurring, we propose a Local Blur-Aware Gated network (LBAG) with gate blocks to focus deblurring on blurred regions, and a Blur-Aware Patch Cropping Strategy (BAPC) to address the data imbalance problem. Acknowledging directional ambiguity and positional uncertainty from shooting errors and non-uniform object motion, we enhance LBAG with LBAGp, guided by center-related distance, and optimized by a symmetric minimization loss. Extensive experiments prove the reliability of the ReLoBlur dataset, and demonstrate that LBAG and LBAGp achieve better local motion deblurring performance compared to state-of-the-art (SOTA) CNN-based deblurring methods.

Abstract:
Identifying novel human-object interaction (HOI) classes with scarce data is a challenging and crucial task in computer vision. Existing methods mainly use coarse global visual information to build class prototypes in meta-learning. Despite their promising results, these methods often fail to capture fine-grained interaction semantics and effectively learn from data with low inter-class variance, leading to suboptimal performance in distinguishing similar categories. To overcome these issues, we propose a new model called hierarchical relation network for few-shot HOI recognition (FS-HOI). This model integrates multi-level interaction clues, spanning from coarse to fine-grained, to enhance HOI features. It employs a unified graph network to capture intra- and inter-relationships among human parts with contextual information, augmented by language-guided attention for semantic mining within each interactive sub-graph. In contrast to conventional methods that depend on global class prototype comparisons, our approach advances metric learning by integrating contrastive mechanisms, utilizing rich instance pairs as comparative References to effectively address inter-class variance. Furthermore, a graph relation network leverages prior knowledge of unknown HOIs, embedding task-specific features into contrastive instances. Our method establishes a new state-of-the-art on three few-shot HOI datasets, with substantial performance gains and ablation studies confirming the efficacy of each component.

Abstract:
Multimodal large language models (MLLMs) integrate sophisticated large vision models (LVMs) to empower large language models (LLMs) with vision ability to perceive, reason, and interact in vision-language (V-L) tasks, while the modality bridge between two specialists becomes the bottleneck that translates visual signals into linguistic representations. However, most of the existing methods train the modality bridge with coarse-grained image-text pairs, neglecting the structural mapping between V-L semantics that facilitates modality translation from LVMs to LLMs. To mitigate this, we propose a Constituency-Tree-Induced Multimodal Bridging mechanism (CTIMB) that learns the fine-grained connection from LVMs to LLMs by the structural guidance from multi-modal constituency tree. Our approach consists of: 1) the multi-modal constituency-tree parser that jointly exploits the semantic structure of vision and language; 2) the lightweight connector that translates visual signals into linguistic representation and re-arranges them according to the constituency-tree structure; 3) the dynamic construction loss that aids in aligning the semantic structures derived from the tree parser and the connector. The CTIMB can learn the fine-grained mapping between visual and linguistic semantics, seamlessly bridge the LVMs and LLMs to enhance V-L tasks, and is more cost-efficient compared with current methods. Extensive experiments have demonstrated that our method more accurately interprets the visual features, enabling LLMs to conduct downstream tasks more effectively, and achieve superior performance with less training cost.

Abstract:
Plain text has become the dominant interactive interface for text-driven human volumetric video generation. However, its limited customization options hinder users from expressing motion effects with accuracy. For example, plain text struggles to specify continuous variables such as motion amplitude, speed, and joint trajectories with precision, and it fails to convey stylized motion characteristics. Additionally, crafting detailed textual prompts for complex motion sequences is cumbersome, while excessively long prompts strain text encoders. To address these limitations, we propose a rich text-based framework that supports font styles, sizes, and trajectory sketching. By extracting motion-related attributes from rich text, our method enables fine-grained control over motion styles, precise speed regulation, and accurate joint trajectory manipulation. These capabilities are realized through gradient-guided noise editing and ControlNet-based motion optimization, which operate within the latent motion diffusion process. Specifically, we design a unified gradient-guided adaptation mechanism to ensure that the generated motion video adheres strictly to the specified constraints. Furthermore, we introduce realism-oriented optimization for stylistic and joint-level control, refining motion synthesis at a granular level to produce smoother, more natural movements. We present multiple comparative evaluations showcasing volumetric video generation from both rich text and plain text. Through quantitative analysis, we demonstrate that our method surpasses strong plain-text baselines, producing expressive, customizable human volumetric motion videos.

Abstract:
Weakly supervised video anomaly detection aims to identify abnormal snippets in untrimmed videos. Existing methods learn prototypes to describe global representation of snippet distributions. But, in weakly-labeled videos, the normal snippets in abnormal video may take high-uncertainty labels for distribution modeling. Without confidence-aware modeling, abnormal/normal prototype distributions may overlap with each other, leading to inaccurate predictions. In this work, we propose the Unified Confident Prototype (UCP) model, which contains a feature extractor, a confidence-aware prototype learner, and a local-global prototype unifier. The prototype learning is designed to ensure proper separability, stability, and representation. First, after learning the weight of each snippet’s loss, snippets with high-uncertainty labels may take small weights. These snippets tend to lie in the overlap between abnormal/normal distributions, hindering their separation. We design uncertainty-aware sampling, which removes high-uncertainty snippets in the small-weight snippets to ensure separable prototype learning. Second, snippets with high-uncertainty labels tend to be far from the prototype center, thus falling in the low-confidence region. These snippets may enlarge the distribution’s variation, resulting in unstable prototype learning. We design confidence-aware sampling, which removes low-confidence snippets to ensure stable prototype learning. Third, after assigning pseudo labels to prototypes, we measure the prototype representation with the distribution’s purity. We design prototype distribution purification, which penalizes normal snippets in the abnormal-majority distribution with purity loss to ensure representative prototype learning. Fourth, beyond prototype learning, prototypes can be enhanced by local/global temporal semantics. We further introduce the local-global prototype unifier to learn the relations across local-global durations, thereby enhancing the semantics for anomaly detection. For weakly-supervised anomaly detection, experiments demonstrate that our method achieves state-of-the-art performance on the UCF-Crime, ShanghaiTech, and XD-Violence datasets. Moreover, to further verify the generality of our method, we further conduct experiments on THUMOS’14 for weakly-supervised temporal action localization.

Abstract:
Underwater images play a vital role in marine exploration, but are often severely degraded due to complex imaging conditions, including color distortion, haze effects, and non-uniform illumination. Existing deep learning-based enhancement methods predominantly rely on conventional RGB sensors, which struggle to distinguish between scattered and reflected light, thereby limiting enhancement performance. Polarization imaging, with its capability to capture directional light information, offers promising potential for underwater image enhancement. In this paper, we propose a lightweight yet effective polarization feature extractor that captures global spatial cues from polarization images. Additionally, we design a polarization-guided feature integration module that adaptively enhances the representational capacity of RGB features. Notably, the proposed module is plug-in and can be seamlessly integrated into existing RGB-based enhancement networks. Extensive experiments across multiple datasets demonstrate that incorporating polarization information significantly improves enhancement performance, highlighting its effectiveness as a valuable cue for underwater image enhancement. The code and pretrained models are at https://github.com/jgy0/UPGD

Affiliations: Guangzhou Institute of Technology, Xidian University, Guangzhou, China; Faculty of Geosciences and Engineering, Southwest Jiaotong University, Chengdu, China; College of Pharmaceutical Engineering of Traditional Chinese Medicine, Tianjin University of Traditional Chinese Medicine, Tianjin, China; School of Telecommunications Engineering, Xidian University, Xi’an, Shaanxi, China; School of Computer Science and Technology, Key Laboratory of Collaborative Intelligence Systems, Ministry of Education, Xidian University, Xi’an, China

Abstract:
Blur detection is a novel method for evaluating image quality in the context of Uncrewed Aerial Vehicles (UAVs) surveillance and monitoring activities. However, existing methods lack multi-scenario adaptability due to the overreliance of deep learning models on learned priors from limited datasets, reducing their adaptability to unfamiliar conditions. To address this issue, the Denoising Diffusion Implicit Model (DDIM) is integrated with a paradigm named Rule-based Semantic Calibration (RSC) to create Rule-Semantic Generative Calibration Blur Detection (RSGC-BD). This approach generates robust blur detection maps through an iterative calibration process that enhances generalization capabilities. Unlike current Blur Detection (BD) methods, which categorize pixels as blurred or unblurred with a single forward propagation, the suggested approach employs the DDIM-based generative model to create and refine a BD map iteratively. By utilizing the iterative calibration process through RSC to integrate rule-based blur masks into generative semantic results at each step, this model ensures high-precision blur prediction, enhanced multi-scenario adaptability, and significantly improved inference speed. Furthermore, we propose a conversion module, namely the Adaptive RGB-to-Grayscale Conversion Cascade (ARGC-Cascade), to convert RGB images to grayscale through adaptive integration, highlighting blurred regions and improving detection accuracy. This enhancement of blur features is achieved by balancing the spectral channel weights during image conversion. The superior performance of the proposed RSGC-BD approach is validated by extensive tests on four high-resolution BD datasets, including the newly introduced UAV-BD. Source codes are available at: https://github.com/udrs/RSGCBD

Abstract:
Existing SOTA methods in active learning object detection(ALOD) achieve impressive results, but they overlook two problems: 1) the requirement for instance-level labels during initialization, and 2) the constrained localization ability of the pre-trained fully supervised detector in the active learning phase. Problem 1) contradicts the fundamental purpose of active learning in balancing annotation costs and detection performance. Problem 2) arises from the fact that the active learning process relies on a single pre-trained fully supervised detector. To tackle these problems, we propose Image-level Labels Driven active learning object detection (termed as ILD). Specifically, we propose a multi-step reasoning process based on the chain-of-thought only using image-level labels, including a class-number-aware step and an iterative step, to enhance the detection ability of VLM. The detection results of the VLM and weakly supervised detector are used as pseudo ground-truth boxes to initialize a fully supervised detector during AL initialization. Thus, the initialization process of ILD eliminates the requirement for instance-level labels. In the active learning stage, we design two novel uncertainty and diversity acquisition functions to select the most informative images based on collaborative outputs from both the weakly supervised detector and the pre-trained fully supervised detector. The collaborative mechanism jointly measures the uncertainty of two detectors and the diversity of object features, thereby enhancing the localization quality. Extensive experiments demonstrate that the proposed ILD achieves state-of-the-art performance(i.e., 77.5%, 25.7%, and 27.9%) on PASCAL VOC2007, MS COCO2014 and MS COCO2017 datasets, surpassing the SOTA methods by 3.4%, 1.2% and 4.7%, respectively. Our code is publicly available on https://github.com/RuiTianHIT/ILD

Abstract:
Multi-view clustering (MVC) has emerged as a powerful tool for analyzing complex datasets by leveraging consistent and complementary information from multiple sources. However, MVC faces three critical challenges in real-world scenarios: (1) Sample-level misalignment due to unknown cross-view correspondence, which introduces noisy correlations, (2) Feature-level heterogeneity from divergent dimensional spaces across views obscuring shared discriminative patterns, and (3) Dynamic-view inefficiency when integrating sequentially arriving data under privacy or sensor constraints. These challenges collectively hinder the clustering performance of existing studies, thus giving rise to a unified framework. To bridge this gap, we propose ASIA-MVC, an anchor-guided sample-and-feature incremental alignment framework for MVC, which is the first attempt in incremental learning on sample-unpaired multi-view data. First, the sample alignment module dynamically maps unpaired samples across views via anchor-based bipartite graphs. Second, the feature-aligned module employs an orthogonal decomposition strategy to unify heterogeneous feature spaces while preserving discriminative structures. Third, the novel incremental fusion framework integrates the dual-aligned modules under the guidance of shared anchors, enabling efficient cross-view representation learning. Furthermore, to solve the resulting problem, we develop a novel three-step alternate optimization algorithm with guaranteed convergence. Finally, the proposed method is validated in extensive experiments and achieves leading cluster efficiency and an outstanding sample-aligned effect.

Abstract:
The language modeling paradigm for scene text recognition (STR) has demonstrated impressive universal capabilities across extensive STR scenarios. However, existing methods still encounter challenges in effectively handling text images with irregular shapes and diverse appearances (e.g., curve, artistic, multi-oriented) due to the absence of contextual information during initial decoding. In this work, inspired by the principle of ‘forest before trees’ in human visual perception, we introduce NASTR, a non-autoregressive scene text recognizer capable of endowing global-aware for the attentional decoder. Specifically, we design a global-to-local attention procedure, simulating the mechanism of globally holistic visual signal processing preceding locally detailed response in the human brain visual system. This is achieved by leveraging the global image information queries to condition the generation of glimpse vectors at each decoding time step. This procedure empowers the NASTR model to achieve on-par performance with its state-of-the-art autoregressive counterparts, while operating in a fully parallel manner. Moreover, we propose multiple optional and flexible encoding constraint components to alleviate the representation quality degradation issue caused by the global image information queries in handling STR tasks with multilingual and in multi-domains. These components constrain the global image features from the perspective of global structural, global semantic, and linguistic knowledge. Extensive experimental results demonstrate that NASTR consistently outperforms existing methods on both Chinese and English STR benchmarks. Our source code, trained models, and logs are available at https://github.com/ML-HDU/NASTR

Abstract:
Functional and anatomical image fusion plays an important role in medical applications by combining information from multiple imaging modalities to retain functional features and anatomical details. Although deep learning-based methods have advanced the field, existing methods often struggle to capture local details and global context, especially when features span multiple scales. To address these limitations, we propose FAMAFuse, a novel multiscale attention mechanism designed specifically for functional and anatomical image fusion. FAMAFuse integrates three key innovations to overcome these challenges. First, the spatial attention residual module (SARM) models long-range global context, ensuring the fusion of relevant anatomical features across modalities. Second, the inter-modal feature fusion (IMFF) module fuses multi-source features, enhancing interaction between local details and broader structures. Finally, the multiscale gaussian attention-infused module (M-GAIM) leverages a learnable gaussian kernel to extract multiscale features, improving fusion quality across various imaging modalities. We validated FAMAFuse on SPECT-MRI, PET-MRI, and CT-MRI datasets. Experimental results demonstrate significant improvements over state-of-the-art fusion methods. In quantitative evaluations, FAMAFuse outperforms existing techniques by 4% to 10% in fusion quality and 0.5% to 6% in structural preservation. Furthermore, FAMAFuse exhibits excellent generalisation across different modalities, making it a suitable tool for clinical imaging. This method represents a promising solution for more accurate and informative medical diagnoses and clinical research. The source code is available at: https://github.com/Alphaalimamy/famafuse

Abstract:
Previous INN-based image hiding methods have demonstrated outstanding concealing and revealing performances; however, their robustness is compromised due to the unbiased and lossless characteristics inherent in strictly symmetric INNs. To address this limitation, we propose the Degradation-Adaptive Semi-Symmetric Network (DASS-Net) for robust image hiding. DASS-Net incorporates asymmetric units into INN flow, injecting nonlinearity and adaptability to the bijection. Specifically, Coupled Learning Unit (CLU) is proposed as asymmetric unit in the concealing phase, to facilitate more flexible information interaction. While the revealing process incorporates Degradation-Aware Modulation Units (DAMU) as asymmetric units that capture degradation clues and customize modulation parameters, thus enabling dynamic adjustments for optimal revealing under various degradation scenarios. Additionally, we propose a Frequency Sub-bands Enhancement Module (FSEM) that leverages low-frequency features to assist the recovery of high-frequency features and strengthen hidden information, thereby mitigating information loss caused by degradation. Extensive experiments on COCO and DIV2K datasets, exhibiting a range of degradation types and degrees, indicate that DASS-Net markedly surpasses state-of-the-art methods, with more than 10% improvements in imperceptibility and robustness.

Abstract:
Underwater images, often affected by light attenuation and particle scattering, pose a challenge for restoration, aggravated by the difficulty in obtaining a substantial amount of annotated data. Existing methods have tackled this issue through the development of semi-supervised frameworks; however, they commonly lack a suitable strategy or rely on additional models trained on extra data to ensure the quality of pseudo-labels. To address this, we propose a self-assessment training framework for semi-supervised underwater image restoration (SAT-UIR). SAT-UIR employs a dual-task network (DT-Net) incorporating an auxiliary assessment task to align the restored image with a target structure similarity index measure score. This enables accurate restoration completeness estimation at a feature level and effective pseudo-label filtering during self-training. Leveraging multi-scale features, the assessment task also encourages the model to learn advantageous features for image restoration. Moreover, we integrate a soft ranking loss to further refine the training process of the auxiliary assessment task. Comprehensive experiments on various underwater benchmarks demonstrate that SAT-UIR outperforms state-of-the-art methods quantitatively and qualitatively. The code is available at https://github.com/aroid721/SAT-UIR

Abstract:
Few-shot class-incremental learning (FSCIL) challenges intelligent models to continually learn new classes from limited samples while maintaining performance on previous tasks. Mixup-based methods address this scenario by creating virtual samples, enabling the model to learn explicit or implicit virtual classification boundaries and improve its forward compatibility, thereby reserving space for new classes. However, existing methods typically focus on virtual class construction and lack a crucial comparison of explicit and implicit virtual classification boundaries, hindering the further exploitation of the potential of mixup-based strategies for the FSCIL task. To this end, this paper analyzes virtual classification boundaries and reveals that implicit virtual classification boundaries are more conducive to model forward compatibility but inferior in base model convergence compared with explicit ones. Subsequently, we construct implicit virtual classification boundaries based on inter-class correlations to improve their adaptability to new classes, and propose a novel proxy-based classification boundary alignment method to ensure the stability of real classification boundaries. This method enhances clustering by focusing on sample-to-proxy correlations and improves the model’s ability to discriminate among old and new classes in incremental sessions by integrating projection layers. Moreover, we propose to randomly mix the augmented labels generated by self-supervised learning (SSL) in a linear manner, which increases the virtual classification boundaries to incorporate new classes better. Experimental results on three benchmark datasets, CIFAR100, miniImageNet, and CUB200, demonstrate that our proposed method effectively enhances the model’s forward compatibility while maintaining stability and outperforms the state-of-the-art works.

Abstract:
Supervised learning-based visual object tracking methods rely heavily on manually annotated video data. Semi-supervised learning-based visual object trackers offer a promising alternative by balancing tracking performance with annotation costs. However, the frame-to-frame dependency and the paired input characteristic of visual object trackers make existing semi-supervised learning methods unsuitable. To solve these problems, a novel semi-supervised learning framework is proposed to train a tracker, which significantly reduces annotation dependency while maintaining strong performance. It adopts an iterative “track-then-train” paradigm tailored to the frame-to-frame dependency characteristic of tracker training samples. In the “train” step, a Dual-Sample-Teacher training method is proposed to better leverage the paired input characteristic and special training samples for semi-supervised learning. In the “track” step, a Track Compensation module—comprising Object Integrity Prediction and Intersection over Union (IoU) Prediction modules—is introduced to enhance both the quantity and quality of pseudo-labels. To further improve performance, an Iterative Training strategy is presented. Experimental results demonstrate that on the GOT-10k dataset, the baseline tracker trained using the proposed semi-supervised approach—leveraging only 1% of the annotated labels—achieves a 17.6% improvement in Average Overlap (AO) over supervised learning with equivalent label coverage.

Abstract:
Partial label learning (PLL) is a paradigm in weakly supervised learning. The goal is to identify the ground-truth label from a set of candidate labels associated with a given sample. However, due to the ambiguity of labels, improving the accuracy of ground-truth label recognition is a challenge. In this paper, we propose an innovative training framework PLMR, for solving the PLL problem. Given the specificity of the PLL problem, its data is rich in valid information but significantly noisy. To overcome the problem, PLMR incorporates mutual information (MI) theory to mine the potential information of candidate labels and data features to distinguish between positive and negative sample pairs. This operation is to effectively utilize the raw data information in PLL and reduce the class conflict problem. In this way, the discriminative power of representation is improved according to the positive and negative pair selection strategy. At the same time, PLMR introduces cluster centers to optimize subsequent tasks, combining data augmentation samples with the K-means cross-attention mechanism to refine the optimized cluster centers. This is to improve the ability of the clustering centre to accurately represent the class information, thus improving the overall quality of the clusters. Further, through the ambiguity marking correction mechanism, weights are calculated based on the association between cluster centers and sample representations to guide model training. Experimental results show that PLMR demonstrates excellent classification performance on multiple datasets, verifying its effectiveness and sophistication.

Affiliations: College of Computer Science, Sichuan University, Chengdu, China; Department of Ophthalmology, Yong Loo Lin School of Medicine, Centre for Innovation and Precision Eye Health, National University of Singapore, Queenstown, Singapore; Department of Radiology, West China Hospital, Sichuan University, Chengdu, China; School of Cyber Science and Engineering and the Key Laboratory of Data Protection and Intelligent Management, Ministry of Education, Sichuan University, Chengdu, China; Technology and Research (A*STAR), Institute of High Performance Computing (IHPC), Agency for Science, Fusionopolis, Singapore

Abstract:
The Medical Segment Anything Model (MedSAM) has demonstrated strong performance in medical image segmentation, attracting increasing attention in the medical imaging domain. However, as with many prompt-based segmentation models, its performance is highly sensitive to the type and location of input prompts. This sensitivity often leads to suboptimal segmentation outcomes and necessitates labor-intensive manual prompt tuning, which hampers both efficiency and robustness. To address this challenge, this paper proposes MedSAM-U, an uncertainty-guided framework designed to automatically refine prompt inputs and enhance segmentation reliability. Specifically, a Multi-Prompt Adapter is integrated into MedSAM, resulting in MPA-MedSAM, which enables the model to effectively accommodate diverse multi-prompt inputs. An uncertainty estimation module is then introduced to evaluate the reliability of the prompts and their initial segmentation results. Based on this, a novel uncertainty-guided prompt adaptation strategy is applied to automatically generate refined prompts and more accurate segmentation outputs. The proposed MedSAM-U framework is evaluated across multiple medical imaging modalities. Experimental results on five diverse datasets demonstrate that MedSAM-U achieves consistent performance improvements ranging from 1.7% to 20.5% over the baseline MedSAM, confirming its effectiveness and practicality for robust and efficient medical image segmentation.

Affiliations: School of Computer Science, University of Nottingham Ningbo China, Ningbo, Zhejiang, China; Department of Electronic Engineering, The Chinese University of Hong Kong, Hong Kong, SAR, China; School of Engineering and Physical Sciences, University of Lincoln, Lincoln, U.K.; Chinese Academy of Sciences, Hong Kong Institute of Science and Innovation, Hong Kong, SAR, China; School of Computer Science, University of Nottingham, Nottingham, U.K.; The Hong Kong Polytechnic University, Hung Hom, Hong Kong

Abstract:
The universality of deep neural networks across different modalities and their generalization capabilities to unseen domains play an essential role in medical image segmentation. The recent segment anything model (SAM) has demonstrated strong adaptability across diverse natural scenarios. However, the huge computational costs, demand for manual annotations as prompts and conflict-prone decoding process of SAM degrade its generalization capabilities in medical scenarios. To address these limitations, we propose a modality-decoupled lightweight SAM for domain-generalized medical image segmentation, named De-LightSAM. Specifically, we first devise a lightweight domain-controllable image encoder (DC-Encoder) that produces discriminative visual features for diverse modalities. Further, we introduce the self-patch prompt generator (SP-Generator) to automatically generate high-quality dense prompt embeddings for guiding segmentation decoding. Finally, we design the query-decoupled modality decoder (QM-Decoder) that leverages a one-to-one strategy to provide an independent decoding channel for every modality, preventing mutual knowledge interference of different modalities. Moreover, we design a multi-modal decoupled knowledge distillation (MDKD) strategy to leverage robust common knowledge to complement domain-specific medical feature representations. Extensive experiments indicate that De-LightSAM outperforms state-of-the-arts in diverse medical imaging segmentation tasks, displaying superior modality universality and generalization capabilities. Especially, De-LightSAM uses only 2.0% parameters compared to SAM-H. The source code is available at https://github.com/xq141839/De-LightSAM

Abstract:
The transferability of adversarial examples allows for the attack on unknown deep neural networks (DNNs), posing a serious threat to many applications and attracting great attention. In this paper, we improve the transferability of adversarial examples by incorporating the Bayesian formulation into both the model parameters and model input, enabling their joint diversification. We demonstrate that combination of Bayesian formulations for both the model input and model parameters yields significant improvements in transferability. By introducing advanced approximations of the posterior distribution over the model input, adversarial transferability achieves further enhancement, surpassing all state-of-the-arts when attacking without model fine-tuning. Additionally, we propose a principled approach to fine-tune model parameters within this Bayesian framework. Extensive experiments demonstrate that our method achieves a new state-of-the-art in transfer-based attacks, significantly improving the average success rate on ImageNet and CIFAR-10. We will make our code publicly available.

Abstract:
Deepfake detection refers to detecting artificially generated or edited faces in images or videos, which plays an essential role in visual information security. Despite promising progress in recent years, Deepfake detection remains a challenging problem due to the complexity and variability of face forgery techniques. Existing Deepfake detection methods are often devoted to extracting features by designing sophisticated networks but ignore the influence of perceptual quality of faces. Considering the complexity of the quality distribution of real and fake faces, we propose a deepfake detection framework called DeepFidelity, which mines the perceptual forgery fidelity of face images and introduces a quality-aware scoring mechanism to distinguish real and fake faces of different image qualities. Specifically, we improve the model’s ability to identify complex samples by mapping real and fake face data of different qualities to different scores to distinguish them in a more detailed way. In addition, we propose a network structure called Symmetric Spatial Attention Augmentation based vision Transformer (SSAAFormer), which uses the symmetry of face images to promote the network to model the geographic long-distance relationship at the shallow level and augment local features. Extensive experiments on multiple benchmark datasets demonstrate the superiority of the proposed method over state-of-the-art methods. The code is available at https://github.com/shimmer-ghq/DeepFidelity

Abstract:
Efficient Point Cloud Geometry Compression (PCGC) with a lower bits per point (BPP) and higher peak signal-to-noise ratio (PSNR) is essential for the transportation of large-scale 3D data. Although octree-based entropy models can reduce BPP without introducing geometry distortion, existing CNN-based models struggle with limited receptive fields to capture long-range dependencies, while Transformer-built architectures always neglect fine-grained details due to their reliance on global self-attention. This paper presents a Transformer-efficient occupancy prediction Network, termed TopNet, to overcome these challenges by developing several novel components designed to enhance both global context modeling and local structure preservation: Locally-enhanced Context Encoding (LeCE) for improving local structural awareness and enhancing the translation-invariance of the octree nodes, Adaptive-Length Sliding Window Attention (AL-SWA) for capturing both global and local dependencies while adaptively adjusting attention weights based on the input window length, Spatial-Gated-enhanced Channel Mixer (SG-CM) for efficient feature aggregation from ancestors and siblings, and Latent-guided Node Occupancy Predictor (LNOP) for improving prediction accuracy of spatially adjacent octree nodes in local context. Comprehensive experiments across three large-scale outdoor sparse LiDAR datasets, including SemanticKITTI, nuScenes, and LiDAR-CS, as well as two indoor dense human body datasets, including 8iVFB and MVUB, and one indoor dense scenario dataset, ScanNet, demonstrate that our TopNet achieves state-of-the-art compression performance with fewer parameters.

Abstract:
With the rapid growth of online video platforms and the escalating volume of video content, the need for proficient video understanding tools has increased significantly. Given the remarkable capabilities of large language models (LLMs) in language and multimodal tasks, this survey provides a detailed overview of recent advances in video understanding that harness the power of LLMs (Vid-LLMs). The emergent capabilities of Vid-LLMs are surprisingly advanced, particularly their ability for open-ended multi-granularity (abstract, temporal, and spatiotemporal) reasoning combined with common-sense knowledge, suggesting a promising path for future video understanding. We examine the unique characteristics and capabilities of Vid-LLMs, categorizing the approaches into three main types: Video Analyzer × LLM, Video Embedder × LLM, and (Analyzer + Embedder) × LLM. We identify five subtypes based on the functions of LLMs in Vid-LLMs: LLM as Summarizer, LLM as Manager, LLM as Text Decoder, LLM as Regressor, and LLM as Hidden Layer. This survey also presents a comprehensive study of the tasks, datasets, benchmarks, and evaluation methods for Vid-LLMs. Additionally, it explores the extensive applications of Vid-LLMs in various domains, highlighting their remarkable scalability and versatility in real-world video understanding challenges. Additionally, it summarizes the limitations of existing Vid-LLMs and outlines directions for future research. For more information, readers are encouraged to visit the repository at https://github.com/yunlong10/Awesome-LLMs-for-Video-Understanding

Abstract:
Video Temporal Grounding (VTG) is a fine-grained video understanding task that aims to ground the relevant video moments corresponding to given language queries. Most existing approaches utilize powerful Vision-Language Models (VLMs), augmented with additional network architectures or specialized modules to supplement temporal reasoning capabilities. Despite achieving impressive performance, these approaches tend to overlook a critical issue that pre-trained visual and textual representations are not specifically optimized for VTG. Particularly, such representations often suffer from inter-modal semantic misalignment referring to the inconsistency between different modalities, and intra-modal semantic confusion, caused by insufficient discriminability within visual modality. To address these limitations, we propose an efficient semantics refinement framework built upon pre-trained models, featuring two core components. First, the Modal Knowledge Bidirectional Propagation (MKBP) component promotes inter-modal semantic alignment via bidirectional enrichment of textual and visual semantics, exploiting their complementary strengths without introducing additional parameters. Second, the Content Context Contrast Learning (C3L) component alleviates intra-modal semantic confusion within visual modality by bringing query-specific visual features closer while separating irrelevant ones. Comprehensive experiments on six benchmark datasets demonstrate the superior performance of our proposed methods.

Abstract:
Creators of 360° videos utilize affluent non-speech sounds for providing immersive experiences. The sound accessibility of such videos is essential for viewers, especially for d/Deaf and hard-of-hearing (DHH) people. In this paper, we propose AVLLM-360, a multimodal framework using Large Language Models (LLMs) for understanding panorama video content and providing sound descriptions, which goes beyond the simple recognition of sound types. AVLLM-360 integrates both visual and auditory information and bootstraps the cross-modal training from the pre-trained LLM. We also implemented a mixed-media interface that allows users to visualize the generated results hierarchically, enabling personalized customization of sound description generation when watching 360° videos. We conducted extensive experiments to evaluate AVLLM-360’s ability across a range of video understanding tasks. We also conducted qualitative studies with 12 DHH participants, evaluating the effectiveness of our AVLLM-360 using 24~360^\circ videos (covering different genres).

Abstract:
RGB-X multimodal vision tasks present a highly promising approach to enhancing model performance in complex visual conditions. Existing multimodal frameworks are based on either the symmetric parallel network of feature fusion or the shared network of input fusion. However, parallel networks suffer from uncontrollable parameters and imbalanced optimization across modal branches, while shared networks often lead to a lack of diversity in gradient optimization. To address these challenges, we propose the LoRA-driven Multimodal Extractor (LoME), following a comprehensive analysis of existing multimodal frameworks. The low-rank properties of modal adapters for LoME ensure controllable growth in model parameters as the number of modalities increases. The dynamic parameter fusion between adapters and the shared feature extractor decouples gradient optimization directions, effectively mitigating imbalances caused by multimodal data biases while preserving complementary features. Moreover, we employ a training strategy based on dynamic rank allocation to reduce computational overhead and enhance modal diversity expression. We validate the effectiveness and generalizability of LoME across three multimodal vision tasks. LoME achieves superior performance compared to previous state-of-the-art methods on multiple datasets. For example, on the DroneVehicle dataset, our method achieves a 10.4% improvement in accuracy compared to the SOTA method, while the parameter overhead is reduced to 23% of the previous network (44.63M). The code has been open-sourced at https://github.com/zyszxhy/LoME

Abstract:
Dynamic CT reconstruction plays a crucial role in both medical and industrial applications. However, existing 4D CT reconstruction methods typically rely on complex regularization techniques or external large-scale training datasets, posing challenges for reconstruction quality and generalization when handling complex object motion and varied imaging modes. Neural Radiance Fields (NeRF) offer a promising approach to dynamic CT reconstruction, but existing NeRF-based methods often assume that the scene is low-rank, limiting their representation capabilities. To address these issues, we propose NG-NeRF. First, we combine 3D and 4D hash grids for scene representation, effectively reducing temporal redundancy in static regions of dynamic scenes while improving the model’s representation capabilities and efficiency. Next, we design a non-local hash attention module to establish non-local dependencies between the features of different hash grids. This guides the model to adaptively select features based on hash table load information, significantly alleviating hash collisions and achieving the decoupling of dynamic and static regions. Besides, we introduce global continuity by employing mask positional encoding, which helps reduce the noise often introduced by grid features. Our experimental results on medical and industrial datasets demonstrate that the proposed method outperforms existing state-of-the-art methods by 5.84 dB and 3.4 dB, respectively, and exhibits excellent generalization ability across different 4D CT scenarios.

Abstract:
Despite the considerable advancements in cross-domain image translation, a significant challenge remains in addressing information asymmetric translation tasks such as SAR-to-Optical and Sketch-to-Instance conversions. These tasks involve transforming data from a domain with limited information into one with more detailed and richer content. Traditional CNN-based methods, while effective at capturing intricate details, often struggle to grasp the overall structural composition of the image, leading to unintended blending or merging of distinct regions within the generated images. In light of these limitations, research has increasingly turned toward Transformers. Though Transformers excel at capturing global structures, they often lack the ability to preserve fine-grained details. Recognizing the importance of both detailed features and structural relationships in information asymmetric translation tasks, we introduce the CNN-Swin Hybrid Network (CSHNet). This network employs a novel bottleneck architecture featuring two key modules: Swin Embedded CNN (SEC) and CNN Embedded Swin (CES), which together form the SEC-CES-Bottleneck (SCB). Within this structure, SEC capitalizes on CNN’s capability for detailed feature extraction while incorporating the Swin Transformer’s inherent structural bias. In contrast, CES preserves the Swin Transformer’s strength in maintaining global structural integrity, while compensating for CNN’s tendency to emphasize detail. In addition to the SCB architecture, CSHNet integrates two essential components designed to improve cross-domain information retention and ensure structural consistency. The Interactive Guided Connection (IGC) fosters dynamic information exchange between SEC and CES, encouraging a deeper understanding of image details. At the same time, Adaptive Edge Perception Loss (AEPL) is implemented to preserve well-defined structural boundaries throughout the translation process. Experimental evaluations demonstrate that CSHNet surpasses current state-of-the-art methods, achieving superior results in both visualization and performance metrics across scene-level and instance-level datasets. Our code is available at: https://github.com/XduShi/CSHNet

Abstract:
Light propagation underwater is susceptible to wavelength attenuation and scattering, leading to degradation plagued by color distortion, contrast degradation, and reduced visibility in underwater imaging. To handle the degradations, the paper proposes an underwater image enhancement method via advantage feature weighted fusion, called AFWF. Specifically, we propose a three-channel contrast enhancement strategy that effectively reduces the color distortion of a raw input image via a three-channel adaptive color compensation strategy. Meanwhile, we employ a fast exposure fusion to integrate the image sequences obtained from the multi-scale gamma correction and adaptive contrast enhancement strategies to improve the global contrast of the above-mentioned image. Subsequently, a single-channel contrast enhancement is proposed to improve the local contrast and edge detail information by enhancing the multi-level details of the raw image. Finally, we adopt the advantage feature weighted fusion strategy to analyze and selectively fuse the advantage feature of different enhanced images layer by layer to reconstruct a high-quality result. Extensive experimental verification results highlight that our AFWF method is superior to the state-of-the-art (SOTA) methods in improving raw underwater images’ color, contrast, and detail. The code is publicly available at: https://www.researchgate.net/publication/393021384_2025-AFWF

Abstract:
Parameter-efficient transfer learning (PETL) has shown great potential in adapting vision transformers (ViTs) pre-trained on large-scale datasets to various downstream tasks. Existing studies primarily focus on minimizing the number of learnable parameters. Although these methods are storage-efficient, they allocate excessive computational resources to easy samples, leading to inefficient inference. To address this issue, we introduce an inference-efficient tuning method termed multiple-exit tuning (MET). MET integrates multiple exits into the pre-trained ViT backbone. Since the predictions of the adapted ViTs are typically made by linear classifiers, each exit is equipped with a linear prediction head. During inference, easy samples exit at early exits and only hard enough samples flow to the final exit, thus reducing the computational cost for easy samples. MET consists of exit-specific adapters (E-adapters) and graph regularization. E-adapters are designed to extract suitable representations for different exits. To maintain parameter efficiency, all E-adapters share the same base down-projection and up-projection matrices. As the performances of linear classifiers are influenced by the relationships among samples, we employ graph regularization to improve the representations fed into the classifiers at early exits. We conduct extensive experiments to evaluate the performance of MET. Experimental results show that MET has a clear advantage over the state-of-the-art methods in terms of both accuracy and inference efficiency.

Abstract:
Unsupervised Person re-identification (ReID) aims to automatically capture and match images of the same person across different camera viewpoints without any manual annotations. Current methods primarily generate pseudo-labels by clustering global features and employ contrastive learning strategies for training. Despite the promising advancements made by these approaches, effectively addressing the inherent bias of global features and mitigating the impact of pseudo-label noise remains an unresolved issue. To tackle this challenge, we propose a part-based features complementary denoising method (PFCD). Specifically, we design the combined features (CF) module and the partial-features fusion and contrastive scheme (PFCS), which capture fine-grained clues from a local perspective and combine global and local features for clustering with consistent pseudo-label assignment, thereby achieving a complementarity between global and local features. Furthermore, to diminish the influence of pseudo-label noise on the model, we design the GMM features denoising (GFD) module, which employs a Gaussian Mixture Model to categorize features within each pseudo-class based on confidence levels and performs denoising on low-confidence features. Lastly, we construct a modular knowledge distillation (MKD) to enhance feature representation capabilities and effectively reduce pseudo-label noise. In addition, our method is confirmed to be effective through extensive experiments on four challenging ReID datasets, remarkably surpassing numerous state-of-the-art methods. Code has been made available at https://github.com/xfltdzzz/PFCS_ReID

Abstract:
Zero-shot captioning aims to generate descriptive captions for unseen image and video data by leveraging the potential of visual language models (VLMs) and language models (LMs) without requiring task-specific training. It has emerged as a critical task, but its performance is often hindered by the inherent gap between the training distribution and unseen test data. The fundamental challenge lies in the model’s strong dependence on the marginal distribution of the training data, which leads to biased predictions when handling test samples. To address this issue, we propose an Energy-aware Reinforcement Feedback Calibration (ERFC) framework to calibrate the distribution and predictions of caption models from a novel energy perspective. The calibration process of ERFC is divided into two key components: 1) We first construct an Energy Stabilizer (ES) based on the caption model, where energy is considered a measure of the affinity between the input sample and the model’s learned distribution. ES iteratively adjusts the embedding features of the input sample using Langevin Dynamics, reducing its energy to implicitly align the model’s distribution with the unseen target domain. 2) We deploy a Reinforcement Calibrator (RC) to refine and calibrate the generated captions through a reward-feedback mechanism. RC leverages the expert CLIP model as a reward signal to assess the quality of the generated captions and employs the policy gradient algorithm to reward or penalize the model, thereby improving its performance. By iteratively combining energy-based optimization and reward-driven calibration, ERFC achieves superior zero-shot generalization capabilities, as demonstrated on image benchmarks such as MSCOCO, Flickr30K, and NoCaps, as well as video benchmarks such as MSR-VTT and MSVD.

Abstract:
Different from the traditional semi-supervised learning paradigm that is constrained by the close-world assumption, Generalized Category Discovery (GCD) presumes that the unlabeled dataset contains new categories not appearing in the labeled set, and aims to not only classify old categories but also discover new categories in the unlabeled data. Existing studies on GCD typically devote to transferring the general knowledge from the self-supervised pretrained model to the target GCD task via some fine-tuning strategies, such as partial tuning and prompt learning. Nevertheless, these fine-tuning methods fail to make a sound balance between the generalization capacity of pretrained backbone and the adaptability to the GCD task. To fill this gap, in this paper, we propose a novel adapter-tuning-based method named AdaptGCD, which is the first work to introduce the adapter tuning into the GCD task and provides some key insights expected to enlighten future research. Furthermore, considering the discrepancy of supervision information between the old and new classes, a multi-expert adapter structure equipped with a route assignment constraint is elaborately devised, such that the data from old and new classes are separated into different expert groups. Extensive experiments are conducted on 7 widely-used datasets. The remarkable performance improvements highlight the efficacy of our proposal and it can be also combined with other advanced methods like SPTNet for further enhancement.

Abstract:
RGB-Thermal multi-object tracking (RGB-T MOT) focuses on tracking multiple objects in complex scenarios, such as nighttime and low-light conditions, which is crucial for various applications, including video surveillance and drone monitoring. Existing RGB-T MOT studies typically merge multi-source features in a single stage before the backbone network. These works fail to preserve fine-grained details and struggle with context-dependent fusion, resulting in suboptimal tracking performance. In this paper, we propose a novel multi-stage cross-modality spatial-temporal feature interaction network (MCTrack), which emphasizes two main aspects: temporal-aware learning and cross-modality information interaction. For temporal-aware learning, we introduce the temporal salient feature interaction (TSFI) module, which ensures trajectory continuity by capturing dynamic spatial variation of objects across RGB-T video pairs, enabling consistent object tracking over time. For cross-modality information interaction, we propose the bidirectional modality interaction (Bi-MI) module, which employs cross-modality transformers to extract complementary features from both RGB and thermal modalities at multiple stages of feature extraction, thereby improving tracking adaptability in diverse tracking scenarios. Additionally, we propose the cross-modality complementary mask (CCM) strategy, which applies non-overlapping random masks to feature maps to improve the robustness of cross-modality feature interaction. Extensive experiments and comparative analyses demonstrate that our MCTrack surpasses state-of-the-art trackers on both publicly available VT-MOT and UniRTL-MOT RGB-T datasets. The code is available at https://github.com/ydhcg-BoBo/RGB-T-MOT

Abstract:
The optimization of block-level quantization parameters (QP) is critical to improving the performance of practical block-based video compression encoders, but the extremely large optimization space makes it challenging to solve. Existing solutions, e.g. HEVC encoder x265, usually add some optimization constraints of the block-independent assumption and linear distortion propagation model, which limits compression efficiency improvement to a certain extent. To address this problem, a deep learning-based encoder-only adaptive quantization method (DAQ) is proposed in this paper, where a deep network is designed to adaptively model the joint temporal propagation relationship of quantization among blocks. Specifically, DAQ consists of two phases: in the training phase, considering the heavy searching cost of the traditional codec, we introduce a well-designed end-to-end learned block-based video compression network as an effective training proxy tool for the deep encoder-side network. While in the deployment phase, the trained deep network is applied to jointly predict all block QPs in a frame for the traditional encoder. Besides, our network deploys only on the encoder side without changing the standard decoder and has very low inference complexity, making it able to apply in practice. At last, we deploy DAQ in HEVC and VVC encoder for performance comparison, and the experimental results demonstrate that DAQ significantly outperforms practically used x265 with on average 15.0%, 10.9% BD-rate reduction under the SSIM and PSNR, and also achieves 12.5%, 5.0% coding gain than VTM. Moreover, for deploying deep video codec in practice, this work provides a new insight for optimizing the encoder parameters with a large space.

Abstract:
Lifelong person re-identification (LReID) exhibits a contradictory relationship between intra-domain discrimination and inter-domain gaps when learning from continuous data. Intra-domain discrimination focuses on individual nuances (i.e., clothing type, accessories, etc.), while inter-domain gaps emphasize domain consistency. Achieving a trade-off between maximizing intra-domain discrimination and minimizing inter-domain gaps is a crucial challenge for improving LReID performance. Most existing methods strive to reduce inter-domain gaps through knowledge distillation to maintain domain consistency. However, they often ignore intra-domain discrimination. To address this challenge, we propose a novel domain consistency representation learning (DCR) model that explores global and attribute-wise representations as a bridge to balance intra-domain discrimination and inter-domain gaps. At the intra-domain level, we explore the complementary relationship between global and attribute-wise representations to improve discrimination among similar identities. Excessive learning intra-domain discrimination can lead to catastrophic forgetting. We further develop an attribute-oriented anti-forgetting (AF) strategy that explores attribute-wise representations to enhance inter-domain consistency, and propose a knowledge consolidation (KC) strategy to facilitate knowledge transfer. Extensive experiments show that our DCR achieves superior performance compared to state-of-the-art LReID methods. Our code is available at https://github.com/LiuShiBen/DCR

Abstract:
Nighttime flare removal is challenging due to the difficulty of acquiring real-world paired data. Existing methods, trained on synthetic pipelines, often struggle to generalize to real-world scenarios. A key limitation of these pipelines is their focus on single-flare scenes, whereas real-world conditions frequently involve more complex cases, such as multi-flare and composite flare scenarios, which are difficult to simulate effectively. This discrepancy significantly hampers model performance in practical applications. Through detailed analysis, we uncover a fundamental characteristic of flare degradation: regardless of whether the scene is synthetic single-flare, real-world single-flare, or multi-flare, the degradation information exhibits a similar distribution across frequency subbands—predominantly concentrated in the low-frequency region, with a minor presence in the high-frequency region. Notably, the severity of the glare effect correlates with an even stronger concentration in the low-frequency domain. This finding suggests that targeted frequency modeling can bridge the gap between synthetic and real-world domains, forming a principled approach to improving generalization. Building on this insight, we propose the Scale-Aware Frequency-Adaptive Guidance Network for Nighttime Flare Removal (SAFAformer), which integrates a Frequency-Adaptive Guidance Module (FAGM) and a Scale-Aware Transformer Block (SATB) to leverage frequency-domain properties during training. Extensive experiments demonstrate that SAFAformer achieves state-of-the-art performance in flare removal compared to existing methods. Our code and pre-trained models are available on GitHub for validation.

Abstract:
Edge devices face a pressing demand for low-cost object detection networks. However, because of limited computational resources, lightweight detectors often suffer significant performance degradation. In this paper, we propose SFCE-Det, an efficient object detector that achieves remarkable performance with remarkably few parameters and GFLOPs. The key contribution of our work lies in the novel subfeature fusion and cross-layer perceptual enhancement block (SFCE-Block), which effectively extracts feature information from images at a very low computational cost. SFCE-Block can be seamlessly integrated into existing convolutional neural networks and serves as a plug-and-play component for lightweight upgrades to the network. SFCE-Block can not only be used to upgrade classic models but also has excellent lightweight effects on state-of-the-art models (e.g. YOLOv8). Additionally, we propose a dynamic label assignment strategy that leverages global label correlation to further enhance the performance of SFCE-Det. Experimental results demonstrate that SFCE-Det surpasses many state-of-the-art lightweight object detectors, on multiple public datasets while maintaining an extremely low cost. For example, SFCE-Det-D2 achieves an impressive mAP of 83.4% on the PASCAL VOC dataset, comparable to YOLOv8-S. However, SFCE-Det-D2 requires only 26% of the parameters and 35% of the GFLOPs, which are 2.96M parameters and 9.9 GFLOPs, respectively.

Abstract:
Personalized text-to-image generation aims to learn new concepts from user-provided images and subsequently generate diverse scenes or styles of the concepts from input prompts. Most existing methods usually require a set of images (typically 3-5) for each concept, which can be cumbersome. Although several methods allow personalized generation with a single reference image, they often require heavy model training and suffer from many issues such as domain-specific applicability, insufficient fidelity, and limited editability. To address these problems, we propose a novel one-shot personalized text-to-image generation method called ConceptCraft, which explicitly separates the reference image into object and background regions and treats them as two distinct concepts to learn, significantly improving the personalization performance. Specifically, we incorporate two unique identifiers into the text prompts: one followed by the object’s class name and the other by the word “background”. To bind these two identifiers to the reference image’s object and background respectively, we introduce a mask-aware object preservation loss and a mask-aware background preservation loss to optimize their corresponding token embeddings under well-designed text conditions, enabling both object and background personalization. In addition, we also develop an identifier regularization scheme to enhance our editability, allowing the synthesis of personalized images across a broader range of scenes and styles without changing the identity. Extensive qualitative and quantitative experiments are conducted to verify the effectiveness and superiority of our method.

Abstract:
Correspondence pruning aims to identify inliers from an initial set of correspondences with a low inlier ratio. Current Graph Neural Networks (GNNs) based correspondence pruning approaches suffer from feature over-smoothing during information propagation, making it difficult to distinguish inliers from outliers. In addition, Transformer-based methods can model long-range dependencies, but their quadratic complexity limits computational efficiency. To address these issues, we propose MatchMamba, a dual-view correspondence pruning network based on a selective state space model, Mamba. MatchMamba combines the strengths of GNNs and Mamba, enhancing local feature extraction while modeling global context with appropriate complexity. Specifically, to overcome Mamba’s limitations in correspondence pruning, such as the lack of local context and unidirectional modeling, we introduce the Cluster Sampling Spatial Mamba (CSSM) block and Correspondence Flip Bidirectional Mamba (CFBM) block. CSSM captures fine-grained local context through the implicit soft assignment and mitigates GNN’s over-smoothing using Mamba’s selective mechanism. CFBM block leverages Mamba’s efficient long-sequence modeling by constructing a pseudo-sequential structure through clustering. It applies forward and backward scanning to enable each correspondence to fully capture contextual information from others, achieving global context modeling with appropriate computational cost. Extensive experiments demonstrate that MatchMamba outperforms current state-of-the-art methods on several challenging tasks. The code is available at https://github.com/Mrwyb/MatchMamba

Abstract:
In modern complex environments, achieving accurate and efficient target localization is essential in numerous fields. However, existing systems often face limitations in both accuracy and the ability to recognize small targets. In this study, we propose a bionic stabilized localization system based on CA-YOLO, designed to enhance both target localization accuracy and small target recognition capabilities. Acting as the “brain” of the system, the target detection algorithm emulates the visual focusing mechanism of animals by integrating bionic modules into the YOLO backbone network. These modules include the introduction of a small target detection head and the development of a Characteristic Fusion Attention Mechanism (CFAM). Furthermore, drawing inspiration from the human Vestibulo-Ocular Reflex (VOR), a bionic pan-tilt tracking control strategy is developed, which incorporates central positioning, stability optimization, adaptive control coefficient adjustment, and an intelligent recapture function. The experimental results show that CA-YOLO outperforms the original model on standard datasets (COCO and VisDrone), with average accuracy metrics improved by 3.94% and 4.90%, respectively. Further time-sensitive target localization experiments validate the effectiveness and practicality of this bionic stabilized localization system.

Abstract:
Color correction for drone images, which are captured by uncrewed aerial vehicles, is an essential task in drone-image based intelligent applications. Different from existing methods, in this paper, we propose an effective error compensation-based fusion algorithm for drone-image color correction. In our first novelty, an image feature matching method is performed on all image pairs to determine their matched feature point pairs and the overlapping areas. To correct color for each target pixel in the overlapping area, we propose an error compensation-based fusion method. The proposed fusion method combines the joint bilateral interpolation (JBI), which works on the color differences of the matched feature point pairs in the overlapping area, and the histogram equalization (HE), which works on the whole source and target pixels in the overlapping area, such that the overall errors caused by JBI and HE can be minimized. In our second novelty, to better correct color for each target pixel in the non-overlapping area, a boundary reference interval-based fusion method is proposed by using the color differences on the boundary and the color-corrected target sub-region in the overlapping area. Based on seven challenging datasets, comprehensive experiments have been carried out. In terms of thorough quality metrics, the experimental data demonstrate the substantial quantitative and qualitative quality improvements of our algorithm when compared to state-of-the-art methods. The source code of our algorithm is available at https://github.com/ivpml84079/EC-Based-Color-Correction.git

Affiliations: School of Future Technology, South China University of Technology, Guangzhou, China; School of Computer Science and the School of Artificial Intelligence, OPtics and ElectroNics (iOPEN) and the Key Laboratory of Intelligent Interaction and Applications, Ministry of Industry and Information Technology, Northwestern Polytechnical University, Xi’an, China; School of Software Engineering, South China University of Technology, Guangzhou, China; College of Information Engineering and Shaanxi Engineering Research Center for Intelligent Perception and Analysis of Agricultural Information, Northwest A&F University, Xianyang, Shaanxi, China; School of Artificial Intelligence, OPtics and ElectroNics (iOPEN), and the School of Computer Science, and the Key Laboratory of Intelligent Interaction and Applications, Ministry of Industry and Information Technology, Northwestern Polytechnical University, Xi’an, China

Abstract:
Multi-view Graph Clustering (MGC) is a crucial approach for uncovering complex data structures by leveraging multiple perspectives of data. However, existing MGC methods face two key challenges: (1) limitations in graph structure that neglect long-range dependencies, and (2) overlooking the view-cluster local structure when mining view discrepancies. To address these issues, we propose a Multi-view Graph Clustering approach based on Dual View-Cluster-Order Interactivity (DVCOI-MGC). This approach consists of three modules: (1) Multi-View Multi-Order Graph Construction, where high-order graphs are generated using matrix exponentiation to capture long-range dependencies; (2) Dual View-Cluster-Order Interactivity, which utilizes a discrete graph cut model to separately learn order-specific and view-specific clustering results from the sets of order-specific multi-view graphs and view-specific multi-order graphs, with a separate View-Cluster-Order tensor weight for each learning direction; and (3) Bidirectional Truncation Consistency Learning, which applies a sparse boolean weight vector to locally select and integrate clustering results while preserving both the view-cluster and order-cluster local structures. Additionally, we introduce an efficient iterative optimization method to solve the discrete graph cut problem and provide a theoretical analysis of its convergence and computational complexity. Extensive experiments on 8 real-world datasets demonstrate that our approach significantly improves clustering performance over 11 state-of-the-art methods.

Abstract:
The rendering degradations produced by Neural Radiance Field (NeRF) is a long-standing but complex issue in the field of 3D implicit representation, which arises from a multitude of intricate causes and was not entirely solved by designing complicated scene parameterization methods before. In this paper, we present a diffusion-based restoration method for improving Neural Radiance Field (Drim-NeRF). We consider the NeRF enhancement issue from a low-level restoration perspective by viewing all types of rendering artifacts as a specific degradation model added to clean ground truths. By leveraging the powerful prior knowledge encapsulated in diffusion model, we could restore the high-realism improved renderings conditioned on the raw low-quality rendering counterparts. To further ensure the multi-view consistent rendering enhancement, we innovatively propose to adopt optical flow warping to reduce temporal inconsistency and employ feature-wrapping in VAE decoder to improve fidelity. Our proposed method is easy to implement and agnostic to various NeRF backbones. We conduct extensive experiments on challenging large-scale urban scenes and unbounded 360-degree scenes, as well as other baselines and datasets and achieve substantial qualitative and quantitative improvements, both in the restoration quality and the multi-view consistency perspective.

Abstract:
Future fixation sequence prediction plays a crucial role in various aspects of virtual reality content production, transmission, rendering, and display. Accurate prediction of future fixation sequence can significantly enhance the quality of user experience, particularly in resource-constrained scenarios. In this paper, we present a novel framework for predicting future fixation sequence and achieves state-of-the-art performance. Specifically, the anti-projection-distortion FoV patch extraction algorithm is proposed to mitigate projection distortions. A comprehensive contextual representation is then constructed by integrating multiple data sources, including visual and audio information, historical fixation sequence, user identity, timestamp, and positional embeddings. The transformer-based predictor is proposed to perform the future fixation sequence prediction based on the integrated contextual representations. Additionally, we propose a framework that effectively utilizes saliency information as supervision and conduct saliency contrastive distillation during the training phase, eliminating the need for saliency data during inference. Overall, by integrating anti-projection-distortion and multimodal representations, along with key embeddings, a dedicated predictor, and contrastive distillation, our approach is designed to accurately predict future fixation sequences. Extensive experiments validate the effectiveness of our framework, demonstrating its superior performance in fixation prediction tasks.

Abstract:
Evaluation metrics are essential tools for quantifying the performance of crowd localization models. GAME ignores substantial localization information and is rarely used for crowd localization evaluation. The commonly used localization metrics Precision, Recall, and F-score relies on a greedy algorithm, which can lead to ambiguity matching. Moreover, the distance threshold for the boolean matching matrix lacks both scale sensitivity and universality. To overcome the limitations of matching accuracy, scale invariance, and universality in crowd localization evaluation, a novel metric termed Scale-aware Optimal Transportation Cost (S-OTC) is proposed. To ensure globally optimal matching, it leverages optimal transport theory to compute the cost of transporting weights from ground truth points to predicted points, which serves as a measure of accuracy. First, S-OTC models predicted and ground truth points as two weighted discrete measures, thereby establishing a stable and consistent optimal transportation plan. Then, an adaptive scale-sensitive method is proposed, which produces a scale prior with only point annotations and refines the measure of ground truth to adapt to varying head region scales. This not only ensures insensitivity to head size but also guarantees the universality of both point-annotated and box-annotated datasets. Third, a straightforward but effective normalization technique is presented to refine the cost matrix in transportation plan, ensuring invariance to changes in image resolution. In addition, a new dataset for validating evaluation metrics is constructed, containing 1,800 pairs of perturbed images annotated with human preference choices. This dataset can be used to evaluate the sensitivity of evaluation metrics for count errors and spatial deviations. Extensive experiments demonstrate that S-OTC outperforms existing metrics in terms of stability, sensitivity, and universality. This indicates that S-OTC can serve as the new standard for crowd localization evaluation.

Abstract:
Open-world point cloud semantic segmentation (OW-Seg) aims to predict point labels for both base and novel classes in real-world scenarios. However, existing methods depend on additional samples to support the prediction on query sample, which is labor-intensive to collect and annotate. Some methods further rely on additional learning stages to adapt to novel classes, limiting their practicality in dynamic scenarios. In addition, the intra-class distribution shifts across samples introduce biased class representations (prototypes), resulting in sub-optimal predictions. To address these limitations, we propose HOW-Seg, the first human-in-the-loop framework for OW-Seg. Instead of relying on additional samples, HOW-Seg constructs class prototypes in the query sample feature space based on sparse point-level annotations, thereby avoiding cross-sample distribution shifts. Considering the lack of granularity of initial prototypes, we introduce an interactive prototype disambiguation mechanism to refine ambiguous prototypes. To further enrich contextual awareness, we propose a prototype label assignment module, which employs a dense conditional random field (CRF) upon the prototypes to optimize their label assignments. Through iterative human feedback, HOW-Seg dynamically improves its predictions, achieving high-quality segmentation for both base and novel classes. Experiments demonstrate that with sparse annotations (e.g., one-class-one-click), HOW-Seg surpasses the state-of-the-art generalized few-shot segmentation (GFS-Seg) method under the 5-shot setting. When using advanced backbones (e.g., Stratified Transformer) and denser annotations (e.g., 10 clicks), HOW-Seg achieves 85.27% mIoU on S3DIS and 66.37% mIoU on ScanNetv2, significantly outperforming other alternatives. The source code will be publicly available at https://github.com/Pengz98/HOW-Seg

Abstract:
Out-of-distribution (OoD) semantic segmentation aims to recognize pixels of classes undefined in the training dataset. Existing methods mostly focus on training the model to fit real OoD data samples to identify OoD pixels, which requires extra data collection and annotation efforts. By contrast, synthesizing OoD data with training data provides a more resource-efficient alternative. However, synthetic data generated from controlled settings lacks diversity, causing the model to suffer from overfitting. To this end, we propose a disentangled representation learning (DRL) method to guide the model to disentangle semantic-related and semantic-unrelated features from synthetic OoD data. DRL encourages the model to utilize the former to identify semantic categories, rather than overfitting to such semantic-unrelated features as synthetic artificiality. Specifically, DRL first incorporates two disentanglers to extract the semantic-related and -unrelated features and then applies a shuffle and reconstruction mechanism to regularize the disentangled features. Furthermore, to facilitate disentangling, we propose a pixel-wise feature similarity calibration (PSC) module, which utilizes more accurate ID-OoD similarity to calibrate inaccurate ID-OoD similarity learned exclusively from ID data. Thus, PSC delivers accurate and stable pixel-wise features for effective disentangling. Extensive experiments illustrate that the proposed method exhibits strong generalization ability. It attains 74.04% AuPRC and 20.82% FPR on Road Anomaly, 69.85% AuPRC and 5.78% FPR on Fishyscapes LostAndFound Validation Set, using SegFormer with the MiT-B5 backbone. Source code is available at https://github.com/WanMotion/DisentangledOoDSeg

Abstract:
Generating high-quality facial photos from fine-detailed sketches is a long-standing research topic that remains unsolved. The scarcity of large-scale paired data due to the cost of acquiring hand-drawn sketches poses a major challenge. Existing methods either lose identity information with oversimplified representations, or rely on costly inversion and strict alignment when using StyleGAN-based priors, limiting their practical applicability. Our primary finding in this work is that the discrete codebook and decoder trained through self-reconstruction in the photo domain can learn rich priors, helping to reduce ambiguity in cross-domain mapping even with current small-scale paired datasets. Based on this, a cross-domain mapping network can be directly constructed. However, empirical findings indicate that using the discrete codebook for cross-domain mapping often results in unrealistic textures and distorted spatial layouts. Therefore, we propose a Hierarchical Adaptive Texture-Spatial Correction (HATSC) module to correct the flaws in texture and spatial layouts. Besides, we introduce a Saliency-based Key Details Enhancement (SKDE) module to further enhance the synthesis quality. Overall, we present a “reconstruct-cross-enhance” pipeline for synthesizing facial photos from fine-detailed sketches. Experiments demonstrate that our method generates high-quality facial photos and significantly outperforms previous approaches across a wide range of challenging benchmarks. The code is publicly available at: https://github.com/Gardenia-chen/DECP

Abstract:
Hyperspectral band selection seeks to identify a compact subset of informative spectral channels that preserves task–relevant information while mitigating the storage, transmission, and computational burdens imposed by high–dimensional data. Yet prevailing techniques face two pervasive limitations: (i) scoring- or ranking-based methods assess bands independently, overlooking the joint dependency that determine their true utility; and (ii) combinatorial search approaches, though theoretically exhaustive, require prohibitive enumeration that is incompatible with the scale and end-to-end nature of modern deep-learning pipelines. We recast band selection as a combinatorial inference problem and propose a task-agnostic framework that embeds a learnable Band Selection Layer equipped with an Expectation–Maximization–driven Sparsity Loss The E-step efficiently enumerates the expected likelihood of all k-out-of-B band subsets via dynamic programming, thereby making implicit dependencies explicit; the M-step optimises band importances toward a provably k-sparse solution without post-hoc thresholding. Comprehensive theoretical analysis proves the absence of spurious local maxima and guarantees convergence to an exact sparse optimum. Extensive experiments on three public benchmarks (KSC, HT2013, HT2018), two auxiliary tasks (anomaly and target detection), and six classifiers demonstrate that the proposed method consistently surpasses state-of-the-art baselines. The results confirm that EM-guided sparsification not only stabilises the sparsity pattern but also yields interpretable inter-band dependency structures, making the framework a robust and broadly applicable tool for hyperspectral analysis and other sparsity-oriented vision problems.

Abstract:
Monocular 3D object detection is challenging due to the lack of accurate depth. However, existing depth-assisted solutions still exhibit inferior performance, whose reason is universally acknowledged as the unsatisfactory accuracy of monocular depth estimation models. In this paper, we revisit monocular 3D object detection from the depth perspective and formulate an additional issue as the limited 3D structure-aware capability of existing depth representations (e.g., depth one-hot encoding or depth distribution). To address this issue, we introduce a novel Depth Thickness Field approach to embed clear 3D structures of the scenes. Specifically, we present MonoDTF, a scene-to-instance depth-adapted network comprising a Scene-Level Depth Retargeting (SDR) module and an Instance-Level Spatial Refinement (ISR) module. The former retargets traditional depth representations to the proposed depth thickness field, incorporating the scene-level perception of 3D structures. The latter refines the voxel space with the guidance of instances, enhancing the 3D instance-aware capability of the depth thickness field and thus improving detection accuracy. Extensive experiments on the KITTI and Waymo datasets demonstrate our superiority to existing state-of-the-art (SoTA) methods and the universality when equipped with different depth estimation models. The source codes are available at https://github.com/QiuDeZhang/MonoDTF.

Abstract:
We address the “long-range ambiguity” problem for unsupervised non-rigid point cloud correspondence, where corresponding points own inconsistent features while different local regions are spatially or geometrically similar. Previous methods struggle with this problem, since local reference frames (LRF) or coordinate-based methods struggle to exclude locally similar or spatially near mismatches, and widely used independent geometric relations might be inconsistent under non-rigid deformation, introducing extra ambiguity. To this end, we propose a novel robust context modeling module (RCM) to alleviate long-range ambiguity in two aspects: 1) RCM tackles the ambiguity problem by introducing inter-relation attention (IRA), which mines robust cues from the interplay between relative geometric relations. 2) RCM enhances features with accessible long-range information from IRAs, following a local-to-global manner. Our method shows significant improvements in multiple benchmarks, with accurate correspondence over rotation and large deformation perturbation. Specifically, our method achieves a new state-of-the-art performance with correspondence accuracy of 33.9% and mean error of 4.2 on the SURREAL benchmark.

Abstract:
Fine-grained visual classification remains challenging due to subtle inter-class differences and significant intra-class variations. We solve this problem from the representation space perspective and propose Contrastive Decoupled Regularization (CoDeR), a module-level regularization method that guides representation learning using class prototypes as anchors without introducing any learnable parameters, steering hierarchical representations toward more discriminative directions. Specifically, for a target module B_i , we maintain an independent cache that collects the module’s outputs and corresponding class labels during each training epoch. At the end of each epoch, features are aggregated by class and mapped onto a hypersphere to compute cluster centers as class prototypes. In the next epoch, these prototypes guide updates in module B_i , pulling representations toward their ground-truth class prototypes and pushing them away from others. This strengthens inter-class separation in the representation space and directly addresses the core challenge of fine-grained recognition. Furthermore, we apply CoDeR in parallel across multiple modules to accelerate information propagation to earlier layers. This enables shallow layers to learn semantically meaningful representations earlier in training and mitigates the delayed representation update problem. Overall, CoDeR provides a simple, general, and effective supervised regularization mechanism that demonstrates the value of imposing constraints on high-dimensional representations. We conduct extensive experiments on ImageNet, six fundus medical imaging datasets, and two standard semi-supervised learning benchmarks. Consistent improvements across all settings validate the effectiveness and cross-domain applicability of our method.

Abstract:
Zero-shot temporal action localization (ZSTAL) aims to localize and recognize action categories unseen during training. However, it assumes that test videos contain only unseen classes, which is unrealistic in practice where seen and unseen actions naturally co-exist. To bridge this gap, we introduce generalized ZSTAL (GZS-TAL), where models trained only on seen classes must handle both seen and unseen ones during testing. This setting highlights a critical challenge: a static, frozen model cannot adapt to the mixed distributions encountered at test time. To address this issue, we propose a Temporal-Sensitive Adaptation (TSA) module that equips TAL models with the ability to update themselves during testing. The key intuition is to use temporal dependency prediction as a self-supervised signal: TSA introduces an online-updatable memory optimized to reconstruct features of preceding segments from the current one, thereby embedding temporal dependencies into parameters and reusing them for adaptation at test time. To further enhance temporal modeling, we extend TSA into a Bi-directional TSA (Bi-TSA) mechanism that performs prediction in both forward and backward directions. By simultaneously exploiting historical and future contexts, Bi-TSA improves long-range temporal representation and yields more accurate boundary localization. Extensive experiments on THUMOS14 and ActivityNet-1.3 demonstrate that our approach achieves significant improvements over state-of-the-art methods under the GZS-TAL setting, validating its effectiveness and generalization ability.

Abstract:
Transparent and specular objects are frequently encountered in daily life, factories, and laboratories. However, due to the unique optical properties, the depth information on these objects is usually incomplete and inaccurate, which poses significant challenges for downstream robotics tasks. Therefore, it is crucial to accurately restore the depth information of transparent and specular objects. Previous depth completion methods for these objects usually generate structure-less or ambiguous depth predictions. To address these issues, we propose a geometry-aware assisted depth completion method for transparent and specular objects, which focuses on exploring the 3D structural cues of the scene. Specifically, besides extracting 2D features from RGB-D input, we back-project the input depth to a point cloud and build the 3D branch to extract hierarchical scene-level 3D structural features. To exploit 3D geometric information, we design several gated cross-modal fusion modules to effectively propagate multi-level 3D geometric features to the image branch. In addition, we propose an adaptive correlation aggregation strategy to appropriately assign 3D features to the corresponding 2D features. Extensive experiments on ClearGrasp, OOD, TransCG, and STD datasets show that our method outperforms other state-of-the-art methods. We further demonstrate that our method significantly enhances the performance of downstream robotic grasping tasks. The code will be available at: https://github.com/lyz3356/GAA-TSO

Abstract:
Hierarchical Variational Autoencoder (HVAE)-based Learned Image Compression (LIC) has shown great promise, but its performance still lags behind autoregressive models due to three key limitations identified: 1) shared latent mappings that lead to accumulated posterior collapse; 2) reliance on static convolutions, limiting adaptability; and 3) gradient imbalance during variable-rate optimization, causing unbalanced performance across different bit rates. To overcome these challenges, we propose QARV++, an improved HVAE-based LIC method. First, we introduce a disentangled latent mapping mechanism, assigning separate transformations to each latent variable to prevent posterior collapse propagation. Second, we integrate deformable convolutions into the network, introducing the DCNNeXt block, which enables dynamic feature adaptation while maintaining computational efficiency. Third, we reformulate variable-rate optimization to ensure balanced gradient updates across different \lambda values, stabilizing variable-rate training. Extensive experiments demonstrate that QARV++ achieves superior rate-distortion (R-D) performance among HVAE-based LIC models, exhibiting -12.20% -16.34% -15.23% BD-Rate against VVC Intra mode on the Kodak, Tecnick, and CLIC2020 test datasets, respectively. Our approach also generalizes effectively to existing LICs, delivering substantial improvements.

Abstract:
Recent advances in 3D Gaussian Splatting (3DGS) have enabled real-time, high-fidelity novel view synthesis, yet their substantial storage cost remains a major barrier to practical deployment. Although several compression techniques have been explored, they share a common limitation: each existing 3DGS requires per-scene optimization to achieve compression, making the compression slow and inefficient. In this work, we present Fast Compression of 3D Gaussian Splatting (FCGS), an optimization-free approach that compresses existing 3DGS in a single feed-forward pass, reducing compression time from minutes to seconds. To enhance compression efficiency, we design a multi-path entropy module that routes Gaussian attributes through separate entropy-constrained paths, achieving a better trade-off between size and fidelity. Furthermore, we introduce both inter- and intra-Gaussian context models to effectively remove redundancies for the unstructured Gaussian representation. Experimental results show that FCGS achieves over 20× compression while maintaining high fidelity, outperforming most State-of-The-Art (SoTA) per-scene optimization-based methods. Beyond static scenes, we further extend FCGS to a streamable setting which eliminates redundant temporal information, demonstrating its strong potential for compressing streamable 3DGS data.

Abstract:
We propose a novel approach for human novel view synthesis from image pairs, achieving high-quality cross-scene rendering. Existing methods struggle with occlusions, often suffering from excessive invalid sampling and sensitivity to depth estimation errors. To address these challenges, we propose GoRF, an innovative framework built upon Neural Radiance Fields (NeRF). Specifically, we develop a geometry-guided occlusion-aware mechanism that implicitly models cross-view geometric projection discrepancies and dynamically adjusts multi-view feature blending weights, mitigating occlusion ambiguity between input views. Furthermore, to enhance detail reconstruction, we propose a hybrid sampling strategy that integrates surface-guided and global sampling, effectively compensating for inaccuracies in depth estimation. By combining these strategies, our method enables occlusion-aware, high-fidelity human novel view synthesis. Extensive experiments on diverse human datasets, including THuman2.0, THuman-Sit, and DNA-Rendering, demonstrate that our approach outperforms state-of-the-art methods in both cross-pose and cross-identity scenarios.

Abstract:
Fashion image editing has garnered significant attention due to its growing demand in e-commerce, social media, and virtual try-on applications. However, existing methods are typically designed for specific editing tasks in isolation, lacking a unified framework capable of handling diverse editing requirements. This work addresses this limitation from two critical perspectives. First, we construct InstructFashion, a large-scale, high-quality dataset specifically curated for instruction-guided fashion image editing. It is generated through carefully designed pipelines that cover four distinct editing tasks. Second, we propose TailorEdit, an adaptive framework for instruction-guided fashion image editing. It integrates human segmentation map-based denoising guidance, modular LoRA-based editing experts, and a dynamic expert routing mechanism to enable precise and semantically coherent modifications. Extensive quantitative and qualitative evaluations demonstrate that TailorEdit consistently outperforms state-of-the-art methods in terms of realism, coherence, and instruction adherence. Our code is available at https://github.com/EndaJude/TailorEdit

Abstract:
Multi-source few-shot domain adaptation (MFDA) is a more common and challenging scenario, as only limited annotated source domain data is provided and a large amount of data is unlabeled. Conventional solutions are difficult to achieve for large-scale vision-language models (VLM) since multiple-source domains need to be aligned and more parameters need to be tuned. To efficiently transfer VLM to the target domain in the MFDA, this study first summarizes the previous prompt tuning for domain adaptation methods as a transductive prompt learning (TPL) paradigm. Then, it introduces a new inductive and transductive prompt learning (I&TPL) paradigm for MFDA. Based on the I&TPL paradigm, a test-time domain-agnostic meta-prompt learning (TDMP) method is further proposed, which is suitable for few-shot annotated multi-source domain data and is compatible with existing prompt tuning methods. As a result, the proposed TDMP does not require multiple complex prompts, constructed source-target pairs, extra auxiliary loss, and pseudo-target labels. Specifically, the proposed TDMP includes domain-agnostic meta-prompt learning and test-time domain-agnostic prompt tuning for target domain adaptation. The first stage is mainly optimized based on the Reptile optimization algorithm. Domain mixup is used to expand the diversity and the number of meta-training tasks. In the second stage, the learned domain-agnostic meta-prompt initializes the test-time prompt to further adapt to the target domain. Extensive experiments are conducted on the OfficeHome, DomainNet, Office, and TerraIncognita datasets of MFDA, achieving better performance with fewer learnable parameters and demonstrating the effectiveness of TDMP.

Abstract:
Low-rank tensor decomposition (LRTD) has demonstrated significant efficacy in multidimensional image reconstruction. Indeed, LRTD driven by nonlinear relationship can capture the underlying low-rank structure more accurately, since real-world data often exhibits complex nonlinear interactions. However, the existing nonlinear LRTD methods do not to investigate the inherent nonlinear interactions in spatial neighborhoods and spectral or temporal models. To address these challenges, we propose a novel deep nonlinear low-rank tensor decomposition (DNLRTD). Specifically, we design a deep nonlinear transform network (DNTN) using multiple convolutional layers and channel attention modules to form a deep nonlinear transform (DNT). The custom-designed DNT effectively captures nonlinear interactions within spatial neighborhoods while paying attention to the nonlinear interactions of spectral or temporal dimensions, consequently achieving a lower-rank representation. By integrating DNT into the low-tubal-rank decomposition framework, we induce the deep tubal-rank and form the DNLRTD. Also, we design a customized DNLRTD optimization strategy to make it flexible for different multidimensional image reconstruction tasks. Based on DNLRTD, we construct two multidimensional image reconstruction models and develop corresponding algorithms based on the alternating direction method of multipliers (ADMM) to solve them. Extensive experimental results on spectral compressive imaging and dynamic magnetic resonance image (MRI) reconstruction verify the superior performance of the proposed method.

Abstract:
In industrial bin-picking, robotic systems must estimate the poses of multiple object instances, where accurate pose estimation is essential for reliable downstream manipulation and grasping. Most existing multi-instance registration methods primarily establish point correspondences based on local features to alleviate the challenges posed by occlusion and clutter. However, local features are easily disturbed by neighboring instances and lack global context, leading to unreliable correspondences and degraded registration accuracy. In addition, the absence of rotational invariance further reduces correspondence accuracy in scenes with stacked instances and highly varying object orientations. To address these challenges, we present a one-stage multi-instance point cloud registration framework for stacked-object scenes. Our framework incorporates a rotation-invariant operator to enhance the robustness of feature representations under arbitrary orientations. Then, we propose a Center-Aware Res-Masked Transformer module, which incorporates an object center embedding to enrich global instance-level context and a center-aware residual mask prediction module to balance weight distribution across objects of varying sizes during training. Extensive experiments on the challenging ROBI dataset demonstrate that our method outperforms the competitive baseline MIRETR by more than 10% in mean precision, highlighting its effectiveness in complex bin-picking scenes. Furthermore, evaluations on the unstacked Scan2CAD dataset confirm the generalizability of the proposed framework across different application scenarios.

Abstract:
Adverse weather conditions can significantly degrade image quality and impair the capture of critical information. Existing restoration networks struggle to effectively combine local, regional, and global features, thereby limiting their ability to handle diverse impacts of such weather. This study proposes the local-region-global transformer (LRGFormer), a transformer-based image restoration model for multiscale feature perception. The model comprises a basic module composed of multi-scale fusion attention (MSFSA) and a channel-spatial dual-attention feed-forward network (CSDF). Specifically, this study designs an MSFSA module. For the first time, it combines rotation-equivariant convolution with local attention for local information extraction and introduces a frequency-domain adaptive attention mechanism. By incorporating a query-aware global adaptive sparse attention mechanism for global information extraction, the network gradually fuses along the channel dimension, enabling progressive capture of spatial and frequency-domain information from the local and regional to global scale. Secondly, a CSDF network structure was designed to enhance channel-spatial interaction and improve the representational capacity of the model. By constructing a basic U-Net framework, the excellent basic modules for image restoration proposed in recent years are compared on a unified framework. Experimental results demonstrated that the proposed basic module can not only better extracts multi-scale features of images and restores image distortion caused by various degradation factors, and also exhibits good universality and generalization.

Abstract:
Point cloud rigid registration is a fundamental problem in robotics, 3D reconstruction, and augmented reality. However, existing methods predominantly rely on local geometric neighborhoods, which fail to capture higher-order semantic structures and thus degrade performance under noisy or complex geometry conditions. To address these limitations, we propose SCAP, a new point cloud registration paradigm that transforms feature interaction from geometry-driven to semantics–geometric co-driven. Specifically, a semantic prototype extractor is devised to abstract high-level semantic prototypes through graph embedding and clustering, thereby mitigating sensitivity to local feature noise. Since semantic abstraction alone cannot guarantee consistent correspondences across point clouds, SCAP performs a prototype alignment path learning to infer reliable semantic mappings through optimal transport. To enhance cross-layer feature integration and prevent redundant attention, an alignment-driven cross-layer transformer is proposed to incorporate the learned priors into the attention mechanism, thereby enabling feature aggregation with improved semantic coherence and local precision. Extensive experiments on ModelNet, ModelLoNet, 3DMatch, and 3DLoMatch demonstrate that our SCAP consistently surpasses state-of-the-art approaches, showing superior robustness and generalization in challenging scenarios with noise and partial overlap. The code will be available at https://github.com/Zhou-111jy/SCAP.git

Abstract:
Video super-resolution (VSR) aims to reconstruct high-resolution (HR) videos from low-resolution (LR) inputs by utilizing spatio-temporal correlations across consecutive LR frames. While recent advances in deep learning, particularly transformer-based architecture, have substantially improved VSR performance, maintaining spatio-temporal coherence remains a critical challenge. To address this issue, we propose a novel U-Net-based spatial transformer module (USTM) that can be seamlessly integrated into the reconstruction stage of existing VSR frameworks. The proposed USTM combines a spatial transformer with a U-Net structure to extract fine-grained spatio-temporal features to enhance the reconstruction of complex motions and texture patterns. Extensive ablation studies were conducted to identify the optimal configuration and verify the contribution of each USTM component. For performance evaluations, USTM was incorporated into multiple representative VSR methods. Experimental results demonstrate that integrating USTM consistently enhances PSNR and SSIM scores on benchmark datasets, including REDS, Vimeo-90K, and Vid4. Furthermore, visual comparisons highlight the superiority of the proposed method, particularly in high-frequency regions, compared to baseline VSR methods without USTM.

Abstract:
Fine-grained visual recognition refers to the ability to distinguish subtle differences between visually similar objects—a fundamental yet challenging capability for Multimodal Large Language Models (MLLMs). In this paper, we observe that even strong open-source MLLMs, such as Qwen2-VL and InternVL2, still struggle with accurately identifying fine-grained categories. These models often fail to attend to subtle but critical details for precise discrimination. To unlock this potential, we propose FineG-RAG, a retrieval-augmented generation pipeline designed to enhance the fine-grained recognition capabilities of MLLMs. FineG-RAG integrates external fine-grained knowledge into the recognition process via a generalized retriever. To support this, we construct fine-grained visual-language knowledge database containing representative images with wide visual diversity and expert-crafted attribute descriptions from multiple perspectives. Relevant fine-grained knowledge is retrieved from this database and fed into a visual-language augmented prompt, which provides rich multimodal context to guide MLLMs in generating accurate labels. To better evaluate the fine-grained recognition capabilities of MLLMs, we design a multiple-choice evaluation strategy based on publicly four fine-grained datasets. Extensive experiments demonstrate that FineG-RAG consistently outperforms baseline methods, achieving superior recognition accuracy across a range of off-the-shelf, open-source MLLMs.

Abstract:
Deep learning-based video watermarking algorithms perform well in terms of robustness and perceptual quality. However, their resistance to HEVC compression remains a major limitation, especially under high compression ratios, where watermark extraction accuracy significantly degrades. To address this issue, this paper proposes a Spatio-temporally Enhanced Video Watermarking (SEVMark) based on invertible neural networks (INNs). SEVMark introduces a channel attention mechanism in the temporal domain to adaptively focus on keyframes, and employs spatial pyramid pooling module in the spatial domain to capture multi-scale features. These two modules work in tandem to enhance the spatio-temporal feature representation, achieving high robustness and imperceptibility. Furthermore, based on the HEVC encoding process, a HEVC video compression simulator (DiffH265) is designed and incorporated as a key component of the noise layer, guiding the encoder-decoder network to maintain high extraction accuracy under HEVC compression. Experimental results demonstrate that SEVMark outperforms state-of-the-art methods in both quantitative and qualitative evaluations, particularly demonstrating excellent robustness against HEVC compression attacks under high compression ratios.

Abstract:
Deep Neural Networks (DNNs) are vulnerable to adversarial patch attacks, which raises security concerns for face recognition systems using DNNs. Previous adversarial patch generation methods typically optimize perturbations in regions that maximally influence critical facial features. However, these existing methods are mostly limited to fixed shapes such as rectangles or squares. This confines subsequent patch texture optimization within these quadrilaterals, resulting in suboptimal adaptation to the complex geometric shapes of critical facial features, which may limit the effectiveness and transferability of the adversarial attacks. To address this issue, this paper proposes a PSO-based Adversarial Patch (PAP) method to generate a dynamic patch to be injected into the face. In the proposed PAP, by employing Particle Swarm Optimization (PSO) with adversarial similarity as the objective, the algorithm searches within a base circle to determine the optimal shape and position of the pre-defined patch. This approach enables the patch to exhibit extrapolation of polygonal deformations, ensuring that the patch optimally balances location, texture, and geometry, which enhances the adversarial transferability of the patch. To evaluate the vulnerability of face recognition models, we explore impersonation attacks under the closed box setting. Extensive experiments show that the proposed PAP improves attack performance across various face recognition models and datasets. Moreover, PAP achieves better transferability on commercial face recognition systems than existing methods.

Abstract:
Recent advances in generative models have sparked growing interest in moving beyond pure image generation toward transparent image generation, i.e., joint generation of image and its alpha mask. However, most existing approaches adopt a two-stage pipeline, where a diffusion-based model first generates an RGB image and a subsequent matting head predicts the alpha mask. This separation not only leads to error accumulation and inaccurate predictions but also overlooks the intrinsic correlation between the cross-modal data. In this work, we introduce Zippo, a unified diffusion framework, zipping color and transparency distributions into a single diffusion model, by learning joint distribution of RGB image and alpha mask. Zippo not only generates high-fidelity images but also produces plausible and sharp alpha masks. In practice, Zippo inflates the latent space into a unified representation that encodes cross-modal data, and builds upon it with a modality-aware diffusion process that flexibly switches between RGB and alpha domains. In this process, conditioning on one modality while denoising the other allows the model to generate RGB images from alpha masks and predict transparency from input images. In addition to single-modality prediction, we further design a modality-aware noise reassignment strategy to empower Zippo with the joint generation capability of RGB images and their corresponding alpha masks under text guidance. With these techniques, Zippo supports a wide range of transparent image generation tasks, including image-alpha joint generation, image matting, and alpha mask conditioned image generation. Extensive experiments demonstrate that Zippo not only delivers superior visual fidelity but also achieves competitive performance in visual downstream prediction, highlighting joint image-alpha modeling as a powerful alternative to traditional paradigms.

Abstract:
In egocentric videos, collecting and annotating supervised data is more complicated and time-consuming than in exocentric videos, limiting research in this area. As a remedy, Unsupervised Domain Adaptation (UDA) enhances model performance on unlabeled target domains by bridging the distribution gap between source and target domains. However, UDA for egocentric action recognition is under-explored, facing unique challenges such as simultaneous learning of verb and noun representations, focusing on human-object interactions, and managing excessive verb-noun combinations. To tackle these issues, we propose a novel Unsupervised Domain Adaptation for Egocentric Action Recognition (UDA-EAR) approach that adaptively models egocentric actions and facilitates cross-domain knowledge transfer, improving recognition performance in unlabeled target domains. Specifically, our UDA-EAR employs adaptive spatio-temporal and spatio-channel attention in a dual-branch pipeline to focus on motion intervals and interaction regions, respectively, allowing specialized learning of discriminative representations while avoiding negative combination dependencies from domain gaps. Additionally, an adversarial domain alignment mechanism aligns the data distributions between source and target domains, effectively transferring fine-grained verb-noun knowledge of egocentric videos. Extensive experiments demonstrate that our UDA-EAR outperforms state-of-the-art baselines on widely used egocentric datasets, significantly improving egocentric action recognition accuracy. Our source codes and datasets are available at https://github.com/zou-y23/UDA-EAR

Abstract:
Cross-view geo-localization (CVGL) offers a promising alternative for positioning in GNSS-constrained environments through visual matching techniques. Extreme viewpoint variations and the complexity of real-world scenes present significant challenges to this task. However, current methods primarily focus on learning single-scale features, which may be inadequate for practical applications. Although some approaches attempt to incorporate multi-scale representations, they may suffer from unimodal bias arising from structural discrepancies among model branches, limiting effective multi-scale feature extraction. To address these issues, we propose a fully multi-branch network architecture, named BEMN, which is designed to learn multi-scale robust feature representations. Specifically, we construct a multi-branch backbone network based on pretrained visual models and design a two-stage training strategy. In the first stage, a separate training scheme is employed to thoroughly optimize each branch of the network, and a joint feature alignment (JFA) module is introduced to align cross-view features. The entire network is fine-tuned in the second stage, where a frequency domain adjustment (FDA) module is designed to improve performance. To further assess the generalization ability of CVGL methods, we establish Xian-37, a highly challenging CVGL test dataset featuring complex real scenes captured from diverse platforms and viewpoints. Experimental results across multiple public benchmarks validate the superiority of our approach, achieving state-of-the-art performance and demonstrating outstanding generalization capabilities. Our code and model are available at https://github.com/VERYBC/BEMN

Abstract:
Given the challenge of balancing high fidelity with perceptual quality, multi-realism image compression is developed to adapt flexibly to varying requirements. It allows images with different levels of realism to be decoded from the same bit stream. Diffusion models are known for generating images with high perceptual quality. However, their inherent process of adding noise and denoising is often difficult to control and will bring more distortion. This limits their direct application in image compression, especially in multi-realism image compression which requires precise control to adapt to different requirements. To address this issue, we propose a Consistency Guided Diffusion Model as a post-processing network for multi-realism image compression, aiming to control the addition of detail representations, thereby adjusting the trade-off between subjective quality and fidelity. In detail, our proposed novel method is crafted to introduce an additional consistency guided feature branch into the diffusion model to constrain the deviation caused by randomness in the diffusion process to ensure fidelity. Furthermore, a syntax-driven feature fusion module is constructed to guide the information adaptive fusion of two branches with an input extra ultra-low stream, which contains the context information and trade-off control information. In addition, we design a warm-up based training strategy and adopt a continuous online optimization method to improve coding efficiency and trade-off control precision. Extensive experiments validate the superiority of our method over existing compression techniques, as well as the effectiveness of each component.

Abstract:
Tensor canonical correlation analysis (TCCA) has garnered significant attention due to its effectiveness in capturing high-order correlations in multi-view learning. However, existing TCCA methods often underemphasize the characterization of individual structures and lack algorithmic convergence guarantees. In order to deal with these challenges, we propose a novel sparse TCCA model called STCCA-L, which integrates sparse regularization of canonical matrices and Laplacian regularization of multi-order graphs into the TCCA framework, thereby effectively exploiting the geometric structure of individual views. To solve this non-convex model, we develop an efficient alternating manifold proximal gradient algorithm based on manifold optimization, which avoids computationally expensive full tensor decomposition and leverages a semi-smooth Newton method for resolving the subproblem. Furthermore, we rigorously prove the convergence of the algorithm and analyze its complexity. Experimental results on eight benchmark datasets demonstrate the superior classification performance of the proposed method. Notably, on the 3Sources dataset, it achieves improvements of at least 4.50% in accuracy and 6.77% in F1 score over competitors. Our code is available at https://github.com/zhudafa/STCCA-L

Abstract:
Sparse adversarial attacks perturb only a few pixels to achieve an attack, making them harder to detect and more dangerous. Recently, generative sparse attacks decouple the generation of sparse adversarial examples (AEs) into dense perturbations and sparse masks. By modeling the data distribution from clean examples to sparse AEs, generative sparse attacks mitigate the poor transferability that arises from over-reliance on gradients. These methods put effort into deriving optimal sparse masks on the generated perturbation. However, the quality of perturbation generation has always been overlooked, which limits the transferability of sparse AEs. To explore the influence of perturbation quality, we conduct empirical analyses of sparse gradient-based perturbations. The results show that directly applying sparsity to gradient-based perturbations disrupts their holistic adversarial information, leading to degraded attack performance. Therefore, it is critical to extract key adversarial knowledge from gradient-based perturbations while preserving their overall integrity to guide sparse adversarial attacks. Motivated by this observation, we propose to extract essential adversarial information from gradient-based AEs to guide the generator to produce higher-quality dense perturbations and stronger transferable sparse AEs. Specifically, we introduce the Gradient Perturbation Guidance (GPG) sparse adversarial attack, which integrates gradient adversarial feature guidance and gradient perturbation guidance regularization. The former guides the generator to capture gradient-based adversarial features during encoding, while the latter refines adversarial knowledge from gradient-based perturbations during decoding. Extensive experiments on ImageNet-1K show that our GPG significantly boosts transferability compared to state-of-the-art methods under consistent sparsity constraints. Our code is available at https://github.com/bookman233/GPG

Abstract:
The Segment Anything Model (SAM) excels at general image segmentation but has limited ability to understand natural language, which restricts its direct application in Referring Expression Segmentation (RES). Toward this end, we propose SSP-SAM, a framework that fully utilizes SAM’s segmentation capabilities by integrating a Semantic-Spatial Prompt (SSP) encoder. Specifically, we incorporate both visual and linguistic attention adapters into the SSP encoder, which highlight salient objects within the visual features and discriminative phrases within the linguistic features. This design enhances the referent representation for the prompt generator, resulting in high-quality SSPs that enable SAM to generate precise masks guided by language. Although not specifically designed for Generalized RES (GRES), where the referent may correspond to zero, one, or multiple objects, SSP-SAM naturally supports this more flexible setting without additional modifications. Extensive experiments on widely used RES and GRES benchmarks confirm the superiority of our method. Notably, our approach generates segmentation masks of high quality, achieving strong precision even at strict thresholds such as Pr@0.9. Further evaluation on the PhraseCut dataset demonstrates improved performance in open-vocabulary scenarios compared to existing state-of-the-art RES methods. The code and checkpoints are available at: https://github.com/WayneTomas/SSP-SAM.

Abstract:
Neural-network model trading raises two unmet requirements for watermarking: exclusive ownership verification and updateability (transfer/revoke) without retraining. We present a training-time framework that jointly embeds a key-driven dynamic label-mapping and a proactive, self-defending trigger. The label-mapping encodes an owner-specific signature that authorized key holders can verify, update, or transfer by rotating keys; the defensive trigger actively hardens the model against unauthorized backdoor insertion and fake-watermark attempts. Our design yields three properties: (i) exclusivity—only the correct key decodes the watermark; (ii) updateability—ownership can be changed without retraining the backbone; and (iii) robustness—the watermark persists under common post-deployment changes. Across CIFAR-10/100, GTSRB, and Tiny ImageNet on five architectures, the method achieves (\gt \!97%) watermark success rate while preserving original accuracy within 0.5 percentage points, retains (\gt \!85%) watermark strength after fine-tuning and (60%) pruning, and reduces fake-watermark success to (\lt \!1.5%) (vs. (\gt \!95%) on unprotected models). By coupling updateable encoding with proactive defense, our approach offers a practical, scalable path to secure, transferable ownership verification for neural-network marketplaces and model exchanges.

Abstract:
Diffusion probabilistic models have effectively addressed the ill-posed nature of cardiac magnetic resonance imaging (CMRI) super-resolution (SR) by learning high-resolution image distributions from low-resolution inputs. However, the iterative sampling process in these models often suffers from slow inference speeds, as well as limitations in the quality and structural consistency of the generated images. To address these challenges, we propose a continuous-time conditional diffusion model (CCDM) for blind CMRI SR. Specifically, we propose a continuous-time conditional diffusion module that reduces the time consumption of the diffusion probability model by maintaining the mean and variance of the data in the forward process. Meanwhile, we design a cascaded residual attention network as a feature extractor to enhance the model’s discriminative power and feature representation capabilities. To further elevate image fidelity, we propose an image quality loss module that integrates a score matching loss, significantly improving detail reconstruction and overall perceptual quality. Furthermore, we develop a hybrid score predictor that approximates the conditional score function via a hybrid parameterized denoising network, facilitating efficient CMRI generation through probability flow sampling. Extensive experimental results demonstrate that compared to existing diffusion model-based SR methods, our CCDM achieves significant improvements in SR quality while substantially reducing time consumption.

Abstract:
Display quality assessment plays a crucial role in evaluating the performance of display devices. However, existing video quality assessment methods primarily target compression-related distortions, failing to capture display-specific degradations including definition loss, color distortions, and motion artifacts that critically affect user subjective experiences during video playback. To address these limitations, we develop a specialized video dataset, namely Video Displaying Quality Assessment Dataset (VDQA), constructed using a DSLR camera with standardized parameter optimization of exposure settings (aperture, ISO sensitivity, and shutter speed). VDQA comprises 250 high-resolution video clips covering diverse content categories, providing a robust foundation for evaluating display devices across multiple quality dimensions. Additionally, we propose a deep learning-based model specifically designed for display quality assessment that employs three complementary pathways to independently evaluate definition, color fidelity, and motion quality. The model integrates Canny edge detection for explicit sharpness measurement, a color attention mechanism to enhance sensitivity to display color reproduction characteristics, and temporal modeling for motion artifact assessment. Experimental results demonstrate that the proposed model achieves superior performance in reflecting user subjective experiences for display content videos compared to state-of-the-art methods, with significant improvements in both color fidelity assessment and definition evaluation.

Abstract:
The geometry of road surfaces plays a critical role in the performance of autonomous driving systems. Consequently, achieving accurate and efficient road surface reconstruction (RSR) is of paramount importance. However, due to the inherent effects of perspective projection, distant regions often exhibit geometric distortions and a long-tailed distribution, which pose significant challenges to existing reconstruction methods. To address these issues, we propose a novel framework, termed Direction-aware Pseudo-Stereo Road Reconstruction Network (DPS-Net), which incorporates two lightweight and plug-and-play modules: Direction-Aware Feature Enhancement (DFE) module and Pseudo-Stereo Fusion (PSF) module. The DFE module is designed to enhance the perception of sparse and geometry-invariant features by integrating directional context, while the PSF module captures global dependencies across spatial and channel dimensions through pseudo-stereo fusion. Both modules are constructed with an emphasis on maintaining low computational complexity. We conducted extensive experiments on the public RSRD dataset to evaluate the effectiveness and superiority of our proposed method. The code is available at https://github.com/yidanyi/DPS-Net

Abstract:
The rapid growth of cloud gaming and game streaming has led to a substantial increase in the volume of game content data. To ensure real-time delivery of cloud game content, a common strategy is to downsample and compress the game content before transmission, reducing both data size and bandwidth requirements. However, this approach presents considerable obstacles for super-resolution (SR) networks at the receiver side. In particular, the degraded quality of compressed video streams, combined with the stringent demand for real-time processing, poses major challenges for practical SR applications. In this paper, we propose a novel real-time super-resolution framework that works directly in the compressed domain by exploiting coding-domain priors. Specifically, we propose an extremely lightweight U-Net architecture that leverages prediction maps and residuals as its primary guidance signals. Furthermore, we incorporate the partition map into a Pixel Adaptive Convolution (PAC) module, allowing the convolution kernels to adapt to different regions in the decoded frame. The resulting deep features are then fused with those from the U-Net backbone through an attention block. Finally, we present an enhanced re-parameterization block designed to better model edge features, leading to notable gains in both the objective metrics and subjective visual quality of the reconstructions. Extensive experiments demonstrate that the proposed method consistently outperforms existing real-time approaches on compressed game video content, achieving superior performance in both quality and efficiency.

Abstract:
Compressing Synthetic Aperture Radar (SAR) images presents unique challenges due to the high dynamic range and inherent acquisition noise in the amplitude signal, as well as the noise-sensitive and limited information content in the phase signal. Traditional compression methods, such as JPEG and JPEG2000, although widely used, often fail to preserve SAR image quality due to their susceptibility to compression artifacts. The continuous capture of high-resolution raw SAR images over extended periods on drones and Unmanned Aerial Vehicles (UAVs), combined with constraints on computational resources, bandwidth, and onboard storage, further complicates the problem. An effective and efficient compression pipeline is essential for either onboard storage or real-time transmission to ground stations. In this work, we propose a hybrid solution for complex-valued SAR image compression by utilizing the Versatile Video Coding (VVC) framework as a backbone compression engine and employing a deep learning-based method that operates jointly in the pixel and transform domains for deblocking and reconstructing SAR amplitude and phase images. Specifically, we design task-specific compression artifact removal networks called AmpRes and AngRes for amplitude and phase reconstruction, respectively. Additionally, we introduce the GradRes network to learn gradients for SAR Scale-Invariant Feature Transform (SAR-SIFT), resulting in robust orientation and magnitude estimations that improve downstream tasks such as keypoints detection and matching in noisy and compressed scenarios. Experimental results demonstrate that our approach achieves a 10% Bjøntegaard Delta (BD)-Rate savings over VVC for amplitude recovery, along with notable improvement in phase reconstruction, and delivers an average of 34% improvement in SAR-SIFT repeatability.

Abstract:
Dense-localization Audio-Visual Events (DAVE) aims to identify time boundaries and corresponding categories for events that are both audible and visible in a long video, where events may co-occur and exhibit varying durations. However, complex audio-visual scenes often involve asynchronization between modalities, making accurate localization challenging. Existing DAVE solutions extract audio and visual features through unimodal encoders, and fuse them via dense cross-modal interaction. However, independent unimodal encoding struggles to emphasize shared semantics between modalities without cross-modal guidance, while dense cross-modal attention may over-attend to semantically unrelated audio-visual features. To address these problems, we present LoCo, a Locality-aware cross-modal Correspondence learning framework for DAVE. LoCo leverages the local temporal continuity of audio-visual events as important guidance to filter irrelevant cross-modal signals and enhance cross-modal alignment throughout both unimodal and cross-modal encoding stages. i) Specifically, LoCo applies Local Correspondence Feature (LCF) Modulation to enforce unimodal encoders to focus on modality-shared semantics by modulating agreement between audio and visual features based on local cross-modal coherence. ii) To better aggregate cross-modal relevant features, we further customize Local Adaptive Cross-modal (LAC) Interaction, which dynamically adjusts attention regions in a data-driven manner. This adaptive mechanism focuses attention on local event boundaries and accommodates varying event durations. By incorporating LCF and LAC, LoCo provides solid performance gains and outperforms existing DAVE methods. The source code will be released.

Abstract:
Generating visual text in natural scene images is a challenging task with many unsolved problems. Different from generating text on artificially designed images (such as posters, covers, and cartoons), existing methods for natural scene visual text generation still have significant deficiencies: methods based on rendering engines rely on manually crafted rules, which struggle to adapt to diverse backgrounds and leave obvious artificial traces, while their text layouts may be placed in unreasonable areas (e.g., sky or ground) and text content is semantically disconnected from the scene; diffusion model-based methods, on the other hand, face difficulties in generating small characters, depend on manually designed prompts to ensure reasonable layout and content, fail to generate text at precise locations, and cannot effectively control text attributes (e.g., font and color). In this paper, we propose a two-stage method named SceneVTG++ to address these issues. SceneVTG++ comprises two core components: a Text Layout and Content Generator (TLCG) and a Controllable Local Text Diffusion (CLTD). The former leverages the world knowledge and visual reasoning capabilities of multimodal large language models to identify reasonable text areas and recommend scene-relevant text content based on natural scene background images; the latter generates controllable multilingual text using a diffusion model, ensuring alignment with the outputs of TLCG. Through extensive experiments, we verified the effectiveness of both TLCG and CLTD, and demonstrated that SceneVTG++ achieves state-of-the-art performance in natural scene visual text generation. Additionally, the images generated by SceneVTG++ exhibit superior utility for training natural scene optical character recognition (OCR) tasks, including text detection and text recognition. Codes and datasets will be made publicly available.

Abstract:
With the rapid advancement of diffusion-based text-to-video generation, challenges surrounding video ownership verification and copyright protection have become increasingly urgent. Traditional digital watermarking techniques are typically handcrafted for specific types of distortions, while existing diffusion-based watermarking methods are primarily applied to image-level tasks. As a result, the effectiveness of these methods significantly diminishes when videos undergo complex transformations. To address this issue, we introduce RINGet, a robust watermarking framework for diffusion-based video generation. RINGet embeds user-defined keys into the initial latent variables in the Fourier domain while maintaining imperceptibility in the spatial domain through reversible Fourier transforms. The framework adopts a radius-based and ring-based segmentation strategies to improve robustness against rotation while preserve the quality of the generated videos. Moreover, to mitigate distribution shifts caused by watermark embedding, the key is partitioned into discrete segments and distributed across different initial latent variables. Extensive experiments demonstrate the superiority of the RINGet framework over traditional approaches in terms of robustness, particularly under severe perceptual-domain distortions, while preserving high video quality and inference efficiency.

Abstract:
Federated Learning (FL) enables collaborative model training with data privacy but risks malicious Clients, namely model leakers, compromising group intellectual property by secretly distributing or selling the valuable trained models. Some preliminary works have studied this traceability issue, enabling the identification of model leakers within the FL network when the model is maliciously distributed. However, these works often achieve traceability by compromising the identity privacy of legitimate Clients. In this paper, we propose AnonymTracker, an anonymous FL model leaker tracing scheme. To protect the anonymity of legitimate Clients, we design a group signature-based fingerprint embedding mechanism, combining group signatures with model watermarking for effective and undeniable leaker identification. We further embed unique fingerprints in each Client’s model during training and use a cosine-similarity-based metric to compare extracted and embed fingerprints in the tracing phase, enhancing leak identification accuracy. The security analysis demonstrates AnonymTracker’s protection of legitimate Clients’ identity anonymity. Experiments on benchmark datasets and models confirm its good effectiveness, fidelity, and robustness against various watermark removal attacks.

Abstract:
As societal focus on image authenticity grows, image manipulation localization has become a crucial and challenging task in computer vision. Current methods relying on dual-stream encoders to extract features from both RGB and noise images often suffer from feature misalignment and information loss during fusion. Moreover, many localization methods use loss functions to identify manipulated areas, but balancing weights between manipulated regions and edges remains challenging. To address these challenges, we propose a novel method that integrates features in dual-stream networks with adaptive selective state spaces. By treating the two output features from the dual-stream encoder as system inputs, we construct a feature space that optimizes the system’s state space. Introducing temporal dynamics enriches the feature representation and enhances learning capabilities, significantly improving the accuracy and reliability of image manipulation localization. Additionally, we propose an edge residual review module that refines the boundaries of manipulated regions from the preliminary output, subsequently enhancing the input features for improved re-localization accuracy. Extensive experiments demonstrate that our approach yields competitive results on diverse large-scale image datasets, outperforming most state-of-the-art methods in both precision and robustness.

Abstract:
Pre-trained Vision-Language Models (VLMs) are often used to tackle the challenging task of Open-vocabulary Segmentation (OVS). To preserve the valuable pre-trained knowledge of VLM-based mask classifiers, most existing approaches freeze their parameters during training. However, our comprehensive analysis identifies a previously overlooked limitation: the performance of OVS is primarily constrained by mask classification. Specifically, VLMs pre-trained using globally pooled image-text representations often fail to capture localized, region-specific semantics necessary for accurate segmentation. This discovery motivates us to improve the fine-grained alignment between word-level text features and pixel-level image features extracted by VLMs. To this end, we propose the Fine-grained Semantic Reconstruction (FiSeR), a novel auxiliary task designed to enrich the spatial semantic detail of visual features. FiSeR trains the model to predict a randomly masked target class label using the image features and the remaining unmasked text. This encourages the model to link the specific words to the corresponding image regions, improving its ability to recognize and segment objects at the region level. FiSeR is broadly applicable and can be incorporated into various VLM-based segmentation models to improve their performance. Additionally, we introduce the Text-guided Visual Aligner (TeVA), a lightweight network module that injects relevant fine-grained semantics from the text information early in the visual encoding process. This enables the model to condition its visual processing on the target text categories from the beginning, improving its ability to associate text with the correct spatial regions. Collectively, these innovations culminate in our proposed framework FOV-Seg. Notably, FOV-Seg achieves new state-of-the-art results across multiple representative OVS benchmarks, improving performance consistently and reducing training costs by nearly 5× compared to previous best methods. Our code and data will be released.

Abstract:
Recent self-supervised stereo matching methods have made significant progress, but their performance significantly degrades under adverse weather conditions such as night, rain, and fog. We identify two primary weaknesses contributing to this performance degradation. First, adverse weather introduces noise and reduces visibility, making CNN-based feature extractors struggle with degraded regions like reflective and textureless areas. Second, these degraded regions can disrupt accurate pixel correspondences, leading to ineffective supervision based on the photometric consistency assumption. To address these challenges, we propose injecting robust priors derived from the visual foundation model into the CNN-based feature extractor to improve feature representation under adverse weather conditions. We then introduce scene correspondence priors to construct robust supervisory signals rather than relying solely on the photometric consistency assumption. Specifically, we create synthetic stereo datasets with realistic weather degradations. These datasets feature clear and adverse image pairs that maintain the same semantic context and disparity, preserving the scene correspondence property. With this knowledge, we propose a robust self-supervised training paradigm, consisting of two key steps: robust self-supervised scene correspondence learning and adverse weather distillation. Both steps aim to align underlying scene results from clean and adverse image pairs, thus improving model disparity estimation under adverse weather effects. Extensive experiments demonstrate the effectiveness and versatility of our proposed solution, which outperforms existing state-of-the-art self-supervised methods.

Abstract:
Deep learning has achieved significant success in hyperspectral image super-resolution (HSISR) by leveraging advanced feature extraction techniques to reconstruct high-resolution images from low-resolution counterparts. However, existing methods predominantly utilize 2D/3D convolutions or Transformer architectures, which are often hindered by limited receptive fields, quadratic computational complexity, and inadequate fusion of spatial-spectral dependencies. To address these challenges, this paper proposes RWKVSR, a novel lightweight network that integrates a Receptance Weighted Key-Value (RWKV) architecture for efficient HSISR. The proposed RWKVSR comprises of three key components: 1) A linear-complexity RWKV module replacing quadratic self-attention, enabling efficient global spectral-spatial modeling; 2) A Spectral-Spatial Residual Module (SSRM) employing anisotropic, direction-separable 3D convolutions to hierarchically extract multi-scale features while enhancing local-global interactions; and 3) A Hyperspectral Frequency Loss (HFL) optimizing spectral consistency by prioritizing high-frequency structural alignment between reconstructed and ground-truth images in the frequency domain. Extensive experiments conducted on the CAVE and Harvard datasets demonstrate that RWKVSR outperforms the existing state-of-the-art methods, effectively balancing accuracy and efficiency, and providing a practical solution for high-quality HSI reconstruction. Our paper code is publicly available at https://github.com/backy-1/RWKVSR.git

Abstract:
Image restoration aims to recover high-quality images from degraded observations, yet real-world degradations are complex, coupled, and difficult to model. Existing task-specific methods struggle to generalize beyond predefined degradation types, while recent all-in-one or prompt-based methods still face three key challenges: 1) they rely on task-specific training or fixed prompt pools, limiting adaptability to real-world and mixed degradations; 2) human-instruction or implicit-prompt mechanisms make them difficult to use in practice; and 3) they often fail to balance structural fidelity and perceptual realism. To address these issues, we propose Diff-Restorer, a diffusion-based universal image restoration framework that unifies diverse degradation handling within a single model. Diff-Restorer adaptively extracts decoupled visual prompts from a visual-language model (CLIP), including clear semantic and degradation embeddings. The clear semantic embeddings serve as content prompts to guide the diffusion model for generation, improving perceptual quality. The degradation embeddings as the task identifier modulate the Image-guided Control Module to generate structure control, ensuring faithfulness. Furthermore, we design a Task-aware Decoder to perform structural correction and convert the latent code to the pixel domain. Extensive experiments on various single, real-world, and mixed degradation tasks show that Diff-Restorer outperforms state-of-the-art methods in terms of generality, realism, and fidelity.

Abstract:
Multi-view clustering has been empirically shown to improve learning performance by leveraging the inherent complementary information across multiple views of data. However, in real-world scenarios, collecting strictly aligned views is challenging, and learning from both aligned and unaligned data becomes a more practical solution. Partially View-aligned Clustering (PVC) aims to learn correspondences between misaligned view samples to better exploit the potential consistency and complementarity across views, including both aligned and unaligned data. However, most existing PVC methods fail to leverage unaligned data to capture the shared semantics among samples from the same cluster. Moreover, the inherent heterogeneity of multi-view data induces distributional shifts in representations, leading to inaccuracies in establishing meaningful correspondences between cross-view latent features and, consequently, impairing learning effectiveness. To address these challenges, we propose a Semantic MAtching contRasTive learning model (SMART) for PVC. The main idea of our approach is to alleviate the influence of cross-view distributional shifts, thereby facilitating semantic matching contrastive learning to fully exploit semantic relationships in both aligned and unaligned data. Specifically, we mitigate view distribution shifts by aligning cross-view covariance matrices, which enables the inference of a semantic graph for all data. Guided by the learned semantic graph, we further exploit semantic consistency across views through semantic matching contrastive learning. After the optimization of the above mechanisms, our model smoothly performs semantic matching for different view embeddings instead of the cumbersome view realignment, which enables the learned representations to enjoy richer category-level semantics and stronger robustness. Extensive experiments on eight benchmark datasets demonstrate that our method consistently outperforms existing approaches on the PVC problem. The code is available at https://github.com/THPengL/SMART

Abstract:
Multiscale sparse representation offers significant advantages in point cloud geometry compression, delivering state-of-the-art performance compared to both standardized solutions and other learned approaches. A crucial component of this framework is the cross-scale occupancy prediction, which employs the lower-scale reference representation either from the current frame alone or from both the current and temporal reference frames to establish conditional priors for either static or dynamic coding. However, existing works mainly use local computations, e.g., sparse convolutions and k NN attention, to exploit correlations in such a representation; these methods usually fail to adequately capture global coherence. In addition, the fixed configuration of lossless-lossy scales cannot adapt to temporal dynamics, which limits the reconstruction quality of temporal references in dynamic coding. These limitations constrain the generation of more effective priors used for conditional coding. To address these issues, we propose two new techniques. The first is KPA (Key Point-driven Attention), which integrates both local and global characteristics. The second is AdaScale (Adaptive Lossy/Lossless Scale), which decides whether the transitional scale should be in lossless or lossy mode based on temporal displacement, thereby enhancing the reconstruction quality of the temporal reference. Extensive experiments demonstrate that our approach significantly outperforms state-of-the-art methods, including rules-based standard codecs like G-PCC and V-PCC, as well as learning-based approaches like Unicorn and TMAP, across both static/dynamic and lossy/lossless coding scenarios.

Abstract:
Multi-modality image fusion aims to synthesize a single, comprehensive image from multiple source inputs. Traditional approaches, such as CNNs and GANs, offer efficiency but struggle to handle low-quality or complex inputs. Recent advances in text-guided methods leverage large model priors to overcome these limitations, but at the cost of significant computational overhead, both in memory and inference time. To address this challenge, we propose a novel framework for distilling large model priors, eliminating the need for text guidance during inference while dramatically reducing model size. Our framework utilizes a teacher-student architecture, where the teacher network incorporates large model priors and transfers this knowledge to a smaller student network via a tailored distillation process. Crucially, our experiments demonstrate that this knowledge transfer is the primary driver of performance gains, rather than mere architectural optimization. Additionally, we introduce a spatial-channel cross-fusion module to enhance the model’s ability to leverage textual priors across both spatial and channel dimensions. Our method achieves a favorable trade-off between computational efficiency and fusion quality. The distilled network, requiring only 10% of the parameters and inference time of the teacher network, retains 90% of its performance and outperforms existing SOTA methods. Extensive experiments demonstrate the effectiveness of our approach. Codes are available at https://github.com/Zirconium233/DTPF

Abstract:
This paper proposes a novel dehazing method termed Haze-Restoration Curve Model (HRCM), which transforms the single-image dehazing task into a specific curve estimation problem, achieving haze removal through an intuitive and simple nonlinear curve mapping. Unlike methods based on Atmospheric Scattering Model (ASM), HRCM does not require the computation of complex physical parameters. Instead, it estimates two intuitive curvature adjustment coefficients. Moreover, compared to recent end-to-end dehazing methods, HRCM circumvents the challenging modeling of static mapping functions, thereby improving the generalization ability and dehazing performance of the model. All of these are attributed to a meticulously designed dehazing curve, which first reversing the hazy image to highlight obscured regions, and then specifies a set of high-order functions to remap hazy pixels for image restoration. Moreover, to estimate the curve parameters, we designed a dual-branch Deep Dehaze Curve Estimation Network(DDCEN), which consists of the Residual Swin Transformer Block(RTSB) and the Large kernel convolutional Attention Block(LAB). Specifically, RTSB captures the global fog density distribution features of foggy images by introducing window self-attention and shifted window mechanisms, providing support for global semantic information for subsequent parameter estimation. LAB captures local multi-scale features by constructing a large receptive field, and uses the attention mechanism of feature pooling in horizontal and vertical directions to focus on detail regions, refining the local details of the parameter map. Extensive experiments on synthetic and real-world hazy image datasets demonstrate that the proposed approach achieves superior performance in terms of quantitative accuracy and subjective visual quality compared to the current state-of-the-art methods. The source code of our HRCM is available at https://github.com/larrylanrui/HRCM

Abstract:
Existing infrared and visible image fusion methods commonly use two structurally identical networks to extract deep features from source images, followed by a handcrafted or learnable feature fusion strategy. These methods overlook the modality-specific characteristics of the two image types, impairing the model’s ability to fully exploit their complementary information. Additionally, their fusion results often exhibit issues such as texture detail loss or unclear thermal targets. This is because the fusion rules they used are either too simple or too redundant. To address these challenges, we start from the infrared physics priors that are naturally complementary to visible images and incorporate the thermal diffusion equation and Stefan-Boltzmann Law into the image fusion architecture. Based on these two physical priors, we design a Thermal Diffusion Convolution (TDC) and a Stefan Thermal Attention (STA) to better extract infrared-specific features. Specifically, the TDC module leverages the anisotropic and isotropic characteristics of thermal diffusion adaptively to sharpen the edges of thermal targets and remove infrared noise, minimizing artifacts in the fused results. By decomposing the Stefan-Boltzmann Law, STA pays more attention on thermal features while suppressing redundant information, enabling more effective aggregation of complementary modality-specific details. To make full use of layer-wise complementary features, we propose an Interactive Injection Fusion framework(IIF) that hierarchically integrates these features, enhancing the richness of fused image content. Furthermore, an energy conservation constraint is designed to ensure the fused images adhere to physical principles. Extensive experimental results on five datasets demonstrate that our method sets a new state-of-the-art. Code is available at https://github.com/QiaoLiuHit/PPIFuse

Affiliations: Chinese Academy of Sciences, Shenzhen Institutes of Advanced Technology, Shenzhen, China; Institute of Artificial Intelligence, Beihang University, Beijing, China; Dongguan University of Technology, Dongguan, Guangdong, China; College of Computing and Data Science, Nanyang Technological University, Jurong West, Singapore; Guangdong-Hong Kong-Macao Joint Laboratory of Human-Machine Intelligence-Synergy Systems, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China

Abstract:
Expressing and controlling fine-grained spatial attributes of objects in large-scale models presents significant challenges, as these spatial attributes are often difficult to describe textually and exhaustive enumeration is impractical. This hinders effective alignment with user preferences regarding spatial attribute-object relationships in fine-grained synthesis tasks. To tackle this problem, we propose AttrObjDiff, a novel framework built on the pre-trained Stable Diffusion model to integrate spatial attribute maps. Firstly, AttrObjDiff constrains the denoising step using trainable cross-attention fusion modules, attribute-enhancing cross-attention and LoRAs. The fusion modules take layout features extracted by a frozen ControlNet and corresponding fine-grained attribute maps as inputs to generate joint constraint features of spatial attribute-object relationships. We leverage attribute-enhancing cross-attention within the U-Net to further refine these spatial attributes. Finally, LoRAs are employed to align with these joint constraint features of fine-grained relationships. Secondly, AttrObjDiff enhances the reverse process with lightweight noise reranking models to improve spatial object-attribute alignment. The reranking models select semantic noises related to fine-grained relationships, improving synthesis quality without significantly increasing computational costs. Experimental results demonstrate that our method can generate high-quality images guided by fine-grained spatial object-attribute relationships, improving synthesis controllability and semantic consistency.

Abstract:
The development of the internet has greatly facilitated the transmission of images over social networks, while also triggering serious copyright issues. Deep robust watermarking serves as a crucial technique for image copyright protection. However, the image distortions caused by Social Network Transmission Operations (SNTOs) make existing deep robust watermarking methods fragile in real-world social network scenarios. To address this, we propose a Curriculum Learning-based Deep Robust Watermarking method, called CL-DRW, to generate watermarks that can be resilient to SNTOs. Specifically, we develop a watermarking model constructed with an invertible neural network and present a multi-stage training framework based on curriculum learning to train it effectively. We incrementally introduce noise attacks based on their disruptive impact on the watermark, from weak to strong, thereby enabling our model to build robustness against SNTOs gradually. Additionally, we design an SNTOs simulation noise layer, which is built upon a transformer-based deep network and incorporates differentiable JPEG, to simulate the closed-box distortions caused by SNTOs. Extensive experiments indicate that our proposed CL-DRW outperforms state-of-the-art deep watermarking methods in terms of robustness against real-world social network transmission operations. Source code is available at https://github.com/yingshuai-zhao/CL-DRW

Abstract:
In this work, we propose an efficient and streamlined paradigm to address the challenge of consistency prediction in generalized multi-task visual grounding. While most existing approaches primarily focus on integrating multi-modal information and employing multi-task learning to enhance both visual and linguistic understanding, they often rely on joint supervision at the region and pixel levels to exploit task complementarities. In contrast, C3VG explores the relatively under-addressed problem of consistency across multi-task predictions. To this end, a multi-task visual grounding framework based on a coarse-to-fine architecture is introduced. Empirical studies demonstrate that the incorporation of both implicit and explicit consistency constraints substantially enhances the coherence between detection and segmentation outputs. However, C3VG is restricted to single-referent visual grounding scenarios and exhibits limited generalizability to real-world applications, which often involve multi-referents or even absent referent. To overcome these limitations, we propose GC3VG, which incorporates three key advancements: 1) extension to generalized scenarios, including both multi-referent and non-referent cases; 2) a Unified Coherent Refinement Module that implicitly encodes region- and instance-level features while explicitly modeling their relational alignment through an IoU-based constraint; and 3) a Granularity-aware Hard-mining Alignment strategy that enforces prediction consistency in the feature space and simultaneously enhances the discriminative power of visual and linguistic representations. Extensive experiments on RefCOCO/+/g and gRefCOCO demonstrate the effectiveness and generalizability of the proposed framework.

Abstract:
Bit-Depth Enhancement (BDE) is designed to restore High-Bit-Depth (HBD) images from Low-Bit-Depth (LBD) input, but existing methods mostly fail to exploit their algorithmic advantages in extreme low-bit cases. In this paper, a new framework that combines noise shaping, latent space modeling, and adaptive weighting block (AWB) is proposed to solve the above problems. Firstly, based on the noise shaping method, Sigma-Delta ( \mathrm \Sigma \text -\Delta ) quantization is introduced to achieve low redundancy and high reversibility low-bit image construction. Second, based on the visual averaging property and Inverse Problem Transform (IPT), the conditional posterior distribution p_z (\text HBD|\text LR) is introduced for the first time to model the relationship between low-resolution (LR) and HBD images in the latent space to achieve the recovery of high-frequency details. To realize effective interaction and adaptive fusion of spatial and bit information, the hierarchical feature discovery module (HFDM) is introduced in network construction to generate multiscale LBD-LR image pairs, while AWB dynamically fuses according to the amount of feature information. This framework reaches the leading level in both quantitative metrics (PSNR/SSIM) and visual quality, and provides a new idea for high-bit image generation.

Affiliations: School of Computer Science and Technology, Beijing Key Laboratory of Traffic Data Mining and Embodied Intelligence, Beijing Jiaotong University, Beijing, China; State Key Laboratory of Internet of Things for Smart City and the Department of Electromechanical Engineering, University of Macau, Macau, China; Fujian Key Laboratory of Sensing and Computing for Smart Cities, Xiamen University, Xiamen, Fujian, China; School of Mechanical and Aerospace Engineering, Nanyang Technological University, Jurong West, Singapore; State Key Laboratory of Robotics and System, Harbin Institute of Technology, Harbin, China

Abstract:
As a critical task in autonomous driving perception systems, 3D object detection is used to identify and track key objects, such as vehicles and pedestrians. However, detecting distant, small, or occluded objects (hard instances) remains a challenge, which directly compromises the safety of autonomous driving systems. We observe that existing multi-modal 3D object detection methods often follow a single-guided paradigm, failing to account for the differences in information density of hard instances between modalities. In this work, we propose DGFusion, based on the Dual-guided paradigm, which fully inherits the advantages of the Point-guide-Image paradigm and integrates the Image-guide-Point paradigm to address the limitations of the single paradigms. The core of DGFusion, the Difficulty-aware Instance Pair Matcher (DIPM), performs instance-level feature matching based on difficulty to generate easy and hard instance pairs, while the Dual-guided Modules exploit the advantages of both pair types to enable effective multi-modal feature fusion. Experimental results demonstrate that our DGFusion outperforms the baseline methods, with respective improvements of +1.0% mAP, +0.8% NDS, and + 1.3% average recall on nuScenes. Extensive experiments demonstrate consistent robustness gains for hard instance detection across ego-distance, size, visibility, and small-scale training scenarios.

Abstract:
Existing 3D occupancy networks demand significant hardware resources, hindering the deployment of resource-limited devices. Binarized Neural Networks (BNNs) offer a potential solution by substantially reducing computational and memory requirements. However, their performance decrease notably compared to full-precision networks. In addition, it is challenging to enhance the performance of the binarized model by increasing the number of binarized convolutional layers, which limits its practicability for 3D occupancy prediction. In this paper, we reconsider the components in binarized convolutional layers, and structures, for 3D occupancy prediction task. Two original insights into binarized convolution are presented, substantiated with theoretical proofs: (a) 1× 1 binarized convolution introduces minimal binarization errors as the network deepens, and (b) binarized convolution is inferior to full-precision convolution in capturing cross-channel feature importance. Building on the above insights, we propose a novel binarized deep convolution (BDC) unit that significantly enhances performance, even when the number of binarized convolutional layers increases to meet the requirements of 3D occupancy networks. Specifically, in the BDC unit, additional binarized convolutional kernels are constrained to 1× 1 to minimize the effects of binarization errors. Further, we propose a per-channel refinement branch to reweight the output via first-order approximation. Then, we partition the 3D occupancy networks into four distinct convolutional modules, employing BDC units to explore the effects of binarizing each of these modules. The proposed BDC unit minimizes binarization errors and improves perceptual capability, meeting the stringent requirements for accuracy and computational efficiency in 3D occupancy prediction. Extensive quantitative and qualitative experiments demonstrate that the proposed BDC unit achieves state-of-the-art performance in 3D occupancy prediction and 3D object detection tasks, while significantly reducing parameters and computational costs. This highlights the potential of the BDC unit as an efficient fundamental component in binarized 3D occupancy networks. Code for our paper will be released on “https://github.com/zzk785089755/BDC”

Abstract:
Automatic emotion recognition has attracted significant attention due to its potential applications in various real-world scenarios. Methods that integrate visual and audio modalities have become increasingly prominent because of their superior information-carrying capacity and complementarity. Despite advancements in feature fusion between video and audio, existing modality fusion-based methods struggle to effectively address the dynamic changes in feature quality caused by interference, which is common in emotion recognition in the wild tasks. To overcome this limitation, we propose a Parameter-Free Feature Evaluation and Interaction (PFFEI) model based on information quality assessment. The model leverages the scaling factor \gamma of the normalization layer to evaluate information quality and dynamically adjusts the degree of interaction between modalities, suppressing the impact of low-quality features affected by interference. Additionally, the norm constraint integrated into the model ensures that the \gamma value consistently measures feature quality across different modalities. This approach effectively mitigates the effects of modality imbalance and significantly enhances the model’s accuracy. The effectiveness of our method is demonstrated through experiments on three challenging real-world emotion datasets: DFEW, AFEW, and Ekman6. The results show that the PFFEI model outperforms state-of-the-art methods, achieving significant improvements of 8.71% (UAR) and 8.61% (WAR) on the AFEW database.

Abstract:
The Deepfakes can generate highly realistic fake images and videos, which may be used to spread false information, manipulate public opinion, and pose serious threats to individual privacy and social stability. In recent years, researchers have proposed proactive defense methods to disrupt the output of Deepfakes by adding adversarial perturbations to the original data. However, the added perturbations are often not robust to common image compression operations, which severely limits the practical application of these proactive defense methods in the real world. In this paper, we propose a new method for adaptively adding adversarial perturbations in the discrete cosine transform (DCT) domain, which can resist various compression operations in real-world scenarios. Specifically, DCT coefficients that remain stable during the compression process and have a significant impact on the output of Deepfakes are selected to add perturbations. In addition, a new perceptual loss is introduced to enhance the visual quality of adversarial examples while preserving their robustness against lossy image compression. Extensive experimental results have shown that our method has strong robustness and effective defense capabilities against various compression operations, including Joint Photographic Experts Group (JPEG) compression and other compression operations provided by those online social networks (OSNs) in the real world. Furthermore, it can significantly improve the visual quality of adversarial images compared to previously proposed DCT-based perturbation methods.

Abstract:
Pose estimation using visual sensors has become a fundamental component in robotic navigation and autonomous driving systems. Learning-based monocular visual odometry (VO) has attracted substantial attention due to its resilience to camera parameter variations and dynamic environments. Given that camera movement manifests as pixel-level motion across the entire image in optical flow data, capturing both global contextual information and local feature details is crucial for accurate pose estimation. To address this challenge, we propose SwinFVO, a novel self-supervised visual odometry framework that incorporates enhanced motion perception to achieve global spatial dependency modeling with temporal continuity. Leveraging quadrant-based motion characteristics, we perform cross-regional feature interaction through a refined Swin Transformer architecture. Two robust spatiotemporal feature extractors are designed to extend the single-frame-based Swin Transformer to a temporally-aware framework for sequential understanding. Through the exploration of long-range spatial correlations and preservation of temporal consistency, SwinFVO delivers accurate and consistent pose estimation. Extensive experiments across multiple datasets demonstrate the superior performance and generalization capability of SwinFVO in both pose and depth estimation tasks. It achieves competitive results against classical algorithms and outperforms related state-of-the-art (SOTA) methods by up to 20.6% and 72.4% on average translational and rotational evaluations, respectively.

Abstract:
Anomaly segmentation has been widely applied to diagnosis of medical organs and lesions and detection of industrial defects. However, existing methods still face challenges in extracting discriminant image features and utilizing semantic information. To address these issues, we propose a Text-Prompted Dual-Path Convolution-Mamba Network (TPCM-SegNet), which integrates Residual Double-Convolution Blocks (RDCBs) and Mamba-Transformer Blocks (MTBs) in two parallel paths for the purpose of extracting local and global features, respectively. Given a pair of RDCB and MTB at the same stage, a Feature Fusion Block (FFB) is introduced in order to facilitate the interaction and fusion of the features extracted using these blocks. Furthermore, we fuse the text tokens extracted from a textual description with the image features extracted using each of those blocks through a Text Prompt Block (TPB), to enhance the semantics understanding ability of the network. A Cascade Feature Block (CFB) is also designed for each stage of the encoder, to combine the feature maps, the logit maps decoded from them and the input image. This block incorporates the prior and original characteristics into the image representation. Experimental results demonstrate that our TPCM-SegNet achieves the superior, or at least comparable, performance to baselines, across eight publicly available datasets. These promising results should benefit from the powerful ability of image representation and semantic understanding of the proposed network.

Abstract:
3D Gaussian Splatting and its extension to 4D dynamic scenes enable photorealistic, real-time rendering from real-world captures, positioning Gaussian Splats (GS) as a promising format for next-generation immersive media. However, their high storage requirements pose significant challenges for practical use in sharing, transmission, and storage. Despite various studies exploring GS compression from different perspectives, these efforts remain scattered across separate repositories, complicating benchmarking and the integration of best practices. To address this gap, we present GSCodec Studio, a unified and modular framework for GS reconstruction, compression, and rendering. The framework incorporates a diverse set of 3D/4D GS reconstruction methods and GS compression techniques as modular components, facilitating flexible combinations and comprehensive comparisons. By integrating best practices from community research and our own explorations, GSCodec Studio supports the development of compact representation and compression solutions for static and dynamic Gaussian Splats. Specifically, we present Static and Dynamic GSCodec: Static GSCodec achieves competitive 3D Gaussian Splat rate-distortion performance with low decoding complexity, while Dynamic GSCodec delivers advanced 4D Gaussian Splat compression performance. The code for our framework is publicly available at https://github.com/JasonLSC/GSCodec_Studio, to advance the research on Gaussian Splats compression.

Abstract:
Autoencoder-based neural compression methods leverage expressive models to fit large datasets but often incur considerable decoding complexity. Recently, overfitted codecs with reduced decoding complexity have gained attention as an alternative. However, they usually require access to entire videos or multiple frames simultaneously for encoding, resulting in substantial system delays. To address these limitations, we propose CNVC, a compact neural video codec that employs instance-level adaptation for efficient and flexible video compression. CNVC is fully overfitted (each frame is optimized independently using up to 45k iterations for maximum performance in this paper), building on the COOL-CHIC video model with substantial architectural and training enhancements. At a decoding complexity of just 1300 MACs per pixel, CNVC provides a more compact solution than previous autoencoder-based and overfitted codecs. Additionally, CNVC inherits the frame-wise overfitting mechanism of COOL-CHIC video, enabling flexible encoding configurations (e.g., low-delay). In terms of compression efficiency, CNVC achieves significant bitrate reductions on HEVC and UVG datasets compared to COOL-CHIC video. To our knowledge, CNVC is the first compact neural codec to match HEVC (x265 slow setting) performance at such a decoding complexity level.

Abstract:
The recent development of feedforward 3D Gaussian Splatting (3DGS) presents a new paradigm to reconstruct 3D scenes. Using neural networks trained on large-scale multi-view datasets, it can directly infer 3DGS representations from sparse input views. Although the feedforward approach achieves high reconstruction speed, it still suffers from the substantial storage cost of 3D Gaussians. Existing 3DGS compression methods relying on scene-wise optimization are not applicable due to architectural incompatibilities. To overcome this limitation, we propose TinySplat, a complete feedforward approach for generating compact 3D scene representations. Built upon standard feedforward 3DGS methods, TinySplat integrates a training-free compression framework that systematically eliminates key sources of redundancy. Specifically, we introduce View-Projection Transformation (VPT) to reduce geometric redundancy by projecting geometric parameters into a more compact space. We further present Visibility-Aware Basis Reduction (VABR), which mitigates perceptual redundancy by aligning feature energy along dominant viewing directions via basis transformation. Lastly, spatial redundancy is addressed through an off-the-shelf video codec. Comprehensive experimental results on multiple benchmark datasets demonstrate that TinySplat achieves over 100× compression for 3D Gaussian data generated by feedforward methods. Compared to the state-of-the-art compression approach, we achieve comparable quality with only 6% of the storage size. Meanwhile, our compression framework requires only 25% of the encoding time and 1% of the decoding time.

Abstract:
Multi-modal generative AI (Artificial Intelligence) has attracted increasing attention from both academia and industry. Particularly, two dominant families of techniques have emerged: i) Multi-modal large language models (LLMs) demonstrate impressive ability for multi-modal understanding; and ii) Diffusion models exhibit remarkable multi-modal powers in terms of multi-modal generation. Therefore, this paper provides a comprehensive overview of multi-modal generative AI, including multi-modal LLMs, diffusions, and the unification for understanding and generation. To lay a solid foundation for unified models, we first provide a detailed review of both multi-modal LLMs and diffusion models, respectively, including their probabilistic modeling procedure, multi-modal architecture design, and advanced applications to image/video LLMs as well as text-to-image/video generation. Furthermore, we explore the emerging efforts toward unified models for understanding and generation. To achieve the unification of understanding and generation, we investigate key designs including autoregressive-based and diffusion-based modeling, as well as dense and Mixture-of-Experts (MoE) architectures. We then introduce several strategies for unified models, analyzing their potential advantages and disadvantages. In addition, we summarize the common datasets widely used for multi-modal generative AI pretraining. Last but not least, we present several challenging future research directions that may contribute to the ongoing advancement of multi-modal generative AI.

Abstract:
With the rapid emergence of various generative models, generated images have increasingly become a prominent data medium on social platforms, making up a significantly higher proportion and providing fertile ground for steganography. However, research on steganography for generated images remains limited, and the distinctive attributes, especially the reproducibility of text-to-image (TTI) models, have not been effectively leveraged. In this paper, we propose GIANT (Generated Image Adversarial steganography based on Narrowed Targeting), a novel adversarial steganography framework for generated images that employs narrowed targeting to focus on embedding the secret message solely in the secure region and synchronizing the position to enhance the steganography security. GIANT achieves narrowed targeting by leveraging the reproducibility of TTI models and fusing two regions: 1) the minimal distortion region, which is localized by measuring steganographic distortion to evaluate the impact of modifications on the cover image distribution, and 2) the critical attention region, which is localized by using coarse-grained and fine-grained attention maps to evade steganalysis detection. Additionally, for positional synchronization of the secure region, the related prompts are transmitted alongside the stego image, allowing the receiver to reconstruct the cover image using a shared key and the provided prompt. Experimental results demonstrate that GIANT significantly improves security compared to conventional and adversarial steganographic methods designed for natural images, effectively countering state-of-the-art steganalyzers.

Affiliations: School of Computer Science and Technology and Hubei Province Key Laboratory of Intelligent Information Processing and Real-Time Industrial System, Wuhan University of Science and Technology, Wuhan, China; School of Computer Science and Technology, Wuhan University of Science and Technology, Wuhan, China; School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China; Department of Electrical Engineering, National Tsing Hua University, Hsinchu, Taiwan

Abstract:
Talking head generation aims to synthesize high-quality and lip-synchronized talking head videos from the given portrait images and audio. However, previous methods directly learn the alignment between lip movements and the driven audio, barely focusing on the fidelity and continuity of the generated videos, suffering from visual distortions and jitter. To deal with this issue, we propose to promote the consistency of audio and image by exploring their spatiotemporal relations, and construct a Mamba-based spatiotemporal fusion scheme. Specifically, we devise an Intra-frame Mamba module to characterize facial features from the source image, which encourages the content consistence between the generated frame and the current source frame. Meanwhile, an Inter-frame Mamba module is designed to excavate the complementary information across sequential frames, which provides clues for better motion simulation. The aggregated spatiotemporal representation with audio features are then aligned with a deformation network to alleviate visual distortions and jitter. In addition, we investigate the practical composite constraints on the structure, details, and motion aspects, involving the keypoint constraint, multi-scale content constraint, and displacement constraint to promote the training stability and model performance. With the above strategies, we construct a novel Talking Head Mamba network, termed as TH-Mamba for high-quality talking head generation. Extensive experiments on the HDTF and Mead-Neutral datasets verify the superiority of our proposed TH-Mamba, which significantly outperforms the current state-of-the-art method by 0.78dB and 1.55dB in PSNR, respectively. The demo is available at https://github.com/YZX-codesky/TH-Mamba.

Affiliations: School of Astronautics, Beihang University, Beijing, China; School of Astronautics, the National Key Laboratory of High-Efficiency Earth-Space Round-Trip Transportation Technologies, and the National Key Laboratory of Integrated Air-Ground Navigation Technologies, Beihang University, Beijing, China; School of Information and Electronics, Beijing Institute of Technology, Beijing, China; Central Research Institute, United Imaging Healthcare Company Ltd., Beijing, China

Abstract:
Hyperspectral salient object detection (HSOD) aims to identify visually and spectrally distinctive regions in hyperspectral images (HSIs). However, existing HSOD methods often suffer from spectral redundancy and inefficient spatial-spectral modeling, which hinder their scalability and accuracy in complex scenes. To tackle these challenges, we propose HySaDe-Mamba, a novel HSOD framework built upon the Mamba architecture. Specifically, to address information redundancy in HSI, we design a spatial-enhanced spectral-embedding (SeSe) module, which maps high-dimensional data into a more compact but effective representation. On the compact SeSe representation features, we further propose a Bi-scale spatial and Bi-directional spectral (BsBd) Mamba module, performing the selective scanning mechanism in a spatial-spectral hybrid, end-to-end way, which not only facilitates comprehensive spatial structural interaction across both global and local scales, but also effectively exploits the underlying spectral semantic correlation. Extensive experiments on two public HSOD datasets demonstrate that our HySaDe-Mamba achieves state-of-the-art detection accuracy across seven metrics, while maintaining an efficient inference speed of 40.22 FPS. The source code is publicly available at https://github.com/Lee-zl/HySaDe-Mamba

Abstract:
Video anomaly detection (VAD) is important in many fields because of its theoretical and practical values. One of the challenges in VAD is the difficulty in obtaining segment-level labels due to the high annotation cost. In recent years, researchers have adopted video-level labels as a form of weak supervision, leading to the development of weakly supervised video anomaly detection (WS-VAD). Among different WS-VAD approaches, graph convolutional networks (GCNs) have attracted much attention, since they have the ability to model relationship information in video data. Typically, the relationship, represented by the graph edges, is the class label similarity, and this similarity is built based on the feature similarity and temporal consistency among video segments. Undoubtedly, the more information about class label similarity is provided, the higher the performance of GCN tends to be. In real-world scenarios of VAD, anomalies exhibit several unique properties such as diversity and rarity. These properties may lead to the following situation. Given two video segments, although their feature similarity is low and their time separation is large, both of them are anomalies, that is, they have the same class label. Likewise, normal samples also encounter such situation. However, the existing graph structures in GCN methods do not adequately account for this situation. To address this issue, this paper proposes an extended graph learning (EGL) method that incorporates additional class label similarity among video segments. The proposed EGL includes two extended graph convolutional networks (EGCNs): a spatial EGCN and a temporal EGCN. To capture more accurate information about class label similarity, EGL incorporates a feedback module to update the graph structures of EGCNs. EGL can effectively extract more information about class label similarity, thereby ensuring good performance when training data is scarce. Experimental results highlight the advantages of the proposed EGL method, particularly with limited training samples. In particular, when only 30% of the training data is used, EGL achieves the best performance of 95.55% AUC on ShanghaiTech, 81.29% AUC on UCF-Crime, and 75.31% AP on XD-Violence, outperforming the existing VAD methods by up to 6.86%, 3.23%, and 4.40%, respectively.

Abstract:
Active Learning (AL) aims to reduce data annotation costs by selecting the most informative samples from an unlabeled data pool. Traditional AL methods often rely on a single snapshot to identify uncertain or representative samples, often overlooking the poor generalization of a single model. Recent AL studies have attempted to address this issue by tracking a broader range of training dynamics for data selection, typically using averaging or accumulating manner. However, both our theoretical and experimental analyses reveal that these methods obscure the variability inherent in the training process, potentially prioritizing hard-to-learn samples that result in poor generalization. In this paper, we propose a novel AL method termed as Dynamic Confidence Variance (DCoV), that seamlessly integrates variability with the training dynamic to effectively identify a well-generalized Coreset. DCoV leverages the variance of the model’s prediction confidence throughout the training process for active sampling and model training. Our theoretical analysis demonstrates that DCoV provides a lower bound on the population risk of the model learned from selected labeled subset, spanning the entire training process. Extensive experiments demonstrate that our approach significantly outperforms existing state-of-the-art AL methods on various balanced and imbalanced benchmark datasets across various modalities.

Abstract:
Semantic communication (SC) enables efficient information exchange by transmitting compact semantic representations rather than raw data, benefiting applications like autonomous driving and medical diagnosis. However, existing copyright protection methods face two key limitations: traditional transform-domain watermarking fails during semantic extraction, while deep learning-based methods lose robustness when integrated with SC. Most critically, existing solutions cannot protect semantic information itself, the core intellectual property in SC. To address these issues, we propose “Dual-stage Robust Semantic Watermarking” (DRSW), a framework that simultaneously protects the copyright for both semantics and reconstructed images. By embedding a watermark into the frequency domain of semantics, DRSW exhibits high robustness against possible channel noises while preserving semantic consistency and maintaining the reconstruction quality of images. Our work provides a new watermarking paradigm for future copyright protection in SC scenarios.

Abstract:
The recent rapid development of video generation technology has led to a significant demand for quality assessment of the latest AI-generated videos. However, current supervised approaches depend on expensive and quickly outdated human scores, and label-free methods overlook the general distortions of AI-generated videos. To address these limitations, we introduce LMVQ, a Label-free Metric-learning framework for general AI-generated Video Quality assessment of three dimensions, spatial, temporal, and alignment. The LMVQ is the first to introduce sample degradations specially designed for AIGC-specific distortions, and constructs a comprehensive training set through two complementary sample generation strategies. It then employs two synergistic modules, the Intra-Quality Token Transformer (IQ-Trans), which explicitly refines dimension-specific quality representations, and the Inter-Quality Mixture of Experts (IQ-MoE), which fuses interactions across multiple quality dimensions. Finally, a Multi-Proxy Metric-Learning (MPML) strategy aligns the learned representations with multi-dimensional quality scores and constrains the model to learn discriminative quality-aware representations. Extensive experiments on four public AIGC-VQA benchmarks show that MPML outperforms previous label-free methods by over 20%, and greatly narrows the gap with supervised methods. This provides a scalable, adaptive foundation for evaluating the ever-evolving quality of AI-generated videos.

Abstract:
Crowd counting in congested scenarios remains challenging, when required to handle “intra-domain bias”—the significant variation in crowd density across regions within each image. In this study, we propose a novel Discrepancy-Controlled Region-Adaptive Learning (DC-RA) method which leverages a divide-and-conquer strategy, transforming the complex problem of image-level crowd counting into a series of more manageable regional tasks. Specifically, we propose a Discrepancy-Controlled Adaptive Partition (DCAP) module, to divide each image to regions that adapt to the varying density levels controlled by discrepancy of crowd density. To specify features for each region, the Region-wise Adaptive Learning (RAL) module is then introduced by incorporating the Mixture-of-Experts (MoE) framework, which involves using a routing module to select the most suitable expert for each region. This dynamic selection process ensures that each region benefits from tailored optimization based on its specific characteristics, leading to more precise density estimates. To ensure that each expert captures the distinct characteristics of various regions, we further incorporate a region-level counting loss for optimization. Experiments show that DC-RA reduces the Mean Absolute Errors (MAE) by 2.5 and 4.1 compared with the state-of-the-art method on JHU-CROWD++ and NWPU, respectively, significantly enhancing the model’s robustness and accuracy across varying crowd densities.

Abstract:
Referring multi-object tracking (RMOT) is an emerging cross-modal task that aims to locate an arbitrary number of target objects and maintain their identities referred by a language expression in a video. This intricate task involves the reasoning of linguistic and visual modalities, along with the temporal association of target objects. However, the seminal work relies on loose feature fusion and neglects long-term information. In this study, we introduce a compact Transformer-based method, termed TenRMOT. We conduct feature fusion at both encoding and decoding stages to fully exploit the advantages of Transformer architecture. Specifically, we incrementally perform cross-modal fusion layer-by-layer during the encoding phase. In the decoding phase, we utilize language-guided queries to probe memory features for accurate prediction of the desired objects. Moreover, we introduce a query update module that explicitly leverages temporal prior information of the tracked objects to enhance the consistency of their trajectories. In addition, we introduce a novel task called Referring Multi-Object Tracking and Segmentation (RMOTS) and construct a new dataset named Ref-KITTI Segmentation. Our dataset consists of 18 videos with 818 expressions, and each expression averages 10.7 masks, which poses a greater challenge compared to the typical single mask in most existing referring video segmentation datasets. TenRMOT demonstrates superior performance on both the referring multi-object tracking and the segmentation tasks.

Abstract:
Semi-supervised medical image segmentation (SSMIS) mainly leverages valuable information from unlabeled data to complement the limited labeled guidance. Incorporating SAM, with its excellent generalization capabilities, enhances the learning process from unlabeled data, as demonstrated by existing methods. However, most current SAM-based methods in SSMIS focus on unique prompt design, while the prompts generated for unlabeled data through pseudo-labels unavoidably introduce noise, limiting the following decoding process. In this paper, we propose a Cross-Hierarchical Decoding (CHD) process for SAM, which removes explicit prompts (e.g., point or box) and thus mitigates the influence of inaccurate pseudo labels. Specifically, our CHD is a two-stage decoder. The first stage uses the original decoder in SAM to generate probability masks, which are combined with a learnable mask interaction module in the second stage to achieve more fine-grained segmentation. Meanwhile, to remove the restriction that the original SAM can only segment foreground-background categories, we design a cross-class correlation module in CHD to capture class-wise interrelationships between different classes, thus achieving multi-class segmentation. Extensive experiments show that CHD achieves new state-of-the-art performance for SSMIS, significantly improving different baselines.

Abstract:
Neural Radiance Field (NeRF) technology has made significant strides in creating novel viewpoints. However, its effectiveness is hampered when working with sparsely available views, often leading to performance dips due to overfitting. FreeNeRF attempts to overcome this limitation by integrating implicit geometry regularization, which incrementally improves both geometry and textures. Nonetheless, an initial low positional encoding bandwidth results in the exclusion of high-frequency elements. The quest for a holistic approach that simultaneously addresses overfitting and the preservation of high-frequency details remains ongoing. This study presents a novel feature-matching-based sparse geometry regularization module, enhanced by a spatially consistent geometry filtering mechanism and a frequency-guided geometric regularization strategy. This module excels at accurately identifying high-frequency keypoints, effectively preserving fine structural details. Through progressive refinement of geometry and textures across NeRF iterations, we unveil an effective few-shot neural rendering architecture, designated as SGCNeRF, for enhanced novel view synthesis. Our experiments demonstrate that SGCNeRF not only achieves superior geometry-consistent outcomes but also surpasses FreeNeRF, with improvements of 0.7 dB in PSNR on LLFF and DTU.

Abstract:
Remote photoplethysmography (rPPG) is a critical technique for non-contact monitoring of human vital signs using facial video data. Most of the existing rPPG approaches, either supervised ones relying on ground-truth physiological signals or less constrained unsupervised ones, primarily address the problem of inaccurate physiological measurements under normal lighting conditions. However, few works focus on handling physiological measurements in extremely low-light scenarios. To this end, we propose an unsupervised geometric-physiological domain anchoring for low-light rPPG measurement (UDA-rPPG). Firstly, we develop a geometric anchoring video enhancement module (GAEM) that can enhance video brightness while preserving rPPG signals, achieving accurate geometric-domain face anchoring. Secondly, we introduce a low-light stable spatial-temporal network (LS-Phys), which focuses on high-frequency information to mitigate noise in low-light scenarios. Finally, a novel highest-peak priority learning strategy is presented to learn physiological-domain rPPG signal anchoring by emphasizing peak information, which enhances the robustness of rPPG measurements in low-light environments. Additionally, we construct a comprehensive low-light rPPG dataset (LRPD) that contains both visible and near-infrared videos under low-light scenarios. Extensive experiments demonstrate the superior performance of our approach over state-of-the-art unsupervised rPPG methods in different light conditions and verify the generalization of UDA-rPPG on cross-dataset testing. Our code and dataset are available at https://github.com/wwenmaositu/LS-rPPG-LRPD

Abstract:
Referring expression comprehension (REC) is a vision-language task to locate a target object in an image based on a language expression. Fully fine-tuning general-purpose pre-trained vision-language foundation models for REC yields impressive performance but becomes increasingly costly. Parameter-efficient transfer learning (PETL) methods have shown strong performance with fewer tunable parameters. However, directly applying PETL to REC faces two challenges: 1) insufficient multi-modal interaction between pre-trained vision-language foundation models, and 2) high GPU memory usage due to gradients passing through the heavy vision-language foundation models. To this end, we present M2IST: Multi-Modal Interactive Side-Tuning with M3ISAs: Mixture of Multi-Modal Interactive Side-Adapters. During fine-tuning, we fix the pre-trained uni-modal encoders and update M3ISAs to enable efficient vision-language alignment for REC. Empirical results reveal that M2IST achieves better performance-efficiency trade-off than full fine-tuning and other PETL methods, requiring only 2.11% tunable parameters, 39.61% GPU memory, and 63.46% training time while maintaining competitive performance. Our code is released at https://github.com/xuyang-liu16/M2IST.

Abstract:
Surgical phase recognition is a critical technology in assistive therapy, aiding physicians in enhancing surgical efficiency and postoperative assessment. Currently, deep learning approaches have been extensively applied to the task of surgical phase recognition. However, methods that rely solely on spatial features are prone to phase shaking. Notably, approaches based on Temporal Convolutional Network (TCN), Transformers or Mamba have demonstrated significant efficacy in handling temporal features. Nevertheless, existing methodologies face several challenges: 1) there is currently no method that integrates all three of the aforementioned approaches simultaneously to leverage their respective advantages; and 2) previous attention mechanisms primarily reduce computational costs through local attention, sparse attention, downsampling, which may inadvertently compromise the performance of attention. To address these gaps, we propose a TCN-Residual Mamba (RMamba)-Attention Network (CMANet), which comprises two key components: spatial feature extraction and prediction. Our main innovation lies in the prediction component, which is further divided into two prediction stages. Both prediction stages employ the same structure, organically integrating TCN, RMamba, and differential attention (DA) mechanisms. TCN and RMamba exhibit linear complexity in long sequence modeling and achieve better performance at a lower cost. In contrast, DA employs a differential mechanism to drastically minimize computational costs while demonstrating superior performance in long-sequence tasks requiring strong coping or contextual learning capabilities. The efficacy of the proposed method is validated on two benchmark datasets for surgical phase recognition, with experimental results demonstrating its effectiveness.

Abstract:
Recently, RGB-T tracking has received increasing attention due to its robustness. However, existing RGB-T trackers mainly use cross-attention for modal feature interaction, limiting the utilization of complementary information. In addition, these trackers employ fixed dominant-auxiliary paradigms for feature fusion, ignoring modal quality fluctuations. To address these issues, we propose FMTrack, an effective framework for fully capturing complementary information. FMTrack consists of two key components, a frequency-aware interaction network (FIN) and a multi-expert fusion module (MEFM). To emphasize the valuable information in each modality, FIN utilizes frequency masks to perform high-pass and low-pass filtering on RGB and TIR data. FIN explicitly establishes cross-modal interactions via frequency domain learning, which facilitates the sharing of complementary information. Besides, MEFM extracts diverse features via the differentiated expert network and then adjusts feature combinations according to modal reliability, achieving deep understanding and flexible fusion of multimodal data. With FIN and MEFM, FMTrack makes full use of the advantageous information of each modality to highlight target representations, thus improving performance in complex scenes. Extensive experiments on four popular RGBT tracking datasets (LasHeR, VTUAV, RGBT234, and RGBT210) show that our FMTrack achieves leading performance. The code is available at https://github.com/xyl-507/FMTrack

Abstract:
Image superpixel segmentation has greatly benefited from the excellent feature extraction capabilities of neural networks. However, most existing neural network-based superpixel segmentation methods require large amounts of labeled data for training, which limits their generalization and practical applications. To address this issue, this paper introduces a novel unsupervised network named UURS for robustly generating accurate superpixels from a single image without requiring an image collection (untrained) and any labeled data (unsupervised). First, an efficient feature extraction module based on Lipschitz-controlled convolutional layers and a channel attention mechanism is proposed to leverage deep image prior for capturing discriminative features. Furthermore, an image reconstruction loss combining structure- and pixel-level similarity is designed to improve feature accuracy. Subsequently, a classification module generates superpixels from these features. The network achieves optimal performance through an adaptive stopping criterion that prevents underfitting or redundant computation. We validate UURS through extensive experiments on standard image segmentation benchmarks. Additionally, synthetic experiments with diverse noise distributions in sRGB space and evaluations on real-world noisy image datasets demonstrate the network’s robustness. Results show that our method outperforms state-of-the-art approaches in boundary adherence and maintains robustness across both clean and noisy conditions.

Abstract:
Images captured in harsh environments often exhibit blurred details, reduced contrast, and color distortion, which hinder feature detection and matching, thereby affecting the accuracy and robustness of homography estimation. While visual enhancement can improve contrast and clarity, it may introduce visual-tolerant artifacts that obscure the structural integrity of images. Considering the resilience of semantic information against environmental interference, we propose a semantic-driven feature enhancement network for robust homography estimation, dubbed SeFENet. Concretely, in our homography estimation network —— Target Aware Homography Estimation Module(TAHEM), we first introduce an innovative hierarchical scale-aware module to expand the receptive field by aggregating multi-scale information, thereby effectively extracting image’s structural features under diverse harsh conditions. Subsequently, we employ a Semantic Extraction Module to extract multi-scale semantic features from the input images. Combined with a high-level perceptual framework, this enables degradation-tolerant semantic feature extraction. Building upon this, the Semantic-Guide Meta Constraints module leverages a meta-learning training strategy to effectively fuse the semantic features with structural features. By internal-external alternating optimization, the proposed network achieves implicit semantic-wise feature enhancement, thereby improving the robustness of homography estimation in adverse environments by strengthening the local feature comprehension and context information extraction. Experimental results under both normal and harsh conditions demonstrate that SeFENet significantly outperforms SOTA methods, reducing point match error by at least 41% on the large-scale datasets.

Abstract:
This paper presents EveryBrain, a method to generate electroencephalographic (EEG) signals of visual stimuli using images. Given that individuals exhibit distinct EEG responses to the same visual stimulus, EveryBrain is capable of capturing these individual characteristics during signal generation. The framework operates in two stages. By leveraging the temporal properties of EEG signals and the spatial features of images, EveryBrain presents a self-supervised framework that simultaneously reconstruct EEG signals and perform contrastive learning between image and EEG features. Furthermore, through additional training focused on individual EEG differences, Stage2 injects an ID number (representing a specific person) into image features via a cross-modal projector. The resulting personalized EEG latent codes, supervised by the Stage1 encoder, are then decoded into vivid, individualized EEG responses. Experiments validate the accuracy of EveryBrain in generating EEG signals for various individuals in response to visual stimuli. Overall, the proposed method tackles challenges in EEG generation from images, such as cross-modal alignment, individual variability, and waveform stability, yielding promising results. Additionally, the novel approach of joint learning between images and EEG demonstrates positive effects on decoding visual neural representations. Both quantitative and qualitative evaluations demonstrate the effectiveness of methods, marking a significant step toward portable and cost-effective “image-to-thought”.

Abstract:
In the advertising and media industries, image editing often involves multiple manipulation techniques to meet creative and technical requirements. Detecting tampered regions is crucial in scenarios like legal disputes or media integrity assessments. However, existing forensic methods often target single manipulation types or treat all manipulations as one, and many deep learning approaches lack flexibility in frequency and edge extraction, limiting their effectiveness. To address these challenges, this paper proposes an ALL-IN-ONE framework for comprehensive image forensic analysis, which adopts a divide-and-conquer strategy for multi-manipulation image classification and localization. Specifically, we introduce a Multi-Frequency Band Extraction Module (MBEM) to capture richer artifact information in the frequency domain. This is complemented by an Attention Window-based Fusion Module, which fuses same-frequency features across different scales and enhances the discriminative features more effectively. To improve the localization of copy-move manipulation, we design a Copy-Move Accurate Detection Module (CADM), which leverages the visual consistency between source and target regions. Furthermore, we propose a Precise Edge Generator (PEG) as part of the Edge-Guided Progressive Fine-Tune Module (EPFM), which can generate more accurate edge to enhance edge localization. To address the issue of insufficient labeled data, we construct a publicly available dataset, the Multi-Manipulation Image Dataset (MMID), consisting of 2,000 multi-manipulation images, each containing at least two types of forgeries. Extensive experiments are conducted, comparing our method with state-of-the-art approaches on MMID, as well as on single-manipulation datasets such as CASIA, CoMoFoD, and NIST. The results demonstrate that MMID is effective for training discriminative models and validate that our proposed method significantly outperforms existing approaches in terms of accuracy and robustness for simultaneous forgery localization and manipulation classification.

Abstract:
Image restoration under adverse weather conditions is critical for real-world applications. However, existing approaches mainly suffer from two fundamental limitations, i) the impractical requirement of prior degradation knowledge for task-specific model selection and ii) performance degradation when handling with in-the-wild corruptions. To address the above issues, in this paper, we propose a novel Meta-prior Aided Transformer restoration framework, MePAT, to synergize dynamic feature modulation with optimal transport (OT) theory. Specifically, we first architect an efficient attention mechanism, rectified self-channel attention (RSCA) to capture long-range associations along the channel dimension. Then, to adaptively tackle different conditions, we design a task-shared prior learning network (TPLN) to generate content-adaptive weather embeddings and serve as feature modulators to direct a more flexible and robust restoration process. In addition to learn discriminative task features, we propose an weakly-supervised OT-driven contrastive loss to measure the discrepancy between different weather corruptions. During the inference process, through the shared TPLN, we derive image-oriented vectors for unseen corruptions and then perform image restoration. The superior experimental results on three synthetic benchmarks demonstrate the effectiveness of MePAT. We also conduct experiments on real-world applications to verify the generalization ability and robustness. The code and pre-trained models will be made available.

Abstract:
Unsupervised person re-identification (Re-ID) performance enhancement hinges on extracting the most informative features from unlabeled person datasets. In recent approaches, proxy-based contrastive learning with awareness of camera labels has been adopted for model training, thereby achieving highly promising results. However, inappropriate selections of contrastive pairs can significantly degrade the performance of these models. To address this issue, we propose the Optimal Proxy Mining Contrastive Network (OPMCN), a novel framework designed to strategically optimize the selection of proxies for positive and negative pair formation, thus enhancing the efficacy of contrastive training. The OPMCN framework proposes two specific contrastive losses: Hardest Camera Proxy Mining (HCPM) and False Negative Proxies Mining (FNPM), each essential for enhancing model performance in unsupervised settings. The HCPM loss targets proxies from the most challenging cameras to maximize semantic differences between pairs while ensuring minimal background shifts. In contrast, the FNPM loss counters noise in pseudo labels by prioritizing similarity rankings over clustering results to effectively identify and correct false negatives among proxies. Moreover, we have developed the Pyramid Kernel Global Context (PKGC) block, which employs an attention mechanism that focuses on identity-invariant semantic cues in instances. This module utilizes optimally sized convolutional kernels to enhance identity recognition consistency across camera-based variations, thereby improving the precision of feature extraction. Experimental results on several popular datasets prove that our work surpasses existing unsupervised person Re-ID approaches to a remarkable extent.

Abstract:
In recent years, with advancements in generative models, an increasing number of garment design methods have been proposed. A generative model capable of generating garment images from text and sketches can provide designers with valuable visual references and creative inspiration to aid in the design process. Existing multimodal garment design methods face the challenge of lacking precise control over the generated results in relation to both sketches and text. In this paper, we propose Multimodal Enhancement and Fusion Network for Garment Design (MEF-GD). Our model inputs image conditions into Stable Diffusion based on ControlNet. On one hand, directly inputting image conditions can lead to feature forgetting, defined as the phenomenon in deep neural networks where previously learned feature representations are lost. To address this issue, we propose a multiple feature injection module to more effectively enhance image condition features. On the other hand, ControlNet fuses control features into Stable Diffusion through pointwise addition, which ignores the interaction between multimodal features and results in the fused features being biased towards the control features, overlooking Stable Diffusion features. To address this limitation, we introduce content-guided attention for more effective feature fusion and improve the expression of text features. Additionally, existing datasets often contain vague textual descriptions of garments. It is difficult to train the model on such a dataset to learn accurate alignment between generated image and the textual descriptions. To address this issue, we have designed a multimodal large model text optimization module to improve the quality and clarity of text generation. Compared to existing multimodal garment design methods, MEF-GD achieves more effective alignment with both textual and sketch-based inputs in generating garment images. Compared to MGD, MEF-GD achieves a decrease of 2.44 in FID and an increase of 0.83 in CLIP Score on Multi-VITON-HD dataset. The code will be available at https://github.com/fengyun691340/MEF-GD

Abstract:
Inaccurate detections remain a critical bottleneck in 3D multi-object tracking (MOT). Recent detection fusion-based methods incorporate camera detections as supplementary to reduce false detections and compensate for missing ones in LiDAR. However, their unidirectional camera-LiDAR correction lacks a feedback mechanism, precluding iterative mutual refinement between modalities for more robust LiDAR-based tracking. Inspired by the coarse-to-fine strategy in two-stage object detection, we introduce CrossTracker, a novel two-stage framework for online multi-modal 3D MOT. CrossTracker first constructs coarse camera and LiDAR trajectories independently, then performs trajectory fusion using both current and historical frames, without requiring future data. This ensures more robust mutual refinement between modalities. Specifically, CrossTracker comprises three core modules: i) the multi-modal modeling (M3) module, which fuses data from images, point clouds, and even planar geometry derived from images to establish a robust tracking constraint; ii) the coarse trajectory generation (C-TG) module, which independently generates coarse trajectories for both modalities using the M3 constraint; and iii) the trajectory fusion (TF) module, which applies mutual refinement between coarse LiDAR and camera trajectories through cross correction to ensure robust LiDAR trajectories. Extensive experiments show that CrossTracker outperforms 19 state-of-the-art methods, highlighting its effectiveness in leveraging the synergistic strengths of camera and LiDAR sensors for robust multi-modal 3D MOT. The code is available at https://github.com/lipeng-gu/CrossTracker.

Abstract:
Based on its excellent capability to extract temporal features, transformer has been widely used in monocular 3D human pose estimation. However, due to its global perspective, it performs inadequately in extracting spatial features, which hinders breakthroughs in performance. In this paper, we propose a local-global feature fusion method based on GCN and transformer for 3D human pose estimation. Our method integrates GCN with multiscale transformer to extract local spatiotemporal features of poses. These are then integrated with the global spatiotemporal features extracted by vanilla transformer to reconstruct 3D human poses accurately. In addition, we introduce a hierarchical feature fusion method to better capturing the underlying 3D pose structure. It blends deep abstract features with shallow raw features. We evaluate our model on the Human3.6M and MPI-INF-3DHP datasets, and experimental results demonstrate that our approach outperforms existing state-of-the-art methods. We achieve advanced performance on both datasets with errors of 37.7mm and 16.4mm under MPJPE, respectively. The code and model are available at https://github.com/ygx7/LG3DPose

Abstract:
The field of autonomous driving increasingly demands high-quality annotated video training data. In this paper, we propose Panacea+, a powerful and universally applicable framework for generating video data in driving scenes. Built upon the foundation of our previous work, Panacea, Panacea+ adopts a multi-view appearance noise prior mechanism and a super-resolution module for enhanced consistency and increased resolution. Extensive experiments show that the generated video samples from Panacea+ greatly benefit a wide range of tasks on different datasets, including 3D object tracking, 3D object detection, and lane detection tasks on the nuScenes and Argoverse 2 dataset. These results strongly prove Panacea+ to be a valuable data generation framework for autonomous driving.

Abstract:
The fusion of multi-view medical images through deep neural networks is essential for boosting diagnostic precision in the field of medical image analysis. However, the reliability of these diagnostic results is often compromised by imperfections in image views, manifested as noise, artifacts, and data deficits arising from inconsistent diagnostic frequencies. These issues introduce a significant risk when merging medical views in a clinical setting. To address these problems, we introduce the Reliability-Enhanced Multi-view Network (REMNet), a novel framework designed to tackle two critical challenges: 1) reducing misclassification and uncertainty from imperfect view integration, and 2) improving the reliability and interpretability of multi-view medical image predictions. Specifically, REMNet merges information from multiple views into a coherent evidence framework and incorporates a Dirichlet prior within our predictive model to more accurately estimate confidence in predictions. Coupled with a robust fusion strategy and a precise confidence calibration process, REMNet consolidates the diverse strengths of various medical imaging views, reduces the impact of view imperfections, and enhances the reliability of medical imaging diagnostics. The superiority of REMNet is validated through comprehensive theoretical analysis and empirical experiments on multi-view medical image datasets across different modalities.

Abstract:
Prompt tuning has emerged as an effective alternative for adapting pre-trained Vision-Language Models (VLMs) to various downstream tasks. In our experiments utilizing prompt tuning methods, we observed that modifying the prompt initialization led to inconsistencies in the model’s predictions, particularly with pronounced variability on specific datasets. Motivated by this observation, we examine the predictive performance of two ensemble methods: prompt fusion and logits fusion. Experimental results indicate that logits fusion results in considerable performance improvements, while prompt fusion does not yield any enhancements. However, a significant downside of logits fusion is the enormous rise in inference time. To investigate a practical approach for integrating knowledge derived from multiple prompts without incurring additional inference costs, we propose a straightforward Prompt Ensemble self-Distillation (PED) framework that considerably improves the generalization capacity of prompt tuning. Specifically, we initialize multiple groups of prompts, and during the training process, we integrate the prediction outputs from each group to facilitate the learning of the fused prompts. The proposed self-distillation approach offers dual benefits: enhancing the performance of both the fused prompts and the fused logits. We utilize fused prompts for prediction during the inference process, thereby achieving performance that is comparable to that of fused logits without incurring additional inference time. We evaluate the effectiveness of our methodology across four distinct tasks. Our PED consistently demonstrates superior performance in all assessments when contrasted with numerous state-of-the-art methods. Moreover, our method can be seamlessly integrated into existing prompt learning approaches and consistently improves their performance. Our code is publicly available at https://github.com/vim-wei/PED

Abstract:
In recent years, adversarial attacks in hyperspectral image (HSI) classification have garnered increasing attention. However, existing attack methods primarily manipulate individual pixel spectral to mislead deep neural networks (DNNs) into misclassification, overlooking the physical consistency of hyperspectral data. This oversight results in adversarial samples that lack physical interpretability and suffer from low attack efficiency. To alleviate these issues, this paper proposes a sparse unmixing guided adversarial attack framework (SUGAA) to efficiently generate hyperspectral adversarial samples that satisfy physical consistency. The proposed framework first employs sparse unmixing to extract the abundance matrix of HSI, introducing adversarial perturbations to the abundance matrix to generate physically consistent adversarial samples. Additionally, SUGAA leverages the compositional similarity of materials within intra-class HSI pixels to design a class-specific perturbation generation strategy, enhancing the applicability of adversarial perturbations across pixels of the same class. To further improve optimization effectiveness, SUGAA incorporates a class-specific perturbation optimization algorithm based on momentum iterative gradients to avoid local optima, ensuring stable and efficient perturbation generation. Experimental results on real HSI datasets demonstrate that SUGAA not only generates adversarial samples with high attack performance and physical consistency but also exhibits robustness to common preprocessing transformations.

Abstract:
In recent years, few-shot image classification has achieved substantial progress. Although existing methods have achieved promising performance, the limited availability of training data often leads to the problem of model overfitting. Model overfitting affects generalization and restricts the effective transfer of knowledge to unseen classes. Moreover, existing methods maintain independence between the image and text modalities during the encoding process, lacking mutual collaboration. This limitation restricts their ability to fully exploit task-specific semantic relationships between visual concepts and textual descriptions. To address this challenge, we propose a text-driven cross-modal feature fusion adapter (TCFF-Adapter) for few-shot image classification. TCFF-Adapter introduces two core components: a cross-modal feature fusion module that constructs joint representations by aligning image and text semantics, and a text-driven adapter that optimizes fused features and dynamically adjusts feature weights in a meta-learning paradigm. By integrating multimodal knowledge with parameter-efficient tuning, our method achieves robust generalization to unseen data without requiring additional fine-tuning. Extensive experiments on eight benchmark datasets demonstrate that the proposed TCFF-Adapter significantly outperforms various state-of-the-art few-shot image classification methods.

Abstract:
Transformer-based trackers have demonstrated remarkable advancements in real-time tracking tasks on edge devices. Since lightweight backbone networks are typically designed for general-purpose tasks, our analysis reveals that, when applied to target tracking, they often contain structurally redundant layers, which limits the model’s efficiency. To address this issue, we propose a novel tracking framework that integrates backbone pruning with Hybrid Knowledge Distillation (HKD), effectively reducing model parameters and FLOPs while preserving high tracking accuracy. Inspired by the success of MiniLM and Focal and Global Distillation (FGD), we design a HKD framework tailored for tracking tasks. Our HKD introduces a multi-level and complementary distillation scheme, consisting of Token Distillation, Local Distillation, and Global Distillation. In Token Distillation, unlike MiniLM, which distills attention via QK dot-products and V, we disentangle and separately distill Q, K, and V representations to enhance structural attention alignment for tracking. For Local Distillation, we use the FGD concept by incorporating spatial foreground-background masks to capture region-specific discriminative cues more effectively. In Global Distillation, we use Vision Mamba module to model long-range dependencies and enhance semantic-level feature alignment. Our tracker HKDT achieves state-of-the-art (SOTA) performance across multiple datasets. On the GOT-10k benchmark, it demonstrates a groundbreaking 67.6% Average Overlap (AO), outperforming the current SOTA real-time tracker HiT-Base by 3.6% in accuracy while reducing computational costs by 64% and achieving 115% faster tracking speed on CPU platforms. The code and model will be available soon.

Abstract:
Single-domain generalized object detection aims to enhance a model’s generalization to multiple unseen target domains using only data from a single source domain during training. This is a practical yet challenging scenario, as it requires the model to address domain shift without incorporating target domain data into the training process. In this paper, we propose a novel phrase-grounding-based style transfer (PGST) approach for the task. Specifically, we first define textual prompts to describe objects for potential unseen target domains. Then, we leverage the grounded language-image pre-training (GLIP) model to capture the styles of these target domains and perform style transfer from the source to the target domains. The style-transferred visual features from the source domain are semantically rich and closely approximate those of their hypothetical counterparts in the target domain. Finally, we employ these style-transferred visual features to fine-tune GLIP. By introducing these imaginary counterparts, the detector can be effectively generalized to unseen target domains using only a single source domain during training. Our method significantly improves mean average precision (mAP), with an average increase of 8.8% across five diverse weather-driving benchmarks. Notably, our approach outperforms or matches the performance of domain-adaptive object detection methods, which require target domain data for training, in several challenging scenarios.

Abstract:
Hyperspectral anomaly detection (HAD) is widely used in Earth observation and deep space exploration. A major challenge for HAD is the complex background of the input hyperspectral images (HSIs), resulting in anomalies confused in the background. On the other hand, most existing HAD methods require training a separate model for each HSI, resulting in poor generalization in practical applications. This paper starts the first attempt to study a new and generalizable background learning problem without labeled samples. We present a novel solution BSDM (background suppression diffusion model) for HAD, which can simultaneously learn latent background distributions and generalize to different datasets for suppressing complex background. It is featured in three aspects: 1) For the complex background of HSIs, we design pseudo-background noise and learn the potential background distribution in it with a diffusion model (DM). 2) For the generalizability problem, we apply a statistical offset module so that the BSDM adapts to datasets of different domains without labeling samples. 3) For achieving background suppression, we innovatively improve the inference process of DM by feeding the original HSIs into the denoising network, which removes the background as noise. Our work paves a new background suppression way for HAD that can improve HAD performance without the prerequisite of manually labeled data. Assessments and generalization experiments of four HAD methods on several real HSI datasets demonstrate the above three unique properties of the proposed method. Our project is available at https://github.com/majitao-xd/BSDM-HAD

Abstract:
The inherent diversity of visual scenes poses a fundamental challenge in blind image quality assessment (BIQA), which has become a major obstacle to the model generalization. In this study, we found that human annotations for images with different visual scenes exhibit distinct quality distribution discrepancies. The existing BIQA models tend to overfit to such diversified distributions, which in turn leads to compromised model generalizability, especially when dealing with unseen scenes in the real-world scenario. Motivated by the above facts, this paper presents a generalizable BIQA model by learning Scene-INvariant Distribution, named SIND. Specifically, we propose a distribution alignment framework to alleviate the distribution discrepancy for quality regression models, which is achieved by automatically scaling and shifting the cross-scene distributions into a unified distribution. Then, the aligned unified distribution is leveraged to supervise the model training, achieving scene-invariant and quality-aware feature representation. In addition, a token-complementary patch reasoning network is designed to extract comprehensive quality-aware features from both the image overview and detail, achieving more accurate quality prediction. Extensive experiments for both image technical- and aesthetic-quality assessment tasks show the superiority of the proposed SIND model over the state-of-the-arts. Moreover, the proposed framework is model-agnostic and can enhance model generalizability without incurring extra inference costs. The proposed method won the championship in the NTIRE 2024 Portrait Quality Assessment Challenge. Codes will be available at https://github.com/ZachL1/SIND

Abstract:
Few-shot class-incremental learning (FSCIL) requires a model to learn the knowledge of new categories incrementally, using only a few samples, after being trained on a base session with ample categories and sample sizes. This task presents two major challenges: catastrophic forgetting and overfitting. Current approaches primarily enhance the model’s ability to extract knowledge during the base stage to improve adaptability to new tasks. Large-scale pre-trained models, known for their high robustness and zero-shot transfer capabilities, have demonstrated promising performance in FSCIL. The key to solving FSCIL lies in effectively fine-tuning such large models to balance the learning of new knowledge and the retention of old knowledge. Inspired by human-like knowledge retrieval mechanisms, we propose Class-specific Knowledge-Guided Prompt Tuning (CKGPT), which leverages class-specific prompts to guide the model in learning targeted knowledge reuse and integration effectively. When faced with novel tasks, the model selectively activates previously learned knowledge that is the most relevant, improving performance on new tasks while minimizing updates to irrelevant knowledge to reduce forgetting. By incorporating mechanisms that balance knowledge retention and transfer, CKGPT ensures a more robust adaptation to sequential tasks. Extensive experiments on multiple benchmarks validate the effectiveness of our method in achieving superior performance.

Abstract:
Hyperspectral image (HSI) reconstruction refers to the process of recovering the high-dimensional HSI signal from the measurements captured by various imaging systems. In the case of the coded aperture snapshot spectral imaging (CASSI) system, this involves recovering the HSI signal from snapshot measurements obtained using a coded aperture and disperser. However, previous methods for HSI reconstruction have been limited by the challenges of reconstructing complex high-dimensional data and the inevitable noise present in the measurements. To better capture the complex distribution of high-dimensional data and mitigate the impact of noise on reconstruction performance, this paper introduces an end-to-end approach leveraging a diffusion model, termed the hierarchically conditional diffusion model for HSI reconstruction (HDiff-HIR). HDiff-HIR achieves high-quality reconstruction by initializing with pure Gaussian noise and using a network to iteratively refine it. Additionally, we design a condition generation module, called the mask-integrated condition generation module (MCGM), which integrates 2D measurements with the coded aperture of the imaging system as conditions and hierarchically embeds them into the denoising network. Furthermore, within the network, we introduce a novel self-attention mechanism, named local-global spectral-enhanced multi-head self-attention (LGS-MSA), to efficiently capture long-range spatial dependencies in HSIs at relatively modest computational costs while incorporating fine-grained spectral features as complementary information. In LGS-MSA, we incorporate time embeddings to make it time-dependent, enabling it to capture both long-range spatial and temporal dependencies simultaneously. Through comprehensive experiments on both simulated and real datasets, we demonstrate that HDiff-HIR not only outperforms other advanced methods but also exhibits strong generalization capability. The code of HDiff-HIR is accessible: https://github.com/chenx2000/HDiff-HIR

Abstract:
Monocular 3D human pose estimation remains a challenging task due to inherent depth ambiguities and occlusions. Compared to traditional methods based on Transformers or Convolutional Neural Networks (CNNs), recent diffusion-based approaches have shown superior performance, leveraging their probabilistic nature and high-fidelity generation capabilities. However, these methods often fail to account for the spatial and temporal correlations across predicted frames, resulting in limited temporal consistency and inferior accuracy in predicted 3D pose sequences. To address these shortcomings, this paper proposes StarPose, an autoregressive diffusion framework that effectively incorporates historical 3D pose predictions and spatial-temporal physical guidance to significantly enhance both the accuracy and temporal coherence of pose predictions. Unlike existing approaches, StarPose models the 2D-to-3D pose mapping as an autoregressive diffusion process. By synergically integrating previously predicted 3D poses with 2D pose inputs via a Historical Pose Integration Module (HPIM), the framework generates rich and informative historical pose embeddings that guide subsequent denoising steps, ensuring temporally consistent predictions. In addition, a fully plug-and-play Spatial-Temporal Physical Guidance (STPG) mechanism is tailored to refine the denoising process in an iterative manner, which further enforces spatial anatomical plausibility and temporal motion dynamics, rendering robust and realistic pose estimates. Extensive experiments on benchmark datasets demonstrate that StarPose outperforms state-of-the-art methods, achieving superior accuracy and temporal consistency in 3D human pose estimation. Code is available at https://github.com/wileychan/StarPose

Affiliations: School of Instrument Science and Engineering, Southeast University, Nanjing, China; College of Artificial Intelligence, Nanjing University of Aeronautics and Astronautics, Nanjing, China; School of Computer Science and Engineering, Southeast University, Nanjing, China; Laboratoire Traitement du Signal et de l’Image, Universite de Rennes , Rennes, France; Department of Radiology, Center of Interventional Radiology and Vascular Surgery, Zhongda Hospital, Medical School, Southeast University, Nanjing, China; School of Medicine, Case Western Reserve University, Cleveland, OH, USA

Abstract:
Segment Anything Model (SAM) has demonstrated state-of-the-art performance in most segmentation tasks. However, due to insufficient training in the medical domain, SAM’s ability to generalize to medical images is limited. Although preliminary efforts have fine-tuned SAM for the medical domain, the fine-tuned model still struggles with variability in medical tasks. Some recent studies have explored weakly supervised learning to mitigate SAM’s performance degradation in the medical domain. However, the effectiveness of weakly supervised learning is heavily dependent on the quality of weakly supervised information, with performance significantly dropping as the quality declines. Doctors’ attention is closely related to the target area during diagnosis. Integrating gaze information into SAM’s adaptation process for medical image segmentation enhances efficiency and significantly improves performance in medical tasks. In this paper, we first propose a Gaze-assisted medical segment Anything Model (GAM), which utilizes gaze information to enable the adaptation of SAM in medical images following doctor’s attention. It has two innovations: 1) Feature-level adaptation: Gaze Alignment (GA) learning makes the feature-level adaptation follow the doctor’s attention which mines the human guidance from gaze heatmaps and guides model to extract general features for downstream tasks. 2) Output-level adaptation: Gaze-Balance (GB) learning makes the output-level adaptation follow the doctor’s attention which utilizes gaze heatmaps to enhance the human-focused area and solve the problem of over/under segmentation from the output-level. Our promising results on 7 tasks with 12 targets have demonstrated the powerful adaptation ability of our GAM in the medical domain. Our GAM demonstrates significant potential for low-cost clinical assistance in medical diagnosis, enabling SAM to adapt to the medical image domain without disrupting clinical workflows. We have released the full source code on https://github.com/Ruiz1026/GAM

Abstract:
Compared to conventional RGB images, hyperspectral images offer a more comprehensive range of spectral information, encompassing both visible and infrared bands. This enhanced spectral information facilitates trackers in effectively differentiating the target object from background clutter, realizing more robust recognition. However, hyperspectral video cameras exhibit variability in terms of spectral ranges and band numbers, producing distinct modalities. Developing specialized networks for each modality proves to be inefficient and time-consuming, while networks trained for one modality struggle to perform well on others, generating unsatisfactory outcomes. To confront these challenges encountered in various tracking scenarios, HyperTrack is proposed as a unified object tracking network tailored for hyperspectral videos. The proposed HyperTrack can be employed individually for single object tracking across three different modalities of hyperspectral videos: near infrared (NIR), visible (VIS), and red-to-near infrared (RedNIR). Specifically, since hyperspectral data have multiple bands, a band gate module is introduced into the network to enable it to select bands from hyperspectral images, thereby reducing the dimensionality for hyperspectral images. Furthermore, in order to effectively utilize the variability amongst the three different modal data, a computationally simple band embedding module is introduced to improve the object tracking performance of different modal hyperspectral videos. Additionally, a hybrid attention module is devised to efficiently extract and interact features between the template and search at each stage. As a unified network, HyperTrack achieves state-of-the-art comprehensive results across three different types of hyperspectral videos, particularly excelling with VIS and NIR type data. The code and models are publicly available at https://github.com/supertyd/HyperTrack

Abstract:
In this paper, we propose a novel Multi-granularity Temporal Trajectory Factorization (MTTF) framework for generative human video compression, which holds great potential for bandwidth-constrained human-centric video communication. In particular, the proposed multi-granularity feature factorization strategy can facilitate to implicitly characterize the high-dimensional visual signal into compact motion vectors for representation compactness and further transform these vectors into fine-grained fields for motion expressibility. As such, the coded bit-stream can be entailed with enough visual motion information at the lowest representation cost. Meanwhile, a resolution-expandable generative module is developed with enhanced background stability, such that the proposed framework can be optimized towards higher reconstruction robustness and more flexible resolution adaptation. Experimental results show that proposed method outperforms latest generative models and the state-of-the-art video coding standard Versatile Video Coding (VVC) on both talking-face videos and moving-body videos in terms of both objective and subjective quality. The project page can be found at https://github.com/xyzysz/Extreme-Human-Video-Compression-with-MTTF

Abstract:
Radio frequency (RF) signals have gained widespread adoption in intelligent perception systems due to their unique advantages, including non-line-of-sight propagation capability, robustness in low-light environments, and inherent privacy preservation. However, their substantial data volumes, generated by the dual-polarization direction characteristic, result in significant challenges to data storage and transmission. To address this, we propose the first end-to-end deep dynamic RF signal compression (DRFC) framework, which primarily focuses on exploiting cross-directional correlation in dynamic RF signals. The proposed framework incorporates four key innovations: (1) a mask-guided RF motion estimation module that leverages Doppler shifts and electromagnetic noise characteristics to identify regions of significant motion using a threshold-based mask, significantly improving motion estimation accuracy; (2) a cross-directional RF motion entropy model that utilizes cross-directional RF motion latent priors to refine the probability distribution for motion entropy coding; (3) a cross-directional RF context mining module that predicts RF contexts from temporal and cross-directional reference signals, adaptively fusing these contexts with confidence maps to maximize complementary information utilization; and (4) a cross-directional RF contextual entropy model that incorporates cross-directional RF contextual latent priors to optimize contextual entropy modeling. Experimental results demonstrate the superiority of our framework over existing codecs. Our DRFC framework achieves significant bitrate savings on benchmark datasets, establishing a strong baseline for future research in this field.

Abstract:
Although lambda-domain-based rate control is widely used in video encoders, developing an efficient rate control scheme for Coding Tree Units (CTUs) under the rate-distortion (R-D) principle remains a significant challenge. In this paper, we propose a spatial-temporal correlation information-based rate control scheme for Versatile Video Coding (VVC), aiming to improve coding performance. We introduce a weight estimation network to establish a CTU-level bit allocation strategy that fully exploits spatial-temporal contextual information. Moreover, the CTU-level coding parameter \lambda is adaptively optimized based on a dependency factor derived from distortion dependency information in both the spatial and temporal domains. Experimental results demonstrate that, compared to the default VVC rate control, the proposed scheme achieves BD-Rate savings of 6.48%, 17.33% and 13.75% in terms of the Peak Signal-to-Noise Ratio (PSNR), the Multi-Scale Structural Similarity Index (MS-SSIM) and the Video Multimethod Assessment Fusion (VMAF), respectively, under the Low Delay_P (LDP) configuration in the VVC Test Model (VTM) 19.0. Furthermore, the proposed method outperforms other state-of-the-art rate control schemes.

Abstract:
Point cloud attribute compression is challenged by fitting the attribute signals living on irregular geometric structures. Existing methods cannot achieve compact multiscale representation for high-fidelity reconstruction using the handcrafted transforms or deep learning-based techniques. In this paper, we propose a novel geometry-aware lifting-based multiscale network via spatial-channel lifting scheme for point cloud attribute compression. The proposed network cascades geometry-aware spatial lifting to reduce spatial redundancy by adaptively capturing irregular geometric structures and progressive channel lifting to progressively reduce channel-wise redundancy in multiscale representation. Furthermore, we design the split, predict, and update operations for geometry-aware spatial lifting to fully exploit the geometry information representing irregular structures. We develop geometry-aware adaptive split to equally split input points with significance scores indicating their dependencies, and propose geometry-aware cross-attention filtering for the predict and update operations for decorrelation based on geometry information. To our best knowledge, this paper achieves the first lifting-based learned transform for point cloud compression that enjoys reversibility guarantees of multiscale representation to enhance rate-distortion performance. Experimental results show that the proposed framework achieves state-of-the-art performance on extensive point cloud datasets, and outperforms latest MPEG G-PCC standard and most recent deep learning based methods.

Abstract:
Video data is growing exponentially daily due to the popularity of video-sharing platforms and the proliferation of video capture devices. The video summarization task has been proposed to remove redundancy while maintaining as many critical parts of the video as possible so that users can browse and process videos more effectively, which has received increasing attention from researchers. The existing research addresses the challenges faced by video summarization methods from various perspectives, such as temporal dependency, data scarcity, user preference, and high precision. This paper reviews representative and state-of-the-art methods, analyzes recent research advances, datasets, and performance evaluations, and discusses future directions. We hope this survey can help future research explore the potential directions of video summarization methods.

Abstract:
Document images are vulnerable to tampering attacks from image editing tools and deep models. Therefore, the Document Tampering Localization (DTL) task has received increasing attention in recent years. However, given the wide variety of document types (e.g., contracts, certificates, ID cards), our analysis shows that existing DTL methods struggle with document images containing diverse background colors and varying semantic contents. Further analysis and experiments verify that the varying background color and semantic contents interfere with the forensic feature extraction process in the existing DTL methods. To address this issue, we propose two disentanglement modules to mitigate such interference and improve the ability of forgery trace detection. First, we design a Color Disentanglement (CD) module that applies disentangled learning representation to forensic features. The CD module, grounded in real-world prior knowledge, effectively decouples color information from forensic features, thereby improving robustness against varying background colors. Second, we propose the Semantic Disentanglement (SD) module, which performs image-level clustering on the tampering probability map during the inference process. The SD module focuses on tampering probabilities for each pixel, while discarding local semantic information (e.g., font, location, and shape). It leads to strong robustness against variations in document content. The evaluations demonstrate that our CD-SD method outperforms existing methods by 45.12% or 0.162 on the F1 metric in cross-dataset tests. Ablation studies show that the CD and SD modules improve the F1 score by 7.98% and 13.38%, respectively, across different backbones. Our method delivers consistent and stable improvements across various experimental protocols. Moreover, it is compatible with many DTL methods in a plug-and-play fashion.

Abstract:
Reliable localization and mapping in large-scale outdoor environments remain a critical requirement for autonomous driving and intelligent robotics. While LiDAR-Inertial SLAM systems provide robust performance in many cases, their reliance on purely geometric features often leads to drift or loop closure failure in semantically repetitive or feature-degraded scenes. To address these limitations, we propose a LiDAR-IMU-semantic fusion SLAM framework that tightly integrates semantic perception with geometric and inertial constraints. At the core of our system is a Semantic-enhanced Spatial Triangular Descriptor (S-STD), which jointly models geometric structures, semantic categories, and label confidence to achieve discriminative and robust representation. This descriptor is embedded into a semantic-aware ICP registration model coupled with IMU pre-integration for accurate and stable odometry, and a semantic factor graph optimization framework with a two-stage loop closure detection strategy that combines global semantic vector retrieval and S-STD matching. Extensive evaluations on four public datasets, including KITTI, NCLT, SemanticPoss, and MCD-ViRAL, demonstrate that our approach significantly improves registration accuracy, loop closure recall, and trajectory estimation compared with state-of-the-art methods, while maintaining real-time performance. These results highlight the potential of the proposed framework for robust perception and consistent map construction in autonomous driving and long-term robotic navigation.

Abstract:
In-depth understanding of 3D environments not only involves locating and recognizing individual objects but also requires inferring the relationships and interactions among them. However, most existing methods heavily rely on scene-specific contents, which leads to poor performance due to the noisy, cluttered, and partial nature of real-world 3D scenes. In this work, we find that the inherently hierarchical structures of 3D environments, derived from support relationships, aid in the automatic association of semantic and spatial arrangements of objects and provide rich geometric and topological information independent of specific scenarios. To this end, we propose a 3D scene graph generation model that leverages the hierarchical structures of 3D environments as spatial multimodal knowledge to enhance 3D scene graph generation. Specifically, we first devise a cross-modal tuning approach, where a visually-prompted vision language model is learned to infer the support relationships between objects in a low-resource way. Subsequently, we build a hierarchical visual graph and hierarchical symbolic knowledge graph using the fine-tuned vision language model to extract contextualized visual contents and relevant textual facts, respectively. Finally, we progressively accumulate 3D spatial multimodal knowledge about the hierarchical structures by correlating contextualized visual contents and textual facts using a novel graph reasoning network. In addition, to better evaluate the performance of 3D scene graph generation models, we propose a new benchmark 3DSSG-M by reorganizing the widely-used 3D scene graph generation dataset 3DSSG. This reorganization balances the predicate distribution of 3DSSG and reduces the influence of frequency bias. Extensive results and ablations attest to the effectiveness of the hierarchical structures in 3D environments and demonstrate the superiority of our proposed method over current state-of-the-art competitors.

Abstract:
Feature-based knowledge distillation has attracted significant attention in remote sensing object detection. The main challenge in this method is that feature distillation may misguide the detection of tiny remote sensing objects due to the lack of local background priors. To address this issue, this paper proposes the Localized Background-aware Generative Distillation (LBGD) method, which incorporates two key components: the lightweight diffusion reconstructor (LDR) and the patch-wise channel distillation (PCD) loss. LDR dynamically adjusts the receptive field to effectively capture the local background information surrounding the target. Meanwhile, PCD emphasizes the most salient patch regions in each channel, reducing the impact of global background information. To the best of our knowledge, localized background-aware generative distillation mechanisms have not been previously explored in remote sensing object detection. Numerous experimental results demonstrate that LBGD brings significant performance improvements, for example, SODA-A (+ 1.9% mAP ), and DIOR (+ 2.8% mAP ). The dataset and code are available at: https://github.com/wchao0601/LBGD

Abstract:
Multimodal large language models (MLLMs) have made significant strides by integrating visual and textual modalities. A critical factor in training MLLMs is the quality of image-text pairs within multimodal pretraining datasets. However, in the process of high-quality data curation, filter-based paradigms often discard a substantial portion of high-quality images due to inadequate semantic alignment between images and texts, leading to inefficiency in data utilization and scalability. In this paper, we propose the Adaptive Image-Text Quality Enhancer (AITQE), a model that dynamically assesses and enhances the quality of image-text pairs. AITQE employs a text rewriting mechanism for low-quality pairs and incorporates a negative sample learning strategy to improve evaluative capabilities by integrating deliberately generated low-quality samples during training. Unlike prior approaches that significantly alter text distributions, our method minimally adjusts text to preserve data volume while enhancing quality. Experimental results demonstrate that AITQE surpasses existing methods on various benchmarks, effectively leveraging raw data and scaling with increasing data volumes. Codes and model are available at https://github.com/hanhuang22/AITQE

Abstract:
3D Gaussian Splatting (3DGS) has emerged as an efficient and high-fidelity paradigm for novel view synthesis. To adapt 3DGS for dynamic content, deformable 3DGS incorporates temporally deformable primitives with learnable latent embeddings to capture complex motions. Despite its impressive performance, the high-dimensional embeddings and vast number of primitives lead to substantial storage requirements. In this paper, we introduce a Lightweight 4DGS framework, called Light4GS, that employs significance pruning with a deep context model to provide a lightweight storage-efficient dynamic 3DGS representation. The proposed Light4GS is based on 4DGS, which is a typical representation of deformable 3DGS. Specifically, our framework is built upon two core components: 1) a spatio-temporal significance pruning strategy that eliminates over 64% of the deformable primitives, followed by an entropy-constrained spherical harmonics compression applied to the remainder; and 2) a deep context model that integrates intra- and inter-prediction with hyperprior into a coarse-to-fine context structure, enabling efficient multiscale latent embedding compression. Our approach achieves over 12× compression and increases rendering FPS up to 20% compared to the baseline 4DGS, and also superior to frame-wise state-of-the-art 3DGS compression methods. Experiment results show the effectiveness of our Light4GS in terms of both intra- and inter-prediction methods without sacrificing rendering quality. The code is available at https://github.com/Evan-sudo/Light4GS

Abstract:
Multi-modal image matching is a fundamental task in computer vision that has made significant progress. Due to modality changes and geometric distortions, the distinctiveness of adjacent descriptors and the precision of keypoint positions may not be sufficient enough to minimize the position errors between matched feature points. Previous works refine matches by either adjusting positions using an initial transformation matrix with a hard elimination threshold or regressing local sub-pixel coordinates supervised by the symmetric epipolar distance function. However, their performance heavily relies on the accuracy of the initial transformation matrix or requires additional camera intrinsic parameter information for supervision. In this paper, we proposed a sub-pixel position error estimation network (SPEN) for multi-modal image matching. The proposed method includes three modules: multi-scale feature extraction, pixel-level reliable feature matching, and sub-pixel position error estimation. The multi-scale feature extraction module integrates an adaptive encoder for modality changes and a multi-scale feature fusion block for geometric distortions to extract robust descriptors. The pixel-level reliable feature matching module designs a detection-description coupling detector, which detects reliable keypoints by emphasizing both keypoint repeatability and matchability, thus improving the overall matching performance. The sub-pixel position error estimation module utilizes the reprojection error function with sub-pixel accuracy to supervise the regression of position errors, improving the alignment between point pairs without depending on the accuracy of the initial transformation matrix. Furthermore, this module only requires the transformation matrix generated by random affine adaptation and data augmentation to calculate the reprojection error rather than additional camera intrinsic, making it more versatile and suitable for a broader range of applications. Experimental results demonstrate the superiority of the proposed method over the state-of-the-art methods on three multi-modal image datasets. Additionally, the ablation study highlights the effectiveness of the proposed components. Our implementation will be available at https://github.com/huhulike/SPEN

Abstract:
As incompleteness is common in real-world data, incomplete multi-view clustering is of great significance in the unsupervised learning field because it allows the partitioning of multi-view data with missing information into distinct groups. In this paper, we propose a novel generalized framework for incomplete multi-view clustering based on robust representation learning and tensor-based co-regularization (RRLTCR). Specifically, a robust principal component analysis is first used to learn a robust representation for each view. To explore high-order relationships among views, the view-specific spectral embeddings are stacked into a third-order tensor with a Schatten p -norm constraint. By spreading the complementary information of the high-quality available data from each view on a global scale, our model is able to alleviate the adverse effects of data noise and uncover the underlying common cluster structure. An effective iterative optimization strategy is developed to efficiently solve our model. According to the experimental results on seven datasets, our proposed framework has the potential to improve the clustering performance for a variety of incomplete multi-view clustering problems. Our research work brings a generalized framework for incomplete multi-view clustering, which can also assist in exploring the large cohort of existing incomplete multimodality datasets for other downstream tasks.

Abstract:
Deep neural networks achieve outstanding performance on specific tasks after training. However, directly tuning these models to learn new tasks often leads to the forgetting of previous knowledge, a phenomenon known as catastrophic forgetting. This paper focuses on the Class Incremental Semantic Segmentation (CISS) task, which aims to mitigate forgetting in segmentation models. Despite the significant progress of recent methods, effective knowledge transfer across sequential tasks remains underexplored. Moreover, these methods still struggle with the semantic shift issue. Based on these observations, we introduce a novel transformer-based framework for the CISS task, designed to acquire more task-general knowledge by leveraging the well-aligned text-image feature space of CLIP. Specifically, segmentation is performed by exploiting the matching process between patch-level image features and text features, which facilitates knowledge sharing and transfer across tasks. To address semantic shift, Class-Agnostic Confidence Prediction (CACP) head is proposed and integrated into the framework, which verifies the existence of different classes independently. This prevents the semantics of a foreground class from being interfered by the ever-changing ‘background’ class. Additionally, to maintain the ability to segment previous classes while generalizing to future ones, we incorporate Generalization-Preserving Distillation (GPD) loss and Query-based Distillation (QD) loss into our framework. We evaluate the proposed framework’s effectiveness using the Pascal VOC and ADE20K datasets, demonstrating superior performance compared to previous state-of-the-art methods.

Abstract:
In the research field of RGB-Thermal saliency object detection (RGB-T SOD), the effective exploitation of the complementary characteristics of the two modalities represents a major challenge for enhancing detection performance. Current fusion methodologies can be roughly classified into early fusion and middle fusion strategies, with prevalent techniques primarily encompassing concatenation, summation, and multiplication of the two modalities. To in depth assess the efficacy of these fusion strategies, we took an empirical investigation on them. Our findings demonstrate that the concatenation of middle features constitutes a more advantageous fusion strategy, yielding superior performance and demonstrating enhanced stability. Furthermore, observing the unique properties of thermal (T) images, we introduced gamma correction as a novel data augmentation methodology to RGB-T SOD. We subsequently evaluated the responses across varying correction parameter ranges, revealing that while the response to this data augmentation technique differs across various models, data augmentation is found to be effective in general. Building upon these findings, we proposed the Gamma Correction Network (GaCNet). Specifically, we also integrated image pyramid mechanism in a lightweight manner, which facilitates a more effective recovery of fine-grained image details. Significant improvement was achieved on commonly used RGB-T testing datasets, especially in VT821 dataset, manifesting the effectiveness of our method.

Abstract:
Inverse synthetic aperture radar (ISAR) and optical image fusion aims to generate a composite image that simultaneously emphasizes the prominent contours of spacecraft from optical images and preserves the rich texture information inherent in ISAR images. However, the limited receptive fields of spatial-domain methods restrict their ability to capture global contextual dependencies among strong scattering points in ISAR images and to effectively integrate complementary optical features. To tackle this challenge, we propose a phase-guided cross-frequency integration module (PGCFIM), which exploits the intrinsic global modeling capability of the frequency domain and the semantic expressiveness of the phase spectrum. Specifically, a deep Fourier transform is employed to establish an image-wide receptive field for intra-domain global modeling. Subsequently, phase components are explicitly aggregated, and a gating mechanism is introduced to guide the integration of inter-domain long-range dependencies, enabling effective learning of complementary cross-modal representations. To eliminate reliance on hand-crafted fusion strategies, we design an end-to-end network, named PGCFINet. By jointly enhancing cross-domain interaction, frequency-domain global awareness, and explicit complementary feature integration, PGCFINet significantly strengthens cross-domain and cross-modal information interaction representation. Furthermore, to mitigate the current lack of ISAR and optical image datasets, we construct a new dataset comprising various spacecraft models, offering an alternative benchmark for evaluation. Extensive experiments demonstrate show that PGCFINet achieves superior performance than state-of-the-art methods in both qualitative and quantitative assessments. Moreover, PGCFINet is extended to infrared and visible image fusion, and the favorable results further validate its robust generalization ability. The codes of our fusion method and the dataset are forthcoming at https://github.com/WangZe0622/PGCFINet

Abstract:
Real-world multi-modal image fusion is hindered by mixed and unknown degradations such as low-light noise, blur, and exposure shifts. Prior fusion methods seldom estimate degradation explicitly at the modality level, which limits conditional fusion when the underlying degradation distribution shifts across scenes, sensors, and tasks. We introduce DMFusion, a two-stage, degradation-aware framework that couples degradation inference with conditional expert routing and pixel-level integration. A CLIP-LoRA discriminator estimates modality-specific degradation vectors, which condition a Degradation-Customized Mixture of Experts to select specialized fusion and restoration pathways. Guided by these selections, a FusionGate decoder performs pixel-level integration and reconstructs both a fused image and high-quality restored source images. Evaluated on LLVIP, RoadScene, and MSRS with 1,491 training pairs and 389 test pairs, DMFusion delivers state-of-the-art performance across standard fusion metrics and remains robust under severe degradations. On the M3FD detection task, the fused images reach mAP at IoU 0.50 of 0.879, while a fusion-only mode attains 125 milliseconds per image, and the multi-output mode remains efficient. These results show that explicit degradation inference, realized through learned conditioning of sparse experts and gated decoding, yields reliable fusion quality and practical benefits for downstream vision systems. The code is publicly available at https://github.com/CrisT777-JN/DMFusion

Affiliations: School of Information and Engineering, Nanjing Xiaozhuang University (NJXZC), Nanjing, China; Singapore University of Technology and Design (SUTD), Tampines, Singapore; National University of Singapore (NUS), Queenstown, Singapore; PCA Laboratory, the Key Laboratory of Intelligent Perception and Systems for High-Dimensional Information of Ministry of Education, and Jiangsu Key Laboratory of Image and Video Understanding for Social Security, School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, China

Abstract:
Real-world image dehazing is a challenging task due to the collection of aligned hazy/clear image pairs under unpredictable and complex environments. To address this limitation, we propose a Physical-Guided Posterior Sampling (PGPS) method that designs a dehazing reconstruction posterior to sample an RGB and depth from pre-trained unconditional diffusion generation process. First, we introduce a Hybrid Degradation Atmospheric Scattering Model (HD-ASM) to adapt the diffusion model, enabling the generation of high-fidelity dehazed images from posterior samples without relying on the aligned hazy/clear image pairs. Second, we propose a two-stage sampling strategy with piecewise loss to improve sampling quality and stability, along with a post-processing technique to remove JPEG compression artifacts amplified by dehazing. Extensive experiments show that our method outperforms state-of-the-art techniques in image dehazing, and in the RTTS dataset’s complex human-vehicle environment. Additionally, our approach also surpasses other benchmarks in object detection, exhibiting superior generalization performance.

Abstract:
Light field (LF) full-view depth estimation aims to recover dense and coherent depth maps for all sub-aperture views, which is crucial for applications such as 3D reconstruction, LF editing and virtual reality. However, directly extending center-view volume-based methods to the full-view is computationally infeasible, as it requires constructing a separate cost volume for each view. Besides, existing full-view propagation-based approaches, while more efficient, frequently suffer from edge fattening and cross-view inconsistencies in the presence of occlusions. In this paper, we propose a gradient-guided density redistribution network (GDRNet), a novel end-to-end framework that efficiently generates full-view depth maps by constructing a single plane-density volume and a multi-plane depth image, which are then propagated to all angular views. To resolve ambiguous estimates at occlusion edges, we perform a direction-aware gradient-guided density redistribution only inside a dilated edge narrow band. For each center pixel in edge regions, a guidance gradient is derived from the initial depth map to determine the normal and tangent directions. Then, density in edge fattening regions can be redistributed via sampling along the normal direction, while similarity along the tangent direction can fill bad pixels with inconsistencies. Furthermore, an adaptive edge extraction module with four directional learnable Sobel kernels is designed to jointly exploit spatial and angular gradients, enabling robust detection and localizing the refinement band. Extensive experiments on synthetic and real-world LF datasets demonstrate that GDRNet achieves state-of-the-art accuracy and edge sharpness in both quantitative and qualitative evaluations, while maintaining computational efficiency compared to full-view methods.

Abstract:
Dark optical flow estimation aims to predict pixel-wise displacement between consecutive noisy dark frames. Existing methods primarily focus on enhancing feature-specific representations before cross-image matching, with few attention devoted to the inherent dark degradation during flow decoding for achieving holistic motion understanding of a given dark scene. In this paper, we introduce the Position-surpassing Flow Estimator (PsFE), which integrates a global graph method into flow decoders to accentuate holistic motion discrimination and robustness. In detail, we incorporate a graph-based motion reconstruction into the decoding paradigm to adaptive aggregate motion-rich feature channels and suppress degraded ones from a more global view. This characteristic suppression retains the graph structure, which is a robust characteristic in the dark. To accurately encode long-range pixel connections, PsFE employs a novel masked global encoder to capture top- k important features by using a sparse masking strategy and dynamic inductive modulation that suppresses noise and interference that only exist under dark conditions. Experiments on challenging FCDN and VBOF benchmarks demonstrate the effectiveness of our PsFE with superior performance over advanced methods.

Abstract:
Continual Semantic Segmentation (CSS) suffers from catastrophic forgetting, particularly challenging for traditional per-pixel methods. Our prior work, CoMasTRe (CVPR 2024), introduced a query-based approach leveraging objectness by disentangling CSS into objectness learning and class recognition stages. While effective, CoMasTRe exhibited performance limitations due to feature forgetting within its pixel decoder. This paper presents CoMasTRe+, an enhanced framework specifically designed to overcome this limitation. The core contribution is a novel plugin, the Mixture of Continual Adapters (MoCA), integrated into the pixel decoder. MoCA is a dynamic architecture that mitigates feature forgetting by learning task-specific expert adapters. Crucially, MoCA employs a task-aware routing strategy and a novel adaptive routing distillation objective, tailored for continual learning, to preserve specialized feature representations across sequential tasks. CoMasTRe+ further enhances the class decoder using MoCA for improved recognition and simplicity. We extensively evaluate CoMasTRe+ on PASCAL VOC and ADE20K for continual semantic and panoptic segmentation. Experiments demonstrate that CoMasTRe+ effectively addresses the identified feature forgetting issue, significantly outperforms the original CoMasTRe, and achieves state-of-the-art results compared to both per-pixel and query-based baselines.

Affiliations: Anhui Provincial Key Laboratory of Multimodal Cognitive Computation, and Anhui Provincial Key Laboratory of Security Artificial Intelligence, School of Artificial Intelligence, State Key Laboratory of Opto-Electronic Information Acquisition and Protection Technology, Anhui University, Hefei, China; Anhui Provincial Key Laboratory of Intelligent Detection and Diagnosis for Traffic Infrastructure, Anhui Jiaojian Traffic Development and Research Center Company Ltd., Hefei, China

Abstract:
Referring image segmentation aims at segmenting the target object referred by a natural language expression, which requires semantic-level object understanding and pixel-level contour segmentation. Existing methods are limited to the spatial domain, thus ignoring potential discriminability from the frequency domain and missing mutual boost between the spatial and frequency domains, also facing heavy high-frequency degeneration issues. In this paper, we revisit frequency domain and propose a novel lightweight spatial-frequency joint tuning (SFJT) plugin for referring image segmentation. We decompose 4 mainstream methodology paradigms into two general stages consisting of roughly semantic-level object understanding and precisely pixel-level contour segmentation, then respectively enhance them with parameter-efficient spatial-frequency tuning strategies. Specifically, we develop a spatial-frequency joint prompting technique (SFJ-Prompt) during early object understanding stage, which mines bidirectional spatial-frequency information to realize mutually spatial-frequency boosting, facilitating more comprehensive object understanding. Besides, we introduce a LoRA-based high-frequency auxiliary branch (HF-LoRA) during latter contour segmentation stage, which compensates for heavy high-frequency degeneration issues in spatial neural networks, facilitating more precise contour segmentation. Eventually, extensive experiments of 4 mainstream methodology paradigms for referring image segmentation on 4 large-scale datasets demonstrate the effectiveness and superiority of the proposed method.

Abstract:
While conventional lossy compression methods predominantly depend on autoencoders to map point clouds into latent representations, they often neglect the intrinsic redundancy within these latent points. To address this limitation, this paper presents a diffusion-based architecture steered by sparse priors, designed to minimize latent redundancy while securing superior reconstruction fidelity, particularly in low-bitrate scenarios. A key feature of the framework is an efficient dual-density data flow that alleviates the stringent size constraints imposed on latent points. By integrating a Probabilistic Attention-based Conditional Denoiser (PACD), the method effectively encapsulates critical reconstruction details within sparse priors, which are hierarchically decoupled into intra- and inter-point components. Specifically, separate encoders are utilized to transform the source point cloud into latent points and decoupled sparse priors, respectively. To dynamically exploit geometric and semantic information, an attention-driven latent denoiser, conditioned on these decoupled priors, is applied across the encoding and decoding layers. Furthermore, inter-point distributions are incorporated into the arithmetic codec to refine local context modeling for sparse points, with the final point cloud recovered via a point decoder. Comprehensive experiments conducted on ShapeNet and standard MPEG PCC datasets demonstrate that the proposed method outperforms state-of-the-art techniques, achieving a superior rate-distortion trade-off.

Abstract:
We introduce ST-ObjGS, a method using Space-time Gaussian surfels for accurate object segmentation within 4D representations. Our approach addresses the limitations of current Gaussian-based methods, which primarily focus on static 3D scene understanding and struggle with geometrically accurate object segmentation in complex dynamic scenes. To ensure robust object-level segmentation, we first integrate Grounded SAM 2, which enables text prompt-based object selection and tracking. We then learn a set of Gaussian surfels for object geometry representation and employ a marginal 1D Gaussian for dynamic modeling at each timestamp. To improve geometric quality when modeling surfaces, we use depth and surface normal for geometric regularization. Furthermore, to address continuity and flickering issues in complex scenes, we implement dynamic-aware regularization to maintain temporal consistency. This approach allows us to capture object motion and morphing over time while maintaining spatial coherence. To the best of our knowledge, ST-ObjGS is the first self-supervised approach using Space-time Gaussian surfels for consistent segmentation of dynamic 3D objects in real-world scenes. Extensive experiments on standard benchmarks including PKU-DyMVHumans, Plenoptic Video, Google Immersive, and CMU Panoptic datasets demonstrate that ST-ObjGS produces more precise object masks than its Gaussian-based counterparts and significantly outperforms supervised single-view baselines.

Abstract:
Multi-modal 3D object detection with bird’s eye view (BEV) has achieved desired advances on benchmarks. Nonetheless, the accuracy may drop significantly in the real world due to data corruption such as sensor configurations for LiDAR and scene conditions for camera. One design bottleneck of previous models resides in the tightly coupling of multi-modal BEV features during fusion, which may degrade the overall system performance if one modality or both is corrupted. To mitigate, we propose a Multi-Modal Decouple and Recouple Network for robust 3D object detection under data corruption. Different modalities commonly share some high-level invariant features. We observe that these invariant features across modalities do not always fail simultaneously, because different types of data corruption affect each modality in distinct ways. These invariant features can be recovered across modalities for robust fusion under data corruption. To this end, we explicitly decouple Camera/LiDAR BEV features into modality-invariant and modality-specific parts. It allows invariant features to compensate each other while mitigates the negative impact of a corrupted modality on the other. We then recouple these features into three experts to handle different types of data corruption, respectively, i.e., LiDAR, camera, and both. For each expert, we use modality-invariant features as robust information, while modality-specific features serve as a complement. Finally, we adaptively fuse the three experts to exact robust features for 3D object detection. For validation, we collect a benchmark with a large quantity of data corruption for LiDAR, camera, and both based on nuScenes. Our model is trained on clean nuScenes and tested on all types of data corruption. Our model consistently achieves the best accuracy on both corrupted and clean data compared to recent models.

Abstract:
Although Multimodal Large Language Models (MLLMs) have shown remarkable generalization across diverse vision-language tasks, recent studies reveal their limitations in visual discrimination. These challenges arise not from insufficient model capacity, but from existing training paradigms that favor linguistic priors over detailed visual analysis. While existing approaches address this limitation through external interventions such as feature integration or knowledge augmentation, we propose a Group-Relative Visual Discrimination Enhancement framework to unlock intrinsic capability of MLLMs and requires no external resources. Our method introduces a Group-Relative Reinforcement Learning paradigm equipped with a lightweight Visual Patch Selection Plugin to dynamically select discriminative visual tokens. The framework establishes a self-feedback loop between visual encoder and language decoder, leveraging the dual reward-penalty signals derived from the model’s internal language feedback to optimize the visual focus, thereby enhancing the model’s visual discrimination capabilities. Extensive experimental results across six visual recognition benchmarks and two VQA benchmarks demonstrate the effectiveness of our method. Code is available at https://github.com/FannierPeng/GROVE

Abstract:
Change detection (CD) is a critical task in remote sensing (RS) image analysis. Recent deep learning networks for CD focus on identifying changes after mining the features of bi-temporal images separately. However, light differences in bi-temporal images lead to the networks extracting different features from the identical objects, which may cause pseudo-changes. From the Fourier transform perspective, an image can be decomposed into amplitude and phase, where the amplitude contains most of the light information and the phase is relevant to structure information. Therefore, amplitude-invariant features of the identical objects in different light conditions are roughly the same, which are pivotal to identify real and fake changes between bi-temporal images. In this article, we propose a capturing amplitude-invariant features network (CAIFNet), which reduces dependence on amplitude and captures diverse amplitude-invariant features. Firstly, we build an amplitude pre-processing module (APM) to provide diverse processed images by randomly mixing the amplitudes of the input images with the amplitudes of the reference images and keeping the phases of the input images constant. Secondly, a quadruple-stream encoder is proposed to capture amplitude-invariant features. Specifically, it is forced to learn and capture amplitude-invariant local details and amplitude-invariant contextual semantics based on the diverse processed images under CD task-oriented constraint, both reciprocate each other to become more accurate by local attention guide strategy (LGS). Moreover, a difference enhancement module (DEM) is designed in the quadruple-stream encoder to enhance the difference features. Thirdly, a bi-stream decoder decodes the captured amplitude-invariant features in main and boundary difference perspectives, enhancing main body and boundary details of the objects in the change maps, respectively. Finally, a spatial embedded module (SEM) allows the main and boundary difference features to be embedded into each other, obtaining more complete change maps. On three remote sensing change detection (RSCD) datasets, CAIFNet achieves better transferability and results compared to state-of-the-art methods. The source code is available at https://github.com/yihui1230/CAIFNet.

Abstract:
Graph transformer networks have received more attention in hyperspectral image (HSI) classification. However, they overlooked the influence of graph connectivity strength in positional encoding and distribution. In order to address the above deficiencies, we proposed the novel graph transformer with structural embedding and training (GTSET) for HSI classification. Specifically, the structural embedding module firstly aimed at extracting effectively local and non-local feature information via patch-based distance encoding and centrality correlation coefficients based on graph connectivity strength, alleviating spectral variability. Secondly, the structural training module aimed at addressing imbalanced structural position distribution of labeled samples by leveraging the topological graph connectivity to determine their structural position distribution and reweighting the influence of labeled samples on the graph transformer training stage, exploring the guiding role of labeled samples in low spatial resolution of HSI. Next, we further refine training weights based on the spectral feature smoothness of labeled samples. Finally, comprehensive experiments on three real-world HSI datasets demonstrate that the GTSET achieves superior performance in HSI classification with limited labeled samples, compared to other popular classification methods. Implementation of GTSET, along with examples, can be found on the GitHub repository: https://github.com/xuchengchao0/GTSET

Abstract:
Large and diverse image datasets have facilitated the recent advances in deep-learning-based computer vision applications. Whereas datasets with images depicting normal-weather scenes are plentiful, datasets with images depicting inclement weather conditions, such as haze, remain scarce due to collection difficulties. In response to this problem, we present a novel domain flow adaptation network (DFA-Net) that can control the haze density and facilitate the generation of realistic and diverse hazy images. DFA-Net employs a density variable to direct the network to learn and yield the desired images and is composed of four modules: a semantic extraction (SE) module, a haze extraction (HE) module, an image production (IP) module, and an image assessment (IS) module. The SE and HE modules are used to capture the semantic structure and style representation of clear and hazy images, respectively, and provide them to the IP module for refining the output images. The IP module is adopted to yield hazy images in a coarse-to-fine fashion, while the IS module is responsible for examining the realism of the synthesized results. Experiments on multiple benchmark datasets confirm the effectiveness of the proposed DFA-Net, which outperforms competing approaches by achieving improvements of up to 147% in quality, 237% in fidelity, and 354% in the diversity of generated images.

Abstract:
Natural disasters pose a threat to the safety of human life and buildings. Rapid and accurate building damage assessment (BDA) on remote sensing images is crucial for disaster response and recovery. However, most methods are constructed on the ideal co-registered bitemporal remote sensing images, neglecting the misalignment in practice. In this paper, we propose a novel building damage assessment method, termed FlowMamba, which can effectively handle the offset between the pre- and post-disaster images in BDA task. Specifically, a vision mamba backbone with four stages is utilized to extract multi scale features from the pre- and post-disaster images. Then, a differential optical flow alignment module is designed to estimate shift matrix to align pre- and post-disaster features. Furthermore, a category distance-aware loss function is tailored for the BDA task, which replaces fixed binary values of the penalty factors in to soft values of inter-class distance. Extensive experiments on the xBD dataset, the BRIGHT dataset and four out-of-distribution disaster scenarios validate the robustness and effectiveness of the proposed FlowMamba. Our code is available at https://github.com/flying318/FlowMamba

Abstract:
Multi-modal learning, which fuses complementary information from different modalities, has significantly improved the accuracy of land cover classification, especially under adverse conditions like cloudy or rainy weather. Recent advancements in multi-modal remote sensing land cover classification (MMRLC) have witnessed the efficacy of approaches based on CNN and Transformer. However, CNN exhibits limitations in capturing long-range dependencies, whereas Transformer suffers from high computational complexity. Recently, Mamba has garnered widespread attention due to its superior long-range modeling capabilities with linear complexity. Nevertheless, Mamba demonstrates notable limitations when directly applied to MMRLC, including limited local contextual modeling capacity, suboptimal multi-modal feature fusion and lack of a task-specific spatial continuity scanning strategy. Hence, to fully explore the potential of Mamba in multi-modal land cover classification, we propose LSFMamba, which comprises multiple hierarchically connected local-enhanced fusion Mamba (LFM) modules. Within each LFM module, a local-enhanced visual state space (LVSS) block is designed to extract features from different modalities, while a cross-modal interaction state space (CISS) block is created to fuse these multi-modal features. In the LVSS block, we integrate a multi-kernel CNN block into the gating branch in Mamba to enhance its local modeling capabilities. In the CISS block, features from different modalities are interleaved, facilitating cross-modal feature interaction through the state space model. Furthermore, we introduce a novel spiral scanning strategy to reassess the significance of central pixels, a design driven by the unique characteristics of pixel-wise classification task. Extensive experimental results on three multi-modal remote sensing datasets demonstrate that the proposed LSFMamba achieves state-of-the-art performance with lower complexity. The code will be released at https://github.com/hhchhang78/LSFMamba

Abstract:
This paper introduces a novel lag-aware dual-stream (LADS) framework and a carefully curated dual-view video dataset for automatic Autism Spectrum Disorder (ASD) classification through imitation tasks. In contrast to prior single-view approaches that overlook the interactive dynamics of imitation, our dataset is the first to capture synchronized experimenter-child interactions with rich pose and motion features. Building on this data, the LADS framework explicitly learns the temporal alignment between the experimenter’s demonstration and the child’s imitative response. A Lag-Aware Alignment module uses constrained cross-attention to compute an adaptive time warping and extract per-frame lag feature, revealing delays in the child’s imitation. Additionally, a lightweight diffusion-based regularizer enforces representation consistency by denoising perturbed child features conditioned on the experimenter’s motion, improving generalization. We then achieve the classification of ASD versus Typical Development (TD) behavior by integrating the aligned dual-stream features, imitation lag, and action discrepancy within an attention-pooling classifier. Experiments on our dual-view imitation dataset show that LADS significantly outperforms conventional single-stream models and a recent dyadic transformer baseline, achieving state-of-the-art classification accuracy. The results demonstrate the importance of modeling interpersonal timing in social behavior analysis. Our work provides a new, public dataset and a computational tool for interdisciplinary research, bridging computer vision and psychological studies of autism. Both dataset and code will be made publicly available.

Abstract:
Surgical scene understanding is a vital intelligent technique in robot-assisted surgery, including surgical instrument detection, segmentation, and instrument–tissue interaction detection. Existing methods typically address these tasks in isolation, neglecting the intrinsic correlations among them. In this work, we innovatively propose a unified multitask framework named UniSurg, being the first to jointly address these three critical aspects of surgical scene understanding, thereby providing the robot with multidimensional perceptual capabilities. By exploring the inter-task correlations and reusing shared features, UniSurg has been demonstrated to significantly enhance the scene analysis performance. To address pose variability of the instruments under the constrained field of view in laparoscopic surgery, we design an Attention Enhanced Conditional Convolution (AEC-Conv) that dynamically adjusts kernels based on pose-specific features for improved adaptability. To further enhance interaction detection, we propose the Temporal Difference Enhancement module (TDE), which captures motion cues by amplifying inter-frame differences, and the Pyramid Global Feature Enhancement module (PGFE), which leverages graph-based hierarchical context to model global relational dependencies. Experiments on the Endovis2018 dataset and a clinical multitask dataset MILVis demonstrate the superior multitask performance of UniSurg.

Abstract:
With the great success of diffusion models in image generation, diffusion-based image compression is attracting increasing interests. However, due to the random noise introduced in the diffusion learning, they usually produce reconstructions with deviation from the original images, leading to suboptimal compression results. To address this problem, in this paper, we propose a Noise Constrained Diffusion (NC-Diffusion) framework for high fidelity image compression. Unlike existing diffusion-based compression methods that add random Gaussian noise and direct the noise into the image space, the proposed NC-Diffusion formulates the quantization noise originally added in the learned image compression as the noise in the forward process of diffusion. Then a noise constrained diffusion process is constructed from the ground-truth image to the initial compression result generated with quantization noise. The NC-Diffusion overcomes the problem of noise mismatch between compression and diffusion, significantly improving the inference efficiency. In addition, an adaptive frequency-domain filtering module is developed to enhance the skip connections in the U-Net based diffusion architecture, in order to enhance high-frequency details. Moreover, a zero-shot sample-guided enhancement method is designed to further improve the fidelity of the image. Experiments on multiple benchmark datasets demonstrate that our method can achieve the best performance compared with existing methods.

Affiliations: School of Artificial Intelligence and Information Engineering, Zhejiang University of Science and Technology, Hangzhou, China; School of Information and Electronic Engineering and Technology, Zhejiang University of Science and Technology, Hangzhou, China; School of Information and Electronic Engineering, Liaoning University of Technology, Anshan, China; College of Computer Science and Technology, Zhejiang University, Hangzhou, China; Zhejiang Key Laboratory of Artificial Intelligence of Things (AIoT) Network and Data Security, Hangzhou, China; School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, China; Center for Biometrics and Security Research, National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing, China

Abstract:
Current adversarial attacks pose a serious threat to the robustness of visual-language models (VLMs), including vision-language pre-trained models (VLPMs) and multimodal large language models (MLLMs). Traditional adversarial attacks are example-specific and rely on specific datasets. This practice suffers from low transferability and additional computation cost, while universal adversarial perturbations (UAPs) offer example-agnostic solutions by generalizing across inputs. However, current UAP methods mainly target VLPMs, demonstrating limited transferability and effectiveness in MLLMs. To bridge this gap, we propose the Recursive Perturbation Attack (RPA), a novel black-box UAP method for both VLPMs and MLLMs. RPA employs a recursive perturbations strategy, utilizing token filtering and polynomial sampling methods to generate perturbations, thereby achieving incremental disruption and enhancing the transferability of the attack. To further enhance the effectiveness of the attack, RPA integrates a three-tier modality decoupling strategy, disentangling intra-modal, cross-modal, and fusion-modal features to effectively disrupt feature alignment and interactions. Extensive experiments validate that RPA achieves superior attack performance compared to existing UAP approaches. This work highlights new security concerns in multimodal AI systems and provides insights into the design of more robust models. Code is available at https://github.com/chilljudaoren/RPAttack

Abstract:
Modeling visual perception in a manner consistent with human subjective evaluation has become a central direction in both video quality assessment (VQA) and broader visual understanding tasks. While free-energy-guided self-repair mechanisms—reflecting human observational experience—have proven effective in image quality assessment, extending them to VQA remains non-trivial. In addition, biologically inspired paradigms such as holistic perception, local analysis, and gaze-driven scanning have achieved notable success in high-level vision tasks, yet their potential within the VQA context remains largely underexplored. To address these issues, we propose EyeSimVQA, a novel VQA framework that incorporates free-energy-based self-repair. It adopts a dual-branch architecture, with an aesthetic branch for global perceptual evaluation and a technical branch for fine-grained structural and semantic analysis. Each branch integrates specialized enhancement modules tailored to distinct visual inputs—resized full-frame images and patch-based fragments—to simulate adaptive repair behaviors. We also explore a principled strategy for incorporating high-level visual features without disrupting the original backbone. In addition, we design a biologically inspired prediction head that models sweeping gaze dynamics to better fuse global and local representations for quality prediction. Experiments on five public VQA benchmarks demonstrate that EyeSimVQA achieves competitive or superior performance compared to state-of-the-art methods, while offering improved interpretability through its biologically grounded design. Our code will be publicly available at https://github.com/handsomewzy/EyeSim-VQA

Abstract:
Unsupervised image restoration under multi-weather conditions remains a fundamental yet underexplored challenge. While existing methods often rely on task-specific physical priors, their narrow focus limits scalability and generalization to diverse real-world weather scenarios. In this work, we propose WeatherCycle, a unified unpaired framework that reformulates weather restoration as a bidirectional degradation-content translation cycle, guided by degradation-aware curriculum regularization. At its core, WeatherCycle employs a lumina-chroma decomposition strategy to decouple degradation from content without modeling complex weather, enabling domain conversion between degraded and clean images. To model diverse and complex degradations, we propose a Lumina Degradation Guidance Module (LDGM), which learns luminance degradation priors from a degraded image pool and injects them into clean images via frequency-domain amplitude modulation, enabling controllable and realistic degradation modeling. Additionally, we incorporate a Difficulty-Aware Contrastive Regularization (DACR) module that identifies hard samples via a CLIP-based classifier and enforces contrastive alignment between hard samples and restored features to enhance semantic consistency and robustness. Extensive experiments across serve multi-weather datasets, demonstrate that our method achieves state-of-the-art performance among unsupervised approaches, with strong generalization to complex weather degradations.

Abstract:
Accurate estimation of reflectance and illumination maps in the Retinex framework remains a significant challenge for low-light image enhancement due to inherent decomposition ambiguity. To address this, we propose DualPrior-Retinex, a novel framework that, inspired by Retinex theory, leverages the practical advantages of the YUV color space for robust enhancement. Our framework introduces a dual-prior architecture that effectively decouples the restoration process. It combines a diffusion-based global prior, responsible for ensuring low-frequency content consistency, with a YUV-based local prior designed to preserve high-frequency structural details. These complementary components are integrated by our Hierarchical Prior Fusion Module (HPFM), which balances perceptual quality with pixel-level fidelity in complex low-light scenarios. Extensive evaluations on multiple benchmarks demonstrate that our method achieves state-of-the-art performance across diverse metrics and visual qualities. Codes will be released at https://github.com/I2-Multimedia-Lab/DP-Retinex

Abstract:
While Tri-plane representation has greatly advanced the development of 3D generative models, problems rooted in its inherent structure, such as multi-face artifacts caused by sharing the same features in symmetric regions, limit its ability to generate complete 360° views. In this paper, we propose CylinderPlane, a novel representation based on the cylindrical coordinate system, to achieve high-quality, artifact-free panoramic image synthesis. Unlike the inevitable feature entanglement in the Cartesian coordinate-based representation, the cylindrical coordinate system explicitly disentangles features at different angles. Consequently, our representation effectively eliminates feature ambiguity and ensures multi-view consistency across full 360°. We further develop a nested cylinder representation that combines cylinder planes of varying radii to achieve multi-scale feature fusion. This design not only addresses the limitations of Tri-plane in modeling complex geometries and varying resolutions, but also mitigates the polar discontinuity inherent in a single cylinder plane. Moreover, our versatile representation can be seamlessly integrated into various generative frameworks and rendering pipelines. Extensive experiments on both synthetic datasets and unstructured in-the-wild images demonstrate that our representation outperforms the existing methods.

Abstract:
Large-scale images play a crucial role in geospatial surveying, as they cover an extensively broad view and diverse objects. Due to computational limitations, existing methods rely on generating large-scale images in patches. However, the lack of global guidance in these methods often leads to significant logical errors among different patches. To address this issue, we propose a Global Consistency Diffusion model (Glob-Diffusion) for large-scale image generation. The core idea is to utilize the global consistency of small-scale images to guide the generation of large-scale images. Specifically, we introduce a Hierarchical Distributed Guidance (HDG) module that extracts patch prompts with different semantic hierarchies from small-scale images, distributedly embedding them into the generation of large-scale images to maintain global consistency across various regions. In addition, we further design a Region Guided Adapter (RGA) that dynamically optimizes the guidance strength of patch prompts by comparing differences across generated regions, effectively improving the realism of large-scale images. Our method demonstrates remarkable visual synthesis results across various natural scenes, effectively preserving global consistency in large-scale images, and also significantly enhancing the generation quality of large-scale remote sensing images. Code will be available at https://github.com/kyh433/Glob-Diffusion

Abstract:
Visible-infrared person re-identification (VI-ReID) retrieves cross-modal identity matches between visible and infrared images, offering significant value for round-the-clock surveillance. Despite recent advances, challenges remain: the task relies heavily on high-quality annotations, and factors such as occlusion, viewpoint variations, and the inherent difficulty of labeling infrared images inevitably introduce noisy annotations (NA) into the dataset during large-scale dataset construction. Moreover, coupled noisy labels in two modalities lead to noisy correspondence (NC), further complicating the learning process. Although prior research has achieved relatively stable results in addressing the NA and NC problem for VI-ReID through noise detection and robust loss functions, they still exhibit certain limitations: 1) Underutilization of training data. Existing methods often discard noisy samples to mitigate their negative impact, overlooking their potential value. 2) Lack of historical relevance. Unstable learning dynamics under noisy labels lead to inconsistent outputs, yet current approaches ignore the valuable historical information embedded in these fluctuations. Focusing on these challenges in VI-ReID, we propose Self-Rectification Historical Consistency Learning (SRHCL) for VI-ReID, which consists of noise detection, self-refined label rectification, and historical consistency learning modules. Firstly, the noise detection module calculates confidence weights for each sample by modeling the model’s loss response, thereby mitigating the adverse impact of noisy samples in subsequent training phases. Secondly, we propose a self-refined label rectification module to rectify noisy labels by reliable historical predictions, progressively collating the training data at fixed intervals. Finally, we introduce cross-modal contrastive learning and early learning regularization based on momentum-updated memories to facilitate historical consistency learning. Extensive experiments conducted on SYSU-MM01 and RegDB datasets demonstrate the robustness and effectiveness of our method across varying noisy ratios.

Abstract:
Although multi-view unsupervised feature selection is a promising technique for dimensionality reduction on unlabeled multi-view data, existing methods cannot directly address incomplete data, where certain samples are missing in specific views. These methods typically begin by imputing missing data using predetermined values, followed by performing feature selection on the completed dataset. However, the separation of imputation and feature selection processes fails to exploit their inherent synergy, as local structural information obtained from feature selection could guide the imputation process and, in turn, improve the overall effectiveness of feature selection. In addition, previous methods rely on similarity graphs based on Euclidean distance to preserve the local manifold structure but overlook the topological relationships within the data, thereby hindering accurate capture of intrinsic structures. In this paper, we propose an adaptive topological similarity learning for incomplete multi-view unsupervised feature selection method (ATSL-IMUFS) to address the aforementioned issues. ATSL-IMUFS first integrates multi-view feature selection and missing data imputation into a unified learning framework. Then, it adaptively learns similarity graphs for each view while simultaneously capturing the consensus topological relationship across views, effectively characterizing the local manifold structure. Extensive experiments conducted on real-world datasets demonstrate the superior performance of ATSL-IMUFS compared to competing methods.

Abstract:
Byzantine-robust federated learning aims to maintain resilient performance in the presence of malicious attacks that can impede the convergence of learning algorithms. Although numerous robust aggregators have been developed to merge the collected gradient information in the server, they either require data homogeneity and are suboptimal for heterogeneous data, or their breakdown points—the smallest proportion of outliers that can make the aggregators fail—are not theoretically analyzed or less than 0.5. In contrast to existing aggregators, this paper formulates the aggregation process as a low-rank plus sparse decomposition model, where the low-rank component, with a rank of one, facilitates accurate gradient computation, while the sparse component, penalized by the \ell _2,0 -norm, mitigates the impact of outliers. We prove that the devised rule achieves the maximum breakdown point of 0.5. Besides, we apply our aggregation rule to Byzantine-robust federated learning and employ the Polyak’s momentum to reduce gradient variance among honest workers. It is analyzed that our aggregator achieves order-optimal Byzantine-resilient federated learning for heterogeneous data. Experimental results using MNIST, Fashion-MNIST and CIFAR-10 demonstrate that the developed approach yields higher classification accuracy than the competing aggregators under different attack types and heterogeneity levels.

Abstract:
Since the general representations produced by pretrained feature extractors are often insensitive to intra-class variations, existing anomaly detection methods that store them directly in memory banks are constrained in performance. In addition, the distribution discrepancy among different modalities may cause cross-modal interference, further weakening the ability to discriminate anomalies. To address these issues, we propose an Adaptive Prototype Guidance Network (APG-Net) for multi-sensor anomaly detection. First, to avoid cross-modal feature interference, we construct independent anomaly detection branches for multi-sensor data including RGB images, point clouds and infrared images. Then, we introduce a non-parametric feature space reshaping paradigm for each modality. This paradigm adds no additional trainable parameters, ensuring efficiency and ease of deployment. It first learns guiding prototypes from normal samples and matches them to the general representations produced by pretrained extractors. Subsequently, a two-stage adaptive prototype guidance strategy is applied to reshape the feature distributions. This strategy enlarges the separation between normal and anomalous features in the feature space. Finally, we perform decision-level fusion to integrate the anomaly detection strengths from all sensor data. Extensive experiments demonstrate that our method achieves an object-AUROC of 97.4% on the MulSen-AD multi-sensor anomaly detection benchmark, surpassing previous state-of-the-art approaches.

Abstract:
In Uncrewed Aerial Vehicle (UAV) tracking, discriminative correlation filters (DCF) are popular for their speed, making them ideal for real-time use with limited resources. Recently, lightweight convolutional neural networks (CNNs) have offered a new approach. Through filter pruning, these CNNs maintain high accuracy and efficiency, making them a strong alternative, especially for greater precision. Despite these advancements, the potential of pure vision transformers (ViTs) in UAV tracking remains largely untapped, especially based on the paradigm of conditional computation. In this work, we introduce an adaptive and background-aware Vision Transformer (Aba-ViT) and leverage it to develop a real-time UAV tracking framework called Aba-ViTrack. The proposed Aba-ViT exploits an adaptive and background-aware token computation method to reduce inference time. This approach adaptively discards tokens based on learned halting probabilities, which a priori are higher for background tokens than target ones. To further improve efficiency, this paper proposes a novel classroom-style learning (CSL) approach, where robustness knowledge is transmitted vertically from teacher to students, and generalization capability is enhanced through horizontal mutual learning among students. This method is used to compress Aba-ViTrack, resulting in Aba-ViTrack++. The upgraded version achieves a better balance between accuracy and efficiency in real-time UAV tracking. This version achieves a better balance between accuracy and efficiency for real-time UAV tracking. Extensive experiments on six UAV tracking benchmarks demonstrate that the proposed method achieves state-of-the-art performance in UAV tracking. The code is available at https://github.com/xyyang317/Aba-ViTrack

Abstract:
Hyperspectral image (HSI) reconstruction algorithms are fundamental to coded aperture snapshot spectral imaging (CASSI) systems. Recently, deep unfolding networks (DUNs) have emerged as a dominant solution, seamlessly combining traditional optimization frameworks with the strengths of deep learning. Among these, Mamba stands out as a prominent method for modeling long-range dependencies. However, its reliance on one-dimensional (1D) spatial scanning often compromises spectral consistency and spatial coherence, leading to misalignment of neighboring pixels within sequences. To address these limitations, we propose a novel multi-view framework based on 2D-slice modeling, which ensures spatial-spectral continuity in 1D sequences while maintaining computational efficiency. Furthermore, motivated by the need for precise local patch modeling in 2D images, we develop a 3D-cube Mamba model for HSI reconstruction. By integrating the UNet architecture, this model enhances spatial and spectral detail representation through multi-scale receptive field modeling, using fixed cube sizes to dynamically adjust pixel distances. These advancements are incorporated into the A-HQS-accelerated deep unfolding framework, synergistically combining the strengths of 2D-slice and 3D-cube MambaNet to achieve state-of-the-art HSI reconstruction performance. Experimental evaluations on simulated and real-world CASSI datasets demonstrate the efficacy of the proposed approach, achieving superior spectral fidelity and detailed feature representation. The source code is available at: https://github.com/fengyuchao97/SCM-DUN

Abstract:
Recent research on Self-Supervised Learning (SSL) has demonstrated its ability to extract high-quality representations from unlabeled samples. However, in continual learning scenarios where training data arrives sequentially, SSL’s performance tends to deteriorate. This study focuses on Continual Contrastive Self-Supervised Learning (CCSSL) and highlights that the absence of inter-task contrastive learning, due to the unavailability of historical samples, leads to a significant drop in performance. To tackle this issue, we introduce a simple and effective method called BGE, which Bridges the inter-task Gap of CCSSL using External data from publicly available datasets. BGE enables the contrastive learning of each task data with external data, allowing relationships between them to be passed along the tasks, thereby facilitating implicit inter-task data comparisons. To overcome the limitation of the external data selection and maintain its effectiveness, we further propose the One-Propose-One algorithm to collect more relevant and diverse high-quality samples from external sources while filtering out distractions from the out-of-distribution data. Experiments show that BGE can generate better discriminative representation in CCSSL, especially for inter-task data, and improve classification results with various external data compositions. Additionally, BGE can be seamlessly integrated into existing continual learning methods, yielding significant performance improvement.

Affiliations: Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China; DSLAB, School of Information Science and Engineering, Lanzhou University, Lanzhou, China; School of Mathematical Science, Jiangsu University, Zhenjiang, China; MoE Key Laboratory of Brain-Inspired Intelligent Perception and Cognition, University of Science and Technology of China, Hefei, China; Chinese Academy of Agricultural Sciences, Agricultural Information Institute, Beijing, China

Abstract:
Recently, patch deformation-based methods have demonstrated significant effectiveness in multi-view stereo due to their incorporation of deformable and expandable perception for reconstructing textureless areas. However, these methods generally focus on identifying reliable pixel correlations to mitigate matching ambiguity of patch deformation, while neglecting the deformation instability caused by edge-skipping and visibility occlusions, which may cause potential estimation deviations. To address these issues, we propose DVP-MVS++, an innovative approach that synergizes both depth-normal-edge aligned and harmonized cross-view priors for robust and visibility-aware patch deformation. Specifically, to avoid edge-skipping, we first apply DepthPro, Metric3Dv2 and Roberts operator to generate coarse depth maps, normal maps and edge maps, respectively. These maps are then aligned via an erosion-dilation strategy to produce fine-grained homogeneous boundaries for facilitating robust patch deformation. Moreover, we reformulate view selection weights as visibility maps, and then implement both an enhanced cross-view depth reprojection and an area-maximization strategy to help reliably restore visible areas and effectively balance deformed patch. Additionally, we obtain geometry consistency by adopting both aggregated normals via view selection and projection depth differences via epipolar lines, and then employ SHIQ for highlight correction to facilitate highlight perception capacity, thus improving reconstruction quality during propagation and refinement stage. Evaluations on ETH3D, Tanks & Temples and Strecha datasets exhibit the state-of-the-art performance and robust generalization capability of our proposed method.

Abstract:
The rapid development of deep learning has significantly improved salient object detection (SOD) combining both RGB and thermal (RGB-T) images. However, existing Transformer-based RGB-T SOD models with quadratic complexity are memory-intensive, limiting their application in high-resolution bimodal feature fusion. To overcome this limitation, we propose a purely Fourier Transform-based model, namely Deep Fourier-Embedded Network (FreqSal), for accurate RGB-T SOD. Specifically, we leverage the efficiency of Fast Fourier Transform with linear complexity to design three key components: 1) To fuse RGB and thermal modalities, we propose Modal-coordinated Perception Attention, which aligns and enhances bimodal Fourier representation in multiple dimensions; 2) To clarify object edges and suppress noise, we design Frequency-decomposed Edge-aware Block, which deeply decomposes and filters Fourier components of low-level features; 3) To accurately decode features, we propose Fourier Residual Channel Attention Block, which prioritizes high-frequency information while aligning channel-wise global relationships. Additionally, even when converged, existing deep learning-based SOD models’ predictions still exhibit frequency gaps relative to ground-truth. To address this problem, we propose Co-focus Frequency Loss, which dynamically weights hard frequencies during edge frequency reconstruction by cross-referencing bimodal edge information in the Fourier domain. Extensive experiments on ten bimodal SOD benchmark datasets demonstrate that FreqSal outperforms twenty-nine existing state-of-the-art bimodal SOD models. Comprehensive ablation studies further validate the value and effectiveness of our newly proposed components. The code is available at https://github.com/JoshuaLPF/FreqSal

Abstract:
Partial multi-label image classification (PMLIC) learns from typical weak supervision, where each image is labeled with a set of candidate labels, only some of which are correct. We find that noisy labels generate conflicting gradient signals that disrupt the learning of latent true labels, causing the model to prefer learning clean negative labels that provide consistent supervisory signals, thereby hindering disambiguation. Meanwhile, noisy labels cause the model to activate misattributed pixel regions, which interfere with feature pattern extraction, leading to inaccurate label correlation. In this paper, we propose a PMLIC framework that constructs a correlation-induced negative suppression disambiguation loss (CoNeS). First, we exploit the property that networks tend to learn clean labels first by extracting class activation maps to identify and screen misattributed pixel regions. Meanwhile, we aggregate noise-disturbed feature patterns into more expressive representations via k-means clustering and construct accurate label correlations to aid disambiguation. In addition, we design the negative suppression disambiguation loss to focus the model on disambiguation by introducing a weight distribution to suppress the contribution of negative labels. This weighting distribution can be adaptively inferred by a closed-form solution. Extensive experiments demonstrate that the CoNeS framework achieves significant advantages over current state-of-the-art methods. Specifically, it achieves average mAP improvements of 1.26%, 2.74%, 0.85%, and 0.33% on the VOC 2007, MS-COCO, VG-256, and CUB-200 datasets at different resolutions and noise rates. Code has been made available at https://github.com/zhongjingyu1/CoNeS

Abstract:
Exemplar-free Class-Incremental Learning (EFCIL) poses a significant challenge in mitigating catastrophic forgetting, due to the absence of exemplars. Recently, analytic learning-based methods propose a recursive alignment procedure to execute EFCIL in a phase-invariant manner and show state-of-the-art performance. However, they heavily rely on a frozen feature extractor trained with the initial dataset to avoid the misalignment between feature and label spaces, ignoring the importance of acquiring generalizable features across incremental tasks for performance improvement. To tackle this, we rethink the obscured sub-optimality of analytic learning-based methods, particularly through empirical reevaluation, and then introduce the Multi-head analytic learning (Muheal) approach. Muheal forms the multi-head model with a delicate feature extractor, thereby introducing a feature optimization procedure and a forgetting compensation module to balance the learning and forgetting. Specifically, within the feature optimization procedure, the feature extractor seeks to learn more generalizable features in a self-supervised manner using the fully-connected classification head. An analytic learning-based classification head follows to align the feature-label space. Additionally, we employ the compensation module to generate and align pseudo-features with a replicated analytic head, thus preventing overfitting and testing. Comprehensive experiments on several benchmark datasets have demonstrated that Muheal significantly outperforms existing state-of-the-art EFCIL methods and is comparable, if not superior, to methods that use replay techniques.

Abstract:
Underwater image quality assessment (UIQA) is a critical research area, challenged by underwater environments such as wavelength-dependent light attenuation, scattering, and non-uniform illumination. Existing deep learning-based UIQA methods often address these degradations in isolation, neglecting their complex interplay with human perception and lacking explicit modeling of underwater optical phenomena. To address this, we propose PhysIQ-Net, a novel framework that integrates physics-driven principles with progressive multi-prior interaction modeling through three key innovations: First, introduce dual physics-based decomposition that separates images into Backscatter, Transmission, Reflectance, and Illuminance components to capture distinct degradation mechanisms; Second, propose prior-guided dynamic filtering that adapts convolutional kernels to image-specific content using physical priors; and Third, propose physic-informed Cross-Domain Feature Interaction that enables bidirectional collaboration between color-aware and structure-aware representations to model their perceptual inter-dependencies. Extensive experiments across multiple benchmark datasets demonstrate that PhysIQ-Net significantly outperforms existing methods, with ablation studies validating each component’s contribution, providing a robust solution for UIQA.

Abstract:
The rapid development of text-to-image (T2I) generation approaches has attracted extensive interest in evaluating the quality of generated images, leading to the development of various quality assessment methods for general-purpose T2I outputs. However, existing image quality assessment (IQA) methods are limited to providing global quality scores, failing to deliver fine-grained perceptual evaluations for structurally complex subjects like humans, which is a critical challenge considering the frequent anatomical and textural distortions in AI-generated human images (AGHIs). To address this gap, we introduce AGHI-QA, a large-scale benchmark specifically designed for quality assessment of AGHIs. The dataset comprises 4,000 images generated from 400 carefully crafted text prompts using 10 state-of-the-art T2I models. We conduct a systematic subjective study to collect multidimensional annotations, including perceptual quality scores, text-image correspondence scores, visible and distorted body part labels. Based on AGHI-QA, we evaluate the strengths and weaknesses of current T2I methods in generating human images from multiple dimensions. Furthermore, we propose AGHI-Assessor, a novel quality metric that integrates the large multimodal model (LMM) with domain-specific human features for precise quality prediction and identification of visible and distorted body parts in AGHIs. Extensive experimental results demonstrate that AGHI-Assessor showcases state-of-the-art performance, significantly outperforming existing IQA methods in multidimensional quality assessment and surpassing leading LMMs in detecting structural distortions in AGHIs.

Abstract:
Source-free domain adaptation (SFDA) involves training a model on source domain and then applying it to a related target domain without access to the source data and labels during adaptation. The complexity of scene information and lack of the source domain make SFDA a difficult task. Recent studies have shown promising results, but many approaches to domain adaptation concentrate on domain shift and neglect the effects of negative transfer, which may impede enhancements of model performance during adaptation. In this paper, addressing this issue, we propose a novel framework of Attention Residual Fusion Network (ARFNet) based on contrast learning for SFDA to alleviate negative transfer and domain shift during the progress of adaptation, in which attention residual fusion, global-local attention contrast, and dynamic centroid evaluation are exploited. Concretely, the attention mechanism is first exploited to capture the discriminative region of the target object. Then, in each block, attention features are decomposed into spatial-wise and channel-wise attentions. The spatial-wise attentions are aggregated with original semantic features to achieve the cross-layer attention residual fusion progressively while the channel-wise attentions are exploited for self-distillation. During adaptation progress, we contrast global and local representations to improve the perceptual capabilities of different categories, which enables the model to discriminate variations between inner-class and intra-class. Finally, a dynamic centroid evaluation strategy is exploited to evaluate the trustworthy centroids and labels for self-supervised self-distillation, which aims to accurately approximate the center of the source domain and pseudo-labels to mitigate domain shift. To validate the efficacy of our methods, we execute comprehensive experiments on five benchmarks of varying scales, i.e., Office-31, Office-Home, VisDA-C, DomainNet-126, Cub-Paintings. Experimental outcomes indicate that our method surpasses other techniques, attaining superior performance across SFDA benchmarks. Code is available at https://github.com/RoryShao/ARFNet.git

Abstract:
Self-supervised pre-training has been shown to effectively learn transferable representations from unlabeled images in many visual tasks. However, existing self-supervised pre-training methods lack sufficient context-awareness and are difficult to obtain fine-grained facial representations, thus resulting in the weak generalization ability of the model to deal with various facial analysis tasks. To address this issue, we propose a Context-Aware Masked Distillation method, termed CAMD, to effectively learn general facial representations for fine-grained facial analysis tasks. The CAMD method first designs an innovative local-to-global masked image modeling framework to learn the contextual spatial structures and semantic relationships between local and global features, enabling effective self-supervised pre-training. In this framework, our pre-training task predicts the dense global feature representations based on the visible local feature representations after masking, so as to achieve semantic alignment across local and global views and significantly enhance spatial sensitivity. Moreover, the CAMD method leverages an attention-driven cross-view hierarchical distillation module to fully distill the features of related regions between different encoder layers of the online and target encoders. This module can learn contextual dependencies and capture discriminative fine-grained facial feature representations. Our method is evaluated on multiple downstream facial analysis tasks, including face alignment, face parsing, facial attribute recognition, facial expression recognition, and head pose estimation, all achieving state-of-the-art results and exhibiting the strong generality and effectiveness. The code is available at: https://github.com/mumumu-wss/CAMD

Abstract:
Few-shot class-incremental Learning (FSCIL) enables models to learn new classes from limited data while retaining performance on previously learned classes. Traditional FSCIL methods often require fine-tuning parameters with limited new class data and suffer from a separation between learning new classes and utilizing old knowledge. Inspired by the analogical learning mechanisms of the human brain, we propose a novel analogical generative method. Our approach includes the Brain-Inspired Analogical Generator (BiAG), which derives new class weights from existing classes without parameter fine-tuning during incremental stages. BiAG consists of three components: Weight Self-Attention Module (WSA), Weight & Prototype Analogical Attention Module (WPAA), and Semantic Conversion Module (SCM). SCM uses Neural Collapse theory for semantic conversion, WSA supplements new class weights, and WPAA computes analogies to generate new class weights. Experiments on miniImageNet, CUB-200, and CIFAR-100 datasets demonstrate that our method achieves higher final and average accuracy compared to SOTA methods.

Abstract:
Leveraging video pre-trained models for video downstream tasks has recently emerged with promising performance. Except for the full fine-tuning paradigm, parameter-efficient transfer learning (PETL) exists as a promising way and has not yet been fully explored in video-to-video transfer learning. While current PETL approaches succeed to reduce parameter quantity and computation cost, they overlook the critical spatiotemporal property in video modality. In this paper, we first propose a novel metric to quantify the spatiotemporal information bias across video datasets and uncover its impact on transfer pReferences through systematic analysis. Based on the analysis, we introduce an innovative parameter-efficient transfer learning method, named Adaptive SpatioTemporal Adapter (AST-Adapter). Our approach automatically adjusts layer-wise architectures with different spatiotemporal adapter modules to exploit the intrinsics of downstream tasks to achieve adaptive spatiotemporal learning, thus delivering robustness and generalization. Extensive experiments on five datasets across action recognition and action detection task show that AST-Adapter surpasses both video-to-video and image-to-video approaches, whilst keeping the advantage of parameter efficiency. Notably, AST-Adapter achieves 89.9% on Kinectics-400, 80.3% on HMDB51 and 97.7% on UCF101 while introduces only 1% to 10% tunable parameters. Our code is available at https://github.com/hhhhhpy/AST-Adapter

Abstract:
Low-resolution visible-infrared image fusion and super-resolution (LRVIF) are critical for enhancing image quality in low-resolution scenarios, yet limited information in the input images often constrains performance. To address these challenges, we propose SaDiff, a spatially-aware adaptive diffusion model that introduces diffusion processes into LRVIF for the first time, representing a major breakthrough in the field. Leveraging the generative capabilities of diffusion models, our approach unifies and enhances image fusion and super-resolution within a cohesive framework. A key component of SaDiff is the Spatial Residual Adaptation Block, which extends the diffusion process by dynamically adapting feature representations to spatial variations in the local regions of the input images. This module maximally preserves crucial information from the input images, such as texture details and contrast, while effectively suppressing noise, ensuring robust and context-aware feature refinement. Then we further propose Direct Diffusion Synthesis, a novel mechanism that utilizes noise predictions during diffusion to generate fused images, enabling joint training of the fusion and super-resolution networks. Additionally, a Cross-Feature Fusion Module integrates texture and contrast details, producing super-resolution fused images with improved clarity and structural integrity. Extensive experiments show that SaDiff achieves state-of-the-art performance, offering a robust and unified solution to infrared-visible image fusion and super-resolution. The code for the proposed method will be made available at https://github.com/guobaoxiao/SaDiff.

Abstract:
An accurate 6D object pose estimation is essential for robotic manipulation and augmented reality applications. Existing methods typically require extensive training for new objects, limiting their effectiveness in dynamic environments where new objects are frequently introduced. In this paper, we propose FreePose, an efficient free-trained zero-shot 6D pose estimation method leveraging pre-trained visual and geometric foundation models. Our approach includes an offline onboarding stage, in which multiple viewpoint templates of a reference object are rendered, then visual and geometric features are extracted using visual and geometric pretrained models, respectively. These visual features are then back-projected onto corresponding 3D points, enabling a precise alignment between appearance and geometry, and subsequently fused with geometric features to form a robust unified representation. During inference stage, target object instances are segmented from RGB-D image using SAM2 coupled with an object-matching algorithm. Visual features of each target instance is similarly extracted, back-projected, and fused with geometric features. Robust 3D-3D correspondences are then established using nearest-neighbor search. Finally, pose estimation is obtained using the TEASER registration algorithm. Extensive evaluations conducted on the BOP5 core datasets show that our approach achieves results comparable to state-of-the-art methods. To highlight the effectiveness and potential of FreePose in real-world scenarios, FreePose is deployed on a real UR3 robot to perform grasping experiments reaching a success grasp rate of 65.0%.

Abstract:
Incomplete multi-view clustering is a prominent research area in multimedia. Among various techniques, self-representation-based approaches have gained attention for effectively capturing global data structures. However, most methods assume different views share a common self-representation matrix, overlooking view-specific characteristics and cross-view complementarity. To address this limitation, we propose a novel incomplete multi-view clustering model, Joint Shared and Private Self-representation Learning (JSPSL), which decomposes the self-representation matrices into shared and private components with mutually exclusive constraints. JSPSL unifies missing view completion and self-representation learning within a single framework, enabling mutual reinforcement. We apply ADMM to efficiently solve our model. Extensive experiments demonstrate that JSPSL consistently outperforms state-of-the-art algorithms.

Abstract:
Weakly-supervised person search presents significant challenges when relying solely on bounding-box annotations, particularly due to inter-class confusion from clothing similarity and intra-class variations caused by illumination changes, which severely degrade cross-view matching accuracy. Existing clustering-based methods, constrained by their heavy dependence on color features, frequently produce unreliable pseudo-labels that ultimately limit model performance. To overcome these limitations, we present Segment Anything Model-based Semantic-Interactive Clustering Optimization (SAM-SICO), a novel framework that integrates the Segment Anything Model’s semantic segmentation capability with adaptive clustering optimization for weakly-supervised person search. Our framework harnesses the representational power of the Segment Anything Model (SAM) to enable detector-free semantic feature learning while significantly improving clustering precision. The proposed solution makes three key advances: the Semantic Contour Embedding (SCE) module leverages SAM’s zero-shot segmentation capability to produce highly accurate human body masks; the Relation-driven Semantic Feature Interaction (RSFI) mechanism effectively mitigates clothing-color bias through innovative dynamic affinity matrix construction across multiscale semantic masks and visual features; and the Adaptive Clustering Optimization (ACO) algorithm introduces parameter adaptation to optimize intra-class compactness and inter-class separation metrics. Experimental results show that our method outperforms existing state-of-the-art approaches on the PRW and CUHK-SYSU datasets. The source code is available at https://github.com//HawlsonZ/SAM-SICO

Abstract:
With the rapid development of artificial intelligence, deep neural networks (DNN) have become valuable digital assets, thereby highlighting the urgent need for copyright protection and secure transmission. Although traditional model watermarking and active defense techniques offer partial protection against unauthorized use, they often suffer from limited imperceptibility and may degrade model performance. To overcome these challenges, this paper proposes ReFHD-Net, a reversible functionality hiding framework for DNN based on a structured mask matrix. Here, reversible functionality hiding refers to the ability to hide the functionality of secret task within the stego model during transmission and enable its lossless recovery by authorized users at the receiver side. Specifically, ReFHD-Net employs a two-stage strategy to hide the secret functionality within a carrier model. In the first stage, a multi-task learning framework enhanced with homoscedastic uncertainty is employed to jointly train the model on both public and secret tasks. In the second stage, the model parameters are further optimized using a combination of task-driven loss and parameter distribution regularization, which limits parameter deviations caused by the hiding process and enhances the imperceptibility of the secret task. Experimental results on image classification and denoising benchmarks validate the superiority of our ReFHD-Net. It achieves an average degradation of only 0.27% in public task and enables lossless recovery of the secret task with no performance drop. Moreover, our framework exhibits strong robustness and security against various unauthorized recovery attempts including random guessing, fine-tuning, and model pruning.

Abstract:
Pathway detection holds extreme significance for fields such as aircraft landing, autonomous driving, and terrain mapping. Passive millimeter-wave (PMMW) imaging technology, with its all-weather operational capability and excellent penetration through fog, and clouds, has shown great potential in pathway detection. However, most existing studies focus on the brightness temperature (TB) differences for detection, which is highly susceptible to interference from environmental radiation. Polarization features can effectively characterize the object and its environmental properties. This paper proposes a physics-based pathway detection method that utilizes feature-level fusion technology, fusing the polarization features of the first two Stokes parameters T_I and T_Q , as well as the degree of linear polarization (DoLP) and the angle of polarization (AoP). By introducing spatial similarity (SS), the method effectively excludes interference from non-pathway areas and improves the detection accuracy. After confirming the horizon position as the vanishing point of the pathway, the detection results are fused with a region growing algorithm to generate the final pathway extraction output. Experimental results demonstrate that this method can distinguish between pathway and non-pathway regions and accurately detect various types of pathways, such as rivers, roads and seas. Quantitative analysis shows that, compared to existing methods, the proposed method has significant advantages in terms of detection accuracy and robustness.

Abstract:
Recent neural models for video captioning are typically built using a framework that combines a pre-trained visual encoder with a large language model(LLM) decoder. However, large language models in video captioning often generate non-existent entities, known as object hallucinations, which severely limit performance. To mitigate object hallucinations, two key issues remain: 1. Biased training data and Knowledge bias in LLM leads models to generate hallucinations; 2. Current methods focus on removal rather than restoring the correct visual content, reducing caption completeness. To address these issues, we propose a visual evidence-aware for object hallucination rectification in LLM-based video captioning. Generally, our model aims to diagnose and correct those generated object hallucinations, and then supplement missing visual content by constraining the process of text description generation. Specifically, we first generate captions by words based on the input video. When decoding each object description, the decoder utilizes visual features for hallucination diagnosis and correction, proposing visual evidence to modify hallucinatory descriptions. This process ensures the generated captions align with the visual content, alleviating the generation of object hallucinations. Compared with the baseline models, our method performs state-of-the-art performance in video captioning, especially avoiding neglecting objects in the visual content caused by the generated hallucinatory descriptions.

Abstract:
Referring Video Segmentation (RVOS) aims to segment objects in videos given linguistic expressions. The key to solving RVOS is to extract long-range temporal context information from the interactions of expressions and videos to depict the dynamic attributes of each object. Previous works either adopt attention across all the frames or stack dense local attention to achieve a global view of temporal context. However, they fail to strike a good balance between locality and globality, and the computation complexity significantly increases with the increase of video length. In this paper, we propose an effective long-range temporal context attention (LTCA) mechanism to aggregate global context information into object features. Specifically, we aggregate the global context information from two aspects. Firstly, we stack sparse local attentions to balance the locality and globality. We design a dilated window attention across frames to aggregate local context information and perform such attention in a stack of layers to enable a global view. Further, we enable each query to attend to a small group of keys randomly selected from a global pool to enhance the globality. Secondly, we design a global query to interact with all the other queries to directly encode the global context information. Experiments show our method achieves new state-of-the-art on four referring video segmentation benchmarks. Notably, our method shows an improvement of \mathbf 11.3% and \mathbf 8.3% on the MeViS val^u and val datasets respectively. The code is available at https://github.com/cilinyan/LTCA

Abstract:
This paper presents a novel approach for unsupervised video summarization using reinforcement learning (RL), addressing limitations like unstable adversarial training and reliance on heuristic-based reward functions. The method operates on the principle that reconstruction fidelity serves as a proxy for informativeness, correlating summary quality with reconstruction ability. The summarizer model assigns importance scores to frames to generate the final summary. For training, RL is coupled with a unique reward generation pipeline that incentivizes improved reconstructions. This pipeline uses a generator model to reconstruct the full video from the selected summary frames; the similarity between the original and reconstructed video provides the reward signal. The generator itself is pre-trained self-supervisedly to reconstruct randomly masked frames. This two-stage training process enhances stability compared to adversarial architectures. Experimental results show strong alignment with human judgments and promising F-scores, validating the reconstruction objective. The code for this project will be available online at https://github.com/mehryar72/TR-SUM

Abstract:
Effective segmentation of unseen categories in zero-shot semantic segmentation is hindered by models’ limited ability to interpret edges in unfamiliar contexts. In this paper, we propose EdgeCLIP, which addresses this by integrating CLIP with explicit edge-awareness. Based on the premise that edge variation patterns are similar across both seen and unseen class objects, EdgeCLIP introduces the Contextual Edge Sensing module. This module accurately discerns and utilizes edge information, which is crucial in complex border areas where conventional models struggle. Further, our Text-Guided Dense Feature Matching strategy precisely aligns text encodings with corresponding visual edge features, effectively distinguishing them from background edges. This strategy not only optimizes the training of CLIP’s image and text encoders but also leverages the intrinsic completeness of objects, enhancing the model’s ability to generalize and accurately segment objects in unseen classes. EdgeCLIP significantly outperforms the current state-of-the-art method, achieving a deep impressive margin of 17.5% on COCO- 20^i datasets. Our code is available at github.com/aqingaqinghh/EdgeCLIP

Abstract:
Adverse natural weather conditions frequently cause substantial performance degradation in outdoor vision systems, underscoring the critical importance of research on image restoration techniques. Employing a unified set of network parameters to restore degraded images across diverse weather conditions has emerged as a key research direction in the field of image restoration. In this work, we propose MUIRF, a Mixture-of-Experts (MoE)-driven unified image restoration framework for multiple adverse weather conditions. Specifically, our technical contribution includes a novel channel-level parameter sharing strategy guided by a shallow-feature-based MoE (CPSM). This fine-grained parameter sharing strategy adaptively selects convolution weight channels for cross-task sharing based on the input image, enabling the network to accurately capture weather-general features, while the remaining channels encode weather-specific features corresponding to each weather condition. CPSM facilitates precise channel selection, thereby enhancing the robustness and accuracy of MUIRF during joint training across diverse image restoration tasks under varying weather conditions. Additionally, gradient conflicts inevitably arise in shared parameters due to the divergent optimization objectives across tasks. To address this challenge, we propose a meta-vector-guided gradient homogenization (MVGH) algorithm that mitigates inter-task gradient conflicts and improves image restoration quality. Comprehensive experimental evaluations demonstrate that our proposed network outperforms most state-of-the-art approaches, validating its superior performance and effectiveness.

Affiliations: School of Computer Science, Nanjing University of Posts and Telecommunications, Nanjing, Jiangsu, China; School of Computer Science and Communication Engineering, Jiangsu University, Zhenjiang, China; College of Computer and Information Science, Southwest University, Chongqing, China; Department of Electronic and Electrical Engineering, Brunel University London, Uxbridge, London, U.K.; Institute of Artificial Intelligence and Robotic, Xi’an Jiaotong University, Xi’an, China

Abstract:
Deep unrolling networks (DUNs) have attracted substantial attention in the field of image compressed sensing (CS) due to their superior performance and good interpretability by recasting optimization algorithms as deep networks. However, existing DUNs suffer from low sampling efficiency, and the improvement in reconstruction quality heavily relies on large model complexity. To address these issues, we propose a lightweight Representation Sampling and Hybrid Transformer Network (RHT-Net). Firstly, we propose a Representation-CS (RCS) model to extract high-level features to achieve efficient sampling. This sampling strategy leads to highly dense, semantically rich and extremely compact features without observing the original pixels, which also reduces the cross-domain loss during iteration. Secondly, we design a Tri-Scale Sparse Denoising (TSSD) module in the deep unrolling stages to extend sparse proximal projections, leveraging multi-scale auxiliary variables to enhance multi-feature flow and memory effects. Thirdly, we develop a hybrid Transformer module that includes a Global Cross Attention (GCA) block and a Window Local Attention (WLA) block, using the measurements to cross-estimate the reconstruction error, thereby generating finer spatial details and improving local recovery. Experiments demonstrate that RHT-Net enhanced version outperforms the current state-of-the-art methods by up to 1.17dB in PSNR. The lightweight RHT-Net achieves a 0.43dB gain while reducing model parameters by up to 22 times. The code will be released publicly at https://github.com/songhp/RHTNet

Abstract:
Object Re-IDentification (ReID) aims to recognize individuals across non-overlapping camera views. While recent advances have achieved remarkable progress, most existing models are constrained to either single-domain or cross-domain scenarios, limiting their real-world applicability. Single-domain models tend to overfit to domain-specific features, whereas cross-domain models often rely on diverse normalization strategies that may inadvertently suppress identity-specific discriminative cues. To address these limitations, we propose an Attribute Prompt Composition (APC) framework, which exploits textual semantics to jointly enhance discrimination and generalization. Specifically, we design an Attribute Prompt Generator (APG) consisting of a Semantic Attribute Dictionary (SAD) and a Prompt Composition Module (PCM). SAD is an over-complete attribute dictionary to provide rich semantic descriptions, while PCM adaptively composes relevant attributes from SAD to generate discriminative attribute-aware features. In addition, motivated by the strong generalization ability of Vision-Language Models (VLM), we propose a Fast-Slow Training Strategy (FSTS) to balance ReID-specific discrimination and generalizable representation learning. Specifically, FSTS adopts a Fast Update Stream (FUS) to rapidly acquire ReID-specific discriminative knowledge and a Slow Update Stream (SUS) to retain the generalizable knowledge inherited from the pre-trained VLM. Through a mutual interaction, the framework effectively focuses on ReID-relevant features while mitigating overfitting. Extensive experiments on both conventional and Domain Generalized (DG) ReID datasets demonstrate that our framework surpasses state-of-the-art methods, exhibiting superior performances in terms of both discrimination and generalization. The source code is available at https://github.com/AWangYQ/APC

Abstract:
The accelerated MRI reconstruction poses a challenging ill-posed inverse problem due to the significant undersampling in k-space. Deep neural networks, such as CNNs and ViTs, have shown substantial performance improvements for this task while encountering the dilemma between global receptive fields and efficient computation. To this end, this paper explores selective state space models (Mamba), a new paradigm for long-range dependency modeling with linear complexity, for efficient and effective MRI reconstruction. However, directly applying Mamba to MRI reconstruction faces three significant issues: 1) Mamba typically flattens 2D images into distinct 1D sequences along rows and columns, disrupting k-space’s unique spectrum and leaving its potential in k-space learning unexplored. 2) Existing approaches adopt multi-directional lengthy scanning to unfold images at the pixel level, leading to long-range forgetting and high computational burden. 3) Mamba struggles with spatially-varying contents, resulting in limited diversity of local representations. To address these, we propose a dual-domain hierarchical Mamba for MRI reconstruction from the following perspectives: 1) We pioneer vision Mamba in k-space learning. A circular scanning is customized for spectrum unfolding, benefiting the global modeling of k-space. 2) We propose a hierarchical Mamba with an efficient scanning strategy in both image and k-space domains. It mitigates long-range forgetting and achieves a better trade-off between efficiency and performance. 3) We develop a local diversity enhancement module to improve the spatially-varying representation of Mamba. Extensive experiments are conducted on three public datasets for MRI reconstruction under various undersampling patterns. Comprehensive results demonstrate that our method significantly outperforms state-of-the-art methods with lower computational cost. Code will be available in https://github.com/XiaoMengLiLiLi/DH-Mamba

Abstract:
Nighttime dehazing is a challenging image restoration task owing to the presence of non-uniform illumination, artificial light sources, and haze. Existing physics-based methods struggle to effectively adapt to complex real-world nighttime haze scenarios, and learning-based approaches also fail to generalize well in such environments. In this paper, we propose a novel two-stage real-world nighttime image dehazing framework using Score-guided Multi-scale Fusion and Dual-channel Enhancement, called SMFDE. In the first stage, we first apply gamma correction and dark channel prior-based operations to generate a series of intermediate improved images. Four haze-related features are then utilized to construct a score mechanism, and binary weight maps are derived by selecting the optimal score for each pixel. Subsequently, a multi-scale fusion strategy is employed to integrate all the intermediate images based on the binary weights to yield an initial result that effectively captures rich and informative features across multiple scales. In the dual-channel enhancement stage, two primary channels of the initial result, namely brightness and saturation, are further refined to enhance details and correct colors, respectively. For brightness enhancement, we first fuse the high-frequency texture component of the input image and the initial result to provide texture-enhanced brightness. Next, the edge component is extracted by calculating the difference between the sharpened and smoothed versions of the input image. The texture-enhanced brightness and the noise-free edge component are then fused to yield the enhanced brightness. For saturation adjustment, a non-local adaptive saturation adjustment algorithm, considering color similarity, is developed to enhance color vibrancy. Experiments on real-world datasets prove that SMFDE achieves superior performance in both visual quality and objective assessments. Our code is available at https://github.com/TaoLi-TL/SMFDE

Abstract:
With the rapid advancement of Artificial Intelligence Generated Content (AIGC) techniques, AI generated images (AIGIs) have attracted widespread attention, among which AI generated omnidirectional images (AIGODIs) hold significant potential for Virtual Reality (VR) and Augmented Reality (AR) applications. AI generated omnidirectional images exhibit unique quality issues, however, research on the quality assessment and optimization of AI-generated omnidirectional images is still lacking. To this end, this work first studies the quality assessment and distortion-aware saliency prediction problems for AIGODIs, and further presents a corresponding optimization process. Specifically, we first establish a comprehensive database to reflect h uman f eedback for AI-generated o mnidirectionals, termed OHF2024, which includes both subjective quality ratings evaluated from three perspectives and distortion-aware salient regions. Based on the constructed OHF2024 database, we propose two models with shared encoders based on the BLIP-2 model to evaluate the human visual experience and predict distortion-aware saliency for AI-generated omnidirectional images, which are named as BLIP2OIQA and BLIP2OISal, respectively. Finally, based on the proposed models, we present an automatic optimization process that utilizes the predicted visual experience scores and distortion regions to further enhance the visual quality of an AI-generated omnidirectional image. Extensive experiments show that our BLIP2OIQA model and BLIP2OISal model achieve state-of-the-art (SOTA) results in the human visual experience evaluation task and the distortion-aware saliency prediction task for AI generated omnidirectional images, and can be effectively used in the optimization process. The database and codes will be released on https://github.com/IntMeGroup/AIGCOIQA to facilitate future research.

Abstract:
Point cloud quality assessment (PCQA) is a challenging task due to the inherently disordered nature of points. Existing point-based methods, such as sparse convolution and PointNet, are limited by local spatial modeling and structural feature extraction. Although 3D graph convolutional networks (GCNs) offer advantages in capturing local structural features through explicit geometric modeling and deformable kernels, their scalability is hindered by the high memory consumption associated with storing neighborhood matrices, particularly for large-scale point clouds. In this paper, to better extract hierarchical structural information and maintain efficiency in computational memory, we propose a novel point-based no-reference PCQA method, namely cellular aggregation network (CANet). The method effectively and efficiently extracts the quality-aware features of large patches in a divide-and-conquer manner. Specifically, a cellular sampling (CS) module is introduced to divide large patches into smaller cells, effectively avoiding the problem of memory explosion. A cellular aggregation (CA) module is proposed to extract intra-cell features and fuse inter-cell features. Moreover, a global aggregation (GA) module is presented to extract global sketch information. Finally, a long-term fusion (LTF) module is introduced to capture long-term dependencies between the features of the CA and GA modules. Experimental results on benchmark datasets demonstrate that the proposed model achieves state-of-the-art performance.

Abstract:
Due to the existence of domain shift, the detection accuracy of road damage detection will drop significantly when the training images and test images are from different scenes. To address this problem, we propose a domain-invariant feature enhancement domain adaptation method (DIFEDA) based on the You Only Look Once (YOLO) series object detectors. We integrate three domain-invariant feature decoupling (DIFD) modules on the backbone to decouple multi-scale domain-invariant features through two-stage adversarial learning. The decoupled features are fed back to the backbone to realize domain-invariant feature enhancement. We construct a spatial and frequency domain perception (SPFD) module in the DIFD to decouple local and global domain-invariant features from the spatial level and frequency level, respectively. We also design a region segmentation decoder (RSD) to make the DIFD pay more attention to the domain-invariant feature extraction in the damaged region, thereby suppressing the interference of background information. We apply DIFEDA to YOLOv8, YOLOv9, YOLO11, and YOLOv12, and conduct extensive experiments in three cross-scenes. The experimental results show that DIFEDA can significantly improve the performance of all baselines, with up to 8.5% and 9.8% improvement in mAP@0.5 and F1 , respectively, proving the effectiveness and generalization of our method.

Abstract:
Large vision language models (LVLMs) have achieved rapid development. However, just like large language models (LLMs), LVLMs face the critical challenge of hallucination, which refers to the phenomenon that the generation text containing References or descriptions of the input image is incorrect or inconsistent. The causes of hallucinations are complex and therefore difficult to avoid directly during the generation process. To alleviate hallucinations, existing studies mainly employ an instruction-tuning approach that requires model retraining with specific data. Other methods use decoding constraints to penalize specific tokens during the decoding process. These will incur expensive annotation costs and computation burden. In this paper, we propose a framework named AQAH to alleviate hallucinations without relying on manual data and large-scale parameter tuning. AQAH compares multiple generated samples to locate the hallucination factors, and then asks questions about the uncertain information. Finally, the answers to the questions are used to add auxiliary information to the prompt to correct the hallucination of LVLMs during regeneration. To facilitate this process, we constructed an automatic process that involves the training of a small model for question generation, and the agent collaboration framework including the small question generation model and large question answering foundation model. Since AQAH does not directly constrain the decoding, it will not cause a significant degradation in inference efficiency, nor force LVLMs to suffer the notorious problem of shortened text generation length. We experimentally demonstrate the effectiveness of AQAH in hallucination alleviation through the proposed “active questioning & answer verification” paradigm in various multimodal tasks such as captioning and visual question answering. Beyond the promising performance and fewer training/inference time costs against other hallucination reduction methods, our method is highly interpretable and flexible, showing great potential in improving LVLMs by exploiting small-scale models. The code is available at https://github.com/bcxbg/AQAH

Abstract:
Few-shot Action Recognition (FSAR) aims to recognize novel actions from only a few labeled examples, posing challenges due to limited supervision and complex temporal dynamics. Existing methods often adopt a unified motion modeling strategy for both short- and long-term dynamics, overlooking the need to adapt motion pattern extraction to the specific temporal properties inherent to different timescales. This forces models to hedge against multi-scale relevance through exhaustive searches over temporal tuples, followed by heavy spatio-temporal fusion, which substantially increases parameters and computation and ultimately limits efficiency. To this end, we propose the efficient Temporal Consistency and Variation-Guided Spatio-Temporal Aggregation Network (TCV-STA), which comprises four key components: the Temporal Consistency Module (TCM), the Temporal Variation Module (TVM), the Spatio-Temporal Aggregation attention (STA), and the Shifted Window Temporal Attention (SWTA). The TCM captures stable motion patterns to suppress short-term perturbations and enhance temporal consistency for robust motion representation, while the TVM models dynamic motion patterns to highlight long-term variations that improve inter-class discriminability and facilitate intra-class alignment. Built upon these complementary motion cues, the STA selectively aggregates spatial and temporal representations under the guidance of the learned stable and dynamic motion patterns, avoiding global dense fusion. Finally, to address the limited receptive field and discontinuous modeling caused by frame grouping in TCM and TVM, we adapt a SWTA to capture longer-range temporal dependencies and ensure smooth transitions across subaction segments for few-shot action recognition. Experiments demonstrate that TCV-STA achieves competitive accuracy across four widely-used FSAR benchmarks while reducing parameters by up to 27.9% and computational cost by 21.3%, striking a favorable balance between accuracy and efficiency for deployment in resource-constrained scenarios.

Abstract:
Deep subspace clustering has demonstrated remarkable results by leveraging the nonlinear subspace assumption. However, it often encounters challenges in terms of computational cost and memory footprint in dealing with large-scale data due to its traditional single-batch training strategy. To address this issue, this paper proposes a deep subspace clustering framework that is regularized by nonlocal contrastive self-distillation, enabling a Deep Inductive and Scalable Subspace Clustering (DISSC) algorithm. In particular, our framework incorporates two subspace learning modules, namely subspace learning based on self-expression model and inductive subspace clustering. These modules generate affinities from different perspectives by extracting intermediate features from two augmentations of the input data using a weight-sharing neural network. By integrating the concept of self-distillation, our framework effectively exploits the clustering-friendly knowledge contained in these two affinities through a novel nonlocal contrastive prediction task, employing an empirical yet effective threshold. This allows the framework to facilitate complementary knowledge mining and scalability without compromising clustering performance. With an alternate branch that bypasses the self-expression computation, our framework can infer subspace membership of the out-of-sample data through the predicted soft labels, eliminating the need for ad-hoc postprocessing. In addition, the self-expression matrix computed using mini-batch data benefits from the distilled knowledge obtained from the inductive subspace clustering module, enabling our framework to scale to data of arbitrary size. Experiments conducted on large-scale MNIST, Fashion-MINST, STL-10, CIFAR-10 and Stanford Online Products datasets validate the superiority of the proposed DISSC algorithm over state-of-the-art subspace clustering methods.

Abstract:
The self-supervised contrastive learning strategy has attracted considerable attention due to its exceptional ability in representation learning. However, current contrastive learning tends to learn global coarse-grained representations of the image that benefit generic object recognition, whereas such coarse-grained features are insufficient for fine-grained visual recognition. In this paper, we incorporate subtle local fine-grained feature learning into global self-supervised contrastive learning through a pure self-supervised global-local fine-grained contrastive learning framework. Specifically, a novel pretext task called local discrimination (LoDisc) is proposed to explicitly supervise the self-supervised model’s focus toward local pivotal regions, which are captured by a simple but effective location-wise mask sampling strategy. We show that the LoDisc pretext task can effectively enhance fine-grained clues in important local regions and that the global-local framework further refines the fine-grained feature representations of images. Extensive experimental results on different fine-grained object recognition tasks demonstrate that the proposed method can lead to a decent improvement in different evaluation settings. The proposed method is also effective for general object recognition tasks.

Abstract:
Pedestrian Trajectory Prediction (PTP) aims to predict the future trajectory of pedestrians based on a historical trajectory. Transformer-based approaches have demonstrated unparalleled performance for PTP tasks, encoding long-term temporal dependencies and heterogeneous spatial interactions of pedestrians. However, Transformer often involves redundant information and noisy interactions from irrelevant regions by considering all available trajectory features. Recently, the structured state space model, Mamba has been proposed, which captures long-range dependency in sequences with a selective mechanism to filter out redundant information. To further tap into the potential of the novel Mamba architecture for the PTP task, in this paper, we present MambaPTP, which predicts future trajectories based purely on Mamba mechanisms, to mitigate the noisy interactions of irrelevant trajectory features and avoid repetitive trajectory modeling, while maintaining high-performance trajectory prediction. Specifically, we propose a new Bidirectional Gating Mamba (BGM) module with bidirectional state space models, which leverages the sparse gate mechanism to select informative temporal patterns and spatial interactions. Moreover, we design a Bidirectional Trajectory Alignment (BTA) module towards aligning the predicted trajectory to the ground truth, ensuring that the model to learn the effective sparse feature representation of trajectories. We conduct extensive experiments on several mainstream pedestrian trajectory prediction datasets. The results demonstrate that the proposed MambaPTP achieves competitive performance compared to advanced Transformer-based models. We hope this paper can further inspire research in Mamba for the PTP task, leading to a tighter integration of the Mamba and PTP communities.

Abstract:
Visual localization is essential in many vision-driven interaction domains, including automation, AR (augmented reality), and surgical navigation. This paper presents a visual marker specifically designed for deformable surface tracking. It has three advantages: 1) IFPs (inner false positives) detection is avoided through grayscale integration design, leading to more robust results. 2) Without relying on thresholds, our detection algorithm enhances the method’s reliability in deformable surface with complex shading. 3) We use a position-sensing marker design with higher information density than the self-identifing marker to ensure the supply of features. In the experiments, our marker achieved superior localization accuracy and zero IFPs. Additionally, we present two compelling case studies showcasing the marker’s practical applications in augmented reality and surgical instrument tracking. Our work offers a significant advancement in visual localization, especially in challenging scenarios involving deformable surfaces, providing valuable solutions for researchers and developers across various application domains.

Abstract:
Hyperspectral image (HSI) super-resolution, which reconstructs a high-resolution HSI (HR-HSI) through hyperspectral and multispectral image fusion (HMIF) tasks that integrate a low-resolution HSI (LR-HSI) with a high-resolution multispectral image (HR-MSI), has emerged as a promising technique for enhancing spatial–spectral quality. Recently, low-rank representations have demonstrated significant advances in various hyperspectral-related applications, providing an effective solution to HMIF tasks. However, most existing methods rely on model priors to learn the low-rank representation of HSIs, which restricts their adaptability to low-rank variations across different datasets. To address this issue, this paper introduces a self-supervised low-rank decomposition network (SSLRDN) framework specifically designed for HMIF, inspired by the observation that the HR-MSI and HR-HSI of the same scene share highly similar spatial features, whereas different hyperspectral scenes exhibit variations in both spectral and spatial features. In SSLRDN, we develop a self-supervised network to adaptively learn the low-rank decomposition (spectral subspace and spatial coefficients) across different HR-HSIs, overcoming the inefficiency of conventional alternating optimization methods where factor updates fail to mutually promote each other. Given the spatial feature consistency between HR-MSI and HR-HSI, we leverage the rich spatial information from HR-MSI to guide the learning of spatial coefficients in HR-HSI. To enhance the self-supervised learning of spatial coefficient images, we further integrate an externally pre-trained denoiser to improve their estimation accuracy, effectively fusing and mutually promoting both self-supervised and pre-trained learning paradigms. Experimental results show that the proposed method achieves superior performance in both visual quality and quantitative metrics, without requiring pretraining on external datasets.

Abstract:
Large Vision-Language Models have drawn much attention and become increasingly applicable in complicated multimodal tasks such as visual question answering, video grounding, etc. However, it still suffers from inefficiency problem during the inference stage due to the computational overhead brought by the large number of visual tokens. Existing works either utilize an attention score (or visual-text relevance) to filter out the less significant visual tokens, or insert learnable projection layers to directly compress the tokens, which neglects the informative details in visual signals and introduces information loss, resulting in poor generalizability to test data. To solve these problems, in this paper we propose a novel Disentangled Visual Token Compression module, i.e., DiViCo, that effectively compresses the visual tokens and maintains good performance simultaneously. In concrete, we first select the top \tau % visual tokens according to their average attention scores, then predict the gap between these selected tokens and the original information by employing the chosen tokens in a disentangled and variational manner. Specifically, we model the mean and variance, sampling the predicted gap from the Gaussian prior. We further keep the informativeness of the compressed visual tokens via KL divergence, which ensures the generalizability of the model. Extensive experiments demonstrate the advantage of our proposed DiViCo module against several state-of-the-art baselines over various real-world datasets. Most notably, LLaVA-v1.5-7b equipped with DiViCo is able to reduce 67.7% FLOPs and save 51.7% time while maintaining 95.6% of the accuracy for LLaVA-v1.5-7b without any compression.

Abstract:
In recent years, intelligent processing of satellite videos has emerged as a significant research focus within the field of remote sensing, driven by the growing demand for enhanced spatial resolution. This need has led to increased interest in satellite video super-resolution (SVSR) algorithms, which aim to improve the quality of satellite imagery. However, many existing SVSR methods tend to neglect the global dependencies among frames in satellite videos, resulting in an incomplete utilization of spatio-temporal feature information. To tackle this issue, we propose a novel non-local spatio-temporal bidirectional recurrent network specifically designed for SVSR applications. Our approach employs a gate-guided deformable alignment module that effectively enhances feature alignment and fusion using a dynamic gating mechanism. This allows the network to adaptively focus on relevant features during the reconstruction process. Furthermore, we introduce a non-local spatio-temporal fusion module that integrates both temporal and spatial relationships over long sequences of frames, ensuring a comprehensive extraction of feature information. Through extensive experiments, our proposed method demonstrates superior performance compared to state-of-the-art SVSR techniques in terms of reconstruction quality. Additionally, it demonstrates outstanding performance in downstream satellite video applications, showcasing its potential in satellite video processing tasks. The source code is publicly available at https://github.com/Yu-Wang-0801/NSBRNet

Abstract:
Multimodal remote sensing image classification has emerged as a key research area in remote sensing, with extensive applications in real-world scenarios. However, these images are collected by different sensors and contain multiple features such as spectrum, space, height and texture. Due to the differences in the characteristics of these data, existing methods have poor results in extracting and fusing heterogeneous features, which limits the improvement of classification performance. To address this problem, we propose a new heterogeneous feature extraction and fusion framework DTFNet, which utilizes the diffusion model and Transformer architecture. In the feature extraction stage, different networks are constructed to extract heterogeneous features while reducing redundancy. The dual-branch diffusion feature extraction (DBDFE) network based on the diffusion model is introduced to process data from different sensors, avoiding the limitation of extracting all features with a single network. In the feature fusion stage, the extracted diffusion features are fused with the original features to preserve the integrity of the original data. The cross-fusion transformer (CFT) module uses a convolutional neural network (CNN) to complete the local feature transformation and integration and models the long-range dependencies between heterogeneous features through cross-transformer encoders. Experimental results show that the classification accuracy of DTFNet on the three datasets reaches 92.38%, 80.08% and 95.02% respectively, which is significantly better than the existing state-of-the-art methods, demonstrating its effectiveness and superiority.

Abstract:
Current Convolutional Neural Networks (CNNs) for Weakly Supervised Semantic Segmentation (WSSS) often have difficulties in discovering distinctive feature locations for each category. Therefore, the pseudo-labels generated from the expanded seed regions are typically incomplete and contain a significant amount of noise. Without additional annotations, the numerous erroneous information will potentially propagate in the segmentation network’s training stage. In this work, we propose a Cross-Modal Dual Graph Reasoning (CDGR) framework to leverage both visual and language knowledge effectively. This framework can capture dependencies between the spatial and the semantic spaces, facilitating the discovery of discriminative feature locations. Specifically, we perform cross-modal graph reasoning between the visual and the language modal graphs to enhance global contextual relationships between pixels in the visual feature map. Additionally, we introduce a graph interaction attention network to thoroughly explore implicit relationships between visual and language graphs. We apply the CDGR network to generate more complete pseudo-labels for the classification network and utilize it in the segmentation network to unleash its self-correcting capabilities. Extensive experiments on the PASCAL VOC 2012 and MS COCO 2014 datasets demonstrate the effectiveness of CDGR compared to other state-of-the-art peers. Our code is provided at https://github.com/JIA-ZHANG666/CDGR

Abstract:
Implicit Neural Representations (INRs) have demonstrated significant potential in video compression by representing videos as neural networks. However, as the number of frames increases, the memory consumption for training and inference increases substantially, posing challenges in resource-constrained scenarios. Inspired by the success of traditional video compression frameworks, which process video frame by frame and can efficiently compress long videos, we adopt this modeling strategy for INRs to decrease memory consumption, while aiming to unify the frameworks from the perspective of timeline-based autoregressive modeling. In this work, we present a novel understanding of INR models from an autoregressive (AR) perspective and introduce a Unified AutoRegressive Framework for memory-efficient Neural Video Compression (UAR-NVC). UAR-NVC integrates timeline-based and INR-based neural video compression under a unified autoregressive paradigm. It partitions videos into several clips and processes each clip using a different INR model instance, leveraging the advantages of both compression frameworks while allowing seamless adaptation to either in form. To further reduce temporal redundancy between clips, we treat the corresponding model parameters as proxies for these clips, and design two modules to optimize the initialization, training, and compression of these model parameters. In special, the Residual Quantization and Entropy Constraint (RQEC) module dynamically balances the reconstruction quality of the current clip and the newly introduced bitrate cost using the previously optimized parameters as conditioning. In addition, the Interpolation-based Initialization (II) module flexibly adjusts the degree of reference used during the initialization of neighboring video clips, based on their correlation. UAR-NVC supports adjustable latencies by varying the clip length. Extensive experimental results demonstrate that UAR-NVC, with its flexible video clip setting, can adapt to resource-constrained environments and significantly improve performance compared to different baseline models. The project page: https://wj-inf.github.io/UAR-NVC-page/

Abstract:
Sports analytics has received significant attention from both academia and industry in recent years. Despite the growing interest and efforts in this field, several issues remain unresolved, including 1) data unavailability, 2) lack of an effective trajectory-based framework, and 3) requirement for sufficient supervision labels. In this paper, we present TrajSV, a trajectory-based framework that addresses various issues in existing studies. TrajSV comprises three components: data preprocessing, Clip Representation Network (CRNet), and Video Representation Network (VRNet). The data preprocessing module extracts player and ball trajectories from sports broadcast videos. CRNet utilizes a trajectory-enhanced Transformer module to learn clip representations based on these trajectories. Additionally, VRNet learns video representations by aggregating clip representations and visual features with an encoder-decoder architecture. Finally, a triple contrastive loss is introduced to optimize both video and clip representations in an unsupervised manner. The experiments are conducted on three broadcast video datasets to verify the effectiveness of TrajSV for three types of sports (i.e., soccer, basketball, and volleyball) with three downstream applications (i.e., sports video retrieval, action spotting, and video captioning). The results demonstrate that TrajSV achieves state-of-the-art performance in sports video retrieval, showcasing a nearly 70% improvement. It outperforms baselines in action spotting, achieving state-of-the-art results in 9 out of 17 action categories, and demonstrates a nearly 20% improvement in video captioning. Additionally, we introduce a deployed system along with the three applications based on TrajSV.

Abstract:
Text-Based Person Retrieval (TBPR) refers to identifying a specific target pedestrian image based on natural language descriptions. Most previous methods rely on one-to-one alignment between paired text-image data, ignoring the polymorphic nature of visual and linguistic information. Moreover, constrained by ID, earlier methods have shown limited exploration of intra-individual and inter-individual relations. This limitation confines them to exploring characteristics within individuals, making it challenging to uncover commonalities and invariants that extend across IDs (e.g., attributes). Recently, due to the lack of accurate annotations, exploring attribute-based cross-modal interactions and alignments has become a significant challenge in TBPR. To address these issues, we propose a Semantic Polymorphism and Commonality Learning (SPCL) framework. First, we present Relation-Sensitive Semantic Polymorphism Alignment (RSSPA) and ID-Based Semantic Polymorphism Alignment (IBSPA) to explore ID-limited Feature Redistribution. Second, we transcend the constraints of ID, leveraging ID-Free Attribute Alignment (IFAA) from a macro perspective to explore commonalities and invariants based on attribute features. Finally, from a micro perspective, we design Attribute Prior Fusion Reconstruction (APFR) to optimize the attention of our model, exploring the positive impact of attribute priors on cross-modal interaction. Experiments on CUHK-PEDES, ICFG-PEDES and RSTPReid show that our method achieves state-of-the-art performance on Rank-1, mAP and mINP.

Abstract:
Image steganography, a crucial technique for secure information transmission, faces the challenge of balancing embedding capacity with visual imperceptibility and security. Existing methods often struggle to maximize these metrics simultaneously, particularly when handling complex image details and achieving adaptive feature representation. To address this, we propose EctFormer, a novel deep steganography framework based on Image Hiding Empirical Mode Decomposition (IHEMD). EctFormer employs a compact autoencoder architecture with a key innovation: an integrated IHEMD module that adaptively decomposes images into physically meaningful intrinsic mode functions (IMFs) and residual components. This decomposition allows for superior feature representation and information embedding. Furthermore, we introduce an intrinsic mode loss function within a novel multi-image training strategy, achieving a remarkable embedding capacity of 96 bits per pixel. Experimental results on the DIV2K, COCO, and ImageNet datasets demonstrate EctFormer’s superior performance. Our method significantly improves PSNR (exceeding 17.00 dB for single-image tasks and 11.00 dB for multi-image tasks) while maintaining high SSIM values (above 0.99). These results surpass current state-of-the-art methods, validating the efficacy of our IHEMD-based approach and the proposed training strategy. EctFormer provides a new effective paradigm for image steganography and enables high-capacity, high-security covert communication. The code is available at https://github.com/lisen1129/EctFormer

Abstract:
Inverse problems in medical imaging, such as undersampled magnetic resonance imaging (MRI) and sparse-view computed tomography (CT) reconstruction, are essential yet challenging tasks for achieving accurate and reliable diagnostic images. Traditional reconstruction approaches, including iterative optimization algorithms and supervised deep learning methods, often struggle with limited adaptability across imaging protocols, substantial computational requirements, and poor generalization between different imaging modalities. Diffusion-based generative models have recently demonstrated promising results; however, these methods frequently suffer from cumulative estimation errors in their sampling processes, limiting their practical performance and robustness. In this paper, we propose a novel framework called Parallel Trajectory Constrained Sampling (PCS), which substantially enhances image reconstruction quality by explicitly enforcing consistency with the underlying physical measurement process. Specifically, PCS introduces a measurement-domain diffusion model whose reverse stochastic differential equation (SDE) trajectory is analytically determinable, thus obviating the need for a learned score estimator within the measurement domain. Furthermore, a parallel trajectory constraint is formulated to rigorously align the reverse sampling paths of the measurement and image diffusion processes, ensuring strict adherence to the known physical model at every sampling step. The proposed PCS method is flexible and can seamlessly integrate various SDE-based diffusion priors. Extensive experiments on representative inverse problems—including undersampled MRI reconstruction, sparse-view CT reconstruction, and image super-resolution—demonstrate that PCS consistently outperforms existing state-of-the-art diffusion-based reconstruction methods. Although current evaluations focus specifically on MRI and CT modalities, the PCS framework holds considerable promise for broader applicability to other imaging modalities and inverse problems, which we plan to investigate in future studies.

Abstract:
Learning-based Underwater Image Enhancement (UIE) methods have made significant progress. Limited by the manual label selection process, the limited quantity and outdated label quality of UIE datasets have severely hindered the development of UIE society. The urgent demand for more and better paired training samples motivates us to propose Ensemble-Select (E-Select), a strategy that can serve as an alternative to fully manual annotation and can enable continuous expansion of dataset size and optimization of label quality. However, expanding size will encounter new images, optimizing labels will encounter new algorithms and the proposed strategy is required to maintain strong generalization in both scenarios, which overwhelms many IQA methods. This work improves generalization in two ways. First, three primary influencing factors and their interrelationships in quality assessment are systematically analyzed. Specifically, we first explore the interactive relationship between content and distortion perception, and further investigate the guiding value of aesthetic-aware features in image quality perception. Then, the proposed Distortion-Content Interaction Module (DCIM) enables the network to focus on perceptually important distortion features guided by content. Second, we investigate a multi-perspective quality evaluation framework based on the ensemble learning paradigm. Building upon the availability of numerous outstanding IQA works, we initially demonstrate their distinct excel regions and evaluation biases. Subsequently, we explore the ensemble of their results through the proposed Aesthetic-Guided Quality Regression module (AGQR), which generates dynamic quality regression layers and derives image-specific quality perception rules based on aesthetic features. We then construct the first expandable, updatable UIE dataset with the help of E-Select. We collect over 50k real underwater image pairs with optimal labels, covering diverse scenes and varied degradation characteristics. Unlike other datasets, our dataset can consistently expand the number of paired samples and maintain optimal labeling without requiring extensive human labor. Experiments show the facilitating effect of the newly constructed dataset on UIE and the SOTA performance of E-Select. Codes and datasets are available at URL.

Abstract:
Recent advances have highlighted the potential of diffusion models in Video Anomaly Detection (VAD). Diffusion models are typically employed to generate negative instances to distinguish them from positive ones. However, the existing diffusion model architectures, generally based on the reconstruction of low-level noisy features, introduce spurious correlations due to shortcut learning, which undermines the robustness of anomaly detection. In this work, we leverage normal-specific representations to guide behavior restoration by aligning disentangled task-relevant representations within the diffusion model. we propose a normal representation-guided conditional diffusion model for unsupervised VAD by aligning normal-specific representations. Inspired by prior knowledge of anomaly discrimination, we decompose normal behavior features into normal-specific and VAD-irrelevant representations into independent channels based on contrastive learning. We introduce a group-supervised learning strategy in learning patch-wise generation guided by normal-specific representations. A gradient-based representation alignment loss enforces the alignment between normal semantics and the target patches. This process enables the diffusion model to understand normal patterns for anomaly detection. Extensive experimental results conducted on VAD benchmarks demonstrate the effectiveness of our methods.

Abstract:
Existing 3D human pose estimation methods often suffer in performance, when applied to cross-scenario inference, due to domain shifts in characteristics such as camera viewpoint, position, posture, and body size. Among these factors, camera viewpoints and locations have been shown to contribute significantly to the domain gap by influencing the global positions of human poses. To address this, we propose a novel framework that explicitly conducts global transformations between pose positions in the camera coordinate systems of source and target domains. We start with a Pseudo-Label Generation Module that is applied to the 2D poses of the target dataset to generate pseudo-3D poses. Then, a Global Transformation Module leverages a human-centered coordinate system as a novel bridging mechanism to seamlessly align the positional orientations of poses across disparate domains, ensuring consistent spatial referencing. To further enhance generalization, a Pose Augmentor is incorporated to address variations in human posture and body size. This process is iterative, allowing refined pseudo-labels to progressively improve guidance for domain adaptation. Our method is evaluated on various cross-dataset benchmarks, including Human3.6M, MPI-INF-3DHP, and 3DPW. The proposed method outperforms state-of-the-art approaches and even outperforms the target-trained model.

Abstract:
Vision-Language Models (VLMs) like CLIP have advanced image representation through open-vocabulary semantic alignment. Yet, existing few-shot transfer learning methods largely overlook the intrinsic interdependencies between text and image embeddings, limiting their ability to fully transfer CLIP’s pretrained capabilities. To address this gap, we propose Hyperspherical Interpolation Variational Encoding (HIVE), a novel method for few-shot image classification. Our core idea is to shift away from directly training feature extraction capabilities for downstream tasks, and instead focus on exploring the semantic transformation relationships between upstream and downstream tasks. By modeling semantics from coarse to fine granularity, HIVE enables the transfer of original feature extraction and modality alignment capabilities to downstream tasks. Extensive experiments on eight established benchmarks, including CUB and EuroSAT, validate HIVE’s efficacy, achieving up to 46.2% and 80.0% improvements over the original CLIP in 1-shot and 16-shot classification tasks, respectively. Our work underscores the importance of preserving pretrained geometric constraints while exploiting semantic hierarchies for effective few-shot adaptation, providing a principled approach for vision-language model customization.

Abstract:
3D object detection has achieved significant progress in outdoor LiDAR point clouds, however, the inherent irregularity and varying sparsity distribution of point occupancy present a key challenge. Existing transformer-based 3D detectors often treat all tokens within the attention window as equally important, regardless of varying sparsity, which not only fails to address the disparities between the varying beam densities but also results in increased memory and computational costs. In this work, we propose an adaptive structure-aware cascaded transformer (ASCFormer) that dynamically captures density-insensitive multiscale structure features to model long-range dependencies via cascaded learning. Our ASCFormer detector includes an adaptive structure-aware token learning module that embeds voxel-level foreground probability and grid-level local density into the grid tokens to enhance structural perception capability. Moreover, we integrate these factors to compute significance scores, which are then utilized in inverse transform sampling to select a subset of multiscale tokens with varying receptive field sizes. To improve the training convergence of the window-based transformer in 3D voxel space, we employ cascaded learning via cross-stage attention to enhance the feature representation capability and refine the localization precision of 3D bounding boxes. This design of structure-aware reweighting effectively enhances the cascade paradigm, making to more adaptable to the varying sparsity distribution of point clouds. Extensive experiments on the KITTI and Waymo Open datasets demonstrate that the proposed ASCFormer detector achieves exceptional performance compared with state-of-the-art 3D object detection methods. The source code is publicly available at https://github.com/Xinglong-Li1/ASCFormer

Abstract:
Binary segmentation is used to distinguish objects of interest from background, and is an active area of convolutional encoder-decoder network research. The current decoders are designed for specific objects based on the common backbones as the encoders, but cannot deal with complex backgrounds. Inspired by the way human eyes detect objects, we propose a new unified dual-branch decoder paradigm, termed the difference-aware decoder, to better explore the differences between foreground and background and to separate objects of interest in optical images. This decoder operates in two stages, leveraging multi-level features from the encoder. In the first stage, coarse detection of foreground objects is achieved by directly utilizing high-level semantic features, mimicking the initial rough observation of human vision. In the second stage, the decoder refines segmentation by exploring differences in low-level features, guided by the coarse map from the first stage. To enhance this process, we introduce two key innovations. First, a difference-aware prototype generation strategy leverages the guide map to extract foreground and background prototypes from high-level features, and calculates the similarity between these prototypes and corresponding representations in low-level feature spaces. Second, an overlapped window cross-level semantic guidance mechanism integrates high-level semantic information into low-level features through channel grouping and multi-scale aligned window pairs, guided by the similarities computed in the first strategy. Together, these innovations significantly enhance the DAD’s ability to discern subtle differences, enabling precise foreground extraction and effectively addressing the challenges of complex and varied backgrounds. To verify the performance of the proposed difference-aware decoder, we choose three well known backbones including ResNet, Res2Net, PVT, and two binary segmentation tasks, i.e ., salient object detection, and camouflaged object detection, for comparative experiments. The results demonstrate that the difference-aware decoder can achieve higher accuracy than the other state-of-the-art binary segmentation methods for these tasks. The source code will be available on https://github.com/Henryjiepanli/DAD

Abstract:
JPEG AI is an emerging learning-based image coding standard developed by Joint Photographic Experts Group (JPEG). The scope of the JPEG AI is the creation of a practical learning-based image coding standard offering a single-stream, compact compressed domain representation, targeting both human visualization and machine consumption. Scheduled for completion in early 2025, the first version of JPEG AI focuses on human vision tasks, demonstrating significant BD-rate reductions compared to existing standards, in terms of MS-SSIM, FSIM, VIF, VMAF, PSNR-HVS, IW-SSIM and NLPD quality metrics. Designed to ensure broad interoperability, JPEG AI incorporates various design features to support deployment across diverse devices and applications. This paper provides an overview of the technical features and characteristics of the JPEG AI standard.

Abstract:
Deep cross-modal hashing models generally inherit the vulnerabilities of deep neural networks, making them susceptible to adversarial attacks and thus posing a serious security risk during real-world deployment. Current adversarial attack or defense strategies often establish a weak correlation between the hashing codes and the targeted semantic representations, and there is still a lack of related works that simultaneously consider the attack and defense for deep cross-modal hashing. To alleviate these concerns, we propose a Fuzzy-Prototype-guided Adversarial Attack and Defense (FPAD) framework to enhance the adversarial robustness of deep cross-modal hashing models. First, an adaptive fuzzy-prototype learning network (FpNet) is efficiently presented to extract a set of fuzzy-prototypes, aiming to encode the underlying semantic structure of the heterogeneous modalities in both feature and Hamming spaces. Then, these derived prototypical hash codes are heuristically employed to supervise the generation of high-quality adversarial examples, while a fuzzy-prototype rectification scheme is simultaneously designed to preserve the latent semantic consistency between the adversarial and benign examples. By mixing the adversarial samples with the original training samples as the augmented inputs, an efficient fuzzy-prototype-guided adversarial learning framework is proposed to execute the collaborative adversarial training and generate robust cross-modal hash codes with high adversarial defense capabilities, therefore resisting various attacks and benefiting various challenging cross-modal hashing tasks. Extensive experiments evaluated on benchmark datasets show that the proposed FPAD framework not only produces high-quality adversarial samples to enhance the adversarial training process, but also shows its high adversarial defense capability to benefit various cross-modal hashing tasks. The code is available at: https://github.com/yzq131/FPAD

Abstract:
Visible-infrared person re-identification (VI-ReID) is a cross-modality retrieval task that aims to match images of the same person across visible (VIS) and infrared (IR) modalities. Existing VI-ReID methods ignore high-order structure information of features and struggle to learn a reliable common feature space due to the modality discrepancy between VIS and IR images. To alleviate the above issues, we propose a novel high-order hierarchical middle-feature learning network (HOH-Net) for VI-ReID. We introduce a high-order structure learning (HSL) module to explore the high-order relationships of short- and long-range feature nodes, for significantly mitigating model collapse and effectively obtaining discriminative features. We further develop a fine-coarse graph attention alignment (FCGA) module, which efficiently aligns multi-modality feature nodes from node-level and region-level perspectives, ensuring reliable middle-feature representations. Moreover, we exploit a hierarchical middle-feature agent learning (HMAL) loss to hierarchically reduce the modality discrepancy at each stage of the network by using the agents of middle features. The proposed HMAL loss also exchanges detailed and semantic information between low- and high-stage networks. Finally, we introduce a modality-range identity-center contrastive (MRIC) loss to minimize the distances between VIS, IR, and middle features. Extensive experiments demonstrate that the proposed HOH-Net yields state-of-the-art performance on the image-based and video-based VI-ReID datasets. The code is available at: https://github.com/Jaulaucoeng/HOS-Net

Abstract:
Physiological studies have shown that differences between depressed and healthy individuals are manifested in the audio and video modalities. Hence, some researchers have combined local and global information from audio or video modality to obtain the unimodal representation. Attention mechanisms or Multi-Layer Perceptrons (MLPs) are then used to complete the fusion of different representations. However, attention mechanisms or MLPs is essentially a linear aggregation manner, and lacks the ability to explore the element-wise interaction between local and global representations within and across modalities, which affects the accuracy of estimating the depression severity. To this end, we propose a Representation Interaction (RI) module, which uses the mutual linear adjustment to achieve element-wise interaction between representations. Thus, the RI module can be seen as an mutual observation of two representations, which helps to achieve complementary advantages and improve the model’s ability to characterize depression cues. Furthermore, since the interaction process generates multiple representations, we propose a Multi-representation Prediction (MP) module. This module implements multi-representation vectorization in a hierarchical manner from summarizing a single representation to aggregating multiple representations, and adopts the attention mechanism to obtain the estimation of an individual depression severity. In this way, we use the RI and MP modules to construct the Multimodal Local Global Interaction (MLGI) network. The experimental performance on AVEC 2013 and AVEC 2014 depression datasets demonstrates the effectiveness of our method.

Abstract:
Aerial object detection plays a vital role in applications such as natural disaster prevention and urban traffic management, thanks to its ability to handle wide coverage areas and diverse objects. As a leading method for this task, You Only Look Once (YOLO) leverages multi-scale feature extraction to detect objects of various sizes. However, most YOLO-based methods focus on feature extraction and fusion from adjacent scales, neglecting the potential collaboration between non-adjacent scales. This limitation leads to redundant parameters and suboptimal detection performance. To address these issues, this paper proposes AF-YOLO (Asymptotic Feature Extraction and Fusion YOLO), a novel approach tailored for aerial object detection. AF-YOLO introduces two lightweight modules: SCC2f and PAFFN. SCC2f, an optimized version of cross-stage partial bottleneck with spatial and channel reconstruction convolution layers, reduces redundancy and enables efficient multi-scale feature extraction. PAFFN, a parallel asymptotic feature fusion network, facilitates enhanced interaction and fusion of non-adjacent scale features. Additionally, AF-YOLO incorporates a P2 layer to improve small object detection and removes YOLO’s P5 layer for a more lightweight design, specifically optimized for aerial detection tasks. Experimental results demonstrate AF-YOLO’s significant improvements across multiple benchmarks: on the VisDrone dataset, it achieves a 6.1% higher mAP0.5 compared to recent baselines while using only 41.8% of their parameters; on the DIOR dataset, it shows a 3.3% accuracy improvement over YOLOv8. These quantitative results are further supported by its superior performance on the DOTA and FAIR1M datasets, with additional validation on HazyDet confirming its robustness in adverse weather conditions. Collectively, these achievements highlight AF-YOLO’s exceptional generalization capability and efficient lightweight design, establishing a new state-of-the-art for aerial object detection systems.

Abstract:
Event camera-based visual tracking has drawn more and more attention in recent years due to the unique imaging principle and advantages of low energy consumption, high dynamic range, and dense temporal resolution. Current event-based tracking algorithms are gradually hitting their performance bottlenecks, due to the utilization of vision Transformer and the static template for target object localization. In this paper, we propose a novel Mamba-based visual tracking framework that adopts the state space model with linear complexity as a backbone network. The search regions and target template are fed into the vision Mamba network for simultaneous feature extraction and interaction. The output tokens of search regions will be fed into the tracking head for target localization. More importantly, we consider introducing a dynamic template update strategy into the tracking framework using the Memory Mamba network. By considering the diversity of samples in the target template library and making appropriate adjustments to the template memory module, a more effective dynamic template can be integrated. The effective combination of dynamic and static templates allows our Mamba-based tracking algorithm to achieve a good balance between accuracy and computational cost on multiple large-scale datasets, including EventVOT, VisEvent, and FE240hz. The source code and checkpoint have been released on https://github.com/Event-AHU/MambaEVT

Affiliations: Institute of Advanced Technology, University of Science and Technology of China, Hefei, China; School of Computer and Artificial Intelligence, Beijing Technology and Business University, Beijing, China; Institute of Artificial Intelligence, Hefei Comprehensive National Science Center, Hefei, China; School of Automation, Northwestern Polytechnical University, Xi’an, China; School of Artificial Intelligence, Chongqing University of Posts and Telecommunications, Chongqing, China

Abstract:
Low-light image enhancement, particularly in cross-domain tasks such as mapping from the raw domain to the sRGB domain, remains a significant challenge. Many deep learning-based methods have been developed to address this issue and have shown promising results in recent years. However, single-stage methods, which attempt to unify the complex mapping across both domains, leading to limited denoising performance. In contrast, existing two-stage approaches typically overlook the characteristic of demosaicing within the Image Signal Processing (ISP) pipeline, leading to color distortions under varying lighting conditions, especially in low-light scenarios. To address these issues, we propose a novel Mamba-based method customized for low light RAW images, called RAWMamba, to effectively handle raw images with different CFAs. Furthermore, we introduce a Retinex Decomposition Module (RDM) grounded in Retinex prior, which decouples illumination from reflectance to facilitate more effective denoising and automatic non-linear exposure correction, reducing the effect of manual linear illumination enhancement. By bridging demosaicing and denoising, better enhancement for low light RAW images is achieved. Experimental evaluations conducted on public datasets SID and MCR demonstrate that our proposed RAWMamba achieves state-of-the-art performance on cross-domain mapping. The code is available at https://github.com/Cynicarlos/RetinexRawMamba.

Abstract:
In 3D semantic occupancy prediction, both the task-specific characteristics and input data critically influence network perception performance. It encounters many challenges, such as data-label misalignment, spatial-significance variance, and long-tailed distribution in semantic occupancy prediction labels. To deal with these challenges, we propose a generalized auxiliary enhancement module, termed OCC-Exoskeleton, for semantic occupancy prediction. The proposed module demonstrates remarkable adaptability, enabling seamless integration with diverse occupancy prediction models while maintaining architectural compatibility. Our module is made up of three parts, each of which is specifically designed to address one of the three mentioned challenges: 1) Virtual point cloud distillation. We generate the virtual point cloud, teaching realistic modalities to concentrate on the data-label misalignment positions. 2) Dual-expert occupancy head. We allocate the occupancy prediction task to two expert heads according to spatial significance to obtain more targeted outcomes. 3) Scene-level frames mixture augmentation. We propose a frames mixture augmentation method that introduces additional foreground objects to create more complex driving scenes, alleviating the long-tailed distribution and enhancing the model’s robustness. Furthermore, the proposed module functions as an efficient plug-and-play module, capable of enhancing the performance of existing network architectures while maintaining minimal computational overhead. Extensive experiments demonstrate that our module achieves significant performance improvement in a range of methods with different input modalities.

Abstract:
Language-guided navigation is a cornerstone of embodied AI, enabling agents to interpret language instructions and navigate complex environments. However, expert-provided instructions are limited in quantity, while synthesized annotations often lack quality, making them insufficient for large-scale research. To address this, we propose NavComposer, a novel framework for automatically generating high-quality navigation instructions. NavComposer explicitly decomposes semantic entities such as actions, scenes, and objects, and recomposes them into natural language instructions. Its modular architecture allows flexible integration of state-of-the-art techniques, while the explicit use of semantic entities enhances both the richness and accuracy of instructions. Moreover, it operates in a data-agnostic manner, supporting adaptation to diverse navigation trajectories without domain-specific training. Complementing NavComposer, we introduce NavInstrCritic, a comprehensive annotation-free evaluation system that assesses navigation instructions on three dimensions: contrastive matching, semantic consistency, and linguistic diversity. NavInstrCritic provides a holistic evaluation of instruction quality, addressing limitations of traditional metrics that rely heavily on expert annotations. By decoupling instruction generation and evaluation from specific navigation agents, our method enables more scalable and generalizable research. Extensive experiments provide direct and practical evidence for the effectiveness of our method.

Abstract:
Transparent object manipulation has long posed a significant challenge in robotic grasping tasks. Existing methods for transparent object grasping rely heavily on visual sensors, aiming to extract relevant features from raw visual data to facilitate grasp execution. However, transparent objects often possess unreliable visual properties, while tactile contact reliably captures their physical properties. Moreover, these visual-based methods often overlook variations in object standing type (OST) and weight, limiting the precise grasping of transparent objects in different physical states. In contrast, humans naturally form memory associations between visual and tactile information and adjust grip force based on tactile feedback. Building on this foundation, we propose tactile-enhanced visual grasping (TEVG)—a novel method that augments robotic visual capabilities with tactile information to enable precise grasping of transparent objects with unknown OST and weight. The TEVG framework comprises two key components: pre-grasp enhancement (PE) and in-hand enhancement (IE). During the pre-grasp phase, PE embeds tactile features into the visual encoder to predict physical properties in advance, facilitating explicit identification of OST and accurate grasp pose prediction through the tactile-enhanced visual (TEV) encoder. IE enables real-time adaptive adjustment of grasp force during contact manipulation, allowing the system to handle objects with unknown weight effectively. Experimental results on two different robotic platforms demonstrate that TEVG significantly enhances the accuracy and stability of grasping transparent objects. The experiment video and project are publicly available at: https://sites.google.com/view/cvft1

Abstract:
The rich spectral information within hyperspectral images (HSIs) results in large data volumes. Thus finding a compact representation for HSIs while maintaining reconstruction quality is a fundamental task for numerous applications. Though the existing learning-based compression methods and context models have shown strong rate-distortion (RD) performance, these methods only pay their attention on spatial redundancy without considering the spectral redundancy of HSIs, which thus impedes further improvement of their performance on HSI. Moreover, the strictly sequential autoregressive nature of context models leads to inefficiency, further limiting their practical applications. In this paper, leveraging the spectral priors unique to HSIs, we propose a hybrid Transformer-CNN architecture to find compact latent representations of HSIs. In specific, we construct Spectral-Spatial Coupling Transformer Group (SSCTG) to cooperatively extract spatial and spectral features of HSIs. Additionally, we propose Group-wise Context Model (GCM) to further enhance the parallel processing capability of autoregression within context models, significantly improving the coding efficiency. Extensive experiments demonstrate the effectiveness of the proposed method, achieving superior RD performance compared to state-of-the-art methods while maintaining high efficiency of codecs.

Abstract:
Sonar images are vital in ocean explorations but face transmission challenges due to limited bandwidth and unstable channels. The Just Noticeable Difference (JND) represents the minimum distortion detectable by human observers. By eliminating perceptual redundancy, JND offers a solution for efficient compression and accurate Image Quality Assessment (IQA) to enable reliable transmission. However, existing JND models prove inadequate for sonar images due to their unique redundancy distributions and the absence of pixel-level annotated data. To bridge these gaps, we propose the first sonar-specific, picture-level JND dataset and a weakly supervised JND model that infers pixel-level JND from picture-level annotations. Our approach starts with pretraining a perceptually lossy/lossless predictor, which collaborates with sonar image properties to drive an unsupervised generator producing Critically Distorted Images (CDIs). These CDIs maximize pixel differences while preserving perceptual fidelity, enabling precise JND map derivation. Furthermore, we systematically investigate JND-guided optimization for sonar image compression and IQA algorithms, demonstrating favorable performance enhancements.

Abstract:
Synthesizing novel views from unconstrained image collections is an important but challenging task in computer vision. Existing methods, which optimize per-image appearance and transient occlusion through implicit neural networks from dense training views (approximately 1000 images), struggle to perform effectively with sparse inputs, resulting in noticeable artifacts. In this work, we introduce SparseGS-W, a novel framework designed to boost the reconstruction of unconstrained scenes and novel view synthesis using as few as five training images. Motivated by the observation that diffusion prior constrained by limited sparse inputs can remove artifacts through fast and efficient fine-tuning, we propose a plug-and-play Constrained Novel-View Enhancement module to iteratively enhance the quality of rendered novel views. We further present an Occlusion Handling scheme, which flexibly removes occlusions utilizing the inherent inpainting capability of constrained diffusion priors. Both components are capable of extracting appearance features from any user-provided reference image, enabling flexible modeling of illumination-consistent scenes. Extensive experiments demonstrate that SparseGS-W achieves superior performance not only in full-reference metrics, but also in commonly used non-reference metrics such as FID, ClipIQA and MUSIQ.

Abstract:
Spatial computing has become a cornerstone of consumer electronics in the metaverse era, powering augmented reality (AR), virtual reality (VR) head-mounted displays (HMDs), and smart glasses. A key enabling technology for these mobile platforms is visual odometry (VO), which supports accurate motion tracking for seamless navigation and interaction. However, deploying deep learning-based VO on resource-constrained edge devices remains challenging due to high computational com-plexity, memory usage, and power demands. Moreover, critical operations such as feature matching, triangulation, and nonlinear optimization are notoriously intensive for embedded processors, underscoring the need for application-specific acceleration. This work presents a hardware-algorithm co-designed VO acceleration system for edge deployment, implemented on a Xilinx UltraScale+ MPSoC ZCU104. The system integrates an ARM Cortex-A53 processor, a neural network accelerator, and custom modules for feature matching and pose refinement. With hardware-aware algorithmic optimizations, the proposed design achieves a 255.6× speedup in neural inference, 13.7× acceleration for geometric modules, and an additional 2.1× gain through task-level parallelism, sustaining 30.6 FPS in real time. Compared to existing FPGA-based VO designs, our system offers the highest localization accuracy while maintaining real-time performance, demonstrating its practical viability for spatial computing in real-world scenarios.

Affiliations: Ministry of Education Key Laboratory of Micro/Nano Systems for Aerospace, Key Laboratory of Micro- and Nano-Electro-Mechanical Systems of Shaanxi Province, School of Mechanical Engineering, Northwestern Polytechnical University, Xi’an, Shaanxi, China; Key Laboratory of Advanced Process Control for Light Industry (Ministry of Education), Jiangnan University, Wuxi, China; Xi’an Modern Control Technology Research Institute, Xi’an, Shaanxi, China; School of Information and Electronics, Beijing Institute of Technology, Beijing, China

Abstract:
Long-wave infrared polarimetric imaging plays a crucial role in target detection, material classification, and scene understanding in complex environments. However, high-intensity background clutter significantly degrades the image quality captured by division-of-focal-plane (DoFP) polarimeters, thereby limiting the performance of subsequent target detection algorithms. Meanwhile, conventional post-processing methods passively suppress background clutter through image analysis alone, lacking in-depth exploration of the underlying hardware framework. As a result, their clutter suppression capability remains limited in complex scenes. In this paper, we propose Active Defocus Blurring Enhancement (ADBE), a simple yet effective approach that integrates both hardware and algorithm strategies to flexibly manipulate the imaging process. Specifically, we reveal how the lens defocus influences the measurement of polarization information and the suppression of background clutter. Building on this insight, an active defocus blurring strategy is developed, transforming a conventional imaging limitation into a powerful mechanism for optimizing imaging parameters based on scene conditions. The outdoor experimental results demonstrate that, compared with conventional in-focus approach, the proposed method can significantly suppress the background clutter and enhance the contrast between the target and background, particularly in degree of linear polarization (DoLP) images. These findings highlight the potential of ADBE for next-generation intelligent imaging systems that autonomously adapt to diverse application scenarios.

Abstract:
Video semantic segmentation aims to assign a semantic label to each pixel in a video by jointly exploiting spatial context and temporal coherence, and it is essential for applications such as autonomous perception and intelligent surveillance. However, conventional RGB videos lack plenoptic cues, making segmentation unreliable in visually complex conditions, including occlusion, low-light environments, and transparent surfaces. Light field imaging captures both spatial and angular information via a micro-lens array, providing multi-view geometric cues that can enhance scene representation and improve segmentation robustness. Motivated by these advantages, we investigate light field video semantic segmentation and propose the Light Field Spatial-Angular Complementary Network (LFCNet) for precise and efficient segmentation under challenging visual settings. LFCNet first employs an efficient pooling strategy to extract multi-scale macro-pixel spatial context features, and then introduces the Angular Modeling Module (AMM) and Context Change Module (CCM) to capture angular cues and handle view-dependent contextual variations. Furthermore, we design a Temporal-Channel Correlation Module (TCCM) to enhance temporal feature consistency by selectively refining channel-wise representations across frames. To support training and evaluation, we construct a light field video dataset based on macro-pixel representations as a benchmark for this task. Extensive experiments on four datasets demonstrate that LFCNet achieves superior segmentation accuracy and competitive efficiency.

Abstract:
Understanding the structure of characters is crucial for recovering clear and readable high-resolution scene text images in Scene Text Image Super-Resolution (STISR). Recently, many existing STISR methods inject the character structure information implicit in the recognition priors into the super-resolution network to guide the super-resolution process, thereby facilitating the generation of more legible text images. However, the recognition priors obtained from low resolution are inaccurate, which means that directly embedding these priors into the network easily misleads the super-resolution process. To address this problem, we draw inspiration from Masked Image Modeling (MIM) and propose the Mask Structure Inference Network (MSINet), which can generate scene text images with accurate character structures without directly embedding recognition priors. To make STISR compatible with MIM, we also propose a Mask-and-Inference Paradigm (MIP), which consists of a mask image pre-training stage for character structure learning and a fine-tuning stage for character structure inference. In addition, a novel mask strategy named Text Confidence Mask (TCM) is proposed to avoid recovery errors by masking legible character regions. With MIP and TCM, MSINet impressively improves the clarity and readability of the degraded scene text images. Specifically, MSINet-B outperforms recent state-of-the-art methods by about + 3.7% on the TextZoom and average + 3.6% on six manually degraded scene text recognition datasets in recognition accuracy. The code will be released at https://github.com/Yuanssr/MSINet

Abstract:
Point cloud registration (PCR) is a fundamental task in 3D vision and provides essential support for applications such as autonomous driving, robotics, and environmental modeling. Despite its widespread use, existing methods often fail when facing real-world challenges like heavy noise, significant occlusions, and large-scale transformations. These limitations frequently result in compromised registration accuracy and insufficient robustness in complex environments. In this paper, we propose IGASA as a novel registration framework constructed upon a Hierarchical Pyramid Architecture (HPA) designed for robust multi-scale feature extraction and fusion. The framework integrates two pivotal components consisting of the Hierarchical Cross-Layer Attention (HCLA) module and the Iterative Geometry-Aware Refinement (IGAR) module. The HCLA module utilizes skip attention mechanisms to align multi-resolution features and enhance local geometric consistency. Simultaneously, the IGAR module is designed for the fine matching phase by leveraging reliable correspondences established during coarse matching. This synergistic integration within the architecture allows IGASA to adapt effectively to diverse point cloud structures and intricate transformations. We evaluate the performance of IGASA on four widely recognized benchmark datasets including 3D(Lo)Match, KITTI, and nuScenes. Our extensive experiments consistently demonstrate that IGASA significantly surpasses state-of-the-art methods and achieves notable improvements in registration accuracy. This work provides a robust foundation for advancing point cloud registration techniques while offering valuable insights for practical 3D vision applications. The code for IGASA is available in https://github.com/DongXu-Zhang/IGASA

Abstract:
Open-World Few-Shot Learning (OFSL) is a critical research domain focused on accurately identifying target samples under conditions where data is scarce and labels are unreliable. This field is highly relevant to real-world scenarios, holding significant practical implications. Currently, the field has only a few solutions, primarily relying on conventional methods such as metric learning and feature aggregation. However, these methods often struggle in more complex scenarios. Recent breakthroughs in foundation models such as CLIP and DINO have demonstrated their strong representational capabilities, even in resource-limited environments. These advancements have led to a shift from “training model from scratch” towards “exploiting the extensive capabilities and expertise of these pre-trained foundation models for OFSL”. Inspired by this shift, we introduce the Improved Collaborative Consortium of Foundation Models (CO _3^+ ), an extension of CO3, first presented in AAAI 2024. CO _3^+ significantly improves the accuracy of OFSL by integrating the strengths of four foundational models. It includes three decoupled blocks: 1) The Label Correction Block (LC-Block) rectifies unreliable labels, 2) the Data Augmentation Block (DA-Block) enriches the available data, and 3) the Text-guided Fusion Adapter (TeFu-Adapter) merges various features and reduces the impact of noisy labels through semantic constraints. We evaluate CO _3^+ across eleven benchmark datasets, comparing it against recent state-of-the-art methods. Our thorough evaluations demonstrate that the proposed CO _3^+ consistently surpasses existing methods by a substantial margin, particularly in high-noise scenarios.

Abstract:
RGB-T tracking benefits from the complementary nature of RGB and TIR modalities, yet their relative reliability for target localization often shifts over time. Most existing trackers fail to adapt to such modality and temporal dynamics in a unified and effective manner, resulting in target representations that are neither discriminative nor temporally consistent. In this paper, we propose ProMoT, a novel tracking framework that jointly integrates cross-modal and temporal cues into a progressive prompting process, enabling continuous retrieval of target-aware representations. Specifically, we design an adaptive target query generator (QueryGen), which selectively aggregates informative spatio-temporal cues from diverse ghost representations through the dynamic sparse ghost fusion mechanism, thereby enabling the generation of target-aware queries. To further preserve fine-grained, temporally consistent target cues, we introduce a high-order contextual prompt updater (PromptUpdater), which encodes high-order cross-modal representations from current and previous frames. These prompts establish the compact and discriminative inter-frame context to not only refine the current frame’s features but also guide target localization in future frames. All components are built upon a parameter-shared backbone for RGB and TIR inputs, forming our complete ProMoT framework. Extensive experiments on both complete and missing modality RGB-T tracking benchmarks show that ProMoT consistently achieves state-of-the-art performance while balancing efficiency.

Abstract:
Accurate depth maps are essential for indoor navigation and modeling by robots, but raw depth maps often have missing areas due to sensor limitations, environmental factors, and distance constraints. Existing methods that fuse RGB images with depth maps usually cannot utilize spatial structural information and exhibit poor accuracy at object edges. To bridge this gap, a dual-branch fusion network with Mamba decoder, called DBFNM, is proposed for depth completion in this work. It consists of two complementary branches: one branch utilizes semantic and texture information from RGB images as visual guidance, while the other extracts spatial geometric structures from normal maps as structural guidance. In particular, a geometric gated encoder is utilized to fully leverage spatial information. In the dual-branch decoding stage, a dual-branch feature interaction alignment module is designed, which is composed of three components, including dual-branch edge feature alignment, dual-branch interaction, and global alignment. Then, the decoded dual-branch features are processed by a dual-modal fusion network based on a spatial propagation network to obtain dense depth map predictions. Extensive experimental results on the NYU-Depth V2 and SUN RGB-D datasets demonstrate that DBF achieves superior depth completion performance compared to existing methods in indoor scenes, particularly in handling large-scale missing depth regions and preserving edge details.

Abstract:
Mirror detection (MD) aims to overcome interference caused by reflections and locate mirror regions. Existing methods focus on designing components to explicitly establish the associations between physical entities and corresponding imagings, or utilizing rotation to construct symmetric consistency. We observe that: a) incomplete and incorrect correspondence between entities and imagings; b) other physical materials (e.g., glass) exhibit characteristics partially similar to mirrors, causing confusion when they co-occur; c) complex interfering factors (e.g., occlusion) and reflection mechanisms may expand vector space several times over. To address these issues in a unified manner, we formulate the scene-aware visual reasoning network (SVRNet) based on visual prompts. Specifically, we construct the prototype-guided prompt chain reasoning (PPCR) that generates a mixed chain of thought reasoning based on maximal difference heterogeneous prototypes to construct comprehensive spatial location and semantic perception. Noise may accumulate gradually through the chain, and crucial clues may also disappear. Therefore, we design the prompt evolution (PE) to filter out noise and enhance the coupling between prompts. We further develop the mixture of prompt injection expert (MPIE) to dynamically select the optimal injection strategy in the low-rank space based on specific scene. Due to reflection interference and random parameter space introducing potential ambiguity, we formulate the three-way evidence-aware (TEA) loss to quantify the uncertainty, thereby providing reliable predictions. To leverage historical knowledge and further disentangle representations, we propose the frequency prototype contrastive (FPC) loss for learning more generalizable features across images. Finally, we relabel 25,828 images and formulate the first point-supervised MD framework. Extensive experiments conducted on four mirror benchmarks under three settings demonstrate that our method surpasses state-of-the-art approaches. Promising results are also achieved on six related benchmarks, showing its generality.

Abstract:
The objective of weakly supervised temporal action localization (WTAL) is to accurately identify the temporal intervals of actions using only video-level annotations for training. Existing cross-modal WTAL methods integrate vision-language models to provide rich semantic supervision, aiming to alleviate the inherent supervision limitations in weakly supervised scenarios. However, it is crucial to acknowledge that contemporary cross-modal methods incorporate textual information simultaneously, which inevitably introduces uncertainty in the alignment of cross-modal semantics. Moreover, previous approaches typically output deterministic temporal localization results, while neglecting to evaluate the predictive uncertainty and confidence of localization results. To address the above issues, we propose a novel Modeling Semantic and Localization Uncertainty (MSLU) framework for WTAL, which can simultaneously model the semantic uncertainty in cross-modal representations and the uncertainty of localization results to achieve more precise and robust temporal action localization. Specifically, we propose the Probabilistic Semantic Uncertainty Modeling (PSUM) module, which utilizes probabilistic encoding to capture diverse cross-modal feature representations, effectively mitigating semantic ambiguity in feature alignment. In addition, we propose the Uncertainty-guided Localization Estimation (ULE) module, which leverages evidential deep learning to estimate predictive uncertainty of localization results in weakly supervised scenarios. Through extensive experiments on benchmark datasets including THUMOS14, ActivityNet1.2, ActivityNet1.3, and FineAction, our framework demonstrates superior performance compared to existing state-of-the-art methods. The empirical results validate the effectiveness of simultaneously modeling both semantic and localization uncertainty.

Abstract:
Precipitation nowcasting based on radar echoes plays a crucial role in monitoring extreme weather and supporting disaster prevention. Although deep learning approaches have achieved significant progress, they still face notable limitations. For example, deterministic models tend to produce over-smoothed predictions, which struggle to capture extreme events and fine-scale precipitation patterns. Probabilistic generative models, due to their inherent randomness, often show fluctuating performance across different metrics and rarely achieve consistently optimal results. Furthermore, precipitation nowcasting is typically evaluated using multiple metrics, some of which are inherently conflicting. For instance, there is often a trade-off between the Critical Success Index (CSI) and the False Alarm Ratio (FAR), making it challenging for existing models to deliver forecasts that perform well on both metrics simultaneously. To address these challenges, we introduce preference optimization into precipitation nowcasting for the first time, motivated by the success of reinforcement learning from human feedback in large language models. Specifically, we propose SynCast, which leverages the two-stage post-training framework of Diffusion Sequential Preference Optimization (Diffusion-SPO) to progressively align conflicting metrics. In the first stage, the framework focuses on reducing FAR to deliver clean and high-precision predictions. Building on this foundation, the second stage further optimizes CSI under strict FAR constraints, thereby achieving synergistic improvements across these conflicting metrics. Experiments on three radar precipitation datasets demonstrate that SynCast reduces FAR while improving CSI, and achieves performance comparable to state-of-the-art methods. Furthermore, we verify that the post-training framework of Diffusion-SPO is compatible with multiple diffusion models for precipitation nowcasting, demonstrating its generalizability. The code for SynCast is available at https://github.com/Dtdtxuky/SynCast

Abstract:
Generalized Category Discovery (GCD) aims to classify both base and novel images using labeled base data. However, current approaches inadequately address the intrinsic optimization of the co-occurrence matrix \bar A based on cosine similarity, failing to achieve zero base-novel regions and adequate sparsity in base and novel domains. To address these deficiencies, we propose a Non-Negative Generalized Category Discovery (NN-GCD) framework. By establishing within the Symmetric Non-negative Matrix Factorization (SNMF) framework: (i) the equivalence between ideal k-means clustering and ideal SNMF, and (ii) the equivalence between SNMF solvers and Non-negative Contrastive Learning (NCL) optimization, we reformulate both the optimization of \bar A and k-means clustering as an NCL optimization problem. Moreover, to satisfy the non-negative constraints and make a GCD model converge to a near-ideal region, we propose a GELU activation function and an NMF NCE loss. To transition \bar A from a near-ideal state to the desired \bar A^ , we introduce a hybrid sparse regularization approach to impose sparsity constraints. Experimental results show NN-GCD outperforms state-of-the-art methods on GCD benchmarks, achieving an average accuracy of 66.9% on the Semantic Shift Benchmark, surpassing prior counterparts by 2.5%.

Abstract:
The semi-supervised multiview clustering (MVC) methods based on non-negative matrix factorization (NMF) have attracted considerable attention due to their ability to utilize partial supervision information and significantly enhance clustering performance. Still, they often face challenges such as high computational costs and difficulties in parameter tuning. We propose a semi-supervised MVC method called Graph-Regularized and One-Hot-Constrained Symmetric NMF (MGOCS) to avoid these issues relatively. The key advantage of MGOCS includes the introduction of a parameter-free semi-supervised strategy that encodes partial label information as one-hot vectors during decomposition, the use of a shared consistency indicator matrix directly as the clustering assignment matrix for all views, and the incorporation of graph regularization terms for each view. These features effectively balance model complexity with information utilization. An iterative optimization algorithm is developed to solve MGOCS, accompanied by complexity and convergence analysis. Clustering experiments conducted on six multi-view datasets demonstrate that MGOCS outperforms various state-of-the-art methods in terms of clustering performance. The code is available at https://github.com/ljisxz/MGOCS

Abstract:
Due to multidimensional heterogeneity, Multimodal Federated Learning (MMFL) confronts fundamental challenges including modality incongruence, modality agnosticism, and modality incompleteness. Existing methods face a trilemma: leveraging external data with privacy risks, isolating features to restrict cross-modal interaction, or incurring high overhead from complex graph-based coordination, all culminating in suboptimal performance. In this paper, we propose a Modality-Agnostic Hybrid Federated Learning (MA-HyFL) framework that synergistically integrates unimodal and multimodal federated processes in modality-agnostic scenarios. Specifically, a bidirectional cross-modal knowledge distillation is employed to promote comprehensive collaboration at inter-client and intra-client levels, enabling robust knowledge transfer among heterogeneous modalities. A reinforcement learning-based aggregation mechanism is further introduced to orchestrate federated workflows through reward-driven policy optimization, dynamically integrating contributive client selection and adaptive aggregation weighting for closed-loop decision-making. Extensive experiments show that MA-HyFL significantly outperforms the other baseline methods in four realistic real-world applications, each exhibiting varying degrees of statistical heterogeneity and missing rates.

Abstract:
VQA models, which answer questions about images by combining both visual and textual information, have been proven susceptible to adversarial attacks. These attacks introduce subtle perturbations to the input data to manipulate the model’s predictions. This paper focuses on adversarial attacks targeting VQA models that follow the “pre-training & fine-tuning” paradigm, an area that remains under-explored. We have identified two key issues in the current field. On one hand, existing multi-modal attacks have low ASR due to inter-modal semantic inconsistency from insufficient cross-modal interaction. On the other hand, the dilemma between attack effectiveness and stealthiness limits the practical applicability of adversarial texts. To address these issues, we propose GIGAS, an innovative attack that uses multi-modal generative models to explore multi-modal interaction through three key modules tailored to solve above-mentioned problems. MIGA aligns adversarial visual features with semantics of misleading images generated by multi-modal generative models to mitigate cross-modal inconsistencies. GSA employs MLLMs to generate natural adversarial texts with greater variation and evaluate similarity to filter based on clean images, balancing effectiveness and stealthiness. Iteration Allocation dynamically adjusts attack iterations based on image-text similarity, maximizing the utility of the limited iterations. Experiments conducted on various VL models and VQA datasets demonstrate superior attack performance, with an average ASR of 89.09% on VQAv2.0. Furthermore, our GIGAS exhibits outstanding transferability, around 60% ASR, across diverse models and specific domains. Our code will be available at: https://github.com/Yvonna-cloud/GIGAS.

Abstract:
Human cognitive systems excel at making approximate decisions in complex and uncertain environments, a capability particularly evident in visual perception. This inherent fuzzy decision-making ability has profound implications for multimodal image fusion, where the scarcity of ground-truth data for infrared and visible light integration presents a fundamental challenge for traditional deep learning approaches. Here we introduce a zero-shot transformer architecture that mirrors human cognitive flexibility by implementing fuzzy decision-making mechanisms for pixel-level fusion weight determination. Our approach circumvents the limitations of conventional pre-training requirements through a sparse attention mechanism that selectively preserves only 15-20% of the most salient cross-modal interactions, effectively filtering out redundancy and noise. To address the computational challenges of high-resolution data, we incorporate low-rank approximation techniques that reduce the complexity from quadratic to linear, capturing over 95% of cross-modal information using a projection dimension of merely 128. The method demonstrates remarkable stability across multiple benchmark datasets, achieving a peak signal-to-noise ratio of 65.904 dB under diverse environmental conditions. Our findings challenge the prevailing assumption that dense attention patterns are essential for effective feature integration, revealing instead that multimodal fusion inherently operates in a low-dimensional space. This biomimetic approach to zero-shot learning not only advances our understanding of cross-modal feature interactions but also provides a more generalizable framework for real-world applications where ground-truth data is scarce.

Abstract:
Surveillance facial images are often captured under unconstrained conditions, resulting in severe quality degradation due to factors such as low resolution, motion blur, occlusion, and poor lighting. Although recent face restoration techniques applied to surveillance cameras can significantly enhance visual quality, they often compromise fidelity (i.e., identity-preserving features), which directly conflicts with the primary objective of surveillance images—reliable identity verification. Existing facial image quality assessment (FIQA) predominantly focus on either visual quality or recognition-oriented evaluation, thereby failing to jointly address visual quality and fidelity, which are critical for surveillance applications. To bridge this gap, we propose the first comprehensive study on surveillance facial image quality assessment (SFIQA), targeting the unique challenges inherent to surveillance scenarios. Specifically, we first construct SFIQA-Bench, a multi-dimensional quality assessment benchmark for surveillance facial images, which consists of 5,004 surveillance facial images captured by three widely deployed surveillance cameras in real-world scenarios. A subjective experiment is conducted to collect six dimensional quality ratings, including noise, sharpness, colorfulness, contrast, fidelity and overall quality, covering the key aspects of SFIQA. Furthermore, we propose SFIQA-Assessor, a lightweight multi-task FIQA model that jointly exploits complementary facial views through cross-view feature interaction, and employs learnable task tokens to guide the unified regression of multiple quality dimensions. The experiment results on the proposed dataset show that our method achieves the best performance compared with the state-of-the-art general image quality assessment (IQA) and FIQA methods, validating its effectiveness for real-world surveillance applications. The code and dataset are publicly available at: https://github.com/Jiang-yan-wei/SFIQA.

Abstract:
Recently, lightweight networks for single image super-resolution (SISR) have surged due to the need of resource-constrained devices, where divide-and-conquer multi-route model exhibits impressive trade-off between performance and computational cost. However, most existing divide-and-conquer multi-route models face two key limitations: 1) possible suboptimal decoupling of image components (e.g. smooth regions, edges and texture details) due to spatial-domain-only processing, and 2) inability to model global dependencies explicitly and capture structural information, hindering further performance gains. To address these drawbacks, we propose a lightweight frequency-selection-based progressive patch Transformer network (FSPPTN) for higher-quality SISR reconstruction. Specifically, we first propose a frequency selection module, in which we develop a frequency enhancement branch (FEB) to dynamically decouple different image components by introducing the window-based Fast Fourier transform (WFFT) and a learnable weight matrix, and a spatial restoration branch (SRB) to recalibrate and fuse cross-granularity features by designing a multi-gate mechanism for reconstructing the component information screened out by the FEB at current level. Secondly, we propose a lightweight multi-branch gradient-guided inter-patch self-attention to explicitly capture global structural similarities by summarizing structural information of each patch into a lower-dimensional space using the statistical properties of first-order gradients, thereby achieving explicit global dependencies modeling and lightweight. Extensive experimental results demonstrate that, in the vast majority of cases, FSPPTN outperforms state-of-the-art lightweight SISR methods in terms of both performance and computational overhead, especially for × 3 and × 4 SR, e.g. FSPPTN outperforms MaIR-Small by even 0.14dB PSNR on Manga109 dataset for × 4 SR even with 48.3% fewer parameters and 63.5% lower FLOPs. The code is available at: https://github.com/yslyangshuli/FSPPTN-main

Abstract:
Reversible data hiding in JPEG images is critical for secure multimedia applications, while, existing histogram modification schemes are confined to fixed 1D or 2D dimensions. These methods fail to adapt to the characteristics of cover images and the demands of embedded data, resulting in suboptimal trade-offs among embedding capacity, image quality, and file size. To address these issues, this study proposes a Multi-Dimensional Histogram Modification Framework that advances histograms from “fixed dimension” to “variable dimension” and extends them to high-dimensional structures. Our key contributions are as follows: first, we propose an adaptive dimension selection strategy that dynamically determines the optimal histogram dimension by evaluating the capacity and distortion for each cover image and embedding task; second, we develop a high-dimensional histogram construction method that combines AC coefficients with absolute values of 1 and 2 at the same frequency to enhance the utilization of coefficient correlation; third, we design an optimized mapping algorithm that searches for the optimal mapping matrix to minimize both distortion and file expansion. Experimental results show that the proposed method outperforms existing 1D and 2D histogram modification schemes. On average, it increases the embedding capacity by 5000 bits, improves the image quality by 0.2 dB, and reduces the file size by 4000 bits.

Abstract:
Temporal context modeling constitutes a fundamental issue for robust visual tracking. However, existing approaches are plagued by a critical granularity trade-off: frame-level template update mechanisms inevitably introduce background redundancy due to global frame information aggregation, while token-level propagation mechanisms undermine inherent local spatial correlations via independent feature transmission. To resolve this challenge, we propose WmLSTM, a plug-and-play Window-level mLSTM-based temporal encoder that reconfigures the temporal modeling paradigm for visual tracking. First, window-centric modeling retains intra-window spatial correlations while adaptively suppressing background clutter. Second, we pioneer the application of mLSTM in visual tracking, exploiting its explicit memory architecture that outperforms implicit sequence modeling alternatives (e.g., Mamba). Third, our plug-and-play design enables seamless integration with state-of-the-art trackers with minor computational overhead. Extensive experiments on seven benchmark datasets validate that WmLSTMTrack achieves an excellent balance among accuracy, speed, and parameter efficiency, attaining state-of-the-art accuracy on five benchmarks, superior real-time speed (GPU: 201~fps , CPU: 47~fps ), and compact model size (8.22 M parameters). Moreover, the WMLSTM module consistently enhances the performance of diverse trackers, e.g., real-time FERMT-256: + 2.3 points SR75 on GOT-10k, non-real-time EVPTrack-224: + 1.8 points P on LaSOT _ext , with merely 30 training epochs. The source code is available at https://github.com/Xiaochen918/WmLSTM.

Abstract:
Visual tracking is essential across numerous video analysis applications, surveillance systems, entertainment, and autonomous applications. However, most conventional state-of-the-art visual trackers are designed for constant-view scenarios with fixed camera viewpoints, and they only achieve satisfactory performance under stable visual features scenarios. In reality, visual tracking often encounters shift-view scenarios (e.g., sports broadcasting, ground-aerial surveillance), where cameras’ dynamic view transitions between ground and aerial views. These shifts lead to large variations in target scale and environmental complexity, resulting in inconsistent visual features that ultimately degrade the robustness of conventional visual trackers. Although developing a dedicated tracker for such shift-view scenarios is possible, it requires expensive temporal and computational costs. To address this challenge, we propose Shift-view Prompt Tuning, a cost-efficient method that enables conventional trackers to handle dynamic view transitions. We use sample pairs from different view datasets as prompts to guide the tracker’s adaptation. By embedding distinctive visual information from these prompts into training samples, we help the tracker learn about dynamic view transitions without requiring it to be relearned from scratch. This approach seamlessly transforms any constant-view trackers into shift-view trackers. Our extensive experiments on 14 datasets with 3 different view types show that our approach significantly enhances tracking performance. This advancement extends the application scope of current trackers and offers a robust solution for multimedia content production, sports analytics, and security monitoring in video analysis systems.

Abstract:
Fully sparse fusion makes an excellent balance between efficiency and accuracy in multi-modal 3D object detection. However, most existing methods focus on foreground objects while overlooking background context. This oversight compromises detection robustness, especially for occluded or small-sized objects, leading to suboptimal detection performance. To address this limitation, we propose a novel fully sparse fusion framework (PLPFusion), which introduces a hierarchical Plane-Line-Pixel representation to progressively model the object-context relationships. PLPFusion comprises three key modules: the Plane Enhancement Module (PEM), the Line Alignment Module (LAM) and the Pixel-Level Aggregation Module (PLAM). Firstly, PEM utilizes geometric cues from LiDAR feature planes to generate spatially-aware object queries. Secondly, LAM further refines these queries with geometric priors for semantic awareness. Lastly, PLAM aggregates pixel-level context to enhance discriminative completeness by leveraging the semantically-aware object queries. On the nuScenes benchmark, PLPFusion achieves 71.9% mAP and 74.0% NDS, outperforming the baseline method FUTR3D by + 2.5% mAP and + 1.9% NDS, respectively. On the KITTI benchmark, it achieves 72.68% BEV mAP and 67.39% 3D mAP. These results confirm its robustness and effectiveness in diverse multi-modal 3D scenarios. The code of PLPFusion is available at the https://github.com/Text357/PLPFusion

Abstract:
To enhance 3D object detection in autonomous driving, recent work combines LiDAR and camera data. However, prior methods often suffer from inadequate image depth information and fixed-weight fusion strategies, limiting semantic extraction and adaptability. PVF-DectNet++ builds on our prior work by employing a perspective voxel projection technique to align both feature types. It introduces an adaptive image semantic feature extraction approach that interpolates image and point cloud intensity into a dense RGB-I multi-channel representation, facilitating the extraction of global, multi-level image features. Furthermore, during the fusion process, a learnable fusion module is designed to address the challenge of individual channels being unable to adapt to varying appearances, colors, and environmental conditions. Experiments on KITTI, nuScenes, and Waymo comprehensively validate PVF-DectNet++. On KITTI, it achieves detection accuracies of 66.3% for pedestrians, 78.8% for cyclists, and 86.8% for vehicles, yielding a 3.56% mAP improvement over PVF-DectNet. Additional tests show further gains, with mAP and NDS increases of 3.8% and 2.6% on nuScenes, and notable boosts in pedestrian and cyclist AP on Waymo. Compared with existing networks, PVF-DectNet++ consistently delivers superior performance, particularly for pedestrian and cyclist detection across diverse benchmarks. The code and model will be released at https://github.com/CQU-AVL/PVF-DectNet-

Abstract:
Weakly supervised Referring Expression Grounding (WREG) aims at grounding the target region based on a given expression, where the mapping between regions and expressions is unknown during training. Recent WREG methods leverage the strategy of generating pseudo-labels utilizing Vision-Language Pre-training (VLP) to avoid the cross-modal heterogeneous gaps arising from the two-stage reconstruction strategy. However, mainstream VLPs are trained with image-text alignment data, which makes the generated labels inapplicable to REG task. Furthermore, due to the constraints of WREG data, it is challenging to ensure the quality of the pseudo-labels. To this end, we propose a Cyclic Pseudo-label Generation and Refinement (CPGR) method to alleviate the above limitations. Specifically, we cycle through the process of Generation-Refinement-Grounding to alleviate the impact of missing region annotations. We perform REG task-adaptive fine-tuning on BLIP-2 to generate REG-style descriptions with Region-Centrality. Then, we design a Pseudo-label Refinement module by utilizing cross-modal token attention to enhance the reliability of pseudo-labels and ensure their Reference-Discrimination. Experiments on five benchmark datasets demonstrate that our proposed method outperforms the current state-of-the-art weakly supervised methods. Our code and models will be released at https://github.com/5jiahe/CPGR

Abstract:
The scarcity of flexibility and effectiveness in skeleton models, combined with the characteristic of limited information in skeleton data, has resulted in fine-grained modeling insufficiently explored in recent skeleton-based action recognition. Confronting these challenges, we propose Dynamic prompting Spatial Temporal Actor transFormer (DSTAFormer), a powerful framework which neatly unifies vision and language. Specifically, we introduce a decoupled vision transformer, which consists of three components: Spatial transFormer (SF), Temporal transFormer (TF), and Actor transFormer (AF), to account for numerous visual aspects of the human body, namely spatial, temporal, and interactive relations. Compared to the vanilla transformer, we reformulate self-attention using Statistically-inspired Attention Reconstruction (SAR) module and Local-specific constraints, thereby enabling a more explicit and interpretable exploration of the action’s fine-grained compositions. The skeleton sequences are processed by this decoupled structure to generate the visual embeddings. To encode environmental interactions that skeletal coordinates inherently lack, we utilize Dynamic Prompting (Dp) strategy to generate visual-based textual prompts. These prompts are transformed into discriminative textual embeddings via a pre-trained large language model (LLM). We also design a Semantic Adapter (SA) to bridge the modality gap. The cross-modality embeddings are projected into a unified feature space for contrastive co-training. This infusion of knowledge into the skeleton data enhances its semantic richness, pushing the boundaries of fine-grained understanding. We evaluate our framework on NTU RGB+D, NTU RGB+D 120, and Toyota Smarthome datasets. DSTAFormer achieves comparable performance against state-of-the-arts.

Abstract:
Camera calibration enables the automatic estimation of intrinsic and extrinsic camera parameters, uncovering correspondences between 2D images and 3D real-world coordinates. For highway surveillance cameras, existing methods often rely on cumbersome procedures to extract limited priors (e.g., vanishing points or reference points) and provide incomplete estimations (e.g., roll angle). Therefore, we leverage the multilayered lane lines on highways, which offer rich priors such as segment lengths, intervals, and lane widths, to develop a novel camera calibration and vehicle speed estimation method. For camera calibration, our approach performs road instance segmentation and extracts multilayered lane-line keypoints (MLK) while mitigating environmental interference and dynamic vehicle occlusions. An MLK-based calibration model is constructed and an angle-polling Levenberg-Marquardt algorithm is designed to estimate key parameters, including focal length, three rotation angles, and lane-line distance. For vehicle speed estimation, multi-object tracking (MOT) algorithms are integrated with the calibration model to infer the average speeds of all identified vehicles. We collected real highway video footage from four different camera setups in Chinese highways. Experimental results demonstrate that our method outperforms existing methods across all setups. The impact of key parameters is evaluated to determine the optimal configuration. Lastly, its effectiveness in vehicle speed estimation is assessed based on advanced MOT algorithms.

Abstract:
Advances in image editing models have enabled intelligent, rapid fashion customization. Vision-guided editing models, in particular, offer more precise and flexible control over fine-grained garment attributes. However, existing methods are limited to coarse-grained edits and fail to achieve attribute-level manipulation, thereby restricting the flexibility and composability required in fashion customization. To address these issues, this paper proposes a Vision-Guided Fashion Fine-Grained Attribute Editing (VFFAE) framework, which leverages visual references to achieve customized editing of both style and structure in fine-grained garment regions. The VFFAE framework involves three key components: 1) a text-driven fashion fine-grained attribute segmenter that incorporates garment keypoints as spatial priors and applies deformable attention to enhance spatial perception, with CLIP-based multimodal alignment for accurate segmentation; 2) a clothing attribute disentanglement module based on orthogonal subspace projection of CLIP embeddings, enabling zero-shot explicit separation of style and structure attributes; and 3) a conditional diffusion pipeline that leverages disentangled representations of segmented regions to fine-tune a pretrained Stable Diffusion model under classifier-free guidance, enabling controllable attribute editing. Experiments on multiple public datasets show that VFFAE surpasses state-of-the-art methods, and ablation analyses confirm the effectiveness of its segmentation and disentanglement modules, establishing it as a practical solution for high-fidelity attribute-level fashion customization.

Abstract:
Image desnowing is an important task in image enhancement and restoration. It aims to reduce the impact of snowfall on image quality and downstream vision tasks. Although recent methods perform well on synthetic datasets, their robustness in real-world scenarios is limited due to the complex and diverse appearance of snow particles. To address this issue, we propose DBRNet, a semi-supervised image desnowing network that improves generalization in real conditions. DBRNet adopts a cascaded recursive structure, using multiple recursive modules to progressively refine features. During training, a dual-branch strategy combining supervised and unsupervised learning is designed, utilizing synthetic paired data for labelled supervision while introducing regularization constraints using real unlabeled images. Dual-branch design is also embedded in each recursive module, enabling explicit separation and joint learning of snow removal and background recovery. Extensive experimental validation demonstrates that this method not only outperforms existing mainstream snow removal algorithms across multiple public snow removal datasets but also exhibits exceptional snow removal performance and robust generalization capabilities in real-world snowy images.

Affiliations: School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu, China; Institute of Machine Intelligence (IMI), University of Shanghai for Science and Technology, Shanghai, China; College of Information Science and Technology, Jinan University, Guangzhou, China; Theory Laboratory, Labs, Shenzhen, China; State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences (CASIA), Beijing, China; Surrey Institute for People-Centred Artificial Intelligence, CVSSP, University of Surrey, Guildford, U.K.

Abstract:
Unsupervised domain adaptation aims to transfer knowledge from a labeled source domain to an unlabeled target domain. Existing methods, whether based on distribution matching or self-supervised learning, often focus solely on classifying individual source samples, potentially overlooking discriminative information. To address this limitation, we propose FlowUDA, a novel plugin method that enhances existing UDA frameworks by constructing semantically invariant flows from individual source samples to corresponding target samples, forming cross-domain trajectories. By leveraging a diffusion network guided by ordinary differential equations, FlowUDA ensures these flows preserve the topological structure of the source domain, maintaining their distinguishability. Our method then classifies these flows by sampling points along them and transferring labels from source samples, effectively capturing spatial relationships between domains. In essence, FlowUDA transforms the traditional point-based classification on individual source samples into flow-based classification on flows, allowing the model to learn richer, more discriminative features that bridge the gap between source and target domains. Extensive experiments on standard benchmarks demonstrate that integrating FlowUDA into existing UDA methods leads to notable performance gains, highlighting its effectiveness in addressing domain shift challenges.

Abstract:
Inertial Measurement Units (IMUs) can capture intricate kinematic behaviors, thereby enhancing the performance of human action recognition methods. Consequently, this technology has recently garnered considerable attention within this domain. However, existing methods either encounter limitations in instance-level visual representation due to self-occlusion or fail to fully utilize the potential of complex kinematic information, making it challenging to adequately capture the intricate relationships between the two data sources. In this paper, we tackle this issue by addressing the problem through causal learning and Fourier learning. Specifically, we introduce a novel framework called Causal-Inspired Fourier Representation Learning (CIFRL) for Wearable IMUs and Egocentric Action Recognition, which aims to enhance cross-modal feature alignment. The framework consists of two key components: 1) Temporal Causal Modeling (TCM), designed for video interpretation from a causal perspective; 2) Spectral-Temporal Learning (STL), which aims to decompose the inertial data using Fourier representation and align cross-modal features. We evaluate our proposed framework on the WEAR and CMU-MMAC benchmarks. Empirical results demonstrate the superior performance of our CIFRL approach compared to state-of-the-art methods. Our code is available at https://github.com/Adrianos1219/CIFRL

Abstract:
Recently, two-stage 3D human pose estimation using monocular cameras has gained significant attention. However, the inherent uncertainty in the upscaling process from 2D to 3D often compromises the accuracy of deterministic methods. To address this, we propose a novel diffusion-based refinement framework (DRPose) which models the uncertainty during the upscaling process by introducing stochastic noise to the initially predicted 3D poses. This approach facilitates the generation of more realistic predictions through iterative refinement with multiple noise samples, ultimately producing multi-hypothesis predictions that better align with ground truth. Our framework incorporates two key components: a Graph Convolution Transformer module (SGCT), which integrates scaling and displacement adjustments based on conditional information with a joint temporal-spatial feature separation mechanism, and a Pose Refinement Module (PRM), which balances the initial and refined poses. This design allows DRPose to effectively refine pose estimation for both individual frames and sequential data. Furthermore, our framework establishes new benchmarks for performance in both frame2frame and seq2frame scenarios. Extensive experiments demonstrate that our method achieves state-of-the-art performance on the Human3.6M and MPI-INF-3DHP datasets. Notably, when applied to the current state-of-the-art single-frame 3D pose extractor, our multi-hypothesis optimization achieves an 18.8% reduction in Mean Per Joint Position Error (MPJPE) and a 16.9% reduction in Procrustes MPJPE (P-MPJPE). Code is available at https://github.com/KHB1698/DRPose

Abstract:
Full-reference (FR) image quality assessment (IQA) (FR-IQA) has been extensively explored in the past two decades and is one of the most basic and hot topics in the image processing community, due to its indispensable role in quantitatively describing image quality degradation and guiding algorithm and system optimization. However, FR omnidirectional image quality assessment (OIQA) (FR-OIQA) has achieved less success, due to the natural gap between 2D images and omnidirectional images (OIs). To this end, we present a novel FR-OIQA model with Inter-Patch and Sequence Similarity (IPSS). Specifically, to avoid the extra computational load of viewport generation/prediction methods, IPSS processes OIs in a viewport-unaware manner, i.e., directly extracting a patch sequence from an OI in the format of Equirectangular Projection (ERP) with retaining regions of interest. Furthermore, since the patches from ERP image contain inborn geometry deformation, the deformation-aware convolution is plugged into feature extraction and used to distill quality-aware features from the intrinsic pseudo-degradation, which are then utilized to measure inter-patch similarity. Finally, a distortion-aware interaction module is used to aggregate patch-wise quality-aware features, whose output is used to calculate patch-sequence similarity, i.e., the global quality of OI. Through comprehensive experiments on a large-scale OIQA database, we demonstrate the superiority of the proposed IPSS and the effectiveness of each module. The source code is available at https://github.com/18liu/IPSS

Abstract:
Integrating 3D Gaussian Splatting (3DGS) for dense scene reconstruction has recently gained significant attention in the field of Visual Simultaneous Localization and Mapping (V-SLAM). However, the static scene assumption underlying both V-SLAM and 3DGS limits their effectiveness in real-world environments populated with dynamic objects. Dynamic objects not only degrade SLAM tracking performance but also compromise the spatial-temporal consistency of the reconstructed map, leading to severe system failures. In this work, we propose DGS-SLAM, a novel 3DGS-based V-SLAM system capable of robust self-localization and dense mapping in dynamic environments. To address dynamic scenes, DGS-SLAM integrates several key strategies: 1) object association that fuses visual and geometric information to match objects between adjacent frames; 2) motion check that extends object association to a long-term sliding window to accurately perceive the movement of objects; 3) local fine-tuning that repairs the 3DGS model and updates keyframe poses after eliminating dynamic objects; 4) keyframe selection that promptly selects static keyframes to optimize the static regions in the 3DGS model. Extensive experiments on the TUM RGB-D and BONN RGB-D Dynamic datasets demonstrate that DGS-SLAM significantly improves localization accuracy in dynamic scenes while generating high-quality static maps, compared to other existing state-of-the-art 3DGS-based methods.

Abstract:
The Vision-language navigation task requires agents to efficiently interpret visual cues in the environment and accurately follow long-range instructions, posing significant challenges to their scene memory and spatial reasoning capabilities. Existing methods typically construct memory systems directly from raw visual observations. However, task-irrelevant cues commonly present in the environment can continuously introduce localization errors during navigation, severely limiting the agent’s performance in complex scenes. Meanwhile, due to the lack of transferable general knowledge priors, existing agents exhibit notable limitations in spatial perception, which undermines the reliability of their decision-making in unseen environments. To address these issues, this paper proposes the dynamic object filtering with Spatial Perception Enhancement for Vision-Language Navigation (SPENav), which aggregates open-vocabulary perception with multi-level information modeling. At the local level, the Hierarchical Semantic Prior Extractor and Room-Information-Guided Filtering construct task-oriented semantic priors to capture critical objects and suppress irrelevant features. At the global level, the Spatial-Instructional Guided Dual Attention module leverages spatial information and instruction guidance to enable the agent to develop selective memory that is goal- and task-oriented. On the unseen test split of R2R, SPENav achieves a 76% Success Rate (SR) and a 65% Success weighted by Path Length (SPL). These results demonstrate the effectiveness of task-oriented feature selection and multi-level semantic modeling in enhancing cross-modal understanding and adaptive navigation performance.

Abstract:
Visual place recognition is a fundamental task essential for applications like visual localization and loop closure detection. Existing methods perform well under controlled environments, but often fail in scenarios with significant domain shifts, such as drastic day-to-night transitions and severe occlusions. This limitation arises because existing approaches are globally optimized without explicit supervision for out-of-distribution (OOD) adaptation and overlook semantics as a complementary modality for improving OOD robustness via local context refinement. To address this, we propose a dual-branch network that jointly optimizes feature attention and feature description under semantic guidance, achieving improved OOD adaptation with overhead comparable to existing methods. The feature attention branch is guided by semantically-informed context richness, while the feature description branch is supervised through inter-class repelling and intra-class re-ranking. Additionally, we introduce a simple yet effective query rejection module that leverages the learned attention to assess an image’s informativeness, allowing it to exclude queries that lack place-representative context. Extensive experiments demonstrate that our method raises the average Recall@1 and Recall@5 by 3.5 and 3.9 percentage points over its state-of-the-art counterpart, and accelerates feature matching by 28% for downstream visual localization without performance degradation.

Abstract:
Vision-language models have the potential to enrich purely visual tasks by utilizing the combined representation of images/videos and corresponding textual descriptions. Recent advances in video anomaly detection have also integrated textual information to enhance the understanding of abnormal events. However, existing approaches often merge visual and textual modalities in a straightforward, bottom-up manner, failing to fully explore their interconnections. Moreover, textual captions themselves do not inherently convey “abnormal” attributes. Consequently, these joint representations tend to highlight all salient input features without adequately focusing on high-level tasks such as video anomaly detection. To direct the model’s attention towards anomalies more effectively, we propose incorporating a top-down mechanism into weakly supervised video anomaly detection tasks. A new Knowledge Sharing and Feedback (KSF) framework is designed to unify the representation of anomalies across both video and text. Specifically, we develop a category pattern sharing module that performs knowledge matching, acting as an alignment bridge between abnormal events and their corresponding descriptions. This ensures consistent representations for identical anomalies while maintaining distinct representations for different ones. Following this alignment process, matched high-level semantic priors are fed back into the forward path to enhance differentiation between abnormal and normal patterns. Comprehensive experiments on three benchmark datasets demonstrate the superiority of our proposed method in learning the implicit definition of anomaly patterns. The code is available at https://github.com/XJ-Cai/KSF

Abstract:
Recently, image tampering localization techniques for scientific publications have attracted increasing attention due to the prevalence of data manipulation and the integrity issue of image content. However, existing methods are still inefficient to expose tampering traces in scientific images due to their unique properties, such as acquisition noise and ambiguous edges. To address these limitations, we propose a Dynamically Perceived Forgery Conditional Diffusion Model, which formulates the prediction of the localization mask as a noise-state aware denoising process. This process progressively localizes the tampered regions by involving time-step guidance to dynamically perceive tampering traces under the variation of diffusion noise, which is jointly controlled by two conditions, including a forgery condition with hierarchically aggregated forensic clues and an enhanced edge condition with multilevel spatial attention. To conduct dynamic controls efficiently, two conditions are fused and then applied to the denoising process via a channel-cross attention module. Furthermore, in the inference stage, a salient element ensemble-based sampling strategy is developed to further improve the reliability against undesired factors of scientific images. Extensive experiments have been conducted on several scientific image tampering datasets, compared with state-of-the-art methods, which demonstrates our superiority in aspects of intra-/cross-dataset evaluations and robustness against post-processing operations.

Abstract:
Skeleton-based temporal action segmentation aims to capture key information in long skeleton motion sequences to temporally segment and identify actions at a fine-grained level. Existing approaches have achieved promising results by improving the modeling of topological spatial relationships and long-term temporal dependencies. However, current methods often overlook the distinct nature of motion and topological information, applying a monolithic modeling paradigm to both. This approach fails to fully exploit their differential contributions to precise boundary localization and effective class discrimination. To address these limitations, we propose a novel Topology-Motion Decoupling Framework (TMD). Our framework incorporates three key designs. First, an auxiliary Differential Motion Perception Branch explicitly models the temporal gradients of skeletal trajectory to decouple boundary-sensitive motion features. Second, we introduce two effective fusion modules that integrate the complementary features from both branches for mutual enhancement. Finally, a Boundary-Aware Textual Regularization scheme leverages a dual set of semantic prompts for boundary/non-boundary to differentially guide the feature learning process. By design, TMD explicitly mitigates semantic and temporal confusion between actions, thereby enhancing inter-class discriminability and boundary awareness. Extensive experiments on five challenging public datasets demonstrate that our TMD achieves state-of-the-art performance.

Abstract:
The anchor-based multi-view clustering method has recently attracted considerable attention due to its superior efficiency. However, most existing methods construct a consensus anchor graph based solely on view-level contributions, overlooking the varying importance of individual samples across different views. Moreover, these methods fail to ensure that anchors are evenly distributed across clusters. Thus, we propose a novel and scalable multi-view clustering method, called Sample-Level Weighted and Structure-Enhanced Anchor Graph Learning for Scalable Multi-View Clustering (SLWSE-AGL). Specifically, we introduce a sample-level weighting mechanism based on anchor self-representation learning, enabling the constructed consensus anchor graph to capture the varying importance of samples in different views. Additionally, we incorporate a structure-enhancement constraint to encourage the learned anchors to be more evenly distributed among clusters, leading to more balanced and meaningful cluster partitions. Furthermore, we employ an anchor-to-sample label propagation mechanism that directly yields the final clustering results, thereby avoiding the information loss associated with the two-stage clustering processes. Extensive experiments demonstrate the superior performance of our method compared to state-of-the-art multi-view clustering approaches. The code of SLWSE-AGL is publicly available at https://github.com/tangchuan2000/SLWSE-AGL

Abstract:
Enabling robots to perform everyday tasks has become increasingly important. Task planning, which decomposes task instructions into executable action sequences, is crucial for equipping robots with the ability to handle daily activities. Currently, there are two main effective methods for task planning: one relies on the reasoning capabilities of Large Language Models (LLMs), but it struggles with handling the underlying motion. The other is based on the generative capabilities of Vision-Language-Action (VLA) model, which often lacks essential semantic details. To overcome these limitations, this paper introduces a novel Semantically Supervised Vision-Language-Action (SS-VLA) model. This model addresses the constraints of previous method that relied solely on single-frame image by designing an adaptive visual sequence encoder that integrates continuous visual streams. This encoder efficiently captures and integrates multi-scale spatial and temporal features from the robot’s first-person visual perspective. Furthermore, the model utilizes LLMs to decompose task instructions into subtasks and organize them into graph structure, using Graph Attention Network (GAT) to extract features from subtask sequences and supervise the generation of action sequences. This method not only enhances the alignment of actions with task instructions but also ensures the contextual and semantic accuracy of the robot’s activities, significantly enhancing the task execution capabilities of robots in complex environments. We evaluated our model on the ALFRED and TEACh benchmark, achieving higher performance compared to existing methods, especially in unseen scenes. Additionally, we successfully deployed our model in the AI2-THOR virtual environment and on the TIAGo real robot, demonstrating the effectiveness of our method. Our code is available at: https://github.com/Li-XD-Pro/SS-VLA

Abstract:
Learning robust and generalizable feature extractors to generate discriminative prototypes is crucial for few-shot action recognition. However, most existing methods rely on fine-tuning large pre-trained image models, easily leading to transferability and overfitting issues. In this paper, we propose a novel vision-language enhancement network based on decoupling-joint adaptation (VEDA) for few-shot action recognition, which decouples visual features into temporal and spatial branches, followed by a joint operation that integrates these two branches using an adapter-tuning paradigm. VEDA can gradually equip the model with spatio-temporal reasoning capabilities. Since relying exclusively on local frame feature matching results in inaccurate performance, we design a video-level relation module (VLR) to enhance video context awareness through global feature matching. In addition, we design a vision-language fusion module (VLF) that introduces multimodal information to alleviate the data scarcity issue. Simultaneously, we apply adapter-tuning to both visual and textual branches to enhance the generalization ability. Based on the proposed components above, our network can extract both informative and discriminative prototypes, resulting in excellent recognition performance. Experimental results on five challenging benchmarks demonstrate the effectiveness of the proposed VEDA. The code will be released soon at https://github.com/ReverseSuzhou/VEDA

Abstract:
Online Continual Learning (OCL) enables machine learning models to learn from a stream of non-stationary tasks, making it more aligned with real-world scenarios. However, OCL faces a significant challenge: catastrophic forgetting, wherein the model learned in previous tasks is substantially overwritten upon encountering new tasks, leading to a biased forgetting of prior knowledge. Among various OCL strategies, replay-based methods have proven particularly effective in mitigating catastrophic forgetting by maintaining a small buffer of past samples and retraining them alongside new data. However, due to strict memory constraints, these replay buffers often fail to adequately represent the true data distribution of previous tasks. This leads to distributional shifts in the feature space, amplifying forgetting and degrading model performance. To address the problem, in this paper, we propose a novel replay strategy, termed Dual-Margin Contrastive Replay (DMCR), to anchor the distribution of old tasks and reduce the negative transfer effects. First, we propose to select memory for more representative samples guided by constructed centroids in a data stream. Then, to keep the model from distribution chaos in biased replay, a two-level angular cross-task Contrastive Margin Loss (CML) is proposed, to encourage the intra-class and intra-task compactness, and increase the inter-class and inter-task discrepancy. Finally, to further suppress the distributional drift, we present an optional Centroid Distillation Loss (CDL) on the replay memory to anchor the knowledge in feature space for each previous old task. Extensive experimental results on five benchmark datasets validate that the proposed DMCR can effectively mitigate the catastrophic forgetting and achieve state-of-the-art (SOTA) performance in OCL.

Abstract:
Video frame interpolation (VFI) is significant for generating a high frame rate coronary sequence without additional radiation exposure. Due to the coronary reciprocating pattern alternating between systolic and diastolic phases, the linear assumption-based existing methods fail to capture the complex motion especially during the transitions between the two phases. Different from the linear methods, a Non-linear Motion Estimation Network (NLME-Net) is proposed to effectively capture the periodic reciprocating motion pattern by accurately estimating both bidirectional flows and long-distance motion. Specifically, the specialized motion estimation decoder is guided not only by target frame reconstruction loss but also by direct supervision through a self-supervised flow loss. This enhanced modeling of reciprocating motion enables accurate intermediate flow estimation in scenarios involving variable directional movement, thereby improving the accuracy and robustness of frame interpolation. Additionally, the interpolation decoder fully exploits the inherent mutual dependency between intermediate flow and target frame features to refine the final interpolation result. According to the experiment results of the proposed and twelve state-of-the-art methods using the coronary dataset with 6486 groups of angiographic images from 399 sequences, the proposed method improves the PSNR score by an average 0.59dB.

Abstract:
This paper introduces a robot active task cognition framework for Situation-Aware Task Planning (SATP), leveraging visual scene understanding to generate action sequences. By integrating object knowledge, user pReferences, and Large Language Models (LLMs), SATP interprets the robot’s current visual perception, and creates procedural actions that align with what the robot “sees”. Diverging from conventional methods requiring explicit verbal commands, our SATP framework autonomously performs task cognition, actively formulating robot-executable action sequences directly from visual input. Initially, a novel approach for describing the visual scene is presented, enabling the robot to grasp detailed object-level properties and inter-object relationships based on its observations. Building on this, a knowledge base for active task cognition is constructed using ontology technology. Furthermore, we develop a two-stage dual-feedback task planner, ReProg+, powered by LLMs, specifically designed for situation-aware task planning grounded in visual data. The efficacy, reliability, and advantages of our solution are thoroughly validated in real-world visual scenarios. Additionally, SATP has been tested with a real robot, with results confirming the feasibility and effectiveness of our approach.

Abstract:
Traditional few-shot action recognition (FSAR) aims to address the problem of the scarcity of action videos, enabling the recognition of action categories with just a few labeled samples. It is generally believed that the samples in the meta-training phase and the meta-testing phase are all drawn from the same domain. However, in practical applications, they often come from different domains, which may lead to significant differences in the distribution of spatiotemporal features. Researchers have started to study the problem of cross-domain few-shot action recognition (CDFSAR). The current solution is to train the model by combining source domain video and unlabeled target domain video to improve the model’s generalization ability. In this paper, we follow this paradigm but make a more refined use of the unlabeled target domain videos to better extract transferable features. First, we decouple the source and target domain videos along the temporal dimension and extract the domain-irrelevant features in both the source and target domains. Second, in each episode, we calculate the centroid of the domain-irrelevant features of the target domain and perform a momentum update on this feature centroid. We use Cross-Attention to align the domain-irrelevant features of the source domain toward this dynamic centroid. Finally, we use these aligned source domain features for few-shot classification. Experimental results demonstrate that our approach significantly improves few-shot classification performance across diverse domain shifts, validating the effectiveness of our refined use of unlabeled target video. Our code has been published at the URL: https://github.com/cofly2014/MCA-TRD.git

Abstract:
Temporal action localization in long-term untrimmed videos remains a critical yet challenging task in video understanding, with existing methods often relying on anchor-based or fully-supervised frameworks that incur heavy computation and require labor-intensive frame-level annotations. This paper presents a novel weakly-supervised approach, CL-WTAL, which leverages multi-scale contrast learning and graph convolution for accurate action localization and recognition. The method comprises three key components: 1) multi-scale sliding window mechanism (long/normal/short sequences) to segment sub-actions from complex videos, adapting to diverse action durations; 2) spatio-temporal graph convolution network (ST-RGCN) to extract skeletal feature vectors, integrating human motion dynamics and environmental context; 3) contrastive learning-based similarity evaluation framework that combines cosine similarity and Dynamic Time Warping (DTW) distance to measure feature vector relationships, enabling precise action boundary detection without extensive fine-tuning. Experiments on daily-life video datasets demonstrate that CL-WTAL effectively localizes action intervals and classifies actions with high accuracy, outperforming state-of-the-art weakly-supervised methods.

Abstract:
For space missions such as deep space exploration and on-orbit operations, a high-precision 3D model of the target is a prerequisite for achieving autonomous navigation and precise manipulation. However, natural uncontrolled orbits impose strong geometry constraints and require long observation periods, while active orbital maneuvering accelerates data acquisition but increases fuel consumption and reduces mission endurance. This trade-off between maneuvering efficiency and observation completeness has become a bottleneck limiting spacecraft operations on-orbit. To address these challenges, this paper proposes a sensing-planning framework that integrates active observation with orbital maneuvering. First, a 3D reconstruction scheme based on 2D Gaussian splatting (2DGS) is designed, taking uncertainty into account. Next, the optimal observation views are estimated using Bayesian theory, followed by orbit selection combined with fuel consumption and observation time derived from orbital mechanics. Simultaneously, discrete point filtering is applied to improve the reconstruction quality of the 3D mesh in the space environment. Finally, the effectiveness of the proposed method is validated through simulations and experimental comparisons with state-of-the-art (SOTA) in a newly constructed multi-orbital observation space environment darkroom. Code and data are available at: https://github.com/YD-96/Active-2DGS and https://bhpan.buaa.edu.cn/link/AAA6508AF1B8714EF0B91A992489F2228F

Abstract:
Monocular 3D object detection offers significant potential for autonomous systems due to its inherent cost-effectiveness and scalability. While DETR-based architectures excel in 2D vision tasks, critical limitations persist in extending them effectively to monocular 3D detection, as evidenced in existing frameworks like MonoDETR and MonoDGP. These methods typically suffer from inefficient serial fusion of multimodal features and lack iterative refinement mechanisms, limiting their performance, especially for mid-to-long range targets. To overcome these shortcomings, we propose Iter3DDet, a novel depth-guided iterative refinement framework that integrates fine-grained feature fusion to significantly enhance detection performance. The core novelty of our approach lies in two key innovations: 1) A hybrid feature encoder combining MonoDGP’s region segmentation head with MonoDETR’s visual backbone, augmented by a multi-scale context attention module that dynamically aggregates structural and semantic cues across pyramid levels, eliminating heuristic fusion rules; 2) A depth-guided adaptive cross-modal decoder that iteratively fuses depth and context features through prioritized attention mechanisms, coupled with a novel iterative refinement training strategy that progressively refines 3D detection hypotheses, substantially improving accuracy across targets of varying difficulty levels. Extensive experiments on the KITTI, nuScenes, and Waymo benchmarks demonstrate Iter3DDet’s state-of-the-art performance, validating the effectiveness of our iterative refinement paradigm. The code will be open-sourced at https://github.com/PCwenyue

Abstract:
Few-shot action recognition seeks to recognize novel actions with limited labeled examples. While dual-modal approaches incorporating video and textual modalities offer enhanced semantic context, existing methods often rely on naive feature fusion strategies, failing to capture deep semantic correlations across modalities and limiting generalization. We propose DPCA-Net, a dual-modal metric learning framework that constructs a unified dual-prototype consistency alignment space. DPCA-Net explicitly models distributional, structural, and metric consistency across modalities to enhance prototype quality and similarity estimation. It integrates three core components: 1) Frame-wise Text-guided Modeling (FTM), which uses conditional prompt learning to embed video frame-level visual features into the textual space, achieving structural consistency; 2) Dual-Modal Metric Learning via dual-path Dynamic Time Warping (Dual-DTW), jointly aligning visual and cross-modal prototypes to ensure metric consistency; and 3) Distribution Consistency Mapping (DCM), which leverages Maximum Mean Discrepancy and cosine similarity to align support-query distributions and reinforce representation robustness. Extensive experiments on three benchmark datasets show that DPCA-Net consistently outperforms prior methods. It surpasses CLIP-FSAR by 1.3%–2.7%, achieving 89.7% (1-shot) on Kinetics and 99.12% (5-shot) on UCF-101. These results highlight the effectiveness of consistency-driven prototype alignment for robust and generalizable cross-modal few-shot action recognition.

Abstract:
Unnatural motion artifacts—such as implausible object dynamics or discontinuous scene transitions—characterize a critical challenge in AI-generated content (AIGC) videos. Assessing these temporal inconsistencies is essential for benchmarking the performance of text-to-video (T2V) models. Current assessment methods derive motion features from either action recognition models or optical flow estimators. However, these motion features cannot faithfully reflect the human-aligned interpretability of how objects or scenes transform between frames. For instance, a model might detect a “running” action but fail to penalize implausible leg movements. To address this gap, we propose Transformation Consistency-based Video Quality Assessment (TCVQA), a novel framework that quantifies transformation consistency by measuring the recognizability of semantic transformations across frames. The core module of TCVQA is the TC-branch, which includes three core components: A Feature Extractor to capture high-level, fine-grained, and low-level motion features. A Flow-Driven Transformation module Warps extracted features from the source frame to the target frame using predicted optical flow. A Differential Perceiver computes discrepancies between warped source features and actual target features, yielding a consistency score that reflects deviations from natural motion patterns. Besides, the TCVQA also integrates three other branches, the TV-branch, the V-branch, and the F-branch, to perceive multiple aspects of distortions. Extensive experiments on AIGC-VQA benchmarks—including T2VQA-DB, LGVQ, FETV, and MQT demonstrate TCVQA’s superiority, achieving consistent improvement in correlation with human judgments over state-of-the-art methods. Our work establishes transformation consistency as a pivotal axis, enabling more reliable evaluation of AIGC video quality assessment.

Abstract:
Diffusion models (DMs) have been successfully applied to real image editing. These models typically invert images into latent noise vectors during the inversion process, and then edit them during the inference process. However, DMs often rely on the local linearization assumption, which assumes that the noise injected during the inversion process approximates the noise removed during the inference process. While DMs efficiently generate images under this assumption, it also accumulates errors during the diffusion process due to the assumption, ultimately negatively impacting the quality of real image reconstruction and editing. To address this issue, we propose a novel ERDDCI (Exact Reversible Diffusion via Dual-Chain Inversion). ERDDCI uses the new Dual-Chain Inversion (DCI) for joint inference to derive an exact reversible diffusion process. Using DCI, our method avoids the cumbersome optimization process in existing inversion approaches and achieves high-quality image editing. Additionally, to accommodate image operations under high guidance scales, we introduce a dynamic control strategy that enables more refined image reconstruction and editing. Our experiments demonstrate that ERDDCI significantly outperforms state-of-the-art methods in a 50-step diffusion process. It achieves rapid and precise image reconstruction with SSIM of 0.999 and LPIPS of 0.001, and delivers competitive results in image editing. The source code is available at: https://github.com/daii-y/ERDDCI

Abstract:
Human animation strives to bring static characters to life. Existing methods produce high-quality outcomes for single-frame animation; however, they often fail to maintain satisfactory temporal consistency, especially in facial and hand movements. This limitation arises from commonly used motion modules that do not explicitly model inter-entity relationships. In this work, we introduce MoAnimate, a Motion-oriented Human Animation framework designed to improve inter-entity consistency. Specifically, we extract motion flows from driving videos and transfer them to align the shape of character. During initialization, we propose a motion-oriented latent refinement that optimizes low-frequency subbands to regulate the layout of visual objects along flow trajectories, while preserving random high-frequency subbands to accommodate appearance variations. During denoising, we further introduce a motion-oriented entity attention module to enable direct and efficient interaction among entities within a coordinated subspace. Extensive experiments demonstrate that our method significantly enhances temporal consistency, particularly the visual consistency of the entities.

Abstract:
In recent years, three-dimensional (3D) point clouds, which are applicable in various fields such as the metaverse and immersive communication, are attracting increasing attention. Under constrained storage and bandwidth conditions, efficient point cloud compression (PCC) plays a crucial role. To address these challenges, the Moving Picture Experts Group has been actively developing the geometry-based point cloud compression (G-PCC) standard and has recently proposed a test model for dynamic solid point clouds called Solid G-PCC. However, several issues still hinder the coding efficiency of attributes, such as inaccurate prediction, redundant coding bits, and accumulated distortion due to dependencies between frames. To tackle these challenges, we propose a layer-based rate-distortion optimized (RDO) attribute coding (L-RDOAC) method. This approach incorporates a layer-based RDO prediction (L-RDOP) to enhance prediction accuracy, a layer-based RDO quantization (L-RDOQ) to minimize redundant coding bits, and a layer-based RDO Wiener filter (L-RDOWF) to reduce distortion. Experimental results demonstrate that the coding efficiency of the proposed method significantly outperforms the state-of-the-art G-PCC reference software, as assessed through both objective and subjective evaluations. Specifically, compared to the state-of-the-art GeS-TM version 7.0, the proposed L-RDOAC achieves average Bjøntegaard-delta (BD) rates of -7.94%, -10.95%, and -8.16% for Luma, Cr, and Cb, respectively, under the C1 configuration (lossless geometry with lossy attributes), while under the C2 configuration (lossy geometry with lossy attributes), the average BD-rates are -7.88%, -8.09%, and -4.87%, respectively, when octree-based geometry coding is used.

Abstract:
This work investigates a Field of View (FoV) prediction-based Adaptive Bitrate Streaming (ABS) strategy aimed at enhancing the Quality of Experience (QoE) for users watching 360-degree videos at wireless mobile devices. In doing so, there are two challenges: 1) how to achieve accurate FoV prediction, and 2) how to provide a flexible bitrate adaption service given the limited storage space of the video server. To address these issues, we propose a novel adaptive streaming strategy for 360-degree videos that integrates a Transformer-based FoV prediction model with on-demand video transcoding under dynamic wireless network conditions. To strike a balance between user QoE and transcoding overhead, we formulate a QoE-driven system utility maximization problem that jointly optimizes computing resource allocation and bitrate adaptation. Given the dynamic and multi-slot nature of the problem, it is inherently complex. To overcome this, we transform the original problem into a Markov decision process (MDP) and solve it using the residual learning and deep deterministic policy gradient (DDPG) method. Simulation results demonstrate that the proposed methods outperform the state-of-the-art baselines in terms of FoV prediction accuracy and utility improvement.

Abstract:
In today’s digital landscape, high-efficiency video coding (H.265/HEVC) has emerged as the most widely used video coding standard, employing selective encryption schemes to protect the privacy of video content while maintaining efficient compression performance. However, existing coefficient scrambling methods impose a significant computational load, leading to increased bit rate overhead due to encryption, longer execution times, and insufficient safety measures. To address these issues, a new coefficient scrambling scheme based on chaotic maps is proposed. This approach leverages the pseudorandomness, ergodicity, and sensitivity to initial conditions inherent in chaotic maps to generate highly unpredictable coefficient distributions, thereby strengthening security while preserving low complexity. Unlike conventional scrambling, chaotic maps ensure minimal correlation between encrypted coefficients, enhancing resistance against statistical and differential attacks. Additionally, the scrambling conditions are specifically designed to minimize the impact on the bit rate overhead. Furthermore, when combined with syntax element encryption (SEC), which includes motion vector difference (MVD), quantized transform coefficients (QTC), and luma intraprediction mode (Luma IPM), this method effectively distorts video content. The proposed scheme operates synchronously with slices, ensuring that the decryption of video content remains intact even if some slices are lost. Additionally, a random sequence generated by AES-CTR is incorporated with the H.265 encoded stream to protect against chosen-plaintext attacks. The experimental results indicate that this scheme features high security, compliance with format standards, fast execution times, synchronous updates with slices, and resilience against common attacks, all while achieving a reduced bit rate overhead of 45.13% with a lowered average execution time overhead of 1.91%.

Abstract:
In recent years, with the rapid advancement of deep neural networks (DNNs), researchers have explored steganography techniques that use DNN models as carriers for secret information hiding. However, existing methods generally suffer from limited imperceptibility and fidelity. To address these limitations, this paper proposes a steganography method that achieves high imperceptibility and fidelity while providing substantial embedding capacity and robustness. Specifically, we introduce a dual-branch encoder that embeds secret information into the initialization parameters of the cover model with almost no degradation of the model’s functionality. In addition, a new SMSE loss is employed to constrain the encoder output, which enhances the imperceptibility of the stego model. After training and transmission, the receiver can utilize a decoder to accurately extract secret information from the stego model. Experimental results demonstrate that the proposed method achieves a Kullback–Leibler (KL) divergence more than an order of magnitude lower than existing methods, with values ranging from 0.0003 to 0.007. The stego model preserves high fidelity to the original model, with classification accuracy differences within 0.005 on benchmark datasets including MNIST, CIFAR-10, and SST-2. In terms of embedding capacity, it achieves 319,312 bits on ResNet-18 and 1,757,952 bits on ViT, which exceeds the performance of baseline methods across most models. Furthermore, the proposed method exhibits strong robustness, as the embedded information can still be accurately recovered with BCH coding even under noise attacks at an SNR as low as -6 dB. It also demonstrates strong generalization, as it performs effectively on both classification networks and generative or reconstruction models such as GANs, VAEs, and U-Nets.

Abstract:
Collaborative perception seeks to mitigate the limitations of single-vehicle perception, such as occlusions, by facilitating communication and information sharing among connected vehicles. However, most existing works assume a homogeneous scenario where all vehicles share identity sensor types and perception model architectures. In contrast, real-world systems often involve heterogeneous agents with diverse sensor configurations and independently developed models. In such settings, directly exchanging features without proper alignment can significantly degrade performance and hinder effective collaboration. While some methods have been proposed to address heterogeneity, they typically require retraining or access to internal model parameters, making them impractical for scalable deployment. To address these challenges, we propose DiffAlign, a plug-and-play adapter that enables feature alignment across heterogeneous agents in a training-free and model-agnostic manner. DiffAlign treats received BEV features as noisy latent representations and progressively refines them through a pretrained diffusion process. This alignment strategy does not require access to model internals or any retraining, which makes it both scalable and privacy-preserving while supporting diverse sensor modalities and perception backbones. Extensive experiments on simulated OPV2V and real-world V2V4Real datasets demonstrate that DiffAlign consistently improves detection performance in heterogeneous settings, improving CoBEVT by 132.01% and 91.95%, respectively. Our method provides a practical path toward scalable, generalizable, and deployment-ready collaborative perception.

Abstract:
Cross-view geo-localization (CVGL) aims to match images of the same location captured from different viewpoints, such as those captured by Uncrewed Aerial Vehicles (UAVs) and satellite platforms. The task is particularly challenging due to significant variations in scale, viewpoint, and illumination. Most existing methods employ symmetric sampling strategy to construct drone–satellite image pairs for deep metric learning, but neglect the potential of incorporating multi-view drone images to enhance the viewpoint robustness of features. To address this, we propose leveraging multi-view images to learn Domain-Invariant Discriminative Embeddings (DIDE) for CVGL. DIDE introduces an Inter-view Feature Aggregation Module (IFAM), which dynamically integrates multi-view drone information into robust embeddings. These are used in contrastive learning with satellite embeddings within batches to learn view-invariant discriminative features, while representation learning further improves scene discrimination across batches. To reduce the domain gap, DIDE constructs and aligns drone and satellite prototypes for effective cross-domain feature alignment. Furthermore, we adopt a parameter-efficient transfer learning strategy that leverages the capabilities of pre-trained foundation models while fine-tuning only dual adapters, significantly reducing the trainable parameters. DIDE achieves the state-of-the-art on University-1652 and University-160k, competitive results on SUES-200, and demonstrates strong cross-dataset transferability, with fewer training parameters and lower computational cost.

Abstract:
RGB-thermal salient object detection (RGB-T SOD) aims to identify prominent objects by integrating complementary information from RGB and thermal modalities. However, learning the precise boundaries and complete objects remains challenging due to the intrinsic insufficient feature fusion and the extrinsic limitations of data scarcity. In this paper, we propose a novel hybrid prompt-driven segment anything model (HyPSAM), which leverages the zero-shot generalization capabilities of the segment anything model (SAM) for RGB-T SOD. Specifically, we first propose a dynamic fusion network (DFNet) that generates high-quality initial saliency maps as visual prompts. DFNet employs dynamic convolution and multi-branch decoding to facilitate adaptive cross-modality interaction, overcoming the limitations of fixed-parameter kernels and enhancing multi-modal feature representation. Moreover, we propose a plug-and-play refinement network (P2RNet) which serves as a general optimization strategy to guide SAM in refining saliency maps by using hybrid prompts. The text prompt ensures reliable modality input, while the mask and box prompts enable precise salient object localization. Extensive experiments on three public datasets demonstrate that our method achieves state-of-the-art performance. Notably, HyPSAM has remarkable versatility, seamlessly integrating with different RGB-T SOD methods to achieve significant performance gains, thereby highlighting the potential of prompt engineering in this field. The code and results of our method are available at: https://github.com/milotic233/HyPSAM.

Abstract:
Hyperspectral anomaly detection (HAD) refers to the identification of anomalies that significantly differ from the background without relying on anomalous spectral priors. By utilizing effective deep learning networks, reconstruction-based methods have become one of the primary solutions for HAD tasks. However, most existing reconstruction-based methods fail to fully leverage the local affinity between a central pixel and its surrounding pixels. To address this issue, a novel Dual Visual Spectral Affinity Monitoring Network (Dual-Net) is proposed. Specifically, the dual network separately extracts the unique features of each pixel and its spectral similarity with surrounding pixels. The Spectral Affinity Monitoring Strategy (SAMS) uses this similarity as an affinity matrix to monitor the impact of anomalous pixels on the network reconstruction process. To further enhance the effectiveness of SAMS, a Visual Monitoring Focused Attention (VMFA) mechanism inspired by the human visual system is introduced. This mechanism effectively captures anomalous features at both the pixel and semantic levels while suppressing background information. Extensive experiments on nine public datasets validate the superiority and effectiveness of the Dual-Net.

Abstract:
In this paper, we reveal a novel haze-specific wavelet degradation prior observed through wavelet transform analysis, which shows that haze-related information predominantly resides in low-frequency components. Exploiting this insight, we propose a novel dehazing framework, WDMamba, which decomposes the image dehazing task into two sequential stages: low-frequency restoration followed by detail enhancement. This coarse-to-fine strategy enables WDMamba to effectively capture features specific to each stage of the dehazing process, resulting in high-quality restored images. Specifically, in the low-frequency restoration stage, we integrate Mamba blocks to reconstruct global structures with linear complexity, efficiently removing overall haze and producing a coarse restored image. Thereafter, the detail enhancement stage reinstates fine-grained information that may have been overlooked during the previous phase, culminating in the final dehazed output. Furthermore, to enhance detail retention and achieve more natural dehazing, we introduce a self-guided contrastive regularization during network training. By utilizing the coarse restored output as a hard negative example, our model learns more discriminative representations, substantially boosting the overall dehazing performance. Extensive evaluations on public dehazing benchmarks demonstrate that our method surpasses state-of-the-art approaches both qualitatively and quantitatively. Code is available at https://github.com/SunJ000/WDMamba

Abstract:
Unsupervised person re-identification (Re-ID) requires learning semantic representation without identity labels. Existing methods entangle identity-related person features with camera-related background features, hindering discriminative feature learning. Also, these methods often disrupt the semantic structure of the person, weakening the semantic representation. In this paper, we propose the Semantic-Aware Disentanglement Representation Learning (SDRL) framework with diffusion models for unsupervised person Re-ID. Firstly, to enhance feature learning, we propose the Disentanglement Aggregation Model (DAM). This model disentangles identity-related features from camera-related features to generate multi-view features. Secondly, to promote the consistency of multi-view features, we design the multi-view similarity consistency (MSC) loss to constrain intra-camera and cross-camera similarity distributions. Thirdly, to generate semantically meaningful patches, we propose the Semantic Spatial Diffusion Model (SSDM). This model operates on identity-related features to perform the denoising diffusion process over spatial transformer parameters. Finally, to further enhance the semantic representation of generated patches, we design the Semantic Decoupled Contrastive (SDC) loss to perceive the inherent semantic structure. Numerous experiments on three demanding datasets prove that our approach is superior to the current unsupervised Re-ID approaches. The source code will be publicly available at https://github.com/taoxuefong/SDRL-reid

Abstract:
3D Gaussian Splatting (3DGS) has emerged as a powerful scene representation, offering geometrically dense and photometrically accurate modeling capabilities that present a promising new paradigm for accurate targetless sensor calibration. Current 3DGS-based LiDAR-camera calibration methods usually highly rely on the joint optimization of the Gaussian model and extrinsics, and thus suffer two critical limitations. On the one hand, the global optimization is usually sensitive to the accumulated localization error of LiDAR. On the other hand, the inaccurate extrinsics may cause oscillations during the joint optimization. Specifically, there is a fundamental dilemma: accurate extrinsic calibration requires accurate scene models, while constructing accurate models itself depends on accurate extrinsics. This dilemma frequently triggers oscillatory optimization trajectories, significantly increasing vulnerability to premature convergence at suboptimal states. To address these challenges, we propose HiGS-Calib, a novel 3DGS calibration pipeline featuring the integration of our proposed Local-Consistent Photometric-Geometric (LCPG) error model and the hierarchical architecture. The LCPG error leverages the spatial consistency within local windows to quantify pose misalignment using only geometric attributes of the 3DGS model, bypassing color-pose reliance constraints. Besides, diverging from joint optimization paradigms, HiGS-Calib implements coarse-to-fine iterative optimization, decoupling scene modeling from extrinsic refinement and thereby achieving stable and accurate calibration. Extensive evaluation demonstrates the significantly improved calibration accuracy and stability of our HiGS-Calib over other state-of-the-art methods. To make our results reproducible, the source code has been released at https://github.com/IRMVLab/HiGS-Calib

Abstract:
Various deep learning-based methods have greatly improved hyperspectral image (HSI) classification performance, but these models are sensitive to noisy training labels. Human annotation on remote sensing images inevitably introduced label noise, which degrades the model prediction confidence. Understanding the spatial characteristics and distribution of such annotation errors is crucial for both diagnosing dataset annotation failures and guiding effective robust learning strategies. Current noisy label learning methods pay limited attention to visualizing noise label distributions, and these approaches often exhibit poor compatibility with noise-free models. Leveraging the relationship between the prediction uncertainty and label noise, we propose a Local Bayesian Framework (LBF) for noisy HSI classification and noise labels awareness. LBF adapts standard CNN, GCN, or Transformer backbones via local Bayesian adaptation (LBA) to evaluate prediction uncertainty and employs an uncertainty-monitoring optimization strategy (U-MOS) for training. Without major architectural changes, LBF delivers accurate uncertainty maps that highlight noisy regions, suppresses overfitting to corrupted labels, and consistently improves classification robustness across four benchmark HSI datasets.

Abstract:
The rapid development of multimodal large language models has resulted in remarkable advancements in visual perception and understanding, consolidating several tasks into a single visual question-answering framework. However, these models are prone to hallucinations, which limit their reliability as artificial intelligence systems. While this issue is extensively researched in natural language processing and image captioning, there remains a lack of investigation of hallucinations in Low-level Visual Perception and Understanding (HLPU), especially in the context of image quality assessment tasks. We consider that these hallucinations arise from an absence of clear self-awareness within the models. To address this issue, we first introduce the HLPU instruction database, the first instruction database specifically focused on hallucinations in low-level vision tasks. This database contains approximately 200K question-answer pairs and comprises four subsets, each covering different types of instructions. Subsequently, we propose the Self-Awareness Failure Elimination (SAFEQA) model, which utilizes image features, salient region features and quality features to improve the perception and comprehension abilities of the model in low-level vision tasks. Furthermore, we propose the Enhancing Self-Awareness Preference Optimization (ESA-PO) framework to increase the model’s awareness of knowledge boundaries, thereby mitigating the incidence of hallucination. Finally, we conduct comprehensive experiments on low-level vision tasks, with the results demonstrating that our proposed method significantly enhances self-awareness of the model in these tasks and reduces hallucinations. Notably, our proposed method improves both accuracy and self-awareness of the proposed model and outperforms close-source models in terms of various evaluation metrics. This research contributes to the advancement of self-awareness capabilities in multimodal large language models, particularly for low-level visual perception and understanding tasks.

Abstract:
Video prediction is a critical task in video processing and generation, with far-reaching implications for various downstream applications. However, existing methods often produce blurred predicted frames and fail to maintain structural continuity in objects. To address these challenges, we propose a Multi-Scale Hybrid Mamba Voxel Flow framework that employs a progressive refinement strategy in combination with adaptive feature extraction modules. The framework begins by generating coarse optical flow estimates and predicted frames, which are progressively refined at lower resolutions to enhance detail and ensure temporal coherence. Specifically, Mamba Blocks are designed to capture complex global motion patterns, while Spatial Aggregation Blocks aggregate spatial context across different scales. Simam Modules further enhance feature representation by selectively focusing on significant spatial regions. Additionally, multi-level residual connections and depthwise channel separations are incorporated to reduce computational complexity. Experimental results show that the proposed method significantly improves the clarity and spatial consistency of predicted frames, outperforming state-of-the-art techniques.

Abstract:
Dyadic social relationships, which refer to relationships between two individuals who know each other through repeated interactions (or not), are shaped by shared spatial and temporal experiences. Current computational methods for modeling these relationships face three major challenges: 1) the failure to model asymmetric relationships, e.g., one individual may perceive the other as a friend while the other perceives them as an acquaintance, 2) the disruption of continuous interactions by discrete frame sampling, which segments the temporal continuity of interaction in real-world scenarios, and 3) the limitation to consider periodic behavioral cues, such as rhythmic vocalizations or recurrent gestures, which are crucial for inferring the evolution of dyadic relationships. To address these challenges, we propose AsyReC, a multimodal graph-based framework for asymmetric dyadic relationship classification, with three core innovations: i) a triplet graph neural network with node-edge dual attention that dynamically weights multimodal cues to capture interaction asymmetries (addressing challenge 1); ii) a clip-level relationship learning architecture that preserves temporal continuity, enabling fine-grained modeling of real-world interaction dynamics (addressing challenge 2); and iii) a periodic temporal encoder that projects time indices onto sine/cosine waveforms to model recurrent behavioral patterns (addressing challenge 3). Extensive experiments on two public datasets demonstrate state-of-the-art performance, while ablation studies validate the critical role of asymmetric interaction modelling and periodic temporal encoding in improving the robustness of dyadic relationship classification in real-world scenarios. Our code is publicly available at: https://github.com/tw-repository/AsyReC

Abstract:
Natural language tracking aims to locate the target of a video based on a language description. The rich contextual information inside the language and video sequence is essential to describe the target movements and appearance variations. However, existing natural language trackers design a fixed-length memory to store historical target information, which merely uses limited context information and necessitates manually designed modules, resulting in sub-optimal localization performance and numerous computational costs. Inspired by the success of the state space model, we propose a novel Context-adaptive Mamba Tracker (CMTrack). It enjoys several merits. First, we propose a novel context-aware state space model that enables language features to serve as hidden states to interact with relevant image features adaptively. Second, CMTrack transfers the hidden states frame-by-frame to continuously incorporate contextual target information into language features, enabling context-adaptive language cues. Third, the proposed context-adaptive language cues can effectively capture the long-range behavior of the target and guide the tracker in locating the target accurately without any extra design. Finally, CMTrack provides a neat pipeline for training and tracking with linear complexity. Experimental results demonstrate that CMTrack achieves new state-of-the-art performance.

Affiliations: School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, China; Department of Computer Science and Engineering, Southern University of Science and Technology, Shenzhen, China; Department of Computer Science and Technology, Shandong Jianzhu University, Jinan, China; College of Computer Science and Electronic Engineering, Hunan University, Changsha, China; School of Electronics and Information Engineering, Harbin Institute of Technology, Shenzhen, China

Abstract:
Text-to-image diffusion models have demonstrated remarkable progress in synthesizing high-quality images from text prompts, which boosts researches on prompt-based image editing that edits a source image according to a target prompt. Despite their advances, existing methods still encounter three key issues: 1) limited capacity of the text prompt in guiding target image generation, 2) insufficient mining of word-to-patch and patch-to-patch relationships for grounding editing areas, and 3) unified editing strength for all regions during each denoising step. To address these issues, we present a Vision-guided and Mask-enhanced Adaptive Editing (ViMAEdit) method with three key novel designs. First, we propose to leverage image embeddings as explicit guidance to enhance the conventional textual prompt-based denoising process, where a CLIP-based target image embedding estimation strategy is introduced. Second, we devise a self-attention-guided iterative editing area grounding strategy, which iteratively exploits patch-to-patch relationships conveyed by self-attention maps to refine those word-to-patch relationships contained in cross-attention maps. Last, we present a spatially adaptive variance-guided sampling, which highlights sampling variances for critical image regions to promote the editing capability. Experimental results demonstrate the superior editing capacity of ViMAEdit over all existing methods. Source code are available at https://github.com/Null-0000/ViMAEdit

Abstract:
The introduction of the Versatile Video Coding (VVC) standard is dedicated to meeting the increasing demand for high-resolution and high-quality video. However, the novel partitioning method named Quad-Tree Plus Multi-Type Tree (QTMTT) significantly increases computational complexity and data dependency, resulting in more pipeline bubbles, lower hardware efficiency, and degraded throughput. To address this issue, we analyze the hardware data dependencies, categorize four different types of pipeline bubbles, and propose a bubble-removing strategy for the Rate Distortion Optimization (RDO) module with MTT depth of 1. To be more specific, we first propose an efficient partition scheduling scheme based on the characteristics of QTMTT partitions. Then we redesign the transpose memory used in 2D transformation to efficiently handle the blocks of different sizes introduced by QTMTT. These two strategies achieve a 42.3% reduction in hardware cycles. In addition, for I frames, we further propose a hardware-oriented partition pruning algorithm that can co-operate with the proposed architecture, achieving a 42.3%~77.9% reduction in hardware cycles with only 0%~1.21% BD-Rate loss compared to the VTM-23.4. The proposed hardware architecture is implemented in GF 28nm technology, supporting up to 4K@40fps throughput at 500MHz with a hardware cost of only 3259 K gates and 63.47 KB on-chip memory, demonstrating competitive compression performance, high hardware efficiency, and outstanding throughput.

Abstract:
Vision-language models (VLMs), such as BLIP-2 and LLaVA, have significantly advanced multimodal understanding but exhibit critical vulnerabilities to visual adversarial perturbations. The high efficacy of untargeted attacks, in particular, poses significant concerns for their operational robustness. Conventional attack methods generate adversarial examples by backpropagating the language modeling loss from the final output to the input image. However, the deep architecture of the integrated large language models (LLMs) often diminishes this gradient flow, limiting the attack’s effectiveness in perturbing the visual domain. To address this limitation, we introduce a novel untargeted attack method based on Maximizing Information Entropy (MIE). Our approach enhances attack efficacy not only by maximizing the information entropy of the model’s final output but also by directly inducing uncertainty within its internal feature representations. This layer-wise perturbation strategy disrupts the model’s cognitive process more comprehensively than relying on the final output layer alone. We provide a theoretical analysis demonstrating that the uncertainty induced by MIE is greater than or equal to that of conventional output-only attacks. Comprehensive quantitative evaluations across multiple VLM architectures and datasets confirm that our method consistently outperforms existing techniques, thereby establishing a new, more rigorous benchmark for assessing adversarial robustness in vision-language models.

Abstract:
Significant advancements in deep learning have been made possible by the utilization of large datasets, underscoring the critical importance of copyright protection. Adding meticulously designed perturbations to examples, making them unlearnable has become a crucial approach for safeguarding data copyright. Existing methods for creating unlearnable examples overlook the risk of data leakage, which can threaten data ownership. Thus, copyright protection in deep learning faces two main threats: illegal model training and malicious data leakage. We investigate that these two threats cannot be solved by straightforwardly combining existing availability attacks and watermarking techniques as their negative interaction effects. Therefore, in this paper, we propose a novel copyright protection mechanism for the aforementioned security concerns. Considering that the prevention of unauthorized model training requires powerful generalizability of unlearnable perturbations, we generate perturbations to induce the model to learn uncorrelated features of input images. It works by minimizing the mutual information of the input and output of the model. On the other hand, to eliminate the side impact of unlearnable perturbations on the watermark extraction, we design a dual extraction strategy by using two distinct watermark extractors. Extensive experiments on the image datasets ImageNet, CIFAR10, and Pets show that our proposed method could provide comprehensive copyright protection to images. The code is available at https://github.com/Yeah21/ReversibleUnlearnableExamples

Abstract:
Co-Speech Gesture Video Generation aims to generate vivid speech videos from audio-driven still images, which is challenging due to the diversity of body parts in terms of motion amplitude, audio relevance, and detailed features. Relying solely on audio as the control signal often fails to capture large gesture movements in videos, resulting in more noticeable artifacts and distortions. Existing approaches typically address this issue by adding extra prior inputs, but this can limit the practical application of the task. Specifically, we propose a Motion Mask-Guided Two-Stage Network (MMGT) that uses audio, along with motion masks and pose videos generated from the audio signal, to jointly generate synchronized speech gesture videos. In the first stage, the Spatial Mask-Guided Audio2Pose Generation (SMGA) Network generates high-quality pose videos and motion masks from audio, effectively capturing large movements in key regions such as the face and gestures. In the second stage, we integrate Motion Masked Hierarchical Audio Attention (MM-HAA) into the Stabilized Diffusion Video Generation model, addressing limitations in fine-grained motion generation and region-specific detail control found in traditional methods. This ensures high-quality, detailed upper-body videos with accurate textures and motion. Evaluations demonstrate improvements in video quality, lip-sync, and hand gestures. The model and code are available at (https://github.com/SIA-IDE/MMGT).

Abstract:
Light field (LF) imaging holds immense promise for applications such as post-capture refocusing and virtual reality. However, its inherent spatial-angular trade-off significantly limits both spatial and angular resolution, restricting its practicality in real-world scenarios. To address these limitations, spatial-angular super-resolution methods have been proposed to simultaneously enhance both dimensions. Yet, existing methods struggle to fully exploit the intertwined spatial-angular correlations and fail to effectively handle sparsely sampled LFs with low spatial resolution, often leading to cumulative errors during reconstruction. In this paper, we propose an Implicit and Detail-Enhanced Network (IDNet) to overcome these challenges. Our IDNet employs 3D convolution for the joint extraction of spatial and angular information, leveraging their interdependencies for more effective LF reconstruction. Additionally, we introduce an implicit detail restoration module that enhances features while encoding positional information to refine fine details. To overcome the limitations of sparse spatial and angular information on high-detail reconstruction and angular consistency in low-resolution LFs, we design a multi-representation enhancement block. This block enhances features by learning pixel differences across multiple directions in diverse representations, effectively capturing intricate details and complex correlations. Thanks to these designs, our IDNet reconstructs novel views with finer details, effectively learns occlusion relationships, and ensures geometric consistency. Experimental results on benchmark datasets demonstrate its superior quantitative and qualitative performance. The code is publicly available at https://github.com/ldyorchid/IDNet

Abstract:
In recent years, numerous hyperspectral image (HSI) reconstruction methods have been proposed to enhance the imaging quality of coded aperture snapshot spectral compressive imaging (CASSI) systems. Among these methods, self-supervised Deep Image Prior (DIP)-based approaches have gained attention for their ability to reconstruct three-dimensional (3D) HSIs without the need for external training data. However, DIP methods often suffer from overfitting to high-frequency noise during the optimization process, leading to artifacts and loss of fine details. To address these challenges, we propose a Mutual-Regularized Dual Deep Image Prior (DDIP) framework that employs implicit mutual regularization between two DIP networks. By encouraging mutual constraints, DDIP effectively mitigates high-frequency learning bias and suppresses noise amplification. Additionally, we employ a Half Quadratic Splitting (HQS) optimization strategy to ensure stable and efficient convergence, progressively integrating complementary information from the dual networks. We provide a comprehensive convergence analysis of the DDIP framework and establish theoretical conditions to guide the progressive fusion of the dual networks, ensuring robust and reliable reconstruction. Based on the insights from the convergence analysis, we introduce an Adaptive Deep Image Prior inner-loop strategy that dynamically adjusts the inner-loop updates, ensuring balanced learning of low- and high-frequency components. Moreover, a Residual Spectral-Spatial Feature Attention Network (SSFAN) is designed to enhance spectral-spatial feature extraction, further improving reconstruction accuracy. Extensive experiments on benchmark datasets demonstrate that DDIP achieves competitive HSI reconstruction quality compared to state-of-the-art unsupervised and self-supervised methods.

Affiliations: School of Computer Science and Engineering and the Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications, Ministry of Education, Southeast University, Nanjing, China; Department of Precision Instrument, Tsinghua University, Beijing, China; School of Mathematics and Statistics, Xi’an Jiaotong University, Xi’an, China; RIKEN Center for Advanced Intelligence Project, Tokyo, Japan; Key Laboratory of Computing Power Network and Information Security, Ministry of Education, Shandong Computer Science Center, Qilu University of Technology, Jinan, China

Abstract:
Label Distribution Learning (LDL) is a novel machine learning paradigm that addresses the problem of label ambiguity and has found widespread applications. However, obtaining complete label distributions in real-world scenarios is challenging, which has led to the emergence of Incomplete Label Distribution Learning (InLDL). Existing InLDL methods attempt to utilize low-rank label correlations to recover the complete label distribution. However, we find that real-world LDL datasets have an imbalanced nature; that is, the sum of the description degrees for normal labels is significantly larger than that for tail labels, which disrupts the low-rank assumption underlying the recovery of the label distribution. To solve the above problem, we propose Incomplete and Imbalance Label Distribution Learning (I2LDL), which makes the use of low-rank label correlations more reasonable for InLDL. Our method decomposes the recovered label distribution matrix into a low-rank component for frequent labels and a sparse component for tail labels, effectively capturing the structure of both head and tail labels. We further require that the entries in the observed positions of the recovered label distribution matrix be close to the observed values, and that the recovered label distribution for every instance forms a probability simplex (i.e., nonnegative entries summing to unity). Finally, the proposed model is optimized via the Alternating Direction Method of Multipliers (ADMM). We provide a theoretical analysis of its exact recovery guarantee under standard assumptions of incoherence, sparsity, and sufficient sampling. Furthermore, we establish a generalization error bound based on Rademacher complexity, offering theoretical insights into the learning performance of our method. Extensive experiments on 16 real-world datasets demonstrate the effectiveness and robustness of our framework compared to existing InLDL methods. The code is available at https://anonymous.4open.science/r/IncomLDL-tailaware-C021

Abstract:
Variational Autoencoder (VAE) combines the ideas of autoencoders and variational inference, introducing the concept of latent space and variational inference to endow autoencoders to generate new images. VAE typically assumes that data follows a Gaussian distribution, but real data may follow other distributions. This inconsistency between the assumption and the true distribution can affect the modeling and reconstruction capabilities of VAE, which makes it difficult for traditional models to accurately capture the true distribution. To address the aforementioned issues, we propose a Prior Distribution Guided Gaussian Mixture Variational Autoencoder (PDGM-VAE). Specifically, we construct a Gaussian Mixture Prior Learner (GMPL) to capture complex features of the data distribution, enabling the model to learn and obtain a Gaussian mixture distribution that is reasonable and close to the real data distributions, which is then used as the prior distribution in the network. Furthermore, we build a Semantic-Aware Module with Embedded Prior Distribution (SAMEPD), integrating data and label information to learn the distribution parameters, enabling the network to learn and utilize the semantic knowledge contained in the labels. During training, by approximating the posterior distribution to the prior distribution, we enhance the model’s modeling and reconstruction capabilities, improving the quality of generated images. We evaluated the image generation task on five public datasets, and based on the FID metric, our proposed method outperformed other VAE methods.

Abstract:
In the present environment where privacy protection is increasingly emphasized, source-free unsupervised domain adaptation (SFUDA) has garnered more attention compared to standard unsupervised domain adaptation (UDA). It concentrates on transferring knowledge directly from well-trained source models to unlabeled target domains without requiring the involvement of source domain like UDA, greatly enhancing data protection capabilities. Many existing methods employ pseudo-labeling to guide this process, but due to domain shift, pseudo-labels often introduce significant noise. Although there are methods to filter out this noise and mitigate its impact, they may also result in the loss of crucial sample knowledge, leading to performance deterioration. In contrast, we propose a novel approach called Progressive Curriculum Learning with Teacher-Student Collaboration (PCTSC) method to mitigate the adverse influence of noisy labels in SFUDA. Inspired by curriculum learning, PCTSC assesses samples’ learning difficulty and trains models in an incremental manner from easy to hard, thereby enhancing the capability of model to against noise. Furthermore, PCTSC employs a two-stage learning approach: initially, a teacher model directs the student model, and later, the student model transitions to independent learning. We assess the effectiveness of PCTSC by conducting extensive experiments across three benchmark datasets, demonstrating its robustness against pseudo-label noise in SFUDA setting.

Abstract:
Multi-sensor fusion significantly enhances the accuracy and robustness of 3D semantic occupancy prediction, which is crucial for autonomous driving and robotics. However, most existing approaches depend on high-resolution images and complex networks to achieve top performance, hindering their deployment in practical scenarios. Moreover, current multi-sensor fusion approaches mainly focus on improving feature fusion while largely neglecting effective supervision strategies for those features. To address these issues, we propose DAOcc, a novel multi-modal occupancy prediction framework that leverages 3D object detection supervision to assist in achieving superior performance, while using a deployment-friendly image backbone and practical input resolution. In addition, we introduce a BEV View Range Extension strategy to mitigate performance degradation caused by lower image resolution. Extensive experiments demonstrate that DAOcc achieves new state-of-the-art results on both the Occ3D-nuScenes and Occ3D-Waymo benchmarks, and outperforms previous state-of-the-art methods by a significant margin using only a ResNet-50 backbone and 256× 704 input resolution. With TensorRT optimization, DAOcc reaches 104.9 FPS while maintaining 54.2 mIoU on an NVIDIA RTX 4090 GPU. Code is available at https://github.com/AlphaPlusTT/DAOcc

Abstract:
Human pose estimation (HPE) is an invaluable task in computer vision with various practical applications. This paper proposes a novel Hierarchical Contrastive Consistensy constraint (HICCON) to improve the HPE in both images and videos, which describes the input into multi-granular representations at spatial and temporal domain and performs multi-level feature consistency by exploring the characteristic of human structure and time sequence. The hierarchical contrast is conducted at four levels: keypoint-level, part-level, instance-level and clip-level. In spatial, we consider keypoint-level and part-level consistency across instances within frame to enhance the fine-grained keypoint robustness. The former conducts the single keypoint feature contrast across instances to improve the category-specific keypoint features. The latter explores the specific pair-wise features for preserving the instructive relation. In temporal, we develop the instance-level and clip-level feature consistency across frames to capture more discriminative temporal representations. The former discriminates instance features across frames within the same video, whereas the clip-level constraint aims to discriminate consistent features from different videos in order to capture more distinctive temporal features. Extensive experiments on kinds of architectures across datasets i.e, PoseTrack2017, PoseTrack2018 and PoseTrack2021 show the HICCON achieves about 1.5% improvement than baseline. Besides, the proposed method unleashes the potential of the contrastive learning in HPE field.

Abstract:
Single underwater image often suffer from severe quality degradation and field-of-view limitation due to the underwater light propagation characteristics and the viewing range of camera equipment. To address these challenges, we propose a underwater scene clarity reconstruction framework called USCR, which comprises a multilayer information fusion (MIF) method for underwater image enhancement (UIE) and a self-organized stitching (SOS) method for image stitching. First, MIF corrects color distortion, enhances contrast, and highlights image detail information through a minimally attenuated channel guided color correction strategy and a gradient weight fusion strategy. Subsequently, SOS is applied to stitch the enhanced underwater images, which utilizes a homography matrix to initially stitch the image sequence, and further employs a pixel blending strategy based on boundary distance weighting for boundary pixel fusion to the initial stitch image, aiming to ensure a homogeneous transition of the stitch region. Our reconstructed underwater scenes are characterized by visual clarity and a wide field-of-view. Extensive qualitative and quantitative experimental validations show that USCR outperforms the state-of-the-art methods in underwater visual reconstruction task.

Abstract:
Image outpainting, a challenging generative task, has advanced significantly with the introduction of text-to-image diffusion models (DM). Despite these advances, DM-based methods frequently encounter the phenomenon in which one modal takes precedence over another, causing the image to be over-guided. Current research relies on manual hyperparameters to achieve bimodal balance. To reduce reliance, Prompt Libra is proposed to automatically balance bimodal prompts during inference and enhance extrapolated images. Given the variation of bimodal cross-attention during DM denoising, we create an adaptive bimodal attention module via attention maps. Furthermore, we design a classifier-free guidance computation based on masked images to improve the semantic control of the masked part and enhance the quality of images. Finally, we propose a semantic transformer to address the problem of quality degradation caused by incomplete prompts. It extracts limited semantics from the source images, which is suitable for scenarios lacking text prompts. Experimental results demonstrate that our method generates images that achieve the state-of-the-art effect on several image quality evaluation metrics while maintaining the image and text prompts in balance.

Abstract:
Ultra-High-Definition (UHD) image restoration has acquired remarkable attention due to its practical demand. In this paper, we construct UHD snow and rain benchmarks, named UHD-Snow and UHD-Rain, to remedy the deficiency in this field. The UHD-Snow/UHD-Rain is established by simulating the physics process of rain/snow into consideration and each benchmark contains 3200 degraded/clear image pairs of 4K resolution. Furthermore, we propose an effective UHD image restoration solution by considering gradient and normal priors in model design, thanks to these priors’ spatial and detail contributions. Specifically, our method contains two branches: (a) feature fusion and reconstruction branch in high-resolution space and (b) prior feature interaction branch in low-resolution space. The former learns high-resolution features and fuses prior-guided low-resolution features to reconstruct clear images, while the latter utilizes normal and gradient priors to mine useful spatial features and detail features to guide high-resolution recovery better. To better utilize these priors, we introduce single prior feature interaction and dual prior feature interaction, where the former respectively fuses normal and gradient priors with high-resolution features to enhance prior ones, while the latter calculates the similarity between enhanced prior ones and further exploits dual guided filtering to boost the feature interaction of dual priors. We conduct experiments on both new and existing public datasets and demonstrate the state-of-the-art performance of our method on UHD image low-light enhancement, dehazing, deblurring, desnowing, and deraining. The source codes and benchmarks are available at https://github.com/wlydlut/UHDDIP

Abstract:
Multi-modal methods based on camera and LiDAR sensors have garnered significant attention in the field of 3D detection. However, many prevalent works focus on single or partial stage fusion, leading to insufficient feature extraction and suboptimal performance. In this paper, we introduce a multi-stage cross-modal fusion 3D detection framework, termed CMF-IOU, to effectively address the challenge of aligning 3D spatial and 2D semantic information. Specifically, we first project the pixel information into 3D space via a depth completion network to get the pseudo points, which unifies the representation of the LiDAR and camera information. Then, a bilateral cross-view enhancement 3D backbone is designed to encode LiDAR points and pseudo points. The first sparse-to-distant (S2D) branch utilizes an encoder-decoder structure to reinforce the representation of sparse LiDAR points. The second residual view consistency (ResVC) branch is proposed to mitigate the influence of inaccurate pseudo points via both the 3D and 2D convolution processes. Subsequently, we introduce an iterative voxel-point aware fine grained pooling module, which captures the spatial information from LiDAR points and textural information from pseudo points in the proposal refinement stage. To achieve more precise refinement during iteration, an intersection over union (IoU) joint prediction branch integrated with a novel proposals generation technique is designed to preserve the bounding boxes with both high IoU and classification scores. Extensive experiments show the superior performance of our method on the KITTI, nuScenes and Waymo datasets. The code is available at https://github.com/pami-zwning/CMF-IOU

Abstract:
Transformer-based models have improved visual tracking, but most still cannot run in real time on resource-limited devices, especially for unmanned aerial vehicle (UAV) tracking. To achieve a better balance between performance and efficiency, we propose AVTrack, an adaptive computation tracking framework that adaptively activates transformer blocks through an Activation Module (AM), which dynamically optimizes the ViT architecture by selectively engaging relevant components. To address extreme viewpoint variations, we propose to learn view-invariant representations via mutual information (MI) maximization. In addition, we propose AVTrack-MD, an enhanced tracker incorporating a novel MI maximization-based multi-teacher knowledge distillation framework. Leveraging multiple off-the-shelf AVTrack models as teachers, we maximize the MI between their aggregated softened features and the corresponding softened feature of the student model, improving the generalization and performance of the student, especially under noisy conditions. Extensive experiments show that AVTrack-MD achieves performance comparable to AVTrack’s performance while reducing model complexity and boosting average tracking speed by over 17%. Codes is available at https://github.com/wuyou3474/AVTrack

Abstract:
Weakly supervised object detection has emerged as a cost-effective and promising solution in remote sensing, as it requires only image-level labels and alleviates the burden of labor-intensive instance-level annotations. Existing approaches tend to assign top-scoring proposals and their highly overlapping counterparts as positive samples, thereby overlooking the inherent gap between high classification confidence and precise localization, which in turn introduces the risk of part domination and instance missing. In order to address these concerns, this paper introduces an Instance-aware Label Assignment scheme for weakly supervised object detection in remote sensing images, termed ILA. Specifically, we propose a context-aware learning network that aims to prioritize regions fully covering the object over top-scoring yet incomplete candidates. This is empowered by the proposed context classification loss, which dynamically responds to the degree of object visibility, thereby driving the model toward representative proposals and mitigating the optimization dilemma caused by partial coverage. Additionally, an instance excavation module is implemented to reduce the risk of misclassifying object instances as negatives. At its core lies the proposed pseudo ground truth mining (PGM) algorithm, which constructs reliable pseudo boxes from the outputs of the basic multiple instance learning network to excavate potential object instances. Comprehensive evaluations on the challenging NWPU VHR-10.v2 and DIOR datasets underscore the efficacy of our approach, with achieved mean average precision (mAP) scores of 76.56% and 31.73%, respectively.

Abstract:
With the growing prevalence of screen content images in multimedia communication, efficient compression has become increasingly crucial. Unlike natural scene images, screen content typically contains rich text regions that exhibit unique characteristics and low correlation with surrounding non-text elements. The intricate mixture of text and non-text within images poses significant challenges for existing learned compression networks, as the text and non-text features are severely entangled in the latent domain along the channel dimension, leading to compromised reconstruction quality and suboptimal entropy estimation. In this paper, we propose a novel Disentangled Image Compression Architecture (DICA) that enhances the analysis module and the entropy model of existing compression architectures to address these limitations. First, we introduce a Disentangled Analysis Module (DAM) by augmenting original analysis modules with an additional text approximation branch and a disentangling network. They work in concert to disentangle latent features into text and non-text classes along the channel dimension, resulting in a more structured feature distribution that better aligns with compression requirements. Second, we propose a Disentangled Channel-Conditional Entropy Model (DCEM) that efficiently leverages the feature distribution bias introduced by DAM, thereby further improving compression performance. Experimental results demonstrate that the proposed DICA, along with DAM and DCEM can be integrated into various channel-conditional compression backbones, significantly improving their performance in screen content compression–particularly in hard-to-compress text regions. When integrated with an advanced WACNN backbone, our method achieves a 13% overall BD-Rate gain and a 16% BD-Rate gain in text regions on the SIQAD dataset.

Abstract:
Existing image-text retrieval methods mainly rely on region and word features to measure cross-modal similarities. Thus, dense cross-modal semantic alignment which matches regions and words becomes crucial. However, this is non-trivial due to the heterogeneity gap and the cross-modal attention used to achieve this alignment is inefficient. Towards solving this problem, we propose a novel framework that goes beyond the previous one-tower and two-tower frameworks to learn cross-modal consensus efficiently. The proposed framework does not align regions and words directly like existing methods but uses semantic prototypes as a bridge to attend specific contents with the same semantics among different modalities through semantic decoders, through which cross-modal semantic alignment is naturally achieved. Furthermore, we design a novel plug-and-play self-correction method based on optimal transport to alleviate the drawbacks of incomplete pairwise labels in existing multimodal datasets. On top of various base backbones, we carry out extensive experiments on two benchmark datasets, i.e., Flickr30K and MS-COCO, demonstrating the effectiveness, superiority and generalization of our method.

Abstract:
Unsupervised visible-infrared person re-identification (USL-VI-ReID) has garnered widespread attention due to its surveillance application value in complex environments. However, it faces four key challenges: modality discrepancy, batch training limitations, pseudo-label noise, and camera view bias. This paper proposes the CMAG (Cross-Modal Attention and Graph-enhanced Memory) framework, which innovatively combines circular topology structure with cross-modal attention mechanisms to address these challenges. CMAG introduces four core innovations: (1) applying circular topology structure to provide pseudo-label verification through detecting circular paths in feature space, effectively addressing the pseudo-label noise problem; (2) designing a cross-modal attention mechanism for Vision Transformers with residual fusion to balance modality-specific and shared information, solving the modality discrepancy issue; (3) constructing a graph-structured memory enhancement module with adaptive graph construction and multi-layer feature propagation to overcome batch training limitations; and (4) integrating camera-specific clustering with circular structure constraints to reduce camera background bias. Extensive experiments on SYSU-MM01 and RegDB datasets demonstrate the effectiveness of CMAG, achieving approximately 3.5% improvement in Rank-1 accuracy and 2.8% in mAP on average compared to state-of-the-art methods, validating our approach’s advantages in addressing key challenges in unsupervised cross-modal person re-identification.Code is available at https://github.com/hurryup186/CMAG

Abstract:
Incremental learning aims to continuously acquire new knowledge from data streams while maintaining previously learned knowledge. Existing incremental learning methods typically assume that the training (source domain) and testing (target domain) data are identically distributed. However, differences in sensor parameters and imaging conditions inevitably lead to distribution gaps between data collected from different satellites (domains). The ensuing domain shift problem substantially impairs the generalization of continuously learned knowledge from source domains to unseen ones. To tackle this problem, we propose adaptive mixture-of-experts distillation (AMoED) for cross-satellite generalizable incremental remote sensing scene classification (CS-GIRSSC). Specifically, AMoED adopts a high-level semantic learning pipeline, in which new knowledge is acquired through the coordinated guidance of multiple domain-specific experts, rather than directly from raw data. This pipeline prevents the model from being exposed to large volumes of newly emerging data, thereby alleviating the erasure of previous knowledge when adapting to new data distributions. Besides, the adaptive mixture of domain-specific experts facilitates the formation of universal class concepts, which exhibit strong generalizability across different domains. During the learning process, an equi-partite subset is constructed for knowledge acquisition and consolidation, accompanied by a shallow style-mixing operation to mitigate the interference of domain discrepancies. Extensive experiments are conducted on four remote sensing scene classification datasets, and the proposed method consistently achieves state-of-the-art performance across various scenarios and settings. The code is released at https://github.com/fuyimin96/AMoED

Abstract:
Local smoothness is a widely used prior in hyperspectral image (HSI) denoising tasks. The current work mainly realizes the representation of this prior through total variation (TV) regularization. However, the TV regularization applies a uniform penalty to each entry in the image, unable to effectively balance noise removal and texture preservation. Aiming at this problem: 1) We propose a novel regularization for HSI denoising called Spatial-Spectral Texture-Preserved Total Variation (SSTPTV). This naturally expresses the physical phenomenon of the difference in sparsity between textured regions and smooth regions. Specifically, the regularization relaxes the sparsity penalty in textured regions by a weight learning strategy and sparsity measurement method for gradient maps, thereby preserving the spatial-spectral textures of HSIs. 2) An HSI denoising model based on the SSTPTV regularization constraint is given. We propose a tensor alternating subspace representation method that can capture the overall spatial texture features across all bands and the overall spectral texture features across all spectral curves. By applying the SSTPTV regularization constraint to these subspaces, the spatial-spectral texture structures are effectively preserved. An efficient ADMM-based algorithm for solving the model is designed. The simulated and real noise removal experiments of HSI prove that the proposed method has significant superiority and can serve as a framework to optimize other TV-based denoising methods. The code is available at https://github.com/zth-code/SSTPTV

Abstract:
With the prevalence of high dynamic range (HDR) imaging, tone mapping techniques, which convert HDR images to high-quality standard dynamic range (SDR) images for display, have become increasingly important. However, obtaining paired HDR and high-quality SDR images is almost impossible, posing challenges to learning-based tone mapping methods. To address this issue, we propose a zero-shot tone mapping framework without requiring any HDR training samples. Our approach decomposes images into two components: structural information and tonal information. A diffusion-based mapping model taking the structural information as input is first trained in the high-quality SDR domain, then transferred to the HDR domain that has less readily available training data for inference, leveraging the equivalent distribution of the structural information across both domains. To preserve the original image’s structure, we modify the reverse sampling process and explicitly incorporate the original structural information into the intermediate results. To improve the image details, we introduce a dual-control network, enabling different conditional inputs to control different scales of the output. Additionally, we devise a flexible tone adjustment strategy, with a bunch of novel loss functions to modify the trained score function dynamically during reverse sampling, allowing users to customize the style of the generated image according to their preference during testing. Initially designed for tone mapping, our model can be applied to various tasks including image fusion, exposure correction, dehazing, etc., without retraining. Experimental results demonstrate that our approach surpasses previous state-of-the-art methods, indicating that it can serve as an effective, flexible and versatile solution to various tone-mapping tasks. Source code is available at https://github.com/ZSDM-HDR/Zero-Shot-Diffusion-HDR

Abstract:
Fine-grained bird image classification (FBIC) for distinguishing bird subspecies is challenging because of several issues, including a camouflaged appearance, body occlusion, and an arbitrary bird posture. To address these challenges, we propose a novel heterogeneous plumage cues-aware texton correlation representation for FBIC, which leverages texton correlation in various functional plumage regions for effective learning. Two key findings are revealed: 1) texton structural discrepancies of heterogeneous plumage; and 2) abstract region information for specific birds. On this basis, this model introduces texton coherence extraction module (TCEM) and abstract representation selection (ARS). Specifically, considering bird characteristics, TCEM is introduced to exploit the spatial statistical properties of local textons in heterogeneous plumage. To the best of our knowledge, this study is the first to introduce heterogeneous plumage cues for mining texton correlation relationship representations in FBIC tasks. In addition, a Multiscale Information Cross-Attention Transformer (MICAformer) is proposed for better modeling texton correlation representation. The experimental results on the CUB-200-2011 dataset and NABirds show the effectiveness of the proposed HPCTrans model over the state-of-the-art methods.

Affiliations: College of Electronics and Information Engineering, Tongji University, Shanghai, China; Department of Control Science and Engineering, Harbin Institute of Technology, Harbin, Heilongjiang, China; Department of Computer Science and Technology, Tongji University, Shanghai, China; College of Electronics and Information Engineering, the School of Computer Science and Technology, and the Key Laboratory of Embedded System and Service Computing (Ministry of Education), Tongji University, Shanghai, China; Multimedia Technology and Telecom Department, Telecommunications Center, Moscow Institute of Physics and Technology, Dolgoprudny, Moscow, Russia; College of Electronics and Information Engineering, Shanghai Institute of Intelligent Science and Technology, Shanghai Research Institute for Intelligent Autonomous Systems, Shanghai Key Laboratory of Intelligent Autonomous Systems, the State Key Laboratory of Autonomous Intelligent Unmanned Systems, and the Frontiers Science Center for Intelligent Autonomous Systems (Ministry of Education), Tongji University, Shanghai, China

Abstract:
Unsupervised stereo matching has garnered significant attention for its independence from costly disparity annotations. Typical unsupervised methods rely on the multi-view consistency assumption for training networks, which suffer considerably from stereo matching ambiguities, such as repetitive patterns and texture-less regions. A feasible solution lies in transferring 3D geometric knowledge from a relative depth map to the stereo matching networks. However, existing knowledge transfer methods learn depth ranking information from randomly built sparse correspondences, which makes inefficient utilization of 3D geometric knowledge and introduces noise from mistaken disparity estimates. This work proposes a novel unsupervised learning framework to address these challenges, which comprises a plug-and-play disparity confidence estimation algorithm and two depth prior-guided loss functions. Specifically, the local coherence consistency between neighboring disparities and their corresponding relative depths is first checked to obtain disparity confidence. Afterwards, quasi-dense correspondences are built using only confident disparity estimates to facilitate efficient depth ranking learning. Finally, a dual disparity smoothness loss is proposed to boost stereo matching performance at disparity discontinuities. Experimental results demonstrate that our method achieves state-of-the-art stereo matching accuracy on the KITTI Stereo benchmarks among all unsupervised stereo matching methods.

Affiliations: State Key Laboratory of Opto-Electronic Information Acquisition and Protection Technology, Institute of Physical Science and Information Technology, Anhui University, Hefei, China; Department of Electronic Engineering and Information Science, School of Information Science and Technology, University of Science and Technology of China, Hefei, China; School of Artificial Intelligence, Anhui University, Hefei, China; Key Laboratory of Intelligent Computing and Signal Processing, Ministry of Education, and the School of Computer Science and Technology, Anhui University, Hefei, China

Abstract:
Low-Rank Adaptation (LoRA) is a widely utilized technique in topic of Parameter-Efficient Transfer Learning (PETL) which could use a limited number of trainable parameters to adapt the model to various downstream tasks. However, the setting of the locations and low-rank sizes in traditional LoRA relies heavily on the fixed and empirical values, which may hinder adaptability and lead to sharply decreasing performance, especially on some self-supervised pre-trained models. To alleviate this dilemma, we introduce a feature responsive LoRA (ResLoRA) method, a resource-efficient algorithm that automatically determines the LoRA modules’ required size based on the downstream task’s response. Firstly, we propose a Feature Decomposition loss (FD-loss) which leverages the feature singular values to mine the corresponding features of different downstream tasks, making the model parameters able to adequately represent downstream tasks. Subsequently, we leverage the Taylor expansion to measure the salience of the model parameters, then some high-efficient parameters with high significance could be leveraged to design a dynamically responsive LoRA. Specifically, the location and low-rank sizes of LoRA are determined based on the response parameters of the features for downstream tasks. Extensive experiments show that our ResLoRA achieves state-of-the-art performance, especially in the transfer capability of self-supervised models based on MoCo v3 and MAE. Our code is available at: https://github.com/wildboarman/ResLoRA

Abstract:
Reversible data hiding (RDH) in shared images is an effective technique for securely storing and managing confidential images. However, most existing methods suffer from a noticeable data expansion and cannot achieve a good trade-off between data expansion and embedding rate. To address this issue, we propose a novel RDH in shared images (RDHSI) using overlapped coefficients in polynomials. In the proposed method, an original image is compressed losslessly to reduce its size before sharing, and then the compressed image is divided into a series of groups, where any two adjacent groups have an overlapped part. Next, each group is shared by our proposed (k, n)-threshold based sharing technique, which is performed by the polynomial over Galois field GF( 2^8 ) with overlapped coefficients. Finally, data embedding is performed on each shared image by bit replacement according to a constructed 0-1 matrix. Experimental results demonstrate that the proposed method can effectively reduce the sizes of the shared images and achieve a high embedding capacity.

Abstract:
Event cameras are renowned for their high efficiency due to outputting a sparse, asynchronous stream of events. However, they are plagued by noisy events, especially in low light conditions. Denoising is an essential task for event cameras, but evaluating denoising performance is challenging. Label-dependent denoising metrics involve artificially adding noise to clean sequences, complicating evaluations. Moreover, the majority of these metrics are monotonic, which can inflate scores by removing substantial noise and valid events. To overcome these limitations, we propose the first label-free and non-monotonic evaluation metric, the area of the continuous contrast curve (AOCC), which utilizes the area enclosed by event frame contrast curves across different time intervals. This metric is inspired by how events capture the edge contours of scenes or objects with high temporal resolution. An effective denoising method removes noise without eliminating these edge-contour events, thus preserving the contrast of event frames. Consequently, contrast across various time ranges serves as a metric to assess denoising effectiveness. As the time interval lengthens, the curve will initially rise and then fall. The proposed metric is validated through both theoretical and experimental evidence. The codes are available at https://github.com/shicy17/AOCC

Abstract:
Recent advancements in 3D reconstruction technologies have paved the way for high-quality and real-time rendering of complex 3D scenes. Despite these achievements, a notable challenge persists: it is difficult to precisely reconstruct specific objects from large scenes. Current scene reconstruction techniques frequently result in the loss of object detail textures and are unable to reconstruct object portions that are occluded or unseen in views. To address this challenge, we delve into the meticulous 3D reconstruction of specific objects within large scenes and propose a framework termed OMEGAS: Object Mesh Extraction from Large Scenes Guided by GAussian Segmentation. Specifically, we propose a novel 3D target segmentation technique based on 2D Gaussian Splatting, which segments 3D consistent target masks in multi-view scene images and generates a preliminary target model. Moreover, to reconstruct the unseen portions of the target, we propose a novel target replenishment technique driven by large-scale generative diffusion priors. We demonstrate that our method can accurately reconstruct specific targets from large scenes, both quantitatively and qualitatively. Our experiments show that OMEGAS significantly outperforms existing reconstruction methods across various scenarios.

Abstract:
Accurate multi-view 3D object detection is essential for applications such as autonomous driving. Researchers have consistently aimed to leverage LiDAR’s precise spatial information to enhance camera-based detectors through methods like depth supervision and bird-eye-view (BEV) feature distillation. However, existing approaches often face challenges due to the inherent differences between LiDAR and camera data representations. In this paper, we introduce the TiGDistill-BEV, a novel approach that effectively bridges this gap by leveraging the strengths of both sensors. Our method distills knowledge from diverse modalities(e.g., LiDAR) as the teacher model to a camera-based student detector, utilizing the Target Inner-Geometry learning scheme to enhance camera-based BEV detectors through both depth and BEV features by leveraging diverse modalities. Specially, we propose two key modules: an inner-depth supervision module to learn the low-level relative depth relations within objects which equips detectors with a deeper understanding of object-level spatial structures, and an inner-feature BEV distillation module to transfer high-level semantics of different keypoints within foreground targets. To further alleviate the domain gap, we incorporate both inter-channel and inter-keypoint distillation to model feature similarity. Extensive experiments on the nuScenes benchmark demonstrate that TiGDistill-BEV significantly boosts camera-based only detectors achieving a state-of-the-art with 62.8% NDS and surpassing previous methods by a significant margin. The codes is available at: https://github.com/Public-BOTs/TiGDistill-BEV.git

Abstract:
Existing depth-based 3D hand pose estimation methods typically estimate hand joints from either 2D depth images or 3D point clouds, whereas the approaches that fuse multimodal data remain underexplored. Furthermore, previous methods often struggle to learn geometric-facilitated features and precise joint correlations, especially for occluded hands, due to the lack of explicit prior guidance and insufficient cross-dimensional interaction. By taking advantage of multi-modal fusion, cross-dimensional interaction, and prior guidance, we propose a novel joint-guided keypoint denoising Transformer (named HandJoKe) to achieve more precise hand pose estimation, which can iteratively estimate hand poses based on keypoint features from both 2D depth images and 3D point clouds under explicit joint guidance within only several denoising steps. Rather than directly applying existing multi-modal fusion to perform redundant interactions among many background pixels and irrelevant points, HandJoKe focuses on modeling correlations and capturing dependencies among local informative hand regions (i.e., keypoints), thus attaining higher learning capability with lower computation redundancy. Moreover, a novel joint-guided denoising estimation strategy is introduced to adequately fuse cross-modal keypoint features under explicit joint guidance, achieving geometric-facilitated cross-modal keypoint interaction in both 2D and 3D spaces. The effectiveness of joint guidance can be further strengthened through iterative denoising, since it can subsequently update cross-modal keypoint features based on previous denoised hand poses and thus can help better locate confused joints, especially for occluded hands. Extensive experiments show that HandJoKe has achieved state-of-the-art performance on four public challenging benchmarks, including single-hand datasets NYU and ICVL, and hand-object datasets DexYCB and HO3D.

Abstract:
Multi-object tracking (MOT) in water surface scenes is crucial for the autonomous navigation of Uncrewed Surface Vehicles (USVs). However, existing MOT datasets rarely focus on these scenes. Moreover, the few available water surface MOT datasets contain limited data shot onboard and concentrate narrowly on specific marine scenes, creating a significant gap from real-world USV navigation applications. To promote research on USV autonomous navigation, we introduce USVTrack, a fully onboard-shot MOT benchmark that covers diverse and complex water surface scenes, characterized by a high proportion of small objects and varied backgrounds. Then, we propose an innovative end-to-end method specifically designed for MOT in complex water surface scenes, termed as USVMOT. It improves tracking performance through four key contributions: 1) integrating mask information via knowledge distillation to boost feature discriminability; 2) deploying task-specific auxiliary pathways to alleviate the competition between detection and re-identification (ReID) in end-to-end MOT methods; 3) employing an adaptive high-quality mask generation strategy based on the Segment Anything Model (SAM) that obviates extensive manual annotation; and 4) introducing an object-aware association method that dynamically tailors the tracking strategy according to object size and motion speed. Extensive experiments on the USVTrack benchmark demonstrate that USVMOT outperforms existing methods. Our analysis reveals that MOT in complex water surface scenes remains challenging, highlighting the need for further advancements.

Abstract:
Adaptive bitrate (ABR) streaming is a popular technique used to improve the quality of experience (QoE) for users who watch videos online, which, for example, can provide a smoother video playback by dynamically adjusting the requested video quality with associated bitrate according to the constrained yet diverse network conditions. Recently, learning-based ABR algorithms have achieved a notable performance gain with lower inference overhead than the conventional heuristic or model-based baselines. However, their performance may degrade significantly in an unseen network environment with time-varying and heterogeneous throughput dynamics. For a better generalization, in this paper, we propose a meta-reinforcement learning (meta-RL)-based neural ABR algorithm that is able to quickly adapt its policy to these unseen throughput dynamics. Specifically, we propose a model-free system framework comprising an inference network and a policy network. The inference network infers distribution of the latent representation for underlying dynamics based on the recent throughout context, while the policy network is trained to quickly adapt to the changing throughout dynamics with the sampled latent representation. To effectively learn the inference network and meta-policy on mixed dynamics of the practical ABR scenarios, we further design a variational information bottleneck theory-based loss function for training the inference and policy networks, whose objective is to strike a trade-off between brevity of the latent representation and expressiveness of the meta-policy. We also derive a theoretically necessary condition for the bitrate versions that yield higher long-term QoE, based on which a dynamic action pruning strategy is further developed for practical implementation. This pruning strategy can not only prevent unsafe policy outputs in midst of unseen throughput dynamics, but may also reduce the computational complexity of model-based ABR algorithms. Finally, the meta-training and meta-adaptation procedures of our proposed algorithm are implemented across a range of throughput dynamics. The empirical evaluations on various datasets containing real-world network traces verify that our algorithm surpasses the state-of-the-art ABR algorithms, particularly in terms of the average chunk QoE and fast adaptation across out-of-distribution throughput traces.

Abstract:
Existing image steganography schemes always introduce obvious modification traces to the cover image, resulting in the risk of secret information leakage. To address this issue, an end-to-end framework for joint makeup style transfer and image steganography is proposed in this paper to achieve imperceptible higher-capacity data hiding. In the scheme, a Parsing-guided Semantic Feature Alignment (PSFA) module is designed to transfer the style of a makeup image to an object non-makeup image, thereby generating a content-style integrated feature matrix. Meanwhile, a Multi-Scale Feature Fusion and Data Embedding (MFFDE) module was devised to encode the secret image into its latent features and fuse them with the generated content-style integrated feature matrix, as well as the non-makeup image features across multiple scales, to achieve the makeup-stego image. As a result, the style of the makeup image is well transformed and the secret image is imperceptibly embedded simultaneously without directly modifying the pixels of the original non-makeup image. Additionally, a Residual-aware Information Compensation Network (RICN) is developed to compensate the loss of the secret image arising from the multilevel data embedding, thereby further enhancing the quality of the reconstructed secret image. Experimental results show that the proposed scheme achieves superior steganalysis resistance capability and visual quality in both makeup-stego images and recovered secret images, compared with other state-of-the-art schemes.

Abstract:
NeRF-like methods learn implicit 3D neural representations from 2D multiview images, enabling the synthesis of compelling novel views. However, to capture high-fidelity geometry, prior methods often rely on large-scale networks. This dependency hampers the potential applications of neural implicit representations, such as MR visualization. To address this, we introduce LODNeuS, an implicit surface representation based on feature voxel grids. LODNeuS captures multiple LODs of implicit geometry by maintaining voxel grids paired with a set of corresponding lightweight decoders. This allows for high-quality rendering with the ability to dynamically switch between detail levels. Another challenge is that existing methods, both volumetric and surface-based, tend to train and render their representations within a confined space, without explicitly restricting the sampling points properly. This lack of constraints can result in ambiguity, artifacts, and inefficient use of computational resources. We study this effect during free viewpoint rendering using conventional methods and develop an adaptive sampling scheme that emphasizes a valid geometric space for sampling point allocation. Our experimental results show that LODNeuS can match the visual quality of existing methods while offering flexible and lightweight inference. The benefits of adaptive sampling are also demonstrated in the free viewpoint rendering subsection. Our work extends the capabilities of neural implicit representations beyond previously defined limitations, broadening the scope of potential applications.

Abstract:
Reflections pose a significant challenge to the quality of light field images. Existing reflection removal methods typically regard it as pixel classification of 2D images, overlooking the fundamental principle that reflections arise from the 3D spatial superposition of background and reflection spaces. In this work, we propose a hierarchical interactive Multi-Plane Image (MPI) construction approach, which separates the mixed 3D space containing reflections and constructs two independent MPIs. We leverage the layered structure of MPIs to introduce three hierarchical interaction mechanisms: The inter interaction is designed to separate and recover background and reflection components, the intra interaction aims to reduce errors in information distribution within each plane, and the inner interaction focuses on optimizing the MPI structure itself. Ultimately, we successfully separate the mixed images and reconstruct two independent spatial representations. Compared to existing reflection removal methods, our approach not only achieves superior separation performance but also supports novel view synthesis of the separated results. In challenging scenarios with severe overlap between background and reflection, our method demonstrates a remarkable ability to improve separation quality.

Abstract:
Recent studies have highlighted the importance of contextual information for small object detection. However, existing methods rely solely on visual features and lack additional semantic guidance, which limits their ability to model key scene-level context in semantically rich, globally complex environments and to suppress irrelevant local context in densely cluttered scenes. These limitations hinder their effectiveness in uncrewed aerial vehicles (UAV) and similar complex scenes. To address these challenges, we propose TGCADNet (Text-Guided Context-Aware Detection Network). TGCADNet is a small object detection framework that leverages the CLIP (Contrastive Language–Image Pretraining) model’s global semantic understanding and image-text alignment capabilities for enhancing context-aware detection. TGCADNet mainly consists of Text-Guided Scene-level Context-Aware (TG-SCA) and Text-Guided Local-Context Filtering (TG-LCF). Specifically, TG-SCA uses CLIP-generated text features to guide the model in accurately extracting key scene-level context from globally complex environments. Meanwhile, TG-LCF performs interactive computation between text and image features to filter high-quality local context, thereby reducing the impact of dense and cluttered local regions in UAV scenes. We validate the effectiveness of TGCADNet on the VisDrone, UAVDT, and AI-TOD-v2 datasets. Compared to the baseline, TGCADNet achieves an improvement of 1.8 in mAP@50 and 1.3 in mAP@50:95 on the VisDrone dataset. On the UAVDT and AI-TOD-v2 datasets, TGCADNet observes improvements of 2.5 and 2.3 in mAP@50, respectively. Furthermore, TGCADNet surpasses recent SOTA methods in both accuracy and efficiency, demonstrating its effectiveness in detecting small objects in UAV and similar remote sensing scenes.

Abstract:
In recent years, video prediction has gained significant attention particularly in weather forecasting. However, accurately predicting weather remains a challenge due to the rapid variability of meteorological data and potential teleconnections. Current spatiotemporal forecasting models primarily rely on convolutional operations or sliding windows for feature extraction. These methods are limited by the size of the convolutional kernel or sliding window, making it difficult to capture and identify potential teleconnection patterns in meteorological data. Additionally, weather data often involve non-rigid bodies, whose motion processes are accompanied by unpredictable deformations, further complicating the forecasting task. In this paper, we propose the GMG model to address these two core challenges. The Global Focus Module, a key component of our model, enhances the global receptive field, while the Motion Guided Module adapts to the growth or dissipation processes of non-rigid bodies. Through extensive evaluations, our method demonstrates competitive performance across various complex tasks, providing a novel approach to improving the predictive accuracy of complex spatiotemporal data.

Abstract:
Depression recognition based on facial movements has garnered widespread attention in recent years. One key challenge faced by existing studies is that individual heterogeneity in facial movements severely affects model performance. This paper, for the first time, proposes an innovative solution targeting two critical aspects: data acquisition and model design. In data acquisition, we replace the traditional interview task with a video-watching paradigm to eliminate mouth movements induced by verbal expression to obtain pure facial movement flows. In model design, we treat facial action units (AUs) as covariates to align discriminative depression cues to mitigate the impact of individual heterogeneity. Specifically, our proposed FDSNet statistically selects AUs showing significant differences between depressed and healthy groups and combines the AUs-based scoring module and graph attention network to guide the model in dynamically focusing on discriminative key video segments. Experimental results show that FDSNet achieves superior accuracy and generalization performance compared to SOTA methods. Our work suggests that controlling data acquisition to ensure pure facial movement data, along with employing a personalized modeling strategy, are both critical to mitigating individual heterogeneity and enhancing model performance.

Affiliations: School of Computer Science, Nanjing University of Information Science and Technology, Nanjing, China; Key Laboratory of Tibetan Information Processing, Ministry of Education, Qinghai Normal University, Xining, China; Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, China; Department of Electromechanical Engineering, State Key Laboratory of Internet of Things for Smart City, University of Macau, Macau, China; Key Laboratory of System Software (Chinese Academy of Sciences) and the State Key Laboratory of Computer Science, Institute of Software, Chinese Academy of Sciences, Beijing, China

Abstract:
Video object segmentation (VOS) requires robust tracking and segmentation of objects over long video sequences. Existing methods often suffer from error accumulation in memory updates and interference from cluttered backgrounds, which degrade performance in complex or long-term scenarios. To address these challenges, we propose a novel segmentation framework with two key innovations that distinguish it from existing methods. First, we introduce Discriminative Semantic-Positional Integration (DSPI), a selective memory update strategy that stores only high-confidence and semantically consistent features. Unlike traditional methods that store features at the frame level, DSPI suppresses the propagation of low-confidence information, effectively preventing error accumulation and ensuring stable, long-term object representations. Second, we propose a Query-Adaptive Discriminative Enhancement (QADE) mechanism, which adaptively locates potential target positions based on semantic cues, allowing the model to focus on relevant target regions while minimizing background interference. Unlike global attention mechanisms, which often absorb irrelevant background information, QADE dynamically adjusts the sampling locations, leading to more precise and robust target localization. Our method achieves state-of-the-art performance across multiple challenging benchmarks, demonstrating its effectiveness in handling occlusion, appearance changes, and background clutter. The proposed framework shows great potential for real-world video analysis applications. The code is available at https://github.com/tomato233144/FECNet

Abstract:
Multimodal remote sensing combines optical and synthetic aperture radar (SAR) imagery to improve perception, yet real deployments face spatially varying degradations (e.g., clouds, low light, sensor interference) that can corrupt fusion. To make robustness measurable, we introduce a controlled mixed-severity setting in which only the optical stream is synthetically cloud-degraded while SAR remains intact, providing a standardized testbed for evaluating multimodal detection under modality imbalance. We further present CAIR-Net, a reliability–aware information routing network that follows a denoise-then-fuse principle: a Local Reliability Modulation (LRM) module learns soft, spatial reliability maps to suppress degraded regions before cross-modal interaction, and a Global Information Selection Mechanism (GISM) performs confidence-aware expert routing across optical, fused, and SAR experts. On the mixed-severity benchmark, CAIR-Net consistently outperforms strong unimodal and fusion baselines and exhibits a substantially smaller performance drop under severe clouds. These results indicate that explicit reliability modeling and quality-guided routing provide a practical path toward robust multimodal detection when one modality is partially or nearly completely occluded.

Abstract:
Image quality assessment (IQA) algorithms have significantly advanced over the past two decades, primarily focusing on natural images. However, applying these methods directly to medical imaging often yields suboptimal performance due to inherent differences such as the structural complexity of medical images and the limited availability of annotated databases. In this study, we conduct a comprehensive evaluation of state-of-the-art IQA methods, including 29 traditional full-reference (FR), 4 traditional no-reference (NR), and 9 deep learning-based approaches, to assess their effectiveness in the context of medical imaging. Our evaluation is performed on a recently developed MRI image quality assessment benchmark, revealing critical performance gaps in existing methods. Building on these findings, we propose a novel dual-branch deep learning framework specifically designed for medical IQA (MIQANet). The proposed approach effectively combines global contextual information with local structural details, enhancing the model’s ability to capture subtle degradations and structural inconsistencies in MRI scans. Experiential results demonstrate the superiority of our approach over existing methods, providing valuable theoretical and practical insights for enhancing quality assessment of medical images.

Abstract:
Compressed video quality enhancement (CVQE) is crucial for mitigating compression artifacts and improving perceptual visual quality, especially under diverse quantization parameters (QPs) and motion patterns. However, many existing approaches insufficiently exploit long-range temporal dependencies, and their reliance on QP-specific training often leads to limited robustness when compression conditions change. In this work, we propose a just noticeable difference (JND)-aware and perception-driven learning framework for CVQE, termed the Pairwise Spatio-Temporal Alignment Network (PSTAN). PSTAN incorporates perceptual priors primarily through a JND-guided training paradigm rather than relying solely on architectural modifications, where learning is driven by perceptually poor video segments identified in the VideoSet dataset. This strategy alleviates the reliance on QP-specific supervision and promotes more stable enhancement behavior across varying compression conditions. To effectively capture temporal dependencies, PSTAN employs a pairwise spatio-temporal interaction mechanism that models each reference-target frame pair independently, enabling adaptive utilization of both nearby and distant frames. In addition, a transformer-based alignment module combining temporal mutual attention with cascaded deformable convolution is introduced to handle complex and large motions. Extensive experiments on VideoSet, MFQE 2.0 and our constructed HEVC-compressed dataset show that PSTAN achieves consistent improvements over state-of-the-art CVQE methods in both objective and perceptual quality metrics. The code of this work is available at https://github.com/leryong/PSTAN.git

Abstract:
Test-Time Adaptation (TTA) has recently emerged as a promising research direction, enabling vision-language models (VLMs) to adapt to unlabeled test data in zero-shot settings. Among TTA approaches, test-time prompt tuning has shown great potential for enhancing the practical applicability of VLMs. However, existing methods typically either focus on adapting a single modality or apply uniform optimization to both modalities, without explicitly defining modality-specific optimization objectives. Such a one-size-fits-all strategy often results in suboptimal performance under test-time conditions. To address this limitation, we propose Dual-modality Heterogeneous Prompt Tuning (DHPT), a novel framework designed to simultaneously capture fine-grained textual semantics and alleviate domain shift noise in the visual modality. Specifically, we leverage a large language model to provide textual cognition guidance for the text encoder, while on the vision side, we develop a lightweight calibration module that adaptively mitigates domain shift noise across different scales. Furthermore, we introduce a cluster-tight optimization objective that enhances the stability and generalizability of prompt tuning under distribution shifts. Extensive experiments conducted on 11 benchmark datasets demonstrate that DHPT consistently and significantly outperforms existing TTA methods for VLMs.

Abstract:
Multi-view clustering (MVC) aims to uncover the consistent structure and latent distribution of data by integrating information from diverse perspectives. A series of MVC models have been developed, but most of them are limited. They either naively fuse multi-view data into a single view or process each view in isolation, thereby neglecting the complex relationships between views. Among which, multi-stage approaches are heavily dependent on pre-learning and post-processing steps. To overcome these limitations, a one-stage MVC model, namely the embedded multi-view clustering approach with tensor self-representation and adaptive graph learning (EMCTGL), is proposed. In the proposed model, a step-structured data tensor is constructed and then decomposed to learn a cross-view self-representation tensor, effectively capturing the global topological relationships across all views. To achieve a clearer global subspace structure, a novel TLog-induced non-convex low-rank regularization is imposed on the rotated representation tensor. Under relaxed symmetry constraints, the self-representative tensor guides the learning of view-specific affinity graphs and a \sigma -norm penalty is applied to promote approximation of symmetry of affinity graphs. Subsequently, the normalized view-specific graphs are adaptively fused and factorized into the final clustering indicator matrix by embedding the semi-non-negative decomposition within a one-stage framework. To reduce the computational complexity, EMCTGL is extended to an anchor-driven MVC through determining anchors based on an adaptive density-peak strategy. Effective optimization schemes are devised to solve these non-convex models. Extensive experiments on various real-world datasets demonstrate that EMCTGL outperforms current state-of-the-art techniques.

Abstract:
New generators for producing fake images are always coming up, which poses a serious challenge for developing a detector with strong generalization ability. This paper notices this challenge and presents a dual-branch cross-stage interactive detector, called DCNet, which is comprised of preprocessing, two parallel three-stage branches, and two cross-branch agent attention fusion (CBAF) modules placed between the two branches. The primary purpose of DCNet is to enhance the generalization ability by utilizing the two branches to grasp and refine periodic patterns exhibited by upsampling artifacts as well as the local details of upampling artifacts, respectively. A multi-dilation similarity extraction module (MSEM) is placed at the beginning of the upper branch to capture the periodic grid-like patterns of different sizes exhibited by upsampling artifacts by innovatively incorporating the cosine similarity at varying dilation rates. Simultaneously, under the guidance of MSEM, a similarity-guided spatial artifact attention module (SSAM) is deployed at the beginning of the bottom branch to extract local contextual features related to upsampling artifacts using standard convolutions. Each CBAF is tailored to collect global contextual information from both branches at low computational cost, guide them to learn from each other to filter out the discriminative features, and provide the feedback to each branch to facilitate their respective feature refinement. The local-global-guided convolution (LGConv) in the 2^nd and 3^rd stage is designed to generate adaptive convolutional kernel weights and position weights for each spatial position of the input feature map by means of local multi-scale features from the previous stage and the global contextual features from CBAF, thereby enhancing key information related to artifacts in deep layers of DCNet. Extensive experimental results demonstrate that the proposed DCNet significantly outperforms existing detectors in terms of generalization capability across various unseen generation models.

Abstract:
Video inpainting aims to reconstruct missing or corrupted regions in video frames, with applications in video editing, restoration, and special effects. Current deep video inpainting methods rely on optical flow to guide the propagation of effective features and spatiotemporal attention mechanisms to model relationships between frames. However, as an explicit motion representation, the optical flow extracted offline in preceding steps often suffers from instability and errors during estimation. These errors accumulate during subsequent content hallucination, resulting in artifacts and blurring. Meanwhile, although traditional spatiotemporal attention effectively captures frame relationships, its dense computational nature introduces redundant information, disrupting inpainting tasks and reducing efficiency. To address these issues, we propose an implicit motion-guided approach for efficient video inpainting. Instead of relying on optical flow, our method uses implicit motion in the latent feature space to guide the dual-domain propagation of images and features end-to-end, avoiding error accumulation from the independent optical flow estimation process. Additionally, we introduce a self-correcting module that enables feedback between image and feature propagation, reducing errors during propagation. Furthermore, we design an adaptive sparse video attention mechanism to focus on highly relevant regions, minimizing the impact of irrelevant information. Experimental results demonstrate that the proposed method outperforms state-of-the-art approaches both qualitatively and quantitatively, while also delivering superior efficiency.

Abstract:
Human body dynamics, as a temporal variation pattern of pose sequences in 3D skeleton-based human motion prediction, has been extensively studied in spatial-temporal dependent modeling of deep learning. However, designing an effective modeling approach that fully harnesses physical principles to enhance algorithmic performance remains a challenge. Existing approaches prioritize displacement information, processing deterministic physical parameters via standard neural networks while modeling rotation motion through simplified angular constraints. Such physical approximation methods neglect the high-dimensional and dynamic characteristics of Dynamics variables, undermining the integrity and diversity of human motion feature representations. To alleviate these limitations, we propose an Adaptive Multi-scale Lagrange Dynamics Spatial-Temporal Network (AMLD-STNet), which directly embeds learnable neural network modules within physical equations to activate multi-scale dynamic physical feature modeling of human motion. Specifically, A Lagrange Dynamics Network (LD-Net) is constructed, which designs a set of joint force adjacency matrices to analyze the mechanical correlation between the velocity and acceleration of each joint motion through the Lagrange Dynamics equation. Subsequently, the Lagrange Dynamic Spatial-Temporal Network (LD-STNet) is established, which utilizes LD-Net to extract multi-perspective high-dimensional features of human displacement and rotational motion represented by Dynamics pose variables. To capture the mechanical correlation of joint node groups, we design a multi-scale streams LD-STNet, which can realize adaptive scale transformation according to the joint force adjacency. Additionally, Euler angle loss is employed to enforce rotational consistency constraints, thereby enhancing physical realism during network training. Finally, extensive experiments are conducted on three popular benchmarks, such as Human 3.6M, AMASS, and 3DPW, among which AMLD-STNet achieved state-of-the-art results with a smaller model size.

Abstract:
Multimodal medical image fusion integrates complementary information from different modalities in terms of structure and function, playing a crucial role in disease diagnosis, surgical planning, and treatment evaluation. However, existing methods face challenges such as structural misalignment, semantic inconsistency, and noise interference during cross-modal alignment and fusion. To address these issues, this paper proposes a multimodal image fusion method based on manifold structure modeling and information geometric enhancement. In the feature alignment stage, a cross-modal manifold diffusion mechanism is designed to model fine-grained relationships between modalities, while Fisher information metric is incorporated to improve the consistency of global semantic space representation. During the feature fusion stage, a gated hybrid attention mechanism is developed to dynamically regulate the contributions of each modality, alongside a post-fusion reconstruction module to strengthen the model’s ability to preserve key features from the original modalities. Experiments on multiple fusion tasks using publicly available medical imaging datasets—including CT-MRI, PET-MRI, and SPECT-MRI—demonstrate that the proposed method outperforms existing mainstream approaches in terms of image clarity, structural consistency, and semantic fidelity.

Abstract:
Most existing deep learning-based trajectory prediction algorithms heavily rely on human expertise, involving iterative manual tuning of their architectures and parameters to tailor prediction models for specific tasks or scenarios. This approach is not only complex to implement and inefficient, but also struggles to balance inference speed with prediction accuracy. To address this challenge, this paper innovatively proposes an improved heterogeneous multi-agent trajectory prediction algorithm utilizing graph neural architecture search. This method automatically conducts an end-to-end graph architecture search to obtain an optimal trajectory prediction model. To enhance model interpretability and its heterogeneous awareness of diverse scenarios, we design a physics- and risk-interaction-based guidance mechanism to steer the architecture search process. Furthermore, we construct a novel neural architecture search loss function, SocialMI-Loss, which comprehensively considers multiple factors such as prediction accuracy, driving region semantic constraints, and model complexity. This function is intended to guide the learning of the trajectory predictor, achieving a harmonious balance between accuracy and computational complexity. A comprehensive series of comparative experiments conducted on three large-scale autonomous driving datasets (nuScenes, Argoverse, and ApolloScape) consistently demonstrates the superior performance of our proposed method. Experimental results indicate that our framework achieves performance comparable to current state-of-the-art methods, while its automatically searched architecture remains remarkably lightweight. Our code is available at:https://github.com/Tu5tra/TrajGNAS

Abstract:
Recent advancements in vision-centric multi-task learning have greatly impacted autonomous driving, with a focus on constructing efficient and rich Bird’s Eye View (BEV) representations. While these methods achieve impressive performance, they often suffer from structural complexity and high computational costs due to the need for dense BEV representations. To address these challenges, we propose UniSparseBEV, a simple and efficient vision-based multi-task learning framework based on sparse queries. We introduce a set of learnable shared queries to facilitate information exchange across tasks. Additionally, we propose the Z -axis Deformable Cross-Attention (Z-DCA) module, which enables BEV segmentation task queries to directly extract information from image features without requiring dense BEV representations. To further enhance training efficiency, we incorporate 2D supervision into the network. Extensive experiments on the NuScenes dataset demonstrate that UniSparseBEV outperforms existing single-task methods in 3D object detection and BEV segmentation. A detailed robustness analysis is also conducted on the UniSparseBEV framework. We hope UniSparseBEV can serve as a strong baseline for multi-tasking in autonomous driving.

Abstract:
Automated neonatal pain recognition and assessment based on deep learning is an emerging interdisciplinary topic that combines clinical pediatric medicine and affective computing. To improve the recognition accuracy of neonatal pain facial expressions, this paper proposes a Cross-hierarchical Multi-head Sparse Vision Transformer Network (CMS-ViT). Based on the Transformer architecture, the network presents a multi-head dynamic token sparsification fusion module, which performs dynamic feature selection and information fusion through three stages: selection, pairing, and fusion. This module sparsifies the tokens involved in computation to reduce redundancy. By inserting the sparsification module into different hierarchical layers, the network gradually reduces computational complexity. A cross-hierarchical feature fusion module is then embedded into the backbone to integrate semantic information from different depths, mitigating information loss caused by sparsification and fusion, and ultimately generating more discriminative feature representations by leveraging high-level semantic cues. In addition, the model utilizes pretrained parameters obtained via masked autoencoding on large-scale facial expression datasets, enhancing performance on the neonatal pain facial expression recognition task. Experimental results show that CMS-ViT achieves state-of-the-art (SOTA) performance on the Facial Expression of Neonatal Pain (FENP) dataset and demonstrates good generalization on the AffectNet and RAF-DB datasets.

Abstract:
Omnidirectional depth estimation predicts 360-degree depth information using multiple fisheye cameras arranged in a surround-view configuration. However, due to the lack of reference panorama and differences between the predicted depth viewpoint and input cameras, it is challenging to construct and utilize semantic information to improve depth accuracy, resulting in limited accurate in complex regions such as non-overlapping, weak textures, object boundaries and occlusions. This paper proposes a novel model architecture that effectively extracts and leverages semantic information to enhance the accuracy of omnidirectional depth estimation. Specifically, the proposed algorithm combines the variance and mean of multi-view image features to construct the fused matching cost and utilize both geometry and semantic constraints. The model extracts 360-degree semantic context during matching cost aggregation, and predict the corresponding panoramas jointly with omnidirectional depth maps. A semantic-aware spatial propagation module is then employed to further refine the depth estimation. We leverage a multi-scale multi-task learning strategy to supervise the prediction of omnidirectional depth maps and panoramas jointly. The proposed approach achieves state-of-the-art performance on public datasets, and also demonstrates high-precision results on real-world data. The experiments with varying camera configurations validate the generalization ability and flexibility of the algorithm.

Abstract:
Existing Optical Remote Sensing Image Salient Object Detection (ORSI-SOD) methods mainly rely on a semantic segmentation paradigm, which relies on pixel-wise probabilities, leading to overconfident mispredictions. In contrast, the random sampling process of the diffusion model allows multiple possible predictions to be drawn from the mask distribution, effectively alleviating this problem. However, existing diffusion models mainly use Transformers as conditional feature extraction networks. Although they are good at global modeling, they have limited ability to handle long-range dependencies due to computational complexity. To overcome these challenges, we introduce MambaDif, an innovative diffusion model architecture based on Mamba. Specifically, we regard ORSI-SOD as a conditional mask generation task leveraging the diffusion model and achieving target distribution matching by adding noise to the mask and iteratively denoising it to match the target distribution. Then, we adopt Mamba to extract global features, efficiently process long sequences, and capture global contextual information with linear complexity. In addition, we introduce the global-local feature collaborative completion module (GLM), which combines the ability of convolutional layers to extract local features with the advantage of Mamba in capturing long-range dependencies, thereby achieving excellent denoising performance. Extensive experiments show that MambaDif outperforms SOTA methods in eight evaluation metrics on two standard datasets (EORSSD and ORSSD). We also report the generalization performance of the model on the challenging ORSI-4199 to evaluate its robustness.

Affiliations: The Hong Kong University of Science and Technology (Guangzhou), Guangdong, China; School of Advanced Technology, Xi’an Jiaotong-Liverpool University, Suzhou, China; Qingdao Institute of Software, China University of Petroleum (East China), Qingdao, China; School of Cyberspace Security, University of Science and Technology of China, Hefei, China; School of Information Engineering, Yancheng Institute of Technology, Yancheng, China; School of Navigation, Wuhan University of Technology, Wuhan, China

Abstract:
Automated waterway environment perception is crucial for enabling autonomous surface vessels to understand their surroundings and make informed decisions. Most existing waterway perception models primarily focus on instance-level object perception paradigms (e.g., detection, segmentation). However, due to the complexity of waterway environments, current perception datasets and models fail to achieve global semantic understanding of waterways, limiting large-scale monitoring and structured log generation. With the advancement of vision-language models, we leverage image captioning to introduce WaterCaption, the first captioning dataset specifically designed for waterway environments. WaterCaption focuses on fine-grained, multi-region long-text descriptions, providing a new research direction for visual geo-understanding and spatial scene cognition. Exactly, it includes 20.2k image-text pair data with 1.8 million vocabulary size. Additionally, we propose Da Yu, an edge-deployable multi-modal large language model for ASVs, where we propose a novel vision-to-language projector called Nano Transformer Adaptor (NTA). NTA effectively balances computational efficiency with the capacity for both global and fine-grained local modeling of visual features, thereby significantly enhancing the model’s ability to generate long-form textual outputs. Da Yu achieves an optimal balance between performance and efficiency, surpassing state-of-the-art models on WaterCaption and several other captioning benchmarks. The project is available at https://github.com/GuanRunwei/WaterCaption

Abstract:
Arbitrary-scale super-resolution (ASSR) overcomes the limitation of traditional super-resolution (SR) that works only at a fixed scale (e.g., × 4 ), enabling a single model to achieve arbitrary-scale SR. Most ASSR methods explicitly incorporate implicit neural representation (INR) to achieve ASSR, but INR’s inherently regression-driven feature extraction and aggregation nature restricts their capacity to synthesize meticulous details, leading to low realism. Recently, diffusion-based realistic image super-resolution (Real-ISR) methods leverage the pre-trained diffusion prior and have shown promising results at × 4 scale. We find that they could also achieve ASSR because the powerful pre-trained diffusion prior implicitly employs SR scale adaptation by encouraging the model to always generate high-realism images. However, due to the lack of explicit SR scale controls, the model fails to effectively manage the diffusion behavior according to different SR scales, causing either excessive hallucination or blurry results, especially for ultra-high magnification. To address these limitations, we propose OmniScaleSR, a novel diffusion-based realistic arbitrary-scale super-resolution (Real-ASSR) method to achieve both high fidelity and high-realism ASSR. We introduce explicit diffusion-native SR scale controls, which could be elegantly coupled with the implicit scale adaptation, unleashing scale-controlled diffusion prior to dynamically managing the diffusion behavior in a content- and scale-aware manner. Furthermore, we incorporate multi-domain fidelity enhancement designs to achieve more faithful reconstruction. Extensive experiments on both bicubic degradation benchmarks and real-world datasets demonstrate that OmniScaleSR consistently outperforms state-of-the-art methods in terms of both fidelity and perceptual realism, with especially strong performance under high-magnification scenarios. Codes will be at https://github.com/chaixinning/OmniScaleSR

Abstract:
Sufficient embodied scene understanding serves as the foundation for embodied agents to perceive, interpret, and solve scene-related questions in a scene. Such understanding is often constrained by limited perception, which can be summarized into two aspects: 1) the lack of perceptual abilities in the agent and 2) the scene itself is incomplete. Existing large models-based methods attempt to leverage implicit knowledge to overcome these limitations, but this lacks interpretability and controllability. Inspired by human explainable associative thinking, we propose SceneReasoner framework, which imposes explicit functional associative rules on LLMs to guide the process of the scene understanding. This framework mines deeper functional relationships between objects, enabling the agent to gain sufficient scene understanding from limited perception in a controllable manner. Specifically, SceneReasoner employs an associative knowledge base to provide such rules from two aspects: 1) functional complementarity of objects in a scene. For instance, in a computer workspace, if the agent perceives a monitor and other unclear objects, it can first analyze the function of the monitor in this area (content display), and then infer the presence of other related objects (e.g., a mouse and keyboard for content input); and 2) commonality of objects in a scene. For example, in a picture-hanging task, when the hammer in the scene is missing, the agent needs to first identify the hammer’s attributes (hard and applying force), and then associate a suitable substitute in the scene (e.g., a hard wrench). Due to such explicit functional association, the agent can rapidly form a sufficient scene understanding and effectively solve scene-related questions. Experimental results demonstrate that such explicit association augmented with functional reasoning can significantly enhance agents’ scene understanding under limited perception. It improves perceptual quality by 9.75% and scene reasoning ability by 21.42% compared with other methods.

Affiliations: School of Computer Science, Nanjing University of Information Science and Technology and Jiangsu Collaborative Innovation Center of Atmospheric Environment and Equipment Technology (CICAEET), Nanjing University of Information Science and Technology, Nanjing, China; School of Computer Science, Nanjing University of Information Science and Technology, Nanjing, China; School of Computer Science and Informatics, Cardiff University, Cardiff, U.K.; Faculty of Information Technology and Communication Sciences, Tampere University, Tampere, Finland

Abstract:
Inspired by the dual-stream (dorsal and ventral streams) theory of the human visual system (HVS), recent Video Quality Assessment (VQA) methods have integrated Contrastive Language-Image Pretraining (CLIP) to enhance semantic understanding. However, as CLIP is originally designed for images, it lacks the ability to adequately capture the temporal dynamics and motion perception (dorsal stream) inherent in videos. To address this limitation, we propose DVLTA-VQA (Decoupled Vision-Language Modeling with Text-Guided Adaptation), which decouples CLIP’s visual and textual components to better align with the NR-VQA pipeline. Specifically, we introduce a Video-Based Temporal CLIP module and a Temporal Context Module to explicitly model motion dynamics, effectively enhancing the dorsal stream representation. Complementing this, a Basic Visual Feature Extraction Module is employed to strengthen spatial detail analysis in the ventral stream. Furthermore, we propose a text-guided adaptive fusion strategy that leverages textual semantics to dynamically weight visual features, facilitating effective spatiotemporal integration. Extensive experiments on multiple public datasets demonstrate that the proposed method achieves state-of-the-art performance, significantly improving prediction accuracy and generalization capability.

Abstract:
Recent advances in single-view 3D scene reconstruction have highlighted the challenges in capturing fine geometric details and ensuring structural consistency, particularly in high-fidelity outdoor scene modeling. This paper presents Niagara, a new single-view 3D scene reconstruction framework that can faithfully reconstruct challenging outdoor scenes from a single input image for the first time. Our approach integrates monocular depth and normal estimation as input, which substantially improves its ability to capture fine details, mitigating common issues like geometric detail loss and deformation. Additionally, we introduce a geometric affine field (GAF) and 3D self-attention as geometry-constraint, which combines the structural properties of explicit geometry with the adaptability of implicit feature fields, striking a balance between efficient rendering and high-fidelity reconstruction. Our framework finally proposes a specialized encoder-decoder architecture, where a depth-based 3D Gaussian decoder is proposed to predict 3D Gaussian parameters, which can be used for novel view synthesis. Extensive results and analyses suggest that our Niagara surpasses prior SoTA approaches such as Flash3D in both single-view and dual-view settings, significantly enhancing the geometric accuracy and visual fidelity, especially in outdoor scenes. Webpage: https://ai-kunkun.github.io/Niagara_page.

Abstract:
Dense focal stack images inherently encode depth cues and are crucial for various 3D vision applications. However, existing generation methods are susceptible to misalignment and introduce a domain gap between synthetic and real-world data due to off-axis aberrations. To address these challenges, we introduce DFS-Net, an aberration-aware dense focal stack image generation network. DFS-Net consists of two core modules: all-in-focus image synthesis and aberration-aware point spread function (PSF) generation. The all-in-focus image synthesis is achieved through a densely connected fusion network based on multi-scale focus migration and focus property detection. This fusion network can effectively fuse misaligned multi-focus images into an all-in-focus image. The aberration-aware PSF generation is realized through a multi-layer perceptron (MLP) network. Supervised by ray-tracing-based PSFs, the MLP network can generate spatially varying PSFs for arbitrary spatial positions and focus distances. By selecting a set of focus distances, the generated PSF maps are locally convolved with the all-in-focus image to produce an aberration-aware dense focal stack. We conduct extensive comparative experiments on all-in-focus image fusion and focal stack generation against state-of-the-art methods. The experimental results demonstrate that DFS-Net can synthesize all-in-focus images with high subjective and objective quality, as well as generate dense focal stacks that closely approximate ray-tracing results. In addition, we conduct comparative experiments on the depth-from-focus and salient object detection tasks using the generated focal stacks. The experimental results demonstrate that our DFS-Net can significantly enhance the performance of existing depth-from-focus and salient object detection models. The code and dataset will be publicly available at https://github.com/North-Li/DFS-Net

Abstract:
Recently, most image segmentation methods exhibit an extreme trade-off between performance and efficiency, resulting in approaches with high performance typically having low computational efficiency, while efficient methods compromise on segmentation accuracy. To address this dual challenge, this study introduces a simple yet efficient segmentation framework based on Multi-scale Prototype matching and visual Sparse Attention mechanisms (MPSA), which is a transformer-based architecture designed to optimize the balance between performance and efficiency. The proposed MPSA integrates a novel lightweight cross-attention mechanism and prototype selection and filtering strategy to accurately correlate category queries with corresponding visual objects with a multi-scale Feature Pyramid Network (FPN). Within the pixel decoder, our Axial Convolution Enhanced (ACE) module mitigates lost global context by combining depth-wise separable convolutions with deformable convolutions, thereby recovering global semantics while preserving fine-grained spatial details. Through this innovative design, MPSA demonstrates outstanding performance in both semantic and panoptic segmentation tasks across multiple datasets. Remarkably, MPSA achieves surprising 83.9% mIoU with only 114M parameters on the Cityscapes dataset while compared to some state-of-the-art architectures, highlighting its ability to deliver exceptional results with significantly reduced resource consumption. Our code is released at https://github.com/zxqing01/MPSA

Abstract:
Deep learning methods have got a great success in high-resolution remote sensing analysis, especially Convolution Neural Network (CNN) and Transformer. However, CNNs have a failure in modeling the long-range dependency because of their fixed receptive fields and Transformers suffer from quadratic computational complexity relative to image resolution. The RWKV model achieves breakthroughs in natural language processing (NLP) through its linear-complexity sequence modeling; however, it exhibits anisotropic limitations in vision tasks due to the constraints of its one-dimensional scanning mechanism. To address these challenges, we adapt the RWKV architecture to high-resolution remote sensing and propose the Remote Sensing RWKV (RSRWKV) model, which incorporates a Linear-Complexity 2D Attention Mechanism. Specifically, RSRWKV employs a novel 2D-WKV scanning mechanism that bridges sequential processing with two-dimensional spatial reasoning while maintaining linear computational complexity. This design facilitates the aggregation of isotropic contexts in multiple spatial directions. Then, the MVC-Shift module further optimizes multiscale receptive field coverage, whereas the Efficient Channel Attention (ECA) module improves cross-channel feature interaction and semantic saliency modeling. Experimental evaluations on the NWPU RESISC45, VHR-10 v2, SSDD and GLHWater datasets demonstrate that RSRWKV surpasses CNN and Transformer baselines in classification, detection and segmentation tasks, establishing a scalable framework for high-resolution remote sensing analysis. Code available at https://github.com/Ling-yunchi/RSRWKV

Abstract:
Learning-based approaches have achieved promising progress in High Dynamic Range (HDR) image reconstruction, particularly in ghost removal. However, they often struggle in ILL-posed regions, such as areas with occlusion or saturation, where insufficient or unreliable information leads to persistent residual ghosting artifacts and structural distortions. In this paper, we present UA-Diff, an uncertainty-aware diffusion framework designed to generate visually coherent, ghost-free HDR images. Specifically, our approach introduces an Uncertainty Generation Module (UGM) that estimates pixel-wise reconstruction confidence via a probabilistic Laplacian loss, producing an uncertainty map that explicitly highlights challenging ILL-posed regions. To address these regions effectively, we develop an Uncertainty-Aware Diffusion Module (UADM) that operates selectively on the average-coefficient component of a 2D discrete wavelet transform, where dominant artifacts tend to concentrate. This enables reduced computational overhead while preserving high-quality details. Moreover, we propose an Uncertainty-Guided Sampling (UGS) strategy that leverages the uncertainty map to guide the denoising process, ensuring faithful reconstruction in reliable regions and targeted refinement in uncertain areas. Extensive experiments on three public HDR benchmarks demonstrate that UA-Diff surpasses state-of-the-art methods both quantitatively and perceptually, especially in challenging ILL-posed scenarios.

Abstract:
Human action recognition in low-light environments is crucial for various real-world applications. However, the existing methods overlook the full utilization of brightness information throughout the training phase, leading to suboptimal performance. To address this issue, we propose OwlSight, a biomimetic framework with whole-stage illumination enhancement to interact with action classification for accurate dark video human action recognition. Specifically, OwlSight incorporates a Time-Consistency Module (TCM) to capture shallow spatiotemporal features meanwhile maintaining temporal coherence, which are then processed by a Luminance Adaptation Module (LAM) to dynamically adjust the brightness based on the input luminance distribution. Furthermore, a Reflect Augmentation Module (RAM) is presented to maximize illumination utilization and simultaneously enhance action recognition via two interactive paths. Additionally, we build a large-scale dataset Dark-101, which comprises 21,030 dark videos across 101 action categories, significantly surpassing the existing datasets (e.g., ARID1.5 and Dark-48) in scale and diversity. Our method establishes new state-of-the-art (SOTA) performance across all benchmarks, achieving Top-1 accuracies of 99.27% on ARID (a 2.0% improvement), 94.85% on ARID1.5 (a 5.36% improvement), 48.24% on Dark-48 (a 1.56% improvement), and 53.85% on Dark-101 (a 1.72% improvement), demonstrating its superior effectiveness in challenging dark video environments.

Abstract:
3D Gaussian Splatting (3DGS) has recently emerged as a promising representation for immersive media. Its explicit splat-based structure offers high visual quality and real-time rendering, making it particularly suitable for six degrees of freedom streaming applications. However, its deployment in practical streaming scenarios is still limited due to several key challenges such as the large data volume, and insufficient support for dynamic bitrate adaptation under fluctuating network conditions. This paper presents an efficient 3DGS streaming framework that operates directly on pre-generated 3DGS models without retraining or fine-tuning. First, a training-free perceptual pruning method, which removes visually redundant Gaussians according to the human visual system metrics, is introduced. The resulting 3DGS is then encoded into a compact representation using the extended 3D codecs, exploiting its point-based structure. Next, we build a scene-specific bitrate ladder through analyzing the trade-off between resolution, bitrate, and perceptual quality. This enables efficient and fine-grained representation selection. Finally, a progressive streaming mechanism is developed. It is driven by a reinforcement learning scheduler that adaptively decides whether to download new content or enhance previously buffered content based on real-time network feedback. Experiments on real-world 3DGS datasets and bandwidth traces show that the proposed method evidently improves the quality of experience and streaming efficiency in various network scenarios.

Abstract:
Video Synthetic Aperture Radar enables high-resolution, continuous imaging of observed scenes under all-weather and day-night conditions. Nevertheless, video SAR image reconstruction remains challenged by substantial data volumes and high computational complexity. This study addresses these limitations by exploiting temporal redundancy in sequential frame data. Through systematic analysis of video SAR data characteristics, we formulate video SAR imaging as a sparse tensor recovery problem by introducing a tailored correlation function to leverage inter-frame dependencies. An iterative solution is derived by integrating the alternating direction method of multipliers (ADMM) and proximal-alternating inexact minimization (P-AIM) frameworks. Based on this formulation, we propose an imaging network (ViSAR-UTNet) by unfolding the iterative process into a Transformer architecture. ViSAR-UTNet comprises two core modules: a weighted self-attention (WSA) mechanism that learns inter-frame correlations and a linearized ADMM (LADMM) operator for sparse tensor recovery. By leveraging the unfolded Transformer structure, ViSAR-UTNet effectively exploits data redundancy, thereby enabling high-quality video reconstruction from reduced measurements. Experiments on synthetic and real datasets are conducted to validate ViSAR-UTNet. The results demonstrate enhanced reconstruction accuracy and computational efficiency of the proposed method.

Abstract:
3D single object tracking (3D SOT) remains a challenging task due to the sparsity of point clouds, appearance variations caused by occlusions, and the difficulty of modeling long-term temporal context. Although recent Transformer-based approaches leverage memory mechanisms to propagate temporal information, their quadratic complexity and reliance on discrete historical snapshots limit both efficiency and temporal coherence. To address these limitations, we propose SSMTrack, a novel 3D SOT framework built upon state space models (SSMs), which efficiently models long-term temporal dependencies through a continuously evolving hidden state with linear complexity. Specifically, we introduce a serialization and bidirectional scanning (SBS) strategy to enhance intra-frame feature interactions and design a Target-Aware Encoder (TAE) to extract target cues while maintaining stable temporal representations. Furthermore, we propose a Temporal Causal Shape Learning (TCSL) mechanism that preserves critical historical information while adaptively integrating current inputs, progressively enriching target feature representations over time. Extensive experiments on three benchmark datasets demonstrate that SSMTrack achieves state-of-the-art performance with strong temporal coherence and high efficiency. The code will be released upon publication.

Abstract:
Surface Defect Detection (SDD) aims to accurately localize defects based on predefined category labels in industrial manufacturing. Different from generic object detection, the industrial environment introduces significant challenges due to interference and unrelated background textures, leading to increased confusion between defect and non-defect features. In this work, we identify and analyze the structural characteristics and relations inherent in defect features. This analysis enables effectively distinguishing defects from non-defect areas, thereby enhancing the discriminative power for surface defect detection. Based on this insight, we propose a novel surface defect detection framework, named UnfoldDet. This framework focuses on separating defect and non-defect features and reasoning about the relations among defects. Specifically, we formulate the feature separation as an optimization problem with structural constraints. By expressing its iterations as network stages, we introduce an unfolding fusion module (UFM) to progressively separate and fuse multi-scale features. At the instance level, we propose a hierarchical relation encoder (HRE) to capture the inherent relations among defect instances. Through reasoning on positional and categorical relations, only highly related defect features are enhanced, while unrelated non-defect features are suppressed. Through extensive quantitative and qualitative experiments, as well as ablation studies on real-world datasets including ESD, CSD, and NEU-DET, we demonstrate the effectiveness of the proposed UnfoldDet in terms of both performance and computational efficiency. The code is available at https://github.com/xiuqhou/UnfoldDet

Abstract:
Industrial defect detection is crucial in the field of industrial production. Recently, industrial defect detection methods based deep learning have been proven to effectively address most industrial defect detection tasks. However, these methods are unable to detect and identify defects of both known and new categories simultaneously. In industrial settings, the emergence of new types of defects is a common occurrence, rendering traditional closed set defect detection approaches inadequate for identifying these new defects. Therefore, to tackle this limitation, we introduce the concept of open set defect detection in industrial contexts, specifically formulated to recognize new types of defects. Furthermore, to address the challenges of limited training samples and small differences between defect classes in industrial defect open set detection, we propose a novel two-stage end-to-end methodology, termed Open Set Industrial Defect Detection (Open-IndDet). The core idea of Open-IndDet is to increase the discrimination boundary between known and new classes of defects by enhancing defect structural features and extracting class-unique robust features. Specifically, in the Open-IndDet, a class-unique robust feature extraction open set recognition strategy is proposed. This strategy fuses the structural features of defects in the defect confidence image with the defect features in the original image, thereby enhancing the expression of the structural features of defects and achieving full learning of defect features under limited training samples. In addition, this strategy extracts robust class-unique features by constraining the uniqueness of features within defect classes and enhancing the entropy of inter-class distribution differences. Such a strategy can increase the differences between the features of defects of different classes, achieving accurate classification of known class and identification of unknown class. The experimental results show that in our constructed standard open set detection dataset ID-OSD and public dataset MVTec, the proposed Open-IndDet achieved the best performance compared to existing advanced open set recognition methods.

Abstract:
The increasing frequency of global wildfires has led to the destruction of vast forests and wetlands. Non-contact remote sensing technologies provide an effective means for accurate burned area segmentation (BAS). However, existing BAS methods often treat each image independently, focusing primarily on local pixel contexts while neglecting the broader semantic consistency of burned regions across different scenes. The lack of global context modeling limits their robustness, as burned areas typically exhibit distinctive and consistent visual characteristics such as color and texture across diverse environments. To address this limitation, we propose a Self-image and Cross-image Consistency Learning (SCCL) framework, which captures both local pixel-level relationships within a single image and global semantic dependencies across multiple images. By enforcing consistent and compact representations of burned regions within and across images, SCCL enhances segmentation robustness under varying weather and terrain conditions. Additionally, to refine boundary delineation between burned and unburned areas, we introduce a Burned Edge Injector (BEI) and an Edge-Injected Decoder (EID). We further construct two large-scale BAS benchmark datasets, BAS-AUS and BAS-EUR, for comprehensive evaluation. Experiments on these benchmarks demonstrate that our method achieves state-of-the-art performance, significantly outperforming previous approaches, with MAE reduced to 0.017 and 0.016, respectively. The new BAS benchmarks and code are available at https://github.com/VisionVerse/SCCL

Abstract:
The landscape of video recognition has undergone a significant transformation, shifting from traditional Convolutional Neural Networks (CNNs) to Transformer-based architectures in order to achieve better accuracy. While CNNs, especially 3D variants, have excelled in capturing spatiotemporal dynamics for action recognition, recent developments in Transformer models, with their self-attention mechanisms, have proven highly effective in modeling long-range dependencies across space and time. Despite their state-of-the-art performance on prominent video recognition benchmarks, the computational demands of Transformers, particularly in processing dense video data, remain a significant hurdle. To address these challenges, we introduce a lightweight Video Focal Modulation Network named DVFL-Net, which distills the spatio-temporal knowledge from a large pre-trained teacher to nano student model, making it well-suited for on-device applications. By leveraging knowledge distillation and spatial-temporal feature extraction, our model significantly reduces computational overhead (approximately 7× ) while maintaining high performance on video recognition tasks. We combine the forward Kullback–Leibler (KL) divergence and spatio-temporal focal modulation to distill the local and global spatio-temporal context from the Video-FocalNet Base (teacher) to our proposed nano VFL-Net (student) model. We extensively evaluate our DVFL-Net, both with and without forward KL divergence, against recent state-of-the-art HAR approaches on UCF50, UCF101, HMDB51, SSV2 and Kinetics-400 datasets. Further, we conducted a detailed ablation study in forward KL divergence settings and report the obtained observations. The obtained results confirm the superiority of the distilled VFL-Net (i.e., DVFL-Net) over existing methods, highlighting its optimal tradeoff between performance and computational efficiency, including reduced memory usage and lower GFLOPs, making it a highly efficient solution for HAR tasks. https://github.com/iscaas/DVFL-Net

Abstract:
Image restoration aims to remove degradation factors (such as blur, snow e.g.) from the damaged image and reconstruct a clean image. Although some methods seek solutions from the frequency domain and are proven to be effective, they are still faced two challenges: i) Degradation blurs cannot be removed well, and ii) Inverse transform in frequency domain is computationally expensive. To this end, we propose a lightweight and efficient Discrete Cosine Channel Modulation Network (DC2MNet) for recovering images of multiple degraded conditions from the frequency and spatial perspectives. Specifically, we propose a Discrete Cosine Channel Modulation (DCCM) module to extract the most informative lowest-frequency components of features, and subsequently utilize the channel modulation to reconstruct the global structure of the corresponding feature, avoiding inverse transform in high-dimensional spaces. Furthermore, to effectively remove degradation, we propose a Spatial Mask Modulation (SMM) module to suppress degradation blurs in high-frequency features and emphasize local details that are beneficial to image restoration via pixel-level spatial attention. Finally, we embed the DCCM module and SMM module into the Channel Spatial Modulation Block (CSMB) to form the basic component of DC2MNet, which achieves SOTA performance on various restoration tasks through extensive experiments, including image dehazing, deraining, desnowing and multi-weather restoration. The code and pre-trained models will be open source in this repository.

Affiliations: Institute of Image Communication and Network Engineering, Shanghai Key Laboratory of Digital Media Processing and Transmissions, Shanghai Jiao Tong University, Shanghai, China; School of Communication and Electronic Engineering, East China Normal University, Shanghai, China; State Key Laboratory of Media Convergence and Communication, Communication University of China, Beijing, China; School of Computer Science and Engineering, Nanyang Technological University, Jurong West, Singapore

Abstract:
As multimedia data flourishes on the Internet, quality assessment (QA) of multimedia data becomes paramount for digital media applications. Since multimedia data includes multiple modalities including audio, image, video, and audio-visual (A/V) content, researchers have developed a range of QA methods to evaluate the quality of different modality data. While they exclusively focus on addressing the single modality QA issues, a unified QA model that can handle diverse media across multiple modalities is still missing, whereas the latter can better resemble human perception behaviour and also have a wider range of applications. In this paper, we propose the Unified No-reference Quality Assessment model (UNQA) for audio, image, video, and A/V content, which tries to train a single QA model across different media modalities. To tackle the issue of inconsistent quality scales among different QA databases, we develop a multi-modality strategy to jointly train UNQA on multiple QA databases. Based on the input modality, UNQA selectively extracts the spatial features, motion features, and audio features, and calculates a final quality score via the four corresponding modality regression modules. Compared with existing QA methods, UNQA has two advantages: 1) the multi-modality training strategy makes the QA model learn more general and robust quality-aware feature representation as evidenced by the superior performance of UNQA compared to state-of-the-art QA methods. 2) UNQA reduces the number of models required to assess multimedia data across different modalities and is friendly to deploy to practical applications. Code is available at https://github.com/charlotte9524/UNQA.

Abstract:
Just Noticeable Difference (JND) refers to the maximum level of distortion in an image or video sequence that remains imperceptible to the Human Visual System (HVS). Current JND-based studies predominantly rely on existing datasets, developing models predicting JND levels in terms of Quantization Parameter (QP) or Quality Factor (QF). However, these solutions primarily focus on spatial-based Perceptual Video Coding (PVC) and neglect temporal-based optimization, which highly affects the video bitrate. This paper addresses this limitation by introducing Just Noticeable frame rate-based Temporal Difference (JNTD) to determine the optimal Frame Rate (FR) based on human perception. A novel dataset comprising 50 high frame rate video sequences is collected through subjective assessments. Subsequently, an ensemble method is proposed to predict the JNTD, by leveraging deep and hand-crafted features, for robust prediction. Experimental evaluations include the integration of the proposed method into several codecs (H.264, H.265, H.266, and a new learned codec), showcasing its ability to reduce bitrate without compromising visual quality.

Abstract:
Event cameras offer great potential for high-temporal-resolution (HTR) motion estimation in dynamic real-world scenarios. However, the lack of dense HTR ground truth in real-world datasets prevents fully leveraging the high temporal resolution potential of event cameras. Furthermore, the intrinsic sparsity of event data introduces additional challenges for optimization and supervision. To address these issues, we propose a residual-based paradigm that decomposes HTR optical flow into a global linear component and high-frequency residuals. The residual paradigm effectively mitigates the impacts of event sparsity on optimization and is compatible with any LTR algorithm. In addition, to bridge the supervision gap caused by the lack of HTR ground truth, we incorporate novel learning strategies. Specifically, we initially employ a shared refiner to estimate the residual flows, enabling both LTR supervision and HTR inference. Subsequently, we introduce regional noise to simulate the residual patterns of intermediate flows, facilitating the adaptation from LTR supervision to HTR inference. Additionally, we show that the noise-based strategy supports in-domain self-supervised training. Comprehensive experimental results demonstrate that our approach achieves state-of-the-art accuracy among existing HTR methods, highlighting its effectiveness and superiority. The source code will be publicly available at https://github.com/ZhouQianang/ResFlow

Abstract:
With the continuous development of flapping-wing flying robot (FWFR) technology, its applications in military reconnaissance and civil monitoring are becoming increasingly widespread. However, FWFRs experience severe image jitters in aerial videos due to their unique movement patterns. These jitters are caused by periodic wing motions that include unstable rotation along the roll axis, periodic oscillations related to wing flapping, and high-frequency mechanical vibrations. These effects significantly degrade video quality and impact subsequent visual perception tasks. To address these challenges, this paper proposes a digital video stabilization method customized for FWFRs. First, an image preprocessing module is employed to deal with the roll-axis jitters, which are caused by the factors such as robot turning and crosswind disturbance during FWFR flight. Second, in order to remove periodic and high-frequency jitters and improve the computational performance of the digital video stabilization method, we design a lightweight motion smoothing network (one learns to refine noisy motion trajectories into smooth ones) primarily comprised of stacked one-dimensional convolutional layers. Leveraging this motion smoothing network, we smooth the original motion trajectories of the video and use image warping to obtain the stabilized video. Finally, extensive video stabilization experiments under various scenarios are conducted by using our self-developed FWFR named USTB-Hawk, and results demonstrate that the proposed method achieves a PSNR of 25.15 and an SSIM of 0.761, outperforming currently employed digital video stabilization methods.

Abstract:
Context-aware Multiple Instance Learning (MIL) is gaining popularity in Whole Slide Image (WSI) classification. Existing methods typically convert instances in a WSI into one-dimensional sequences and learn the long-range contextual dependencies among instances. However, due to the extremely large size of WSIs and the morphological similarities within tissue structures, the enormous number of redundant instances significantly increases computational overhead in the context learning paradigm. Additionally, the rearrangement of instances into one dimension loses the inherent spatial information involved in image patches, further compromising the classification performance of pathological images. Consequently, efficiently modeling contextual dependencies in WSIs remains a crucial challenge. In this paper, we propose a novel Semantic Anchor-based Context-aware Multiple Instance Learning (SeCoMIL) framework. This framework partitions the WSI into a series of regions and encodes the coordinates of instances within these regions to preserve their spatial relationships. Subsequently, SeCoMIL identifies the most representative instances from each region as semantic anchors. By capturing both the local context around these anchors and the global context across different anchors, the framework efficiently summarizes the critical pathological information of the WSI, enabling precise classification. Extensive experiments on four public datasets (CAMELYON16, CAMELYON17, TCGA-NSCLC, and TCGA-RCC) demonstrate the robustness of our method, with superior performance compared to state-of-the-art methods.

Abstract:
With the emergence of the diffusion model, its powerful regression capabilities have significantly boosted the performance for low-light image enhancement. However, the inherent information loss in low-light conditions calls for a deep understanding of scene semantics and structures to effectively recover missing content. Recent advances such as the Segment Anything Model (SAM) provide semantic priors for arbitrary regions through prompt-based object segmentation, which offers rich contextual cues to guide the restoration process. Motivated by this, we propose to incorporate such semantics-aware priors into a generative diffusion framework from three perspectives. Firstly, we propose a novel Context-Aware Understanding Guided Diffusion model (CUGD) for low-light image enhancement. This method utilizes the diffusion technique to model the distribution of images by incorporating contextually aware semantic and structural information for any region. Specifically, regional priors provided by SAM are integrated to guide the diffusion process with awareness of any object or region, enhancing the model’s capability to reason about scene content. Secondly, we design a Context Understanding Injection Encoder (CUIE) module that combines self-attention and cross-attention mechanisms to comprehensively integrate semantic and structural information into enhanced results, thus facilitating a fine-grained understanding and enhancement process. This module serves the diffusion model in generating normal-light images with richer and more semantically consistent details. Lastly, the semantic context regularization loss is introduced into the optimization process, ensuring that the recovered context better aligns with the normal-light semantic distribution. Extensive experiments on various datasets show that the proposed method attains state-of-the-art (SOTA) performance in both full-reference and no-reference evaluation measures. The code is released at https://github.com/lingyzhu0101/Diffusion _Image_Enhancement.git

Abstract:
Road defect detection, particularly for potholes and cracks, is a critical component of intelligent transportation systems. Deep learning methods have advanced in this field; however, existing single-network approaches face inherent challenges in addressing the diffe rences between crack orientation sensitivity and pothole scale perception, resulting in either compromised detection accuracy or excessive architectural complexity. To address this limitation, we propose a resonant collaborative network (RCNet) framework with two lightweight specialized networks: Net _\mathbf 1 , which focuses on orientation-sensitive feature extraction in the spatial domain using a Mamba-based multidirectional perception mechanism, and Net _\mathbf 2 , which processes macro structural feature aggregation in the frequency domain using graph-wavelet transformation. To achieve effective knowledge transfer between the different networks, we introduce geometric resonance adversarial learning, which combines geometric moment constraints with conditional adversarial mechanisms to dynamically balance structural stability and discriminative capability. We further validate the generalization capability of our approach on four additional datasets. Experimental results demonstrate that the proposed RCNet outperforms state-of-the-art methods by 3.4% and 2.5% in accuracy, while requiring only 27.4% and 18.7% of the model parameters, respectively. The code is available [here].

Abstract:
3D Semantic Scene Completion (3D SSC), aiming to infer complete 3D scene geometry and semantics from partial observations, is crucial for autonomous driving and robotics. Recent progress in SSC has leveraged the Transformer architecture, while UNet has shown strong multi-scale feature aggregation capabilities. Building on these, we propose MVFormer, integrating the strengths of both architectures to predict the dense geometry and semantics of a scene from 2D images. MVFormer utilizes depth information to separate 2D image features, improving object representation at varying distances, and providing rich geometric and semantic information for voxel initialization. We also introduce a UNet-like decoder with a novel Mix-Voxel attention mechanism, seamlessly integrating Transformer into the UNet structure. This decoder uses multi-scale information to guide voxel feature updates, enhancing multi-scale feature capture. Experiments on SemanticKITTI, SSCBench-KITTI-360 and NYUv2 show that MVFormer achieves state-of-the-art performance, with mIoU scores of 15.81%, 18.94% and 32.96% respectively, while significantly reducing the computational complexity of the model. The code will be available at https://github.com/Ricardovvvv/MVFormer

Abstract:
In recent years, the rapid advancement of multimodal large language models has propelled the development of video-based conversation models. Due to their exceptional video understanding capabilities, there is often an expectation that these models can handle all video-related tasks, including action recognition. However, because action recognition datasets typically lack semantic information, limiting the performance of dialogue models. Additionally, as these dialogue models are designed for video understanding, they frequently overlook critical information required for action recognition—continuous motion—in their model architecture and training dataset configurations. To address these challenges, we first propose a novel two-step mapping framework based on large language models, termed “Vision-Semantics-Label” mapping, to better adapt video-based large language models for action recognition. In the first step, we proposed a visual-skeletal collaborative learning large language model (VS-LLM), which utilizes human keypoints to compensate for the missing motion details without increasing the input token length of the large language model. In the second step, we designed two mapping methods: verb noun match (VN-Match) and all text match (ALL-Match), which can effectively extract relevant action descriptions from the text. Finally, we construct semantic action recognition datasets to ensure that the training data inherently contains action details, enabling the model to better achieve action recognition. We evaluate our approach on five benchmark datasets, demonstrating the state-of-the-art performance of large language models in action recognition. The source code and dataset are publicly available at https://github.com/xiaoyu92568/VS-LLM.

Abstract:
Super-resolution (SR) is an ill-posed inverse problem with many feasible solutions that are consistent with a given low-resolution image. On one hand, regressive SR models aim to balance fidelity and perceptual quality to yield a single solution; but this trade-off often leads to artifacts that introduce ambiguity in information-critical applications such as identifying digits or letters. On the other hand, diffusion models generate a diverse set of SR images; but now selecting the most trustworthy solution out of this set becomes a challenge. This paper introduces a robust, automated framework for identifying the most trustworthy SR sample from a diffusion-generated set by leveraging the semantic reasoning capabilities of vision-language models (VLMs). Specifically, VLMs such as BLIP-2, GPT-4o, and their variants are prompted with structured queries to evaluate semantic correctness, visual quality, and the presence of artifacts. The top-ranked SR candidates are then ensembled to yield a single trustworthy output in a cost-effective manner. To rigorously assess the validity of VLM-selected samples, we propose a novel Trustworthiness Score (TWS)–a hybrid metric that quantifies SR reliability based on three complementary components: semantic similarity using CLIP embeddings, structural integrity via SSIM on edge maps, and artifact sensitivity measured through a multi-level wavelet decomposition. We empirically demonstrate that TWS correlates strongly with human preference in both ambiguous and natural images, and that VLM-guided selections consistently yield high TWS values. Compared to conventional metrics like PSNR, LPIPS, and DISTS–which fail to reflect information fidelity–our approach offers a principled, scalable, and generalizable solution for navigating the uncertainty of the diffusion SR space. By aligning model outputs with human expectations and semantic correctness, this work sets a new benchmark for trustworthiness in generative SR tasks.

Abstract:
Cross-View Geo-Localization is essential for drone visual localization and navigation, which aims at establishing correlation between images collected by uncrewed aerial vehicle (UAV) and satellite platforms in the same geographic area. Drastic changes in the drone’s viewpoints pose a significant challenge for methods based on image representation mining. Previous studies attempt to learn fine-grained image appearance features from various perspectives; however, they tend to underutilize the various state information of the UAV. This paper proposes a novel multimodal framework, CGSI (Context-Guided and UAV’s Status Informed), which leverages UAV state textual descriptions to mitigate scene bias caused by viewpoint differences. The following two issues are addressed to achieve more accurate and reliable multimodal geo-localization: 1) The domain gap across different datasets caused by the fixed UAV altitudes. We propose a Context-Guided Multimodal Tokenizer, which learns contextual vectors from multi-altitude visual features and utilizes them as adaptive text tokens. 2) Multimodal features are susceptible to state-feature ambiguity. We propose a Drone Group Graph Attention method to enhance the association between UAV visual feature with the same location ID but different states and exploit the intrinsic relationships to extract discriminative multimodal features. Extensive experiments on the University-1652 and SUES benchmark demonstrate that our CGSI significantly outperforms existing algorithms, achieving state-of-the-art performance. The substantial improvements observed in cross-region ablation experiments further showcase the superior domain generalization capability of our method.

Abstract:
With the popularity of image editing techniques, synthetic images may have inharmonious regions due to color/illumination differences between the manipulated area and the background. The inharmonious region localization task aims to find these regions, which is crucial for blind image harmonization. Existing methods rely on single-view images and do not fully explore multi-scale fusion, which limits their performance. To address these issues, in this paper, we propose a novel multi-view perception and shrinkage aggregation network (MSANet) for the inharmonious region localization task that fully utilizes multi-view images and multi-scale fusion information and can mine subtle cues between candidate objects and the background. Specifically, we first design a multi-view ensemble encoder to fully perceive the inharmonious regions by multi-view interactive learning and then aggregate the feature representations of inharmonious regions. Moreover, we propose a multi-scale shrinkage fusion decoder, where multi-scale features with multi-view prior information are utilized to aggregate adjacent features, adaptively select high-quality information, reduce background interference and gradually locate inharmonious regions. Extensive experimental results on four public datasets (HDobe5K, HCOCO, HFlickr, and Hday2Night) demonstrate that the proposed MSANet can outperform all the SOTA methods in terms of average F1 and average IoU score, while maintaining a lower computational cost.

Affiliations: DUT-RU International School of Information Science and Engineering, Dalian University of Technology, Dalian, China; Department of Computer Science, City University of Hong Kong, Kowloon Tong, Hong Kong; School of Computer Science, Northwestern Polytechnical University, Xi’an, China; College of Computer Science and Technology, Zhejiang University, Hangzhou, China; College of Information Science and Technology, Dalian Maritime University, Dalian, China; School of Software Technology, Dalian University of Technology, Dalian, China

Abstract:
In the realm of 3D scene modeling and rendering, the emergence of Neural Radiance Fields (NeRF) represents a significant leap forward. However, NeRF’s rendering performance suffers significantly when rendering images under low-light conditions. Existing approaches are optimized by enhancing low-light input images and combining NeRF models, but still fail to address the issues of multiview consistency and image quality. To address these challenges, our research introduces a textual constraint-prompted enhancement method that facilitates low-light image brightening and new view synthesis in an unsupervised manner. Specifically, we devise a semantic calibration strategy that employs positive and negative prompts to motivate and penalize the network towards attributes associated with high-quality images and exploits the capability of visual language models in semantic parsing to align the generated images with textual descriptors to improve image generation quality. In addition, to address the multiview consistency problem, we propose a two-layer optimization strategy, where the semantic cue optimization in the upper layer and the new view generation in the lower layer interact with each other to achieve a balance between luminance consistency and structural integrity by combining these improved images with text-driven semantic features. Comprehensive tests on two datasets with different resolutions, LOM and LLFF, show that our approach outperforms existing methods by significantly improving the brightness and clarity of low-light images to state-of-the-art while preserving the natural appearance and details.

Affiliations: Department of Precision Instrument, State Key Laboratory of Precision Measurement Technology and Instruments, Tsinghua University, Beijing, China; School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, China; School of Artificial Intelligence and Robotics, Hunan University, Changsha, Hunan, China; State Key Laboratory of Intelligent Manufacturing Equipment and Technology, Huazhong University of Science and Technology, Wuhan, China

Abstract:
Unsupervised anomaly detection plays a pivotal role in industrial defect inspection and medical image analysis, with most methods relying on the reconstruction framework. However, these methods may suffer from over-generalization, enabling them to reconstruct anomalies well, which leads to poor detection performance. To address this issue, instead of focusing solely on normality reconstruction, we propose an innovative Uncertainty-Integrated Anomaly Perception and Restoration Attention Network (URA-Net), which explicitly restores abnormal patterns to their corresponding normality. First, unlike traditional image reconstruction methods, we utilize a pre-trained convolutional neural network to extract multi-level semantic features as the reconstruction target. To assist the URA-Net learning to restore anomalies, we introduce a novel feature-level artificial anomaly synthesis module to generate anomalous samples for training. Subsequently, a novel uncertainty-integrated anomaly perception module based on Bayesian neural networks is introduced to learn the distributions of anomalous and normal features. This facilitates the estimation of anomalous regions and ambiguous boundaries, laying the foundation for subsequent anomaly restoration. Then, we propose a novel restoration attention mechanism that leverages global normal semantic information to restore detected anomalous regions, thereby obtaining defect-free restored features. Finally, we employ residual maps between input features and restored features for anomaly detection and localization. The comprehensive experimental results on two industrial datasets, MVTec AD and BTAD, along with a medical image dataset, OCT-2017, unequivocally demonstrate the effectiveness and superiority of the proposed method.

Abstract:
In recent years, rate control (RC) for neural video coding (NVC) has become an active research area. However, existing RC methods in NVC neglect the actual rate-distortion (R-D) characteristics and lack dedicated optimization strategies for intra and inter modes, leading to significant bit rate errors. To address these issues, we propose a high accuracy RC method for NVC based on R-D modeling, which integrates intra frame RC, inter frame RC and bit allocation. Specifically, the rate-quantization parameter (R-Q) model and R-D model are established for both intra frame and inter frame in NVC. To derive the model parameters, intra frame parameters are estimated using high dimensional features, while inter frame parameters are derived using gradient descent based model update methods. Based on the proposed R-Q model, intra frame and inter frame RC methods are proposed to determine the quantization parameters (QP). Meanwhile, a bit allocation method is developed based on the derived R-D models to allocate bits for the intra frame and inter frame. Extensive experiments demonstrate that, benefiting from the accurate R-Q models derived by the proposed approach, highly accurate RC is achieved with only 0.56% average bit rate error. Compared with other methods, the proposed method reduces the average bit rate error by more than 4.18%, and achieves over 8.94% Bjøntegaard Delta Rate savings.

Abstract:
In RGB and thermal (RGB-T) modalities fusion tracking, the multi-feature responses of each modality contain rich consistency in object localization, which is crucial to enhance tracking robustness. However, existing decision-level fusion paradigms mostly focus on fusing the output of the last layer, ignoring the correlation between multi-feature responses. Moreover, they also lack consideration of tracking failure, which hinders the application of RGB-T tracking in complex environments. To this end, this paper proposes a multi-feature response adaptive fusion model and a dominant-auxiliary dynamic selection recovery mechanism. Specifically, the former achieves joint optimal fusion by mining the correlation between multi-feature responses. The latter flexibly switches between short-term and long-term tracking modes according to the reliability of tracking results, and utilizes the most reliable modality to further improve tracking stability. Experiments on five prevalent RGB-T tracking benchmarks demonstrate the competitive performance of our method compared with the state-of-the-art methods.

Abstract:
Temporal Action Detection (TAD) aims to identify action boundaries and their corresponding categories in untrimmed videos, playing a crucial role in long-video understanding. Prior works often struggle to balance the trade-off between capturing long-range dependencies and ensuring computational efficiency. Recently, the state space model Mamba has exhibited impressive capabilities and efficiency in long-term sequence modeling. However, current methods based on Mamba generally lack a unified framework to simultaneously address the redundancy of long-duration actions and the boundary sensitivity of short-duration actions—limitations that largely stem from Mamba’s reliance on limited state representations and its unidirectional modeling. To tackle the aforementioned challenges, we propose DilatedTAD, a novel TAD framework with an expanded receptive field. DilatedTAD leverages the Inter-Parallel DIM component (InterDIM) to integrate multi-scale temporal information, enabling a better trade-off between short-duration and long-duration action detection. InterDIM is built upon our proposed Dilated Mamba (DIM), where multiple DIM branches with different dilation rates are designed to focus on actions of varying durations. Specifically, DIM introduces a novel use of dilation to skip redundant temporal information, thereby enhancing the model’s focus on crucial boundary features. Additionally, a bidirectional modeling design is adopted in DIM to compensate for the lack of future temporal context in the original Mamba architecture. Extensive experiments show that DilatedTAD outperforms state-of-the-art methods on multiple datasets, achieving mAPs of 74.9% (THUMOS14), 42.90% (ActivityNet 1.3), 45.0% (HACS), and 26.3% and 24.3% (EPIC-Kitchens 100). Our code will be publicly available.

Abstract:
Gaussian splatting has gained attention for its efficient representation and rendering of 3D scenes using continuous Gaussian primitives. However, it struggles with sparse-view inputs due to limited geometric and photometric information, causing ambiguities in depth, shape, and texture. We propose GBR: Generative Bundle Refinement, a method for high-fidelity Gaussian splatting and meshing using only 4—6 input views. GBR integrates a neural bundle adjustment module to enhance geometry accuracy and a generative depth refinement module to improve geometry fidelity. More specifically, the neural bundle adjustment module integrates a foundation network to produce initial 3D point maps and point matches from unposed images, followed by bundle adjustment optimization to improve multiview consistency and point cloud accuracy. The generative depth refinement module employs a diffusion-based strategy to enhance geometric details and fidelity while preserving the scale. Finally, for Gaussian splatting optimization, we propose a multimodal loss function incorporating depth and normal consistency, geometric regularization, and pseudo-view supervision, providing robust guidance under sparse-view conditions. Experiments on widely used datasets show that GBR significantly outperforms existing methods under sparse-view inputs. Additionally, GBR demonstrates the ability to reconstruct and render large-scale real-world scenes, such as the Pavilion of Prince Teng and the Great Wall, with remarkable details using only 6 views. More results can be found on our project page https://gbrnvs.github.io

Abstract:
Point cloud registration is a crucial task in the field of 3D processing research, which aims to align two or more point cloud scans into the same coordinate system. A significant factor limiting the performance of point cloud registration is the low proportion of inlier correspondences between two unaligned point clouds. It is particularly pronounced when the overlap between two scenes is low. Based on this observation, we propose a novel point cloud registration framework that enhances the proportion of correct correspondences via two aspects: extracting richer global geometric information for accurate identification of overlapping regions, and rejecting outliers based on spatial feature consistency. During the feature extraction phase, we first encode local geometry utilizing the Point Pair Features and then propose the Dual Graph Convolution module to reshape the receptive field, thereby expanding perception beyond small local areas. In the transformation estimation phase, we design a filtering module based on a multi-layer decoder. We extract point cloud features at different resolutions and select high-confidence point cloud pairs for registration based on the consistency of correspondences. We test the performance of our method on four datasets (3DMatch, ScanNet, KITTI, and MVP-RG). Compared with state-of-the-art approach NMCT, our method achieves improvements of 6% / 38% on KITTI/MVP-RG. Additionally, our filtering approach enhances the operational speed of RANSAC by more than 300%. Code is available at https://github.com/xiwanghuolight/RefreshReg

Abstract:
Indiscernible marine object counting refers to the counting of marine objects that are visually blended with their surrounding environment. This task encounters critical challenges, including limited visibility in underwater scenes, mutual occlusion and overlap among objects, and the dynamic similarity in appearance, color, and texture between the background and foreground. To address the scarcity of video-based indiscernible object counting datasets, we have established a new dataset comprising 50 videos, from which approximately 800 frames have been extracted and annotated with around 40,800 point-wise object labels. This dataset represents real underwater environments where indiscernible marine objects are intricately integrated with their surroundings, thereby comprehensively illustrating the aforementioned challenges in marine object counting. To address these challenges, we propose a depth-assisted network with adaptive motion-differentiated feature encoding. The network consists of a backbone encoding module and three branches: a depth-assisting branch, a density estimation branch, and a motion weight generation branch. Depth-aware features extracted by the depth-assisting branch are enhanced via a depth-enhanced encoder to improve object representation. Meanwhile, weights from the motion weight generation branch refine multi-scale perception features in the adaptive flow estimation module. Experiments demonstrate that our method not only achieves state-of-the-art performance on the proposed dataset but also yields competitive results on three video-based crowd counting datasets. The pre-trained model, code, and dataset are publicly available at https://github.com/OUCVisionGroup/VIMOC-Net

Abstract:
Video coding for machines (VCM) is an emerging approach in video compression designed to optimize content for machine analysis tasks. Although VCM was initially developed for machine vision, scalable coding frameworks have been developed to support both machine-driven analysis and human viewing as required. In this work, we focus on scenarios where high-quality encoding of regions of interest (ROIs) for machine vision and low-bitrate encoding of the background (BG) for human vision. At the decoder, severely degraded BG quality in reconstructed frames makes them unsuitable for viewing; therefore, restoring the degraded BGs by leveraging high-quality ROIs is essential. To this end, we propose the Gradient-Guided Diffusion Restoration (GGDR) algorithm, which integrates a pretrained generative diffusion model with content-aware supervision and adaptive refinement mechanisms to restore severely degraded regions robustly while maintaining visual consistency across the entire frame. The GGDR algorithm consists of two key components: (i) a content-aware supervision mechanism that preserves salient features and structural information in the input image, ensuring superior performance even with challenging high-variance inputs and (ii) a refinement block that guides the generation process of the pretrained diffusion model based on a degradation model and structural guidance. Experimental results demonstrate that the proposed algorithm outperforms state-of-the-art algorithms both qualitatively and quantitatively.

Affiliations: School of Computer Science and Technology, Chongqing University of Posts and Telecommunications, Chongqing, China; Department of Automation, Chongqing University of Posts and Telecommunications, Chongqing, China; Computer Information Systems Department, State University of New York at Buffalo State, Buffalo, USA; Department of Civil and Environmental Engineering, State Key Laboratory of Internet of Things for Smart City, University of Macau, Macau, SAR, China; Department of Electrical and Computer Engineering, University of Calgary, Calgary, AB, Canada

Abstract:
Visible light and thermal infrared tracking combines the characteristics of visible light and thermal infrared modalities to achieve robust target tracking in all-weather and all-day scenarios. However, most existing visible light and thermal infrared tracking methods rely on either full fine-tuning or attention mechanisms, which introduce a large number of parameters and are predominantly influenced by the visible modality. This results in challenges such as high computational complexity, slower processing speeds, and limited exploitation of multimodal information. To address these issues, this paper proposes a lightweight multimodal tracking model based on feature fusion and enhancement. The model consists of a feature fusion adapter and a joint enhancement adapter, designed to integrate and refine information across modalities. It employs a dual-stream transformer encoder with shared parameters across modality branches, utilizing a frozen pre-trained foundation model to independently extract features from visible light and thermal infrared inputs. The lightweight fusion adapter combines modality-specific information, while the joint enhancement adapter refines unimodal features, introducing only 0.23M trainable parameters. Experimental results on the LasHeR benchmark demonstrate that the proposed method outperforms prompt learning and other adapter-based methods, achieving a 4.4% improvement in PR and a 3.3% increase in SR while maintaining computational efficiency. With a real-time inference speed of 28.60 FPS, the proposed method balances accuracy and efficiency effectively. The source code will be available at https://github.com/hu-xue/MFJA

Abstract:
Text-to-image person re-identification (TIReID) aims to retrieve the target pedestrians according to specific textual descriptions. Benefiting from abundant annotated training data, current supervised TIReID methods have achieved impressive performance. However, annotating cross-modality data is extremely time-consuming, which limits their application in real-world scenarios. Several methods attempt to generate text descriptions or pseudo-labels but neglect the dependability of image-text matching relationships or identity information. To this end, we propose a Dependability Feature Learning based on Sample Generation (DFLSG) for unsupervised TIReID. First, we introduce a dependable text generation method that leverages multimodal large language models to generate diverse texts and further filtrate dependable texts for establishing image-text matching relationships. Second, we design an Error Sample Filtering Module (ESFM) to eliminate abnormal samples and obtain reliable identity labels. Furthermore, we develop a Multilevel Triplet Joint Learning (MTJL) process, which continuously optimizes the cross-modality dependable feature from center and instance views. Extensive experiments are implemented to assess the proposed DFLSG on four mainstream TIReID databases. Experimental results demonstrate that DFLSG achieves state-of-the-art performance compared with other unsupervised methods. Code will be available at: https://github.com/CLS-2001/DFLSG

Abstract:
Existing approaches to multi-person pose tracking often suffer from low-confidence detections due to inter-instance and intra-instance occlusions, as well as non-canonical poses. In this work, we propose a novel solution by addressing two critical aspects: incomplete joint temporal dependencies and spatio-temporal voxelization. First, we introduce a method for extracting hierarchical relationships between joints based on human dynamics, enabling the model to reason about occlusions within the spatial topology of the human body. This hierarchical approach tackles incomplete joint visibility by leveraging the interdependencies between joints in both space and time. Second, we present a spatio-temporal occupancy network for multi-person pose tracking. By stacking 2D pose data over time to create a spatio-temporal voxel grid, the model captures temporal relationships between instances and joints, enhancing spatio-temporal correlations and learning keypoint distributions under occlusions or non-canonical poses. Extensive experiments on the PoseTrack2017, PoseTrack2018, and PoseTrack21 dataset demonstrate that our method improves multi-person pose tracking performance, achieving state-of-the-art mAP.

Abstract:
Domain generalization-based hyperspectral image classification methods have achieved promising results in recent years. However, these studies seldom consider the issue of small sample in the source domain. In practical applications, manually annotating hyperspectral images is difficult, so labeled samples in the source domain may be scarce. Existing models have limited feature extraction capability and poor generalization performance in scenarios with limited labeled samples. To address the limitations of existing methods on small sample data of the source domain, a novel approach, Progressive Multiscale Generator for Domain Generalization (PMGDG), is proposed in this paper. The PMGDG employs a progressive multiscale generator comprising a series of sub-generators with paired sub-discriminators. The channel dimension of generated samples grows gradually from the first layer to the last layer. Then, the Classifier network is trained on both the original samples and the generated samples with different distributions to enhance its generalization performance. Additionally, we introduce a hierarchical optimization approach to stabilize the training process. Extensive experiments are conducted on three public hyperspectral image cross-domain datasets:Houston, Pavia, and HyRANK. The experimental results demonstrate that, compared to existing domain generalization methods for hyperspectral image classification, the proposed approach significantly improves classification performance under small sample. The code is available from the website: https://github.com/adwfdawd/PMGDG

Abstract:
Zero-shot captioning aims to describe visual content without additional paired image-text data by leveraging the potential of Visual Language Models (VLMs). Although text-only training allows the model to leverage large-scale textual knowledge, current approaches suffer from two major challenges: (1) the modality gap between text-only training and image-based inference, and (2) catastrophic forgetting when adapting to new text domains. In this paper, we present a novel Continual Zero-shot Captioning framework (CZC), which contains two key components: Retrieval-augmented Pseudo-image Guided Alignment (RPGA) and Text domain-aware Memory Recall (TMR). RPGA synthesizes pseudo visuals to bridge the modality gap and perform the retrieval-augmented generation. The synthetic visuals serve as cross-modal anchors in the CZC where real unseen visuals are unavailable during training, while retrieval-augmented generation enriches them with additional semantic cues to produce more informative conditional prompts. TMR mitigates catastrophic forgetting through the text domain-aware parameter-efficient fine-tuning with adaptive weight replay. It selectively recalls previously text domain knowledge relevant to the input images, achieving stability on previous tasks and plasticity for new tasks. Extensive experiments on the ZCCL demonstrate that CZC effectively bridges the modality gap between training and inference and enables zero-shot captioning under cross-task continual learning scenarios. Particularly, it achieves up to + 7.6% and + 19.8% relative CIDEr improvements over state-of-the-art baselines on UCM-Captions and Sydney-Captions, respectively, while maintaining strong performance on previously learned tasks.

Abstract:
In Cross-Domain Few-Shot Learning (CD-FSL), models are required to identify novel classes while addressing domain discrepancies caused by visual style variations. Simple style transformations often fail to extend beyond the source domain’s distribution, and unrepresentative support samples in the target task may lead to ambiguous or biased decision boundaries. To address these challenges, a Style-Guided Source Data Augmentation and Target Feature Optimization (SSDATFO) approach is proposed. Specifically, Style-Guided Source Data Augmentation is introduced, employing Style Transformation and Source Data Augmentation techniques to create more challenging source data, thereby expanding the source domain’s style distribution. Target Feature Optimization is subsequently introduced, comprising two distinct modules. The Domain Attention Shift Transformation enhances low-magnitude feature channels, thereby reactivating target domain feature channels previously overlooked by the source domain-trained feature extractor. Additionally, the Task Category Differentiation Enhancement Transformation calibrates the features of support samples and eliminates the commonality component along both the task-specific and inter-class commonality directions for all features within the novel task, thereby acquiring more discriminative features. Extensive experiments on eight distinct target datasets demonstrate the efficacy of the proposed method, while comprehensive ablation studies and detailed visualization experiments elucidate its nuanced and compelling aspects.

Abstract:
Open-World Object Detection (OWOD) aims to detect unseen objects as “unknown” while incrementally learning them without catastrophic forgetting. This problem presents two major challenges: 1) the lack of annotations for unknown objects during training, and 2) the risk of catastrophic forgetting during model updates. To address these issues, we propose the COmpositional and Bidirectional low-Rank Adaptive open-world detection transformer (COBRA)-a novel framework built upon a pre-trained Deformable DETR model. Specifically, COBRA first employs an attentional filtering mechanism that prunes previously known (P-Known) and currently known (C-Known) objects, yielding a purified set of candidate unknowns. To systematically pseudo-label these unknowns, we introduce a Primitive Composition Recognition (PCR) module, which evaluates set-level similarity between candidate objects and learned primitives, enabling accurate labeling of pseudo-unknowns. To mitigate catastrophic forgetting during incremental updates, COBRA leverages Bidirectional Low-Rank Adaptation (Bi-LoRA)-a parameter-efficient mechanism that supports forward knowledge transfer and stable backward integration. Together, these components form a synergistic pipeline for continual object discovery and knowledge consolidation. Extensive experiments on MS COCO and PASCAL VOC demonstrate that our rehearsal-free COBRA framework outperforms SAM-powered methods in unknown recall while achieving lower forgetting compared to rehearsal-based competitors.

Affiliations: School of Information Science and Technology, the Engineering Research Center of Intelligent Perception and Autonomous Control of Ministry of Education, Beijing Laboratory of Smart Environmental Protection, Beijing Key Laboratory of Computational Intelligence and Intelligent System, Beijing Artificial Intelligence Institute, Beijing University of Technology, Beijing, China; School of Computer Science and Technology, Ocean University of China, Qingdao, China; Faculty of Computing and Informatics, Multimedia University, Cyberjaya, Malaysia; Institute of Image Communication and Information Processing, Shanghai Jiao Tong University, Shanghai, China; College of Computing and Data Science, Nanyang Technological University, Singapore, Singapore; Department of Computing and Decision Science, Lingnan University, Tuen Mun, SAR, Hong Kong

Abstract:
Due to inadequate monitoring, key pollutants (e.g., PM2.5, VOCs, etc) very possibly leak into atmosphere, thus to endanger the long-term and short-term life safety of people that work and live in the environment. Therefore, it is imperative to effectively and efficiently detect the leakage of industrial waste gas, for the purpose of timely lowering the risk of pollution and explosions. To solve such a problem, we in this paper propose a new lightweight deep and wide network (LdwNet) for detecting the leakage of industrial waste gas from an image, which brings about the two main merits: 1) Compensating for the deficiencies of sensor-based detection methods, which can accurately detect the leakage of waste gas and even measure its concentrations but require to seek leakage sources beforehand; 2) Overcoming the shortcomings of image-based detection methods, which leverage DNN-based recognition technologies and usually suffer from low efficacy, low efficiency and high energy consumption during the model training and inference. To specify, the proposed LdwNet is developed by simulating human perception, motivated by the method which detects the leakage of industrial waste gas from surveillance images with the human observation and judgement. First, based on the inspiration that the human eyes are highly sensitive to horizontal and vertical stimuli, we construct a novel lightweight parallel-series-stripe (PS2) module to validly extract features with very few parameters. Second, to fully exploit deep and shallow features for fusing the global and local information, we extend the PS2 module as a backbone along both the deep and wide directions to build the multi-channel network. Third, to achieve effective, efficient and low-carbon detection in model running, we constraint the extended PS2 modules with parameter sharing to prodigiously reduce the model parameters and thus to make the proposed model ultra-lightweight. Experiments on the datasets of carbon particulate matters and ethylene leakage prove that our LdwNet with ten thousand parameters outperforms the state-of-the-art models with millions of parameters in detection accuracy and implementation cost, and this renders our proposed LdwNet more suitable for real industrial applications.

Abstract:
Multi-modal crisis event detection is critical for timely situational awareness across diverse real-world emergencies. In practice, however, multi-modal data—typically composed of images and textual reports—are often incomplete, with many instances providing only a single modality because of data loss, platform constraints, or real-time limitations. This modality-incomplete setting poses three key challenges: 1) how to reliably reconstruct missing modalities to restore cross-modal context, 2) how to extract deep semantic clues across original and completed data, and 3) how to bridge the distributional gap between reconstructed and fully-observed samples during model training. To tackle these issues, we propose M3Former, a unified framework tailored for modality-incomplete multi-modal crisis event detection. It consists of four dedicated modules: Memory-Guided Modality Completion builds a memory bank of paired image–text data to retrieve semantically related samples and keywords, guiding powerful pretrained generators—diffusion models for image synthesis and multi-modal large language models for text generation; Crisis-Aware Heterogeneous Dual-Stream Encoders jointly capture modality-specific cues and establish initial cross-modal semantic alignment between original and completed data; Hierarchical Attention Refinement Network progressively refines representations through hierarchical attention and guided cross-modal interaction to suppress noise and semantic drift; Modality-aware Routing Experts designs a gated mixture-of-experts architecture that dynamically selects both modality-specific and shared experts to mitigate distributional shifts and enhance the quality of fused representations. Extensive experiments on two real-world crisis datasets demonstrate that M3Former significantly outperforms existing baselines under a variety of modality-incomplete scenarios. The code is available: https://github.com/lcygky/M3former

Abstract:
This paper presents a unified framework for continual learning in Audio-Visual Question Answering (AVQACL), designed to enhance fine-grained scene understanding and spatio-temporal reasoning in dynamic multimodal task environments. To simulate realistic incremental learning scenarios, we construct two large-scale benchmark datasets, Split-AVQA and Split-MUSIC-AVQA, by reorganizing existing AVQA corpora into sequential tasks. Empirical results show that conventional models suffer from severe performance degradation and catastrophic forgetting when learning across audio, visual, and textual modalities in a continual setup. To address these challenges, we propose a novel approach that integrates four key modules: 1) Question-Guided Cross-modal Information Fusion (QCIF), which dynamically extracts task-relevant multimodal features via question-aware attention; 2) Task-specific Knowledge Distillation with Spatial-Temporal Feature Constraints (TKD-STFC), which preserves semantic output behavior and internal reasoning trajectories across tasks; 3) Question Semantic Consistency Constraint (QSCC), which regularizes evolving question representations to maintain linguistic stability; and 4) Dual-Strategy Exemplar Selection (DSES), a memory-efficient replay strategy that jointly maximizes sample informativeness and diversity. All components are theoretically grounded, and formal analysis is provided to ensure modality alignment, spatial-temporal coherence, and exemplar selection reliability. Extensive experiments on both datasets demonstrate that our method consistently outperforms prior state-of-the-art AVQACL baselines in terms of accuracy, retention, and robustness.

Abstract:
Image forgery localization is of great importance for passive image forensic tasks, yet existing methods often struggle to capture either semantic inconsistencies or local forgery artifacts from macroscopic and microscopic perspectives, thus somewhat restraining their generalization ability. To tackle this, we propose F2Mamba, a general framework that combines Forgery-guided global perception with Frequency-driven fine-grained modeling. Specifically, we design a Forgery-Guided Vision Mamba (FG-Mamba) encoder, which enhances Mamba’s perception of forgery patterns by adopting a Forgery-Aware Masking (FAM) module. FAM integrates edge gradients and a learnable attention pathway to guide the model toward forgery artifacts rather than semantic contents, enabling generalized macroscopic feature extraction. To capture microscopic forgery traces, we introduce a Multi-scale Adaptive Frequency Perception (MAFP) module that adaptively decomposes features into high- and low-frequency components to highlight fine-grained forgery details, which are subsequently fused into a unified frequency representation for accurate localization. Extensive experiments show that F2Mamba outperforms state-of-the-art methods in cross-dataset evaluations.

Abstract:
Recent lightweight salient object detection (SOD) models for strip steel surface defect typically adopt classification model as encoder to extract semantic features, followed by task-specific decoder for defect detection. Despite achieving promising results, these models often overlook the potential of leveraging class-level knowledge to enhance detection performance. In this paper, we propose a Class Knowledge-Guided Lightweight Network (CKLNet) for salient object detection of strip steel surface defect. Particularly, CKLNet employs a three-stage training strategy. In the first stage, a general SOD model is trained to perform initial detection of surface defect. Then, the second stage constructs a classification branch built upon the general SOD model to extract defect class knowledge. Embarking on this class knowledge, multiple class-specific decoders are trained in the third stage, where we can generate refined and class-aware defect predictions. Extensive experiments on two datasets demonstrate that CKLNet achieves superior detection performance and generalization capability when compared with the cutting-edge lightweight models, with real-time inference speed of 783 FPS on an RTX 2080Ti GPU. Moreover, unlike existing models that only enable defect localization, our CKLNet further provides accurate defect class identification. The source code is publicly available at https://github.com/Kunye-Shen/CKLNet

Abstract:
Online visual object tracking fundamentally constitutes a continual learning challenge, demanding persistent adaptation to target variations within dynamic video streams, while preserving critical features to prevent forgetting. Existing template-based trackers are particularly vulnerable to severe target deformation and drastic background changes in real-world scenarios. Recent explorations enhance adaptability through local update strategies—such as dynamic templates, appearance tokens, and parameter fine-tuning—to address rapid appearance variations. However, these methods inherently propagate target appearance changes without explicit modelling and prioritise short-term adaptation over long-term global representation. Consequently, they fail to balance initial target features with current observations, leading to tracking failure in long-term scenarios. To overcome these limitations, we propose DG-Track, an adaptive continual learning framework leveraging Grassmannian manifold geometry. Specifically, we represent target appearance within a Grassmannian affine subspace and perform continual adaptation via incremental learning. Compared to Euclidean geometry, the Grassmannian manifold captures nonlinear appearance representations, yielding more compact and geometrically consistent temporal dynamics modelling. Furthermore, an adaptive forgetting module dynamically regulates the interplay between the current observed subspace and the initial template, ensuring stable long-term tracking. DG-Track is a plug-and-play solution for online tracking, adding no learnable parameters. Comprehensive experiments with diverse baseline trackers on LaSOT, GOT-10k, TrackingNet, and UAV123 validate the efficacy of our continual learning framework and Grassmannian manifold geometry in enhancing visual object tracking performance. Code is available at https://github.com/xiaoqing0825/DGTrack

Abstract:
Although accurate head pose estimation is critical for natural human–computer interaction, it remains challenging due to occlusion, extreme poses, illumination conditions, and data ambiguity issues. To address these challenges, a novel morphology aware Transformer framework (MHPE) is proposed, which can learn morphological relationships during facial rotation. The methodology is based on two key findings: cross-region geometric dependencies and angle-specific morphodynamic representations. The proposed framework incorporates two key components: adversarial feature generation, which generates robust rotation representations by adaptive multi-scale feature interaction; and morphology relationship inference, which establishes long-range dependencies between facial features through a cross-modal attention mechanism that incorporates morphological priors. Extensive evaluations on three demanding benchmarks (BIWI, AFLW2000, and 300W-LP) demonstrate state-of-the-art performance, particularly in demanding scenarios. The Python implementation will be available on request to facilitate reproducibility.

Abstract:
Postural instability, one of the primary symptoms of Parkinson’s disease, reflects the severity of muscle rigidity and bradykinesia and is typically assessed via the Movement Disorder Society-Unified Parkinson’s Disease Rating Scale (MDS-UPDRS). However, the scarcity of qualified physicians and the subjective variability in clinical scoring underscore the necessity for developing automated assessment systems that provide reliable and objective diagnosis of postural instability. Therefore, we propose a causal graph-based stable scoring model with a clinical embedding architecture for multi-class postural stability assessment. This architecture systematically employs counterfactual reasoning, causal invariance constraints, and clinical embedding such that clinically causal features are extracted from video-based skeleton data, and thereby a more stable automated assessment is achieved. Specifically, we initially developed a causal counterfactual suppression strategy to deemphasize non-causal features and suppress their confounding influence. Subsequently, we designed a causal invariance reinforcement strategy that combines causal and non-causal features to reduce pseudo-correlation between postural stability and non-causal features. Finally, we proposed a clinical embedding guidance strategy that directs attention toward clinically important regions to extract salient features. The proposed method for postural stability four-class assessment achieved clinically interpretable results with an accuracy of 75.09% and an acceptable accuracy of 97.50% on the largest clinical dataset of more than 1,000 videos. The effectiveness and robustness of our method were further confirmed on an independent test set, and the wide applicability of this method was further demonstrated on two additional fine-grained recognition tasks of skating and medical gait. The proposed method offers an effective, robust, and objective solution for postural stability assessment, with a significant potential for broader applications in other fine-grained action recognition tasks.

Abstract:
The security of AI-generated content (AIGC) detection is crucial for ensuring multimedia content credibility. To enhance detector security, research on adversarial attacks has become essential. However, most existing adversarial attacks focus only on GAN-generated facial images detection, struggle to be effective on multi-class natural images and diffusion-based detectors, and exhibit poor invisibility. To fill this gap, we first conduct an in-depth analysis of the vulnerability of AIGC detectors and discover the feature that detectors vary in vulnerability to different post-processing. Then, considering that the detector is agnostic in real-world scenarios and given this discovery, we propose a Realistic-like Robust Black-box Adversarial attack (R2BA) with post-processing fusion optimization. Unlike typical perturbations, R2BA uses real-world post-processing, i.e., Gaussian blur, JPEG compression, Gaussian noise and light spot to generate adversarial examples. Specifically, we use a stochastic particle swarm algorithm with inertia decay to optimize post-processing fusion intensity and explore the detector’s decision boundary. Guided by the detector’s fake probability, R2BA enhances/weakens the detector-vulnerable/detector-robust post-processing intensity to strike a balance between adversariality and invisibility. Extensive experiments on popular/commercial AIGC detectors and datasets demonstrate that R2BA exhibits impressive anti-detection performance, excellent invisibility, and strong robustness in GAN-based and diffusion-based cases. Compared to state-of-the-art white-box and black-box attacks, R2BA shows significant improvements of 15%–72% and 21%–47% in anti-detection performance under the original and robust scenario respectively, offering valuable insights for the security of AIGC detection in real-world applications.

Abstract:
Physiological signs are key indicators of cardiovascular health, which can be estimated using remote photoplethysmography. Their estimations in dark environments are particularly important, where infrared based methods were predominantly applied, since they are illumination resistant. However, the extracted signals have poor pulsatile strength with low signal-to-noise ratio, eventually resulting in spurious estimates. Conversely, RGB based methods exhibits stronger pulsatile strength, but hindered by poor illumination. To overcome these limitations, we propose 2E1D-Net, trained using a self-created database acquired in a dark environment with marginal illuminance \leq 1 lux. It comprises dual encoders that take paired input images captured at different exposure levels, and project them to a latent. The decoder then, elevates the noise (darkness) component from the dark image, followed by multiscale feature fusion, to produce enhanced images. 2E1D-Net was trained using a linear combination of multiscale structured-similarity-index, L1 and L2 losses, respectively. Subsequently, RGB heart rate and oxygen saturation methods cascaded to trained 2E1D-Net, were tested on self-created and public databases. Experimental results proved the superiority of 2E1D-Net, over state-of-the-art, which ensured the extended ability of RGB methods for physiological measurements in dark, thereby proposing RGB as reliable and clinically relevant alternative to infrared methods without performance compromise.

Abstract:
Angiographic enhancement of non-contrast CT (NCCT) using AI techniques is essential for diagnosing patients unable to use contrast agents. However, AI angiography remains a challenging task because of the feature fragility, structural complexity, and spatial continuity. In this paper, we propose an angiographic framework based on a conditional multi-view diffusion model called AEGIS with three innovations: multi-view hybrid learning (MHL), conditional angiographic diffusion estimation (CADE), and multi-view map fusion (MMF). 1) MHL targets Contrast Map (CM), the difference between NCCT and CT angiography, from multiple views to perceive 3D features in 2D space, enhancing the stability of feature representation. 2) CADE is a conditional diffusion model using NCCT as spatial guidance, providing crucial information for CM generation. 3) MMF adopts a lightweight AutoEncoder for filtering and fusing multi-view CMs, maintaining coherence between adjacent slices while modifying slight bias in low-dimensional representations, thus optimizing data quality and accuracy. Experiments demonstrate our superior performance, which achieve state-of-the-art image quality (PSNR+ 6.69, SSIM+ 3.17, MSE-46.38), segmentation evaluation (CADIR × 10.49 , HSDIR × 5.57 ) and feature distance (FID-64.27). Visualizations and positive evaluation scores from clinicians further reveals that AEGIS has significant potential in clinical applications.

Abstract:
The quality evaluation of audio-visual (A/V) content has become increasingly critical in modern multimedia communication systems. Traditional single-modality quality evaluation methods and existing dedicated A/V quality models often fail to accurately assess the quality of A/V signals. To address this challenge, we propose a novel multi-modal cross-attention guided network specifically designed for A/V quality evaluation. By leveraging visual saliency and Mel-spectrum features, our network aims to achieve accurate and comprehensive quality evaluation. Specifically, distorted video frames are first converted into saliency maps, from which perceptually salient patches are selectively extracted and fed into a Convolutional Neural Network (CNN) for intra-frame visual feature extraction. Concurrently, the distorted audio signal is transformed into a Mel-spectrum, and time-frequency patches are extracted via sliding window techniques for CNN-based audio feature extraction. To effectively integrate these features and capture the long-term dependencies across consecutive A/V segments, we design a multi-modal cross-attention module that explicitly models complex inter-modal interactions. The resulting representations are then passed through a series of fully-connected (FC) layers for dimensionality reduction, ultimately deriving the quality score. Extensive experiments on three publicly available A/V quality datasets indicate that our metric outperforms the traditional quality metrics and newly-developed A/V quality metrics. The source code will be released at https://github.com/Jour3141/avqa

Abstract:
Volumetric video represents 4D (i.e., dynamic 3D) scenes captured from multiple viewpoints, offering rich spatial and temporal information for immersive applications. Despite advances in diffusion-based editing for images, videos, and 3D objects, achieving precise, spatiotemporally coherent edits in volumetric videos remains challenging due to complex geometry and dynamic motion. In this work, we propose 2^2 DEditor, a novel zero-shot volumetric video editing framework that enforces temporal-spatial coherence using a single 2D diffusion model. A key component, Spatiotemporal Consistent Null-Text Optimization (SCNO), jointly models continuity across viewpoints and timestamps, mitigating inconsistencies in multi-frame editing. To further enhance fidelity, the Source-Attention-Guided Editing (SAGE) module leverages self-attention maps to preserve spatial structures, reuses cross-attention maps for semantic consistency, and aggregates multi-layer source attention to prevent unintended edits when target objects exit the scene. Building upon these components, we develop a round-by-round editing pipeline, enabling diverse local and global modifications—including semantic, stylistic, and attribute-based adjustments—while maintaining coherent dynamic scene structure. Extensive experiments on multi-view volumetric video datasets demonstrate that our approach significantly improves editing precision, semantic fidelity, and spatiotemporal consistency, outperforming state-of-the-art zero-shot methods and establishing a new benchmark for high-fidelity volumetric video editing.

Abstract:
Zero-shot skeleton action recognition endeavors to classify novel action categories by transferring previously learned seen skeleton-semantic priors to unseen categories. However, current methods struggle to distinguish highly similar action categories, primarily due to the coarse-grained cross-modal alignment and non-discriminative representation space. To address these issues, we propose STAR++, a novel framework that aligns skeleton and semantics in a fine-grained and conditional manner. The key idea is to first establish region-level correspondences between body parts and semantic cues, and then utilize these local alignments to inform a global alignment process. This design is inspired by human visual cognition, which first attends to crucial local details before perceiving the broader scene. Concretely, we refine both skeleton and semantic representations with a dual-prompt attention mechanism driven by the structural decomposition of the human body and side information generated by a large language model (LLM). This encourages skeleton representations to be more compact within each class and semantic embeddings to be more separable across classes, which helps resolve ambiguity between highly similar actions and provides better interpretability of how unseen actions are perceived. Furthermore, we construct a region-aware holistic fusion module that aggregates these fine-grained features into a unified representation, yielding more discriminative holistic representations. Finally, the global alignment is conditioned on region-aware semantics feedback derived from fine-grained alignment, forming a conditional process that achieves more effective cross-modal alignment. Extensive experiments on four mainstream benchmarks demonstrate that our method achieves state-of-the-art performance in the zero-shot learning (ZSL) and generalized zero-shot learning (GZSL) settings.

Abstract:
Event cameras capture asynchronous pixel-level intensity changes, leading to wide applications to monocular depth estimation under high-speed and low-light environment. Existing event-based depth estimation suffers from two issues: (a) event sparsity and pollution, due to the large amount of spike noise and light sensitivity; (b) spatiotemporal relation deviation, due to the unevenly distribution of spikes along temporal axis. Inspired from the attention calibration in nature language processing, we propose a Filtering and Alternating Calibration network (FAC), using a U-Net architecture with swin transformer based encoders and decoders. The key components of FAC are filtering-based temporal context fusion (FTF) modules and alternating-calibration-based spatiotemporal context fusion (ACSF) modules, serving as the skip connection between pairwise encoded and decoded feature maps. Towards the issue (a), each FTF module learns the cross attention between current encoded feature maps of current event frame and previous event frame, where the latter is filtered with a low-pass filter, reducing the redundancy and irrelevant noise. Towards the issue (b), each ACSF module utilizes the alternating attention calibration between the temporal context fusion map (the output of FTF module) and the decoded feature map, which facilitates the spatial context interaction and calibrates long-range spatiotemporal relation. Thanks to the alternating attention calibration, the encoded and decoded feature maps calibrate each other with motion-corrected temporal contexts and deeper spatial contexts, respectively. Experiments on MVSEC and DENSE datasets show that, FAC outperforms several state-of-the-art depth estimation approaches in terms of several metrics. The code is available at https://github.com/wangsfan/FAC

Abstract:
The preponderance of traffic accidents takes place within urban intersection areas. It is crucial to extrapolate accident risk maps specifically for these locations to proactively mitigate and prevent future occurrences of traffic accidents. Nevertheless, inferring fine-grained intersection risk remains a challenging task, primarily due to the intricate structure of the road network, variability in scene information, and the stringent requirements for high-quality data. In this work, we propose an end-to-end Adaptive Risk-Feature Aware Fusion Network (ARFAF-Net) based on multimodal data to achieve fine-grained inference of traffic risk maps in intersection areas. Specifically, we introduce a Heterogenous Feature Adaptive Fusion Module to extract complementary features of risk from satellite imagery and streetscape data. The Dynamic Correlation Analysis Module is used to capture large-scale changes in risk in the region to improve multi-scale information perception. In addition, the Macro-Aware Guided Fusion Module is used to introduce macro-satellite image data features to enhance the accuracy of perceiving risks at the periphery of intersections. Ultimately, pixel-level extrapolation maps of intersection crash risk are generated, thereby offering more cost-effective and rational guidance for crash prevention. Both quantitative evaluation and qualitative analysis on real-world datasets demonstrate that the proposed ARFAF-Net achieves superior performance. Finally, the codes and models used in this study for intersection crash risk map inference are available at https://github.com/gwt-ZJU/ARFAF-Net

Abstract:
With the advent of the intelligent era, increasing attention has been given to facial aesthetics. While the academic community has achieved notable progress in facial aesthetic research, current efforts predominantly concentrate on two isolated subtasks: aesthetic evaluation and enhancement. Crucially, the intrinsic correlation between these tasks and their integration within a unified framework remain underexplored. To bridge this gap, this paper proposes a facial aesthetic enhancement and prediction network based on differential average aesthetic perceptions (Diff-AEPNet) that synergistically combines facial aesthetic enhancement with prediction. The proposed framework implements a four-stage architecture: 1) a transformer module learns latent code beautification trajectories to guide preenhancement feature modification; 2) a dual-stream encoder extracts and contrasts pre-/post-beautification features to refine evaluation accuracy; 3) a lightweight network generates attention-guided image mask for image fusion; and 4) a deghosting block eliminates fusion artifacts through residual learning. The experimental results demonstrate that the model achieves a favorable beautification effect in the enhancement task and exhibits better generalization performance across datasets in the evaluation task than existing aesthetic evaluation models do.

Abstract:
Self-supervised category-level 6D pose estimation has emerged as a task of paramount significance within the field of computer vision. Despite recent advancements, current self-supervised methods grapple with two critical challenges. Primarily, the ability of existing networks to accurately reconstruct object models is constrained by pronounced part-level shape variations across specific categories. Additionally, the persistent many-to-one ambiguity within pixel-to-point cloud correspondences poses a significant barrier to achieving robust performance. To address these challenges, we propose a novel approach that includes a Language-Assisted Memory-Encoding Shape Reconstruction (LMR) module and a Coarse-to-Fine Correspondence Optimization (CFCO) module. In the LMR module, language descriptions are leveraged to bridge the gap between virtual and real images, thereby improving the alignment between learned representations and real-world object appearances. Additionally, a memory encoding mechanism is introduced to enhance reconstruction accuracy by capturing fine-grained shape variations. The CFCO module utilizes Hungarian matching to generate one-to-one pseudo labels at both region and pixel levels, providing explicit supervision for the corresponding similarity matrices. Furthermore, this process helps alleviate the many-to-one ambiguity to some extent, leading to more accurate correspondence learning. We evaluate our method on the REAL275 and WILD6D datasets. Extensive experiments demonstrate that our self-supervised approach outperforms existing methods and achieves new state-of-the-art results within the self-supervised framework.

Abstract:
Sewing patterns form the structural foundation of the fashion industry, translating 2D conceptual designs into 3D manufacturable garments. With the rise of deep learning, 3D fashion design has seen transformative advancements, automating traditionally labor-intensive processes and enabling intelligent, data-driven workflows. Recent studies have demonstrated promising progress in areas such as sewing pattern generation, reconstruction, and 3D garment modeling. However, to date, no systematic review has specifically examined the integration of deep learning with sewing pattern–driven 3D fashion design. This survey addresses that gap by providing the first comprehensive overview of the field. We propose a novel four-stage pipeline, including representation, generation, reconstruction, and editing, to categorize and analyze current research. Within this pipeline, we review core methodologies, including geometric encoding for pattern representation, data-driven pattern generation, reconstruction from multimodal inputs (e.g., images, sketches, or text), and intuitive 3D garment editing techniques. We also consolidate existing benchmarks, covering both datasets and evaluation metrics, and contextualize the pattern-driven paradigm through comparison with alternative approaches in 2D and pattern-free 3D design. Finally, we identify key challenges, such as limited data availability and the difficulty of incorporating domain-specific design constraints, and outline future research directions to address these issues. By synthesizing current developments and structuring the research landscape, this survey serves as a foundational resource to support and accelerate innovation in manufacturable sewing pattern-driven 3D fashion design.

Abstract:
Sparse view synthesis with one stereopair has been a crucial but challenging task for light field display. Despite 3D Gaussian Splatting demonstrates exceptional performance in novel view synthesis, its results degrade significantly and exhibit substantial degradation when training data is insufficient. To solve this problem, we introduce Stereo-Gaussian, which combines stereo matching and pseudo cameras to regularize the optimization of 3D Gaussian Splatting. In 3D Gaussian Splatting, we find that directly using the point cloud from stereo matching suffers from a mismatch between cameras and points. Consequently, we introduce a pose refinement module to optimize the poses of the train cameras before the training setup. To further overcome the overfitting problem introduced by sparse input, we utilize pseudo cameras and depth imaging based rendering to directly regularize the picture rather than rendered depth used by other methods. Extensive experiments on several datasets validate its efficacy over other state-of-the-art sparse view 3D Gaussian Splatting, achieving real-time rendering results with fidelity while maintaining strong compatibility with other methods. Our code will be made available for research.

Abstract:
Human-centric Video Anomaly Detection (VAD) aims to identify human behaviors that deviate from normal. At its core, human-centric VAD faces substantial challenges, such as the complexity of diverse human behaviors, the rarity of anomalies, and ethical constraints. These challenges limit access to high-quality datasets and highlight the need for a dataset and framework supporting continual learning. Moving towards adaptive human-centric VAD, we introduce the HuVAD (Human-centric privacy-enhanced Video Anomaly Detection) dataset and a novel Unsupervised Continual Anomaly Learning (UCAL) framework. UCAL enables incremental learning, allowing models to adapt over time, bridging traditional training and real-world deployment. HuVAD prioritizes privacy by providing de-identified annotations and includes seven indoor/outdoor scenes, offering over 5× more pose-annotated frames than previous datasets. Our standard and continual benchmarks, utilize a comprehensive set of metrics, demonstrating that UCAL-enhanced models achieve superior performance in 83.33% of cases, setting a new state-of-the-art (SOTA). The dataset can be accessed at https://github.com/TeCSAR-UNCC/HuVAD

Affiliations: Laboratory of Beijing Engineering Research Center of Mixed Reality and Advanced Display, School of Optics and Photonics, Beijing Institute of Technology, Beijing, China; School of Mathematics and Statistics, Beijing Institute of Technology, Beijing, China; School of Medical Technology, Beijing Institute of Technology, Beijing, China; School of Computer Science and Technology, Beijing Institute of Technology, Beijing, China; Yellwin Company Ltd., Beijing, China; Department of Cardiac Surgery, Beijing Anzhen Hospital, Capital Medical University, Beijing, China

Abstract:
Accurate detection of multi-shape coronary artery stenoses from X-ray angiography (XRA) sequences plays a crucial role in diagnosing and planning interventions for coronary artery disease. However, vessel overlap, background noise, and nonlinear cardiac motion introduce significant challenges. These factors often result in missed detections, intra-frame class conflict, and temporal category drift, particularly for subtle and morphologically complex stenoses such as focal and bifurcation stenoses. To address these challenges, we propose a Hierarchical Heterogeneous Aggregation Network that effectively integrates both spatial and temporal cues across XRA sequences. The proposed framework incorporates a Channel Importance-guided Fusion module, which aims to enhance the representation of small-stenosis features by dynamically selecting high-importance channels across scales. Furthermore, we introduce a Hierarchical Heterogeneous Aggregator designed to reduce spatial redundancy and explicitly generate discriminative features across frames based on heterogeneous relationships, thereby improving temporal consistency and classification robustness. Existing experiments conducted on two clinical datasets indicate that our method outperforms existing detectors and stenosis methods in terms of detection accuracy and generalization.

Abstract:
In medical image segmentation, the reliance on extensive, high-quality labeled datasets poses a significant challenge, especially considering the associated costs and the requirement for specialized expertise. In response, the field has progressively embraced semi-supervised learning (SSL) methods that leverage both labeled and unlabeled data. Nonetheless, these methods frequently encounter issues related to inconsistent label quality and constrained generalizability of models. To surmount these obstacles, we present InterTeach, an innovative SSL framework that seamlessly integrates cross-supervision with the mean teacher model. This framework facilitates effective knowledge transfer and boosts model performance through the implementation of two unique teacher-student training configurations. Herein, knowledge is exchanged between models via their respective teacher counterparts, facilitating mutual learning and enhancement. This strategy diverges from traditional SSL approaches, which mainly depend on mutual learning between two models updated through gradient descent. Furthermore, the incorporation of Feature Divergence Loss (FDL) in InterTeach encourages the transfer of diverse and complementary knowledge between models, thereby enriching the overall learning dynamics. The evaluation results revealed that our method could approach or even match the performance of fully supervised learning methods on certain evaluation metrics. This finding further confirms the effectiveness and wide applicability of the IntraTeach method in handling multi-modal and multi-dimensional medical image segmentation tasks.

Abstract:
Although large pretrained stable diffusion (SD) models can generate high-quality images from prompts, they cannot generate images that are consistent with the fine-grained characteristics of a specific identity V^ (e.g., an anime character). Subject-driven generation focuses on exploring and leveraging the prior knowledge within a model to achieve the goals of ID and context preservation. There have been efforts, such as DreamBooth, to conduct subject-driven generation; however, they suffer from ID and context mistakes. An ID mistake means a feature loss of V^ , and a context mistake means that the generated image does not align with the given prompt. To rectify these problems, in this paper, we propose masked fine-tuning for efficient feature learning of V^ , then propose IP-Controller for decomposing and optimizing cross-attention maps of V^ and prompt words other than V^ . Specifically, we generate the cross-attention map using a vanilla input prompt and decompose it into an ID cross-attention map (matching V^ ) and a context cross-attention map (matching prompt words other than V^ ). Next, we generate fitter ID and context cross-attention maps on the basis of the input ID and context prompts, respectively. We optimize the ID and context cross-attention maps with the fitter ID and context cross-attention maps, respectively, so that the diffusion process pays fitter attention for specific contents. Experiments show that IP-Controller correctly integrates the core features of V^ and the semantic context of the prompt words other than V^ and generates high-quality images for the given prompt.

Abstract:
Complex scene segmentation aims to segment objects with intricate details or those concealed within the background. Despite significant advancements, a persistent challenge remains: accurately identifying object edges in backgrounds with high inherent similarity and complex structures. To address this, we identify the prevalent spectral bias in image segmentation, where networks preferentially learn low-frequency information, as a key impediment to recognizing and learning object edges, which are rich in high-frequency details. To mitigate this bias, we propose MCNet, a segmentation framework designed to promote balanced frequency learning. MCNet comprises two primary components: multi-frequency perception (MP), which independently captures high-frequency details and low-frequency structural components of objects, and complementary fusion (CF), which intelligently fuses these distinct frequency features through learnable, adaptive mechanisms. Crucially, MCNet employs a novel frequency-aware consistency adversarial loss to explicitly guide the learning across different frequency bands. MCNet effectively integrates MP and CF, enhancing the detection of high-frequency details and low-frequency structures, thereby alleviating challenges posed by spectral bias. We evaluate the proposed method on complex scene segmentation tasks, including camouflaged object detection and dichotomous image segmentation. Through extensive comparisons with 31 existing methods across 8 benchmark datasets, we demonstrate the superiority of the proposed method.

Affiliations: School of Microelectronics and Communication Engineering, Chongqing University, Chongqing, China; School of Biological Processes, Chongqing University, Chongqing, China; The Second Medical Center, and the National Clinical Research Center for Geriatric Diseases, Chinese PLA General Hospital, Health Management Institute, Beijing, China; Chinese Academy of Sciences, Shenzhen Institutes of Advanced Technology, Shenzhen, China; GIOME, California Medical Innovations Institute, San Diego, CA, USA

Abstract:
The scarcity of high-quality annotated data in medical imaging significantly constrains the performance of deep learning-based segmentation models. While few-shot medical image segmentation (FSMIS) has emerged as a promising solution, existing methods exhibit critical limitations when handling scenarios with inter-class similarity between foreground-background regions and intra-class heterogeneity within foreground objects. Current prototype-based approaches focus primarily on the holistic extraction of the prototype from support images, failing to distinguish subtle anatomical variations and complex feature representations effectively. The AVT-ProNet features three innovative components: 1) an Adaptive Visual-Text Prototype Generation (AVPG) module leveraging CLIP’s cross-modal guide capabilities through adaptive prompting strategies; 2) a graph-based multiregion prototyping relationship optimization (GMPRO) module establishing structural relationships between decomposed subregion prototypes via graph neural networks; and 3) a foreground-background prototyping contrast learning (FBPCL) strategy implementing dual-space optimization through inter-class separation and intra-class compactness. The synergistic integration of multi-modal guidance, structural relationship modeling, and contrastive prototype refinement enables our framework to overcome existing limitations in FSMIS. Comprehensive evaluations across multiple clinical scenarios (CHAOS, SABS, and CMR datasets under diverse training configurations) demonstrate superior performance over state-of-the-art approaches, including PANet, CAT-Net, DMAP, and recent PAMI baselines, Source code is available at https://github.com/394481125/AVT-ProNet

Abstract:
Unsupervised person re-identification aims to retrieve a given pedestrian image from unlabeled data. The method of clustering and assigning pseudo-labels has become mainstream, but there are still some problems that will reduce recognition accuracy. On the one hand, in the process of clustering, poor classification of hard samples between neighboring classes leads to inadequate clustering accuracy, which affects the quality of pseudo-labels. On the other hand, the representational capacity of features extracted by the backbone network is also crucial for the model’s performance. To this end, this paper proposes an unsupervised person re-identification method based on nearest neighbor sample constraint and ordinary differential equation guided feature reconstruction (NNSC-FR) to improve the clustering accuracy and pseudo-label quality while enhancing the representation of features. Specifically, we present a novel nearest neighbor sample constraint (NNSC) after neighbor sample mining for each instance sample to recognize the hard samples’ fine classification between classes. To further improve clustering accuracy, an inter-class balance loss (CB loss) is introduced to better identify the hard samples between the nearest neighbor classes. In addition, guided by the third-order adam solution of the Ordinary Differential Equation, we design a Feature Reconstruction (ODE-FR) module with residual structure to improve the model representation ability. Extensive experimental results on Market-1501, DukeMTMC-reID, and MSMT17 demonstrate that our proposed method is superior to the state-of-the-art methods.

Abstract:
Open-set Supervised Anomaly Detection (OSAD) strategy seeks to detect novel anomalies that are unseen during training. However, existing OSAD works fail to learn a comprehensive margin that separates normal and anomalous samples in the feature space owing to the lack of restriction on abnormal distribution. To this end, we propose a new OSAD scheme based on Margin Metric Learning, which fully exploits the differences between normal and abnormal events in the latent feature dimension. The developed framework embodies three major components. First, video clips including normal events and restricted anomalies are input into an Attention-embedded Spatial Convolutional Network to extract the spatial feature sequences. Then, the spatial feature sequences are input into a Temporal Convolutional Siamese Network to further obtain the temporal features. Second, a novel Quadruplet Contrastive Loss is designed based on the spatial-temporal features of normal and abnormal video clips to conduct the margin learning, which enlarges the inter-class distances between anomalous and normal instances as well as reduces the intra-class distances inside anomalous instances and normal instances. This contributes to acquire compact normal and abnormal distributions. Finally, during the testing stage, a simplified metric distance-based anomaly detection strategy is proposed to calculate the anomaly score of testing clip. Extensive experiments and ablation studies on the Avenue and ShanghaiTech datasets demonstrate the effectiveness and efficiency of our proposed method for discovering unseen anomalous events via limited types of aware anomalies.

Abstract:
Surround-view perception has garnered significant attention for its ability to enhance the perception capabilities of autonomous driving vehicles through the exchange of information with surrounding cameras. However, existing surround-view perception systems are limited by inefficiencies in unidirectional interaction pattern with human and distortions in overlapping regions exponentially propagating into non-overlapping areas. To address these challenges, this paper introduces ChatStitch, a surround-view human-machine co-perception system capable of unveiling obscured blind spot information through natural language commands integrated with external digital assets. To dismantle the unidirectional interaction bottleneck, ChatStitch implements a cognitively grounded closed-loop interaction multi-agent framework based on Large Language Models. To suppress distortion propagation across overlapping boundaries, ChatStitch proposes SV-UDIS, a surround-view unsupervised deep image stitching method under the non-global-overlapping condition. We conducted extensive experiments on the UDIS-D, MCOV-SLAM open datasets, and our real-world dataset. Specifically, our SV-UDIS method achieves state-of-the-art performance on the UDIS-D dataset for 3, 4, and 5 image stitching tasks, with PSNR improvements of 9%, 17%, and 21%, and SSIM improvements of 8%, 18%, and 26%, respectively. The code is available at https://github.com/lhlawrence/ChatStitch

Abstract:
Infrared-visible image fusion (IVF) aims to integrate complementary information from infrared and visible sensors into a single, more informative representation. However, achieving both visual clarity and semantic consistency in the fused results remains a critical challenge, particularly for real-world applications like scene understanding. To address this, we propose DICFusion, a deep integrated and semantic-coordinated network, tailored for perceptually and semantically enriched infrared-visible fusion tasks. Firstly, the DICFusion employs a novel modality-aware fusion strategy to integrate infrared and visible modalities into a cohesive feature embedding. Secondly, the framework incorporates hybrid mamba-convolution blocks, which leverage the combined strengths of mamba and convolution neural networks to accurately capture both global context and localized details while maintaining computational efficiency. To mitigate the feature heterogeneity between fusion and downstream tasks, DICFusion adopts a comprehensive framework of deep integration and collaborative optimization. This design utilizes a unified multi-scale encoder to harmonize feature representations, followed by parallel fusion and segmentation branches to enhance both visual quality and task performance. Moreover, a semantic guidance module leveraging cross-attention mechanism is incorporated to refine the semantic consistency of the fused outcomes. Comprehensive experimental evaluations validate the performance and efficiency of DICFusion, demonstrating its superiority over contemporary state-of-the-art methods, both in terms of fusion visual quality and downstream task precision. The code is available at https://github.com/fd-qhwang/DICFusion

Abstract:
Point cloud is often regarded as a discrete sampling of Riemannian manifold and plays a pivotal role in the 3D image interpretation. Particularly, rotation perturbation, an unexpected small change in rotation caused by various factors (like equipment offset, system instability, measurement errors and so on), can easily lead to the inferior results in point cloud learning tasks. However, classical point cloud learning methods are sensitive to rotation perturbation, and the existing networks with rotation robustness also have much room for improvements in terms of performance and noise tolerance. Given these, this paper remodels the point cloud from the perspective of manifold as well as designs a manifold distillation method to achieve the robustness of rotation perturbation without any coordinate transformation. In brief, during the training phase, we introduce a teacher network to learn the rotation robustness information and transfer this information to the student network through online distillation. In the inference phase, the student network directly utilizes the raw 3D coordinate information to achieve the robustness of rotation perturbation. Experiments carried out on four different datasets verify the effectiveness of our method. On average, on the ModelNet40 and ScanObjectNN classification datasets with random rotation perturbations, our method improves classification accuracy by 4.41% and 3.65%, respectively, compared to popular rotation-robust networks. Similarly, on the ShapeNet and S3DIS segmentation datasets, our method achieves improvements in mIoU of 6.96% and 5.12%, respectively. Furthermore, the experimental results also demonstrate that our algorithm exhibits higher computational efficiency and stronger resistance to noise and outliers.

Abstract:
Adapters and prompt learning have become two de facto strategies to fine-tune pre-trained vision-language models, mitigating the high computational cost of fine-tuning an entire model for downstream tasks. They can align the prediction from the fine-tuned model with that from the pre-trained model. However, the existing methods of these strategies primarily focus on aligning within a single modality, and the exploration of bidirectional interactions between modalities remains limited. To address this issue, we propose a closed-form dual alignment mechanism (DAM) that not only ensures the consistency in predictions within a single modality but also achieves the alignment of features across different modalities. In DAM, all alignments are achieved by closed-form solutions to ridge regression, without inducing a massive number of learnable parameters. Experimental results demonstrate that DAM outperforms the state-of-the-art methods on 11 benchmarks over various evaluation metrics. Our codes are available at https://github.com/Peiy-Lu/DAM

Abstract:
Various forms of degradation, including noise, blur, and adverse weather conditions (e.g., rain, snow, and fog), significantly compromise video quality and system reliability across critical domains ranging from surveillance and medical imaging to entertainment. Previous research mainly focuses on network models tailored to specific degradation types, while recent unified frameworks and foundation models still face critical challenges in temporal consistency, automated degradation recognition, and detail preservation. Despite recent advances in foundation models, current approaches rely heavily on predefined degradation labels and remain focused on image-level operations, limiting their generalization to real-world scenarios and struggling with preserving fine-grained details. To address these challenges, we propose Grid Splicing Diffusion Model (GSDiff), a general framework for video reconstruction that leverages a novel grid splicing execution alongside instruction-tuned Large Language Model (LLM). GSDiff introduces three key innovative modules: 1) a LLM-driven degradation recognition module that enables automatic and fine-grained restoration guidance through zero-shot degradation analysis, 2) a Grid Splicing Module that organizes multiple frames into a unified grid structure to facilitate spatiotemporal feature processing, and 3) a Detail Preservation Module integrated with a Tail Refine Network to enhance fine-grained details during diffusion and post-processing. Extensive experiments demonstrate that GSDiff delivers state-of-the-art performance across a wide range of reconstruction tasks, including deraining, desnowing, denoising, and deblurring, propelling advancements in medical diagnostics and smart city applications.

Abstract:
Light Field (LF) is extensively utilized for depth estimation tasks due to its rich structural information. However, real-world LF images often encounter reflective and transparent surfaces and the related regions contain depth information from the reflection and background layers, which can be modeled as dual-layer scenes. For the existing depth estimation frameworks, the constructed cost volume shows an aliasing bimodal distribution in dual-layer surfaces and further causes serious wrong depth results. In this paper, we propose a novel decoupling-and-aggregating strategy and develop a dual-layer depth estimation network for LF images with complex reflections. Specifically, we develop an adaptive cost volume decoupling module to separate both the background and reflection features from the aliasing cost volume. Light field angular-spatial information is sufficiently extracted to infer the effort of features in different dimensions to the background or reflection layer. Additionally, we employ an iterative self-guided aggregating module with multi-stage supervision to aggregate two branches of cost volumes. The module applies the self-guided masks to regularize the distribution of cost volumes. Given the challenge of acquiring the ground truth disparity maps for the LF images under reflection scenes, we also construct a synthetic dataset with dual-layer properties. Our model is the first to introduce dual-layer scenes into the LF depth estimation task using an end-to-end deep neural network. It successfully separates the background and reflection layers and achieves accurate depth estimation results in both layers. Quantitative and qualitative experiment results on publicly available datasets demonstrate that our method performs better than other state-of-the-art methods.

Abstract:
This paper investigates one of the most challenging problems in single image dehazing: how to restore haze-free scenes solely from the input observed image without relying on paired or unpaired images and how to extract useful prior information from the observed image to guide the dehazing process. To address these challenges, this paper introduces a novel zero-reference real-world image dehazing method via deep self-decoupling and reverse knowledge transfer (ZRID-Net). Specifically, we first employ a model-driven approach to preliminarily decouple the observed image into coarse-grained components: the haze-free image, transmission map, and atmospheric light. Subsequently, we refine the haze-free image and transmission map separately via a data-driven approach. In addition, we propose a novel reverse knowledge transfer method to exploit latent prior information within hazy images thoroughly for dehazing guidance. This method combines knowledge transfer and contrastive learning to reverse guide the refinement network away from haze characteristics. Finally, a perceptual fusion strategy is employed to obtain haze-free images with high visibility and realism. Extensive experiments demonstrate that the proposed ZRID-Net effectively restores image clarity, enhances structural details, and improves color fidelity across various challenging haze conditions without relying on paired or unpaired supervision. On multiple benchmark datasets, ZRID-Net outperforms existing SOTA approaches in terms of both quantitative metrics and visual quality. The results also confirm its strong generalizability and practical applicability to real-world scenarios. The relevant implementation code can be found at https://github.com/cswangshilong/ZRID-Net

Abstract:
Target detection in aerial images captured by unmanned aerial vehicles has long been hampered by occlusion. Active Object Detection (AOD) aims to fundamentally address this issue from the active vision perspective, typically realized through the Deep Reinforcement Learning (DRL) paradigm. However, the active observation policy often suffers from low generalization ability, thus limiting its practical application. In this paper, we propose Divide-and-Conquer Sharpness-Aware Gradient Matching (DC-SAGM), a novel sharpness-based Domain Generalization (DG) method, to effectively enhance the generalization capacity of the agent’s policy. Specifically, we train the agent to learn the active observation policy using the conventional DRL approach. Sharpness-Aware Gradient Matching (SAGM) is employed during training, improving the model’s generalization performance by minimizing the sharpness metric of the loss landscape. Nevertheless, the imperfect state representation and classifier preference in the AOD problem lead to fierce gradient conflicts, deteriorating the effectiveness of SAGM. We address this incompatibility by using a divide-and-conquer strategy and exclude gradient conflicts via the majority-rule gradient surgery operation. Extensive experimental results on the UEVAVD dataset validate DC-SAGM’s superiority in helping the agent’s policy achieve better generalization compared to extensive policy learning approaches.

Abstract:
Particle image-based fluid measurement techniques are widely used to study complex flows in nature and industrial processes. Despite that particle tracking velocimetry (PTV) has shown potential in various experimental applications for quantitatively capturing unsteady flow characteristics, estimating fluid motion with long displacement and high particle density remains challenging. We propose an artificial-intelligence-enhanced PTV framework to track particle trajectories from consecutive images. The proposed framework, called GOTrack+ (a learning framework with graph optimal transport for particle tracking velocimetry), contains three components: a convolutional neural network-based particle detector for particle recognition and sub-pixel coordinate localization; a graph neural network-based initial displacement predictor for fluid motion estimation; and a graph-based optimal transport particle tracker for continuous particle trajectory linking. Each component of GOTrack+ can be extracted and used independently, not only to enhance classical PTV algorithms but also as a simple, fast, accurate, and robust alternative to traditional PTV programs. Comprehensive evaluations, including numerical simulations and real-world experiments, have shown that GOTrack+ achieves state-of-the-art performance compared to recent PTV approaches. All the codes are available at https://github.com/wuwuwuas/GOTrack.git

Abstract:
Instance shadow detection, a recently proposed task, is more challenging than traditional shadow detection, as it aims to detect shadow instances paired with the associated objects that cast shadows. Previous works focus on outdoor natural light scenes, featuring unique shadow per object. These methods utilized an offset-based bidirectional relational learning mechanism requiring dual matching, which struggled in scenarios with artificial light where an object may cast multiple shadows. In this paper, we propose a pairwise grouping strategy enabling the model detecting objects and their corresponding shadows in conjunction, eliminating the need for dual matching. The model incorporates the pairwise query, serving as the matching entities for objects and their corresponding shadows, and the pairwise masked attention to enforce the network focus more on the relational features between object and shadow. Additionally, we introduce a morphological representation of object-shadow pairs, and employ contrastive morphological alignment to further refine the pairwise relational features. In this way, the Pairwise Grouping with Contrastive Morphological Alignment (PGCMA) enables the model to detect instance shadows through single-time matching. Moreover, we propose a Universal Scene Shadow-Object dataset (USSO), which is more comprehensive and includes a broader range of scenes than previous datasets. Extensive experiments conducted on both existing and our newly proposed benchmarks demonstrate the superiority of our method over current state-of-the-art approaches. The source code and our proposed dataset are publicly available at https://github.com/ssecv/USSO/

Abstract:
Online Multi-Object Tracking (MOT) plays a pivotal role in autonomous systems. The state-of-the-art approaches usually employ a tracking-by-detection method, and data association plays a critical role. This paper proposes a learning and graph-optimized (LEGO) modular tracker to improve data association performance in the existing literature. The proposed LEGO tracker integrates graph optimization, which efficiently formulates the association score map, facilitating the accurate and efficient matching of objects across time frames. To further enhance the state update process, the Kalman filter is added to ensure consistent tracking by incorporating temporal coherence in the object states to further enhance the state update process. Our proposed method, utilising LiDAR alone, has shown exceptional performance compared to other online tracking approaches, including LiDAR-based and LiDAR-camera fusion-based methods. LEGO ranked 3^rd among all trackers (both online and offline) and 2^nd among all online trackers in the KITTI MOT benchmark for cars, (https://www.cvlibs.net/datasets/kitti/eval_tracking.php) at the time of submitting results to KITTI object tracking evaluation ranking board. Moreover, our method also achieves competitive performance on the Waymo open dataset benchmark.

Abstract:
Light field (LF) videos contain rich spatial, angular, and temporal information, resulting in immense data volumes and posing significant challenges for low-bitrate compression. Existing LF video compression methods focus on modifying the structure of traditional video codecs to encode all LF views, but they are insufficient to achieve low-bitrate compression of LF video. To address these limitations, we propose a low-bitrate LF video compression framework that exploits spatial-angular-temporal correlations through sparse coding and joint reconstruction. On the encoding side, we introduce a content-adaptive prediction structure for sparse key view sequences selection. This structure is adapted to LF video content, leveraging the most similar view as a reference to enhance prediction accuracy and significantly reduce bitrate. On the decoding side, we observe that pixels missing in the current view are often captured in adjacent angular and/or temporal views. As a result, we develop a spatial-angular-temporal based joint reconstruction network that integrates cues across the different domains. This approach supplements missing texture details near occlusion areas and reconstructs high-quality non-key views. Experimental results demonstrate the efficiency of our framework, achieving an average gain of about 60 % in terms of bitrate savings and 2 dB in terms of reconstruction quality compared to the state-of-the-art methods.

Affiliations: School of Computer Science and Artificial Intelligence, Wuhan Textile University, Wuhan, China; School of Cyber Science and Engineering, Wuhan University, Wuhan, China; Department of Applied Artificial Intelligence, School of Convergence, College of Computing and Informatics, Visual Analytics for Knowledge Laboratory (VISKNOW Lab), Sungkyunkwan University, Seoul, Republic of Korea; Department of Computer Science and Engineering, Visual Surveillance Laboratory, National Institute of Technology Rourkela, Rourkela, Odisha, India; TECNALIA, Basque Research and Technology Alliance (BRTA), Derio, Spain; MOE Key Laboratory of AI, School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University, Shanghai, China; Department of Computer Science and Engineering, School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University, Shanghai, China

Abstract:
Cloth-changing Person Re-Identification (CC-ReID) is a challenging data modeling task that involves identifying specific pedestrians wearing different outfits. Existing methods primarily focus on altering clothing color and directly reconstructing appearance to extract features independent of the clothes. Real pedestrians differ in height, body shape, etc. Such methods are prone to losing the intrinsic information of the original sample (i.e., the person identity) owing to the absence of contextual phenomena (e.g., texture structure and local correlation), which decreases the recognition performance. To address this problem, we propose a framework called HPRNet, or “Human Parsing Reconstruction with Non-Local Multi-Scale Perception Network,” which includes a non-local weighted multi-scale perception (NWMP) module and a parsing reconstruction exploration (PRE) module. In particular, the proposed NWMP module effectively captures the global receptive field of a sample and obtains a contextual correlation between non-neighboring pixels within the sample image. The PRE module was used to achieve a more accurate reconstruction of human body components with a clothing parsing model to better distinguish features related to or unrelated to clothes. Extensive experiments were conducted on CC-ReID public datasets (LTCC, PRCC, and CCVID) to demonstrate the effectiveness and competitiveness of the proposed method with state-of-the-art (SOTA) baselines for this complex modeling task.

Abstract:
The goal of Few-Shot Segmentation (FSS) is to segment images of novel categories using few labeled examples. However, FSS tasks face challenges such as over-segmentation and lack of generalization issues. This paper addresses these challenges by employing a triple attention mechanism (TAM) and a hierarchical decoding transformer (HDT). Specifically, TAM is proposed to enhance the model’s ability to focus on spatial regions within query features that are semantically relevant to the target category. The HDT module then aggregates the enhanced query features with the support features in a decoupled manner, generating dense features with pixel-level semantic relevance, which improves the segmentation ability on novel classes. Additionally, considering that class-level labels inside image can provide weak supervision for the segmentation task, this paper introduces a contrastive language image pretraining (CLIP) based model to enhance the segmentation performance. The Grad-CAM mechanism is utilized to convert the class logit scores from CLIP into localization heatmaps, effectively leveraging the text label information to provide prior localization cues for the model. Extensive experiments conducted on the PASCAL- 5^i and COCO- 20^i datasets demonstrate state-of-the-art performance. The experimental results validate the effectiveness of the proposed method, significantly improving the generalization and segmentation performance of few-shot semantic segmentation models on novel categories.

Abstract:
Autonomous aerial vehicles (AAVs) have garnered significant attention due to their operational flexibility, enabling expanded application scenarios across diverse fields. The Text-Based Pedestrian Retrieval (TBPR) task aims to identify corresponding images from textual descriptions, yet existing research has primarily focused on ground-level views. To broaden the applicability of TBPR systems, we introduce aerial-view analysis and propose a novel Text-Based Aerial Pedestrian Retrieval (TBAPR) task. This task introduces unique challenges, particularly the dual gaps in cross-view (aerial vs. ground) and cross-modal (text vs. image) matching, which are more complex than traditional TBPR or aerial-ground pedestrian understanding tasks. To address these challenges, we propose an Adaptive Elastic Alignment Network with FIne-Grained Representation Mining (AEA-FIRM). Our framework tackles the cross-view gap through an AEA loss that adaptively prioritizes critical semantic features while dynamically aligning textual and aerial semantics under challenging conditions. Concurrently, the FIRM module refines visual-linguistic representations by mining fine-grained pedestrian attributes and explicitly textualizing them for cross-modal matching verification. Extensive experiments demonstrate that AEA-FIRM achieves state-of-the-art performance, outperforming existing TBPR methods by 4.87% in Rank-1 accuracy. Our code and dataset are available at https://github.com/xbdxwyh/AEA-FIRM-main.git

Abstract:
Action detection in untrimmed, densely annotated video datasets is a challenging task due to the presence of composite actions and co-occurring actions in videos. To facilitate action detection in such intricate scenarios, leveraging ample prior information from the data and comprehending the context of actions in the video are the most important two clues. Specifically, the co-occurrence probability of actions can effectively capture the temporal relationships and associations among actions, aiding the model in recognizing multiple actions occurring simultaneously. Additionally, aggregating action information from different levels of the data into a comprehensive graph and describing human actions from various semantic layers can significantly reduce ambiguities in action detection. Based on this, a novel knowledge graph, Hierarchical Augmented Knowledge Graph for human behaviour (HAhb-KG), is proposed, which brings together action-related prior knowledge on different levels into a unified hierarchical graph. The graph describes human behaviour from various semantic aspects by defining diversified graph nodes, and augments the nodes and relationships with corresponding images and probability of co-occurrence respectively, to introduce textual modality information and weigh the associations between actions. In order to mine the knowledge related to the input video in the knowledge graph, HAhb-KG oriented knowledge understanding framework is proposed to embed multi-modal knowledge as a valuable supplement to visual information. Incorporated with the framework, a cross-modal learning action detection model is designed to achieve high accuracy in action detection tasks, which validates the effectiveness of HAhb-KG. Our method achieves gains of 1.45(mAP) and 2.28(mAP) in action detection experiments on the Charades and TSU datasets, respectively, which show that the proposed method outperforms existing knowledge-based action detection methods.

Abstract:
LiDAR point cloud semantic segmentation is a fundamental task in 3D perception, essential for applications like autonomous driving and mobile robotics. However, the inherent sparsity and irregular distribution of point cloud data pose significant challenges to achieving high segmentation accuracy. To address these issues, we propose a Multi-scale Dilated Spatial and Local Channel Attention (MDSLCA) network, which integrates sparse convolutional operations with advanced attention mechanisms for accurate and efficient 3D semantic segmentation. The MDSLCA network features three core components: the Heterogeneous Convolution Skip Connection (HCSC) architecture, which bridges the semantic gap between the encoder and decoder; the Multi-Scale Local Attention (MSLA) module, which enhances focus on salient spatial regions and important feature channels while suppressing irrelevant information; and the Multi-Scale Dilated Spatial Attention (MSDSA) module, which improves spatial feature learning through multi-scale dilated convolutions that effectively capture long-range spatial dependencies in point cloud data. Extensive experiments conducted on the SemanticKITTI dataset demonstrate that our MDSLCA achieves a mean Intersection-over-Union (mIoU) of 73.7%, surpassing current global modeling approaches such as Point Transformer and achieving a new state-of-the-art (SOTA) performance. These results validate the effectiveness of our method in capturing fine-grained local features and modeling high-level semantic context, demonstrating its strong capability to handle the inherent challenges of point cloud data. Our project is publicly available at https://github.com/jinzhengguang/MDSLCA

Abstract:
Large Vision-Language Models (LVLMs) have recently demonstrated strong multimodal understanding, yet their fine-grained visual perception is often constrained by low input resolutions. A common remedy is to partition high-resolution images into multiple sub-images for separate encoding, but this approach drastically inflates the number of visual tokens and introduces prohibitive inference overhead. To overcome this challenge, we propose Pyramid Token Pruning (PTP), a training-free strategy that hierarchically integrates bottom-up visual saliency at both region and token levels with top-down instruction-guided relevance. Inspired by human visual cognition, PTP selectively preserves more tokens from salient regions while further emphasizing those most relevant to task instructions. Extensive experiments on 13 diverse benchmarks show that PTP substantially reduces computational cost, memory usage, and inference latency, with negligible performance degradation.

Affiliations: Guangdong Provincial Key Laboratory of Intelligent Information Processing, and Shenzhen Key Laboratory of Media Security, College of Electronics and Information Engineering, Shenzhen University, Shenzhen, China; School of Cyberspace, Hangzhou Dianzi University, Hangzhou, China; School of Cyber Science and Engineering, University of International Relations, Beijing, China; School of Software, Nanchang University, Nanchang, China; Institute of Information Science, Beijing Jiaotong University, Beijing, China

Abstract:
Reversible data hiding (RDH) for JPEG images remains relatively underexplored, with key challenges lying in coefficient selection and modification strategies. Existing methods select coefficients for embedding through block-wise or frequency-band-based operations, resulting in coarse-grained decisions that constrain embedding performance. In this paper, a novel RDH scheme for JPEG images based on gap-driven histograms generation with coefficient-wise selection is proposed. First, a multi-metric weighted complexity and coefficient-wise selection approach is proposed, integrating four local feature criteria to assess each coefficient individually, enabling more precise per-coefficient selection. Then, a gap-driven adaptive multi-histogram generation strategy is introduced, leveraging gap pairs to minimize shifting distortion by segmenting histograms via bisection and avoiding modifications to high-magnitude coefficients. Experimental results confirm that the proposed method achieves improved visual quality and more efficient file size control compared to existing state-of-the-art approaches.

Abstract:
As an effective feature extractor, Vision Transformer (ViT) has been widely applied to both image classification and object tracking tasks. In this paper, we revisit and enhance the classic Data-efficient image Transformer (DeiT) for these two tasks. The DeiT is optimized step-by-step across different modules, including its patch stem, position embedding, and the development of efficient linear attention mechanisms. To address the performance degradation of linear attention, we propose Shift Expansion Linear Attention (SELA) which generates new heads with rich feature diversity through a simple but efficient cyclic shift operation. Additionally, SELA similarity minimization is added to cross-entropy loss to further enhance feature diversity. Based on these improvements, we develop SELA-ViT for image classification and further build SELA-Track for object tracking. With comparable model size and speed, SELA-ViT-T achieves a +4.8% improvement in Top-1 accuracy over DeiT-T on ImageNet-1K and establishes a new state-of-the-art performance among linear attention methods. Furthermore, we validate SELA-ViT on five small datasets. On four benchmark object tracking datasets, SELA-Track exhibits improved tracking performance. The code and models are available at: https://github.com/saizhou777/SELA-ViT

Abstract:
Nowadays 4D millimeter-wave radars can generate LiDAR-equivalent 3D point clouds with superior cost efficiency and enhanced stability in harsh environments, and have found wide applications in 3D object detection. However, existing 4D radar-based object detection methods may overlook the detrimental effects of information loss during feature processing and fail to effectively leverage the velocity information of 4D radars. These limitations hinder the full exploitation of 4D radar’s potential. To resolve this issue, we propose a dual-representation network with motion-aware augmentation, named as DR-Net. Specifically, to compensate the information loss, we propose a dual-representation encoder (DRE) and a sampling fusion backbone (SFB). The encoder extracts radar features from both pillars’ and points’ perspectives, well exploiting the complementarity between pillar-level structural context and point-level fine-grained details. The backbone fuses features of the pillar representation and the point representation, effectively mitigating the feature-level information loss. Instead of simply taking the velocity information as an additional input, we design a motion-aware augmentation (MAA) module to augment 4D radar point cloud data from the perspectives of both raw point cloud features and training instances. Finally, we extend the original 3D detection head by incorporating an additional velocity supervision branch to enhance the capability of perceiving both static and dynamic objects. We conduct comprehensive experiments on the View-of-Delft (VoD) and TJ4DRadSet datasets. Experimental results reveal that compared with some state-of-the-art 4D radar-based approaches, our DR-Net achieves significant performance advancement.

Abstract:
Sparse-view computed tomography (SVCT) offers the advantages of accelerated scanning and reduced X-ray radiation dose in different clinical applications. However, it faces a challenge due to incomplete data acquisition, resulting in streak artifacts in the analytically reconstructed CT images. Utilizing self-supervised learning, implicit neural representation (INR) recently has shown great promise in addressing inverse problems such as SVCT reconstruction. Nonetheless, given that the input of original INR only contains coordinate information, it is limited to represent one SVCT instance at a time, and its performance significantly declines when performing cross-instance reconstruction. In this study, we propose a novel self-supervised framework named VFMINR, which leverages generalizable representations extracted from the visual foundation models (VFMs) to tackle the cross-instance reconstruction issue of INR. Specifically, VFMINR first utilizes VFMs to effectively capture the spatial and frequency domain representations of sinograms, and then a fusion module is applied to fuse two domain features into complementary representations. This combination maximizes the utilization of local detail information from the spatial domain and the global structural information from the frequency domain. Subsequently, an adaptive cell decoding strategy is designed to map representations into variable resolution hybrid feature grids, which are integrated into the learning of the INR to enhance its generalizability for different SVCT instances. The VFMs and VFMINR are trained by using only SV sinogram data, and extensive results confirm that the proposed method can effectively handle the generalization problem of INR, while achieving superior performance in image fidelity and artifact suppression. The code is available at: https://github.com/nightastars/VFMINR-main

Abstract:
As telemedicine continues to expand, the frequent transmission and sharing of medical images over networks has become central to remote diagnostics. However, these images often contain sensitive lesion information. Once they are illicitly obtained during transmission or storage, they can be misused to train malicious segmentation models, resulting in serious patient privacy violations. While encryption and steganography provide basic protection, encrypted content may draw adversarial attention, and some stego-images may be exposed due to abnormal formats or semantics. More critically, most existing modes lack traceability, making it difficult to identify the source once a privacy breach occurs. To address these challenges, we present a traceable robust adversarial watermarking model that acts as an invisible shield to protect medical image privacy in telemedicine scenarios. This model seamlessly integrates invisibility, privacy protection, and traceability into a unified watermark embedding framework, enabling proactive defense against segmentation-based attacks while maintaining diagnostic quality. Specifically, receiver identity information is embedded into medical images through adversarial perturbations. These perturbations suppress lesion extraction by attackers while allowing reliable identity decoding in case of data leakage. Experimental results show that the model achieves strong privacy protection on various polyp and ISIC datasets. Moreover, the model maintains reliable ID extraction under different noise perturbations, validating its robustness and traceability. Visual quality assessments further confirm the invisibility of the embedded watermark, Ensuring that diagnostic usability remains unaffected. This provides a promising direction for safeguarding medical image privacy in telemedicine environments.

Abstract:
A large receptive field is crucial for the video frame interpolation (VFI) task. Existing video frame interpolation methods struggle with large motions due to their limited receptive fields. However, simply expanding the receptive field brings two challenges: a substantial computational burden and potential loss of texture details. In this paper, we first propose a novel spatial-temporal global window self-attention mechanism with an enlarged receptive field to enhance motion capture. Furthermore, to reduce the computational complexity introduced by the global window, we design a simple and effective separable fence window decomposition. Meanwhile, to better synthesize high-quality intermediate frames, we propose two complementary frame synthesis strategies. First, from the perspective of receptive field design, we introduce a progressive receptive field focusing module, enabling a smooth transition from global motion modeling to local detail preservation. Second, based on the VFI-specific property and the high structural similarity shared by the adjacent frames, we propose a structure-aware synthesis strategy, which incorporates structural priors to guide the generation of fine details. Subjective and objective experimental results demonstrate that our method effectively captures large motions while synthesizing texture details, outperforming state-of-the-art techniques on various datasets.

Abstract:
Existing reversible data hiding methods in encrypted images (RDH-EI) are primarily designed for point-to-point scenarios and not suitable for multiparty communication. To address this issue, a novel RDH-EI scheme based on multi-key (k,n) -threshold decryption (RDH-EITD) is proposed, in which a dual-phase embedding method is designed to support authentication for both the content owner and central server. In RDH-EITD, the private key of Paillier is split into n shares, which are then allocated to n distributed receivers. The image is first encrypted and then embedded with additional data to generate the marked ciphertext, which is uploaded to data server. The server can perform the 2nd-phase embedding. The dual-marked ciphertext is then distributed to n receivers. Each receiver can extract the 2nd-phase embedded data but cannot decrypt the image individually. Only when k out of n receivers submit their partially decrypted results, can the image be decrypted. Then the 1st-phase embedded data can be extracted and the original image can be recovered losslessly. Security reduction is employed to formally prove the semantic security of RDH-EITD. Experimental results demonstrate that RDH-EITD preserves (k,n) -threshold decryption, enabling resistance against k-1 collusion attacks and tolerance of n-k failures. The dual-phase embedding achieves embedding rates of \lambda - 9 bpp (bits per pixel) and 14 bpp respectively with security parameter \lambda . For an image of length L in one-to- n communication scenarios, the time complexity of RDH-EITD reaches O(L(\lambda ^3+k^2)) , outperforming existing Paillier-based point-to-point solutions.

Abstract:
Small object detection is a significant and challenging research area in aerospace. Small objects often face issues like background interference and inter-class similarity due to their small size and low boundary contrast in complex environments. Physiological studies indicate that the visual pathway’s neuronal mechanisms can effectively extract features such as contours, shapes, and colors, filter out background noise and thus recognize complex forms. Therefore, this paper proposes a remote sensing small object detection model inspired by neuronal mechanisms in visual pathways, called RSVDet. RSVDet simulates the information transmission of the ventral visual pathway (Retina-LGN-V1-V2-V4-IT) and meticulously models the involved visual areas. First, inspired by the “Retina-LGN-V1” pathway, we designed a feature enhancement module to capture low-level information. Second, based on the global-local receptive field mechanism of V2 neurons, we developed a feature extraction module for shape information. Additionally, inspired by the self-regulation mechanisms of V4 neurons, we created a self-feedback attention module to filter background noise. Finally, drawing from the orientation selectivity of IT neurons, we designed a hierarchical modulation detection head module to extract complex shape features. RSVDet achieves an AP50 of 50.1% on the Visdrone dataset with 1.72M parameters, achieving the best performance among lightweight models.The code for RSVDet can be found at:https://github.com/wxz0426/RSVDet/tree/main

Abstract:
As a critical preprocessing step, hyperspectral image (HSI) denoising aims to improve the HSI quality for subsequent applications. While unsupervised HSI denoising methods based on Deep Image Prior (DIP) have garnered attention due to their pre-training-free advantage, existing DIP-based approaches typically utilize L_2 -norm as data fidelity, making them inefficient in handling complex mixed noise. Moreover, such unsupervised methods only focus on spatial domain priors, lacking a comprehensive characterization of the spatial-spectral correlations inherent in HSIs. To tackle these limitations, we propose a robust deep recovery (RDR) model for HSI denoising with spatial-spectral total generalized variation (SSTGV) prior. Specifically, the truncated-Cauchy loss function is adopted to suppress the interference of outliers and enhance the robustness against sparse noise. Moreover, the SSTGV prior is integrated into the unsupervised RDR model, resulting in complementary effect of deep prior and handcraft prior. To solve the resulting optimization problem, an efficient ADMM algorithm is developed with convergence guarantee. Experimental results demonstrate the significant advantages of our approach in both noise suppression and detail preservation, highlighting its robustness and adaptability for varied HSI denoising applications.

Abstract:
To deal with the curse of dimensionality in hyperspectral images, numerous feature dimensionality reduction (FDR) methods have been proposed to map high-dimensional data into a low-dimensional subspace. However, most of existing FDR methods lack robustness against noise corruption. To this end, the representation-based subspace learning has been developed to find a robust projection matrix for FDR. Nevertheless, most of them only consider a single direction of the matrix, which ignore the information from other directions. Moreover, the majority of existing methods fail to account for both global structure and feature correlations effectively. To address the above problems, we propose a novel robust projection learning method called latent low-rank embedding (LatLRE), which integrates the latent low-rank representation (LatLRR) with projection learning. In particular, the proposed model can maintain the strong robustness of LatLRR and simultaneously learn a projection for FDR. Moreover, the nuclear norm and logarithmic norm are employed to approximate the two underlying rank functions and provide a more accurate measure of correlation. In addition, LatLRE is optimized using the alternating direction method of multipliers (ADMM) algorithm with the theoretical convergence guarantee. To verify the FDR performance of LatLRE, extensive experiments are conducted on three benchmark hyperspectral datasets. The experimental results demonstrate that LatLRE outperforms other FDR methods considered in this paper.

Affiliations: School of Computer Science, Cyber Science and Engineering, Nanjing University of Information Science and Technology, Nanjing, China; School of Computer and Software Engineering, Huaiyin Institute of Technology, Huaian, China; State Key Laboratory of Mathematical Engineering and Advanced Computing, Zhengzhou, China; School of Computer Science and Cyber Engineering, Guangzhou University, Guangzhou, China; School of Cyber Security, Qilu University of Technology, Jinan, Shandong, China; College of Cryptology and Cyber Science, Nankai University, Tianjin, China

Abstract:
The manipulation history of Joint Photographic Experts Group (JPEG) compression plays an important role in JPEG image forensics and information hiding. For non-aligned recompressed images, different cropping methods produce non-aligned outputs with varying feature distributions. One such important factor is the shifts of the discrete cosine transform (DCT) grid (i.e., the misalignment parameters) between two compression processes. Although many methods have been proposed to estimate the misalignment parameters, the limited amount of useful information available in small-sized images leads to low accuracy of these methods. To enhance the accuracy of misalignment parameter estimation for small-sized non-aligned images, we propose a novel two-branch network structure that accounts for the unique horizontal and vertical characteristics of non-aligned images. This structure employs convolution to simulate second-order difference (SOD) and incorporates it throughout the training process to optimize the difference parameters dynamically. Based on the insight that cropping operations leave traces in all color channels, we derive the Cg channel through a color space transformation. This approach expands the input dimensionality to four channels (Y, Cb, Cr, and Cg), thereby compensating for the information scarcity in small-sized images. The experimental results show that our method outperforms existing methods on different image sizes, regardless of the known or unknown quality factor (QF) of the first compression. Finally, we propose a re-cropping framework based on the estimated misalignment parameters. The influence of the first cropping is counteracted by a re-cropping operation, which improves the accuracy of existing methods in estimating the first quantization step for non-aligned recompressed images.

Abstract:
As a cloud-hosted virtual desktop service, cloud desktop supports various fields such as telecommuting, collaborative development, while enabling real-time user interaction through video stream. The stability of this process is determined by bandwidth, which significantly influences the user experience. Therefore, precise bandwidth prediction of video streams is essential in cloud desktops. This work proposes DeskPred for video stream transmission in cloud desktops, focusing on dynamic bandwidth prediction. In the startup stage, the limited data amount poses a challenge for achieving precise bandwidth predictions. We propose an Affinity-based Federated Learning algorithm, which leverages the historical records of high-affinity users for assisted training, all while protecting user privacy. During the long-term adjustment stage, we propose a Fluctuation-based Adaptive Incremental Prediction algorithm for independent training to address the issue of pattern forgetting. The algorithm considers both periodic features and instantaneous features, incorporating new patterns while revisiting previous knowledge through the memory module and Adversarial Elastic Weight Consolidation. We have verified DeskPred through an actual cloud desktop project supported by Lenovo Research. Through experiments conducted on a total of over 18 million data items (approximately 10 GB), DeskPred achieves the highest total score of 71.11%, making it highly suitable for cloud desktop environments.

Abstract:
Weakly-supervised point cloud semantic segmentation (WS-PCS) has attracted increasing attention due to the challenge of sparse annotations. A central problem is how to effectively extract informative features from the annotated points, enabling reliable supervision. Although many existing works extend 2D graph convolution to 3D point cloud data, 2D convolution inherently assumes feature localization, which is an assumption that does not hold in point clouds, and lacks consistent semantic offsets. To address this, we propose a novel Bilateral Graph Convolutional (BGC) method, which refines graph edges into two categories: regular edges and offset edges, providing improved guidance for WS-PCS. Firstly, we create the Local Bilateral Relations (LBR) module to learn the relational features of edges in local point cloud graphs, encompassing both regular and offset edges. To the best of our knowledge, we are the first to utilize offset edges to capture irregular semantic offsets in point cloud data. Secondly, we propose the Adaptive Pooling (AP) module, which adaptively pools edge information learned from LBR, enhancing the feature characterization ability by incorporating salient and pervasive features. Finally, we design BGC as BGC-Net and evaluate its performance against recent networks on four datasets, achieving state-of-the-art results.

Abstract:
Skeleton action recognition is becoming a representative of video motion recognition based on GCNs or Transformers. However, GCN-based works suffer from the overall structure with low learning cap, over-smoothing of dynamic graphs, and limited temporal learning of fixed receptive fields. Transformer-based works are encumbered by high computational resources, lack of artificial priors, and shallow temporal features over-aggregation. Thus, we propose TDSN-GCN with Transformerify architecture, Decaying Static Graph Embedding, and NAS-guided temporal receptive field strategy. First, we constructed the GCN architecture following the Transformer style with the info-decreased staged strategy, effectively raising learning capacity. Then, we theorize that the spatial graph matrix over-smooths by row as the depth increases. For this issue, a decaying static topology embedding with multi-topological hypergraphs is proposed with effective artificial priors. Finally, we design a NAS with the linear interpolation expansion receptive field search to explore the temporal receptive field pReferences in depth. With the guidance of NAS, the temporal receptive field stage expansion strategy is proposed. Extensive experiments show that TDSN-GCN achieves the highest single-stream accuracy and state-of-the-art accuracy in 2-stream and 4-stream fusion compared to previous work with more streams. The code is available at https://github.com/vvhj/TDSN-GCN

Abstract:
Multimodal remote sensing (RS) images exhibit distinct structure and distribution characteristics, making it challenging to design an effective multimodal RS image classification algorithm. Moreover, although existing deep learning-based methods have become the darling in the multimodal RS image classification, they usually lack effective exploration and explicit integration for generative information from different modalities. Aiming at the above challenges, a generative information-guided heterogeneous cross-fusion network with contrastive learning (GIHCN) is proposed for multimodal RS image classification. Firstly, to simulate the land-cover distributions from different modal data, a multimodal generative information learning architecture (MGILA) is constructed to capture the unsupervised heterogeneous distribution features. Secondly, to achieve bidirectional modeling between heterogeneous data and the reconstructed land-cover distributions, a heterogeneous data & generative information cross-attention module (HGCM) is designed to explore the complementarity between multimodal data and the reconstructed land-cover distributions. HGCM can provide the heterogeneous generative information for current modal data or provide the heterogeneous data support for current modal generative information, thereby obtaining cross-fusion sources with different attributes. Furthermore, we achieve the effective feature extraction for different cross-fusion sources by a designed multimodal contrastive learning framework (MCLF). Notably, to capture local information and long-range dependencies, a hybrid classification network with convolutional neural network and Mamba (CMNet) is proposed as the feature extraction backbone of each cross-fusion source to further improve the classification performance. Finally, we construct a joint multimodality loss function for MCLF, which can reduce the distribution difference between modalities while focusing on the information flow within and across the modality. Experimental results on four multimodal RS datasets confirm the effectiveness of GIHCN compared with other state-of-the-art methods. The source code will be released at https://github.com/ZJier/GIHCN

Abstract:
Medical Visual Question Answering (Medical VQA) is an essential task that facilitates the automated interpretation of complex clinical imagery with corresponding textual questions, thereby supporting both clinicians and patients in making informed medical decisions. With the rapid progress of Vision-Language Pretraining (VLP) in general domains, the development of medical VLP models has emerged as a rapidly growing interdisciplinary area at the intersection of artificial intelligence (AI) and healthcare. However, few works have been proposed to evaluate the adversarial robustness of medical VLP models, which faces two primary challenges: 1) the complexity of medical texts, stemming from the presence of terminologies, poses significant challenges for models in comprehending the text for adversarial attack; 2) the diversity of medical images arises from the variety of anatomical regions depicted, which requires models to determine critical anatomical regions for attack. In this paper, we propose a novel multimodal adversarial attack generator for evaluating the robustness of medical VLP models. Specifically, for the complexity of medical texts, we integrate medical knowledge when crafting text adversarial samples, which can facilitate the terminologies understanding and adversarial strength; for the diversity of medical images, we divide the anatomical regions into either global or local regions in medical images, which are determined by learned balance weights for perturbations. Our experimental study not only provides a quantitative understanding in medical VLP models, but also underscores the critical need for thorough safety evaluations before implementing them in real-world medical applications.

Abstract:
Major depressive disorder (MDD) is projected to become one of the leading mental disorders by 2030. While audiovisual cues have garnered significant attention in depression recognition research owing to their non-invasive acquisition and rich emotional expressiveness. However, conventional centralized training paradigms raise substantial privacy concerns for individuals with depression and are further hindered by data heterogeneity and label inconsistency across datasets. To overcome these challenges, a hybrid architecture, termed Federated Domain Adversarial with Attention Mechanism (FedDAAM), for privacy preserving multimodal depression assessment, is proposed. FedDAAM introduces a mechanism by differentiating discriminative features into depression-public and depression-private features. Specifically, to extract visual depression-private features from the AVEC2013 and AVEC2014 datasets, a local attention-aware (LAA) architecture is developed. For the depression-public features, action units (AUs), landmarks, head poses, and eye gazes features are adopted. In addition, to consider the transferability and performance of individual client, a dynamic parameter aggregation mechanism, termed FedDyA, is proposed. Extensive validations are performed on the AVEC2013, AVEC2014 and AVEC2017 databases, resulting in root mean square error (RMSE) and mean absolute error (MAE) of 8.61/6.78, 8.59/6.77, and 4.71/3.68, respectively. More importantly, to the best of our knowledge, this is the first study to borrow federated learning (FL) for multimodal depression assessment. The proposed framework offers a novel solution for privacy-aware, distributed clinical diagnosis of depression. Code will be available at: https://github.com/helang818/FedDAAM/

Abstract:
Modern millimeter-wave (mmWave) radar heat map-based object detection techniques have shown significant potential. However, radar heat maps still pose challenges for fine-grained object differentiation, and fusing camera maps with radar heat maps remains challenging. This paper proposes a multi-sensor radar object segmentation (ROS) method that fuses radar range-angle (RA) heat maps with camera RGB maps, leveraging the advantages of camera sensors for extracting semantic information and mmWave radar for extracting object position information. A method for mapping RGB to the RA coordinate system (RMTR) is proposed, eliminating the need for additional training branches and sensor calibration. A Transformer is utilized to predict the range and angle matrices of each object separately from a global perspective. These matrices then undergo an element-wise multiplication (Hadamard product) operation to generate the final pseudo-RA map. For the fusion method, this paper introduces the cross-attention fusion module (CAF), which uses the cross-attention mechanism to achieve more efficient fusion by treating the pseudo-RA feature map as the Query and the RA feature map as the Key and Value. Extensive experiments show that the proposed method achieves state-of-the-art (SOTA) performance on the CRUW and CARRADA datasets, with fewer parameters and reduced computational complexity. Our code for RC-ROSNet is available at https://github.com/Zhuanglong2/RC-ROSNet

Abstract:
Detecting tiny objects in multimodal Red-Green-Blue-Thermal (RGBT) imagery is a critical challenge in computer vision, particularly in surveillance, search and rescue, and autonomous navigation. Drone-based scenarios exacerbate these challenges due to spatial misalignment, low-light conditions, occlusion, and cluttered backgrounds. Current methods struggle to leverage the complementary information between visible and thermal modalities effectively. We propose COXNet, a novel framework for RGBT tiny object detection, addressing these issues through three core innovations: i) the Cross-Layer Fusion Module, fusing high-level visible and low-level thermal features for enhanced semantic and spatial accuracy; ii) the Dynamic Alignment and Scale Refinement module, correcting cross-modal spatial misalignments and preserving multi-scale features; and iii) an optimized label assignment strategy using the GeoShape Similarity Measure for better localization. COXNet achieves a 3.32% mAP50 improvement on the RGBTDronePerson dataset over state-of-the-art methods, demonstrating its effectiveness for robust detection in complex environments.

Abstract:
The Picture-Wise Just Noticeable Difference (PW-JND) represents the visibility threshold of human vision when viewing distorted images. The PW-JND plays an important role in perceptual image processing and compression. However, predicting the PW-JND is challenging due to its dependence on image content, viewing conditions, and the viewer. In this paper, we propose a visual perception-assisted deep PW-JND (VP-JND) prediction model for image compression that combines data-driven methods with the perceptual mechanisms of human vision. First, we identify a correlation between PW-JND and conventional pixel-wise JND. Based on this observation, we design the VP-JND model, consisting of a pixel-wise JND model, a deep binary classifier (VP-JNDnet) and a binary block search algorithm for refining predictions. VP-JNDnet exploits the pixel-wise JND map of the original image to predict whether a compressed image is perceptually lossless. In addition, the model incorporates visual importance of content and regions by using a mixed attention module and calculating perceptual loss during training. Experimental results show that VP-JND achieved an average precision of 94.82% and a mean absolute difference of 3.92 in predicting the JPEG quality factor corresponding to the PW-JND on the MCL-JCI dataset, outperforming state-of-the-art JND models. When applied to perceptual lossless image coding, the predicted PW-JND enabled average bit rate savings of 89.35% for JPEG compression on MCL-JCI and 85.46%/41.13% for JPEG/BPG compression on KonJND-1k. These savings were relative to images compressed at the lowest distortion level. The source codes and trained models are publicly available at https://github.com/SYSU-Video/VP-JND

Abstract:
The goal of image fusion is to generate a new image that incorporates all the high-quality features from source images, such as the salient features of infrared images and the texture details of visible images. Existing image fusion methods primarily focus on deep feature extraction, often neglecting the importance of shallow features. Furthermore, fusion strategies that rely heavily on human intervention can lead to the loss of feature information and limit deep learning performance. To address these issues, we propose an adaptive multi-scale fusion network based on attention mechanisms, named AMSFusion. The autoencoder component of our method is based on the U-Net architecture, with different modules designed to process shallow and deep multi-scale features, thereby improving feature processing quality while maintaining computational efficiency. The fusion network component employs a channel-spatial attention block (CSA) to assign appropriate weights to heterogeneous features and applies a shifted window attention and convolution mix transformer (SWACmixT) to fuse them, enhancing the network’s ability to adaptively fuse features at different scales. In our implementation, the autoencoder and fusion networks are trained separately, with distinct loss functions for each, further enhancing the method’s capabilities in feature encoding, decoding, and fusion. Qualitative and quantitative experiments on multiple datasets demonstrate the superiority of our method compared to state-of-the-art algorithms. Additionally, experiments on object detection tasks validate that our method effectively promotes high-level computer vision tasks.

Abstract:
Underwater polarization imaging is a promising method as applied to sensing the ocean. However, existing methods primarily focus on restoring intensity images while neglecting the retrieval of polarization information itself. As an additional information dimension, polarization can significantly enhance imaging performance in various underwater scenarios, such as target detection and autonomous navigation. This paper integrates polarization information into a modified dual-stream diffusion model for the joint restoration of intensity and polarization information in turbid underwater conditions. To bridge the dual diffusion processes, a descattering network with polarization restoration sub-networks is designed, which intuitively takes polarization information as input and fully exploits informative polarization features. Polarization correlation is adopted as a useful prior to recalibrate the polarization features of the sub-networks, which is achieved by polarization hybrid attention blocks. Additionally, a specially designed loss function that considers polarization criteria enables the diffusive image restoration process to focus on descattering-related polarization features and improve the accuracy of the restored polarization information. Comprehensive experiments demonstrate that the proposed method outperforms state-of-the-art underwater imaging methods in both polarization image quality and polarization information restoration accuracy. Furthermore, sea trials validate its robust imaging performance under real marine conditions. Combined with several practical application cases, this work highlights the advantages of the restored polarization information for addressing real-world underwater tasks.

Abstract:
Recently, instruction-driven image editing methods have demonstrated promising capabilities, requiring only a brief text to guide image modifications. However, most of them often yield suboptimal results for object editing in complex scenes, due to two major defects: 1) Over-editing, where unintended regions of the image are inadvertently altered; 2) Inability to precisely adhere to instructions, particularly in scenes with numerous elements. To resolve these issues, we propose a Single object Editing scheme, termed SoEdit, which distills complex editing tasks into single-object editing within cropped regions through a pipeline that integrates task parsing, object localization, editing, and context blending. This approach minimizes interference from irrelevant areas, ensures proper object size and placement, and ultimately enhances model performance. Furthermore, we introduce a lightweight Spatially-Adaptive Mixture of Experts (SAMoE) to better model spatial heterogeneity, enabling token-wise adaptive processing and further enhancing the overall editing capability with minimal additional parameters. Moreover, we introduce a large-scale object-centric dataset to further optimize the model. Extensive experiments demonstrate that SoEdit outperforms existing methods, especially in precise responses to fine-grained editing requirements, such as multi-action and quantity-sensitive object editing.

Abstract:
Pre-trained vision-language models have shown great potential in few-shot learning. However, existing methods typically employ either KL divergence or feature similarity-based knowledge distillation, and rarely integrate both. Our analysis reveals that a naive simultaneous deployment of these two strategies yields suboptimal results. To address this, we propose a unified dual knowledge distillation framework. This framework is grounded in a theoretical derivation of class-adaptive temperature parameters, effectively resolving the incompatibility between KL divergence and feature similarity approaches. Furthermore, we introduce a top-K feature perturbation technique that targets specific features for more consistent enhancement than traditional noise regularization. Experimental results across 11 diverse benchmarks show that our approach yields consistent performance gains over various baselines. Notably, it improves the harmonic mean (H) by 0.41% to 0.72% and enhances generalization to unseen classes with an accuracy boost of up to 1.41%. Our source code is available at: https://github.com/sydney72380/DKL

Abstract:
Remote sensing images captured from aerial perspectives often exhibit significant scale variations and complex backgrounds, posing challenges for salient object detection (SOD). Existing methods typically extract multi-level features at a single scale using uniform attention mechanisms, leading to suboptimal representations and incomplete detection results. To address these issues, we propose a GeoGran-Aware Hierarchical Feature Fusion Network (G2HFNet) that fully exploits geometric and granular cues in optical remote sensing images. Specifically, G2HFNet adopts Swin Transformer as the backbone to extract multi-level features and integrates three key modules: the multi-scale detail enhancement (MDE) module to handle object scale variations and enrich fine details, the dual-branch geo-gran complementary (DGC) module to jointly capture fine-grained details and positional information in mid-level features, and the deep semantic perception (DSP) module to refine high-level positional cues via self-attention. Additionally, a local-global guidance fusion (LGF) module is introduced to replace traditional convolutions for effective multi-level feature integration. Extensive experiments demonstrate that G2HFNet achieves high-quality saliency maps and significantly improves detection performance in challenging remote sensing scenarios.

Abstract:
Real-time localization of endoscope is significant for the navigation and automation of endoscopic diagnosis and minimally invasive surgery. However, traditional localization based on optical tracking or magnetic tracking is easily influenced by occlusion or electromagnetic interference, while the implementation is complicated. Meanwhile, transformation and correlation information in image pairs are still ignored in existing visual localization methods for endoscopy. In this work, a novel relative pose regression framework is proposed for relative pose estimation and absolute pose tracking of endoscope based on endoscopic videos. Firstly, scene features and transformation features are respectively extracted from endoscopic observations and the corresponding optical flow by the proposed feature encoder based on gated convolution, which can prevent gradient vanishing when training the encoder from scratch on endoscopic data. Furthermore, a novel correlation module based on cross-attention is proposed to extract correlation features from two input images, which can capture more key features in endoscopic frames with more limited vision from local to global. Moreover, a novel pose decoder with upsampling and downsampling on the channel dimension is utilized to extract richer representation from the concatenated feature map for relative transformation vector prediction. The proposed method outperforms the state-of-the-art methods on the datasets from nasal endoscopy and colonoscopy, with an average localization error of less than 5%. The further experiments also demonstrate the efficiency of the proposed method. The demo videos of visual localization can be found on https://endoloc.netlify.app/

Abstract:
Recent advances in text-to-image diffusion models have demonstrated impressive capabilities in image quality. However, complex scene generation remains relatively unexplored, and even the definition of ‘complex scene’ itself remains unclear. In this paper, we address this gap by providing a precise definition of complex scenes and introducing a set of Complex Decomposition Criteria (CDC) based on this definition. Inspired by the artist’s painting process, we propose a training-free diffusion framework called Complex Diffusion (CxD), which divides the process into three stages: composition, painting and retouching. Our method leverages the powerful chain-of-thought capabilities of large language models (LLMs) to decompose complex prompts based on CDC and to manage composition and layout. We then develop an attention modulation method that guides simple prompts to specific regions to complete the complex scene painting. Finally, we inject the detailed output of the LLM into a retouching model to enhance the image details, thus implementing the retouching stage. Extensive experiments demonstrate that our method outperforms previous SOTA approaches, significantly improving the generation of high-quality, semantically consistent, and visually diverse images for complex scenes, even with intricate prompts.

Abstract:
Deepfake technology has great potential in the field of media and entertainment, but it also brings serious risks, including privacy disclosure and identity fraud. To counter these threats, proactive forensic methods have become a research hotspot by embedding invisible watermark signals to build active protection schemes. However, existing methods are vulnerable to watermark destruction under malicious distortions, which leads to insufficient robustness. Moreover, embedding strong signals may degrade image quality, making it challenging to balance robustness and imperceptibility. Although watermarked images look natural, their underlying structures are often different from the original images, which is ignored by traditional watermarking methods. To address these issues, this paper proposes a proactive watermarking framework called WaveGuard, which explores frequency domain embedding and graph-based structural consistency optimization. In this framework, the watermark is embedded into the high-frequency sub-bands by dual-tree complex wavelet transform (DT-CWT) to enhance the robustness against distortions and deepfake forgeries. By leveraging joint sub-band correlations and selected sub-band combinations, the framework enables robust source tracing and semi-robust deepfake detection. To enhance imperceptibility, we propose a Structural Consistency Graph Neural Network (SC-GNN) that constructs graph representations of the original and watermarked images to ensure structural consistency and reduce perceptual artifacts. Experimental results show that the proposed method performs exceptionally well in face swap and face replay tasks. The code has been published at https://github.com/vpsg-research/WaveGuard

Abstract:
In a ( k , n )-threshold secret image sharing (SIS) scheme, a secret image is encoded into n shadow images and distributed to the corresponding participants, enabling lossless reconstruction with any k correct shadow images. This inherent fault tolerance allows up to n – k shadow images to be lost or corrupted. However, in real-world scenarios, all shadow images are susceptible to malicious tampering, cropping, or noise during transmission and storage, making it difficult to guarantee the availability of k intact shadows. Robust secret image sharing (RSIS) schemes have been proposed to address this issue, yet existing methods often suffer significant degradation in reconstruction quality as the attack strength increases, revealing limitations in their robustness. To address these issues, we propose an RSIS scheme against malicious shadow images by Reusing Polynomial Coefficients with hash function (RSIS-RPC), which provides both malicious shadow detection and error correction capabilities. The correction capability improves with the degree of coefficient reuse, where greater reuse provides stronger resilience to pixel corruption. However, as more coefficients are reused, the size of the generated shadow images increases correspondingly, resulting in higher storage requirements. This trade-off between robustness and efficiency makes the proposed scheme adaptable to diverse application scenarios requiring secure and resilient image sharing. Experimental results and analyses demonstrate that the proposed scheme achieves superior robustness compared to existing schemes.

Abstract:
Automatic medical image segmentation is a fundamental component of computer-aided diagnosis. U-shaped networks (U-Nets) remain the most widely adopted architecture due to their suitability for the unique challenges of medical imaging. However, recent studies have shown that U-Nets fuse low-level visual features from the encoder with high-level semantic features from the decoder using direct skip connections (DSC), which are likely to degrade segmentation performance due to the semantic gap, thereby limiting their ability to meet high-precision clinical requirements. This paper revisits the semantic gap from a performance-oriented perspective and conceptualizes it as a learnable task. A key characteristic of this semantic gap is revealed through comprehensive quantification and visualization, demonstrating its significant negative impact on segmentation performance. Further analysis indicates that a contributing factor is the substantial channel noise present in low-level pixel features, which is transmitted to the decoder via DSC, thereby disrupting the modeling of high-level semantic representations. In response, this paper proposes a self-disambiguating skip connection (SDSC), which incorporates a self-guided filter leveraging spatial features for channel filtering, a multi-layer fusion Transformer to capture long-range contextual dependencies, and Jensen-Shannon divergence as a constraint to enhance learning. The proposed method, referred to as SDSC-UNet, is evaluated through extensive experiments on four challenging benchmarks. The results demonstrate that replacing DSC with our SDSC yields an improvement of 5.91% in mean Intersection over Union (mIoU), 4.58% in Dice coefficient, and 1.32 in Hausdorff Distance, achieving state-of-the-art performance and highlighting the effectiveness of SDSC.

Abstract:
Micro-holes with small diameters and large aspect ratios require endoscopic probes for deep morphological inspection. Accurate identification of the micro-hole center is essential for precise probe alignment. However, current industrial production lines rely on manual alignment, a time-consuming and labor-intensive process with limited accuracy. Moreover, existing Hough transform-based circle detection methods exhibit poor robustness, often failing in real-world environments, which leads to alignment errors. To address these challenges, this paper proposes a high-precision micro-hole circle detection method that integrates a lightweight focusable network (LF-Net) with a resolution-aware Hough transform (A-RAHT). The method first generates precise micro-hole masks and then efficiently detects circles. Specifically, LF-Net employs a global grouped linear embedding structure and a dynamic channel sparsity mechanism to reduce computational complexity while effectively modeling long-range dependencies, capturing more representative high-level features of micro-hole circles. Then, an adaptive Sobel operator combined with a multiscale receptive field focusable module is introduced to accurately extract circle masks. Finally, A-RAHT partitions the image into multiple regions and applies multi-threaded acceleration alongside a coarse-to-fine resolution-aware Hough transform, enabling accurate and efficient circle detection. Experimental results demonstrate that the proposed method achieves state-of-the-art performance in detecting micro-hole circles in real industrial environments, significantly enhancing automation and precision. Furthermore, additional experiments across various scenarios validate its robustness and generalization capabilities, making it well-suited for high-precision and real-time circle detection in diverse practical applications.

Abstract:
The conditional coding paradigm is widely used in learned video compression, which shows superior performance in capturing redundancies within a large context space. However, existing Conditional coding-based Learned Video Compression (C-LVC) methods ignore that the predicted motion vectors usually contain large uncertainty due to complex motions, occlusions, etc., which consequently decrease the accuracy of the generated temporal contexts. In addition, existing C-LVC methods have a weak ability to mine diverse dependencies within the context space, which are closely related to the coding efficiency. To address these issues, an efficient temporal redundancy mining method is proposed to improve the coding efficiency of C-LVC in this paper. To generate accurate temporal contexts, a Long Short-Term Motion Aggregation (LSTMA) model is proposed, in which an LSTMA-based motion estimation module is developed to capture both current and aggregated long short-term motion information to reduce the uncertainty of predicted motion vectors. Based on the dual motion information, an LSTMA-based temporal context mining module is developed to exploit the aggregated long short-term motion information and increase the accuracy of the generated temporal contexts. In order to fully eliminate spatial-temporal redundancies in a video, a Global-Local Information Decorrelation Module (GLIDM)-based context codec is proposed, in which the GLIDM is designed based on the visual state space block (namely vmamba), the residual block, and the squeeze-and-excitation block to effectively capture long-range, short-range spatial-temporal dependencies and channel-wise dependencies. Experimental results demonstrate that our proposed method can effectively improve the coding performance of C-LVC, and outperforms other state-of-the-art LVC methods.

Abstract:
Point clouds are a fundamental format for immersive experiences, posing significant challenges for storage and transmission. Unlike 2D image compression, 3D point clouds are sparse and irregular, complicating their attribute compression. While sparse convolution-based methods have made significant success on point cloud attribute compression by leveraging the sparsity of point clouds, they are constrained by a limited receptive field and insufficient adaptability to diverse inputs. To overcome these limitations, this paper proposes a novel point transformer-based architecture to exploit correlations and aggregate features across multiple scales (CSFormer). It retains the advantage of sparse convolution operating on occupied voxels and leverages varied sparsity distributions and the geometry distortions inherent in consecutive scales to construct attention maps, effectively extending the receptive field and adapting to different inputs. We further introduced GCPEM, a Geometry-aware Context Prediction-based Entropy Model that reduces bitrates by jointly utilizing the spatial and channel dependencies. Unlike previous methods that capture only one type of the information, GCPEM organizes latent features into groups interlaced across both space and channel dimensions and employs a context-prediction mechanism guided by known geometry for efficient coding. Experimental results show that the proposed method outperforms the state-of-the-art learning-based method and MPEG standard G-PCC codec over 7% and 28% in BD-BR (Y-PSNR), respectively. It has a time complexity comparable to the state-of-the-art learning-based method and the G-PCC. The source code and trained models will be released at https://github.com/X-H-offical/CST-PCAC-plus.git

Abstract:
Active contour models (ACMs) have shown effectiveness on remote sensing (RS) image segmentation tasks. However, this type of method still faces two important problems when segmenting RS images. First, the lack of high-level semantic information makes it difficult for ACMs to distinguish targets and backgrounds with similar textures. Second, the manual contour initialization required by ACMs is inconvenient and inefficient. To address the above problems, we propose a two-stage segmentation method that consists of an improved U-Net using a mixed pooling attention and an ACM (UMPA-ACM). In the first stage, a lightweight network based on U-Net structure is developed to extract semantic features while reducing computational cost. A mixed pooling attention (MPA) module is designed to enhance the ability of our proposed network to extract high-level semantic information from RS images. In the second stage, an adaptive feature enhancement (AFE) module computes grayscale information from original images and feature maps produced by the lightweight network and then fuses them to improve the intensity of target edges; a morphological-threshold process (MTP) module automatically generates appropriate initial contours for targets from the semantic feature maps instead of manual contour initialization. Then, a new ACM, proposed based on pre-fitting foreground and background in local regions, uses the statistical characteristics of local intensities and the bias field correction to suppress the interference of non-target regions, thereby improving segmentation accuracy. Experimental results show that the mean Dice Similarity Coefficient (mDSC) and the mean Intersection over Union (mIoU) of our method are higher than those of the suboptimal method by 1.21% and 1.60%, respectively, on average for segmenting images from six RS datasets, which verify the advantage of our proposed method.

Abstract:
Robust blind watermarking enables the accurate recovery of embedded watermarking without requiring access to the original images or message, and is widely applied in copyright protection and traceability tracking. However, existing flow-based robust blind watermarking methods typically rely on auxiliary variables during the extraction process that are inconsistent with the property of the information lost during embedding, and are often tailored to a single known noise type, resulting in limited robustness against unknown noise. To address these limitations, we propose a flow-based watermarking method that enhances robustness to unknown noise through feature preservation and gradient perturbation. Specifically, we propose a feature preservation network that learns auxiliary variables from noisy images, aligning them with the property of the lost information to improve the robustness of watermarking extraction. In addition, we design an unknown noise image generation method based on gradient perturbation, which iteratively adds perturbations to encoded images to simulate a variety of potential noise patterns that lead to watermarking extraction errors, thereby improving the robustness of watermarking against unknown noises. Experimental results validate the effectiveness of our method, demonstrating an improvement in watermarking extraction accuracy and a 10 dB increase in peak signal-to-noise ratio (from 33.5 dB to 43.8 dB). The source code is publicly available at https://github.com/reliarui/FPGP.

Abstract:
Image-in-image hiding, which embeds a full-size secret image into a cover image with minimal perceptual distortion and accurate recovery, has attracted increasing attention because of its wide applications in copyright protection, covert communication, and digital forensics. However, when facing full-size secret images, the secret data often exceed the capacity limitation of individual cover images, resulting in existing solutions struggling to achieve an effective balance among robustness, high load capacity, and resistance to steganalysis and detection capabilities, which is particularly prominent when encountering social noise interference in real-world scenarios. To address these challenges, we propose MambaRIS, a robust and efficient image steganography framework that combines the dual-tree complex wavelet transform (DTCWT) with a state space model. The DTCWT module enables directionally selective decomposition of the input, enriching frequency-domain representations and providing more resilient embedding regions for robust cross-frequency hiding and recovery. Furthermore, we introduce a Mamba-based autoencoder architecture equipped with a novel spatial channel Mamba block (SCMB), which integrates spatial and channel attention mechanisms with linear-time global dependency modeling, significantly improving embedding adaptability under complex noise and distortion conditions. Extensive experiments demonstrate the superiority of the proposed scheme in terms of visual quality, robustness, and resistance to steganalysis. In JPEG compression with a quality factor of Q = 80, our method achieves an average improvement in hiding and recovery accuracy (measured by PSNR) of 0.87 dB compared with state-of-the-art architectures, while reducing the number of parameters by 33%.

Abstract:
Synthetic Aperture Radar (SAR) images are crucial for maritime vessel detection; however, challenges such as blurred ship edges, strong land scattering interference, and angular regression mismatches across varying target sizes hinder accurate rotational localization. In this paper, an oriented decoupling target detection method (R-MCLST) is proposed to address these issues. The method integrates three key modules: a multi-channel positioning module (MC-PM) that employs distributed average pooling and additional coordinate channels to enhance orientation awareness; a soft threshold-based multilayer perceptron (ST-MLP) that effectively mitigates background interference while robustly extracting complex features; and a Gaussian distribution-based prediction box (GD-BPB) that transforms rotated bounding box encoding into a two-dimensional Gaussian distribution using KL divergence for adaptive parameter adjustment. Experimental evaluations on the R-SSDD and MR-HRSID datasets demonstrate that R-MCLST achieves superior performance, with the R-SSDD dataset yielding AP50 of 87.48%, AP75 of 34.96%, AP of 41.15%, and AR of 46.52%, and the MR-HRSID dataset yielding AP50 of 61.59%, AP75 of 4.96%, AP of 19.13%, and AR of 22.24%. Comparative analyses confirm that the proposed method outperforms current state-of-the-art networks in accurately localizing rotating targets under challenging SAR imaging conditions.

Abstract:
Query-based 3D scene instance segmentation from point clouds has attained notable performance. However, existing methods suffer from the query initialization dilemma due to the sparse nature of point clouds and rely on computationally intensive attention mechanisms in query decoders. We accordingly introduce LaSSM, prioritizing simplicity and efficiency while maintaining competitive performance. Specifically, we propose a hierarchical semantic-spatial query initializer to derive the query set from superpoints by considering both semantic cues and spatial distribution, achieving comprehensive scene coverage and accelerated convergence. We further present a coordinate-guided state space model (SSM) decoder that progressively refines queries. The novel decoder features a local aggregation scheme that restricts the model to focus on geometrically coherent regions and a spatial dual-path SSM block to capture underlying dependencies within the query set by integrating associated coordinates information. Our design enables efficient instance prediction, avoiding the incorporation of noisy information and reducing redundant computation. LaSSM ranks first place on the latest ScanNet++ V2 leaderboard, outperforming the previous best method by 2.5% mAP with only 1/3 FLOPs, demonstrating its superiority in challenging large-scale scene instance segmentation. LaSSM also achieves competitive performance on ScanNet V2, ScanNet200, S3DIS and ScanNet++ V1 benchmarks with less computational cost. Extensive ablation studies and qualitative results validate the effectiveness of our design. The code and weights are available at https://github.com/RayYoh/LaSSM

Affiliations: School of Automotive Studies, Tongji University, Shanghai, China; Momoni AI, Gothenburg, Sweden; Department of Computer Science and Engineering, The Hong Kong University of Science and Technology (Guangzhou), Guangzhou, China; Chair of Automotive Engineering, Technische Universität Berlin, Berlin, Germany; College of Information Science and Electronic Engineering, Zhejiang University, Hangzhou, China; School of Information and Electrical Engineering, Hangzhou City University, Hangzhou, China

Abstract:
3D object detection and occupancy prediction are critical tasks in autonomous driving, attracting significant attention. Despite the potential of recent vision-based methods, they encounter challenges under adverse conditions. Thus, integrating cameras with next-generation 4D imaging radar to achieve unified multi-task perception is highly significant, though research in this domain remains limited. In this paper, we propose Doracamom, the first framework that fuses multi-view cameras and 4D radar for joint 3D object detection and semantic occupancy prediction, enabling comprehensive environmental perception. Specifically, we introduce a novel Coarse Voxel Queries Generatorthat integrates geometric priors from 4D radar with semantic features from images to initialize voxel queries, establishing a robust foundation for subsequent Transformer-based refinement. To leverage temporal information, we design a Dual-Branch Temporal Encoderthat processes multi-modal temporal features in parallel across BEV and voxel spaces, enabling comprehensive spatio-temporal representation learning. Furthermore, we propose a Cross-Modal BEV-Voxel Fusionmodule that adaptively fuses complementary features through attention mechanisms while employing auxiliary tasks to enhance feature quality. Extensive experiments on the OmniHD-Scenes, View-of-Delft (VoD), and TJ4DRadSet datasets demonstrate that Doracamom achieves state-of-the-art performance in both tasks, establishing new benchmarks for multi-modal 3D perception. Code and models can be available at https://github.com/TJRadarLab/Doracamom

Abstract:
Limited by observation conditions, the availability of synthetic aperture radar (SAR) image samples is constrained, posing challenges for deep learning-based SAR target recognition. The SAR target recognition algorithms based on few-shot learning (FSL) have made significant progress, and recent studies indicate that local descriptors outperform image-level feature representations. However, the similarity in SAR target backgrounds makes some irrelevant local descriptors severely affect the accuracy of the metric. To address this issue, we propose a novel metric-based FSL framework for SAR target recognition, i.e., adaptive weighted mutual nearest neighbor network with support-query collaborative feature reconstruction. First, by proposing the support-query collaborative feature reconstruction module, calculating the similarity between the reconstructed support features and the support feature values allows the model to better capture intra-class common features and enhance intra-class consistency. Meanwhile, calculating the similarity between the reconstructed query features and the query feature values helps the model identify effective distinguishing features, highlight target saliency, and increase inter-class differentiation. Secondly, an adaptive weighted mutual nearest neighbor module is designed, where the weight of query descriptors is adjusted to highlight distinctive features and reduce background interference. Finally, a metric fusion module is proposed, which not only computes image-to-class metrics based on local descriptors but also integrates similarity from the support set to the query set as well as from the query set to the support set, enabling a discriminative similarity measure. Experimental results on three public SAR datasets demonstrate that our proposed algorithm performs better classification than other FSL algorithms.

Abstract:
With the growing demand for cloud services, traditional image privacy encryption methods applied in cloud scenarios reveal two major issues. First, security is often achieved at the expense of visibility, which is incompatible with cloud service scenarios such as information preview and search. Second, there is a lack of design for hierarchical visual privacy for users with different security levels. For the above problems, a multi-level key mechanism is designed and integrated with the YoloV5 network, providing not only multi-level privacy protection for sensitive regions but also achieving a balance between visibility and security in these regions. Simulation results demonstrate that the proposed framework can decrypt images with multi-level visual effects. Performance analysis shows that the framework achieves an adjustable balance between visibility and security, which users can modify by adjusting parameters. Compared to other visibility-security trade-off schemes, this approach offers advantages including strong reversibility, high image size compatibility, adjustable visual effects, and computational efficiency.

Abstract:
Carpal Tunnel Syndrome (CTS), the predominant type of peripheral entrapment neuropathy, necessitates timely and precise diagnosis for optimal treatment. While ultrasound (US) has emerged as a non-invasive diagnostic modality, existing computer-aided methods largely rely on static frames or unimodal features, failing to capture the temporal dynamics and multidimensional cues integral to clinical assessment. To address these, we present a panoramic diagnostic system for CTS that harnesses the full potential of US videos. It seamlessly consolidates median nerve segmentation, multi-dimensional biometric measurement, and multimodal CTS classification within a unified pipeline, transforming conventional diagnosis into a comprehensive digital solution. Specifically, we develop a Mamba-based video segmentation model with temporal compression and spectral gated enhancement to enable efficient, high-fidelity delineation and measurement. Building upon this, we propose a Spatiotemporal and Multimodal Processing (STAMP) framework that synergistically integrates video dynamics, anatomical measurements, and clinical covariates through bidirectional metadata-visual interactions and temporal contextualization. This approach mirrors the clinical reasoning process and provides clinically interpretable diagnostic results. Experimental results demonstrate that the proposed system outperforms existing automatic segmentation methods across multiple metrics (Dice=87.18%) and achieves performance comparable to manually initialized ones. Moreover, our system not only surpasses both frame-based and video-based approaches (F1-score=97.44%), but also exceeds that of junior radiologists and rivals senior experts.

Abstract:
We present a novel quality assessment method which can predict the perceptual quality of point clouds from new scenes without available annotations by leveraging the rich prior knowledge in images, called the Distribution-Weighted Image-Transferred Point Cloud Quality Assessment (DWIT-PCQA). Recognizing the human visual system (HVS) as the decision-maker in quality assessment regardless of media types, we can emulate the evaluation criteria for human perception via neural networks and further transfer the capability of quality prediction from images to point clouds by leveraging the prior knowledge in the images. Specifically, domain adaptation (DA) can be leveraged to bridge the images and point clouds by aligning feature distributions of the two media in the same feature space. However, the different manifestations of distortions in images and point clouds make feature alignment a difficult task. To reduce the alignment difficulty and consider the different distortion distributions during alignment, we have derived formulas to decompose the optimization objective of the conventional DA into two suboptimization functions with distortion as a transition. Specifically, through network implementation, we propose the distortion-guided biased feature alignment which integrates existing/estimated distortion distribution into the adversarial DA framework, emphasizing common distortion patterns during feature alignment. Besides, we propose the quality-aware feature disentanglement to mitigate the destruction of the mapping from features to quality during alignment with biased distortions. Experimental results demonstrate that our proposed method exhibits reliable performance compared to general blind PCQA methods without needing point cloud annotations.

Abstract:
The Gaussian relighting is critical for appearance editing and physical-based rendering, which rely on the decomposition and estimation of light and materials. However, existing methods treat both the environmental lighting and material features as unknowns and estimate them jointly, which results in inaccurate material decomposition and estimation due to the inherent variation of environmental illumination across datasets. In this study, we introduce the realistic and stable 3D Gaussian relighting (RS3DGR) that is enabled by the semantic-guided stable material estimation through object-oriented differentiable Gaussian path tracing. First, the target object is segmented from environment via semantic 3D Gaussian segmentation and the semantic label is designed as the roughness regularization term in optimization. Then, the object is illuminated by the environmental light sampled from the radiance field of environmental Gaussians or other types of radiance caches. The object-oriented path tracing rendering is differentiable and efficient. Under the alpha blending of primary rays penetrating through object Gaussians, the secondary rays are uniformly emitted from those primary-hit Gaussians to directly sample the environmental radiance field to model the reflection on object surface. As a result, the influence of unstable environmental lighting is removed, and the semantic prior successfully promotes the convergence on material estimation. Experimental results demonstrate significant improvements in stability and accuracy for material reconstruction on both synthetic and self-constructed datasets.

Affiliations: School of Computer Science and Engineering, Faculty of Innovation Engineering, Macau University of Science and Technology, Macau, SAR, China; School of Cyber Security, Shandong University of Political Science and Law, Jinan, China; Guangdong Polytechnic of Science and Technology, Zhuhai, China; Faculty of Applied Sciences, Macao Polytechnic University, Macau, SAR, China; School of Computer Science and Technology, Guangdong University of Technology, Guangzhou, China

Abstract:
The proliferation of modern image editing tools has raised concerns about image manipulation, particularly regarding the potential to mislead the public and compromise privacy and security. Consequently, detecting and localizing tampered regions has become a critical research challenge. Traditional methods struggle with subtle manipulations, such as splicing, copy-move, and removal, which are often more discernible in the frequency domain than in the spatial domain. Additionally, the size imbalance between the tampered and background regions further complicates the detection process. To address these challenges, we propose DFFormer, an end-to-end network that leverages frequency feature differences and a dynamic token strategy for precise manipulation localization. DFFormer combines the Conventional Neural Network (CNN) and Transformer in a hybrid architecture with three key modules: the Adaptive Frequency Transformer (AFT), the Prototype Learning Module (PLM), and the Cascaded Progressive Token Fusion Head (CPTF-Head). AFT integrates high- and low-frequency components into self-attention via the Parallel Adaptive Frequency Attention (PAFA) block, enhancing tampering feature representation while preserving fine details. PLM employs KNN-based density peak clustering (DPC-KNN) and weighted token aggregation to optimize dynamic token reduction. The CPTF-Head adopts a hierarchical coarse-to-fine strategy to integrate multiscale features, thereby improving localization accuracy and edge refinement. Experiments demonstrate that DFFormer outperforms state-of-the-art models across four benchmark datasets and one real-world dataset, exhibiting superior generalization and robustness. The source code is publicly available at https://github.com/XiangGD/DFFormer.git

Affiliations: College of Computer Science and the National Key Laboratory of Fundamental Science on Synthetic Vision, Sichuan University, Chengdu, China; National-Local Joint Engineering Research Center of Biodiagnosis and Biotherapy, the Second Affiliated Hospital, Xi’an Jiaotong University, Xi’an, China; TCM Regulating Metabolic Diseases Key Laboratory of Sichuan Province, Hospital of Chengdu University of Traditional Chinese Medicine, Chengdu, China; Digital and Intelligent Health Research Center, Anqing Normal University, Anqing, China; Department of Computer Science, Faculty of Science, Memorial University of Newfoundland, St. John’s, NL, Canada; School of Cyber Science and Engineering, Sichuan University, Chengdu, China

Abstract:
Medical image analysis plays key role in computer-aided diagnosis, where segmentation and classification are essential and interconnected tasks. While multi-task learning (MTL) has been widely explored to leverage inter-task synergies, effectively guiding knowledge transfer to prevent task conflict and negative transfer remains a key challenge, particularly in anatomically complex diagnostic scenarios. This paper presents LTRMTL-Net, a novel multi-task learning framework for medical image analysis that simultaneously addresses segmentation and classification tasks guided by lesion regions and spatial relationships of tissues. The proposed architecture integrates an Enhanced Lesion Region Fusion (ELRF) module that leverages GradCAM-guided attention mechanisms to precisely locate and enhance lesion regions, providing critical prior knowledge for both tasks. Tissue Space Structure Prediction (TSSP) component captures local-global spatial dependencies through contrastive learning, establishing effective anatomical context modeling. The core encoder employs Hybrid Wavelet-State Attention blocks that combine modulated wavelet transform convolutions with structured state space models to extract multi-scale features while maintaining computational efficiency. Dual-stream inputs with symmetric architecture accommodate single-source scenarios across diverse medical imaging applications. Experimental results on mammography and breast ultrasound datasets demonstrate that the proposed method captures fine-grained lesion boundary details while providing accurate malignancy classification. Harnessing cooperative knowledge transfer between segmentation and classification, guided by anatomical priors, boosts diagnostic performance and provides comprehensive, interpretable clinical insights.

Abstract:
The ceramic package substrate plays a crucial role in the field of integrated circuit manufacturing. Nevertheless, the lack of a well-defined dataset and benchmark, as well as the scarcity of defective ceramic package substrate samples, hinders further research and optimization of the project. To address these issues, we introduce CPS3D-Det, a high-precision 3D industrial defect detection dataset based on IC ceramic packaging substrates. All of the samples are collected from the multi-batch production stages in actual industrial production scenarios. With 1640 high spatial resolution point cloud samples (0.0025 mm) and hundreds of millions of total points. In addition, we propose a benchmark AIFAD, an end-to-end point cloud defect detection method based on auxiliary information flows. AIFAD transforms sparse convolutions into dense features to enhance the sparse backbone and does not rely on handcrafted proxies. Through comprehensive evaluations, we demonstrate the relevance and effectiveness of our dataset and benchmark. They will be released at https://github.com/dcsrgh/CPS3D-Det.

Abstract:
Aiming at the problems of sensitive data leakage and unauthorised direct access faced during the storage and transmission of shared medical data, this paper proposes a new medical data security sharing scheme based on blockchain. Firstly, to ensure the security and randomness of the keys in the scheme, a new 4D-YG hyperchaotic key generator is designed. Secondly, self-expanding fractal Sierpinski triangle permutating operation is performed on medical images to achieve image decentring effect. Utilize a custom Order-Preserving Encryption (OPE) algorithm to diffuse it, ensuring consistency in pixel order before and after encryption. Subsequently, mix-bit triple normalized diffusion is performed on the images that have completed preliminary diffusion to enhance the resistance of privacy data within the image to attacks. Finally, the decryption permission for encrypted shared medical data is restricted through smart contracts in the blockchain. While achieving secure transmission of medical data, it enables hospitals to perform medical image segmentation and organ recognition diagnostics on shared data in an encrypted state, which is more in line with practical needs. According to performance test data, the proposed encryption algorithm can effectively resist differential attacks and frequency attacks, and the Number of Pixels Change Rate (NPCR) test results can reach the ideal value of 99.6094%.

Abstract:
In industrial inspection, achieving clear X-ray imaging requires a high radiation dose, which poses risks such as damaging inspected targets, impacting operator health, and reducing inspection efficiency due to longer imaging times. Thus, ensuring inspection accuracy while minimizing radiation dose has become an urgent practical requirement. An effective way is using image enhancement technology to improve the quality of low-dose images, achieving the same clarity as high-dose images. However, existing methods, such as those using Transformer backbones, struggle to balance capturing global information (critical for accurate industrial inspection) with reducing computational resource consumption. On the other hand, existing datasets lack paired low- and high-dose images, both from real-world sources, for training these models. For instance, some low-dose images are often generated by artificially adding noise to real high-dose images. As a result, although methods achieve high accuracy on these datasets, applying them to real-world industrial inspections remains challenging. To solve these issues existing in both method and dataset, in this paper, we propose a Low-dose X-ray Image Enhancement network (LXIE-net) based on Mamba architecture ar for industrial inspection, and a large-scale dataset with real-captured paired high- and low-dose X-ray images, named HLXray. To further demonstrate the effectiveness of our method in real-world industrial inspection scenarios, we annotated 800 real images with three types of defects in the HLXray dataset. In addition, we simulated low-dose images on the public GDXray dataset by adding noise to the corresponding high-dose images. Based on these two datasets, we conducted industrial defect detection experiments using low-dose images, generated high-dose images (i.e., enhanced low-dose images produced by our method), and real high-dose images for comparison. The results show that using the generated high-dose images significantly improves detection accuracy compared to using the original low-dose images, and the performance is nearly consistent with that of the real high-dose images. It demonstrates the effectiveness of applying our method in real industrial inspection application. The code and dataset are publicly available at https://github.com/YqunYang/LXIE-net

Abstract:
Deep learning (DL) has achieved unprecedented success in precisely diagnosing dermatological lesions. However, increasing concerns over diagnostic unfairness across different demographic subgroups in DL algorithms raise issues of ethical violations and healthcare inequity. Due to limited access to the internal workings of DL algorithms, post-processing methods are widely regarded as efficient techniques for fairness enhancements in DL-based predictions. However, these methods often come at the cost of accuracy, and research exploring their application to medical images remains limited. To address these issues, we propose FADiaFrame, an innovative post-processing framework designed to enhance both fairness and accuracy in DL-based diagnosis of dermatological lesions. Specifically, our uncertainty-aware gating calibration mechanism in FADiaFrame identifies and calibrates untrustworthy samples, thereby enhancing diagnostic fairness, accuracy, and trustworthiness. Furthermore, we integrate this mechanism with an optimal transport method to further enhance group fairness. Extensive experiments on real-world datasets demonstrate that FADiaFrame outperforms existing post-processing methods in terms of both fairness and accuracy. Notably, FADiaFrame preserves diagnostic accuracy while achieving significant gains in fairness compared to pre-trained baseline models. Among all baselines, MedCLIP, with FADiaFrame, shows the most substantial improvement for the age-sensitive attribute, with accuracy increasing by 3.34% and demographic parity rising by 10.60%. Our results suggest that FADiaFrame provides universal applicability across diverse DL models for medical image diagnosis, ensuring both fair and accurate diagnosis across a wide range of devices and deployment contexts.

Abstract:
Foreign object detection (FOD) in railway catenary systems is crucial for ensuring operational safety and preventing catastrophic failures. However, current detection frameworks encounter two significant challenges. First, the infrequency of fault events leads to severe data scarcity, hampering the training and validation of robust detection models. Second, although lightweight networks (e.g., MobileNet, YOLO) achieve compactness by compressing channel factors, they struggle to balance local feature extraction with global dependency modeling. To address these challenges, we propose a solution that includes the following: 1) RailFOD23, a publicly available dataset created using generative AI to mitigate data scarcity; and 2) EPRepSADet, a compact detection framework that utilizes a re-parameterizable bottleneck (Re-bottleneck) and lightweight self-attention (LSA) module for efficient FOD. The Re-bottleneck consolidates multi-branch structures into a single-path representation, whereas LSA facilitates element-wise attention modeling to effectively reduce computational complexity. In addition, the efficient detection head further minimizes model complexity through hierarchical semantic modeling. Extensive experiments demonstrate that EPRepSADet achieves a mean Average Precision (mAP) of 92.5% on the RailFOD23 test set, requiring only 1.7G FLOPs, thus outperforming several state-of-the-art baseline models.