TIP2026

Abstract:
The synthesis of computed tomography images can supplement electron density information and eliminate MR-CT image registration errors. Consequently, an increasing number of MR-to-CT image translation approaches are being proposed for MR-only radiotherapy planning. However, due to substantial anatomical differences between various regions, traditional approaches often require each model to undergo independent development and use. In this paper, we propose a unified model driven by prompts that dynamically adapt to the different anatomical regions and generates CT images with high structural consistency. Specifically, it utilizes a region-specific attention mechanism, including a region-aware vector and a dynamic gating factor, to achieve MRI-to-CT image translation for multiple anatomical regions. Qualitative and quantitative results on three datasets of anatomical parts demonstrate that our models generate clearer and more anatomically detailed CT images than other state-of-the-art translation models. The results of the dosimetric analysis also indicate that our proposed model generates images with dose distributions more closely aligned to those of the real CT images. Thus, the proposed model demonstrates promising potential for enabling MR-only radiotherapy across multiple anatomical regions. we have released the source code for our RSAM model. The repository is accessible to the public at: https://github.com/yhyumi123/RSAM

Abstract:
Current compressed video super-resolution methods have achieved promising performance, but they often assume that an input video is compressed under low-delay configurations. However, under random access configurations, those methods might struggle to leverage the metadata effectively due to the large variations of metadata in different compression configurations. In this work, we propose a Compression-Oriented Video Super-Resolution (COVSR) method that can address video super-resolution for both low-delay and random-access configurations. Specifically, we first introduce an efficient compression-aware propagation (ECAP) module that dynamically adjusts propagation routes in accordance with the compression configurations. Since existing methods require reconstructing frames in a frame-by-frame manner, it is difficult to achieve efficient parallelization. However, we find that by slightly relaxing sequential dependencies, our ECAP can significantly improve inference speed. Furthermore, existing methods typically perform alignment between adjacent frames or adjacent features. However, since ECAP may propagate features along non-adjacent reference routes, it introduces new challenges for accurate cross-frame feature alignment. In response, we propose a metadata-driven alignment (MDA) module that refines cross-frame motion vectors into dense, feature-level flow offsets, enabling precise alignment across temporally distant features. Extensive experimental results demonstrate that our COVSR not only achieves efficient and superior super-resolution performance but also is generalizable to various compression configurations. Our code will be available at https://covsr.github.io

Abstract:
Previous research on sparse feature matching typically involves a staged optimization process of keypoint detection, description, and matching. While it allows the network to adapt to specific inputs, it may limit the network’s expressive capability and the overall architectural flexibility. In this study, we rethink the matching framework and propose to directly match any given keypoints, optimizing the matching network in an approximately end-to-end manner. To achieve this, firstly, we dynamically sample random positions within the images as assumed keypoints during training, allowing the network to explore a broader matching space. Secondly, we replace specific descriptors with high-efficiency sparse embeddings at multi levels of the image, facilitating the direct learning of underlying textures. Thirdly, we propose a novel and promising architecture, called Proposal-Guided TRansformer (PGTR), which aggregates context information from neighboring match proposals instead of searching globally with local features. PGTR works especially well under our training approach, and attain a synergistic advantage in terms of performance and efficiency. The overall pipeline achieves outstanding performance on various keypoints without any retraining, and can be flexibly reused when new keypoints emerge, making it valuable for real-world applications. Code will be available.

Abstract:
In Deep Image Prior (DIP), a Convolutional Neural Network (CNN) is fitted to map a latent space to a degraded (e.g. noisy) image but in the process learns to reconstruct the clean image. This phenomenon is attributed to CNN’s internal image prior. We revisit the DIP framework, examining it from the perspective of a neural implicit representation. Motivated by this perspective, we replace the random latent with Fourier-Features (Positional Encoding). We empirically demonstrate that the convolution layers in DIP can be replaced with simple pixel-level MLPs thanks to the Fourier features properties. We also prove that they are equivalent in the case of linear networks. We name our scheme “Positional Encoding Image Prior” (PIP) and exhibit that it performs very similar to DIP on various image-reconstruction tasks with much fewer parameters. Furthermore, we demonstrate that PIP can be easily extended to videos, an area where methods based on image-priors and certain INR approaches face challenges with stability. Code and additional examples for all tasks, including videos, are available on the project page nimrodshabtay.github.io/PIP

Abstract:
In Full-Reference Image Quality Assessment (FR-IQA), subjective Mean Opinion Scores (MOS) reflect human retinal perception, whereas objective metrics operate on the displayed image. Bridging these domains requires parametric mappings that are sensitive to viewing distance. This paper introduces Blur-Equivalent Linearized Estimator (BELE), a lightweight and perceptually interpretable FR-IQA model that disentangles the impact of strong edge degradations from that of texture distortions. BELE computes two indices: a blur index derived from a linearized estimator of Positional Fisher Information loss on strong edges, explicitly accounting for viewing distance; and a texture index based on a Complex Peak Signal-to-Noise Ratio (CPSNR) that captures distortions affecting fine spatial details in textured regions. All distortion estimates are combined via low-order polynomial fitting, with a focalization term applied in this final stage to replace the VQEG rectification, thereby eliminating its limitations such as overfitting and lack of interpretability. BELE is entirely training-free, requiring only five interpretable parameters, and achieves very low computational complexity. Extensive experiments on six benchmark datasets demonstrate that BELE attains competitive or superior correlation with MOS compared to both classical and deep learning-based FR-IQA methods, while offering strong generalization, real-time feasibility, and minimal resource demands.

Abstract:
Unsupervised feature selection (UFS) has recently gained attention for its effectiveness in processing unlabeled high-dimensional data. However, existing methods overlook the intrinsic causal mechanisms within the data, resulting in the selection of irrelevant features and poor interpretability. Additionally, previous graph-based methods fail to account for the differing impacts of non-causal and causal features in constructing the similarity graph, which leads to false links in the generated graph. To address these issues, a novel UFS method, called Causally-Aware UnSupErvised Feature Selection learning (CAUSE-FS), is proposed. CAUSE-FS introduces a novel causal regularizer that reweights samples to balance the confounding distribution of each treatment feature. This regularizer is subsequently integrated into a generalized unsupervised spectral regression model to mitigate spurious associations between features and clustering labels, thus achieving causal feature selection. Furthermore, CAUSE-FS employs causality-guided hierarchical clustering to partition features with varying causal contributions into multiple granularities. By integrating similarity graphs learned adaptively at different granularities, CAUSE-FS increases the importance of causal features when constructing the fused similarity graph to capture the reliable local structure of data. Extensive experimental results demonstrate the superiority of CAUSE-FS over state-of-the-art methods, with its interpretability further validated through feature visualization.

Abstract:
Diffusion models have demonstrated impressive abilities in generating photo-realistic and creative images. To offer more controllability for the generation process of diffusion models, previous studies normally adopt extra modules to integrate condition signals by manipulating the intermediate features of the noise predictors, where they often fail in conditions not seen in the training. Although subsequent studies are motivated to handle multi-condition control, they are mostly resource-consuming to implement, where more generalizable and efficient solutions are expected for controllable visual generation. In this paper, we present a late-constraint controllable visual generation method, namely LaCon, which enables generalization across various modalities and granularities for each single-condition control. LaCon establishes an alignment between the external condition and specific diffusion timesteps, and guides diffusion models to produce conditional results based on this built alignment. Experimental results on prevailing benchmark datasets illustrate the promising performance and generalization capability of LaCon under various conditions and settings. Ablation studies analyze different components in LaCon, illustrating its great potential to offer flexible condition controls for different backbones.

Abstract:
Contrastive learning facilitates the acquisition of informative skeleton representations for unsupervised action recognition by leveraging effective positive and negative sample pairs. However, most existing methods construct these pairs through weak or strong data augmentations, which typically rely on random appearance alterations of skeletons. While such augmentations are somewhat effective, they introduce semantic variations only indirectly and face two inherent limitations. First, simply modifying the appearance of skeletons often fails to reflect meaningful semantic variations. Second, random perturbations can unintentionally blur the boundary between positive and negative pairs, weakening the contrastive objective. To address these challenges, we propose an attack-driven augmentation framework that explicitly introduces semantic-level perturbations. This approach facilitates the generation of hard positives while guiding the model to mine more informative hard negatives. Building on this idea, we present Attack-Augmented Mixing-Contrastive Skeletal Representation Learning (A2MC), a novel framework that focuses on contrasting hard positive and hard negative samples for more robust representation learning. Within A2MC, we design an Attack-Augmentation (Att-Aug) module that integrates both targeted (attack-based) and untargeted (augmentation-based) perturbations to generate informative hard positive samples. In parallel, we propose the Positive-Negative Mixer (PNM), which blends hard positive and negative features to synthesize challenging hard negatives. These are then used to update a mixed memory bank for more effective contrastive learning. Comprehensive evaluations across three public benchmarks demonstrate that our approach, termed A2MC, achieves performance on par with or exceeding existing state-of-the-art methods.

Abstract:
Cross-modal hashing (CMH) aims to bridge the semantic gap between heterogeneous modalities by learning compact binary representations for efficient retrieval. Most existing deep cross-modal hashing methods are developed under the assumption that multimodal data are complete and perfectly paired across modalities. However, this assumption rarely holds as real-world multimodal datasets often suffer from missing modalities due to inconsistencies, imbalances, or noise during data collection. To address such incomplete data, existing incomplete CMH methods typically attempt to reconstruct the missing information by exploiting internal signals from the available modalities. Nonetheless, these internally guided completion strategies tend to be highly sensitive to distributional shifts, leading to substantial performance degradation on unseen or out-of-distribution data. Inspired by the human learning mechanism of enhancing cognition through external knowledge, this paper proposes a novel External Guidance Incomplete Cross-modal Hashing (EGICH) framework to address this limitation. Specifically, we first design a Completion with External Guidance (CEG) module that leverages rich semantic information from external knowledge bases to expand the semantic boundary and accurately reconstruct the semantics of missing samples. Subsequently, we introduce a Consistency Learning with External Guidance (CLEG) module, which employs externally guided reconstructed features as anchors to align sample representations with label semantics, thereby effectively mitigating cross-modal bias. Finally, a Semantic-aware Contrastive Hashing (SCH) module is developed to refine the feature distribution by semantic similarity, pulling semantically related samples closer and pushing unrelated ones apart, thus achieving fine-grained discrimination among positive pairs. To the best of our knowledge, this is the first attempt to incorporate external knowledge into incomplete cross-modal hashing. Extensive experiments demonstrate that EGICH consistently and significantly outperforms 11 state-of-the-art methods under various modality-missing scenarios. The code is available at https://github.com/chenjiali27/EGICH

Abstract:
Positive and Unlabeled (PU) learning aims to train a suitable classifier simply based on a set of positive data and unlabeled data. Existing PU methods usually follow a discriminative framework and yield limited classification performance, because the lack of explicit negative labels poses a great barrier in training a discriminative PU model. To address the challenge of limited supervisory information faced by discriminative PU methods, this paper introduces generative operation to PU learning in addition to the conventional discriminative operation, and proposes a novel algorithm dubbed “Discriminative-Generative Positive and Unlabeled Learning” (DGPU). Specifically, our proposed DGPU consists of a data generation stage and a discriminative annotation stage, which can benefit from each other in an iterative manner. In data generation stage, we employ a tailored diffusion model to generate high-quality negative examples and positive examples to efficiently enrich the supervisory information. In discriminative annotation stage, the classifier is further refined on the initial and generated training data. To the best of our knowledge, this study represents the first attempt to integrate diffusion models into PU learning to make generative model and discriminative model benefit from each other in a collaborative way. Thanks to this, our proposed DGPU significantly outperforms existing PU methods across a wide range of synthetic and real-world benchmark datasets. In particular, our DGPU is almost comparable to the fully supervised counterpart, and improves the test accuracy of existing state-of-the-art methods by 3.89% and 2.56% on CIFAR-10 and CelebA datasets, respectively.

Abstract:
The out-of-distribution (OOD) detection task is crucial for the real-world deployment of machine learning models. In this paper, we propose to study the problem from the perspective of Sharpness-aware Minimization (SAM). Compared with traditional optimizers such as SGD, SAM can better improve the model performance and generalization ability, and this is closely related to OOD detection. Therefore, instead of using SGD, we propose to fine-tune the model with SAM, and observe that the score distributions of in-distribution (ID) data and OOD data are pushed away from each other. Besides, with our carefully designed loss, the fine-tuning process is very time-efficient. The OOD performance improvement can be observed after fine-tuning the model within 1 epoch. Moreover, our method is very flexible and can be used to improve the performance of different OOD detection methods. Extensive experiments have demonstrated that our method achieves state-of-the-art performance on various OOD benchmarks across different architectures. Moreover, comprehensive ablation studies and theoretical analyses are discussed to support the empirical results.

Affiliations: School of Computing and Artificial Intelligence and the Artificial Intelligence and Digital Finance Key Laboratory of Sichuan Province, Southwestern University of Finance and Economics, Chengdu, China; School of Business Administration, Southwestern University of Finance and Economics, Chengdu, China; Chinese Academy of Sciences, Institute of Optics and Electronics, Chengdu, China; Xiangjiang Laboratory, Changsha, China; Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University, Xi’an, China

Abstract:
Information theory has inspired numerous advancements in multi-view learning. Most multi-view methods incorporating information-theoretic principles rely an assumption called multi-view redundancy which states that common information between views is necessary and sufficient for down-stream tasks. This assumption emphasizes the importance of common information for prediction, but inherently ignores the potential of unique information in each view that could be predictive to the task. In this paper, we propose a comprehensive information-theoretic multi-view learning framework named CIML, which discards the assumption of multi-view redundancy. Specifically, CIML considers the potential predictive capabilities of both common and unique information based on information theory. First, the common representation learning maximizes Gács-Körner common information to extract shared features and then compresses this information to learn task-relevant representations based on the Information Bottleneck (IB). For unique representation learning, IB is employed to achieve the most compressed unique representation for each view while simultaneously minimizing the mutual information between unique and common representations, as well as among different unique representations. Importantly, we theoretically prove that the learned joint representation is predictively sufficient for the downstream task. Extensive experimental results have demonstrated the superiority of our model over several state-of-art methods. The code is released on CIML

Abstract:
Camouflaged Object Detection (COD) is pivotal for segmenting objects that seamlessly blend into their surroundings. While prior endeavors demonstrate impressive performance through training on predefined labels, they heavily rely on labor-intensive data annotation and struggle to adapt to open-world scenarios. In this light, we propose RA-COD, a training-free paradigm that enables COD by retrieving the most similar samples from the prototype repository. The efficacy of RA-COD hinges on 1) capturing the nuanced resemblance between objects and their environments and 2) excelling in dense prediction tasks. To achieve (1), the crux lies in ensuring diversity and discriminability within the prototype repository. In this context, we propose GenPro, an automated pipeline for crafting Generative Prototypes. GenPro integrates a range of foundation models, including the Diffusion Model, Vision-Language Model, Segment Anything Model (SAM), and DINOv2, in a complementary manner that synergistically generates diverse and distinguishable prototype samples. To achieve (2), we propose C2F to retrieve camouflaged objects in a Coarse-to-Fine regime. We commence with pixel-level retrieval in the feature space, which generates a coarse mask that effectively captures class discrimination and object localization. Further refinement is achieved by extracting bounding boxes from this coarse mask to prompt SAM in generating mask proposals for region-level retrieval. Evaluations on four benchmarks showcase that RA-COD achieves state-of-the-art performance compared to existing training-free methods.

Abstract:
Tensor-based multi-view clustering algorithms have attracted considerable attention due to their superior clustering performance. However, these algorithms typically treat each view independently, failing to utilize the complementary information across all views, thus lacking globality. Additionally, employing low-rank tensor constraints to extract consistent information among views may result in the loss of important information due to weak consistency constraints. These limitations significantly hinder the clustering performance. To address these issues, we propose Simple Multi-view Tensor Clustering (SimMTC), which achieves globality and strong consistency. SimMTC first applies Fast Fourier Transform (FFT) to the anchor graphs to obtain high-frequency and low-frequency information, which encode similarities between samples and anchors from all views, thereby capturing global information. Orthogonal tensor factorization is then conducted in the frequency domain. Moreover, a novel strong consistency constraint based on FFT is introduced, which enhances the extraction of consistent information in the frequency domain. What’s more, an efficient alternating optimization algorithm is designed to solve the optimization problem in SimMTC. Finally, extensive experiments on real-world datasets demonstrate that SimMTC achieves state-of-the-art clustering performance. The code has been made publicly available on GitHub at: https://github.com/haonanxin/SimMTC_code

Abstract:
Unsupervised Domain Adaptation (UDA) aims to transfer knowledge from a labeled source domain to an unlabeled target domain by learning domain-invariant representations. Motivated by the recent success of Vision Transformers (ViTs), several UDA approaches have adopted ViT architectures to exploit fine-grained patch-level representations, which are unified as Transformer-based D omain A daptation (TransDA) independent of CNN-based. However, we have a key observation in TransDA: due to inherent domain shifts, patches (tokens) from different semantic categories across domains may exhibit abnormally high similarities, which can mislead the self-attention mechanism and degrade adaptation performance. To solve that, we propose a novel P atch- A daptation Transformer (PATrans), which first identifies similarity-anomalous patches and then adaptively suppresses their negative impact to domain alignment, i.e. token calibration. Specifically, we introduce a P atch- A daptation A ttention (PAA) mechanism to replace the standard self-attention mechanism, which consists of a weight-shared triple-branch mixed attention mechanism and a patch-level domain discriminator. The mixed attention integrates self-attention and cross-attention to enhance intra-domain feature modeling and inter-domain similarity estimation. Meanwhile, the patch-level domain discriminator quantifies the anomaly probability of each patch, enabling dynamic reweighting to mitigate the impact of unreliable patch correspondences. Furthermore, we introduce a contrastive attention regularization strategy, which leverages category-level information in a contrastive learning framework to promote class-consistent attention distributions. Extensive experiments on four benchmark datasets demonstrate that PATrans attains significant improvements over existing state-of-the-art UDA methods (e.g., 89.2% on the VisDA-2017). Code is available at: https://github.com/YSY145/PATrans

Abstract:
Point clouds have gained prominence across numerous applications due to their ability to accurately represent 3D objects and scenes. However, efficiently compressing unstructured, high-precision point cloud data remains a significant challenge. In this paper, we propose NeRC ^\textbf 3 , a novel point cloud compression framework that leverages implicit neural representations (INRs) to encode both geometry and attributes of dense point clouds. Our approach employs two coordinate-based neural networks: one maps spatial coordinates to voxel occupancy, while the other maps occupied voxels to their attributes, thereby implicitly representing the geometry and attributes of a voxelized point cloud. The encoder quantizes and compresses network parameters alongside auxiliary information required for reconstruction, while the decoder reconstructs the original point cloud by inputting voxel coordinates into the neural networks. Furthermore, we extend our method to dynamic point cloud compression through techniques that reduce temporal redundancy, including a 4D spatio-temporal representation termed 4D-NeRC ^\textbf 3 . Experimental results validate the effectiveness of our approach: For static point clouds, NeRC ^\textbf 3 outperforms octree-based G-PCC standard and existing INR-based methods. For dynamic point clouds, 4D-NeRC ^\textbf 3 achieves superior geometry compression performance compared to the latest G-PCC and V-PCC standards, while matching state-of-the-art learning-based methods. It also demonstrates competitive performance in joint geometry and attribute compression.

Affiliations: School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China; College of Computer Science and Electronic Engineering, Hunan University, Changsha, China; PAMI Research Group, Department of Computer and Information Science, Centre for Artificial Intelligence and Robotics, Institute of Collaborative Innovation, University of Macau, Macau, SAR, China; School of Elector-Mechanical Engineering, Xidian University, Xi’an, China; School of Computer Science and Engineering, South China University of Technology, Guangzhou, China; School of Data Science, The Chinese University of Hong Kong, Shenzhen Campus, Shenzhen, China

Abstract:
Deep convolutional neural networks can use hierarchical information to progressively extract structural information to recover high-quality images. However, preserving the effectiveness of the obtained structural information is important in image super-resolution. In this paper, we propose a cosine network for image super-resolution (CSRNet) by improving a network architecture and optimizing the training strategy. To extract complementary homologous structural information, odd and even heterogeneous blocks are designed to enlarge the architectural differences and improve the performance of image super-resolution. Combining linear and non-linear structural information can overcome the drawback of homologous information and enhance the robustness of the obtained structural information in image super-resolution. Taking into account the local minimum of gradient descent, a cosine annealing mechanism is used to optimize the training procedure by performing warm restarts and adjusting the learning rate. Experimental results illustrate that the proposed CSRNet is competitive with state-of-the-art methods in image super-resolution.

Abstract:
Recent advancements have led the image matching community to increasingly focus on obtaining subpixel-level correspondences in a detector-free manner, i.e., semi-dense feature matching. Existing methods tend to overfocus on low-level local features while ignoring equally important high-level semantic information. To tackle these shortcomings, we propose SigMa, a semantic similarity-guided semi-dense feature matching method, which leverages the strengths of both local features and high-level semantic features. First, we design a dual-branch feature extractor, comprising a convolutional network and a vision foundation model, to extract low-level local features and high-level semantic features, respectively. To fully retain the advantages of these two features and effectively integrate them, we also introduce a cross-domain feature adapter, which could overcome their spatial resolution mismatches, channel dimensionality variations, and inter-domain gaps. Furthermore, we observe that performing the transformer on the whole feature map is unnecessary because of the similarity of local representations. We design a guided pooling method based on semantic similarity. This strategy performs attention computation by selecting highly semantically similar regions, aiming to minimize information loss while maintaining computational efficiency. Extensive experiments on multiple datasets demonstrate that our method achieves a competitive accuracy-efficiency trade-off across various tasks and exhibits strong generalization capabilities across different datasets. Additionally, we conduct a series of ablation studies and analysis experiments to validate the effectiveness and rationality of our method’s design. Our code is publicly available at https://github.com/ShineFox/SigMa

Abstract:
Neural networks commonly employ the McCulloch-Pitts neuron model, which is a linear model followed by a point-wise non-linear activation. Various researchers have already advanced inherently non-linear neuron models, such as quadratic neurons, generalized operational neurons, generative neurons, and super neurons, which offer stronger non-linearity compared to point-wise activation functions. In this paper, we introduce a novel and better non-linear neuron model called Padé neurons ( \mathrm \textit Paon s), inspired by Padé approximants. \mathrm \textit Paon s offer several advantages, such as diversity of non-linearity, since each \mathrm \textit Paon learns a different non-linear function of its inputs, and layer efficiency, since \mathrm \textit Paon s provide stronger non-linearity in much fewer layers compared to piecewise linear approximation. Furthermore, \mathrm \textit Paon s include all previously proposed neuron models as special cases, thus any neuron model in any network can be replaced by \mathrm \textit Paon s. We note that there has been a proposal to employ the Padé approximation as a generalized point-wise activation function, which is fundamentally different from our model. To validate the efficacy of \mathrm \textit Paon s, in our experiments, we replace classic neurons in some well-known neural image super-resolution, compression, and classification models based on the ResNet architecture with \mathrm \textit Paon s. Our comprehensive experimental results and analyses demonstrate that neural models built by \mathrm \textit Paon s provide better or equal performance than their classic counterparts with a smaller number of layers. The PyTorch implementation code for \mathrm \textit Paon is open-sourced at https://github.com/onur-keles/Paon

Abstract:
Diffusion models have emerged as powerful tools for solving inverse problems due to their exceptional ability to model complex prior distributions. However, existing methods predominantly assume known forward operators (i.e., non-blind), limiting their applicability in practical settings where acquiring such operators is costly. Additionally, many current approaches rely on pixel-space diffusion models, leaving the potential of more powerful latent diffusion models (LDMs) underexplored. In this paper, we introduce LatentDEM, an innovative technique that addresses more challenging blind inverse problems using latent diffusion priors. At the core of our method is solving blind inverse problems within an iterative Expectation-Maximization (EM) framework: (1) the E-step recovers clean images from corrupted observations using LDM priors and a known forward model, and (2) the M-step estimates the forward operator based on the recovered images. Additionally, we propose two novel optimization techniques tailored for LDM priors and EM frameworks, yielding more accurate and efficient blind inversion results. As a general framework, LatentDEM supports both linear and non-linear inverse problems. Beyond common 2D image restoration tasks, it enables new capabilities in non-linear 3D inverse rendering problems. We validate LatentDEM’s performance on representative 2D blind deblurring and 3D pose-free sparse-view reconstruction tasks, demonstrating its superior efficacy over prior arts. The project page can be found at https://ai4imaging.github.io/latentdem/

Abstract:
Deep unfolding networks (DUNs), combining conventional iterative optimization algorithms and deep neural networks into a multi-stage framework, have achieved remarkable accomplishments in Image Restoration (IR), such as spectral imaging reconstruction, compressive sensing and super-resolution. It unfolds the iterative optimization steps into a stack of sequentially linked blocks. Each block consists of a Gradient Descent Module (GDM) and a Proximal Mapping Module (PMM) which is equivalent to a denoiser from a Bayesian perspective, operating on Gaussian noise with a known level. However, existing DUNs suffer from two critical limitations: 1) their PMMs share identical architectures and denoising objectives across stages, ignoring the need for stage-specific adaptation to varying noise levels; and 2) their chain of structurally repetitive blocks results in severe parameter redundancy and high memory consumption, hindering deployment in large-scale or resource-constrained scenarios. To address these challenges, we introduce generalized Deep Low-rank Adaptation (LoRA) Unfolding Networks for image restoration, named LoRun, harmonizing denoising objectives and adapting different denoising levels between stages with compressed memory usage for more efficient DUN. LoRun introduces a novel paradigm where a single pretrained base denoiser is shared across all stages, while lightweight, stage-specific LoRA adapters are injected into the PMMs to dynamically modulate denoising behavior according to the noise level at each unfolding step. This design decouples the core restoration capability from task-specific adaptation, enabling precise control over denoising intensity without duplicating full network parameters and achieving up to N times parameter reduction for an N -stage DUN with on-par or better performance. Extensive experiments conducted on three IR tasks validate the efficiency of our method.

Abstract:
Sketch-based Person Retrieval (SBPR) aims to identify and retrieve a target individual across non-overlapping camera views using professional sketches as queries. In practice, sketches drawn by different artists often present diverse painting styles unpredictably. The substantial style variations among sketches pose significant challenges to the stability and generalizability of SBPR models. Prior works attempt to mitigate style variations through style manipulation methods, which inevitably undermine the inherent structural relations among multiple sketch features. This leads to overfitting on existing training styles and struggles with generalizing to new, unseen sketch styles. In this paper, we introduce FreeStyle, an innovative style-inclusive framework for SBPR, built upon the foundational CLIP architecture. FreeStyle explicitly models the relations across diverse sketch styles via style consistency enhancement, enabling dynamic adaptation to both seen and unseen style variations. Specifically, Diverse Style Semantic Unification is first devised to enhance the style consistency of each identity at the semantic level by introducing objective attribute-level semantic constraints. Meanwhile, Diverse Style Feature Squeezing tackles unclear feature boundaries among identities by concentrating the intra-identity space and separating the inter-identity space, thereby strengthening style consistency at the feature representation level. Additionally, considering the feature distribution discrepancy between sketches and photos, an identity-centric cross-modal prototype alignment mechanism is introduced to facilitate identity-aware cross-modal associations and promote a compact joint embedding space. Extensive experiments validate that FreeStyle not only achieves stable performance under seen style variations but also demonstrates strong generalization to unseen sketch styles.

Abstract:
Visual Language Tracking (VLT) enables machines to perform tracking in real world through human-like language descriptions. However, existing VLT methods are limited to 2D spatial tracking or single-object 3D tracking and do not support multi-object 3D tracking within monocular video. This limitation arises because advancements in 3D multi-object tracking have predominantly relied on sensor-based data (e.g., point clouds, depth sensors) that lacks corresponding language descriptions. Moreover, natural language descriptions in existing VLT literature often suffer from redundancy, impeding the efficient and precise localization of multiple objects. We present the first technique to extend VLT to multi-object 3D tracking using monocular video. We introduce a comprehensive framework that includes (i) a Monocular Multi-object 3D Visual Language Tracking (MoMo-3DVLT) task, (ii) a large-scale dataset, MoMo-3DRoVLT, tailored for this task, and (iii) a custom neural model. Our dataset, generated with the aid of Large Language Models (LLMs) and manual verification, contains 8,216 video sequences annotated with both 2D and 3D bounding boxes, with each sequence accompanied by three freely generated, human-level textual descriptions. We propose MoMo-3DVLTracker, the first neural model specifically designed for MoMo-3DVLT. This model integrates a multimodal feature extractor, a visual language encoder-decoder, and modules for detection and tracking, setting a strong baseline for MoMo-3DVLT. Beyond existing paradigms, it introduces a task-specific structural coupling that integrates a differentiable linked-memory mechanism with depth-guided and language-conditioned reasoning for robust monocular 3D multi-object tracking. Experimental results demonstrate that our approach outperforms existing methods on the MoMo-3DRoVLT dataset. Our dataset and code are available at https://github.com/hongkai-wei/MoMo-3DVLT.

Affiliations: School of Computer and Computing Science, Hangzhou City University, Hangzhou, Zhejiang, China; School of Environmental and Chemical Engineering, Shanghai University, Shanghai, China; School of Computer Science, the School of Artificial Intelligence, Optics and Electronics (iOPEN), and the Key Laboratory of Intelligent Interaction and Applications (Ministry of Industry and Information Technology), Northwestern Polytechnical University, Xi’an, Shaanxi, China; School of Electrical and Electronic Engineering, Nanyang Technological University, Jurong West, Singapore

Abstract:
We propose a novel Greedy Graph Cut (GGC) algorithm to address the graph partitioning problem. The algorithm begins by treating each data point as an individual cluster and iteratively merges cluster pairs that maximize the reduction in the global objective function until the desired number of clusters is achieved. We provide a theoretical proof of the monotonic convergence of the objective function values throughout this process. To improve computational efficiency, the algorithm restricts merging operations to adjacent clusters, resulting in a computational complexity that scales nearly linearly with the sample size. A significant advantage of our greedy approach is its deterministic nature, which ensures consistent results across multiple runs. This stands in contrast to many existing algorithms that are sensitive to random initialization effects. We demonstrate the effectiveness of the proposed algorithm by applying it to the Normalized Cut (N-Cut) problem, a well-studied variant of graph partitioning. Extensive experimental results show that GGC consistently outperforms the conventional two-stage optimization approach—which involves eigendecomposition followed by k-means clustering—in solving the N-Cut problem. Furthermore, comparative analyses reveal that GGC achieves superior performance compared to several state-of-the-art clustering algorithms.

Affiliations: Department of Information Science and Engineering, Ocean University of China, Qingdao, China; Innovation School of Artificial Intelligence, Hefei University of Technology, Hefei, China; Department of Health Technology and Informatics, The Hong Kong Polytechnic University, Hung Hom, Hong Kong; School of Systems and Computing, University of New South Wales, Canberra, Australia; School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China; School of Computer Science, The University of Adelaide, Adelaide, SA, Australia; School of Data Science, The Chinese University of Hong Kong, Shenzhen, Guangdong, China

Abstract:
Self-supervised learning has shown potential in fine-grained visual recognition (FGVR). However, existing self-supervised learning methods are often susceptible to irrelevant patterns during training and lack the ability to capture the critical subtle differences in FGVR, leading to suboptimal performance. Moreover, existing approaches focus primarily on uni-modal visual concepts. Despite the emergence of powerful vision-language models (VLMs) in various high-level vision tasks, their potential in self-supervised FGVR remains largely unexplored. To this end, we propose a novel self-supervised learning (LearnMat) framework, that effectively filters out irrelevant feature interference and extracts more important and subtle discriminative features during training. Specifically, LearnMat consists of two key modules: the semantic awareness module (SAM) and the insight extraction module (IEM). In the SAM, we introduce a novel vision–language–grounded semantic distillation strategy using a corpus of generic, category-agnostic textual attributes, that injects explicit semantic constraints into self-supervised training and improves robustness to background interference. Complementarily, the IEM exploits gradient-based signals from the input image to highlight subtle differences and localize key discriminative regions, mitigating inter-class similarity and intra-class variation, and enhancing fine-grained discrimination. Extensive experiments across multiple popular FGVR datasets show that LearnMat significantly outperforms recent state-of-the-art methods, highlighting its marked effectiveness. Our code is avaliable at https://github.com/Heng-CHY/LearnMat

Abstract:
Recent advances in learning-based underwater image enhancement have achieved remarkable progress. However, the inherent diversity and complexity of underwater scenes still limit the ability of existing approaches to simultaneously restore fine structural details and global image layouts. To address this challenge, we propose a Resonant Fusion (ReFu) framework that explicitly leverages complementary information in both spatial and frequency domains. Specifically, we design a frequency decomposer and a spatial decomposer to capture high- and low-frequency cues from different perspectives. A resonant fuser is then introduced to adaptively integrate high-frequency resonances for detail refinement and low-frequency resonances for structural consistency. This fine-grained cross-domain fusion significantly improves structural preservation and detail enhancement, thereby generating visually more natural and perceptually friendly underwater images. Extensive quantitative and qualitative evaluations across diverse underwater benchmarks show that ReFu consistently surpasses state-of-the-art methods by a clear margin. Comprehensive ablation studies further validate the effectiveness of each module and prove the necessity of the proposed ReFu mechanism. Our code is available at https://github.com/CircleQa/ReFu-main

Abstract:
Accurate and interpretable detection of AI-generated images is essential for mitigating risks associated with AI misuse. However, the substantial domain gap among generative models makes it challenging to develop a generalizable forgery detection model. Moreover, since every pixel in an AI-generated image is synthesized, traditional saliency-based forgery explanation methods are not well suited for this task. To address these challenges, we formulate detection and explanation as a unified Forgery Detection and Reasoning task (FDR-Task), leveraging Multi-Modal Large Language Models (MLLMs) to provide accurate detection through reliable reasoning over forgery attributes. To facilitate this task, we introduce the Multi-Modal Forgery Reasoning dataset (MMFR-Dataset), a large-scale dataset containing 120K images across 10 generative models, with 378K reasoning annotations on forgery attributes, enabling comprehensive evaluation of the FDR-Task. Furthermore, we propose FakeReasoning, a forgery detection and reasoning framework with three key components: 1) a dual-branch visual encoder that integrates CLIP and DINO to capture both high-level semantics and low-level artifacts; 2) a Forgery-Aware Feature Fusion Module that leverages DINO’s attention maps and cross-attention mechanisms to guide MLLMs toward forgery-related clues; 3) a Classification Probability Mapper that couples language modeling and forgery detection, enhancing overall performance. Experiments across multiple generative models demonstrate that FakeReasoning not only achieves robust generalization but also outperforms state-of-the-art methods on both detection and reasoning tasks. The code is available at: https://github.com/PRIS-CV/FakeReasoning

Abstract:
Transformer-based models have demonstrated great promises in single image super-resolution (SISR), but our investigations find significant redundancy in terms of high mutual information across the attention maps, which is associated with reduced efficiency and degraded performance of SOTA models. To address the problem, here we propose a low redundancy attention network (LRAN). First, to mitigate the redundancy among heads, we introduce in the self-attention computation a multi-element mechanism, which allows for the incorporation of various types of self-attention, thus increasing inter-head diversity. Second, to address the redundancy among blocks, we propose the encapsulated architecture, in which enhanced local perception unit and gated multi-layer perceptron are designed to capture local information. Specifically, this architecture incorporates a single self-attention layer between several MLP layers. Subsequently, the proposed gated multi-layer perceptron significantly enhances the SR quality. Extensive experiments demonstrate that LRAN outperforms SOTA models in the task of lightweight SR, achieving a better trade-off between quality and speed. For instance, the proposed LRAN-light surpasses SwinIR-light by 0.32dB PSNR in × 4 SR on Urban100, while running × 4 faster.

Abstract:
The out-of-distribution (OOD) property in data is deemed as one main challenge hindering the generalization ability of machine learning algorithms. However, the underlying reasons for this property remain an intriguing and open question that has yet to be fully understood. In this paper, we seek to enhance our understanding of the OOD phenomenon by framing it as a problem of distribution shift and addressing it through two complementary causal perspectives. The first is a generative causal view that elucidates the data generation process. We introduce a novel three-dimensional coordinate system to represent three fundamental distribution shifts, illustrating their role in various OOD generalization problems. The second is an anti-causal view that focuses on the model learning process. We develop an effective approach dubbed Counterfactual Risk Minimization (CRM) to address arbitrary distribution shifts in a unified framework. Additionally, we introduce a new multi-domain visual recognition dataset called CONA to facilitate further exploration of OOD generalization. We conduct evaluations of CRM alongside several state-of-the-art competitors on four benchmark datasets under the three distribution shifts. The results not only affirm CRM’s superiority but also shed light on potential future directions. Code and data: https://github.com/muliyangm/CRM

Abstract:
Deep multi-view clustering aims to exploit the rich semantic information contained in heterogeneous multi-view data to uncover the underlying relationships among samples. However, existing deep multi-view clustering models often overlook inter-cluster separability and the effective integration of semantic information across views, resulting in insufficient feature discriminability and consequently limited clustering performance. To address the above issues, this paper proposes a novel deep multi-view clustering method via cluster-semantic guidance. We separate clusters to enhance inter-cluster discriminability, while incorporating a knowledge distillation mechanism to ensure cluster stability and facilitate the learning of clustering-friendly representations. Furthermore, by aggregating sample-level semantic information, the model is guided to follow a cluster-oriented learning strategy that promotes the extraction of discriminative features, thereby strengthening the sample representation capability. Our method effectively learns discriminative and clustering-friendly representations, guiding the model to acquire distinctive feature embeddings from a cluster-oriented perspective. Our comprehensive experiments across datasets of varying scales confirm the model’s effectiveness, showing superior clustering performance over existing state-of-the-art methods.

Abstract:
Precise segmentation of out-of-distribution (OoD) objects, herein referred to as anomalies, is crucial for the reliable deployment of semantic segmentation models in open-set, safety-critical applications, such as autonomous driving. Current anomalous segmentation benchmarks predominantly focus on favorable weather conditions, resulting in untrustworthy evaluations that overlook the risks posed by diverse meteorological conditions in open-set environments, such as low illumination, dense fog, and heavy rain. To bridge this gap, this paper introduces the ComSAmy, a Complex Scenarios Anomaly segmentation benchmark. ComSAmy encompasses a wide spectrum of adverse weather conditions, dynamic driving environments, and diverse anomaly types to comprehensively evaluate the model performance in realistic open-world scenarios. Our extensive evaluation of several state-of-the-art anomalous segmentation models reveals that existing methods demonstrate significant deficiencies in such challenging scenarios, highlighting their serious safety risks for real-world deployment. To solve that, we propose a novel energy-entropy learning (EEL) strategy that integrates the complementary information from energy and entropy to bolster the robustness of anomaly segmentation under complex open-world environments. Additionally, a diffusion-based anomalous training data synthesizer is proposed to generate diverse and high-quality anomalous images to enhance the existing copy-paste training data synthesizer. Extensive experimental results on both public and ComSAmy benchmarks demonstrate that our proposed diffusion-based synthesizer with energy and entropy learning (DiffEEL) framework serves as an effective and generalizable plug-and-play method to enhance existing models, yielding an average improvement of around 4.96% in AUPRC and 9.87% in \rm FPR_95 .

Abstract:
Efficient talking face video coding and control are crucial in modern video communication, reshaping how individuals connect, collaborate, and interact. Coding seeks to reduce transmission costs, while control enables the realization of user-customizable facial expressions and head poses in the transmitted videos. However, the compression efficiency of the common par-adigm of applying control algorithms before video coding is not satisfactory. In this paper, we propose an efficient, Controllable Generative Talking Face Video Coding (CoFaCo) framework, wh-ich seamlessly integrates control into the coding process. Specific-ally, CoFaCo projects talking face videos into ultra-compact and semantic feature representations that can be customized by users before compression. To enable independent controls of pose and expression, we design a set of sophisticated losses to accurately de-couple the pose and expression direction codes. Once the decoupled direction codes and the semantic face representations are obtained, the pose and expression control modules can be effectively learned to generate decoupled, controlled pose and expression direction codes. The controlled direction codes are subsequently smoothed to enhance temporal consistency in the controlled video output by the generators. Experimental results demonstrate that CoFaCo achieves competitive compression efficiency in ultra-low bit rate video reconstruction and control tasks, providing valuab-le insights for advancing face video communication with diverse control capabilities.

Abstract:
Few-shot semantic segmentation (FSS) aims to segment unseen-category objects given only a few annotated samples. Although significant progress has been made in the field of FSS, selecting an appropriate feature matching method remains a challenge. Traditional prototype-based methods can preserve high-level semantic features, but they tend to lose detailed information. On the other hand, pixel-level comparison methods retain fine-grained details but are vulnerable to distractors and noise, leading to poor robustness. To address these issues, this paper proposes a target-agnostic object-based method. Specifically, we propose a set of learnable “object queries” to extract object features, which preserve both high-level semantic information and fine-grained details. Additionally, during the training phase, we exploit the prior knowledge of foreground and background embedded in the samples to enhance the model’s performance. In the inference phase, the model utilizes both the support set and the learned prior knowledge to perform segmentation tasks, mitigating the data distribution bias caused by limited samples. Extensive experiments on benchmark datasets demonstrate that our method outperforms state-of-the-art approaches in both accuracy and robustness. Code is available at https://github.com/wenbo456/OTBNet

Abstract:
Fine-grained image analysis is widely recognized as highly challenging, since distinguishing individual differences within a certain category, species, or type often depends on tiny, subtle patterns. However, learning fine-grained semantic categories from these subtle part patterns is inherently fragile, as they can easily be overwhelmed by the dominant patterns resting in the coarse-category information. Therefore, how to enhance the relation between the fine-grained semantics and these subtle patterns is the key. To push this frontier, a novel semantic-part alignment (SPA) learning scheme is proposed in this paper. Its general idea is to firstly measure the relevance of each part to the fine-grained semantics, and then regularize the fine-grained visual representation learning. Specifically, it consists of three key components, namely, joint semantic-part modeling, semantic-part set modeling, and optimal semantic-part transport. The joint semantic-part modeling associates each part in an image with the fine-grained semantics in a latent space. Then, the optimal semantic-part transport component is devised to enhance the relation between fine-grained semantic embeddings and the discriminative part embeddings. Notably, the proposed SPA is plug-in-and-play, easy-to-implement, and insensitive to the latent embedding dimension and loss weight. Experiments show the proposed method can substantially boost performance on multiple fine-grained image analysis tasks across various baselines.

Abstract:
Vector-Quantization (VQ) based discrete generative models are widely used to learn powerful high-quality (HQ) priors for blind image restoration (BIR). In this paper, we diagnose the side-effects of discrete VQ process essential to VQ-based BIR methods: 1) confining the representation capacity of HQ codebook, 2) being error-prone for code index prediction on low-quality (LQ) images, and 3) under-valuing the importance of input LQ image. These motivate us to learn continuous feature representation of HQ codebook for better restoration performance than using discrete VQ process. To further improve the restoration fidelity, we propose a new Self-in-Cross-Attention (SinCA) module to augment the HQ codebook with the feature of input LQ image, and perform cross-attention between LQ feature and input-augmented codebook. By this way, our SinCA leverages the input LQ image to enhance the representation of codebook for restoration fidelity. Experiments on four typical VQ-based BIR methods demonstrate that, by replacing the VQ process with a transformer using our SinCA, they achieve better quantitative and qualitative performance on blind image super-resolution and blind face restoration. The code and pre-trained models are publicly released at https://github.com/lhy-85/SinCA

Affiliations: School of Artificial Intelligence, Xidian University, Xi’an, China; School of Computing and Artificial Intelligence, Jiangxi University of Finance and Economics, Nanchang, Jiangxi, China; Department of Electrical Communication Engineering, Indian Institute of Science, Bengaluru, India; CNRS, CentraleSupelec, Laboratoire des Signaux et Systèmes, Université Paris-Saclay, Gif-sur-Yvette, France; School of Electronic and Information Engineering, Chongqing Three Gorges University, Chongqing, China; School of Artificial Intelligence and State Key Laboratory of Electromechanical Integrated Manufacturing of High-Performance Electronic Equipments, Xidian University, Xi’an, China

Abstract:
Video Quality Assessment (VQA) strives to computationally emulate human perceptual judgments and has garnered significant attention given its widespread applicability. However, existing methodologies face two primary impediments: (1) limited proficiency in evaluating samples at quality extremes (e.g., severely degraded or near-perfect videos), and (2) insufficient sensitivity to nuanced quality variations arising from a misalignment with human perceptual mechanisms. Although vision-language models offer promising semantic understanding, their reliance on visual encoders pre-trained for high-level tasks often compromises their sensitivity to low-level distortions. To surmount these challenges, we propose the Restoration-Assisted Multi-modality VQA (RAM-VQA) framework. Uniquely, our approach leverages video restoration as a proxy to explicitly model distortion-sensitive features. The framework operates through two synergistic stages: a prompt learning stage that constructs a quality-aware textual space using triple-level references (degraded, restored, and pristine) derived from the restoration process, and a dual-branch evaluation stage that integrates semantic cues with technical quality indicators via spatio-temporal differential analysis. Extensive experiments demonstrate that RAM-VQA achieves state-of-the-art performance across diverse benchmarks, exhibiting superior capability in handling extreme-quality content while ensuring robust generalization.

Abstract:
Cross-scene hyperspectral image classification aims to identify a new scene in target domain via learned knowledge from source domain using limited training samples. Existing cross-scene alignment approaches focus on aligning the global feature distribution between the source and target domains while overlooking the fine-grained alignment at different levels. Moreover, they mainly use Transformer architectures to model long-range dependencies across different channels but confront efficiency challenges due to their quadratic complexity, which limits classification performance in unsupervised domain adaptation tasks. To address these issues, a new domain-adaptive Mamba (DAMamba) is proposed for cross-scene hyperspectral image classification. First, a spectral-spatial Mamba is developed to extract high-order semantic features from the input data. Then, a domain-invariant prototype alignment method is proposed from three perspectives, i.e., intra-domain, inter-domain, and mini-batch, to produce reliable pseudo-labels and mitigate the spectral shift between the source and target domains. Finally, a fully connected layer is applied to the aligned features in the target domain to obtain the final classification results. Extensive evaluations across diverse cross-scene datasets demonstrate that our DAMamba outperforms existing state-of-the-art methods in classification accuracy and computing time. The code of this paper is available at https://github.com/PuhongDuan/DAMamba

Abstract:
Stereo image restoration in adverse environments, such as low-light conditions, rain, and low resolution, requires effective exploitation of cross-view complementary information to recover degraded visual content. In monocular image restoration, frequency decomposition has proven effective, where high-frequency components aid in recovering fine textures and reducing blur, while low-frequency components facilitate noise suppression and illumination correction. However, existing stereo restoration methods have yet to explore cross-view interactions by frequency decomposition, which is a promising direction for enhancing restoration quality. To address this, we propose a frequency-aware framework comprising a Frequency Decomposition Module (FDM), Detail Interaction Module (DIM), Structural Interaction Module (SIM), and Adaptive Fusion Module (AFM). FDM employs learnable filters to decompose the image into high- and low-frequency components. DIM enhances the high-frequency branch by capturing local detail cues through deformable convolution. SIM processes the low-frequency branch by modeling global structural correlations via a cross-view row-wise attention mechanism. Finally, AFM adaptively fuses the complementary frequency-specific information to generate high-quality restored images. Extensive experiments demonstrate the efficacy and generalizability of our framework across three diverse stereo restoration tasks, where it achieves state-of-the-art performance in low-light enhancement, rain removal, alongside highly competitive results in super-resolution. Our code is available at https://github.com/C2022J/FDIN

Abstract:
In the RGB-D vision community, extensive research has been focused on designing multi-modal learning strategies and fusion structures. However, the complementary and fusion mechanisms in RGB-D models remain a opaque box. In this paper, we present an analytical framework and a novel score to dissect the RGB-D vision community. Our approach involves measuring proposed semantic variance and feature similarity across modalities and levels, conducting visual and quantitative analyzes on multi-modal learning through comprehensive experiments. Specifically, we investigate the consistency and specialty of features across modalities, evolution rules within each modality, and the collaboration logic used when optimizing a RGB-D model. Our studies reveal/verify several important findings, such as the discrepancy in cross-modal features and the hybrid multi-modal cooperation rule, which highlights consistency and specialty simultaneously for complementary inference. We also showcase the versatility of the proposed RGB-D dissection method and introduce a straightforward fusion strategy based on our findings, which delivers significant enhancements across various tasks and even other multi-modal data.

Abstract:
Recently, backdoor attacks on Deep Neural Networks (DNNs) have raised urgent security threats, which can manipulate the behavior of an attacked model by embedding the backdoor trigger into the input. Since triggers can be designed to be stealthy and hard to recognize by the naked eye, segmenting these triggers in backdoor samples becomes a significant challenge. However, finding triggers embedded by the attacker can be crucial for analyzing the attacks and formulating a defense strategy. Therefore, in this paper, we propose the Backdoor Trigger Segmentation (BTS) task with a comprehensive benchmark consisting of 8 attack methods, 8 unique triggers, and 179 attack settings for image or text data. Moreover, we construct a mathematical system for BTS, abstracting various backdoor triggers into a unified theoretical framework. Based on the theoretical guarantees, we propose a unified Trigger Locator (TriLoc) algorithm to segment various triggers in backdoor samples of both image and text modalities, without prior knowledge of triggers. Extensive experimental results on our benchmark demonstrate the superior performance of our algorithm compared to state-of-the-art methods. Our benchmark and code are available at https://github.com/LivXue/Backdoor-Trigger-Segmentation

Abstract:
Instruction tuning has become a widely adopted approach for aligning large multimodal models (LMMs) with human intent. It enables multi-task joint training through unified data formats. However, as new vision-language tasks constantly emerge, exhaustive joint training of all tasks becomes impractical. Continual learning offers a more flexible and resource-efficient alternative, enabling incremental training of LMMs on emerging tasks. This study investigates two fundamental questions when applying continual learning to instruction tuning of LMMs: 1) Do LMMs suffer from catastrophic forgetting during continual instruction tuning? 2) Can existing continual learning methods be effectively applied to continual instruction tuning of LMMs? A comprehensive study was conducted to answer these questions. First, we establish the first benchmark for continual instruction tuning of LMMs and reveal the phenomenon of catastrophic forgetting in this setup. Second, we integrate and adapt traditional continual learning approaches to this setting, demonstrating the effectiveness of these strategies to varying degrees in different scenarios. Third, we explore task-similarity dynamics between pairs of vision-language tasks and propose task-similarity-informed regularization and model expansion methods. Experimental results show that our approach can consistently boost the model’s performance.

Abstract:
Deep anomaly detection aims to provide robust and efficient classifiers for zero-shot (unsupervised, UNS) and few-shot (imbalanced supervised, IMS) settings. However, current models still struggle on edge-case normal samples and are often unable to keep high performance over different scales of anomalies. Additionally, there is a lack of a unified framework that efficiently addresses both UNS and IMS settings. To address these limitations, we present a novel two-stage method which leverages multi-scale normal prototypes during training to compute an anomaly deviation score. First, we employ a novel memory-augmented contrastive learning to jointly learn representations and memory modules across multiple scales. This allows us to effectively capture subtle features of normal data while adapting to varying levels of anomaly complexity. Then, we train an efficient anomaly distance-based detector that computes spatial deviation maps between the learned prototypes and incoming observations. Our model outperforms the SoTA on a wide range of anomalies, including object, style, and local anomalies, as well as industrial inspection and face anti-spoofing, while being on par with SoTa out-of-distribution detectors. Notably, it stands as the first model capable of maintaining exceptional performance across both settings.

Abstract:
Dataset distillation improves neural network training efficiency by compressing large real datasets into compact synthetic datasets. Existing methods typically optimize matching objectives, such as aligning gradients, features, and trajectories between the synthetic and original datasets to ensure the distilled data retains essential properties for model training. However, many of these approaches rely on predefined distillation pools to streamline the process or treat all real data points equally, overlooking the dynamic nature of the synthetic dataset’s training requirements during optimization. To address these limitations, we propose Active Dataset Distillation via Dual-Space Informative Matching (ACDD), an active learning-based algorithm that dynamically selects the most informative real data subset to align with the synthetic dataset’s evolving needs. By adaptively refining the distillation pool, ACDD enhances training efficiency and generalization while ensuring the synthetic dataset effectively captures the original data’s key characteristics. ACDD operates through two interconnected loops: the dual-space active loop (DAL) and the distillation loop. DAL plays a key role by dynamically selecting samples that balance diversity and uncertainty, adding them to the target distillation pool to meet the evolving informational needs of the current distillation loop. As a result, ACDD enables the synthetic dataset to achieve superior performance compared to SOTA methods across multiple benchmarks, including SVHN, CIFAR-10, CIFAR-100, TinyImageNet, and ImageNet subset. Moreover, ACDD reduces the required real dataset to just 20%–40% of the original, demonstrating its efficiency and effectiveness in data distillation.

Abstract:
Diffusion-based image super-resolution (SR) has shown strong potential in recovering high-fidelity details from low-resolution inputs. However, the need for tens or hundreds of sampling steps leads to substantial inference latency. Recent works attempt to accelerate this process via knowledge distillation, but often rely solely on pixel-level loss or overlook the fact that diffusion models capture different information across time steps. To address this, we propose TAD-SR, a time-aware diffusion distillation framework. Specifically, we introduce a novel score distillation strategy to align the score functions between the outputs of the student and teacher models after minor noise perturbation. This distillation strategy eliminates the inherent bias in score distillation sampling (SDS) and enables the student models to focus more on high-frequency image details by sampling at smaller time steps. We further introduce a time-aware discriminator that exploits the teacher’s knowledge to differentiate real and synthetic samples across different noise scales, using explicit temporal conditioning. Extensive experiments on SR tasks demonstrate that TAD-SR outperforms existing single-step diffusion methods and achieves performance on par with multi-step state-of-the-art models.

Abstract:
Visible-Infrared Person Re-Identification (VI-ReID) that matches pedestrian images across visible and infrared modalities suffers from substantial modality discrepancies and intra-class variations. While existing methods typically address the modality gap via style alignment, they often lose identity-relevant semantics and overlook fine-grained inter-class nuances, such as body part contours and structural cues around the head, shoulders, or feet. To tackle these challenges, we propose an Identity-Compensated Style Distillation (ICSD) network that enforces cross-modality style consistency and enhances the discriminative power of modality-invariant features. Specifically, ICSD comprises two core components: (1) a Style Knowledge Distillation (SKD) module, which integrates Style Discrepancy Reduction (SDR) and Identity Knowledge Compensation (IKC) to align modality styles while preserving identity-relevant semantics; (2) an Identity Discrimination Amplification (IDA) module, which captures and enhances subtle inter-class differences by refining identity-specific cues, thereby facilitating more accurate discrimination between different pedestrians. Extensive experiments on three public benchmarks—SYSU-MM01, RegDB, and LLCM—demonstrate that ICSD consistently outperforms state-of-the-art methods, validating the effectiveness and complementarity of its components.

Abstract:
Multi-view clustering (MVC) based on anchor learning has been proven to be effective in improving clustering accuracy and efficiency. Existing MVC methods are mainly based on single-granularity anchor learning, that is, the number of anchors corresponding to different views is constant and consistent, which will lead to information redundancy or insufficient mining. In addition, aggregating anchors of varying scales from all views to obtain multi-view shared clustering results remains a problem to be further explored. To address the above problems, a novel MVC method named View-adaptive Multi-granularity Anchor Learning (VMAL) is proposed in this paper, where view-adaptive anchor pruning and view-shared sample clustering are jointly optimized. On the one hand, VMAL can dynamically adjust the optimal number of anchors for each view during optimization by exploiting the reconstruction error of samples. On the other hand, an intuitive and effective mapping-aggregation message passing strategy is cleverly designed, which first maps the anchor representations of different views to the cluster space and then transfers the obtained cluster information of anchors to the sample space through an aggregation matrix. As a byproduct, VMAL can directly obtain the discrete cluster distribution of samples without additional partitioning. Finally, an iterative optimization algorithm is developed to solve the proposed VMAL method. Experimental results on multiple datasets have demonstrated the superiority of VMAL in terms of clustering results when compared with other state-of-the-art methods.

Abstract:
3D scene CAD recomposition aims to reconstruct a given scene by retrieving and assembling CAD models from a database, so as to accurately simulate the geometric properties and spatial arrangement of the original environment. Recent methods learn this task through training on limited scan-to-CAD annotation data, which hinders their generalization to diverse real-world scenes. In this paper, we propose POSITION, an open-world 3D scene CAD recomposition method to construct the 3D scene with CADs retrieved from an open-set database. POSITION is designed following a divide-and-conquer strategy. Firstly, we extract open-world multi-modal object representations from a captured 3D scene. Secondly, on top of the representations, we propose a coarse-to-fine retrieval method to retrieve CADs that are visually, geometrically and semantically match real objects. Thirdly, we present a physically plausible pose alignment method to adjust retrieved CAD models to maintain consistent geometry and layout with the observation. By decomposing the problem into well-defined subtasks, our approach achieves generalization across various scene types and scalable CAD databases without retraining or fine-tuning. Our approach demonstrates superior CAD recomposition performance on both the Scan2CAD and diverse real-world 3D scene datasets. Our project page: https://yangrongkun.github.io/position/

Affiliations: School of Physics and Electronic Engineering, Jiangsu Normal University, Xuzhou, Jiangsu, China; Guangdong Key Laboratory for Biomedical Measurements and Ultrasound Imaging, the National Regional Key Technology Engineering Laboratory for Medical Ultrasound, the Marshall Laboratory of Biomedical Engineering, the School of Biomedical Engineering, Shenzhen University Medical School, Shenzhen University, Shenzhen, China; School of Computer Science, Nanjing University of Information Science and Technology, Nanjing, China; School of Computing and Mathematical Sciences, University of Leicester, Leicester, U.K.; Institute of Information Technology, University of Klagenfurt, Klagenfurt am Wörthersee, Austria; Faculty of Information Technology and Communication Sciences, Tampere University, Tampere, Finland; School of Computer Science and Informatics, Cardiff University, Cardiff, U.K.

Abstract:
Existing stereo image quality assessment (SIQA) methods generally have limitations in binocular fusion and fine-grained perception modeling. To address these issues, we propose a Perception-Inspired Network for SIQA that simulates binocular difference-guided fusion, high-frequency sensitivity, and hierarchical perception mechanisms of the human visual system (HVS). First, a difference-guided binocular fusion (DGBF) module is designed to mimic the binocular difference sensitivity mechanism, which exploits difference information at both the feature-level and image-level to optimize binocular fusion. Furthermore, the image distortion primarily affects the high-frequency components, which are critical for perceptual quality. To reflect this, we propose a high-frequency enhancement module (HFEM) to simulate the human eye’s sensitivity to edge and texture distortions. Finally, to better achieve fine-grained perception modeling, we propose a hierarchical quality regression strategy that simulates the human perceptual process, from perceiving local details to forming a global quality judgment, thereby achieving a quality prediction more aligned with human subjective evaluation. Experimental results demonstrate that the proposed method outperforms mainstream approaches, achieving a PLCC of 0.9734 on the LIVE I database, and a PLCC of 0.9632 on the LIVE II database.

Abstract:
In practical applications of social media and the Internet, deepfake face images involve a plethora of unlabeled samples. To effectively identify unlabeled deepfake images, the domain adaptation technique has gained significant attention. It applies the knowledge learned from labeled samples (source domain) to unlabeled samples (target domain) in a cross-domain manner. However, the existing domain adaptation-based deepfake detection methods primarily focus on intra-type cross-domain scenarios. In this study, we propose an unsupervised domain adaptation-based deepfake face image detection method for extra-type cross-domain scenarios. The core idea of our approach lies in the development of a domain adaptation model that consists of Domain Tag Adversarial (DTA) and Domain Feature Alignment (DFA) algorithms, called DTA-DFA, which empowers the proposed method with strong cross-domain capability. The DTA is utilized to weaken the specificity within each domain, while DFA aligns the distribution between the source and target domains. Compared with the existing deepfake detection methods, the experimental results demonstrate that the proposed method dramatically enhances the extra-type cross-domain detection performance. Moreover, the DTA-DFA model also exhibits a remarkable ability to perform cross-domain detection from large-shot labeled samples to few-shot labeled samples, further verifying its powerful cross-domain capability. Code is released at https://github.com/QinQin741/DTA-DFA-DA-model

Abstract:
The Vision Transformer (ViT) has achieved notable success in computer vision, with its variants widely validated across various downstream tasks, including semantic segmentation. However, as general-purpose visual encoders, ViT backbones often do not fully address the specific requirements of task decoders, highlighting opportunities for designing decoders optimized for efficient semantic segmentation. This paper proposes Strip Cross-Attention (SCASeg), an innovative decoder head specifically designed for semantic segmentation. Instead of relying on the conventional skip connections, we utilize lateral connections between encoder and decoder stages, leveraging encoder features as Queries in cross-attention modules. Additionally, we introduce a Cross-Layer Block (CLB) that integrates hierarchical feature maps from various encoder and decoder stages to form a unified representation for Keys and Values. The CLB also incorporates the local perceptual strengths of convolution, enabling SCASeg to capture both global and local context dependencies across multiple layers, thus enhancing feature interaction at different scales and improving overall efficiency. To further optimize computational efficiency, SCASeg compresses the channels of queries and keys into one dimension, creating strip-like patterns that reduce memory usage and increase inference speed compared to traditional vanilla cross-attention. Experiments show that SCASeg’s adaptable decoder delivers competitive performance across various setups, outperforming leading segmentation architectures on benchmark datasets, including ADE20K, Cityscapes, COCO-Stuff 164k, and Pascal VOC2012, even under diverse computational constraints.

Abstract:
Open set domain adaptation (OSDA) aims to transfer classification-oriented knowledge from a labeled source domain to an unlabeled target domain, which faces the challenges from unseen knowledge in open-set scenarios, i.e., unknown classes privileged to the target domain. Existing methods usually identify unknown classes from classifier prediction directly, which are sensitive to the intrinsic clustering structure and cluster numbers of the unknown class data. In this paper, inspired by the sample relation characterization ability of Optimal Transport (OT), we propose a new type of OT method for OSDA, namely, Target-relaxed Optimal Transport (TROT). Compared with existing OT with strict marginal constraints, TROT imposes a single-side relaxation to the mass requirement on the open-set target domain. Theoretically, we prove that such a relaxation can reduce mis-matches between known and unknown classes, which indicates the transport plan of TROT is promising to identify unknown classes. Methodologically, TROT can identify unknown classes adaptively and map the cross-domain shared data with a sparse plan assignment, which improves both the effectiveness and robustness of known class alignment; besides, a graph embedding with multi-cluster structure of unknown classes is designed to learn a discriminative metric space for open-set classification. Empirically, extensive evaluations are conducted on several image datasets, where TROT achieves significant performance improvements compared with existing techniques for visual recognition in open-set scenarios.

Abstract:
Face age synthesis (FAS) predicts a person’s future or past facial appearance. In FAS, modifying one facial attribute usually affects the generation of other attributes during face image generation. Current models directly learn entangled representations of age-related features, resulting in insufficient feature disentanglement, which consequently impairs their causal reasoning capability for FAS tasks. To this end, we propose a hierarchical causal learning model for face age synthesis (HCFace), which integrates hierarchical structures and causal relationships into the facial generative model. Specifically, we propose to leverage hierarchical causal relationships to align with facial features for feature disentanglement. Furthermore, we design a novel nonlinear mapping function that captures the true patterns of facial attribute changes with age, enhancing the disentanglement of these attributes. We conduct extensive experiments to validate the superiority of our proposed model. Compared to other advanced baseline methods, HCFace improves overall accuracy by 2.47%, with improvements of 9.75% and 9.69% in certain age-related attributes, such as skin and hair. Our source code is available at https://github.com/SE-hash/HCFace

Abstract:
Deep neural networks enriched with structural information have been widely employed for facial expression recognition tasks. However, these methods often depend on hierarchical information rather than face property to finish expression recognition. In this paper, we propose a cross-modal network with strong biological and structural information for facial expression recognition (CMNet). CMNet can respectively learn expression information via face symmetry on a whole face, left and right half faces to extract complementary facial features. To prevent negative effect of biological and structural information fusion, a salient facial information refinement module can obtain salient facial expression information to improve stability of an obtained facial expression classifier. To reduce reliance on unilateral facial features, a half-face alignment optimization mechanism is designed to align obtained expression information of learned left and right half faces. Our experimental results demonstrate that CMNet outperforms several novel methods, i.e., SCN and LAENet-SA for facial expression recognition. Codes can be obtained at https://github.com/hellloxiaotian/CMNet

Abstract:
With the prevalence of pre-trained vision-language models like CLIP, leveraging the generic knowledge embedded in CLIP for domain adaptation has proved to be a promising direction. However, most existing CLIP-based methods are limited to closed-set settings. This is primarily because CLIP needs the semantic labels of unknown classes for inference, thus making it not applicable to Open-Set Domain Adaptation (OSDA). To utilize the complementary roles of CLIP and the source model, our paper proposes a novel Semantic-guided Target Adaptation (SemTA) framework for OSDA in a training-free manner. Specifically, we introduce an unknown semantic discovery module. It uses the cluster centroids of the target data to obtain the semantic labels of unknown classes from the worldwide corpus. Then, the semantic-based inference can be performed with CLIP. Additionally, the dual sample attention mechanism is implemented to output sample-based inference. Representative features from both the source model and CLIP serve as the key to improve task specificity. Compared to previous OSDA methods which reject unknown data by confidence threshold, the proposed approach is more practical and offers better interpretability. Comprehensive evaluations on four benchmarks reveal our method sets a new state-of-the-art even without training. Our code will be publicly available soon.

Abstract:
Underwater image enhancement (UIE) is crucial for robust marine exploration, yet existing methods prioritize perceptual quality while overlooking irreversible semantic corruption that impairs downstream tasks. Unlike terrestrial images, underwater semantics exhibit layer-specific degradations: shallow features suffer from color shifts and edge erosion, while deep features face semantic ambiguity. These distortions entangle with semantic content across feature hierarchies, where direct enhancement amplifies interference in downstream tasks. Even if distortions are removed, the damaged semantic structures cannot be fully recovered, making it imperative to further enhance corrupted content. To address these challenges, we propose a task-driven UIE framework that redefines enhancement as machine-interpretable semantic recovery rather than mere distortion removal. First, we introduce a multi-scale underwater distortion-aware generator to perceive distortions across feature levels and provide a prior for distortion removal. Second, leveraging this prior and the absence of clean underwater references, we propose a stable self-supervised disentanglement strategy to explicitly separate distortions from corrupted content through CLIP-based semantic constraints and identity consistency. Finally, to compensate for the irreversible semantic loss, we design a task-aware hierarchical enhancement module that refines shallow details via spatial-frequency fusion and strengthens deep semantics through multi-scale context aggregation, aligning results with machine vision requirements. Extensive experiments on segmentation, detection, and saliency tasks demonstrate the superiority of our method in restoring machine-friendly semantics from degraded underwater images. Our code is available at https://github.com/gemyumeng/HSRUIE

Abstract:
Unsupervised domain adaptation semantic segmentation (UDASS) aims to perform dense prediction on the unlabeled target domain by training the model on a labeled source domain. In this field, self-training approaches have demonstrated strong competitiveness and advantages. However, existing methods often rely on additional training data (such as reference datasets or depth maps) to rectify the unreliable pseudo-labels, ignoring the cross-domain interaction between the target and source domains. To address this issue, in this paper, we propose a novel method for unsupervised domain adaptation semantic segmentation, termed Unlocking Cross-Domain Synergies (UCDS). Specifically, in the UCDS network, we design a new Dynamic Self-Correction (DSC) module that effectively transfers source domain knowledge and generates high-confidence pseudo-labels without additional training resources. Unlike the existing methods, DSC proposes a Dynamic Noisy Label Detection method for the target domain. To correct the noisy pseudo-labels, we design a Dual Bank mechanism that explores the reliable and unreliable predictions of the source domain, and conducts cross-domain synergy through Weighted Reassignment Self-Correction and Negative Correction Prevention strategies. To enhance the discriminative ability of features and amplify the dissimilarity of different categories, we propose Discrepancy-based Contrastive Learning (DCL). The DCL selects positive and negative samples in the source and target domains based on the semantic discrepancies among different categories, effectively avoiding the numerous false negative samples found in existing methods. Extensive experimental results on three commonly used datasets demonstrate the superiority of the proposed UCDS in comparison with the state-of-the-art methods. The project and code are available at https://github.com/wqh011128/UCDS

Abstract:
Ultra-high-definition (UHD) image restoration is vital for applications demanding exceptional visual fidelity, yet existing methods often face a trade-off between restoration quality and efficiency, limiting their practical deployment. In this paper, we propose TSFormer, an all-in-one framework that integrates Trusted learning with Sparsification to boost both generalization capability and computational efficiency in UHD image restoration. The key to sparsification is that only a small amount of token movement is allowed within the model. To efficiently filter tokens, we use Min- p with random matrix theory to quantify the uncertainty of tokens (lower trustworthiness), thereby improving the robustness of the model. Our model can run a 4K ( 3840× 2160 ) image in real time (40fps) with 3.38 M parameters. Extensive experiments demonstrate that TSFormer achieves state-of-the-art restoration quality while enhancing generalization and reducing computational demands. In addition, our token filtering method can be applied to other image restoration models to effectively accelerate inference and maintain performance.

Abstract:
Hyperspectral image (HSI) classification demands models that can jointly capture long-range spatial relations and high-dimensional spectral structures while remaining scalable to large scenes and robust under limited supervision. Existing CNN-, Transformer-, and state-space-based approaches either suffer from restricted receptive fields, quadratic attention complexity, or directional biases that hinder dense pixel-wise prediction. To address these limitations, we propose Hi-RWKV, a hierarchical recurrent weighted key–value framework tailored for hyperspectral analysis. Hi-RWKV introduces three key innovations: 1) a spatial structure–guided bidirectional propagation mechanism that integrates global spatial context while preserving boundary fidelity via edge-aware gating; 2) a spectral identity–driven channel mixing module that incorporates learnable band embeddings and whitening transforms to enhance cross-band discriminability; and 3) a multi-stage hierarchical encoder that progressively refines spectral–spatial representations with strictly linear complexity. Together, these designs enable efficient, direction-free spectral–spatial reasoning essential for large-scale HSI interpretation. Extensive experiments on four benchmarks demonstrate that Hi-RWKV consistently achieves state-of-the-art accuracy under diverse training regimes. Ablation studies confirm that each proposed module offers complementary gains in boundary preservation, spectral discrimination, and data efficiency. By unifying scalable recurrence with hyperspectral-specific structural modeling, Hi-RWKV establishes a strong and efficient paradigm for high-resolution remote sensing. The logs and source data of this article are available at https://github.com/HSI-Lab/Hi-RWKV

Abstract:
Both classical and learned image transformations such as the discrete wavelet transforms (DWTs) and flow-based generative models provide semantically meaningful representations of images. In this paper, we exploit the expressiveness of these representations to propose a general method for improving the classification robustness of neural network against real-world corruptions. The key idea is a novel adversarial attack that targets suitable low-dimensional subspaces in the transformed space while at the same time obeying the L^\infty -box in the pixel space. Subsequent training for adversarial robustness with this attack is then used as a proxy for achieving corruption robustness. We apply this approach with the discrete cosine transform (DCT), DWTs, and Glow with attacks that preserve low frequencies or the most relevant features, respectively. The resulting models are significantly more robust against a broad class of unseen common image perturbations compared to using the standard L^\infty -box, with only a minor sacrifice of natural accuracy. We provide an extensive ablation study, which shows that our method applies quite generally for two different color systems and choice of relevant parameters and also provides insight into why our method works.

Abstract:
The emotional video captioning (EVC) task, which aims to generate factual descriptions based on the perceived subtle visual emotion cues, has received more and more attention and research. However, EVC is essentially an objective video captioning task, and ignores the subjective emotional reactions of video viewers, which cannot reflect personalized affective understandings of different viewers on the same video. To fill the research gap, we investigate the subjective video captioning (SVC) task in this paper, which aims to generate emotional captions by incorporating viewers’ personalized emotional reactions upon the EVC task. SVC is extremely challenging, which lies in two aspects: 1) the correlative emotion perception between subjective and objective emotions and 2) the collaborative generation between emotional and factual information. To this end, we propose the Subjective-Objective Emotion-Correlated Generation Network (SO-ECGN) in this paper. Specifically, our SO-ECGN leverages the proposed dynamic mask attention and emotion domain shifting module to achieve the objective emotion incremental learning, and then, a subjective-objective emotions correlation module is proposed to adaptively combine two perspective emotions to provide accurate emotion guidance (i.e., emotional polarity and intensity) for each generation step. Furthermore, an emotion-correlated decoder is proposed to generate subjective captions by adaptively referring to factual information and emotional information. Extensive experiments on three challenging datasets demonstrate the superiority of our approach and each proposed module, i.e., reaching 79.2%, 45.1% on BLEU-1, CIDEr metrics on EmVidCap-L dataset.

Abstract:
Multi-exposure image fusion (MEF) is the main method to obtain High Dynamic Range (HDR) images by fusing multiple images taken under various exposure values. In this paper, we propose and develop a novel variational model based on detail-base decomposition for MEF. The main idea is to incorporate the decomposition procedure and the reconstruction procedure into a unified framework, and to interact the detail information and the base information at the same time. Specifically, we make use of Tikhonov regularization to model the base layer, and we present an efficient design to obtain the detail layer, which is able to capture more detailed information effectively. Meanwhile, we incorporate multi-scale techniques to remove halo artifacts. Numerically, we apply alternating direction method of multipliers (ADMM) to solve the proposed minimization problem. Theoretically, we study the existence of the solution of the proposed model and the convergence of the proposed ADMM algorithm. Experimental examples are presented to demonstrate that the performance of the proposed model is better than that by using other testing methods in terms of visual quality and some criteria, e. g., the proposed model gives the best Natural image quality evaluator (NIQE) values with 1% - 10% improvement for real image fusion experiments and gives the best PSNR values with 13% - 20% improvement for the synthetic image fusion experiment.

Abstract:
In this study, we introduce EinsPT, an efficient instance-aware pre-training paradigm designed to reduce the transfer gap between vision foundation models and downstream instance-level tasks. Unlike conventional image-level pre-training that relies solely on unlabeled images, EinsPT leverages both image reconstruction and instance annotations to learn representations that are spatially coherent and instance discriminative. To achieve this efficiently, we propose a proxy–foundation architecture that decouples high-resolution and low-resolution learning: the foundation model processes masked low-resolution images for global semantics, while a lightweight proxy model operates on complete high-resolution images to preserve fine-grained details. The two branches are jointly optimized through reconstruction and instance-level prediction losses on fused features. Extensive experiments demonstrate that EinsPT consistently enhances recognition accuracy across various downstream tasks with substantially reduced computational cost, while qualitative results further reveal improved instance perception and completeness in visual representations. Code is available at github.com/feufhd/EinsPT

Abstract:
Unsupervised Domain Adaptation (UDA) person search aims to adapt models trained on labeled source data to unlabeled target domains. Existing approaches typically rely on clustering-based proxy learning, but their performance is often undermined by unreliable pseudo-supervision. This unreliability mainly stems from two challenges: (i) spectral shift bias, where low- and high-frequency components behave differently under domain shifts but are rarely considered, degrading feature stability; and (ii) static proxy updates, which make clustering proxies highly sensitive to noise and less adaptable to domain shifts. To address these challenges, we propose the Reliable Pseudo-supervision in UDA Person Search (RPPS) framework. At the feature level, a Dual-branch Wavelet Enhancement Module (DWEM) embedded in the backbone applies discrete wavelet transform (DWT) to decompose features into low- and high-frequency components, followed by differentiated enhancements that improve cross-domain robustness and discriminability. At the proxy level, a Dynamic Confidence-weighted Clustering Proxy (DCCP) employs confidence-guided initialization and a two-stage online–offline update strategy to stabilize proxy optimization and suppress proxy noise. Extensive experiments on the CUHK-SYSU and PRW benchmarks demonstrate that RPPS achieves state-of-the-art performance and strong robustness, underscoring the importance of enhancing pseudo-supervision reliability in UDA person search. Our code is accessible at https://github.com/zqx951102/RPPS

Abstract:
Prompt tuning has proven to be an effective alternative for fine-tuning the pre-trained vision-language models (VLMs) to downstream tasks. Among existing approaches, class-shared prompts learn a unified prompt shared across all classes, while sample-specific prompts generate distinct prompts tailored to each individual sample. However, both approaches often struggle to adequately capture the unique characteristics of underrepresented classes, particularly in imbalanced scenarios where data for tail classes is scarce. To alleviate this issue, we propose an attribute-aware prompt tuning framework that prompts a more balanced understanding for imbalance tasks by explicitly modeling critical class-level attributes. The key intuition is that, from the perspective of class, essential attributes tend to be relatively consistent across classes, regardless of sample sizes. Specifically, we build an attribute pool to learn potential semantic attributes of classes based on VLMs. For each input sample, we generate a unique attribute-aware prompt by selecting the relevant class attributes from the pool through a matching mechanism. This design enables the model to capture essential class semantics and generate informative prompts, even for classes with limited data. Additionally, we introduce a ProAdapter module to facilitate the transfer of foundational knowledge from VLMs while enhancing generalization to underrepresented classes in imbalanced settings. Extensive experiments on standard and imbalance few-shot tasks demonstrate that our model achieves superior performance especially in tail classes.

Abstract:
Image reconstruction in coded aperture snapshot spectral compressive imaging (CASSI) aims to recover high-fidelity hyperspectral images (HSIs) from compressed 2D measurements. While deep unfolding networks have shown promising performance, the degradation induced by the CASSI degradation model often introduces global illumination discrepancies in the reconstructions, creating artifacts similar to those in low-light images. To address these challenges, we propose a novel Retinex Prior-Driven Unfolding Network (RPDUN), which unfolds the optimization incorporating the Retinex prior as a regularization term into a multi-stage network. This design provides global illumination adjustment for compressed measurements, effectively compensating for spatial-spectral degradation according to physical modulation and capturing intrinsic spectral characteristics. To the best of our knowledge, this is the first application of the Retinex prior in hyperspectral image reconstruction. Furthermore, to mitigate the noise in the reflectance domain, which can be amplified during decomposition, we introduce an Adaptive Token Selection Transformer (ATST). This module adaptively filters out weakly correlated tokens before the self-attention computation, effectively reducing noise and artifacts within the recovered reflectance map. Extensive experiments on both simulated and real-world datasets demonstrate that RPDUN achieves new state-of-the-art performance, significantly improving reconstruction quality while maintaining computational efficiency. The code is available at https://github.com/ZUGE0312/RPDUN

Abstract:
Recovering High Dynamic Range (HDR) images from multiple Standard Dynamic Range (SDR) images becomes challenging when the SDR images exhibit noticeable degradation and missing content. Leveraging scene-specific semantic priors offers a promising solution for restoring heavily degraded regions. However, these priors are typically extracted from sRGB SDR images, the domain/format gap poses a significant challenge when applying it to HDR imaging. To address this issue, we propose a general framework that transfers semantic knowledge derived from SDR domain via self-distillation to boost existing HDR reconstruction. Specifically, the proposed framework first introduces the Semantic Priors Guided Reconstruction Model (SPGRM), which leverages SDR image semantic knowledge to address ill-posed problems in the initial HDR reconstruction results. Subsequently, we leverage a self-distillation mechanism that constrains the color and content information with semantic knowledge, aligning the external outputs between the baseline and SPGRM. Furthermore, to transfer the semantic knowledge of the internal features, we utilize a Semantic Knowledge Alignment Module (SKAM) to fill the missing semantic contents with the complementary masks. Extensive experiments demonstrate that our framework significantly boosts HDR imaging quality for existing methods without altering the network architecture.

Abstract:
In recent years, fusing high-resolution multispectral images (HR-MSIs) and low-resolution hyperspectral images (LR-HSIs) has become a widely used approach for hyperspectral image super-resolution (HSI-SR). The deep unfolding framework has attracted significant attention thanks to its ability to formulate the problem into a data module and a prior module. However, there are still two critical issues that hinder the performance enhancement of the existing methods: 1) Parameters in the data module are fixed (though learnable) at each iteration, i.e., lacking the adaptivity to comprehensive data; 2) The Transformer in the prior module cannot effectively capture high-frequency information. To resolve these issues, we propose a Content-Adaptive Unfolding Wavelet Transformer (CAUWT) for HSI-SR, where the parameters are adaptively learned based on the reconstructed HSI at each iteration. Moreover, we propose a novel Wavelet-Assisted Transformer (WAT), by integrating the Discrete Wavelet Transform (DWT) and the Hybrid Spectral-Spatial Attention Block (HSSAB) to further upgrade the high-frequency information quality of HSI at no cost of extra branch structures, where the former is for multi-scale and multi-frequency details and the latter is for correlations between and within sub-band components. Extensive experiments performed on both simulated and real datasets well demonstrate the effectiveness of the proposed method. In comparison with mainstream HSI-SR methods, our method exhibits superior performance and lower computational overhead.

Abstract:
Vision-and-Language Navigation in continuous environments (VLN-CE) requires an embodied robot to navigate the target destination following the natural language instruction. Most existing methods use panoramic RGB-D cameras for 360° observation of environments. However, these methods struggle in real-world applications because of the higher cost of panoramic RGB-D cameras. This paper studies a low-cost and practical VLN-CE setting, e.g., using monocular cameras of limited field of view, which means “Look Less” for visual observations and environment semantics. In this paper, we propose a ThinkMatter framework for monocular VLN-CE, where we motivate monocular robots to “Think More” by 1) generating novel views and 2) integrating instruction semantics. Specifically, we achieve the former by the proposed 3DGS-based panoramic generation to render novel views at each step, based on past observation collections. We achieve the latter by the proposed enhancement of the occupancy-instruction semantics, which integrates the spatial semantics of occupancy maps with the textual semantics of language instructions. These operations promote monocular robots with wider environment perceptions as well as transparent semantic connections with the instruction. Both extensive experiments in the simulators and real-world environments demonstrate the effectiveness of ThinkMatter, providing a promising practice for real-world navigation.

Abstract:
Recent advances in surgical robotics and computer vision have greatly improved intelligent systems’ autonomy and perception in the operating room (OR), especially in endoscopic and minimally invasive surgeries. However, for open surgery, which is still the predominant form of surgical intervention worldwide, there has been relatively limited exploration due to its inherent complexity and the lack of large-scale, diverse datasets. To close this gap, we present OpenSurgery, by far the largest video–text pretraining and evaluation dataset for open surgery understanding. OpenSurgery consists of two subsets: OpenSurgery-Pretrain and OpenSurgery-EVAL. OpenSurgery-Pretrain consists of 843 publicly available open surgery videos for pretraining, spanning 102 hours and encompassing over 20 distinct surgical types. OpenSurgery-EVAL is a benchmark dataset for evaluating model performance in open surgery understanding, comprising 280 training and 120 test videos, totaling 49 hours. Each video in OpenSurgery is meticulously annotated by expert surgeons at three hierarchical levels of video, operation, and frame to ensure both high quality and strong clinical applicability. Next, we propose the Hierarchical Surgical Knowledge Pretraining (HierSKP) framework to facilitate large-scale multimodal representation learning for open surgery understanding. HierSKP leverages a granularity-aware contrastive learning strategy and enhances procedural comprehension by constructing hard negative samples and incorporating a Dynamic Time Warping (DTW)-based loss to capture fine-grained temporal alignment of visual semantics. Extensive experiments show that HierSKP achieves state-of-the-art performance on OpenSurgegy-EVAL across multiple tasks, including operation recognition, temporal action localization, and zero-shot cross-modal retrieval. This demonstrates its strong generalizability for further advances in open surgery understanding.

Abstract:
Knowledge transfer aims to apply existing knowledge to different tasks or new data, and it has extensive applications in multi-domain and Multi-Task Learning. The key to this task is quickly identifying a fine-grained object for knowledge sharing and efficiently transferring knowledge. Current methods, such as fine-tuning, layer-wise parameter sharing, and task-specific adapters, only offer coarse-grained sharing solutions and struggle to effectively search for shared parameters, thus hindering the performance and efficiency of knowledge transfer. To address these issues, we propose Channel-Wise Parameter Sharing (CWPS), a novel fine-grained parameter-sharing method for knowledge transfer, which is efficient for parameter sharing, comprehensive, and plug-and-play. For the coarse-grained problem, we first achieve fine-grained parameter sharing by refining the granularity of shared parameters from the level of layers to the level of neurons. The knowledge learned from previous tasks can be utilized through the explicit composition of the model neurons. Besides, we promote an effective search strategy to minimize computational costs, simplifying the selection of shared weights. In addition, our CWPS has strong composability and generalization ability, which theoretically can be applied to any network consisting of linear and convolution layers. We introduce several datasets in both Incremental Learning and Multi-Task Learning scenarios. Our method has achieved state-of-the-art precision-to-parameter ratio performance with various backbones, demonstrating its efficiency and versatility.

Abstract:
Intermediate flow estimation is an important part of video frame interpolation (VFI). Most previous works use interpolation to derive the intermediate flow assuming localized linear motion. However, this method is not effective when dealing with extreme motions. In this work, we assume that the motion trajectory of an object is determined by the appearance characteristics of this object. Based on this assumption, we propose a new intermediate flow estimation method, which obtains the motion features of intermediate frames from image appearance and inter-frame motion features. In addition, in order to fully extract the inter-frame features, we rethink the difference of VFI and previous works on using Swin-Transformer and compute the appearance features and motion features within the adaptive neighborhood by cyclically shifting the window. Experimental results show that our method achieves state-of-the-art performance on different datasets for both fixed-time and arbitrary-time interpolation. Moreover, our proposed method outperforms models that require inputting a sequence of four frames when handling videos with extremely large motion. The source code is available from https://github.com/chen12304/IFE-VFI

Abstract:
A few recent works attempt to train an adversarially robust Unsupervised Domain Adaptation (UDA) model, transferring the robustness from a robust source model or other robust pre-trained models to an unlabeled target domain. However, it is usually impractical to assume the availability of robust source models or robust pre-training, and meanwhile, source data are not always accessible or efficient for adaptation training in many real-world scenarios. In this paper, we dive into a more practical and challenging problem of robust source-free domain adaptation: can we train a robust model on an unlabeled target domain given only a non-robust source model (without source data)? Empirically, we find that applying adversarial training (AT) to the self-supervised adaptation process leads to severe model degradation, as it tends to amplify the inevitable errors of UDA models. To tackle this issue, we propose a novel approach called Source-Free Alternating Optimization (SFAO), which employs a non-robust target model to provide better guidance for the AT of the desired robust target model. The two models are trained in an alternating manner to minimize the discrepancy between the clean source domain and the adversarial target domain. Moreover, we propose Softly-Constrained Adversarial Training (SCAT) to further mitigate the adverse effects of incorrect pseudo-labels in AT. Extensive experimental results demonstrate that the proposed method significantly improves the model performance on both clean and adversarial data. Source code is available at: https://github.com/Coxy7/robust-SFDA.

Abstract:
Multi-Modal Image Fusion (MMIF) aims to combine images from different modalities to produce fused images, retaining texture details and preserving significant information. Recently, some MMIF methods incorporate frequency domain information to enhance spatial features. However, these methods typically rely on simple serial or parallel spatial-frequency fusion without interaction. In this paper, we propose a novel Interactive Spatial-Frequency Fusion Mamba (ISFM) framework for MMIF. Specifically, we begin with a Modality-Specific Extractor (MSE) to extract features from different modalities. It models long-range dependencies across the image with linear computational complexity. To effectively leverage frequency information, we then propose a Multi-scale Frequency Fusion (MFF). It adaptively integrates low-frequency and high-frequency components across multiple scales, enabling robust representations of frequency features. More importantly, we further propose an Interactive Spatial-Frequency Fusion (ISF). It incorporates frequency features to guide spatial features across modalities, enhancing complementary representations. Extensive experiments are conducted on six MMIF datasets. The experimental results demonstrate that our ISFM can achieve better performances than other state-of-the-art methods. The source code is available at https://github.com/Namn23/ISFM.

Abstract:
Cross-modal retrieval facilitates more flexible information access and improves semantic understanding across different modalities. However, traditional cross-modal retrieval models rely on well-aligned datasets, which are often labor-intensive and costly to obtain. In real-world applications, data inevitably includes mismatched pairs, and these semantically inconsistent pairs can significantly degrade retrieval performance. Previous approaches have assumed ideal loss value distributions to optimize models for accurate semantic matching through soft-label estimation. However, the absence of hierarchical semantic correlation learning limits the effectiveness of these models in scenarios involving partial mismatches. To address these challenges, we propose Exploring Hierarchical Cross-Modal Correlation Consistency (EH3C) for cross-modal retrieval under partially mismatched conditions. Specifically, our approach first leverages neighborhood correlation distributions among samples to optimize cross-modal alignment, without assuming ideal distributions. This allows for the measurement of soft matching degrees between cross-modal data pairs and facilitates the effective learning of their positive correlations. Next, we enhance inter-class separability through intra-modal correlation learning by exploiting negative correlations between reliable negative sample pairs, thus enabling a more comprehensive exploration of cross-modal correlations. Finally, to assess the effectiveness and robustness of our approach, we conducted extensive experiments on three benchmark datasets. The results demonstrate that the proposed EH3C significantly improves cross-modal retrieval performance in scenarios involving partial mismatches.

Abstract:
Continual image super-resolution (CISR) aims to efficiently adapt a pre-trained model to a variety of tasks while retaining knowledge from previously learned tasks, minimizing the need for intensive independent training. The primary challenges include catastrophic forgetting due to varying data distributions and degradation types, along with the necessity for high adaptability. While prompt-based continual learning has proven effective in image classification, its direct application to super-resolution (SR) often fails to meet the demands for detailed pixel-level restoration and domain discrimination in low-level characteristics. To address these challenges, we propose Learning Prompt Adapters (LPA), which dynamically generates pixel-wise prompts through a combination of multi-granularity prompt bases and identities. By adaptively integrating these prompts into the Transformer architecture, we effectively improve the model’s performance on fine-grained details in super-resolution tasks, as well as enhancing the model’s adaptability to new tasks and preserving knowledge from previous ones. Through organizing the low-rank prompt bases with specific identities, we set up an effective solution to managing cross-task differences and enhancing prompt richness. Extensive experiments on benchmarks comprising the NYU, RealSR, DIV2K, REDS, and MANGA109 datasets with diverse degradation types demonstrate that LPA significantly outperforms existing continual learning methods. Codes of this paper are available at: https://github.com/dummerchen/LPA

Affiliations: College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing, China; College of Artificial Intelligence and the Key Laboratory of Brain-Machine Intelligence Technology, Ministry of Education, Nanjing University of Aeronautics and Astronautics, Nanjing, China; School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China; College of Artificial Intelligence, Nanjing University of Aeronautics and Astronautics, Nanjing, China

Abstract:
Functional brain connectivity networks capture complex relationships and temporal evolution between brain regions, which have become increasingly important for diagnosing neurological disorders. However, existing methods, which are primarily based on vector or graph representations, struggle to adequately characterize the intricate spatio-temporal topological architecture of functional brain networks. Additionally, they predominantly rely on data-driven paradigms and lack priors pertaining to cross-windows network interactions. To address these issues, we propose a spatio-temporal hypergraph attention network framework for brain network analysis. Specifically, we first propose a temporal attention network architecture embedded with temporal similarity-driven prior knowledge, which effectively extracts long-range dependency information from fMRI by combining multi-head self-attention mechanisms and cross-window temporal prior knowledge. Second, we design a hierarchical hypergraph generation module that fuses local and global brain topological information to achieve multi-scale modeling of high-order spatio-temporal structures. Additionally, the spatial attention network, developed based on transformer architecture, employs hypergraph message passing mechanisms to effectively construct multi-level spatial interaction relationships between brain regions. Finally, a multi-layer perceptron (MLP) is adopted for classification. Experiments on the ADNI and PD datasets demonstrate that our method outperforms several state-of-the-art approaches in diagnostic performance and provides discriminative graph features for relevant brain disease diagnosis.

Abstract:
Transfer subspace learning plays a critical role in unsupervised domain adaptation by establishing a shared embedding space where source domain data can be linearly reconstructed to match target domain distribution. While existing methods exploit the low-rank structures of reconstruction matrix, they frequently overlook the alignment of cross-domain joint probability distributions in the learned low-dimensional subspace. To address these challenges, we propose a novel non-convex transfer learning method named DATSL, which employs embedded distribution alignment. Our DATSL incorporates a non-convex regularizer to approximate low-rank constraints, capturing the complex characteristics of the rank function by minimizing top k smallest singular values of reconstruction matrix. To align the joint distributions across domains, a category-aware joint distribution alignment mechanism extracts more discriminative representations and enhances subspace discriminability through label-informed covariance matching. Besides, DATSL is extended to a graph-based variant GDATSL, which incorporates manifold-preserving constraints via Laplacian regularization to maintain intrinsic data topology during knowledge transfer. Furthermore, we develop an efficient iterative optimization algorithm to solve our formulated nonconvex minimization problems with proved convergence. Extensive experimental results on several public datasets demonstrate the effectiveness of our proposed methods in comparison to other state-of-the-art approaches.

Abstract:
Occluded person re-identification aims to address the identification challenges posed by pedestrians obscured by other individuals or objects. Existing methods often rely on incorporating pose or semantic information to improve model performance under occlusion. However, such information often depends on external models with inevitably cross-domain gaps, whose stability is limited in complex occlusion environments and prone to false results. In this paper, we propose a Transformer-based uncertainty-driven Gaussian model, termed as UD-Gaussian. Firstly, to enrich the detailed features of pedestrian images, a high-frequency enhancement module is introduced. The high-frequency components of the pedestrian image are extracted by Discrete Haar Wavelet Transform, and Top-K high-frequency patches are extracted to construct a graph Laplacian matrix to achieve high-frequency graph attention, which is fused with features learned from self-attention to enhance the high-frequency feature representation. Given the uncertainty in pedestrian feature learning induced by occlusion makes it challenging to obtain reliable and stable pedestrian features, we propose a probability distribution learning module. This module establishes a memory bank to build Gaussian distributions for each pedestrian identity and the entropy is introduced as a loss function to encourage the model to generate more deterministic and relatively independent probability distributions, thereby enhancing the discriminative ability of the model across different pedestrian identities. The high-frequency enhancement module provides a solid foundation for the probability distribution learning module, alleviating uncertainty caused by pedestrian images themselves. Experimental results on occluded and holistic person re-identification datasets demonstrate the superiority of the proposed method.

Abstract:
Multimodal large models have shown excellent ability in addressing image super-resolution in real-world scenarios by leveraging language class as condition information, yet their abilities in degraded images remain limited. In this paper, we first revisit the capabilities of the Recognize Anything Model (RAM) for degraded images by calculating text similarity. We find that directly using contrastive learning to fine-tune RAM in the degraded space is difficult to achieve acceptable results. To address this issue, we employ a degradation selection strategy to propose a Real Embedding Extractor (REE), which achieves significant recognition performance gain on degraded image content through contrastive learning. Furthermore, we use a Conditional Feature Modulator (CFM) to incorporate the high-level information of REE for a powerful Mamba-based network, which can leverage effective pixel information to restore image textures and produce visually pleasing results. Extensive experiments demonstrate that the REE can effectively help image super-resolution networks balance fidelity and perceptual quality, highlighting the great potential of Mamba in real-world applications. The source code of this work will be made publicly available at: https://github.com/nathan66666/DACESR.git

Abstract:
Federated learning (FL) enables privacy-preserving collaboration among distributed clients, but practical deployments often face heterogeneous models and non-IID data, leading to degraded communication and personalization. In addition, real-world FL systems frequently encounter newly joined clients that require rapid adaptation and abnormal clients that may upload corrupted updates, further exacerbating instability and hindering global convergence. To address these challenges in image classification, we propose HFedDGHN, a Heterogeneous Federated Dynamic Graph HyperNetwork that jointly models inter-client relations and personalized parameter generation. Specifically, a graph structure learner adaptively captures client correlations to construct a dynamic collaboration graph, while a graph-convolutional hypernetwork generates model parameters for heterogeneous architectures, enabling implicit knowledge transfer without sharing local data or weights. Moreover, the framework naturally supports meta-learning-based generalization, allowing efficient adaptation to newly joined clients. Furthermore, the dynamic graph enhances robustness by isolating abnormal clients, as they tend to be excluded from most neighborhoods during adaptive graph construction. Extensive experiments across multiple benchmarks demonstrate that HFedDGHN achieves superior accuracy compared to state-of-the-art personalized and heterogeneous FL methods, while naturally improving robustness and scalability in real-world deployments.

Abstract:
Remote sensing object detection requires precise identification of multi-scale and multi-directional targets in com-plex backgrounds, demanding the model that achieves both high accuracy and real-time performance. While knowledge distillation proves effective for compressing natural image models, it exhibits limitations in more realistic remote sensing scenarios, including inadequate adaptability, biases from long-tail data distributions, and the propagation of errors from the teacher model. To address these challenges, we propose a Prompt Driven Knowledge Distillation (PDKD) framework for remote sensing object detection. This framework leverages prompt-based mechanisms to guide the student model in effectively acquiring and assimilating the teacher’s knowledge, which integrates three core components: (1) Scale-Decoupled Feature Prompting (SDFP) module dynamically adjusts feature representation capabilities through scale decoupling, enabling differentiated distillation for targets of varying scales; (2) Semantic Visual Co-Prompting (SVCP) module, based on CLIP’s multimodal prior knowledge, constructs category-specific semantic prompt vectors to enhance the focus on features of long-tail categories; (3) Self-Correcting Prompting (SCP) module that suppresses error propagation through a cross self-distillation mechanism. The experiments on the DOTA dataset show that with a 1x training schedule, the model achieves a 49.0% mAP . Source codes are available at https://github.com/Ningsui/PDKD.git

Abstract:
Image restoration represents a promising approach for addressing the inherent defects of image content distortion. Standard image restoration approaches suffer from high storage cost and the requirement towards the known degradation pattern, including type and degree, which can barely be satisfied in dynamic practical scenarios. In contrast, all-in-one image restoration (AiOIR) eliminates multiple degradations within a unified model to circumvent the aforementioned issues. However, according to our causal analysis, we disclose that two significant defects still exacerbate the effectiveness and generalization of AiOIR models: 1) the spurious correlation between non-degradation semantic features and degradation patterns; 2) the biased estimation of degradation patterns. To obtain the true causation between degraded images and restored images, we propose Causal-deconfounding Wavelet-disentangled Prompt Network (CWP-Net) to perform effective AiOIR. CWP-Net introduces two modules for decoupling, i.e., wavelet attention module of encoder and wavelet attention module of decoder. These modules explicitly disentangle the degradation and semantic features to tackle the issue of spurious correlation. To address the issue stemming from the biased estimation of degradation patterns, CWP-Net leverages a wavelet prompt block to generate the alternative variable for causal deconfounding. Extensive experiments on two all-in-one settings prove the effectiveness and superior performance of our proposed CWP-Net over the state-of-the-art AiOIR methods.

Abstract:
Deep unfolding network has gained significant attention for magnetic resonance imaging super-resolution (MRI SR) due to its performance and interpretability. However, 1) existing methods predominantly focus on cross-contrast correlations while neglecting high-order correlations embedded within spatially adjacent slices in volumetric MRI data. 2) Their degradation models are optimized via the proximal gradient algorithm (PGA) that relies on manually designed hyperparameters (e.g., step size), often leading to overshooting or suboptimal solutions. To solve these limitations, we propose HocMRI, a deep unfolding multi-contrast MRI SR framework, which seamlessly integrates dual-prior modeling and hyperparameter-free PGA for enhanced reconstruction. Specifically, we first design a novel degradation model based on the dual-prior mechanism: an explicit prior based on low-rank tensor factorization to capture intra- and inter-slice dependencies, and an implicit prior leveraging a Mamba-based network with a novel 3D scanning strategy to further exploit high-order correlations across slices. Then, we derive a hyperparameter-free PGA to boost the traditional PGA, which employs a hyperbolic tangent function to dynamically control the gradient descent step, eliminating manual tuning while ensuring stable convergence with theoretical proofs. Based on the hyperparameter-free PGA, we develop an efficient iterative optimization algorithm to solve the degradation model and unfold it into a multi-stage deep network. Numerous experimental results from widely used MRI datasets demonstrate that our HocMRI achieves superior performance with enhanced efficiency compared to the state-of-the-art methods.

Abstract:
Recent advances in LiDAR representation learning with limited annotations show strong promise. Existing well-performed methods mainly focus on distilling the 2D representation into the 3D representation via superpixels. Superpixels are used to construct the cross-modal contrastive learning, leading to semantic ambiguity of 3D features belonging to the same object and impairing the performance. To this end, we aim to leverage unlabeled LiDAR-camera pairs to design a novel pre-training pipeline, which learns from category space directly and pulls the 3D features belonging to the same object close. Specifically, we obtain autolabeled 2D object boxes with a fixed 2D open-vocabulary object detector and transform the labeled 2D object boxes into high-quality pixel-wise label maps with a box-to-label-maps generation algorithm. Based on the pseudo labels, we present a dual-space pre-training 3D network that recognizes accurate categories from the semantic priors of paired 3D points and segments complete objects. Furthermore, we propose a module named AdaptPro to improve performance further when fine-tuning the 3D network under limited annotations, aiming to explore the unpaired 3D features that lack 2D correspondences via category prototypes. The experimental results show that our method achieves state-of-the-art performances on both the nuScenes and SemanticKITTI benchmark datasets. Code is avialable at https://github.com/dengq7/Box4Scene

Affiliations: School of Computer Science and Engineering, Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications, Ministry of Education, Southeast University, Nanjing, China; School of Cyber Science and Engineering, Southeast University, Nanjing, China; New Laboratory of Pattern Recognition (NLPR) and the State Key Laboratory of Multimodal Artificial Intelligence Systems (MAIS), Institute of Automation, Chinese Academy of Sciences (CASIA), Beijing, China

Abstract:
Gesture recognition is an important research area in the field of computer vision. Most existing efforts focus on close-set scenarios, thereby limiting the capacity to effectively handle unseen or novel gestures. We aim to address class-incremental gesture recognition, which entails the ability to accommodate new and previously unseen gestures over time. Specifically, we introduce a Prototype-Guided Pseudo Feature Replay framework for data-free class-incremental learning. This framework comprises four components: Pseudo Feature Generation with Batch Prototypes (PFGBP), Variational Prototype Replay for old classes, Truncated Cross-Entropy for new classes, and Continual Classifier Re-Training. To tackle the issue of catastrophic forgetting, the PFGBP dynamically generates a diversity of pseudo features in an online manner, leveraging class prototypes of old classes along with batch class prototypes of new classes. Furthermore, the Variational Prototype Replay enforces consistency between the classifier’s weights and the prototypes of old classes, leveraging class prototypes and covariance matrices to enhance robustness and generalization capabilities. The Truncated Cross-Entropy mitigates the impact of domain differences of the classifier caused by pseudo features. Finally, the Continual Classifier Re-Training training strategy is designed to prevent overfitting to new classes and ensure the stability of features extracted from old classes. Extensive experiments conducted on two widely used gesture recognition datasets, namely SHREC 2017 3D and EgoGesture 3D, demonstrate that our approach outperforms existing state-of-the-art methods by 11.8% and 12.8% in terms of mean global accuracy, respectively. The code is available on https://github.com/sunao-101/PGPFR-3/

Abstract:
Spectral reconstruction (SR) aims to recover high-quality hyperspectral images (HSIs) from more readily available RGB or multispectral images (MSIs). While supervised SR has shown promising results, it is hindered by the difficulty of collecting abundant, well-registered RGB-HSI or MSI-HSI pairs. Semi-supervised SR (Semi-SR) offers a more practical solution by exploiting plentiful RGBs/MSIs together with limited HSIs. However, existing Semi-SR approaches still suffer from cross-domain discrepancies, cross-modality inconsistency, and unreliable pseudo-labels. To tackle these challenges, we propose a Manifold-aware Teacher-Student Semi-SR (MTSSR) framework, which seamlessly integrates labeled and unlabeled domains through a teacher-student paradigm and memory-efficient consistency learning. At its core, a Flexible Cross-attention Spectral Reconstruction (FCSR) network extracts scene-related spatial cues via customized self-attention and models scene-agnostic priors through dynamic quantization, thereby enhancing spectral fidelity. Furthermore, a manifold-aware dimensionality analysis derives a latent space that jointly captures spatial and spectral structures across modalities. This enables a manifold-aware alignment loss to enforce cross-modality consistency and a manifold-aware contrastive loss to progressively refine pseudo-label reliability. In addition, we develop a Threshold-adjusted Memory Bank Update (TMBU) strategy, which generates reliable negative samples by storing network-driven representations instead of memory-consuming HSIs, significantly reducing memory consumption. Extensive experiments on three visual and two remote sensing benchmarks demonstrate that MTSSR consistently outperforms state-of-the-art SR methods, achieving robust and memory-efficient spectral reconstruction.

Abstract:
Previous multimodal visual-tactile image representation learning (VTL) methods have achieved significant success in object understanding through large-scale training data. However, obtaining sufficient training data is often infeasible, and the above methods struggle to effectively focus on discriminative visual and tactile features with limited data, resulting in degraded performance. To solve the above issue, we introduce a new task called visual-tactile image representation learning with limited data (VTL-L), which better facilitates real-world applications. To address the challenges of limited data and modality discrepancy in the VTL-L task, we propose a novel multi-order feature enhancement-based, alignment-free fusion network (MOA-Net). First, we introduce a multi-order feature enhancement (MFE) module to hierarchically strengthen the detailed and structural representation by aggregating the low- and high-order topological information. This approach can effectively reduce the attention noise and obtain discriminative features with limited data. Then, we propose the alignment-free visual-tactile fusion (AVTF) module to achieve representative spatial and channel features and perform the cross-modality fusion without alignment, which efficiently mitigates the modality discrepancy. Finally, we develop a dual counterfactual intervention (DCI) loss to jointly optimize fused visual-tactile feature and probability distributions, thereby improving the performance of the MOA-Net in the VTL-L task. Extensive experiments demonstrate the superiority of the proposed method across three types of tasks on four datasets under diverse limited-data settings (source code available at: https://github.com/liuxiangqiu007/MOA-Net).

Abstract:
Due to the complexity and diversity of practical environments, real-world image dehazing remains an unresolved problem, with one of the key challenges being how to bridge the distribution gap between synthetic and real domains. This paper proposes a Prompt-driven Domain Adaptation (PDDA) framework within the bi-level optimization perspective. Specifically, we introduce hyperparameter optimization-based bi-level modeling: the lower-level optimization emphasizes prior learning within the synthetic domain to stabilize dehazing performance, while the upper-level optimization focuses on enhancing cross-domain adaptability to ensure that the model can generalize across different domains. Given the scarcity of paired real haze images, we train learnable haze prompts by jointly optimizing the text-image similarity between positive/negative prompts and corresponding clear/haze images in the CLIP latent space to more effectively capture real-world haze characteristics. Based on the learned haze prompts, we construct an unsupervised cross-domain loss function that enhances the adaptability to complex real-world scenarios by integrating prompt learning with bi-level optimization strategy. Furthermore, we conduct a comprehensive exploration to uncover the inherent properties of PDDA, including architecture-irrelevant flexibility and domain-agnostic robustness. Extensive experiments across a wide range of benchmark datasets demonstrate that our method achieves both quantitative and qualitative improvements across diverse scenarios, showing robust performance not only in real-world daytime conditions but also exhibiting superior cross-domain adaptation capabilities in nighttime scenarios. Codes are available at https://github.com/YanZhang-zy/PDDA.git

Abstract:
Ship object detection faces the challenge of increasing the difficulty of positioning in hazy environments. Additionally, the latest convolutional neural network (CNN) cannot obtain satisfactory detection results. Therefore, we propose ERDNet, a dual-branch-driven end-to-end network, to improve ship detection accuracy during hazy weather. Specifically, we design a two-branch feature extraction network through complementary attentional fusion to enhance the object feature information of low-quality images. Second, we designed a feature pyramid fusion structure called ERPSA-PAN to aggregate context information effectively. ERPSA-PAN improves the feature fusion capability of the model by suppressing background interference and enhancing useful information. In addition, the spatial-frequency fusion block (SFFB) module with expanded receptive fields is added to the ERDNet detection head to improve the detection ability for multiscale targets. More importantly, we design a robust haze loss to handle different degrees of haze. We introduce two new haze ship datasets, Hazy-SeaShips and Hazy-Boats, which include 17,000 synthetic haze images and 2898 real haze images, respectively, to address the lack of hazy ship image datasets. The images cover variations such as haze thicknesses, ship types, and scales, along with complex backgrounds, and occlusions. The experimental results show that the proposed method is superior to other state-of-the-art (SOTA) methods and achieves relatively competitive results. The source codes, and datasets are available on https://github.com/ZikHH/ERDNet

Affiliations: Laboratory of Advanced Theranostic Materials and Technology, Ningbo Institute of Materials Technology and Engineering, and Ningbo Cixi Institute of Biomedical Engineering, Chinese Academy of Sciences, Ningbo, China; Department of Computer Science, Edge Hill University, Ormskirk, U.K.; School of Software, Shandong University, Jinan, China; Department of Eye and Vision Sciences, University of Liverpool, Liverpool, U.K.; Department of Computer Science and Engineering, Southern University of Science and Technology, Shenzhen, China; Institute of High Performance Computing,A*STAR, Singapore

Abstract:
Quantitative analysis of retinal vascular morphology is vital for clinical decision-making and the investigation of systemic diseases. Central to this process is the accurate segmentation of retinal arteries and veins (A/V) from the background, a task challenged by substantial variations in vessel calibers and the presence of low-contrast or ambiguous structures in fundus images, especially in ultra-wide field imaging where peripheral distortions and large-scale anatomical variability are pronounced. These factors often lead to fragmented semantic representations and topological inconsistencies in automated segmentation outputs. To address these limitations, we propose Ultra, a multi-granularity topological reasoning network designed for precise A/V segmentation. Ultra adopts a cascaded two-stage architecture: PriorNet generates coarse, multi-scale vascular priors that provide structural guidance, while RefineNet performs topology-aware segmentation refinement. To further enforce topological coherence, we propose the neighboring pixel connectivity regularization (NICER) layer, which selectively integrates local connectivity information predicted by the proposed connectivity prediction union (CPU) module. This connectivity is employed as auxiliary supervision through a pixel-wise local connectivity loss, reinforcing structural reasoning and promoting anatomically consistent vascular topology inference. Extensive experiments on ultra-wide field fundus imaging (UWF) datasets demonstrate that Ultra achieves state-of-the-art performance in A/V segmentation and topological preservation. Moreover, Ultra generalizes well to conventional color fundus photography (CFP) datasets, underscoring its robustness and broad applicability. Code is publicly available at: https://github.com/iMED-Lab/Ultra

Abstract:
Recently, Vision Transformers (ViTs) have become the state-of-the-art architecture on various computer vision tasks including image classification, object detection and semantic segmentation. However, such success in high-accuracy performance comes at the price of high computational complexity, with typically tens of millions of or even more parameters in a Vision Transformer (ViT) model. Such a large volume of parameters makes it very difficult to deploy ViT models on mobile devices and cumbers their applications. In this paper, we present a novel post-training quantization approach that is able to quantize ViT models to very low bit widths, without the need of re-training. Prior works on post-training quantization for ViTs optimize the quantization of each layer separately thus leading to sub-optimal results. In contrast, we propose a unified learning framework that jointly optimizes the quantization of all layers to directly reduce the overall output error of the network. Moreover, we explore an important property of ViTs, i.e., the additivity property, revealing that the output error caused by the quantization of multiple layers equals the sum of the output error due to the quantization of each layer. Utilizing this property, we present a very efficient algorithm to solve the joint optimization problem with linear time complexity. We performed extensive experiments on the large-scale ImageNet dataset to evaluate the effectiveness of our approach. Empirical results show that our approach improves state-of-the-art noticeably on various ViT models and lowers the bit width from 8-bit to 6-bit without hurting the accuracy. Specifically, at 4 bits, our approach significantly outperforms existing works by 1.72%, 11.49%, 6.15%, and 3.54% on ViT-S, ViT-B, DeiT-S, and DeiT-B, respectively. In the end, we evaluate the performance when deploying our quantized models on hardware. Our approach achieves 1.5× to 1.7× speedups for the inference on NVIDIA A100 GPU.

Affiliations: Department of Computer Science, Khalifa University of Science and Technology, Abu Dhabi, United Arab Emirates; School of Intelligent Systems Engineering, Sun Yat-sen University, Shenzhen, China; School of Computer Science and Technology, Harbin Institute of Technology., Harbin, China; Department of Computer Science and the G Research Center, Khalifa University of Science and Technology, Abu Dhabi, United Arab Emirates; School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu, China; Tongji University, Shanghai, China

Abstract:
Action Quality Assessment (AQA) has gained significant attention due to its potential real-world applications, which require a fine-grained understanding of action sequences. Recent works have attempted to utilize multimodal video features and address some existing challenges. However, these approaches primarily focus on leveraging textual information from language models only, leading to instability and suboptimal performance due to directional bias in a vision-language joint embedding space. To tackle these issues, we propose a Vision-Language Collaboration Representation Learning approach (VLC-Net) to understand fine-grained action sequences and create a unified feature representation along with their temporal dependencies for accurate AQA score prediction. Specifically, we design a bidirectional knowledge distillation operation to perform collaboration learning between vision-language pre-trained knowledge and visual action knowledge for fine-grained action feature learning. Furthermore, we design vision-language alignment guidance to explicitly align action features with the same action semantics across modalities, thereby unifying their joint representation. Leveraging these aligned features, we propose multimodal contrastive learning to relate modalities and align subactions with textual descriptions, ensuring accurate action representation. We conduct experiments on the FineDiving, MTL-AQA, FineFS, and Fis-V datasets, demonstrating the effectiveness of our approach, which outperforms state-of-the-art methods.

Abstract:
Recent advances in zero-shot text-to-3D generation have revolutionised 3D content creation by enabling direct synthesis from textual descriptions. While state-of-the-art methods leverage 3D Gaussian Splatting with score distillation to enhance multi-view rendering through pre-trained text-to-image (T2I) models, they suffer from inherent prior view biases in T2I Models. These biases lead to inconsistent 3D generation, particularly manifesting as the multi-face Janus problem, where objects exhibit conflicting features across views. To address this fundamental challenge, we propose ConsDreamer, a novel method that mitigates view bias by refining both the conditional and unconditional terms in the score distillation process: (1) a View Disentanglement Module (VDM) that eliminates viewpoint biases in conditional prompts by decoupling irrelevant view components and injecting precise view control; and (2) a similarity-based partial order loss that enforces geometric consistency in the unconditional term by aligning cosine similarities with azimuth relationships. Extensive experiments demonstrate that ConsDreamer can be seamlessly integrated into various 3D representations and score distillation paradigms, effectively mitigating the multi-face Janus problem.

Affiliations: Institute of Automation, Chinese Academy of Sciences (CASIA), State Key Laboratory of Multimodal Artificial Intelligence Systems, Beijing, China; State Key Laboratory of Human-Machine Hybrid Augmented Intelligence and the Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University, Xi’an, China; School of Mathematical Sciences and the Multi-Hazard Early Warning Key Laboratory of Sichuan Province, University of Electronic Science and Technology of China (UESTC), Chengdu, Sichuan, China

Abstract:
Remote sensing image fusion aims to create a high-resolution multi/hyper-spectral image from a high-resolution image with limited spectral information and a low-resolution image with abundant spectral data. Recently, deep learning (DL) techniques have shown significant effectiveness in this area. Most DL-based methods approach image fusion as a 2D problem by encoding spectral information into feature map channels. However, our research suggests that this strategy introduces notable spectral distortions. In contrast, some methods consider spectral data as an additional dimension, utilizing standard 3D convolutions to preserve spectral information. Nevertheless, in a standard 3D convolutional layer, the same set of kernels is applied across all input regions, which we have found to be sub-optimal for image fusion. Furthermore, standard 3D convolutions necessitate substantial computational resources. To address these challenges, we propose a novel convolutional paradigm called Adaptive 3D Convolution (Ada3D) for remote sensing image fusion. Ada3D applies a unique set of 3D kernels to each input voxel, enabling the capture of fine-grained details. These adaptive kernels are generated through a two-step process: 1) spatial and spectral kernels are derived from their respective image sources and 2) these two types of kernels are then combined to form content-aware 3D kernels that effectively integrate spatial and spectral information. Additionally, adaptive biases are introduced to enhance the convolutional outcome at the voxel level. Furthermore, we incorporate the group convolution technique to reduce computational complexity. As a result, Ada3D offers full adaptivity in an efficient manner. Evaluation results across five datasets demonstrate that our method achieves state-of-the-art (SOTA) performance, underscoring the superiority of Ada3D. The code is available at https://github.com/PSRben/Ada3D

Affiliations: Department of Electrical and Computer Engineering, University of Alberta, Edmonton, AB, Canada; NKIARI, Shenzhen, Futian, China; eBay Inc., San Jose, CA, USA; East China Normal University, Shanghai, China; School of Computer Science and Technology, Tongji University, Shanghai, China; Research Center for Space Computing System, Zhejiang Lab, Zhejiang, China; School of Computer Science and Technology, Tianjin University, Tianjin, China; CFAR and IHPC, Agency for Science, Technology and Research (A*STAR), Fusionopolis Way, Singapore; The University of Tokyo, Tokyo, Bunkyo, Japan

Abstract:
In this work, we focus on a novel and practical task, i.e., Time-vAriant iMage inPainting (TAMP). The aim of TAMP is to restore a damaged target image by leveraging the complementary information from a reference image, where both images capture the same scene but with a significant time gap in between, i.e., time-variant images. Different from conventional reference-guided image inpainting, the reference image under TAMP setup presents significant content distinction to the target image and potentially also suffers from damages. Such an application frequently happens in our daily life to restore a damaged image by referring to another reference image, where there is no guarantee of the reference image’s source and quality. In particular, our study finds that even SOTA reference-guided image inpainting methods fail to achieve plausible results due to the chaotic image complementation. To address such an ill-posed problem, we propose a novel Interactive Distribution Transition Estimation (InDiTE) module which interactively complements the time-variant images with appropriate semantics thus facilitate the restoration of damaged regions. To further boost the performance, we propose our TAMP solution, namely Interactive Distribution Transition Estimation-driven Diffusion (InDiTE-Diff), which integrates InDiTE with SOTA diffusion model and conducts latent cross-reference during sampling. Moreover, considering the lack of benchmarks for TAMP task, we newly assembled a dataset, i.e., TAMP-Street, based on existing image and mask datasets. We conduct experiments on the TAMP-Street datasets under two different time-variant image inpainting settings, which show our method consistently outperform SOTA reference-guided image inpainting methods for solving TAMP.

Abstract:
Vision-language models such as CLIP have shown impressive capabilities in aligning images and text, but they often struggle with lengthy and detailed text descriptions due to pre-training on short and concise captions. We present FAST-GOAL (Fast and Efficient Global-local Object Alignment Learning), an efficient fine-tuning method that enhances ability of CLIP to handle lengthy text through global-local semantic alignment. Our method consists of two key components. First, Fast Local Image-Sentence Matching (FLISM) efficiently extracts local image regions through object detection and spatial division, then matches them with corresponding sentences. Second, Token Similarity-based Learning (TSL) maximizes the similarity between patch tokens from specific regions in the image and their corresponding region embeddings, applying the same principle to text, which enhances the ability of the model to capture detailed correspondences. Additionally, we introduce GLIT100k, a dataset that provides both global image-lengthy caption pairs and context-derived local pairs, where local descriptions are extracted from global captions to maintain semantic coherence. Through extensive experiments on long caption datasets (DOCCI, DCI) and short caption datasets (MSCOCO, Flickr30k), we demonstrate that FAST-GOAL achieves significant improvements over baselines, enabling effective adaptation of CLIP to detailed textual descriptions while maintaining computational efficiency.

Abstract:
Camouflaged object detection involves identifying camouflaged objects visually blended into the surroundings, holding crucial significance in various visual applications. Existing methods primarily focus on leveraging boundary information to enhance camouflaged object detection. However, they often overlook the background interference near the object boundaries, which leads to coarse boundary predictions and results in suboptimal detection performance. In this paper, to address this problem, we propose GBNet, a gated boundary-aware network designed to enhance boundary precision and improve overall detection performance. Specifically, GBNet incorporates a boundary-enhanced module that selectively filters extraneous background information through a boundary gate block, ensuring the generation of high-quality boundary information. Additionally, a boundary-aware decoder is designed to enrich the representation ability of the decoder by injecting high-quality boundary features and aggregating contextual features. With meticulous design, GBNet excels in accurately segmenting camouflaged objects in challenging scenarios. Extensive experiments demonstrate that GBNet outperforms 19 state-of-the-art methods significantly across four widely-used benchmark datasets. The source code is publicly available at https://github.com/wooownn/GBNet

Abstract:
The Mainstream 3D masked point modeling representation learning community typically employs predefined, fixed-ratio random or block masking strategies, aiming to obtain optimal representations and achieve high downstream performance. However, these empirical designs overlook the significant geometric information and structural importance differences that are inherent among different 3D points, leading to a suboptimal trade-off between the representation capture capabilities and reconstruction difficulty of such masking strategies. To address this issue, we are the first to present this decision-making problem to a reinforcement learning agent and propose a Reinforcement Masked Autoencoder for 3D representation learning, named Point-RMAE. Guided by geometric features as state factor, this method leverages the Masking Strategy Analyzer and the Dynamic Masking Generator to adaptively decide and apply the masking strategy during pretraining. The Masking Ratio Scheduling module dynamically adjusts the masking ratio based on the optimal strategy. Subsequently, the analyzer is updated by multiscale rewards derived from reconstruction quality level, distribution-aware feedback, and policy exploration. Notably, to enrich the Reward Function with distribution-aware signals and avoid decision collapse issue, we propose a Flow Matching Point Cloud Fast Generator that guides the selected masking decisions. Our method achieves outstanding performance across downstream tasks such as shape classification, medical diagnosis, object detection, action recognition, denoising and multiscale scene segmentation on ten popular 3D and 4D datasets. More importantly, Point-RMAE pioneers the application of reinforcement learning in 3D self-supervised representation learning.

Abstract:
Aerial-Ground Person Re-IDentification (AG-ReID) aims to retrieve specific persons across cameras with different viewpoints. Previous works focus on designing discriminative models to maintain the identity consistency despite drastic changes in camera viewpoints. The core idea behind these methods is quite natural, but designing a view-robust model is a very challenging task. Moreover, they overlook the contribution of view-specific features in enhancing the model’s ability to represent persons. To address these issues, we propose a novel generative framework named SD-ReID for AG-ReID, which leverages generative models to mimic the feature distribution of different views while extracting robust identity representations. More specifically, we first train a ViT-based model to extract person representations along with controllable conditions, including identity and view conditions. We then fine-tune the Stable Diffusion (SD) model to enhance person representations guided by these controllable conditions. Furthermore, we introduce the View-Refined Decoder (VRD) to bridge the gap between instance-level and global-level features. Finally, both person representations and all-view features are employed to retrieve target persons. Extensive experiments on five AG-ReID benchmarks (i.e., CARGO, AG-ReIDv1, AG-ReIDv2, LAGPeR and G2APS-ReID) demonstrate the effectiveness of our proposed method. The source code and pre-trained models are available at https://github.com/924973292/SD-ReID

Abstract:
Tensor decomposition is a powerful tool for data analysis and has been extensively employed in the field of hyperspectral-multispectral image fusion (HMF). Existing tensor decomposition-based fusion methods typically rely on disruptive data vectorization/reshaping or impose rigid constraints on the arrangement of factor tensors, hindering the preservation of spatial-spectral structures and the modeling of cross-dimensional correlations. Although recent advances utilizing the Fully-Connected Tensor Network (FCTN) decomposition have partially alleviated these limitations, the process of reorganizing data into higher-order tensors still disrupts the intrinsic spatial-spectral structure. Furthermore, these methods necessitate extensive manual parameter tuning and exhibit limited robustness against noise and spatial degradation. To alleviate these issues, we propose the Bayesian FCTN (BFCTN) method. Within this probabilistic framework, a hierarchical sparse prior that characterizing the sparsity of physical elements, establishes connections between the factor tensors. This framework explicitly models the intrinsic physical coupling among spatial structures, spectral signatures, and local scene homogeneity. For model learning, we develop a parameter estimation method based on Variational Bayesian inference (VB) and the Expectation-Maximization (EM) algorithm, which significantly reduces the need for regularization parameter tuning. Extensive experiments demonstrate that BFCTN not only achieves state-of-the-art fusion accuracy and strong robustness but also exhibits practical applicability in complex real-world scenarios. The source code is available at: https://github.com/LinsongShan/BFCTN

Abstract:
Recent advances in text-to-image (T2I) generation have led to impressive visual results. However, these models still face significant challenges when handling complex prompts—particularly those involving multiple subjects with distinct attributes. Inspired by the human drawing process, which first outlines the composition and then incrementally adds details, we propose Detail++, a training-free framework that introduces a novel Progressive Detail Injection (PDI) strategy to address this limitation. Specifically, we decompose a complex prompt into a sequence of simplified sub-prompts, guiding the generation process in stages. This staged generation leverages the inherent layout-controlling capacity of self-attention to first ensure global composition, followed by precise refinement. To achieve accurate binding between attributes and corresponding subjects, we exploit cross-attention mechanisms and further introduce a Centroid Alignment Loss at test time to reduce binding noise and enhance attribute consistency. Extensive experiments on T2I-CompBench and a newly constructed style composition benchmark demonstrate that Detail++ significantly outperforms existing methods, particularly in scenarios involving multiple objects and complex stylistic conditions.

Abstract:
Feature reconstruction networks have achieved remarkable performance in few-shot fine-grained classification tasks. Nonetheless, traditional feature reconstruction networks rely on linear regression. This linearity may cause the loss of subtle discriminative cues, ultimately resulting in less precise reconstructed features. Moreover, in situations where the background predominantly occupies the image, the background reconstruction errors tend to overshadow foreground reconstruction errors, resulting in inaccurate reconstruction errors. In order to address the two key issues, a novel approach called the Foreground-Aware Kernelized Feature Reconstruction Network (FKFRN) is proposed. Specifically, to address the problem of imprecise reconstructed features, we introduce kernel methods into linear feature reconstruction, extending it to nonlinear feature reconstruction, thus enabling the reconstruction of richer, finer-grained discriminative features. To tackle the issue of inaccurate reconstruction errors, the foreground-aware reconstruction error is proposed. Specifically, the model assigns higher weights to features containing more foreground information and lower weights to those dominated by background content, which reduces the impact of background errors on the overall reconstruction. To estimate these weights accurately, we design two complementary strategies: an explicit probabilistic graphical model and an implicit neural network–based approach. Extensive experimental results on eight datasets validate the effectiveness of the proposed approach for few-shot fine-grained classification.

Abstract:
Existing methods for learning 3D point cloud representation often use a single dataset-specific training and testing approach, leading to performance drops due to significant domain shifts between training and testing data. While recent cross-domain methods have made promising progress, the lack of inherent semantic information in point clouds makes models prone to overfitting specific datasets. As such, we introduce 3D-CFA, a simple yet effective cross-modality feature aggregation method for cross-domain 3D point cloud representation learning. 3D-CFA aggregates the geometry tokens with semantic tokens derived from multi-view images, which are projected from the point cloud, thus generating more transferable features for cross-domain 3D point cloud representation learning. Specifically, 3D-CFA consists of two main components: a cross-modality feature aggregation module and an elastic domain alignment module. The cross-modality feature aggregation module converts unordered points into multi-view images using the modality transformation module. Then, the geometry tokens and semantic tokens extracted from the geometry encoder and semantic encoder are fed into the cross-modal projector to get the transferable 3D tokens. A key insight of this design is that the semantic tokens can serve as a bridge between the 3D point cloud model and the 2D foundation model, greatly promoting the generalization of cross-domain models facing the severe domain shift. Finally, the elastic domain alignment module learns the hierarchical domain-invariant features of different training domains for either domain adaptation or domain generalization protocols. 3D-CFA finds a better way to transfer the knowledge of the 2D foundation model pre-trained at scale, meanwhile only introducing a few extra trainable parameters. Comprehensive experiments on several cross-domain point cloud benchmarks demonstrate the effectiveness of the proposed method compared to the state-of-the-art methods.

Abstract:
One-shot Text-to-Image Person Re-Identification (One-shot TIReID) aims to construct a TIReID model using only a single labeled image-text pair per identity, along with a large pool of unlabeled person images. While supervised learning in text-to-image person re-identification has demonstrated high effectiveness, the requirement for extensive annotated data, both in terms of identities and corresponding textual descriptions, makes it impractical for large-scale camera networks. One-shot TIReID presents a promising approach to reduce the annotation burden. The primary challenge in one-shot TIReID lies in establishing consistent visual-textual correspondences across diverse viewing conditions, particularly in the absence of cross-view paired data. To address this challenge, we propose a novel progressive discrepancy learning framework, termed P-CLIP, which aims to establish a shared embedding space that is robust to view-specific biases. To achieve this goal, we dynamically construct multi-view image-text pairs based on a single labeled pair and simultaneously project the multi-view data into a unified embedding space. Specifically, we propose a Progressive Multi-View Generation method (MVG) to generate multiple noisy views from a single labeled instance for training. To mitigate cross-view ambiguities, we introduce a Cross-View Discrepancy Learning module (CDL) that leverages the discrepancies among different views to guide the learning of cross-view visual-textual correspondences. This approach effectively integrates multimodal error correction into the person re-identification domain. Furthermore, to enhance the effectiveness of visual-textual correspondence learning, we propose a Compact Cross-Modal Matching Loss (CCM), which suppresses unmatched pairs while emphasizing matched ones. Extensive experiments were conducted on three benchmark datasets, and the experimental results demonstrate the effectiveness of our proposed method. The data and codes are available at https://github.com/Itachjw/P-CLIP/tree/main

Abstract:
Self-supervised learning (SSL) has emerged as a crucial technique in image processing, encoding, and understanding, especially for developing today’s vision foundation models that utilize large-scale datasets without annotations to enhance various downstream tasks. This study introduces a novel SSL approach, Information-Maximized Soft Variable Discretization (IMSVD), for image representation learning. Specifically, IMSVD softly discretizes each variable in the latent space, enabling the estimation of their probability distributions over training batches and allowing the learning process to be directly guided by information measures. Motivated by the MultiView assumption, we propose an information-theoretic objective function to learn transform-invariant, non-trivial, and redundancy-minimized representation features. We then derive a cross-joint entropy loss function for self-supervised image representation learning, which theoretically enjoys superiority over the existing methods in reducing feature redundancy. Notably, our non-contrastive IMSVD method statistically performs contrastive learning. Extensive experimental results demonstrate the effectiveness of IMSVD on various downstream tasks in terms of both accuracy and efficiency. Thanks to our variable discretization, the embedding features optimized by IMSVD offer unique explainability at the variable level. IMSVD has the potential to be adapted to other learning paradigms. Our code is publicly available at https://github.com/niuchuangnn/IMSVD

Abstract:
CLIP (Contrastive Language–Image Pre-training) uses contrastive learning from noise image-text pairs to excel at recognizing a wide array of candidates, yet its focus on broad associations hinders the precision in distinguishing subtle differences among fine-grained items. Conversely, Multimodal Large Language Models (MLLMs) excel at classifying fine-grained categories, thanks to their substantial knowledge from pre-training on web-level corpora. However, the performance of MLLMs declines with an increase in category numbers, primarily due to growing complexity and constraints of limited context window size. To synergize the strengths of both approaches and enhance the few-shot/zero-shot recognition abilities for datasets characterized by extensive and fine-grained vocabularies, this paper introduces RAR, a Retrieving And Ranking augmented method for MLLMs. We initially establish a multi-modal retriever based on CLIP to create and store explicit memory for different categories beyond the immediate context window. During inference, RAR retrieves the top- k similar results from the memory and uses MLLMs to rank and make the final predictions. Our proposed approach not only addresses the inherent limitations in fine-grained recognition but also preserves the model’s comprehensive knowledge base, significantly boosting accuracy across a range of vision-language recognition tasks. Notably, our approach demonstrates a significant improvement in performance on 5 fine-grained visual recognition benchmarks, 11 few-shot image recognition datasets, and the 2 object detection datasets under the zero-shot recognition setting.

Affiliations: School of Computer Science and Technology, Guangdong University of Technology, Guangzhou, China; Department of Emergency, Guangdong Provincial Corps Hospital of Chinese People’s Armed Police Force, Guangzhou, China; The Fourth Affiliated Hospital, Guangzhou Medical University, Guangzhou, China; School of Cyber Science and Technology, Sun Yat-sen University, Shenzhen Campus, Shenzhen, China; Department of Ultrasound, State Key Laboratory of Oncology in South China, Guangdong Provincial Clinical Research Center for Cancer, Sun Yat-sen University Cancer Center, Guangzhou, China; Department of Endocrinology, The Second Affiliated Hospital of Guangzhou University of Chinese Medicine, Guangzhou, China; Department of Computer Science and Engineering, The Hong Kong University of Science and Technology, Kowloon, Hong Kong

Abstract:
Ultrasound imaging and biochemical examinations are the primary methods for diagnosing Hashimoto’s thyroiditis (HT). However, neither of them is sufficient to accurately diagnose HT alone. Most existing multimodal models for HT diagnosis focus primarily on extracting and concatenating features from different modalities, which are ineffective due to the dimensional imbalance of the features between the textual and image data. To address this issue, we propose a novel Multimodal Collaborative Fusion Learning (MCFL) approach, which can enhance and recalibrate the biochemical indicators using ultrasound images, effectively improving the significance and specificity of biochemical indicators for the diagnosis of HT. Specifically, MCFL first constructs a novel INNet to convert the image-level characteristics of the HT ultrasound image into two numerical indicators, i.e., the Local prominent inflammatory (Lpi) and the Global diffuse lesion (Gdl), unifying image data and textual data into a single representation space. Then, a decision tree-based optimization strategy is employed to supervise the training of INNet, interactively recalibrating biochemical indicators with the guidance of the two numerical indicators mentioned above and obtaining a more accurate feature representation of HT. Finally, based on the deep Q-learning framework, a reward mechanism is established to guide the HT diagnostic process, in which the experience replay mechanism and the \epsilon -greedy strategy are utilized collaboratively to improve the accuracy and robustness of the model. Extensive experiments are conducted on a multimodal dataset from multiple medical centers, and the results demonstrate that MCFL achieves state-of-the-art performance, setting a new benchmark.

Abstract:
Color distortion and structural degradation in underwater images are classic challenges in underwater image enhancement. The core goal is to restore degraded images to high-quality images with both color and structure that conform to visual perception. However, in the traditional RGB space, these two issues are highly coupled, resulting in existing enhancement methods often neglecting one over the other. To address this challenge, we propose a guided diffusion model based on the principle of decoupling. Our key insight is that in perceptual color spaces such as HSV, color (H, S) and structure (V) are naturally separated. To exploit this property, we first design an adaptive perceptual guidance module, which analyzes the degraded HSV image and generates two orthogonal guidance signals: a color guide and a structure guide, which guide the denoising process of the diffusion model. To ensure that this decoupled guidance is faithfully implemented, we propose a corresponding decoupled loss optimization module, which uses independent loss functions to supervise the final output color and structure. By combining the forward decoupled guidance with the backward decoupled supervision, we construct a closed-loop optimization framework. This framework enables the model to collaboratively optimize color and structure under various degradation scenarios. Extensive experiments demonstrate that our proposed method outperforms existing state-of-the-art approaches in a variety of underwater scenes, particularly those degraded by color casts and haze. Furthermore, it exhibits superior performance on no-reference image quality assessment metrics. The source code is available at https://github.com/zy-world/DCD-UIE

Abstract:
Magnetic particle imaging technology, a novel medical imaging technology, possesses rapid imaging, high penetration depth, and is free from ionizing radiation. However, the system point spread function causes imaging blurring, which can be further exacerbated by external environmental interferences. Although hardware improvements and system optimization can mitigate blurring, these approaches are often expensive and time-consuming, particularly for low-field imaging in large-scale systems. This article proposes a Fast Context-aware Saliency-enhanced Deblurring Network, FCS-edNET, to solve the challenging issue by deblurring the reconstructed images. The network introduces the Multi-scale Global module to enhance the multi-scale feature perception ability. The Multi-scale Denoising Prior algorithm, which employs a low-frequency filter operator to restrict image noise and offers priors for each layer of subnetworks, is designed to improve the model robustness. Finally, proposing a Multi-level Joint loss optimizes model parameters to promote model convergence speed and space distribution simulation capability. Extensive experiments on multiple public and private datasets demonstrate that FCS-edNET outperforms the state-of-the-art methods in MPI image deblurring efficiently, suggesting its potential to support future research toward clinical imaging applications. The code is available at https://github.com/ydz1118/FCS-edNET

Abstract:
Vision Language Models (VLMs), pre-trained on large-scale image-text datasets, enable zero-shot predictions for unseen data but may underperform on specific unseen tasks. Continual learning (CL) can help VLMs effectively adapt to new data distributions without joint training, but faces challenges of catastrophic forgetting and generalization forgetting. Although significant progress has been achieved by distillation-based methods, they exhibit two severe limitations. One is the popularly adopted single-teacher paradigm fails to impart comprehensive knowledge, The other is the existing methods inadequately leverage the multimodal information in the original training dataset, instead they rely on additional data for distillation, which increases computational and storage overhead. To mitigate both limitations, by drawing on Knowledge Integration Theory (KIT), we propose a Multi-Stage Knowledge Integration network (MulKI) to emulate the human learning process in distillation methods. MulKI achieves this through four stages, including Eliciting Ideas, Adding New Ideas, Distinguishing Ideas, and Making Connections. During the four stages, we first leverage prototypes to align across modalities, eliciting cross-modal knowledge, then adding new knowledge by constructing fine-grained intra- and inter-modality relationships with prototypes. After that, knowledge from two teacher models is adaptively distinguished and re-weighted. Finally, we connect between models from intra- and inter-task, integrating preceding and new knowledge. Our method demonstrates significant improvements in maintaining zero-shot capabilities while supporting continual learning across diverse downstream tasks, showcasing its potential in adapting VLMs to evolving data distributions.

Abstract:
Image super-resolution (SR) aims to recover low-resolution images to high-resolution images, where improving SR efficiency is a high-profile challenge. However, commonly used units in SR, like convolutions and window-based Transformers, have limited receptive fields, making it challenging to apply them to improve SR under extremely limited computational cost. To address this issue, inspired by modeling convolution theorem through token mix, we propose a Fourier token-based plugin called FourierSR to improve SR uniformly, which avoids the instability or inefficiency of existing token mix technologies when applied as plug-ins. Furthermore, compared to convolutions and windows-based Transformers, our FourierSR only utilizes Fourier transform and multiplication operations, greatly reducing complexity while having global receptive fields. Experimental results show that our FourierSR as a plug-and-play unit brings an average PSNR gain of 0.34dB for existing efficient SR methods on Manga109 test set at the scale of × 4 , while the average increase in the number of Params and FLOPs is only 0.6% and 1.5% of original sizes. We will release our codes upon acceptance.

Abstract:
Face detection accuracy significantly decreases under rotational variations, including in-plane (RIP) and out-of-plane (ROP) rotations. ROP is particularly problematic due to its impact on landmark distortion, which leads to inaccurate face center localization. Meanwhile, many existing rotation-invariant models are primarily designed to handle RIP, they often fail under ROP because they lack the ability to capture semantic and topological relationships. Moreover, existing datasets frequently suffer from unreliable landmark annotations caused by imperfect ground truth labeling, the absence of precise center annotations, and imbalanced data across different rotation angles. To address these challenges, we propose a topology-guided semantic face center estimation method that leverages graph-based landmark relationships to preserve structural integrity under both RIP and ROP. Additionally, we construct a rotation-aware face dataset with accurate face center annotations and balanced rotational diversity to support training under extreme pose conditions. Next, we introduce a Hybrid-ViT model that fuses CNN spatial features with transformer-based global context and employ a center-guided module for robust landmark localization under extreme rotations. In order to evaluate center quality, we further design a hybrid metric that combines topological geometry with semantic perception for a more comprehensive evaluation of face center accuracy. Finally, experimental results demonstrate that our method outperforms state-of-the-art models in cross-dataset evaluations. Code: https://github.com/Catster111/TCE_RIFD.

Abstract:
Small object detection (SOD) constitutes a notable yet immensely arduous task, stemming from the restricted informative regions inherent in size-limited instances, which further sparks off heightened uncertainty beyond the capacity of current two-stage detectors. Specifically, the intrinsic ambiguity in small objects undermines the prevailing sampling paradigms and may mislead the model to devote futile effort to those unrecognizable targets, while the inconsistency of features utilized for the detection at two stages further exposes the hierarchical uncertainty. In this paper, we develop an Uncertainty learning framework for Small Object Detection, dubbed as Unc-SOD. By incorporating an auxiliary uncertainty branch to conventional Region Proposal Network (RPN), we model the indeterminacy at instance-level which later on serves as a surrogate criterion for sampling, thereby unearthing adequate candidates dynamically based on the varying degrees of uncertainty and facilitating the learning of proposal networks. In parallel, a Perception-and-Interaction strategy is devised to capture rich and discriminative representations, through optimizing the intrinsic properties from the regional features at the original pyramid and the assigned one, in which the perceptual process unfolds in a mutual paradigm. As the seminal attempt to model uncertainty in SOD task, our Unc-SOD yields state-of-the-art performance on two large-scale small object detection benchmarks, SODA-D and SODA-A, and the results on several SOD-oriented datasets including COCO, VisDrone, and Tsinghua-Tencent 100K also exhibit the promotion to baseline detector. This underscores the efficacy of our approach and its superiority over prevailing detectors when dealing with small instances.

Abstract:
Segment Anything Model (SAM), known for its remarkable zero-shot segmentation capabilities, has garnered significant attention in the community. Nevertheless, its performance is challenged when dealing with what we refer to as visually non-salient scenarios, where there is low contrast between the foreground and background. In these cases, existing methods often cannot capture accurate contours and fail to produce promising segmentation results. In this paper, we propose Visually Non-Salient SAM (VNS-SAM), aiming to enhance SAM’s perception of visually non-salient scenarios while preserving its original zero-shot generalizability. We achieve this by effectively exploiting SAM’s low-level features through two designs: Mask-Edge Token Interactive decoder and Non-Salient Feature Mining module. These designs help the SAM decoder gain a deeper understanding of non-salient characteristics with only marginal parameter increments and computational requirements. The additional parameters of VNS-SAM can be optimized within 4 hours, demonstrating its feasibility and practicality. In terms of data, we established VNS-SEG, a unified dataset for various VNS scenarios, with more than 35K images, in contrast to previous single-task adaptations. It is designed to make the model learn more robust VNS features and comprehensively benchmark the model’s segmentation performance and generalizability on VNS scenarios. Extensive experiments across various VNS segmentation tasks demonstrate the superior performance of VNS-SAM, particularly under zero-shot settings, highlighting its potential for broad real-world applications. Codes and datasets are publicly available at https://guangqian-guo.github.io/VNS-SAM/

Abstract:
Directly reconstructing 3D CT volume from few-view 2D X-rays using an end-to-end deep learning network is a challenging task, as X-ray images are merely projection views of the 3D CT volume. In this work, we facilitate complex 2D X-ray image to 3D CT mapping by incorporating new view synthesis, and reduce the learning difficulty through view-guided feature alignment. Specifically, we propose a dual-view guided diffusion model (DVG-Diffusion), which couples a real input X-ray view and a synthesized new X-ray view to jointly guide CT reconstruction. First, a novel view parameter-guided encoder captures features from X-rays that are spatially aligned with CT. Next, we concatenate the extracted dual-view features as conditions for the latent diffusion model to learn and refine the CT latent representation. Finally, the CT latent representation is decoded into a CT volume in pixel space. By incorporating view parameter guided encoding and dual-view guided CT reconstruction, our DVG-Diffusion can achieve an effective balance between high fidelity and perceptual quality for CT reconstruction. Experimental results demonstrate our method outperforms state-of-the-art methods. Based on experiments, the comprehensive analysis and discussions for views and reconstruction are also presented. The model and code are available at https://github.com/xiexing0916/DVG-Diffusion

Abstract:
Compared to traditional computed tomography (CT), photon-counting detector (PCD)-based CT provides significant advantages, including enhanced CT image contrast and reduced radiation dose. However, owing to the current immaturity of PCD technology, scanned PCD data often contain stripe artifacts resulting from non-functional or defective detector units, which subsequently introduce ring artifacts in reconstructed CT images. The presence of ring artifact may compromise the accuracy of CT values and even introduce pseudo-structures, thereby reducing the application value of CT images. In this paper, we propose a dual-domain optimization model that takes advantage of the distribution characteristics of the stripe artifact in 3D projection data and the prior features of reconstructed 3D CT images. Specifically, we demonstrate that stripe artifact in 3D projection data exhibit both group sparsity and low-rank properties. Building on this observation, we propose a TLT (TV- l_2,1 -Tucker) model to eliminate ring artifact in PCD-based cone beam CT (CBCT). In addition, an efficient iterative algorithm is designed to solve the proposed model. The effectiveness of both the model and the algorithm is evaluated through simulated and real data experiments. Experimental results demonstrate that the proposed method outperforms existing state-of-the-art approaches.

Affiliations: Hubei Key Laboratory of Applied Mathematics, the Faculty of Mathematics and Statistics, and the Key Laboratory of Intelligent Sensing System and Security, Ministry of Education, Hubei University, Wuhan, China; Department of Geography and Spatial Information Techniques, Ningbo Key Laboratory of Remote Sensing and Ecological Security of Coastal Zone and Zhejiang-Germany Joint Laboratory on Remote Sensing of Coastal Ecosystem, Ningbo University, Ningbo, China; Faculty of Innovation Engineering, Macau University of Science and Technology, Taipa, Macau, China; Department of Electrical and Computer Engineering, Mississippi State University, Mississippi State, MS, USA

Abstract:
Classifying hyperspectral remote sensing images across different scenes has recently emerged as a significant challenge. When only historical labeled images (source domain, SD) are available, it is crucial to leverage these images effectively to train a model with strong generalization ability that can be directly applied to classify unseen samples (target domain, TD). To address these challenges, this paper proposes a novel single-domain generalization (SDG) network, termed the domain-aware adversarial domain augmentation network (DADAnet) for cross-scene hyperspectral image classification (HSIC). DADAnet involves two stages: adversarial domain augmentation (ADA) and task-specific training. ADA employs a progressive adversarial generation strategy to construct an augmented domain (AD). To enhance variability in both spatial and spectral dimensions, a domain-aware spatial-spectral mask (DSSM) encoder is constructed to increase the diversity of the generated adversarial samples. Furthermore, a two-level contrastive loss (TCC) is designed and incorporated into the ADA to ensure both the diversity and effectiveness of AD samples. Finally, DADAnet performs supervised learning jointly on the SD and AD during the task-specific training stage. Experimental results on two public hyperspectral image datasets and a new Hangzhouwan (HZW) dataset demonstrate that the proposed DADAnet outperforms existing domain adaptation (DA) and domain generalization (DG) methods, achieving overall accuracies of 80.69%, 63.75%, and 87.61% on three datasets, respectively.

Abstract:
Multi-focus image fusion (MFIF) addresses the challenge of partial focus by integrating multiple source images taken at different focal depths. Unlike most existing methods that rely on complex loss functions or large-scale synthetic datasets, this study approaches MFIF from a novel perspective: optimizing the input space. The core idea is to construct a high-quality MFIF input space in a cost-effective manner by using intermediate features from well-trained, non-MFIF networks. To this end, we propose a cascaded framework comprising two feature extractors, a Feature Distillation and Fusion Module (FDFM), and a focus segmentation network Y ^U Net. Based on our observation that discrepancy and edge features are essential for MFIF, we select a image deblurring network and a salient object detection network as feature extractors. To transform these extracted features into an MFIF-suitable input space, we propose FDFM as a training-free feature adapter. To make FDFM compatible with high-dimensional feature maps, we extend the manifold theory from the edge-preserving field and design a novel isometric domain transformation. Extensive experiments on six benchmark datasets show that 1) our model consistently outperforms 13 state-of-the-art methods in both qualitative and quantitative evaluations, and 2) the constructed input space can directly enhance the performance of many MFIF models without additional requirements.

Abstract:
Similar to facial beautification in real life, 3D virtual avatars require personalized customization to enhance their visual appeal, yet this area remains insufficiently explored. Although current 3D Gaussian editing methods can be adapted for facial makeup purposes, these methods fail to meet the fundamental requirements for achieving realistic makeup effects: 1) ensuring a consistent appearance during drivable expressions; 2) preserving the identity throughout the makeup process; and 3) enabling precise control over fine details. To address these, we propose a specialized 3D makeup method named AvatarMakeup, leveraging a pretrained diffusion model to transfer makeup patterns from a single reference photo of any individual. We adopt a coarse-to-fine idea to first maintain the consistent appearance and identity, and then to refine the details. In particular, the diffusion model is employed to generate makeup images as supervision. Due to the uncertainties in diffusion process, the generated images are inconsistent across different viewpoints and expressions. Therefore, we propose a Coherent Duplication method to coarsely apply makeup to the target while ensuring consistency across dynamic and multi-view effects. Coherent Duplication optimizes a global UV map by recoding the averaged facial attributes among the generated makeup images. By querying the global UV map, it easily synthesizes coherent makeup guidance from arbitrary views and expressions to optimize the target avatar. Given the coarse makeup avatar, we further enhance the makeup by incorporating a Refinement Module into the diffusion model to achieve high makeup quality. Experiments demonstrate that AvatarMakeup achieves state-of-the-art makeup transfer quality and consistency throughout animation.

Abstract:
Hyperspectral image (HSI) change detection is a technique that can identify the changes occurring between the bitemporal HSIs covering the same geographic area. The field of change detection has witnessed the proposal and successful implementation of numerous methods. However, a majority of these approaches adhere to the centralized learning paradigm, which requires data transmission to a central server for training. The sensitivity of remote sensing data generally prohibit their sharing across different clients. Furthermore, manual labeling is a costly effort in practically. In this paper, we propose a spatial-spectral-temporal collaborative Mamba-based active federated hyperspectral change detection (MambaFedCD) framework, which utilizes the limited labeled samples from multiple clients to achieve change detection while ensuring the data privacy of each client. Specifically, there are three key characteristics: 1) a spatial-spectral-temporal collaborative Mamba-based change detection ( \text S^2\text TMamba ) model is proposed to efficiently synergize the temporal and global spatial-spectral information of the bitemporal HSIs for change detection; 2) a difference feature diversity correction-based model aggregation (DFDCMA) strategy is devised to incorporate the diversity of difference features for rational allocation of weight factors among clients and to facilitate effective aggregation of the global model; 3) we propose a multi-decision federated active learning (MDFAL) strategy that selects both error-prone and valuable samples for model training to alleviate the burden of sample labeling. Comprehensive experiments conducted on commonly utilized datasets demonstrate that the proposed method outperforms other state-of-the-art methods. The code is available at https://github.com/Jiahuiqu/MambaFedCD

Abstract:
In video-text cross-domain retrieval tasks, the generalization ability of the retrieval models is key to improving their performance and is crucial for enhancing their practical applicability. However, existing retrieval models exhibit significant deficiencies in cross-domain generalization. On one hand, models tend to overfit specific training domain data, resulting in poor cross-domain matching and significantly reduced retrieval accuracy when dealing with data from different, new, or mixed domains. On the other hand, although data augmentation is a vital strategy for enhancing model generalization, most existing methods focus on unimodal augmentation and fail to fully exploit the multimodal correlations between video and text. As a result, the augmented data lack semantic diversity, which further limits the model’s ability to understand and perform in complex cross-domain scenarios. To address these challenges, this paper proposes an innovative collaborative augmentation approach named MDA-MAA, which includes two core modules: the Masked Attention Augmentation (MAA) module and the Multimodal Diffusion Augmentation (MDA) module. The MAA module applies masking to the original video frame features and uses an attention mechanism to predict the masked features, effectively reducing overfitting to training data and enhancing model generalization. The MDA module generates subtitles from video frames and uses the LLaMA model to infer comprehensive video captions. These captions, combined with the original video frames, are integrated into a diffusion model for joint learning, ultimately generating semantically enriched augmented video frames. This process leverages the multimodal relationship between video and text to increase the diversity of the training data distribution. Experimental results demonstrate that this collaborative augmentation method significantly improves the performance of video-text cross-domain retrieval models, validating its effectiveness in enhancing model generalization.

Abstract:
Denoising is one of the fundamental steps of the processing pipeline that converts data captured by a camera sensor into a display-ready image or video. It is generally performed early in the pipeline, usually before demosaicking, although studies swapping their order or even conducting them jointly have been proposed. With the advent of deep learning, the quality of denoising algorithms has steadily increased. Even so, modern neural networks still have a hard time adapting to new noise levels and scenes, which is indispensable for real-world applications. With those in mind, we propose a self-similarity-based denoising scheme that weights both a pre- and a post-demosaicking denoiser for Bayer-patterned CFA video data. We show that a balance between the two leads to better image quality, and we empirically find that higher noise levels benefit from a higher influence pre-demosaicking. We also integrate temporal trajectory prefiltering steps before each denoiser, which further improve texture reconstruction. The proposed method only requires an estimation of the noise model at the sensor, accurately adapts to any noise level, and is competitive with the state of the art, making it suitable for real-world videography.

Abstract:
Vision-based 3D object detection (3DOD) gains lots of attention due to its low cost for deployment compared to Lidar-based tasks, while it suffers from labor-expensive data annotations. At the same time, active learning (AL) has shown great potential in reducing annotation costs in related tasks, which can maximize model performance within very limited labeled data. In this paper, we explore active learning for vision-based 3DOD for the first time. Inspired by the entropy analysis, we involve three concerns to characterize the sample informativeness: sample diversity in input space, feature informativeness in BEV space, and result distribution in prediction space. Based on these concerns, we propose a novel AL framework named HMAD, which utilizes Height Modeling and Adaptive Diversity-based sampling for comprehensive informativeness characterization. In HMAD, we first propose a novel height-guided adversarial module in BEV space, which measures the informativeness of height modeling for 2D-to-3D mapping in an adversarial manner. Furthermore, Budget-aware SpatioTemporal diversity Sampling (BSTS) and Class Balance Sampling (CBS) are proposed to adaptively measure the sample informativeness in input and prediction space, respectively. Finally, the three components are integrated into a two-stage sampling strategy, with which the most informative samples can be selected and annotated for the next iteration. Experiments evidence that HMAD achieves comparable performances by only using 50% annotated training data, and can generalize well on different conditions.

Abstract:
With the development of real-time video conferences, interactive multimedia services have proliferated, leading to a surge in traffic. Interactivity becomes one of the main features on future multimedia services, which brings a new challenge to Computer Vision (CV) for communications. In addition, many directions for CV in video, like recognition, understanding, saliency segmentation, coding, and so on, do not satisfy the demands of the multiple tasks of interactivity without integration. Meanwhile, with the rapid development of the foundation models, we apply task-oriented semantic communications to handle them. Therefore, we propose a novel framework, called Real-Time Video Conference with Foundation Model (RTVCFM), to satisfy the requirement of interactivity in the multimedia service. Firstly, at the transmitter, we perform the causal understanding and spatiotemporal decoupling on interactive videos, with the Video Time-Aware Large Language Model (VTimeLLM), Iterated Integrated Attributions (IIA) and Segment Anything Model 2 (SAM2), to accomplish the video semantic segmentation. Secondly, in the transmission, we propose a two-stage semantic transmission optimization driven by Channel State Information (CSI), which is also suitable for the weights of asymmetric semantic information in real-time video, so that we achieve a low bit rate and high semantic fidelity in the video transmission. Thirdly, at the receiver, RTVCFM provides multidimensional fusion with the whole semantic segmentation by using the Diffusion Model for Foreground Background Fusion (DMFBF), and then we reconstruct the video streams. Finally, the simulation result demonstrates that RTVCFM can achieve a compression ratio as high as 95.6%, while it guarantees high semantic similarity of 98.73% in Multi-Scale Structural Similarity Index Measure (MS-SSIM) and 98.35% in Structural Similarity (SSIM), which shows that the reconstructed video is relatively similar to the original video.

Abstract:
Generic deep learning (DL) networks for image restoration like denoising and interpolation lack mathematical interpretability, require voluminous training data to tune large parameter sets, and are fragile in the face of covariate shift. To address these shortcomings, we build interpretable networks by unrolling variants of a graph-based optimization algorithm of different complexities. Specifically, for a general linear image formation model, we first formulate a convex quadratic programming (QP) problem with a new \ell _2 -norm graph smoothness prior called gradient graph Laplacian regularizer (GGLR) that promotes piecewise planar (PWP) signal reconstruction. To solve the posed unconstrained QP problem, instead of computing a linear system solution straightforwardly, we introduce a variable number of auxiliary variables and correspondingly design a family of ADMM algorithms. We then unroll them into variable-complexity feedforward networks, amenable to parameter tuning via back-propagation. More complex unrolled networks require more labeled data to train more parameters, but have better overall performance. The unrolled networks have periodic insertions of a graph learning module, akin to a self-attention mechanism in a transformer architecture, to learn pairwise similarity structure inherent in data. Experimental results show that our unrolled networks perform competitively to generic DL networks in image restoration quality while using only a fraction of parameters, and demonstrate improved robustness to covariate shift.

Abstract:
Multimodal perception and fusion play a vital role in uncrewed aerial vehicle (UAV) object detection. Existing methods typically adopt global fusion strategies across modalities. However, due to illumination variation, the effectiveness of RGB and infrared modalities may differ across local regions within the same image, particularly in UAV perspectives where occlusions and dense small objects are prevalent, leading to suboptimal performance of global fusion methods. To address this issue, we propose an adaptive fine-grained fusion network for multimodal UAV object detection. First, we design a local feature consistency-based modality fusion module, which adaptively assigns local fusion weights according to the structural consistency of high-response regions across modalities, thereby enabling more effective aggregation of object-relevant features. Second, we introduce a mutual information-guided feature contrastive loss to encourage the preservation of modality-specific information during the early training phase. Experimental results demonstrate that the proposed method effectively addresses the issue of object occlusion in UAV perspectives, achieving state-of-the-art performance on multimodal UAV object detection benchmarks. Code will be available at https://github.com/lingf5877/AFFNet

Abstract:
Self-supervised monocular depth estimation for fisheye cameras has attracted much attention in recent years due to their large view range. However, the performances of existing methods in this field are generally limited due to the inevitable severe distortions in fisheye images. To address this problem, we propose a distortion-aware depth self-updating network for self-supervised fisheye monocular depth estimation called DDS-Net. The proposed DDS-Net method employs a coarse-to-fine learning strategy, in which an explored fine depth predictor for predicting final depth is optimized with the predicted scene depths by a pretrained coarse depth predictor. The fine depth predictor contains a distortion-aware fisheye cost volume construction module and a depth self-updating module. The distortion-aware fisheye cost volume construction module is designed to construct a fisheye cost volume by learning the corresponding feature matching cost between continuous fisheye frames, which enables more accurate pixel-level depth cues to be captured under severe distortions. Based on the constructed cost volume and the initial depth estimated by the pretrained coarse depth predictor, the depth self-updating module is designed to self-update the depth map in an iterative manner. Extensive experimental results on 3 fisheye datasets demonstrate that the proposed method significantly outperforms 14 state-of-the-art methods for fisheye monocular depth estimation.

Abstract:
Text removal is an important task in processing both scene and document images. However, existing scene text removal (STR) methods are primarily focus on scene text images. The STR models (trained by scene text images) perform poorly on document images with dense, complex textured backgrounds. We discover that the limitations of existing methods can be attributed to the difficuties in background features estimation in the regions to be erased, which is based on the knowledge from neighboring regions in the input images and priors learned from the training data. The background features estimation performance degrades under the cross-domain scenarios, and compromises the quality of STR results. To address these issues, we introduce DiffEraser, a novel text removal framework that leverages prior knowledge from the Latent Diffusion Model (LDM) for removing text in both scene and document images. Our DiffEraser incorporates two key innovations to fully exploit the prior knowledge of LDM. First, we replace the conventional Variational Auto-Encoders (VAE) encoder with a Diffusion-Prior (DP) encoder, aiming to integrate the heterogeneous information from the LDM prior knowledge in latent space with the multi-level encoded features of the input image. Second, we introduce a Latent-Fusion (LF) decoder that integrates the heterogeneous features from both the LDM and DP encoders to generate high-quality text-erased results. To evaluate the generalization performance of our DiffEraser, we focus on the cross-domain protocols and construct a document image dataset, NPID295, which contains 295 types of passports and identity cards. Notably, when trained on a scene text dataset, DiffEraser significantly outperforms existing STR methods in the challenging NPID295 dataset. The resources of this work will be available online upon acceptance.

Abstract:
Fine-grained visual referring and grounding are critical for enhancing scene understanding and enabling various real-world vision-language applications. Although recent studies have extended multimodal large language models (MLLMs) to these tasks, they still face significant challenges in fine-grained multi-target scenarios. To address this, we propose MTRAG, a pixel-level multi-target referring and grounding framework that leverages semantic-spatial collaboration. Specifically, we introduce a Channel Extension Mechanism (CEM) that enables a global image encoder to extract global semantics and multi-region representations while retaining background context, without extra region feature extractors. Moreover, we introduce a grounding branch for pixel-level grounding and design a Hybrid Adapter (HA) to fuse semantic features from the MLLM branch with spatial information from the grounding branch, thereby enhancing the semantic-spatial alignment. For training, we meticulously curate MTRAG-D, a dataset comprising single- and multi-target referring and grounding samples derived from existing datasets and newly synthesized free-form multi-target referring instruction-following data. We also present MTR-Bench, a benchmark for systematic evaluation of multi-target referring. Extensive experiments across five core tasks, including single- and multi-target referring and grounding as well as image-level captioning, show that MTRAG consistently outperforms strong baselines on both multi- and single-target tasks, while maintaining competitive image-level understanding. The code is available at https://github.com/deng-ai-lab/MTRAG

Abstract:
Multi-modal few-shot semantic segmentation (FSS) aims to perform dense prediction from multiple modality images including visible image, depth image, and thermal image with a few annotated samples. However, some efforts treat the three modality information equally, where they don’t incorporate the inherent differences among multiple modalities. Besides, the objects vary in size greatly, and the cutting-edge matching paradigms fail to establish an effective support-query connection. Therefore, we propose a novel scale-invariant feature matching network (i.e., SFM-Net), which consists of an encoder, a feature matching block, a feature elevation block, and a decoder, to conduct visible-depth-thermal (V-D-T) few-shot semantic segmentation. Firstly, in the encoder part, after the extraction of multi-level initial features, we fuse each level’s RGB feature and thermal feature, yielding the support features and the query features. Secondly, in the feature matching block, a pixel-to-patch cross-attention (PTPCA) module is deployed to explore the correlation between each level’s support feature and the query feature, where the pixel-to-patch pooling (PTP-pool) units are designed to build scale-invariant relationships, generating the coarse mask for the query image. Thirdly, in the feature elevation block, we employ the prior-related fusion (PF) module to integrate the depth image with a coarse mask via the cross-attention mechanism, yielding the enhanced coarse prediction result, which is further aggregated in a bottom-up way. Finally, in the decoder, we deploy a reverse attention (RA) unit to gradually explore the complementarity between object internal regions and spatial details, and further generate the final segmentation results via conventional convolution layers. Extensive experiments are conducted on the VDT-2048- 5^i dataset, and the results show that our model outperforms the state-of-the-art methods with a large margin.

Abstract:
RGB-Thermal (RGB-T) tracking enhances visual tracking robustness by combining RGB and thermal infrared (TIR) modalities, addressing limitations of RGB-only trackers under challenging conditions such as low light and appearance variations. However, most existing RGB-T trackers rely on complex fusion modules or modality-specific architectures, sacrificing efficiency for performance. In this paper, we propose a novel Multi-level Self-Distillation (MSD) framework that adapts a one-stream RGB tracker to the RGB-T setting without modifying the network architecture or adding any extra parameters. RGB and TIR inputs are jointly processed through a shared backbone, and training is guided by a combination of self-supervised and supervised objectives to enhance cross-modal feature representation. The self-supervised component includes a contrastive loss that aligns semantically consistent regions across template-search pairs, as well as a modality-gap alignment loss that reduces discrepancies between RGB and TIR features. These internal signals complement task-driven supervision, including an intermediate focal loss that strengthens early localization by enhancing shallow and mid-level features, modality-specific losses that preserve distinctive cues under partial modality degradation, and a fused tracking loss that drives final bounding box prediction. Comprehensive evaluations on LasHeR, RGBT234, and GTOT benchmarks demonstrate that MSD achieves state-of-the-art tracking accuracy while maintaining the computational efficiency of the original RGB tracker. Our work establishes a new paradigm in multi-modal tracking by demonstrating that optimized training strategies can outperform complex architectural modifications, offering significant practical advantages for real-world deployment.

Abstract:
The partially supervised Compositional Zero-Shot Learning (pCZSL) recognizes new compositions of states and objects, where for every image in the training set either the state or the object annotation is available. In pCZSL, features of a state vary depending on the object in the composition (e.g. the features of state ripe are different for ripe banana and ripe apple). Understanding the variation in features across scales of objects is also a key challenge. In the proposed architecture, a swin transformer based Hierarchical Feature Extractor (HFE) captures the large range of semantic interactions between state and object features. The Discriminative Context Aggregation module utilizes features from the intermediate layers of the HFE to understand the features of object at their corresponding scales. To leverage the partially labeled data in pCZSL, we pass strongly and weakly augmented versions of the input image to the proposed architecture. The predicted class probabilities for strongly and weakly augmented images are encouraged to be similar, minimizing a distribution alignment loss. This loss incorporates class specific re-weighting approach to alleviate the effect of data imbalance for pCZSL. Extensive experiments on three benchmark datasets demonstrate the superiority of the proposed approach.

Abstract:
The goal of compositional zero-shot learning (CZSL) is to train a model to recognize images containing known attribute-object pairs. This reduces the reliance on extensive training data and enables the model to identify unseen combinations. Current CZSL methods face several challenges, including multiple attributes for a single object, disconnected training and test sets, long-tailed distribution of visual categories, and substantial differences in state representation between different objects. These factors collectively impede the precise identification of new combinations. In response to these challenges, we propose a Semi-Negative Contrastive Subclass Discriminative Network (SN-CSDN) based on contrastive learning. Firstly, we propose a semi-negative sampling strategy that incorporates carefully selected negative samples into the training process. This approach enables the model to effectively distinguish between different classes while enhancing its ability to capture fine-grained subclass features. By improving the model’s sensitivity to inter-class differences and refining its recognition of subtle intra-class variations, this strategy significantly boosts overall discrimination performance. Additionally, we introduce a decoupled network branch designed to capture the intricate relationships between attributes and objects by generating more representative compositional embeddings. This branch leverages subclass information to ensure an accurate classification of synthesized embeddings while preserving the inherent visual distinctions of the original decoupled embeddings across different combinations. By improving feature representation capacity and mitigating sample imbalance, this design effectively improves model performance in long-tailed distributions. Our method has been comprehensively evaluated on three benchmark datasets, with results showing significant performance improvements that demonstrate the method’s effectiveness and reliability.

Abstract:
Lensless cameras, innovatively replacing traditional lenses for ultra-thin, flat optics, encode light directly onto sensors, producing images that are not immediately recognizable. This compact, lightweight, and cost-effective imaging solution offers inherent privacy advantages, making it attractive for privacy-sensitive applications like face verification. Typical lensless face verification adopts a two-stage process of reconstruction followed by verification, incurring privacy risks from reconstructed faces and high computational costs. This paper presents an end-to-end optimization approach for privacy-preserving face verification directly on encoded lensless captures, ensuring that the entire software pipeline remains encoded with no visible faces as intermediate results. To achieve this, we propose several techniques to address unique challenges from the lensless setup which precludes traditional face detection and alignment. Specifically, we propose a face center alignment scheme, an augmentation curriculum to build robustness against variations, and a knowledge distillation method to smooth optimization and enhance performance. Evaluations in both simulation and real environments demonstrate that our method outperforms two-stage lensless verification while enhancing privacy and efficiency.

Abstract:
Deep learning is the mainstream method for medical image segmentation, and neural architecture search (NAS) has also been developed for this task. However, existing NAS methods remain limited in their ability to search for high-performance yet lightweight network architectures due to the high computational cost of NAS and the low fidelity of performance evaluation during the search process. In this paper, we propose a novel once-for-all NAS method for medical image segmentation to address these challenges. A carefully designed search space (supernet) incorporating key components of U-shape networks is constructed specifically for medical image segmentation. An effective and efficient hybrid two-stage supernet training scheme is then designed to enhance supernet training while maintaining a balance between performance and computational cost. A multi-objective evolutionary algorithm is leveraged to search for sets of network architectures, which produces high-performing architectures with varying computational complexities, optimized under multiple objectives. We conduct experiments on six widely used medical image segmentation datasets. Compared with existing methods, the proposed method achieves state-of-the-art performance on all six datasets. The searched architectures exhibit an excellent trade-off between performance and computational complexity, which is attributed to the effective multi-objective search. Our source codes are available at https://github.com/jiahongwei21-lang/MOOFA4MIS.

Affiliations: School of Cyber Science and Engineering, Zhengzhou University, Zhengzhou, China; School of Computer Science and Artificial Intelligence, Zhengzhou University, Zhengzhou, China; Mindray Bio-Medical Electronics Company, Shenzhen, China; School of Electronic Engineering and Computer Science, Queen Mary University of London, London, U.K.; Department of Magnetic Resonance Imaging, The First Affiliated Hospital of Zhengzhou University, Zhengzhou, China; State Key Laboratory of Phytochemistry and Natural Medicines, Kunming Institute of Botany, Chinese Academy of Sciences, Kunming, China

Abstract:
Accurate cross-modality cardiac image segmentation is essential for effectively diagnosing and treating heart disease. Different imaging modalities help to determine suitable pre-procedure planning. However, most methods face the difficulty of spatial-temporal confounding, where the anatomy element and modality element of cardiac images are intertwined across both spatial and temporal dimensions. It is derived from the imaging diversity and structure diversity of cardiac images. The spatial-temporal confounding hinders knowledge transfer between cardiac images on different modalities. In this paper, we propose a novel dynamic causal learning (DCL) to solve spatial-temporal confounding. The DCL explores multi-dimensional causal intervention to consider not only the causal relationship between images and labels, but also the causality in time dimension and space dimension. It integrates historical optimal interventions and facilitates the transfer of this knowledge across temporal contexts. In addition, the DCL utilizes the diffusion mechanism to further ensure that the extracted anatomy element remains causal invariant, improving model performance across multiple imaging modalities. Extensive experiments on cross-modality cardiac images (MR, CT, and US) demonstrate the effectiveness of the DCL (mean Dice = 0.951), outperforming other advanced segmentation methods. DCL is freely accessible at https://github.com/asdww0721ww/DCL

Abstract:
Fine-grained cross-view localization seeks to estimate precise camera poses by matching ground images with GPS-tagged aerial imagery. Existing methods typically employ first-order iterative optimization to progressively update the camera pose based on cross-view feature correspondences. However, they rely on local features and neglect global and complementary contextual information, making them prone to local optima and slow convergence under large initial errors or strong disturbances. To overcome these limitations, we propose a second-order robust iterative pose estimation framework for fine-grained cross-view localization. Firstly, we devise a second-order deep iterative optimization module to capture complementary forward and backward motion cues, leading to a bidirectional correlation volume. A motion aggregator uses the volume to approximate the dynamics of second-order iterators, substantially facilitating convergence and robustness. In addition, a bidirectional motion-aware robust regularization module mitigates geometric distortions and outlier interference by leveraging bidirectional motion cues to generate fine-grained confidence maps, adaptively suppressing unreliable regions and enhancing the stability of iterative optimization and pose estimation accuracy. Extensive experiments demonstrate that the proposed framework achieves faster convergence and higher pose estimation accuracy than state-of-the-art methods, particularly under large initial errors and challenging conditions.

Abstract:
Open-vocabulary multiple object tracking (MOT) aims to track arbitrary objects in the real world. Although significant progress has been achieved in object classification by leveraging the knowledge from large vision-language models, advances in data association for open-vocabulary MOT remain limited. Existing methods primarily rely on appearance cues to establish associations. However, these cues are often unreliable in the face of occlusions and ambiguous object appearances, resulting in suboptimal tracking performance in complex scenarios. In this paper, we propose a novel open-vocabulary MOT method, Spatial-temporal Scene Graph Tracker (SSGTrack), which introduces a fundamentally different approach to data association by building a Spatial-temporal Scene Graph (SSG) that captures rich semantic and spatial relationships between objects across adjacent frames. Specifically, SSGTrack constructs proposal-level relationships by extracting diverse contextual information from the multi-head self-attention layers of the Transformer decoder. These relationships, derived from the keyframe and reference frame, are compressed into the compact SSG, where nodes represent detected objects, and edge weights denote frame-level connectivity. Furthermore, to address the challenge of differentiating visually similar objects and background distractors, we propose a Context-aware Contrastive Learning (CCL) strategy. By identifying background features that significantly differ from positive samples and incorporating them as negative samples, CCL enhances the ability of the model to learn discriminative representations, thus improving tracking robustness. Extensive experiments conducted on several challenging MOT benchmarks demonstrate the effectiveness of our method, which achieves superior tracking performance.

Affiliations: College of Computer Science and Technology, Zhejiang University of Technology, Hangzhou, China; College of Computer Science and Technology, Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, China; Public Computer Education and Research Center, Jilin University, Changchun, China; College of Computer Science and Technology, Changchun University, Changchun, China; Thoughtworks, Cross Street, Singapore; School of Computer Science and Technology, Zhejiang University, Hangzhou, China

Abstract:
Action recognition has long been a fundamental and compelling problem in the field of computer vision. However, one aspect that has been overlooked so far is that current action recognition approaches often produce an unfavourable multi-peaked distribution when identifying the action class of a given motion sequence, which is ambiguous and hard to learn for neural networks. Moreover, current methods heavily rely on neural networks to extract action features for differentiating actions, lacking theoretical constraints ensuring that action-specific features are selectively extracted and ambiguous features common to multiple actions are effectively reduced. These shortcomings culminate in inadequate action recognition accuracy. Motivated by this, in this paper we seek to tackle the problem from three aspects: 1) We try to eliminate ambiguity by enforcing a smooth single-peaked distribution instead of a multi-peaked one for action-class prediction. 2) We theoretically analyze the lower bound of the label prediction log-likelihood and derive a training objective, which focuses on the extraction of action-specific features and the reduction of ambiguous features. 3) We further advocate feeding the model with richer information, including positive information like body-part structures and negative information like masked inputs. Empirically, our approach sets the new state-of-the-art performance on five large-scale benchmarks. Our code is released at https://github.com/ActionR-Group/DPM to facilitate future research.

Abstract:
Existing meta-learning based few-shot object detection methods suffer from limitations in learning representative prototypes. Specifically, directly aggregating bounding box contents from support images into prototypes renders these methods vulnerable to background noise and the morphological intricacies of objects. Furthermore, these methods neglect the varied contributions of intra-class image-specific prototypes and fail to leverage semantic information effectively during prototype generation, resulting in suboptimal class representations due to naive average aggregation. To address these issues, we propose a Dual Prototype-Enhancement Network (DPENet), designed to optimize prototypes by improving support feature representation and enhancing prototype discriminability. Specifically, we introduce an Object Enhancement Module (OEM) based on dynamic hypergraph construction. This module employs hypergraph convolution to adaptively capture complex high-order semantic interactions among highly similar regions within support features, thereby highlighting salient features of target regions, suppressing background noise, and enhancing support feature representation. Moreover, we propose a Semantic Fusion Perception Module (SFPM) that generates more discriminative class-specific prototypes by integrating weighted intra-class prototype representations with text-based semantic embeddings. Experimental results demonstrate that DPENet significantly outperforms existing methods on the PASCAL VOC and MS COCO datasets.

Abstract:
Unsupervised visible infrared person re-identification (USVI-ReID) is a challenging retrieval task that retrieves cross-modality pedestrian images without using any label information. In this task, the large cross-modality variance makes it difficult to generate reliable cross-modality labels, and the lack of annotations also provides additional difficulties for learning modality-invariant features. To facilitate this unsupervised cross-modal learning, we begin by leveraging the information contained in the cross-modality input and its predicted label. Aiming to minimize information loss, we optimize the model by incorporating entropy minimization, uniform label distribution, and cross-modality matching. In our approach, we design a loop iterative training strategy alternating between model training and cross-modality matching, where a uniform prior guided optimal transport assignment is proposed to select matched visible and infrared prototypes. This matching information is then utilized to minimize the intra- and cross-modality entropy. As a result, our model can gradually self-learn useful information, enabling it to generate discriminative representations for unlabeled cross-modal data. Extensive experimental results on benchmarks demonstrate the effectiveness of our method, e.g., 69.4% and 89.4% of Rank-1 accuracy on SYSU-MM01 and RegDB without any annotations. The code will be released soon.

Abstract:
Recently, new paradigms of camouflaged object detection (COD), such as referring COD (Ref-COD) and collaborative COD (Co-COD), have been proposed to enhance task performance. However, there remains a lack of in-depth exploration of how to utilize reference information more effectively. In this paper, we introduce in-context learning camouflaged object detection (ICL-COD) as a novel paradigm of COD, which leverages camouflaged image samples and their corresponding annotations as visual examples to guide the model in better perceiving camouflage and recognizing camouflaged objects. We propose the ICL-Camo network, with the design of a context mining module (CMM) to mine fine-grained contextual information contained in the visual examples, and a context guiding module (CGM) that utilizes the contextual information mined from the examples as guidance to shift the attention of the target image features on potential camouflaged regions, thus enhancing its perception of camouflaged objects. Extensive experiments conducted on the COD benchmarks and other relevant tasks demonstrate the effectiveness of our proposed ICL-COD paradigm and ICL-Camo network. Code and results are available at: https://github.com/h0t-zer0/ICL-Camo

Abstract:
In this paper, we tackle the open-set temporal action segmentation task, which aims to identify unknown frames while ensuring accurate segmentation of known actions in the temporal domain. Existing open-set methods struggle with identifying unknown frames due to their indistinguishability against ambiguous known frames during action transitions, resulting in significant performance degradation. To address this, we propose the action distribution flow, which models transitions between action sequences to capture the inherent feature discrepancies between unknown and known frames. Specifically, our method first models the distributions of known actions using the training data, and then interpolates these distributions along the optimal transport path for consecutive actions in the testing videos. By evaluating the likelihood of testing frames against the modeled action distribution flow, our approach effectively identifies unknown frames without requiring additional training or prior knowledge of the unknown data. Extensive experiments on open-set versions of the GTEA, 50Salads, and Breakfast datasets demonstrate the superiority of the proposed method across all evaluation metrics.

Abstract:
Instruction tuning of Large Vision-language Models (LVLMs) has revolutionized the development of versatile models with zero-shot generalization across a wide range of downstream vision-language tasks. However, the diversity of different training tasks from various sources and formats would lead to inevitable task conflicts, where different tasks conflict for the same set of model parameters, resulting in sub-optimal instruction-following abilities. To address that, we propose the Mixture of Cluster-conditional LoRA Experts (MoCLE), a novel Mixture of Experts (MoE) architecture designed to activate task-customized model parameters based on instruction clusters. A separate universal expert is further incorporated to improve generalization abilities of MoCLE for novel instructions. Extensive experiments on InstructBLIP and LLaVA demonstrate the effectiveness of MoCLE.

Abstract:
This paper focuses on domain incremental learning (DIL) for multiple vision tasks, including object detection, instance segmentation, and image classification. DIL aims to adapt a model to new domains over time without forgetting previously acquired knowledge. Recent DIL methods append learnable prompts to input embeddings of a frozen base model to learn from new domains. However, due to prompts’ limited representation ability, they struggle to adapt the feature space to new domain data distributions. To overcome this limitation, we propose a novel DIL method named Domain Difference Adapters (DD-Adapters). Through feature visualization and singular value analysis, we identify the cross-domain clustering ability of the base model and the low-rank property of domain difference. Based on these insights, our method imposes low-rank constraints on the base model to capture the principal components of domain differences, while freezing the base model to maintain its cross-domain clustering ability, thereby adapting to new domains effectively. Additionally, we introduce a prototype-guided domain selector (PDS) to dynamically select the appropriate DD-Adapters during inference, mitigating catastrophic forgetting in DIL. Extensive experimental evaluations on eight benchmark datasets demonstrate the performance superiority of the proposed method on three vision tasks, with minimal extra parameter usage.

Abstract:
Most existing self-supervised learning methods for skeleton-based temporal action segmentation (TAS) fail to capture the short-term motion semantics essential for dense frame-level prediction, as they typically learn representations that are either too coarse or motion-insensitive. This issue is reflected in local dimension collapse, which highlights the limitations of current approaches and suggests directions for improvement. Specifically, to address the issue of local dimension collapse for self-supervised learning in TAS, we propose the Local Dimension Enhancement (LoDE) framework, which introduces the local effective rank (LER) as a metric to measure and a learning objective to reduce this collapse. A new fine-grained representation scale, termed a motion unit, is defined as a temporal clip of consecutive skeleton frames to model skeleton data. Centered on this representation scale, we analyze existing methods (sequence-scale and frame-scale learning) with the tool of LER and theoretically demonstrate that introducing motion unit-scale learning is essential to alleviate local dimension collapse. Inspired by our theoretical insights, we design a multi-scale semantics module that integrates frame-, sequence-, and motion unit-scale learning, with LER-based regularization to enrich local representation diversity. These designs effectively alleviate local dimension collapse and lead to significant improvements in TAS, as evidenced by LoDE’s superior performance over state-of-the-art methods on three large-scale untrimmed datasets: PKUMMD, TSU, and BABEL. Our project website is available at https://carefreesun.github.io/LoDE_TIP_2026/

Abstract:
Specialized image restoration methods have been extensively explored, each targeting a specific type of degradation. However, real-world images often suffer from composite degradations, prompting growing interest in unified restoration approaches. While recent unified models have shown promising results, many are hindered by high computational complexity, limiting their deployment in resource-constrained settings. Motivated by the parameter-efficient design of Low-Rank Adaptation (LoRA), we propose an efficient attention module specifically designed for composite degradation image restoration. The proposed method adopts a dual-branch architecture, where one branch processes features at full resolution, and the other operates with reduced spatial and channel dimensions to improve efficiency. To better adapt to diverse degradation patterns, the latter branch is further divided into two sub-branches, each incorporating dynamic operations guided by local and contextual priors. These context priors are iteratively updated within each module, drawing inspiration from feedback mechanisms in reinforcement learning, thereby enabling the model to effectively perceive and handle multiple degradation types within a unified structure. Additionally, we introduce a multi-scale feed-forward network to further enhance both performance and computational efficiency. Extensive experiments on two composite degradation benchmarks demonstrate that our proposed network, CDIR, achieves state-of-the-art performance with significantly reduced complexity and fast inference speed. In addition, CDIR shows strong adaptability to various task-specific image restoration scenarios, such as dehazing, desnowing, and deraining. It also performs robustly on domain-specific applications such as ultra-high-definition (UHD), remote sensing, and medical image restoration, highlighting its versatility and practical applicability.

Abstract:
Recent advancements in Text-to-3D generation are significantly limited by the capabilities of current 2D vision-language models. When these models attempt to distill complex multi-object descriptions, they often produce 3D outputs that suffer from issues like 3D geometric confusion and the Janus problem. To overcome these challenges, we introduce DreamAssemble, a novel framework that views 3D scenes as compositional assemblies of multiple objects. Specifically, our framework enables the simultaneous optimization of various 3D assets using Multi-Density Neural Field for the first time, which helps maintain a consistent structure and greatly enhances the fidelity of the generated scenes. Furthermore, our method reduces the variance in the latent space during the distillation process by decomposing prompts, showing an improved ability to handle abstract textual descriptions and significantly alleviating the Janus problem commonly encountered in Text-to-3D generation. We provide comprehensive experimental results and visualizations that demonstrate the effectiveness of our proposed method, along with the corresponding theoretical analysis. This approach demonstrates significant potential for advancing the field of 3D generation. Our source code and more results are available at: https://github.com/bingozju/DreamAssemble

Abstract:
Accurate 3D reconstruction in real-world environments remains a significant challenge due to the coexistence of reflective and non-reflective surfaces, which pose distinct modeling demands. Existing methods often treat these surface types separately, limiting their generalizability and physical plausibility. To bridge this gap, we propose MicroSDF, a novel neural implicit framework that facilitates geometry and reflectance modeling through microfacet theory. Our approach incorporates three core innovations: 1) a microfacet-guided geometry model that extracts multi-scale surface normals (macroscopic and microfacet) from a signed distance field (SDF), regularized by a proposed microfacet normal consistency loss to enforce physically plausible surface orientations; 2) an enhanced dual-branch color model, where the specular branch leverages the microfacet normals to model high-frequency reflectance, and the vanilla branch, unlike prior works, uses reflection direction (instead of viewing direction) to better model diffuse and low-frequency specular components; and 3) a detection-guided color blending strategy that adaptively fuses the color outputs based on reflection priors, providing more physically intuitive blending than implicitly learned blending weights. Combined with a tailored multi-stage optimization scheme, the proposed MicroSDF achieves robust and high-fidelity reconstruction across reflective and non-reflective surfaces. Extensive experiments on DTU, Shiny Blender, Ref-NeRF, and DeepVoxels datasets demonstrate state-of-the-art performance, establishing a new direction for physically grounded neural reconstruction.

Abstract:
Text-Based Person Retrieval (TBPR), which is a pivotal technology in the intelligent surveillance field, is aimed at retrieving target pedestrians based on free-form textual descriptions. While the existing methods attempt to align cross-modal features via multigranular interactions, their performance remains fundamentally limited by two core challenges: cross-modal semantic inconsistency and cross-modal semantic discriminability. To address these issues, we propose DSEE (Diversity Semantic Embedding Expansion), a novel framework for semantically enhanced representation learning. Unlike approaches that rely on constructing larger or more detailed datasets, DSEE establishes identity-centric cross-modal consistency through contrastive learning and generative synergy. The framework consists of two key modules: Bidirectional-guided Semantic Modeling (BSM) and Generative-driven Semantic Enhancement (GSE) modules. The BSM module constructs novel semantic embeddings by modeling similarity-based interactions between the image and text modalities. Specifically, it emphasizes identity-level similarity to guide the generation of enriched, discriminative semantic representations, thereby enhancing their semantic expressiveness and cross-modal alignment. The GSE module provides enriched semantic diversity through a generative text augmentation scheme based on visual inputs, while refining the semantic precision of the method via a dual-path attention mechanism that performs both intramodal refinement and cross-modal alignment. Extensive experiments demonstrate that DSEE achieves state-of-the-art performance on major benchmarks across diverse scenarios. Our work provides an effective paradigm for advancing TBPR applications in real-world settings.

Abstract:
Recently, tensor network (TN) decompositions have gained prominence in computer vision and contributed promising results to tensor recovery for their capability of compactly and efficiently representing high-order tensors. However, current TN topologies are rather being developed towards more intricate structures to pursue incremental improvements, resulting in a drastically increased number of TN ranks, which requires laborious hyper-parameter selection, especially for higher-order cases. In this paper, we propose a novel TN decomposition, dubbed tensor wheel (TW) decomposition, in which a high-order tensor is represented by a set of latent factors mapped into a specific wheel topology. Such a decomposition is constructed starting from analyzing the graph structure, aiming to more accurately characterize the complex interactions inside objectives while maintaining a lower hyper-parameter scale, theoretically alleviating the above deficiencies. The comprehensive analysis of the mathematical properties fully demonstrates that TW decomposition can be more potential in representation capabilities and more flexible in controlling both parameter storage and computational costs. To compute the TW-format decomposition, the sequential singular value decomposition (SVD)-based and the alternating least squares (ALS)-based learning algorithms are developed. Furthermore, to investigate the validity of TW decomposition, we provide its one numerical application, i.e., tensor completion (TC), yet develop an efficient proximal alternating minimization-based solving algorithm with guaranteed convergence. Experimental results on both synthetic and real-world data reveal that TW decomposition significantly outperforms other state-of-the-art tensor decompositions for incomplete-tensor inference, especially under solely few observations, thus substantiating the superiority and reliability of TW decomposition.

Abstract:
Few-shot medical image segmentation (FSMIS) has attracted increasing attention as a promising technique for solving medical image segmentation tasks by relying on only a small amount of labeled data from new classes. Current FSMIS methods typically employ pixel-level semantic correlations between support-query image pairs to guide the segmentation of query images. However, the class information gap between support and query images may induce severe mismatches, leading to semantic ambiguity between foreground and background pixels. To address this issue, we propose a novel mask-guided proxy mining network (MPMNet), which mines a set of representative reference features (termed proxies) from support and query images to rectify foreground-background ambiguity. Specifically, to eliminate false pairwise matches caused by excessive intra-class variations, we design a mask-guided proxy mining module to adaptively learn representative proxies that can perceive visual differences between objects with different scales and shapes. Moreover, we integrate a hierarchical prior generation module and a context-aware feature enrichment module into MPMNet to obtain multi-scale information and enhance the discriminability of features. With these well-designed components and structures, our MPMNet can effectively overcome the adverse effects of false pixel matches by establishing proxy-level semantic correlations. Extensive experiments on three standard medical segmentation benchmarks demonstrate that our MPMNet significantly outperforms previous state-of-the-art methods, with a mean gain of 2.71% in DSC across all datasets. The code is available at: https://github.com/donglongzi/MPMNet

Abstract:
Transformer has been very successful in various computer vision tasks and understanding the working mechanism of transformer is important. As touchstones, weakly-supervised semantic segmentation (WSSS) and class activation map (CAM) are useful tasks for analyzing vision transformers (ViT). Based on the plain ViT pre-trained with ImageNet classification, we find that multi-layer, multi-head self-attention maps can provide rich and diverse information for weakly-supervised semantic segmentation and CAM generation, e.g., different attention heads of ViT focus on different image areas and object categories. Thus we propose a novel method to end-to-end estimate the importance of attention heads, where the self-attention maps are adaptively fused for high-quality CAM results that tend to have more complete objects. Besides, we propose a ViT-based gradient clipping decoder for online retraining with the CAM results efficiently and effectively. Furthermore, the gradient clipping decoder can make good use of the knowledge in large-scale pre-trained ViT and has a scalable ability. The proposed plain Transformer-based Weakly-supervised learning method (WeakTr) obtains the superior WSSS performance on standard benchmarks, i.e., 78.5% mIoU on the val set of PASCAL VOC 2012 and 51.1% mIoU on the val set of COCO 2014. Source code and checkpoints are available at https://github.com/hustvl/WeakTr

Abstract:
Controllable image structure editing has attracted increasing attention. While recent interactive point-based methods are convenient and realistic, they often lack fine-grained control over localized content. Partial sketches provide a simple yet expressive interface for local structure manipulation. However, existing partial-sketch-based manipulation methods relying on generative adversarial networks (GANs) suffer from limited generalization and fidelity. Moreover, although diffusion-based adapters excel at global conditioning (e.g., edge maps), localized editing with partial strokes remains challenging due to two key issues: effectively injecting sparse stroke conditions during denoising and preserving non-edited regions to avoid unintended changes. To address these challenges, we propose DiffStroke, a mask-free framework for localized image manipulation with partial sketches. We introduce trainable Image-Stroke Fusion (ISF) blocks to fuse source images and strokes at the feature level, enabling precise local shape control while maintaining appearance consistency. We further develop a self-supervised mask estimator to protect irrelevant regions without manual input. Specifically, we leverage Tweedie’s formula to estimate a clean latent image from noisy latents, blend the denoised result with the source, and train the mask estimator by minimizing the error between the blended latent and the target latent. Experiments on natural and facial images demonstrate that DiffStroke outperforms state-of-the-art methods on both simple and complex stroke-based editing tasks. DiffStroke can also be combined with text prompts to produce diverse and creative results. Code is available at https://github.com/CMACH508/DiffStroke

Abstract:
The rapid progress of generative models has made detecting realistic forgeries a critical challenge for security and trust. Existing image and frequency-based methods depend on dataset-specific artifacts with poor generalization, while Vision-Language Model (VLM)-based methods remain limited by coarse prompts and underused cross-modal alignment. To address these issues, we propose a Fine-grained Text-driven Generative Image Detection (FTGID) framework, which enables comprehensive detection through multi-modal cues. First, we design a Layer-wise Adaptive Global Extractor (LAGE) that stabilizes multi-level global representations through adaptive CLS token fusion with lightweight calibration and parameter-efficient tuning. Second, we propose a Fine-grained Text-guided Local Enhancer (FTLE) that performs patch-level text–visual interaction to enhance the localization of forgery-relevant regions. Third, we introduce a High-frequency Artifact Feature Extractor (HAFE) that adaptively captures discriminative high-frequency cues, enabling more reliable detection of subtle generative artifacts. Extensive experiments demonstrate that FTGID consistently outperforms state-of-the-art GID methods across diverse generative models and unseen datasets, achieving superior performance, thereby enhancing both robustness and interpretability in open-world generative image detection. Our codes will be made publicly available after the peer review process.

Abstract:
Recently, the block-term decomposition with rank- (L_r, L_r, 1) (termed as LL1 decomposition), which decomposes a third-order tensor into the sum of the outer products between vector and matrix factors, has received increasing attention for high-dimensional image reconstruction. However, the fixed low-rank matrix decomposition in LL1 is restricted to third-order tensors, which hinders its development for higher-order tensor data (i.e., order N \gt 3 ). To address this, we propose a Block Customized Topology Term Decomposition (BCTD), which represents an N th-order tensor as a sum of outer products of basis vectors and customized (N-1) th-order coefficient tensors with flexible internal topological structures. The proposed BCTD enjoys two advantages: Firstly, it allows tackling higher-order tensors beyond the third-order tensor setting of LL1, which can better preserve the high-dimensional structure of the tensor. Secondly, it allows each term to have a customized topological structure beyond the fixed topological structure (i.e., low-rank matrix decomposition) in LL1, which can better explore the intrinsic high-dimensional low-rank structures of the tensor. To evaluate the performance of the proposed BCTD, we build the corresponding high-dimensional image reconstruction model and provide a theoretical generalization error bound between the recovered tensor of the proposed model and the underlying tensor. To solve the resulting optimization problem, we apply a proximal alternating minimization (PAM)-based algorithm with a theoretical convergence guarantee. Extensive experimental results on high-dimensional image completion and compression tasks using real-world datasets (color videos and light field images) demonstrate the superiority of the proposed model over other baseline models.

Abstract:
Images captured under real-world nighttime haze conditions often suffer from severe degradations, including low visibility, color distortion, and reduced contrast, which not only impair visual perception but also degrade the performance of vision-based tasks. However, existing dehazing methods are mainly designed for daytime scenarios and struggle to cope with the complex illumination and scattering characteristics of nighttime hazy images. In this paper, we propose a novel Bayesian-based variational framework with fractional-order constraints for real-world nighttime image dehazing. First, a simplified physical model is constructed to characterize nighttime hazy images, accounting for haze, low-light conditions, Poisson noise, and glow degradations. An anisotropic pre-processing strategy is iteratively applied in the Lab color space to remove glow effects. Subsequently, illumination and reflectance estimation within our constructed physical model is formulated as a maximum a-posteriori (MAP) problem, which is then approximated as a unified variational optimization function. To impose prior constraints, two fractional-order terms are introduced as priors to regulate the illumination and reflectance, promoting piecewise smoothness in illumination and preserving sharp edges and fine textures in reflectance. The resulting variational model is efficiently solved using the alternating direction minimization method. Finally, the estimated illumination and reflectance are enhanced via spatial-domain gamma correction for brightness adjustment and frequency-domain processing for texture detail enhancement. Extensive experiments on real-world datasets demonstrate that the proposed framework outperforms state-of-the-art dehazing methods in both qualitative and quantitative evaluations. Besides, our algorithm generalizes effectively to both other degraded scenes and high-level vision tasks.

Abstract:
Unsupervised reconstruction networks have shown promise for unified vision anomaly detection, i.e., image-level anomaly classification and pixel-level anomaly segmentation, where a single model trained on multi-class normal images can detect various anomalies. This is more challenging than most existing separate methods, i.e., one model for one class, as it requires handling a more complex data distribution. Notably, pure reconstruction networks often suffer from overfitting due to “identity shortcut”, where both normal and anomaly images may be well recovered and thus fail in detecting anomalies. Recent efforts have focused on developing specific modules for different network architectures, e.g., Convolutions and Transformers. However, it is still unclear how to essentially and effectively prevent learning from this shortcut in a simpler and more general manner. Furthermore, most existing methods consider anomaly detection solely as unsupervised classification, resulting in inaccurate anomaly segmentation due to “weak discrimination”, where normal and anomaly features may be entangled. To address these challenges, we propose a simple yet general Dual-masked and Discriminative Reconstruction (D2Rec) for unified vision anomaly detection. First, we propose a general dual-masked reconstruction, i.e., using a pair of complementary masks, resolving the “identity shortcut” so that all masked positions are reconstructed by unmasked original features. Second, we propose a self-supervised discriminator, which refines reconstruction errors with synthesized anomaly images to enhance the discrimination ability between normal and abnormal features. The dual-masked reconstruction and self-supervised discriminator can serve as universal plugins, easily integrated into reconstruction-based anomaly detection methods of any architecture. Despite its simplicity, D2Rec outperforms previous methods on three industrial benchmarks (MVTec, BTAD, and VisA), and three medical datasets (Brain MRI, Liver CT and Retinal OCT). The code for D2Rec is available at https://github.com/gaobb/D2Rec

Abstract:
Multi-view learning aims to integrate multi-source information for a comprehensive data representation, which has gained widespread attention in image processing. Each view contains view-specific noise and joint features associated with other views, and thus exploring the specificity and consistency among views is a typical solution to deal with multi-view data for learning discriminative representations. In this paper, we present a theory-induced model, termed Adversarial Distribution Alignment Network (ADAN), which learns view-invariant features and alleviate the negative impact of view-specific noise. We first demonstrate the necessity of suppressing view-specific noise and capturing view-invariant features inspired by the theory of view generalization, and then derive two collaborative modules: a feature disentangler and an adversarial alignment module. In detail, the feature disentanglement separates view-specific noise and view-invariant features by minimizing the mutual information between them. Following this, a negative entropy is proposed to suppress the negative impact of view-specific noise. Meanwhile, the adversarial module uses the adversarial technique that can fit more complex data conformed to different distributions to adaptively align cross-view features so that features encoded in different views converge. Substantial experiments are constructed on multi-view datasets, demonstrating that ADAN can achieve more promising performance compared to other superior methods. Code is available at https://github.com/huangsuj/ADANet

Abstract:
Semantic communication targets reliable task execution at the receiver under stringent bandwidth and channel constraints. However, existing communication paradigms either focus on bit-level signal reconstruction, impeding the balance between task efficacy and bandwidth efficiency, or are limited by fixed vocabularies and lack generalization when facing unknown categories and open scenarios. To this end, we propose Universal Semantic Communication (UniSC), an open-vocabulary semantic communication framework that formulates transmission as a Matchable Semantic Subspace Transmission (MSST) problem. In this work, “universal” refers to the ability to handle arbitrary text-defined semantic categories beyond fixed vocabularies, rather than universality across all vision tasks. The transmitted representation is explicitly constrained to preserve cross-modal matchability after noisy transmission, rather than merely supporting latent recovery or closed-set inference. Concretely, UniSC comprises a Visual Semantic Engine (VSE), a Semantic Squeeze Network (SSN), a Noise-Adaptive Semantic Re-expansion (NASR) module, and a VLM-based Decoder. VSE and SSN project images into a compact semantic subspace for transmission. This subspace is optimized to preserve both robustness and cross-modal matchability under channel corruption. NASR denoises and lifts the received features back into a semantically complete visual space, from which the VLM-based Decoder performs open-category inference by matching arbitrary text queries rather than relying on a fixed classifier head. The VLM-based Decoder employs a Text Semantic Engine (TSE) to map natural language to text embeddings and, via a learnable Text-Visual Bridge (TVB), aligns them with the reconstructed visual structure for cross-modal matching. To improve cross-modal alignment and transmission robustness, a two-stage training strategy first establishes cross-modal anchors and then optimizes end-to-end robustness and compactness. Extensive experiments on semantic segmentation benchmarks demonstrate that UniSC achieves strong generalization and state-of-the-art performance under harsh channel conditions, outperforming existing methods in both low-SNR and extreme-compression regimes.

Affiliations: Hubei Provincial Key Laboratory of Artificial Intelligence and Smart Learning, School of Computer Science, and the Academy of Frontier Interdisciplinary Research, Central China Normal University, Wuhan, China; School of Electrical and Information Engineering, Tianjin University, Tianjin, China; West China Biomedical Big Data Center, West China Hospital, Sichuan University, Chengdu, China; School of Computer Science and Artificial Intelligence, Hubei University of Technology, Wuhan, China; College of Physics and Information Engineering, Fuzhou University, Fuzhou, China

Abstract:
Although numerous semi-supervised learning methods have been elaborately designed for hyperspectral image (HSI) classification, most existing semi-supervised learning paradigms still rely on a closed-set assumption. These methods implicitly assume that the category spaces of labeled and unlabeled samples are completely aligned, that is, all unlabeled samples must belong to a pre-defined known category set. However, the closed-set assumption is particularly problematic in practical remote sensing scenarios because partial unlabeled data inevitably belong to unknown categories. To address this challenge, this paper proposes a reconstruction-contrast coupling learning (ReCo2L) method for open-set semi-supervised HSI classification, fully leveraging the complementarity between masked feature reconstruction learning and contrastive learning to enhance the encoder’s local detail sensitivity and global discriminative ability. Specifically, we first apply a masked feature reconstruction learning with an adaptive masking strategy to enhance the encoder’s ability to capture local details by high-quality spectral-spatial feature reconstruction. Then, we employ contrastive learning to strengthen the encoder’s capability to extract global characteristics by pulling semantically similar samples closer and pushing dissimilar ones farther apart in the feature space. Finally, a pixel-prototype deviation loss is proposed to further improve both inter-category distinguishability and intra-category compactness by reducing the distances between labeled sample features and their corresponding class anchors. Extensive experiments on three benchmark datasets demonstrate that our proposed ReCo2L achieves superior classification performance in both known and unknown categories and significantly surpasses 10 state-of-the-art HSI classification methods. The code will be available at https://github.com/repository-AI-chen/ReCo2L

Affiliations: College of Information Science and Engineering, Institute of Interdisciplinary Studies, Key Laboratory of Educational Informatization and Intelligence of Higher Education Institutions in Hunan Province, Hunan Normal University, Changsha, China; School of Artificial Intelligence and Robotics and the National Engineering Laboratory for Robot Visual Perception and Control Technology, Hunan University, Changsha, China; Sogang University, Seoul, Republic of Korea; School of Educational Science, Institute of Interdisciplinary Studies, Key Laboratory of Educational Informatization and Intelligence of Higher Education Institutions in Hunan Province, Hunan Normal University, Changsha, China

Abstract:
Unsupervised visible-infrared person re-identification (VI-ReID) is challenging due to the significant modality gap between visible and infrared images. Most existing methods rely on one-hot clustering pseudo-labels as supervision signals, which often fail to capture the full semantic relationships among samples and are highly susceptible to noise. To address these limitations, we propose a Semantic-aware Multimodal Collaborative Learning (SAMCL) framework for unsupervised VI-ReID. Specifically, a Modality-aware Semantic Fusion (MSF) module is designed to bridge the inter-modality gap by integrating complementary semantic details from both visible and infrared modalities, generating enriched cross-modal supervision signals, for cross-modal collaborative learning. Meanwhile, we present a Dynamic Contrastive Learning (DCL) module to refine intra-modality feature learning by dynamically aligning samples with their neighboring centroids in the feature space, improving clustering reliability and intra-modality feature discrimination. By combining the two modules, SAMCL harnesses multimodal collaboration, minimizes dependence on noisy pseudo-labels, and provides a robust approach to unsupervised VI-ReID. Extensive experiments demonstrate the superiority of our proposed method. For instance, on the SYSU-MM01 dataset, our model achieves a Rank-1 accuracy of 68.68% in the All Search setting, surpassing the state-of-the-art (SOTA) by 3.48%. On the RegDB dataset, it achieves a Rank-1 accuracy of 94.47% in the Visible-to-Infrared setting, outperforming the SOTA by 3.57%. On the LLCM dataset, it achieves a Rank-1 accuracy of 50.6% in the Visible-to-Infrared setting, outperforming the SOTA by 3.7%. The code is available at https://github.com/luoshixi123/SAMCL

Abstract:
Composed Image Retrieval (CIR) represents a challenging retrieval task that targets locating specific images through multimodal inputs. Despite recent progress in CIR techniques, prior approaches often overlook cases where images appear visually alike yet differ in attributes, potentially undermining both multimodal feature fusion and similarity modeling. To mitigate this limitation, we design a unified representation of cross-modal features based on attribute prototypes. Nevertheless, the task is far from straightforward, owing to three core issues: (1) entanglement in attribute-level semantics, (2) inconsistency across modalities, and (3) supervised signal missing. To tackle the above obstacles, we introduce a COMposed image retrieval network guided By attrIbute-based NEighbor Relations (COMBINER). Specifically, we first design an Adaptive Semantic Disentanglement module, which is capable of disentangling attribute features based on multimodal primitive features. Secondly, we propose a Unified Prototype-based Composition module, which can construct cross-modal unified prototypes (CUP) and facilitate multimodal feature composition. Finally, we introduce a Dual Relations Modeling module, which can mine pairwise and neighbor relations based on attribute similarity. Compared to traditional neighbor relations modeling CIR methods, COMBINER represents the first study addressing the phenomenon of visually similar but attribute-unrelated samples. It achieves a more accurate understanding of the semantic relations among samples by employing an attribute prototype-based similarity metric. Comprehensive experiments conducted on three benchmark datasets confirm the effectiveness of our proposed COMBINER. The implementation of our method will be accessed at https://github.com/Lee-zixu/COMBINER

Abstract:
To fit diverse display and bandwidth constraints, high-frame-rate videos are temporally downscaled to low-frame-rate (LFR) and later upscaled, requiring joint optimization for effective frame-rate rescaling. However, existing methods typically link the two operations via training objectives, without fully exploiting their reciprocal nature, which may cause high-frequency information loss. Moreover, they overlook the impact of lossy codecs on LFR videos, limiting real-world applicability. In this work, we propose an end-to-end framework for compression-aware frame-rate rescaling, named TVRN. To regularize high-frequency information lost during frame-rate downscaling, TVRN adopts an invertible architecture that combines a Multi-Input Multi-Output Temporal Wavelet Transform with a high-frequency reconstruction module. To enable end-to-end training through non-differentiable lossy codecs, we design a surrogate network that approximates their gradients. Finally, to improve robustness under various compression levels, we extend TVRN to an asymmetric architecture by incorporating compression-aware features learned via a learning-to-rank strategy. Extensive experiments show that TVRN outperforms existing methods in reconstruction quality under industrial video compression settings. Source code is publicly available at https://github.com/fengxinmin/TVRN_public

Abstract:
As a key preprocessing technique in medical image analysis, deformable image registration has remained a research focus over the past decade. Recently, deep learning-based registration methods have become mainstream. Nevertheless, simultaneously handling large-scale deformations and accurate feature matching remains a persistent challenge. While pyramid architectures are widely employed to mitigate large-scale deformations, existing methods often exhibit an unbalanced focus. One group emphasizes iterative refinement to handle large deformations but relies on implicit, coarse feature interactions. Conversely, the other group concentrates on explicit matching techniques, but such static matching is often unreliable in regions with significant anatomical discrepancies. To bridge this gap, we propose a novel Correlation-Guided Recursive Pyramid Network (CRPNet). Unlike previous approaches, CRPNet addresses these challenges in a unified manner by embedding explicit correlation modeling directly into the recursive optimization. Specifically, we propose a Correlation-Guided Intra-layer Recursive Strategy (CGIRS), which enables the network to continuously refine matching accuracy through recursive feedback while preventing cross-scale error propagation. To facilitate this, we design a Spatial Correlation Module (SPCM) for accurate spatial correspondence and a Semantic Correlation Module (SECM) for high-level semantic alignment. Extensive experiments on three brain imaging datasets demonstrate that our method achieves state-of-the-art performance, particularly exhibiting exceptional robustness under extreme deformations, proving the efficacy of our method for deformable brain MRI registration. The code is available at https://github.com/ZhangWH0129/CRPNet

Abstract:
Hyperspectral image (HSI) data possess complex spatial structures and high-dimensional spectral information. Mamba has been applied to address the limitations of general methods in HSI classification, including restricted receptive fields and high computational complexity. However, the scan mechanism of traditional Mamba unreasonably constructs spatial distance relationships between neighboring row pixels and fails to adaptively construct the optimal scanning path based on the spectral similarity of pixels. Additionally, the characteristic of traditional Mamba scanning each channel independently overlooks the feature extraction from high-dimensional spectral information. This work proposes a Spectral State Fusion Tree Mamba (SSFTM) architecture to resolve these limitations. The Tree Scan (TS) mechanism computes cosine distances among spatial neighboring pixels and spectral channels to construct adaptive minimum spanning trees in both the spatial and spectral domains, thereby establishing reasonable spatial–spectral relationships and enabling efficient joint feature extraction. The Spectral State Fusion (SSF) mechanism applies multi-layer one-dimensional dilated convolutions along the spectral dimension to the state space vectors, enabling inter-channel interaction and promoting multi-scale spectral feature extraction. The proposed SSFTM demonstrates superior classification accuracy across multiple datasets compared to SOTA methods and exhibits acceptable computational complexity. The code is available at https://github.com/copawloroous/SSFTM

Affiliations: School of Computer Science and Information Engineering, Hefei University of Technology, Hefei, China; School of Automation Science and Engineering and the School of Future Technology, South China University of Technology, Guangzhou, China; School of Electronics and Communication Engineering, Sun Yat-sen University, Shenzhen Campus, Shenzhen, China; School of Geospatial Artificial Intelligence, the Key Laboratory of Geographic Information Science (Ministry of Education), and the Key Laboratory of Spatial-temporal Big Data Analysis and Application of Natural Resources in Megacities (Ministry of Natural Resources), East China Normal University, Shanghai, China; College of Computing and Data Science, Nanyang Technological University, Jurong West, Singapore

Abstract:
Giving machines the ability to infer the complete 3D geometry and semantics of complex scenes is crucial for many downstream tasks, such as decision-making and planning. Vision-centric Semantic Scene Completion (SSC) has emerged as a trendy 3D perception paradigm due to its compatibility with task properties, low cost, and rich visual cues. Despite impressive results, current approaches inevitably suffer from problems such as depth errors or depth ambiguities during the 2D-to-3D transformation process. To overcome these limitations, in this paper, we first introduce an Optical Flow-Guided (OFG) DepthNet that leverages the strengths of pretrained depth estimation models, while incorporating optical flow images to improve depth prediction accuracy in regions with significant depth changes. Then, we propose a depth ambiguity-mitigated feature lifting strategy that implements deformable cross-attention in 3D pixel space to avoid depth ambiguities caused by the projection process from 3D to 2D and further enhances the effectiveness of feature updating through the utilization of prior mask indices. Moreover, we customize two subnetworks: a residual voxel network and a sparse UNet, to enhance the network’s geometric prediction capabilities and ensure consistent semantic reasoning across varying scales. By doing so, our method achieves performance improvements over state-of-the-art methods on the SemanticKITTI, SSCBench-KITTI-360 and Occ3D-nuScene benchmarks.

Abstract:
Generating 3D-based body movements from speech shows great potential in extensive downstream applications, while it still suffers challenges in imitating realistic human movements. Predominant research efforts focus on end-to-end generation schemes to generate co-speech gestures, spanning GANs, VQ-VAE, and recent diffusion models. As an ill-posed problem, in this paper, we argue that these prevailing learning schemes fail to model crucial inter- and intra-correlations across different motion units, i.e. head, body, and hands, thus leading to unnatural movements and poor coordination. To delve into these intrinsic correlations, we propose a unified Hierarchical Implicit Periodicity (HIP) learning approach for audio-inspired 3D gesture generation. Different from predominant research, our approach models this multi-modal implicit relationship by two explicit technique insights: i) To disentangle the complicated gesture movements, we first explore the gesture motion phase manifolds with periodic autoencoders to imitate human natures from realistic distributions while incorporating non-period ones from current latent states for instance-level diversities. ii) To model the hierarchical relationship of face motions, body gestures, and hand movements, driving the animation with cascaded guidance during learning. We exhibit our proposed approach on 3D avatars and extensive experiments show our method outperforms the state-of-the-art co-speech gesture generation methods by both quantitative and qualitative evaluations. Code and models will be publicly available.

Abstract:
This paper presents a robust, decoupled approach to camera distortion correction using a rational function model (RFM), designed to address challenges in accuracy and flexibility within precision-critical applications. Camera distortion is a pervasive issue in fields such as medical imaging, robotics, and 3D reconstruction, where high fidelity and geometric accuracy are crucial. Traditional distortion correction methods rely on radial-symmetry-based models, which have limited precision under tangential distortion and require nonlinear optimization. In contrast, general models do not rely on radial symmetry geometry and are theoretically generalizable to various sources of distortion. There exists a gap between the theoretical precision advantage of the Rational Function Model (RFM) and its practical applicability in real-world scenarios. This gap arises from uncertainties regarding the model’s robustness to noise, the impact of sparse sample distributions, and its generalizability out of the training sample range. In this paper, we provide a mathematical interpretation of how RFM is suitable for the distortion correction problem through sensitivity analysis. The precision and robustness of RFM are evaluated through synthetic and real-world experiments, considering distortion level, noise level, and sample distribution. Moreover, a practical and accurate decoupled distortion correction method is proposed using just a single captured image of a chessboard pattern. The correction performance is compared with the current state-of-the-art using camera calibration, and experimental results indicate that more precise distortion correction can enhance the overall accuracy of camera calibration. In summary, this decoupled RFM-based distortion correction approach provides a flexible, high-precision solution for applications requiring minimal calibration steps and reliable geometric accuracy, establishing a foundation for distortion-free imaging and simplified camera models in precision-driven computer vision tasks.

Abstract:
Underwater salient object detection (USOD) faces two major challenges that hinder accurate detection: substantial image noise owing to water turbidity and low foreground-background contrast caused by high visual similarity. In this study, a dual-model architecture based on mutual learning is proposed to address these issues. First, DenoisedNet, which focuses on addressing water turbidity issues, is developed. Using a separation–denoising–enhancement processing framework, it suppresses noise while maintaining target feature integrity through domain separation and cleaning enhancement modules. Second, SearchNet is designed to address the foreground–background similarity issue. It achieves precise localization through pseudo-label generation and layer-by-layer search mechanisms. To enable both networks to address these challenges collaboratively, a feature-consistent mutual-learning strategy is proposed, which aligns encoded features and prediction results, via evaluation and cross modes, respectively. This strategy enables their respective strengths to be complemented and the challenges of USOD to be solved more comprehensively. Our DenoisedNet and SearchNet outperform the best existing methods on the USOD10K and USOD benchmarks, achieving MAE improvements of 4.52%/5.52% and 1.61%/8.94%, respectively. The source code is available at https://github.com/BeibeiIsFreshman/DSNet_CL

Abstract:
Federated Domain Generalization (FedDG) aims to train a globally generalizable model on data from decentralized, heterogeneous clients. While recent work has adapted vision-language models for FedDG using prompt learning, the prevailing “one-prompt-fits-all” paradigm struggles with sample diversity, causing a marked performance decline on personalized samples. The Mixture of Experts (MoE) architecture offers a promising solution for specialization. However, existing MoE-based prompt learning methods suffer from two key limitations: coarse image-level expert assignment and high communication costs from parameterized routers. To address these limitations, we propose TRIP, a Token-level pRompt mIxture with Parameter-free routing framework for FedDG. TRIP treats prompts as multiple experts, and assigns individual tokens within an image to distinct experts, facilitating the capture of fine-grained visual patterns. To ensure communication efficiency, TRIP introduces a parameter-free routing mechanism based on capacity-aware clustering and Optimal Transport (OT). First, tokens are grouped into capacity-aware clusters to ensure balanced workloads. These clusters are then assigned to experts via OT, stabilized by mapping cluster centroids to static, non-learnable keys. The final instance-specific prompt is synthesized by aggregating experts, weighted by the number of tokens assigned to each. Extensive experiments across four benchmarks demonstrate that TRIP achieves optimal generalization results, with communicating as few as 1K parameters. Our code is available at https://github.com/GongShuai8210/TRIP

Abstract:
Recent pre-trained vision-language models (PT-VLMs) often face a Multi-Domain Task Incremental Learning (MTIL) scenario in practice, where several classes and domains of multi-modal tasks are arrive incrementally. Without access to previously seen tasks and unseen tasks, memory-constrained MTIL suffers from forward and backward forgetting. To alleviate the above challenges, parameter-efficient fine-tuning techniques (PEFT), such as prompt tuning, are employed to adapt the PT-VLM to the diverse incrementally learned tasks. To achieve effective new task adaptation, existing methods only consider the effect of PEFT strategy selection, but neglect the influence of PEFT parameter setting (e.g., prompting). In this paper, we tackle the challenge of optimizing prompt designs for diverse tasks in MTIL and propose an Instance-Aware Prompting (IAP) framework. Specifically, our Instance-Aware Gated Prompting (IA-GP) strategy enhances adaptation to new tasks while mitigating forgetting by adaptively assigning prompts across transformer layers at the instance level. Our Instance-Aware Class-Distribution-Driven Prompting (IA-CDDP) improves the task adaptation process by determining an accurate task-label-related confidence score for each instance. Experimental evaluations across 11 datasets, using three performance metrics, demonstrate the effectiveness of our proposed method. The source codes are available at https://github.com/FerdinandZJU/IAP

Abstract:
Incremental Few-shot Semantic Segmentation (iFSS) aims to learn novel classes with limited samples while preserving segmentation capability for base classes, addressing the challenge of continual learning of novel classes and catastrophic forgetting of previously seen classes. Existing methods mainly rely on techniques such as knowledge distillation and background learning, which, while partially effective, still suffer from issues such as feature drift and limited generalization to real-world novel classes, primarily due to a bidirectional coupling bottleneck between the learning of base classes and novel classes. To address these challenges, we propose, for the first time, a diffusion-based generative framework for iFSS. Specifically, we bridge the gap between generative and discriminative tasks through an innovative binary-to-RGB mask mapping mechanism, enabling pre-trained diffusion models to focus on target regions via class-specific semantic embedding optimization while sharpening foreground-background contrast with color embeddings. A lightweight post-processor then refines the generated images into high-quality binary masks. Crucially, by leveraging diffusion priors, our framework avoids complex training strategies. The optimization of class-specific semantic embeddings decouples the embedding spaces of base and novel classes, inherently preventing feature drift, mitigating catastrophic forgetting, and enabling rapid novel-class adaptation. Experimental results show that our method achieves state-of-the-art performance on the PASCAL- 5^i and COCO- 20^i datasets using much less data than other methods, and exhibiting competitive results in cross-domain few-shot segmentation tasks. Project page: https://ifss-diff.github.io/

Abstract:
Scene Graph Generation (SGG) is a challenging cross-modal task, which aims to identify entities and relationships in a scene simultaneously. Due to the highly skewed long-tailed distribution, the generated scene graphs are dominated by relation categories of head samples. Current works address this problem by designing re-balancing strategies at the data level or refining relation representations at the feature level. Different from them, we attribute this impact to catastrophic interference, that is, the subsequent learning of dominant relations tends to overwrite the earlier learning of rare relations. To address it at the modeling level, a Hippocampal Memory-Like Separation-Completion Collaborative Network (HMSC2) is proposed here, which imitates the hippocampal encoding and retrieval process. Inspired by the pattern separation of dentate gyrus during memory encoding, a Gradient Separation Classifier and a Prototype Separation Learning module are proposed to relieve the catastrophic interference of tail categories by modeling the separated classifier and prototypes. In addition, inspired by the pattern completion of area CA3 of the hippocampus during memory retrieval, a Prototype Completion Module is designed to supplement the incomplete information of prototypes by introducing relation representations as cues. Finally, the completed prototype and relation representations are connected within a hypersphere space by a Contrastive Connected Module. Experimental results on the Visual Genome and GQA datasets show our HMSC2 achieves state-of-the-art performance on the unbiased SGG task, effectively relieving the long-tailed problem. The source codes are released on GitHub: https://github.com/Nora-Zhang98/HMSC2

Abstract:
Accurate and efficient volumetric medical image segmentation is vital for clinical diagnosis, pre-operative planning, and disease-progression monitoring. Conventional convolutional neural networks (CNNs) struggle to capture long-range contextual information, whereas Transformer-based methods suffer from quadratic computational complexity, making it challenging to couple global modeling with high efficiency. To address these limitations, we explore an effective yet accurate segmentation model for volumetric data. Specifically, we introduce a novel linear-complexity sequence modeling technique, RWKV, and leverage it to design a Tri-directional Spatial Enhancement RWKV (TSE-R) block; this module performs global modeling via RWKV and incorporates two optimizations tailored to three-dimensional data: 1) a spatial-shift strategy that enlarges the local receptive field and facilitates inter-block interaction, thereby alleviating the structural information loss caused by sequence serialization; and 2) a tri-directional scanning mechanism that constructs sequences along three distinct directions, applies global modeling via WKV, and fuses them with learnable weights to preserve the inherent 3D spatial structure. Building upon the TSE-R block, we develop an end-to-end 3D segmentation network, termed U-RWKV, and extensive experiments on three public 3D medical segmentation benchmarks demonstrate that U-RWKV outperforms state-of-the-art CNN-, Transformer-, and Mamba-based counterparts, achieving a Dice score of 87.21% on the Synapse multi-organ abdominal dataset while reducing parameter count by a factor of 16.08 compared with leading methods.

Abstract:
The geometry-based point cloud compression algorithm achieves efficient compression and transmission for LiDAR point clouds with high sparsity. However, the low-bitrate mode results in severe geometry compression artifacts, which involve both point reduction and coordinate offset. To the best of our knowledge, this is the first attempt to directly enhance the geometry quality for compressed LiDAR point cloud (CLGE) in a post-processing manner. Our proposed method consists of two branches: cylindrical densification and adaptive refinement. The former adopts a multi-scale sparse convolution framework to effectively extract spatial features in the cylindrical coordinate system and generate dense candidate points quickly. Large asymmetric sparse convolution kernels are also designed to capture the shapes of different regions and objects. The latter branch refines the candidate points through several MLP layers, which takes the neighborhood features between the candidate points and the input points into account. Finally, the designed ring-based farthest point resampling serves as an effective alternative for achieving the target number while maintaining the geometry distribution. Extensive experiments conducted on several datasets verify the effectiveness of our approach under different compression artifact levels. Furthermore, our method is easily extended to upsampling and is robust to noise. In addition to the geometry signal quality improvement, the point cloud enhanced by our proposed method alleviates the performance degradation in object detection task due to compression distortion.

Abstract:
Existing mosaic-based snapshot hyperspectral imaging systems struggle to capture high resolution (HR) hyperspectral image (HSI), limiting its application. Fusing a low resolution (LR) mosaiced image with an HR panchromatic (PAN) image serves as a feasible solution to obtain the HR HSI. Therefore, we propose a dual-sensor based HSI imaging system, combining a 4× 4 spectral filter array (SFA) mosaiced image sensor with a co-aligned PAN image sensor to provide complementary spatial-spectral information. To reconstruct HR HSI, we propose an unsupervised equivariant imaging (EI)-based training framework with a learnable degradation function, overcoming the inaccessibility of ground truth and spectral response function (SRF). Specifically, we formulate the degradation process as a combination of 8× 8 mosaicing and 2× 2 average downsampling for the LR mosaiced image, while modeling the PAN image as a linear projection of the HR HSI using SRF. Since parameters of SRF are inaccessible, we propose to make them learnable to have an accurate estimation. By enforcing transformation equivariance between the input-output pair of the fusion network, the proposed framework ensures the reconstructed HSI preserves spatial-spectral consistency without relying on paired supervision. Furthermore, we instantiate the proposed HSI imaging system and collect a real-world dataset of 60 paired mosaiced / PAN images. The mosaiced image exhibits 16 spectral bands ranging from 722 to 896 nm and 1020× 1104 spatial pixels while the PAN image exhibits 2040× 2208 spatial pixels. Comprehensive experiments demonstrate that the proposed method exhibits high spatial consistency and spectral fidelity while maintaining computational efficiency.

Abstract:
Few-shot class incremental learning (FSCIL) aims to continuously learn new classes from limited training samples while retaining previously acquired knowledge. Existing approaches are not fully capable of balancing stability and plasticity in dynamic scenarios. To overcome this limitation, we introduce a novel FSCIL framework that leverages graph neural networks (GNNs) to model interdependencies between different categories and enhance cross-modal alignment. Our framework incorporates three key components: 1) a Graph Isomorphism Network (GIN) to propagate contextual relationships among prompts; 2) a Hamiltonian Graph Network with Energy Conservation (HGN-EC) to stabilize training dynamics via energy conservation constraints; and 3) an Adversarially Constrained Graph Autoencoder (ACGA) to enforce latent space consistency. By integrating these components with a parameter-efficient CLIP backbone, our method dynamically adapts graph structures to model semantic correlations between textual and visual modalities. Additionally, contrastive learning with energy-based regularization is employed to mitigate catastrophic forgetting and improve generalization. Comprehensive experiments on benchmark datasets validate the framework’s incremental accuracy and stability compared to state-of-the-art baselines. This work advances FSCIL by unifying graph-based relational reasoning with physics-inspired optimization, offering a scalable and interpretable framework. Code is available at: https://github.com/aries-yqian/ACHG-CLIP

Abstract:
Enhancing the resolution of scene text images is a critical preprocessing step that can substantially improve the accuracy of downstream text recognition in low-quality images. Existing methods primarily rely on auxiliary text features to guide the super-resolution process. However, these features often lack rich low-level information, making them insufficient for faithfully reconstructing both the global structure and fine-grained details of text. Moreover, previous methods often learn suboptimal feature representations from the original low-quality landmark images, which cannot provide precise guidance for super-resolution. In this study, we propose a Fine-Grained Feedback Domain-Complementary Network (FDNet) for scene text image super-resolution. Specifically, we first employ a fine-grained feedback mechanism to selectively refine landmark images, thereby enhancing feature representations. Then, we introduce a novel domain-trace prior interaction generator, which integrates domain-specific traces with a text prior to comprehensively complement the clear edges and structural coverage of the text. Finally, motivated by the limitations of existing datasets, which often exhibit limited scene scales and insufficient challenging scenarios, we introduce a new dataset, MDRText. The proposed dataset, MDRText, features multi-scale and diverse characteristics and is designed to support challenging text image recognition and super-resolution tasks. Extensive experiments on the MDRText and TextZoom datasets demonstrate that our method achieves superior performance in scene text image super-resolution and further improves the accuracy of subsequent recognition tasks.

Abstract:
Efficient image super-resolution (SR) models are essential for achieving high-quality image reconstruction with reduced computational complexity, particularly in resource-constrained environments. In this paper, we introduce a novel self-attention mechanism, Broadcast-Gated Attention with Identity Adaptive Integration (BGAI). Then, based on this mechanism, we design a lightweight super-resolution network that achieves state-of-the-art performance with minimal computational cost. By observing the sparsity and convergence properties of self-attention, BGAI optimizes computational resource utilization through the effective broadcasting of meaningful features across attention heads and network layers. A key innovation in BGAI is the Broadcast-Gated Multi-head Self-Attention (BGMSA) mechanism, which employs a dedicated head to capture and integrate long-range dependencies, broadcasting this broader contextual information to local attention heads. This design enhances long-range interaction modeling while minimizing redundant computations. Additionally, the Identity Attention Adaptive Integration (IAAI) mechanism facilitates efficient feature propagation by leveraging the continuity in dependencies across layers, with a focus on dynamic variations to improve representational efficiency and accelerate convergence. Comprehensive experiments on standard benchmarks demonstrate that BGAI achieves high-fidelity super-resolution while reducing the number of parameters and FLOPs by up to 35% compared with existing lightweight methods. These results establish BGAI as a robust and scalable solution for resource-efficient SR, with significant potential for deployment in real-world, high-resolution image processing applications. The code and trained models are publicly available at https://github.com/bbbolt/BGAI

Abstract:
Image dehazing, a crucial task in low-level vision, supports numerous practical applications, such as autonomous driving, remote sensing, and surveillance. This paper proposes IHDCP, a novel Inverted Haze Density Correction Prior for efficient single image dehazing. It is observed that the medium transmission can be effectively modeled from the inverted haze density map using correction functions with various gamma coefficients. Based on this observation, a pixel-wise gamma correction coefficient is introduced to formulate the transmission as a function of the inverted haze density map. To estimate the transmission, IHDCP is first incorporated into the classic atmospheric scattering model (ASM), leading to a transcendental equation that is subsequently simplified to a quadratic form with a single unknown parameter using the Taylor expansion. Then, boundary constraints are designed to estimate this model parameter, and the gamma correction coefficient map is derived via the Vieta theorem. Finally, the haze-free result is recovered through ASM inversion. Experimental results on diverse synthetic and real-world datasets verify that our algorithm not only provides visually appealing dehazing performance with high computational efficiency, but also outperforms several state-of-the-art dehazing approaches in both subjective and objective evaluations. Moreover, our IHDCP generalizes well to various types of degraded scenes. Our code is available at https://github.com/TaoLi-TL/IHDCP.

Abstract:
Recent advancements in industrial anomaly detection (AD) have demonstrated that incorporating a small number of anomalous samples during training can significantly enhance accuracy. However, this improvement often comes at the cost of extensive annotation efforts, which are impractical for many real-world applications. In this paper, we introduce a novel framework, “Weakly-supervised RESidual T ransformer” (WeakREST), designed to achieve high anomaly detection accuracy while minimizing the reliance on manual annotations. First, we reformulate the pixel-wise anomaly localization task into a block-wise classification problem. Second, we introduce a residual-based feature representation called “Positional F ast A nomaly R esiduals” (PosFAR) which captures anomalous patterns more effectively. To leverage this feature, we adapt the Swin Transformer for enhanced anomaly detection and localization. Additionally, we propose a weak annotation approach utilizing bounding boxes and image tags to define anomalous regions. This approach establishes a semi-supervised learning context that reduces the dependency on precise pixel-level labels. To further improve the learning process, we develop a novel ResMixMatch algorithm, capable of handling the interplay between weak labels and residual-based representations. On the benchmark dataset MVTec-AD, our method achieves an Average Precision (AP) of 83.0%, surpassing the previous best result of 82.7% in the unsupervised setting. In the supervised AD setting, WeakREST attains an AP of 87.6%, outperforming the previous best of 86.0%. Notably, even when using weaker annotations such as bounding boxes, WeakREST exceeds the performance of leading methods relying on pixel-wise supervision, achieving an AP of 87.1% compared to the prior best of 86.0% on MVTec-AD. This superior performance is consistently replicated across other well-established AD datasets, including MVTec 3D, KSDD2 and Real-IAD. Code is available at: https://github.com/BeJane/Semi_REST

Abstract:
Medical image segmentation is a critical yet challenging task, primarily due to the difficulty of obtaining extensive datasets of high-quality, expert-annotated images. Contrastive learning presents a potential but still problematic solution to this issue. Because most existing methods focus on extracting instance-level or pixel-to-pixel representation, which ignores the characteristics between intra-image similar pixel groups. Moreover, when considering contrastive pairs generation, most SOTA methods mainly rely on manually setting thresholds, which requires a large number of gradient experiments and lacks efficiency and generalization. To address these issues, we propose a novel contrastive learning approach named SuperCL for medical image segmentation pre-training. Specifically, our SuperCL exploits the structural prior and pixel correlation of images by introducing two novel contrastive pairs generation strategies: Intra-image Local Contrastive Pairs (ILCP) Generation and Inter-image Global Contrastive Pairs (IGCP) Generation. Considering superpixel cluster aligns well with the concept of contrastive pairs generation, we utilize the superpixel map to generate pseudo masks for both ILCP and IGCP to guide supervised contrastive learning. Moreover, we also propose two modules named Average SuperPixel Feature Map Generation (ASP) and Connected Components Label Generation (CCL) to better exploit the prior structural information for IGCP. Finally, experiments on 8 medical image datasets indicate our SuperCL outperforms existing 12 methods. i.e. Our SuperCL achieves a superior performance with more precise predictions from visualization figures and 3.15%, 5.44%, 7.89% DSC higher than the previous best results on MMWHS, CHAOS, Spleen with 10% annotations. Our code is released at https://github.com/stevezs315/SuperCL

Abstract:
Traditional 3D scene understanding methods heavily depend on 3D annotation and training, which allow for the identification of seen classes but struggle to recognize unseen classes. In this paper, we leverage the open vocabulary inference capabilities of pre-trained models, enabling the encoding of open vocabulary concepts. However, unlike existing open vocabulary 3D scene understanding methods, we propose a framework based on semantic probability. This innovation significantly reduces computational cost and is compatible with state-of-the-art two-stage 2D pre-trained models. Specifically, we align the text features from the CLIP model with the pixel features from the 2D pre-trained models, inferring semantic probability of image pixels based on similarity and projecting it onto 3D points. Subsequently, we introduce a point cloud pairs semantic fusion method to merge the point clouds, reducing the semantic probability of erroneous 3D points. Based on probability scores, we achieve 3D semantic segmentation on open vocabularies without any supervision or training. In addition, the semantic probability of 3D points can serve as pseudo-labels for 3D distillation, and the geometric features of the 3D scene can be exploited to improve the segmentation performance. Experimental results demonstrate that the proposed method exhibits competitive performance on publicly available benchmark datasets, including ScanNet, Matterport3D, and nuScenes.

Abstract:
Vision Transformers (ViTs) mark a revolutionary advance in neural networks with their token mixer’s powerful global context capability. However, the pairwise token affinity and complex matrix operations limit its deployment on resource-constrained scenarios and real-time applications, such as mobile devices, although considerable efforts have been made in previous works. In this paper, we introduce CAS-ViT: Convolutional Additive Self-attention Vision Transformers, to achieve a balance between efficiency and performance in mobile applications. Firstly, we argue that the capability of token mixers to obtain global contextual information hinges on multiple information interactions, such as spatial and channel domains. Subsequently, we propose Convolutional Additive Token Mixer (CATM) employing underlying spatial and channel attention as novel interaction forms. This module eliminates troublesome complex operations such as matrix multiplication and Softmax. We introduce Convolutional Additive Self-attention(CAS) block hybrid architecture and utilize CATM for each block. And further, we build a family of lightweight networks, which can be easily extended to various downstream tasks. Finally, we evaluate CAS-ViT across a variety of vision tasks, including image classification, object detection, instance segmentation, and semantic segmentation. Our M and T model achieves 83.0%/84.1% top-1 with only 12M/21M parameters on ImageNet-1K. Meanwhile, throughput evaluations on GPUs, ONNX, and iPhones also demonstrate superior results compared to other state-of-the-art backbones. Extensive experiments demonstrate that our approach achieves a better balance of performance, efficient inference and easy-to-deploy. Our code and model are available at: https://github.com/Tianfang-Zhang/CAS-ViT

Abstract:
Existing All-in-One image restoration methods often fail to perceive degradation types and severity levels simultaneously, overlooking the importance of fine-grained quality perception. Moreover, these methods often utilize highly customized backbones, which hinder their adaptability and integration into more advanced restoration networks. To address these limitations, we propose Perceive-IR, a novel backbone-agnostic All-in-One image restoration framework designed for fine-grained quality control across various degradation types and severity levels. Its modular structure allows core components to function independently of specific backbones, enabling seamless integration into advanced restoration models without significant modifications. Specifically, Perceive-IR operates in two key stages: 1) multi-level quality-driven prompt learning stage, where a fine-grained quality perceiver is meticulously trained to discern three-tier quality levels by optimizing the alignment between prompts and images within the CLIP perception space. This stage ensures a nuanced understanding of image quality, laying the groundwork for subsequent restoration; 2) restoration stage, where the quality perceiver is seamlessly integrated with a difficulty-adaptive perceptual loss, forming a quality-aware learning strategy. This strategy not only dynamically differentiates sample learning difficulty but also achieves fine-grained quality control by driving the restored image toward the ground truth while pulling it away from both low- and medium-quality samples. Furthermore, Perceive-IR incorporates a Semantic Guidance Module (SGM) and Compact Feature Extraction (CFE). The SGM leverages semantic information from pre-trained vision models to provide high-level contextual guidance, while the CFE focuses on extracting degradation-specific features, ensuring accurate handling of diverse image degradations. Extensive experiments demonstrate that Perceive-IR not only surpasses state-of-the-art methods but also generalizes reliably to zero-shot real-world and unknown degraded scenes, while adapting seamlessly to different backbone networks. This versatility underscores the framework’s robustness and backbone-agnostic design. Project page at https://house-yuyu.github.io/Perceive-IR/.

Abstract:
Reasoning segmentation (RS) interprets implicit textual instructions to accurately segment target regions. This reasoning capability transforms ambiguous non-expert queries into precise pixel-level masks, thereby enabling downstream tasks like area measurement and density analysis with a level of precision unattainable by detection methods. However, existing RS models are not tailored for agriculture and lack domain-specific knowledge, which poses challenges in handling similar pest appearances and small target scales. To bridge this gap, we introduce a fine-grained pest RS task with two subtasks: Pest Discriminative Referring Expression Segmentation (PDRES) and Pest Exclusion Reasoning Segmentation (PERS). Based on this, we propose PestScope, which integrates vision, language, and reasoning for fine-grained pest segmentation. To tackle the exclusion of small non-target pests, we introduce a dedicated [NON] token alongside the standard [SEG] token for target pests. This guides the model to prioritize small target pests and suppress non-target background regions. To further address pest similarity, we propose an Exclusivity Suppression Loss, applying differentiated supervision to [SEG] and [NON] tokens to better separate target and non-target pests. Additionally, we develop an automated dataset construction pipeline to address the scarcity of fine-grained, difficulty-controllable pest RS datasets. It produces 45k and 27.6k image-text-mask samples for the PDRES and PERS tasks, respectively, covering 18 pest categories. Experiments show that in small and similar pest scenarios, integrating PestScope into mainstream models improves average gIoU by 4.28% on PDRES and 6.49% on PERS. For unseen pest categories, gIoU increases by 21.72% and 8.66%, respectively, demonstrating strong generalization. Code and datasets will be available at: https://github.com/aluodaydayup/PestScope

Abstract:
Out-of-distribution (OOD) detection plays a crucial role as a mechanism for handling anomalies in computer vision systems. Among existing approaches, outlier exposure (OE), which trains the model with an additional auxiliary OOD dataset, has demonstrated strong effectiveness. However, acquiring clean and well-curated auxiliary OOD data is often infeasible, particularly within large and complex systems. Alternatively, wild outliers, i.e., unlabeled samples collected directly in deployment environments, are abundant and easy to obtain, and recent studies have shown that they can substantially benefit OOD detection learning. Nevertheless, wild outliers typically contain a mixture of in-distribution (ID) and OOD samples. Directly using them as auxiliary OOD data unavoidably exposes the model to adverse supervision signals arising from the contained ID samples. Yet existing methods still lack an effective strategy that can fully leverage wild outliers while suppressing the negative influence introduced by their ID subset. To this end, we propose a simple yet effective method named Clustering for Wild Outlier Exposure (C-WOE), which alleviates the adverse effect of the ID samples contained within wild outliers by reweighting them. Specifically, C-WOE assigns higher weights to real OOD samples and lower weights to ID samples and dynamically updates these weights during training. Theoretically, we establish solid guarantees for the proposed method. Empirically, extensive experiments conducted on various real-world benchmarks and simulated datasets demonstrate that C-WOE notably achieves superior performance compared with state-of-the-art methods, validating its reliability in image processing applications.

Abstract:
Deep unrolling networks have rapidly gained popularity in image reconstruction by integrating data-driven networks with iterative model-driven reconstruction algorithms. Technically, existing unrolling networks could easily break down and produce sub-optimal results due to inadequate iterations and limited receptive fields. Another challenge is that mainstream algorithms are established in isolation from the physical system and confined to digital realm. This paper proposes a novel implicit unrolling Transformer architecture, dubbed TranIU-Net, that extracts local contents and non-local dependencies to assist iterative learning, and forms indicative imaging mechanism to guide system design. Concretely, TranIU-Net unrolls the proximal gradient algorithm into a trainable network with structural interpretability. Using only constant memory cost, the implicit mapping is analytically built to guarantee the convergence through the fixed-point at unlimited depth. To consider intrinsic correlation and sparsity in reconstructed images, an embedded Transformer module is developed to capture multi-scale information with hybrid receptive fields and assign self-aware granularity with learnt significance estimator, making it an efficient backbone for implicit unrolling network. Additionally, with adaptive and flexible architecture, TranIU-Net explores a new imaging mechanism by indicating structure design and measurement condition, bridging the gap between algorithm and imaging system to facilitate reconstruction quality. Extensive numerical simulations and practical experiments of electrical tomography reconstruction demonstrate that the proposed TranIU-Net outperforms state-of-the-art alternatives in different scenarios from both quantitative and qualitative perspectives.

Abstract:
The limitations of seismic vertical resolution pose significant challenges for the identification of thin beds. Improving the vertical resolution of seismic data using deep learning methods often encounters challenges related to unrealistic outputs and limited generalization. To address these challenges, we propose a novel framework that improves the fidelity and generalization of seismic super-resolution. Our approach begins with the generation of realistic synthetic training data that aligns with the structural and amplitude characteristics of field surveys. We then introduce an enhanced 2D network with 3D awareness, which builds on the 2D Swin-Transformer and 3D convolution blocks to effectively capture 3D spatial features while maintaining computational efficiency. This network addresses the limitations of traditional 2D approaches by reducing stitching artifacts and improving spatial consistency. Finally, we develop a prior-informed fine-tuning strategy using field data without the need for labels, which incorporates a self-supervised data consistency loss and a spectral matching loss based on prior knowledge. This strategy ensures that the super-resolution results preserve the original low frequency information while yielding a spectral distribution as expected. Experiments on multiple field datasets demonstrate the robustness and generalization capability of our method, making it a practical solution for seismic resolution enhancement in diverse field datasets.

Abstract:
Unsupervised human motion segmentation (HMS) can be effectively achieved using subspace clustering techniques. However, traditional methods overlook the role of temporal semantic exploration in HMS. This paper explores the use of temporal vision semantics (TVS) derived from human motion sequences, leveraging the image-to-text capabilities of a large language model (LLM) to enhance subspace clustering performance. The core idea is to extract textual motion information from consecutive frames via LLM and incorporate this learned information into the subspace clustering framework. The primary challenge lies in learning TVS from human motion sequences using LLM and incorporating this information into subspace clustering. To address this, we determine whether consecutive frames depict the same motion by querying the LLM and subsequently learn temporal neighboring information based on its response. We then develop a TVS-integrated subspace clustering approach, incorporating subspace embedding with a temporal regularizer that induces each frame to share similar subspace embeddings with its temporal neighbors. Additionally, segmentation is performed based on subspace embedding with a temporal constraint that induces the grouping of each frame with its temporal neighbors. We also introduce a feedback-enabled framework that continuously optimizes subspace embedding based on the segmentation output. Experimental results demonstrate that the proposed method outperforms existing state-of-the-art approaches on four benchmark human motion datasets.

Abstract:
Visual tracking aims to automatically estimate the state of a target object in a video sequence, which is challenging especially in dynamic scenarios. Thus, numerous methods are proposed to introduce temporal cues to enhance tracking robustness. However, conventional CNN and Transformer architectures exhibit inherent limitations in modeling long-range temporal dependencies in visual tracking, often necessitating either complex customized modules or substantial computational costs to integrate temporal cues. Inspired by the success of the state space model, we propose a novel temporal modeling paradigm for visual tracking, termed State-aware Mamba Tracker (SMTrack), providing a neat pipeline for training and tracking without needing customized modules or substantial computational costs to build long-range temporal dependencies. It enjoys several merits. First, we propose a novel selective state-aware space model with state-wise parameters to capture more diverse temporal cues for robust tracking. Second, SMTrack facilitates long-range temporal interactions with linear computational complexity during training. Third, SMTrack enables each frame to interact with previously tracked frames via hidden state propagation and updating, which releases computational costs of handling temporal cues during tracking. Extensive experimental results demonstrate that SMTrack achieves promising performance with low computational costs.

Abstract:
Generating dynamic scenes from images has gained increasing attention. Existing methods have two major limitations: 1) they can hardly handle sparse images which exhibit limited geometry constraints and insufficient motion; 2) they struggle to maintain spatial-temporal consistency when rendering multi-view videos. To address these limitations, we propose SCSV, a spatial-temporal consistent dynamic scene generation method from sparse views. Our method consists of two stages: scene reconstruction and scene expansion, both of which decouple background and foreground. In the scene reconstruction stage, we first interpolate a set of images between the input images based on a video generation model, followed by the optimization of the scene Gaussian from the interpolated and input images. To improve the spatial-temporal consistency of the reconstructed scene, we propose an uncertainty-aware Gaussian training approach, which introduces adaptive weights of images and pixels. In the scene expansion stage, for background, we render novel views and refine them with a geometry-aware diffusion process. These refined images are then used to incrementally add the Gaussians. As to foreground, we generate human motion according to previous motion, enabling temporal coherent generation of motion. To further enhance the physical plausibility, we integrate the expanded foreground into the background using a gravity-aware alignment. Experiments on NeuMan, Bonn, and EMDB datasets demonstrate that our SCSV achieves superior performance compared to state-of-the-art methods. The code will be released upon acceptance.

Abstract:
Single-domain Generalized Object Detection (Single-DGOD) is recently proposed, aiming to transfer a detector to multiple unknown domains never seen during training. For this task, the challenge mainly lies in how to utilize the single data distribution from the source domain to generalize across multiple unknown domains with diverse data distributions. Accordingly, the challenge could be addressed by expanding the data distribution of the source domain. In this paper, we propose feature recombination from a frequency perspective to generate a series of recombined features that exhibit diversity in style and rich variation in content features. Specifically, we propose a new method, Fourier-KAN Feature Recombination, which utilizes the Fast Fourier Transform (FFT) to decompose features into amplitude and phase components. Then we apply the Kolmogorov-Arnold theorem to further decompose these components into linear combinations of multiple base distributions. Finally, through multi-level recombination, we generate a series of recombined features with diverse distributions, effectively emulating deep cross-domain variations in feature levels and strengthening the model’s generalization ability to unknown domains. Our method demonstrates strong adaptability to both two-stage and single-stage detection frameworks. Experimental results show that on the Diverse Weather and Real-to-Art benchmarks, our approach not only achieves outstanding detection accuracy but also significantly enhances the model’s generalization ability, all while maintaining excellent real-time performance. Our code is available at https://github.com/2490o/Fourier-KAN

Abstract:
Zero-shot anomaly detection (ZSAD) is a challenging task that aims to detect anomalies in images without any prior knowledge of the anomaly classes. This task is especially difficult because anomalies are rare, diverse, and often manifest differently across domains, making it hard for models to generalize when training data is scarce or unavailable. Recently, vision-language models (VLMs), such as CLIP, have shown great potential in ZSAD, but they often struggle to adapt to unseen domains due to the lack of domain-aware knowledge. To address these challenges, we propose the Domain Adaptation CLIP (DACLIP), a novel approach that adapts domain-aware knowledge to the VLM. Specifically, DACLIP leverages a Domain-Aware Knowledge Adaptation (DAKA) strategy to enhance CLIP for ZSAD across different domains. The DAKA strategy comprises multiple experts that specialize in target domains, enabling the model to dynamically select and combine specialized experts tailored to anomaly characteristics, thus improving its ability to generalize and detect a wide range of anomalies. Furthermore, we introduce learnable domain-aware prompts that are jointly learned by and injected into both the CLIP encoders (visual and text) and the DAKA modules. This dual-pathway learning enables the model to capture domain-specific features at multiple levels of the architecture, allowing for more effective adaptation to new domains and anomaly types. We evaluate our approach on several benchmark datasets spanning industrial and medical domains. Extensive experiments demonstrate that DACLIP consistently outperforms state-of-the-art methods in ZSAD, achieving significant improvements in both image-level and pixel-level anomaly detection tasks.

Abstract:
360°depth estimation is a challenging research problem due to the difficulty of finding a representation that both preserves global continuity and avoids distortion in spherical images. Existing methods attempt to leverage complementary information from multiple projections, but struggle with balancing global and local consistency. Their local patch features have limited global perception, and the combined global representation does not address discrepancies in feature extraction at the boundaries between patches. To address these issues, we propose Cross360, a novel cross-attention-based architecture integrating local and global information using less-distorted tangent patches along with equirectangular features. Our Cross Projection Feature Alignment module employs cross-attention to align local tangent projection features with the equirectangular projection’s 360° field of view, ensuring each tangent projection patch is aware of the global context. Additionally, our Progressive Feature Aggregation with Attention module refines multi-scaled features progressively, enhancing depth estimation accuracy. Cross360 significantly outperforms existing methods across most benchmark datasets, especially those in which the entire 360° image is available, demonstrating its effectiveness in accurate and globally consistent depth estimation. The code and model are available at https://github.com/huangkun101230/Cross360

Abstract:
The effective fusion of multi-modal remote sensing images, particularly hyperspectral imagery (HSI) and light detection and ranging (LiDAR) data, is pivotal for accurate land use and land cover (LULC) classification. However, this process is hindered by two inherent challenges: pervasive data redundancy and the underutilization of cross-modal complementarity, largely due to the lack of a unifying theoretical framework. To address these limitations, we propose the multi-modal complementary information bottleneck (MCIB) framework, which extends the IB principle to learn compact, sufficient, and complementary representations for multi-modal scenes. From a theoretical perspective, we formalize the MCIB objective and introduce structured priors to derive tractable information-theoretic bounds, providing a principled and computationally feasible approach to reduce redundancy and enhance complementarity simultaneously. Building on the obtained theoretical insights, we design an end-to-end variational optimization strategy with a novel supervised conditional InfoNCE (SCInfoNCE). Efficiently reusing existing model components, this new supervised contrastive method optimizes the conditional mutual information terms crucial for synergy. Extensive experiments on benchmark HSI-LiDAR datasets demonstrate superior classification performance of MCIB. This work not only fills a theoretical gap in multi-modal representation learning, but offers a robust and principled solution for LULC classification using complex heterogeneous remote sensing images.

Abstract:
To efficiently assist humans in various tasks, it is crucial to accurately decode and understand the rich information embedded in brain’s visual cognition. Existing brain-driven research often fails to overcome the challenge of small target data domains, and the lack of explicit semantic, spatial, and other information constraints on feature extractors prevents brain decoding models from learning uniform cross-domain representations, leading to degradation of their performance in unseen domains. To overcome these limitations, we propose DAMind, a multimodal EEG-based model for robust visual cross-domain alignment and decoding. Our approach integrates VLM with brain-inspired cognitive mechanisms, leveraging the strong image-text representation abilities to learn both fine-grained primary visual features and high-level semantic concepts from neural signals, provide effective visual fine-tuning using the visual guidance mechanism. DAMind introduces a stepwise EEG encoding process aligned with visual processing, and employs an instruction-based learning strategy for effective cross-domain zero-shot transfer. Its robust architecture efficiently achieves good generalization performance, enabling the mapping of EEG signals from multiple domains to a unified learning domain. We construct a comprehensive EEG decoding benchmark EBench, DAMind achieves state-of-the-art results on several visual tasks, and outperforms the baseline in zero-shot setting.

Abstract:
Existing semi-supervised semantic segmentation (SSS) methods fail to explore the potential of depth information in unlabeled data, as they suffer from 1) inter-class depth similarity, and 2) intra-class depth discrepancy. To address these challenges, this paper proposes DepMatch, a simple yet effective approach that leverages depth difference knowledge to guide consistency learning. Specifically, a Class-wise Depth Disparity Perception (CDDP) module is designed to exploit depth difference information, driven by class prediction priors, facilitating robust feature learning. Depth-feature discrepancy set is first constructed and then reliable pixel pairs are selected for inter-class depth disparity knowledge distillation. Simultaneously, exponential normalization is applied to intra-category depth disparity for suppressing large outlier variations, and an entropy-based adaptive weight is derived to prioritize feature learning of high entropy areas. Moreover, we propose the Uncertain Logit Disparity Regulation (ULDR) module, which leverages the depth variations at class boundaries to promote the mutual regulation of uncertain pixel logit information, enhancing the model’s spatial understanding. Experiments on five public benchmarks show that DepMatch can be seamlessly incorporated as a plug-and-play plugin into popular SSS frameworks, achieving significant performance improvements across various visual encoders. The source code and models are made available at https://github.com/NUST-Machine-Intelligence-Laboratory/DepMatch

Abstract:
Noisy Correspondence (NC), caused by mismatched pairs in multimedia datasets, poses major challenges for cross-modal retrieval, especially under high noise levels. Existing solutions often suffer from substantial performance degradation as noise levels increase. To address this issue, we propose Pseudo-Text guided Robust Learning (PTRL), a novel framework designed to identify noisy pairs and enhance model robustness. Specifically, PTRL leverages pseudo-text as explicit supervision signals and introduces a new data division criterion to accurately distinguish between clean and noisy pairs. Instead of discarding or directly using noisy data, PTRL proposes a pseudo-text replacement strategy to maintain semantic consistency of the training set, thereby facilitating more reliable learning. In addition, pseudo-text-image pairs serve as a form of data augmentation, enriching data diversity and improving model generalization. To further stabilize training and mitigate overfitting, PTRL incorporates a robust InfoNCE loss that is particularly effective in the presence of noise. Extensive experiments demonstrate that PTRL achieves state-of-the-art performance and robustness, with an RSum improvements of +60.1% on Flickr30K and +22.6% on MS-COCO at an 80% noise level, significantly outperforming existing methods. The datasets and source code are available at https://github.com/shidan0122/PTRL.git.

Abstract:
Marine Animal Segmentation (MAS) aims at identifying and segmenting marine animals from complex marine environments. Most of previous deep learning-based MAS methods struggle with the long-distance modeling issue. Recently, Segment Anything Model (SAM) has gained popularity in general image segmentation. However, it lacks of perceiving fine-grained details and frequency information. To this end, we propose a novel learning framework, named Hierarchical Frequency Prompted SAM (HFP-SAM) for high-performance MAS. First, we design a Frequency Guided Adapter (FGA) to efficiently inject marine scene information into the frozen SAM backbone through frequency domain prior masks. Additionally, we introduce a Frequency-aware Point Selection (FPS) to generate highlighted regions through frequency analysis. These regions are combined with the coarse predictions of SAM to generate point prompts and integrate into SAM’s decoder for fine predictions. Finally, to obtain comprehensive segmentation masks, we introduce a Full-View Mamba (FVM) to efficiently extract spatial and channel contextual information with linear computational complexity. Extensive experiments on four public datasets demonstrate the superior performance of our approach. We will make our code publicly available upon the acceptance.

Abstract:
The visual quality of point clouds is critical for perception-centric immersive media. Point Cloud Quality Assessment (PCQA) is crucial for reducing costs associated with human evaluation, optimizing compression pipeline and enhancing human visual perception. However, real-valued PCQA methods often struggle to capture the coupled geometric and perceptual cues that govern quality. Com-PCQA, a novel no-reference PCQA framework leveraging complex-valued feature learning, is proposed. First, a Hilbert dual-stream module transforms multi-modal inputs of point clouds and images into analytic signals in the complex domain, enabling joint modeling of global structure and local texture with efficient tensor operations. Second, a complex amplitude–phase attention (CAPA) module explicitly decomposes and fuses amplitude features that describe geometric structure and phase features that capture fine-grained details, and it can be seamlessly integrated into other PCQA frameworks to enhance performance. Third, an adversarial joint scoring module integrates adversarial training with collaborative learning to calibrate multi-modal, multi-scale representations and enhance robustness. Extensive experiments on three public databases show that Com-PCQA achieves state-of-the-art correlations with subjective scores and consistently outperforms recent PCQA methods, demonstrating its effectiveness and robustness. The code will be available at https://openi.pcl.ac.cn/OpenPointCloud and https://github.com/LareinaSu/Com-PCQA

Abstract:
Audio-driven talking face video generation has attracted increasing attention due to its huge industrial potential. Some previous methods focus on learning a direct mapping from audio to visual content. Despite progress, they often struggle with the ambiguity of the mapping process, leading to flawed results. An alternative strategy involves facial structural representations (e.g., facial landmarks) as intermediaries. This multi-stage approach better preserves the appearance details but suffers from error accumulation due to the independent optimization of different stages. Moreover, most previous methods rely on generative adversarial networks, prone to training instability and mode collapse. To address these challenges, our study proposes a novel landmark-based diffusion model for talking face generation, which leverages facial landmarks as intermediate representations while enabling end-to-end optimization. Specifically, we first establish the less ambiguous mapping from audio to landmark motion of lip and jaw. Then, we introduce an innovative conditioning module called TalkFormer to align the synthesized motion with the motion represented by landmarks via differentiable cross-attention, which enables end-to-end optimization for improved lip synchronization. Besides, TalkFormer employs implicit feature warping to align the reference image features with the target motion for preserving more appearance details. Extensive experiments demonstrate that our approach can synthesize high-fidelity and lip-synced talking face videos, preserving more subject appearance details from the reference image.

Abstract:
Recent advances in propagation-based phase-contrast imaging, such as hierarchical imaging, have enabled the visualization of internal structures in large biological specimens and material samples. However, modulation-based techniques, which provide quantitative electron density information, face challenges when imaging larger objects due to stringent beam stability requirements and detector distortions. Extending the field of view of these methods is crucial for obtaining comparable quantitative results across beamlines and adapting to the smaller beam profiles of fourth-generation synchrotron sources. We introduce a novel image processing technique combining an eigenflat optimization with deformable image registration to address the challenges and enable quantitative high-resolution scans of centimeter-sized objects with multiple-micrometer resolution. We demonstrate the potential of the method by obtaining an electron density map of a rat brain sample 15mm in diameter despite the limited horizontal field of view of 6mm of the beamline. This showcases the technique’s ability to significantly widen the range of applications of modulation-based techniques in both biological and materials science research.

Abstract:
Conventional end-to-end learning-based point cloud compression requires training multiple models to adapt to different target bit rates. Moreover, the rate difference between geometry and attribute components of point clouds is not well-considered. In this paper, we propose an end-to-end Rate-Reconfigurable Deep Point Cloud Compression (RR-DPCC) with on/off-line Perceptual Bit Allocation Optimization (PBAO-ON/OFF), which achieves arbitrary bit rate control with one trained deep model and high efficiency joint geometry and attribute coding. First, we propose the framework of the RR-DPCC using PBAO-ON/OFF, which includes Point Cloud Quality Assessment (PCQA) for perceptual quality measurement, PBAO-ON/OFF modules for bit allocation and RR-DPCC for high efficiency point cloud coding. Second, we propose a one-stream network of the RR-DPCC to encode the attribute and geometry of point clouds jointly. Moreover, in RR-DPCC, a bitrate reconfigurable module is proposed to encode multiple fine-grained bitrate points with one trained model and a rate allocation module is proposed to allocate bits between geometry and attribute. Third, we propose on/off-line PBAO algorithms to maximize the perceptual quality of the reconstructed point cloud, where the bits are properly allocated based on the importance of geometry and attribute. Meanwhile, rate-distortion models (R- \alpha / \beta and D- \alpha / \beta ) are derived for high accuracy rate control and bit allocation. Experimental results show that the proposed RR-DPCC achieves fine-grained bitrate control and allocation through a single trained model. When combined the proposed RR-DPCC with PBAO-ON, it reduces −6.56% and −18.68% bit rate on average as comparing with the state-of-the-art V-PCC and Deep Joint Geometry and Attribute Compression (Deep-JGAC), respectively. When combined with the PBAO-OFF, it achieves −4.90% and −15.34% bit rate reductions on average, and reduces 98.38%/22.05% and 53.75%/10.04% encoding/decoding time on average with respect to V-PCC and Deep-JGAC.

Abstract:
Facial expressions (FEs) and action units (AUs) are facial emotional representations at different levels of granularity. In the past, recognizing them has often been treated as two separate tasks. There are also some methods that use the knowledge of one to aid in recognizing the other, but currently, unified models capable of recognizing both FEs and AUs simultaneously remain rare. In this paper, we construct a unified model with strong generalization capability to jointly perform facial expression recognition (FER) and action unit detection (AUD). Considering the extremely limited training samples annotated with both FEs and AUs, we introduce a large amount of unlabeled facial data from the wild. We carefully design category-specific confidence margins and leverage the correspondences between FEs and AUs to assign credible pseudo-labels to the unlabeled facial data. Furthermore, we incorporate semantically richer textual descriptions as supervision and refine them through visual perception, leveraging the inherent correlations between AUs and between FEs and AUs to enhance their precision. Extensive experiments demonstrate the superiority of the proposed method from various perspectives, including a unified zero-shot benchmark for exploring the model’s comprehensive generalization capability to recognize facial emotional representations across multiple datasets, as well as within-domain and cross-domain evaluations after fine-tuning. The code for the proposed method is available at https://github.com/yuankaishen2001/MGFER

Abstract:
Prompt learning has emerged as an effective strategy for adapting vision-language models (VLMs) which injects learnable semantic prompts into VLMs to guide the alignment between visual and textual representations. Although existing methods have shown strong performance across various tasks, they usually focus on the representative class-level samples and overlook the atypical and hard samples in visual feature space, which hinders generalization of VLMs. To address this issue, we propose the concept of dynamic boundary prototype, which highlights ambiguous samples that are far from the class centroid and is updated at each epoch. Accordingly, we propose a Distribution-Aware Prompt Learning (DAPL) framework to calibrate the distribution of visual feature space via the definition, optimization, and updating of dynamic boundary prototypes. Firstly, we introduce Boundary-Centroid Pulling to optimize the intra-class distribution by progressively reducing the distance between boundary and centroid prototypes, thereby enhancing structural consistency within each class. Secondly, to further enhance inter-class separability, a distance-weighted contrastive loss that places greater emphasis on distinguishing adjacent classes is designed, facilitating more effective fine-grained discrimination. Thirdly, we apply Low-Rank Adaptation Fine-Tuning to adapt the vision encoder through targeted modifications to its self-attention layers. Additionally, we adopt a progressive training strategy for stable optimization. DAPL is compatible with mainstream prompt learning methods such as CoOp, CoCoOp and PromptKD, and consistently improves their average performance across 11 benchmark datasets.

Abstract:
Fine-grained video captioning aims to generate detailed, temporally coherent descriptions of video content. However, existing methods struggle to capture subtle video dynamics and rich detailed information. In this paper, we leverage preference learning to enhance the performance of vision-language models (VLM) in fine-grained video captioning, while mitigating several limitations inherent to Direct Preference Optimization (DPO). First, we propose a pipeline for constructing preference pairs that leverages the intrinsic properties of VLMs along with partial assistance from large language models, achieving an balance between cost and data quality. Then, we propose Synergistic Preference Optimization (SynPO), a novel optimization method offering significant advantages over DPO and its variants. SynPO prevents negative pReferences from dominating the training, explicitly preserves the model’s language capability to avoid deviation of the optimization objective, thus obtains high-quality captions and improves training efficiency by eliminating the need for the reference model. We extensively evaluate our proposed data construction pipeline across three models: AuroraCap, LLaVA1.6-7B-Video and InterVL2-8B. Results demonstrate that our method improve performance in fine-grained video captioning significantly and consistenly. Source code is available at https://github.com/longmalongma/SynPO

Abstract:
Multi-view or stereo image compression is an essential technology in 3D related applications. Due to the overlap between different views, exploring their correlations can help improve the compression rate. However, the computing complexity of joint encoding at the encoding side is a heavy burden for terminal encoders. To solve this problem, the learned Distributed Image Coding (DIC), which only uses the correlated view (namely the side image, SI) in the decoder side, has gained much attention in recent years. In this work, we explore asymmetric DIC where one view is selected as the SI and is losslessly compressed. The key problem in learned asymmetric DIC is alignment between the transmitted low-quality target image and high-quality SI. Previous methods usually adopt patch-level alignment with the offset index obtained from degraded (via re-encoded and decoded) SI and the decoded target image, which hinders the alignment accuracy. In this work, we propose a dual domain alignment strategy, which includes degraded domain and fused domain pixel-wise offset estimation. For the degraded domain alignment, we estimate the offset between the degraded SI feature and the degraded target image feature, which eliminates the difficulties in cross-domain matching. For the fused-domain alignment, we observe that the fusion result of degraded target feature and aligned side image feature implicitly contains fine-scale disparity information. Therefore, we estimate the fine-scale offset from the fusion result, which helps refine the degraded domain offsets. We further propose a selective enhancement module to repair the mismatched region in the aligned feature. Extensive experiments on three datasets demonstrate the superiority of our proposed method, outperforming the second-best method by 16% in terms of average BD-rate reduction on the KITTI Stereo dataset. Our code is available at https://github.com/lixianghuitju/DIC-DDA

Abstract:
Synthetic Aperture Radar (SAR) images offer unique advantages in all-weather, all-day remote sensing, but the high acquisition costs and time-consuming annotation processes limit their widespread implementation. Semi-supervised domain adaptation leverages abundant annotated optical images and a small number of labeled SAR images to achieve great performance on SAR images. However, existing semi-supervised domain adaptation object detection methods typically select SAR domain labeled samples randomly, making it difficult to fully exploit the valuable information and distinctive features inherent in the target domain data. Moreover, there is a significant style and content gap between optical and SAR images, and previous methods have not adapted to them in a task-specific manner. To this end, this paper proposes an active style-content dual-branch domain adaptation method specifically designed for semi-supervised object detection in SAR images. The proposed approach employs Task-aware Active Sampling (TAS) module to select the most valuable SAR samples, addressing inefficiencies in random sampling. Also, we employ a dual-branch framework to address the style and content gaps between optical and SAR images. Multi-layer Feature Alignment (MFA) module ensures style alignment by maintaining consistent feature representations across different visual styles, while Gaussian-SAM Image Fusion (G-SIF) module is employed to integrate content from the source domain into the target domain, effectively bridging the gap between optical and SAR images. Extensive experiments on multiple ship and aircraft datasets demonstrate the exceptional generalization capabilities of our proposed model.

Abstract:
Photometric stereo is widely used to recover detailed surface normals. However, previous methods fail to balance the accuracy and efficiency. Conventional photometric stereo achieves high accuracy but suffers from low efficiency due to spectral-multiplexing and inefficient algorithms. In contrast, multispectral photometric stereo captures images efficiently with spectral-multiplexing, but its accuracy is harmed by crosstalk. In this paper, we aim to resolve the crosstalk issue to achieve fast photometric stereo (FPS) at low cost. First, we analyze the formulation and impact of crosstalk, showing that it significantly affects normal estimation, with external factors being primary contributors to crosstalk and internal factors being the secondary. Subsequently, we propose the FPS framework with a fast data capture scheme that combines time- and spectral-multiplexing to introduce constraints on crosstalk regarding both internal and external factors, along with a lightweight network, FPS-Net, to remove crosstalk caused by those factors based on constraints under such scheme. Finally, we build a real-world crosstalk-affected FPS dataset to evaluate the performance in handling crosstalk for normal estimation. Experimental results show the superior accuracy and efficiency of our method. The code and dataset are available at https://github.com/wxy-zju/FPS-Net

Abstract:
Accurate segmentation of 3D vascular structures is essential for various medical imaging applications. The dispersed nature of vascular structures leads to inherent spatial uncertainty and necessitates location awareness, yet most current 3D medical segmentation models rely on the patch-wise training strategy that usually loses this spatial context. In this study, we introduce the Coordinate-aware Modulated Mamba Network (COMMA) and contribute a manually labeled dataset of 570 cases, the largest publicly available 3D cerebrovascular dataset to date. COMMA leverages both entire and cropped patch data through global and local branches, ensuring robust and efficient spatial location awareness. Specifically, COMMA employs a channel-compressed Mamba (ccMamba) block to efficiently encode full-resolution image data, capturing long-range dependencies while optimizing computational costs. Additionally, we propose a coordinate-aware modulated (CaM) block to enhance interactions between the global and local branches, allowing the local branch to better perceive spatial information. We evaluate COMMA on six datasets, covering two imaging modalities and five types of vascular tissues. The results demonstrate COMMA’s superior performance compared to state-of-the-art methods with computational efficiency, especially in segmenting small vessels. Ablation studies further highlight the importance of our proposed modules and spatial information. The code will be available at COMMA

Abstract:
Existing object detection methods struggle to generalize across increasingly data domains while simultaneously adapting to the emergence of novel categories. To tackle this challenge, adaptive open-set object detection (AOOD) has been introduced, which employs supervised training on base categories within the source domain while enabling unsupervised adaptation to both base and novel categories in the target domain. However, existing AOOD approaches are still hindered by several limitations, including insufficient cross-domain feature representation, inter-category ambiguity in novel classes, and inherent feature bias toward the source domain. To overcome these issues, this paper proposes a category-level collaboration knowledge mining strategy designed to comprehensively exploit both inter-class and intra-class feature relationships across domains. Specifically, a clustering-based memory bank (CMB) is initially constructed to aggregate class prototype features, class auxiliary features, and intra-class disparity features, thereby embedding rich category-level knowledge into a unified memory structure. The CMB is iteratively updated through unsupervised clustering, which facilitates the modeling of intra-category relationships and enhances its capacity for cross-domain knowledge representation. Subsequently, a base-to-novel selection metric (BNSM) is designed to identify features corresponding to novel categories within the source domain by regulating the relationships between the novel categories and each base category. The selected features are then leveraged to initialize the object detector for the classification of novel categories. Finally, an adaptive feature assignment (AFA) strategy is introduced to transfer the learned category-level knowledge to the target domain, enabling the assignment of category labels to features. The memory bank is updated asynchronously with these assigned features to mitigate source domain bias. Extensive experiments conducted on diverse domain datasets demonstrate that the proposed method consistently outperforms state-of-the-art AOOD approaches, achieving performance gains of 1.1 to 5.5 mAP. Code is available at https://github.com/Jandsome/CCKM

Abstract:
Light field (LF) benefits various applications due to its rich spatial and angular information. To address the technical limitation in terms of imaging resolution, LF view reconstruction becomes a research hotspot. However, relevant methods mainly focus on pixel representation modeling on image plane but ignore the importance of scene geometry modeling. Inspired by powerful geometry description ability embedded in 3D Gaussian Splatting, we construct a network called LFGaussian to perform generalizable LF view reconstruction in this paper. Specifically, owing to the unique composition of cross-view Gaussian attribute deviation under 4D LF imaging setting, we propose disparity-guided feed-forward 2D Gaussian propagation with novel Gaussian primitive definition, subtly implementing Gaussian unprojection-projection operation in camera parameter-free case. On this basis, we introduce a dual-branch workflow including Gaussian representation rendering and pixel representation upsampling to create features of target views from two different levels, which complement each other to jointly realize geometric structure consistency as well as texture detail consistency across all target views. Besides, for the pursuit of high-efficient and high-quality Gaussian representation rendering, we design sub-sampling Gaussian decoding to alleviate Gaussian redundancy and leverage Gaussian splitting to allocate additional Gaussians for complex geometry regions identified by disparity gradient. Experimental results show that the proposed LFGaussian achieves superior performance compared with state-of-the-art methods on both real-world and synthetic LF datasets, proving the effectiveness of introducing Gaussian representation for LF view reconstruction. Furthermore, our LFGaussian supports arbitrary-scale reconstruction, showing high flexibility for the upsampling scale factor.

Abstract:
Skeleton-based human action recognition has attracted increasing attention in recent years. However, most existing methods focus on single-person scenarios and struggle with complex behaviors in multi-person groups. In particular, they lack the capability to automatically identify and model core person. To address these challenges, this paper proposes a star-shaped group interaction model for skeleton-based action recognition. Firstly, the character importance scoring system analyzes both individual and group aspects: it evaluates each person’s individual importance based on motion intensity and motion complexity, and assesses their significance within the group using centrality and interactivity. This process enables accurate identification of the core person in the video. Secondly, a core-star interaction graph is constructed with the core person as the center node and other individuals as peripheral nodes. The relationships among individuals are categorized into self-connections, centripetal connections, and centrifugal connections. For each type of connection, we design differentiated data augmentation strategies to fully exploit diverse action and interaction features. Finally, the structured skeleton data is fed into the star-shaped spatio-temporal graph convolutional network for efficient feature extraction and action classification. Experiments on several public benchmark datasets demonstrate that our method achieves state-of-the-art performance, achieving accuracies of 79.1%, 96.1%, and 93.1% on the NBA, Volleyball, and Volleyball-weak datasets, respectively.

Abstract:
Corrosion semantic segmentation (CSS) is essential for early and accurate detection and positioning of corrosion in complex real-life scenarios. However, the unique characteristics of corrosion patterns, including the diverse forms, blurred boundaries, and intra-class heterogeneity, pose significant challenges in CSS. To address these challenges, we propose a Prototype-based Multi-dimension Sample-Adaptive Intensity Mapping with Density Sampling network (PMSAD) for CSS. PMSAD leverages nonparametric nearest prototype retrieving to enhance intra-class cohesion and inter-class separation, thereby handling the challenge of diverse forms. In PMSAD, prototypes are equally assigned to each class during training to mitigate class imbalance and capture intra-class variations. In addition, we elaborately design and implement three core components in PMSAD, including Multi-Scale Dual Attention (MSDA), Multi-dimension Sample-adaptive Intensity Mapping (MSAIM), and Density Sampling (DS). The MSDA enhances feature discrimination, facilitating robust representation learning. The end-to-end MSAIM adaptively adjusts RGB channel intensity contrasts of the input corrosion image to enhance feature robustness, counteracting the effects of uneven natural illumination. The DS is proposed for training refinement to tackle fuzzy boundaries and internal interference between corrosion classes. It focuses on high-density, high-error regions, offering refined guidance to correct intra-cluster centers and reduce inter-cluster similarity. Extensive evaluations on real-world datasets, including coarse and relabeled fine-grained dataset, validate the superior performance and generalization ability of PMSAD, achieving the new state-of-the-art performance in precise boundary delineation and accurate corrosion classification. The code is available at: https://github.com/c1oTTpD/PMSAD

Abstract:
The goal of snapshot spectral compressive imaging reconstruction is to recover the 3D hyperspectral image from a 2D measurement. However, current reconstruction methods still face significant challenges in fully leveraging degradation and image prior. Many methods estimate degradation solely from a single measurement rather than learning from the real imaging process, resulting in inaccurate prior modeling. Moreover, the high compression of the CASSI measurement leads to the loss of spectral-spatial context, and the existing priors fail to fully capture it - for instance, in complex scenarios (such as S5, S9 in Table I), the performance gap can be as high as 3 dB. To address these issues, this paper introduces a novel reconstruction method with Degradation Cue Learning and Spectral Latent Diffusion (DCL-SLD), which comprises two key components: the Degradation Cue Learning (DCL) module and the Spectral Latent Diffusion (SLD) module. In the spatial domain, the DCL module employs a pre-trained image encoder and a feature distribution transmission strategy to extract degraded information and integrate it into the feature, enabling reconstruction through learned visual context. In the spectral domain, the SLD module leverages a latent diffusion model based on spectral correlations to generate a low-rank vector representation, effectively preserving contextual relationships within the high-dimensional structure. By enhancing priors in both dimensions, the model significantly improves its ability to exploit contextual information for more accurate recovery. Extensive experimental results on both simulation and real datasets demonstrate the superior performance of DCL-SLD over state-of-the-art methods.

Abstract:
With the wide application of knowledge distillation between an ImageNet pre-trained teacher model and a learnable student model, unsupervised anomaly detection has witnessed a significant achievement in the past few years. The success of this framework mainly relies on how to keep the feature discrepancy between the teacher and student model, in which it has two underlying sub-assumptions: (1) The teacher model can represent two separable distributions for the normal and abnormal patterns, while (2) the student model can only reconstruct the normal distribution. However, it still remains a challenging issue to maintain these ideal assumptions in practice. In this paper, we propose a simple yet effective two-stage industrial anomaly detection framework, termed AAND, which sequentially performs Anomaly Amplification and Normality Distillation to enhance the two assumptions. In the first anomaly amplification stage, we propose a novel Residual Anomaly Amplification (RAA) module to advance the pre-trained teacher encoder with synthetic anomalies. It generates adaptive residuals to amplify anomalies while maintaining the feature integrity of pre-trained model. It mainly comprises a Matching-guided Residual Gate and an Attribute-scaling Residual Generator, which can determine the residuals’ proportion and characteristic, respectively. In the second normality distillation stage, we further employ a reverse distillation paradigm to train a student decoder, in which a novel Hard Knowledge Distillation (HKD) loss is built to better facilitate the reconstruction of normal patterns. Comprehensive experiments on the MvTecAD, VisA, and MvTec3D-RGB datasets show that our method achieves state-of-the-art performance. Our code is available at https://github.com/Hui-design/AAND

Abstract:
Remote sensing image captioning is a multimodal foundation task for fine-grained understanding of remote sensing images. However, remote sensing images contain complex scenes and rich objects, it is very challenging to accurately describe the objects in the scene with their attributes and dependencies. To address these issues, the article proposes a novel scale-aware prompting with optimal transport (SPOT) to learn effective multiscale features under diverse scenes, and to build fine-grained cross-modal alignment between semantic features and linguistic words during caption generation. Specifically, a scale-aware prompt extractor is constructed to explore feature integrations in complex scenes through learning prompts that query multi-scale features, and to enhance the representation of attributes and dependencies for objects by embedding positional relations. Besides, a fine-grained cross-modal alignment is designed to dynamically match image feature representations and textual semantics through optimal transport. Through the above manner, the model can learn effective language-aligned feature representations for caption generation. Finally, a caption Transformer with causal self-attention is introduced to generate accurate captions for remote sensing scenes. Extensive experiments demonstrate that the proposed method achieves state-of-the-art performance on three public datasets, with the superiority of the proposed method further demonstrated by ablating the role of each component.

Abstract:
3D Gaussian Splatting has emerged as an alternative 3D representation for novel view synthesis, benefiting from its high-quality rendering results and real-time rendering speed. However, the 3D Gaussians learned by 3D-GS have ambiguous structures without any geometry constraints. This inherent issue in 3D-GS leads to a rough boundary when segmenting individual objects. To remedy these problems, we propose SAGD, a conceptually simple yet effective boundary-enhanced segmentation pipeline for 3D-GS to improve segmentation accuracy while preserving segmentation speed. Specifically, we introduce a Gaussian Decomposition scheme, which ingeniously utilizes the special structure of 3D Gaussians, finds out, and then decomposes the boundary Gaussians. Moreover, to achieve fast interactive 3D segmentation, we introduce a novel training-free pipeline by lifting a 2D foundation model to 3D-GS. Extensive experiments demonstrate that our approach achieves high-quality 3D segmentation without rough boundary issues, which can be easily applied to other scene editing tasks. Our code is publicly available at https://github.com/XuHu0529/SAGS

Abstract:
Video dehazing aims to restore clean scenarios from a sequence of hazy frames, where frame alignment is a critical stage for leveraging temporal information. However, haze degrades contrast and obscures details, making alignment challenging. Existing methods ignore the impairment of haze on alignment and thus struggle to align frames accurately. To address this challenge, we propose an alignment network with the temporal lookup table (temporal-LUT), which effectively enhances the haze-degraded frames and provides vivid cues for precise alignment. Specifically, to tackle the color degradation of haze, we employ a learnable lookup table (LUT) to enhance hazy color. The color mapping nature of LUT favorably preserves the naturalness of enhanced outcomes. Besides, we introduce a temporal weight prediction strategy to strengthen inter-frame interaction, which ensures temporal consistency across enhanced results and thereby benefits alignment. Extensive experimental results on two widely used benchmarks and real-world scenes demonstrate the superiority of our method.

Abstract:
Pruning is a highly effective method for reducing the size of neural networks with negligible impact on their average performance. However, recent studies have revealed that pruning actually amplifies the bias in the models, leading to decreased performance for underrepresented groups. To address this issue, we first analyze the impact of pruning on the confidence of each sample and introduce Accumulated Confidence (AC). AC is a proxy that facilitates the identification of bias-conflicting and bias-aligned samples without relying on group annotations. We then propose a debiasing algorithm, which is called DEbiasing Network through Pruning (DENP). DENP utilizes AC to mitigate bias within the network. Even without bias information, DENP exhibits remarkable debiasing performance on varying levels of sparsity, effectively mitigating the bias-exacerbating property of pruning and resulting in both sparse and debiased neural networks. Moreover, even when compared with state-of-the-art debiasing baselines under identical conditions, the DENP still achieves the best performance on multiple benchmark datasets, demonstrating its superior debiasing capabilities.

Abstract:
Emotional understanding and generation are often treated as separate tasks, yet they are inherently complementary and can mutually enhance each other. In this paper, we propose the UniEmo, a unified framework that seamlessly integrates these two tasks. The key challenge lies in the abstract nature of emotions, necessitating the extraction of visual representations beneficial for both tasks. To address this, we propose a hierarchical emotional understanding chain with learnable expert queries that progressively extracts multi-scale emotional features, thereby serving as a foundational step for unification. Simultaneously, we fuse these expert queries and emotional representations to guide the diffusion model in generating emotion-evoking images. To enhance the diversity and fidelity of the generated emotional images, we further introduce the emotional correlation coefficient and emotional condition loss into the fusion process. This step facilitates fusion and alignment for emotional generation guided by the understanding. In turn, we demonstrate that joint training allows the generation component to provide implicit feedback to the understanding part. Furthermore, we propose a novel data filtering algorithm to select high-quality and diverse emotional images generated by the well-trained model, which explicitly feedback into the understanding part. Together, these generation-driven dual feedback processes enhance the model’s understanding capacity. Extensive experiments show that UniEmo significantly outperforms state-of-the-art methods in both emotional understanding and generation tasks. The code for the proposed method is available at https://github.com/JiuTian-VL/UniEmo

Abstract:
Current methods for 3D semantic segmentation propose training models with limited annotations to address the difficulty of annotating large, irregular, and unordered 3D point cloud data. They usually focus on the 3D domain only, without leveraging the complementary nature of 2D and 3D data. Besides, some methods extend original labels or generate pseudo labels to guide the training, but they often fail to fully use these labels or address the noise within them. Meanwhile, the emergence of comprehensive and adaptable foundation models has offered effective solutions for segmenting 2D data. Leveraging this advancement, we present a novel approach that maximizes the utility of sparsely available 3D annotations by incorporating segmentation masks generated by 2D foundation models. We further propagate the 2D segmentation masks into the 3D space by establishing geometric correspondences between 3D scenes and 2D views. We extend the highly sparse annotations to encompass the areas delineated by 3D masks, thereby substantially augmenting the pool of available labels. Furthermore, we apply confidence- and uncertainty-based consistency regularization on augmentations of the 3D point cloud and select the reliable pseudo labels, which are further spread on the 3D masks to generate more labels. This innovative strategy bridges the gap between limited 3D annotations and the powerful capabilities of 2D foundation models, ultimately improving the performance of 3D weakly supervised segmentation.

Abstract:
As an alternative to acquiring high-resolution hyperspectral images (HR-HSI), Hyperspectral Image Fusion (HIF) aims to recover clean HR-HSIs by fusing degraded low spatial resolution hyperspectral images and high spatial resolution multispectral images. Among existing HIF approaches, model-guided HIF methods stand out by integrating physical degradation constraints with the learning capabilities of data-driven networks. However, most of them learn deep priors only from degraded-clean pairs without degradation-free knowledge, making them struggle with severe or unseen degradations. To address these issues, we propose a Vector-Quantized Prior-Guided Network (VPG-Net), an unfolding-based HIF framework enhanced by sparse representation and novel uncertainty-driven generative priors. Specifically, VPG-Net unfolds the Maximum A Posteriori (MAP) estimation with a sparse representation model into an uncertainty-aware VQ prior-guided network implementation. Within this framework, the sparse representation prior is integrated into the MAP formulation to improve noise resistance. As the core of our method, we leverage a high-quality vector-quantized (VQ) prior, which serves as a powerful degradation-free generative prior for the HIF process. We pre-train a discrete codebook and encoder on clean HR-HSIs to generate a VQ-prior representation (VQPR), which preserves complete spatial-spectral information. To effectively bridge the gap between degraded inputs and the learned degradation-free codebook, we further incorporate a novel uncertainty-driven probabilistic matching strategy that improves feature alignment and suppresses artifacts. The learned VQPR is then incorporated into the deep prior module as dynamic modulation parameters to enhance the fidelity and realism of the reconstructed results, particularly for severely degraded inputs. Extensive experiments on clean and degraded synthetic and real-world datasets demonstrate that our approach outperforms state-of-the-art HIF methods in both quantitative metrics and visual quality.

Abstract:
In recent years, there has been notable progress in single-image rain removal, particularly focusing on static data distributions in these approaches. When dealing with data that constantly changes, the challenge of catastrophic forgetting arises, which is quite common and critical in real-world scenarios. To address this, we propose Evolving COmpact Dual Prompt Learning (EcoDPL), an efficient rehearsal-free continual learning deraining framework designed specifically for low-level vision tasks. Specifically, we design two prompt pools at both image and feature levels and insert these prompts into images and embedding tokens, for better knowledge transfer across tasks. Our adaptive weight generation module, P-Fuser, attaches an attention map to each prompt, to adaptively pay attention to different inputs, and get different weights to fuse prompts, making the inserted prompts more flexible with various inputs. Also, we introduce Grad-Tuner, a dictionary learning strategy, to compress knowledge into fewer prompts. This makes the knowledge more compact and provides more space for new prompts to learn new tasks. Our method stands out by leveraging small, learnable prompts for efficient knowledge retention across tasks, not increasing training time or parameters. Furthermore, we present an augmented method that upgrades the distance function \gamma from simple cosine distance to a more advanced weight generation network. We also employ a fine-tuned dictionary learning technique, compressing knowledge into a more compact form, and enhancing the ability of prompts to learn new tasks. With our new designs, the model becomes more flexible with various inputs and it compresses knowledge into fewer prompts to free up spaces to learn new tasks. Through extensive experiments on various rain removal datasets, our EcoDPL method consistently outperforms previous continual learning techniques. Notably, although EcoDPL is designed for continual learning with changing data, it also performs well with stationary data, proving its robustness and versatility. Our website is available at: https://starymoon.github.io/Prompting-Rain-Off.

Abstract:
Light field cameras and multi-camera arrays have emerged as promising solutions for accurately estimating depth by passively capturing light information. This is possible because the 3D information of a scene is embedded in the 4-D light field geometry. Commonly, depth estimation methods extract this information relying on gradient information, heuristic-based optimisation models, or learning-based approaches. This paper focuses mainly on explicitly understanding and exploiting 4-D geometrical cues for light field depth estimation. Thus, a novel method is proposed, based on a non-learning-based optimisation approach for depth estimation that explicitly considers surface normal accuracy and occlusion regions by utilising a fully explainable 4-D geometric model of the light field. The 4-D model performs depth/disparity estimation by determining the orientations and analysing the intersections of key 2D planes in 4-D space, which are the images of 3D-space points in the 4-D light field. Experimental results show that the proposed method outperforms both learning-based and non-learning-based state-of-the-art methods in terms of surface normal angle accuracy, achieving a Median Angle Error on planar surfaces, on average, 26.3% lower than the state-of-the-art, and still being competitive with state-of-the-art methods in terms of MSE \boldsymbol × 100 and Badpix 0.07.

Abstract:
Multi-source remote sensing data classification refers to the process of categorizing ground objects by integrating complementary strengths of multiple remote sensing data, such as hyperspectral image (HSI), light detection and ranging (LiDAR) and synthetic aperture radar (SAR) data. However, current Mamba-based multisource remote sensing data classification approaches rely on fixed scanning patterns that are inadequate in characterizing spectral-spatial information. Additionally, current fusion techniques adopt concatenation or attention-based fusion rules without considering the complementary characteristics between different modalities. To address these limitations, we propose a spectral-spatial dynamic scan Mamba (SDSM) for multi-source remote sensing data classification. Specifically, a dynamic scan Mamba network is proposed to extract the spectral-spatial features of multi-source remote sensing data, in which a dynamic scan module is designed to adaptively capture the important spatial and spectral information. Furthermore, a bidirectional cross-modal fusion rule is proposed to merge the extracted features, in which a global-local frequency feature extraction module is designed to extract the salient structural features of multi-source remote sensing data as clues to guide heterogeneous feature fusion. Comprehensive experiments on four multi-source remote sensing datasets, i.e., MUUFL, Augsburg, Italy and Yellow River, demonstrate that the proposed method outperforms other state-of-the-art methods with respect to quantitative and qualitative results. The code of this article is available at https://github.com/PuhongDuan/SDSM

Abstract:
Existing algorithms for human body part segmentation have shown promising results on challenging datasets, primarily relying on end-to-end supervision. However, these algorithms exhibit severe performance drops in the face of domain shifts, leading to inaccurate segmentation masks. To tackle this issue, we introduce POSTURE: Pose Guided Unsupervised Domain Adaptation for Human Body Part Segmentation - an innovative pseudo-labelling approach 0designed to improve segmentation performance on the unlabeled target data. Distinct from conventional domain adaptive methods for general semantic segmentation, POSTURE stands out by considering the underlying structure of the human body and uses anatomical guidance from pose keypoints to drive the adaptation process. This strong inductive prior translates to impressive performance improvements, averaging 8% over existing state-of-the-art domain adaptive semantic segmentation methods across three benchmark datasets. Furthermore, the inherent flexibility of our proposed approach facilitates seamless extension to source-free settings (SF-POSTURE), effectively mitigating potential privacy and computational concerns, with negligible drop in performance.

Abstract:
Semantic correspondence establishes keypoint correspondences between different instances of the same category. Fusing texture and semantic features from vision foundation models like stable diffusion (SD) and DINO significantly improves matching performance. However, we found an unnoticed yet essential problem: current feature fusion enhances the edge and semantic information in SD features with fine textures and DINOv2 features with fine semantics, but it destroys the semantic and structural information in SD features with weak and coarse semantics. We propose guard features (GuFT), a simple yet efficient method, to prevent feature degradation. Moreover, matching methods designed for traditional deep neural networks can be simplified based on two key insights: 1) vision foundation models provide rich visual knowledge; and 2) GuFT yields high-quality feature descriptors. We propose a bottleneck-style non-shared aggregation and backward interaction (NABI) module to efficiently capture intra- and inter-feature relationships, instead of common self- and cross-attention. The resulting framework, SimBetter, embodies a “simpler is better” design philosophy. It achieves state-of-the-art results with lower computation on SPair-71k, AP-10K, and PF-PASCAL, excelling in geometry-aware, cross-species, cross-family, and cross-dataset tasks. SimBetter also shows excellent potential in the applications of image-video semantic correspondence and sticker editing. Code is available at https://github.com/wzhlearning/SimBetter

Abstract:
Single-image reflection removal (SIRR) aims to restore the latent background layer from a reflection-contaminated image. Despite the promising progress achieved by deep learning-based methods, the roles of negative training samples and descriptive prompts for the reflection severity are underexplored in most existing deep SIRR approaches, limiting their reflection removal performance and generalization capability. In this work, we introduce a novel training framework that synergistically leverages learnable prompts and image data to optimize the restoration network. To this end, we define reflection levels corresponding to varying degrees of reflection interference on the background content and learn reflection-level prompts to supervise the SIRR process. We propose an Iterative Reflection Level Reduction (IRLR) framework composed of a Restoration Network Training Module (RNTM) and a Reflection Level Learning Module (RLLM). Specifically, RNTM predicts the background layer under the guidance of prompts learned by RLLM, while RLLM in turn refines these prompts using outputs from RNTM. The two modules are trained iteratively to progressively reduce the reflection levels of estimated background layers. To initialize the prompts, we construct a dedicated reflection-level dataset for pretraining. For adaptively supervising RNTM, we design a new reflection-level-aware strategy to address the challenge of directly aligning the output background with the minimal reflection level. Comprehensive experimental results demonstrate that the proposed method significantly outperforms state-of-the-art methods on average performance across several released datasets, improving PSNR by 0.82 dB and SSIM by 0.0120, respectively. The source code and dataset are available at https://github.com/NamecantbeNULL/IRLR_SIRR

Abstract:
Hyperspectral images (HSIs) and multispectral images (MSIs) fusion is a hot topic in the remote sensing society. A high-resolution HSI (HR-HSI) can be obtained by fusing a low-resolution HSI (LR-HSI) and a high-resolution MSI (HR-MSI) or RGB image. However, most deep learning-based methods require a large amount of HR-HSIs for supervised training, which is very rare in practice. In this paper, we propose a coupled diffusion posterior sampling (CDPS) method for HSI and MSI fusion in which the HR-HSIs are no longer required in the training process. Because the LR-HSI contains the spectral information and HR-MSI contains the spatial information of the captured scene, we design an unsupervised strategy that learns the required diffusion priors directly and solely from the input test image pair (the LR-HSI and HR-MSI themselves). Then, a coupled diffusion posterior sampling method is proposed to introduce the two priors in the diffusion posterior sampling which leverages the observed LR-HSI and HR-MSI as fidelity terms. Experimental results demonstrate that the proposed method outperforms other state-of-the-art unsupervised HSI and MSI fusion methods. Additionally, this method utilizes smaller networks that are simpler and easier to train without other data.

Abstract:
Industrial few-shot anomaly detection (FSAD) requires identifying various abnormal states by leveraging as few normal samples as possible (abnormal samples are unavailable during training). However, current methods often require training a separate model for each category, leading to increased computation and storage overhead. Thus, designing a unified anomaly detection model that supports multiple categories remains a challenging task, as such a model must recognize anomalous patterns across diverse objects and domains. To tackle these challenges, this paper introduces FocusPatch AD, a unified anomaly detection framework based on vision-language models, achieving anomaly detection under few-shot multi-class settings. FocusPatch AD links anomaly state keywords to highly relevant discrete local regions within the image, guiding the model to focus on cross-category anomalies while filtering out background interference. This approach mitigates the false detection issues caused by global semantic alignment in vision-language models. We evaluate the proposed method on the MVTec, VisA, and Real-IAD datasets, comparing them against several prevailing anomaly detection methods. In both image-level and pixel-level anomaly detection tasks, FocusPatch AD achieves significant gains in classification and localization performance, demonstrating excellent generalization and adaptability.

Abstract:
Lightweight smoke image segmentation is essential for fire warning systems, particularly on mobile devices. In recent years, although numerous high-precision, large-scale smoke segmentation models have been developed, there are few lightweight solutions specifically designed for mobile applications. Therefore, we propose a Multi-stage Group Interaction and Cross-domain Fusion Network (MGICFN) with low computational complexity for real-time smoke segmentation. To improve the model’s ability to effectively analyze smoke features, we incorporate a Cross-domain Interaction Attention Module (CIAM) to merge spatial and frequency domain features for creating a lightweight smoke encoder. To alleviate the loss of critical information from small smoke objects during downsampling, we design a Multi-stage Group Interaction Module (MGIM). The MGIM calibrates the information discrepancies between high and low-dimensional features. To enhance the boundary information of smoke targets, we introduce an Edge Enhancement Module (EEM), which utilizes predicted target boundaries as advanced guidance to refine lower-level smoke features. Furthermore, we implement a Group Convolutional Block Attention Module (GCBAM) and a Group Fusion Module (GFM) to connect the encoder and decoder efficiently. Experimental results demonstrate that MGICFN achieves an 88.70% Dice coefficient (Dice), an 81.16% mean Intersection over Union (mIoU), and a 91.93% accuracy (Acc) on the SFS3K dataset. It also achieves an 87.30% Dice, a 78.68% mIoU, and a 92.95% Acc on the SYN70K test dataset. Our MGICFN model has 0.73M parameters and requires 0.3G FLOPs.

Abstract:
Microscopic 3D shape reconstruction using depth from focus (DFF) is crucial in precision manufacturing for 3D modeling and quality control. However, the absence of high-precision microscopic DFF datasets and the significant differences between existing DFF datasets and microscopic DFF data in optical design, imaging principles and scene characteristics hinder the performance of current DFF models in microscopic tasks. To address this, we introduce M3D, a novel microscopic DFF dataset, constructed using a self-developed microscopic device. It includes multi-focus image sequences of 1,952 scenes across five categories, with depth labels obtained through the 3D TFT algorithm applied to dense image sequences for initial depth estimation and calibration. All labels are then compared and analyzed against the design values, and those with large errors are eliminated. We also propose M3DNet, a frequency-aware end-to-end network, to tackle challenges like shallow depth-of-field (DoF) and weak textures. Results show that M3D compensates for the limitations of macroscopic DFF datasets and extends DFF applications to microscopic scenarios. M3DNet effectively captures rapid focus decay and improves performance on public DFF datasets by leveraging superior global feature extraction. Additionally, it exhibits strong robustness even in extreme conditions. Dataset and code are available at https://github.com/jiangfeng-Z/M3D

Abstract:
Adversarial distillation (AD) aims to mitigate deep neural networks’ inherent vulnerability to adversarial attacks, thereby providing robust protection for compact models through teacher-student interactions. Despite advancements, existing AD studies still suffer from insufficient robustness due to the limitations of fixed attack strength and attention region shifts. To address these challenges, we propose a strength-adaptive Info-maximizing Adversarial Robustness Distillation paradigm, namely “InfoARD”, which strategically incorporates the Attack-Strength Adaptation (ASA) and Mutual-Information Maximization (MIM) to enhance adversarial robustness against adversarial attacks and perturbations. Unlike previous adversarial training (AT) methods that utilize fixed attack strength, the ASA mechanism is designed to capture smoother and generalized classification boundaries by dynamically tailoring the attack strength based on the characteristics of individual instances. Benefiting from mutual information constraints, our MIM strategy ensures the student model effectively learns from various levels of feature representations and attention patterns, thereby deepening the student model’s understanding of the teacher model’s decision-making processes. Furthermore, a comprehensive multi-granularity distillation is conducted to capture knowledge across multiple dimensions, enabling a more effective transfer of knowledge from the teacher model to the student model. Note that our InfoARD can be seamlessly integrated into existing AD frameworks, further boosting the adversarial robustness of deep learning models. Extensive experiments on various challenging datasets consistently demonstrate the effectiveness and robustness of our InfoARD, surpassing previous state-of-the-art methods.

Abstract:
Continual test-time domain adaptation (CTTA) aims to adapt a pre-trained source model to a stream of continually evolving unlabeled target domains, facilitating model deployment in dynamic and non-stationary environments. Contemporary works usually encode domain-specific (DS) style information in a domain-agnostic manner, synchronizing with the learning of domain-invariant (DI) semantic information. This scheme forces DS information to be optimized using the weights of the previous domain, corrupted by cross-domain discrepancies, and hence leads to error accumulation and catastrophic forgetting issues. Inspired by the Attribute Memory Model (AMM) in brain neuroscience, we propose a dual domain-attribute learning framework based on independent asynchronous updates, aiming to imitate how brain learns new knowledge without forgetting. Concretely, we explicitly decompose the continual adaptation process into two complementary systems: an event-based learning system (ELS) that captures DS style representations and a knowledge-based learning system (KLS) that concentrates on the DI structural characteristics. The ELS first detects differences in the distribution of data streams, and actively builds an adapter pool for new latent domains. The KLS adopts a cross-domain shared adapter emphasizing general knowledge, and cooperates with the adapter from ELS to jointly guide adaptation. To make DS and DI knowledge collaboratively working, we exploit a gradient conflict solver to ease the conflict between the past and current DI knowledge, realizing a win-win game (i.e., no interference adaptation) across evolving domains. Our framework have been extensively evaluated on four benchmarks and outperformed the state-of-the-art approaches on both segmentation and classification CTTA tasks.

Abstract:
Existing Blind Image Quality Assessment (BIQA) approaches typically employ subjective scores as optimization targets to train the model, aiming for results consistent with human judgments. Such judgments are derived from a comprehensive analysis of complex distortions and diverse semantics from images, whereas subjective scores represent the overall quality. This poses a significant challenge for a single model to learn diverse perceptual cues under weak supervision. To address this, we propose a Decoupled Feature Learning (DFL) framework that learns compact global content-aware and local distortion-aware features in a disentangled modeling for BIQA. Our key insight is to leverage global-local input pairs to decompose content-aware and distortion-aware cues entangled in distorted images, and aggregate decoupled perceptual features into a single network. We design a perceptual knowledge distillation strategy that progressively guides the student from fragmented representations to build local-to-global correspondences by distilling self-supervised semantic knowledge, while incorporating the Just-Noticeable-Difference (JND) model to highlight the transfer of perceptually sensitive content features. Finally, we introduce a local distortion-guided attention module to model synergistic effects of different perceptual features from the student for quality evaluation. Extensive experiments on eight benchmark datasets demonstrate the superior performance of the proposed model over the state-of-the-arts. In addition, the DFL framework is flexibly used to improve the perception ability of other Transformer variants. The code is released at https://github.com/JianjunXiang/DFT

Abstract:
Recent studies in remote sensing object detection have made excellent progress and shown promising performance. However, most current detectors only explore rotation-invariant feature extraction but disregard the valuable spatial and semantic prior knowledge in remote sensing images (RSIs), which limits the detection performance when encountering blurred or heavy occluded objects. To address this issue, we propose a mask-reconstruction relation learning (MRRL) framework to learn such prior knowledge among objects and a consistency-reasoning transformer over relation proposals (CTRP) to recognize objects with limited visual features via consistency reasoning. Specifically, MRRL framework applies random mask to some objects in the training dataset and performs masked objects reconstruction to guide the network to learn the distribution consistency of objects. CTRP is the core component of the MRRL framework, which models the interaction between spatial and semantic priors, and uses easy detected objects to reason hard detected objects. The trained CTRP can be integrated into the existing detector to improve the ability of object detection with limited visual features in RSIs. Extensive experiments on widely-used datasets for two distinct tasks, namely remote sensing object detection task and occluded object detection task, demonstrate the effectiveness of the proposed method. Source code is available at https://github.com/sunpeng96/CTRP_mmrotate.

Abstract:
Ultra-Fine-Grained Visual Categorization (Ultra-FGVC) aims to classify objects into sub-granular categories, presenting the challenge of distinguishing visually similar objects with limited data. Existing methods primarily address sample scarcity but often overlook the importance of leveraging intrinsic object features to construct highly discriminative representations. This limitation significantly constrains their effectiveness in Ultra-FGVC tasks. To address these challenges, we propose SV-Transformer that progressively encodes object features while incorporating background perturbation modeling to generate robust and discriminative representations. At the core of our approach is a progressive feature encoder, which hierarchically extracts global semantic structures and local discriminative details from backbone-generated representations. This design enhances inter-class separability while ensuring resilience to intra-class variations. Furthermore, our background perturbation learning mechanism introduces controlled variations in the feature space, effectively mitigating the impact of sample limitations and improving the model’s capacity to capture fine-grained distinctions. Comprehensive experiments demonstrate that SV-Transformer achieves state-of-the-art performance on benchmark Ultra-FGVC datasets, showcasing its efficacy in addressing the challenges of Ultra-FGVC task.

Abstract:
Labeling Cadmium Zinc Telluride (CdZnTe) semiconductor images is challenging due to the low-contrast defect boundaries, necessitating annotators to cross-reference multiple views. These views share a single ground truth (GT), forming a unique “many-to-one” relationship. This characteristic renders advanced semi-supervised semantic segmentation (SSS) methods suboptimal, as they are generally limited by a “one-to-one” relationship, where each image is independently associated with its GT. Such limitation may lead to error accumulation in low-contrast regions, further exacerbating confirmation bias. To address this issue, we revisit the SSS pipeline from a group-oriented perspective and propose a human-inspired solution: the Intra-group Consistency Augmentation Framework (ICAF). First, we experimentally validate the inherent consistency constraints within CdZnTe groups, establishing a group-oriented baseline using the Intra-group View Sampling (IVS). Building on this insight, we introduce the Pseudo-label Correction Network (PCN) to enhance consistency representation, which consists of two key modules. The View Augmentation Module (VAM) improves boundary details by dynamically synthesizing a boundary-aware view through the aggregation of multiple views. In the View Correction Module (VCM), this synthesized view is paired with other views for information interaction, effectively emphasizing salient regions while minimizing noise. Extensive experiments demonstrate the effectiveness of our solution for CdZnTe materials. Leveraging DeepLabV3+ with a ResNet-101 backbone as our segmentation model, we achieve a 70.6% mIoU on the CdZnTe dataset using only 2 group-annotated data (5‰). The code is available at https://github.com/pipixiapipi/ICAF

Affiliations: School of Computer Science, Guangdong University of Technology, Guangzhou, China; Guangdong Key Laboratory for Biomedical Measurements and Ultrasound Imaging, National-Regional Key Technology Engineering Laboratory for Medical Ultrasound, School of Biomedical Engineering, Shenzhen University Medical School, Shenzhen University, Shenzhen, China; Hangzhou Institute of Technology, Xidian University, Hangzhou, China; Department of Electrical Engineering and the Institute of Communications Engineering, National Tsing Hua University, Hsinchu, Taiwan; Institute of Image Communication and Network Engineering, Shanghai Jiao Tong University, Shanghai, China; School of Computer Science and Informatics, Cardiff University, Cardiff, U.K.

Abstract:
Recently, incorporating Retinex theory with unfolding networks has attracted increasing attention in the low-light image enhancement field. However, existing methods have two limitations, i.e., ignoring the modeling of the physical prior of Retinex theory and relying on a large amount of paired data. To advance this field, we propose a novel self-supervised unfolding network, named S2UNet, for the LIE task. Specifically, we formulate a novel optimization model based on the principle that content-consistent images under different illumination should share the same reflectance. The model simultaneously decomposes two illumination-different images into a shared reflectance component and two independent illumination components. Due to the absence of the normal-light image, we process the low-light image with gamma correction to create the illumination-different image pair. Then, we translate this model into a multi-stage unfolding network, in which each stage alternately optimizes the shared reflectance component and the respective illumination components of the two images. During progressive multi-stage optimization, the network inherently encodes the reflectance consistency prior by jointly estimating an optimal reflectance across varying illumination conditions. Finally, considering the presence of noise in low-light images and to suppress noise amplification, we propose a self-supervised denoising mechanism. Extensive experiments on nine benchmark datasets demonstrate that our proposed S2UNet outperforms state-of-the-art unsupervised methods in terms of both quantitative metrics and visual quality, while achieving competitive performance compared to supervised methods. The source code will be available at https://github.com/J-Liu-DL/S2UNet

Abstract:
Recent advancements in multimodal large language models (MLLMs) have achieved significant multimodal generation capabilities, akin to GPT-4. These models predominantly map visual information into language representation space, leveraging the vast knowledge and powerful text generation abilities of LLMs to produce multimodal instruction-following responses. We could term this method as LLMs for Vision because of its employing LLMs for visual understanding and reasoning, yet observe that these MLLMs neglect the potential of harnessing visual knowledge to enhance the overall capabilities of LLMs, which could be regarded as Vision Enhancing LLMs. In this paper, we propose an approach called MKS2, aimed at enhancing LLMs through empowering Multimodal Knowledge Storage and Sharing in LLMs. Specifically, we introduce Modular Visual Memory (MVM), a component integrated into the internal blocks of LLMs, designed to store open-world visual information efficiently. Additionally, we present a soft Mixture of Multimodal Experts (MoMEs) architecture in LLMs to invoke multimodal knowledge collaboration during text generation. Our comprehensive experiments demonstrate that MKS2 substantially augments the reasoning capabilities of LLMs in contexts necessitating physical or commonsense knowledge. It also delivers competitive results on image-text understanding multimodal benchmarks. The codes will be available at: https://github.com/HITsz-TMG/MKS2-Multimodal-Knowledge-Storage-and-Sharing

Abstract:
Lossy compression of point clouds reduces storage and transmission costs; however, it inevitably leads to irreversible distortion in geometry structure and attribute information. To address these issues, we propose a unified geometry and attribute enhancement (UGAE) framework, which consists of three core components: post-geometry enhancement (PoGE), pre-attribute enhancement (PAE), and post-attribute enhancement (PoAE). In PoGE, a Transformer-based sparse convolutional U-Net is used to reconstruct the geometry structure with high precision by predicting voxel occupancy probabilities. Building on the refined geometry structure, PAE introduces an innovative enhanced geometry-guided recoloring strategy, which uses a detail-aware K-Nearest Neighbors (DA-KNN) method to achieve accurate recoloring and effectively preserve high-frequency details before attribute compression. Finally, at the decoder side, PoAE uses an attribute residual prediction network with a weighted mean squared error (W-MSE) loss to enhance the quality of high-frequency regions while maintaining the fidelity of low-frequency regions. UGAE significantly outperformed existing methods on three benchmark datasets: 8iVFB, Owlii, and MVUB. Compared to the latest G-PCC test model (TMC13v29), in terms of total bitrate setting, UGAE achieved an average BD-PSNR gain of 9.98 dB and -90.54% BD-bitrate for geometry under the D1 metric, as well as a 3.34 dB BD-PSNR improvement with -55.53% BD-bitrate for attributes. Additionally, it improved perceptual quality significantly. Our source code will be released on GitHub at: https://github.com/yuanhui0325/UGAE

Abstract:
Self-Explainable Models (SEMs) rely on Prototypical Concept Learning (PCL) to enable their visual recognition processes more interpretable, but they often struggle in data-scarce settings where insufficient training samples lead to suboptimal performance. To address this limitation, we propose a Few-Shot Prototypical Concept Classification (FSPCC) framework that systematically mitigates two key challenges under low-data regimes: parametric imbalance and representation misalignment. Specifically, our approach leverages a Mixture of LoRA Experts (MoLE) for parameter-efficient adaptation, ensuring a balanced allocation of trainable parameters between the backbone and the PCL module. Meanwhile, cross-module concept guidance enforces tight alignment between the backbone’s feature representations and the prototypical concept activation patterns. In addition, we incorporate a multi-level feature preservation strategy that fuses spatial and semantic cues across various layers, thereby enriching the learned representations and mitigating the challenges posed by limited data availability. Finally, to enhance interpretability and minimize concept overlap, we introduce a geometry-aware concept discrimination loss that enforces orthogonality among concepts, encouraging more disentangled and transparent decision boundaries. Experimental results on six popular benchmarks (CUB-200-2011, mini-ImageNet, CIFAR-FS, Stanford Cars, FGVC-Aircraft, and DTD) demonstrate that our approach consistently outperforms existing SEMs by a notable margin, with 4.2%–8.7% relative gains in 5-way 5-shot classification. These findings highlight the efficacy of coupling concept learning with few-shot adaptation to achieve both higher accuracy and clearer model interpretability, paving the way for more transparent visual recognition systems.

Abstract:
Recent advances in “track-anything” models have significantly improved fine-grained video understanding by simultaneously handling multiple video segmentation and tracking tasks. However, existing models often struggle with robust and efficient temporal propagation. To address these challenges, we propose the Sparse Spatio-Temporal Propagation (SSTP) method, which achieves robust and efficient unified video segmentation by selectively leveraging key spatio-temporal features in videos. Specifically, we design a dynamic 3D spatio-temporal convolution to aggregate global multi-frame spatio-temporal information into memory frames during memory construction. Additionally, we introduce a spatio-temporal aggregation reading strategy to efficiently aggregate the relevant spatio-temporal features from multiple memory frames during memory retrieval. By combining SSTP with an image segmentation foundation model, such as the segment anything model, our method effectively addresses multiple data-scarce video segmentation tasks. Our experimental results demonstrate state-of-the-art performance on five video segmentation tasks across eleven datasets, outperforming both task-specific and unified methods. Notably, SSTP exhibits strong robustness in handling sparse, low-frame-rate videos, making it well-suited for real-world applications.

Abstract:
The Segment Anything Model 2 (SAM 2) has demonstrated exceptional performance in object segmentation tasks but encounters challenges in visual object tracking, particularly in handling crowded scenes with fast-moving or self-occluding objects. Additionally, its fixed-window memory mechanism indiscriminately retains past frames, leading to error accumulation. This issue results in incorrect memory retention during occlusions, causing the model to condition future predictions on unreliable features and leading to identity switches or drift in crowded scenes. This paper introduces SAMURAI, an enhanced adaptation of SAM 2 that integrates temporal motion cues with a novel motion-aware memory selection strategy. SAMURAI effectively predicts object motion and refines mask selection, achieving robust and precise tracking without requiring retraining or fine-tuning. It demonstrates strong training-free performance across multiple VOT benchmark datasets, underscoring its generalization capability. SAMURAI achieves state-of-the-art performance on LaSOText, GOT-10k, and TrackingNet, while also delivering competitive results on LaSOT, VOT2020-ST, VOT2022-ST, and VOS benchmarks such as SA-V. These results highlight SAMURAI’s robustness in complex tracking scenarios and its potential for real-world applications in dynamic environments with an optimized memory selection mechanism. Code and results are available at https://github.com/yangchris11/samurai

Abstract:
The similar textures, diverse shapes and blurred boundaries of thyroid lesions in ultrasound images pose a significant challenge to accurate segmentation. Although several methods have been proposed to alleviate the aforementioned issues, their generalization is hindered by limited annotation data and insufficient ability to distinguish lesion from its surrounding tissues, especially in the presence of noise and outlier. Additionally, most existing methods lack uncertainty estimation which is essential for providing trustworthy results and identifying potential mispredictions. To this end, we propose knowledge-prompted trustworthy disentangled learning (KPTD) for thyroid ultrasound segmentation with limited annotations. The proposed method consists of three key components: 1) knowledge-aware prompt learning (KAPL) encodes TI-RADS reports into text features and introduces learnable prompts to extract contextual embeddings, which assist in generating region activation maps (serving as pseudo-labels for unlabeled images); 2) foreground-background disentangled learning (FBDL) leverages region activation maps to disentangle foreground and background representations, refining their prototype distributions through a contrastive learning strategy to enhance the model’s discrimination and robustness; and 3) foreground-background trustworthy fusion (FBTF) integrates the foreground and background representations and estimates their uncertainty based on evidence theory, providing trustworthy segmentation results. Experimental results show that KPTD achieves superior segmentation performance under limited annotations, significantly outperforming state-of-the-art methods.

Abstract:
Recent studies have shown that Deep Neural Networks (DNNs) are susceptible to adversarial attacks, with frequency-domain analysis underscoring the significance of high-frequency components in influencing model predictions. Conversely, targeting low-frequency components has been effective in enhancing attack transferability on black-box models. In this study, we introduce a frequency decomposition-based feature mixing method to exploit these frequency characteristics in both clean and adversarial samples. Our findings suggest that incorporating features of clean samples into adversarial features extracted from adversarial examples is more effective in attacking normally-trained models, while combining clean features with the adversarial features extracted from low-frequency parts decomposed from the adversarial samples yields better results in attacking defense models. However, a conflict issue arises when these two mixing approaches are employed simultaneously. To tackle the issue, we propose a cross-frequency meta-optimization approach comprising the meta-train step, meta-test step, and final update. In the meta-train step, we leverage the low-frequency components of adversarial samples to boost the transferability of attacks against defense models. Meanwhile, in the meta-test step, we utilize adversarial samples to stabilize gradients, thereby enhancing the attack’s transferability against normally trained models. For the final update, we update the adversarial sample based on the gradients obtained from both meta-train and meta-test steps. Our proposed method is evaluated through extensive experiments on the ImageNet-Compatible dataset, affirming its effectiveness in improving the transferability of attacks on both normally-trained CNNs and defense models. The source code is available at https://github.com/WJJLL/MetaSSA

Abstract:
Light Field (LF) images provide rich visual representations of 3D scenes by capturing both spatial and angular information of light rays. However, their high dimensions present substantial challenges for conventional 2D image watermarking techniques in effectively ensuring copyright protection. In this work, we propose a deep learning-based Spatial-Angular Consistency waterMarking (SACMark) network, designed to address the unique challenges of watermark embedding and extraction in LF images. SACMark employs a spatial-angular feature extraction module to capture the multidimensional information of LF images and introduces consistency matching and fusion strategies to enhance feature utilization. The network adopts an encoder-noise-decoder architecture, optimized through adversarial training to improve the imperceptibility and robustness of the watermark. Experimental results demonstrate that SACMark maintains high visual quality across various embedding capacities and has minimal impact on depth estimation. Compared to traditional LF watermarking approaches and existing deep learning-based methods for 2D images, SACMark demonstrates improved resilience to noise while preserving essential LF characteristics. These findings suggest that SACMark holds promise for practical applications and may contribute to future developments in secure and adaptive LF image protection.

Abstract:
Recent advancements in codebook-based real image super-resolution (SR) have shown promising results in real-world applications. The core idea involves matching high-quality image features from a codebook based on low-resolution (LR) image features. However, existing methods face two major challenges: inaccurate feature matching with the codebook and poor texture detail reconstruction. To address these issues, we propose a novel Uncertainty-Guided and Top-k Codebook Matching SR (UGTSR) framework, which incorporates three key components: 1) an uncertainty learning mechanism that guides the model to focus on texture-rich regions, 2) a Top-k feature matching strategy that enhances feature matching accuracy by fusing multiple candidate features, and 3) an Align-Attention module that enhances the alignment of information between LR and HR features. Experimental results demonstrate significant improvements in texture realism and reconstruction fidelity compared to existing methods. The source code can be found at https://github.com/wwlCape/UGTSR-main

Abstract:
Unrestrained palmprint recognition refers to a comprehensive identity authentication technology, that performs personal authentication based on the palmprint images captured in uncontrolled environments, i.e., smartphone cameras, surveillance footage, or near-infrared scenarios. However, unrestrained palmprint recognition faces significant challenges due to the variability in image quality, lighting conditions, and hand poses present in such settings. We observed that many existing methods utilize the subspace structure as a prior, where the block diagonal property of the data has been proved. In this paper, we consider a unified learning model to guarantee the consensus block diagonal property for all views, named high-confident block diagonal analysis for multi-view palmprint recognition (HCBDA_MPR). Particularly, this paper proposed a multi-view block diagonal regularizer to guide that all views learn a consensus block diagonal structure. In such a manner, the main discriminant features from each view can be preserved while the learning of the strict block diagonal structure across all views. Experimental results on a number of real-world unrestrained palmprint databases proved the superiority of the proposed method, where the highest recognition accuracies were obtained in comparison with the other state-of-the-art related methods.

Abstract:
Camera-based 3D semantic scene completion (SSC) offers a cost-effective solution for assessing the geometric occupancy and semantic labels of each voxel in the surrounding 3D scene with image inputs, providing a voxel-level scene perception foundation for the perception-prediction-planning autonomous driving systems. Although significant progress has been made in existing methods, their optimization rely solely on the supervision from voxel labels and face the challenge of voxel sparsity as a large portion of voxels in autonomous driving scenarios are empty, which limits both optimization efficiency and model performance. To address this issue, we propose a Multi-Resolution Alignment (MRA) approach to mitigate voxel sparsity in camera-based 3D semantic scene completion, which exploits the scene and instance level alignment across multi-resolution 3D features as auxiliary supervision. Specifically, we first propose the Multi-resolution View Transformer module, which projects 2D image features into multi-resolution 3D features and aligns them at the scene level through fusing discriminative seed features. Furthermore, we design the Cubic Semantic Anisotropy module to identify the instance-level semantic significance of each voxel, accounting for the semantic differences of a specific voxel against its neighboring voxels within a cubic area. Finally, we devise a Critical Distribution Alignment module, which selects critical voxels as instance-level anchors with the guidance of cubic semantic anisotropy, and applies a circulated loss for auxiliary supervision on the critical feature distribution consistency across different resolutions. Extensive experiments on the SemanticKITTI and SSCBench-KITTI-360 datasets demonstrate that our MRA approach significantly outperforms existing state-of-the-art methods, showcasing its effectiveness in mitigating the impact of sparse voxel labels. The code is available at https://github.com/PKU-ICST-MIPL/MRA_TIP.

Abstract:
Tracking-by-Detection paradigms shine in generic multi-object tracking (MOT), while their compact construction hinders the real-time applications. In this work, we attribute the substantial computational burden to two expensive components, i.e. detection and re-identification. Building upon the principle of adaptively maintaining acceptable inference efficiency, we present Adaptively Sparse Detection with attention-guided refinement (ASDTracker) for efficient tracking. In specific, our ASDTracker rapidly assess the short-term and long-term occlusion, dynamically determining the usage of the expensive detector. For non-key frames, we efficiently refine small-size crops out of Kalman Filter predictions and introduce the noisy shadow labels to robustly train this refinement network. Additionally, we substitute the lightweight appearance representation for the heavy ReID network, which efficiently extracts sufficient appearance cues in the coarsely quantized color spaces. Extensive experiments on four benchmarks demonstrate that ASDTracker achieves competitive performance in generalization and robustness under favorable inference speed. Moreover, the efficient tracking deployment is further implemented to an unmanned surface vehicle with high accuracy and low latency in real-world scenarios.

Abstract:
Vision foundation models in remote sensing have been extensively studied due to their superior generalization on various downstream tasks. Synthetic Aperture Radar (SAR) offers all-day, all-weather imaging capabilities, providing significant advantages for Earth observation. However, establishing a foundation model for SAR image interpretation inevitably encounters the challenges of insufficient information utilization and poor interpretability. In this paper, we propose a remote sensing foundation model based on complex-valued SAR data, which simulates the polarimetric decomposition process for pre-training, i.e., characterizing pixel scattering intensity as a weighted combination of scattering bases and scattering coefficients, thereby endowing the foundation model with physical interpretability. Specifically, we construct a series of scattering queries, each representing an independent and meaningful scattering basis, which interact with SAR features in the scattering query decoder and output the corresponding scattering coefficient. To guide the pre-training process, polarimetric decomposition loss and power self-supervised loss are constructed. The former aligns the predicted coefficients with Yamaguchi coefficients, while the latter reconstructs power from the predicted coefficients and compares it to the input image’s power. The performance of our foundation model is validated on nine typical downstream tasks, achieving state-of-the-art results. Notably, the foundation model can extract stable feature representations and exhibits strong generalization, even in data-scarce conditions.

Abstract:
Unsupervised domain adaptive object detection methods enhance model robustness in the target domain without requiring target-domain annotations. Despite notable progress, existing methods face two major challenges: 1) insufficient and inefficient learning of holistic feature consistency due to cumbersome pixel-level style matching and semantic discrepancy elimination between domains as well as the overlooking of their collaborative effect; and 2) unreliable learning of category feature compactness caused by poor-quality target-domain samples, inaccurate pseudo-labels and noisy cross-domain contrast paradigms. To address these challenges, we propose a novel Semantic Consistency and Compactness Learning (SCCL) network. For consistency learning, we introduce a Visual Adaptation-guided Semantic Alignment (VSA) module that achieves style matching through simple feature adaptation and incorporates a novel adversarial-free self-supervised method for feature disentanglement. The collaboration between these two aspects enables sufficient and efficient consistency learning. For reliable compactness learning, we develop a plug-and-play Instance Center-Contrastive (ICC) head that, for the first time, comprehensively addresses all three potential causes of unreliable learning through three integrated innovations, concerning sample pseudo-label quality enhancement, reliable sample storage and updating, and a robust sample contrast paradigm. Besides, the mutual reinforcement effect of VSA and ICC simultaneously enhances feature transferability and discriminability. Extensive experiments across four UDA object detection benchmarks with two baselines show that SCCL achieves superior adaptability and robustness. Code will be available at https://github.com/TooZE23/SCCL.

Abstract:
The accurate measurement of perceptual color differences (CDs) between two images plays an important role in modern smartphone photography. Although traditional CD metrics provide numerical scores to quantify color variations, they often lack the ability to offer intuitive insights or explanations that reflect the factors behind these differences in a way that aligns with human perception and reasoning. Here, we present CD-Reasoning, an innovative method designed not merely to compute numerical CD scores but also to provide a detailed rationale for the observed CDs between images. This method surpasses simple numerical quantification, delivering a more profound and explanatory analysis that bridges quantitative assessments with the qualitative reasoning characteristic of human perception. The development of the CD-Reasoning model begins with the compilation of a multi-modal CD dataset dubbed M-SPCD based on the existing SPCD, where we collect textual descriptions that detail the quantification of CDs across seven pivotal attributes: white balance, brightness contrast, color contrast, overall brightness, overall color, shadow detail, and highlight detail. Utilizing the newly curated M-SPCD dataset, we enhance the capabilities of cutting-edge Multimodal Large Language Models (MLLMs) to not only accurately assess numerical CD scores but also to provide in-depth reasoning that explains the CDs between two images. Extensive experiments demonstrate that the proposed CD-Reasoning not only achieves superior accuracy compared to state-of-the-art CD metrics but also significantly exceeds leading MLLMs in CD interpreting. Source codes will be available at https://github.com/LongYu-LY/CD-Reasoning.

Abstract:
Advancements in 3D scene reconstruction have transformed 2D images from the real world into 3D models, producing realistic 3D results from hundreds of input photos. Despite great success in dense-view reconstruction scenarios, rendering a detailed scene from sparse views is still an ill-posed optimization problem, often resulting in artifacts and distortions in unseen areas. In this paper, we propose ReconX, a novel 3D scene reconstruction paradigm that reframes the ambiguous reconstruction problem as a temporal generation task. The key insight is to unleash the strong generative prior of large pre-trained video diffusion models for sparse-view reconstruction. Nevertheless, it is challenging to preserve 3D view consistency when directly generating video frames from pre-trained models. To address this issue, given limited input views, the proposed ReconX first constructs a global point cloud and encodes it into a contextual space as the 3D structure condition. Guided by the condition, the video diffusion model then synthesizes video frames that are detail-preserved and exhibit a high degree of 3D consistency, ensuring the coherence of the scene from various perspectives. Finally, we recover the 3D scene from the generated video through a confidence-aware 3D Gaussian Splatting optimization scheme. Extensive experiments on various real-world datasets show the superiority of ReconX over state-of-the-art methods in terms of quality and generalizability.

Abstract:
The dependence of neural radiance fields (NeRF) on accurate camera poses has emerged as a critical obstacle to their widespread real-world applications. While recent advances have demonstrated the potential for simultaneously addressing camera registration and scene reconstruction, these methods inherently rely on reasonable initialization derived from pose or scene priors and struggle with complex scenes involving large camera motions, particularly in unordered 360-degree scenes. In this work, we propose Zero-Pose-Prior NeRF to recover radiance fields from unposed and unordered image collections without any prior knowledge. Our key insight is to decompose this complex problem into smaller sub-problems, wherein the sub-problems’ camera poses are initially estimated to provide self-bootstrapping priors for the global pose estimation, followed by a recursive registration and reconstruction. To achieve this, we first perform scene partitioning to establish a hierarchical structure that describes registration order from local to global. Thereafter, we devise a conditionally-decoupled positional encoding for NeRFs, which serves as the basic model for camera pose estimation and scene representation. Following this, we develop a recursive registration to recursively estimate the poses of local scenes and register them into a unified global pose space, ultimately enabling the reconstruction of the entire scene. Experiments on real-world scenes show that our approach outperforms the state-of-the-art pose-free methods in terms of accurate camera poses and robust radiance field reconstruction, resulting in high-fidelity view synthesis.

Abstract:
Human motion prediction is a key task in computer vision and human-robot interaction, which has received much attention in recent years. However, existing approaches suffer from two issues: 1) They typically rely only on complete data and overlook real-world challenges such as missing observations. 2) Recent works fail to capture the diverse relations among body parts in different action categories, which limits their prediction performance. To address the above problems, we propose a novel Incomplete human Motion Prediction method through motion Re covery and Structure-Semantic fusion (IMPRESS). Specifically, for motion recovery, we introduce a wavelet-based self-attention module. It captures motion details from high-frequency features and extracts global trends from low-frequency components. To enhance the relations among different body parts, we design a structure-semantic fusion graph convolutional network. Moreover, we employ a dual-channel sliding window attention mechanism to capture motion periodicity, enabling smoother predictions. Extensive experiments on two benchmark datasets (Human3.6M, CMU-MoCap) demonstrate that IMPRESS achieves state-of-the-art average prediction performance under both complete and incomplete observations.

Abstract:
The grounded question answering in egocentric videos (Ego-GQA) aims to identify the relevant temporal window and generate corresponding responses in natural language given a textual question. Compared with third-person videos, egocentric video understanding requires more advanced human-centric thinking capability. However, existing Ego-GQA approaches often fail to distinguish the inherent limitations of dynamic egocentric context understanding, treating both first-person and third-person perspectives equally. This oversight leads to hallucinations and a lack of proper egocentric reasoning in first-person video understanding. To address this issue, we propose a novel Collaborated with Hallucination (CoHa) framework for the Ego-GQA, which quantifies the hallucinations generated by an Ego-GQA model and further leverages them as error demonstrations to constrain the model’s reasoning process, encouraging it to ground predictions in egocentric visual cues instead of relying on biased pretraining priors. Specifically, we first employ Subjective Logic to quantify the degree of uncertainty in unreliable answers. We then generate diffusion-based noisy visual inputs to amplify the hallucinations as error demonstrations, which are used to append appropriate constraints to the model according to the uncertainty. These constraints effectively steer predictions away from the unreliable semantics induced by inherent drawbacks in egocentric thinking. Additionally, we incorporate an interactive refinement module to facilitate the model to explore more fine-grained cues observed from the first-person view. Extensive experiments on two widely used benchmarks demonstrate that our CoHa method outperforms recent state-of-the-art methods. Our code is available at https://github.com/Mrshenshen/CoHa

Abstract:
In the past few years, group-based sparse representation (GSR) has emerged as a powerful paradigm for image inverse problems by synergizing model-driven interpretability with nonlocal self-similarity priors. Nevertheless, its practical utility is hindered by computationally expensive iterative processes. Deep learning (DL) methods can avoid this deficiency, but they often lack of model interpretability. To bridge this gap, we propose a novel deep group-based sparse representation framework, termed DeepGSR, which brings the GSR method and the DL approach together. DeepGSR not only circumvents the iterative bottlenecks of conventional GSR but also preserves its model interpretability through a learnable parameterization. Specifically, the network is built upon a GSR model that leverages nonlocal self-similarity, and it integrates adaptive patch matching and aggregation mechanisms to model complex intra-group relationships in the latent space. To reduce the computational complexity associated with traditional SVD-based rank shrinkage, we introduce a learnable low-rank shrinkage module that incorporates low-rank constraints while enhancing the interpretability and adaptability of the model. To better exploit frequency-specific structures, the network incorporates a shifting wavelet-domain patch partitioning strategy, which separately models high- and low-frequency components to further enhance the representation ability of the network. Extensive experiments demonstrate that DeepGSR, when applied as a drop-in replacement module to various image inverse problems such as image denoising, single-image deraining, metal artifact reduction, sparse-view computed tomography reconstruction, phase retrieval, and all-in-one image restoration consistently delivers effective performance, validating the effectiveness of the proposed framework. The source code and datasets have been made publicly available at https://github.com/shibaoshun/DeepGSR

Affiliations: School of Control Science and Engineering, Shandong University, Jinan, China; School of Information Science and Engineering, Shandong University, Qingdao, China; Key Laboratory of Knowledge Engineering With Big Data, Ministry of Education of China, and the Innovation School of Artificial Intelligence, Hefei University of Technology, Hefei, China; School of Computer Science and Technology, Hainan University, Haikou, China; Institute of Information Science, Beijing Jiaotong University, Beijing, China; School of Data Science, Lingnan University, Hong Kong, SAR, China; School of Control Science and Engineering and the Key Laboratory of Machine Intelligence and System Control, Ministry of Education, Shandong University, Jinan, China

Abstract:
Facial image acquisition under constrained illumination and with limited-resolution imaging devices often results in coupled photometric and geometric degradations, manifesting as low-light and low-resolution (LLR) conditions. Prevailing research predominantly follows fragmented optimization paradigms that address low-light image enhancement (LLIE) and face super-resolution (FSR) as isolated tasks. This approach overlooks the compound nature of the degradations, thereby significantly limiting their applicability in practical scenarios. To bridge this gap, we present DiffLLFace, a unified framework that harnesses diffusive generative capabilities with illumination-aware trajectories to achieve robust FSR from LLR observations. The core of our method lies in its alternate illumination-diffusion adaptation, which operates throughout the generation process. This mechanism not only captures degradation patterns in both brightness and structure to harmonize latent representations but also dynamically calibrates the illumination prior with the generative knowledge inherent to diffusion models. As such, DiffLLFace attains precise control over conditional adaptation and illumination rectification. We further devise a simple yet effective non-parametric Fourier enhancement strategy, which provides structural appearance clues that work in concert with the alternate adaptation to ensure texture and color consistency. Extensive experiments demonstrate the superiority of DiffLLFace over existing methods and remarkable generalizability on complex natural scenes. Code is available at https://github.com/KaishengPang/DiffLLFace

Abstract:
Open-vocabulary object detection (OVD) aims to detect novel object concepts by mining region-word correspondences from image-text pairs, yet current methods often produce false correspondences. While some strategies (e.g., one-to-one matching) were proposed to mitigate this issue, they often sacrifice numerous valuable region-word pairs during the matching process. To overcome these challenges, we propose a novel comprehensive alignment method, named Region-word Alignment with Partial Optimal Transport (ROOT) framework, which reframes the region-word matching task as a problem of partial distribution alignment. Unlike traditional optimal transport, which shifts the full mass of the distribution, partial optimal transport enables selective matching, making it more robust to noise in region and word alignment. Specifically, ROOT first employs partial optimal transport to obtain an optimal transport plan for region and word feature alignment. This transport plan is then used to compute a matching reliability score for each region-word pair, which reweights the contrastive alignment loss to enhance accuracy. By enabling more flexible and reliable region-text matches, ROOT significantly reduces misalignment errors while preserving valuable region–word correspondences. Extensive experiments on standard benchmarks OV-COCO and OV-LVIS show that our ROOT outperforms the previous state-of-the-art works, demonstrating the effectiveness of our approach.

Abstract:
Recovering ghost-free High Dynamic Range (HDR) images from multiple Low Dynamic Range (LDR) images becomes challenging when the LDR images exhibit saturation and significant motion. Recent Diffusion Models (DMs) have been introduced in HDR imaging field, showing promising performance, particularly in achieving visually perceptible better results compared to previous DNN-based methods. However, DMs require extensive iterations with large models to estimate entire images, resulting in inefficiency that hinders their practical application. To address this challenge, we propose the Low-Frequency aware Diffusion (LF-Diff) model for ghost-free HDR imaging. The key idea of LF-Diff is implementing the DMs in a highly compacted latent space and integrating it into a regression-based model to enhance the details of reconstructed images. Specifically, as low-frequency information is closely related to human visual perception we propose to utilize DMs to create compact low-frequency priors for the reconstruction process. These priors are integrated into a carefully designed Dynamic HDR Reconstruction Network (DHRNet), which employs a regression-based approach to produce high-quality HDR images. Furthermore, we introduce the Attention-guided Deformable Alignment Module (ADAM) that utilizes correlation-driven feature matching to learn deformable receptive fields for self-attention, enabling efficient pre-alignment of LDR images by focusing on salient regions. Extensive experiments on synthetic and real-world benchmark datasets demonstrate that our LF-Diff performs favorably against several state-of-the-art methods and is 10× faster than previous DM-based methods.

Abstract:
Video temporal grounding, including moment retrieval and highlight detection, is an emerging topic aiming to identify specific clips within videos. In addition to pre-trained video models, contemporary methods utilize pre-trained vision-language models (VLMs) to capture detailed characteristics of diverse scenes and objects from video frames. However, as pre-trained on images, directly using pre-extracted VLM features neglects the domain gap between the pre-trained and temporal grounding datasets, thus inducing domain shifts due to the data-level distribution disparity. As a result, VLMs may struggle to distinguish action-sensitive patterns from static objects, making it necessary to adapt them to specific data domains for effective feature representation over temporal grounding. In this work, we address two primary challenges to achieve this goal. Specifically, to mitigate high adaptation costs, we propose an efficient preliminary in-domain fine-tuning paradigm for feature adaptation before standard downstream training, where downstream-adaptive features are learned through several well-designed pretext tasks that ensure improved performance. Furthermore, to integrate action-sensitive information into VLMs, we introduce Action-Cue-Injected Temporal Prompt Learning (ActPrompt), which injects action cues into the image encoder of VLMs to discover action-sensitive visual patterns better. This is followed by context-aware temporal prompt learning, which considers both action cues and temporal context to enhance the ability to recognize patterns associated with actions for downstream tasks. Extensive experiments demonstrate that ActPrompt is an off-the-shelf training framework that can be applied effectively to various SOTA methods, resulting in notable improvements.

Abstract:
Multimodal Image Fusion (MMIF) aims to synthesize complementary information from different modalities to generate comprehensive fused images, thereby facilitating downstream applications. Existing methods typically employ deep neural networks to directly construct high-dimensional image-to-image mappings, which is highly challenging, struggling to extract generalizable patterns for various fusion scenarios. Inspired by meta learning, we propose a learning-to-optimize fusion framework, named LTOFusion, which formulates image fusion as a trajectory optimization problem, decoupling the complicated fusion problem into multistage subproblems. Subsequently, a restricted state transition function based on flow matching is designed to compress the prediction space and lead the network to build an image-to-flow mapping and fine-tune the current fusion state. To facilitate model training, we collect intermediate fusion states and utilize a memory-replay strategy, further enhancing the sample diversity and model robustness. In addition, a hybrid loss with respect to intensity, gradient, structure, and local normalized cross-correlation is designed to improve image details and reduce potential artifacts for fusion results. Experimental results demonstrate that the proposed method achieves the state-of-the-art performance across multiple fusion tasks and downstream applications without requiring fine-tuning. The code is available at https://github.com/HeDan-11/LTOFusion

Abstract:
Birds’ Eye View (BEV) semantic segmentation is an indispensable perception task in end-to-end autonomous driving systems. Unsupervised and semi-supervised learning for BEV tasks, as pivotal for real-world applications, underperform due to the homogeneous distribution of the labeled data. In this work, we explore the potential of synthetic data from driving world models to enhance the diversity of labeled data for robustifying BEV segmentation. Yet, our preliminary findings reveal that generation noise in synthetic data compromises efficient BEV model learning. To fully harness the potential of synthetic data from world models, this article proposes NRSeg, a noise-resilient learning framework for BEV semantic segmentation. Specifically, a Perspective-Geometry Consistency Metric (PGCM) is proposed to quantitatively evaluate the guidance capability of generated data for model learning. This metric originates from the alignment measure between the perspective road mask of generated data and the mask projected from the BEV labels. Moreover, a Bi-Distribution Parallel Prediction (BiDPP) is designed to enhance the inherent robustness of the model, where the learning process is constrained through parallel prediction of multinomial and Dirichlet distributions. The former efficiently predicts semantic probabilities, whereas the latter adopts evidential deep learning to realize uncertainty quantification. Furthermore, a Hierarchical Local Semantic Exclusion (HLSE) module is designed to address the non-mutual exclusivity inherent in BEV semantic segmentation tasks. The proposed framework is evaluated on BEV semantic segmentation using data generated by multiple world models, with comprehensive testing conducted on the public nuScenes dataset under unsupervised and semi-supervised settings. Experimental results demonstrate that NRSeg achieves state-of-the-art performance, yielding the highest improvements in mIoU of 13.8% and 11.4% in unsupervised and semi-supervised BEV segmentation tasks, respectively. The source code will be made publicly available at https://github.com/lynn-yu/NRSeg

Affiliations: Anhui Province Key Laboratory of Industry Safety and Emergency Technology and the Intelligent Interconnected Systems Laboratory of Anhui Province, Hefei University of Technology, Hefei, China; School of Computer Science and Information Engineering and the Key Laboratory of Knowledge Engineering With Big Data (Ministry of Education), Hefei, China; Department of Computer and Information Science, State Key Laboratory of Internet of Things for Smart City, University of Macau, Macau, China; Faculty of Computer and Software Engineering, Huaiyin Institute of Technology, Huai’an, China; Zhejiang Laboratory, Zhejiang University, Hangzhou, China; Yunnan Key Laboratory of Software Engineering, Kunming, Yunnan, China

Abstract:
Existing image restoration and enhancement (IRE) methods suffer from three fundamental limitations: 1) they present a high technical barrier, requiring expert knowledge and lacking intuitive natural language control; 2) they are inflexible and poorly adaptable, as models are typically designed for single, specific degradations and fail on complex or mixed real-world scenarios; and 3) they lack interactivity and ignore subjectivity, operating as “closed-box” tools that cannot incorporate human feedback or understand nuanced user intentions. To overcome these challenges, we pioneer a novel paradigm: a Multi-Agent System (MAS) for interactive and adaptive image restoration. We design and implement a prototype system, Interactive and Adaptive Multi-Agent System (IAMAgent), which orchestrates a team of specialized agents to collaboratively solve complex IRE tasks. At its core, a Manager Agent, driven by a Large Language Model, interprets user commands, devises strategies, and allocates sub-tasks. It directs a Perception Agent for degradation diagnosis, a suite of specialized Execution Agents that encapsulate various low-level vision models, and a Critique Agent for automated quality assessment. This collaborative framework enables an innovative, language-driven, and human-in-the-loop optimization process. Our work is the first to introduce the MAS paradigm to the IRE domain, transforming it from a collection of static tools into a dynamic, user-centric, and intelligent system. We demonstrate that IAMAgent not only significantly enhances restoration performance and adaptability but also bridges the critical gap between high-level human intention and low-level vision tasks.

Abstract:
No-reference image quality assessment (NR-IQA) models are critically vulnerable to adversarial attacks, posing significant risks to downstream vision systems. However, existing attack methods suffer from high computational costs, reliance on Mean Opinion Score (MOS) annotations, and poor cross-model transferability. To overcome these limitations, we propose Degrade-to-OverReconstruct (DOR), a novel prior knowledge-driven black-box attack framework operating in a “completely blind” manner, requiring neither MOS labels nor surrogate models, inducing significant prediction bias solely based on distortion statistics. Specifically, DOR generates universal adversarial examples by first applying mild degradation to preserve global structure and then employing aggressive over-reconstruction using a Residual Denoising Diffusion Model (RDDM) to adaptively disrupt intrinsic Natural Scene Statistics (NSS)—a shared foundation across NR-IQA models. Extensive experiments on synthetic (LIVE, TID2013) and authentic (CLIVE) datasets demonstrate DOR’s strong attack performance and superior transferability against leading NR-IQA models that cover diverse deep neural network architectures. Our work pioneers a diffusion model-based “completely blind” attack paradigm, offering a practical, MOS-free solution for adversarial robustness assessment of NR-IQA models in real-world deployments.

Abstract:
With the development of underwater exploration and marine protection, underwater vision tasks are widespread. Due to the degraded underwater environment, characterized by color distortion, low contrast, and blurring, camouflaged instance segmentation (CIS) faces greater challenges in accurately segmenting objects that blend closely with their surroundings. Traditional camouflaged instance segmentation methods, trained on terrestrial-dominated datasets with limited underwater samples, may exhibit inadequate performance in underwater scenes. To address these issues, we introduce the first underwater camouflaged instance segmentation (UCIS) dataset, abbreviated as UCIS4K, which comprises 3,953 images of camouflaged marine organisms with instance-level annotations. In addition, we propose an Underwater Camouflaged Instance Segmentation network based on Segment Anything Model (UCIS-SAM). Our UCIS-SAM includes three key modules. First, the Channel Balance Optimization Module (CBOM) enhances channel characteristics to improve underwater feature learning, effectively addressing the model’s limited understanding of underwater environments. Second, the Frequency Domain True Integration Module (FDTIM) is proposed to emphasize intrinsic object features and reduce interference from camouflage patterns, enhancing the segmentation performance of camouflaged objects blending with their surroundings. Finally, the Multi-scale Feature Frequency Aggregation Module (MFFAM) is designed to strengthen the boundaries of low-contrast camouflaged instances across multiple frequency bands, improving the model’s ability to achieve more precise segmentation of camouflaged objects. Extensive experiments on the proposed UCIS4K and public benchmarks show that our UCIS-SAM outperforms state-of-the-art approaches. The code and dataset are released at https://github.com/wchchw/UCIS4K

Abstract:
The diffusion-based text-to-image generation has achieved remarkable progress and realistic content generation performance, greatly promoting the development in text-to-video generation. Although equipped with powerful image diffusion models, video generation modeling still requires massive labeled data and a high training resource cost. Recent, work has been focused on cost-effective video generation in a one-shot or few-shot manner based on the image diffusion model with minimum demand for video data and computing resources. However, these video generation models only support the generation of one single motion pattern/concept. This raises an important question: Can we improve generation freedom with a light training burden? In this paper, we explore a cost-effective video generation scheme for adaptive motion concepts by learning motion priors from a small set of video data. Specifically, we construct a learnable bank for motion concepts and propose the Dual-Semantic-guided Motion Attention module to locate the corresponding motion elements from the bank with the guidance of textual semantic and visual semantic. The extracted motion elements are inserted into video latents via lightweight motion injection layer, which is capable of integrating motion semantic effectively with much fewer parameters compared to the conventional temporal attention layer. In addition, we introduce a temporal-aware noise prior and an inter-frame consistency constraint to strengthen the learning of temporal dependency and improve video smoothness. Extensive experiments validate that the proposed method can learn motion priors adaptively from a small set of training videos to generate smooth videos that involve either single or multiple motion concepts. The results demonstrate that the proposed scheme achieves superior performance compared to existing few-shot video generation methods and even some large-scale video generation models. More information and results are available at https://youncy-hu.github.io/motionprior/

Abstract:
Existing point cloud color upsampling methods typically treat color upsampling as an interpolation problem within a local color or implicit feature domain. This largely overlooks the ability of the frequency domain to capture color correlations in local point sets. To address this limitation, we propose a spectrum collaborative strategy that uses frequency decomposition on voxel blocks (VBs) to enhance point cloud color reconstruction. We first voxelize the low-resolution (LR) color point cloud to generate multiple VBs and introduce a virtual filling strategy that adaptively assigns colors to empty voxels in each VB, ensuring that the irregularly distributed color information fully occupies the VB. We then apply the discrete cosine transform, known for its strong frequency-domain representation of locally smooth signals, to each color-filled VB to obtain frequency coefficients. These frequency coefficients are separated into high-frequency (HF) and low-frequency (LF) components. The LF coefficients, together with the LR color point cloud, are fed into a multi-scale cross-domain feature extraction module to capture deep features. Next, a Gaussian perturbation-based feature expansion generates upsampled color features, which are used to regress a coarse upsampled color point cloud. Finally, a high-frequency-guided residual refinement module uses the HF coefficients to refine the coarse upsampled result and produce a high-fidelity color point cloud. Extensive experiments demonstrate that our method achieves superior performance compared to state-of-the-art methods. Our code will be publicly available at https://github.com/wangwenchaoxx/FD-SCU

Abstract:
Unsupervised Camouflaged Object Detection (UCOD) presents a significant challenge due to the inherent similarity between camouflaged objects and their backgrounds, compounded by the absence of manual annotations. Although pixel-level pseudo-labeling has proven effective for unsupervised salient object detection (USOD), it is far less reliable for COD, where the concealed and ambiguous nature of camouflaged objects frequently produces noisy pseudo-labels, causing misjudgments, missed detections, and imprecise boundaries. To overcome this, we propose SAPNet, a novel self-anchored progressive framework for UCOD. Rather than depending on noisy pixel-level supervision, we leverage semantically reliable foreground and background regions as high-confidence anchors. This effectively transforms the unsupervised problem into a more robust weakly supervised paradigm, reducing learning difficulty and mitigating overfitting to noise. SAPNet learns camouflaged objects progressively by first emphasizing these confident regions and then exploiting DINO’s contextual awareness to recover complete structures. Central to our framework is the semantic-driven region detector (SDRD), which employs cascaded convolutions and a residual attention projection mechanism to suppress background noise, filter erroneous information, and enhance spatial context, ensuring reliable supervision signals. Furthermore, a region-based context inference module (RCIM) is introduced to iteratively refine object boundaries by integrating multi-level semantic features under the guidance of these refined region-level anchors. Extensive experiments on four benchmark COD datasets demonstrate that SAPNet significantly outperforms state-of-the-art unsupervised methods. The source code of our SAPNet is available at https://github.com/ArloJie/SAPNet

Abstract:
Detecting small air target is an important task in civil aviation. However, the weak characteristics of these targets make detection challenging. Hyperspectral image (HSI), provides a new approach for the small air target detection task due to its strong ability of capturing both spatial and spectral information simultaneously. In this article, we propose a spectral-spatial enhanced local contrast strategy for hyperspectral small air target detection. An unsupervised band selection step based on the local contrast strategy has been designed based on local contrast (LC-UBSM) to choose bands with better distinguish ability between the target and background in HSI. Then, we have developed an improved RX detection algorithm with combined spatial and spectral variance (CSSV-RX) to detect the target while suppressing both background and noise. Experimental results on both real GAOFEN-5 dataset and simulated dataset based on EO-1 (Earth Observing-1) satellite have validated the effectiveness and robustness of the proposed method.

Abstract:
3D Semantic Scene Completion (SSC) aims to infer voxel-level occupancy and semantics from partial 2D observations. However, existing methods often rely on global attention or uniform voxel modeling, which may cause semantic interference across unrelated regions and degrade performance under occlusion. To address this, we propose HGroupScene, a unified framework that integrates spatial priors and region-constrained reasoning for robust SSC. HGroupScene introduces: 1) a Hierarchical Grouping Module that partitions the voxel space into subregions and performs semantic aggregation via differentiable Gumbel-Softmax attention; 2) a dual-branch architecture composed of an Explicit Constraint Branch for extracting region-level structural features and an Implicit Diffusion Branch for fine-grained semantic reasoning; and 3) a Region-Constrained Feature Diffusion Mechanism that enables controllable feature propagation under the guidance of structural region priors. Extensive experiments on SemanticKITTI and SSCBench-KITTI360 under both single-frame and multi-frame settings demonstrate that HGroupScene achieves competitive or superior performance compared to state-of-the-art methods, validating the effectiveness of spatially structured semantic modeling.

Abstract:
Diffusion-based methods have achieved remarkable success in photorealistic image generation, leveraging iterative denoising steps to improve image quality. However, multi-step denoising often suffers from error accumulation—similar to exposure bias in autoregressive models—due to suboptimal noise estimation, which can lead to degraded semantic alignment and image fidelity. To tackle the challenge of suboptimal inner latent representations in generation and improve the inner latent, this paper introduces a novel method NoisePO, an efficient semantic noise preference optimization framework. NoisePO employs a semantic noise preference optimization generative adversarial network (NPO-GAN) and noise ranking methods to search for semantically relevant noises based on textual conditions, thus eliminating undesired semantic features while emphasizing the necessary semantic ones. Specifically, NoisePO utilizes a light NPO-GAN to generate semantic noises that encourage the latent at the previous step to incorporate more semantic information from the caption. Then, light ranking models are employed to filter out low-quality noises and select the best noise. Experimental results demonstrate that NoisePO consistently outperforms the baselines across widely used frameworks, achieving notable improvements in image quality, semantic consistency, and user-specific alignment as measured by IS, FID, CLIP, and other metrics. These results indicate that NoisePO effectively enhances synthesis quality and strengthens text-image alignment.

Abstract:
While Vision-Language Models (VLMs) have achieved remarkable success in tasks involving natural RGB images, their capability to understand non-RGB sensor data, including thermal, depth, hyperspectral, and X-ray imagery, remains severely limited. This limitation stems from an entrenched RGB-centric bias, leading current VLMs to treat these distinct modalities as ordinary photographs, thus failing to account for their unique physical properties. To systematically evaluate and address this pervasive issue, we present CausalSense, a novel benchmark suite designed to expose RGB-centric bias within large-scale VLMs using non-RGB sensor data. Concurrently, we devise a causal learning framework specifically engineered to alleviate this RGB-bounded bias. Our approach effectively employs confounder dictionaries and backdoor adjustments from causal inference to integrate essential sensor-specific knowledge into VLMs, circumventing the need for extensive retraining on massive datasets. Our comprehensive evaluations using CausalSense underscore a significant performance deficiency in state-of-the-art VLMs concerning non-RGB vision sensor comprehension. Crucially, we demonstrate that our proposed causal deconfounded cross-modal encoder substantially improves VLMs’ ability to reason about the physical attributes captured by these modalities, thereby achieving a measurable reduction in the observed performance gap. This combined benchmark and framework pave the way for developing more resilient and sensor-aware vision–language models, capable of robustly interpreting diverse real-world phenomena beyond the visible spectrum.

Abstract:
User-generated visual content (UGC) now occupies a significant fraction of internet traffic, and billions of UGC videos and pictures are uploaded daily. Among these, short-form video content now accounts for most of the videos consumed by online users. Given the popularity of short-form UGC content, being able to control the perceptual quality of UGC videos has emerged as an important problem. Visual UGC is subject to myriad types, severity, and combinations of distortions. While UGC video quality has been closely studied, the quality and legibility of text that is overlaid or embedded in short-form UGC videos has received relatively low attention. However, being able to accurately predict text quality in images is important, since it both impacts the overall perception of the content it is embedded in, as well as the messages being conveyed. It is also beneficial for applications involving image or video text recognition which can affect visual search and content identification. Analyzing the quality of text embedded in pictures or videos is a hard problem, since perception of it is commingled with the surrounding visual content. Our work, which greatly extends our early report on text legibility prediction, contributes to both the psychophysics of embedded text quality as well as to computational models of its perception. We have created two subjective datasets–designated as the LIVE-COCO Text Legibility (LIVE-COCO-TL) Database (a modification of COCO-Text), and the LIVE-YouTube Text-in-Video Quality (LIVE-YT-TVQ) Database. LIVE-COCO-TL contains 74,440 text patches with legibility annotations, while LIVE-YT-TVQ contains ～ ~19 K subjective quality ratings on 405 videos and 641 text patches extracted from them. We build models that predict embedded or overlaid text legibility and text quality, as well as a multi-task model that simultaneously predicts the overall quality of videos with embedded or overlaid and local text quality. We are making the databases and all models freely available at https://live.ece.utexas.edu/research/LIVE_YouTube_Text_Quality_Assessment/index.html

Abstract:
Rendering realistic human-object interactions (HOIs) from sparse-view inputs is a challenging yet crucial task for various real-world applications. Existing methods often struggle to simultaneously achieve high rendering quality, physical plausibility, and computational efficiency. To address these limitations, we propose HOGS (Human-Object Rendering via 3D Gaussian Splatting), a novel framework for efficient HOI rendering with physically plausible geometric constraints from sparse views. HOGS represents both humans and objects as dynamic 3D Gaussians. Central to HOGS is a novel optimization process that operates directly on these Gaussians to enforce geometric consistency (i.e., preventing inter-penetration or floating contacts) to achieve physical plausibility. To support this core optimization under sparse-view ambiguity, our framework incorporates two pre-trained modules: an optimization-guided Human Pose Refiner for robust estimation under sparse-view occlusions, and a Human-Object Contact Predictor that efficiently identifies interaction regions to guide our novel contact and separation losses. Extensive experiments on both human-object and hand-object interaction datasets demonstrate that HOGS achieves state-of-the-art rendering quality and maintains high computational efficiency.

Affiliations: School of Computer Science and Engineering and the Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications, Ministry of Education, Southeast University, Nanjing, China; College of Software Engineering and the Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications, Ministry of Education, Southeast University, Nanjing, China; School of Computer Science and Technology, Hainan University, Haikou, China; School of Information and Communication Engineering, Hainan University, Haikou, China; School of Data Science, Lingnan University, Hong Kong, SAR, China; School of Control Science and Engineering, Shandong University, Jinan, China

Abstract:
Vision Transformer (ViT) and its variants have witnessed a significant success in computer vision. However, their performance may degrade in underwater dense prediction tasks due to challenges like complex underwater environments, quality degradation, and light scattering in underwater images. To solve this problem, we propose the Vision Transformer Underwater-Adapter (ViT-UWA), the first detail-focused and adapted ViT backbone for underwater dense prediction tasks, without requiring task-specific pretraining. In ViT-UWA, we first introduce High-frequency Components Prior (HFCP) to add high-frequency information of underwater images to the plain ViT, which can help recover and capture lost high-frequency information of underwater images. Then, we propose a Detail Aware Module (DAM) to obtain a detail-focused multi-scale convolutional feature pyramid, which can be used in kinds of dense prediction tasks. Through the ViT-DAM Cross Fusion (VDCF), we achieve bidirectional feature cross fusion between ViT and DAM. We evaluate ViT-UWA on multiple underwater dense prediction tasks, including semantic segmentation, instance segmentation, and object detection. With only ImageNet-22K pretraining, our ViT-UWA-B yields state-of-the-art 46.4 box AP and 44.2 mask AP on USIS10K dataset, which demonstrates the superiority of our method. Our code is available at https://github.com/Linqirui/ViT-UWA

Abstract:
3D reasoning is crucial in areas like robotics and autonomous driving. Due to the high cost of 3D data acquisition, some recent methods attempt to enable LLMs to perform 3D reasoning through multi-view images, thereby transferring the powerful 2D reasoning capabilities of LLMs to 3D environments. However, these methods face challenges: either they use redundant views that contain many perspectives irrelevant to the question, or they rely on globally aggregated multi-view representations, losing the fine-grained vision-language correlations. To tackle these challenges, we propose 3DMulti-LLM, which mainly consists of three components: a COT selector, a question-guided fusion block, and pre-trained LLMs. Specifically, first, the COT selector leverages the powerful chain-of-thought reasoning capabilities of LLMs to identify question-related multi-view images. In this way, 3DMulti-LLM can eliminate a substantial amount of interference from unnecessary viewpoints. Then, we propose a question-guided fusion block for integrating multi-view features via question-guided interaction among various viewpoints. Finally, the pre-trained LLMs are utilized to reason in 3D scenes directly through multi-view features. Notably, our approach understands the 3D scene solely through multi-view images, without requiring the input of point cloud information or additional 3D feature extraction. Through our experiments, 3DMulti-LLM achieves impressive performance and surpasses existing 3D-input-free methods by + 12.2% and + 7.1% on ScanQA and 3DMV-VQA datasets, respectively.

Abstract:
Restoring high-quality images from blurred videos is a highly challenging task, especially in severely blurred scenes. In recent years, event-based methods have achieved significant progress in video deblurring. However, the modal differences between the event and image increase the difficulty of feature fusion. Additionally, the sparsity of event makes it difficult to restore some local details. To address these issues, we propose a new video deblurring method. Firstly, we design a cross-modal collaborative attention mechanism to effectively fuse features from blurred frames and event frames, thereby deeply extracting motion information from event frames. Secondly, we utilize a diffusion model to generate spatial guiding prior feature, enhancing local details and textures. Furthermore, we propose an event-guided dynamic feature fusion module that adaptively integrates spatio-temporal information from neighboring frames. Experimental results on both synthetic and real datasets demonstrate that our method outperforms the current state-of-the-art approaches. The code is available at: https://github.com/Frank-Zhou-01/EDVD-main

Abstract:
Audio-Visual Speech Recognition (AVSR) has been studied for a long time in the literature. By leveraging the complementary information from both acoustic and visual modalities, this approach offers a promising solution for robust speech transcription. While recent AVSR models have achieved impressive performance on large-scale, uniformly distributed datasets, they often overlook the challenges posed by real-world scenarios—where data is collected across multiple sessions and environments, leading to significant domain shifts and heterogeneous distributions. Such heterogeneity can result in catastrophic forgetting and hinder the generalization ability of the conventional models. To bridge this gap, we introduce the Continual Audio-Visual Speech Recognition (CL-AVSR) problem, which formulates AVSR as a continual learning task. We establish a dedicated benchmark for CL-AVSR by designing three experimental scenarios that reflect real-world challenges: introducing varying background noise for the audio stream, degrading video quality for the visual stream, and dividing tasks by speaker characteristics to jointly affect both modalities. These scenarios systematically evaluate the model’s ability to adapt and retain knowledge across dynamic and non-stationary data streams. To address the unique challenges of CL-AVSR, we propose the Interaction-enhanced Multimodal Prompt learning (IMP) framework. IMP builds upon a pre-trained AV-HuBERT backbone and integrates task-relevant soft prompts with cross-modal and cross-task interactions, enabling efficient knowledge transfer from high-quality source domains to typical low-quality target domains with minimal parameter overhead. The interactive prompts facilitate fine-grained alignment and adaptation between modalities and tasks, while contrastive regularization further mitigates catastrophic forgetting. Furthermore, we devise a multi-modal prompt selection strategy that leverages clustering-based feature analysis, empowering the model to dynamically select optimal prompts for unseen data distributions during inference. Extensive experiments on the LRS2 dataset demonstrate that IMP achieves substantial improvements over strong baselines, setting new state-of-the-art performance in all CL-AVSR scenarios. Our results highlight the effectiveness of IMP in enhancing continual learning capabilities for AVSR, paving the way for more robust and adaptable multi-modal speech recognition systems in real-world applications.

Abstract:
While change detection (CD) is crucial for tracking dynamic changes on the Earth’s surface, it faces substantial challenges in real-world settings caused by seasonal variations and sensor-related interference. Current CD models often suffer performance degradation under such conditions, mainly due to two key challenges. First, most existing CD datasets lack sufficient temporal and environmental diversity, as they are typically collected over constrained time spans. This limits the models’ ability to generalize across varying conditions. Second, many CD methods are heavily data-driven and rely on simplified assumptions, leading to models that are not adequately designed to handle the complex, heterogeneous nature of real-world scenarios. Together, these challenges restrict the robustness and practical applicability of current CD approaches. To overcome these challenges, we make the following contributions in this paper: 1) Regarding data diversity, we construct a comprehensive benchmark by introducing five typical perturbations (fog, snow, motion blur, Gaussian noise, and impulse noise) into three classical CD datasets and supplementing them with a real-world seasonal dataset, resulting in 75 interference-rich scenarios. This enables a systematic evaluation under diverse real-world conditions, revealing that such perturbations induce severe distribution shifts across both temporal phases and hierarchical network layers, leading to substantial performance degradation in existing models. 2) Algorithmically, we propose Real-CD, a novel method specifically designed to address distribution shifts in real-world CD. The core of Real-CD is to leverage bi-temporal correlations to perform adaptive distribution alignment across hierarchical layers and temporal phases. Specifically, we propose the Distribution Shifts Alleviation Module (DSAM) to correct distribution shifts. The DSAM captures bi-temporal differences and similarities to formulate temporal-specific adjustment strategies for each LayerNorm (LN) layer. To stabilize the optimization of DSAM, we propose the Distribution Consistency Optimization Strategy (DCOS), which introduces a flip-based auxiliary task that encourages the model to maintain distributional consistency under complex bi-temporal disturbances. Consequently, our method outperforms other state-of-the-art approaches and achieves the best performance on the proposed dataset. Our datasets and code implementation will be available at https://github.com/fangyee-ISALAB/Real-CD

Abstract:
Effectively establishing correspondence between two images is at the centre of image registration methods. Spatially omnipresent representations, including dense displacement fields (DDFs) and spatial (non-)rigid transformations, have been used to parameterise such correspondence. Alternatively, region-based representation uses paired regions of interest (ROIs) to represent region-level correspondence, while retaining its local and dense representation capability at pixel/voxel level if required. Thus, registration can be re-envisioned as a problem of segmenting corresponding paired ROIs in the to-be-registered images. In this work, we utilize models such as SAM, which are pre-trained on substantive datasets, to segment ROIs of the same class from two images, for a new training-free, non-iterative registration algorithm. First, a “corresponding prompt problem” is posed to find a corresponding Prompt Y on Image Y, given any vision Prompt X on Image X, such that the two respectively prompt-conditioned segmentations are a pair of corresponding ROIs from the two images. Second, we propose an “inverse prompt” solution to the corresponding prompt problem, by inverting Prompt X to the Image Y prompt space, where the Jacobian of prototypical features is used. Third, we propose a new registration algorithm that identifies multiple paired corresponding ROIs, by marginalizing the inverted Prompt X over both prompt and spatial spaces, random sampling Prompt X and spatial warping Image X. Comprehensive experiments were conducted on five applications of registering 3D prostate MR, 3D abdomen CT, 3D lung CT, 2D histopathology and, as a non-medical example, 2D aerial images. Based on metrics including Dice and target registration errors on anatomical structures, the proposed registration outperforms both intensity-based iterative algorithms and learning-based networks, even yielding competitive performance with weakly-supervised registration which requires fully-segmented training data.

Abstract:
Recent self-supervised stereo matching methods have made significant progress. They typically rely on the photometric consistency assumption, which presumes corresponding points across views share the same appearance. However, this assumption could be compromised by real-world disturbances, resulting in invalid supervisory signals and a significant accuracy gap compared to supervised methods. To address this issue, we propose SMFormer, a framework integrating more reliable self-supervision guided by the Vision Foundation Model (VFM) and data augmentation. We first incorporate the VFM with the Feature Pyramid Network (FPN), providing a discriminative and robust feature representation against disturbance in various scenarios. We then devise an effective data augmentation mechanism that ensures robustness to various transformations. The data augmentation mechanism explicitly enforces consistency between learned features and those influenced by illumination variations. Additionally, it regularizes the output consistency between disparity predictions of strong augmented samples and those generated from standard samples. Experiments on multiple mainstream benchmarks demonstrate that our SMFormer achieves state-of-the-art (SOTA) performance among self-supervised methods and even competes on par with supervised ones. Remarkably, in the challenging Booster benchmark, SMFormer even outperforms some SOTA supervised methods, such as CFNet.

Abstract:
Current methods for 3D generation still fall short in physically based rendering (PBR) texturing, primarily due to limited data and challenges in modeling multi-channel materials. In this work, we propose MuMA, a method for 3D PBR texturing through Multi-channel Multi-view generation and Albedo post-processing. Our approach features two key innovations: 1) we opt to model shaded and albedo appearance channels, where the shaded channels enables the integration intrinsic decomposition modules for material properties; and 2) leveraging multimodal large language models, we emulate artists’ techniques for material assessment and selection. Experiments demonstrate that MuMA achieves superior results in visual quality and material fidelity compared to existing methods.

Abstract:
Temporal Action Detection aims to localize and classify action instances within untrimmed videos, yet it remains challenging due to background clutter, high intra-class similarity, and varied temporal scales in real-world scenarios. To address these issues, we propose the Local-Pattern Separation and Global-Aware Network (LSGNet) tailored for temporal action localization. Specifically, the core of LSGNet is the Local Pattern Separation Module (LPSM), which explicitly models both consistency and variation patterns of action segments within local temporal windows. Additionally, to capture comprehensive contextual information, we introduce the Global Context-Aware Representation Module (GCRM), which decouples temporal features across multiple granularities and enables robust modeling of long-range dependencies. Finally, we design the Multi-scale Feature Refinement Module (MFRM) to mitigate the degradation of fine-grained information by performing iterative reconstruction across temporal scales, thereby enriching semantic representations and preserving temporal details. Extensive experiments on THUMOS14, ActivityNet1.3, HACS, and EPIC-Kitchens-100 demonstrate the effectiveness of the proposed LSGNet method. Additional ablation studies on the QVHighlights dataset further confirm the generalization capability of LPSM module in video moment retrieval and highlight detection, achieving consistent improvements in retrieval accuracy and localization precision.

Abstract:
Weakly supervised 3D instance segmentation aims to reduce the high cost of point-wise annotations while maintaining competitive accuracy compared with fully supervised methods. Among various weak annotations, box annotation offers an ideal trade-off between labeling efficiency and supervision strength. However, most box-supervised methods rely on a two-stage training pipeline: 1) generating pseudo-labels, 2) training the segmentation model with pseudo-labels, which is iterative and sensitive to pseudo-label quality. To address this issue, we propose an end-to-end framework that directly learns instance masks from box annotations without an explicit pseudo-label generation and iterative relabeling and retraining stage. Specifically, we introduce a boundary-aware refinement module that adaptively learns instance boundaries from boxes through level set evolution. Furthermore, we propose a multi-scale geometric augmentation module to alleviate semantic ambiguity in overlapping regions by applying cross-view consistency constraints on predictions. Finally, we construct a multi-objective optimization framework, which improves both the training stability of level-set-evolution-based boundary refinement and the overall segmentation performance. Extensive experimental results on both indoor and outdoor datasets demonstrate that our method achieves the SOTA performance across multiple benchmark datasets with different backbones and closely approaches fully supervised counterparts.

Abstract:
Transformer-based approaches have achieved superior performance in image restoration, since they can model long-term dependencies well. However, the limitation in capturing local information restricts their capacity to remove degradations. While existing approaches attempt to mitigate this issue by incorporating convolutional operations, the core component in Transformer, i.e., self-attention, which serves as a low-pass filter, could unintentionally dilute or even eliminate the acquired local patterns. In this paper, we propose HIT, a simple yet effective High-frequency Injected Transformer for image restoration. Specifically, we design a window-wise injection module (WIM), which incorporates abundant high-frequency details into the feature map, to provide reliable references for restoring high-quality images. We also develop a bidirectional interaction module (BIM) to aggregate features at different scales using a mutually reinforced paradigm, resulting in spatially and contextually improved representations. In addition, we introduce a spatial enhancement unit (SEU) to preserve essential spatial relationships that may be lost due to the computations carried out across channel dimensions in the BIM. Extensive experiments on 6 tasks (real noise, rain streak, blur, flare, underwater conditions, and low-light conditions) demonstrate that HIT with linear computational complexity performs favorably against the state-of-the-art methods. The source code is available at https://github.com/joshyZhou/HIT_.

Abstract:
Understanding the evolution of 3D scenes is crucial for autonomous driving. While conventional methods describe scene development through individual instance motions, world models provide a generative framework for modeling overall scene dynamics. However, most existing approaches rely on autoregressive next-token prediction, which suffers from error accumulation and limited global spatiotemporal reasoning, leading to degraded long-term consistency. To address these issues, we propose a diffusion-based 4D occupancy generation model, OccSora, to simulate 3D world evolution for autonomous driving. A 4D scene tokenizer is introduced to obtain compact spatiotemporal representations and enable high-quality reconstruction of long occupancy sequences. We then train a diffusion transformer on these representations to generate 4D occupancy conditioned on trajectory prompts. Experiments on the nuScenes dataset with Occ3D annotations show that OccSora can generate 16s videos with authentic 3D layout and strong temporal consistency. With trajectory-aware 4D generation, OccSora has the potential to serve as a world simulator for autonomous driving decision-making. Project page: https://wzzheng.net/OccSora

Abstract:
Underwater light absorption and scattering lead to severe color distortion, reduced visibility, contrast loss, and a significant degradation in image quality, thereby impeding both human visual analysis and machine vision tasks. Although considerable progress has been achieved in improving image quality, existing deep learning-based methods for underwater image enhancement (UIE) remain constrained by high computational complexity and insufficient modeling of global dependencies, which restricts their practical deployment in resource-limited underwater environments. To tackle these issues, we propose a novel hybrid framework integrating Retinex theory and state-space models (SSMs) for underwater image enhancement, named HRMamba. Different from existing Transformer-based approaches constrained by quadratic complexity, HRMamba attains computational efficiency through linear-complexity state-space operations while maintaining global dependency modeling capabilities. Moreover, to achieve comprehensive feature fusion, an Illumination Feature Fusion Module (IFFM) is proposed, which synergizes the global dependency modeling of SSMs with the local adaption capability of convolutional neural networks (CNNs). For context-sensitive noise suppression with illumination awareness, we propose an Illumination-Guided Denoising Module (IGDM) that employs directional-scanning Vision State Space Module (VSSM) blocks. Experiments demonstrate that HRMamba achieves state-of-the-art enhancement quality via an efficient architecture, significantly improving color fidelity and visibility restoration while substantially reducing computational demands. The code is available at https://github.com/YeFan-web/HRMamba/

Affiliations: Department of Precision Instrument, Center for Brain Inspired Computing Research (CBICR), Tsinghua University, Beijing, China; College of Computer Science and Technology, Taiyuan University of Technology, Shanxi, China; Engineering Laboratory of Power Equipment Reliability in Complicated Coastal Environments, Tsinghua Shenzhen International Graduate School, Tsinghua University, Shenzhen, China; School of Automation, Beijing Institute of Technology, Beijing, China; Department of Data Science and Artificial Intelligence and the Department of Computing, The Hong Kong Polytechnic University, Hong Kong, SAR, China

Abstract:
RGB-Event tracking has become a promising trend in visual object tracking to leverage the complementary strengths of both RGB images and dynamic spike events for improved performance. However, existing artificial neural networks (ANNs) struggle to fully exploit the sparse and asynchronous nature of event streams. Recent efforts toward hybrid architectures combining ANNs and spiking neural networks (SNNs) have emerged as a promising solution in RGB-Event perception, yet effectively fusing features across heterogeneous paradigms remains a challenge. In this work, we propose ISTASTrack, the first transformer-based ANN-SNN hybrid Tracker equipped with ISTA adapters for RGB-Event tracking. The two-branch model employs a vision transformer to extract spatial context from RGB inputs and a spiking transformer to capture spatio-temporal dynamics from event streams. To bridge the modality and paradigm gap between ANN and SNN features, we systematically design an ISTA adapter for bidirectional feature interaction between the two branches. The ISTA adapter is derived from the sparse representation theory by unfolding the iterative shrinkage-thresholding algorithm. Additionally, we incorporate a temporal downsampling attention module within the adapter to align multi-step SNN features with single-step ANN features in the latent space. Experimental results on RGB-Event tracking benchmarks, such as FE240hz, VisEvent, COESOT, and FELT, have demonstrated that ISTASTrack achieves state-of-the-art performance while maintaining high energy efficiency. This work highlights the effectiveness and practicality of hybrid ANN-SNN designs for robust visual tracking. The code is publicly available at https://github.com/lsying009/ISTASTrack.git

Abstract:
Recent advances in surface reconstruction for 3D Gaussian Splatting (3DGS) have enabled remarkable geometric accuracy. However, their performance degrades in photometrically ambiguous regions such as reflective and textureless surfaces, where unreliable cues disrupt photometric consistency and hinder accurate geometry estimation. Reflected light is often partially polarized in a manner that reveals surface orientation, making polarization an optical complement to photometric cues in resolving such ambiguities. Therefore, we propose PolarGS, an optics-aware extension of RGB-based 3DGS that leverages polarization as an optical prior to resolve photometric ambiguities and enhance reconstruction accuracy. Specifically, we introduce two complementary modules: a polarization-guided photometric correction strategy, which ensures photometric consistency by identifying reflective regions via the Degree of Linear Polarization (DoLP) and refining reflective Gaussians with Color Refinement Maps; and a polarization-enhanced Gaussian densification mechanism for textureless area geometry recovery, which integrates both Angle and Degree of Linear Polarization (A&DoLP) into a PatchMatch-based depth completion process. This enables the back-projection and fusion of new Gaussians, leading to a more complete reconstruction. PolarGS is framework-agnostic and achieves superior geometric accuracy compared to state-of-the-art methods.

Abstract:
All-in-focus(AIF) images, which contain comprehensive scene information with global sharpness, play a crucial role in high-precision light field (LF) measurement and computational imaging. However, generating AIF images from LF data typically requires accurate depth priors, which are often unavailable or unreliable in practice. To overcome this limitation, directly fusing a series of LF refocused images provides an effective alternative that eliminates the dependency on explicit depth estimation. Nevertheless, existing multi-focus image fusion(MFIF) methods are primarily designed for fusing image pairs with complementary focus, performing poorly when applied to stacks due to the error accumulation that occurs during iterative fusion. To this end, we propose a Frequency-Decoupled Stack Fusion Network (FDSNet) for high-precision depth-free LF AIF image generation. FDSNet incorporates a spatial-frequency joint feature extraction module that captures multi-scale spatial details while decoupling high- and low- frequency components to model textures and contextual information separately, thereby alleviating edge blurring caused by subtle focal variations and weak textures in transition regions. Moreover, a dual-stage cross-attention fusion module, following a coarse-to-fine strategy, suppresses artifacts, enhances edge fidelity, and enables simultaneous fusion of arbitrary numbers of refocused images, thereby avoiding error accumulation and computational redundancy. Extensive experiments on both synthetic and real LF datasets demonstrate that FDSNet achieves superior visual quality and quantitative performance. Additional experiments further demonstrate that FDSNet performs robustly under varying low-light and noisy conditions. These results validate that FDSNet delivers excellent fusion capability in terms of image clarity, detail preservation, noise resistance, and generalization, outperforming existing state-of-the-art methods.

Abstract:
In resource-constrained vehicle systems, establishing consistency between multi-view scenes and driver gaze remains challenging. Prior methods mainly focus on cross-source data fusion, estimating gaze or attention maps through unidirectional implicit links between scene and facial features. Although bidirectional projection can correct misalignment between predictions and ground truth, the high resolution of scene images and complex semantic extraction incur heavy computational loads. To address these issues, we propose a lightweight driver-attention estimation framework that leverages geometric consistency between scene and gaze to guide feature extraction bidirectionally, thereby strengthening representation. Specifically, we first introduce a lightweight feature extraction module that captures global and local information in parallel through dual asymmetric branches to efficiently extract facial and scene features. An information cross fusion module is then designed to promote interaction between the scene and gaze streams. The multi-branch architecture extracts gaze and geometric cues at multiple scales, reducing the computational redundancy caused by mixed features when modeling geometric consistency across both views. Experiments on a large public dataset show that incorporating scene information introduces no significant computational overhead and yields a better trade-off between accuracy and efficiency. Moreover, leveraging bidirectional projection and the temporal continuity of gaze, we preliminarily explore the framework’s potential for predicting attention trends.

Abstract:
Hyperspectral image classification (HSIC) is a valuable method for identifying coastal wetland vegetation, but challenges like environmental complexity and difficulty in distinguishing land cover types make large-scale labeling difficult. Cross-domain few-shot learning (CDFSL) offers a potential solution to limited labeling. Existing CDFSL HSIC methods have made significant progress, but still face challenges like prototype deviation, covariate shifts, and rely on complex domain alignment (DA) methods. To address these issues, a feature reconstruction-based CDFSL (FRFSL) algorithm is proposed. Within FRFSL, a Prototype Calibration Module (PCM) is designed for the prototype deviation, which employs a Bayesian inference-enhanced Gaussian Mixture Model to select reliable query features for prototype reconstruction, aligning the prototypes more closely with the actual distribution. Additionally, a ridge regression closed-form solution is incorporated into the Distance Metric Module (DMM), employing a projection matrix for prototype reconstruction to mitigate covariate shifts between the support and query sets. Features from both source and target domains are reconstructed into dynamic graphs, transforming DA into a graph matching problem guided by optimal transport theory. A novel shared transport matrix implementation algorithm is developed to achieve lightweight and interpretable alignment. Extensive experiments on three self-constructed coastal wetland datasets and one public dataset show that FRFSL outperforms eleven state-of-the-art algorithms. The code will be available at https://github.com/Yqx-ACE/TIP_2025_FRFSL

Abstract:
Micro-expressions can reveal genuine emotions that are not easily concealed, making them invaluable in fields such as psychotherapy and criminal interrogation. However, existing pseudo-labeling-based methods for micro-expression analysis have two major limitations. First, pseudo-labels generated by the sliding window do not account for the actual proportion of micro-expressions in the video, which leads to inaccurate labeling. Second, they predominantly focus on overall features, thereby neglecting subtle features. In this paper, we propose a micro-expression analysis method called Spot-Then-Recognize Method (STRM), which integrates spotting and recognition tasks. To address the first limitation, we propose a Self-Adaptive Pseudo-labeling Method (SAPM) that dynamically assigns pseudo-labels to micro-expression frames according to their actual proportion in the video sequence, thereby improving labeling accuracy. To address second limitation, we design a Multi-Scale Residual Channel Attention Network (MSRCAN) to effectively extract subtle micro-expression features. The MSRCAN comprises three modules: Multi-Scale Shared Network (MSSN), Spotting Network, and Recognition Network. The MSSN initially extracts micro-expression features by performing multi-scale feature extraction with Residual Connected Channel Attention Modules (RCCAM), which are then refined in the spotting and recognition networks. We conducted comprehensive experiments on three short video datasets (CASME II, SMIC-E-HS, SMIC-E-NIR) and two long video datasets (CAS(ME)2, SAMMLV). Experimental results show that our proposed method significantly outperforms existing methods, achieving an overall performance of 58.24%, a 19.62% improvement, and a 1.51× gain over the baseline in terms of micro-expression analysis.

Affiliations: Marshall Laboratory of Biomedical Engineering, School of Biomedical Engineering, Guangdong Key Laboratory for Biomedical Measurements and Ultrasound lmaging, National-Regional Key Technology Engineering Laboratory for Medical Ultrasound, Shenzhen University Medical School, Shenzhen University, Shenzhen, China; Department of Gastroenterology and Hepatology, Shenzhen University General Hospital, Shenzhen University, Shenzhen, China; College of Management, Shenzhen University, Shenzhen, China; School of Science and Engineering, University of Dundee, Dundee, U.K.; School of Information Science and Engineering, Ningbo University, Ningbo, China; School of Cyber Science and Technology, Shenzhen Campus, Sun Yat-sen University, Shenzhen, China

Abstract:
A low-light colonoscopy video enhancement method is needed as poor illumination in colonoscopy can hinder accurate disease diagnosis and adversely affect surgical procedures. Existing low-light video enhancement methods usually apply a frame-by-frame enhancement strategy without considering the temporal correlation between them, which often causes a flickering problem. In addition, most methods are designed for endoscopic devices with fixed imaging styles and cannot be easily adapted to different devices. In this paper, we propose a Style-Guided Network (SGNet) for unpaired Low-Light Colonoscopy Video Enhancement (LLCVE). Given that collecting content-consistent paired videos is difficult, SGNet adopts a CycleGAN-based framework to convert low-light videos to normal-light videos, in which a Temporal Compensation (TC) module and a Style Guidance (SG) module are proposed to alleviate the flickering problem and achieve flexible style transfer, respectively. The TC module compensates for a low-light frame by learning the correlated feature of its adjacent frames, thereby improving the temporal smoothness of the enhanced video. The SG module encodes the text of the imaging style and adaptively explores its intrinsic relationships with video features to obtain style representations, which are then used to guide the subsequent enhancement process. Extensive experiments on a curated database show that SGNet achieves promising performance on the LLCVE task, outperforming state-of-the-art methods in both quantitative metrics and visual quality.

Abstract:
Hyperspectral imaging (HSI) captures abundant spectral information of land covers while light detection and ranging (LiDAR) provides elevation and structural characteristics. Joint classification of HSI and LiDAR data can effectively merge spectral and elevation information to enhance the outcome of land cover classification. Current HSI and LiDAR joint classification approaches mainly employ a three-layer deep network to extract high-order features, followed by a concatenation or weighted fusion scheme which cannot fully exploit the unique properties of different data modalities. Meanwhile, these methods usually require high computational resources. To alleviate these issues, this paper proposes a masked self-attention fusion network (MSAF) for joint HSI and LiDAR classification, where a cascaded cross-attention fusion framework is designed to fully merge different stages of features. First, a mobile convolution block is developed to extract multi-modal data features. Then, a multi-view sequence embedding method is proposed to effectively integrate elevation information and spectral-spatial information so as to obtain token sequences. Finally, an effective masked self-attention mechanism is designed to fuse token sequences. Experimental results on multiple datasets indicate that the proposed framework significantly outperforms other advanced multi-modal fusion methods in terms of classification performance and computing efficiency. The code of this manuscript is available on https://github.com/lulushh/MSAF

Affiliations: Institute of Biomedical Manufacturing and Life Quality Engineering, School of Mechanical Engineering, Shanghai Jiao Tong University, Shanghai, China; Department of Oral Surgery, Shanghai Ninth People’s Hospital, College of Stomatology, Shanghai Key Laboratory of Stomatology and Shanghai Research Institute of Stomatology, Shanghai Jiao Tong University School of Medicine, Shanghai, China; School of Mechanical Engineering and the Institute of Medical Robotics, Shanghai Jiao Tong University, Shanghai, China

Abstract:
Orthognathic surgery demands precise preoperative planning to achieve optimal functional and aesthetic results, yet current practices remain labor-intensive and highly dependent on surgical expertise. To address these challenge, we propose OrthoPlanner, a novel two-stage framework for automated orthognathic surgical planning. In the first stage, we develop JawFormer, a shape sensitive transformer network that predicts postoperative bone morphology directly from preoperative 3D point cloud data. Built upon a point cloud encoder-decoder architecture, the network integrates anatomical priors through a region-based feature alignment module. This enables precise modeling of structural changes while preserving critical anatomical features. In the second stage, we introduce a symmetry-constrained rigid alignment algorithm that automatically outputs the precise translation and rotation of each osteotomized bone segment required to match the predicted morphology. This ensures bilateral anatomical consistency and facilitates interpretable surgical plans. Compared with existing approaches, our method achieves superior quantitative performance and enhanced visualization results, as demonstrated by 65 experiments on real clinical datasets. Moreover, OrthoPlanner significantly reduces planning time and manual workload, while ensuring reproducible and clinically acceptable outcomes.

Abstract:
Volumetric images often encapsulate critical information, making it essential to employ lossless compression to preserve data integrity. Although various learned methods have demonstrated effective lossless compression for volumetric images, balancing high compression ratios with rapid coding speeds and lightweight architectures remains challenging. In this paper, we propose a 3D-scanning lightweight autoregressive model (3D-SLARM) for practical lossless volumetric image compression. 3D-SLARM integrates a novel 3D plane scanning module, a lightweight feature extraction (FE) module, and a lightweight distribution parameter and adaptive range predictor (DPARP) module. Initially, 3D-SLARM leverages a 3D plane scanning module to determine the scanning order of each voxel, allowing parallel coding of voxels within the same plane. Next, the lightweight FE module captures both intra-slice and inter-slice dependencies in the receptive field defined by the 3D plane scanning module. By incorporating our proposed serial re-parameterization (SerRep) technology alongside non-centric masked convolution (NCMC), the FE module attains a lightweight design while effectively capturing complex dependencies. Finally, 3D-SLARM employs a lightweight DPARP module to compute distribution parameters for both 8-bit and high bit-depth volumetric images. For high bit-depth images, the module further generates an adaptive probability range for each voxel, resulting in compact, voxel-specific PMF tables that facilitate efficient compression. Extensive experiments demonstrate that our 3D-SLARM achieves state-of-the-art lossless compression performance on majority volumetric image datasets and maintains fast coding speed with a lightweight design, underscoring its practical applicability.

Abstract:
Video object segmentation (VOS) is a fundamental task in video analysis, aiming to accurately recognize and segment objects of interest within video sequences. Conventional methods, relying on memory networks to store single-frame appearance features, face challenges in computational efficiency and capturing dynamic visual information effectively. To address these limitations, we present a Video Decoupling Network (VDN) with a per-clip memory updating mechanism. Our approach is inspired by the dual-stream hypothesis of the human visual cortex and decomposes multiple previous video frames into fundamental elements: scene, motion, and instance. We propose the Unified Prior-based Spatio-temporal Decoupler (UPSD) algorithm, which parses multiple frames into basic elements in a unified manner. UPSD continuously stores elements over time, enabling adaptive integration of different cues based on task requirements. This decomposition mechanism facilitates comprehensive spatial-temporal information capture and rapid updating, leading to notable enhancements in overall VOS performance. Extensive experiments conducted on multiple VOS benchmarks validate the state-of-the-art accuracy, efficiency, generalizability, and robustness of our approach. Remarkably, VDN demonstrates a significant performance improvement and a substantial speed-up compared to previous state-of-the-art methods on multiple VOS benchmarks. It also exhibits excellent generalizability under domain shift and robustness against various noise types.

Affiliations: Division of Biomedical Sciences, Bioengineering Program, King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia; School of Advanced Technology, Xi’an Jiaotong-Liverpool University, Suzhou, China; Computer Science Department, University of Exeter, Exeter, U.K.; School of Computer Science and Informatics, Cardiff University, Cardiff, U.K.; College of Optoelectronic Engineering, Chongqing University, Chongqing, China; Chinese Academy of Sciences, Ningbo Cixi Institute of Biomedical Engineering, Cixi, China; School of Bioengineering, Imperial College London, London, U.K.; Liverpool Centre for Cardiovascular Science, University of Liverpool, Liverpool, U.K.

Abstract:
Annotating medical data for training AI models is often costly and limited due to the shortage of specialists with relevant clinical expertise. This challenge is further compounded by privacy and ethical concerns associated with sensitive patient information. As a result, well-trained medical segmentation models on private datasets constitute valuable intellectual property requiring robust protection mechanisms. Existing model protection techniques primarily focus on classification and generative tasks, while segmentation models—crucial to medical image analysis—remain largely underexplored. In this paper, we propose a novel, stealthy, and harmless method, StealthMark, for verifying the ownership of medical segmentation models under closed-box conditions. Our approach subtly modulates model uncertainty without altering the final segmentation outputs, thereby preserving the model’s performance. To enable ownership verification, we incorporate model-agnostic explanation methods, e.g. LIME, to extract feature attributions from the model outputs. Under specific triggering conditions, these explanations reveal a distinct and verifiable watermark. We further design the watermark as a QR code to facilitate robust and recognizable ownership claims. We conducted extensive experiments across four medical imaging datasets (CMR dataset from UK Biobank, the SEG fundus dataset, the EchoNet echocardiography dataset, and the PraNet colonoscopy dataset) and five mainstream segmentation models. The results demonstrate the effectiveness, stealthiness, and harmlessness of our method on the original model’s segmentation performance. For example, when applied to the SAM model, StealthMark consistently achieved attack success rates (ASR) above 95% across various datasets while maintaining less than a 1% drop in Dice and AUC scores—significantly outperforming backdoor-based watermarking methods and highlighting its strong potential for practical deployment. Our implementation code is made available at https://github.com/Qinkaiyu/StealthMark

Abstract:
Colored point cloud comprising geometry and attribute components is one of the mainstream representations enabling realistic and immersive 3D applications. To generate large-scale and denser colored point clouds, we propose a deep learning-based Joint Geometry and Attribute Up-sampling (JGAU) method, which learns to model both geometry and attribute patterns and leverages the spatial attribute correlation. Firstly, we establish and release a large-scale dataset for colored point cloud up-sampling, named SYSU-PCUD, which has 121 large-scale colored point clouds with diverse geometry and attribute complexities in six categories and four sampling rates. Secondly, to improve the quality of up-sampled point clouds, we propose a deep learning-based JGAU framework to up-sample the geometry and attribute jointly. It consists of a geometry up-sampling network and an attribute up-sampling network, where the latter leverages the up-sampled auxiliary geometry to model neighborhood correlations of the attributes. Thirdly, we propose two coarse attribute up-sampling methods, Geometric Distance Weighted Attribute Interpolation (GDWAI) and Deep Learning-based Attribute Interpolation (DLAI), to generate coarsely up-sampled attributes for each point. Then, we propose an attribute enhancement module to refine the up-sampled attributes and generate high quality point clouds by further exploiting intrinsic attribute and geometry patterns. Extensive experiments show that Peak Signal-to-Noise Ratio (PSNR) achieved by the proposed JGAU are 33.90 dB, 32.10 dB, 31.10 dB, and 30.39 dB when up-sampling rates are 4× , 8× , 12× , and 16× , respectively. Compared to the state-of-the-art schemes, the JGAU achieves an average of 2.32 dB, 2.47 dB, 2.28 dB and 2.11 dB PSNR gains at four up-sampling rates, respectively, which are significant. The code is released with https://github.com/SYSU-Video/JGAU.

Abstract:
Tensor robust principal component analysis (TRPCA), as a popular linear low-rank method, has been widely applied to various visual tasks. The mathematical process of the low-rank prior is derived from the linear latent variable model. However, for nonlinear tensor data with rich information, their nonlinear structures may break through the assumption of low-rankness and lead to the large approximation error for TRPCA. Motivated by the latent low-dimensionality of nonlinear tensors, the general paradigm of the nonlinear tensor plus sparse tensor decomposition problem, called tensor robust kernel principal component analysis (TRKPCA), is first established in this paper. To efficiently tackle TRKPCA problem, two novel nonconvex regularizers the kernelized tensor Schatten- p norm (KTSPN) and generalized nonconvex regularization are designed, where the former KTSPN with tighter theoretical support adequately captures nonlinear features (i.e., implicit low-rankness) and the latter ensures the sparser structural coding, guaranteeing more robust separation results. Then by integrating their strengths, we propose a double nonconvex TRKPCA (DNTRKPCA) method to achieve our expectation. Finally, we develop an efficient optimization framework via the alternating direction multiplier method (ADMM) to implement the proposed nonconvex kernel method. Experimental results on synthetic data and several real databases show the higher competitiveness of our method compared with other state-of-the-art regularization methods. The code has been released in our ResearchGate homepage: https://www.researchgate.net/publication/397181729_DNTRKPCA_code

Abstract:
Due to the lack of prior knowledge about unknown classes during training, existing methods for cross-domain open-set image recognition typically rely on threshold-based solutions. However, such approaches often struggle to capture the complex boundary relationships between known and unknown classes, which can lead to negative transfer effects caused by feature confusion between the two. To address this issue, this paper proposes a graph isomorphic distillation diffusion model (GIDDM) that aims to learn the boundary relationships between known and unknown classes from a closed-set classifier that models predictive uncertainty. First, a diffusion classifier is designed to quantify model predictive uncertainty through a Monte Carlo sampling strategy performed on the noise distribution during the reverse denoising process. The uncertainty distribution is modeled, and the cumulative distribution function is used to compute the probability of a sample belonging to an unknown class. Second, an open-set recognition framework is constructed, treating the closed-set diffusion classifier as a teacher classifier, and guiding the student classifier to learn the complex boundary relationships between known and unknown classes through knowledge distillation. Third, the knowledge distillation process is further formalized as a graph isomorphic optimization problem, where the predictive manifolds of the student and teacher classifiers are constrained to be consistent, thereby enhancing knowledge transfer between the classifiers. Finally, the entire process is integrated into a unified open-set adversarial domain adaptation framework, reconstructing the traditional optimization objectives of closed-set adversarial domain adaptation to ensure sufficient separation between known and unknown classes while aligning the distributions of known classes in both the source and target domains. Experiments conducted on multiple hyperspectral image (HSI) datasets demonstrate that the proposed method achieves state-of-the-art performance on cross-domain open-set image recognition tasks. The code demo can be accessed on the following website: https://github.com/wzr78998/GIDDM

Affiliations: Hubei Key Laboratory of Applied Mathematics, Faculty of Mathematics and Statistics, and the Key Laboratory of Intelligent Sensing System and Security, Ministry of Education, Hubei University, Wuhan, China; College of Sciences, China Jiliang University, Hangzhou, China; Department of Geography and Spatial Information Techniques, Ningbo Key Laboratory of Remote Sensing and Ecological Security of Coastal Zone and Zhejiang-Germany Joint Laboratory on Remote Sensing of Coastal Ecosystem, Ningbo University, Ningbo, China

Abstract:
Change detection (CD) in hyperspectral images (HSIs) has become an increasingly vital research field in remote sensing. Over the past few years, the adoption of deep learning approaches, particularly convolutional neural network (CNN) and transformer-based architectures have significantly advanced performance in this field. While these models effectively capture spectral-spatial features, they may also introduce redundant or irrelevant spatial information, potentially degrading the accuracy of HSI CD. To address this challenge, a center-pixel and gated mechanism-based attention network (CGMNet) is proposed for HSI CD, leveraging the central pixel’s significance to enhance accuracy and robustness. First, a gated-based center spatial attention (GCSA) module is designed to emphasize spatial relationships surrounding the central pixel. By incorporating gating mechanisms, GCSA selectively enhances relevant spatial features while suppressing irrelevant information. Second, a gated-based spectral attention (GSA) module is proposed to dynamically highlight the most significant spectral features, ensuring an effective spectral representation. Finally, a global transform fusion (GTF) module is proposed to capture global contextual information and to fuse it with the extracted spatial and spectral features. Moreover, we introduce a novel benchmark dataset, named the Hangzhou Bay (HZB), specifically designed to advance coastal remote sensing research. Experimental evaluations conducted on three publicly available datasets, as well as the HZB dataset, show that our CGMNet consistently outperforms some state-of-the-art methods in the HSI CD task. The source code of the proposed CGMNet, along with the HZB dataset, will be made publicly available at https://github.com/creativeXin/CGMNet

Affiliations: School of Cyber Science and Technology, Sun Yat-sen University, Shenzhen Campus, Shenzhen, China; MSU-BIT-SMBU Joint Research Center of Applied Mathematics, Shenzhen MSU-BIT University, Shenzhen, China; Institute of Forensic Science, Ministry of Public Security, Beijing, China; School of Artificial Intelligence, Nanjing University of Information Science and Technology, Nanjing, China; College of Intelligence Science and Technology, National University of Defense Technology, Changsha, China

Abstract:
Fingerprint biometrics plays a crucial role in biometric identification, especially in applications such as criminal investigations. Although recent progress in recognition methodology has significantly enhanced automated fingerprint recognition, these systems still rely heavily on the quality of the input fingerprints. In criminal investigations, fingerprints are often of low quality due to their incidental deposition from natural oils and sweat, rather than being deliberately captured under controlled conditions. This degradation can significantly impact usability and identification accuracy, underscoring the need for effective Fingerprint Quality Assessment (FQA) methods. In this paper, we establish the Crime Scene Fingerprints quality assessment Dataset (CSFD-10k), the largest dataset of its kind, containing 11,500 fingerprint images from real criminal investigations. Of these, 10,000 samples are assigned Mean Opinion Scores (MOSs) for correlation testing, while the remaining 1,500 are labeled based on matching performance for generalizability testing. All labels are provided by frontline criminal police officers. Using this dataset, we propose a deep neural network-based Dual-Branch FQA (DB-FQA) framework that integrates image-level and edge-level features. The DB-FQA enhances ridge details by transforming raw grayscale fingerprints into edge maps using the Logical/Linear operator. A dual-branch network processes both the raw fingerprint and the edge map, and the Multi-scale Adaptive Cross feature Fusion (MACF) module fuses these features, guided by the edge map to highlight quality-related regions of interest. Extensive experiments demonstrate the robustness and superiority of our proposed method, offering substantial support for forensic fingerprint biometrics. The code and dataset are available at https://github.com/wzhsysu/FIQA.

Abstract:
Given the complexity of underwater environments and the variability of water as a medium, underwater images are inevitably subject to various types of degradation. The degradations present nonlinear coupling rather than simple superposition, which renders the effective processing of such coupled degradations particularly challenging. Most existing methods focus on designing specific branches, modules, or strategies for specific degradations, with little attention paid to the potential information embedded in their coupling. Consequently, they struggle to effectively capture and process the nonlinear interactions of multiple degradations from a bottom-up perspective. To address this issue, we propose JDPNet, a joint degradation processing network, that mines and unifies the potential information inherent in coupled degradations within a unified framework. Specifically, we introduce a joint feature-mining module, along with a probabilistic bootstrap distribution strategy, to facilitate effective mining and unified adjustment of coupled degradation features. Furthermore, to balance color, clarity, and contrast, we design a novel AquaBalanceLoss to guide the network in learning from multiple coupled degradation losses. Experiments on six publicly available underwater datasets, as well as two new datasets constructed in this study, show that JDPNet exhibits state-of-the-art performance while offering a better tradeoff between performance, parameter size, and computational cost.

Abstract:
Completing multidimensional color images is a fundamental challenge in image processing and computer vision. However, some tensor-based methods often treat RGB channels as independent modes, thereby neglecting their intrinsic correlations. To address this limitation, we represent RGB values as pure quaternions and organize them into a quaternion tensor for holistic modeling that preserves chromatic relationships. To better capture the nonlinear characteristics inherent in visual data and to improve the compactness of low-rank representations, we propose a nonlinear transformation within the quaternion domain. This design enables more expressive modeling compared to conventional linear approaches. In addition, we introduce two novel regularization terms that jointly encode global low-rankness and local smoothness, with the nonlinear transformation further enhancing the exploitation of structural priors. The overall model is optimized via a nonlinear alternating direction method of multipliers (ADMM), with theoretical guarantees of convergence. Extensive experiments on several datasets demonstrate that the proposed method significantly outperforms state-of-the-art low-rank tensor and quaternion tensor recovery techniques in multidimensional color image completion tasks.

Abstract:
Camera-based contactless monitoring of vital signs, also known as imaging photoplethysmography (iPPG), has seen applications in driver-monitoring, perfusion assessment, affective computing, and more. iPPG involves sensing the underlying cardiac pulse from video of the skin and estimating vital signs such as the pulse rate or a full pulse waveform. Some previous iPPG methods impose model-based sparse priors on the pulse signals and use iterative optimization for pulse wave recovery, while others use end-to-end black-box deep learning methods. In contrast, we introduce methods that combine signal processing and deep learning methods in an inverse problem framework. Our methods estimate the underlying pulse signal, pulse rate, and pulse rate variability from facial video by learning deep-network-based denoising operators that leverage deep algorithm unfolding and deep equilibrium models. Experiments show that our methods can denoise an acquired signal from the face and infer the correct underlying pulse rate and pulse rate variability, achieving pulse rate estimation performance consistent with the state-of-the-art on well-known benchmarks, all with less than one-fifth the number of learnable parameters as the closest competing method.

Abstract:
Continual video instance segmentation (CVIS) requires the plasticity to absorb new categories while maintaining the stability to retain previously learned knowledge. Crucially, the model must also preserve temporal consistency of instances across video frames. In this work, we introduce Contrastive Residual Injection and Semantic Prompting (CRISP), a framework tailored to address instance-wise, category-wise, and task-wise confusion in CVIS. For instance-wise learning, we model instance tracking and construct instance correlation loss, which emphasizes the correlation with the prior query space while strengthening the specificity of the current task query. For category-wise learning, we build an adaptive residual semantic prompt (ARSP) learning framework, which constructs a learnable semantic residual prompt pool generated by category text and uses an adjustive query-prompt matching mechanism to build a mapping relationship between the query of the current task and the semantic residual prompt. Meanwhile, a semantic consistency loss based on the contrastive learning is introduced to maintain semantic coherence between object queries and residual prompts during incremental training. For task-wise learning, to ensure the correlation at the inter-task level within the query space, we introduce a concise yet powerful initialization strategy for incremental prompts. Extensive experiments on YouTube-VIS-2019 and YouTube-VIS-2021 datasets demonstrate that CRISP significantly outperforms existing continual segmentation methods in the long-term continual video instance segmentation task, avoiding catastrophic forgetting and effectively improving segmentation and classification performance. The code is available at https://github.com/LyuQi127/CRISP.

Abstract:
The advancement of continual person search techniques has seen significant progress in recent years due to its practical applications in the real world. However, continual learning for person search presents significant challenges as it combines both person detection and re-identification (Re-ID) tasks, resulting in issues of domain and class incremental learning. To address these challenges, we propose a novel framework that uses an adapter-based Swin Transformer backbone, and incorporates two key components: Domain Aware Adapter (DAA) blocks and Virtual Prototype Replay-Online Instance Matching (VPR-OIM). Specifically, to solve the domain incremental problem in object detection, we introduce parallel DAA blocks to handle multiple domains, while a Domain Prototype Router (DPR) mechanism is used to dynamically route the feature to the domain-specific adapter. Additionally, for class incremental Re-ID, we extend the OIM loss with virtual prototype replay, which generates Gaussian distribution-based virtual features derived from historical prototypes, effectively enabling the model to preserve knowledge of previous identities while accommodating new identity categories. Overall, our proposed DAA and VPR-OIM simultaneously address the dual incremental challenges of continual person search. Experimental results demonstrate that our method significantly improves both person detection and Re-ID performance in continual learning settings, achieving state-of-the-art (SOTA) performance.

Abstract:
General Text-to-3D (GT23D) generation is crucial for creating diverse 3D content across objects and scenes, yet it faces two key challenges: 1) ensuring semantic consistency between input text and generated 3D models, and 2) maintaining multi-view consistency across different perspectives within 3D. Existing approaches typically address only one of these challenges, often leading to suboptimal results in semantic fidelity and structural coherence. To overcome these limitations, we propose SeMv-3D, a novel framework that jointly enhances semantic alignment and multi-view consistency in GT23D generation. At its core, we introduce Triplane Prior Learning (TPL), which effectively learns triplane priors by capturing spatial correspondences across three orthogonal planes using a dedicated Orthogonal Attention mechanism, thereby ensuring geometric consistency across viewpoints. Additionally, we present Prior-based Semantic Aligning in Triplanes (SAT), which enables consistent any-view synthesis by leveraging attention-based feature alignment to reinforce the correspondence between textual semantics and triplane representations. Extensive experiments demonstrate that our method sets a new state-of-the-art in multi-view consistency, while maintaining competitive performance in semantic consistency compared to methods focused solely on semantic alignment. These results emphasize the remarkable ability of our approach to effectively balance and excel in both dimensions, establishing a new benchmark in the field.

Affiliations: Department of Electrical and Computer Engineering, The University of Hong Kong, Hong Kong, SAR, China; State Key Laboratory of Multimedia Information Processing and the National Engineering Research Center of Visual Technology, School of Computer Science, Peking University, Beijing, China; Chinese Academy of Sciences, Institute of Automation, Beijing, China; Key Laboratory of Computing Power Network and Information Security, Ministry of Education, Shandong Computer Science Center, Qilu University of Technology, Jinan, Shandong, China

Abstract:
In low-light environments, conventional cameras often struggle to capture clear multi-view images of objects due to dynamic range limitations and motion blur caused by long exposure. Event cameras, with their high-dynamic range and high-speed properties, have the potential to mitigate these issues. Additionally, 3D Gaussian Splatting (GS) enables radiance field reconstruction, facilitating bright frame synthesis from multiple viewpoints in low-light conditions. However, naively using an event-assisted 3D GS approach still faced challenges because, in low lights, events are noisy, frames lack quality, and the color tone may be inconsistent. To address these issues, we propose Dark-EvGS, the first event-assisted 3D GS framework that enables the reconstruction of bright frames from arbitrary viewpoints along the camera trajectory. Triplet-level supervision is proposed to gain holistic knowledge, granular details, and sharp scene rendering. The color tone matching block is proposed to guarantee the color consistency of the rendered frames. Furthermore, we introduce the first real-captured dataset for the event-guided bright frame synthesis task via 3D GS-based radiance field reconstruction. Experiments demonstrate that our method achieves better results than existing methods, conquering radiance field reconstruction under challenging low-light conditions. The code and sample data are included in the supplementary material.

Abstract:
Semi-supervised learning (SSL) provides an effective means of reducing reliance on large-scale annotated datasets by leveraging unlabeled data. However, existing SSL methods often struggle with semantic ambiguity, especially under limited supervision. Recent studies have incorporated textual information to provide contextual guidance, yet most focus on feature fusion rather than emphasizing target semantics critical for segmentation. In this paper, we proposed a novel Text-anchored Visual Decoupling (TeViD) framework for semi-supervised medical image segmentation. TeViD is built upon a teacher-student architecture with a dual-decoder design that explicitly disentangles target and background representations using both labeled and unlabeled data. For unlabeled data, a reversed cross-supervision mechanism is introduced to enhance decoder diversity and semantic separation. Furthermore, two contrastive learning objectives are proposed: a teacher-guided visual contrastive loss and a text-anchored contrastive loss, both designed to reinforce semantic disentanglement from visual and textual perspectives. Extensive experiments on five public datasets (covering X-ray, pathology, ultrasound, MRI, and CT) demonstrate that TeViD consistently outperforms both standard SSL and text-enhanced SSL methods, achieving average improvements of 5.72% in Dice and 8.15% in mIoU over the second-best competitor. The code is available at: https://github.com/jgfiuuuu/TeViD

Abstract:
Given a query sentence, text-to-image person retrieval aims to identify matched pedestrian images from a large gallery. Most of the existing methods are designed for the unified domain setting, which is operated under the assumption that the training and test data are drawn from the same distribution. However, this assumption is difficult to guarantee in real application scenes, as data is often collected from various surveillance scenarios. To this end, in this paper, we introduce the concept of single-source domain generalization into the context of text-to-image person retrieval and propose a novel task called single-source domain generalizable text-to-image person retrieval (SSDG-TIPR). This task is applicable in real-world scenarios but poses significant challenges due to the limitation of accessible training data. Intuitively, a trained model is the most familiar with the domain on which it was trained, that is, the source domain. Therefore, to handle this SSDG-TIPR task, we propose a new method to infinitely close astray features from unseen target domains to the source domain, namely, to take it home (TIME), allowing the model to handle the features in a familiar manner. The proposed TIME method comprises three main modules: the Domain Astray Leading (DAL) module, the Domain Invariant Feature Extract (DIFE) module and the Domain Home Taking (DoT) module. We evaluated TIME on 3 benchmark datasets, namely CUHK-PEDES, ICFG-PEDES and RSTPReid, and demonstrated its superior performance on 10 SSDG-TIPR sub-tasks as well as on 3 conventional TIPR sub-tasks, establishing a new state-of-the-art in both settings.

Abstract:
Existing image fusion methods focus on containing more complementary information, but source images always suffer from motion blur owing to object motion, which results in distorted details in fused images and further deteriorates performance on high-level tasks. This paper proposes a novel visible and infrared image fusion framework capable for motion deblurring (MDbFusion++), which can simultaneously perform image fusion and deblurring within a mutually reinforcing framework. MDbFusion++ employs a coarse-to-fine image restoration strategy and comprises two key components: a coarse deblurring part (CDP) and a fine deblurring and fusion part (FDFP). Firstly, CDP transfers multi-modal images into features corresponding to spatial locations and creatively leverages infrared features to coarsely compensate motion blurred visible ones through adaptive weights module (AWM). Subsequently, FDFP further restores fine visible features and achieves multi-modal images fusion in spatial and frequency domains with the help of multi-domain enhancement module (MEM). The deblurred visible features provide clear information to improve fusion results, and the improved fused images, in turn, provide gradient feedback to further improve deblurring effects. We evaluate our network in terms of both image deblurring and fusion, and extensive comparative experiments demonstrate the superior performance and distinct advantages of MDbFusion++.

Abstract:
Semantic segmentation has suffered for a while from a lack of datasets such as ImageNet for image classification. This issue was partially alleviated by the advent of the segment anything model (SAM), which provides a foundation model trained on the largest and most diverse segmentation dataset to date. However, the SAM often falls short in segmenting specific regions, mostly in regard to biomedical images; this is why unsupervised domain adaptation (UDA) remains the best option for addressing the challenge of generalization capabilities. Classical UDA methods might be ineffective in several biomedical segmentation cases because the gap between two datasets, named domain shift, is too high. To address this issue, we propose a strategy based on learning the source mask probability distribution with a segmentation diffusion model as a generative prior to propose accurate target segmentation at inference. This latter can be guided by supplementary inputs, which allows us to call for the rich information contained in SAM raw segmentation both to perform adaptation and to improve robustness. A study was conducted using a comprehensive collection of segmentation datasets: 3 domains for mitochondria, 2 for the endoplasmic reticulum, and 2 for brain tumors, allowing the creation of 10 adaptation scenarios and providing an extensive test basis. The results of the experiments reveal that our proposed method outperforms various state-of-the-art UDA methods. Furthermore, ablation studies highlight the significant role of each component of our presented strategy. The code is available at: https://github.com/alex-stenger/GUDA

Abstract:
Despite advances in single image dehazing, robust dehazing for real-world road traffic scenes remains challenging due to scarce paired data, traffic-specific geometry, and real-time constraints. To address this issue, we propose a novel image prior for road traffic scenes, termed scene geometry prior (SGP), which leverages depth cues derived from vanishing point (VP) to provide geometry-aware guidance and reduce reliance on paired training data. Our SGP comprises two components: a global SGP (G-SGP) that captures the global geometric distribution and a non-local SGP (NL-SGP) that corrects the errors, among obstructions belonging to the same category, in captured global distribution. Building on the proposed prior, we develop a lightweight and unsupervised road traffic image dehazing network (RTDnet). It consists of a main sub-network guided by the G-SGP to reconstruct the haze-free image, alongside two auxiliary sub-networks that leverage the NL-SGP and VP information to respectively estimate transmission map, and atmospheric light. During training, we introduce an atmospheric scattering model (ASM)-driven mutual-boost learning mechanism (ASM-ML), which is rooted in Bayesian theory and effectively integrates the strengths of different priors without mutual interference while distilling ASM-based physical knowledge into each sub-network. By coupling SGP with ASM-ML, RTDnet can be trained without paired traffic data by exploiting traffic-specific geometry, whose accurate guidance reduces the reliance on large model capacity and enables lightweight real-time deployment. Experiments demonstrate that our RTDnet surpasses state-of-the-art competitors in terms of restoration quality, efficiency, and model size. Moreover, its robust dehazing performance benefits downstream tasks operating in hazy conditions.

Abstract:
Unsupervised Domain Adaptation (UDA) has emerged as a pivotal technique for enhancing machine learning models’ performance in unlabeled target domain with domain shifts. This technique is fundamentally achieved by aligning the domain distributions of source and target domains within a latent feature space, thereby enhancing model robustness across heterogeneous data distributions. However, the inherent discrepancy between source and target domain distributions poses significant challenges in identifying the optimal latent space. Furthermore, projecting both domains into suboptimal latent spaces may induce substantial semantic information loss, particularly compromising discriminative feature representations critical for final tasks. In this article, our systematic analysis reveals that natural language representations inherently possess stronger semantic abstraction capabilities than visual features in natural images. As a result, natural language tends to have smaller domain shifts. Motivated by this discovery, we proposed a novel model that systematically transforms visual patterns into structured linguistic representations. This cross-modal translation mechanism leverages the invariant semantic properties of natural language to mitigate domain shifts while preserving task-critical semantic hierarchies. Our model leverages the inherent abstraction capacity of linguistic structures to enhance cross-domain generalization, effectively bridging the visual-semantic gap in unsupervised adaptation scenarios. Our model comprises three core components: 1) text classification branch translating images to text for prediction; 2) image adaptation branch supplementing visual details; and 3) ensemble mechanism reconciling text abstraction with visual granularity through mismatch detection. Extensive experiments on three benchmark datasets validate the effectiveness of our model, achieving state-of-the-art performance.

Abstract:
Sparse Principal Component Analysis (SPCA) is a powerful technique for dimensionality reduction and feature extraction in high-dimensional data, with applications spanning various fields such as computer vision, pattern recognition, and data mining. However, the computational intensity of SPCA presents a significant challenge, necessitating the development of efficient and robust algorithms. In this paper, we shed light on the SPCA problem and uncover intriguing structures that enable us to design an efficient algorithm, which we have named SPCA_ACC. Firstly, we identify a separable structure in this problem, which prompts us to draw on the Variable Projection (VP) strategy and generalize it to separable nonlinear problem in Stiefel manifold. This strategy projects out part of the parameters to obtain a reduced problems, allowing the SPCA_ACC algorithm to optimize in a lower-dimensional parameter space. Secondly, we resolve the coupling between different parameters of the SPCA problem in the optimization process on a fixed coordinate-sparsity manifold, which opens the way to the use of second-order Riemannian accelerated VP strategy. Moreover, we systematically analyze the advantages of using VP to solve the SPCA problem from a theoretical perspective, and confirm the local quadratic convergence of our algorithm. Numerical experiments on datasets of different sizes and types demonstrate that our method achieves rapid convergence and significantly reduces computational costs.

Abstract:
Given a text-to-image diffusion model pretrained on large-scale text-image pairs, can we align the model with human pReferences without further fine-tuning? In this paper, we analyze the effect of alignment tuning in diffusion models by comparing the diffusion denoising trajectory between base and aligned models. Our findings reveal that alignment tuning primarily affects superficial stylistic aspects during denoising, rather than fundamental content, suggesting superficial alignment behaviors. Based on this discovery, we introduce a novel, training-free alignment approach (RSTFA) that leverages rejection sampling at specific stylistic timesteps, ensuring human preference alignment without fine-tuning or heavy inference overhead. We provide a theoretical analysis and derive a bias bound for our rejection-sampling alignment scheme. Empirically, we show that RSTFA better preserves sample diversity than reinforcement-learning-based tuning methods. Extensive experiments on Pick-a-Pic, COCO, HPD V2, and PartiPrompts show that our method not only achieves superior alignment with human preferences compared to state-of-the-art methods, but also reduces computational demands, establishing efficient, human-centered diffusion model alignment.

Abstract:
In human pose estimation, formulating keypoint localization as a classification task over discretized coordinate grids has proven effective. Essentially, the 2D features of the keypoints are reduced to 1D coordinate representations. This process leads to the loss of spatial constraints among keypoints and increases the difficulty for the model to capture their structural relationships. To address this issue, we propose an enhanced query attention mechanism constrained by bidirectional graphs. The core idea is to establish the topological constraints on the 1D coordinate representations. First, two fundamental connection directions of the skeleton are defined and encoded as a pair of adjacency matrices to enhance the feature interaction capability of the graph convolutional network (GCN). Second, a GCN-guided multi-scale feature fusion framework is designed to effectively combine multi-scale visual features with structural priors, thereby enhancing the representation of keypoint spatial distributions. Finally, a dual-gate module is incorporated into a GCN-guided attention unit to construct a structured query matrix constrained by the bidirectional skeleton graphs, which helps filter out spurious joint interactions and emphasize plausible ones. Extensive experiments on Tai Chi Chuan-Pose, Animal-Pose, AP-10K, MPII, COCO, and COCO-WholeBody datasets demonstrate that the proposed method outperforms existing methods in terms of both accuracy and robustness, particularly in balancing precise local keypoint localization with global pose consistency.

Abstract:
Parameter-efficient tuning (PET) has achieved promising performance on various downstream vision tasks. Despite their effectiveness for general classification, existing PET approaches neglect the over-concentration of channel-wise saliency and the feature redundancy of pre-trained models during fine-tuning, thus leaving much room for improvement when applied to the downstream fine-grained recognition tasks. To address these issues, we propose a novel parameter-efficient tuning approach tailored for fine-grained recognition (FG-PET). Specifically, FG-PET first employs a Channel-wise Importance Equalization (CIE) module. It suppresses the concentrated salient channels while strengthening the remaining majority ones during fine-tuning, notably mitigating the over-concentrated saliency, thus evoking more channels within pre-trained models to deliver abundant local visual clues. Furthermore, FG-PET develops an Efficient Navigator for Diversity (EFIND) by introducing a center-based loss and orthogonal constraints on features generated from distinct attention heads. It alleviates the redundancy between different attention maps, thus enforcing the models to explore diverse subtle visual differences in various discriminative local regions, which are critical for fine-grained recognition. Extensive experimental results on five public fine-grained benchmarks based on distinct ViT models demonstrate that the proposed method remarkably boosts the performance of existing PET approaches, and generalizes well to general classification tasks. The source code is available at FG-PET.

Abstract:
Immunohistochemical (IHC) virtual staining is a task that generates virtual IHC images from H&E images while maintaining pathological semantic consistency with adjacent slices. This task aims to achieve cross-domain mapping between morphological structures and staining patterns through generative models, providing an efficient and cost-effective solution for pathological analysis. However, under weakly paired conditions, spatial heterogeneity between adjacent slices presents significant challenges. This can lead to inaccurate one-to-many mappings and generate results that are inconsistent with the pathological semantics of adjacent slices. To address this issue, we propose a novel unbalanced self-information feature transport for IHC virtual staining, named USIGAN, which extracts global morphological semantics without relying on positional correspondence. By removing weakly paired terms in the joint marginal distribution, we effectively mitigate the impact of weak pairing on joint distributions, thereby significantly improving the content consistency and pathological semantic consistency of the generated results. Moreover, we design the Unbalanced Optimal Transport Consistency Mining (UOT-CTM) mechanism and the Pathology Self-Correspondence Mining (PC-SCM) mechanism to construct correlation matrices between H&E and generated IHC in image-level and real IHC and generated IHC image sets in intra-group level. Experiments conducted on two publicly available datasets demonstrate that our method achieves superior performance across multiple clinically significant metrics, such as IoD and Pearson-R correlation, demonstrating better clinical relevance. The code is available at: https://github.com/MIXAILAB/USIGAN

Abstract:
Adversarial attack strategies for 3D object detection have highlighted the critical importance of addressing security concerns in this domain. However, white-box methods require full access to the victim model in large-scale point cloud applications. To this end, we propose a novel Policy-Driven Black-box Attack (BAT) that is designed to optimize attack locations without necessitating detailed knowledge of the victim models. First, we introduce a density-aware pattern generator that creates scene-adaptive attack clusters. Second, we leverage the deep deterministic policy gradient in deep reinforcement learning to train an attack agent capable of targeting the victim model. Ultimately, the attack agent is iteratively directed towards optimal attack locations through the joint application of critic loss and actor loss. To the best of our knowledge, this represents the first reinforcement learning-based black-box attack applied to practical 3D object detection. Experimental results on the KITTI, nuScenes, and Waymo datasets demonstrate that BAT effectively diminishes the accuracy of notable models. Importantly, BAT significantly enhances the attack success rate (surpassing state-of-the-art both white-box and black-box methods) and increases transferability (by 20 times) through simple deep deterministic policy gradient, thus establishing a new baseline for adversarial attacks in 3D object detection.

Abstract:
Reasoning video object segmentation (ReaVOS) aims to segment referred objects in video sequences based on implicit and complex linguistic queries. Existing methods typically compress limited video frames into pooled representations and prompt multimodal large language models (MLLMs) to generate a single global segmentation token. However, this strategy lacks explicit contextual guidance and causes substantial loss of spatial details, limiting capability and segmentation consistency. To overcome these limitations, we introduce Context-infused Consistent Video Segmentor (CiCVS), a novel framework leveraging contextual information to guide generation of temporally coherent and accurate mask trajectories. CiCVS incorporates a Hierarchical Frame Sampling (HFS) module, which globally samples support frames across the entire video to ensure broad temporal coverage, and then uniformly selects target frames within the support set. It also employs a Contextual Token Prompting (CTP) module, which utilizes contextual cues from support frames to guide the MLLM in generating specialized tokens for various target frames, enabling the model to capture intricate temporal patterns and ensure consistency across long-range sequences. At the core of CTP is the Multimodal Injection Compressor (MIC) block, which efficiently integrates support frame features and textual semantic information into a compact set of latent queries, enhancing temporal-level object perception. To further advance the ReaVOS field, we introduce the CoCoRVOS benchmark, which features more temporally intricate reasoning instructions and a diverse set of video scenarios. Extensive experiments demonstrate that CiCVS establishes a new state-of-the-art on multiple benchmarks, achieving significant improvements in \mathcal J\& \mathcal F scores, including +2.7 on CoCoRVOS, +1.4 on ReVOS, and +7.0 on ReasonVOS, underscoring its superior contextual reasoning and segmentation capabilities.

Abstract:
This paper addresses the critical challenge of domain adaptation for LiDAR-based semantic segmentation, particularly the significant density disparities that emerge when transferring models from synthetic to real-world environments. We present DA2-LiDAR, a novel density-adaptive domain adaptation framework that bridges domain gaps through the construction of intermediate domains with density-varying point distributions. Our approach employs a simple yet effective masking strategy that systematically reduces density discrepancies between domains while extracting more effective supervisory signals, as well as preserving critical semantic information. The framework consists of three key components: (1) a Density Adaptation Module that establishes a continuous spectrum of intermediate domains through dataset-agnostic masking operations; (2) a Contextual Consistency Module that enforces relational coherence across differently masked variants of the same scan at varying degrees, providing additional supervision signals, enhancing the model’s ability to extract features; and (3) a Semantic Preservation Module that mitigates information loss in heavily masked scans by reconstructing domain-specific data distributions. Extensive experiments on synthetic-to-real and other benchmarks demonstrate that DA2-LiDAR consistently outperforms state-of-the-art methods, achieving significant improvements in cross-domain generalization without requiring dataset-specific prior knowledge or introducing computational overhead.

Abstract:
Video Snapshot Compressive Imaging (SCI) captures multiple video frames in a single exposure, enabling efficient reconstruction of high-speed scenes for motion analysis and event detection. Existing SCI in coded aperture compressive temporal imaging (CACTI) methods predominantly rely on feedforward deep networks with fixed denoising strategies. However, they lack alignment with the SCI physical inverse model and struggle to balance motion detail recovery and static background smoothing. In this paper, we propose PCD-Diffusion for Video SCI, the first diffusion-based reconstruction framework for Video SCI, which reformulates the inverse problem as a progressive denoising process. Specifically, we design a Physically-Constrained Dynamic Diffusion (PCD-Diffusion) model, introducing a region-adaptive diffusion schedule and spatiotemporal residual estimation. This method explicitly aligns the denoising process with SCI’s spatially non-uniform and temporally evolving residual distribution. Additionally, a motion prior-guided diffusion schedule and a Gauss-guided spatiotemporal adaptive residual estimation dynamically steer the denoising trajectory, ensuring accurate motion detail restoration and physically consistent reconstructions. Extensive results on simulated and real datasets verify the superior reconstruction fidelity and temporal coherence of the proposed PCD-Diffusion framework over existing approaches. Code will be released upon publication.

Abstract:
Novel view synthesis for underwater scene reconstruction presents unique challenges due to complex light-media interactions. Optical scattering and absorption in water body bring inhomogeneous medium attenuation interference that disrupts conventional volume rendering assumptions of uniform propagation medium. While 3D Gaussian Splatting (3DGS) offers real-time rendering capabilities, it struggles with underwater inhomogeneous environments where scattering media introduces artifacts and inconsistent appearance. In this study, we propose a physics-based framework that disentangles object appearance from water medium effects through tailored Gaussian modeling. Our approach introduces appearance embeddings, which are explicit medium representations for backscatter and attenuation, enhancing scene consistency. In addition, we propose a depth-guided optimization strategy that leverages pseudo-depth maps as supervision with depth regularization and scale penalty terms to improve geometric fidelity. By integrating the proposed appearance and medium modeling components via an underwater imaging model, our approach achieves both high-quality novel view synthesis and physically accurate scene restoration. Experiments demonstrate our significant improvements in rendering quality and restoration accuracy over existing methods. The project page is available at https://bilityniu.github.io/3D-UIR

Abstract:
Unsupervised Text-Based Person Search (TBPS) eliminates the need for costly manual sentence annotations by generating pseudo sentences via Multi-modal Large Language Models (MLLMs). However, these pseudo sentences often face the quality defect issues, resulting in semantic misalignment across modalities, which will hinder discriminative representation learning. To address this problem, we propose the PSE-QRL (Pseudo Sentences Evaluation and Quality-aware Robust Learning), a unified framework that enhances robustness to pseudo sentences for unsupervised TBPS. The PSE-QRL dynamically couples an evolving TBPS model with MLLMs to assess pseudo sentences’ reliability, and adaptively leverages high-quality ones during training. It consists of three key components: 1) Multi-granularity Sentence Augmentation, for enriching pseudo sentences with multiple granularities to broaden the diversity of image-sentence pairs; 2) Hybrid Quality Evaluation, to combine MLLM’s cross-modal reasoning knowledge with TBPS model’s person-specific distinguishing capabilities for effective sentence quality assessment; and 3) Quality-aware Robust Learning, for selecting and re-weighting samples based on quality scores to emphasize reliable sentence annotations while suppressing low-quality ones. Extensive experiments on CUHK-PEDES, ICFG-PEDES, and RSTPReid benchmarks demonstrate the effectiveness of PSE-QRL for improving learning robustness, achieving state-of-the-art (SOTA) retrieval performance for unsupervised TBPS.

Abstract:
Estimating 3D human poses from 2D images remains challenging due to occlusions and projective ambiguity. Multi-view learning-based approaches mitigate these issues but often fail to generalize to real-world scenarios, as large-scale multi-view datasets with 3D ground truth are scarce and captured under constrained conditions. To overcome this limitation, recent methods rely on 2D pose estimation combined with 2D-to-3D pose lifting trained on synthetic data. Building on our previous MPL framework, we propose RUMPL, a transformer-based 3D pose lifter that introduces a 3D ray-based representation of 2D keypoints. This formulation enables a model agnostic to camera parameters that can be universally deployed across arbitrary camera configurations in a given area without retraining or fine-tuning. A new View Fusion Transformer leverages learned fused-ray tokens to aggregate information along rays, further improving multi-view consistency. Evaluation on standard benchmarks shows that RUMPL significantly outperforms existing methods, yielding a 56.6% MPJPE (All KP) reduction on Human3.6M over triangulation-based methods and exceeding 70% improvement on the CMU Panoptic dataset when compared to transformer-based image-representation approaches. Results on new benchmarks, including in-the-wild multi-view and multi-person datasets, confirm its robustness and scalability.

Abstract:
Cross-view geo-localization between drone and satellite images is severely challenged by rapid weather variations, which induce appearance shifts, occlusions, and texture degradation. Inspired by human foveal attention, we propose the Fovea Attention Network (FANet), a robust dual-branch framework comprising: 1) the Weather-Adaptive Global Branch (WAGB) that explicitly injects weather cues (e.g., ‘rain/snow’) into the feature space via a style-modulation encoder, then captures large-scale structural consistency through a Learnable Region Reassembly (LRR) mechanism; and 2) the Local Semantic Attention Branch (LSAB) that leverages a pretrained segmentation model to generate high-confidence masks, distilling discriminative features from salient regions. An adaptive fusion strategy module fuses global context with fine-grained semantic cues. We further adopt multi-weather adaptive training, treating weather types as related tasks with shared parameters to reduce cross-weather confounding. Extensive experiments on University-1652, SUES-200, and CVUSA show that FANet achieves competitive Recall@1 across all conditions, attaining the highest overall mean with the lowest variance. Notably, it improves Recall@1 by 6.79% under severe low-illumination (‘dark’) conditions, demonstrating robustness and stability. Our code is available at https://github.com/Jahawn-Wen/FANet

Abstract:
Online Class-Incremental Learning (OCIL) enables models to learn continuously from non-i.i.d. data streams. Since samples of the data streams can be seen only once, it is more suitable for real-world scenarios compared to offline learning. However, this constraint intensifies the challenge for OCIL in maintaining an appropriate balance between stability and plasticity. Moreover, under stricter memory buffer constraints in real world, current replay-based methods are less effective. While ensemble methods improve plasticity, they often struggle with stability. Inspired by the Global Workspace Theory (GWT), we propose a novel approach that enhances ensemble learning through a Global Workspace Model (GWM)—a shared, implicit memory that guides the learning of multiple student models. The GWM is formed by fusing the parameters of all students within each training batch, capturing the historical learning trajectory and serving as a dynamic anchor for knowledge consolidation. Like the broadcasting mechanism of GWT, the GWM is redistributed periodically to students, stabilizing learning and promoting cross-task consistency. In addition, we introduce a multi-level collaborative distillation mechanism. It enforces peer-to-peer consistency among students and preserves historical knowledge by aligning each student with the GWM. As a result, student models remain adaptable to new tasks while maintaining previously learned knowledge, striking a better balance between stability and plasticity. Extensive experiments on three standard OCIL benchmarks show that our method delivers significant performance improvement for several OCIL models across various memory budgets. The code is available at https://github.com/susususushi/GWM.

Abstract:
Weakly Supervised Semantic Segmentation (WSSS) with image-level labels typically leverages Class Activation Maps (CAMs) to achieve pixel-level predictions. Recently, Contrastive Language-Image Pre-training (CLIP) has been introduced to generate CAMs in WSSS. However, previous WSSS methods solely adopt CLIP’s vision-language paired property for dense localization, neglecting its inherently limited dense knowledge across both visual and text modalities, which renders CAM generation suboptimal. In this work, we propose DiCLIP, a novel WSSS framework that leverages the generative diffusion model to enhance CLIP’s dense knowledge across two modalities. Specifically, Visual Correlation Enhancement (VCE) and Text Semantic Augmentation (TSA) modules are proposed for dense prediction enhancement. To improve the spatial awareness of visual features, our VCE module utilizes diffusion’s reliable spatial consistency to mitigate the over-smoothing issue in CLIP’s attention. It designs the Attention Clustering Refinement (ACR) module to reliably extract diverse correlation maps from the diffusion model. The correlation maps act as a diversity bias for CLIP’s self-attention, recursively pushing its visual features towards a more discriminative dense distribution. To augment the semantics of text embeddings, our TSA module argues that a single text modality is insufficient to encompass the variability of visual categories. Thus, we leverage diffusion’s generative power to maintain a dynamic key-value cache model, shifting CAM generation from a patch-text matching mechanism to a novel visual knowledge retrieval paradigm. With these enhancements, DiCLIP not only outperforms state-of-the-art methods on PASCAL VOC and MS COCO but also significantly reduces training costs. Code is publicly available at https://github.com/zwyang6/DiCLIP.

Abstract:
Pre-trained Vision-Language Models (VLMs) have demonstrated strong zero-shot generalization capabilities. Despite their effectiveness on various downstream tasks, they remain vulnerable to adversarial samples. Existing methods fine-tune VLMs to improve their robust performance by performing adversarial training on a certain dataset. However, this can lead to model overfitting and is not a true zero-shot scenario. In this paper, we propose a truly zero-shot and training-free approach that can improve the zero-shot adversarial robustness of VLMs on the evaluated benchmarks. Specifically, we first discover that simply adding Gaussian noise can enhance the VLM’s zero-shot robustness. Then, we treat the adversarial examples with added Gaussian noise as anchors and strive to find a path in the embedding space that leads from the adversarial examples to the cleaner samples. Furthermore, to avoid the overfitting issue caused by fixed hyperparameters, we propose an adaptive parameter adjustment method based on the distance between the anchors and adversarial samples in the embedding space. We largely preserve the original VLMs’ zero-shot generalization abilities in a truly zero-shot and training-free manner on the evaluated benchmarks compared to previous methods. Extensive experiments on 16 datasets demonstrate that our method can achieve stronger zero-shot robust performance, improving the top-1 robust accuracy by an average of 10.83%.

Abstract:
Hyperspectral and multispectral image fusion (HMF) enhances spatial-spectral quality by fusing low-resolution hyperspectral images (LR-HSI) with high-resolution multispectral images (HR-MSI). Although recent fusion methods have shown promise in preserving the multi-mode structure of high-dimensional data, existing fusion methods still face some challenges. For tensor-based approaches, conventional mode-wise decomposition, such as order-3 CP or Tucker decomposition, may disrupt intrinsic spatial consistency. Furthermore, although deep learning exhibits powerful feature representation ability, existing deep fusion methods either rely on ‘data-driven’ deep fusion networks remain insufficiently interpretability with large training data. To address these issues, a novel Self-Expressive High-Order Tensor Unrolling Network (SHOTUN) is proposed for unsupervised HSI-MSI fusion. Within the sparse core tensor decomposition framework, we introduce the intrinsic self-expressive relationships among overlapping image patches as a form of high-order mode representation to preserve spatial structure of the fusion model. During optimization, we adopt an alternative optimizing strategy and design dedicated modules for each sub-problem, yielding an interpretable end-to-end training pipeline. Furthermore, to improve generalization across different sensors, we introduce a pre-training strategy into the unsupervised training for the more accurate estimation of unknown degraded parameters. Extensive experimental results on simulated and real datasets demonstrate the effectiveness of our proposed method. The source code is publicly available at https://github.com/Shawn-H-Wang/SHOTUN

Abstract:
Regarding intelligent transportation systems, low-bitrate transmission via lossy point cloud compression is vital for facilitating real-time collaborative perception among connected agents, such as vehicles and infrastructures, under restricted bandwidth. In existing compression transmission systems, the sender lossily compresses point coordinates and reflectance to generate a transmission code stream, which faces transmission burdens from reflectance encoding and limited detection robustness due to information loss. To address these issues, this paper proposes a 3D object detection framework with reflectance prediction-based knowledge distillation (RPKD). We compress point coordinates while discarding reflectance during low-bitrate transmission, and feed the decoded non-reflectance compressed point clouds into a student detector. The discarded reflectance is then reconstructed by a geometry-based reflectance prediction (RP) module within the student detector for precise detection. A teacher detector with the same structure as the student detector is designed for performing reflectance knowledge distillation (RKD) and detection knowledge distillation (DKD) from raw to compressed point clouds. Our cross-source distillation training strategy (CDTS) equips the student detector with robustness to low-quality compressed data while preserving the accuracy benefits of raw data through transferred distillation knowledge. Experimental results on the KITTI and DAIR-V2X-V datasets demonstrate that our method can boost detection accuracy for compressed point clouds across multiple code rates. We will release the code publicly at https://github.com/HaoJing-SX/RPKD

Abstract:
Recent research on the joint classification of multimodal remote sensing data has achieved outstanding performance in tasks within predefined label spaces. However, surface conditions are dynamic and change over time, resulting in variations in land cover classes collected from the same region at different time points. As a result, when new classes are discovered, the previous works must use a combination of old and new class data to retrain the model, which incurs high computational costs and raises concerns about data privacy. In this work, we propose the prototype-based meta-prompt tuning (PMPT) framework, which fine-tunes only a few session-relevant visual prompts to adapt to incremental classes, while simultaneously learning prototype embeddings for each class to preserve historical knowledge. Specifically, the PMPT consists of a meta-learning-based feature representation backbone and an incrementally updated nearest-class-mean (NCM) classifier. The backbone is trained on base class data to learn shared and stable global knowledge, then frozen, with only the prompts fine-tuned to extract sessions-specific local knowledge from incremental sessions. The NCM classifier is a globally shared classifier that measures the similarity between test samples and prototypes, effectively alleviating the issues of knowledge forgetting and overfitting. Additionally, we propose an incremental prototype contrastive loss to reduce semantic drift and prototype overlap in the embedding space. During the testing phase, the PMPT reproduces the complete embedding function by matching samples, class prototypes, and visual prompts, thereby enabling accurate classification of unknown samples. The method has been tested on widely used multimodal remote sensing datasets, demonstrating the effectiveness of the proposed PMPT in addressing the dilemma of stability-plasticity with limited incremental samples. The code is available at https://github.com/Jiahuiqu/PMPT

Abstract:
Confined spaces refer to partially or fully enclosed areas, e.g., sewage wells, where working conditions pose significant risks to the workers. The evaluation of COfined Space Operational Safety (COSOS) refers to verifying whether workers are properly equipped with safety equipment before entering a confined space, which is crucial for protecting their safety and health. Due to the crowded nature of such environments and the small size of certain safety equipment, existing methods face significant challenges. Moreover, there is a lack of dedicated datasets to support research in this domain. In this paper, in order to advance research in this challenging task, we present COSOS-1k, an extensive dataset constructed from diverse confined space scenarios. It comprises multi-view videos for each scenario, covers 10 essential safety protective equipments and 6 attributes of worker, and is annotated with expressive object locations, fine-grained attributes, and occlusion status. The COSOS-1k is the first dataset known to date, tailored explicitly for the real-world COSOS scenarios. In addition, we address the challenge of occlusion from three perspectives: instance, video, and view. Firstly, at the instance level, we propose Occlusion-aware Uncertainty Estimation (OUE) method, which leverages box-level occlusion annotations to enable part-level occlusion prediction for objects. Secondly, at the video level, we introduce Cross-Frame Cluster (CFC) attention, which integrates temporal context features from the same object category to mitigate the impact of occlusions in the current frame. Finally, we extend CFC to the view level and form Cross-View Cluster (CVC) attention, where complementary information is mined from another view. Extensive experiments demonstrate the effectiveness of the proposed methods and provide insights into the importance of dataset diversity and expressivity. The COSOS-1k dataset and code are available at https://github.com/deepalchemist/cosos-1k

Abstract:
Indoor 3D object detection serves as a fundamental task in computer vision and robotics. Existing research predominantly focuses on training domain-specific optimal models for individual datasets, yet it overlooks the potential value of capturing universal geometric attributes that can substantially enhance object detection performance across diverse domains. To resolve this gap, we propose COME, a novel and effective collaborative optimization framework designed to seamlessly integrate these universal attributes while preserving the domain-specific characteristics of each dataset domain. COME is built on VoteNet and incorporates a Cross-Domain Expert Parameter Sharing Strategy (CEPSS) that draws inspiration from the Mixture of Experts (MoE) framework. Its core innovation resides in the dual-expert design of CEPSS: domain-shared experts capture universal geometric relationships across datasets, whereas domain-specific experts encode unique features for individual datasets. This separation enables the model to focus on learning both generic and domain-specialized visual cues, without mutual interference. In addition, to dynamically adapt to different domains, we design a lightweight gating network that automatically selects relevant experts, eliminating irrelevant feature interference and enhancing model specialization. Compared to standard parameter-sharing architectures, this design significantly reduces gradient conflicts during multi-domain training. We further optimize computational efficiency by implementing low-rank structures for domain-shared and domain-specific experts, thus striking a better balance between memory overhead and detection performance. Experiments show that COME achieves state-of-the-art results across benchmarks, with acceptable parameter growth, and outperforms existing multi-domain detection methods.

Abstract:
Two-view correspondence learning aims to discern true and false correspondences between image pairs by recognizing their underlying different information. Previous methods either treat the information equally or require the explicit storage of the entire context, tending to be laborious in real-world scenarios. Inspired by Mamba’s inherent selectivity, we propose CorrMamba, a Correspondence filter leveraging Mamba’s ability to selectively mine information from true correspondences while mitigating interference from false ones, thus achieving adaptive focus at a lower cost. To prevent Mamba from being potentially impacted by unordered keypoints that obscured its ability to mine spatial information, we customize a causal sequential learning approach based on the Gumbel-Softmax technique to establish causal dependencies between features in a fully autonomous and differentiable manner. Additionally, a local-context enhancement module is designed to capture critical contextual cues essential for correspondence pruning, complementing the core framework. Extensive experiments on relative pose estimation, visual localization, and analysis demonstrate that CorrMamba achieves state-of-the-art performance. Notably, in outdoor relative pose estimation, our method surpasses the previous SOTA by 2.58 absolute percentage points in AUC@20°, highlighting its practical superiority. Our code is publicly available at https://github.com/ShineFox/CorrMamba

Affiliations: School of Artificial Intelligence and Information Engineering, Zhejiang University of Science and Technology, Hangzhou, China; School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, China; Zhejiang Key Laboratory of Artificial Intelligence of Things (AIoT) Network and Data Security, Hangzhou, China; College of Computer Science and Technology, Zhejiang University, Hangzhou, China; Meta AI, Hong Kong, China; Center for Biometrics and Security Research and the National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing, China

Abstract:
Vision–Language Pretrained (VLP) models exhibit strong multimodal understanding and reasoning capabilities, finding wide application in tasks such as image–text retrieval and visual grounding. However, they remain highly vulnerable to adversarial attacks, posing serious reliability concerns in safety-critical scenarios. We observe that existing adversarial examples optimization methods typically rely on individual features from the other modality as guidance, causing the crafted adversarial examples to overfit that modality’s learning preferences and thus limiting their transferability. In order to further enhance the transferability of adversarial examples, we propose a novel adversarial attack framework, I&CA (Individual & Common feature Attack), which simultaneously considers individual features within each modality and common features cross-modal interactions. Concretely, I&CA first drives divergence among individual features within each modality to disrupt single-modality learning, and then suppresses the expression of common features during cross-modal interactions, thereby undermining the robustness of the fusion mechanism. In addition, to prevent adversarial perturbations from overfitting to the learning bias of the other modality, which may distort the representation of common features, we simultaneously introduce augmentation strategies to both modalities. Across various experimental settings and widely recognized multimodal benchmarks, the I&CA framework achieves an average transferability improvement of 6.15% over the state-of-the-art DRA method, delivering significant performance gains in both cross-model and cross-task attack scenarios.

Abstract:
Using higher-resolution feature maps in the network is an effective approach for detecting small objects. However, high-resolution feature maps face the challenge of lacking semantic information. This has led previous methods to rely on downsampling feature maps, applying large-kernel convolution layers, and then upsampling the feature maps to obtain semantic information. However, these methods have certain limitations: first, large kernel convolutions in deeper layers typically provide significant global semantic information, but our experiments reveal that such prominent semantic information introduces background smear, which in turn leads to overfitting. Second, deep features often contain substantial redundant information, and the features of small objects are either minimal or have disappeared, which causes a degradation in detection performance when directly relying on deep features. To address these issues, we propose a high-resolution network based on local contextual semantics (HR-SemNet). The network is built on the proposed high-resolution backbone (HRB), which replaces the traditional backbone-FPN architecture by focusing all computational resources of large kernel convolutions on high-resolution feature layers to capture clearer features of small objects. Additionally, a local context semantic module (LCSM) is employed to extract semantic information from the background, confining the semantic extraction to a local window to avoid interference from large-scale backgrounds and objects. HR-SemNet decouples small object semantics from contextual semantics, with HRB and LCSM independently extracting these features. Extensive experiments and comprehensive evaluations on the VisDrone, AI-TOD, and TinyPerson datasets validate the effectiveness of the method. On the VisDrone dataset, which contains a large number of small objects, HR-SemNet improves the mean average precision (mAP) by 4.6%, reduces the computational cost (GFLOPs) by 49.9%, and decreases the parameter count by 94.9%.

Abstract:
Partially Relevant Video Retrieval (PRVR) aims to retrieve videos that match a given textual query only partially. This task is inherently challenging due to the modality gap between text and video, which is further exacerbated by the partial semantic correspondence between linguistic descriptions and visual content. To address these challenges, we propose a bidirectional cross-modal alignment mechanism that collaboratively optimizes both visual and textual modalities. In the visual modality, a major difficulty lies in the absence of visual cues that directly correspond to textual semantics, limiting the model’s ability to align visual representations with textual meanings under unsupervised conditions. To overcome this issue, we construct a semantic-visual association library, which stores paired visual and textual features with semantic annotations. During training, the model dynamically retrieves the most semantically similar visual samples from this library based on the current visual feature vector. These retrieved samples, preliminarily associated with semantics via cross-modal matching, are used to form dynamic anchors that guide visual representation learning. By leveraging these enriched visual features, the model progressively refines the visual representations to achieve better alignment with the corresponding textual inputs, thereby enhancing cross-modal consistency. In the textual modality, we enhance textual representations by integrating semantically aligned visual features selected from the same association library, further narrowing the modality gap. Extensive experiments on benchmark datasets under partial semantic correspondence scenarios demonstrate that our method achieves state-of-the-art performance. The source code of the paper is available at https://github.com/cyanlll/BOA

Abstract:
Single Image Reflection Separation (SIRS) aims to reconstruct both the transmitted and reflected images from a single image that contains a superimposition of both, captured through a glass-like reflective surface. Recent learning-based methods of SIRS have significantly improved performance on typical images with mild reflection artifacts; however, they often struggle with diverse images containing challenging reflections captured in the wild. In this paper, we propose a universal SIRS framework based on a flexible dual-stream architecture, capable of handling diverse reflection artifacts. Specifically, we incorporate a Mixture-of-Experts mechanism that dynamically assigns specialized experts to image patches based on spatially heterogeneous reflection characteristics. The assigned experts then cooperate to extract complementary features between the transmission and reflection streams in an adaptive manner. In addition, we leverage the multi-head attention mechanism of Transformers to simultaneously exploit both high and low cross-correlations, which are then complementarily used to facilitate adaptive inter-stream feature interactions. Experimental results evaluated on diverse real-world datasets demonstrate that the proposed method significantly outperforms existing state-of-the-art methods qualitatively and quantitatively.

Abstract:
Multi-modal image fusion (MMIF) aims to integrate complementary information from heterogeneous sensor modalities. However, substantial cross-modality discrepancies hinder joint scene representation and lead to semantic degradation in the fused output. To address this limitation, we propose C2MFuse, a novel framework designed to preserve content while ensuring cross-modality consistency. To the best of our knowledge, this is the first MMIF approach to explicitly disentangle style and content representations across modalities for image fusion. C2MFuse introduces a content-preserving style normalization mechanism that suppresses modality-specific variations while maintaining the underlying scene structure. The normalized features are then progressively aggregated to enhance fine-grained details and improve content completeness. In light of the lack of ground truth and the inherent ambiguity of the fused distribution, we further align the fused representation with a well-defined source modality, thereby enhancing semantic consistency and reducing distributional uncertainty. Additionally, we introduce an adaptive consistency loss with learnable transformation, which provides dynamic, modality-aware supervision by enforcing global consistency across heterogeneous inputs. Extensive experiments on five datasets across three representative MMIF tasks demonstrate that C2MFuse achieves efficient and high-quality fusion, surpasses existing methods, and generalizes effectively to downstream visual applications.

Abstract:
Distribution estimation is a pivotal strategy in few-shot learning (FSL) to mitigate data scarcity by sampling from estimated distributions, utilizing statistical properties (mean and variance) transferred from related base categories. However, category-level estimation alone often fails to generate representative samples due to significant dissimilarities between base and novel categories, leading to suboptimal performance. To address this limitation, we propose Hybrid Granularity Distribution Estimation (HGDE), which integrates both coarse-grained category-level statistics and fine-grained instance-level statistics. By leveraging instance statistics from the nearest base samples, HGDE enhances the characterization of novel categories, capturing subtle features that category-level estimation overlooks. These statistics are fused through linear interpolation to form a robust distribution for novel categories, ensuring both diversity and representativeness in generated samples. Additionally, HGDE employs refined estimation techniques, such as weighted summation for mean calculation and principal component retention for covariance, to further improve accuracy. Empirical evaluations on four FSL benchmarks, including Mini-ImageNet, Tiered-ImageNet, CUB and CIFAR-FS, demonstrate that HGDE offers effective distribution estimation capabilities and leads to notable accuracy gains, with improvements of more than 1.8% in 1-shot tasks on CUB. These results highlight HGDE’s ability to balance mean precision and variance diversity, making it a versatile and effective solution for FSL.

Abstract:
We present HoloQA, a new state-of-the-art Full Reference Video Quality Assessment (VQA) model that was designed using principles of visual neuroscience, information theory, and self-supervised deep learning to accurately predict the quality of rendered digital human avatars in Virtual Reality (VR) and Augmented Reality (AR) systems. The growing adoption of VR/AR applications that aim to transmit digital human avatars over bandwidth-limited video networks has driven the need for VQA algorithms that better account for the kinds of distortions that reduce the quality of rendered and viewed avatars. As we will show, standard VQA models often fail to capture distortions unique to the rendering, transmission, and compression of videos containing human avatars. Towards solving this difficult problem, we adopt a multi-level Mixture-of-Experts approach. This involves computing distortion-aware perceptual features and high-level content-aware deep features that capture semantic attributes of human body avatars. The high-level features are computed using a self-supervised, pre-trained deep learning network. We show that HoloQA is able to achieve state-of-the-art performance on the recently introduced LIVE-Meta Rendered Human Avatar VQA database, demonstrating its efficacy in predicting the quality of rendered human avatars in VR. Furthermore, we demonstrate the competitive performance of HoloQA on other digital human avatar databases and on another synthetically generated video quality use case: cloud gaming. The code associated with this work will be made available on GitHub.

Abstract:
Due to the loss of 3D information, accurate and robust 2D image feature matching remains challenging for many computer vision applications. This paper introduces a 2.5D feature that uses the disparity value from the light field Fourier disparity layer (FDL) as a rough proxy of scene depth. Without explicit depth estimation, a parameterized depth-degraded projection is proposed to construct the geometric transformation of paired features between two light fields. Then, we propose a parameterized learning solution to calculate the depth-degraded projection. This solution estimates a global constant fundamental matrix, a variable disparity-guided translation vector, and a depth compensation term using a very simple network. Although the 0.5D relative disparity provided by the FDL does not represent precise depth, it can also significantly reduce the depth ambiguity in feature matching. Therefore, the proposed solution achieves accurate feature-matching results by minimizing the sum of reprojection errors across all matching candidates. On the public light field feature-matching dataset, the proposed solution outperforms existing 2D image feature-matching solutions and light field feature-matching algorithms in terms of matching accuracy and robustness. The code is available online.

Abstract:
Natural image quality is often degraded by adverse weather conditions, significantly impairing the performance of downstream tasks. Image restoration has emerged as a core solution to this challenge and has been widely discussed in the literature. Although recent transformer-based approaches have made remarkable progress in image restoration, their increasing system complexity poses significant challenges for real-time processing, particularly in real-world deployment scenarios. To this end, most existing methods attempt to simplify the self-attention mechanism, such as by channel self-attention or state space model. However, these methods primarily focus on network architecture while neglecting the inherent characteristics of image restoration itself. In this context, we explore a pyramid Wavelet-Fourier iterative pipeline to demonstrate the potential of Wavelet-Fourier processing for image restoration. Inspired by the above findings, we propose a novel and efficient restoration baseline, named Pyramid Wavelet-Fourier Network (PW-FNet). Specifically, PW-FNet features two key design principles: 1) at the inter-block level, integrates a pyramid wavelet-based multi-input multi-output structure to achieve multi-scale and multi-frequency bands decomposition; and 2) at the intra-block level, incorporates Fourier transforms as an efficient alternative to self-attention mechanisms, effectively reducing computational complexity while preserving global modeling capability. Extensive experiments on tasks such as image deraining, raindrop removal, image super-resolution, motion deblurring, image dehazing, image desnowing and underwater/low-light enhancement demonstrate that PW-FNet not only surpasses state-of-the-art methods in restoration quality but also achieves superior efficiency, with significantly reduced parameter size, computational cost and inference time. The code is available at: https://github.com/deng-ai-lab/PW-FNet

Abstract:
Although video generation and editing models have advanced significantly, individual models remain restricted to specific tasks, often failing to meet diverse user needs. Effectively coordinating these models in pipelines can unlock a wide range of video generation and editing capabilities. However, manual orchestration is complex, time-consuming, and requires deep expertise in model performance and limitations. To address these challenges, we propose the Semantic Planning Agent (SPAgent), a novel system that automatically coordinates state-of-the-art open-source models to fulfill complex user intents. To equip SPAgent with robust orchestration capabilities, we introduce a three-step framework: 1) decoupled intent recognition to accurately parse multi-modal inputs; 2) principle-guided route planning to design effective execution chains; and 3) capability-based model selection to identify the optimal tools for each sub-task. To facilitate training, we curate a comprehensive multi-task generative video dataset. Furthermore, we enhance SPAgent with a video quality evaluation module, enabling it to autonomously assess and incorporate new models into its tool library without human intervention. Experimental results demonstrate that SPAgent effectively coordinates models to generate and edit high-quality videos, exhibiting superior versatility and adaptability across various tasks.

Affiliations: School of Artificial Intelligence, State Key Laboratory of Opto-Electronic Information Acquisition and Protection Technology, and Anhui Provincial Key Laboratory of Security Artificial Intelligence, Anhui University, Hefei, China; School of Electronic Information Engineering, Anhui University, Hefei, China; National Key Laboratory of Opto-Electronic Information Acquisition and Protection Technology and Anhui Provincial Key Laboratory of Multimodal Cognitive Computation, School of Computer Science and Technology, Anhui University, Hefei, China

Abstract:
Each sequence in existing RGBT tracking datasets is typically captured from a single platform equipped with both RGB (visible light) and TIR (thermal infrared) sensors. In real-world applications, tracking some objects requires cross-platform collaboration and these platforms might be equipped with different sensors. However, changes in modalities and platforms may cause significant variations in target appearance and abrupt position shifts, which existing RGBT trackers struggle to handle. To address these challenges, we define a new task, termed dynamic RGBT tracking, focusing on cross-platform and modality-variant scenarios. Considering the dynamic changes of modalities and platforms, we investigate dynamic RGBT tracking from a causal perspective, and assume that images consist of causal factors (target-relevant information) and non-causal factors (target-irrelevant information, i.e., modality/platform information), where only the former is conducive to stable tracking. Based on this assumption, we propose a novel causality-based modality&platform-invariant representation learning approach to capture robust invariant representations for dynamic RGBT tracking. In particular, to mitigate the challenges posed by modality variations, we design a causal consistency encoder that introduces an intervener to model feature uncertainty and simulate modal variations, compelling the model to focus on modality-invariant features to improve tracking robustness. To overcome the issue of abrupt view change and position shift, we design a platform-independent global searcher to re-localize the target whenever a platform switch occurs, which leverages an intervener to simulate the interference of platform changes on features, encouraging the searcher to learn platform-invariant representations for improved localization accuracy. In addition, to promote the research and development of dynamic RGBT tracking, we construct a dataset named DRGBT603, which consists of 603 sequences with a total of 1.49 M frame pairs. Extensive experiments on DRGBT603 dataset validate the effectiveness of the proposed method against other state-of-the-art methods. Our code and data are now available: https://github.com/dongdong2061/DRGBT

Abstract:
Change detection (CD) in heterogeneous remote sensing images plays a crucial role in earth observation tasks, such as disaster monitoring and destruction assessment. Recent advancements in heterogeneous CD studies have substantially enhanced the capability to detect changes, but existing methodologies frequently lack effective control mechanisms for increasing false alarms when facing different heterogeneous scenes. Consequently, even with a high detection rate for changes, the real changes co-exist with lots of false alarms, thereby reducing the reliability and practical utility of the CD results. To address this issue, inspired by the insight of adaptive thresholding for false alarm control in constant false alarm rate (CFAR) detection, we propose a copula theory-based CD framework, named FAR-Aware-Copula-CD, to control false alarm rate (FAR) in heterogeneous CD. In the proposed FAR-Aware-Copula-CD, the heterogeneous CD problem is represented as a binary hypothesis testing problem. Then, the binary hypothesis testing problem is solved by a generalized likelihood ratio test based on copula theory, which effectively characterizes change statistics based on superpixel-level dependence within various heterogeneous image pairs. Finally, the decision thresholds of the copula-based change statistics are determined so as to satisfy the FAR constraint and ensure that the final CD result approaches a prespecified false alarm rate. Our FAR-Aware-Copula-CD provides a new approach for implementing controllable false alarms in heterogeneous CD tasks. Experimental results on four real-world datasets demonstrate the effectiveness of our proposed method.

Abstract:
Alzheimer’s Disease (AD) detection is essential for timely treatment and better patient care. Magnetic Resonance Imaging (MRI) is a technique in which radio waves and magnetic fields are used to capture high-resolution, multi-dimensional representations of brain structures. This high-resolution imaging capability makes MRI a key tool for diagnosing neurological disorders such as Alzheimer’s disease. However, the problem is to correctly classify the fresh MRI scans of patients. Researchers have proposed a deep learning-based method for Alzheimer’s disease diagnosis using a Siamese Convolutional Neural Network (SCNN) with three ResNet-34 branches trained on structural MRI data. However, this method relies solely on ResNet34 for feature extraction which struggles to preserve spatial relationship due to pooling operations, causing loss of positional information. Other researchers have explored methods like attention mechanisms and 3D convolutional networks to capture spatial dependencies. However, these methods underperform by missing brain complexity or needing high resources without consistent accuracy. In this study, we propose a cognitively inspired approach for classifying MRI images as Non Demented, Very Mild Demented, Mild Demented and Moderate Demented using Siamese Capsule Network (SNNCap). SNNCap uses ResNet-18 for feature extraction and capsule layers to preserve spatial and part-whole relationships in the images. It compares a test image against a few known reference examples per class. This reference-based validation closely mimics cognitive reasoning, improving the system’s generalizability. The model achieves strong results on unseen data and demonstrates its effectiveness through classification reports and confusion matrices.

Abstract:
The deep learning revolution has strongly impacted low-level image processing tasks such as style/domain transfer, enhancement/restoration, and visual quality assessments. Despite often being treated separately, the aforementioned tasks share a common theme of understanding, editing, or enhancing the appearance of input images without modifying the underlying content. We leverage this observation to develop a novel disentangled representation learning method that decomposes inputs into content and appearance features. The model is trained in a self-supervised manner and we use the learned features to develop a new quality prediction model named DisQUE. We demonstrate through extensive evaluations that DisQUE achieves state-of-the-art accuracy across quality prediction tasks and distortion types. Moreover, we demonstrate that the same features may also be used for image processing tasks such as HDR tone mapping, where the desired output characteristics may be tuned using example input-output pairs.

Abstract:
Normalized Cut (NCut) or Spectral Clustering (SC) discourages the isolated segmentation that may result from the standard minimum Cut by adding a volume constraint. Such a volume constraint introduces a significant computational challenge or an undesirable effect when the isolated segmentation is desired. In this paper, we propose the K -way constrained Normalized Cut ( K -way CNCut). It is formulated as the minimum Cut with a priori chosen (either manually or automatically) constraints or representatives for the cluster. The role played by the constraints is to attract the strongly connected nodes to the relevant hosting constraints, thus it can both discourage or encourage the isolated segmentation, depending on the choice of constraints and nodes surrounding it. Most critically, in this paper, the K -way CNCut is discovered to have a link with the construction of the optimal prolongation operator in the algebraic multigrid method (AMG), more precisely, the energy minimizing AMG in its most general setting, for the normalized Graph Laplacian. For the special case when a single constraint is given as a representative of a single cluster, it is shown to lead to the multiscale image segmentation. The importance of this link has been demonstrated as well. Among others, a set of constraints for the K -way CNCut was shown to be constructed via the multilevel coarsening algorithm, which exists in the algebraic multigrid method, thereby the K -way CNCut with manually chosen constraints, is made to be a fully automatic image segmentation algorithm. A number of numerical experiments are presented and compared with state-of-the-art classical and learning-based (both supervised and semi-supervised) image segmentation algorithms, which include SegNet and SAM to demonstrate the effectiveness of the proposed framework.

Abstract:
Infrared and visible image fusion (IVIF) significantly enhances scene interpretation by integrating broad-spectrum information. Drawing inspiration from specific snakes that possess an evolutionarily optimized bimodal sensory system capable of parallel processing infrared and visible radiation, we propose a novel IVIF framework incorporating two key elements: nonlinear cross-modal interactions across six distinct classes of snake bimodal neurons and dynamic center-surround receptive field organization. These biological principles are mathematically formalized and integrated within a deep neural network (DNN), optimized through an object detection region-guided loss and a frequency-dependent fusion loss that enable data-driven fusion strategy learning. Experimental results demonstrate that the optimized model effectively emulates the infrared-visible information integration observed in snake bimodal neurons. Critically, the nonlinear bimodal neurons capture a significantly greater amount of edge information and finer mid-to-high-frequency details, which are essential for the subsequent reconstruction of the fused image. Furthermore, a comprehensive evaluation of visual quality, encompassing both qualitative and quantitative assessments on six datasets, along with extensive object detection and semantic segmentation experiments using the fused images in both daytime and nighttime scenarios, demonstrates that our model outperforms traditional biologically-inspired IVIF algorithms, achieving performance comparable to SOTA DNN-based methods. The code and weights are available at https://github.com/rwerwer2024/SBNF.

Abstract:
Large Vision-Language Models (LVLMs) suffer from the high computational cost of the attention mechanism caused by the large number of visual tokens. Token reduction has emerged as a promising approach to reduce the complexity by eliminating redundant visual tokens. However, existing token reduction methods struggle to preserve task-relevant tokens and eliminate irrelevant ones. This is due to the attention biases of LVLMs, where tokens with high attention scores are not always the critical ones. Such biases force existing methods into a dilemma: they face either high performance degradation or limited inference acceleration. This issue becomes more severe in fine-grained perception tasks, which rely heavily on the fine-grained information stored in specific visual tokens. To address the above issue, we propose an unbiased fine-grained token reduction method named FinePruner, which explores the attention patterns of LVLMs at the attention-head-level to mitigate the interference of attention biases. Concretely, we first conducted comparative studies to validate the impact of tokens corresponding to visual objects on final task performance, which established the conclusion that these tokens should be preserved while others can be pruned. Also, a series of visualizations unveils the changing patterns of LVLMs’ attention biases across layers and attention heads. Based on the patterns of attention biases, the pipeline of FinePruner is divided into two stages. The first stage, named Instruction-Agnostic Clustering, clusters visual tokens into groups according to their embeddings to exclude the attention biases. The second stage, named Attention-Refined Pruning, selects attention heads with less bias by the divergence, which are used to identify the preserved tokens. Experiments on VQA benchmarks and fine-grained benchmarks demonstrate that our FinePruner achieves better accuracy-efficiency tradeoffs than state-of-the-art methods. The code is available at https://github.com/PKU-ICST-MIPL/FinePruner_TIP2026.

Abstract:
Existing research on text-to-image person retrieval primarily focuses on visible images, which are not suitable under low-light scenarios. Infrared imaging becomes necessary in many visual systems, and matching text with both visible and infrared images is required. However, visible and infrared images are heterogeneous with different visual characteristics, so matching text with them in a unified framework is very challenging. In this work, we design a new task called Text-Visible/Infrared person retrieval and contribute a novel approach and a unified benchmark to promote the research and development of this field. On one hand, we propose a novel Attribute-guided feature decoupling and Collaborative Alignment Network (ACANet) that pursues accurate alignment from the text modality to both visible and infrared modalities in a unified framework according to the texture and color attribute information of text descriptions. In particular, we decouple the color features of visible images supervised by the text labels and integrate them into the infrared features to eliminate the impact of the absence of color information in infrared images during cross-modal collaborative alignment. Moreover, we also decouple the texture information from visible images supervised by the text labels and perform the collaborative alignment of texture and infrared features with a fusion agent. In addition, we extend conventional masked language modeling to a cross-modal paradigm to help ACANet learn uniform fine-grained alignment in multiple image modalities. On the other hand, we contribute a unified high-quality MM01LLCM-Text dataset, which provides person images in both visible and infrared modalities paired with fine-grained text descriptions. Experimental results show that the proposed ACANet outperforms existing state-of-the-art methods on MM01LLCM-Text dataset.

Abstract:
Multi-modality image fusion (MMIF) in adverse weather aims to address the loss of visual information caused by weather-related degradations, providing clearer scene representations. Although a few studies have attempted to incorporate textual information to improve semantic perception, they often lack effective categorization and thorough analysis of textual content. To address these limitations, we propose AWM-Fuse, a unified fusion framework that handles diverse weather degradations via global and local text perception with shared parameters. In particular, a global text perception module leverages BLIP-generated captions to extract overall scene features and identify primary degradation types, thus promoting generalization across various adverse weather conditions. Complementing this, the local module employs detailed scene descriptions produced by ChatGPT to concentrate on specific degradation effects through concrete textual cues, enabling the recovery of subtle details. Furthermore, textual descriptions are used to constrain the generation of fused images, effectively steering the network learning process toward better alignment with semantic labels, thereby promoting the learning of more meaningful visual features. To facilitate text-guided fusion under adverse weather, we construct AWMM-Text, a large-scale benchmark providing paired global and local annotations for multi-modality image pairs. Extensive experiments demonstrate that AWM-Fuse consistently outperforms state-of-the-art methods under complex weather conditions and on multiple downstream tasks. Our code is available at https://github.com/Feecuin/AWM-Fuse

Affiliations: Institute of Biomedical Manufacturing and Life Quality Engineering, School of Mechanical Engineering, Shanghai Jiao Tong University, Shanghai, China; College of Computer Science and Technology, Huaqiao University, Xiamen, China; Hepatobiliary Surgery/Digital Medicine Research Center, The First People’s Hospital of Yunnan Province, Affiliated Hospital of Kunming University of Science and Technology, Kunming, China; Faculty of Mathematics and Computer Science, University of Münster, Münster, Germany; School of Mechanical Engineering, Institute of Biomedical Manufacturing and LifeQuality Engineering, Shanghai Jiao Tong University, Shanghai, China

Abstract:
Self-supervised learning technology has been applied to calculate depth and ego-motion from monocular videos, achieving remarkable performance in various real-world scenarios. Unfortunately, challenges such as specular reflections and soft tissue deformations in endoscopic scenes greatly undermine the performance of these methods, inevitably compromising the accuracy of depth and ego-motion estimation. To address these two problems, we introduce a novel strategy based on image distance transform for robust self-supervised learning for monocular depth estimation, effectively handling specular reflections in endoscopic scenes. Furthermore, we propose a soft tissue deformation constraint based on biomechanical principles, which mitigates the adverse effects of deformed region pixels, ultimately enhancing the model’s depth estimation precision. Additionally, our method employs a lightweight architecture ensuring a reduced number of model parameters and faster inference time. Extensive experiments are conducted on both public datasets (SCARED, SERV-CT) and our own datasets to validate the effectiveness of our method. Compared with other SOTA methods, our approach demonstrates comparable accuracy and robustness while ensuring faster inference time. On the SCARED dataset, our approach attains an RMSE of 4.96 mm with only 2.25M model parameters for depth estimation. Especially, experiment results on SERV-CT dataset and our own datasets further demonstrate the model’s generalization ability and potential clinical value in computer-assisted surgical navigation.

Abstract:
Synthesizing novel perspectives of complex scenes in high quality using sparse image sequences, especially for those without camera poses, is a challenging task. The key to enhancing accuracy in such scenarios lies in sufficient prior knowledge and accurate camera motion constraints. Therefore, we propose an end-to-end novel view synthesis network named BP-NeRF. It is capable of using sequences of sparse images captured in indoor and outdoor complex scenes to estimate camera motion trajectories and generate novel view images. Firstly, to address the issue of inaccurate prediction of depth map caused by insufficient overlapping features in sparse images, we designed the RDP-Net module to generate depth maps for sparse image sequences and calculate the depth accuracy of these maps, providing the network with a reliable depth prior. Secondly, to enhance the accuracy of camera pose estimation, we construct a loss function based on the geometric consistency of 2D and 3D feature variations between frames, improving the accuracy and robustness of the network’s estimations. We conducted experimental evaluations on the LLFF and Tanks datasets, and the results show that, compared to the current mainstream methods, BP-NeRF can generate more accurate novel views without camera poses.

Abstract:
Ultrasonic image anomaly detection faces significant challenges due to limited labeled data, strong structural and random noise, and highly diverse defect manifestations. To overcome these obstacles, we introduce UltraChip, a new large-scale C-scan benchmark containing about 8,000 real-world images from various chip packaging types, each meticulously annotated with pixel-level masks for cracks, holes, and layers. Building on this resource, we present FSGM-Net, a fully unsupervised framework tailored for anomaly detection. FSGM-Net leverages an adaptive Frequency-Spatial feature filtering mechanism: a learnable FFT-Spatial patch filter first suppresses noise and dynamically assigns normality weights to Vision Transformer (ViT) patch features. Subsequently, an Adaptive Gaussian Mixture Model (Ada-GMM) captures the distribution of normal features and guides a deep–shallow multi-scale interaction decoder for accurate, pixel-level anomaly inference. In addition, we propose a filter loss that enforces encoder–filter consistency and entropy-based sparse gating, together with a distributional loss that encourages both feature reconstruction and confident Gaussian mixture modeling. Extensive experiments demonstrate that FSGM-Net not only achieves state-of-the-art results on UltraChip but also exhibits superior cross-domain generalization to MVTec-AD and VisA, while supporting real-time inference on a single GPU. Together, the dataset and framework advance robust, annotation-free ultrasonic NDT in practical applications. The UltraChip dataset can be obtained via https://iiplab.net/ultrachip/

Abstract:
In recent years, with the development of autonomous driving, 3D reconstruction for unbounded large-scale scenes has attracted researchers’ attention. Existing methods have achieved outstanding reconstruction accuracy in autonomous driving scenes, but most of them lack the ability to edit scenes. Although some methods have the capability to edit scenarios, they are highly dependent on manually annotated 3D bounding boxes, leading to their poor scalability. To address the issues, we introduce a new Gaussian representation, called DrivingEditor, which decouples the scene into two parts and handles them by separate branches to individually model the dynamic foreground objects and the static background during the training process. By proposing a framework for decoupled modeling of scenarios, we can achieve accurate editing of any dynamic target, such as dynamic objects removal, adding and etc, meanwhile improving the reconstruction quality of autonomous driving scenes especially the dynamic foreground objects, without resorting to 3D bounding boxes. Extensive experiments on Waymo Open Dataset and KITTI benchmarks demonstrate the performance in 3D reconstruction for both dynamic and static scenes. Besides, we conduct extra experiments on unstructured large-scale scenarios, which can more convincingly demonstrate the performance and robustness of our proposed model when rendering the unstructured scenes. Our code is available at https://github.com/WangXu-xxx/DrivingEditor

Abstract:
Mamba and its variants excel at modeling long-range dependencies with linear computational complexity, making them effective for diverse vision tasks. However, Mamba’s reliance on unfolding 1D sequential representations necessitates multiple directional scans to recover lost spatial dependencies. This introduces significant computational overhead, redundant token traversal, and inefficiencies that compromise accuracy in real-world applications. To this end, we propose PH-Mamba, a novel framework integrating position encoding and harmonized attention for image deraining and beyond. PH-Mamba transforms Mamba’s scanning process into a position-guided, unidirectional scanning that selectively prioritizes degradation-relevant tokens. Specifically, we devise a position-guided hybrid Mamba module (PHMM) that jointly encodes perturbation features alongside their spatial coordinates and harmonized representation to model consistent degradation patterns. Within PHMM, a harmonized Transformer is developed to focus on uncertain regions while suppressing noise interference, thereby improving spatial modeling fidelity. Additionally, we employ a vector decomposition and synthesis strategy to enable the unified representation layout to global degradation by directional scanning while minimizing redundancy. By cascading multiple PHMM blocks, PH-Mamba combines global positional guidance with local differential features to strengthen contextual learning. Extensive experiments demonstrate the superiority of PH-Mamba across low-level image restoration benchmarks. For example, compared to NeRD, PH-Mamba achieves a 0.60 dB PSNR improvement while requiring 88.9% fewer parameters, 36.2% less computation, and 63.0% faster inference time.

Abstract:
Using image-level weakly supervised semantic segmentation (WSSS) techniques to segment tissue regions in giga-pixel histopathological whole slide images (WSI) has garnered widespread attention, as it can reduce many annotation workloads for pathologists. Most recent studies are based on class activation mapping (CAM) to generate pseudo masks, which are then used to train segmentation model in a fully supervised manner. However, it is still a challenge to accurately segment non-predominant tissue categories due to the existence of long-tailed and inter-class homogeneity matters. For these matters, we propose three designs to solve them: 1) Diffusion-based Data Generation to synthesis new images of tail class to expand data distribution; 2) Feature Recalibration to reassign the logits in CAM to narrow the feature-level prediction gap between predominant and non-predominant classes; 3) Grade-skip Learning to correct the under-fitting tendency of hard samples during the segmentation phase. Moreover, we also design a powerful pipeline LoHo for histopathology tissue segmentation. Extensive experiments demonstrate that our method not only achieves new state-of-the-art performances but also significantly improves segmentation of tail classes. In addition, our methods are plug-and-play, making it easily integrable into many mainstream WSSS frameworks.

Abstract:
Infrared and visible image fusion methods have shown promising results, yet existing approaches either compromise downstream detection performance through independent fusion processes or sacrifice computational efficiency and flexibility by requiring joint training of fusion and detection models. To address these challenges, we propose a detection-driven image fusion network based on diffusion models (termed as DDIF), which optimizes the fused images specifically for object detection tasks. Our method features the following three aspects: 1) we reformulate the image fusion process as an inverse problem solved by a non-differentiable optimization process wherein the fused result preserves the source modality information while conforming to the image prior provided by the diffusion model; 2) we design a Response Guide Learning Module (RGLM) to learn response maps, which determine the contribution of each modality in the fusion process according to the downstream detection task; 3) we establish explicit gradient relationships to ensure compatibility between RGLM training and the non-differentiable optimization process, enabling end-to-end training. Notably, a moderate coupling mechanism is formed in our framework as the subsequent detection model is pre-trained and frozen, enabling flexible integration with various advanced detection networks while maintaining computational efficiency. Extensive experiments indicate that our method achieves superior detection performance compared to SOTA approaches and produces high-quality image fusion results.

Abstract:
Recently, few-shot strip steel surface defect segmentation has received more and more concerns. However, the existing few-shot segmentation methods usually adopt the frozen encoder, which is pre-trained on the classification task and can only provide class-related knowledge. Therefore, we propose a novel method, namely pre-trained variational auto-encoder based latent gaussian process regression (LGPR), to conduct few-shot strip steel surface defect segmentation. Firstly, different from previous methods, the frozen Variational Auto-Encoder (VAE) based encoder and decoder, which are pre-trained by using the pixel-level self-supervised task (i.e., image reconstruction), can provide rich image-related knowledge. This ensures the effective characterization of defect regions. Secondly, by deploying a gaussian process regression in the latent feature space generated by the VAE-based encoder, pixel-level correlation between support features and query features can be efficiently built. This operation is non-parametric and doesn’t bring any training overhead. Besides, we deploy transformer-based projectors to dig long-range contextual cues of support and query features. Extensive experiments are performed on two public datasets, and the experimental results clearly show that our model consistently outperforms the state-of-the-art models with a large margin. Both the codes and results are publicly available at https://github.com/Hlao-hub/LGPR

Abstract:
In real underwater environments, downstream image recognition tasks such as semantic segmentation and object detection often face challenges posed by problems like blurring and color inconsistencies. Underwater image enhancement (UIE) has emerged as a promising preprocessing approach, aiming to improve the recognizability of targets in underwater images. However, most existing UIE methods mainly focus on enhancing images for human visual perception, frequently failing to reconstruct high-frequency details that are critical for task-specific recognition. To address this issue, we propose a Downstream Task-Inspired Underwater Image Enhancement (DTI-UIE) framework, which leverages human visual perception model to enhance images effectively for underwater vision tasks. Specifically, we design an efficient two-branch network with task-aware attention module for feature mixing. The network benefits from a multi-stage training framework and a task-driven perceptual loss. Additionally, inspired by human perception, we automatically construct a Task-Inspired UIE Dataset (TI-UIED) using various task-specific networks. Experimental results demonstrate that DTI-UIE significantly improves task performance by generating preprocessed images that are beneficial for downstream tasks such as semantic segmentation, object detection, and instance segmentation. The code will be made publicly available at https://github.com/oucailab/DTIUIE

Abstract:
As most optical satellites remotely acquire multispectral images (MSIs) with limited spatial resolution, multispectral unmixing (MU) becomes a critical signal processing technology for analyzing the pure material spectra for high-precision classification and identification. Unlike the widely investigated hyperspectral unmixing (HU) problem, MU is much more challenging as it corresponds to the underdetermined blind source separation (BSS) problem, where the number of sources is larger than the number of available multispectral bands. In this article, we transform MU into its overdetermined counterpart (i.e., HU) by inventing a radically new quantum deep image prior (QDIP), which relies on the virtual band-splitting task conducted on the observed MSI for generating the virtual hyperspectral image (HSI). Then, we perform HU on the virtual HSI to obtain the virtual hyperspectral sources. Though HU is overdetermined, it still suffers from the ill-posed issue, for which we employ the convex geometry structure of the HSI pixels to customize a weighted simplex shrinkage (WSS) regularizer to mitigate the ill-posedness. Finally, the virtual hyperspectral sources are spectrally downsampled to obtain the desired multispectral sources. The proposed geometry/quantum-empowered MU (GQ- \mu ) algorithm can also effectively obtain the spatial abundance distribution map for each source, where the geometric WSS regularization is adaptively and automatically controlled based on the sparsity pattern of the abundance tensor. Simulation and real-world data experiments demonstrate the practicality of our unsupervised GQ- \mu algorithm for the challenging MU task. Ablation study demonstrates the strength of QDIP, not achieved by classical DIP, and validates the mechanics-inspired WSS geometry regularizer. The associated code will be available at https://github.com/IHCLab/GQ-mu

Abstract:
The absence of real-world ground truth (GT) remains a challenge in multi-exposure image fusion (MEF). Benchmarks synthesizing pseudo GT through algorithm ensembles. Existing methods, hampered by inherent imperfections of pseudo GT and fixed mapping relationships, show limited performance and robustness. To address the limitations, we propose a novel cross-modal diffusion framework that synergizes text prompts and semantic perception for MEF, termed as Diff-MEF. First, it reformulates MEF as a probabilistic estimation task with conditional diffusion model for progressive transition and fusion. Then, we explicitly infer semantic and exposure priors as text prompts and semantic perception to improve performance and robustness. The priors are synergized through multi-modal prior embedding and optimization guidance. On the one hand, regarding cross-modal interaction, multi-modal priors, including segmentation masks, and exposure- and content-aware text prompts, are embedded into diffusion process by dedicated encoders and refine visual features through a text-segmentation refinement module. On the other hand, a semantic-level contrastive loss builds a regularization between cross-modal features in the semantic space of CLIP to mitigate degradations introduced by pseudo GT and fusion distortions. Experiments demonstrate that Diff-MEF outperforms SOTA methods and pseudo GT with superior fusion performance and robustness across diverse exposure scenarios. Code is available at https://github.com/hanna-xu/Diff-MEF

Abstract:
Online continual learning studies how models learn from continuous and non-stationary data streams. In this paper, we observe that CLIP models exhibit an asymmetric image–text interaction under online continual learning. Specifically, text features of previously seen classes may introduce unfavorable supervision when paired with visual features of newly observed data, leading to catastrophic forgetting. To alleviate this issue, we propose a simple yet effective symmetric image-text tuning (SIT) strategy that removes such asymmetric text supervision during online learning. We further introduce an entropy-guided fusion (EGF) mechanism that adaptively combines predictions from the pretrained and finetuned branches based on their relative uncertainty. This design allows the model to recover pretrained knowledge when the finetuned branch becomes unreliable, while still preserving plasticity on recently observed classes when confidence is high. In addition, we present MiD-Blurry, an online continual learning benchmark that combines multiple class distribution patterns to better reflect realistic data streams with blurred temporal boundaries. Extensive experiments on standard continual learning benchmarks and the MiD-Blurry setting evaluate inference-at-any-time performance and generalization to future data. The results show that the proposed approach maintains a practical balance between adapting to new data and preserving previously learned information in realistic online learning scenarios.

Abstract:
Infrared (IR) and visible image fusion (IVIF) has become prevalent in recent years. By leveraging the complementary characteristics of infrared and visible images, we can obtain visually-appealing fused images, which further facilitate subsequent scene understanding and object detection from day to night. Integrating complementary information while simultaneously eliminating redundancy is a crucial challenge in fusion. Most of available deep learning based methods, after being trained, execute static inference on all pairs of infrared and visible images. They struggle to effectively handle redundancy of modality across diverse scenarios, resulting in superfluous information such as thermal noise in infrared images and artifacts in visible images. In this paper, we propose an IVIF method based on a semantic-guided mixture of multi-feature experts, where multiple types of features are extracted, each assigned to a dedicated expert network specialized in processing a specific type of features. Through an expert routing mechanism, these experts are chosen dynamically, ensuring that the most significant features of each image modality are routed to a specific group of experts. In order to align fusion task with subsequent semantic segmentation task, we introduce a segmentation head to semantically guide the selection of the complementary features. Extensive experiments on five infrared and visible image fusion and segmentation benchmarks demonstrate the effectiveness of our method, both for image fusion and subsequent semantic segmentation tasks. The code will be available at https://github.com/ZhilongNiu/SD-MoMFE

Affiliations: Key Laboratory of Computing Power Network and Information Security, Ministry of Education, Shandong Computer Science Center, Qilu University of Technology (Shandong Academy of Sciences), Jinan, China; PCA Laboratory, Nanjing University of Science and Technology, Nanjing, China; School of Cyber Science and Engineering and the Key Laboratory of Computer Network and Information of the Ministry of Education of China, Southeast University, Nanjing, China; Computational Intelligence Center (CIC), School of Computer and Artificial Intelligence, Shandong Jianzhu University, Jinan, China

Abstract:
Deepfake detection remains a challenging research topic, especially when the quality of forged images degrades, leading to unreliable detection results. In this paper, we propose a watermarking-based proactive method for robust proactive deepfake detection. First, we embed a watermark into the Fractional-order Quaternion Exponent Moments (FrQEMs) space of the host face image, achieving a balance between imperceptibility and robustness of the watermarking algorithm. Then, we introduce the Frequency Mamba (FreMamba) block to enhance feature extraction by leveraging correlations between frequency-domain subbands, thereby enabling the extraction of more discriminative feature representations. Finally, at the detection stage, we construct a dual-branch framework comprising a watermark extractor and a forgery discriminator. Through knowledge distillation, the watermark extractor guides the forgery discriminator to perceive forgery traces. Specifically, the integrity of the extracted watermark is compromised only when the host image is subjected to a deepfake attack, while conventional attacks do not affect the integrity. Experimental results on benchmark datasets demonstrate that the proposed method achieves superior deepfake detection accuracy. In particular, when images are subjected to conventional attacks, our method surpasses state-of-the-art approaches by more than 5.3% in terms of ACC.

Abstract:
In motion-robust magnetic resonance imaging (MRI), slice-to-volume reconstruction is critical for recovering anatomically consistent 3D brain volumes from 2D slices, especially under accelerated acquisitions or patient motion. However, this task remains challenging due to hierarchical structural disruptions. It includes local detail loss from k-space undersampling, global structural aliasing caused by motion, and volumetric anisotropy. Therefore, we propose a progressive refinement implicit neural representation (PR-INR) framework. Our PR-INR unifies motion correction, structural refinement, and volumetric synthesis within a geometry-aware coordinate space. Specifically, a motion-aware diffusion module is first employed to generate coarse volumetric reconstructions that suppress motion artifacts and preserve global anatomical structures. Then, we introduce an implicit detail restoration module that performs residual refinement by aligning spatial coordinates with visual features. It corrects local structures and enhances boundary precision. Further, a voxel continuous-aware representation module represents the image as a continuous function over 3D coordinates. It enables accurate inter-slice completion and high-frequency detail recovery. We evaluate PR-INR on five public MRI datasets under various motion conditions (3% and 5% displacement), undersampling rates (4x and 8x) and slice resolutions (scale = 5). Experimental results demonstrate that PR-INR outperforms state-of-the-art methods in both quantitative reconstruction metrics and visual quality. It further shows generalization and robustness across diverse unseen domains.

Abstract:
Industrial image anomaly detection (IAD) is a pivotal topic with huge value. Due to the nature of anomalies, real anomalies in a specific modern industrial domain (i.e., domain-specific anomalies) are usually too rare to collect, which severely hinders IAD. Thus, zero-shot anomaly synthesis (ZSAS), which synthesizes pseudo anomaly images without any domain-specific anomaly, emerges as a vital technique for IAD. However, existing solutions are either unable to synthesize authentic pseudo anomalies, or require cumbersome training. Thus, we focus on ZSAS and propose a brand-new paradigm that can realize both authentic and training-free ZSAS. It is based on a chronically-ignored fact: Although domain-specific anomalies are rare, real anomalies from other domains (i.e., cross-domain anomalies) are actually abundant and directly applicable to ZSAS. Specifically, our new ZSAS paradigm makes three-fold contributions: First, we propose a novel method named Cross-domain Anomaly Injection (CAI), which directly exploits cross-domain anomalies to enable highly authentic ZSAS in a training-free manner. Second, to supply CAI with sufficient cross-domain anomalies, we build the first Domain-agnostic Anomaly Dataset (DAAD) within our best knowledge, which provides ZSAS with abundant real anomaly patterns. Third, we propose a CAI-guided Diffusion Mechanism, which can further break the quantity limit of real anomalies and enable unlimited anomaly synthesis. Our head-to-head comparison with existing ZSAS solutions justifies the superior performance of our paradigm for IAD and demonstrates it as an effective and pragmatic ZSAS solution.

Abstract:
Efficient tiny object detection (TOD) in large-size remote sensing imagery (LSRSI) is particularly challenging in real-world remote sensing applications. We observe that as the input size of the remote sensing scene increases, TOD faces more severe foreground signal identification issues. To address this, we are the first to design a backbone network from the perspective of low-level spatial feature preservation and utilization, specifically for tiny object feature extraction in large-size remote sensing scene patches. The proposed architecture, referred to as the resolution preserving and utilization network (RPUN), demonstrates excellent foreground tiny object feature response identification ability when increasing the input size of remote sensing scenes, effectively maintaining detection performance comparable to that of smaller input slices. Additionally, we introduce GF2UBSv2, a large-scale panchromatic satellite imagery dataset focused on tiny urban bridge detection. Extensive experiments conducted on GF2UBSv2, DIOR, SODA-A, and DOTAv2.0 demonstrate the superior performance of RPUN compared with state-of-the-art methods. The code and dataset are available at: https://github.com//Nankle

Abstract:
Albeit recent Generative Models have achieved notable progress in synthesizing realistic facial aging images, many of them, e.g., GAN-based methods, cannot accurately capture the continuous progression of age-related shape-to-texture changes over time. In this paper, we propose an innovative facial age transformation framework that enables the generation of continuous shape-to-texture aging facial images. Firstly, the Prior Latent Age Modulation (PLAM) is designed to leverage the advantages of continuous sampling in high-dimensional space by normalizing flows to achieve precise and reversible mapping between the age attribute variable distributions and the prior latent space, ensuring smooth transitions along with facial aging. Secondly, we introduce the Attentional Feature Fusion (AFF), which dynamically allocates weights to effectively fuse the age attribute features by the latent space manipulation with the content features in StyleGAN, thereby generating facial images that accurately depict facial characteristics from shape to texture corresponding to specific ages. Finally, through quantitative and qualitative analysis of existing datasets, we validate the effectiveness and superiority of our proposed method in facial aging tasks.

Abstract:
Efficiently and accurately recognizing interesting objects within the image and regressing bounding boxes to enclose them has been a persistent pursuit in object detection. However, existing detectors fail to achieve both aspects simultaneously due to insufficient task interaction and suboptimal classification behavior. To solve the problem, this paper proposes a novel detector with Efficient Asymmetric Progressive Semi-Decoupled Head (EAPSDH) and Harmonic Focal Loss (HFL). Specifically, we generalize the detection head into a progressive asymmetric paradigm that performs hierarchical and dynamically recalibrated interaction between classification and localization, enabling iterative mutual enhancement in an efficient manner beyond the prior designs. Meanwhile, HFL is proposed to improve classifier optimization by addressing the imbalance between positive and negative samples. HFL dynamically increases the loss weights of positive samples, amplifying their gradient contributions during classifier training, which significantly reduces classification error. By jointly improving task-specific feature representation and classification optimization, EAPSDH and HFL complement each other to alleviate the inconsistency between classification and localization performance, resulting in an efficient and accurate one-stage detector termed EADet. Experimental results on the MS COCO database demonstrate that EADet effectively mitigates the inconsistency between classification and localization performance. Furthermore, EADet achieves a strong trade-off between accuracy and speed, reaching 47.4 AP at 33.2 FPS on the MS COCO with ResNet-101 under the 2× training schedule, demonstrating its effectiveness compared with recent state-of-the-art detectors. Code will be available at https://github.com/HB-X/EADet

Abstract:
With the rapid advancement of vision-language models (VLMs) in general-purpose settings, their application to cross-modal retrieval and semantic understanding of large-scale multimodal remote sensing (RS) data is emerging as a key enabler for urban governance, environmental monitoring, and disaster response. However, the pervasive issue of semantic shift in RS image poses a significant challenge to the transferability of pre-trained VLMs. To address this limitation, we propose ReCoTR, an enhanced CLIP-based cross-modal retrieval framework tailored for remote sensing applications. ReCoTR tackles region-level granularity bias and contextual semantic drift through a Dual Consensus Token Evaluation (DCTE) module, which leverages a mixture-of-experts strategy to fuse inter-modal semantic consensus with intra-modal structural consistency, enabling fine-grained estimation of semantic confidence for visual tokens. Moreover, to mitigate representational contamination caused by background noise, we introduce the Semantic Confidence Token Compression (SCTC) module. This module selectively filters and aggregates tokens with high semantic relevance, thus reducing redundancy and alleviating the noise amplification inherent in CLIP’s average pooling. Experimental results on three benchmark RS cross-modal retrieval datasets demonstrate that ReCoTR consistently outperforms existing methods on bidirectional image-text retrieval tasks, validating its effectiveness and robustness in remote sensing semantic alignment scenarios. Our source codes are available at: https://github.com/Jerry710/ReCoTR.git

Abstract:
Motion cues play a vital role in multi-frame infrared small target detection (MISTD). However, most targets in existing datasets exhibit regular and slow motion, which cannot reflect the complex and diverse motion patterns in real-world scenarios. This biased data distribution makes recent data-driven methods highly rely on simplified motion assumptions that tend to fail in irregular or fast motion, resulting in noisy feature representations cluttered with target-irrelevant factors. Hence, we stress that methods for MISTD should also work when targets are in complex motion. To enable this research, we propose a large-scale dataset called MIST for airborne infrared detection scenarios. The dataset is built on a synthetic data engine that models variations in pose, size, and intensity of moving targets while seamlessly blending them into real backgrounds for physical, geometric, and visual realism. Targets in MIST exhibit low signal-to-clutter ratios and complex motion, making it a promising yet challenging benchmark for developing algorithms focused on motion analysis. To tackle the challenges of MIST, we develop MISTNet, a robust baseline based on the Information Bottleneck theory. To handle irregular and fast motion, we propose a shifted neighborhood compensation block to efficiently model multi-scale correspondences for implicit motion compensation. To distill compact representations free from irrelevant cues, we design a progressive distillation decoder to hierarchically filter out redundancy while preserving target-relevant information. We benchmark 31 state-of-the-art methods and find that their performance on MIST drops significantly compared with that on the widely used NUDT-MIRSDT dataset. Our MISTNet outperforms all other methods by a large margin, with an over 6% gain in the IoU metric, demonstrating its superiority. The dataset, code, and model weights are available at https://github.com/GR-ray/MIST

Abstract:
The open set known class bias is conventionally viewed as a fatal problem i.e., the models trained solely on known classes tend to fit unknown classes to known classes with high confidence in inference. Thus existing methods, without exception make a choice in two manners: most methods opt for eliminating the known class bias as much as possible with tireless efforts, while others circumvent the known class bias by employing a reconstruction method. However, in this paper, we challenge the two widely accepted approaches and present a novel proposition: the so-called harmful known class bias for most methods is, exactly conversely, beneficial for the reconstruction-based method and thus such known class bias can serve as a positive-incentive to the Open set recognition (OSR) models from a reconstruction perspective. Along this line, we propose the Bias Enhanced Reconstruction Learning (BERL) framework to enhance the known class bias respectively from the class level, model level and sample level. Specifically, at the class level, a specific representation is constructed in a supervised contrastive manner to avoid overgeneralization, while a diffusion model is employed by injecting the class prior to guide the biased reconstruction at the model level. Additionally, we leverage the advantages of the diffusion model to design a self-adaptive strategy, enabling effective sample-level biased sampling based on the information bottleneck theory. Experiments on various benchmarks demonstrate the effectiveness and performance superiority of the proposed method.

Abstract:
Nuclei segmentation and classification in Hematoxylin and Eosin (H&E) stained histology images play a vital role in cancer diagnosis, treatment planning, and research. However, accurate segmentation can be hindered by factors like irregular cell shapes, unclear boundaries, and class imbalance. To address these challenges, we propose the Adaptive Gated Attention Fusion Network (AGAFNet), which integrates three innovative attention-based blocks into a U-shaped architecture complemented by dedicated decoders for both segmentation and classification tasks. These blocks comprise the Channel-wise and Spatial Attention Integration Block (CSAIB) for enhanced feature representation and selective focus on informative regions; the Adaptive Gated Convolutional Block (AGCB) for robust feature selection throughout the network; and the Fusion Attention Refinement Block (FARB) for effective information fusion. AGAFNet leverages these elements to provide a robust solution for precise nuclei segmentation and classification in H&E stained histology images. We evaluate the performance of AGAFNet on three large-scale multi-tissue datasets: PanNuke, CoNSeP, and Lizard. The experimental results demonstrate our proposed AGAFNet achieves comparable performance to state-of-the-art methods.

Abstract:
Existing Image Quality Assessment (IQA) models are limited to either full reference or no reference evaluation tasks, while humans can seamlessly switch between these assessment types. This motivates us to explore resolving these two tasks using a versatile model. In this work, we propose a novel framework that unifies full reference and no reference IQA. Our approach utilizes an encoder to extract multi-level features from images and introduces a Hierarchical Attention module to adaptively handle spatial distortions for both full reference and no reference inputs. Additionally, we develop a Semantic Distortion Aware module to analyze feature correlations between shallow and deep layers of the encoder, thereby accounting for the varying effects of different distortions on these layers. Our proposed framework achieves state-of-the-art performance for both full-reference and no-reference IQA tasks when trained separately. Furthermore, when the model is trained jointly on both types of tasks, it not only enhances performance in no-reference IQA but also maintains competitive results in full-reference IQA. This integrated approach facilitates a single training process that efficiently addresses both IQA tasks, representing a significant advancement in model versatility and performance.

Abstract:
Polarization and intensity images fusion (PIF) has extracted extensive attentions as it can generate images with clear scene information and salient texture details of the object surface that are important for downstream applications. However, existing deep learning-based PIF methods usually lack interpretability and ignore the interactions among multi-modal features. To this end, we propose a novel interpretable low-rank sparse representation guided fusion network for polarization and intensity images (termed LSRNet). Specifically, a low-rank sparse representation deep unfolding module is designed to acquire the base and detail features of the source images, with the ability of improving the interpretability of the network. In addition, a cross-modal connection complementary feature extraction module is proposed, which aims to establish dependency among features of multi-modalities to fully extract complementary features of the source images. In order to demonstrate the validity of our LSRNet and take into account shortcomings of existing datasets for PIF, a multi-scene polarization and intensity image dataset, named MSPI dataset, is constructed, which includes 1034 high-resolution aligned image pairs. According to the best of our knowledge, this is the most comprehensive dataset for PIF that with a large number of image pairs, high resolution and multiple scene types. Extensive experiments on our MSPI dataset and two publicly available datasets (i.e., 12CFC and HCP) demonstrate the superior fusion performance, generalization ability, and desirable running efficiency of our LSRNet. Our codes and dataset will be publicly available at https://github.com/thebinyang/LSRNet

Abstract:
We propose a robust alignment technique for Standard Fundus Images (SFIs) and Ultra-Widefield Fundus Images (UWFIs), which are challenging to align due to differences in scale, appearance, and the scarcity of distinctive features. Our method, termed Particle Diffusion Matching (PDM), performs alignment through an iterative Random Walk Correspondence Search (RWCS) guided by a diffusion model. At each iteration, the model estimates displacement vectors for particle points by considering local appearance, the structural distribution of particles, and an estimated global transformation, enabling progressive refinement of correspondences even under difficult conditions. PDM achieves state-of-the-art performance across multiple retinal image alignment benchmarks, showing substantial improvement on a primary dataset of SFI-UWFI pairs and demonstrating its effectiveness in real-world clinical scenarios. By providing accurate and scalable correspondence estimation, PDM overcomes the limitations of existing methods and facilitates the integration of complementary retinal image modalities. This diffusion-guided search strategy offers a new direction for improving downstream supervised learning, disease diagnosis, and multi-modal image analysis in ophthalmology.

Abstract:
Geometry-based point cloud compression (G-PCC), an international standard designed by MPEG, provides a generic framework for compressing diverse types of point clouds while ensuring interoperability across applications and devices. However, G-PCC underperforms compared to recent deep learning-based PCC methods despite its lower computational power consumption. To enhance the efficiency of G-PCC without sacrificing its interoperability or computational flexibility, we propose the first compression-oriented point cloud voxelization network jointly optimized with a differentiable G-PCC surrogate model. The surrogate model mimics the rate-distortion behavior of the non-differentiable G-PCC codec, enabling end-to-end gradient propagation. The versatile voxelization network adaptively transforms input point clouds using learning-based voxelization and effectively manipulates point clouds via global scaling, fine-grained pruning, and point-level editing for rate-distortion trade-off. During inference, only the lightweight voxelization network is prepended to the G-PCC encoder, requiring no modifications to the decoder, thus introducing no computational overhead for end users. Extensive experiments demonstrate a 38.84% average BD-rate reduction over G-PCC. By bridging classical codecs with deep learning, this work offers a practical pathway to enhance legacy compression standards while preserving their backward compatibility, making it ideal for real-world deployment.

Abstract:
Synthetic Aperture Radar (SAR) imaging relies on using focusing algorithms to transform raw measurement data into radar images. These algorithms require knowledge of SAR system parameters, such as wavelength, center slant range, fast time sampling rate, pulse repetition interval, waveform, and platform speed. However, in non-cooperative scenarios or when metadata is corrupted, these parameters are unavailable, rendering traditional algorithms ineffective. To address this challenge, this article presents a novel parameter-free method for recovering SAR images from raw data without the requirement of any SAR system parameters. Firstly, we introduce an approximated matched filtering model that leverages the shift-invariance properties of SAR echoes, enabling image formation via convolving the raw data with an unknown reference echo. Secondly, we develop a Principal Component Maximization (PCM) method that exploits the low-dimensional structure of SAR signals to estimate the reference echo. The PCM method employs a three-stage procedure: 1) segment raw data into blocks; 2) normalize the energy of each block; and 3) maximize the principal component’s energy across all blocks, enabling robust estimation of the reference echo under non-stationary clutter. Experimental results on various SAR datasets demonstrate that our method can effectively recover SAR images from raw data without any system parameters. To facilitate reproducibility, the Matlab program is available at https://github.com/huizhangyang/pcm

Affiliations: Key Laboratory of Smart Manufacturing in Energy Chemical Process, Ministry of Education, East China University of Science and Technology, Shanghai, China; Brain-Computer Interfacing and Neural Engineering Laboratory, School of Computer Science and Electronic Engineering, University of Essex, Colchester, Essex, U.K.; Shanghai Lansheng Brain Hospital Investment Company Ltd., Shanghai, China; Systems Research Institute of Polish Academy of Sciences, Warsaw, Poland; Key Laboratory of Smart Manufacturing in Energy Chemical Process, Ministry of Education, the School of Mathematics, and ECUST Medical-Engineering Integration Innovation Center, East China University of Science and Technology, Shanghai, China

Abstract:
Steady state visual evoked potential (SSVEP)-based brain–computer interfaces have been widely studied for their fast response speeds and high information transfer rates. However, how to fully utilize the potential information of existing subjects to realize the mining of common information among different subjects and then realize the information migration in a small amount of data scenarios is a difficult problem faced by current research. In order to solve the above problems, this study proposes a deep neural network based on the pyramid squeeze attention (PSA-DNN) mechanism to enhance the performance of SSVEP-BCI through common information migration. Specifically, the band-pass filtered EEG signals were first Fourier transformed to obtain the frequency domain information; subsequently, the frequency domain information is input into a deep neural network, followed by a spatial convolution step to extract spatial domain information. In order to further enhance the quality of information extraction, a pyramid attention module is introduced into the network to realize the enhancement of frequency domain and spatial domain information. Time domain information from the EEG signals is then mined using temporal convolution. Finally, the full connectivity layer is used to output the recognition results. The model is trained in a three-stage stepped approach for SSVEP target recognition. The first stage uses data from all participants in the training set for common information learning and transfers the model parameters trained in the first stage to the network model in the second stage. In the second stage, some of the information from participants in the test set is used for fine-tuning and to mine personalized information from these new participants. The third stage uses the remaining data from participants in the test set to produce classification results. The proposed method is systematically evaluated using the Benchmark and BETA datasets, where it demonstrates favorable performance compared to established baselines. These findings contribute theoretical insights and methodological References for the application of SSVEP-based brain–computer interfaces in real-world scenarios.

Abstract:
Deep neural networks using generative diffusion prior have provided the state-of-the-art performances for the task of blind image super resolution. Thanks to their powerful image generation capability, these deep networks are able to produce high-quality visual signals with realistic textures and structures. However, since these schemes employ a very large number of parameters, their training process is often difficult, and therefore, their performances can be limited. In order to address this, in this paper, we propose a diffusion-based blind image super resolution scheme, which by using a novel learning algorithm with invertible neural networks, is able to provide superior results. Specifically, we argue that because of the reversibility property of invertible neural networks, they are able to generate degraded low-quality images, whose super resolved versions are the upper bound of the image super resolution function space. The inclusion of such visual signals in the training process of our blind image super resolution network leads to facilitating the learning paradigm and achieving higher performances. We show that our proposed blind image super resolution scheme is able to outperform the state-of-the-art methods.

Abstract:
Retinal image registration (RIR) plays an important role in the diagnosis and long-term monitoring of retinal diseases. Retinal image global registration (RIGR) is usually the first step of RIR. Traditional methods often struggle to achieve robust keypoint detection and description when faced with high-resolution, fine-textured retinal images. Deep learning-based methods for this task have not been widely developed. Therefore, we propose a keypoint detection and description network based on local feature saliency, EyeKey, for RIGR. EyeKey uses the “Detect While Describing (DWD)” design. Specifically, two proposed UDPAM++ modules are embedded into the feature description network to enhance its feature description capability. Concurrently, these modules detect distinctive keypoints based on local feature saliency, combined with a Mapping Module featuring only three learnable parameters. Moreover, we achieve self-supervised feature description network training on high-resolution, fine-textured retinal images through the Random Local Hardest Example Mining strategy. Additionally, we realize robust unsupervised keypoint detection network training based on the High Matching Probability Defines Keypoints strategy and the proposed Cumulative Salient Keypoint Expansion, which, together with the DWD design, mutually reinforce the training of the keypoint detection and description network. Finally, combined with the feature-based RIGR pipeline, our method achieves outstanding performance while maintaining excellent inference speed on monomodal and multimodal RIGR evaluation datasets.

Abstract:
The multi-classification of histopathological images under imbalanced sample conditions remains a long-standing unresolved challenge in computational pathology. In this paper, we propose for the first time a cross-patient pseudo-bag generation technique to address this challenge. Our key innovation lies in a cross-patient pseudo-bag generation framework that extracts complementary pathological features to construct distributionally consistent pseudo-bags. To resolve the critical challenge of distributional alignment in pseudo-bag generation, we propose an affinity-driven curriculum contrastive learning strategy, integrating sample affinity metrics with progressive training to stabilize representation learning. Unlike prior methods focused on bag-level embeddings, our framework pioneers a paradigm shift toward multi-instance feature distribution mining, explicitly modeling inter-bag heterogeneity to address class imbalance. Our method demonstrates significant performance improvements on three datasets with multiple classification difficulties, outperforming the second-best method by an average of 1.95 percentage points in F1 score and 2.07 percentage points in ACC.