TIP2025

Abstract:
Most dataset distillation methods struggle to accommodate large-scale datasets due to their substantial computational and memory requirements. Recent research has begun to explore scalable disentanglement methods. However, there are still performance bottlenecks and room for optimization in this direction. In this paper, we present a curriculum-based dataset distillation framework aiming to harmonize performance and scalability. This framework strategically distills synthetic images, adhering to a curriculum that transitions from simple to complex. By incorporating curriculum evaluation, we address the issue of previous methods generating images that tend to be homogeneous and simplistic, doing so at a manageable computational cost. Furthermore, we introduce adversarial optimization towards synthetic images to further improve their representativeness and safeguard against their overfitting to the neural network involved in distilling. This enhances the generalization capability of the distilled images across various neural network architectures and also increases their robustness to noise. Extensive experiments demonstrate that our framework sets new benchmarks in large-scale dataset distillation, achieving substantial improvements of 11.1% on Tiny-ImageNet, 9.0% on ImageNet-1K, and 7.3% on ImageNet-21K. Our distilled datasets and code are available at https://github.com/MIV-XJTU/CUDD

Affiliations: College of Computer Science, Sichuan University, Chengdu, China; the College of Computing and Data Science Nanyang, Nanyang, Singapore; College of Software and Microelectronics, Peking University, Beijing, China; Centre for Frontier AI Research (CFAR) and the Institute of High Performance Computing (IHPC), Agency for Science, Technology and Research (A*STAR), Connexis, Singapore; College of Computer Science and the National Key Laboratory of Fundamental Algorithms and Models for Engineering Simulation, Sichuan University, Chengdu, China

Abstract:
Visual-textual retrieval, as a link between computer vision and natural language processing, aims at jointly learning visual-semantic relevance to bridge the heterogeneity gap across visual and textual spaces. Existing methods conduct retrieval only relying on the ranking of pairwise similarities, but they cannot self-evaluate the uncertainty of retrieved results, resulting in unreliable retrieval and hindering interpretability. To address this problem, we propose a novel Trust-Consistent Learning framework (TCL) to endow visual-textual retrieval with uncertainty evaluation for trustworthy retrieval. More specifically, TCL first models the matching evidence according to cross-modal similarity to estimate the uncertainty for cross-modal uncertainty-aware learning. Second, a simple yet effective consistency module is presented to enforce the subjective opinions of bidirectional learning to be consistent for high reliability and accuracy. Finally, extensive experiments are conducted to demonstrate the superiority and generalizability of TCL on six widely-used benchmark datasets, i.e., Flickr30K, MS-COCO, MSVD, MSR-VTT, ActivityNet, and DiDeMo. Furthermore, some qualitative experiments are carried out to provide comprehensive and insightful analyses for trustworthy visual-textual retrieval, verifying the reliability and interoperability of TCL. The code is available in https://github.com/QinYang79/TCL

Abstract:
In this paper, we propose a novel visual relation detection task, named Group Visual Relation Detection (GVRD), for detecting visual relations whose subjects and/or objects are groups (GVRs), inspired by the observation that groups are common in image semantic representation. GVRD can be deemed as an evolution over the existing visual relation detection task that limits both subjects and objects of visual relations as individuals. We propose a Simultaneous Group Relation Prediction (SGRP) method that can simultaneously predict groups and predicates to address GVRD. SGRP contains an Entity Construction (EC) module, a Feature Extraction (FE) module, and a Group Relation Prediction (GRP) module. Specifically, the EC module constructs instances, group candidates, and phrase candidates; the FE module extracts visual, location and semantic features for these entities; and the GRP module simultaneously predicts groups and predicates, and generates the GVRs. Moreover, we construct a new dataset, named COCO-GVR, to facilitate solutions to GVRD task, which consists of 9,570 images from COCO dataset and 31,855 manually labeled GVRs. We test and validate the performance of SGRP by extensive experiments on COCO-GVR dataset. It shows that SGRP outperforms the baselines generated from the state-of-the-art visual relation detection and scene graph generation methods.

Abstract:
We introduce pluralistic salient object detection (PSOD), a novel task aimed at generating multiple plausible salient segmentation results for a given input image. Unlike conventional SOD methods that produce a single segmentation mask for salient objects, this new setting recognizes the inherent complexity of real-world images, comprising multiple objects, and the ambiguity in defining salient objects due to different user intentions. To study this task, we present two new SOD datasets “DUTS-MM” and “DUTS-MQ”, along with newly designed evaluation metrics. DUTS-MM builds upon the DUTS dataset but enriches the ground-truth mask annotations from three aspects which 1) improves the mask quality especially for boundary and fine-grained structures; 2) alleviates the annotation inconsistency issue; and 3) provides multiple ground-truth masks for images with saliency ambiguity. DUTS-MQ consists of approximately 100K image-mask pairs with human-annotated preference scores, enabling the learning of real human preferences in measuring mask quality. Building upon these two datasets, we propose a simple yet effective pluralistic SOD baseline based on a Mixture-of-Experts (MOE) design. Equipped with two prediction heads, it simultaneously predicts multiple masks using different query prompts and predicts human preference scores for each mask candidate. Extensive experiments and analyses underscore the significance of our proposed datasets and affirm the effectiveness of our PSOD framework.

Abstract:
Text-based person search (TBPS) aims to retrieve images of a specific person from a large image gallery based on a natural language description. Existing methods rely on massive annotated image-text data to achieve satisfactory performance in fully-supervised learning. This presents a substantial practical challenge, given the difficulty in obtaining annotated texts for person images. This work undertakes a pioneering initiative to explore TBPS under the semi-supervised setting, where only a limited number of person images are annotated with textual descriptions while the majority of images lack annotations. We present a two-stage basic solution based on generation-then-retrieval for semi-supervised TBPS. The generation stage enriches annotated data by applying an image captioning model to generate pseudo-texts for unannotated images. Later, the retrieval stage performs fully-supervised retrieval learning using the augmented data. Crucially, considering the noise interference of the pseudo-texts on retrieval learning, we propose a noise-robust retrieval framework that enhances the ability of the retrieval model to handle noisy data. The framework integrates two key strategies: Hybrid Patch-Channel Masking (PC-Mask) to refine the model architecture, and Noise-Guided Progressive Training (NP-Train) to enhance the training process. PC-Mask performs masking on the input data at both the patch-level and the channel-level to prevent overfitting noisy supervision. NP-Train introduces a progressive training schedule based on the noise level of pseudo-texts to facilitate noise-robust learning. Extensive experiments on multiple TBPS benchmarks show that the proposed framework achieves promising performance under the semi-supervised setting.

Abstract:
Virtual reality (VR) makes it possible to provide immersive multimedia content composed of omnidirectional videos (ODVs). Towards enabling more immersive and satisfying VR content, methods are needed to manipulate VR scenes, taking into account perceptual factors related to viewers’ quality of experience (QoE). For example, style transfer methods can be applied to VR content, allowing users to create artistic or surreal effects in their immersive environments. Here, we study perceptual factors that affect the sensation of stylized immersiveness, including color dynamics and spatio-temporal consistency. To do this, we introduce an immersiveness sensitivity model of luminance and color perception, and use it to measure the color dynamics and spatio-temporal consistency of stylized VR contents. We subsequently use this model to construct a perceptually-guided VR style transfer model called VR Style Transfer GAN (VRST-GAN). VRST-GAN learns to transfer a desired style into VR to enhance immersiveness by considering color dynamics while preserving spatio-temporal consistency. We demonstrate the effectiveness of VRST-GAN via qualitative and quantitative experiments. We also develop a VR Immersiveness Predictor (VR-IP) that is able to predict the sensation of immersiveness using the perceptual model. In our experiments, VR-IP predicts immersiveness with an accuracy of 91%.

Abstract:
Video-text retrieval is a crucial task in numerous computer vision applications. In this paper, we focus on video-text retrieval involving complex action compositions, where a single video encompasses multiple primitive actions such as “sitting up”, “opening door”, “cooking food”, and “eating.” Despite the common occurrences in real-world scenarios, such action-compositional videos have received limited research attention, often leading to significant performance degradations in existing retrieval methods. To address this challenge, we present Hyperbolic Video-tExt Retrieval (HOVER), which models the hierarchical semantic relationships between videos and texts by embedding them in a low-dimensional hyperbolic space. Since hyperbolic space provides a geometric prior that naturally aligns with hierarchical data, it allows for more efficient and generalizable representations of video-text semantic hierarchies. HOVER first longitudinally decomposes each video into a hierarchical action tree, where primitive mono-actions are represented as leaf nodes and increasingly complex action compositions as parent nodes. The semantic structures and temporal dependencies of videos/texts are then encoded in hyperbolic space by exploiting hyperbolic distance, norm, and relative cosine similarity. Experimental results show that HOVER significantly outperforms traditional Euclidean-based methods, particularly in scenarios with limited training labels, achieving a notable performance improvement of 28.83%. Additionally, the hyperbolic video-text embeddings learned by HOVER demonstrate strong generalization across new datasets containing videos with varying levels of action complexity. The source code is available at https://github.com/shi-rq/HOVER

Abstract:
To deal with high-dimensional unlabeled datasets in many areas, principal component analysis (PCA) has become a rising technique for unsupervised feature selection (UFS). However, most existing PCA-based methods only consider the structure of datasets by embedding a single sparse regularization or constraint on the transformation matrix. In this paper, we introduce a novel bi-sparse method called BSUFS to improve the performance of UFS. The core idea of BSUFS is to incorporate \ell _2,p -norm and \ell _q -norm into the classical PCA, which enables our method to select relevant features and filter out irrelevant noises, thereby obtaining discriminative features. Here, the parameters p and q are within the range of [ 0, 1 ). Therefore, BSUFS not only constructs a unified framework for bi-sparse optimization, but also includes some existing works as special cases. To solve the resulting non-convex model, we propose an efficient proximal alternating minimization (PAM) algorithm using Stiefel manifold optimization and sparse optimization techniques. In addition, the computational complexity analysis is presented. Extensive numerical experiments on synthetic and real-world datasets demonstrate the effectiveness of our proposed BSUFS. The results reveal the advantages of bi-sparse optimization in feature selection and show its potential for other fields in image processing. Our code is available at https://github.com/xianchaoxiu/BSUFS.

Abstract:
Underwater creature segmentation (UCS) is critical for marine research and robotics but faces unique challenges: environmental distortions and biological traits that distinguish it from terrestrial segmentation. While deep learning advances exist, current UCS models are constrained to low-resolution inputs, losing critical details when processing high-resolution (HR) imagery and degrading segmentation precision. To bridge this gap, we introduce UCS4K, the first large-scale HR dataset for UCS, containing 4,096 images with pixel-wise annotations. UCS4K offers 4 times higher average resolution than existing datasets, covering diverse species, habitats, and environmental complexities essential for robust model training. Additionally, we propose a Resolution-Asymmetric Dual-branch Alignment and Refinement (RADAR) network to address the efficiency-receptiveness trade-off in HR-UCS. RADAR decouples context and detail processing: a CNN branch preserves HR spatial details, while a Transformer branch models global semantics on downsampled inputs to avoid quadratic complexity. Crucially, it resolves the inherent semantic misalignment issue between branches via the Global Semantic Alignment (GSA) module in the encoder and the Bidirectional Collaborative Refinement (BCR) module-embedded decoder that progressively integrates multi-scale encoding features to sharpen boundaries. This asymmetric design ensures efficient long-range context capture without sacrificing spatial precision. Extensive benchmarks demonstrate that RADAR sets new state-of-the-art performance on UCS4K and other existing datasets. Our contributions establish the first HR benchmark for UCS and deliver a scalable framework for high-precision segmentation. Dataset, code, and models are available at https://github.com/WHYfromNUT/RADAR.

Abstract:
Multifractal analysis (MFA) provides a framework for the global characterization of image textures by describing the spatial fluctuations of their local regularity based on the multifractal spectrum. Several works have shown the interest of using MFA for the description of homogeneous textures in images. Nevertheless, natural images can be composed of several textures and, in turn, multifractal properties associated with those textures. This paper introduces an unsupervised Bayesian multifractal segmentation method to model and segment multifractal textures by jointly estimating the multifractal parameters and labels on images, at the pixel-level. For this, a computationally and statistically efficient multifractal parameter estimation model for wavelet leaders is firstly developed, defining different multifractality parameters for different regions of an image. Then, a multiscale Potts Markov random field is introduced as a prior to model the inherent spatial and scale correlations (referred to as cross-scale correlations) between the labels of the wavelet leaders. A Gibbs sampling methodology is finally used to draw samples from the posterior distribution of the unknown model parameters. Numerical experiments are conducted on synthetic multifractal images to evaluate the performance of the proposed segmentation approach. The proposed method achieves superior performance compared to traditional unsupervised segmentation techniques as well as modern deep learning-based approaches, showing its effectiveness for multifractal image segmentation.

Abstract:
Cross-modal retrieval is crucial in understanding latent correspondences across modalities. However, existing methods implicitly assume well-matched training data, which is impractical as real-world data inevitably involves imperfect alignments, i.e., noisy correspondences. Although some works explore similarity-based strategies to address such noise, they suffer from sub-optimal similarity predictions influenced by modality-exclusive information (MEI), e.g., background noise in images and abstract definitions in texts. This issue arises as MEI is not shared across modalities, thus aligning it in training can markedly mislead similarity predictions. Moreover, although intuitive, directly applying previous cross-modal disentanglement methods suffers from limited noise tolerance and disentanglement efficacy. Inspired by the robustness of information bottlenecks against noise, we introduce DisNCL, a novel information-theoretic framework for feature Disentanglement in Noisy Correspondence Learning, to adaptively balance the extraction of modality-invariant information (MII) and MEI with certifiable optimal cross-modal disentanglement efficacy. DisNCL then enhances similarity predictions in modality-invariant subspace, thereby greatly boosting similarity-based alleviation strategy for noisy correspondences. Furthermore, DisNCL introduces soft matching targets to model noisy many-to-many relationships inherent in multi-modal inputs for noise-robust and accurate cross-modal alignment. Extensive experiments confirm DisNCL’s efficacy by 2% average recall improvement. Mutual information estimation and visualization results show that DisNCL learns meaningful MII/MEI subspaces, validating our theoretical analyses.

Abstract:
Ensemble clustering fuses a set of base clusterings and shows promising capability in achieving more robust and better clustering results. The existing methods usually realize ensemble clustering by adopting a co-association matrix to measure how many times two data points are categorized into the same cluster based on the base clusterings. Though great progress has been achieved, the obtained co-association matrix is constructed based on the combination of different connective matrices or its variants. These methods ignore exploring the inherent latent space shared by multiple connective matrices and learning the corresponding co-association matrices according to this latent space. Moreover, these methods neglect to learn discriminative connective matrices, explore the high-order relation among these connective matrices and consider the latent space in a unified framework. In this paper, we propose a Latent spacE leArning baseD Ensemble Clustering (LEADEC), which introduces the latent space shared by different connective matrices and learns the corresponding connective matrices according to this latent space. Specifically, we factorize the original multiple connective matrices into a consensus latent space representation and the specific connective matrices. Meanwhile, the orthogonal constraint is imposed to make the latent space representation more discriminative. In addition, we collect the obtained connective matrices based on the latent space into a tensor with three orders to investigate the high-order relations among these connective matrices. The connective matrices learning, the high-order relation investigation among connective matrices and the latent space representation learning are integrated into a unified framework. Experiments on seven benchmark datasets confirm the superiority of LEADEC compared with the existing representive methods.

Abstract:
Audio-visual segmentation (AVS) aims to segment objects in audio-visual content. The effective interaction between audio and visual features has garnered significant attention from the multimodal domain. Despite significant advancements, most existing AVS methods are hampered by multimodal inconsistencies. These inconsistencies primarily manifest as a mismatch between audio and visual information guided by audio cues, wherein visual features often dominate audio modality. To address this issue, we propose the Consistency-Queried Transformer (CQFormer), a novel framework for AVS tasks that leverages the transformer architecture. This framework features a Consistency Query Generator (CQG) and a Query-Aligned Matching (QAM) module. The Noise Contrastive Estimation (NCE) loss function enhances modality matching and consistency by minimizing the distributional differences between audio and visual features, facilitating effective fusion and interaction between these features. Additionally, introducing the consistency query during the decoding stage enhances consistency constraints and object-level semantic information, further improving the accuracy and stability of audio-visual segmentation. Extensive experiments on the popular benchmark of the audio-visual segmentation dataset demonstrate that the proposed CQFormer achieves state-of-the-art performance.

Abstract:
Transformer is leading a trend in the field of image processing. While existing lightweight image processing transformers have achieved notable success, they primarily focus on reducing FLOPs (floating-point operations) or the number of parameters, rather than on practical inference acceleration. In this paper, we present a latency-aware image processing transformer, termed LIPT. We devise the low-latency proportion LIPT block that substitutes memory-intensive operators with the combination of self-attention and convolutions to achieve practical speedup. Specifically, we propose a novel non-volatile sparse masking self-attention (NVSM-SA) that utilizes a pre-computing sparse mask to capture contextual information from a larger window with no extra computation overload. Besides, a high-frequency reparameterization module (HRM) is proposed to make LIPT block reparameterization friendly, enhancing the model’s ability to reconstruct fine details. Extensive experiments on multiple image processing tasks (e.g., image super-resolution (SR), JPEG artifact reduction, and image denoising) demonstrate the superiority of LIPT on both latency and PSNR. LIPT achieves real-time GPU inference with state-of-the-art performance on multiple image SR benchmarks. The source codes are released at https://github.com/Lucien66/LIPT

Affiliations: School of Telecommunications Engineering, Xidian University, Xi’an, China; School of Computer Science, School of Artificial Intelligence, Optics and Electronics (iOPEN), and the Key Laboratory of Intelligent Interaction and Applications (Ministry of Industry and Information Technology), Northwestern Polytechnical University, Xi’an, Shaanxi, China; School of Artificial Intelligence, Optics and Electronics (iOPEN) and the Key Laboratory of Intelligent Interaction and Applications, Ministry of Industry and Information Technology, Northwestern Polytechnical University, Xi’an, China; Institute of Artificial Intelligence (TeleAI), China Telecom, Beijing, China

Abstract:
Existing deep clustering methods leverage contrastive or non-contrastive learning to facilitate downstream tasks. Most contrastive-based methods typically learn representations by comparing positive pairs (two views of the same sample) against negative pairs (views of different samples). However, we spot that this hard treatment of samples ignores inter-sample relationships, leading to class collisions and degrade clustering performances. In this paper, we propose a soft neighbor supported contrastive clustering method to address this issue. Specifically, we first introduce a concept called perception radius to quantify similarity confidence between a sample and its neighbors. Based on this insight, we design a two-level soft neighbor loss that captures both local and global neighborhood relationships. Additionally, a cluster-level loss enforces compact and well-separated cluster distributions. Finally, we conduct a pseudo-label refinement strategy to mitigate false negative samples. Extensive experiments on benchmark datasets demonstrate the superiority of our method. The code is available at https://github.com/DuannYu/soft-neighbors-supported-clustering

Abstract:
Cross-Modal Hashing (CMH) has become a powerful technique for large-scale cross-modal retrieval, offering benefits like fast computation and efficient storage. However, most CMH models struggle to adapt to streaming multimodal data in real-time once deployed. Although recent online CMH studies have made progress in this area, they often overlook two key challenges: 1) learning effectively from streaming partial-modal multimodal data, and 2) avoiding the high costs associated with frequent hash function re-training and large-scale updates to database hash codes. To address these issues, we propose Fast Partial-modal Online Cross-Modal Hashing (FPO-CMH), the first approach to tackle online cross-modal hash learning with partial-modal data. This marks a significant shift from previous methods that rely on fully-available multimodal data. Specifically, our approach introduces a multimodal dual-tier anchor bank, initialized using offline training data, which allows offline-trained CMH models to adapt seamlessly to partial-modal data while progressively updating the anchor bank. By leveraging gradient accumulation and asynchronous optimization, FPO-CMH facilitates efficient online cross-modal hash learning. Additionally, an initial-anchor rehearsal strategy is employed to prevent model catastrophic forgetting during online optimization, ensuring the code invariance of database hash codes and eliminating the need for frequent hash function re-training. Extensive experiments validate the superiority of FPO-CMH, especially in handling streaming partial-modal multimodal data, a more realistic scenario. The source codes and datasets are available at https://github.com/DandelionWow/FPO-CMH

Abstract:
Lighting enhancement is a classical topic in low-level image processing. Existing studies mainly focus on global illumination optimization while overlooking local semantic objects, and this limits the performance of exposure compensation. In this paper, we introduce SRENet, a novel lighting enhancement network guided by saliency information. It adopts a two-step strategy of foreground-background separation optimization to achieve a balance between global and local illumination. In the first step, we extract salient regions and implement the local illumination enhancement that ensures the exposure quality of salient objects. Next, we utilize a fusion module to process global lighting optimization based on local enhanced results. With the two-step strategy, the proposed SRENet yield better lighting enhancement for local illumination while preserving the globally optimal results. Experimental results demonstrate that our method obtains more effective enhancement results for various tasks of exposure correction and lighting quality improvement. The source code and pre-trained models are available at https://github.com/PlanktonQAQ/SRENet

Abstract:
For the low-rank matrix recovery problem, algorithms that directly manipulate the low-rank matrix typically require computing the top singular values/vectors of the matrix and thus are computationally expensive. Matrix factorization is a computationally efficient nonconvex approach for low-rank matrix recovery, utilizing an alternating minimization or a gradient descent algorithm, and its theoretical properties have been investigated in recent years. However, the behavior of the factorization-based matrix recovery problem in the decentralized setting is still unknown when data are distributed on multiple nodes. In this paper, we consider the distributed gradient descent algorithm and establish its (local) linear convergence up to the approximation error. Numerical results are also presented to illustrate the convergence of the algorithm over a general network.

Abstract:
Multi-part portrait customization aims to generate realistic human images by assembling specified body parts from multiple reference images, with significant applications in digital human creation. Existing customization methods typically follow two approaches: 1) test-time fine-tuning, which learn concepts effectively but is time-consuming and struggles with multi-part composition; 2) generalizable feed-forward methods, which offer efficiency but lack fine control over appearance specifics. To address these limitations, we present Parts2Whole, a diffusion-based generalizable portrait generator that harmoniously integrates multiple reference parts into high-fidelity human images by our proposed multi-reference mechanism. To adequately characterize each part, we propose a detail-aware appearance encoder, which is initialized and inherits powerful image priors from the pre-trained denoising U-Net, enabling the encoding of detailed information from reference images. The extracted features are incorporated into the denoising U-Net by a shared self-attention mechanism, enhanced by mask information for precise part selection. Additionally, we integrate pose map conditioning to control the target posture of generated portraits, facilitating more flexible customization. Extensive experiments demonstrate the superiority of our approach over existing methods and applicability to related tasks like pose transfer and pose-guided human image generation, showcasing its versatile conditioning. Our project is available at https://huanngzh.github.io/Parts2Whole/

Abstract:
Generative diffusion models can serve as priors, ensuring that image restoration solutions adhere to natural image manifolds. For facial images, however, personalized priors are essential to accurately reconstruct individual-specific facial features. We propose Dual-Pivot Tuning — a simple yet effective two-stage approach to personalize blind restoration systems while preserving general prior integrity. Our key observation is that for efficient personalization, the diffusion model should be tuned around a fixed textual pivot in the first step, while in the second step a guiding network should be tuned in a generic (non-personalized) manner, using the personalized diffusion model as a fixed “pivot”. This approach ensures that personalization does not interfere with the restoration process, producing results with a natural appearance that show high fidelity to both identity and degraded image attributes. We conducted extensive experiments with images of widely recognized individuals, evaluating our approach both qualitatively and quantitatively against relevant baselines. Notably, our personalized prior not only achieves superior identity fidelity, but also outperforms state-of-the-art generic priors in terms of overall image quality. Project webpage is https://personalized-restoration.github.io/ and code is available at https://github.com/personalized-restoration/personalized-restoration

Abstract:
In real-world scenarios, the data usually appears in a streaming fashion. To achieve remarkable retrieval performance in such scenarios, online multi-modal hashing has drawn great research attention due to its high retrieval speed and low storage cost. However, existing online multi-modal hashing methods still fail to achieve satisfactory retrieval performance in the scenarios where the new streaming datapoints all belong to the new classes. Therefore, to further improve the retrieval performance in these scenarios, we propose a novel Prospective Layout-Guided Multi-modal Online Hashing, termed PLG-MOH. Specifically, PLG-MOH first establishes the layout of the Hamming space by generating a series of hashing centers to split the space. Each hashing center will be gradually assigned to a new appearing class, and these assigned centers correspond one-to-one with the classes. Moreover, we propose a novel prospective layout-guided loss, which leverages all the hashing centers, including those not yet assigned to the classes, to supervise the training of hashing model. As the unassigned hashing centers will be designated to the new classes emerging in the future, it signifies that during each round of training, PLG-MOH has already considered the forthcoming data from new classes in the future rounds. Consequently, PLG-MOH can effectively adapt its hashing functions to address the new arriving samples and learn semantic similarity-preserved hash codes for them, meanwhile it can effectively retain the information learned from the old data. Extensive experiments on two public datasets demonstrate that the proposed PLG-MOH achieves better retrieval performance than state-of-the-art baselines on online scenarios.

Abstract:
Multimodal prompt learning has emerged as an effective strategy for adapting vision-language models such as CLIP to downstream tasks. However, conventional approaches typically operate at the input level, forcing learned prompts to propagate through a sequence of frozen Transformer layers. This indirect adaptation introduces cumulative geometric distortions, a limitation that we formalize as the indirect learning dilemma (ILD), leading to overfitting of the base class and reduced generalization to novel classes. To overcome this challenge, we propose the Multimodal Self-Attention Prompt (MSP) framework, which shifts adaptation into the semantic core of the model by injecting learnable prompts directly into the key and value sequences of attention blocks. This direct modulation preserves the pretrained embedding geometry while enabling more precise downstream adaptation. MSP further incorporates distance-aware optimization to maintain semantic consistency with CLIP’s original representation space, and partial prompt learning via stochastic dimension masking to improve robustness and prevent over-specialization. Extensive evaluations across 11 benchmarks demonstrate the effectiveness of MSP. It achieves a state-of-the-art harmonic mean accuracy of 80.67%, with 77.32% accuracy on novel classes—representing a 2.18% absolute improvement over prior methods—while requiring only 0.11M learnable parameters. Notably, MSP surpasses CLIP’s zero-shot performance on 10 out of 11 datasets, establishing a new paradigm for efficient and generalizable prompt-based adaptation. Our implementation is available at https://github.com/laixinyi023/Multimodal-Self-Attention-Prompt

Abstract:
Test-time adaptation (TTA) has gained increasing popularity due to its efficacy in addressing “distribution shift” issue while simultaneously protecting data privacy. However, most prior methods assume that a paired source domain model and target domain sharing the same label space coexist, heavily limiting their applicability. In this paper, we investigate a more general source model capable of adaptation to multiple target domains without needing shared labels. This is achieved by using a pre-trained vision-language model (VLM), e.g., CLIP, that can recognize images through matching with class descriptions. While the zero-shot performance of VLMs is impressive, they struggle to effectively capture the distinctive attributes of a target domain. To that end, we propose a novel method – Context-aware Language-driven TTA (COLA). The proposed method incorporates a lightweight context-aware module that consists of three key components: a task-aware adapter, a context-aware unit, and a residual connection unit for exploring task-specific knowledge, domain-specific knowledge from the VLM and prior knowledge of the VLM, respectively. It is worth noting that the context-aware module can be seamlessly integrated into a frozen VLM, ensuring both minimal effort and parameter efficiency. Additionally, we introduce a Class-Balanced Pseudo-labeling (CBPL) strategy to mitigate the adverse effects caused by class imbalance. We demonstrate the effectiveness of our method not only in TTA scenarios but also in class generalisation tasks. The source code is available at https://github.com/NUDT-Bai-Group/COLA-TTA

Abstract:
The modulation transfer function tailored image filter (MTF-TIF) has long been regarded as the optimal filter for multispectral image pansharpening. It excels at simulating the camera’s frequency response, thereby capturing finer image details and significantly improving pansharpening performance. However, we are skeptical about whether the pre-measured MTF is sufficient to describe the characteristics of actually acquired panchromatic image (PAN) and multispectral image (MSI). For example, any image resampling operations in geometric correction or image registration inevitably change the sharpness of acquired PAN and MSI, and the processed images no longer conform to the camera’s MTF. Further, following the Wald protocol, in deep learning (DL) methods using MTF-TIF for downsampling images to construct training data does not satisfy the generalization consistency of training and testing. To prove our point, we propose a pair of symmetric frameworks based on DL in this paper, to find better image filters suitable for both traditional and DL pansharpening methods. We embed two learnable filters into the frameworks to simulate the optimal image filter, namely anisotropic Gaussian image filter and arbitrary image filter. Further, the proposed frameworks can capture subtle offsets between images and maintain the smoothness of the global deformation field. Extensive experiments on various satellite datasets demonstrate that the proposed frameworks can find better image filters than MTF-TIFs, which can achieve better pansharpening performance with stronger generalization ability.

Abstract:
In real-world scenarios, the number of training samples across classes usually subjects to a long-tailed distribution. The conventionally trained network may achieve unexpected inferior performance on the rare class compared to the frequent class. Most previous works attempt to rectify the network bias from the data-level or from the classifier-level. Differently, in this paper, we identify that the bias towards the frequent class may be encoded into features, i.e., the rare-specific features which play a key role in discriminating the rare class are much weaker than the frequent-specific features. Based on such an observation, we introduce a simple yet effective approach, normalizing the parameters of Batch Normalization (BN) layer to explicitly rectify the feature bias. To achieve this end, we represent the Weight/Bias parameters of a BN layer as a vector, normalize it into a unit one and multiply the unit vector by a scalar learnable parameter. Through decoupling the direction and magnitude of parameters in BN layer to learn, the Weight/Bias exhibits a more balanced distribution and thus the strength of features becomes more even. Extensive experiments on various long-tailed recognition benchmarks (i.e., CIFAR-10/100-LT, ImageNet-LT and iNaturalist 2018) show that our method outperforms previous state-of-the-arts remarkably.

Abstract:
In this paper, we address the challenge of significant memory consumption and redundant components in large-scale voxel-based model, which are commonly encountered in real-world 3D reconstruction scenarios. We propose a novel method called Shell-guided compression of Voxel Radiance Fields (SVRF), aimed at optimizing voxel-based model into a shell-like structure to reduce storage costs while maintaining rendering accuracy. Specifically, we first introduce a Shell-like Constraint, operating in two main aspects: 1) enhancing the influence of voxels neighboring the surface in determining the rendering outcomes, and 2) expediting the elimination of redundant voxels both inside and outside the surface. Additionally, we introduce an Adaptive Thresholds to ensure appropriate pruning criteria for different scenes. To prevent the erroneous removal of essential object parts, we further employ a Dynamic Pruning Strategy to conduct smooth and precise model pruning during training. The compression method we propose does not necessitate the use of additional labels. It merely requires the guidance of self-supervised learning based on predicted depth. Furthermore, it can be seamlessly integrated into any voxel-grid-based method. Extensive experimental results demonstrate that our method achieves comparable rendering quality while compressing the original number of voxel grids by more than 70%. Our code will be available at: https://github.com/eezkni/SVRF

Abstract:
Recent studies have revealed that deep neural networks (DNNs) are susceptible to backdoor attacks, in which attackers insert a pre-defined backdoor into a DNN model by poisoning a few training samples. A small subset of neurons in DNN is responsible for activating this backdoor and pruning these backdoor-associated neurons has been shown to mitigate the impact of such attacks. Current neuron pruning techniques often face challenges in accurately identifying these critical neurons, and they typically depend on the availability of labeled clean data, which is not always feasible. To address these challenges, we propose a novel defense strategy called Contrastive Neuron Pruning (CNP). This approach is based on the observation that poisoned samples tend to cluster together and are distinguishable from benign samples in the feature space of a backdoored model. Given a backdoored model, we initially apply a reversed trigger to benign samples, generating multiple positive (benign-benign) and negative (benign-poisoned) feature pairs from the backdoored model. We then employ contrastive learning on these pairs to improve the separation between benign and poisoned features. Subsequently, we identify and prune neurons in the Batch Normalization layers that show significant response differences to the generated pairs. By removing these backdoor-associated neurons, CNP effectively defends against backdoor attacks while requiring the pruning of only about 1% of the total neurons. Comprehensive experiments conducted on various benchmarks validate the efficacy of CNP, demonstrating its robustness and effectiveness in mitigating backdoor attacks compared to existing methods.

Abstract:
Large-scale multi-view clustering for image data has achieved impressive clustering performance and efficiency. However, most methods lack interpretability in clustering and do not fully consider the complementarity of distributions between different views. To address these problems, we introduce Multi-View Clustering with Transition Probabilities Learning (MVC-TPL). Specifically, we construct an anchor graph factorization model from the perspective of transition probabilities, while simultaneously learning transition probability matrices from samples to clusters and from anchor points to clusters, serving as soft label matrices for samples and anchor points, respectively. This model enables one-step label acquisition and provides the model with a sound probability interpretation. Moreover, since the clusters of samples and anchor points should be consistent across all views, we employ Schatten p-norm regularization on the two matrices, effectively mining the complementary information distributed among the views, thereby aligning the labels across views more consistently. Comprehensive testing on four small-scale datasets and three large-scale datasets confirms the effectiveness of this model.

Abstract:
Self-supervised visual pre-training models have achieved significant success without employing expensive annotations. Nevertheless, most of these models focus on iconic single-instance datasets (e.g. ImageNet), ignoring the insufficient discriminative representation for non-iconic multi-instance datasets (e.g. COCO). In this paper, we propose a novel Object Adaptive Dense Pre-training (OADP) method to learn the visual representation directly on the multi-instance datasets (e.g., PASCAL VOC and COCO) for dense prediction tasks (e.g., object detection and instance segmentation). We present a novel object-aware and learning-adaptive random view augmentation to focus the contrastive learning to enhance the discrimination of object presentations from large to small scale during different learning stages. Furthermore, the representations across different scale and resolutions are integrated so that the method can learn diverse representations. In the experiment, we evaluated OADP pre-trained on PASCAL VOC and COCO. Results show that our method has better performances than most existing state-of-the-art methods when transferring to various downstream tasks, including image classification, object detection, instance segmentation and semantic segmentation.

Abstract:
A new approach for occlusion-robust 3D human mesh reconstruction from a single image is introduced in this paper. Since occlusion has emerged as a major problem to be resolved in this field, there have been meaningful efforts to deal with various types of occlusions (e.g., person-to-person occlusion, person-to-object occlusion, self-occlusion, etc.). Although many recent studies have shown the remarkable progress, previous regression-based methods still have respective limitations to handle occlusion problems due to the lack of the appearance information. To address this problem, we propose a novel method for human mesh reconstruction based on the pose-relevant subspace analysis. Specifically, we first generate a set of eigenvectors, so-called eigenposes, by conducting the singular value decomposition (SVD) of the pose matrix, which contains diverse poses sampled from the training set. These eigenposes are then linearly combined to construct a target body pose according to fusing coefficients, which are learned through the proposed network. Such combination of principal body postures (i.e., eigenposes) in a global manner gives a great help to cope with partial ambiguities by occlusions. Furthermore, we also propose to exploit a joint injection module that efficiently incorporates the spatial information of visible joints into the encoded feature during the estimation process of fusing coefficients. Experimental results on benchmark datasets demonstrate the ability of the proposed method to robustly reconstruct the human mesh under various occlusions occurring in real-world scenarios. The code and model are publicly available at: https://github.com/DCVL-3D/Eigenpose_release.

Abstract:
Ellipse detection is of great significance in the fields of image processing and computer vision. Accurate, stable and direct ellipse detection in real-world images has always been a key issue. Therefore, an ellipse detection method is proposed on the basis of the constructed three-intersection-chord-invariant. First, in the inflexion point detection, the PCA minimum bounding box considering the distribution characteristics of edge points is studied to achieve the more refined line segment screening. Second, a multi-scale inflexion point detection method is proposed to effectively avoid over-segmentation of small arc segments, providing assurance for more reasonable and reliable arc segment combinations. Then, the 20 precisely classified arc segment combinations are refined into 4 combinations. A number of non-homologous arc segment combinations can be quickly removed to reduce incorrect combinations by the constructed midpoint distance constraint and quadrant constraint. Moreover, in order to accurately reflect the strict arc segment combination constraints of geometric features of ellipses, a three-intersection-chord-invariant model of ellipses is established with strong constraint of relative distances among five constraint points, by which a more robust initial ellipse set of homologous arc segment combinations is further obtained. Finally, ellipse validation and clustering are performed on the initial set of ellipses to obtain the high-precision ellipses. The algorithm accuracy of the ellipse detection method is experimentally validated on 6 publicly available datasets and 2 established wheel rim datasets.

Abstract:
Color imaging algorithms - such as color correction, spectral estimation and color constancy - are developed and validated with spectral reflectance data. However, the choice of the reflectance data set - used in development and tuning - not only affects the results of these algorithms but it also changes the ranking of the different approaches. We propose that this fragility is because it is difficult to measure/sample enough data to statistically represent the large number of degrees of freedom apparent in spectral reflectances. In this paper, we propose that the space of reflectance data should not be sampled but, rather, integrated. Specifically, we advocate that the convex closure of a reflectance data set - all convex combinations of all spectra - should be used instead of discrete reflectance samples. To make the integration computation tractable, we approximate these convex closures by their enclosing hyper-cube in a privileged coordinate system. We use color correction as an exemplar color imaging problem to demonstrate the utility of our approach.

Abstract:
RGB-T tracking aims to effectively leverage the complement ability of visual (RGB) and infrared (TIR) modalities to achieve robust tracking performance in various scenarios. Existing RGB-T tracking methods typically adopt backbone networks pre-trained on large-scale RGB datasets, which can lead to a predisposition toward RGB image patterns. RGB and TIR modalities also exhibit inconsistent responses to regions with diverse properties, resulting in imbalances in tracking decisions. We refer to these issues as feature-level and decision-level biases in the TIR modality. In this paper, we propose a novel dual-level modality de-biasing framework for RGB-T tracking to eliminate the inherent feature and decision-level biases. Specifically, we propose a joint infrared-fusion adapter, comprising an infrared-aware adapter and a cross-fusion adapter, designed to adaptively mitigate feature-level biases and utilize complementary information between the two modalities. In addition to implicit feature-level adjustment, we propose a response-decoupled distillation strategy to explicitly alleviate decision-level biases, aiming to achieve consistently accurate decision-making between the RGB and TIR modalities. Extensive experiments on several popular RGB-T tracking benchmarks validate the effectiveness of our proposed method.

Abstract:
Cross-modal hashing is a highly effective technique for searching relevant data across different modalities, owing to its low storage costs and fast similarity retrieval capability. While significant progress has been achieved in this area, prior investigations predominantly concentrate on a one-to-one feature alignment approach, where a singular feature is derived for similarity retrieval. However, the singular feature in these methods fails to adequately capture the varied multi-instance information inherent in the original data across disparate modalities. Consequently, the conventional one-to-one methodology is plagued by a semantic mismatch issue, as the rigid one-to-one alignment inhibits effective multi-instance matching. To address this issue, we propose a novel Diverse Instances Matching for Cross-modal Hashing (DIMCH), which explores the relevance between multiple instances in different modalities using a multi-instance learning algorithm. Specifically, we design a novel diverse instances learning module to extract a multi-feature set, which enables our model to capture detailed multi-instance semantics. To evaluate the similarity between two multi-feature sets, we adopt the smooth chamfer distance function, which enables our model to incorporate the conventional similarity retrieval structure. Moreover, to sufficiently exploit the supervised information from the semantic label, we adopt the weight cosine triplet loss as the objective function, which incorporates the multilevel similarity among the multi-labels into the training procedure and enables the model to mine the multi-label correlation effectively. Extensive experiments demonstrate that our diverse hashing embedding method achieves state-of-the-art performance in supervised cross-modal hashing retrieval tasks.

Abstract:
Existing Unbiased Scene Graph Generation (USGG) methods only focus on addressing the predicate-level imbalance that high-frequency classes dominate predictions of rare ones, while overlooking the concept-level imbalance. Actually, even if predicates themselves are balanced, there is still a significant concept-imbalance within them due to the long-tailed distribution of contexts (i.e., subject-object combinations). This concept-level imbalance poses a more pervasive and challenging issue compared to the predicate-level imbalance since subject-object pairs are inherently complex in combinations. To address the issue, we propose Multi-Concept Learning (MCL), a novel concept-level balanced learning framework orthogonal to existing SGG methods. MCL first quantifies the concept-level imbalance across predicates in terms of different amounts of concepts, representing as multiple concept-prototypes within the same class. Then, to achieve balanced learning across different concepts (i.e., concept-prototypes), we introduce the Concept-based Balanced Memory (CBM), which guides SGG models in generating balanced representations for concept-prototypes. Furthermore, the Concept Regularization (CR) technique is proposed to effectively help models in aligning relation features to their corresponding concept-prototypes, thereby generating concept-level compact and predicate-level distinctive representations for robust relation recognition. Finally, we introduce a novel metric, mean Context Recall (mCR@K), as a complement to mean Recall (mR@K), to evaluate the model’s performance across concepts (determined by contexts) within the same predicate. Extensive experiments demonstrate the remarkable efficacy of our model-agnostic strategy in enhancing the performance of benchmark models on both VG-SGG and OI-SGG datasets, leading to new state-of-the-art achievements in two key aspects: predicate-level unbiased relation recognition and concept-level compositional generability. Code is available at https://github.com/XinyuLyu/G-USGG.

Affiliations: National Key Laboratory of Opto-Electronic Information Acquisition and Protection Technology, Institutes of Physical Science and Information Technology, Anhui University, Hefei, China; Department of Electronic Engineering and Information Science, School of Information Science and Technology, University of Science and Technology of China, Hefei, China; School of Computer Science and Engineering, Chongqing University of Technology, Chongqing, China; School of Microelectronics and Communication Engineering, Chongqing University, Chongqing, China; Key Laboratory of Intelligent Computing and Signal Processing, Ministry of Education, and the School of Computer Science and Technology, Anhui University, Hefei, China; School of Computer Science and Information Engineering, Hefei University of Technology, Hefei, China

Abstract:
Domain generalization (DG) aims to solve the problem of significant performance degradation when target domain data collected from the Out-Of-Distribution (O.O.D). Previous efforts try to exploit invariant features in the source domain through CNN networks. However, inspired by causal mechanisms, we find that the complex spurious-invariant information is still hidden in this view invariant features, and the impact of domain and class discrepancies on extracting invariance has not been effectively mitigated. To alleviate these issues, we propose a self-weighted multi-view mining invariance domain generalization framework (SMIDG). On the one hand, to make up for the insufficiency of traditional single-view convolutional feature extraction networks, we propose to mine features from another frequency view and use the self-adaptive adversarial masks to eliminate some spurious correlations, ensuring causal invariance in the coarse-grained generalization. However, due to inconsistencies in discriminative information between inter-domain and intra-domain samples, as well as inter-class and intra-class samples, the coarse-grained elimination of spurious associations does not fully resolve this issue. On the other hand, we also consider the fine-grained generalization from two aspects. Firstly, to better tackle the domain discrepancies, we propose a novel progressive contrastive learning strategy that learns the underlying specific features of samples while gradually mitigating domain discrepancies, thereby ensuring domain invariance in fine-grained generalization. Secondly, due to the issue of feature inconsistency, we adopt a self-adaptive hard sample mining method with information gain to ensure that the model pays more attention on hard disentangled samples, thus maintaining feature invariance. Extensive experiments on five benchmark datasets demonstrate that our method outperforms state-of-the-art approaches. Our code is available at https://github.com/bihhm/SMIDG

Affiliations: Future Media Center, School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu, China; Sichuan Artificial Intelligence Research Institute, Yibin, China; Shenzhen Institute for Advanced Study, University of Electronic Science and Technology of China, Chengdu, China; School of Computer Science and Technology, Tongji University, Shanghai, China; School of Computing and Artificial Intelligence, Southwestern University of Finance and Economics, Chengdu, China

Abstract:
Adapting large-scale image-text pre-training models, e.g., CLIP, to the video domain represents the current state-of-the-art for text-video retrieval. The primary approaches involve transferring text-video pairs to a common embedding space and leveraging cross-modal interactions on specific entities for semantic alignment. Though effective, these paradigms entail prohibitive computational costs, leading to inefficient retrieval. To address this, we propose a simple yet effective method, Global-Local Semantic Consistent Learning (GLSCL), which capitalizes on latent shared semantics across modalities for text-video retrieval. Specifically, we introduce a parameter-free global interaction module to explore coarse-grained alignment. Then, we devise a shared local interaction module that employs several learnable queries to capture latent semantic concepts for learning fine-grained alignment. Furthermore, an Inter-Consistency Loss (ICL) is devised to accomplish the concept alignment between the visual query and corresponding textual query, and an Intra-Diversity Loss (IDL) is developed to repulse the distribution within visual (textual) queries to generate more discriminative concepts. Extensive experiments on five widely used benchmarks (i.e., MSR-VTT, MSVD, DiDeMo, LSMDC, and ActivityNet) substantiate the superior effectiveness and efficiency of the proposed method. Remarkably, our method achieves comparable performance with SOTA as well as being nearly 220 times faster in terms of computational cost. Code is available at: https://github.com/zchoi/GLSCL

Abstract:
Edge detection is frequently employed to support downstream visual tasks. However, current edge detection methods still encounter two significant challenges: extracting complex textured targets and capturing valuable information from complex backgrounds. We propose FFED, a flow field-guided edge detection model. FFED integrates the three components of our design. FFED incorporates three designed components: the Feature Broadcast Module (FBM), the Antagonistic Bio-inspired Spatial Attention Module (ABSAM), a novel pixel difference convolution named ALS. The FBM serves as an implementation mode of the flow field, with its input pair selection strategy inspired by video processing. The FBM broadcasts high-level semantic features to high-resolution ones, preserving more meaningful texture details. Inspired by biological studies, we propose the ABSAM. ABSAM extracts valuable information from complex backgrounds by optimizing spatial modeling of data. The ALS exhibits enhanced capability in extracting gradient information and capturing subtle texture details that are easily overlooked. Experimental results demonstrate that FFED achieved competitive detection results on NYUD, BSDS500, and BIPED datasets, as well as good performance on industrial datasets. Additionally, the experiment verified the auxiliary effect of FFED on downstream visual tasks. The code is available at https://github.com/hanyuchen2022/Flow-field-guided-edge-detection-FFED-.

Abstract:
Multitemporal hyperspectral image unmixing (MTHU) holds significant importance in monitoring and analyzing the dynamic changes of surface. However, compared to single-temporal unmixing, the multitemporal approach demands comprehensive consideration of information across different phases, rendering it a greater challenge. To address this challenge, we propose the Multitemporal Hyperspectral Image Unmixing Transformer (MUFormer), an end-to-end unsupervised deep learning model. To effectively perform multitemporal hyperspectral image unmixing, we introduce two key modules: the Global Awareness Module (GAM) and the Change Enhancement Module (CEM). The GAM computes self-attention across all phases, facilitating global weight allocation. On the other hand, the CEM dynamically learns local temporal changes by capturing differences between adjacent feature maps. The integration of these modules enables the effective capture of multitemporal semantic information related to endmember and abundance changes, significantly improving the performance of multitemporal hyperspectral image unmixing. We conducted experiments on one real dataset and two synthetic datasets, demonstrating that our model significantly enhances the effect of multitemporal hyperspectral image unmixing.

Abstract:
Despite the photorealistic novel view synthesis (NVS) performance achieved by the original 3D Gaussian splatting (3DGS), its rendering quality significantly degrades with sparse input views. This performance drop is mainly caused by the limited number of initial points generated from the sparse input, lacking reliable geometric supervision during the training process, and inadequate regularization of the oversized Gaussian ellipsoids. To handle these issues, we propose the LoopSparseGS, a loop-based 3DGS framework for the sparse novel view synthesis task. In specific, we propose a loop-based Progressive Gaussian Initialization (PGI) strategy that could iteratively densify the initialized point cloud using the rendered pseudo images during the training process. Then, the sparse and reliable depth from the Structure from Motion, and the window-based dense monocular depth are leveraged to provide precise geometric supervision via the proposed Depth-alignment Regularization (DAR). Additionally, we introduce a novel Sparse-friendly Sampling (SFS) strategy to handle oversized Gaussian ellipsoids leading to large pixel errors. Comprehensive experiments on four datasets demonstrate that LoopSparseGS outperforms existing state-of-the-art methods for sparse-input novel view synthesis, across indoor, outdoor, and object-level scenes with various image resolutions. Code is available at: https://github.com/pcl3dv/LoopSparseGS

Abstract:
Beyond the exploration of traditional spatial, temporal and subjective visual signal redundancy in image and video compression, recent research has focused on leveraging cross-color component redundancy to enhance coding efficiency. Cross-component coding approaches are motivated by the statistical correlations among different color components, such as those in the Y’CbCr color space, where luma (Y) color component typically exhibits finer details than chroma (Cb/Cr) color components. Inspired by previous cross-component coding algorithms, this paper introduces a novel in-loop filtering approach named Cross-Component Sample Offset (CCSO). CCSO utilizes co-located and neighboring luma samples to generate correction signals for both luma and chroma reconstructed samples. It is a multiplication-free, non-linear mapping process implemented using a look-up-table. The input to the mapping is a group of reconstructed luma samples, and the output is an offset value applied on the center luma or co-located chroma sample. Experimental results demonstrate that the proposed CCSO can be applied to both image and video coding, resulting in improved coding efficiency and visual quality. The method has been adopted into an experimental next-generation video codec beyond AV1 developed by the Alliance for Open Media (AOMedia), demonstrating average -0.81% and -0.69% coding gain on PSNR and VMAF quality metric, respectively, under random access configuration. Additionally, CCSO notably improves the subjective visual quality.

Abstract:
Recognizing social relations from images is crucial for improving machine perception of social interactions. Current studies mainly focus on exploring single-type relation reasoning frameworks, such as the relation between father, mother and son in a family. However, real-world scenarios often involve complex hybrid relations, such as friendships and professional relations, which pose a challenge for current methods due to the difficulty of establishing robust logical connections between these relations. In fact, in this hybrid social relation recognition setting, the interactions extend beyond dyadic to multipartite structures. To effectively explore these multipartite interactions, we propose a novel Hypergraph Mamba (HGM) framework. Specifically, we construct two hypergraphs, i.e., Person-Person Hypergraphs (PPH) and Person-Object Hypergraphs (POH), to model these high-order multipartite interactions. The HGM module performs social relation reasoning within these hypergraph structures, which includes a Vertex Selection Algorithm to mitigate inference confusion by filtering out confounders, and a Vertex Interaction Operator to find optimal global vertex neighborhoods by capturing long-range vertex dependencies. In addition, a Multilevel Transformer is proposed to adaptively align the PPH and POH inferred knowledge and visual signals to facilitate information fusion. We validate the effectiveness of our proposed HGM model on several public datasets and perform extensive ablation studies to elucidate the reasons contributing to its superior performance. Experimental results indicate that our HGM model achieves superior accuracy in predicting social relations compared to the state-of-the-art methods. Codes and datasets are available at: https://github.com/tw-repository/HGM-SRR

Abstract:
Image denoising is an appealing and challenging task, in that noise statistics of real-world observations may vary with local image contents and different image channels. Specifically, the green channel usually has twice the sampling rate in raw data. To handle noise variances and leverage such channel-wise prior information, we propose a simple and effective green channel prior-based image denoising (GCP-ID) method, which integrates GCP into the classic patch-based denoising framework. Briefly, we exploit the green channel to guide the search for similar patches, which aims to improve the patch grouping quality and encourage sparsity in the transform domain. The grouped image patches are then reformulated into RGGB arrays to explicitly characterize the density of green samples. Furthermore, to enhance the adaptivity of GCP-ID to various image contents, we cast the noise estimation problem into a classification task and train an effective estimator based on convolutional neural networks (CNNs). Experiments on real-world datasets demonstrate the competitive performance of the proposed GCP-ID method for image and video denoising applications in both raw and sRGB spaces. Our code is available at https://github.com/ZhaomingKong/GCP-ID

Abstract:
In many image processing tasks, e.g., 3D reconstruction of dynamic scenes, different types of descriptions, a.k.a., views, of an object are emerging in a streaming way. Streaming view learning provides an effective solution to this dynamic view problem. In this paradigm, existing streaming view learning methods typically assume that all labels are accurate. However, in many real-world applications, the initial views may be not good enough for characterizing, leading to noisy labels that degrade classification performance. How to learn a model for simultaneous view evolving and label ambiguity is critical yet unexplored. In this paper, we propose a novel method called Streaming View Classification with Noisy Label (SVCNL). We calibrate noisy labels according to the emerging of new views, thereby reflecting the dynamic changes in the data more accurately. Leveraging the sequential and non-revisitable nature of views, the method tunes existing models to inherit information from previous stages by utilizing current-stage data. It reconstructs noisy labels through a label transition matrix and establishes relationships between true labels and samples using a graph embedding strategy, progressively correcting noisy labels. Together with the theoretical analyses about generalization bounds, extensive experiments demonstrate the effectiveness of the proposed approach.

Abstract:
Color grading, as a crucial step in film post-production, plays an important role in emotional expression and artistic enhancement. Recently, a geometric palette-based approach to video recoloring has been introduced with impressive results. It offers an intuitive interface that allows users to alter the color of a video by manipulating a limited set of representative colors. However, this method has two primary limitations. Firstly, palette extraction is computationally expensive, often taking more than one hour to generate palettes even for medium-length videos, which significantly limits the practical application of color editing for longer videos. Secondly, the palette colors are less representative, and some primary colors may be omitted from the resulting palettes during topological simplification, making it less intuitive in color editing. To overcome these limitations, in this paper, we propose a novel approach to video recoloring. The core of our method is a set of Bézier curves that connect the dominant colors throughout the input video. By slicing these Bézier curves in RGBT space, per-frame palette can be naturally derived. During recoloring, users can select several frames of interest and modify their corresponding palettes to change the color of the video. Our method is simple and intuitive, enabling compelling time-varying recoloring results. Compared to existing methods, our approach is more efficient in palette extraction and can effectively capture the dominant colors of the video. Extensive experiments demonstrate the effectiveness of our method.

Abstract:
Source-Free Object Detection (SFOD) enables knowledge transfer from a source domain to an unsupervised target domain for object detection without access to source data. Most existing SFOD approaches are either confined to conventional object detection (OD) models like Faster R-CNN or designed as general solutions without tailored adaptations for novel OD architectures, especially Detection Transformer (DETR). In this paper, we introduce Feature Reweighting ANd Contrastive Learning NetworK (FRANCK), a novel SFOD framework specifically designed to perform query-centric feature enhancement for DETRs. FRANCK comprises four key components: 1) an Objectness Score-based Sample Reweighting (OSSR) module that computes attention-based objectness scores on multi-scale encoder feature maps, reweighting the detection loss to emphasize less-recognized regions; 2) a Contrastive Learning with Matching-based Memory Bank (CMMB) module that integrates multi-level features into memory banks, enhancing class-wise contrastive learning; 3) an Uncertainty-weighted Query-fused Feature Distillation (UQFD) module that improves feature distillation through prediction quality reweighting and query feature fusion; and 4) an improved self-training pipeline with a Dynamic Teacher Updating Interval (DTUI) that optimizes pseudo-label quality. By leveraging these components, FRANCK effectively adapts a source-pre-trained DETR model to a target domain with enhanced robustness and generalization. Extensive experiments on several widely used benchmarks demonstrate that our method achieves state-of-the-art performance, highlighting its effectiveness and compatibility with DETR-based SFOD models.

Abstract:
We consider the problem of client-server localization, where edge device users communicate visual data with the service provider for locating oneself against a pre-built 3D map. This localization paradigm is a crucial component for location-based services in AR/VR or mobile applications, as it is not trivial to store large-scale 3D maps and process fast localization on resource-limited edge devices. Nevertheless, conventional client-server localization systems possess numerous challenges in computational efficiency, robustness, and privacy-preservation during data transmission. Our work aims to jointly solve these challenges with a localization pipeline based on event cameras. By using event cameras, our system consumes low energy and maintains small memory bandwidth. Then during localization, we propose applying event-to-image conversion and leverage mature image-based localization, which achieves robustness even in low-light or fast-moving scenes. To further enhance privacy protection, we introduce privacy protection techniques at two levels. Network level protection aims to hide the entire user’s view in private scenes using a novel split inference approach, while sensor level protection aims to hide sensitive user details such as faces with light-weight filtering. Both methods involve small client-side computation and localization performance loss, while significantly mitigating the feeling of insecurity as revealed in our user study. We thus project our method to serve as a building block for practical location-based services using event cameras.

Abstract:
This paper addresses the task of space-time video super-resolution (STVSR). Existing methods generally suffer from inaccurate motion estimation and motion compensation (MEMC) problems for large motions. Inspired by recent progress in physics-informed neural networks, we model the challenges of MEMC in STVSR as a mapping between two continuous function spaces. Specifically, our approach transforms independent low-resolution representations in the coarse-grained continuous function space into refined representations with enriched spatiotemporal details in the fine-grained continuous function space. To achieve efficient and accurate MEMC, we design a Galerkin-type attention function to perform frame alignment and temporal interpolation. Due to the linear complexity of the Galerkin-type attention mechanism, our model avoids patch partitioning and offers global receptive fields, enabling precise estimation of large motions. The experimental results show that the proposed method surpasses state-of-the-art techniques in both fixed-size and continuous space-time video super-resolution tasks. Code is publicly available at the URL https://github.com/hahazh/STVSR-NO

Abstract:
Establishing local semantic correspondences between medical images and their corresponding reports is crucial for effective medical vision-language pre-training. However, existing methods encounter two major challenges: (1) lesion regions in radiological images are often small, blurry, or lack clear boundaries, complicating accurate localization; and (2) medical reports typically contain redundant or non-diagnostic words, hindering precise semantic alignment. To overcome these issues, we propose MedAligner, a specialized local alignment network for medical vision-language pre-training. MedAligner employs dual encoders to extract both global and local representations and uses global contrastive learning to maintain coarse semantic consistency. To enhance local alignment, we introduce a Word-Region Alignment, which generates a learnable word-pixel similarity matrix that is sparsified to identify salient lesion regions accurately. Additionally, our Diagnostic Term Filtering dynamically samples high-importance diagnostic terms from reports, aligning them with identified lesion areas via a local contrastive loss. Importantly, we adopt a progressive training strategy that gradually refines both the input text and semantic alignment. This is achieved by reconstructing concise diagnostic reports and progressively updating word-pixel similarity, generating increasingly accurate image-text pairs. Extensive experiments demonstrate that MedAligner significantly surpasses existing approaches on tasks such as phrase grounding, image-text retrieval, and zero-shot classification, setting new benchmarks in medical vision-language pre-training.

Abstract:
Due to the substantial gap between vision and language modalities, along with the mismatch problem between fixed language descriptions and dynamic visual information, existing vision-language tracking methods exhibit performance on par with or slightly worse than vision-only tracking. Effectively exploiting the rich semantics of language to enhance tracking robustness remains an open challenge. To address these issues, we propose a self-adaptive vision-language tracking framework that leverages the pre-trained multi-modal CLIP model to obtain well-aligned visual-language representations. A novel context-aware prompting mechanism is introduced to dynamically adapt linguistic cues based on the evolving visual context during tracking. Specifically, our context prompter extracts dynamic visual features from the current search image and integrates them into the text encoding process, enabling self-updating language embeddings. Furthermore, our framework employs a unified one-stream Transformer architecture, supporting joint training for both vision-only and vision-language tracking scenarios. Our method not only bridges the modality gap but also enhances robustness by allowing language features to evolve with visual context. Extensive experiments on four vision-language tracking benchmarks demonstrate that our method effectively leverages the advantages of language to enhance visual tracking. Our large model can obtain 55.0% AUC on \text LaSOT_\text EXT and 69.0% AUC on TNL2K. Additionally, our language-only tracking model achieves performance comparable to that of state-of-the-art vision-only tracking methods on TNL2K. Code is available at https://github.com/zj5559/SAVLT

Abstract:
Recent advancements in pre-trained vision-language models like CLIP, have enabled the task of open-vocabulary segmentation. CLIP demonstrates impressive zero-shot capabilities in various downstream tasks that require holistic image understanding. However, due to the image-level contrastive learning and fully global feature interaction, ViT-based CLIP struggles to capture local details, resulting in poor performance in segmentation tasks. Our analysis of ViT-based CLIP reveals that anomaly tokens emerge during the forward process, attracting disproportionate attention from normal patch tokens and thereby diminishing spatial awareness. To address this issue, we propose Self-Calibrated CLIP (SC-CLIP), a training-free method that calibrates CLIP to generate finer representations while preserving its original generalization ability—without introducing new parameters or relying on additional backbones. Specifically, we mitigate the negative impact of anomaly tokens from two complementary perspectives. First, we explicitly identify the anomaly tokens and replace them based on local context. Second, we reduce their influence on normal tokens by enhancing feature discriminability and attention correlation, leveraging the inherent semantic consistency within CLIP’s mid-level features. In addition, we introduce a two-pass strategy that effectively integrates multi-level features to enrich local details under the training-free setting. Together, these strategies enhance CLIP’s feature representations with improved granularity and semantic coherence. Experimental results demonstrate the effectiveness of SC-CLIP, achieving state-of-the-art results across all datasets and surpassing previous methods by 9.5%. Notably, SC-CLIP boosts the performance of vanilla CLIP ViT-L/14 by 6.8 times. Furthermore, we discuss our method’s applicability to other vision–language models and tasks for a comprehensive evaluation. Our source code is available at https://github.com/SuleBai/SC-CLIP

Abstract:
In the domain of image anomaly detection, significant progress has been made in unsupervised and self-supervised methods with datasets containing only normal samples. Although these methods perform well in general industrial anomaly detection scenarios, they often struggle with over- or under-detection when faced with fine-grained anomalies in products. In this paper, we propose GRAD: Bi-Grid Reconstruction for Image Anomaly Detection, which utilizes two continuous grids to detect anomalies from both normal and abnormal perspectives. In this work: 1) Grids serve as feature repositories to assist in the reconstruction task, achieving stronger generalization compared to discrete storage, while also helping to avoid the Identical Shortcut (IS) problem common in general reconstruction methods. 2) An additional grid storing abnormal features is introduced alongside the normal grid storing normal features, which refines the boundaries of normal features, thereby enhancing GRAD’s detection performance for fine-grained defects. 3) The Feature Block Pasting (FBP) module is designed to synthesize a variety of anomalies at the feature level, enabling the rapid deployment of the abnormal grid. Additionally, benefiting from the powerful representation capabilities of grids, GRAD is suitable for a unified task setting, requiring only a single model to be trained for multiple classes. GRAD has been comprehensively tested on classic industrial datasets including MVTecAD, VisA, and the newest GoodsAD dataset, showing significant improvement over current state-of-the-art methods.

Abstract:
Recent years have witnessed a great success of multi-view learning empowered by deep ConvNets, leveraging a large number of network parameters. Nevertheless, there is an ongoing consideration regarding the essentiality of all these parameters in multi-view ConvNets. As we know, hypernetworks offer a promising solution to reduce the number of parameters by learning a concise network to generate weights for the larger target network, illustrating the presence of redundant information within network parameters. However, how to leverage hypernetworks for learning parameter-efficient multi-view ConvNets remains underexplored. In this paper, we present a lightweight multi-layer shared Hyper-Adaptive network (HAda), aiming to simultaneously generate adaptive weights for different views and convolutional layers of deep multi-view ConvNets. The adaptability inherent in HAda not only contributes to a substantial reduction in parameter redundancy but also enables the modeling of intricate view-aware and layer-wise information. This capability ensures the maintenance of high performance, ultimately achieving parameter-efficient learning. Specifically, we design a multi-view shared module in HAda to capture information common across views. This module incorporates a shared global gated interpolation strategy, which generates layer-wise gating factors. These factors facilitate adaptive interpolation of global contextual information into the weights. Meanwhile, we put forward a tailored weight-calibrated adapter for each view that facilitates the conveyance of view-specific information. These adapters generate view-adaptive weight scaling calibrators, allowing the selective emphasis of personalized information for each view without introducing excessive parameters. Extensive experiments on six publicly available datasets demonstrate the effectiveness of the proposed method. In particular, HAda can serve as a flexible plug-in strategy to work well with existing multi-view methods for both image classification and image clustering tasks.

Abstract:
We propose PhaseForensics, a DeepFake (DF) video detection method that uses a phase-based motion representation of facial temporal dynamics. Existing methods that rely on temporal information across video frames for DF detection have many advantages over the methods that only utilize the per-frame features. However, these temporal DF detection methods still show limited cross-dataset generalization and robustness to common distortions due to factors such as error-prone motion estimation, inaccurate landmark tracking, or the susceptibility of the pixel intensity-based features to adversarial distortions and the cross-dataset domain shifts. Our key insight to overcome these issues is to leverage the temporal phase variations in the band-pass frequency components of a face region across video frames. This not only enables a robust estimate of the temporal dynamics in the facial regions, but is also less prone to cross-dataset variations. Furthermore, we show that the band-pass filters used to compute the local per-frame phase form an effective defense against the perturbations commonly seen in gradient-based adversarial attacks. Overall, with PhaseForensics, we show improved distortion and adversarial robustness, and state-of-the-art cross-dataset generalization, with 92.4% video-level AUC on the challenging CelebDFv2 benchmark (a recent state of-the-art method, FTCN, compares at 86.9%).

Abstract:
End-to-end image and video codecs are becoming increasingly competitive, compared to traditional compression techniques that have been developed through decades of manual engineering efforts. These trainable codecs have many advantages over traditional techniques, such as their straightforward adaptation to perceptual distortion metrics and high performance in specific fields thanks to their learning ability. However, current state-of-the-art neural codecs do not fully exploit the benefits of vector quantization and the existence of the entropy gradient in decoding devices. In this paper, we propose to leverage these two properties (vector quantization and entropy gradient) to improve the performance of off-the-shelf codecs. Firstly, we demonstrate that using non-uniform scalar quantization cannot improve performance over uniform quantization. We thus suggest using predefined optimal uniform vector quantization to improve performance. Secondly, we show that the entropy gradient, available at the decoder, is correlated with the reconstruction error gradient, which is not available at the decoder. We therefore use the former as a proxy to enhance compression performance. Our experimental results show that these approaches save between 1 to 3% of the rate for the same quality across various pre-trained methods. In addition, the entropy gradient based solution improves traditional codec performance significantly as well.

Abstract:
Passive non-line-of-sight (NLOS) imaging has witnessed rapid development in recent years, due to its ability to image objects that are out of sight. The light transport condition plays an important role in this task since changing the conditions will lead to different imaging models. Existing learning-based NLOS methods usually train independent models for different light transport conditions, which is computationally inefficient and impairs the practicality of the models. In this work, we propose NLOS-LTM, a novel passive NLOS imaging method that effectively handles multiple light transport conditions with a single network. We achieve this by inferring a latent light transport representation from the projection image and using this representation to modulate the network that reconstructs the hidden image from the projection image. We train a light transport encoder together with a vector quantizer to obtain the light transport representation. To further regulate this representation, we jointly learn both the reconstruction network and the reprojection network during training. A set of light transport modulation blocks is used to modulate the two jointly trained networks in a multi-scale way. Extensive experiments on a large-scale passive NLOS dataset demonstrate the superiority of the proposed method. The code is available at https://github.com/JerryOctopus/NLOS-LTM.

Abstract:
Recently, MLP-based architectures have achieved competitive performance with convolutional neural networks (CNNs) and vision transformers (ViTs) across various vision tasks. However, most MLP-based methods introduce local feature interactions to facilitate direct adaptation to downstream tasks, thereby lacking the ability to capture global visual dependencies and multi-scale context, ultimately resulting in unsatisfactory performance on dense prediction. This paper proposes a competitive and effective MLP-based architecture called Pyramid Fusion MLP (PFMLP) to address the above limitation. Specifically, each block in PFMLP introduces multi-scale pooling and fully connected layers to generate feature pyramids, which are subsequently fused using up-sample layers and an additional fully connected layer. Employing different down-sample rates allows us to obtain diverse receptive fields, enabling the model to simultaneously capture long-range dependencies and fine-grained cues, thereby exploiting the potential of global context information and enhancing the spatial representation power of the model. Our PFMLP is the first lightweight MLP to obtain comparable results with state-of-the-art CNNs and ViTs on the ImageNet-1K benchmark. With larger FLOPs, it exceeds state-of-the-art CNNs, ViTs, and MLPs under similar computational complexity. Furthermore, experiments in object detection, instance segmentation, and semantic segmentation demonstrate that the visual representation acquired from PFMLP can be seamlessly transferred to downstream tasks, producing competitive results. All materials contain the training codes and logs are released at https://github.com/huangqiuyu/PFMLP.

Abstract:
As information acquisition diversifies, data is acquired and stored in increasing modalities. However, sensor failures or equipment issues can lead to partial data loss in certain views, resulting in incomplete multi-view clustering (IMVC) problems. Although some prototype-based IMVC methods have achieved satisfactory performance, almost all of these methods implicitly assume that the cross-view prototypes are aligned. However, during the generation or selection of prototypes, different networks could produce different prototypes, thereby leading to potential misalignment of prototypes across views, i.e., prototype-unaligned problem (PUP). The presence of PUP could lead to overfitting the model. Additionally, when recovering the missing data, there is uncertainty in data quality under different missing rates, which could lead to the performance instability problem (PIP). To address these issues, we propose Prototype Matching Learning for Incomplete Multi-view Clustering (PMIMC). Specifically, PMIMC leverages relational consistency learning to mitigate the heterogeneity of multi-view data. Subsequently, we design a robust prototype contrastive learning loss for the generated prototypes to reduce the effects of PUP. Finally, we propose a prototype-based imputation strategy, that aims to alleviate the instability of imputation under high missing rates. Extensive experiments demonstrate that PMIMC outperforms 13 state-of-the-art methods in terms of clustering performance and robustness. The code is available at: https://github.com/hl-yuan/PMIMC.

Abstract:
In a conventional Domain Adaptation (DA) setting, we only have one source and target domain, whereas, in many real-world applications, data is often collected from several related sources in different conditions. This has led to a more practical and challenging knowledge transfer problem called Multi-source Domain Adaptation (MDA). Several methodologies, such as prototype matching, explicit distance discrepancy, adversarial learning, etc., have been considered to tackle the MDA problem in recent years. Among them, the adversarial-based learning framework is a popular methodology for transferring knowledge from multiple sources to target domains using a min-max optimization strategy. Despite the advances in adversarial-based methods, several limitations exist, such as the need for a classifier-aware discrepancy metric to align the domains and the need to consider target samples’ consistency and semantic information while aligning the domains. To mitigate these issues, in this work, we propose a novel adversarial learning MDA algorithm, MDAMA, which aligns the target domain with a mixture distribution that consists of source domains. MDAMA uses margin-based discrepancy and augmented intermediate distributions to align the domains effectively. We also propose consistency of target samples by confidence thresholding and transfer of semantic information from multiple source domains to the augmented target domain to further improve the performance of the target domain. We extensively experiment with the MDAMA algorithm on popular real-world MDA datasets such as OfficeHome, Office31, PACS, Office-Caltech, and DomainNet. We evaluate the MDAMA model on these benchmark datasets and demonstrate top performance in all of them.

Abstract:
Generative models have attracted much attention for handling the generalized zero-shot learning (GZSL) task recently. Most of the existing generative GZSL models are trained for visual feature synthesis by utilizing the unique semantic feature of each object class as input but its kaleidoscopic real visual features as supervisions. However, since the real visual features are inevitably infiltrated by some class-irrelevant information, the trained generative models could not guarantee the discriminability of their synthesized visual features. In this paper, we firstly provide an empirical analysis on this problem, finding that among the elements of the real visual features, some elements contain more class-irrelevant information than the others, resulting in ambiguous visual feature synthesis. Then according to this finding, we propose a self-assembled generative GZSL framework, where both the real and synthesized visual features are re-assembled by identifying and updating the class-irrelevant elements in a self-learning manner, called SaG. Moreover, an element-affinity regularizer is explored for constraining the affinity among different elements, so that the synthesized visual features under the SaG framework approach the updated feature elements. In principle, different generative GZSL models could be seamlessly embedded into the SaG framework, resulting in different GZSL methods. Extensive experimental results demonstrate that the derived methods, by embedding three baseline generative GZSL models into SaG respectively, could boost the performances of their baselines significantly, and one of the derived methods outperforms 20 state-of-the-art GZSL methods in most cases.

Abstract:
A factored display emits full-parallax dense-view light fields for a glasses-free 3D experience without sacrificing the spatial resolution of a liquid-crystal display (LCD). For static light fields, it achieves high-quality reconstruction by applying frame-based low-rank factorization to time-multiplexed sub-frame contents of stacked LCDs. However, for light field videos such frame-based factorization could introduce reconstruction artifacts and visual flickers and further cause human discomfort. The artifacts mainly come from incomplete constraints for the emitted light fields that are actually perceived in continuous time, instead of discrete frames. In particular, the perceived light fields are related to the persistence-of-vision (POV) effect of human eyes and the refresh rates of LCD displays, which is not well explored in previous work. In this work, we introduce a light-field video factorization framework—temporal fusion (TF)—to resolve these issues. To begin with, we explicitly formulate the continuous-time POV effect into a global factorization objective functional to eliminate visual flickers and enhance image quality. We further show that this optimization problem can be solved by sequence-level iterative updates on LCD sub-frames. Then, to tackle the enormous requirement of memory access for the sequence-level processing flow, we devise an efficient cuboid-wise factorization algorithm which enables practical GPU implementation. We also devise another lightweight causal framework, TF-C, for supporting low-latency applications. Finally, extensive experiments are performed to verify the effectiveness. Compared to the plain frame-based factorization, TF/TF-C can improve temporal consistency by reducing flicker values by 85%/91% and enhance reconstruction quality by increasing PSNR values by 5.0dB/3.7dB. In addition, we present a prototype dual-layer factored display, which was built with two 240-Hz high-refresh-rate LCDs, to demonstrate the visual quality for real-life applications.

Abstract:
Semi-Supervised Object Detection (SSOD) aims to improve the utilization of unlabeled data, and various methods, such as adaptive threshold techniques, have been extensively studied to increase exploitable information. However, these methods are passive, relying solely on the original image data. Additionally, existing approaches prioritize the predicted categories of the teacher model while overlooking the relationships between different categories in the prediction. In this paper, we introduce a novel approach called Dense Information Learning (DIL), which actively generates unlabeled data containing densely exploitable information and forces the network to have relation consistency under different perturbations. Specifically, Dense Information Augmentation (DIA) leverages the prior information of the network to create a foreground bank and actively incorporates exploitable information into the unlabeled data. DIA automatically performs information enhancement and filters noise. Furthermore, to encourage the network to maintain consistency at the manifold level under various perturbations, we introduce Relation Consistency Regularization (RCR). It considers both feature-level and image-level perturbations, guiding the network to focus on more discriminative features. Extensive experiments conducted on multiple datasets validate the effectiveness of our approach in leveraging information from unlabeled images. The proposed DIL improves the mAP by 12.6% and 10.0% relative to the supervised baseline method when utilizing 5% and 10% of labeled data on the MS-COCO dataset, respectively.

Abstract:
Deep Hashing is one of the most important methods for generating compact feature representation in content-based image retrieval. However, in various application scenarios, it requires training different models with diversified memory and computational resource costs. To address this problem, in this paper, we propose a new scalable deep hashing framework, which aims to generate binary codes with different code lengths by adaptive bit selection. Specifically, the proposed framework consists of two alternative steps, i.e., bit pool generation and adaptive bit selection. In the first step, a deep feature extraction model is trained to output binary codes by optimizing retrieval performance and bit properties. In the second step, we select informative bits from the generated bit pool with reinforcement learning algorithm, in which the same retrieval performance and bit properties are directly used in computing reward. The bit pool can be further updated by fine-tuning the deep feature extraction model with more attention on the selected bits. Hence, these two steps are alternatively iterated until convergence is achieved. Notably, most existing binary hashing methods can be readily integrated into our framework to generate scalable binary codes. Experiments on four public image datasets prove the effectiveness of the proposed framework for image retrieval tasks.

Abstract:
With the increasing consumption of 3D displays and virtual reality, multi-view video has become a promising format. However, its high resolution and multi-camera shooting result in a substantial increase in data volume, making storage and transmission a challenging task. To tackle these difficulties, we propose an implicit-explicit integrated representation for multi-view video compression. Specifically, we first use the explicit representation-based 2D video codec to encode one of the source views. Subsequently, we propose employing the implicit neural representation (INR)-based codec to encode the remaining views. The implicit codec takes the time and view index of multi-view video as coordinate input and generates the corresponding implicit reconstruction frames. To enhance the compressibility, we introduce a multi-level feature grid embedding and a fully convolutional architecture into the implicit codec. These components facilitate coordinate-feature and feature-RGB mapping, respectively. To further enhance the reconstruction quality from the INR codec, we leverage the high-quality reconstructed frames from the explicit codec to achieve inter-view compensation. Finally, the compensated results are fused with the implicit reconstructions from the INR to obtain the final reconstructed frames. Our proposed framework combines the strengths of both implicit neural representation and explicit 2D codec. Extensive experiments conducted on public datasets demonstrate that the proposed framework can achieve comparable or even superior performance to the latest multi-view video compression standard MIV and other INR-based schemes in terms of view compression and scene modeling. The source code can be found at https://github.com/zc-lynen/MV-IERV.

Abstract:
The last decade has witnessed significant advances in semantic segmentation brought about by deep learning. However, existing methods only fit the data-label correspondence in a data-driven manner and do not fully conform to the abstraction and structuralization characteristics of the human visual cognition process, which limits the upper bounds of their performance. To this end, a multi-grained logical prototype (MGLP) method is proposed to rethink semantic segmentation based on these two key characteristics. Its novel design can be summarized as follows. 1) For abstraction, prototypes of the same class at different grain levels are established: a label generation method is proposed to automatically generate a multi-grained label space, which can guide the learning of the multi-grained prototypes for each class. 2) For structuralization, the intrinsic logical structure across different semantic levels is explicitly modeled: the horizontal metric relationships are established via metric relation operations on prototypes at the same grain level, to improve the discriminability between classes while taking the vertical semantic hierarchy into account. Moveover, the vertical logical relationships are established as the sub-to-super positive and super-to-sub negative constraints, to strengthen the semantic dependencies among prototypes at different grain levels. 3)MGLP is plug-and-play and can be directly combined with existing segmentation methods. Extensive experimental results indicate that MGLP can significantly improve the segmentation performance of existing methods, which opens up a new avenue for future research.

Abstract:
Multi-modal image synthesis is crucial for obtaining complete modalities due to the imaging restrictions in reality. Current methods, primarily CNN-based models, find it challenging to extract global representations because of local inductive bias, leading to synthetic structure deformation or color distortion. Despite the significant global representation ability of transformer in capturing long-range dependencies, its huge parameter size requires considerable training data. Multi-modal synthesis solely based on one of the two structures makes it hard to extract comprehensive information from each modality with limited data. To tackle this dilemma, we propose a simple yet effective Recursive TransFusion (RTF) framework for multi-modal image synthesis. Specifically, we develop a TransFusion unit to integrate local knowledge extracted from the individual modality by connecting a CNN-based local representation block (LRB) and a transformer-based global fusion block (GFB) via a feature translating gate (FTG). Considering the numerous parameters introduced by the transformer, we further unfold a TransFusion unit with recursive constraint repeatedly, forming recursive TransFusion (RTF), which progressively extracts multi-modal information at different depths. Our RTF remarkably reduces network parameters while maintaining superior performance. Extensive experiments validate our superiority against the competing methods on multiple benchmarks. The source code will be available at https://github.com/guoliangq/RTF.

Abstract:
Contemporary deep face recognition techniques predominantly utilize the Softmax loss function, designed based on the similarities between sample features and class prototypes. These similarities can be categorized into four types: in-sample target similarity, in-sample non-target similarity, out-sample target similarity, and out-sample non-target similarity. When a sample feature from a specific class is designated as the anchor, the similarity between this sample and any class prototype is referred to as in-sample similarity. In contrast, the similarity between samples from other classes and any class prototype is known as out-sample similarity. The terms target and non-target indicate whether the sample and the class prototype used for similarity calculation belong to the same identity or not. The conventional Softmax loss function promotes higher in-sample target similarity than in-sample non-target similarity. However, it overlooks the relation between in-sample and out-sample similarity. In this paper, we propose Global Cross-Entropy loss (GCE), which promotes 1) greater in-sample target similarity over both the in-sample and out-sample non-target similarity, and 2) smaller in-sample non-target similarity to both in-sample and out-sample target similarity. In addition, we propose to establish a bilateral margin penalty for both in-sample target and non-target similarity, so that the discrimination and generalization of the deep face model are improved. To bridge the gap between training and testing of face recognition, we adapt the GCE loss into a pairwise framework by randomly replacing some class prototypes with sample features. We designate the model trained with the proposed Global Cross-Entropy loss as GFace. Extensive experiments on several public face benchmarks, including LFW, CALFW, CPLFW, CFP-FP, AgeDB, IJB-C, IJB-B, MFR-Ongoing, and MegaFace, demonstrate the superiority of GFace over other methods. Additionally, GFace exhibits robust performance in general visual recognition task.

Abstract:
Video Wire Inpainting (VWI) is a prominent application in video inpainting, aimed at flawlessly removing wires in films or TV series, offering significant time and labor savings compared to manual frame-by-frame removal. However, wire removal poses greater challenges due to the wires being longer and slimmer than objects typically targeted in general video inpainting tasks, and often intersecting with people and background objects irregularly, which adds complexity to the inpainting process. Recognizing the limitations posed by existing video wire datasets, which are characterized by their small size, poor quality, and limited variety of scenes, we introduce a new VWI dataset with a novel mask generation strategy, namely Wire Removal Video Dataset 2 (WRV2) and Pseudo Wire-Shaped (PWS) Masks. WRV2 dataset comprises over 4,000 videos with an average length of 80 frames, designed to facilitate the development and efficacy of inpainting models. Building upon this, our research proposes the Redundancy-Aware Transformer (Raformer) method that addresses the unique challenges of wire removal in video inpainting. Unlike conventional approaches that indiscriminately process all frame patches, Raformer employs a novel strategy to selectively bypass redundant parts, such as static background segments devoid of valuable information for inpainting. At the core of Raformer is the Redundancy-Aware Attention (RAA) module, which isolates and accentuates essential content through a coarse-grained, window-based attention mechanism. This is complemented by a Soft Feature Alignment (SFA) module, which refines these features and achieves end-to-end feature alignment. Extensive experiments on both the traditional video inpainting datasets and our proposed WRV2 dataset demonstrate that Raformer outperforms other state-of-the-art methods. Our codes and the WRV2 dataset will be made available at: https://github.com/Suyimu/WRV2.

Abstract:
In learned image compression, probabilistic models play an essential role in characterizing the distribution of latent variables. The Gaussian model with mean and scale parameters has been widely used for its simplicity and effectiveness. Probabilistic models with more parameters, such as the Gaussian mixture models, can fit the distribution of latent variables more precisely, but the corresponding complexity is higher. To balance the compression performance and complexity, we extend the Gaussian model to the generalized Gaussian family for more flexible latent distribution modeling, introducing only one additional shape parameter \beta than the Gaussian model. To enhance the performance of the generalized Gaussian model by alleviating the train-test mismatch, we propose improved training methods, including \beta -dependent lower bounds for scale parameters and gradient rectification. Our proposed generalized Gaussian model, coupled with the improved training methods, is demonstrated to outperform the Gaussian and Gaussian mixture models on a variety of learned image compression networks.

Abstract:
Deep networks notoriously suffer from performance deterioration on previous tasks when learning from sequential tasks, i.e., catastrophic forgetting. Recent methods of gradient projection show that the forgetting is resulted from the gradient interference on old tasks and accordingly propose to update the network in an orthogonal direction to the task space. However, these methods assume the task space is invariant and neglect the gradual change between tasks, resulting in sub-optimal gradient projection and a compromise of the continual learning capacity. To tackle this problem, we propose to embed each task subspace into a non-Euclidean manifold, which can naturally capture the change of tasks since the manifold is intrinsically non-static compared to the Euclidean space. Subsequently, we analytically derive the accumulated projection between any two subspaces on the manifold along the geodesic path by integrating an infinite number of intermediate subspaces. Building upon this derivation, we propose a novel geodesic-aligned gradient projection (GAGP) method that harnesses the accumulated projection to mitigate catastrophic forgetting. The proposed method utilizes the geometric structure information on the task manifold by capturing the gradual change between the new and the old tasks. Empirical studies on image classification demonstrate that the proposed method alleviates catastrophic forgetting and achieves on-par or better performance compared to the state-of-the-art approaches.

Abstract:
Unsupervised object re-identification (Re-ID) aims to learn discriminative features without identity annotations. Existing mainstream methods are usually developed based on convolutional neural networks for feature extraction and pseudo-label estimation. However, convolutional neural networks suffer from limitations in capturing dispersed long-range dependencies and integrating global information. In comparison, vision transformers demonstrate superior robustness in complex environments, leveraging their versatile modeling capabilities to process diverse data structures with greater precision. In this paper, we delve into the potential of vision transformers in unsupervised Re-ID, proposing a Transformer-based perception-assisted framework (PAT). Considering Re-ID is a typical fine-grained task, existing unsupervised Re-ID methods relying on pseudo-labels generated by clustering algorithms provide only category-level discriminative supervision, with limited attention to local details. Therefore, we propose a novel target-aware mask alignment (TMA) strategy that provides additional supervision signals by leveraging low-level visual cues. Specifically, we employ pseudo-labels to guide the fine-grained alignment of features with local pixel information from critical discriminative regions. This method establishes a mutual learning mechanism via a shared Transformer, effectively balancing discriminative learning and detailed understanding. Furthermore, we propose a perceptual fusion feature augmentation (PFA) method to optimize instance-level discriminative learning. The proposed method is evaluated on multiple Re-ID datasets, demonstrating superior performance and robustness in comparison to state-of-the-art techniques. Notably, without annotations, our method achieves better results than many supervised counterparts. The code will be released.

Abstract:
LiDAR segmentation has become a crucial component of advanced autonomous driving systems. Recent range-view LiDAR segmentation approaches show promise for real-time processing. However, they inevitably suffer from corrupted contextual information and rely heavily on post-processing techniques for prediction refinement. In this work, we propose FRNet, a simple yet powerful method aimed at restoring the contextual information of range image pixels using corresponding frustum LiDAR points. First, a frustum feature encoder module is used to extract per-point features within the frustum region, which preserves scene consistency and is critical for point-level predictions. Next, a frustum-point fusion module is introduced to update per-point features hierarchically, enabling each point to extract more surrounding information through the frustum features. Finally, a head fusion module is used to fuse features at different levels for final semantic predictions. Extensive experiments conducted on four popular LiDAR segmentation benchmarks under various task setups demonstrate the superiority of FRNet. Notably, FRNet achieves 73.3% and 82.5% mIoU scores on the testing sets of SemanticKITTI and nuScenes. While achieving competitive performance, FRNet operates 5 times faster than state-of-the-art approaches. Such high efficiency opens up new possibilities for more scalable LiDAR segmentation. The code has been made publicly available at https://github.com/Xiangxu-0103/FRNet.

Abstract:
Recently, salient object detection (SOD) methods have achieved impressive performance. However, salient regions predicted by existing methods usually contain unsaturated regions and shadows, which limits the model for reliable fine-grained predictions. To address this, we introduce the uncertainty guidance learning approach to SOD, intended to enhance the model’s perception of uncertain regions. Specifically, we design a novel Uncertainty Guided Refinement Attention Network (UGRAN), which incorporates three important components, i.e., the Multilevel Interaction Attention (MIA) module, the Scale Spatial-Consistent Attention (SSCA) module, and the Uncertainty Refinement Attention (URA) module. Unlike conventional methods dedicated to enhancing features, the proposed MIA facilitates the interaction and perception of multilevel features, leveraging the complementary characteristics among multilevel features. Then, through the proposed SSCA, the salient information across diverse scales within the aggregated features can be integrated more comprehensively and integrally. In the subsequent steps, we utilize the uncertainty map generated from the saliency prediction map to enhance the model’s perception capability of uncertain regions, generating a highly-saturated fine-grained saliency prediction map. Additionally, we devise an adaptive dynamic partition (ADP) mechanism to minimize the computational overhead of the URA module and improve the utilization of uncertainty guidance. Experiments on seven benchmark datasets demonstrate the superiority of the proposed UGRAN over the state-of-the-art methodologies. Codes will be released at https://github.com/I2-Multimedia-Lab/UGRAN

Abstract:
Pre-trainedlarge text-to-image (T2I) models with an appropriate text prompt has attracted growing interests in customized image generation fields. However, catastrophic forgetting issue makes it hard to continually synthesize new user-provided styles while retaining the satisfying results amongst learned styles. In this paper, we propose MuseumMaker, a method that enables the synthesis of images by following a set of customized styles in a never-end manner, and gradually accumulates these creative artistic works as a Museum. When facing with a new customization style, we develop a style distillation loss module to extract and learn the styles of the training data for new image generation task. It can minimize the learning biases caused by content of new training images, and address the catastrophic overfitting issue induced by few-shot images. To deal with catastrophic forgetting issue amongst past learned styles, we devise a dual regularization for shared-LoRA module to optimize the direction of model update, which could regularize the diffusion model from both weight and feature aspects, respectively. Meanwhile, to further preserve historical knowledge from past styles and address the limited representability of LoRA, we design a task-wise token learning module where a unique token embedding is learned to denote a new style. As any new user-provided style come, our MuseumMaker can capture the nuances of the new styles while maintaining the details of learned styles. Experimental results on diverse style datasets validate the effectiveness of our proposed MuseumMaker method, showcasing its robustness and versatility across various scenarios.

Abstract:
Regularization of inverse problems is of paramount importance in computational imaging. The ability of neural networks to learn efficient image representations has been recently exploited to design powerful data-driven regularizers. While state-of-the-art plug-and-play (PnP) methods rely on an implicit regularization provided by neural denoisers, alternative Bayesian approaches consider Maximum A Posteriori (MAP) estimation in the latent space of a generative model, thus with an explicit regularization. However, state-of-the-art deep generative models require a huge amount of training data compared to denoisers. Besides, their complexity hampers the optimization involved in latent MAP derivation. In this work, we first propose to use compressive autoencoders instead. These networks, which can be seen as variational autoencoders with a flexible latent prior, are smaller and easier to train than state-of-the-art generative models. As a second contribution, we introduce the Variational Bayes Latent Estimation (VBLE) algorithm, which performs latent estimation within the framework of variational inference. Thanks to a simple yet efficient parameterization of the variational posterior, VBLE allows for fast and easy (approximate) posterior sampling. Experimental results on image datasets BSD and FFHQ demonstrate that VBLE reaches similar performance as state-of-the-art PnP methods, while being able to quantify uncertainties significantly faster than other existing posterior sampling techniques. The code associated to this paper is available in https://github.com/MaudBqrd/VBLE

Abstract:
Accurate extraction of molecular representations is a critical step in the drug discovery process. In recent years, significant progress has been made in molecular representation learning methods, among which multi-modal molecular representation methods based on images, and 2D/3D topologies have become increasingly mainstream. However, existing these multi-modal approaches often directly fuse information from different modalities, overlooking the potential of intermodal interactions and failing to adequately capture the complex higher-order relationships and invariant features between molecules. To overcome these challenges, we propose a structure-awareness-based multi-modal self-supervised molecular representation pre-training framework (MMSA) designed to enhance molecular graph representations by leveraging invariant knowledge between molecules. The framework consists of two main modules: the multi-modal molecular representation learning module and the structure-awareness module. The multi-modal molecular representation learning module collaboratively processes information from different modalities of the same molecule to overcome intermodal differences and generate a unified molecular embedding. Subsequently, the structure-awareness module enhances the molecular representation by constructing a hypergraph structure to model higher-order correlations between molecules. This module also introduces a memory mechanism for storing typical molecular representations, aligning them with memory anchors in the memory bank to integrate invariant knowledge, thereby improving the model’s generalization ability. Compared to existing multi-modal approaches, MMSA can be seamlessly integrated with any graph-based method and supports multiple molecular data modalities, ensuring both versatility and compatibility. Extensive experiments have demonstrated the effectiveness of MMSA, which achieves state-of-the-art performance on the MoleculeNet benchmark, with average ROC-AUC improvements ranging from 1.8% to 9.6% over baseline methods.

Abstract:
Semi-supervised learning uses labeled and unlabeled data from known classes for training, assuming the test data contains only those classes. However, in real-world scenarios, new classes can appear. Generalized Category Discovery (GCD) extends SSL to handle unlabeled samples that may belong to both known and unknown categories. The challenge arises from the lack of prior information about the unknown categories. We propose to generate unknown samples to address the GCD problem, called Generalized Category Discovery with Unknown Sample Generation (GCDUSG). Since the number of unknown categories is uncertain, we propose a prototype alignment method to estimate both the class numbers and pseudo-labels for unlabeled samples, thereby enabling us to learn the unknown prototypes. We have developed a process for generating realistic and discriminative unknown samples based on the known-unknown relationships between known and unknown prototypes. We generate realistic and discriminative unknown samples leveraging the known-unknown relationships. We achieve this by minimizing the class-wise Maximum Mean Discrepancy distance between the generated samples and the selected unknown samples. To account for the pseudo-labels assigned to unlabeled samples, we train a classifier using all samples, incorporating a pseudo-label supervision loss to mitigate the impact of potentially erroneous labels. This comprehensive training equips the classifier to effectively handle both known and unknown classes during testing. Extensive experiments conducted on benchmark datasets demonstrate the effectiveness of our approach.

Abstract:
Neural networks have achieved significant advances in the field of image restoration and much research has focused on designing new architectures for convolutional neural networks (CNNs) and Transformers. The choice of loss functions, despite being a critical factor when training image restoration networks, has attracted little attention. The existing losses are primarily based on semantic or hand-crafted representations. Recently, discrete representations have demonstrated strong capabilities in representing images. In this work, we explore the loss of discrete representations for image restoration. Specifically, we propose a Local Residual Quantized Variational AutoEncoder (Local RQ-VAE) to learn prototype vectors that represent the local details of high-quality images. Then we propose a Prototypical Distribution Divergence (PDD) loss that measures the Kullback-Leibler divergence between the prototypical distributions of the restored and target images. Experimental results demonstrate that our PDD loss improves the restored images in both PSNR and visual quality for state-of-the-art CNNs and Transformers on several image restoration tasks, including image super-resolution, image denoising, image motion deblurring, and defocus deblurring.

Abstract:
Unsupervised domain adaptation enables the transfer of knowledge from a labeled source domain to an unlabeled target domain, and its application in crowd counting is gaining momentum. Current methods typically align distributions across domains to address inter-domain disparities at a global level. However, these methods often struggle with significant intra-domain gaps caused by domain-agnostic factors such as density, surveillance angles, and scale, leading to inaccurate alignment and unnecessary computational burdens, especially in large-scale training scenarios. To address these challenges, we propose the Multi-Granularity Optimal Transport (MGOT) distribution alignment framework, which aligns domain-agnostic factors across domains at different granularities. The motivation behind multi-granularity is to capture fine-grained domain-agnostic variations within domains. Our method proceeds in three phases: first, clustering coarse-grained features based on intra-domain similarity; second, aligning the granular clusters using an optimal transport framework and constructing a mapping from cluster centers to finer patch levels between domains; and third, re-weighting the aligned distribution for model refinement in domain adaptation. Extensive experiments across twelve cross-domain benchmarks show that our method outperforms existing state-of-the-art methods in adaptive crowd counting. The code will be available at https://github.com/HopooLinZ/MGOT

Abstract:
Class incremental learning (CIL) endeavors to acquire new knowledge continuously from an unending data stream while retaining previously acquired knowledge. Since the amount of new data is significantly smaller than that of old data, existing methods struggle to strike a balance between acquiring new knowledge and retaining previously learned knowledge, leading to substantial performance degradation. To tackle such a dilemma, in this paper, we propose the Contrastive Complementary Augmentation Learning (CoLA) method, which mitigates the aliasing of distributions in incremental tasks. Specifically, we introduce a novel yet effective supervised contrastive learning module with instance- and class-level augmentation during base training. For the instance-level augmentation method, we spatially segment the image at different scales, creating spatial pyramid contrastive pairs to obtain more robust feature representations. Meanwhile, the class-level augmentation method randomly mixes images within the mini-batch, facilitating the learning of compact and more easily adaptable decision boundaries. In this way, we only need to train the classifier to maintain competitive performance during the incremental phases. Furthermore, we also propose CoLA+ to further enhance the proposed method with relaxed limitations on data storage. Extensive experiments demonstrate that our method achieves state-of-the-art performance on different benchmarks.

Abstract:
Major efforts in data-driven image super-resolution (SR) primarily focus on expanding the receptive field of the model to better capture contextual information. However, these methods are typically implemented by stacking deeper networks or leveraging transformer-based attention mechanisms, which consequently increases model complexity. In contrast, model-driven methods based on the unfolding paradigm show promise in improving performance while effectively maintaining model compactness through sophisticated module design. Based on these insights, we propose a Structural Similarity-Inspired Unfolding (SSIU) method for efficient image SR. This method is designed through unfolding an SR optimization function constrained by structural similarity, aiming to combine the strengths of both data-driven and model-driven approaches. Our model operates progressively following the unfolding paradigm. Each iteration consists of multiple Mixed-Scale Gating Modules (MSGM) and an Efficient Sparse Attention Module (ESAM). The former implements comprehensive constraints on features, including a structural similarity constraint, while the latter aims to achieve sparse activation. In addition, we design a Mixture-of-Experts-based Feature Selector (MoE-FS) that fully utilizes multi-level feature information by combining features from different steps. Extensive experiments validate the efficacy and efficiency of our unfolding-inspired network. Our model outperforms current state-of-the-art models, boasting lower parameter counts and reduced memory consumption. Our code will be available at: https://github.com/eezkni/SSIU

Abstract:
All-in-one image restoration, which seeks to handle multiple types of degradation within a unified model, has become a prominent research topic in computer vision. While existing deep learning models have achieved remarkable success in specific restoration tasks, extending these models to heterogenous degradations presents significant challenges. Current all-in-one methods predominantly concentrate on extracting degradation priors, often employing learned and fixed task prompts to guide the restoration process. However, these static prompts are inclined to generate an average distribution characteristics of degradations, unable to accurately depict the unique attribute of the given input, consequently providing suboptimal restoration results. To tackle these challenges, we propose a novel dynamic prompt approach called Degradation Prototype Assignment and Prompt Distribution Learning (DPPD). Our approach decouples the degradation prior extraction into two novel components: Degradation Prototype Assignment (DPA) and Prompt Distribution Learning (PDL). DPA anchors the degradation representations to predefined prototypes, providing discriminative and scalable representations. In addition, PDL models prompts as distributions rather than fixed parameters, facilitating dynamic and adaptive prompt sampling. Extensive experiments demonstrate that our DPPD framework can achieve significant performance improvement on different image restoration tasks. Codes are available at our project page https://github.com/Aitical/DPPD

Abstract:
Audio-visual Segmentation (AVS) is conceptualized as a conditional generation task, where audio is considered as the conditional variable for segmenting the sound producer(s). In this case, audio should be extensively explored to maximize its contribution for the final segmentation task. We propose a contrastive conditional latent diffusion model for audio-visual segmentation (AVS) to thoroughly investigate the impact of audio, where the correlation between audio and the final segmentation map is modeled to guarantee the strong correlation between them. To achieve semantic-correlated representation learning, our framework incorporates a latent diffusion model. The diffusion model learns the conditional generation process of the ground-truth segmentation map, resulting in ground-truth aware inference during the denoising process at the test stage. As our model is conditional, it is vital to ensure that the conditional variable contributes to the model output. We thus extensively model the contribution of the audio signal by minimizing the density ratio between the conditional probability of the multimodal data, e.g. conditioned on the audio-visual data, and that of the unimodal data, e.g. conditioned on the audio data only. In this way, our latent diffusion model via density ratio optimization explicitly maximizes the contribution of audio for AVS, which can then be achieved with contrastive learning as a constraint, where the diffusion part serves as the main objective to achieve maximum likelihood estimation, and the density ratio optimization part imposes the constraint. By adopting this latent diffusion model via contrastive learning, we effectively enhance the contribution of audio for AVS. The effectiveness of our solution is validated through experimental results on the benchmark dataset. Code and results are online via our project page: https://github.com/OpenNLPLab/DiffusionAVS

Abstract:
Multi-view clustering (MVC) has attracted increasing attention with the emergence of various data collected from multiple sources. In real-world dynamic environment, instances are continually gathered, and the number of views expands as new data sources become available. Learning for such simultaneous increment of instances and views, particularly in unsupervised scenarios, is crucial yet underexplored. In this paper, we address this problem by proposing a novel MVC method with Incremental Instances and Views, MVC-IIV for short. MVC-IIV contains two stages, an initial stage and an incremental stage. In the initial stage, a basic latent multi-view subspace clustering model is constructed to handle existing data, which can be viewed as traditional static MVC. In the incremental stage, the previously trained model is reused to guide learning for newly arriving instances with new views, transferring historical knowledge while avoiding redundant computations. In specific, we design and reuse two modules, i.e., multi-view embedding module for low-dimensional representation learning, and consensus centroids module for cluster probability learning. By adding consistency regularization on the two modules, the knowledge acquired from previous data is used, which not only enhances the exploration within current data batch, but also extracts the between-batch data correlations. The proposed model can be efficiently solved with linear space and time complexity. Extensive experiments demonstrate the effectiveness and efficiency of our method compared with the state-of-the-art approaches.

Abstract:
Multi-modal feature fusion as a core investigative component of RGBT tracking emerges numerous fusion studies in recent years. However, existing RGBT tracking methods widely adopt fixed fusion structures to integrate multi-modal feature, which are hard to handle various challenges in dynamic scenarios. To address this problem, this work presents a novel Attention-based Fusion router called AFTER, which optimizes the fusion structure to adapt to the dynamic challenging scenarios, for robust RGBT tracking. In particular, we design a fusion structure space based on the hierarchical attention network, each attention-based fusion unit corresponding to a fusion operation and a combination of these attention units corresponding to a fusion structure. Through optimizing the combination of attention-based fusion units, we can dynamically select the fusion structure to adapt to various challenging scenarios. Unlike complex search of different structures in neural architecture search algorithms, we develop a dynamic routing algorithm, which equips each attention-based fusion unit with a router, to predict the combination weights for efficient optimization of the fusion structure. Extensive experiments on five mainstream RGBT tracking datasets demonstrate the superior performance of the proposed AFTER against state-of-the-art RGBT trackers. We release the code in https://github.com/Alexadlu/AFter

Abstract:
The alpha-tree, also known as the quasi-flat zone hierarchy is a widely used representation of images in Mathematical Morphology. This structure organizes the regions according to a similarity criterion into a tree, that eases the multiscale analysis of images. Many alpha-tree algorithms exist and computing this structure efficiently is still an active field of research. Indeed, the alpha-tree is commonly used in remote sensing where there is an urge for fast processing of large terabytes images. In this paper, we propose the first massively parallel alpha-tree algorithm that leverages concurrent union-find data structures to exploit the SIMT (Single Instruction Multiple Threads) programming model of GPUs. Our algorithm outperforms the State-of-the-Art parallel CPU algorithms by a factor of 10 on average on desktop computers and servers. It also opens new perspectives for using Mathematical Morphology methods on GPU pipelines.

Abstract:
Learning intrinsic bias from limited data has been considered the main reason for the failure of deepfake detection with generalizability. Apart from the discovered content and specific-forgery bias, we reveal a novel spatial bias, where detectors inertly anticipate observing structural forgery clues appearing at the image center, also can lead to the poor generalization of existing methods. We present ED4, a simple and effective strategy, to address aforementioned biases explicitly at the data level in a unified framework rather than implicit disentanglement via network design. In particular, we develop ClockMix to produce facial structure preserved mixtures with arbitrary samples, which allows the detector to learn from an exponentially extended data distribution with much more diverse identities, backgrounds, local manipulation traces, and the co-occurrence of multiple forgery artifacts. We further propose the Adversarial Spatial Consistency Module (AdvSCM) to prevent extracting features with spatial bias, which adversarially generates spatial-inconsistent images and constrains their extracted feature to be consistent. As a model-agnostic debiasing strategy, ED4 is plug-and-play: it can be integrated with various deepfake detectors to obtain significant benefits. We conduct extensive experiments to demonstrate its effectiveness and superiority over existing deepfake detection approaches. Code is available at https://github.com/beautyremain/ED4.

Abstract:
Edge sensor devices generate vast amounts of user data, but centralized processing poses privacy risks. Federated Learning addresses this by decentralizing training. However, applying Federated Learning directly to skeleton videos fails to preserve motion dynamics and suffers from client heterogeneity bias. To address these limitations, we propose CSAR—a Client-Unbiased Skeletal Action Recognizer for Federated Learning—which tackles two core challenges: motion dynamics preservation and classifier bias mitigation. Specifically, CSAR employs a Model Calibration Loss during client training to align client-server representations and reduce drift. On the server, it generates class-balanced spatiotemporal federated features through Prototypical Gaussian Sampling, subsequently refined via a Motion-aware Differential Loss to capture kinematic properties. These features enable retraining of a globally debiased recognizer that achieves accuracy comparable to real-data-trained models. Further stabilization is achieved through Knowledge Matching, which enhances global understanding. Experiments under natural and label heterogeneity confirm that CSAR outperforms state-of-the-art methods.

Abstract:
In hashing-based long-tailed image retrieval, the dominance of data-rich head classes often hinders the learning of effective hash codes for data-poor tail classes due to inherent long-tailed bias. Interestingly, this bias also contains valuable prior knowledge by revealing inter-class dependencies, which can be beneficial for hash learning. However, previous methods have not thoroughly analyzed this tangled negative and positive effects of long-tailed bias from a causal inference perspective. In this paper, we propose a novel hash framework that employs causal inference to disentangle detrimental bias effects from beneficial ones. To capture good bias in long-tailed datasets, we construct hash mediators that conserve valuable prior knowledge from class centers. Furthermore, we propose a de-biased hash loss To enhance the beneficial bias effects while mitigating adverse ones, leading to more discriminative hash codes. Specifically, this loss function leverages the beneficial bias captured by hash mediators to support accurate class label prediction, while mitigating harmful bias by blocking its causal path to the hash codes and refining predictions through backdoor adjustment. Extensive experimental results on four widely used datasets demonstrate that the proposed method improves retrieval performance against the state-of-the-art methods by large margins. The source code is available at https://github.com/IMAG-LuJin/CIH

Abstract:
Few-shot fine-tuning of pre-trained vision-language models (VLMs) for downstream tasks has gained widespread attention for reducing data annotation efforts while maintaining high performance. However, we observe that VLMs excel in excluding most incorrect classes in fine-grained recognition tasks, but struggles with a small set of confusing categories, which are typically highly similar subspecies. Existing few-shot fine-tuning methods attempt to directly recognize the correct category among all predefined classes, limiting their ability to capture discriminative features for those confusing categories. This raises an intriguing question: Can we specifically extract useful information from confusing classes to enhance fine-grained recognition performance? Based on this insight, we propose a hierarchical few-shot fine-tuning framework to address the severe confusion problem while ensuring the interpretability, namely Attribute-Decoupled Discriminator (AttrDD). Instead of thinking once among all classes, AttrDD employs a two-stage recognition, “think through” then “think smart”. Specifically, in the first phase, a representative VLM, CLIP, is fine-tuned to select the Top-K confusing classes. In the second phase, we leverage the knowledge of large language models (LLMs) to generate fixed format descriptions of attribute differences between these confusing classes via in-context learning. Attribute-decoupled classifications are then conducted to capture fine-grained discriminative features. To achieve parameter-efficient fine-tuning, we introduce a lightweight attention adapter for each phase to align image features with task-specific textual features and LLM-generated textual features. Extensive experiments on 9 fine-grained recognition benchmarks demonstrate that AttrDD consistently outperforms existing baselines by wide margins.

Abstract:
Referring camouflaged object detection (Ref-COD) is a recently proposed task, aiming to segment specified camouflaged objects by leveraging visual reference, i.e., a small set of referring images with salient target objects. Ref-COD poses a considerable challenge due to the difficulty of discerning camouflaged objects from their highly similar backgrounds, as well as the significant feature differences between the camouflaged objects and the provided visual reference. To tackle the above dilemma, we propose a novel uncertainty-aware transformer for the Ref-COD task, termed UAT. UAT first utilizes a cross-attention mechanism to align and integrate visual reference to guide camouflaged feature learning, and then models dependencies between patches in a probabilistic manner to learn predictive uncertainty and excavate discriminative camouflaged features. Specifically, we first design a referring feature aggregation (RFA) module to align and incorporate referring features with camouflaged features, guiding targeted specific feature learning within the feature space of camouflaged images. Then, to enhance multi-level feature extraction, we develop a cross-attention encoder (CAE) to integrate global information and multi-scale semantics between adjacent layers to excavate critical camouflage cues. More importantly, we propose a transformer probabilistic decoder (TPD) to model the dependencies between patches as Gaussian random variables to capture uncertainty-aware camouflaged features. Extensive experiments on the golden Ref-COD benchmark demonstrate the superiority of UAT over existing state-of-the-art competitors. The proposed UAT also achieves competitive performance on several conventional COD datasets, further demonstrating its scalability. The source code is available at https://github.com/CVL-hub/UAT

Abstract:
Learned Image Compression (LIC) has experienced rapid growth with the emergence of diverse frameworks. However, the variability in model design and training datasets poses a challenge for the universal application of a single coding model. To address this problem, this paper introduces a pioneering multi-model image coding framework that integrates various image codecs to overcome these limitations. By dynamically allocating codecs to different image regions, our framework optimizes reconstruction quality within the constraints of limited bitrate and decoding time, offering a high-performance, ubiquitous solution for the rate-distortion-complexity trade-off. Our framework features a detailed codec assignment algorithm based on the Simulated Annealing (SA) method, selected for its proven efficacy in managing the discrete and intricate nature of codec assignment optimization. We have implemented a coarse-to-fine strategy, which significantly enhances efficiency. Notably, our framework maintains compatibility with all standard image codecs without necessitating structural modifications. Empirical results indicate that our framework establishes a new standard in LIC, advancing the Pareto frontier for performance-complexity trade-offs. It achieves a significant 70% reduction in decoding time compared to current state-of-the-art methods, without compromising reconstruction quality. Furthermore, under comparable conditions, our approach not only outperforms but significantly eclipses existing Rate-Distortion-Complexity (RDC) optimized codecs, with decoding speeds up to 30 times faster.

Abstract:
Semi-supervised learning has gained considerable popularity in medical image segmentation tasks due to its capability to reduce reliance on expert-examined annotations. Several mean-teacher (MT) based semi-supervised methods utilize consistency regularization to effectively leverage valuable information from unlabeled data. However, these methods often heavily rely on the student model and overlook the potential impact of cognitive biases within the model. Furthermore, some methods employ co-training using pseudo-labels derived from different inputs, yet generating high-confidence pseudo-labels from perturbed inputs during training remains a significant challenge. In this paper, we propose an Uncertainty-aware Cross-training framework for semi-supervised medical image Segmentation (UC-Seg). Our UC-Seg framework incorporates two distinct subnets to effectively explore and leverage the correlation between them, thereby mitigating cognitive biases within the model. Specifically, we present a Cross-subnet Consistency Preservation (CCP) strategy to enhance feature representation capability and ensure feature consistency across the two subnets. This strategy enables each subnet to correct its own biases and learn shared semantics from both labeled and unlabeled data. Additionally, we propose an Uncertainty-aware Pseudo-label Generation (UPG) component that leverages segmentation results and corresponding uncertainty maps from both subnets to generate high-confidence pseudo-labels. We extensively evaluate the proposed UC-Seg on various medical image segmentation tasks involving different modality images, such as MRI, CT, ultrasound, colonoscopy, and so on. The results demonstrate that our method achieves superior segmentation accuracy and generalization performance compared to other state-of-the-art semi-supervised methods. Our code and segmentation maps will be released at https://github.com/taozh2017/UCSeg

Abstract:
Camouflaged object detection (COD) aims to discover objects that are seamlessly embedded in the environment. Existing COD methods have made significant progress by typically representing features in a discrete way with arrays of pixels. However, limited by discrete representation, these methods need to align features of different scales during decoding, which causes some subtle discriminative clues to become blurred. This is a huge blow to the task of identifying camouflaged objects from clear subtle clues. To address this issue, we propose a novel continuous feature representation network (CFRN), which aims to represent features of different scales as a continuous function for COD. Specifically, a Swin transformer encoder is first exploited to explore the global context between camouflaged objects and the background. Then, an object-focusing module (OFM) deployed layer by layer is designed to deeply mine subtle discriminative clues, thereby highlighting the body of camouflaged objects and suppressing other distracting objects at different scales. Finally, a novel frequency-based implicit feature decoder (FIFD) is proposed, which directly decodes the predictions at arbitrary coordinates in the continuous function with implicit neural representations, thus propagating clearer discriminative clues. Extensive experiments on four challenging COD benchmarks demonstrate that our method significantly outperforms state-of-the-art methods. The source code will be available at https://github.com/SongZeHNU/CFRN

Abstract:
Human-Object Interaction (HOI) detection, as a foundational task in human-centric understanding, aims to detect interactive triplets in real-world scenarios. To better distinguish diverse HOIs within an open-world context, current HOI detectors utilize pre-trained Visual-Language Models (VLMs) to extract prior knowledge through textual prompts (i.e., descriptive texts for each HOI instance). However, relying on predetermined descriptive texts, such approaches only acquire a fixed set of textual knowledge for HOI prediction, consequently resulting in inferior performance and limited generalization. To remedy this, we propose a novel VLM-based method, which jointly performs prompting learning from both visual and textual perspectives and synergizes visual-textual prompting for HOI detection. Initially, we design a hierarchical adaptation architecture to perform progressive prompting: visual prompting is facilitated through gradual token migration from VLM’s image encoder, while textual prompting is initialized with progressively leveled interaction descriptions. In addition, to synergize the visual-textual prompting learning, a text-supervising and image-tuning loop is introduced, in which the text-supervising stage guides visual prompting learning through contrastive learning and the image-tuning stage refines textual prompting by modal matching. Finally, we employ an interaction-aware knowledge merging mechanism to effectively transfer visual-textual knowledge encapsulated within synergistic prompting for HOI detection. Extensive experiments on two benchmarks demonstrate that our proposed method outperforms the state-of-the-art ones, under both supervised and zero-shot settings.

Abstract:
Efficiently compressing HD/UHD content has long been challenging due to high bitrate costs. Instance-adaptive enhancement methods try to tackle this issue by compressing a video at reduced resolution and enhancing it using a neural model specifically overfitted for this video. However, existing methods focus solely on spatial super-resolution (SR) and under-utilize the videos’ temporal redundancy. Their limited management of the model’s updated parameters also causes excessive overfitting overheads. Therefore, this paper introduces IASTE, the first instance-adaptive enhancement method based on spatial-temporal enhancement (STE), and incorporates low-rank adaptation (LoRA) for efficient model overfitting. Specifically, we downscale videos spatially and temporally to reduce the data volume and achieve efficient video compression. Then, we overfit a specific STE model for each video and use it to enhance the decoded video’s spatiotemporal resolution. Leveraging the video swin transformer’s strong capability in capturing spatiotemporal correlations, we design a lightweight and efficient model to implement video STE. The model is overfitted for each video using LoRA. By freezing the pre-trained model and selectively updating a few low-rank matrices, the bitrate overhead for model storage can be mitigated. Experiments prove that compared to directly compressing high-frame-rate (HFR), high-resolution (HR) videos, our method achieves around 30% BD-Rate gains on the CTC and UVG datasets, about 15% gains on the YoutubeUGC dataset, and about 10% gains on the ultra-long videos in the Xiph dataset.

Abstract:
When dealing with low-quality source images, existing image fusion methods either fail to handle degradations or are restricted to specific degradations. This study proposes an unsupervised unified degradation-robust image fusion network, termed as URFusion, in which various types of degradations can be uniformly eliminated during the fusion process, leading to high-quality fused images. URFusion is composed of three core modules: intrinsic content extraction, intrinsic content fusion, and appearance representation learning and assignment. It first extracts degradation-free intrinsic content features from images affected by various degradations. These content features then provide feature-level rather than image-level fusion constraints for optimizing the fusion network, effectively eliminating degradation residues and reliance on ground truth. Finally, URFusion learns the appearance representation of images and assigns the statistical appearance representation of high-quality images to the content-fused result, producing the final high-quality fused image. Extensive experiments on multi-exposure image fusion and multi-modal image fusion tasks demonstrate the advantages of URFusion in fusion performance and suppression of multiple types of degradations. The code is available at https://github.com/hanna-xu/URFusion

Abstract:
Image Super-Resolution (SR) has seen remarkable progress with the emergence of transformer-based architectures. However, due to the high computational cost, many existing transformer-based SR methods limit their attention to local windows, which hinders their ability to model long-range dependencies and global structures. To address these challenges, we propose a novel SR framework named Semantic-Driven Global-Local Fusion Transformer (SGLFT). The proposed model enhances the receptive field by combining a Hybrid Window Transformer (HWT) and a Scalable Transformer Module (STM) to jointly capture local textures and global context. To further strengthen the semantic consistency of reconstruction, we introduce a Semantic Extraction Module (SEM) that distills high-level semantic priors from the input. These semantic cues are adaptively integrated with visual features through an Adaptive Feature Fusion Semantic Integration Module (AFFSIM). Extensive experiments on standard benchmarks demonstrate the effectiveness of SGLFT in producing visually faithful and structurally consistent SR results. The code will be available at https://github.com/kbzhang0505/SGLFT.

Abstract:
This paper presents a new domain-specific representation learning method, exponential dissimilarity-dispersion family (EDDF), a novel distribution family that includes a dissimilarity function and a global dispersion parameter. In generative models, variational autoencoders (VAEs) has a solid theoretical foundation based on variational inference in visual representation learning and are also used as one of core components of other generative models. This paper addresses the issue where conventional VAEs, with the commonly adopted Gaussian settings, tend to experience performance degradation in generative modeling for high-dimensional data. This degradation is often caused by their excessively limited model family. To tackle this problem, we propose EDDF, a new domain-specific method introducing a novel distribution family with a dissimilarity function and a global dispersion parameter. A decoder using this family employs dissimilarity functions for the evidence lower bound (ELBO) reconstruction loss, leveraging domain-specific knowledge to enhance high-dimensional data modeling. We also propose an ELBO optimization method for VAEs with EDDF decoders that implicitly approximates the stochastic gradient of the normalizing constant using log-expected dissimilarity. Empirical evaluations of the generative performance show the effectiveness of our model family and proposed method. Our framework can be integrated into any VAE-based generative models in representation learning. The code and model are available at https://github.com/ganmodokix/eddf-vae

Abstract:
In recent years, deep learning-based methods have made significant progress on the image quality assessment problem; however, challenges remain arising from the lack of annotated, real-world training data and consequent poor generalization ability. Towards addressing these challenges, we propose a no-reference image quality assessment (NR-IQA) method based on generative AI (GenAI) images. Specifically, we use GenAI images as reference images, employing a cold diffusion model to generate distorted images of four different distortion types, and we label these distorted images using a full-reference model, thereby making it possible to construct a large-scale pre-training dataset. We use this resource generation method to facilitate NR-IQA model building. We deploy a Multi-scale Cross Attention Block (MCAB) and a Scale Simple Attention Module (SSAM) to enhance feature representation by extracting multi-scale feature information from both the channel and spatial dimensions that are predictive of image quality. Extensive experiments on eight public databases demonstrate that the proposed method achieves state-of-the-art (SOTA) performance. A public release of all the codes associated with this work will be made available on GitHub.

Abstract:
Light field (LF) imaging, which captures both intensity and directional information of light rays, extends the capabilities of traditional imaging techniques. In this paper, we introduce a task in the field of LF imaging, sparse-to-dense inbetweening, which focuses on generating dense novel views from sparse multi-view LFs. By synthesizing intermediate views from sparse inputs, this task enhances LF view synthesis through filling in interperspective gaps within an expanded field of view and increasing data robustness by leveraging complementary information between light rays from different perspectives, which are limited by non-robust single-view synthesis and the inability to handle sparse inputs effectively. To address these challenges, we construct a high-quality multi-view LF dataset, consisting of 60 indoor scenes and 59 outdoor scenes. Building upon this dataset, we propose a baseline method. Specifically, we introduce an adaptive alignment module to dynamically align information by capturing relative displacements. Next, we explore angular consistency and hierarchical information using a multi-level feature decoupling module. Finally, a multi-level feature refinement module is applied to enhance features and facilitate reconstruction. Additionally, we introduce a universally applicable artifact-aware loss function to effectively suppress visual artifacts. Experimental results demonstrate that our method outperforms existing approaches, establishing a benchmark for sparse-to-dense inbetweening. The code is available at https://github.com/Starmao1/MutiLF

Abstract:
Hashing is an effective technique for large-scale image retrieval. However, traditional hashing models typically follow a closed-set assumption, which fails to satisfy the practicality of real-world tasks. In this paper, we explore a meaningful yet overlooked question: is there a hashing paradigm that not only supports rehearsal-free online incremental coding for single-pass data streams but also adapts to potentially expanding concept spaces in open environments? Instead of presetting fixed bit lengths, we suggest adjusting the bit length dynamically based on the number of encountered categories, meanwhile enabling bit extension of existing hash codes to match the adaptive code lengths without knowledge forgetting. Therefore, we propose a Bit-extendable IncremenTal haShing (BITS) method for image retrieval in open environments. Specifically, we identify a blurry incremental setup to better simulate realistic scenarios, revisiting the widely-used data-incremental and class-incremental settings. With this challenging setup, a three-phase framework is designed to efficiently perform incremental hashing, which jointly solves online continual coding and bit extension with adaptive code lengths. Through the well-designed hashing paradigm, BITS achieves comparable performance to offline hashing methods while significantly saving computational resources. Comprehensive experiments on six benchmarks demonstrate the superiority of our BITS in dynamic scenarios. The source code is available at https://github.com/yxinwang/BITS

Abstract:
Maintaining stylistic consistency is crucial for the cohesion and aesthetic appeal of images, a fundamental requirement in effective image editing and inpainting. However, existing methods primarily focus on the semantic control of generated content, often neglecting the critical task of preserving this consistency. In this work, we introduce the Neural Scene Designer (NSD), a novel framework that enables photo-realistic manipulation of user-specified scene regions while ensuring both semantic alignment with user intent and stylistic consistency with the surrounding environment. NSD leverages an advanced diffusion model, incorporating two parallel cross-attention mechanisms that separately process text and style information to achieve the dual objectives of semantic control and style consistency. To capture fine-grained style representations, we propose the Progressive Self-style Representational Learning (PSRL) module. This module is predicated on the intuitive premise that different regions within a single image share a consistent style, whereas regions from different images exhibit distinct styles. The PSRL module employs a style contrastive loss that encourages high similarity between representations from the same image while enforcing dissimilarity between those from different images. Furthermore, to address the lack of standardized evaluation protocols for this task, we establish a comprehensive benchmark. This benchmark includes competing algorithms, dedicated style-related metrics, and diverse datasets and settings to facilitate fair comparisons. Extensive experiments conducted on our benchmark demonstrate the effectiveness of the proposed framework.

Abstract:
Deep subspace clustering uses latent features instead of raw images to construct the self-expression coefficient matrix. Existing methods primarily focus on optimizing the self-expression coefficient matrix, often neglecting the impact of latent features. However, better latent features are more in line with the self-representation assumption and results in a better self-expression coefficient matrix, which construct a chain relationship. Based on the chain relationship, this paper proposes a Class Relation Constraint (CRC) induced Deep Subspace Clustering (DSC) method to improve the representation ability of latent features. First, an intra- and inter-class weighted constraint is proposed to enhance latent data separability in subspaces. Then, to further remove negative samples inside a subspace, a contrastive loss function is introduced within the diagonal blocks of the self-expression coefficient matrix, i.e. the same subspace, under the guidance of spectral clustering results. Along with the enhanced representation ability on latent features and corresponding diagonal blocks, the self-expression coefficient matrix can provide more accurate data relationships for spectral clustering. Experimental results on multiple benchmark datasets have validated the effectiveness of the proposed DSCCRC method, particularly in handling small samples and complex datasets.

Abstract:
Despite the great success of large-scale text-to-image diffusion models in image generation and image editing, existing methods still struggle with editing the layout of real-world images. Although a few works have been developed to address this issue, they either fail to adjust the image layout effectively or encounter challenges in preserving the visual appearance of objects after layout adjustment. To bridge this gap, this paper proposes a novel image layout editing method that not only re-arranges a real-world image to a specified layout, but also ensures that the visual appearance of the objects remains consistent with their original state prior to editing. Concretely, a Multi-Concept Learning scheme is developed to learn the concepts of different objects from a single image, which can be seen as a novel inversion scheme tailored for image layout editing. Then, we leverage the semantic consistency within intermediate features of diffusion models to project the appearance information of objects to the target regions to improve the fidelity of objects after editing. Additionally, a novel initialization noise design is adopted to facilitate the convergence and success rate of re-arranging the layout. The phenomenon of concept entanglement is also analyzed, and resolved by a novel asynchronous editing strategy. Extensive experimental results demonstrate that the proposed method outperforms existing methods in both layout alignment and visual consistency for the task of image layout editing.

Abstract:
Dataset distillation (DD) aims to accelerate the training speed of neural networks (NNs) by synthesizing a reduced dataset. NNs trained on the smaller dataset are expected to obtain almost the same test set accuracy as they do on the larger one. Previous DD research treated the obtained distilled dataset as a regular dataset for training, neglecting the overfitting issue caused by the limited number of original distilled images. In this paper, we propose a new DD paradigm. Specifically, in the deployment stage, distilled images are augmented by amplifying their local information since the teacher network can produce diverse supervision signals when receiving inputs from different regions. Efficient and diverse augmentation methods for each distilled image are devised, while ensuring the authenticity of augmented samples. Additionally, to alleviate the increased training cost caused by data augmentation, we design a bi-directional dynamic dataset pruning technique to prune the original distilled dataset and augmented distilled dataset. A new pruning strategy and scheduling are proposed based on experimental findings. Experiments on 9 benchmark datasets (CIFAR10, CIFAR100, ImageWoof, ImageCat, ImageFruit, ImageNette, ImageNet10, ImageNet100 and ImageNet1K) demonstrate the effectiveness of our approach. For instance, on the ImageNet1K dataset with a ResNet18 architecture and 50 distilled images per class, our algorithm surpasses the second-ranked MiniMax algorithm by 7.6%, achieving a distilled accuracy of 66.2%.

Abstract:
Radiology report generation, which automatically generates diagnostic textual reports from medical images, plays a crucial role in improving clinical efficiency and diagnostic accuracy. However, existing radiology report generation models face numerous challenges, such as lack of interpretability as well as description inaccuracy. To address these issues, we propose an integrated framework that enhances radiology report generation by combining target detection with contextual alignment of relevant region descriptions. Target detection focuses on clinically significant areas within medical images, while contextual alignment ensures that the generated text is directly linked to visual findings. Additionally, we introduce a full-spectrum feature fusion method that combines both high- and low-frequency features from the images. This approach captures details and broader structures, allowing the model to gain a more comprehensive and hierarchical understanding of the images. We validated the effectiveness of our method on the public dataset MIMIC-CXR. The results indicate that our method outperforms previous approaches on multiple evaluation metrics. Notably, in terms of the average of the six traditional metrics, our method (VTAG) achieved a significant improvement of 14.3%, compared to the state-of-the-art model MLRG.

Abstract:
Multi-view clustering typically leverages the consistency and complementarity among views to partition different samples. However, existing deep learning-based methods often face the dilemma between selecting complementary information and capturing essential details: 1) Capturing complementary semantics among views may introduce label-irrelevant redundant information. 2) Only extracting consistent semantic information will cause information loss, hindering the clarity in downstream tasks. To address these issues, we propose a novel method from the perspective of meta-learning to learn clustering-friendly representations with minimal redundancy. Specifically, we train an information compressor to guide the model in describing the original samples as compact as possible with minimal information, thus learning the key semantics with minimized redundancy. Meta-learning bi-level optimization promotes the nested optimization of feature embedding and information compressor. Meanwhile, a semantic puzzle mechanism complements the semantic fragments by exploiting the relationships between low-level features, resulting in a consensus representation with strong discriminative power. We conducted extensive experiments on datasets with various sizes to validate the effectiveness of our model, demonstrating significant performance improvements over several state-of-the-art methods.

Abstract:
Multi-Modal Image Fusion (MMIF) aims to integrate complementary image information from different modalities to produce informative images. Previous deep learning-based MMIF methods generally adopt Convolutional Neural Networks (CNNs) or Transformers for feature extraction. However, these methods deliver unsatisfactory performances due to the limited receptive field of CNNs and the high computational cost of Transformers. Recently, Mamba has demonstrated a powerful potential for modeling long-range dependencies with linear complexity, providing a promising solution to MMIF. Unfortunately, Mamba lacks full spatial and frequency perceptions, which are very important for MMIF. Moreover, employing Image Reconstruction (IR) as an auxiliary task has been proven beneficial for MMIF. However, a primary challenge is how to leverage IR efficiently and effectively. To address the above issues, we propose a novel framework named Spatial-Frequency Enhanced Mamba Fusion (SFMFusion) for MMIF. More specifically, we first propose a three-branch structure to couple MMIF and IR, which can retain complete contents from source images. Then, we propose the Spatial-Frequency Enhanced Mamba Block (SFMB), which can enhance Mamba in both spatial and frequency domains for comprehensive feature extraction. Finally, we propose the Dynamic Fusion Mamba Block (DFMB), which can be deployed across different branches for dynamic feature fusion. Extensive experiments show that our method achieves better results than most state-of-the-art methods on six MMIF datasets. The source code is available at https://github.com/SunHui1216/SFMFusion

Abstract:
Single spectral image demosaicing for multispectral filter array (MSFA) is an essential task in spectral imaging, aiming to recover a mosaic-free spectral image from its mosaic raw counterpart. Existing deep learning-based methods typically improve the reconstruction performance by indiscriminately stacking CNN-based blocks, failing to effectively handle the intertwined spatio-spectral correlations caused by spatial sub-sampling and spectral aliasing. In this paper, we propose Mosaic Pattern Excavation Transformer (MPEFormer) to achieve better reconstruction by effectively modelling the intertwined spatio-spectral correlations. Specifically, the proposed three-branch model integrates low-frequency information, edge information, and fine high-frequency details essential for spectral image reconstruction, with the third branch serving as the core component. In this branch, we design the Dual Fusion Self-attention Block (DFSAB) and the Mosaic Pattern-guided Spectral Modulation Module (MPSM). DFSAB incorporates the Mosaic Pattern Excavation Self-attention (MPESA) mechanism, which effectively captures non-local spatio-spectral correlations induced by the MSFA pattern distributed across the whole image, thereby enhancing the expressive capability of the model. By dynamically integrating various MSFA pattern-related dependencies, MPSM enables adaptive recalibration of spectral information. Extensive experimental results demonstrate the effectiveness of our MPEFormer, highlighting its greater potential over the state-of-the-art MSFA demosaicing methods. The code will be uploaded at https://github.com/Matsuri247/MPEFormer

Abstract:
Transformer-based RGBT tracking has attracted much attention due to the strong modeling capacity of self attention and cross attention mechanisms. These attention mechanisms utilize the correlations among tokens to construct powerful feature representations, but are easily affected by low-quality tokens. To address this issue, we propose a novel Quality-aware Spatio-temporal Transformer Network (QSTNet), which calculates the quality weights of tokens in search regions based on the correlation with multimodal template tokens to suppress the negative effects of low-quality tokens in spatio-temporal feature representations, for robust RGBT tracking. In particular, we argue that the correlation between search tokens of one modality and multimodal template tokens could reflect the quality of these search tokens, and thus design the Quality-aware Token Weighting Module (QTWM) based on the correlation matrix of search and template tokens to suppress the negative effects of low-quality tokens. Specifically, we calculate the difference matrix derived from the attention matrices of the search tokens from both modalities and the multimodal template tokens, and then assign the quality weight for each search token based on the difference matrix, which reflects the relative correlation of search tokens from different modalities to multimodal template tokens. In addition, we propose the Prompt-based Spatio-temporal Encoder Module (PSEM) to utilize spatio-temporal multimodal information while alleviating the impact of low-quality spatio-temporal features. Extensive experiments on four RGBT benchmark datasets demonstrate that the proposed QSTNet exhibits superior performance compared to other state-of-the-art tracking methods. Our code and supplementary video are now available: https://zhaodongah.github.io/QSTNet

Abstract:
We identify two major limitations in the existing studies on retinal vessel segmentation: 1) Most existing works are restricted to one modality, i.e., the Color Fundus (CF). However, multi-modality retinal images are used every day in the study of the retina and diagnosis of retinal diseases, and the study of vessel segmentation on other modalities is scarce; 2) Even though a few works extended their experiments to new modalities such as the Multi-Color Scanning Laser Ophthalmoscopy (MC), these works still require fine-tuning a separate model for the new modality. The fine-tuning will require extra training data, which is difficult to acquire. In this work, we present a novel universal vessel segmentation model (URVSM) for multi-modality retinal images. In addition to performing the study on a much wider range of image modalities, we also propose a universal model to segment the vessels in all these commonly used modalities. While being much more versatile compared with existing methods, our universal model also demonstrates comparable performance to the state-of-the-art fine-tuned methods. To the best of our knowledge, this is the first work that achieves modality-agnostic retinal vessel segmentation and the first to study retinal vessel segmentation in several novel modalities (Code, model and 3 new retinal vessel segmentation datasets are available at https://github.com/JRC-VPLab/URVSM).

Abstract:
In passive polarization imaging, the degree and the angle of linear polarization images are representations of the polarization content in the scene that can be used to detect small polarized objects in a largely randomly polarized surrounding. The polarized signal is often near the noise limit of a photon detector (as in CCD and CMOS cameras) and sensitivity to polarization deteriorates further when the source imagery is under-exposed. This work aims to increase the robustness to sensor noise by estimating the Cartesian coordinates of the degree and angle of linear polarization—a notion we refer to as “Stokes simplex.” The proposed Stokes Simplex Polarimetric Image Denoising (SSPID) algorithm is the minimum mean squared error estimation of the noise-free Stokes simplex vectors in the wavelet domain from the Poisson corrupted analyzer images. Benchmarking against the state-of-the-art polarization image denoising methods on a newly acquired division-of-time (DoT) polarimetric data shows superior performance.

Abstract:
Hyperspectral unmixing aims to decompose the mixed pixels into pure spectra and calculate their corresponding fractional abundances. It holds a critical position in hyperspectral image processing. Traditional model-based unmixing methods use convex optimization to iteratively solve the unmixing problem with hand-crafted regularizers. While their performance is limited by these manually designed constraints, which may not fully capture the structural information of the data. Recently, deep learning-based unmixing methods have shown remarkable capability for this task. However, they have limited generalizability and lack interpretability. In this paper, we propose a novel hyperspectral unmixing method regularized by a diffusion model (URDM) to overcome these shortcomings. Our method leverages the advantages of both conventional optimization algorithms and deep generative models. Specifically, we formulate the unmixing objective function from a variational perspective and integrate it into a diffusion sampling process to introduce generative priors from a denoising diffusion probabilistic model (DDPM). Since the original objective function is challenging to optimize, we introduce a splitting-based strategy to decouple it into simpler subproblems. Extensive experiment results conducted on both synthetic and real datasets demonstrate the efficiency and superior performance of our proposed method.

Abstract:
Convolutional neural networks (CNNs) can automatically learn data patterns to express face images for facial expression recognition (FER). However, they may ignore effect of facial segmentation of FER. In this paper, we propose a perception CNN for FER as well as PCNN. Firstly, PCNN can use five parallel networks to simultaneously learn local facial features based on eyes, cheeks and mouth to realize the sensitive capture of the subtle changes in FER. Secondly, we utilize a multi-domain interaction mechanism to register and fuse between local sense organ features and global facial structural features to better express face images for FER. Finally, we design a two-phase loss function to restrict accuracy of obtained sense information and reconstructed face images to guarantee performance of obtained PCNN in FER. Experimental results show that our PCNN achieves superior results on several lab and real-world FER benchmarks: CK+, JAFFE, FER2013, FERPlus, RAF-DB and Occlusion and Pose Variant Dataset. Its code is available at https://github.com/hellloxiaotian/PCNN

Affiliations: Faculty of Engineering and Information Technology, University of Technology Sydney, Sydney, NSW, Australia; Bionic Vision System Laboratory and the State Key Laboratory of Transducer Technology, Shanghai Institute of Microsystem and Information Technology, Chinese Academy of Sciences, Shanghai, China; PCA Laboratory, the Key Laboratory of Intelligent Perception and Systems for High-Dimensional Information of Ministry of Education, and the School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, China; Research Center for Industries of the Future and the School of Engineering, Westlake University, Hangzhou, China

Abstract:
The Vision Transformer (ViT) has achieved remarkable success in computer vision due to its powerful token mixer, which effectively captures global dependencies among all tokens. However, the quadratic complexity of standard self-attention with respect to the number of tokens severely hampers its computational efficiency in practical deployment. Although recent hybrid approaches have sought to combine the strengths of convolutions and self-attention to improve the performance–efficiency trade-off, the costly pairwise token interactions and heavy matrix operations in conventional self-attention remain a critical bottleneck. To overcome this limitation, we introduce S2AFormer, an efficient Vision Transformer architecture built around a novel Strip Self-Attention (SSA) mechanism. Our design incorporates lightweight yet effective Hybrid Perception Blocks (HPBs) that seamlessly fuse the local inductive biases of CNNs with the global modeling capability of Transformer-style attention. The core innovation of SSA lies in simultaneously reducing the spatial resolution of the key ( K ) and value ( V ) tensors while compressing the channel dimension of the query ( Q ) and key ( K ) tensors. This joint spatial-and-channel compression dramatically lowers computational cost without sacrificing representational power, achieving an excellent balance between accuracy and efficiency. We extensively evaluate S2AFormer on a wide range of vision tasks, including image classification (ImageNet-1K), semantic segmentation (ADE20K), and object detection/instance segmentation (COCO). Experimental results consistently show that S2AFormer delivers substantial accuracy improvements together with superior inference speed and throughput across both GPU and non-GPU platforms, establishing it as a highly competitive solution in the landscape of efficient Vision Transformers.

Abstract:
Diffusion probabilistic models (DPMs) have recently achieved brilliant achievements in computer vision. Inspired by the success of DPMs, we present TrajDiff, a model based on conditional diffusion probabilistic models for agent future trajectory prediction, which speculates the agent future states through a series of stochastic iterative denoising processes. Specifically, we map the trajectory prediction task into the latent heatmap space, translating hard keypoint prediction into soft cluster center learning. The core architecture is a U-shaped encoder-decoder network (U-Net) that is trained with a denoising objective. During inference, conditioned on the observed past trajectory heatmaps, random pure Gaussian noise is initialized to drive the reverse sampling process. The U-Net iteratively removes various levels of Gaussian noise from initialized images, resembling Langevin dynamics, and generates multi-modal predicted future trajectory heatmaps. Furthermore, we introduce a novel residual block with a mutual attention mechanism that can elegantly consider the interactions between the agent and the surrounding environment at multiple scales, assisting in generating physically and socially acceptable trajectories. We verify TrajDiff on the Stanford Drone Dataset and the ETH and UCY Datasets. The experimental results show that TrajDiff outperforms previous state-of-the-art methods with considerable accuracy gains, while significantly reducing computational requirements.

Affiliations: School of Information Engineering, Zhongnan University of Economics and Law, Wuhan, China; College of Computer Science and Technology, Qingdao University, Qingdao, China; School of Computer Science, Wuhan University, Wuhan, China; Computer Vision Institute, College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, China; School of Computing and Communications, Lancaster University, Lancaster, U.K.; School of Computer Science and Engineering, South China University of Technology, Guangzhou, China

Abstract:
The CLIP visual feature-based image captioning models have developed rapidly and achieved remarkable results. However, existing models still struggle to produce descriptive and discriminative captions because they insufficiently exploit fine-grained visual cues and fail to model complex vision–language alignment. To address these limitations, we propose a Ranking Diffusion Transformer (RDT), which integrates a Ranking Visual Encoder (RVE) and a Ranking Loss (RL) for fine-grained image captioning. The RVE introduces a novel ranking attention mechanism that effectively mines diverse and discriminative visual information from CLIP features. Meanwhile, the RL leverages the ranking of generated caption quality as a global semantic supervisory signal, thereby enhancing the diffusion process and strengthening vision–language semantic alignment. We show that by collaborating RVE and RL via the novel RDT—and by gradually adding and removing noise in the diffusion process—more discriminative visual features are learned and precisely aligned with the language features. Experimental results on popular benchmark datasets demonstrate that our proposed RDT surpasses existing state-of-the-art image captioning models in the literature. The code is publicly available at: https://github.com/junwan2014/RDT

Abstract:
Recent advancements in deep learning, particularly through Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs), have led to significant progress in nighttime image deraining. However, current architectures still struggle to strike an optimal balance between computational efficiency and restoration performance. Moreover, existing methods often fail to fully exploit the intrinsic characteristics of low-light conditions and inadequately model the interaction between rain and illumination. To overcome these challenges, we propose NDMamba, a dual-prior-guided state-space model that addresses nighttime deraining by incorporating degradation cues related to both lighting and rain distribution. Inspired by the Retinex theory, which suggests that rain streak distribution is influenced by the reflectance component of a scene, we propose a Prior Extraction Module (PEM) to jointly model lighting conditions and rain degradation. Furthermore, we design a Prior-Guided Mamba Block (PGMB), which comprises a Lighting-Adaptive Vision State-Space Module (LVSSM) that incorporates illumination priors, and a Rain Distribution Guidance Module (RDGM) to enhance local features in a more refined manner. Extensive experiments demonstrate that NDMamba outperforms state-of-the-art methods on both synthetic and real-world benchmark datasets. Our code is publicly available at https://github.com/tandaily/NDMamba

Abstract:
Despite Transformers have achieved significant success in low-level vision tasks, they are constrained by computing self-attention with a quadratic complexity and limited-size windows. This limitation results in a lack of global receptive field across the entire image. Recently, State Space Models (SSMs) have gained widespread attention due to their global receptive field and linear complexity with respect to input length. However, integrating SSMs into low-level vision tasks presents two major challenges: 1) Relationship degradation of long-range tokens with a long-range forgetting problem by encoding pixel-by-pixel high-resolution images. 2) Significant redundancy in the existing multi-direction scanning strategy. To this end, we propose Hi-Mamba for image super-resolution (SR) to address these challenges, which unfolds the image with only a single scan. Specifically, the Global Hierarchical Mamba Block (GHMB) enables token interactions across the entire image, providing a global receptive field while leveraging a multi-scale structure to facilitate long-range dependency learning. Additionally, the Direction Alternation Module (DAM) adjusts the scanning patterns of GHMB across different layers to enhance spatial relationship modeling. Extensive experiments demonstrate that our Hi-Mamba achieves 0.2–0.27dB PSNR gains on the Urban100 dataset across different scaling factors compared to the state-of-the-art MambaIRv2 for SR. Moreover, our lightweight Hi-Mamba also outperforms lightweight SRFormer by 0.39dB PSNR for × 2 SR.

Abstract:
With the development of face forgery technology, fake faces are rampant, threatening the security and authenticity of many fields. Therefore, it is of great significance to study face forgery detection. At present, existing detection methods have deficiencies in the comprehensiveness of feature extraction and model adaptability, and it is difficult to accurately deal with complex and changeable forgery scenarios. However, the rise of multimodal models provides new insights for current forgery detection methods. At present, most methods use relatively simple text prompts to describe the difference between real and fake faces. However, these researchers ignore that the CLIP model itself does not have the relevant knowledge of forgery detection. Therefore, our paper proposes a face forgery detection method based on multi-encoder fusion and cross-modal knowledge distillation. On the one hand, the prior knowledge of the CLIP model and the forgery model is fused. On the other hand, through the alignment distillation, the student model can learn the visual abnormal patterns and semantic features of the forged samples captured by the teacher model. Specifically, our paper extracts the features of face photos by fusing the CLIP text encoder and the CLIP image encoder, and uses the dataset in the field of forgery detection to pretrain and fine-tune the Deepfake-V2-Model to enhance the detection ability, which are regarded as the teacher model. At the same time, the visual and language patterns of the teacher model are aligned with the visual patterns of the pretrained student model, and the aligned representations are refined to the student model. This not only combines the rich representation of the CLIP image encoder and the excellent generalization ability of text embedding, but also enables the original model to effectively acquire relevant knowledge for forgery detection. Experiments show that our method effectively improves the performance on face forgery detection.

Abstract:
Portrait shadow removal is a challenging task due to the complex surface of the face. Although existing work in this field makes substantial progress, these methods tend to overlook information in the background areas. However, this background information not only contains some important illumination cues but also plays a pivotal role in achieving lighting harmony between the face and the background after shadow elimination. In this paper, we propose a Context-aware Illumination Restoration Network (CIRNet) for portrait shadow removal. Our CIRNet consists of three stages. First, the Coarse Shadow Removal Network (CSRNet) mitigates the illumination discrepancies between shadow and non-shadow areas. Next, the Area-aware Shadow Restoration Network (ASRNet) predicts the illumination characteristics of shadowed areas by utilizing background context and non-shadow portrait context as references. Lastly, we introduce a Global Fusion Network to adaptively merge contextual information from different areas and generate the final shadow removal result. This approach leverages the illumination information from the background region while ensuring a more consistent overall illumination in the generated images. Our approach can also be extended to high-resolution portrait shadow removal and portrait specular highlight removal. Besides, we construct the first real facial shadow dataset for portrait shadow removal, consisting of 6200 pairs of facial images. Qualitative and quantitative comparisons demonstrate the advantages of our proposed dataset as well as our method.

Abstract:
Cross-view geo-localization aims to match the same geographic location from different view images, e.g., drone-view images and geo-referenced satellite-view images. Due to UAV cameras’ different shooting angles and heights, the scale of the same captured target building in the drone-view images varies greatly. Meanwhile, there is a difference in size and floor area for different geographic locations in the real world, such as towers and stadiums, which also leads to scale variants of geographic targets in the images. However, existing methods mainly focus on extracting the fine-grained information of the geographic targets or the contextual information of the surrounding area, which overlook the robust feature for scale changes and the importance of feature alignment. In this study, we argue that the key underpinning of this task is to train a network to mine a discriminative representation against scale variants. To this end, we design an effective and novel end-to-end network called Self-Adaptive Feature Extraction Network (Safe-Net) to extract powerful scale-invariant features in a self-adaptive manner. Safe-Net includes a global representation-guided feature alignment module and a saliency-guided feature partition module. The former applies an affine transformation guided by the global feature for adaptive feature alignment. Without extra region annotations, the latter computes saliency distribution for different regions of the image and adopts the saliency information to guide a self-adaptive feature partition on the feature map to learn a visual representation against scale variants. Experiments on two prevailing large-scale aerial-view geo-localization benchmarks, i.e., University-1652 and SUES-200, show that the proposed method achieves state-of-the-art results. In addition, our proposed Safe-Net has a significant scale adaptive capability and can extract robust feature representations for those query images with small target buildings. The source code of this study is available at: https://github.com/AggMan96/Safe-Net.

Abstract:
Enlarging input images is a straightforward and effective approach to promote small object detection. However, simple image enlargement is significantly expensive on both computations and GPU memory. In fact, small objects are usually sparsely distributed and locally clustered. Therefore, massive feature extraction computations are wasted on the non-target background area of images. Recent works have tried to pick out target-containing regions using an extra network and perform conventional object detection, but the newly introduced computation limits their final performance. In this paper, we propose to reuse the detector’s backbone to conduct feature-level object-seeking and patch-slicing, which can avoid redundant feature extraction and reduce the computation cost. Incorporating with a sparse detection head, we are able to detect small objects on high-resolution inputs (e.g., 1080P or larger) for superior performance. The resulting Efficient Small Object Detection (ESOD) approach is a generic framework, which can be applied to both CNN- and ViT-based detectors to save the computation and GPU memory costs. Extensive experiments demonstrate the efficacy and efficiency of our method. In particular, our method consistently surpasses the SOTA detectors by a large margin (e.g., 8% gains on AP) on the representative VisDrone, UAVDT, and TinyPerson datasets. Code will be made public soon.

Abstract:
In the field of semi-supervised skeleton action recognition, existing work primarily follows the paradigm of self-supervised training followed by supervised fine-tuning. However, self-supervised learning focuses on exploring data representation rather than label classification. Inspired by Mean Teacher, we explore a novel pseudo-label-based model called SkeleMoCLR. Specifically, we use MoCo v2 as the foundation and extend it into a teacher-student network through a momentum encoder. The generation of high-confidence pseudo-labels requires a well-pretrained model as a prerequisite. In cases where large-scale skeleton data is lacking, we propose leveraging contrastive learning to transfer discriminative action features from large vision-text models to the skeleton encoder. Following the contrastive pre-training, the key encoder branch from MoCo v2 serves as the teacher to generate pseudo-labels for training the query encoder branch. Furthermore, we introduce pseudo-labels into the memory queues, sampling negative samples from different pseudo-label classes to maximize the representation differentiation between different categories. We jointly optimize the classification loss for both labeled and pseudo-labeled data and the contrastive loss for unlabeled data to update model parameters, fully harnessing the potential of pseudo-label semi-supervised learning and self-supervised learning. Extensive experiments conducted on the NTU-60, NTU-120, PKU-MMD, and NW-UCLA datasets demonstrate that our SkeleMoCLR outperforms existing competitive methods in the semi-supervised skeleton action recognition task.

Abstract:
Recently, neural networks have become the dominant approach to low-light image enhancement (LLIE), with at least one-third of them adopting a Retinex-related architecture. However, through in-depth analysis, we contend that this most widely accepted LLIE structure is suboptimal, particularly when addressing the non-uniform illumination commonly observed in natural images. In this paper, we present a novel variant learning framework, termed residual quotient learning, to substantially alleviate this issue. Instead of following the existing Retinex-related decomposition-enhancement-reconstruction process, our basic idea is to explicitly reformulate the light enhancement task as adaptively predicting the latent quotient with reference to the original low-light input using a residual learning fashion. By leveraging the proposed residual quotient learning, we develop a lightweight yet effective network called ResQ-Net. This network features enhanced non-uniform illumination modeling capabilities, making it more suitable for real-world LLIE tasks. Moreover, due to its well-designed structure and reference-free loss function, ResQ-Net is flexible in training as it allows for zero-reference optimization, which further enhances the generalization and adaptability of our entire framework. Extensive experiments on various benchmark datasets demonstrate the merits and effectiveness of the proposed residual quotient learning, and our trained ResQ-Net outperforms state-of-the-art methods both qualitatively and quantitatively. Furthermore, a practical application in dark face detection is explored, and the preliminary results confirm the potential and feasibility of our method in real-world scenarios.

Abstract:
Existing fine-grained visual categorization (FGVC) methods assume that the fine-grained semantics rest in the informative parts of an image. This assumption works well on favorable front-view object-centric images, but can face great challenges in many real-world scenarios, such as scene-centric images (e.g., street view) and adverse viewpoint (e.g., object re-identification, remote sensing). In such scenarios, the mis-/over- feature activation is likely to confuse the part selection and degrade the fine-grained representation. In this paper, we are motivated to design a universal FGVC framework for real-world scenarios. More precisely, we propose a concept guided learning (CGL), which models concepts of a certain fine-grained category as a combination of inherited concepts from its subordinate coarse-grained category and discriminative concepts from its own. The discriminative concepts is utilized to guide the fine-grained representation learning. Specifically, three key steps are designed, namely, concept mining, concept fusion, and concept constraint. On the other hand, to bridge the FGVC dataset gap under scene-centric and adverse viewpoint scenarios, a Fine-grained Land-cover Categorization Dataset (FGLCD) with 59,994 fine-grained samples is proposed. Extensive experiments show the proposed CGL: 1) has a competitive performance on conventional FGVC; 2) achieves state-of-the-art performance on fine-grained aerial scenes & scene-centric street scenes; 3) good generalization on object re-identification and fine-grained aerial object detection. The dataset and source code will be available at https://github.com/BiQiWHU/CGL.

Abstract:
Vision-Language models (VLMs) have proven to be effective at aligning image and text representations, producing superior zero-shot results when transferred to many downstream tasks. However, these representations suffer from some key shortcomings in understanding Compositional Language Concepts (CLC), such as recognizing objects’ attributes, states, and relations between different objects. Moreover, VLMs typically have poor interpretability, making it challenging to debug and mitigate compositional-understanding failures. In this work, we introduce the architecture and training technique of Tree-augmented Vision-Language (3VL) model accompanied by our proposed Anchor inference method and Differential Relevance (DiRe) interpretability tool. By expanding the text of an arbitrary image-text pair into a hierarchical tree structure using language analysis tools, 3VL allows the induction of this structure into the visual representation learned by the model, enhancing its interpretability and compositional reasoning. Additionally, we show how Anchor, a simple technique for text unification, can be used to filter nuisance factors while increasing CLC understanding performance, e.g., on the fundamental VL-Checklist benchmark. We also show how DiRe, which performs a differential comparison between VLM relevancy maps, enables us to generate compelling visualizations of the reasons for a model’s success or failure.

Abstract:
Reconstructing visual stimuli from functional Magnetic Resonance Imaging (fMRI) enables fine-grained retrieval of brain activity. However, the accurate reconstruction of diverse details, including structure, background, texture, color, and more, remains challenging. The stable diffusion models inevitably result in the variability of reconstructed images, even under identical conditions. To address this challenge, we first uncover the neuroscientific perspective of diffusion methods, which primarily involve top-down creation using pre-trained knowledge from extensive image datasets, but tend to lack detail-driven bottom-up perception, leading to a loss of faithful details. In this paper, we propose NeuralDiffuser, which incorporates primary visual feature guidance to provide detailed cues in the form of gradients. This extension of the bottom-up process for diffusion models achieves both semantic coherence and detail fidelity when reconstructing visual stimuli. Furthermore, we have developed a novel guidance strategy for reconstruction tasks that ensures the consistency of repeated outputs with original images rather than with various outputs. Extensive experimental results on the Natural Senses Dataset (NSD) qualitatively and quantitatively demonstrate the advancement of NeuralDiffuser by comparing it against baseline and state-of-the-art methods horizontally, as well as conducting longitudinal ablation studies. Code can be available on https://github.com/HaoyyLi/NeuralDiffuser.

Abstract:
Multi-head attention (MA), which allows the model to jointly attend to crucial information from diverse representation subspaces through its heads, has yielded remarkable achievement in image captioning. However, there is no explicit mechanism to ensure MA attends to appropriate positions in diverse subspaces, resulting in overfocused attention for each head and redundancy between heads. In this paper, we propose a novel Intra- and Inter-Head Orthogonal Attention (I2OA) to efficiently improve MA in image captioning by introducing a concise orthogonal regularization to heads. Specifically, Intra-Head Orthogonal Attention enhances the attention learning of MA by introducing orthogonal constraint to each head, which decentralizes the object-centric attention to more comprehensive content-aware attention. Inter-Head Orthogonal Attention reduces the heads redundancy by applying orthogonal constraint between heads, which enlarges the diversity of representation subspaces and improves the representation ability for MA. Moreover, the proposed I2OA is flexible to combine with various multi-head attention based image captioning methods and improve the performances without increasing model complexity and parameters. Experiments on the MS COCO dataset demonstrate the effectiveness of the proposed model.

Abstract:
Sonar imagery is substantially degraded by speckle noise, making the task of despeckling crucial for improving image quality. Self-supervised despeckling methods, represented by blind-spot networks (BSNs), have shown promise in this regard. However, these methods consistently face significant challenges due to the spatial correlation of speckle noise and the inherent information loss within BSNs. In this paper, we introduce SEGSID, a BSN-based, semantic-guided sonar despeckling framework designed to address these challenges. Specifically, the SEGSID framework primarily comprises a Receptive Field Augmentation (RFA) module and a Global Semantic Enhancement (GSE) module. To address the noise spatial correlation, the RFA module is crafted to strategically extract valuable local information while avoiding the exploitation of noise-correlated pixels. Concurrently, the GSE module extracts the global semantic information from entire images and injects it into the extracted local features. This enhances BSNs’ ability to harness more comprehensive image information and compensates for their inherent information loss. Furthermore, to bolster efficiency, we employ knowledge distillation techniques to transfer the expertise from the trained SEGSID into a more streamlined network suitable for broader practical applications. Extensive experiments on three distinct sonar datasets demonstrate that SEGSID outperforms both traditional despeckling methods and state-of-the-art self-supervised despeckling techniques. The implementation is publicly accessible at https://github.com/deng-ai-lab/SEGSID.

Abstract:
Text-guided style transfer aims to repaint a content image with the target style described by a text prompt, offering greater flexibility and creativity compared to traditional image-guided style transfer. Despite the potential, existing text-guided style transfer methods often suffer from many issues, including insufficient visual quality, poor generalization ability, or a reliance on large amounts of paired training data. To address these limitations, we leverage the inherent strengths of transformers in handling multimodal data and propose a novel transformer-based framework called TRTST that not only achieves unpaired arbitrary text-guided style transfer but also significantly improves the visual quality. Specifically, TRTST explores combining a text transformer encoder with an image transformer encoder to project the input text prompt and content image into a joint embedding space and extract the desired style and content features. These features are then input into a multimodal co-attention module to stylize the image sequence based on the text sequence. We also propose a new adaptive parametric positional encoding (APPE) scheme which can adaptively produce different positional encodings to optimally match different inputs with a position encoder. In addition, to further improve content preservation, we introduce a text-guided identity loss to our model. Extensive results and comparisons are conducted to demonstrate the effectiveness and superiority of our method.

Abstract:
Cloth-changing person re-identification is a subject closer to the real world, which focuses on solving the problem of person re-identification after pedestrians change clothes. The primary challenge in this field is to overcome the complex interplay between intra-class and inter-class variations and to identify features that remain unaffected by changes in appearance. Sufficient data collection for model training would significantly aid in addressing this problem. However, it is challenging to gather diverse datasets in practice. Current methods focus on implicitly learning identity information from the original image or introducing additional auxiliary models, which are largely limited by the quality of the image and the performance of the additional model. To address these issues, inspired by prompt learning, we propose a novel multiple information prompt learning (MIPL) scheme for cloth-changing person ReID, which learns identity robust features through the common prompt guidance of multiple messages. Specifically, the clothing information stripping (CIS) module is designed to decouple the clothing information from the original RGB image features to counteract the influence of clothing appearance. The bio-guided attention (BGA) module is proposed to increase the learning intensity of the model for key information. A dual-length hybrid patch (DHP) module is employed to make the features have diverse coverage to minimize the impact of feature bias. Extensive experiments demonstrate that the proposed method outperforms all state-of-the-art methods on the LTCC, Celeb-reID, Celeb-reID-light, and CSCC datasets, achieving rank-1 scores of 74.8%, 73.3%, 66.0%, and 88.1%, respectively. When compared to AIM (CVPR23), ACID (TIP23), and SCNet (MM23), MIPL achieves rank-1 improvements of 11.3%, 13.8%, and 7.9%, respectively, on the PRCC dataset.

Abstract:
Active Domain Adaptation (ADA) improves knowledge transfer efficiency from the labeled source domain to the unlabeled target domain by selecting a few target sample labels. However, most existing active sampling methods ignore the local uncertainty of neighbors in the target domain, making it easier to pick out anomalous samples that are detrimental to the model. To address this problem, we present a new approach to active domain adaptation called Local Uncertainty Energy Transfer (LUET), which integrates active learning of local uncertainty confusion and energy transfer alignment constraints into a unified framework. First, in the active learning module, the uncertainty difficult and representative samples from the target domain are selected through local uncertainty energy selection and entropy-weighted class confusion selection. And the active learning strategy based on local uncertainty energy will avoid selecting anomalous samples in the target domain. Second, for the discrimination issue caused by domain shift, we use a global and local energy-transfer alignment constraint module to eliminate the domain gap and improve accuracy. Finally, we used negative log-likelihood loss for supervised learning of source domains and query samples. With the introduction of sample-based energy metrics, the active learning strategy is more closely with the domain alignment. Experiments on multiple domain-adaptive datasets have demonstrated that our LUET can achieve outstanding results and outperform existing state-of-the-art approaches.

Abstract:
In hyperspectral images (HSIs), different land cover (LC) classes have distinct reflective characteristics at various wavelengths. Therefore, relying on only a few bands to distinguish all LC classes often leads to information loss, resulting in poor average accuracy. To address this problem, we propose a method called Cascaded Spatial Cross-Attention Network (CSCANet) for HSI classification. We design a cascaded spatial cross-attention module, which first performs cross-attention on local and global features in the spatial context, then uses a group cascade structure to sequentially propagate important spatial regions within the different channels, and finally obtains joint attention features to improve the robustness of the network. Moreover, we also design a two-branch feature separation structure based on spatial-spectral features to separate different LC Tokens as much as possible, thereby improving the distinguishability of different LC classes. Extensive experiments demonstrate that our method achieves excellent performance in enhancing classification accuracy and robustness. The source code can be obtained from https://github.com/WUTCM-Lab/CSCANet.

Affiliations: Research Institute of Trustworthy Autonomous Systems and the Department of Computer Science and Engineering, Southern University of Science and Technology, Shenzhen, China; Center for High Performance Computing and Shenzhen Key Laboratory of Intelligent Bioinformatics, Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China; Institute of High-Performance Computing (IHPC), Agency for Science, Technology and Research (A*STAR), Fusionopolis Wy, Singapore; Research Institute of Trustworthy Autonomous Systems, the Department of Computer Science and Engineering, and Guangdong Provincial Key Laboratory of Brain-Inspired Intelligent Computation, Southern University of Science and Technology, Shenzhen, China

Abstract:
The morphologies of vessel-like structures, such as blood vessels and nerve fibres, play significant roles in disease diagnosis, e.g., Parkinson’s disease. Although deep network-based refinement segmentation and topology-preserving segmentation methods recently have achieved promising results in segmenting vessel-like structures, they still face two challenges: 1) existing methods often have limitations in rehabilitating subsection ruptures in segmented vessel-like structures; 2) they are typically overconfident in predicted segmentation results. To tackle these two challenges, this paper attempts to leverage the potential of spatial interconnection relationships among subsection ruptures from the structure rehabilitation perspective. Based on this perspective, we propose a novel Vessel-like Structure Rehabilitation Network (VSR-Net) to both rehabilitate subsection ruptures and improve the model calibration based on coarse vessel-like structure segmentation results. VSR-Net first constructs subsection rupture clusters via a Curvilinear Clustering Module (CCM). Then, the well-designed Curvilinear Merging Module (CMM) is applied to rehabilitate the subsection ruptures to obtain the refined vessel-like structures. Extensive experiments on six 2D/3D medical image datasets show that VSR-Net significantly outperforms state-of-the-art (SOTA) refinement segmentation methods with lower calibration errors. Additionally, we provide quantitative analysis to explain the morphological difference between the VSR-Net’s rehabilitation results and ground truth (GT), which are smaller compared to those between SOTA methods and GT, demonstrating that our method more effectively rehabilitates vessel-like structures.

Abstract:
Principal Component Analysis (PCA) is one of the most important unsupervised dimensionality reduction algorithms, which uses squared \ell _2 -norm to make it very sensitive to outliers. Those improved versions based on \ell _1 -norm alleviate this problem, but they have other shortcomings, such as optimization difficulties or lack of rotational invariance, etc. Besides, existing methods only vaguely divide normal samples and outliers to improve robustness, but they ignore the fact that normal samples can be more specifically divided into positive samples and hard samples, which should have different contributions to the model because positive samples are more conducive to learning the projection matrix. In this paper, we propose a novel Data Subdivision Based Dual-Weighted Robust Principal Component Analysis, namely DRPCA, which firstly designs a mark vector to distinguish normal samples and outliers, and directly removes outliers according to mark weights. Moreover, we further divide normal samples into positive samples and hard samples by self-constrained weights, and place them in relative positions, so that the weight of positive samples is larger than hard samples, which makes the projection matrix more accurate. Additionally, the optimal mean is employed to obtain a more accurate data center. To solve this problem, we carefully design an effective iterative algorithm and analyze its convergence. Experiments on real-world and RGB large-scale datasets demonstrate the superiority of our method in dimensionality reduction and anomaly detection.

Abstract:
Feature matching is a fundamental concern widely employed in computer vision applications. This paper introduces a novel and efficacious method named Grid-guided Sparse Laplacian Consensus, rooted in the concept of smooth constraints. To address challenging scenes such as severe deformation and independent motions, we devise grid-based adaptive matching guidance to construct multiple transformations based on motion coherence. Specifically, we obtain a set of precise yet sparse seed correspondences through motion statistics, facilitating the generation of an adaptive number of candidate correspondence sets. In addition, we propose an innovative formulation grounded in graph Laplacian for correspondence pruning, wherein mapping function estimation is formulated as a Bayesian model. We solve this utilizing EM algorithm with seed correspondences as initialization for optimal convergence. Sparse approximation is leveraged to reduce the time-space burden. A comprehensive set of experiments are conducted to demonstrate the superiority of our method over other state-of-the-art methods in both robustness to serious deformations and generalizability for various descriptors, as well as generalizability to multi motions. Additionally, experiments in geometric estimation, image registration, loop closure detection, and visual localization highlight the significance of our method across diverse scenes for high-level tasks.

Abstract:
Point cloud primitive instance segmentation is critical for understanding the geometric shapes of man-made objects. Existing learning-based methods mainly focus on learning high-dimensional feature representations of points and further perform clustering or region growing to obtain corresponding primitive instances. However, these features generally cannot accurately represent the discriminability between instances, especially near the boundaries or in regions with small differences in geometric properties. This limitation often leads to over- or under-segmentation of geometric primitives. On the other hand, the boundaries of different primitives are the direct features that distinguish them and thus utilizing boundary information to guide feature learning and clustering is crucial for this task. In this paper, we propose a novel framework BGPSeg for point cloud primitive instance segmentation that utilizes boundary-guided feature extraction and clustering. Specifically, we first introduce a boundary-guided feature extractor with the additional input of a boundary probability map, which utilizes boundary-guided sampling and a boundary transformer to enhance feature discrimination among points crossing geometric boundaries. Furthermore, we propose a boundary-guided primitive clustering module, which combines boundary clues and geometric feature discrimination for clustering to further improve the segmentation performance. Finally, we demonstrate the effectiveness of our BGPSeg with a series of comparison and ablation experiments while achieving the state-of-the-art primitive instance segmentation. Our code is available at https://github.com/fz-20/BGPSeg.

Abstract:
Zero-shot learning (ZSL) focuses on recognizing unseen categories by aligning visual features with semantic information. Recent advancements have shown that aligning each attribute with its corresponding visual region significantly improves zero-shot learning performance. However, the crude semantic proxies used in these methods fail to capture the varied appearances of each attribute, and are also easily confused by the presence of semantically redundant backgrounds, leading to suboptimal alignment. To combat these issues, we introduce a novel Alignment-Enhanced Network (AENet), designed to denoise the visual features and dynamically perceive semantic information, thus enhancing visual-semantic alignment. Our approach comprises two key innovations. (1) A visual denoising encoder, employing a class-agnostic mask to filter out semantically redundant visual information, thus producing refined visual features adaptable to unseen classes. (2) A dynamic semantic generator that crafts content-aware semantic proxies adaptively, steered by visual features, enabling AENet to discriminate fine-grained variations in visual contents. Additionally, we integrate a cross-fusion module to ensure comprehensive interaction between the denoised visual features and the generated dynamic semantic proxies, further facilitating visual-semantic alignment. Through extensive experiments across three datasets, the proposed method demonstrates that it narrows down the visual-semantic gap and sets a new benchmark in this setting.

Abstract:
The geometric alterations in the iris’s appearance are intricately linked to the gaze direction. However, current deep appearance-based gaze estimation methods mainly rely on latent feature sharing to leverage iris features for improving deep representation learning, often neglecting the explicit modeling of their geometric relationships. To address this issue, this paper revisits the physiological structure of the eyeball and introduces a set of geometric assumptions, such as “the normal vector of the iris center approximates the gaze direction”. Building on these assumptions, we propose an Iris Geometric Transformation Guided Gaze estimation (IGTG-Gaze) module, which establishes an explicit geometric parameter sharing mechanism to link gaze direction and sparse iris landmark coordinates directly. Extensive experimental results demonstrate that IGTG-Gaze seamlessly integrates into various deep neural networks, flexibly extends from sparse iris landmarks to dense eye mesh, and consistently achieves leading performance in both within- and cross-dataset evaluations, all while maintaining end-to-end optimization. These advantages highlight IGTG-Gaze as a practical and effective approach for enhancing deep gaze representation from appearance.

Affiliations: Yangtze Delta Region Academy, Beijing Institute of Technology (Jiaxing), Jiaxing, China; State Key Laboratory of CNS/ATM and the MIIT Key Laboratory of Complex-Field Intelligent Sensing, Beijing Institute of Technology, Beijing, China; School of Materials Science and Engineering, Beijing Institute of Technology, Beijing, China; Department of Stomatology, Beijing Hospital, National Center of Gerontology, National Health Commission, Institute of Geriatric Medicine, Chinese Academy of Medical Sciences, Beijing, China; Clinical Biobank, Beijing Hospital, National Center of Gerontology, National Health Commission, Institute of Geriatric Medicine, Chinese Academy of Medical Sciences, Beijing, China

Abstract:
Accurate oral disease segmentation is a challenging task, for three major reasons: 1) The same type of oral disease has a diversity of size, color and texture; 2) The boundary between oral lesions and their surrounding mucosa is not sharp; 3) There is a lack of public large-scale oral disease segmentation datasets. To address these issues, we first report an oral disease segmentation network termed Oralformer, which enables to tackle multiple oral diseases. Specifically, we use a parallel design to combine local-window self-attention (LWSA) with channel-wise convolution (CWC), modeling cross-window connections to enlarge the receptive fields while maintaining linear complexity. Meanwhile, we connect these two branches with bi-directional interactions to form a basic parallel Transformer block namely LC-block. We insert the LC-block as the main building block in a U-shape encoder-decoder architecture to form Oralformer. Second, we introduce an uncertainty-driven self-adaptive loss function which can reinforce the network’s attention on the lesion’s edge regions that are easily confused, thus improving the segmentation accuracy of these regions. Third, we construct a large-scale oral disease segmentation (ODS) dataset containing 2602 image pairs. It covers three common oral diseases (including dental plaque, calculus and caries) and all age groups, which we hope will advance the field. Extensive experiments on six challenging datasets show that our Oralformer achieves state-of-the-art segmentation accuracy, and presents advantages in terms of generalizability and real-time segmentation efficiency (35fps). The code and ODS dataset will be publicly available at https://github.com/LintaoPeng/Oralformer.

Abstract:
As a common model compression technique, network pruning is widely used to reduce storage and computational cost of deep models in the resource-constrained regime. However, most current pruning methods are designed for high-level vision tasks, with few developed for low-level vision tasks. We observed that the norm-based pruning criterion, originally designed for high-level vision tasks, is highly unsuitable for low-level image denoising networks. This difference arises because image denoising networks pursue distinct feature granularities and goals compared to typical high-level vision tasks. To address this issue, we propose a novel filter evaluation method, termed High-Frequency Components Pruning (HFCP), specifically tailored for image denoising network pruning. HFCP assesses filter importance based on high-frequency components. To the best of our knowledge, this is the first pruning method designed specifically for image denoising tasks, straightforward and applicable to various types of noise. Furthermore, HFCP enhances the pruned model’s high-frequency information content with high reliability and interpretability. This facilitates the network’s ability to distinguish high-frequency signals from noise. We comprehensively analyzed multiple image denoising networks and validated HFCP’s effectiveness across four mainstream networks.

Abstract:
Existing pose-invariant face recognition mainly focuses on frontal or profile, whereas high-pitch angle face recognition, prevalent under surveillance videos, has yet to be investigated. More importantly, tilted faces significantly differ from frontal or profile faces in the potential feature space due to self-occlusion, thus seriously affecting key feature extraction for face recognition. In this paper, we asymptotically reshape challenging high-pitch angle faces into a series of small-angle approximate frontal faces and exploit a statistical approach to learn texture features to ensure accurate facial component generation. In particular, we design a statistical texture-guided GAN for tilted face frontalization (STG-GAN) consisting of three main components. First, the face encoder extracts shallow features, followed by the face statistical texture modeling module that learns multi-scale face texture features based on the statistical distributions of the shallow features. Then, the face decoder performs feature deformation guided by the face statistical texture features while highlighting the pose-invariant face discriminative information. With the addition of multi-scale content loss, identity loss and adversarial loss, we further develop a pose contrastive loss of potential spatial features to constrain pose consistency and make its face frontalization process more reliable. On this basis, we propose a divide-and-conquer strategy, using STG-GAN to progressively synthesize faces with small pitch angles in multiple stages to achieve frontalization gradually. A unified end-to-end training across multiple stages facilitates the generation of numerous intermediate results to achieve a reasonable approximation of the ground truth. Extensive qualitative and quantitative experiments on multiple-face datasets demonstrate the superiority of our approach.

Abstract:
Existing localization methods commonly employ vision to perceive scene and achieve localization in GNSS-denied areas, yet they often struggle in environments with complex lighting conditions, dynamic objects or privacy-preserving areas. Humans possess the ability to describe various scenes using natural language, effectively inferring their location by leveraging the rich semantic information in these descriptions. Harnessing language presents a potential solution for robust localization. Thus, this study introduces a new task, Language-driven Localization, and proposes a novel localization framework, LangLoc, which determines the user’s position and orientation through textual descriptions. Given the diversity of natural language descriptions, we first design a Spatial Description Generator (SDG), foundational to LangLoc, which extracts and combines the position and attribute information of objects within a scene to generate uniformly formatted textual descriptions. SDG eliminates the ambiguity of language, detailing the spatial layout and object relations of the scene, providing a reliable basis for localization. With generated descriptions, LangLoc effortlessly achieves language-only localization using text encoder and pose regressor. Furthermore, LangLoc can add one image to text input, achieving mutual optimization and feature adaptive fusion across modalities through two modality-specific encoders, cross-modal fusion, and multimodal joint learning strategies. This enhances the framework’s capability to handle complex scenes, achieving more accurate localization. Extensive experiments on the Oxford RobotCar, 4-Seasons, and Virtual Gallery datasets demonstrate LangLoc’s effectiveness in both language-only and visual-language localization across various outdoor and indoor scenarios. Notably, LangLoc achieves noticeable performance gains when using both text and image inputs in challenging conditions such as overexposure, low lighting, and occlusions, showcasing its superior robustness.

Abstract:
Current event-based video reconstruction methods, limited to the spatial domain, face challenges in decoupling brightness and structural information, leading to exposure distortion, and in efficiently acquiring non-local information without relying on computationally expensive Transformer models. To address these issues, we propose the Deep Spatial-Frequency Unfolding Reconstruction Network (DSFURNet), which explores and utilizes knowledge in the frequency domain for event-based video reconstruction. Specifically, we construct a variational model and propose three regularization terms: a brightness regularization term approximated by Fourier amplitudes, a structural regularization term approximated by Fourier phases, and an initialization regularization term that converts event representations into initial video frames. Then, we design corresponding spatial-frequency domain approximation operators for each regularization term. Benefiting from the global nature of computations in the frequency domain, the designed approximation operators can integrate local spatial and global frequency information at a lower computational cost. Furthermore, we combine the learned knowledge of the three regularization terms and unfold the optimization algorithm into an iterative deep network. Through this approach, the pixel-level initialization regularization constraint and the frequency domain brightness and structural regularization constraints can continuously play a role during the testing process, achieving a gradual improvement in the quality of the reconstructed video frames. Compared to existing methods, our network significantly reduces the number of network parameters while improving evaluation metrics.

Abstract:
Facial Aesthetics Enhancement (FAE) aims to improve facial attractiveness by adjusting the structure and appearance of a facial image while preserving its identity as much as possible. Most existing methods adopted deep feature-based or score-based guidance for generation models to conduct FAE. Although these methods achieved promising results, they potentially produced excessively beautified results with lower identity consistency or insufficiently improved facial attractiveness. To enhance facial aesthetics with less loss of identity, we propose the Nearest Neighbor Structure Guidance based on Diffusion (NNSG-Diffusion), a diffusion-based FAE method that beautifies a 2D facial image with 3D structure guidance. Specifically, we propose to extract FAE guidance from a nearest neighbor reference face. To allow for less change of facial structures in the FAE process, a 3D face model is recovered by referring to both the matched 2D reference face and the 2D input face, so that the depth and contour guidance can be extracted from the 3D face model. Then the depth and contour clues can provide effective guidance to Stable Diffusion with ControlNet for FAE. Extensive experiments demonstrate that our method is superior to previous relevant methods in enhancing facial aesthetics while preserving facial identity.

Abstract:
Nonnegative CANDECOMP/PARAFAC (CP) factorization of incomplete tensors is a powerful technique for finding meaningful and physically interpretable latent factor matrices to achieve nonnegative tensor completion. However, most existing nonnegative CP models rely on manually predefined tensor ranks, which introduces uncertainty and leads the models to overfit or underfit. Although the presence of CP models within the probabilistic framework can estimate rank better, they lack the ability to learn nonnegative factors from incomplete data. In addition, existing approaches tend to focus on point estimation and ignore estimating uncertainty. To address these issues within a unified framework, we propose a fully Bayesian treatment of nonnegative tensor completion with automatic rank determination. Benefitting from the Bayesian framework and the hierarchical sparsity-inducing priors, the model can provide uncertainty estimates of nonnegative latent factors and effectively obtain low-rank structures from incomplete tensors. Additionally, the proposed model can mitigate problems of parameter selection and overfitting. For model learning, we develop two fully Bayesian inference methods for posterior estimation and propose a hybrid computing strategy that reduces the time overhead for large-scale data significantly. Extensive simulations on synthetic data demonstrate that our model can recover missing data with high precision and automatically estimate CP rank from incomplete tensors. Moreover, results from real-world applications demonstrate that our model is superior to state-of-the-art methods in image and video inpainting. The code is available at https://github.com/zecanyang/BNTC.

Abstract:
Dataset distillation techniques have revolutionized the way of utilizing large datasets by compressing them into smaller, yet highly effective subsets that preserve the original datasets’ accuracy. However, while these methods have proven effective in reducing data size and training times, the robustness of these distilled datasets against adversarial attacks remains underexplored. This vulnerability poses significant risks, particularly in security-sensitive applications. To address this critical gap, we introduce DD-RobustBench, a novel and comprehensive benchmark specifically designed to evaluate the adversarial robustness of distilled datasets. Our benchmark is the most extensive of its kind and integrates a variety of dataset distillation techniques, including recent advancements such as TESLA, DREAM, SRe2L, and D4M, which have shown promise in enhancing model performance. DD-RobustBench also rigorously tests these datasets against a diverse array of adversarial attack methods to ensure broad applicability. Our evaluations cover a wide spectrum of datasets, including but not limited to, the widely used ImageNet-1K. This allows us to assess the robustness of distilled datasets in scenarios mirroring real-world applications. Furthermore, our detailed quantitative analysis investigates how different components involved in the distillation process, such as data augmentation, downsampling, and clustering, affect dataset robustness. Our findings provide critical insights into which techniques enhance or weaken the resilience of distilled datasets against adversarial threats, offering valuable guidelines for developing more robust distillation methods in the future. Through DD-RobustBench, we aim not only to benchmark but also to push the boundaries of dataset distillation research by highlighting areas for improvement and suggesting pathways for future innovations in creating datasets that are not only compact and efficient but also secure and resilient to adversarial challenges. The implementation details and essential instructions are available on DD-RobustBench.

Abstract:
The In-Context Learning (ICL) is to understand a new task via a few demonstrations (aka. prompt) and predict new inputs without tuning the models. While it has been widely studied in NLP, it is still a relatively new area of research in computer vision. To reveal the factors influencing the performance of visual in-context learning, this paper shows that Prompt Selection and Prompt Fusion are two major factors that have a direct impact on the inference performance of visual in-context learning. Prompt selection is the process of selecting the most suitable prompt for query image. This is crucial because high-quality prompts assist large-scale visual models in rapidly and accurately comprehending new tasks. Prompt fusion involves combining prompts and query images to activate knowledge within large-scale visual models. However, altering the prompt fusion method significantly impacts its performance on new tasks. Based on these findings, we propose a simple framework prompt-SelF to improve visual in-context learning. Specifically, we first use the pixel-level retrieval method to select a suitable prompt, and then use different prompt fusion methods to activate diverse knowledge stored in the large-scale vision model, and finally, ensemble the prediction results obtained from different prompt fusion methods to obtain the final prediction results. We conducted extensive experiments on single-object segmentation and detection tasks to demonstrate the effectiveness of prompt-SelF. Remarkably, prompt-SelF has outperformed OSLSM method-based meta-learning in 1-shot segmentation for the first time. This indicated the great potential of visual in-context learning. The source code and models will be available at https://github.com/syp2ysy/prompt-SelF.

Abstract:
Recent years have witnessed the remarkable success of the vision-language model in various computer vision tasks. However, how to exploit the semantic language knowledge of the vision-language model to advance real-world stereoscopic image super-resolution remains a challenging problem. This paper proposes a vision-language model-based stereoscopic image super-resolution (VLM-SSR) method, in which the semantic language knowledge in CLIP is exploited to facilitate stereoscopic image SR in a training-free manner. Specifically, by designing visual prompts for CLIP to infer the region similarity, a prompt-guided information aggregation mechanism is presented to capture inter-view information among relevant regions between the left and right views. Besides, driven by the prior knowledge of CLIP, a cognition prior-driven iterative enhancing mechanism is presented to optimize fuzzy regions adaptively. Experimental results on four datasets verify the effectiveness of the proposed method.

Abstract:
Multimodal medical applications have garnered considerable attention due to their potential to offer comprehensive and robust support for medical assistance. Specifically, within this domain, difference-aware medical Visual Question Answering (VQA) has emerged as a topic of increasing interest that enables the recognition of changes in physical conditions over time when compared to previous states and provides customized suggestions accordingly. However, it is challenging because samples usually exhibit characteristics of complexity, diversity, and inherent noise. Besides, there is a need for multimodal knowledge understanding of the medical domain. The difference-aware setting requiring image comparison further intensifies these situations. To this end, we propose a cross-Modal knowlEdge diffusioN-baseD gEneration netwoRk (MENDER), where the diffusion mechanism with multi-step denoising and knowledge injection from global to local level are employed to tackle the aforementioned challenges, respectively. The diffusion process is to gradually generate answers with the sequence input of questions, random noises for the answer masks and virtual vision prompts of images. The strategy of answer nosing and knowledge cascading is specifically tailored for this task and is implemented during forward and reverse diffusion processes. Moreover, the visual and structure knowledge injection are proposed to learn virtual vision prompts to guide the diffusion process, where the former is realized using a pre-trained medical image-text network and the latter is modeled with spatial and semantic graph structures processed by the heterogeneous graph Transformer models. Experiment results demonstrate the effectiveness of MENDER for difference-aware medical VQA. Furthermore, it also exhibits notable performance in the low-resource setting and conventional medical VQA tasks.

Abstract:
A critical challenge for multi-modal Object Re-Identification (ReID) is the effective aggregation of complementary information to mitigate illumination issues. State-of-the-art methods typically employ complex and highly-coupled architectures, which unavoidably result in heavy computational costs. Moreover, the significant distribution gap among different image spectra hinders the joint representation of multi-modal features. In this paper, we propose a framework named as PromptMA to establish effective communication channels between different modality paths, thereby aggregating modal complementary information and bridging the distribution gap. Specifically, we inject a series of learnable multi-modal prompts into the Image Encoder and introduce a prompt exchange mechanism to enable the prompts to alternately interact with different modal token embeddings, thus capturing and distributing multi-modal features effectively. Building on top of the multi-modal prompts, we further propose Prompt-based Token Selection (PBTS) and Prompt-based Modality Fusion (PBMF) modules to achieve effective multi-modal feature fusion while minimizing background interference. Additionally, due to the flexibility of our prompt exchange mechanism, our method is well-suited to handle scenarios with missing modalities. Extensive evaluations are conducted on four widely used benchmark datasets and the experimental results demonstrate that our method achieves state-of-the-art performances, surpassing the current benchmarks by over 15% on the challenging MSVR310 dataset and by 6% on the RGBNT201. The code is available at https://github.com/FHR-L/PromptMA

Abstract:
Existing unsupervised salient object detection (USOD) methods usually rely on low-level saliency priors, such as center and background priors, to detect salient objects, resulting in insufficient high-level semantic understanding. These low-level priors can be fragile and lead to failure when the natural images do not satisfy the prior assumptions, e.g., these methods may fail to detect those off-center salient objects causing fragmented objects in the segmentation. To address these problems, we propose to eliminate the dependency on flimsy low-level priors, and extract high-level saliency from natural images through a contrastive learning framework. To this end, we propose a Contrastive Saliency Network (CSNet), which is a prior-free and label-free saliency detector, with two novel modules: 1) a Contrastive Saliency Extraction (CSE) module to extract high-level saliency cues, by mimicking the human attention mechanism within an instance discriminative task through a contrastive learning framework, and 2) a Feature Re-Coordinate (FRC) module to recover spatial details, by calibrating high-level features with low-level features in an unsupervised fashion. In addition, we introduce a novel local appearance triplet (LAT) loss to assist the training process by encouraging similar saliency scores for regions with homogeneous appearances. Extensive experiments show that our approach is effective and outperforms state-of-the-art methods on popular SOD benchmarks.

Abstract:
Snapshot Spectral Imaging (SSI) techniques, with the ability to capture both spectral and spatial information in a single exposure, have been found useful in a wide range of applications. SSI systems generally operate within the ‘encoding-decoding’ framework, leveraging the synergism of optical hardware and reconstruction algorithms. Typically, reconstructing desired spectral images from SSI measurements is an ill-posed and challenging problem. Existing studies utilize either model-based or deep learning-based methods, but both have their drawbacks. Model-based algorithms suffer from high computational costs, while supervised learning-based methods rely on large paired training data. In this paper, we propose a novel Unsupervised range-Nullspace learning (UnNull) prior for spectral image reconstruction. UnNull explicitly models the data via subspace decomposition, offering enhanced interpretability and generalization ability. Specifically, UnNull considers that the spectral images can be decomposed into the range and null subspaces. The features projected onto the range subspace are mainly low-frequency information, while features in the nullspace represent high-frequency information. Comprehensive multispectral demosaicing and reconstruction experiments demonstrate the superior performance of our proposed algorithm.

Abstract:
Deep multi-modal clustering (DMC) expects to improve clustering performance by exploiting abundant information available from multiple modalities. However, different modalities usually have heterogeneous distribution with uneven quality. This may lead to limited performance, especially for contrastive multi-modal clustering, which inevitably performs contrastive learning between high-quality and low-quality modalities. To tackle this challenge, we propose a novel framework named parameter-free deep multi-modal clustering with reliable contrastive learning (PDMC-RCL). Specifically, the reliable contrastive learning quantifies the relationship between contrastive modality pairs with weight values that will promote the discriminative features learning from useful modality pairs and slow down or even prevent the learning from unreliable modality pairs. Moreover, the reliable contrastive learning is imposed simultaneously at both the feature-level and cluster-level in this framework so that the feature representation learning can benefit from multi-level contrastive learning. It is worth noting that our PDMC-RCL method is parameter-free, which can achieve promising performance without additional hyperparameter tuning. Experimental results on various datasets show the effectiveness of our method over typical state-of-the-art compared DMCs. The source code is available on https://github.com/ShizheHu

Abstract:
Few-shot class-incremental learning (FSCIL) aims to learn from a sequence of incremental data sessions with a limited number of samples in each class. The main issues it encounters are the risk of forgetting previously learned data when introducing new data classes, as well as not being able to adapt the old model to new data due to limited training samples. Existing state-of-the-art solutions normally utilize pre-trained models with fixed backbone parameters to avoid forgetting old knowledge. While this strategy preserves previously learned features, the fixed nature of the backbone limits the model’s ability to learn optimal representations for unseen classes, which compromises performance on new class increments. In this paper, we propose a novel SEssion-Guided Attention framework (SEGA) to tackle this challenge. SEGA exploits the class relationships within each incremental session by assessing how test samples relate to class prototypes. This allows accurate incremental session identification for test data, leading to more precise classifications. In addition, an attention module is introduced for each incremental session to further utilize the feature from the fixed backbone. As the session of the testing image is determined, we can fine-tune the feature with the corresponding attention module to better cluster the sample within the selected session. Our approach adopts the fixed backbone strategy to avoid forgetting the old knowledge while achieving novel data adaptation. Experimental results on three FSCIL datasets consistently demonstrate the superior adaptability of the proposed SEGA framework in FSCIL tasks. The code is available at: https://github.com/zichengpan/SEGA.

Abstract:
Traditional crowd-counting networks suffer from information loss when feature maps are reduced by pooling layers, leading to inaccuracies in counting crowds at a distance. Existing methods often assume correct annotations during training, disregarding the impact of noisy annotations, especially in crowded scenes. Furthermore, using a fixed Gaussian density model does not account for the varying pixel distribution of the camera distance. To overcome these challenges, we propose a Scale-Aware Crowd Counting Network (SACC-Net) that introduces a scale-aware loss function with error-compensation capabilities of noisy annotations. For the first time, we simultaneously model labeling errors (mean) and scale variations (variance) by spatially varying Gaussian distributions to produce fine-grained density maps for crowd counting. Furthermore, the proposed scale-aware Gaussian density model can be dynamically approximated with a low-rank approximation, leading to improved convergence efficiency with comparable accuracy. To create a smoother scale-aware feature space, this paper proposes a novel Synthetic Fusion Module (SFM) and an Intra-block Fusion Module (IFM) to generate fine-grained heat maps for better crowd counting. The lightweight version of our model, named SACC-LW, enhances the computational efficiency while retaining accuracy. The superiority and generalization properties of scale-aware loss function are extensively evaluated for different backbone architectures and performance metrics on six public datasets: UCF-QNRF, UCF CC 50, NWPU, ShanghaiTech A, ShanghaiTech B, and JHU. Experimental results also demonstrate that SACC-Net outperforms all state-of-the-art methods, validating its effectiveness in achieving superior crowd-counting accuracy. The source code is available at https://github.com/Naughty725.

Abstract:
In this paper, we propose a novel framework for Interactive Face Video Coding (IFVC), which allows humans to interact with the intrinsic visual representations instead of the signals. The proposed solution enjoys several distinct advantages, including ultra-compact representation, low delay interaction, and vivid expression/headpose animation. In particular, we propose the Internal Dimension Increase (IDI) based representation, greatly enhancing the fidelity and flexibility in rendering the appearance while maintaining reasonable representation cost. By leveraging strong statistical regularities, the visual signals can be effectively projected into controllable semantics in the three dimensional space (e.g., mouth motion, eye blinking, head rotation, head translation and head location), which are compressed and transmitted. The editable bitstream, which naturally supports the interactivity at the semantic level, can synthesize the face frames via the strong inference ability of the deep generative model. Experimental results have demonstrated the performance superiority and application prospects of our proposed IFVC scheme. In particular, the proposed scheme not only outperforms the state-of-the-art video coding standard Versatile Video Coding (VVC) and the latest generative compression schemes in terms of rate-distortion performance for face videos, but also enables the interactive coding without introducing additional manipulation processes. Furthermore, the proposed framework is expected to shed lights on the future design of the digital human communication in the metaverse. The project page can be found at https://github.com/Berlin0610/Interactive_Face_Video_Coding

Affiliations: School of Remote Sensing and Information Engineering, Wuhan University, Wuhan, China; School of Medicine, Yale University, New Haven, CT, USA; Faculty of Information Technology, Monash University, Clayton, VIC, Australia; School of Artificial Intelligence, the National Engineering Research Center for Multimedia Software, the School of Computer Science, and the State Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing (LIESMARS), Wuhan University, Wuhan, China

Abstract:
High-quality annotation of fine-grained visual categories demands great expert knowledge, which is taxing and time consuming. Alternatively, learning fine-grained visual representation from enormous unlabeled images (e.g., species, brands) by self-supervised learning becomes a feasible solution. However, recent investigations find that existing self-supervised learning methods are less qualified to represent fine-grained categories. The bottleneck lies in that the pre-trained class-agnostic representation is built from every patch-wise embedding, while fine-grained categories are only determined by several key patches of an image. In this paper, we propose a Cross-level Multi-instance Distillation (CMD) framework to tackle this challenge. Our key idea is to consider the importance of each image patch in determining the fine-grained representation by multiple instance learning. To comprehensively learn the relation between informative patches and fine-grained semantics, the multi-instance knowledge distillation is implemented on both the region/image crop pairs from the teacher and student net, and the region-image crops inside the teacher / student net, which we term as intra-level multi-instance distillation and inter-level multi-instance distillation. Extensive experiments on several commonly used datasets, including CUB-200-2011, Stanford Cars and FGVC Aircraft, demonstrate that the proposed method outperforms the contemporary methods by up to 10.14% and existing state-of-the-art self-supervised learning approaches by up to 19.78% on both top-1 accuracy and Rank-1 retrieval metric. Source code is available at https://github.com/BiQiWHU/CMD

Abstract:
Radiology Report Generation (RRG) is essential for computer-aided diagnosis and medication guidance, which can relieve the heavy burden of radiologists by automatically generating the corresponding radiology reports according to the given radiology image. However, generating accurate lesion descriptions remains challenging due to spurious correlations from visual-linguistic biases and inherent limitations of radiological imaging, such as low resolution and noise interference. To address these issues, we propose a two-stage framework named Cross-Modal Causal Representation Learning (CMCRL), consisting of the Radiological Cross-modal Alignment and Reconstruction Enhanced (RadCARE) pre-training and the Visual-Linguistic Causal Intervention (VLCI) fine-tuning. In the pre-training stage, RadCARE introduces a degradation-aware masked image restoration strategy tailored for radiological images, which reconstructs high-resolution patches from low-resolution inputs to mitigate noise and detail loss. Combined with a multiway architecture and four adaptive training strategies (e.g., text postfix generation with degraded images and text prefixes), RadCARE establishes robust cross-modal correlations even with incomplete data. In the VLCI phase, we deploy causal front-door intervention through two modules: the Visual Deconfounding Module (VDM) disentangles local-global features without fine-grained annotations, while the Linguistic Deconfounding Module (LDM) eliminates context bias without external terminology databases. Experiments on IU-Xray and MIMIC-CXR show that our CMCRL pipeline significantly outperforms state-of-the-art methods, with ablation studies confirming the necessity of both stages. Code and models are available at https://github.com/WissingChen/CMCRL.

Abstract:
Learned image compression has attracted considerable interests in recent years. An analysis transform and a synthesis transform, which can be regarded as coupled transforms, are used to encode an image to latent feature and decode the feature after quantization to reconstruct the image. Inspired by the success of invertible neural networks in generative modeling, invertible modules can be used to construct the coupled analysis and synthesis transforms. Considering the noise introduced in the feature quantization invalidates the invertible process, this paper proposes an Approximately Invertible Neural Network (A-INN) framework for learned image compression. It formulates the rate-distortion optimization in lossy image compression when using INN with quantization, which differentiates from using INN for generative modelling. Generally speaking, A-INN can be used as the theoretical foundation for any INN based lossy compression method. Based on this formulation, A-INN with a progressive denoising module (PDM) is developed to effectively reduce the quantization noise in the decoding. Moreover, a Cascaded Feature Recovery Module (CFRM) is designed to learn high-dimensional feature recovery from low-dimensional ones to further reduce the noise in feature channel compression. In addition, a Frequency-enhanced Decomposition and Synthesis Module (FDSM) is developed by explicitly enhancing the high-frequency components in an image to address the loss of high-frequency information inherent in neural network based image compression, thereby enhancing the reconstructed image quality. Extensive experiments demonstrate that the proposed A-INN framework achieves better or comparable compression efficiency than the conventional image compression approach and state-of-the-art learned image compression methods.

Abstract:
Weakly supervised semantic segmentation (WSSS) is a challenging yet important research field in vision community. In WSSS, the key problem is to generate high-quality pseudo segmentation masks (PSMs). Existing approaches mainly depend on the discriminative object part to generate PSMs, which would inevitably miss object parts or involve surrounding image background, as the learning process is unaware of the full object structure. In fact, both the discriminative object part and the full object structure are critical for deriving of high-quality PSMs. To fully explore these two information cues, we build a novel end-to-end learning framework, alternate self-dual teaching (ASDT), based on a dual-teacher single-student network architecture. The information interaction among different network branches is formulated in the form of knowledge distillation (KD). Unlike the conventional KD, the knowledge of the two teacher models would inevitably be noisy under weak supervision. Inspired by the Pulse Width (PW) modulation, we introduce a PW wave-like selection signal to alleviate the influence of the imperfect knowledge from either teacher model on the KD process. Comprehensive experiments on the PASCAL VOC 2012 and COCO-Stuff 10K demonstrate the effectiveness of the proposed ASDT framework, and new state-of-the-art results are achieved.

Abstract:
Occupancy prediction reconstructs 3D structures of surrounding environments. It provides detailed information for autonomous driving planning and navigation. However, most existing methods heavily rely on the LiDAR point clouds to generate occupancy ground truth, which is not available in the vision-based system. In this paper, we propose an OccNeRF method for training occupancy networks without 3D ground truth. Different from previous works which consider a bounded scene, we parameterize the reconstructed occupancy fields and reorganize the sampling strategy to align with the cameras’ infinite perceptive range. The neural rendering is adopted to convert occupancy fields to multi-camera depth maps, supervised by multi-frame photometric consistency. Moreover, for semantic occupancy prediction, we design several strategies to polish the prompts and filter the outputs of a pretrained open-vocabulary 2D segmentation model. Extensive experiments for both self-supervised depth estimation and 3D occupancy prediction tasks on nuScenes and SemanticKITTI datasets demonstrate the effectiveness of our method. The code is available at https://github.com/LinShan-Bin/OccNeRF

Abstract:
Weakly-supervised Temporal Action Localization (WTAL) aims to localize action instances with only video-level labels during training, where two primary issues are localization incompleteness and background interference. To relieve these two issues, recent methods adopt an attention mechanism to activate action instances and simultaneously suppress background ones, which have achieved remarkable progress. Nevertheless, we argue that these two issues have not been well resolved yet. On the one hand, the attention mechanism adopts fixed weights for different videos, which are incapable of handling the diversity of different videos, thus deficient in addressing the problem of localization incompleteness. On the other hand, previous methods only focus on learning the foreground attention and the attention weights usually suffer from ambiguity, resulting in difficulty of suppressing background interference. To deal with the above issues, in this paper we propose an Adaptive Prototype Learning (APL) method for WTAL, which includes two key designs: 1) an Adaptive Transformer Network (ATN) to explicitly model background and learn video-adaptive prototypes for each specific video; 2) an OT-based Collaborative (OTC) training strategy to guide the learning of prototypes and remove the ambiguity of the foreground-background separation by introducing an Optimal Transport (OT) algorithm into the collaborative training scheme between RGB and FLOW streams. These two key designs can work together to learn video-adaptive prototypes and solve the above two issues, achieving robust localization. Extensive experimental results on two standard benchmarks (THUMOS14 and ActivityNet) demonstrate that our proposed APL performs favorably against state-of-the-art methods.

Abstract:
Spectral variations pose a common challenge in analyzing hyperspectral images (HSI). To address this, low-rank tensor representation has emerged as a robust strategy, leveraging inherent correlations within HSI data. However, the spatial distribution of ground objects in HSIs is inherently irregular, existing naturally in tensor format, with numerous class-specific regions manifesting as irregular tensors. Current low-rank representation techniques are designed for regular tensor structures and overlook this fundamental irregularity in real-world HSIs, leading to performance limitations. To tackle this issue, we propose a novel model for irregular tensor low-rank representation tailored to efficiently model irregular 3D cubes. By incorporating a non-convex nuclear norm to promote low-rankness and integrating a global negative low-rank term to enhance the discriminative ability, our proposed model is formulated as a constrained optimization problem and solved using an alternating augmented Lagrangian method. Experimental validation conducted on four public datasets demonstrates the superior performance of our method compared to existing state-of-the-art approaches. The code is publicly available at https://github.com/hb-studying/ITLRR

Abstract:
Previous methods utilize the Neural Radiance Field (NeRF) for panoptic lifting, while their training and rendering speed are unsatisfactory. In contrast, 3D Gaussian Splatting (3DGS) has emerged as a prominent technique due to its rapid training and rendering speed. However, unlike NeRF, the conventional 3DGS may not satisfy the basic smoothness assumption as it does not rely on any parameterized structures to render (e.g., MLPs). Consequently, the conventional 3DGS is, in nature, more susceptible to noisy 2D mask supervision. In this paper, we propose a new method called PLGS that enables 3DGS to generate consistent panoptic segmentation masks from noisy 2D segmentation masks while maintaining superior efficiency compared to NeRF-based methods. Specifically, we build a panoptic-aware structured 3D Gaussian model to introduce smoothness and design effective noise reduction strategies. For the semantic field, instead of initialization with structure from motion, we construct reliable semantic anchor points to initialize the 3D Gaussians. We then use these anchor points as smooth regularization during training. Additionally, we present a self-training approach using pseudo labels generated by merging the rendered masks with the noisy masks to enhance the robustness of PLGS. For the instance field, we project the 2D instance masks into 3D space and match them with oriented bounding boxes to generate cross-view consistent instance masks for supervision. Experiments on various benchmarks demonstrate that our method outperforms previous state-of-the-art methods in terms of both segmentation quality and speed.

Abstract:
Semantic Point Cloud Upsampling (SPU) aims to reconstruct a high-resolution (dense) 3D point cloud from a low-resolution (sparse) one, ensuring that the upsampled point cloud is easily recognizable by downstream tasks. Conventional upsampling architectures typically represent point clouds using high-dimensional feature vectors. However, we observe a dimensional bottleneck, where simply increasing the feature dimensionality does not necessarily improve performance on semantic tasks. This insight motivates us to explore more effective feature representations within upsampling networks. In this paper, we propose a novel SPU method called SPU+, which introduces dimension folding as an alternative strategy for handling high-dimensional features. Specifically, SPU+ decomposes each high-dimensional feature into several g-dimensional packages, allowing interactions among packages within the feature space. Guided by the principle of maximizing feature diversity, we determine that setting the package dimension to 3 yields optimal performance. To enable convolutional operations over these 3D packages, we present a 3D Residual Graph Convolution Block (3D-RGCB) that achieves high computational efficiency. Based on 3D-RGCBs, we design an upsampling network that incorporates three structural modes: pre-mode, middle-mode, and end-mode. Additionally, for large-scale upsampling, we develop a scaling-and-shuffling strategy that adaptively adjusts the spatial size of each 3D package. Finally, we analyze the covering number of the 3D package representation and compare it to traditional high-dimensional feature representations. Experiments on publicly available datasets demonstrate not only the effectiveness of dimension folding but also the state-of-the-art performance achieved by SPU+. Code is available at: https://github.com/lizhuangzi/SPU_plus

Abstract:
Face recognition has achieved remarkable progress and is widely deployed in real-world scenarios. Recently more and more attention has been given to individual privacy protection, due to unauthorized sensitive image leakage by malicious attackers. Multi-modality face images captured by diverse sensors, also called heterogeneous faces, bring in more challenges in face privacy protection while lacking related research. In this paper, we propose a novel visual Privacy preserving method for Heterogeneous Face Recognition (Privacy-HFR) to protect perceptual visual information and maintain essential identity information in multi-modality face analysis scenarios. Frequency domain analysis is a vital strategy to bridge the inevitable modality gap for heterogeneous face images. Meanwhile, recent theoretical insights also inspire us to design a suitable frequency component adjustment to balance human visual sensitivity and identity discriminative information. In addition, the ability to defend against recovery attacks has emerged as an essential criterion for privacy preserving face recognition. Noting that there seems to exist a dilemma that reducing accessible information by the attack model will affect the extracted identity information for recognition. It is because these two kinds of information are mutually blended in the frequency domain, which makes it a challenge to simultaneously maintain visual privacy and identity distinguishability. Thus, we provide a novel perspective to leverage the randomly optimal solutions and design the specific adversarial perturbations against the recovery attack. Experiments on several large-scale heterogeneous face datasets (CASIA NIR-VIS 2.0, LAMP-HQ, Tufts Face and CUFSF datasets) prove that the proposed method outperforms existing privacy-preserving face recognition methods in terms of recognition accuracy and privacy protection capability. The code is available in https://github.com/xiyin11/Privacy-HFR

Abstract:
In this paper, we propose a simple yet effective contrastive knowledge distillation framework that achieves sample-wise logit alignment while preserving semantic consistency. Conventional knowledge distillation approaches exhibit over-reliance on feature similarity per sample, which risks overfitting, and contrastive approaches focus on inter-class discrimination at the expense of intra-sample semantic relationships. Our approach transfers “dark knowledge” through teacher-student contrastive alignment at the sample level. Specifically, our method first enforces intra-sample alignment by directly minimizing teacher-student logit discrepancies within individual samples. Then, we utilize inter-sample contrasts to preserve semantic dissimilarities across samples. By redefining positive pairs as aligned teacher-student logits from identical samples and negative pairs as cross-sample logit combinations, we reformulate these dual constraints into an InfoNCE loss framework, reducing computational complexity lower than sample squares while eliminating dependencies on temperature parameters and large batch sizes. We conduct comprehensive experiments across three benchmark datasets, including the CIFAR-100, ImageNet-1K, and MS COCO datasets, and experimental results clearly confirm the effectiveness of the proposed method on image classification, object detection, and instance segmentation tasks.

Affiliations: Department of Computer Science, Huaqiao University, Xiamen, China; Department of Computer Science and Fujian Key Laboratory of Big Data Intelligence and Security, Huaqiao University, Xiamen, China; Department of Computer Science, Hong Kong Baptist University, Hong Kong, SAR, China; Department of Artificial Intelligence, Huaqiao University, Xiamen, China; Center for Future Multimedia and the School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu, China; State Key Laboratory of Integrated Services Networks, Xidian University, Xi’an, China

Abstract:
The manual annotation of perfectly aligned labels for cross-modal retrieval (CMR) is incredibly labor-intensive. As an alternative, the collection of co-occurring data pairs from the Internet is a remarkably cost-effective way, but which, inevitably induces the Partially Mismatched Pairs (PMPs) and therefore significantly degrades the retrieval performance without particular treatment. Previous efforts often utilize the pair-wise similarity to filter out the mismatched pairs, and such operation is highly sensitive to mismatched or ambiguous data and thus leads to sub-optimal performance. To alleviate these concerns, we propose an efficient approach, termed UCPM, i.e., Uncertainty-guided Cross-modal retrieval with Partially Mismatched pairs, which can significantly reduce the adverse impact of mismatched data pairs. Specifically, a novel Uncertainty Guided Division (UGD) strategy is sophisticatedly designed to divide the corrupted training data into confident matched (clean), easily-identifiable mismatched (noisy) and hardly-determined hard subsets, and the derived uncertainty can simultaneously guide the informative pair learning while reducing the negative impact of potential mismatched pairs. Meanwhile, an effective Uncertainty Self-Correction (USC) mechanism is concurrently presented to accurately identify and rectify the fluctuated uncertainty during the training process, which further improves the stability and reliability of the estimated uncertainty. Besides, a Trusted Margin Loss (TML) is newly designed to enhance the discriminability between those hard pairs, by dynamically adjusting their soft margins to amplify the positive contributions of matched pairs while suppressing the negative impacts of mismatched pairs. Extensive experiments on three widely-used benchmark datasets, verify the effectiveness and reliability of UCPM compared with the existing SOTA approaches, and significantly improve the robustness in both synthetic and real-world PMPs. The code is available at: https://github.com/qxzha/UCPM

Abstract:
Multi-view clustering (MVC) aims to exploit the latent relationships between heterogeneous samples in an unsupervised manner, which has served as a fundamental task in the unsupervised learning community and has drawn widespread attention. In this work, we propose a new deep multi-view contrastive clustering method via graph structure awareness (DMvCGSA) by conducting both instance-level and cluster-level contrastive learning to exploit the collaborative representations of multi-view samples. Unlike most existing deep multi-view clustering methods, which usually extract only the attribute features for multi-view representation, we first exploit the view-specific features while preserving the latent structural information between multi-view data via a GCN-embedded autoencoder, and further develop a similarity-guided instance-level contrastive learning scheme to make the view-specific features discriminative. Moreover, unlike existing methods that separately explore common information, which may not contribute to the clustering task, we employ cluster-level contrastive learning to explore the clustering-beneficial consistency information directly, resulting in improved and reliable performance for the final multi-view clustering task. Extensive experimental results on twelve benchmark datasets clearly demonstrate the encouraging effectiveness of the proposed method compared with the state-of-the-art models.

Abstract:
Point cloud attribute compression is a challenging issue in efficiently compressing large volumes of attributes. Despite notable advancements in lossy point cloud compression using deep learning, progress in lossless compression remains limited. Some methods have employed octree- or voxel-based partitioning techniques derived from geometric compression, achieving success on dense point clouds. However, these voxel-based approaches struggle with sparse or unevenly distributed point clouds, leading to performance degradation. In this work, we introduce a novel framework for learning-based lossless point cloud attribute compression, named LOD-PCAC, which leverages a Level-of-Detail (LOD) structure to ensure density-robust compression. Specifically, the input point cloud is divided into multiple detail levels, and vertices from these levels are selected to construct a Reference Set as context, which effectively captures multi-level information. Then we propose the Bit-level Residual Coder for efficient attribute compression. Instead of directly compressing attributes, our method first predicts attribute values and organizes the residual bits into a Bit Matrix as another context, simplifying predictions and fully exploiting channel correlations. Finally, a neural network with specialized encoders processes the context to estimate the probability of each residual bit. Experimental results demonstrate that the proposed method outperforms both traditional and learning-based approaches across various point clouds, exhibiting strong generalization across datasets and robustness to varying densities.

Abstract:
Currently, two main research lines in efficient context modeling for image dehazing are tailoring effective feature modulation mechanisms and utilizing the Fourier transform more precisely. The former is usually based on self-scale features that ignore complementary cross-scale/level features, and the latter tends to overlook regions with pronounced haze degradation and intricate structures. This paper introduces a novel spatial and frequency modulation perspective to synergistically investigate contextual feature modeling for efficient image dehazing. Specifically, we delicately develop a Spatial Frequency Modulator (SFM) equipped with a Cross-Scale Modulator (CSM) and Frequency Modulator (FM) to implement intra-block feature modulation. The CSM progressively aggregates hierarchical features across different scales, employing them for spatial self-modulation, and the FM subsequently adopts a dual-branch design to focus more on the crucial areas with severe haze and complex structures for reconstruction. Further, we propose a Cross-Level Modulator (CLM) to facilitate inter-block feature mutual modulation, enhancing seamless interaction between features at different depths and layers. Integrating the above-developed modules into the U-Net architecture, we construct a two-stage spatial frequency modulation network (SFMN). Extensive quantitative and qualitative evaluations showcase the superior performance and efficiency of the proposed SFMN over recent state-of-the-art image dehazing methods. The source code can be found in https://github.com/it-hao/SFMN.

Abstract:
Generalization under distribution shifts has been a great challenge in computer vision. The prevailing practice of directly employing the one-hot labels as the training targets in domain generalization (DG) can lead to gradient conflicts, making it insufficient for capturing the intrinsic class characteristics and hard to increase the intra-class variation. Besides, existing methods in DG mostly overlook the distinct contributions of source (seen) domains, resulting in uneven learning from these domains. To address these issues, we first present a theoretical and empirical analysis on the existence of gradient conflicts in DG, unveiling the previously unexplored relationship between distribution shifts and gradient conflicts during optimization process. In this paper, we present a novel perspective of DG from the empirical source domain’s risk, and propose a new paradigm for DG called Diverse Target and Contribution Scheduling (DTCS). DTCS comprises two innovative modules: Diverse Target Supervision (DTS) and Diverse Contribution Balance (DCB), with the aim of addressing the limitations associated with the common utilization of one-hot labels and equal contributions for source domains in DG. In specific, DTS employs distinct soft labels as training targets to account for various feature distributions across domains and thereby mitigates the gradient conflicts, and DCB dynamically balances the contributions of source domains by ensuring a fair decline in losses of different source domains. Extensive experiments with analysis on four benchmark datasets show that the proposed method achieves a competitive performance in comparison with the state-of-the-art approaches, demonstrating the effectiveness and advantages of the proposed DTCS. The source code will be available at https://github.com/longshaocong/DTCS

Abstract:
Video watermarking embeds a message into a cover video in an imperceptible manner, which can be retrieved even if the video undergoes certain modifications or distortions. Traditional watermarking methods are often manually designed for particular types of distortions and thus cannot simultaneously handle a broad spectrum of distortions. To this end, we propose a robust deep learning-based solution for video watermarking that is end-to-end trainable. Our model consists of a novel multiscale design where the watermarks are distributed across multiple spatial-temporal scales. Extensive evaluations on a wide variety of distortions show that our method outperforms traditional video watermarking methods as well as deep image watermarking models by a large margin. We further demonstrate the practicality of our method on a realistic video-editing application.

Abstract:
In Few-Shot Learning (FSL), the objective is to correctly recognize new samples from novel classes with only a few available samples per class. Existing methods in FSL primarily focus on learning transferable knowledge from base classes by maximizing the information between feature representations and their corresponding labels. However, this approach may suffer from the “supervision collapse” issue, which arises due to a bias towards the base classes. In this paper, we propose a solution to address this issue by preserving the intrinsic structure of the data and enabling the learning of a generalized model for the novel classes. Following the InfoMax principle, our approach maximizes two types of mutual information (MI): between the samples and their feature representations, and between the feature representations and their class labels. This allows us to strike a balance between discrimination (capturing class-specific information) and generalization (capturing common characteristics across different classes) in the feature representations. To achieve this, we adopt a unified framework that perturbs the feature embedding space using two low-bias estimators. The first estimator maximizes the MI between a pair of intra-class samples, while the second estimator maximizes the MI between a sample and its augmented views. This framework effectively combines knowledge distillation between class-wise pairs and enlarges the diversity in feature representations. By conducting extensive experiments on popular FSL benchmarks, our proposed approach achieves comparable performances with state-of-the-art competitors. For example, we achieved an accuracy of 69.53% on the miniImageNet dataset and 77.06% on the CIFAR-FS dataset for the 5-way 1-shot task.

Abstract:
In this paper, we propose a novel Transformer based approach, namely Cross-modal Contrastive Masked AutoEncoder (C2MAE), to Self-Supervised Learning (SSL) on compressed videos. A unified Transformer encoder is employed to discover relationships of visual tokens from RGBs, motion vectors and residuals. A hybrid SSL framework is proposed, which combines the complementary advantages of Masked Image Modeling (MIM) and Contrastive Learning (CL) pretext tasks, for powerful representation learning. The MIM branch extends VideoMAE by a new Fine-Grained Motion-aware Masking (FGMM) strategy and a modified Multi-modal Reconstruction (MR) task, where FGMM computes motion saliency maps as motion priors to guide the masks so that it well fits for the data properties in the compressed domain and the MR task highlights the reconstruction of raw videos by joint representations from corresponding compressed videos in addition to that in each single modality. The CL branch introduces the Contrastive Cross-modal Learning (CCL) module, and the features from a compressed video clip and the ones from its raw video counterpart are compared instead of widely used augmented data. Due to these designs, C2MAE significantly enhances interactions across modalities to compensate the sparsity of I-frames and the coarse and noisy nature of P-frames, thus delivering much stronger pre-trained models. Extensive experiments are conducted on the UCF-101, HMDB-51 and Kinetics-400 benchmarks with state-of-the-art results reported, demonstrating its effectiveness.

Abstract:
Class incremental semantic segmentation (CISS) aims to progressively segment newly introduced classes while preserving the memory of previously learned ones. Traditional CISS methods directly employ advanced semantic segmentation models (e.g., Deeplab-v3) as continual learners. However, these methods require substantial computational and memory resources, limiting their deployment on edge devices. In this paper, we propose a Lightweight Class Incremental Semantic Segmentation (LISS) model tailored for resource-constrained scenarios. Specifically, we design an automatic knowledge-preservation pruning strategy based on the Hilbert-Schmidt Independence Criterion (HSIC) Lasso, which automatically compresses the CISS model by searching for global penalty coefficients. Nonetheless, reducing model parameters exacerbates catastrophic forgetting during incremental learning. To mitigate this challenge, we develop a clustering-based pseudo labels generator to obtain high-quality pseudo labels by considering the feature space structure of old classes. It adjusts predicted probabilities from the old model according to the feature proximity to nearest sub-cluster centers for each class. Additionally, we introduce a customized soft labels module that distills the semantic relationships between classes separately. It decomposes soft labels into target probabilities, background probabilities, and other probabilities, thereby maintaining knowledge of previously learned classes in a fine-grained manner. Extensive experiments on two benchmark datasets demonstrate that our LISS model outperforms state-of-the-art approaches in both effectiveness and efficiency.

Abstract:
Zero-shot skeleton-based action recognition aims to classify unseen skeleton-based human actions without prior exposure to such categories during training. This task is extremely challenging due to the difficulty in generalizing from known to unknown actions. Previous studies typically use two-stage training: pre-training skeleton encoders on seen action categories using cross-entropy loss and then aligning pre-extracted skeleton and text features, enabling knowledge transfer to unseen classes through skeleton-text alignment and language models’ generalization. However, their efficacy is hindered by 1) insufficient discrimination for skeleton features, as the fixed skeleton encoder fails to capture necessary alignment information for effective skeleton-text alignment; 2) the neglect of alignment bias between skeleton and unseen text features during testing. To this end, we propose a prototype-guided feature alignment paradigm for zero-shot skeleton-based action recognition, termed PGFA. Specifically, we develop an end-to-end cross-modal contrastive training framework to improve skeleton-text alignment, ensuring sufficient discrimination for skeleton features. Additionally, we introduce a prototype-guided text feature alignment strategy to mitigate the adverse impact of the distribution discrepancy during testing. We provide a theoretical analysis to support our prototype-guided text feature alignment strategy and empirically evaluate our overall PGFA on three well-known datasets. Compared with the top competitor SMIE method, our PGFA achieves absolute accuracy improvements of 22.96%, 12.53%, and 18.54% on the NTU-60, NTU-120, and PKU-MMD datasets, respectively.

Abstract:
In the field of image quality assessment (IQA), researchers have been studying the mean opinion score (MOS) of image quality for decades. They focus on developing IQA methods with the help of MOS without using the potential of the distribution of opinion scores (DOS). We find that the Gaussian mixture distribution (GMD) can more accurately describe the DOS of image quality on SJTU IQSD and KonIQ-10K databases compared to some traditional distributions. Therefore, this paper proposes a blind IQA method that predicts the MOS of image quality by learning the GMD-based image quality. The proposed method consists of a visual feature learning module and a GMD learning module. The visual feature learning module uses a multi-stage Swin Transformer model and a CLIP feature extractor to extract visual features from an image. The GMD learning module then maps the extracted visual features to the GMD-based image quality using a mixture density network, where the mean of the GMD represents the MOS of image quality. We not only use the MOS of image quality to train the proposed method, but also employ the DOS of image quality for auxiliary training to improve the prediction performance of the proposed method. To address the lack of DOS in some existing IQA databases, we introduce a pseudo DOS generation strategy to generate the DOS of image quality for training, which significantly improves the applicability of the proposed method. Numerous analyses show that the proposed method is superior to most state-of-the-art IQA methods in predicting both the MOS and the DOS, thus facilitating a deeper investigation into the DOS of image quality in IQA.

Abstract:
Variational approaches to disparity estimation typically use a linearised brightness constancy constraint, which only applies in smooth regions and over small distances. Accordingly, current variational approaches rely on a schedule to progressively include image data. This paper proposes the use of Gradient Consistency information to assess the validity of the linearisation; this information is used to determine the weights applied to the data term as part of an analytically inspired Gradient Consistency Model. The Gradient Consistency Model penalises the data term for view pairs that have a mismatch between the spatial gradients in the source view and the spatial gradients in the target view. Instead of relying on a tuned or learned schedule, the Gradient Consistency Model is self-scheduling, since the weights evolve as the algorithm progresses. We show that the Gradient Consistency Model outperforms standard coarse-to-fine schemes and the recently proposed progressive inclusion of views approach in both rate of convergence and accuracy.

Abstract:
Incomplete multi-view clustering has gained significant attention due to the prevalence of incomplete multi-view data in real-world scenarios. However, existing methods often overlook the critical role of inter-view relationships. In unsupervised settings, selectively leveraging cross-view topological relationships can effectively guide view completion and representation learning. To address this challenge, we propose a novel framework called Selective Cross-View Topology Incomplete Multi-View Clustering (SCVT). Our approach constructs a view topology graph using the Optimal Transport (OT) distance between view. This graph helps identify neighboring views for those with missing data, enabling the inference of topological relationships and accurate completion of missing samples. Additionally, we introduce the Max View Graph Contrastive Alignment module to facilitate information transfer and alignment across neighboring views. Furthermore, we propose the View Graph Weighted Intra-View Contrastive Learning module, which enhances representation learning by pulling representations of samples within the same cluster closer, while applying varying degrees of enhancement across different views based on the view graph. Our method achieves state-of-the-art performance on seven benchmark datasets, significantly outperforming existing methods for incomplete multi-view clustering and demonstrating its effectiveness.

Abstract:
Multi-object tracking (MOT) emerges as a pivotal and highly promising branch in the field of computer vision. Classical closed-vocabulary MOT (CV-MOT) methods aim to track objects of predefined categories. Recently, some open-vocabulary MOT (OV-MOT) methods have successfully addressed the problem of tracking unknown categories. However, we found that the CV-MOT and OV-MOT methods each struggle to excel in the tasks of the other. In this paper, we present a unified framework, Associate Everything Detected (AED), that simultaneously tackles CV-MOT and OV-MOT by integrating with any off-the-shelf detector and supports unknown categories. Different from existing tracking-by-detection MOT methods, AED gets rid of prior knowledge (e.g., motion cues) and relies solely on highly robust feature learning to handle complex trajectories in OV-MOT tasks while keeping excellent performance in CV-MOT tasks. Specifically, we model the association task as a similarity decoding problem and propose a sim-decoder with an association-centric learning mechanism. The sim-decoder calculates similarities in three aspects: spatial, temporal, and cross-clip. Subsequently, association-centric learning leverages these threefold similarities to ensure that the extracted features are appropriate for continuous tracking and robust enough to generalize to unknown categories. Compared with existing powerful OV-MOT and CV-MOT methods, AED achieves superior performance on TAO, SportsMOT, and DanceTrack without any prior knowledge. Our code is available at https://github.com/balabooooo/AED

Abstract:
Compositional Zero-Shot Learning (CZSL) aims to recognize unseen attribute-object compositions by leveraging prior knowledge of known primitives. However, real-world visual features of attributes and objects are often entangled, causing distribution shifts between seen and unseen combinations. Existing methods often ignore intrinsic variations and interactions among primitives, leading to poor feature discrimination and biased predictions. To address these challenges, we propose Multi-level Contextual Prototype Modulation (MCPM), a transformer-based framework with a hierarchical structure that effectively integrates attributes and objects to generate richer visual embeddings. At the feature level, we apply contrastive learning to improve discriminability across compositional tasks. At the prototype level, a subclass-driven modulator captures fine-grained attribute-object interactions, enabling better adaptation to long-tail distributions. Additionally, we introduce a Minority Attribute Enhancement (MAE) strategy that synthesizes virtual samples by mixing attribute classes, further mitigating data imbalance. Experiments on four benchmark datasets (MIT-States, C-GQA, UT-Zappos, and VAW-CZSL) show that MCPM brings significant performance improvements, verifying its effectiveness in complex composition scenes.

Abstract:
Palmprint recognition has recently garnered attention due to its high accuracy, strong robustness, and high security. Existing deep learning-based palmprint recognition methods usually require large amounts of data for centralized training, facing the challenge of privacy disclosure. In addition, the non-independent and identically distributed (non-IID) issue in the multi-spectral palmprint images generally leads to the degradation of recognition performance. To tackle these problems, this paper proposes a dynamic personalized federated learning model for cross-spectral palmprint recognition, called DPFed-Palm. Specifically, for each client’s local training, we present a new combination of loss functions to enforce the constraints of local models and effectively enhance the feature representation capability of models. Subsequently, DPFed-Palm aggregates the above-trained local models by using the combined aggregation strategies of the Federated Averaging (FedAvg) and Personalized Federated Learning (PFL) to obtain the best personalized global model of each client. For the selection of the best personalized global model, we develop a dynamic weight selection strategy to obtain the optimal weights of the local and global models by cross-spectral (cross-client) testing. Extensive experimental results on three public PolyU multispectral, IITD, and CASIA datasets show that the proposed method outperforms the existing techniques in privacy-preserving and recognition performance.

Abstract:
Generative models represented by diffusion models have recently shown great potential in image generation. They usually use a reverse iteration process to map noise into the data. However, for many real-world applications such as image restoration and translation, the model input comes from a distribution that is not random noise, making it difficult for these models to adapt directly to these tasks. In this paper, we introduce Image-to-Image Bayesian Flow Networks (I2I-BFNs), a novel framework for general-purpose image-to-image translation (I2I) that operates within the parameter space of distributions. This method upholds Gaussian distributions over pixel intensities, refining distribution parameters through closed-form Bayesian inference, steered by the network’s predictions for the target image. An essential aspect of our approach is the utilization of the conditional image as a robust prior parameter, initializing the translation process from a deterministic, clean image to reduce variance and produce interpretable generation. Additionally, we introduce a skip sampling technique that enhances the efficiency of I2I-BFNs, facilitating rapid translation in diverse image restoration and general I2I tasks. Our experimental evaluations showcase the model’s competitive edge in various settings, underscoring its efficacy and adaptability. This work contributes new insights and opportunities for the large-scale development of efficient conditional generation systems.

Abstract:
Partial domain adaptation (PDA) is a challenging task in real-world machine learning scenarios. It aims to transfer knowledge from a labeled source domain to a related unlabeled target domain, where the support set of the source label distribution subsumes the target one. Previous PDA works managed to correct the label distribution shift by weighting samples in the source domain. However, the simple reweighing technique cannot explore the latent structure and sufficiently use the labeled data, and then models are prone to over-fitting on the source domain. In this work, we propose a novel importance sampling-based shift correction (IS2C) method, where new labeled data are sampled from a built sampling domain, whose label distribution is supposed to be the same as the target domain, to characterize the latent structure and enhance the generalization ability of the model. We provide theoretical guarantees for IS2C by proving that the generalization error can be sufficiently dominated by IS2C. In particular, by implementing sampling with the mixture distribution, the extent of shift between source and sampling domains can be connected to generalization error, which provides an interpretable way to build IS2C. To improve knowledge transfer, an optimal transport-based independence criterion is proposed for conditional distribution alignment, where the computation of the criterion can be adjusted to reduce the complexity from \mathcal O(n^3) to \mathcal O(n^2) in realistic PDA scenarios. Extensive experiments on PDA benchmarks validate the theoretical results and demonstrate the effectiveness of our IS2C over existing methods.

Abstract:
Image restoration aims to recover the latent clean image from a degraded counterpart. In general, the prevailing state-of-the-art image restoration methods concentrate on solving only a specific degradation type according to the task, e.g., deblurring or deraining. However, if the corresponding well-trained frameworks confront other real-world image corruptions, i.e., the corruptions are not covered in the training phase, and state-of-the-art restoration models will suffer from a lack of generalization ability. We have observed that an image restoration model can be easily confused by noise corruption. Towards improving the robustness of image restoration networks, in this paper, we focus on alleviating the corruption of noise in various image restoration tasks, which is almost inevitable in real-world scenes. To this end, we devise a novel Cascade Augmentation strategy against Noise (CAN) to enhance the robustness of specific image restoration. Specifically, the given degraded images are sequentially augmented from different perspectives, i.e., noise-aware augmentation and model-aware augmentation. The noise-aware augmentation is proposed to enrich the samples by introducing various noise operations. Moreover, to adapt to more unknown corruptions, we propose a novel model-aware augmentation mechanism, which enhances the scalability by exploring useful both spatial and frequency clues with the help of model randomness. It is worth noting that the proposed augmentation scheme is model-agnostic, and it can plug and play into arbitrary state-of-the-art image restoration architectures. In addition, we construct noise corruption benchmark datasets, derived from the validation set of standard image restoration datasets, to assist us in evaluating the robustness of restoration networks. Extensive quantitative and qualitative evaluations demonstrate that the proposed method has strong generalization capability, which can enhance the robustness of various image restoration frameworks when facing diverse noises.

Abstract:
Due to its distinctive texture and intricate details, palmprint has emerged as a critical modality in biometric identity recognition. The absence of large-scale public palmprint datasets has substantially impeded the advancement of palmprint research, resulting in inadequate accuracy in commercial palmprint recognition systems. However, existing generative methods exhibit insufficient generalization, as the images they generate differ in specific ways from the conditional images. This paper proposes a method for generating palmprint images using a controllable diffusion model (PalmDiff), which addresses the issue of insufficient datasets by generating palmprint data, improving the accuracy of palmprint recognition. We introduce a diffusion process that effectively tackles the problems of excessive noise and loss of texture details commonly encountered in diffusion models. A linear attention mechanism is employed to enhance the backbone’s expressive capacity and reduce the computational complexity. To this end, we proposed an ID loss function to enable the diffusion model to generate palmprint images under the same identical space consistently. PalmDiff is compared with other generation methods in terms of both image quality and the enhancement of palmprint recognition performance. Experiments show that PalmDiff performs well in image generation, with an FID score of 13.311 on MPD and 18.434 on Tongji. Besides, PalmDiff has significantly improved various backbones for palmprint recognition compared to other generation methods.

Abstract:
The adage “Beautiful Outside But Ugly Inside” resonates with the security and explainability challenges encountered in image aesthetics assessment (IAA). Although deep neural networks (DNNs) have demonstrated remarkable performance in various IAA tasks, how to probe, explain, and enhance aesthetics-oriented “black-box” models has not yet been investigated to our knowledge. This lack of investigation has significantly impeded the commercial application of IAA. In this paper, we investigate the susceptibility of current IAA models to adversarial attacks and aim to elucidate the underlying mechanisms that contribute to their vulnerabilities. To address this, we propose a novel diffusion-based framework as an attacker (DA3Attacker), capable of generating adversarial examples (AEs) to deceive diverse black-box IAA models. DA3Attacker employs a dedicated Attack Diffusion Transformer, equipped with modular aesthetics-oriented filters. By undergoing two unsupervised training stages, it constructs a latent space to generate AEs and facilitates two distinct yet controllable attack modes: restricted and unrestricted. Extensive experiments on 26 baseline models demonstrate that our method effectively explores the vulnerabilities of these IAA models, while also providing multi-attribute explanations for their feature dependencies. To facilitate further research, we contribute the evaluation tools and four metrics for measuring adversarial robustness, as well as a dataset of 60,000 re-labeled AEs for fine-tuning IAA models. The resources are available here.

Abstract:
Infrared images exhibit a significantly different appearance compared to visible counterparts. Existing infrared and visible image fusion (IVF) methods fuse features from both infrared and visible images, producing a new “image” appearance not inherently captured by any existing device. From an appearance perspective, infrared, visible, and fused images belong to different data domains. This difference makes it challenging to apply fused images because their domain-specific appearance may be difficult for downstream systems, e.g., pre-trained segmentation models. Therefore, accurately assessing the quality of the fused image is challenging. To address those problem, we propose a novel IVF method, FusionINV, which produces fused images with an appearance similar to visible images. FusionINV employs the pre-trained Stable Diffusion (SD) model to invert infrared images into the noise feature space. To inject visible-style appearance information into the infrared features, we leverage the inverted features from visible images to guide this inversion process. In this way, we can embed all the information of infrared and visible images in the noise feature space, and then use the prior of the pre-trained SD model to generate visually friendly images that align more closely with the RGB distribution. Specially, to generate the fused image, we design a tailored fusion rule within the denoising process that iteratively fuses visible-style infrared and visible features. In this way, the fused image falls into the visible domain and can be directly applied to existing downstream machine systems. Thanks to advancements in image inversion, FusionINV can directly produce fused images in a training-free manner. Extensive experiments demonstrate that FusionINV achieves outstanding performance in both human visual evaluation and machine perception tasks. The code is available at https://github.com/erfect2020/FusionINV

Abstract:
Practical deployments, especially on resource-limited edge devices, necessitate high speed for visual object trackers. To meet this demand, we introduce a new efficient tracker with a Two-Stream architecture, named ToS. While the recent one-stream tracking framework, employing a unified backbone for simultaneous processing of both the template and search region, has demonstrated exceptional efficacy, we find the conventional two-stream tracking framework, which employs two separate backbones for the template and search region, offers inherent advantages. The two-stream tracking framework is more compatible with advanced lightweight backbones and can efficiently utilize benefits from large templates. We demonstrate that the two-stream setup can exceed the one-stream tracking model in both speed and accuracy through strategic designs. Our methodology rejuvenates the two-stream tracking paradigm with lightweight pre-trained backbones and the proposed three efficient strategies: 1) A feature-aggregation module that improves the representation capability of the backbone, 2) A channel-wise approach for feature fusion, presenting a more effective and lighter alternative to spatial concatenation techniques, and 3) An expanded template strategy to boost tracking accuracy with negligible additional computational cost. Extensive evaluations across multiple tracking benchmarks demonstrate that the proposed method sets a new state-of-the-art performance in efficient visual tracking.

Abstract:
Contrastive Language-Image Pre-training (CLIP) has achieved remarkable results in the field of person re-identification (ReID) due to its excellent cross-modal understanding ability and high scalability. Since the text encoder of CLIP mainly focuses on easy-to-describe attributes such as clothing, and clothing is the main interference factor that reduces the recognition accuracy in cloth-changing person ReID (CC ReID). Consequently, directly applying CLIP to cloth-changing scenario may be difficult to adapt to such dynamic feature changes, thereby affecting the precision of identification. To solve this challenge, we propose a CLIP-based multi-modal feature learning framework (CMFF) for CC ReID. Specifically, we first design a pose-aware identity enhancement module (PIE) to enhance the model’s perception of identity-intrinsic information. In this branch, to weaken the interference of clothing information, we apply a ranking loss to minimize the difference between appearance and pose in the feature space. Secondly, we propose a global-local hybrid attention module (GLHA), which fuses head and global features through a cross-attention mechanism, enhancing the global recognition ability of key head information. Finally, considering that existing CLIP-based methods often ignore the potential importance of shallow features, we propose a graph-based multi-layer interactive enhancement module (GMIE), which groups and integrates multi-layer features of the image encoder, aiming to enhance the contextual awareness of multi-scale features. Extensive experiments on multiple popular pedestrian datasets validate the outstanding performance of our proposed CMFF.

Abstract:
Image enhancement methods have been widely studied to improve the visual quality of diverse images, implicitly assuming that all human observers have normal vision. However, a large population around the world suffers from Color Vision Deficiency (CVD). Enhancing images to compensate for their perceptions remains a challenging issue. Existing CVD compensation methods have two drawbacks: first, the available datasets and validations have not been rigorously tested by CVD individuals; second, these methods struggle to strike an optimal balance between contrast enhancement and naturalness preservation, which often results in suboptimal outcomes for individuals with CVD. To address these issues, we develop the first large-scale, CVD-individual-labeled dataset called FZU-CVDSet and a CVD-friendly recoloring algorithm called ColorAssist. In particular, we design a perception-guided feature extraction module and a perception-guided diffusion transformer module that jointly achieve efficient image recoloring for individuals with CVD. Comprehensive experiments on both FZU-CVDSet and subjective tests in hospitals demonstrate that the proposed ColorAssist closely aligns with the visual perceptions of individuals with CVD, achieving superior performance compared with the state-of-the-arts. The source code is available at https://github.com/xsx-fzu/ColorAssist.

Abstract:
With the rise of short video content, efficient video summarization techniques for extracting key information have become crucial. However, existing methods struggle to capture the global temporal dependencies and maintain the semantic coherence of video content. Additionally, these methods are also influenced by noise during multi-channel feature fusion. We propose a Spiking Variational Graph (SpiVG) Network, which enhances information density and reduces computational complexity. First, we design a keyframe extractor based on Spiking Neural Networks (SNN), leveraging the event-driven computation mechanism of SNNs to learn keyframe features autonomously. To enable fine-grained and adaptable reasoning across video frames, we introduce a Dynamic Aggregation Graph Reasoner, which decouples contextual object consistency from semantic perspective coherence. We present a Variational Inference Reconstruction Module to address uncertainty and noise arising during multi-channel feature fusion. In this module, we employ Evidence Lower Bound Optimization (ELBO) to capture the latent structure of multi-channel feature distributions, using posterior distribution regularization to reduce overfitting. Experimental results show that SpiVG surpasses existing methods across multiple datasets such as SumMe, TVSum, VideoXum, and QFVS. Our codes and pre-trained models are available at https://github.com/liwrui/SpiVG

Abstract:
The development of facial editing, virtual makeup, AR/VR technologies and 3D games applications underscore the need for advanced 3D facial attractiveness research. However, due to the lack of 3D beauty face data and the complexity of handling 3D face data, 3D facial aesthetics research remains largely unexplored. To fill this gap, we propose 3DFACENet, an innovative system designed for the computation and enhancement of 3D facial attractiveness. Our approach employs a 3D facial reconstruction encoder to generate encoded vectors from images and a render module to obtain 3D face models. To minimize computational load, we innovatively propose an attractiveness computation module which leverages 3D shape and texture coefficients rather than 3D mesh models to access facial attractiveness, achieving state-of-the-art results. To balance aesthetic enhancement and identity preservation, we design a controllable beautification decoder. For the first time, we introduce the concept of “attractive centers”, demonstrating that an individual’s distance to these centers is significantly negatively correlated with their beauty scores. Our beautification decoder edits 3D facial coefficients towards these centers, achieving a significant and controllable enhancement in facial attractiveness. Extensive experiments on the SCUT-FBP5500 and MEBeauty dataset validate the effectiveness and feasibility of 3DFACENet.

Abstract:
In recent years, there has been a rapid growth in applications that rely on point clouds to represent the 3D world, driven by the increasing demand for immersive and other related scenarios. However, compressing the large and high-precision point cloud data efficiently while maintaining high perceptual quality for human vision remains a challenge. To solve the problem, we propose a new structure-aware generative point cloud compression framework for human vision. In the encoder, we focus on information that is more sensitive to the human vision and obtain this type of information from different scale. This allows us to capture structural importance information from global scale and local scale, which are more difficult to reconstruct. For the decoder, we introduce a progressive generative reconstruction approach that utilizes acquired information from the encoder to guide the generation of point cloud surfaces. Moreover, we propose a novel probability cloud-based discriminator. Instead of directly assessing the authenticity of the generated point clouds, our discriminator evaluates the probability distribution of the existence of points within the generated point cloud. This approach reduces the difficulty of discrimination while effectively improving the accuracy of the generator in generating probability distributions. According to the correct probability, we can obtain a high accuracy point cloud by pruning the points with low probability. Through comprehensive experiments, we demonstrate the effectiveness and superiority of our proposed framework in terms of encoding efficiency, high perceptual quality, and generation quality.

Abstract:
Event cameras, with high temporal resolution and high dynamic range, have shown great potential under extreme scenarios such as high-speed movement and low illumination. However, previous event representation methods typically aggregate event data into a single dense tensor, often overlooking the dynamic changes of events within a given time unit. This limitation can introduce historical artifacts and semantic inconsistencies, ultimately degrading model performance. Inspired by human visual prior, we propose a motion and appearance decoupling (MAD) event representation to disentangle the mixed spatial-temporal event tensor into two independent branches. This bio-inspired design helps the network extract discriminative temporal (i.e., motion) and spatial (i.e., appearance) information, thus reducing the network’s learning burden toward complex high-level interpretation tasks. In our method, the event motion guided attention module (EMGA) is designed to achieve temporal and spatial feature interaction and fusion sequentially. Based on EMGA, three specially designed decoder heads are proposed for several representative event-based tasks (i.e., object detection, semantic segmentation, and human pose estimation). Experimental results demonstrate that our method achieves state-of-the-art performance on the above three tasks, which reveals that our method is an easy-to-implement replacement for currently event-based methods. Our code is available at: https://github.com/ChenYichen9527/MAD-representation

Abstract:
This paper presents an empirical investigation into illuminant estimation using multi-spectral images. Our study emphasizes two key contributions: (1) the utilization of the estimated multi-spectral images and (2) the incorporation of a hierarchical structure. Firstly, exploiting multi-spectral images proves to have a positive influence on illuminant estimation, particularly in scenarios characterized by monochromatic images where conventional color constancy methods face challenges. Our experimental results vividly illustrate the effectiveness of leveraging spectral information in enhancing illuminant estimation. Secondly, the adoption of a hierarchical structure stems from the need for spatial invariance in the task of estimating a global illuminant. To further enhance the performance of the hierarchical structure, we employ a contrastive loss applied to different scaled outputs. This approach demonstrates remarkable effectiveness on our custom dataset, showcasing superior performance compared to the existing methods. In addition, we extend the evaluation to the widely recognized NUS-8 dataset, where the proposed method showcases a notable 26.7% relative improvement over the previous state-of-the-art methods.

Affiliations: College of Computer Science and Software Engineering, Hohai University, Nanjing, China; State Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing, Wuhan University, Wuhan, China; Department of Computer and Information Science, University of Macau, Macau, China; College of Information Science and Engineering, Hohai University, Nanjing, China; Key Laboratory of Computational Optical Imaging Technology, Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing, China

Abstract:
Convolution Neural Networks (CNNs) have demonstrated strong feature extraction capabilities in Euclidean spaces, achieving remarkable success in hyperspectral image (HSI) classification tasks. Meanwhile, Graph convolution networks (GCNs) effectively capture spatial-contextual characteristics by leveraging correlations in non-Euclidean spaces, uncovering hidden relationships to enhance the performance of HSI classification (HSIC). Methods combining GCNs with CNNs have achieved excellent results. However, existing GCN methods primarily rely on single-scale graph structures, limiting their ability to extract features across different spatial ranges. To address this issue, this paper proposes a multiscale segmentation-guided fusion network (MS2FN) for HSIC. This method constructs pixel-level graph structures based on multiscale segmentation data, enabling the GCN to extract features across various spatial ranges. Moreover, effectively utilizing features extracted from different spatial scales is crucial for improving classification performance. This paper adopts distinct processing strategies for different feature types to enhance feature representation. Comparative experiments demonstrate that the proposed method outperforms several state-of-the-art (SOTA) approaches in accuracy. The source code will be released at https://github.com/shengrunhua/MS2FN

Abstract:
Since degraded underwater images are not always accompanied with distortion-free counterparts in real-world situations, existing underwater image enhancement (UIE) methods are mostly learned on a paired set consisting of raw underwater images and their corresponding pseudo-reference labels. Although the existing UIE datasets manually select the best model-generated results as pseudo-References, such pseudo-reference labels do not always exhibit perfect visual quality. Therefore, it would be interesting to investigate whether it is possible to break through the performance bottleneck of UIE networks trained with imperfect pseudo-references. Motivated by these facts, this paper focuses on innovating more advanced loss functions rather than designing more complex network architectures. Specifically, a plug-and-play hybrid Performance SurPassing Loss (PSPL), consisting of a Quality Score Comparison Loss (QSCL) and a scene Depth-aware Unpaired Contrastive Loss (DUCL), is formulated to guide the training of UIE network. Functionally, QSCL aims to guide the UIE network to generate enhanced results with better visual quality than pseudo-references by constructing image quality score comparison losses from both image-level and region-level. Nevertheless, only using QSCL cannot guarantee obtaining desired results for those severely degraded distant regions. Therefore, we also design a tailored DUCL to handle this challenging issue from the scene depth perspective, i.e., DUCL encourages the distant regions of the enhanced results to be closer to the high-quality nearby regions (pull) and far away from the low-quality distant regions (push) of the pseudo-references. Extensive experimental results demonstrate the advantage of using PSPL over the state-of-the-arts even with an extremely simple and lightweight UIE network. The source code will be released at https://github.com/lewis081/PSPL

Abstract:
Multimodal emotion recognition is a task that integrates textual, visual, and audio data to holistically infer an individual’s emotional state. Existing research predominantly focuses on exploiting modality-specific cues for joint learning, often ignoring the differences between multiple modalities in common goal learning. Due to multimodal heterogeneity, common goal learning inadvertently introduces optimization biases and interaction noise. To address above challenges, we propose a novel approach named Gradient and Structure Consistency (GSCon). Our strategy operates at both overall and individual levels to consider balance optimization and effective interaction respectively. At the overall level, to avoid the optimization suppression of one modality on others, we construct a balanced gradient direction that aligns each modality’s optimization direction, ensuring unbiased convergence. Simultaneously, at the individual level, to avoid the interaction noise caused by multimodal alignment, we align the spatial structure of samples in different modalities. The spatial structure of the samples will not differ due to modal heterogeneity, achieving effective inter-modal interaction. Extensive experiments on multimodal emotion recognition and multimodal intention understanding datasets demonstrate the effectiveness of the proposed method. Code is available at https://github.com/ShiQingHongYa/GSCon

Abstract:
Building on the success of universal language models in natural language processing (NLP), researchers have recently sought to develop methods capable of tackling a broad spectrum of visual tasks within a unified foundation framework. However, existing universal vision models face significant challenges when adapting to the rapidly expanding scope of downstream tasks. These challenges stem not only from the prohibitive computational and storage expenses associated with training such models but also from the complexity of their workflows, which makes efficient adaptations difficult. Moreover, these models often fail to deliver the required performance and versatility for a broad spectrum of applications, largely due to their incomplete visual generation and perception capabilities, limiting their generalizability and effectiveness in diverse settings. In this paper, we present VisionHub, a novel universal vision model designed to concurrently manage multiple visual restoration and perception tasks, while offering streamlined transferability to downstream tasks. Our model leverages the frozen denoising U-Net architecture from Stable Diffusion as the backbone, fully exploiting its inherent potential for both visual restoration and perception. To further enhance the model’s flexibility, we propose the incorporation of lightweight task-plugins and the task router, which are seamlessly integrated onto the U-Net backbone. This architecture enables VisionHub to efficiently handle various vision tasks according to user-provided natural language instructions, all while maintaining minimal storage costs and operational overhead. Extensive experiments across 11 different vision tasks showcase both the efficiency and effectiveness of our approach. Remarkably, VisionHub achieves competitive performance across a variety of benchmarks, including 53.3% mIoU on ADE20K semantic segmentation, 0.253 RMSE on NYUv2 depth estimation, and 74.2 AP on MS-COCO pose estimation.

Abstract:
As a prerequisite for many vision-oriented tasks, image deraining is an effective solution to alleviate performance degradation of these tasks on rainy days. In recent years, the introduction of deep learning has obtained the significant developments in deraining techniques. However, due to the inherent constraints of synthetic datasets and the insufficient robustness of network architecture designs, most existing methods are difficult to fit varied rain patterns and adapt to the transition from synthetic rainy images to real ones, ultimately resulting in unsatisfactory restoration outcomes. To address these issues, we propose a reduced biquaternion dual-branch deraining U-Network (RQ-D2UNet) for better deraining performance, which is the first attempt to apply the reduced biquaternion-valued neural network in the deraining task. The algebraic properties of reduced biquaternion (RQ) can facilitate modeling the rainy artifacts more accurately while preserving the underlying spatial structure of the background image. The comprehensive design scheme of U-shaped architecture and dual-branch structure can extract multi-scale contextual information and fully explore the mixed correlation between rain and rain-free features. Moreover, we also extend the self-attention and convolutional attention mechanisms in the RQ domain, which allow the proposed model to balance both global dependency capture and local feature extraction. Extensive experimental results on various rainy datasets (i.e., rain streak/rain-haze/raindrop/real rain), downstream vision applications (i.e., object detection and segmentation), and similar image restoration tasks (i.e., image desnowing and low-light image enhancement) demonstrate the superiority and versatility of our proposed method.

Abstract:
Dynamic convolution demonstrates outstanding representation capabilities, which are crucial for natural image segmentation. However, it fails when applied to medical image segmentation (MIS) and infrared small target segmentation (IRSTS) due to limited data and limited fitting capacity. In this paper, we propose a new type of dynamic convolution called dynamic parameter convolution (DPConv) which shows superior fitting capacity, and it can efficiently leverage features from deep layers of encoder in reconstruction tasks to generate DPConv kernels that adapt to input variations. Moreover, we observe that DPConv, built upon deep features derived from reconstruction tasks, significantly enhances downstream segmentation performance. We refer to the segmentation network integrated with DPConv generated from reconstruction network as the siamese reconstruction-segmentation network (SRS). We conduct extensive experiments on seven datasets including five medical datasets and two infrared datasets, and the experimental results demonstrate that our method can show superior performance over several recently proposed methods. Furthermore, the zero-shot segmentation under unseen modality demonstrates the generalization of DPConv. The code is available at: https://github.com/fidshu/SRSNet

Abstract:
Conventional reconstruction-based video anomaly detection (VAD) methods implicitly model normality in latent spaces, which is limited by the generalization ability of latent features. Normalizing Flow (NF)-based methods have been introduced to address this issue, as they explicitly model the distribution of input data and achieve significant performance in VAD. However, existing NF-based methods are confined to Euclidean space, limiting their ability to model action hierarchies. While effective at capturing local joint dynamics and short-term temporal variations, they fail to encode kinematic dependencies and long-term pose evolution, ultimately struggling to discern ambiguous anomalies that deviate minimally from normal motion. In contrast, hyperbolic representation learning, with its ability to model hierarchical and complex relationships among actions, offers a promising solution to enhance the discriminative power between similar skeletal actions. Motivated by this, we propose a novel Dual-Space Normalizing Flow (DSNF) method. Specifically, we design a Dual-Space Parallel Graph Convolutional Network (DSPGCN) that synergistically integrates the strengths of both Euclidean and hyperbolic geometries to simultaneously capture local detail features of poses and intrinsic hierarchical relationships of actions. To enhance the model’s focus on discriminative features, we design an Adaptive Weighted Approximation Mass (AWAM) loss that dynamically adjusts weights to impose stronger constraints on regions with low discriminability in the dual space, encouraging the model to focus more on key discriminative features in hyperbolic space that reflect complex relationships between actions. Extensive experiments on public datasets demonstrate the effectiveness and robustness of our method in various VAD scenarios.

Abstract:
Source-free domain adaptation (SFDA) aims to address the challenge of adapting to a target domain without accessing the source domain directly. However, due to the inaccessibility of source domain data, deterministic invariable features cannot be obtained. Current mainstream methods primarily focus on evaluating invariant features in the target domain that closely resemble those in the source domain, subsequently aligning the target domain with the source domain. However, these methods are susceptible to hard samples and influenced by domain bias. In this paper, we propose a Consistent Assistant Domains Transformer for SFDA, abbreviated as CADTrans, which solves the issue by constructing invariable feature representations of domain consistency. Concretely, we develop an assistant domain module for CADTrans to obtain diversified representations from the intermediate aggregated global attentions, which addresses the limitation of existing methods in adequately representing diversity. Based on assistant and target domains, invariable feature representations are obtained by multiple consistent strategies, which can be used to distinguish easy and hard samples. Finally, to align the hard samples to the corresponding easy samples, we construct a conditional multi-kernel max mean discrepancy (CMK-MMD) strategy to distinguish between samples of the same category and those of different categories. Extensive experiments are conducted on various benchmarks such as Office-31, Office-Home, VISDA-C, and DomainNet-126, proving the significant performance improvements achieved by our proposed approaches. Code is available at https://github.com/RoryShao/CADTrans.git

Affiliations: Institute of Optics and Electronics, the State Key Laboratory Cultivation Base of Atmospheric Optoelectronic Detection and Information Fusion, Jiangsu International Joint Laboratory on Meteorological Photonics and Optoelectronic Detection, and Jiangsu Engineering Research Center for Intelligent Optoelectronic Sensing Technology of Atmosphere, Nanjing University of Information Science and Technology, Nanjing, China; School of Electronic Information and Electrical Engineering, Anhui Jianzhu University, Hefei, China; Faculty of Computer Science, China University of Geosciences, Wuhan, China; Department of Technology of Computers and Communications, Escuela Politécnica, Hyperspectral Computing Laboratory, University of Extremadura, Cáceres, Spain

Abstract:
Hyperspectral image anomaly detection faces the challenge of difficulty in annotating anomalous targets. Autoencoder(AE)-based methods are widely used due to their excellent image reconstruction capability. However, traditional grid-based image representation methods struggle to capture long-range dependencies and model non-Euclidean structures. To address these issues, this paper proposes a self-supervised Masked Graph AutoEncoder (MGAE) for hyperspectral anomaly detection. MGAE utilizes a Graph Attention Network (GAT) autoencoder to reconstruct the background of hyperspectral images and identifies anomalies by comparing the reconstructed features with the original features. Specifically, we constructs a topological graph structure of the hyperspectral image, which is then input into the GAT autoencoder for reconstruction, leveraging the multi-head attention mechanism to learn spatial and spectral features. To prevent the decoder from learning trivial solutions, we introduce a re-masking strategy that randomly masks both the input features and hidden representations during training, forcing the model to learn and reconstruct features under limited information, thereby improving detection performance. Additionally, the proposed loss function with graph Laplacian regularization (Twice Loss) minimizes variations in feature representations, leading to more consistent background reconstruction. Experimental results on several real-world hyperspectral datasets demonstrate that MGAE outperforms existing methods.

Affiliations: School of Electronic and Information Engineering and the State Key Laboratory of Virtual Reality Technology and Systems, Beihang University, Beijing, China; School of Electronic and Information Engineering, Beihang University, Beijing, China; State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, China; School of Information and Communication Engineering and the Glasgow College, University of Electronic Science and Technology of China, Chengdu, China

Abstract:
The emerging semantic compression has been receiving increasing research efforts most recently, capable of achieving high fidelity restoration during compression, even at extremely low bitrates. However, existing semantic compression methods typically combine standard pipelines with either pre-defined or high-dimensional semantics, thus suffering from deficiency in compression. To address this issue, we propose a novel hierarchical semantic compression (HSC) framework that purely operates within intrinsic semantic spaces from generative models, which is able to achieve efficient compression for consistent semantic restoration. More specifically, we first analyse the entropy models for the semantic compression, which motivates us to employ a hierarchical architecture based on a newly developed general inversion encoder. Then, we propose the feature compression network (FCN) and semantic compression network (SCN), such that the middle-level semantic feature and core semantics are hierarchically compressed to restore both accuracy and consistency of image semantics, via an entropy model progressively shared by channel-wise context. Experimental results demonstrate that the proposed HSC framework achieves the state-of-the-art performance on subjective quality and consistency for human vision, together with superior performances on machine vision tasks given compressed bitstreams. This essentially coincides with human visual system in understanding images, thus providing a new framework for future image/video compression paradigms. The source code and trained models are available at https://github.com/bblgbr/HSC-TIP2025

Abstract:
In recent years, there has been an increase in exploring and applying the training dynamics (TD) of deep neural networks (DNNs). Current studies typically rely on quite limited TD quantities and apply their sequences to understand or aid training. This study investigates how to create more effective TD representations, and then apply them to improve the training process of real learning tasks. Specifically, first, an epoch-wise vector comprising 142-dimensional TD quantities, such as loss, is extracted for each sample. Second, a new learning strategy with both self-supervised and supervised learning is designed to learn the deep TD representation of each sample on 200 typical image classification tasks. Third, two novel methods for noisy label detection and imbalance learning, respectively, are presented based on deep TD representations. Our study reveals that neighborhoods and logits are the most important TD quantities, unlike the traditional research that focuses on loss and margin. Moreover, our method based on deep TD representations achieves better performance and demonstrates that high-level TD quantities can facilitate understanding model training, leading to improvements in practical learning tasks, such as noisy label detection and imbalance learning. All the codes are available at https://github.com/limengyang1992/TD_Exploring

Abstract:
360°cameras have gained popularity over the last few years. In this paper, we propose two fundamental techniques—Field-of-View IoU (FoV-IoU) and 360Augmentation for object detection in 360° images. Although most object detection neural networks designed for perspective images are applicable to 360° images in equirectangular projection (ERP) format, their performance deteriorates owing to the distortion in ERP images. Our method can be readily integrated with existing perspective object detectors and significantly improves the performance. The FoV-IoU computes the intersection-over-union of two Field-of-View bounding boxes in a spherical image which could be used for training, inference, and evaluation while 360Augmentation is a data augmentation technique specific to 360° object detection task which randomly rotates a spherical image and solves the bias due to the sphere-to-plane projection. We conduct extensive experiments on the 360° indoor dataset with different types of perspective object detectors and show the consistent effectiveness of our method.

Affiliations: Institute for Interdisciplinary Studies, Guangdong Provincial Key Laboratory of Intellectual Property and Big Data, Guangdong Polytechnic Normal University, Guangzhou, China; School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, China; Allied Health Department, Osaka University, Osaka, Japan; Department of Computer Science and Information Engineering, National Taiwan University, Taipei, Taiwan; College of Medical Information Engineering, Guangdong Pharmaceutical University, Guangzhou, Guangdong, China

Abstract:
Referring expression comprehension (REC) is a crucial task in understanding how a given text description identifies a target object within an image. Existing two-stage REC methods have demonstrated strong performance due to their rational framework design. However, during the encoding of object candidates in an image, most two-stage methods rely exclusively on features extracted from pre-trained detectors, often neglecting the contextual relationships between an object and its neighboring elements. This limitation hinders the full capture of contextual and relational information, reducing the discriminative power of object representations and negatively impacting subsequent processing. In this paper, we propose two novel plug-and-adapt modules: expression-guided label representation module (ELR) and cross-modal calibrated semantic module (CCS), designed to enhance two-stage REC methods. Specifically, the ELR module connects the noun phases of expression to the categorical labels of object candidates in the image, ensuring effective alignment between them. Guided by these connections, a CCS module is introduced to represent each object candidate by integrating its features with those of neighboring candidates from multiple perspectives. This preserves the intrinsic information of each candidate while incorporating relational cues from other objects, enabling more precise embeddings and effective downstream processing in two-stage REC methods. Extensive experiments on six datasets demonstrate the importance of incorporating prior statistical knowledge, and detailed analysis shows that the proposed modules strengthen the alignment between image and text. As a result, our method achieves competitive performance and is compatible with most two-stage methods in the REC task. The code is available on Github: https://github.com/freedom6927/ELR_CCS.git.

Abstract:
The variational autoencoder-based method has been widely used for modeling massive datasets. However, for 3D images, simultaneously achieving disentangled representations, low-variance Evidence Lower Bounds (ELBO), and a lightweight model remains a challenging task. In this work, we propose a Langevin dynamics-based inference framework that integrates target data information for efficient likelihood inference and disentangles appearance and morphology features via multi-scale energy-level encoding that enables unsupervised disentanglement. We adopt a quasi-symplectic integrator to handle the Hessian-related computational bottleneck that often arises in Langevin-based flow inference. We demonstrate both theoretical and empirical effectiveness of our approach compared to other methods. Experiments on public benchmarks and clinical 3D imaging datasets show that our Langevin-VAE achieves high-quality generation and learns disentangled shape and appearance representations with a model size of only 1.7M parameters. The code will be available at: https://github.com/LaplaceCenter/LangevinVAE

Abstract:
Video service providers need their delivery systems to be able to adapt to network conditions, user preferences, display settings, and other factors. HTTP Adaptive Streaming (HAS) offers dynamic switching between different video representations to simultaneously enhance bandwidth consumption and users’ streaming experiences. Per-shot encoding, pioneered by Netflix, optimizes the encoding parameters on each scene or shot. The Dynamic Optimizer (DO) uses the Video Multi-Method Assessment Fusion (VMAF) perceptual video quality prediction engine to deliver high-quality videos at reduced bitrates. Here we develop a perceptually optimized method of constructing optimal per-shot bitrate and quality ladders, using an ensemble of low-level features and Visual Information Fidelity (VIF) features. During inference, our method predicts the bitrate or quality ladder of a source video without any compression or quality estimation. We compare the performance of our model against other content-adaptive bitrate ladder prediction methods, a fixed bitrate ladder, and reference bitrate ladders constructed via exhaustive encoding using Bjøntegaard-delta (BD) metrics. Our proposed method shows excellent gains in bitrate and quality against the fixed bitrate ladder and only small losses against the reference bitrate ladder, while providing significant computational advantages.

Abstract:
Mixture-of-Experts (MoE) models embody the divide-and-conquer concept and are a promising approach for increasing model capacity, demonstrating excellent scalability across multiple domains. In this paper, we integrate the MoE structure into the classic Vision Transformer (ViT), naming it ViMoE, and explore the potential of applying MoE to vision through a comprehensive study on image classification and semantic segmentation. However, we observe that the performance is sensitive to the configuration of MoE layers, making it challenging to obtain optimal results without careful design. The underlying cause is that inappropriate MoE layers lead to unreliable routing and hinder experts from effectively acquiring helpful information. To address this, we introduce a shared expert to learn and capture common knowledge, serving as an effective way to construct a stable ViMoE. Furthermore, we demonstrate how to analyze expert routing behavior, revealing which MoE layers are capable of specializing in handling specific information and which are not. This provides guidance for retaining the critical layers while removing redundancies, thereby advancing ViMoE to be more efficient without sacrificing accuracy. We aspire for this work to offer new insights into the design of vision MoE models and provide valuable empirical guidance for future research.

Abstract:
Skeleton-based Temporal Action Segmentation (STAS) aims to segment and recognize various actions from long, untrimmed sequences of human skeletal movements. Current STAS methods typically employ spatio-temporal modeling to establish dependencies among joints as well as frames, and utilize one-hot encoding with cross-entropy loss for frame-wise classification supervision. However, these methods overlook the intrinsic correlations among joints and actions within skeletal features, leading to a limited understanding of human movements. To address this, we propose a Text-Derived Relational Graph-Enhanced Network (TRG-Net) that leverages prior graphs generated by Large Language Models (LLM) to enhance both modeling and supervision. For modeling, the Dynamic Spatio-Temporal Fusion Modeling (DSFM) method incorporates Text-Derived Joint Graphs (TJG) with channel- and frame-level dynamic adaptation to effectively model spatial relations, while integrating spatio-temporal core features during temporal modeling. For supervision, the Absolute-Relative Inter-Class Supervision (ARIS) method employs contrastive learning between action features and text embeddings to regularize the absolute class distributions, and utilizes Text-Derived Action Graphs (TAG) to capture the relative inter-class relationships among action features. Additionally, we propose a Spatial-Aware Enhancement Processing (SAEP) method, which incorporates random joint occlusion and axial rotation to enhance spatial generalization. Performance evaluations on four public datasets demonstrate that TRG-Net achieves state-of-the-art results.

Abstract:
Source-Free unsupervised Domain Adaptation (SFDA) aims to classify target samples by only accessing a pre-trained source model and unlabelled target samples. Since no source data is available, transferring the knowledge from the source domain to the target domain is challenging. Existing methods normally exploit the pair-wise relation among target samples and attempt to discover their correlations by clustering these samples based on semantic features. The drawbacks of these methods include: 1) the pair-wise relation is limited to exposing the underlying correlations of two more samples, hindering the exploration of the structural information embedded in the target domain; and 2) the clustering process only relies on the semantic feature, while overlooking the critical effect of domain shift, i.e., the distribution differences between the source and target domains. To address these issues, we propose a new SFDA method that exploits the high-order neighborhood relation and explicitly takes the domain shift effect into account. Specifically, we formulate the SFDA as a hypergraph learning problem and construct hyperedges to explore the deep structural and context information among multiple samples. Moreover, we integrate a self-loop strategy into the constructed hypergraph to elegantly introduce the domain uncertainty of each sample. By clustering these samples based on hyperedges, both the semantic feature and domain shift effects are considered. We then describe an adaptive relation-based objective to tune the model with soft attention levels for all samples. Extensive experiments are conducted on Office-31, Office-Home, VisDA, DomainNet-126 and PointDA-10 datasets. The results demonstrate the superiority of our method over state-of-the-art counterparts. Our code is avaliable at https://github.com/OUC-POVA/HG-SFDA

Abstract:
Small vehicles (SV) detection is crucial for urban security and traffic management. However, detecting such targets from a single image presents significant challenges due to the difficulty in discerning their dynamic movements. In this paper, we propose a deep joint image-level and feature-level processing network, IFNet, designed for detecting changes in SV using bi-temporal hyperspectral images. At the image-level, a new Gumbel Softmax trick (GS)-based band selection strategy is introduced to address the problem of inconsistent spectral resolutions of bi-temporal images. At the feature-level, to tackle the challenge of capturing edge and shape details of SV, we propose a feature-based edge enhancement module, it can extract the target edge using high-level difference features, and the refined change map will be generated with the guidance of the edge map. Moreover, current deep learning-based hyperspectral change detection (HCD) methods are limited by HCD datasets. Therefore, we propose a benchmark dataset, the Hyperspectral Vehicle Change Detection (HVCD) dataset, which consists of 201 pairs of aerial hyperspectral images, each with a size of 256× 256 , and exhibits inconsistent spectral resolutions across the bi-temporal data. Extensive experiments conducted on the HVCD dataset demonstrate that our IFNet obtains state-of-the-art performance with an acceptable computational cost.

Abstract:
Consistent perturbation strategies have emerged as a dominant paradigm in semi-supervised medical image segmentation. Nevertheless, prevailing approaches inadequately address two critical challenges: 1) prediction errors induced by data uncertainty from distribution shifts, and 2) loss instability caused by model uncertainty in parameter generalization. To overcome these limitations, we propose an Uncertainty-Guided Adaptive Correction (UGAC) framework with three key innovations. First, we develop a dual-path uncertainty rectification mechanism that employs normalized entropy measures to detect error-prone regions in unlabeled predictions, followed by bilateral correction through confidence-weighted fusion. Second, we introduce adversarial consistency constraints that leverage labeled data to discriminate authentic segmentation patterns, effectively regularizing uncertainty propagation in unlabeled predictions through spectral normalization. Third, we architect a frequency-aware segmentation backbone through our novel Freqfusion module, which performs adaptive spectral decomposition during feature decoding to explicitly disentangle high-frequency (boundary-aware) and low-frequency (structural) components, thereby enhancing anatomical boundary sensitivity. Comprehensive evaluations on MM-WHS, BUSI, M&Ms and PROMISE12 datasets demonstrate UGAC’s superior performance. The proposed framework exhibits robust generalizability across CT, MRI, and ultrasound modalities, while achieving significantly lower computational complexity than baseline UNet implementations. The code will be available at https://github.com/SIGMACX/UGAC.

Abstract:
Non-line-of-sight (NLOS) imaging aims to reconstruct scenes hidden from direct view and has broad applications in robotic vision, rescue operations, autonomous driving, and remote sensing. However, most existing methods rely on densely sampled transients from large, continuous relay surfaces, which limits their practicality in real-world scenarios with aperture constraints. To address this limitation, we propose an unsupervised zero-shot framework tailored for confocal NLOS imaging with aperture-limited relay surfaces. Our method leverages latent diffusion models to recover fully-sampled transients from undersampled versions by enforcing measurement consistency during the sampling process. To further improve recovered transient quality, we introduce a progressive recovery strategy that incrementally recovers missing transient values, effectively mitigating the impact of severe aperture limitations. In addition, to suppress error propagation during recovery, we develop a backpropagation-based error correction reconstruction algorithm that refines intermediate recovered transients by enforcing sparsity regularization in the voxel domain, enabling high-fidelity final reconstructions. Extensive experiments on both simulated and real-world datasets validate the robustness and generalization capability of our method across diverse aperture-limited relay surfaces. Notably, our method follows a zero-shot paradigm, requiring only a single pretraining stage without paired data or pattern-specific retraining, which makes it a more practical and generalizable framework for NLOS imaging.

Abstract:
Point cloud video streaming is promising for immersive media applications, which urges the development of efficient compression methods. However, existing approaches either suffer from poor performance or lack effective coder control mechanisms, making them impractical for networked point cloud services, where bandwidth is often constrained and fluctuates over time. Therefore, this paper proposes a system-level solution – a layered point cloud compressor, called Yak, to address these issues. Yak offers comprehensive support for both intra and inter-frame coding of geometry and attribute components in point cloud sequences. It consists of three layers: the Base Layer uses the standard G-PCC to encode a thumbnail counterpart downscaled from the input point cloud; the Enhancement Layer devises the end-to-end variational autoencoder to compress the original input conditioned on the base layer reconstruction, and the Dynamic Layer generates feature-space predictions as the temporal prior for conditional inter-frame coding. In addition, Yak devises the Content Analysis module to dynamically determine the optimal encoding parameters of each frame, by which bit budget is intelligently allocated for geometry and attribute components to maximize the overall rate-distortion (R-D) performance. Such accurate rate control relies on the parametric rate/distortion models whose parameters are initialized through one-pass template matching and frame-wise delta updating constrained by R-D optimization. Following standard evaluation guidelines, Yak has notably outperformed traditional rules-based methods such as MPEG G-PCC and V-PCC, as well as other learning-based approaches, while offering flexible networked adaption and affordable complexity.

Abstract:
Medical image restoration (MedIR) aims to recover high-quality images from degraded inputs, yet faces unique challenges from physics-driven degradations and multi-modal task interference. While existing all-in-one methods handle natural image degradations well, they struggle with medical scenarios due to limited degradation perception and suboptimal multi-task optimization. In response, we introduce DaPT, a Degradation-aware Prompted Transformer, which integrates dynamic prompt learning and modular expert mining for unified MedIR. First, DaPT introduces spatially compact prompts with optimal transport regularization, amplifying inter-prompt differences to capture diverse degradation patterns. Second, a mixture of experts dynamically routes inputs to specialized modules via prompt guidance, resolving task conflicts while reducing computational overhead. The synergy of prompt learning and expert mining further enables robust restoration across multi-modal medical data, offering a practical solution for clinical imaging. Extensive experiments across multiple modalities (MRI, CT, PET) and diverse degradations, covering both in-distribution and out-of-distribution scenarios, demonstrate that DaPT consistently outperforms state-of-the-art methods and generalizes reliably to unseen settings, underscoring its robustness, effectiveness, and clinical practicality. The source code will be released at https://github.com/weijinbao1998/DaPT

Abstract:
With the increasing demand for terrain visualization in many fields, such as augmented reality, virtual reality and geographic mapping, traditional terrain scene modeling methods encounter great challenges in processing efficiency, content realism and semantic consistency. To address these challenges, we propose a Text-Guided Arbitrary-Resolution Terrain Scene Generation Network (TG-TSGNet), which contains a ConvMamba-VQGAN, a Text Guidance Sub-network and an Arbitrary-Resolution Image Super-Resolution Module (ARSRM). The ConvMamba-VQGAN is built on top of the Conv-Based Local Representation Block (CLRB) and the Mamba-Based Global Representation Block (MGRB) that we design, to utilize local and global features. Furthermore, the Text Guidance Sub-network comprises a text encoder and a Text-Image Alignment Module (TIAM) for the sake of incorporating textual semantics into image representation. In addition, the ARSRM can be trained together with the ConvMamba-VQGAN, to perform the task of image super-resolution. To fulfill the text-guided terrain scene generation task, we derive a set of textual descriptions for the 36,672 images across the 38 categories of the Natural Terrain Scene Data Set (NTSD). These descriptions can be used to train and test the TG-TSGNet (The data set, model and source code are available at https://github.com/INDTLab/TG-TSGNet). Experimental results show that the TG-TSGNet outperforms, or at least performs comparably to, the baseline methods in image realism and semantic consistency with proper efficiency. We believe that the promising performance should be due to the ability of the TG-TSGNet not only to capture both the local and global characteristics and the semantics of terrain scenes, but also to reduce the computational cost of image generation.

Abstract:
Multimodal Large Language Models (MLLMs) exhibit impressive performance across vision-language tasks, but still face the hallucination challenges, where generated texts are factually inconsistent with visual input. Existing mitigation methods focus on surface symptoms of hallucination and heavily rely on post-hoc corrections, extensive data curation, or costly inference schemes. In this work, we identify two key factors of MLLM hallucination: Insufficient Visual Context, where ambiguous visual contexts lead to language speculation, and Progressive Textual Drift, where model attention strays from visual inputs in longer responses. To address these problems, we propose a novel Complementary Visual Grounding (CVG) framework. CVG exploits the intrinsic architecture of MLLMs, without requiring any external tools, models, or additional data. CVG first disentangles visual context into two complementary branches based on query relevance, then maintains steadfast visual grounding during the auto-regressive generation. Finally, it contrasts the output distributions of two branches to produce a faithful response. Extensive experiments on various hallucination and general benchmarks demonstrate that CVG achieves state-of-the-art performances across MLLM architectures and scales.

Abstract:
With the differential sensitivity and high time resolution, event cameras can record detailed motion clues, which form a complementary advantage with frame-based cameras to enhance the object tracking, especially in challenging dynamic scenes. However, how to better match heterogeneous event-image data and exploit rich complementary cues from them still remains an open issue. In this paper, we align event-image modalities by proposing a motion adaptive event sampling method, and we revisit the cross-complementarities of event-image data to design a bidirectional-enhanced fusion framework. Specifically, this sampling strategy can adapt to different dynamic scenes and integrate aligned event-image pairs. Besides, we design an image-guided motion estimation unit for extracting explicit instance-level motions, aiming at refining the uncertain event clues to distinguish primary objects and background. Then, a semantic modulation module is devised to utilize the enhanced object motion to modify the image features. Coupled with these two modules, this framework learns both the high motion sensitivity of events and the full texture of images to achieve more accurate and robust tracking. The proposed method is easily embedded in existing tracking pipelines, and trained end-to-end. We evaluate it on four large benchmarks, i.e. FE108, VisEvent, FE240hz and CoeSot. Extensive experiments demonstrate our method achieves state-of-the-art performance, and large improvements are pointed as contributions by our sampling strategy and fusion concept.

Abstract:
Recent advances in learning-based methods have markedly enhanced the capabilities of image compression. However, these methods struggle with high bit-depth volumetric medical images, facing issues such as degraded performance, increased memory demand, and reduced processing speed. To address these challenges, this paper presents the Bit-Division based Lossless Volumetric Image Compression (BD-LVIC) framework, which is tailored for high bit-depth medical volume compression. The BD-LVIC framework skillfully divides the high bit-depth volume into two lower bit-depth segments: the Most Significant Bit-Volume (MSBV) and the Least Significant Bit-Volume (LSBV). The MSBV concentrates on the most significant bits of the volumetric medical image, capturing vital structural details in a compact manner. This reduction in complexity greatly improves compression efficiency using traditional codecs. Conversely, the LSBV deals with the least significant bits, which encapsulate intricate texture details. To compress this detailed information effectively, we introduce an effective learning-based compression model equipped with a Transformer-Based Feature Alignment Module, which exploits both intra-slice and inter-slice redundancies to accurately align features. Subsequently, a Parallel Autoregressive Coding Module merges these features to precisely estimate the probability distribution of the least significant bit-planes. Our extensive testing demonstrates that the BD-LVIC framework not only sets new performance benchmarks across various datasets but also maintains a competitive coding speed, highlighting its significant potential and practical utility in the realm of volumetric medical image compression.

Abstract:
Accurately matching local features between a pair of images corresponding to the same 3D scene is a challenging computer vision task. Previous studies typically utilize attention-based graph neural networks (GNNs) with fully-connected graphs over keypoints within/across images for visual and geometric information reasoning. However, in the background of local feature matching, a significant number of keypoints are non-repeatable due to factors like occlusion and failure of the detector, and thus irrelevant for message passing. The connectivity with non-repeatable keypoints not only introduces redundancy, resulting in limited efficiency (quadratic computational complexity w.r.t. the keypoint number), but also interferes with the representation aggregation process, leading to limited accuracy. Aiming at the best of both worlds on accuracy and efficiency, we propose MaKeGNN, a sparse attention-based GNN architecture which bypasses non-repeatable keypoints and leverages matchable ones to guide compact and meaningful message passing. More specifically, our Bilateral Context-Aware Sampling (BCAS) Module first dynamically samples two small sets of well-distributed keypoints with high matchability scores from the image pair. Then, our Matchable Keypoint-Assisted Context Aggregation (MKACA) Module regards sampled informative keypoints as message bottlenecks and thus constrains each keypoint only to retrieve favorable contextual information from intra- and inter-matchable keypoints, evading the interference of irrelevant and redundant connectivity with non-repeatable ones. Furthermore, considering the potential noise in initial keypoints and sampled matchable ones, the MKACA module adopts a matchability-guided attentional aggregation operation for purer data-dependent context propagation. By these means, MaKeGNN outperforms the state-of-the-arts on multiple highly challenging benchmarks, while significantly reducing computational and memory complexity compared to typical attentional GNNs.

Abstract:
Reducing the radiation dose in CT scanning is important to alleviate the damage to the human health in clinical scenes. A promising way is to replace the normal-dose CT (NDCT) imaging by low-dose CT (LDCT) imaging with lower tube voltage and tube current. This often brings severe noise to the LDCT images, which adversely affects the diagnosis accuracy. Most of existing LDCT image denoising networks are trained either with synthetic LDCT images or real-world LDCT and NDCT image pairs with huge spatial misalignment. However, the synthetic noise is very different from the complex noise in real-world LDCT images, while the huge spatial misalignment brings inaccurate predictions of tissue structures in the denoised LDCT images. To well utilize real-world LDCT and NDCT image pairs for LDCT image denoising, in this paper, we introduce a new Patch Similarity Purification (PSP) strategy to construct high-quality training dataset for network training. Specifically, our PSP strategy first perform binarization for each pair of image patches cropped from the corresponding LDCT and NDCT image pairs. For each pair of binary masks, it then computes their similarity ratio by common mask calculation, and the patch pair can be selected as a training sample if their mask similarity ratio is higher than a threshold. By using our PSP strategy, each training set of our Rabbit and Patient datasets contain hundreds of thousands of real-world LDCT and NDCT image patch pairs with negligible misalignment. Extensive experiments demonstrate the usefulness of our PSP strategy on purifying the training data and the effectiveness of training LDCT image denoising networks on our datasets. The code and dataset are provided at https://github.com/TuTusong/PSP.

Abstract:
This paper introduces a Bayesian framework for image inversion by deriving a probabilistic counterpart to the regularization-by-denoising (RED) paradigm. It additionally implements a Monte Carlo algorithm specifically tailored for sampling from the resulting posterior distribution, based on an asymptotically exact data augmentation (AXDA). The proposed algorithm is an approximate instance of split Gibbs sampling (SGS) which embeds one Langevin Monte Carlo step. The proposed method is applied to common imaging tasks such as deblurring, inpainting and super-resolution, demonstrating its efficacy through extensive numerical experiments. These contributions advance Bayesian inference in imaging by leveraging data-driven regularization strategies within a probabilistic framework.

Abstract:
Self-supervised point cloud representation learning aims to acquire robust and general feature representations from unlabeled data. Recently, masked point modeling-based methods have shown significant performance improvements for point cloud understanding, yet these methods rely on overlapping grouping strategies (k-nearest neighbor algorithm) resulting in early leakage of structural information of mask groups, and overlook the semantic modeling of object components resulting in parts with the same semantics having obvious feature differences due to position differences. In this work, we rethink grouping strategies and pretext tasks that are more suitable for self-supervised point cloud representation learning and propose a novel hierarchical masked representation learning method, including an optimal transport-based hierarchical grouping strategy, a prototype-based part modeling module, and a hierarchical attention encoder. The proposed method enjoys several merits. First, the proposed grouping strategy partitions the point cloud into non-overlapping groups, eliminating the early leakage of structural information in the masked groups. Second, the proposed prototype-based part modeling module dynamically models different object components, ensuring feature consistency on parts with the same semantics. Extensive experiments on four downstream tasks demonstrate that our method surpasses state-of-the-art 3D representation learning methods. Furthermore, Comprehensive ablation studies and visualizations demonstrate the effectiveness of the proposed modules.

Abstract:
Image degradation caused by noise and blur remains a persistent challenge in imaging systems, stemming from limitations in both hardware and methodology. Single-image solutions face an inherent tradeoff between noise reduction and motion blur. While short exposures can capture clear motion, they suffer from noise amplification. Long exposures reduce noise but introduce blur. Learning-based single-image enhancers tend to be over-smooth due to the limited information. Multi-image solutions using burst mode avoid this tradeoff by capturing more spatial-temporal information but often struggle with misalignment from camera/scene motion. To address these limitations, we propose a physical-model-based image restoration approach leveraging a novel dual-exposure Quad-Bayer pattern sensor. By capturing pairs of short and long exposures at the same starting point but with varying durations, this method integrates complementary noise-blur information within a single image. We further introduce a Quad-Bayer synthesis method (B2QB) to simulate sensor data from Bayer patterns to facilitate training. Based on this dual-exposure sensor model, we design a hierarchical convolutional neural network called QRNet to recover high-quality RGB images. The network incorporates input enhancement blocks and multi-level feature extraction to improve restoration quality. Experiments demonstrate superior performance over state-of-the-art deblurring and denoising methods on both synthetic and real-world datasets. The code, model, and datasets are publicly available at https://github.com/zhaoyuzhi/QRNet.

Abstract:
This study addresses the challenge of controlling the global color aspect of images generated by a diffusion model without training or fine-tuning. We rewrite the guidance equations to ensure that the outputs are closer to a known color map, without compromising the quality of the generation. Our method results in new guidance equations. In the context of color guidance, we show that the scaling of the guidance should not decrease but rather increase throughout the diffusion process. In a second contribution, our guidance is applied in a compression framework, where we combine both semantic and general color information of the image to decode at very low cost. We show that our method is effective in improving the fidelity and realism of compressed images at extremely low bit rates ( 10^-2 bpp), performing better on these criteria when compared to other classical or more semantically oriented approaches. The implementation of our method is available on gitlab at https://gitlab.inria.fr/tbordin/color-guidance.

Abstract:
Text-video retrieval aims to establish a matching relationship between a video and its corresponding text. However, previous works have primarily focused on salient video subjects, such as humans or animals, often overlooking Low-Salient but Discriminative Objects (LSDOs) that play a critical role in understanding content. To address this limitation, we propose a novel model that enhances retrieval performance by emphasizing these overlooked elements across video and text modalities. In the video modality, our model first incorporates a feature selection module to gather video-level LSDO features, and applies cross-modal attention to assign frame-specific weights based on relevance, yielding frame-level LSDO features. In the text modality, text-level LSDO features are captured by generating multiple object prototypes in a sparse aggregation manner. Extensive experiments on benchmark datasets, including MSR-VTT, MSVD, LSMDC, and DiDeMo, demonstrate that our model achieves state-of-the-art results across various evaluation metrics.

Abstract:
In end-to-end learned image compression, encoder and decoder are jointly trained to minimize a \boldsymbol R \boldsymbol + \boldsymbol \lambda \boldsymbol D cost function, where \boldsymbol \lambda controls the trade-off between rate of the quantized latent representation and image quality. Unfortunately, a distinct encoder-decoder pair with millions of parameters must be trained for each \boldsymbol \lambda , hence the need to switch encoders and to store multiple encoders and decoders on the user device for every target rate. This paper proposes to exploit a differentiable quantizer designed around a parametric sum of hyperbolic tangents, called STanH, that relaxes the step-wise quantization function. STanH is implemented as a differentiable activation layer with learnable quantization parameters that can be plugged into a pre-trained fixed rate model and refined to achieve different target bitrates. Experimental results show that our method enables variable rate coding with comparable efficiency to the state-of-the-art, yet with significant savings in terms of ease of deployment, training time, and storage costs.

Abstract:
The encoder-decoder architecture is a prevailing paradigm for semantic segmentation. It has been discovered that aggregation of multi-stage encoder features plays a significant role in capturing discriminative pixel representation. In this work, we rethink feature reconstruction for scale alignment of multi-stage pyramidal features and treat it as a Query Update (Q-UP) task. Pixel-wise affinity scores are calculated between the high-resolution query map and low-resolution feature map to dynamically broadcast low-resolution pixel features to match a higher resolution. Unlike prior works (e.g. bilinear interpolation) that only exploit sub-pixel neighborhoods, Q-UP samples contextual information within a global receptive field via a data-dependent manner. To alleviate intra-category feature variance, we substitute source pixel features for feature reconstruction with their corresponding category prototype that is assessed by averaging all pixel features belonging to that category. Besides, a memory module is proposed to explore the capacity of category prototypes at the dataset level. We refer to the method as Category Prototype Transformer (CPT). We conduct extensive experiments on popular benchmarks. Integrating CPT into a feature pyramid structure exhibits superior performance for semantic segmentation even with low-resolution feature maps, e.g. 1/32 of the input size, significantly reducing computational complexity. Specifically, the proposed method obtains a compelling 55.5% mIoU with greatly reduced model parameters and computations on the challenging ADE20K dataset.

Abstract:
Semi-supervised learning based on consistency learning offers significant promise for enhancing medical image segmentation. Current approaches use copy-paste as an effective data perturbation technique to facilitate weak-to-strong consistency learning. However, these techniques often lead to a decrease in the accuracy of synthetic labels corresponding to the synthetic data and introduce excessive perturbations to the distribution of the training data. Such over-perturbation causes the data distribution to stray from its true distribution, thereby impairing the model’s generalization capabilities as it learns the decision boundaries. We propose a weak-to-strong consistency learning framework that integrally addresses these issues with two primary designs: 1) it emphasizes the use of highly reliable data to enhance the quality of labels in synthetic datasets through cross-copy-pasting between labeled and unlabeled datasets; 2) it employs uncertainty estimation and foreground region constraints to meticulously filter the regions for copy-pasting, thus the copy-paste technique implemented introduces a beneficial perturbation to the training data distribution. Our framework expands the copy-paste method by addressing its inherent limitations, and amplifying the potential of data perturbations for consistency learning. We extensively validated our model using six publicly available medical image segmentation datasets across different diagnostic tasks, including the segmentation of cardiac structures, prostate structures, brain structures, skin lesions, and gastrointestinal polyps. The results demonstrate that our method significantly outperforms state-of-the-art models. For instance, on the PROMISE12 dataset for the prostate structure segmentation task, using only 10% labeled data, our method achieves a 15.31% higher Dice score compared to the baseline models. Our experimental code will be made publicly available at https://github.com/slhuang24/RCP4CL.

Abstract:
The main challenge of multimodal change detection (MCD) is that multimodal bitemporal images (MBIs) cannot be compared directly to identify changes. To overcome this problem, this paper proposes a novel commonality feature representation learning (CFRL) and constructs a CFRL-based unsupervised MCD framework. The CFRL is composed of a Siamese-based encoder and two decoders. First, the Siamese-based encoder can map original MBIs in the same feature space for extracting the representative features of each modality. Then, the two decoders are used to reconstruct the original MBIs by regressing themselves, respectively. Meanwhile, we swap the decoders to reconstruct the pseudo-MBIs to conduct modality alignment. Subsequently, all reconstructed images are input to the Siamese-based encoder again to map them in a same feature space, by which representative features are obtained. On this basis, latent commonality features between MBIs can be extracted by minimizing the distance between these representative features. These latent commonality features are comparable and can be used to identify changes. Notably, the proposed CFRL can be performed simultaneously in two modalities corresponding to MBIs. Therefore, two change magnitude images (CMIs) can be generated simultaneously by measuring the difference between the commonality features of MBIs. Finally, a simple threshold algorithm or a clustering algorithm can be employed to divide CMIs into binary change maps. Extensive experiments on six publicly available MCD datasets show that the proposed CFRL-based framework can achieve superior performance compared with other state-of-the-art approaches.

Affiliations: School of Cyber Science and Technology, Sun Yat-sen University, Shenzhen Campus, Shenzhen, China; College of Artificial Intelligence, Hebei University of Technology, Tianjin, China; Department of Computing, The Hong Kong Polytechnic University, Hong Kong, China; School of Data Science and Artificial Intelligence, Dongbei University of Finance and Economics, Dalian, China; School of Computer Science and the Center for Optical Imagery Analysis and Learning (OPTIMAL), Northwestern Polytechnical University, Xi’an, China

Abstract:
Label propagation (LP) is a popular semi-supervised learning technique that propagates labels from a training dataset to a test one using a similarity graph, assuming that nearby samples should have similar labels. However, the recent cross-domain problem assumes that training (source domain) and test data sets (target domain) follow different distributions, which may unexpectedly degrade the performance of LP due to small similarity weights connecting the two domains. To address this problem, we propose optimal graph learning-based label propagation (OGL2P), which optimizes one cross-domain graph and two intra-domain graphs to connect the two domains and preserve domain-specific structures, respectively. During label propagation, the cross-domain graph draws two labels close if they are nearby in feature space and from different domains, while the intra-domain graph pulls two labels close if they are nearby in feature space and from the same domain. This makes label propagation more insensitive to cross-domain problems. During graph embedding, we optimize the three graphs using features and labels in the embedded subspace to extract locally discriminative and domain-invariant features and make the graph construction process robust to noise in the original feature space. Notably, as a more relaxed constraint, locally discriminative and domain-invariant can somewhat alleviate the contradiction between discriminability and domain-invariance. Finally, we conduct extensive experiments on five cross-domain image classification datasets to verify that OGL2P outperforms some state-of-the-art cross-domain approaches.

Abstract:
The goal of pedestrian trajectory retrieval is to infer the multi-camera path of a targeted pedestrian using images or videos from a camera network, which is crucial for passenger flow analytics and individual pedestrian retrieval. Conventional approaches hinge on spatiotemporal modeling, necessitating the gathering of positional information for each camera and trajectory data between every camera pair for the training phase. To mitigate these stringent requirements, our proposed methodology employs solely temporal information for modeling. Specifically, we introduce an Implicit Trajectory Encoding scheme, dubbed Temporal Rotary Position Embedding (T-RoPE), which integrates the temporal aspects of within-camera tracklets directly into their visual representations, thereby shaping a novel feature space. Our analysis reveals that, within this refined feature space, the challenge of inter-camera trajectory extraction can be effectively addressed by delineating a linear trajectory manifold. The visual characteristics gleaned from each candidate trajectory are utilized to compare and rank against the query feature, culminating in the ultimate trajectory retrieval outcome. To validate our method, we collected a new pedestrian trajectory dataset from a multi-storey shopping mall, namely the Mall Trajectory Dataset. Extensive experimentation across diverse datasets has demonstrated the versatility of our T-RoPE module as a plug-and-play enhancement to various network architectures, significantly enhancing the precision of pedestrian trajectory retrieval tasks. The dataset and code are released at https://github.com/zhangxin1995/MTD.

Abstract:
Semantic segmentation methods enhance robust and reliable understanding under adverse illumination conditions by integrating complementary information from visible and thermal infrared (RGB-T) images. Existing methods primarily focus on designing various feature fusion modules between different modalities, overlooking that feature learning is the critical aspect of scene understanding. In this paper, we propose a novel module-free Multiplex Interactive Learning Network (MiLNet) for RGB-T semantic segmentation, which adeptly integrates multi-model, multi-modal, and multi-level feature learning, fully exploiting the potential of multiplex feature interaction. Specifically, robust knowledge is transferred from the vision foundation model to our task-specific model to enhance its segmentation performance. In the task-specific model, an asymmetric simulated learning strategy is introduced to facilitate mutual learning of geometric and semantic information between high- and low-level features across modalities. Additionally, an inverse hierarchical fusion strategy based on feature learning pairs is adopted and further refined using multilabel and multiscale supervision. Experimental results on the MFNet and PST900 datasets demonstrate that MiLNet outperforms state-of-the-art methods in terms of mIoU. As a limitation, the model’s performance under few-sample conditions could be improved further. The code and results of our method are available at https://github.com/Jinfu-pku/MiLNet.

Abstract:
How to aggregate spatial-temporal information plays an essential role in video super-resolution (VSR) tasks. Despite the remarkable success, existing methods adopt static convolution to encode spatial-temporal information, which lacks flexibility in aggregating information in large-scale remote sensing scenes, as they often contain heterogeneous features (e.g., diverse textures). In this paper, we propose a spatial feature diversity enhancement module (SDE) and channel diversity enhancement module (CDE), which explore the diverse representation of different local patterns while aggregating the global response with compactly channel-wise embedding representation. Specifically, SDE introduces multiple learnable filters to extract representative spatial variants and encodes them to generate a dynamic kernel for enriched spatial representation. To explore the diversity in the channel dimension, CDE exploits the discrete cosine transform to transform the feature into the frequency domain. This enriches the channel representation while mitigating massive frequency loss caused by pooling operation. Based on SDE and CDE, we further devise a multi-axis feature diversity enhancement (MADE) module to harmonize the spatial, channel, and pixel-wise features for diverse feature fusion. These elaborate strategies form a novel network for satellite VSR, termed MADNet, which achieves favorable performance against state-of-the-art method BasicVSR++ in terms of average PSNR by 0.14 dB on various video satellites, including JiLin-1, Carbonite-2, SkySat-1, and UrtheCast. Code will be available at https://github.com/XY-boy/MADNet

Abstract:
Unsupervised efficient domain adaptive retrieval aims to transfer knowledge from a labeled source domain to an unlabeled target domain, while maintaining low storage cost and high retrieval efficiency. However, existing methods typically fail to address potential noise in the target domain, and directly align high-level features across domains, thus resulting in suboptimal retrieval performance. To address these challenges, we propose a novel Cross-Domain Diffusion with Progressive Alignment method (COUPLE). This approach revisits unsupervised efficient domain adaptive retrieval from a graph diffusion perspective, simulating cross-domain adaptation dynamics to achieve a stable target domain adaptation process. First, we construct a cross-domain relationship graph and leverage noise-robust graph flow diffusion to simulate the transfer dynamics from the source domain to the target domain, identifying lower noise clusters. We then leverage the graph diffusion results for discriminative hash code learning, effectively learning from the target domain while reducing the negative impact of noise. Furthermore, we employ a hierarchical Mixup operation for progressive domain alignment, which is performed along the cross-domain random walk paths. Utilizing target domain discriminative hash learning and progressive domain alignment, COUPLE enables effective domain adaptive hash learning. Extensive experiments demonstrate COUPLE’s effectiveness on competitive benchmarks.

Abstract:
Auroral image classification has long been a focus of research in auroral physics. However, current methods for automatic auroral classification typically assume that only one type of aurora is present in an auroral image. This oversight neglects the complex transition states and coexistence of multiple types during the auroral evolution process, thus limiting the exploration of the intricate semantics of auroral images. To fully exploit the physical information embedded in auroral images, this paper proposes a multi-label auroral classification method, termed MLAC, which integrates convolutional neural network (CNN) and Transformer architectures. Firstly, we introduce a multi-scale feature fusion framework that enables the model to capture both fine-grained features and high-level information in auroral images, resulting in a more comprehensive representation of auroral features. Secondly, we propose a lightweight multi-head self-attention mechanism that captures long-range dependencies between pixels during the multiscale feature fusion process, which is crucial for distinguishing subtle differences between auroral types. Furthermore, we design a residual focused multilayer perceptron module that integrates large kernel depth-wise convolution with an improved multilayer perceptron. This integration enhances the model’s ability to represent complex spatial structure, thus improving local feature extraction and global contextual understanding. The proposed method achieves a mean average precision (mAP) of 88.20% on the auroral observation data collected at the Yellow River Station from 2003 to 2008. This performance significantly surpasses that of the most advanced multi-label classification models while maintaining competitive computational efficiency. Moreover, our method also outperforms the state-of-the-art multi-label methods in both computational efficiency and classification accuracy on two publicly available multi-label image datasets: WIDER-Attribute and VOC2007. These results demonstrate that our method skillfully leverages the robust feature extraction capability of CNNs for local features and the superior global information processing capability of Transformer.

Abstract:
Prompt learning has been recently introduced into the adaption of pre-trained vision-language models (VLMs) by tuning a set of trainable tokens to replace hand-crafted text templates. Despite the encouraging results achieved, existing methods largely rely on extra annotated data for training. In this paper, we investigate a more realistic scenario, where only the unlabeled test data is available. Existing test-time prompt learning methods often separately learn a prompt for each test sample. However, relying solely on a single sample heavily limits the performance of the learned prompts, as it neglects the task-level knowledge that can be gained from multiple samples. To that end, we propose a novel test-time prompt learning method of VLMs, called Task-to-Instance PromPt LEarning (TIPPLE), which adopts a two-stage training strategy to leverage both task- and instance-level knowledge. Specifically, we reformulate the effective online pseudo-labeling paradigm along with two tailored components: an auxiliary text classification task and a diversity regularization term, to serve the task-oriented prompt learning. After that, the learned task-level prompt is further combined with a tunable residual for each test sample to integrate with instance-level knowledge. We demonstrate the superior performance of TIPPLE on 15 downstream datasets, e.g., the average improvement of 1.87% over the state-of-the-art method, using ViT-B/16 visual backbone. Our code is open-sourced at https://github.com/zhiheLu/TIPPLE.

Abstract:
Cross-domain few-shot learning aims to achieve swift generalization between a source domain and a target domain using a limited number of images. Current research predominantly relies on generalized feature embeddings, employing metric classifiers in Euclidean space for classification. However, due to existing disparities among different data domains, attaining generalized features in the embedding becomes challenging. Additionally, the rise in data domains leads to high-dimensional Euclidean spaces. To address the above problems, we introduce a cross-domain few-shot learning method named Hyperbolic Insights with Knowledge Distillation (HIKD). By integrating knowledge distillation, it enhances the model’s generalization performance, thereby significantly improving task performance. Hyperbolic space, in comparison to Euclidean space, offers a larger capacity and supports the learning of hierarchical structures among images, which can aid generalized learning across different data domains. So we map the Euclidean space features to the hyperbolic space via hyperbolic embedding and utilize hyperbolic fitting distillation method in the meta-training phase to obtain multi-domain unified generalization representation. In the meta-testing phase, accounting for biases between the source and target domains, we present a hyperbolic adaptive module to adjust embedded features and eliminate inter-domain gap. Experiments on the Meta-Dataset demonstrate that HIKD outperforms state-of-the-arts methods with the average accuracy of 80.6%.

Abstract:
Video object detection is a challenging task in computer vision since it needs to handle the object appearance degradation problem that seldom occurs in the image domain. Off-the-shelf video object detection methods typically aggregate multi-frame features at one stroke to alleviate appearance degradation. However, these existing methods do not take supervision knowledge into consideration and thus still suffer from insufficient feature aggregation, resulting in the false detection problem. In this paper, we take a different perspective on feature aggregation, and propose a dynamic graph contrastive network (DGC-Net) for video object detection, including three improvements against existing methods. First, we design a frame-level graph contrastive module to aggregate frame features, enabling our DGC-Net to fully exploit discriminative contextual feature representations to facilitate video object detection. Second, we develop a proposal-level graph contrastive module to aggregate proposal features, making our DGC-Net sufficiently learn discriminative semantic feature representations. Third, we present a graph transformer to dynamically adjust the graph structure by pruning the useless nodes and edges, which contributes to improving accuracy and efficiency as it can eliminate the geometric-semantic ambiguity and reduce the graph scale. Furthermore, inherited from the framework of DGC-Net, we develop DGC-Net Lite to perform real-time video object detection with a much faster inference speed. Extensive experiments conducted on the ImageNet VID dataset demonstrate that our DGC-Net outperforms the performance of current state-of-the-art methods. Notably, our DGC-Net obtains 86.3%/87.3% mAP when using ResNet-101/ResNeXt-101.

Abstract:
Temporal prediction is one of the most important technologies for video compression. Various prediction coding modes are designed in traditional video codecs. Traditional video codecs will adaptively to decide the optimal coding mode according to the prediction quality and reference quality. Recently, learned video codecs have made great progress. However, they did not effectively address the problem of prediction and reference quality adaptation, which limits the effective utilization of temporal prediction and reduction of reconstruction error propagation. Therefore, in this paper, we first propose a confidence-based prediction quality adaptation (PQA) module to provide explicit discrimination for the spatial and channel-wise prediction quality difference. With this module, the prediction with low quality will be suppressed and that with high quality will be enhanced. The codec can adaptively decide which spatial or channel location of predictions to use. Then, we further propose a reference quality adaptation (RQA) module and an associated repeat-long training strategy to provide dynamic spatially variant filters for diverse reference qualities. With these filters, our codec can adapt to different reference qualities, making it easier to achieve the target reconstruction quality and reduce the reconstruction error propagation. Experimental results verify that our proposed modules can effectively help our codec achieve a higher compression performance.

Abstract:
Synthesizing multi-view images that are geometrically consistent with a given single-view image is one of the hot issues in AIGC in recent years. Existing methods have achieved impressive performance on objects with symmetry or rigidity, but they are inappropriate for the human hand. Because an image-captured human hand has more diverse poses and less attractive textures. In this paper, we propose NP-Hand, a framework that elegantly combines the diffusion model and generative adversarial network: The multi-step diffusion is trained to synthesize low-resolution novel perspective, while the single-step generator is exploited to further enhance synthesis quality. To maintain the consistency between inputs and synthesis, we creatively introduce normal maps into NP-Hand to guide the whole synthesizing process. Comprehensive evaluations have demonstrated that the proposed framework is superior to existing state-of-the-art models and more suitable for synthesizing hand images with faithful structures and realistic appearance details. The code will be released on our website.

Abstract:
Snapshot compressive imaging (SCI) compresses a 3D hyperspectral image (HSI) into a 2D measurement, significantly improving imaging efficiency while preserving the spatial and spectral information inherent in HSI. However, reconstructing high-quality HSIs from compressed measurements remains a core challenge due to the complexity of the inverse problem. Transformer-based methods have recently shown promising performance in HSI reconstruction. Nonetheless, effectively capturing local information, long-range dependencies, and multi-scale features within a reasonable computational cost remains a significant challenge. In this paper, we propose a dual-stage multiscale Transformer (DSMT) tailored for HSI reconstruction, which adopts a coarse-to-fine framework to enhance reconstruction accuracy and network generalization. Specifically, we design a novel U-Net architecture with a dual-branch encoder, where two separate branches process distinct features and are fused to achieve more refined reconstruction results. Full-scale skip connections are introduced to strengthen feature fusion across different stages. To further improve performance, we develop a novel self-attention mechanism called dual-window multiscale multi-head self-attention (DWM-MSA). By utilizing two differently sized windows, DWM-MSA captures long-range dependencies and local information at multiple scales, significantly boosting reconstruction quality. Additionally, we introduce a hybrid positional embedding method, conditional/relative positional embedding (CRPE), which dynamically models both spatial and spectral dependencies, effectively enhancing the Transformer’s capacity for HSI reconstruction. Extensive quantitative and qualitative experiments on both the simulated and the real data are conducted to demonstrate the superior performance, stability, and generalization ability of our DSMT. Code of this project is at https://github.com/chenx2000/DSMT.

Abstract:
Recent advances in text-to-image models have opened new frontiers in human-centric generation. However, these models cannot be directly employed to generate images with consistent newly coined identities. In this work, we propose CharacterFactory, a framework that allows sampling new characters with consistent identities in the latent space of GANs for diffusion models. More specifically, we consider the word embeddings of celeb names as ground truths for the identity-consistent generation task and train a GAN model to learn the mapping from a latent space to the celeb embedding space. In addition, we design a context-consistent loss to ensure that the generated identity embeddings can produce identity-consistent images in various contexts. Remarkably, the whole model only takes 10 minutes for training, and can sample infinite characters end-to-end during inference. Extensive experiments demonstrate excellent performance of the proposed CharacterFactory on character creation in terms of identity consistency and editability. Furthermore, the generated characters can be seamlessly combined with the off-the-shelf image/video/3D diffusion models. We believe that the proposed CharacterFactory is an important step for identity-consistent character generation. Code and Gradio demo are available at: https://qinghew.github.io/CharacterFactory/

Abstract:
Raw low-light image enhancement (LLIE) has achieved much better performance than the sRGB domain enhancement methods due to the merits of raw data. However, the ambiguity between noisy to clean and raw to sRGB mappings may mislead the single-stage enhancement networks. The two-stage networks avoid ambiguity by step-by-step or decoupling the two mappings but usually have large computing complexity. To solve this problem, we propose a single-stage network empowered by Feature Domain Adaptation (FDA) to decouple the denoising and color mapping tasks in raw LLIE. The denoising encoder is supervised by the clean raw image, and then the denoised features are adapted for the color mapping task by an FDA module. We propose a Lineformer to serve as the FDA, which can well explore the global and local correlations with fewer line buffers (friendly to the line-based imaging process). During inference, the raw supervision branch is removed. In this way, our network combines the advantage of a two-stage enhancement process with the efficiency of single-stage inference. Experiments on four benchmark datasets demonstrate that our method achieves state-of-the-art performance with fewer computing costs (60% FLOPs of the two-stage method DNF). Our codes will be released after the acceptance of this work.

Abstract:
Single domain generalization (SDG) aims to transfer models trained on a single source domain to multiple unseen target domains while against the unknown domain shifts. The main challenge lies in learning the domain-invariant features to mitigate the domain shift impact. To address this challenge, we reconsider SDG from a causal perspective to capture the domain-invariant features accurately. Specifically, we present a Progressive Invariant Causal Feature Learning (PICF) method that leverages front-door adjustment to gradually obtain the invariant causal features for SDG. First, we introduce a foreground feature filter, which removes object-irrelevant confounders in a cyclical manner to extract the object-related causal features. Subsequently, to further enhance the causal feature invariance, we propose to train with augmented causal features by combining them with randomly-sampled styles from the object-irrelevant feature distribution boundary. As a result, our model bridges the gap between one seen domain and multiple unseen ones by capturing the invariant causal features, which largely enhances the model’s generalization ability in SDG. In experiments, our method can be plugged into multiple state-of-the-art methods, and the significant performance improvements on multiple datasets demonstrate the superiority of our method. In particular, on the PACS dataset, our method achieves an accuracy improvement of 4.7%.

Abstract:
The study of effective methods for removing image speckle remains a significant challenge in image processing. In contrast to additive noise, speckle noise is a multiplicative noise whose intensity is proportional to the signal. This results in a noise distribution that exhibits a high dependence on the signal intensity throughout the image, rendering it difficult to remove. Therefore, we present a novel approach to speckle noise removal using dynamical threshold–based fractional anisotropic diffusion (named as DTFAD) in this study. The method simultaneously considers both gradient and gray scale information in the image. In addition, the fractional derivative is integrated with anisotropic diffusion in the DTFAD model, which enhances the image denoising effect to preserve the fundamental features and edges of the image. The design of a dynamic threshold function in the diffusion coefficient enables the diffusion pattern and intensity to adaptively change according to image information, thus effectively removing speckle noise. We establish the well–posedness of the DTFAD model and implement it using an explicit finite difference scheme. Extensive experiments demonstrate that the DTFAD model outperforms traditional anisotropic diffusion techniques, and achieves a superior balance between denoising performance and texture preservation. This evidence demonstrates that the DTFAD model has the potential to be applied in practical engineering.

Abstract:
Point cloud compression is critical for the success of immersive multimedia applications. For attribute compression in geometric point cloud compression (G-PCC), Region Adaptive Hierarchical Transform (RAHT) is the preferred coding method. This paper presents several advances to predictive coding with RAHT: 1) Sample Domain Prediction: Prediction in RAHT is done in transform domain. This introduces undesirable distortion to the prediction signal because of fixed-point computations and leads to increased decoding complexity. We address this by naturally applying prediction in sample domain. The method opens door to skip the transform stage altogether when all residues are quantized to zero, leading to a significantly light decoder. 2) Reference Node Resampling: Inter-prediction signal derived in RAHT could have a different occupancy and weight distribution compared to the current block, causing a mismatch. To address this, we resample the reference node and align the occupancy and weight distribution. 3) Temporal Filtering: During inter-prediction, the reference node is simply copied as the prediction signal. This assumes a correlation coefficient of unity, which is barely true. We introduce a temporal filtering mechanism conditioned on the sub-band, that emulates a low-pass filtering and achieves improved prediction. 4) Inter-Eligibility: During AC inter-prediction, both encoder and decoder have access to the DC of the current and the reference nodes. We use this information to derive an inter-eligibility criterion. Experimental results show considerable gains and reduced complexity that demonstrate the utility of the proposed methods. All the presented methods have been adopted to the second version of G-PCC.

Abstract:
We propose an end-to-end attribute compression method for dense point clouds. The proposed method combines a frequency sampling module, an adaptive scale feature extraction module with geometry assistance, and a global hyperprior entropy model. The frequency sampling module uses a Hamming window and the Fast Fourier Transform to extract high-frequency components of the point cloud. The difference between the original point cloud and the sampled point cloud is divided into multiple sub-point clouds. These sub-point clouds are then partitioned using an octree, providing a structured input for feature extraction. The feature extraction module integrates adaptive convolutional layers and uses offset-attention to capture both local and global features. Then, a geometry-assisted attribute feature refinement module is used to refine the extracted attribute features. Finally, a global hyperprior model is introduced for entropy encoding. This model propagates hyperprior parameters from the deepest (base) layer to the other layers, further enhancing the encoding efficiency. At the decoder, a mirrored network is used to progressively restore features and reconstruct the color attribute through transposed convolutional layers. The proposed method encodes base layer information at a low bitrate and progressively adds enhancement layer information to improve reconstruction accuracy. Compared to the best anchor of the latest geometry-based point cloud compression (G-PCC) standard that was proposed by the Moving Picture Experts Group (MPEG), the proposed method can achieve an average Bjøntegaard delta bitrate of -24.58% for the Y component (resp. -21.23% for YUV components) on the MPEG Category Solid dataset and -22.48% for the Y component (resp. -17.19% for YUV components) on the MPEG Category Dense dataset. This is the first instance that a learning-based attribute codec outperforms the G-PCC standard on these datasets by following the common test conditions specified by MPEG. Our source code will be made publicly available on https://github.com/sduxlmao/SPAC

Abstract:
Currently, the research on cross-scene classification of hyperspectral image (HSI) based on domain generalization (DG) has received wider attention. The majority of the existing methods achieve cross-scene classification of HSI via data manipulation that generates more feature-rich samples. The insufficient mining of complex features of HSIs in these methods leads to limiting the effectiveness of the newly generated HSI samples. Therefore, in this paper, we propose a novel single-source frequency transform (SFT), which realizes domain generalization by transforming the frequency features of samples, mainly including frequency transform (FT) and balanced attentional consistency (BAC). Firstly, FT is designed to learn dynamic attention maps in the frequency space of samples filtering frequency components to improve the diversity of features in new samples. Moreover, BAC is designed based on the class activation map to improve the reliability of newly generated samples. Comprehensive experiments on three public HSI datasets demonstrate that the proposed method outperforms the state-of-the-art method, with accuracy at most 5.14% higher than the second place.

Abstract:
Accurate segmentation of lesions plays a critical role in medical image analysis and diagnosis. Traditional segmentation approaches that rely solely on visual features often struggle with the inherent uncertainty in lesion distribution and size. To address these issues, we propose STPNet, a Scale-aware Text Prompt Network that leverages vision-language modeling to enhance medical image segmentation. Our approach utilizes multi-scale textual descriptions to guide lesion localization and employs retrieval-segmentation joint learning to bridge the semantic gap between visual and linguistic modalities. Crucially, STPNet retrieves relevant textual information from a specialized medical text repository during training, eliminating the need for text input during inference while retaining the benefits of cross-modal learning. We evaluate STPNet on three datasets: COVID-Xray, COVID-CT, and Kvasir-SEG. Experimental results show that our vision-language approach outperforms state-of-the-art segmentation methods, demonstrating the effectiveness of incorporating textual semantic knowledge into medical image analysis. The code has been made publicly on https://github.com/HUANGLIZI/STPNet

Abstract:
Decoding of seen visual contents with non-invasive brain recordings has important scientific and practical values. Efforts have been made to recover the seen images from brain signals. However, most existing approaches cannot faithfully reflect the visual contents due to insufficient image quality or semantic mismatches. Compared with reconstructing pixel-level visual images, speaking is a more efficient and effective way to explain visual information. Here we introduce a non-invasive neural decoder, termed MindGPT, which interprets perceived visual stimuli into natural languages from functional Magnetic Resonance Imaging (fMRI) signals in an end-to-end manner. Specifically, our model builds upon a visually guided neural encoder with a cross-attention mechanism. By the collaborative use of data augmentation techniques, this architecture permits us to guide latent neural representations towards a desired language semantic direction in a self-supervised fashion. Through doing so, we found that the neural representations of the MindGPT are explainable, which can be used to evaluate the contributions of visual properties to language semantics. Our experiments show that the generated word sequences truthfully represented the visual information (with essential details) conveyed in the seen stimuli. The results also suggested that with respect to language decoding tasks, the higher visual cortex (HVC) is more semantically informative than the lower visual cortex (LVC), and using only the HVC can recover most of the semantic information. The source code for the MindGPT model is publicly available at https://github.com/JxuanC/MindGPT.

Abstract:
Recently, the single image super-resolution based on implicit image function is a hot topic, which learns a universal model for arbitrary upsampling scales. By contrast, color-guided depth map super-resolution is less explored based on implicit function learning. The related research faces three questions. First, is it also necessary and applicable to fuse the depth feature and the color feature in the encoder with continuous upsampling scales? Second, is the scale information in the encoder as important as that in the decoder? Third, how to efficiently and effectively model the affinity of location distance and content similarity within cross domains in the decoder? This paper proposes a transformer-based network to answer the above questions, which includes a depth super-resolution branch and a guidance extraction branch. Specifically, in the encoder, the effective implicit cross transformer is designed to fuse the guidance from the color feature with continuous coordinate mapping. In addition, the unrelated guidance is filtered out by correlation evaluation in the high-dimension feature space. Unlike the scale only introduced in the decoder, this paper additionally embeds the scale into the position encoding and the feed-forward network in the encoder to learn the scale-aware feature representation. In the decoder, the high-resolution depth feature is reconstructed by using the internal prior and the external guidance. The internal prior is implemented by implicit self-attention in the depth super-resolution branch, and the external guidance is exploited via implicit cross-attention between both branches. Finally, the above decoded features are complementary to generate the high-resolution depth map. The sufficient experiments on the synthetic and real datasets for in-distribution and out-of-distribution upsampling scales validate the improved performance. The code and the models are public via https://github.com/NaNRan13/GIDF

Abstract:
High-resolution natural image matting plays an important role in image editing, film-making and remote sensing due to its ability of accurately extract the foreground from a natural background. However, due to the complexity brought about by the proliferation of resolution, the existing image matting methods cannot obtain high-quality alpha mattes on high-resolution images in reasonable time. To overcome this challenge, we introduce a high-resolution image matting framework based on alpha matte refinement from low-resolution to high-resolution (HRIMF-AMR). The proposed framework transforms the complex high-resolution image matting problem into low-resolution image matting problem and high-resolution alpha matte refinement problem. While the first problem is solved by adopting an existing image matting method, the latter is addressed by applying the Detail Difference Feature Extractor (DDFE) designed as a part of our work. The DDFE extracts detail difference features from high-resolution images by measuring the image feature difference between high-resolution images and low-resolution images. The low-resolution alpha matte is refined according to the extracted detail difference feature, providing the high-resolution alpha matte. In addition, the Matte Detail Resolution Difference (MDRD) loss is introduced to train the DDFE, which imposes an additional constraint on the extraction of detail difference features with mattes. Experimental results show that integrating HRIMF-AMR significantly enhances the performance of existing matting methods on high-resolution images of Transparent-460 and Alphamatting. Project page: https://github.com/yexianmin/HRAMR-Matting

Abstract:
Digital humans have witnessed extensive applications in various domains, necessitating related quality assessment studies. However, there is a lack of comprehensive digital human quality assessment (DHQA) databases. To address this gap, we propose SJTU-H3D, a subjective quality assessment database specifically designed for full-body digital humans. It comprises 40 high-quality reference digital humans and 1,120 labeled distorted counterparts generated with seven types of distortions. The SJTU-H3D database can serve as a benchmark for DHQA research, allowing evaluation and refinement of processing algorithms. Further, we propose a zero-shot DHQA approach that focuses on no-reference (NR) scenarios to ensure generalization capabilities while mitigating database bias. Our method leverages semantic and distortion features extracted from projections, as well as geometry features derived from the mesh structure of digital humans. Specifically, we employ the Contrastive Language-Image Pre-training (CLIP) model to measure semantic affinity and incorporate the Naturalness Image Quality Evaluator (NIQE) model to capture low-level distortion information. Additionally, we utilize dihedral angles as geometry descriptors to extract mesh features. By aggregating these measures, we introduce the Digital Human Quality Index (DHQI), which demonstrates significant improvements in zero-shot performance. The DHQI can also serve as a robust baseline for DHQA tasks, facilitating advancements in the field. The database and the code are available at https://github.com/zzc-1998/SJTU-H3D

Abstract:
Hyperspectral video (HSV) provides rich spectral-spatial-temporal information, enabling the capture of complex object dynamics beyond the limitations of conventional single- and multi-modal tracking. However, current HSV tracking methods face challenges such as data scarcity, band gaps, spectral fragmentation, temporal underutilization, and high computational load, which constrain performance. In this article, we present SpectralTrack, a novel HSV tracking framework with spectral-spatial fusion and memory enhancement. SpectralTrack incorporates an explicit visual prompting module to mitigate band gaps and spectral fragmentation. We further introduce an extraction-matching-interaction module, which leverages a template-bridging search adapter and a multi-layer perceptron adapter within a multi-modal Transformer architecture for efficient cross-modal feature extraction-matching-interaction. Additionally, a memory perception module enhances state reasoning by injecting temporal prompts to refine spectral and spatial cues. SpectralTrack follows parameter-efficient fine-tuning and feature-level fusion to alleviate data scarcity and reduce computational overhead. We instantiate two variants, SpectralTrack and SpectralTrack+, across nine HSV tracking datasets, demonstrating superior effectiveness over extensive trackers. Implementations and results will be available at https://github.com/YZCU/SpectralTrack

Affiliations: National Engineering Research Center for Multimedia Software, Institute of Artificial Intelligence, School of Computer Science, and Hubei Key Laboratory of Multimedia and Network Communication Engineering, Wuhan University, Wuhan, China; Lee Kong Chian School of Medicine, Nanyang Technological University, Nanyang Ave, Singapore; Ningbo Institute of Materials Technology and Engineering, Chinese Academy of Sciences, Ningbo, Zhejiang, China; Institute for Infocomm Research (IR), Agency for Science, Technology and Research (A*STAR), Nanyang Ave, Singapore

Abstract:
Many keypoint detection and description methods have been proposed for image matching or registration. While these methods demonstrate promising performance for single-modality image matching, they often struggle with multimodal data because the descriptors trained on single-modality data tend to lack robustness against the non-linear variations present in multimodal data. Extending such methods to multimodal image matching often requires well-aligned multimodal data to learn modality-invariant descriptors. However, acquiring such data is often costly and impractical in many real-world scenarios. To address this challenge, we propose a modality-invariant feature learning network (MIFNet) to compute modality-invariant features for keypoint descriptions in multimodal image matching using only single-modality training data. Specifically, we propose a novel latent feature aggregation module and a cumulative hybrid aggregation module to enhance the base keypoint descriptors trained on single-modality data by leveraging pre-trained features from Stable Diffusion models. We validate our method with recent keypoint detection and description methods in three multimodal retinal image datasets (CF-FA, CF-OCT, EMA-OCTA) and two remote sensing datasets (Optical-SAR and Optical-NIR). Extensive experiments demonstrate that the proposed MIFNet is able to learn modality-invariant feature for multimodal image matching without accessing the targeted modality and has good zero-shot generalization ability. The code will be released at https://github.com/lyp-deeplearning/MIFNet

Abstract:
Graph-based multi-view clustering has attracted remarkable attention due to its impressive performance. However, the typical framework consisting of graph learning and indicator generation may fail to align learned graphs with the underlying data structure due to the unidirectional pipeline from refined graphs to indicator generation. Another common problem is the inadequate prior information in graph learning methods. This paper proposes a Bidirectional Probabilistic Multi-graph Learning and Decomposition (BPMLD) method by establishing an explicit bidirectional pipeline between graph learning and indicator generation for multi-view clustering. Specifically, we design a confidence term based on clustering probability indicators and fuse it with graph learning to form clustering confidence driven graph learning. Meanwhile, graph tensor learning is introduced to recover the high-order correlations among the refined graphs. We further propose a multi-graph probability decomposition module to adaptively produce cluster indicators with probability representation from the refined graphs. The seamless integration between graph learning and indicator generation enables them to interact directly and enhance each other. To solve the proposed model, we design an effective optimization algorithm. Extensive experiments demonstrate the effectiveness of our method compared to state-of-the-art methods. The code is available at: https://github.com/W-Xinxin/BPMLD.

Abstract:
Gaze estimation task aims to predict a 3D gaze direction or a 2D gaze point given a face or eye image. To improve generalization of gaze estimation models to unseen new users, existing methods either disentangle personalized information of all subjects from their gaze features, or integrate unrefined personalized information into blended embeddings. Their methodologies are not rigorous whose performance is still unsatisfactory. In this paper, we put forward a comprehensive perspective named ‘Disengage AND Integrate’ to deal with personalized information, which elaborates that for specified users, their irrelevant personalized information should be discarded while relevant one should be considered. Accordingly, a novel Personalized Causal Network (PCNet) for generalizable gaze estimation has been proposed. The PCNet adopts a two-branch framework, which consists of a subject-deconfounded appearance sub-network (SdeANet) and a prototypical personalization sub-network (ProPNet). The SdeANet aims to explore causalities among facial images, gazes, and personalized information and extract a subject-invariant appearance-aware feature of each image by means of causal intervention. The ProPNet aims to characterize customized personalization-aware features of arbitrary users with the help of a prototype-based subject identification task. Furthermore, our whole PCNet is optimized in a hybrid episodic training paradigm, which further improve its adaptability to new users. Experiments on three challenging datasets over within-domain and cross-domain gaze estimation tasks demonstrate the effectiveness of our method.

Abstract:
Stereo Image Super-Resolution (SSR) holds great promise in improving the quality of stereo images by exploiting the complementary information between left and right views. Most SSR methods primarily focus on the inter-view correspondences in low-resolution (LR) space. The potential of referencing a high-quality SR image of one view benefits the SR for the other is often overlooked, while those with abundant textures contribute to accurate correspondences. Therefore, we propose Reference-based Iterative Interaction (RIISSR), which utilizes reference-based iterative pixel-wise and patch-wise matching, dubbed P^2 -Matching, to establish cross-view and cross-resolution correspondences for SSR. Specifically, we first design the information perception block (IPB) cascaded in parallel to extract hierarchical contextualized features for different views. Pixel-wise matching is embedded between two parallel IPBs to exploit cross-view interaction in LR space. Iterative patch-wise matching is then executed by utilizing the SR stereo pair as another mutual reference, capitalizing on the cross-scale patch recurrence property to learn high-resolution (HR) correspondences for SSR performance. Moreover, we introduce the supervised side-out modulator (SSOM) to re-weight local intra-view features and produce intermediate SR images, which seamlessly bridge two matching mechanisms. Experimental results demonstrate the superiority of RIISSR against existing state-of-the-art methods.

Abstract:
High-speed cameras are crucial for capturing fast events beyond human perception, although challenges in terms of storage, bandwidth, and cost hinder their widespread use. As an alternative, snapshot compressive video can overcome these challenges by exploiting the principles of compressed sensing to capture compressive projections of dynamic scenes into a single image, which is then used to recover the underlying video by solving an ill-posed inverse problem. However, scalability in terms of spatial and temporal resolution is limited for both acquisition and reconstruction. In this work, we leverage time-division multiplexing to design a versatile scalable coded aperture approach that allows unseen spatio-temporal scalability for snapshot compressive video, offering on-the-fly, high-compression ratios with minimal computational burden and low memory requirements. The proposed sampling scheme is universal and compatible with any compressive temporal imaging sampling matrices and reconstruction algorithm aimed for low spatio-temporal resolutions. Simulations validated with a series of experimental results confirm that we can compress up to 512 frames of 2K × 2 K resolution into a single snapshot, equivalent to a compression ratio of 0.2%, delivering an overall reconstruction quality exceeding 30 dB in PSNR for conventional reconstruction algorithms, and often surpassing 36 dB when utilizing the latest state-of-the-art deep learning reconstruction algorithms. The results presented in this paper can be reproduced in the following GitHub repository: https://github.com/FOGuzman/All-scalable-CACTI

Abstract:
In the pursuit of achieving further coding gains beyond the versatile video coding (VVC) standard, the enhanced compression model (ECM) has been initiated by the Joint Video Exploration Team (JVET) with the aim of developing next generation video coding techniques. In ECM, novel coding tools are studied to improve the coding efficiency for both camera-captured content and screen content. Intra block copy (IBC) has been included as a fundamental coding tool in both VVC and ECM, yielding significant improvement in compression efficiency for screen content. This paper presents a method of reconstruction reordered IBC (RR-IBC) to further improve the compression efficiency for screen content, by taking advantage of the symmetry property inherent in screen content sequences. The reconstruction block is flipped horizontally or vertically to restore the characteristics of samples in the original block. A flip-aware adjustment is performed to regulate block vector candidates of the RR-IBC block according to the types of symmetry. Similarly, the reference template of the template-based reordering method for the RR-IBC block is adjusted accordingly to accommodate the geometry property. A motion constraint is applied to restrict the block vector of an RR-IBC coded block to a single direction displacement perpendicular to the flip axis. An RR-IBC flip mode index is signalled to specify how to flip the reconstruction block. Experimental results show that the proposed RR-IBC can provide an average Bjontegaard delta rate (BD-rate) saving of 1.61%/1.79%/1.76% and 3.90%/3.63%/3.63% on Y/Cb/Cr components for class F and class TGM sequences, respectively, with a negligible change on the runtime, compared with ECM-5.0 in all intra configurations. RR-IBC has been adopted into ECM.

Abstract:
Deep learning-based person re-identification (re-id) models are widely employed in surveillance systems and inevitably inherit the vulnerability of deep networks to adversarial attacks. Existing attacks merely consider cross-dataset and cross-model transferability, ignoring the cross-test capability to perturb models trained in different domains. To powerfully examine the robustness of real-world re-id models, the Meta Transferable Generative Attack (MTGA) method is proposed, which adopts meta-learning optimization to promote the generative attacker producing highly transferable adversarial examples by learning comprehensively simulated transfer-based cross-model&dataset&test black-box meta attack tasks. Specifically, cross-model&dataset black-box attack tasks are first mimicked by selecting different re-id models and datasets for meta-train and meta-test attack processes. As different models may focus on different feature regions, the Perturbation Random Erasing module is further devised to prevent the attacker from learning to only corrupt model-specific features. To boost the attacker learning to possess cross-test transferability, the Normalization Mix strategy is introduced to imitate diverse feature embedding spaces by mixing multi-domain statistics of target models. Extensive experiments show the superiority of MTGA, especially in cross-model&dataset and cross-model&dataset&test attacks, our MTGA outperforms the SOTA methods by 20.0% and 11.3% on mean mAP drop rate, respectively. The source codes are available at https://github.com/yuanbianGit/MTGA

Abstract:
Deep learning approaches have demonstrated high effectiveness in 3D object detection tasks. However, they often suffer from a notable drop in performance on the previously trained classes when learning new classes incrementally without revisiting the old data. This is the “catastrophic forgetting” phenomenon which impedes 3D object detection in real-world scenarios, where intelligent machines must continuously learn to detect previously unseen categories. Furthermore, frequent co-occurrences of old and new classes in scenes exacerbate catastrophic forgetting and cause model confusion. To address these challenges, we propose a novel static-dynamic co-teaching approach. Our framework involves a student model and two teacher models: a static teacher with fixed weights which imparts preserved old knowledge to the student, and a dynamic teacher with continuously updated weights which transfers underlying knowledge from new data to the student. To mitigate the issue of co-occurrence, we generate pseudo labels for base (i.e. old) classes from both static and dynamic sources during incremental learning. Additionally, to mitigate the negative impact of varying occurrence frequencies of classes on fixed thresholding during the selection of pseudo labels, we calibrate the probabilities of base classes to attain more balanced class probabilities. Moreover, our static-dynamic co-teaching framework is backbone-agnostic, making it compatible with different detection architectures. We demonstrate its backbone-agnostic nature by adapting three representative 3D object detectors: VoteNet, 3DETR and CAGroup3D. Extensive experiments showcase the superior performance of our proposed method compared to baseline approaches across indoor and outdoor benchmark datasets and applicability with different backbone models.

Abstract:
In hand pose estimation, challenges such as occlusion often result in partial observation of a human hand, making it difficult to uniquely determine the hand pose, thus leading to ambiguity in certain hand regions. Heatmap-based methods may struggle with locating ambiguous joints and end up violating physiological constraints in their predictions. Parametric model based single-solution methods often fail to adequately address this ambiguity issue due to the inherent one-to-many mappings between input and output, resulting in unstable regression. While some existing multi-hypothesis methods have improved diversity by directly modeling the distribution of ambiguous hypotheses, their localization accuracy still falls short compared to the recent single-solution methods. To achieve quality results in both diversity and accuracy, we propose a novel multi-hypothesis approach for hand pose estimation, by progressively integrating heatmap information into the distribution of ambiguous poses using a RANSAC-like strategy. It starts with a conditional-flow model to provide an initial estimate of a coarse distribution over ambiguous joint poses. This is followed by randomly sampling multiple hypotheses, projecting each of them onto 2D heatmap plane, and employing consensus checks to identify unambiguous joints that adhere to skeletal constraints. Joint features are then resampled, with mismatches due to incorrect estimations being eliminated. Finally, we refine the distribution of ambiguous poses using graph neural networks and attention mechanisms. Extensive empirical experiments are carried out, where our approach are carefully examined both qualitatively and quantitatively. It is shown to not only produce more diverse & feasible pose hypotheses than existing multi-hypothesis methods, but also achieves accurate localization results comparable to the state-of-the-art single-solution methods.

Abstract:
Existing multi-view classification and clustering methods typically improve task accuracy by leveraging and fusing information from different views. However, ensuring the reliability of multi-view integration and final decisions is crucial, particularly when dealing with noisy or corrupted data. Current methods often rely on Kullback-Leibler (KL) divergence to estimate uncertainty of network predictions, ignoring domain gaps between different modalities. To address this issue, KPHD-Net, based on Hölder divergence, is proposed for multi-view classification and clustering tasks. Generally, our KPHD-Net employs a variational Dirichlet distribution to represent class probability distributions, models evidences from different views, and then integrates it with Dempster-Shafer evidence theory (DST) to improve uncertainty estimation effects. Our theoretical analysis demonstrates that Proper Hölder divergence offers a more effective measure of distribution discrepancies, ensuring enhanced performance in multi-view learning. Moreover, Dempster-Shafer evidence theory, recognized for its superior performance in multi-view fusion tasks, is introduced and combined with the Kalman filter to provide future state estimations. This integration further enhances the reliability of the final fusion results. Extensive experiments show that the proposed KPHD-Net outperforms the current state-of-the-art methods in both classification and clustering tasks regarding accuracy, robustness, and reliability, with theoretical guarantees.

Abstract:
The transferability of adversarial examples is vital for black-box attacks, as it enables the adversary to deceive the target model without knowing its internals. Despite numerous methods focusing on transferability, they still struggle with transferring across models with distinct architectural components (e.g., CNNs and ViTs). In this work, we argue that the limited adversarial perturbation diversity leads to overfitting of the surrogate model, which acts as a key factor in reducing transferability. To this end, we propose a Masked Adversarial Perturbation (MAP) method to boost adversarial transferability across various architectures from a novel perspective of diversifying perturbation. Specifically, MAP randomly masks perturbation patches during iterations and compels the remaining ones to retain the attack effect, which diversifies perturbations to mitigate their overfitting to the surrogate model. Naturally, MAP spreads perturbation over local patches to alleviate their co-adaptation and prevent perturbations from overly relying on specific patterns. Consequently, it can deceive convolution operation and self-attention mechanism indiscriminately by attacking their basic input units, i.e., a single patch, showing superior transferability over previous methods. Extensive experiments illustrate that MAP consistently and significantly boosts diverse black-box attacks to achieve state-of-the-art performance.

Abstract:
The absorption and scattering of light in different turbid media cause images to suffer from poor visibility and contrast, which severely affects the performance of many computer vision tasks. To address this issue, we propose a fast scene recovery method based on the Ambient light similarity prior (ALSP). In this method, the ambient light similarity metric is designed from both magnitude and orientation, which is embedded into the optical imaging model, and the estimation of scene transmission is derived by simplification and approximation. The estimation of the transmission map is very simple, and its time complexity is O(N), where N is the size of the input image. Moreover, we propose a progressive manner to determine the ambient light for both the near and far regions separately, which can effectively improve the brightness and color saturation of the restored image. Experiments performed in different scenes demonstrate that our method outperforms several state-of-the-art competitors in terms of efficiency and scene recovery performance.

Abstract:
In recent years, the Vision Transformer (ViT) model has gradually become mainstream in various computer vision tasks, and the robustness of the model has received increasing attention. However, existing large models tend to prioritize performance during training, potentially neglecting the robustness, which may lead to serious security concerns. In this paper, we establish a new challenge: exploring how to use a small number of additional parameters for adversarial finetuning to quickly and effectively enhance the adversarial robustness of a standardly trained model. To address this challenge, we develop novel LNLoRA module, incorporating a learnable layer normalization before the conventional LoRA module, which helps mitigate magnitude differences in parameters between the adversarial and standard training paradigms. Furthermore, we propose the FullLoRA framework by integrating the learnable LNLoRA modules into all key components of ViT-based models while keeping the pretrained model frozen, which can significantly improve the model robustness via adversarial finetuning in a parameter-efficient manner. Extensive experiments on several datasets demonstrate the superiority of our proposed FullLoRA framework. It achieves comparable robustness with full finetuning while only requiring about 5% of the learnable parameters. This also effectively addresses concerns regarding extra model storage space and enormous training time caused by adversarial finetuning.

Abstract:
Alpha trees, and derived \alpha - \omega -hierarchies are powerful tools for hierarchical image representation in computer vision. However, the quality of \alpha - \omega -hierarchies has not been fully evaluated, limiting their further development and application. In our study, an algorithm for evaluating the quality of \alpha - \omega -hierarchies based on horizontal cut filters is proposed. With the aim to automatically select optimal parameters and dissimilarity measures for \alpha - \omega -hierarchy constructions, key factors including maximum accuracy, construction complexity, and efficiency of \alpha - \omega -hierarchies are systematically considered. Notably, remote sensing images based experiments were conducted to demonstrate the usefulness of this algorithm. In addition, our algorithm can be potentially extended to qualify other types of hierarchical trees, making it useful for the automatic selection of optimal hierarchical segmentation methods.

Affiliations: Institute of Optics and Electronics, State Key Laboratory Cultivation Base of Atmospheric Optoelectronic Detection and Information Fusion, Jiangsu International Joint Laboratory on Meteorological Photonics and Optoelectronic Detection, and Jiangsu Engineering Research Center for Intelligent Optoelectronic Sensing Technology of Atmosphere, Nanjing University of Information Science and Technology, Nanjing, China; School of Electronic Information and Electrical Engineering, Anhui Jianzhu University, Hefei, China; Faculty of Computer Science, China University of Geosciences, Wuhan, China; Department of Technology of Computers and Communications, Escuela Politecnica, Hyperspectral Computing Laboratory, University of Extremadura, Cáceres, Spain

Abstract:
Autoencoders (AEs) have received extensive attention in hyperspectral anomaly detection (HAD) due to their capability to separate the background from the anomaly based on the reconstruction error. However, the existing AE methods routinely fail to adequately exploit spatial information and may precisely reconstruct anomalies, thereby affecting the detection accuracy. To address these issues, this study proposes a novel Multi-scale Autoencoder Suppression Strategy (MASS). The underlying principle of MASS is to prioritize the reconstruction of background information over anomalies. In the encoding stage, the Local Feature Extractor, which integrates Convolution and Omni-Dimensional Dynamic Convolution (ODConv), is combined with the Global Feature Extractor based on Transformer to effectively extract multi-scale features. Furthermore, a Self-Attention Suppression module (SAS) is devised to diminish the influence of anomalous pixels, enabling the network to focus more intently on the precise reconstruction of the background. During the process of network learning, a mask derived from the test outcomes of each iteration is integrated into the loss function computation, encompassing only the positions with low anomaly scores from the preceding detection round. Experiments on eight datasets demonstrate that the proposed method is significantly superior to several traditional methods and deep learning methods in terms of performance.

Abstract:
Due to the characteristics of low storage requirement and high retrieval efficiency, hashing-based retrieval has shown its great potential and has been widely applied for information retrieval. However, retrieval tasks in real-world applications are usually required to handle the data from various domains, leading to the unsatisfactory performances of existing hashing-based methods, as most of them assuming that the retrieval pool and the querying set are similar. Most of the existing works overlooked the self-representation that containing the modality-specific semantic information, in the cross-modal data. To cope with the challenges mentioned above, this paper proposes an asymmetric and discrete self-representation enhancement hashing (ADSEH) for cross-domain retrieval. Specifically, ADSEH aligns the mathematical distribution with domain adaptation for cross-domain data, by exploiting the correlation of minimizing the distribution mismatch to reduce the heterogeneous semantic gaps. Then, ADSEH learns the self-representation which is embedded into the generated hash codes, for enhancing the semantic relevance, improving the quality of hash codes, and boosting the generalization ability of ADSEH. Finally, the heterogeneous semantic gaps are further reduced by the log-likelihood similarity preserving for the cross-domain data. Experimental results demonstrate that ADSEH can outperform some SOTA baseline methods on four widely used datasets.

Abstract:
Capturing the human body and clothing from videos has obtained significant progress in recent years, but several challenges remain to be addressed. Previous methods reconstruct the 3D bodies and garments from videos with self-rotating human motions or capture the body and clothing separately based on neural implicit fields. However, the reconstruction methods for self-rotating motions may cause instable tracking on dynamic videos with arbitrary human motions, while implicit fields based methods are limited to inefficient rendering and low quality synthesis. To solve these problems, we propose a new method, called CloCap-GS, for clothed human performance capture with 3D Gaussian Splatting. Specifically, we align 3D Gaussians with the deforming geometries of body and clothing, and leverage photometric constraints formed by matching Gaussians renderings with input video frames to recover temporal deformations of the dense template geometry. The geometry deformations and Gaussians properties of both the body and clothing are optimized jointly, achieving both dense geometry tracking and novel-view synthesis. In addition, we introduce a physics-aware material-varying cloth model to preserve physically-plausible cloth dynamics and body-clothing interactions that is pre-trained in a self-supervised manner without preparing training data. Compared with the existing methods, our method improves the accuracy of dense geometry tracking and quality of novel-view synthesis for a variety of daily garment types (e.g., loose clothes). Extensive experiments in both quantitative and qualitative evaluations demonstrate the effectiveness of CloCap-GS on real sparse-view or monocular videos.

Abstract:
Because optical spectrometers capture abundant molecular, biological, and physical information beyond images, ongoing efforts focus on both algorithmic and hardware approaches to obtain detailed spectral information. Spectral reconstruction from red-green-blue (RGB) values acquired by conventional trichromatic cameras has been an active area of study. However, the resultant spectral profile is often affected not only by the unknown spectral properties of the sample itself, but also by light conditions, device characteristics, and image file formats. Existing machine learning models for spectral reconstruction are further limited in generalizability due to their reliance on task-specific training data or fixed models. Advanced spectrometer hardware employing sophisticated nanofabricated components also constrains scalability and affordability. Here we introduce a general computational framework, co-designed with spectrally incoherent color reference charts, to recover the spectral information of an arbitrary sample from a single-shot photo in the visible range. The mutual optimization of reference color selection and the computational algorithm eliminates the need for training data or pretrained models. In transmission mode, altered RGB values of reference colors are used to recover the spectral intensity of the sample, achieving spectral resolution comparable to that of scientific spectrometers. In reflection mode, a spectral hypercube of the sample can be constructed from a single-shot photo, analogous to hyperspectral imaging. The reported computational photography spectrometry has the potential to make optical spectroscopy and hyperspectral imaging accessible using off-the-shelf smartphones.

Abstract:
Low-Light Enhancement (LLE) is aimed at improving the quality of photos/videos captured under low-light conditions. It is worth noting that most existing LLE methods do not take advantage of geometric modeling. We believe that incorporating geometric information can enhance LLE performance, as it provides insights into the physical structure of the scene that influences illumination conditions. To address this, we propose a Geometry-Guided Low-Light Enhancement Refine Framework (GG-LLERF) designed to assist low-light enhancement models in learning improved features by integrating geometric priors into the feature representation space. In this paper, we employ depth priors as the geometric representation. Our approach focuses on the integration of depth priors into various LLE frameworks using a unified methodology. This methodology comprises two key novel modules. First, a depth-aware feature extraction module is designed to inject depth priors into the image representation. Then, the Hierarchical Depth-Guided Feature Fusion Module (HDGFFM) is formulated with a cross-domain attention mechanism, which combines depth-aware features with the original image features within LLE models. We conducted extensive experiments on public low-light image and video enhancement benchmarks. The results illustrate that our framework significantly enhances existing LLE methods. The source code and pre-trained models are available at https://github.com/Estheryingqi/GG-LLERF

Affiliations: Department of Computer Science and Software Engineering, The University of Western Australia, Perth, Crawley, Australia; College of Computer Science, The University of Oulu, Oulu, Finland; College of Electronic Science, Aviation University of Air Force, Changchun, China; Information Technology Discipline, Murdoch University, Murdoch, Australia; School of Electronics and Communication Engineering, Sun Yat-sen University (SYSU), Guangzhou, China; Department of Electrical, Electronic and Computer Engineering, The University of Western Australia (UWA), Perth, Crawley, Australia

Abstract:
Point cloud completion aims to reconstruct complete 3D shapes from partial scans. The long-range dependencies between points and shape perception are crucial for this task. While Transformers are effective due to their global processing ability, the quadratic complexity of their attention mechanism makes them unsuitable for long sequences when computational resources are constrained. As an alternative, State Space Models (SSMs) provide a memory-efficient solution for handling long-range dependencies, yet applying them directly to unordered point clouds presents challenges because of their intrinsic causality requirements. Existing methods attempt to address this by sorting points along a single axis. This, however, often overlooks complex causal relationships in 3D space since adjacency relationships based on Euclidean distance between points in the 3D space may not be preserved by this linear arrangement. To overcome this issue, we introduce CompletionMamba, a novel SSM-based network designed to harness SSMs for capturing both global and local dependencies within a point cloud. Initially, the input point cloud is causally structured by rearranging its coordinates. Then, a local SSM framework is proposed that defines neighborhood spaces around each point based on Euclidean distance, enhancing the causal structure. Although local SSM enhances relationships in short and long distance sequences, it still lacks full shape modeling of point cloud. To address this, we propose a novel shape-aware Mamba by integrating the shape code of each 3D shape into the model, enabling shape information propagation to all points. Our experiments show that CompletionMamba achieves state-of-the-art performance on both the MVP and PCN datasets.

Abstract:
Infrared video small object detection is pivotal in numerous security and surveillance applications. However, existing deep learning-based methods, which typically rely on a two-step paradigm of frame-by-frame detection followed by temporal refinement, struggle to effectively utilize temporal information. This is particularly challenging when detecting small objects against complex backgrounds. To address these issues, we introduce the One-Step Transformer (OSFormer), a novel method that pioneeringly integrates a small-object-friendly transformer with a one-step detection paradigm. Unlike traditional methods, OSFormer processes the video sequence only through a single inference, encoding the sequence into cube format data and tracking object motion trajectories. Additionally, we propose the Varied-Size Patch Attention (VPA) module, which generates patches of varying sizes to capture adaptive attention features, bridging the gap between transformer architectures and small object detection. To further enhance detection accuracy, OSFormer incorporates a Doppler Adaptive Filter, which integrates traditional filtering techniques into an end-to-end neural network to suppress background noise and accentuate small objects. OSFormer outperforms YOLOv8-s on both the AntiUAV dataset (+ 3.1%~\text mAP_50 , - 35.1%~\text Params ) and the InfraredUAV dataset (+ 4.0%~\text mAP_50-95 , - 51.0%~\text FLOPs ), demonstrating superior efficiency and effectiveness in small object detection. The code is available on https://github.com/q2479036243/OSFormer.

Abstract:
Deep Image Prior (DIP) has shown that networks with stochastic initialization and custom architectures can effectively address inverse imaging challenges. Despite its potential, DIP requires significant computational resources, whereas the lighter Implicit Neural Positional Image Prior (PIP) often yields overly smooth solutions due to exacerbated spectral bias. Research on lightweight, high-performance solutions for inverse imaging remains limited. This paper proposes a novel framework, Enhanced Positional Image Priors through High-Order Implicit Representations (HOPE), incorporating high-order interactions between layers within a conventional cascade structure. This approach reduces the spectral bias commonly seen in PIP, enhancing the model’s ability to capture both low- and high-frequency components for optimal inverse problem performance. We theoretically demonstrate that HOPE’s expanded representational space, narrower convergence range, and improved Neural Tangent Kernel (NTK) diagonal properties enable more precise frequency representations than PIP. Comprehensive experiments across tasks such as signal representation (audio, image, volume) and inverse image processing (denoising, super-resolution, CT reconstruction, inpainting) confirm that HOPE establishes new benchmarks for recovery quality and training efficiency.

Abstract:
Accurate 3D medical image segmentation is crucial for diagnosis and treatment. Diffusion models demonstrate promising performance in medical image segmentation tasks due to the progressive nature of the generation process and the explicit modeling of data distributions. However, the weak guidance of conditional information and insufficient feature extraction in diffusion models lead to the loss of fine-grained features and structural consistency in the segmentation results, thereby affecting the accuracy of medical image segmentation. To address this challenge, we propose a Mamba-Enhanced Diffusion Model for 3D Medical Image Segmentation. We extract multilevel semantic features from the original images using an encoder and tightly integrate them with the denoising process of the diffusion model through a Semantic Hierarchical Embedding (SHE) mechanism, to capture the intricate relationship between the noisy label and image data. Meanwhile, we design a Global-Slice Perception Mamba (GSPM) layer, which integrates multi-dimensional perception mechanisms to endow the model with comprehensive spatial reasoning and feature extraction capabilities. Experimental results show that our proposed MambaDiff achieves more competitive performance compared to prior arts with substantially fewer parameters on four public medical image segmentation datasets including BraTS 2021, BraTS 2024, LiTS and MSD Hippocampus. The source code of our method is available at https://github.com/yuliu316316/MambaDiff

Abstract:
Generalized zero-shot learning (GZSL) shows great potential for improving generalization to unseen classes in real-world scenarios. However, most GZSL methods depend on benchmark datasets with per-class attribute annotations, which creates a large semantic gap and worsens the domain shift problem in the visual-semantic space. To address these challenges, instance-level attributes offer an intuitive solution, but they require expensive manual annotation. In this paper, we propose a simple yet effective approach called per-instance attribute synthesis (PIAS) to generate diverse semantic representations for each instance. Our method first uses the Vision Transformer (ViT) model to extract visual features and then generates per-instance attributes. The patch splitting, positional embedding, and multi-head self-attention mechanisms in ViT improve the discriminability of both visual and semantic representations. Next, we define the generated attributes of class-average images as class anchor points. These anchor points are calibrated in the semantic space by minimizing the cosine similarity between the anchor points and per-class attribute annotations. Finally, we improve the diversity of generated per-instance attributes by aligning the topological structure between per-class attribute annotations and synthesized per-instance attributes with that between class-average visual features and per-instance visual features. We conduct comprehensive experiments on three challenging ZSL datasets: AWA2, CUB, and SUN. The results show that PIAS significantly outperforms state-of-the-art methods under both ZSL and GZSL settings. We further demonstrate the generalization ability of PIAS by applying it to attribute-based zero-shot image retrieval tasks.

Abstract:
This paper addresses the two-view geometric model fitting problem on the multi-structural data with severe outliers for providing reliable and consistent fitting results. The key idea is to adopt spatial clustering to guide deterministically sample minimum subsets. Specifically, we firstly improve the effectiveness of spatial clustering with good neighbors that preserve the consensus of neighborhood elements and neighborhood topology, for enhancing the quality of sampled minimum subsets. Then we further design a multi-scale fusion strategy, which not only boosts more high-quality minimum subsets, but also enables our method to cover all model instances in data. Moreover, we propose a simple and effective model selection algorithm to estimate the parameters of model instances in data. The final proposed method is able to guarantee fast, accurate and stable model fitting results for the multi-structural data. In addition, we construct two large labeled datasets, for homography and fundamental matrix estimation, respectively. Experimental results on real images from six datasets show the significant superiority of the proposed method on both accuracy and speed over several state-of-the-art alternatives. Especially for the MS-COCO-F and YFCC100M-F datasets, the proposed method yields a performance boost of over three times on segmentation error, parameter error and the CPU time.

Abstract:
The performance of deep learning models for medical image segmentation is often limited in scenarios where training data or annotations are limited. Self-Supervised Learning (SSL) is an appealing solution for this dilemma due to its feature learning ability from a large amount of unannotated images. Existing SSL methods have focused on pretraining either an encoder for global feature representation or an encoder-decoder structure for image restoration, where the gap between pretext and downstream tasks limits the usefulness of pretrained decoders in downstream segmentation. In this work, we propose a novel SSL strategy named Volume Fusion (VolF) for pretraining 3D segmentation models. It minimizes the gap between pretext and downstream tasks by introducing a pseudo-segmentation pretext task, where two sub-volumes are fused by a discretized block-wise fusion coefficient map. The model takes the fused result as input and predicts the category of fusion coefficient for each voxel, which can be trained with standard supervised segmentation loss functions without manual annotations. Experiments with an abdominal CT dataset for pretraining and both in-domain and out-domain downstream datasets showed that VolF led to large performance gain from training from scratch with faster convergence speed, and outperformed several state-of-the-art SSL methods. In addition, it is general to different network structures, and the learned features have high generalizability to different body parts and modalities.

Abstract:
Multi-view Subspace Clustering (MVSC) effectively aggregating multiple data sources to promise clustering performance. Recently, various anchor-based variants have been introduced to effectively alleviate the computation complexity of MVSC. Although satisfactory advancement has been achieved, existing methods either independently learn anchor matrices and their anchor representations or learn a consensus anchor matrix and unified anchor representation, failing to capture both consistency and complementary information simultaneously. In addition, the time complexity of obtaining clustering results by applying Singular Value Decomposition (SVD) on the anchor representation matrix remains high. To tackle the above problems, we propose an Adaptive Anchor-guided Representation Learning for Efficient Multi-view Subspace Clustering (A2RL-EMVSC) framework, which integrates consensus anchors learning, anchor-guided representation learning and matrix factorization to enhance clustering performance and scalability. Technically, the proposed method learns view-specific anchor representation matrices by consensus anchors guidance, which simultaneously exploit consistency and complementary information. Moreover, by applying matrix decomposition to the view-specific anchor representation matrices, clustering results can be achieved with linear time complexity. Extensive experiments on ten challenging multi-view datasets show that the proposed method can improve the effectiveness and superiority of clustering compared with state-of-the-art methods.

Abstract:
Camouflaged object detection (COD) is challenging for both human and computer vision, as targets often blend into the background by sharing similar color, texture, or shape. While many feature enhancement techniques exist, single-view methods tend to overemphasize certain Recognizing that camouflaged objects exhibit different concealment strategies under varying observational perspectives, we propose HUNTNet, a network that establishes a dynamic detection mechanism to decouple target features from RGB images and perform topological decamouflage across multiple homomorphic feature spaces through a unified feature focusing architecture. We adopt PVTv2 as the backbone to extract multi-perspective spatial features. Detail representation is enhanced via a feature module that integrates Dual-Channel Recursive (DCR), Wavelet-Gabor Transform (WGT), and Anisotropic Gradient Responding (AGR), which together improve boundary discrimination and edge contour detection. To further boost performance, the Simplicial Feature Integration (SFI) module recursively fuses multi-layer features, enabling high-resolution focus on target regions. Experiments show that HUNTNet surpasses state-of-the-art methods in both accuracy and generalization, offering a robust solution for COD and improving segmentation in complex scenes. Our code is available at https://github.com/HaolinJi817/HUNTNet

Abstract:
Complex imaging environments and conditions in real-world scenes pose significant challenges for stereo matching tasks. Models are susceptible to underperformance in non-Lambertian surfaces, weakly textured regions, and occluded regions, due to the difficulty in establishing accurate matching relationships between pixels. To alleviate these problems, we propose a multi-scale geometrically enhanced stereo matching model that exploits the geometric structural relationships of the objects in the scene to mitigate these problems. Firstly, a geometric structure perception module is designed to extract geometric information from the reference view. Secondly, a geometric structure-adaptive embedding module is proposed to integrate geometric information with matching similarity information. This module integrates multi-source features dynamically to predict disparity residuals in different regions. Third, a geometric-based normalized disparity correction module is proposed to improve matching robustness for pathological regions in realistic complex scenes. Extensive evaluations on popular benchmarks demonstrate that our method achieves competitive performance against leading approaches. Notably, our model provides robust and accurate predictions in challenging regions containing edges, occlusions, reflective, and non-Lambertian surfaces. Our source code will be publicly available.

Abstract:
Hyperspectral image change detection (HSI-CD) benefits from HSIs with continuous spectral bands, which uniquely enables the analysis of more subtle changes. Existing methods have achieved desirable performance relying on multi-temporal homogenous HSIs over the same region, which is generally difficult to obtain in real scenes. HSI-RGB multimodal CD overcomes the constraint of limited HSI availability by incorporating another temporal RGB data, and the combination of advantages within different modalities enhances the robustness of detection results. Nevertheless, due to the different imaging mechanisms between two modalities, existing HSI CD methods cannot be directly applied. In this paper, we propose a cycle translation-based collaborative training (co-training) for HSI-RGB multimodal CD, which achieves cross-modal mutual guidance to collaboratively learn complementary difference information from diverse modalities for identifying changes. Specifically, a cross-modal guided CycleGAN-based image translation module is designed to implement bi-directional image translation, which mitigates modal difference and enables the extraction of information related to land cover changes. Then, a spatial-spectral interactive co-training CD module is proposed to achieve iterative interaction between cross-modal information, which jointly extracts the multimodal difference features to generate the final results. The proposed method outperforms several leading CD methods in extensive experiments carried out on both real and synthetic datasets. In addition, a new public HSI-RGB multimodal dataset along with our code are available at https://github.com/Jiahuiqu/CT2Net

Abstract:
Action Quality Assessment (AQA), which aims at the automatic and fair evaluation of athletic performance, has gained increasing attention in recent years. However, athletes are often in rapid movement and the corresponding visual appearance variances are subtle, making it challenging to capture fine-grained pose differences and leading to poor estimation performance. Furthermore, most common AQA tasks, such as diving in sports, are usually divided into multiple sub-actions, each of which contains different durations. However, existing methods focus on segmenting the video into fixed frames, which disrupts the temporal continuity of sub-actions resulting in unavoidable prediction errors. To address these challenges, we propose a novel action quality assessment method through hierarchically pose-guided multi-stage contrastive regression. Firstly, we introduce a multi-scale dynamic visual-skeleton encoder to capture fine-grained spatio-temporal visual and skeletal features. Compared to mask or auxiliary visual features, skeletal features provide a more accurate representation during athletic movements. Then, a procedure segmentation network is introduced to separate different sub-actions and obtain segmented features. Afterwards, the segmented visual and skeletal features are both fed into a multi-modal fusion module as physics structural priors, to guide the model in learning refined activity similarities and variances. Finally, a multi-stage contrastive learning regression approach is employed to learn discriminative representations and output prediction results. In addition, we introduce a newly-annotated FineDiving-Pose Dataset to improve the current low-quality human pose labels. In experiments, the results on FineDiving and MTL-AQA datasets demonstrate the effectiveness and superiority of our proposed approach. Our source code and dataset are available at https://github.com/Lumos0507/HP-MCoRe

Abstract:
Virtual Reality (VR) has attracted widespread attention in recent years due to its capability to create immersive experiences by presenting multi-modal information to users. Omnidirectional videos (ODVs), as a prominent component of VR content, are essential across diverse applications. This necessitates service providers to monitor and optimize the quality of ODVs throughout the filming, encoding, decoding, and transmission stages to ensure a high-quality viewing experience. However, most existing Quality of Experience (QoE) studies for ODVs only focus on the visual quality, while overlooking the impact of the audio modality on perceptual quality. This paper presents a comprehensive study of omnidirectional audio-visual quality assessment (OD-AVQA) from both subjective and objective perspectives. Specifically, we first establish a large-scale audio-visual quality assessment database for ODVs named OAVQAD+, which includes 625 distorted omnidirectional audio-visual sequences derived from 25 pristine ODVs, and the corresponding collected mean opinion scores (MOSs) for the QoE of these ODVs. This contributes to the largest database for assessing the audio-visual quality of ODVs. To advance the fields of objective OD-AVQA, we construct a benchmark that includes three types of benchmark models. Type I and Type II models integrate well-known video quality assessment (VQA) and audio quality assessment (AQA) methods using support vector regression (SVR) and multi-layer perceptron (MLP), respectively, while Type III consists of AVQA models specifically designed for traditional 2D audio-visual sequences. We also propose a novel Omnidirectional Audio-Visual quality assessment Network (OmniAVNet) that integrates quality-aware audio, visual, and motion features to predict overall audio-visual quality for ODVs effectively, which supports both full-reference (FR) and no-reference (NR) assessment. Extensive experimental results demonstrate that OmniAVNet outperforms the aforementioned benchmark OD-AVQA models on two OD-AVQA databases, and shows great performance on one omnidirectional VQA database. The database and code are available at https://github.com/IntMeGroup/OmniAVNet.

Abstract:
The \alpha -tree is an effective hierarchical image representation used for connected filtering or segmentation in remote sensing and other image applications. The \alpha -tree constructs a tree based on the dissimilarities of the pixels in an image. Compared to other hierarchical image representations such as the component tree, the \alpha -tree provides a better representation of the granularity of images and is easier to apply to multichannel images. The major drawback of the \alpha -tree is its processing speed, due to the large amount of data to be processed and the lack of studies on an efficient algorithms, especially on multichannel and high dynamic range images. In this study, we introduce a novel adaptation of the hybrid component tree algorithm on the \alpha -tree for fast parallel \alpha -tree construction in any dynamic range of pixel dissimilarity. We tested the hybrid \alpha -tree algorithm on Sentinel-2 remote sensing images from the European Space Agency (ESA) as well as randomly generated images, on the Hábrók high performance computing cluster. Experimental results show that the hybrid \alpha -tree algorithm achieves the processing speed of 10–30Mpix/s and the speedup of 10–30 on a 128-core computer, proving the efficiency of the first parallel \alpha -tree algorithm in high dynamic range, to the best of our knowledge.

Abstract:
Cross-domain few-shot medical image segmentation (CDFSMIS) presents the fundamental challenge of segmenting novel anatomical or tissue structures on unfamiliar medical imaging domains with limited annotated data. In this paper, we conduct an in-depth investigation of CDFSMIS and identify two critical observations: 1) the conventional matching mechanisms from existing few-shot models are particularly vulnerable to discrepancies in local characteristics between different domains and 2) the semantic representations learned from source domains often lack robustness when generalizing to unfamiliar target domains. Motivated by these insights, we propose a novel Dynamic Semantic Matching (DSM) framework that addresses these challenges through a three-component approach. First, we design a support-query feature re-weighting (SFR) mechanism that leverages multilevel hidden features to suppress domain-specific contents. Second, we introduce a dynamic semantic information selection (DSIS) strategy that adaptively identifies and combines domain-robust channels to construct generalizable representations. Third, we develop a dual-perspective semantic center calculation method to address the inherent texture imbalance in medical images. Extensive experiments on four unfamiliar target domains (MS-CMR, PI-PMR, Chest-X-Ray and ISIC2018) demonstrate that our approach significantly outperforms state-of-the-art few-shot segmentation and cross-domain few-shot segmentation models, validating the effectiveness of DSM in simultaneously addressing domain generalization and semantic matching challenges in medical image segmentation. The source code is available at https://github.com/YazhouZhu19/DSM

Abstract:
Low-light video enhancement is a critical task in computer vision with a wide range of applications. However, there is a lack of high-quality benchmark datasets in this field. To address this issue, we collect a high-quality low-light video dataset using a well-designed camera system. The videos in our dataset feature apparent camera motion and strict spatial alignment. In order to achieve general low-light video enhancement, we propose a Retinex-based method called Light Adjustable Network (LAN). LAN iteratively adjusts the brightness and adapts to different lighting conditions in various real-world scenarios, producing visually appealing results. We further develop a new dataset capture method and low-light video enhancement method to address the limitation of our previous dataset in capturing dynamic scenes and previous method. The new camera setup and capture method enable the recording of real continuous videos and generate the new dataset. Our new low-light video enhancement method, LAN++, leverages a new inter-frame relationship, difference images. It utilizes the texture information contained in the difference images of dynamic scenes to supplement the high-frequency details of the original features, which produce sharper and more realistic output images. The extensive experiments demonstrate the superiority of our low-light video dataset and enhancement method. Our dataset can be downloaded at https://pan.baidu.com/s/1d3EljvVduVM0wUOvzjWaqA?pwd=p45g.

Abstract:
A model trained in a source domain often experiences a decline in effectiveness when deployed in a different target domain, primarily due to the discrepancies between the source and target domain characteristics. Test time adaptation (TTA) provides a practical solution for addressing the domain gap by adapting the models during the test phase. Existing TTA approaches mainly focus on aligning image features into a unified feature space. However, they generally only manage to achieve broad, coarse-grained alignment across domains while overlooking the more detailed, fine-grained feature clusters within each category. Furthermore, these methods are susceptible to settling at local optima because significant details can be lost when image features are abstracted into distribution parameters. To surpass these challenges, we introduce a novel approach that ensures hierarchical cross-domain alignment at three distinct levels: category-level, subcategory-level, and sample-level. Simple category-level alignment is inadequate due to the presence of various subcategories within each category, which possess distinct semantic properties identified through unsupervised clustering in our approach. Advancing further, we enhance our method by creating synthesized features from the initially extracted category-specific features, aiming for precise sample-level alignment. During our optimization process, we redefine TTA as essentially a feature matching problem, concentrating on the calculation of feature matching probabilities. Through hierarchical distribution alignment across these levels, our method maintains the semantic consistency of cross-domain image features from a broad to a detailed scale. Unlike prior test-time adaptation methods such as Tent, our method leverages source data only once after pre-training to fit feature distributions. During the testing phase, source data is completely discarded, and the model relies solely on test sample features. This design ensures privacy preservation and makes the method well-suited for privacy-sensitive applications. Our experimental evaluations on recognized datasets demonstrate that our approach significantly surpasses other established TTA methods in performance. Our code is accessible at https://github.com/yaboliudotug/HDA-TTA

Abstract:
Accurate 3D scene understanding in outdoor environments heavily relies on high-quality point clouds. However, LiDAR-scanned data often suffer from extreme sparsity, severely hindering downstream 3D perception tasks. Existing point cloud upsampling methods primarily focus on individual objects, thus demonstrating limited generalization capability for complex outdoor scenes. To address this issue, we propose PVNet, a diffusion model-based point-voxel interaction framework to perform LiDAR point cloud upsampling without dense supervision. Specifically, we adopt the classifier-free guidance-based DDPMs to guide the generation, in which we employ a sparse point cloud as the guiding condition and the synthesized point clouds derived from its nearby frames as the input. Moreover, we design a voxel completion module to refine and complete the coarse voxel features for enriching the feature representation. In addition, we propose a point-voxel interaction module to integrate features from both points and voxels, which efficiently improves the environmental perception capability of each upsampled point. To the best of our knowledge, our approach is the first scene-level point cloud upsampling method supporting arbitrary upsampling rates. Extensive experiments on various benchmarks demonstrate that our method achieves state-of-the-art performance. The source code will be available at https://github.com/chengxianjing/PVNet

Abstract:
Existing underwater image restoration (UIR) methods generally only handle color distortion or jointly address color and haze issues, but they often overlook the more complex degradations that can occur in underwater scenes. To address this limitation, we propose a Universal Underwater Image Restoration method, termed as UniUIR, considering the complex scenario of real-world underwater mixed distortions as an all-in-one manner. To disentangle degradation-specific effects and capture their inter-correlations, we propose the Mamba Mixture-of-Experts module (MMoEM). Each expert specializes in distinct aspects of degradation, while gating mechanism dynamically routes features to appropriate experts. This design enables collaborative prior extraction and preserves global context, all within linear computational complexity. Building upon this foundation, to enhance degradation representation and address the task conflicts that arise when handling multiple types of degradation, we introduce the spatial-frequency prior generator. This module extracts degradation prior information in both spatial and frequency domains, and adaptively selects the most appropriate task-specific prompts based on image content, thereby improving the accuracy of image restoration. Finally, to more effectively address complex, region-dependent distortions in UIR task, we incorporate depth information derived from a large-scale pre-trained depth prediction model, thereby enabling the network to perceive and leverage depth variations across different image regions to handle localized degradation. Extensive experiments demonstrate that UniUIR can produce more attractive results across qualitative and quantitative comparisons, and shows strong generalization than state-of-the-art methods. Project page at https://house-yuyu.github.io/UniUIR.

Abstract:
Human motion understanding is a fundamental task with diverse practical applications, facilitated by the availability of large-scale motion capture datasets. Recent studies focus on text-motion tasks, such as text-based motion generation, editing and question answering. In this study, we introduce the novel task of text-based human motion grounding (THMG), aimed at precisely localizing temporal segments corresponding to given textual descriptions within untrimmed motion sequences. Capturing global temporal information is crucial for the THMG task. However, Transformer-based models that rely on global temporal self-attention face challenges when handling long untrimmed sequences due to the quadratic computational cost. We address these challenges by proposing Text-controlled Motion Mamba (TM-Mamba), a unified model that integrates temporal global context, language query control, and spatial graph topology with only linear memory cost. The core of the model is a text-controlled selection mechanism which dynamically incorporates global temporal information based on text query. The model is further enhanced to be topology-aware through the integration of relational embeddings. For evaluation, we introduce BABEL-Grounding, the first text-motion dataset that provides detailed textual descriptions of human actions along with their corresponding temporal segments. Extensive evaluations demonstrate the effectiveness of TM-Mamba on BABEL-Grounding.

Abstract:
Recent advancements in vision transformers (ViTs) have demonstrated that larger models often achieve superior performance. However, training these models remains computationally intensive and costly. To address this challenge, we introduce ScaleNet, an efficient approach for scaling ViT models. Unlike conventional training from scratch, ScaleNet facilitates rapid model expansion with negligible increases in parameters, building on existing pretrained models. This offers a cost-effective solution for scaling up ViTs. Specifically, ScaleNet achieves model expansion by inserting additional layers into pretrained ViTs, utilizing layer-wise weight sharing to maintain parameters efficiency. Each added layer shares its parameter tensor with a corresponding layer from the pretrained model. To mitigate potential performance degradation due to shared weights, ScaleNet introduces a small set of adjustment parameters for each layer. These adjustment parameters are implemented through parallel adapter modules, ensuring that each instance of the shared parameter tensor remains distinct and optimized for its specific function. Experiments on the ImageNet-1K dataset demonstrate that ScaleNet enables efficient expansion of ViT models. With a 2× depth-scaled DeiT-Base model, ScaleNet achieves a 7.42% accuracy improvement over training from scratch while requiring only one-third of the training epochs, highlighting its efficiency in scaling ViTs. Beyond image classification, our method shows significant potential for application in downstream vision areas, as evidenced by the validation in object detection task.

Abstract:
Deep unfolding has emerged as a powerful solution for Multi-modal Image Super-Resolution (MISR) through strategic integration of cross-modal priors in network architecture. However, current deep unfolding approaches rely on first-order optimization, which exhibit limitations in learning efficiency and reconstruction accuracy. In this paper, to overcome these limitations, we propose a novel Semi-smooth Newton driven Unfolding network for MISR, namely SNUM-Net. Specifically, we first develop a Semi-smooth Newton-driven MISR (SNM) algorithm that establishes a theoretical foundation for our approach. Then, we unfold the iterative solution of SNM into a novel network. To the best of our knowledge, the SNUM-Net is the first successful attempt to design a deep unfolding MISR network based on second-order optimization algorithm. Compared to existing methods, the SNUM-Net demonstrates three main advantages. 1) Universal paradigm: the SNUM-Net provides a unified paradigm for diverse MISR tasks without requiring scenario-specific constraints; 2) Explainable framework: the network preserves a mathematical correspondence with the SNM algorithm, ensuring that the topological relationships between modules are well explainable; 3) Superior performance: comprehensive evaluations across 10 datasets spanning 3 MISR tasks demonstrate the network’s exceptional reconstruction accuracy and generalization capability. The software codes are available at https://github.com/pandazcx/SNUM-Net

Abstract:
Since cross-modal hashing requires minimal storage and computation, it is becoming increasingly popular with the exponential growth of multimedia content on the internet. However, the lack of accurate supervisory data has curtailed the effectiveness of unsupervised hashing techniques. Conversely, supervised hashing strategies necessitate considerable human and financial resources for data annotation. To address this limitation, we propose a novel semi-supervised cross-modal hashing method called Enhanced Cross-Modal Hashing via Hybrid Distillation and Structural Refinement (HDSR). Specifically, we first learn the features of inter-modal and inter-instance similarity relationships through pointwise semantic alignment and listwise similarity partial order learning, respectively, to extract refined structural representations from partially labeled data. Secondly, by fusing inter-modal similarity to construct higher-order affinity matrices, we precisely delineate the semantic correlation information across cross-modal data, facilitating stable self-supervised training of unlabeled data through the application of momentum fusion strategies. Finally, the refined structural representation of labeled data is transferred into unlabeled branches through hybrid distillation, enhancing the performance of cross-modal hash learning by generating compact and accurate hash codes. The proposed HDSR is compared with several state-of-the-art deep cross-modal hashing methods on three widely used benchmark databases, and the experimental results verify its efficiency and superiority.

Abstract:
An image processing pipeline typically involves key operations like compression, denoising, and resizing, along with enhancements such as sharpening, histogram equalization, and low-light compensation. Within this pipeline, image artifacts are often introduced, which could severely degrade perceptual quality and mislead downstream vision tasks. Yet, current image quality assessment (IQA) models fail to distinguish between harmful artifacts and beneficial enhancements, as they generally apply a rigid fidelity criterion that penalizes all deviations from the reference image. We address this gap with the artifact peak signal-to-noise ratio (APSNR), a new IQA metric that adopts a selective fidelity criterion—allowing legitimate enhancements while penalizing only spurious artifacts. Specifically, APSNR detects artifacts by identifying pixels that violate an “artifact-free” intensity mapping between the processed and reference images, and then computes PSNR exclusively within the artifact-corrupted regions. Extensive experiments demonstrate that our APSNR consistently correlates with human perception of artifacts while remaining robust to enhancements. This enables a more nuanced evaluation of image processing algorithms and provides a principled tool for benchmarking artifact suppression.

Abstract:
3D medical images are volumetric data that provide spatial continuity and multi-dimensional information. These features provide rich anatomical context. However, their anisotropy may result in reduced image detail along certain directions. This can cause blurring or distortion between slices. In addition, global or local intensity inhomogeneities are often observed. This may be due to limitations of the imaging equipment, inappropriate scanning parameters, or variations in the patient’s anatomy. This inhomogeneity may blur lesion boundaries and may also mask true features, causing the model to focus on irrelevant regions. Therefore, a probability map-guided network for 3D volumetric medical image segmentation (3D-PMGNet) is proposed. The probability maps generated from the intermediate features are used as supervisory signals to guide the segmentation process. A new probability map reconstruction method is designed, combining dynamic thresholding with local adaptive smoothing. This enhances the reliability of high-response regions while suppressing low-response noise. A learnable channel-wise temperature coefficient is introduced to adjust the probability distribution to make it closer to the true distribution; in addition, a feature fusion method based on dynamic prompt encoding is developed. The response strength of the main feature maps is dynamically adjusted, and this adjustment is achieved through the spatial position encoding derived from the probability maps. The proposed method has been evaluated on four datasets. Experimental results show that the proposed method outperforms state-of-the-art 3D medical image segmentation methods. The source codes have been publicly released at https://github.com/ZHANGZIMENG01/3D-PMGNet

Abstract:
Meta-metric learning has demonstrated strong performance in coarse-grained few-shot situations. However, despite their simplicity and availability, these metametrics are limited in effectively handling fine-grained few-shot scenarios. Fine-Grained Few-Shot Classification (FGFSC) presents significant challenges to the network’s ability to extract subtle features. Equipped with the symmetrical binocular perception system and complex neural networks in the brain, humans inherently possess exceptional and resilient meta-learning abilities, facilitating superior management of fine-grained few-shot scenarios. In this paper, inspired by the human binocular visual system, we pioneer the first human-like meta-metric paradigm: Binocular Singular Hellinger Metametric (BinoHeM). Functionally, BinoHeM incorporates advanced symmetric binocular feature encoding and recognition mechanisms. Structurally, it integrates two binocular sensing feature encoders, a singular Hellinger metametric, and two collaborative identification mechanisms. Building on this foundation, we introduce two innovative metametric variants: BinoHeM-KDL and BinoHeM-MTL. These are grounded in two advanced training mechanisms: knowledge distillation learning (KDL) and meta-transfer learning (MTL), respectively. Furthermore, we showcase the high accuracy and robust generalization capabilities of our approaches on four representative FGFSC benchmarks. Extensive comparative and ablation experiments have validated the efficiency and superiority of our paradigm over other state-of-the-art algorithms. Our code is publicly available at: https://github.com/ChaofeiQI/BinoHeM

Abstract:
Realistic 3D food creation generation plays a critical role in applications such as nutritional assessment, advertising, and virtual content creation. The existing text-to-3D models typically begin by initializing a 3D representation, which is subsequently refined using supervision from a text-to-image model to obtain the final 3D output. In this work, we present Food3D, a novel framework for 3D food generation designed to address two main limitations of current models. First, the limitation of initialization in 3D generation: poor initialization can result in the generated 3D food lacking crucial details and realism, thereby reducing its quality. To address this issue, we propose a generalized method named Food3D-G, which uses Mamba-based initialization to improve the starting point of the initialization process, thereby enhancing the visual fidelity and quality of the generated 3D food. Second, the limitation of text-to-image models: current text-to-3D models often rely on text-to-image models for supervision. However, a considerable gap persists between the generated images and real-world visuals, particularly when modeling complex food structures. These models fail to accurately capture the fine details and textures, which negatively impacts the quality and realism of the generated 3D food models. To address this limitation, we propose a customizable method for personalized 3D food generation, termed Food3D-C. This method employs a dual-branch diffusion model that effectively captures intricate details, particularly in complex food structures. Within the Food3D framework, both proposed methods incorporate 3D Gaussian splatting (3D GS) and a schedulable interval score matching (S-ISM) algorithm to enhance shape and texture generation. Extensive experiments demonstrate that Food3D achieves state-of-the-art performance, with substantial improvements in detail, shape accuracy, and overall visual realism. Project page and source codes: https://yudongjian.github.io/Food3D/

Abstract:
Detecting camouflaged objects is challenging due to their high visual similarity to surrounding environments in texture, color, and shape. Traditional Camouflaged Object Detection (COD) methods heavily rely on pixel-level annotations, which are costly and time-consuming. Scribble-Supervised COD (SSCOD) has emerged as a more efficient alternative by using sparse scribble annotations. However, it faces two critical challenges: sparse annotations, compounded by the extreme similarity between foreground and background, cause entangled feature representations and inaccurate predictions in unlabeled regions, and existing SSCOD methods lack robustness to scale variations, resulting in inconsistent predictions across scales. To alleviate these challenges, we propose the Mutual Iterative Refinement Network (MIR-Net), which introduces a cross-branch mutual refinement mechanism to disentangle and enhance foreground and background features. MIR-Net incorporates two novel modules: Background-driven Foreground Feature Enhancement (BFFE) and Foreground-driven Background Feature Enhancement (FBFE), which dynamically suppress irrelevant cues and amplify relevant features. Additionally, we introduce a Scale-Invariant Consistency (SIC) loss that enforces stable and accurate predictions across scales, improving the model’s robustness to scale variations. Comprehensive experiments on CAMO, COD10K, and NC4K datasets demonstrate that MIR-Net achieves state-of-the-art performance among SSCOD methods, surpassing all fully supervised CNN-based models and demonstrating competitive performance with fully supervised Transformer-based approaches. These results highlight MIR-Net’s potential to advance COD under weak supervision.

Abstract:
It has been proven that introducing multiple guidance sources boosts image inpainting performance. However, existing methods primarily focus on local relationships and neglect the holistic interplay between guidance and texture information. Moreover, they lack an effective feedback mechanism to adaptively update the guidance process as corrupted texture information is progressively restored, potentially resulting in inconsistent inpainting. To tackle this issue, we propose a novel scheme aligned with pre-perception and cross-perception collaborative processes in human drawing. To mimic the pre-perception process, we introduce a pre-perceptual transformer block that captures long-range contextual dependencies and activates meaningful information to individually optimize image structures, semantic layouts, and textures, thereby effectively controlling their respective generation. To mimic the cross-perception collaborative process, we propose a cyclic cross-perceptual interaction to maintain consistency across the entire image regarding structure, layout, and texture while progressively refining their details. This interaction accounts for the global attention relationship between texture and other guidance sources (including image structure and semantic layout) to enhance image texture, alongside integrating a dedicated feedback mechanism to update guidance information. The proposed components are alternately deployed in three-branch decoders of the new scheme from rough to fine-grained levels to achieve these two iterative processes of human drawing. Experimental results prove the superiority of the proposed scheme over state-of-the-art methods across three datasets.

Abstract:
In this work, we aim to detect the changes caused by object variations in a scene represented by the neural radiance fields (NeRFs). Given an arbitrary view and two sets of scene images captured at different timestamps, we can predict scene changes in that view, which has significant potential applications in scene monitoring and measuring. We conducted preliminary studies and found that such an exciting task cannot be easily achieved by utilizing existing NeRFs and 2D change detection (CD) methods with many false or missing detections. The main reason is that the 2D CD is based on the pixel appearance difference between spatial-aligned image pairs and neglects the stereo information in the NeRF. To address the limitations, we propose the C-NeRF to represent scene changes as directional consistency difference-based NeRF, which mainly contains three modules. We first build two aligned NeRFs from pre-change and post-change scenes. Then, we identify the change points based on the direction-consistent constraint; that is, real change points have similar change representations across view directions, but fake change points do not. Finally, we design the change map rendering process based on the built NeRFs and can generate the change map of an arbitrarily specified view direction. To validate the effectiveness, we build a new dataset containing ten scenes covering diverse scenarios with different changing objects. Our approach surpasses state-of-the-art CD methods and NeRF-based methods by a significant margin.

Abstract:
Correspondence pruning aims to identify inliers from correspondences severely disturbed by outliers. Although Transformers and graph neural networks have shown impressive results in this field, they are either limited by a narrow receptive field or encounter quadratic computational complexity. To tackle this challenge, this work pioneers the integration of state space model into correspondence pruning task, proposing a Mamba-based framework named MambaMatch. Specifically, to address the limitations of the Mamba architecture in local consensus modeling, we proposes a multi-scale scanning strategy. It first employs an adaptive clustering algorithm to map origin correspondences into spatially coherent feature clusters, constructing a dual-representation space encompassing both full-scale and clustered-scale features. Bidirectional scan operations are then performed at both scales: 1) full-scale scan preserves global structural context, and 2) clustered-scale scan enhances local consistency. Subsequently, a Multi-Scale Interaction layer is designed to dynamically fuse dual-scale features via a cross-attention mechanism, further integrated with a Gated Feed-Forward Network to significantly improve the network’s feature discrimination capability. Extensive experiments validate that MambaMatch surpasses state-of-the-art approaches across multiple benchmarks for two-view geometry estimation. Furthermore, MambaMatch exhibits robust generalization across diverse scenarios, tasks, and feature extractors. The source code is available at: https://github.com/mxyttkx/MambaMatch

Abstract:
Sparsely single-point human parsing aims at segmenting the human body into fine-grained categories via weak point-level labels (e.g., point-level, scribble-level, or image-level, etc). The point-level label, especially single-point supervision, can simultaneously preserve spatial positions as well as take light annotation time, which is particularly advantageous in alleviating the human labeling burden. However, how to obtain satisfactory parsing performance under limited sparse point annotations is challenging, which requires further investigation. In this paper, we propose a novel end-to-end Point Evolution Hierarchy human parsing Network (PEHNet) for fine-grained human parsing task that just leverages single-point supervision. Motivated by the concept of a divide-and-conquer strategy, we partition all pixels into three distinct groups, i.e., single-point labels, pseudo-region labels, and unlabeled pixels, then optimize each group with suitable mechanisms. To expand the coverage of single-point labels, we introduce a point dissemination module that generates high-quality pseudo-region labels. Furthermore, the point-level spatial position information inherently preserves the structural characteristics of the human body. Inspired by this hierarchical property, we devise a point-level human hierarchy-wise constraint that guides the prediction probabilities to align with the inherent hierarchy of the human body. Experimental results demonstrate that the proposed PEHNet outperforms state-of-the-art parsing methods on two popular human parsing benchmark datasets (LIP and ATR) and one semantic segmentation dataset (Pascal VOC 2012).

Abstract:
Prompt learning has made significant progress in vision-language models (VLMs), enabling pre-trained models like CLIP to perform cross-domain tasks with few-shot or even zero-shot learning. However, existing methods tend to overfit the training data after fine-tuning on the target domain, leading to a decline in generalization ability and limiting their performance on unseen categories.To address these challenges, we propose a multi-regularization guided knowledge distillation towards generalizable prompt learning. This approach enhances the model’s adaptability and generalization through different stages of regularization while mitigating performance degradation caused by target domain training. Specifically, within the image encoder of CLIP, we introduce Residual Regularization, which binds additional residual connections to certain transformer blocks. This design provides greater flexibility, allowing the model to adjust to new data distributions when adapting to the target domain.Furthermore, during training, we impose Self-distillation Regularization to ensure that while adapting to the target domain, the model preserves its prior generalization knowledge. Specifically, we regularize the intermediate layer outputs of Transformer Blocks to prevent the model from excessively favoring target domain data. Additionally, we employ an unsupervised knowledge distillation strategy to enforce multi-level alignment between the teacher and student models by Direction Distillation Regularization. This ensures that both models maintain consistent visual feature orientations under the same textual features, thereby enhancing overall model stability and cross-domain adaptability.Experimental results demonstrate that our method achieves more stable classification performance in both cross-domain few-shot classification and domain adaptation settings.

Affiliations: School of Mathematical Sciences, Beihang University, Beijing, China; School of Biomedical Engineering, Shenzhen University, Shenzhen, China; CAS Key Laboratory of Technology in Geo-Spatial Information Processing and Application System, University of Science and Technology of China, Hefei, China; School of Information Science and Engineering, Ningbo University, Ningbo, China; School of Artificial Intelligence, Dalian University of Technology, Dalian, China; School of Computer Science and Informatics, Cardiff University, Cardiff, U.K.

Abstract:
Many super-resolution (SR) algorithms have been proposed to increase image resolution. However, full-reference (FR) image quality assessment (IQA) metrics for comparing and evaluating different SR algorithms are limited. In this work, we propose the Perception-oriented Bidirectional Attention Network (PBAN) for image SR FR-IQA, which is composed of three modules: an image encoder module, a perception-oriented bidirectional attention (PBA) module, and a quality prediction module. First, we encode the input images for feature representations. Inspired by the characteristics of the human visual system, we then construct the perception-oriented PBA module. Specifically, different from existing attention-based SR IQA methods, we conceive a Bidirectional Attention to bidirectionally construct visual attention to distortion, which is consistent with the generation and evaluation processes of SR images. To further guide the quality assessment towards the perception of distorted information, we propose Grouped Multi-scale Deformable Convolution, enabling the proposed method to adaptively perceive distortion. Moreover, we design Sub-information Excitation Convolution to direct visual perception to both sub-pixel and sub-channel attention. Finally, the quality prediction module is exploited to integrate quality-aware features and regress quality scores. Extensive experiments demonstrate that our proposed PBAN outperforms state-of-the-art quality assessment methods.

Abstract:
Hyperspectral videos (HSVs), with their inherent spatial-spectral-temporal structure, offer distinct advantages in challenging tracking scenarios such as cluttered backgrounds and small objects. However, existing methods primarily focus on spatial interactions between the template and search regions, often overlooking spectral interactions, leading to suboptimal performance. To address this issue, this paper investigates spectral interactions from both the architectural and training perspectives. At the architectural level, we first establish band-wise long-range spatial relationships between the template and search regions using Transformers. We then model spectral interactions using the inclusion-exclusion principle from set theory, treating them as the union of spatial interactions across all bands. This enables the effective integration of both shared and band-specific spatial cues. At the training level, we introduce a spectral loss to enforce material distribution alignment between the template and predicted regions, enhancing robustness to shape deformation and appearance variations. Extensive experiments demonstrate that our tracker achieves state-of-the-art tracking performance. The source code, trained models and results will be publicly available via https://github.com/bearshng/suit to support reproducibility

Abstract:
Existing hyperspectral fusion computational imaging methods primarily rely on using high-resolution multispectral images (HRMSI) to provide spatial details for low-resolution hyperspectral images (LRHSI), thereby enabling the reconstruction of hyperspectral images. However, these methods are often limited by the low spectral resolution of the HRMSI, making the sampled tensors unable to provide effective information for the LRHSI in a finer spectral range. To achieve more accurate computational imaging results, we propose a Heterospectral Structure Compensation Sampling (HSC-sampling) mechanism. Unlike traditional spatial sampling methods, which directly calculate the interpolation between adjacent pixels, this mechanism analyzes the structural complementarity among different bands in LRHSI. It utilizes the information from other bands to compensate for the missing details in the current band. Additionally, a novel Multi-phase Mixed Modeling (M2M) approach is designed, expanding the model’s analytical capabilities into multiple phases to accommodate the high-dimensional nature of HSI data. Specifically, it extracts fusion features from three phases and organizes the generated features along with the input features into a multi-variate mixed cube based on phase relationships, thereby capturing feature correlations across different phases. Based on the HSC-sampling mechanism and the M2M approach, we construct a Merging Residual Concatenation (MRC) hyperspectral fusion computational imaging network. Compared to other state-of-the-art methods, this network achieves significant improvements in fusion performance across multiple datasets. Moreover, the effectiveness of the HSC-sampling mechanism has been demonstrated in various hyperspectral imaging tasks. Code is available at: https://github.com/1318133/HSC-Sampling

Abstract:
Natural images are often degraded by complex, composite degradations such as rain, snow, and haze, which adversely impact downstream vision applications. While existing image restoration efforts have achieved notable success, they are still hindered by two critical challenges: limited generalization across dynamically varying degradation scenarios and a suboptimal balance between preserving local details and modeling global dependencies. To overcome these challenges, we propose M2Restore, a novel Mixture-of-Experts (MoE)-based Mamba-CNN fusion framework for efficient and robust all-in-one image restoration. M2Restore introduces three key contributions: First, to boost the model’s generalization across diverse degradation conditions, we exploit a CLIP-guided MoE gating mechanism that fuses task-conditioned prompts with CLIP-derived semantic priors. This mechanism is further refined via cross-modal feature calibration, which enables precise expert selection for various degradation types. Second, to jointly capture global contextual dependencies and fine-grained local details, we design a dual-stream architecture that integrates the localized representational strength of CNNs with the long-range modeling efficiency of Mamba. This integration enables collaborative optimization of global semantic relationships and local structural fidelity, preserving global coherence while enhancing detail restoration. Third, we introduce an edge-aware dynamic gating mechanism that adaptively balances global modeling and local enhancement by reallocating computational attention to degradation-sensitive regions. This targeted focus leads to more efficient and precise restoration. Extensive experiments across multiple image restoration benchmarks validate the superiority of M2Restore in both visual quality and quantitative performance. Code is available at https://github.com/yz-wang/M2Restore.

Abstract:
Existing methods based on graph convolutional network often struggle with large-scale graphs due to their high computing consumption and inefficiency. Although strategies such as edge sparsification and node sampling can indeed decrease the complexity, they frequently result in information loss and local information bias. Furthermore, in multi-view scenarios, traditional multi-view fusion methods are unable to simultaneously account for both inter-view consistency and intra-view diversity, thus constraining model performance. In this paper, we propose a multi-view block-wise graph convolutional network that effectively addresses the challenges posed by large-scale graphs while exploiting the complementary nature of multi-view information. Specifically, we implement a node segmentation module to partition nodes into view-specific subsets, thereby diminishing computational complexity while preserving local structural information. To enhance feature extraction, plentiful subgraph representations are captured within blocks by alternating graph convolution with graph structure learning under a shared-weight strategy. Finally, the global fusion module introduces a cross-view inter-block loss that progressively aligns block representations across views, alleviates over-smoothing, and yields a consistent and comprehensive common representation. Extensive experiments on diverse large-scale graph datasets demonstrate that the proposed method not only outperforms state-of-the-art approaches in multi-view semi-supervised classification but also exhibits superior scalability and memory efficiency.

Abstract:
In this paper, we propose UniqueSplat, a view-conditioned feed-forward 3D Gaussian Splatting model to reconstruct customized 3D radiance fields for each view query. Existing feed-forward methods such as pixelSplat and MVSplat aim to generate fixed Gaussians across all views of each scene by minimizing the error between rendered views and ground-truth images. However, such fixed Gaussians generally render images from all views and lack the ability to adapt to specific viewpoints, as they do not incorporate target view information when predicting Gaussians. To address this, our UniqueSplat learns the view-conditioned information as a prior and incorporates this knowledge into network parameters, so that Gaussians are dynamically adjusted in accordance with different views. Specifically, we propose a two-branch view-conditioned hyperNetwork to simultaneously learn view-agnostic embeddings and view-specific knowledge, which not only explores the shareable knowledge from various views, but also adapts the model to specific views at test time. Extensive experiments on widely-used datasets including RealEstate10K, ACID and DTU demonstrate the superiority of UniqueSplat over the state-of-the-art methods. Moreover, UniqueSplat encouragingly outperforms existing methods in cross-dataset evaluation, showing its notable generalization ability.

Abstract:
Test-Time Adaptation (TTA) offers a practical solution for deploying image segmentation models under domain shift without accessing source data or retraining. Among existing TTA strategies, pseudo-label-based methods have shown promising performance. However, they often rely on perturbation-ensemble heuristics (e.g., dropout sampling, test-time augmentation, Gaussian noise), which lack distributional grounding and yield unstable training signals. This can trigger error accumulation and catastrophic forgetting during adaptation. To address this, we propose A3-TTA, a TTA framework that constructs reliable pseudo-labels through anchor-guided supervision. Specifically, we identify well-predicted target domain images using a class compact density metric, under the assumption that confident predictions imply distributional proximity to the source domain. These anchors serve as stable references to guide pseudo-label generation, which is further regularized via semantic consistency and boundary-aware entropy minimization. Additionally, we introduce a self-adaptive exponential moving average strategy to mitigate label noise and stabilize model update during adaptation. Evaluated on both multi-domain medical images (heart structure and prostate segmentation) and natural images, A3-TTA significantly improves average Dice scores by 10.40 to 17.68 percentage points compared to the source model, outperforming several state-of-the-art TTA methods under different segmentation model architectures. A3-TTA also excels in continual TTA, maintaining high performance across sequential target domains with strong anti-forgetting ability. The code will be made publicly available at https://github.com/HiLab-git/A3-TTA

Abstract:
The vast accessibility of Synthetic Aperture Radar (SAR) images through online portals has propelled the research across various fields. This widespread use and easy availability have unfortunately made SAR data susceptible to malicious alterations, such as local editing applied to the images for inserting or covering the presence of sensitive targets. To contrast malicious manipulations, in the last years the forensic community has begun to dig into the SAR manipulation issue, proposing detectors that effectively localize the tampering traces in amplitude images. Nonetheless, in this paper we demonstrate that an expert practitioner can exploit the complex nature of SAR data to obscure any signs of manipulation within a locally altered amplitude image. We refer to this approach as a counter-forensic attack. To achieve the concealment of manipulation traces, the attacker can simulate a re-acquisition of the manipulated scene by the SAR system that initially generated the pristine image. In doing so, the attacker can obscure any evidence of manipulation, making it appear as if the image was legitimately produced by the system. This attack has unique features that make it both highly generalizable and relatively easy to apply. First, it is a black-box attack, meaning it is not designed to deceive a specific forensic detector. Furthermore, it does not require a training phase and is not based on adversarial operations. We assess the effectiveness of the proposed counter-forensic approach across diverse scenarios, examining various manipulation operations. The obtained results indicate that our devised attack successfully eliminates traces of manipulation, deceiving even the most advanced forensic detectors.

Abstract:
The lifting-based methods have dominated monocular 3D human pose estimation by leveraging detected 2D poses as intermediate representations. The 2D component of the final 3D human pose benefits from the detected 2D poses, whereas its depth counterpart must be estimated from scratch. The lifting-based methods encode the detected 2D pose and unknown depth in an entangled feature space, explicitly introducing depth uncertainty to the detected 2D pose, thereby limiting overall estimation accuracy. This work reveals that the depth representation is pivotal for the estimation process. Specifically, when depth is in an initial, completely unknown state, jointly encoding depth features with 2D pose features is detrimental to the estimation process. In contrast, when depth is initially refined to a more dependable state via network-based estimation, encoding it together with 2D pose information is beneficial. To address this limitation, we present a Mixture-of-Experts network for monocular 3D pose estimation named PoseMoE. Our approach introduces: 1) A mixture-of-experts network where specialized expert modules refine the well-detected 2D pose features and learn the depth features. This mixture-of-experts design disentangles the feature encoding process for 2D pose and depth, therefore reducing the explicit influence of uncertain depth features on 2D pose features. 2) A cross-expert knowledge aggregation module is proposed to aggregate cross-expert spatio-temporal contextual information. This step enhances features through bidirectional mapping between 2D pose and depth. Extensive experiments show that our proposed PoseMoE outperforms the conventional lifting-based methods on three widely used datasets: Human3.6M, MPI-INF-3DHP, and 3DPW.

Abstract:
Online image super-resolution (SR) services have been widely used in applications such as Remini and DeepAI. However, the exposure of plaintext images raises serious privacy concerns. While secure CNN inference techniques are employed to protect images in image classification, they are not applicable to the unique challenges posed by image SR: the output resolution is significantly higher than that of the input image. In this paper, we present a secure CNN inference scheme for image SR by employing a multiple ciphertext encapsulation method. We begin by designing fundamental homomorphic operations, including addition, multiplication, and rotation across ciphertexts. Recognizing that image SR typically involves an upsampling layer—unlike image classification—we propose a fast algorithm for secure upsampling. This technique leverages pre-weight block masking and cross-ciphertext rotation, resulting in a significant speedup compared to direct homomorphic upsampling. We then present an efficient batched homomorphic two-dimensional convolution method across ciphertexts, incorporating kernel rearrangement and merging strategies. We also design a polynomial activation function specifically optimized for image SR, further enhancing performance. Extensive experiments demonstrate that our HE-friendly SR network outperforms existing secure solutions, while the proposed multiple ciphertext encapsulation technique achieves at least a 2x improvement in both computational efficiency and memory usage.

Abstract:
In this paper, based on second-order cross-partial derivative (CPD), we propose an efficient blind image deblurring algorithm for uniform blur. The proposed method consists of two stages. We first apply a novel blur kernel estimation method to quickly estimate the blur kernel. Then, we use the estimated kernel to perform non-blind deconvolution to restore the image. A key discovery of the proposed kernel estimation method is that the blur kernel information is usually embedded in the cross-partial-derivative (CPD) image of the blurred image. By exploiting this property, we propose a pipeline to extract a set of kernel candidates directly from the CPD image and then select the most suitable kernel as the estimated blur kernel. Since our kernel estimation method can obtain a fairly accurate blur kernel, we can achieve effective image restoration using a relatively simple Tikhonov regularization in the subsequent non-blind deconvolution process. To improve the quality of the restored image, we further adopt an efficient filtering technique to suppress periodic artifacts that may appear in the restored images. Experimental results demonstrate that our algorithm can efficiently restore high-quality sharp images on standard CPUs without relying on GPU acceleration or parallel computation. For blurred images of approximately 800× 800 resolution, the proposed method can complete image deblurring within 1 to 5 seconds, which is significantly faster than most state-of-the-art methods. Our MATLAB codes are available at https://github.com/e11tkcee06-a11y/CPD-Deblur.git.

Abstract:
Image compression distortion can cause performance degradation of machine analysis tasks, therefore recent years have witnessed fast progress in developing deep image compression methods optimized for machine perception. However, the investigation still lacks for saliency segmentation. First, in this paper we propose a deep compression network increasing local signal fidelity of important image pixels for saliency segmentation, which is different from existing methods utilizing the analysis network loss for backward propagation. By this means, these two types of networks can be decoupled to improve the compatibility of proposed compression method for diverse saliency segmentation networks. Second, pixel-level bit weights are modeled with probability distribution in the proposed bit allocation method. The ascending cosine roll-down (ACRD) function allocates bits to those important pixels, which fits the essence that saliency segmentation can be regarded as the pixel-level bi-classification task. Third, the compression network is trained without the help of saliency segmentation, where latent representations are decomposed into base and enhancement channels. Base channels are retained in the whole image, while enhancement channels are utilized only for important pixels, and therefore more bits can benefit saliency segmentation via enhancement channels. Extensive experimental results demonstrate that the proposed method can save an average of 10.34% bitrate compared with the state-of-the-art deep image compression method, where the rate-accuracy (R-A) performances are evaluated on sixteen downstream saliency segmentation networks with five conventional SOD datasets. The code will be available at: https://openi.pcl.ac.cn/OpenAICoding/SaliencyIC and https://github.com/AkeLiLi/SaliencyIC.

Abstract:
Pyramid Temporal Hierarchy Network (PTH-Net) is a new paradigm for dynamic facial expression recognition, applied directly to raw videos, without face detection and alignment. Unlike the traditional paradigm, which focus only on facial areas and often overlooks valuable information like body movements, PTH-Net preserves more critical information. It does this by distinguishing between backgrounds and human bodies at the feature level, offering greater flexibility as an end-to-end network. Specifically, PTH-Net utilizes a pre-trained backbone to extract multiple general features of video understanding at various temporal frequencies, forming a temporal feature pyramid. It then further expands this temporal hierarchy through differentiated parameter sharing and downsampling, ultimately refining emotional information under the supervision of expression temporal-frequency invariance. Additionally, PTH-Net features an efficient Scalable Semantic Distinction layer that enhances feature discrimination, helping to better identify target expressions versus non-target ones in the video. Finally, extensive experiments demonstrate that PTH-Net performs excellently in eight challenging benchmarks, with lower computational costs compared to previous methods. The source code is available at https://github.com/lm495455/PTH-Net.

Abstract:
We conducted a large-scale subjective study of the perceptual quality of User-Generated Mobile Video Content on a set of mobile-originated videos obtained from ShareChat, a social media platform widely used across India. The content viewed by volunteer human subjects under controlled laboratory conditions has the benefit of culturally diversifying the existing corpus of User-Generated Content (UGC) video quality datasets. There is a great need for large and diverse UGC-VQA datasets, given the explosive global growth of the visual internet and social media platforms. This is particularly true in regard to videos obtained by smartphones, especially in rapidly emerging economies like India. ShareChat provides a safe and cultural community oriented space for users to generate and share content in their preferred Indian languages and dialects. Our subjective quality study, which is based on this data, supplies much needed cultural, visual, and language diversification to the overall shareable corpus of video quality data. We expect that this new data resource will also allow for the development of systems that can predict the perceived visual quality of Indian social media videos, and in this context, control scaling and compression protocols for streaming, provide better user recommendations, and guide content analysis and processing. We demonstrate the value of the new data resource by conducting a study of leading No-Reference Video Quality Assessment (NR-VQA) models on it, including a simple new model, called MoEVA, which deploys a mixture of experts to predict video quality. Both the new LIVE-ShareChat Database and sample source code for MoEVA are being made freely available to the research community at https://github.com/sandeep-sm/LIVE-SC.

Abstract:
Regression-based 3D human pose and shape estimation often fall into one of two different paradigms. Parametric approaches, which regress the parameters of a human body model, tend to produce physically plausible but image-mesh misalignment results. In contrast, non-parametric approaches directly regress human mesh vertices, resulting in pixel-aligned but unreasonable predictions. In this paper, we consider these two paradigms together for a better overall estimation. To this end, we propose a novel HYbrid REgressor (HYRE) that greatly benefits from the joint learning of both paradigms. The core of our HYRE is a hybrid intermediary across paradigms that provides complementary clues to each paradigm at the shared feature level and fuses their results at the part-based decision level, thereby bridging the gap between the two. We demonstrate the effectiveness of the proposed method through both quantitative and qualitative experimental analyses, resulting in improvements for each approach and ultimately leading to better hybrid results. Our experiments show that HYRE outperforms previous methods on challenging 3D human pose and shape benchmarks.

Abstract:
Image fusion facilitates the integration of information from various source images of the same scene into a composite image, thereby benefiting perception, analysis, and understanding. Recently, diffusion models have demonstrated impressive generative capabilities in the field of computer vision, suggesting significant potential for application in image fusion. The forward process in the diffusion models requires the gradual addition of noise to the original data. However, typical unsupervised image fusion tasks (e.g., infrared-visible, medical, and multi-exposure image fusion) lack ground truth images (corresponding to the original data in diffusion models), thereby preventing the direct application of the diffusion models. To address this problem, we propose a versatile diffusion model-based unsupervised framework for image fusion, termed as VDMUFusion. In the proposed method, we integrate the fusion problem into the diffusion sampling process by formulating image fusion as a weighted average process and establishing appropriate assumptions about the noise in the diffusion model. To simplify the training process, we propose a multi-task learning framework that replaces the original noise prediction network, allowing for simultaneous prediction of noise and fusion weights. Meanwhile, our method employs joint training across various fusion tasks, which significantly improves noise prediction accuracy and yields higher quality fused images compared to training on a single task. Extensive experimental results demonstrate that the proposed method delivers very competitive performance across various image fusion tasks. The code is available at https://github.com/yuliu316316/VDMUFusion.

Abstract:
High Dynamic Range (HDR) images present unique challenges for Learned Image Compression (LIC) due to their complex domain distribution compared to Low Dynamic Range (LDR) images. In coding practice, HDR-oriented LIC typically adopts preprocessing steps (e.g., perceptual quantization and tone mapping operation) to align the distributions between LDR and HDR images, which inevitably comes at the expense of perceptual quality. To address this challenge, we rethink the HDR imaging process which involves fusing multiple exposure LDR images to create an HDR image and propose a novel HDR image compression paradigm, Unifying Imaging and Compression (HDR-UIC). The key innovation lies in establishing a seamless pipeline from image capture to delivery and enabling end-to-end training and optimization. Specifically, a Mixture-ATtention (MAT)-based compression backbone merges LDR features while simultaneously generating a compact representation. Meanwhile, the Reference-guided Misalignment-aware feature Enhancement (RME) module mitigates ghosting artifacts caused by misalignment in the LDR branches, maintaining fidelity without introducing additional information. Furthermore, we introduce an Appearance Redundancy Removal (ARR) module to optimize coding resource allocation among LDR features, thereby enhancing the final HDR compression performance. Extensive experimental results demonstrate the efficacy of our approach, showing significant improvements over existing state-of-the-art HDR compression schemes. Our code is available at: https://github.com/plf1999/HDR-UIC.

Affiliations: School of Information Science and Engineering, Shandong Normal University, Jinan, China; College of Intelligence and Information Engineering, Shandong University of Traditional Chinese Medicine, Jinan, China; School of Information Science and Engineering, Shandong University, Qingdao, China; Faculty of Engineering and Information Technology, University of Technology Sydney, Sydney, NSW, Australia; Department of Electrical and Computer Engineering, University of Kentucky, Lexington, KY, USA

Abstract:
Despite recent advances, scene text recognition remains a challenging problem due to the significant variability, irregularity and distortion in text appearance and localization. Attention-based methods have become the mainstream due to their superior vocabulary learning and observation ability. Nonetheless, they are susceptible to attention drift which can lead to word recognition errors. Most works focus on correcting attention drift in decoding but completely ignore the error accumulated during the encoding process. In this paper, we propose a novel scheme, called the Attention Guidance by Cross-Domain Supervision Signals for Scene Text Recognition (ACDS-STR), which can mitigate the attention drift at the feature encoding stage. At the heart of the proposed scheme is the cross-domain attention guidance and feature encoding fusion module (CAFM) that uses the core areas of characters to recursively guide attention to learn in the encoding process. With precise attention information sourced from CAFM, we propose a non-attention-based adaptive transformation decoder (ATD) to guarantee decoding performance and improve decoding speed. In the training stage, we fuse manual guidance and subjective learning to learn the core areas of characters, which notably augments the recognition performance of the model. Experiments are conducted on public benchmarks and show the state-of-the-art performance. The source will be available at https://github.com/xuefanfu/ACDS-STR.

Abstract:
Despite the remarkable progress in synthetic aperture radar automatic target recognition (SAR ATR), recent efforts have concentrated on detecting and classifying a specific category, e.g., vehicles, ships, airplanes, or buildings. One of the fundamental limitations of the top-performing SAR ATR methods is that the learning paradigm is supervised, task-specific, limited-category, closed-world learning, which depends on massive amounts of accurately annotated samples that are expensively labeled by expert SAR analysts and have limited generalization capability and scalability. In this work, we make the first attempt towards building a foundation model for SAR ATR, termed SARATR-X. SARATR-X learns generalizable representations via self-supervised learning (SSL) and provides a cornerstone for label-efficient model adaptation to generic SAR target detection and classification tasks. Specifically, SARATR-X is trained on 0.18 M unlabelled SAR target samples, which are curated by combining contemporary benchmarks and constitute the largest publicly available dataset till now. Considering the characteristics of SAR images, a backbone tailored for SAR ATR is carefully designed, and a two-step SSL method endowed with multi-scale gradient features was applied to ensure the feature diversity and model scalability of SARATR-X. The capabilities of SARATR-X are evaluated on classification under few-shot and robustness settings and detection across various categories and scenes, and impressive performance is achieved, often competitive with or even superior to prior fully supervised, semi-supervised, or self-supervised algorithms. Our SARATR-X and the curated dataset are released at https://github.com/waterdisappear/SARATR-X to foster research into foundation models for SAR image interpretation.

Abstract:
Local Binary Pattern (LBP) and its variants have considerable success in a wide range of computer vision and pattern recognition applications, especially in tasks related to texture classification. However, the LBP method is sensitive to noise, scale variations and unable to capture macro-structure information. We propose a novel texture classification descriptor called Scale Adaptive Robust LBP (SARLBP) that enhances macro-level descriptive information by incorporating significantly larger scales, and a novel encoding scheme, which is designed to overcome the limitations of traditional LBP schemes. SARLBP method dynamically determines a single optimal scale for each radial direction from multiple scales based on the local area’s characteristics. Subsequently, this descriptor extracts four distinct patterns derived from regional image medians of center pixel, radially-optimized neighbor pixels, optimized fixed scale-based pixels, and radial-difference-based pixels. This method adeptly captures texture information at both micro and macro scales by employing scale adaptation based on the distinctive attributes of the local region. As a result, it provides a comprehensive and robust representation of the texture images. Extensive experimentation was conducted on four publicly available texture databases (ALOT, CUReT, UMD, and Kylberg), considering both the presence and absence of two distinct types of interference (Gaussian noise and Salt-and-Pepper noise). The results reveal that our SARLBP method achieves significantly better performance than other state-of-the-art LPB variants with a fixed smaller feature dimension.

Abstract:
The latest versatile video coding (VVC) standard proposed by the Joint Video Exploration Team (JVET) has significantly improved coding efficiency compared to that of its predecessor, while introducing an extremely higher computational complexity by 6～ 26 times. The quad-tree plus multi-type tree (QTMT)-based coding unit (CU) partition accounts for most of the encoding time in VVC encoding. This paper proposes a data-driven fast CU partition approach based on an efficient Transformer model to accelerate VVC inter-coding. First, we establish a large-scale database for inter-mode VVC, comprising diverse CU partition patterns from more than 800 raw video sequences across various resolutions and contents. Next, we propose a deep neural network model with a Transformer-based temporal topology for predicting the CU partition, named as TCP-Net, which is adaptive to the group of pictures (GOP) hierarchy in VVC. Then, we design a two-stage structured output for TCP-Net, reflecting both the locations of CU edges and the split modes of all possible CUs. Accordingly, we develop a dual-supervised optimization mechanism to train the TCP-Net model with improved accuracy. The experimental results have verified that our approach can reduce the encoding time by 46.89～ 55.91 % with negligible rate-distortion (RD) degradation, outperforming other state-of-the-art approaches.

Abstract:
The reconstruction of limited data computed tomography (CT) aims to obtain high-quality images from a reduced set of projection views acquired from sparse views or limited angles. This approach is utilized to reduce radiation exposure or expedite the scanning process. Deep Learning (DL) techniques have been incorporated into limited data CT reconstruction tasks and achieve remarkable performance. However, these DL methods suffer from various limitations. Firstly, the distribution inconsistency between the simulation data and the real data hinders the generalization of these DL-based methods. Secondly, these DL-based methods could be unstable due to lack of kernel awareness. This paper addresses these issues by proposing an unrolling framework called Progressive Artifact Image Learning (PAIL) for limited data CT reconstruction. The proposed PAIL primarily consists of three key modules, i.e., a residual domain module (RDM), an image domain module (IDM), and a wavelet domain module (WDM). The RDM is designed to refine features from residual images and suppress the observable artifacts from the reconstructed images. This module could effectively alleviate the effects of distribution inconsistency among different data sets by transferring the optimization space from the original data domain to the residual data domain. The IDM is designed to suppress the unobservable artifacts in the image space. The RDM and IDM collaborate with each other during the iterative optimization process, progressively removing artifacts and reconstructing the underlying CT image. Furthermore, in order to void the potential hallucinations generated by the RDM and IDM, an additional WDM is incorporated into the network to enhance its stability. This is achieved by making the network become kernel-aware via integrating wavelet-based compressed sensing. The effectiveness of the proposed PAIL method has been consistently verified on two simulated CT data sets, a clinical cardiac data set and a sheep lung data set. Compared to other state-of-the-art methods, the proposed PAIL method achieves superior performance in various limited data CT reconstruction tasks, demonstrating its promising generalization and stability.

Abstract:
The balance between accuracy and computational efficiency is crucial for the applications of deep learning-based stereo matching algorithms in real-world scenarios. Since matching cost aggregation is usually the most computationally expensive component, a common practice is to construct cost volumes at a low resolution for aggregation and then directly regress a high-resolution disparity map. However, current solutions often suffer from limitations such as the loss of discriminative features caused by downsampling operations that treat all pixels equally, and spatial misalignment resulting from repeated downsampling and upsampling. To overcome these challenges, this paper presents two sampling strategies: the Adaptive Downsampling Module (ADM) and the Disparity Alignment Module (DAM), to prioritize real-time inference while ensuring accuracy. The ADM leverages local features to learn adaptive weights, enabling more effective downsampling while preserving crucial structure information. On the other hand, the DAM employs a learnable interpolation strategy to predict transformation offsets of pixels, thereby mitigating the spatial misalignment issue. Building upon these modules, we introduce ADStereo, a real-time yet accurate network that achieves highly competitive performance on multiple public benchmarks. Specifically, our ADStereo runs over 5× faster than the current state-of-the-art CREStereo (0.054s vs. 0.29s ) under the same hardware while achieving comparable accuracy (1.82% vs. 1.69%) on the KITTI stereo 2015 benchmark. The codes are available at: https://github.com/cocowy1/ADStereo.

Abstract:
Numerous representation-based classification (RC) methods have been developed for face recognition due to their decent model interpretability and robustness against noise. Most existing RC methods primarily characterize the gray-scale reconstruction error image (single-channel data) in two ways: the one-dimensional (1D) pixel-based error model and the two-dimensional (2D) gray-scale image-matrix-based error model. The former measures the reconstruction error pixel by pixel, while the latter leverages 2D structural information of the gray-scale error image, such as the low-rank property. However, when applying these methods to different color channels of a test color face image (multi-channel data) separately and independently, they neglect the three-dimensional (3D) structural correlations among distinct color channels. In real-world scenarios, face images are often contaminated with complex noise, including contiguous occlusion and random pixel corruption, which pose significant challenges to these approaches and can lead to a decline in performance. In this paper, we propose a Tensor Nuclear Norm based Robust Multi-channel Atomic Representation (TNN-RMAR) framework with application to color face recognition. The proposed method has the following three critical ingredients: 1) We propose a 3D color image-tensor-based error model, which can take full advantage of the 3D structural information of the color error image. 2) To leverage the 3D structural information of the color error image, we model it as a 3-order tensor \mathcal E and exploit its low-rank property with the tensor nuclear norm. Given that multiple color channels in a color image are generally corrupted at the same positions, we design a tube-wise tailored loss function to further leverage its tube-wise structure. 3) We devise the multi-channel atomic norm (MAN) regularization for the representation coefficient matrix, which allows us to jointly harness the correlation information of coefficients in different color channels. In addition, we also devise an efficient algorithm to solve the TNN-RMAR framework based on the alternating direction method of multipliers (ADMM) framework. By leveraging TNN-RMAR as a general platform, we also develop several novel robust multi-channel RC methods. Experimental results on benchmark real-world databases validate the effectiveness and robustness of the proposed framework for robust color face recognition.

Abstract:
Current scene parsers have effectively distilled abstract relationships among refined instances, while overlooking the discrepancies arising from variations in scene depth. Hence, their potential to imitate the intrinsic 3D perception ability of humans is constrained. In accordance with the principle of perspective, we advocate first grading the depth of the scenes into several slices, and then digging semantic correlations within a slice or between multiple slices. Two attention-based components, namely the Scene Depth Grading Module (SDGM) and the Edge-oriented Correlation Refining Module (EoCRM), comprise our framework, the Line-of-Sight Depth Network (LoSDN). SDGM grades scene into several slices by calculating depth attention tendencies based on parameters with explicit physical meanings, e.g., albedo, occlusion, specular embeddings. This process allocates numerous multi-scale instances to each scene slice based on their line-of-sight extension distance, establishing a solid groundwork for ordered association mining in EoCRM. Since the primary step in distinguishing distant faint targets is boundary delineation, EoCRM implements edge-wise saliency quantification and association digging. Quantitative and diagnostic experiments on Cityscapes, ADE20K, and PASCAL Context datasets reveal the competitiveness of LoSDN and the individual contribution of each highlight. Visualizations display that our strategy offers clear benefits in detecting distant, faint targets.

Abstract:
Open-Set Domain Adaptation (OSDA) aims at adapting a model trained on a labelled source domain, to an unlabeled target domain that is corrupted with unknown classes. The key challenge inherent to this open-set setting is therefore how best to avoid the negative transfer incurred by unknown classes during model adaptation. Most existing works tackle this challenge by simply pushing the entire unknown classes away. In this paper, we take a different stance – instead of addressing these unknown classes as a single entity, we “reserve” in-between spaces for their subsets in the learned embedding. Our key finding is that the inter-class relations learned off the source domain, can help to enforce class separations in the target domain – thereby reserving spaces for unknown classes. More specifically, we first prep the “reservation” by tightening the known-class representations while enlarging their inter-class margin. We then learn soft-label prototypes in the source domain to facilitate the discrimination of known and unknown samples in the target domain. It follows that these two steps are iterated at each epoch in a mutually beneficial manner – better discrimination of unknown samples helps with space reservation, and vice versa. We show state-of-the-art results on four standard OSDA datasets, Office-31, Office-Home, VisDA and ImageCLEF, and conduct further analysis to help understand our method. Codes are available at: https://github.com/PRIS-CV/Reserve_to_Adapt

Abstract:
For three-dimensional (3D) imaging based on fringe projection profilometry (FPP), maximum fringe frequency selection and fringe frequencies allocation have a significant impact on the accuracy and robustness of 3D imaging. In this paper, we conduct a detailed analysis of the wrapped phase error, and analyze the phase unwrapping reliability in the three-frequency temporal phase unwrapping (TPU). Since different measurement systems and scenes having different maximum sampling frequencies, we introduce a maximum frequency selection approach in this work. In order to ensure the overall phase unwrapping reliability, we introduce an optimal frequencies allocation approach. Experimental results show the valid of the proposed approach. The research in this paper will help to improve the accuracy and robustness of FPP in practical 3D measurement.

Abstract:
Who, What and Where (3W)are the three core elements of storytelling, and accurately identifying the 3W semantics is critical to understanding the story in a video. This paper studies the 3W composite-semantics video Instance Search (INS) problem, which aims to find video shots about a specific person doing a concrete action in a particular location. The popular Complete-Decomposition (CD) methods divide a composite-semantics query into multiple single-semantics queries, which are likely to yield inaccurate or incomplete retrieval results due to neglecting important semantic correlations. Recent Non-Decomposition (ND) methods utilize Vision Language Model (VLM) to directly measure the similarity between textual query and video content. However, the accuracy is limited by VLM’s immature capability to recognize fine-grained objects. To address the above challenges, we propose a video structure-aware Partial-Decomposition (PD) method. Its core idea is to partially decompose the 3W INS problem into three semantic-correlated 2W INS problems i.e., person-action INS, action-location INS, and location-person INS. Thereafter, we respectively model the correlations between pairs of semantics at frames, shots and scenes of story videos. With the help of the spatial consistency and temporal continuity contained in the unique hierarchical structure of story videos, we can finally obtain identity-matching, logic-consistent, and content-coherent 3W INS results. To validate the effectiveness of the proposed method, we specifically build three large-scale 3W INS datasets based on three TV series Eastenders, Friends and The Big Bang Theory, totally comprising over 670K video shots spanning 700 hours. Extensive experiments show that the proposed PD method surpasses the current state-of-the-art CD and ND methods for 3W INS in story videos.

Abstract:
Change detection(CD) is important for Earth observation, emergency response and time-series understanding. Recently, data availability in various modalities has increased rapidly, and multimodal change detection (MCD) is gaining prominence. Given the scarcity of datasets and labels for MCD, unsupervised approaches are more practical for MCD. However, previous methods typically either merely reduce the gap between multimodal data through transformation or feed the original multimodal data directly into the discriminant network for difference extraction. The former faces challenges in extracting precise difference features. The latter contains the pronounced intrinsic distinction between the original multimodal data; direct extraction and comparison of features usually introduce significant noise, thereby compromising the quality of the resultant difference image. In this article, we proposed the MaCon framework to synergistically distill the common and discrepancy representations. The MaCon framework unifies mask reconstruction (MR) and contrastive learning (CL) self-supervised paradigms, where the MR serves the purpose of transformation while CL focuses on discrimination. Moreover, we presented an optimal sampling strategy in the CL architecture, enabling the CL subnetwork to extract more distinguishable discrepancy representations. Furthermore, we developed an effective silent attention mechanism that not only enhances contrast in output representations but stabilizes the training. Experimental results on both multimodal and monomodal datasets demonstrate that the MaCon framework effectively distills the intrinsic common representations between varied modalities and manifests state-of-the-art performance across both multimodal and monomodal CD. Such findings imply that the MaCon possesses the potential to serve as a unified framework in the CD and relevant fields. Source code will be publicly available once the article is accepted.

Abstract:
Underwater salient object detection (USOD) is an emerging research area that has great potential for various underwater visual tasks. However, USOD research is still in its early stage due to the lack of large-scale datasets within which salient objects are well-defined and pixel-wise annotated. To address this issue, this paper introduces a new dataset named USOD10K. It contains 10,255 underwater images, covering 70 categories of salient objects in 12 different underwater scenes. Moreover, the USOD10K provides salient object boundaries and depth maps of all images. The USOD10K is the first large-scale dataset in the USOD community, making a significant leap in diversity, complexity, and scalability. Secondly, a simple but strong baseline termed TC-USOD is proposed for the USOD10K. The TC-USOD adopts a hybrid architecture based on an encoder-decoder design that leverages transformer and convolution as the basic computational building block of the encoder and decoder, respectively. Thirdly, we make a comprehensive summarization of 35 state-of-the-art SOD/USOD methods and benchmark them on the existing USOD dataset and the USOD10K. The results show that our TC-USOD achieves superior performance on all datasets tested. Finally, several other use cases of the USOD10K are discussed, and future directions of USOD research are pointed out. This work will promote the development of the USOD research and facilitate further research on underwater visual tasks and visually-guided underwater robots. To pave the road in the USOD research field, the dataset, code, and benchmark results are publicly available: https://github.com/Underwater-Robotic-Lab/USOD10K.

Abstract:
We propose a complete system to enable progressive coding with quality scalability of the mesh geometry, in MPEG’s state-of-the-art Video-based Dynamic Mesh Coding (V-DMC) framework. In particular, we propose an alternative method for encoding the subdivision wavelet coefficients in V-DMC, using a zerotree coding approach that works directly in the native 3D mesh space. This allows us to identify parent-child relationships amongst the wavelet coefficients across different subdivision levels, which can be used to achieve an efficient and versatile coding mechanism. We demonstrate that, given a starting base mesh, a target subdivision surface and a desired maximum number of zerotree passes, our system produces an elegant and visually attractive lossy-to-lossless mesh geometry reconstruction with no further user intervention. Moreover, lossless coefficient encoding with our approach requires nearly the same bitrate as the default displacement coding methods in V-DMC. Yet, our approach provides several quality resolution levels embedded in the same bitstream, while the current V-DMC solutions encode a single quality level only. To the best of our knowledge, this is the first time that a zerotree-based method has been proposed and demonstrated to work for the compression of dynamic time-varying meshes, and the first time that an embedded quality-scalable approach has been used in the V-DMC framework.

Abstract:
The pre-trained text-image discriminative models, such as CLIP, has been explored for open-vocabulary semantic segmentation with unsatisfactory results due to the loss of crucial localization information and awareness of object shapes. Recently, there has been a growing interest in expanding the application of generative models from generation tasks to semantic segmentation. These approaches utilize generative models either for generating annotated data or extracting features to facilitate semantic segmentation. This typically involves generating a considerable amount of synthetic data or requiring additional mask annotations. To this end, we uncover the potential of generative text-to-image diffusion models (e.g., Stable Diffusion) as highly efficient open-vocabulary semantic segmenters, and introduce a novel training-free approach named DiffSegmenter. The insight is that to generate realistic objects that are semantically faithful to the input text, both the complete object shapes and the corresponding semantics are implicitly learned by diffusion models. We discover that the object shapes are characterized by the self-attention maps while the semantics are indicated through the cross-attention maps produced by the denoising U-Net, forming the basis of our segmentation results. Additionally, we carefully design effective textual prompts and a category filtering mechanism to further enhance the segmentation results. Extensive experiments on three benchmark datasets show that the proposed DiffSegmenter achieves impressive results for open-vocabulary semantic segmentation.

Abstract:
Semantic instance completion aims to recover the complete 3D shapes of foreground objects together with their labels from a partial 2.5D scan of a scene. Previous works have relied on full supervision, which requires ground-truth annotations, in the form of bounding boxes and complete 3D objects. This has greatly limited their real-world application because the acquisition of ground-truth data is very costly and time-consuming. To address this bottleneck, we propose a Weakly-Supervised Semantic Instance Completion Network (WSSIC-Net), which learns real-world partial point cloud object completion without requiring the ground truth of complete 3D objects. Instead, WSSIC-Net leverages 3D ground-truth bounding boxes, partial objects of a raw scene, and unpaired synthetic 3D point clouds. More specifically, a 3D detector is used to encode partial point clouds into proposal features, which are then fed into two branches. The first branch uses fully supervised box prediction based on proposal features. The second branch, hereinafter called instance completion, leverages the proposal features as partial object features to achieve weakly-supervised instance completion. A Generative Adversarial Network (GAN) completes the partial features of the 2.5D foreground objects of real-world scenes using only unpaired but semantically-consistent complete synthetic point clouds. In our experiments, we demonstrate that the fully-supervised 3D detection and the weakly-supervised instance completion complement one another. The qualitative and quantitative evaluations on the ScanNet v2 dataset demonstrate that the proposed “weakly-supervised” approach consistently achieves comparable performance to the state-of-the-art “fully supervised” methods.

Affiliations: School of Computer Science and Engineering and the Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications, Southeast University, Nanjing, China; Key Laboratory of Intelligent Perception and Systems for High-Dimensional Information, Ministry of Education, School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, China; School of Artificial Intelligence, Beijing Normal University, Beijing, China; School of Computer Science and Engineering, Nanyang Technological University, Jurong West, Singapore; School of Biological Science and Medical Engineering, Key Laboratory of Child Development and Learning Science of Ministry of Education, Nanjing, China; State Key Laboratory of AI Safety, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China

Abstract:
Despite the impressive performance of current vision-based facial action unit (AU) detection approaches, they are heavily susceptible to the variations across different domains and the cross-domain AU detection methods are under-explored. In response to this challenge, we propose a decoupled doubly contrastive adaptation (D2CA) approach to learn a purified AU representation that is semantically aligned for the source and target domains. Specifically, we decompose latent representations into AU-relevant and AU-irrelevant components, with the objective of exclusively facilitating adaptation within the AU-relevant subspace. To achieve the feature decoupling, D2CA is trained to disentangle AU and domain factors by assessing the quality of synthesized faces in cross-domain scenarios when either AU or domain attributes are modified. To further strengthen feature decoupling, particularly in scenarios with limited AU data diversity, D2CA employs a doubly contrastive learning mechanism comprising image and feature-level contrastive learning to ensure the quality of synthesized faces and mitigate feature ambiguities. This new framework leads to an automatically learned, dedicated separation of AU-relevant and domain-relevant factors, and it enables intuitive, scale-specific control of the cross-domain facial image synthesis. Extensive experiments demonstrate the efficacy of D2CA in successfully decoupling AU and domain factors, yielding visually pleasing cross-domain synthesized facial images. Meanwhile, D2CA consistently outperforms state-of-the-art cross-domain AU detection approaches, achieving an average F1 score improvement of 6%-14% across various cross-domain scenarios.

Abstract:
Unsupervised person re-identification aims to retrieve a given pedestrian image from unlabeled data. For training on the unlabeled data, the method of clustering and assigning pseudo-labels has become mainstream, but the pseudo-labels themselves are noisy and will reduce the accuracy. To overcome this problem, several pseudo-label improvement methods have been proposed. But on the one hand, they only use target domain data for fine-tuning and do not make sufficient use of high-quality labeled data in the source domain. On the other hand, they ignore the critical fine-grained features of pedestrians and overfitting problems in the later training period. In this paper, we propose a novel unsupervised cross-domain person re-identification network (IDENet) based on an inter-domain equilibrium structure to improve the quality of pseudo-labels. Specifically, we make full use of both source domain and target domain information and construct a small learning network to equalize label allocation between the two domains. Based on it, we also develop a dynamic neural network with adaptive convolution kernels to generate adaptive residuals for adapting domain-agnostic deep fine-grained features. In addition, we design the network structure based on ordinary differential equations and embed modules to solve the problem of network overfitting. Extensive cross-domain experimental results on Market1501, PersonX, and MSMT17 prove that our proposed method outperforms the state-of-the-art methods.

Abstract:
Weakly supervised object localization (WSOL) learns to localize objects using only image-level labels. Recently, some studies apply transformers in WSOL to capture the long-range feature dependency and alleviate the partial activation issue of CNN-based methods. However, existing transformer-based methods still face two challenges. The first challenge is the over-activation of backgrounds. Specifically, the object boundaries and background are often semantically similar, and localization models may misidentify the background as a part of objects. The second challenge is the incomplete activation of occluded objects, since transformer architecture makes it difficult to capture local features across patches due to ignoring semantic and spatial coherence. To address these issues, in this paper, we propose LCA-MD, a novel transformer-based WSOL method using local cross-patch activation from multi-direction, which can capture more details of local features while inhibiting the background over-activation. In LCA-MD, first, combining contrastive learning with the transformer, we propose a token feature contrast module (TCM) that can maximize the difference between foregrounds and backgrounds and further separate them more accurately. Second, we propose a semantic-spatial fusion module (SFM), which leverages multi-directional perception to capture the local cross-patch features and diffuse activation across occlusions. Experiment results on the CUB-200-2011 and ILSVRC datasets demonstrate that our LCA-MD is significantly superior and has achieved state-of-the-art results in WSOL. The project code is available at https://github.com/rjy-fighting/LCA-MD.

Abstract:
Cross-Domain Few-Shot Learning (CD-FSL) addresses the challenges of recognizing targets with out-of-domain data when only a few instances are available. Many current CD-FSL approaches primarily focus on enhancing the generalization capabilities of models in spatial domain, which neglects the role of the frequency domain in domain generalization. To take advantage of frequency domain in processing global information, we propose a Frequency-Spatial Complementation (FSC) model, which combines frequency domain information with spatial domain information to learn domain-invariant information from attacked data style. Specifically, we design a Frequency and Spatial Fusion (FusionFS) module to enhance the ability of the model to capture style-related information. Besides, we propose two attack strategies, i.e., the Gradient-guided Unified Style Attack (GUSA) strategy and the Channel-specific Attack Intensity Calculation (CAIC) strategy, which conduct targeted attacks on different channels to provide more diversified style data during the training phase, especially in single-source domain scenarios where the source domain data style is homogeneous. Extensive experiments across eight target domains demonstrate that our method significantly improves the model’s performance under various styles.

Abstract:
Tracking by natural language specification requires trackers to jointly perform grounding and tracking tasks. Existing methods either use separate models or a single shared network, failing to account for the link and diversity between tasks jointly. In this paper, we propose a novel framework that performs dynamic task switching to customize its network path routing for each task within a unified model. For this purpose, we design a task-switchable attention module, which enables the acquisition of modal relation patterns with different dominant modalities for each task via dynamic task switching. In addition, to alleviate the inconsistency between the static language description and the dynamic target appearance during tracking, we propose a language renovation mechanism that renovates the initial language online via visual-context-aware linguistic prompting. Extensive experimental results on five datasets demonstrate that the proposed method performs favorably against state-of-the-art approaches for both grounding and tracking. Our project will be available at: https://github.com/mkg1204/SAKTrack.

Abstract:
Training Generative Adversarial Networks (GANs) with few-shot data has been a challenging task, which is prevalently solved by adapting a deep generative model pre-trained on the large-scale data in a source domain to small target domains with limited training data. In practice, most of the existing methods focus on designing task-specific fine-tuning strategies or regularization terms to select and preserve compatible knowledge across the source and target domain. However, the compatible knowledge greatly depends on the target domain and is entangled with the incompatible one. For the few-shot image generation task, without accurate compatible knowledge as prior, the generated images will strongly overfit the scarce target images. From a different perspective, we propose a unified learning paradigm for better knowledge transfer, i.e., keep and extent (KAE). Specifically, we orthogonally decompose the latent space of GANs, where the resting direction that has an unnoticeable impact on the generated images is adopted to extend the new target latent subspace while the remaining directions keep intact to reconstruct the source latent subspace. In this way, the whole source domain knowledge is included in the source latent subspace and the compatible knowledge will be automatically transferred to the target domain along the resting direction, rather than manually selecting. Extensive experimental results on several benchmark datasets demonstrate the superiority of our method.

Abstract:
Current RGB-D methods usually leverage large-scale backbones to improve accuracy but sacrifice efficiency. Meanwhile, several existing lightweight methods are difficult to achieve high-precision performance. To balance the efficiency and performance, we propose a Speed-Accuracy Tradeoff Network (SATNet) for Lightweight RGB-D SOD from three fundamental perspectives: depth quality, modality fusion, and feature representation. Concerning depth quality, we introduce the Depth Anything Model to generate high-quality depth maps,which effectively alleviates the multi-modal gaps in the current datasets. For modality fusion, we propose a Decoupled Attention Module (DAM) to explore the consistency within and between modalities. Here, the multi-modal features are decoupled into dual-view feature vectors to project discriminable information of feature maps. For feature representation, we develop a Dual Information Representation Module (DIRM) with a bi-directional inverted framework to enlarge the limited feature space generated by the lightweight backbones. DIRM models texture features and saliency features to enrich feature space, and employ two-way prediction heads to optimal its parameters through a bi-directional backpropagation. Finally, we design a Dual Feature Aggregation Module (DFAM) in the decoder to aggregate texture and saliency features. Extensive experiments on five public RGB-D SOD datasets indicate that the proposed SATNet excels state-of-the-art (SOTA) CNN-based heavyweight models and achieves a lightweight framework with 5.2 M parameters and 415 FPS. The code is available at https://github.com/duan-song/SATNet

Abstract:
The well-known soft-focus effect, which relies on either special optical filters or manual post-production techniques, has been intriguing and powerful in photography for quite a while. Nonetheless, how to impose the soft-focus effect automatically simply using sophisticated image-processing (computational photography) algorithms has never been addressed in the literature to the best of our knowledge. In this work, we would like to make the first-ever attempt to design an automatic, optical-filter-free approach to create the appropriate soft-focus effects desired by individual users. Our approach is first to investigate the physical optical filter, namely Kenko Black Mist No. 5, and estimate the corresponding kernel matrix (i.e., the system impulse response matrix) using our proposed novel irradiance-domain kernel-matrix estimation framework. Furthermore, we demonstrate that it is not feasible to find a kernel matrix that precisely characterizes the soft-focus effect by just using a pixel-value-domain image (a regular photo) in post production. To combat the aforementioned problem, we establish a novel pixel-value-to-pseudo-irradiance map such that the pseudo irradiance-domain image can be obtained directly from any pixel-value-domain image. Finally the soft-focus effect can be created from the two-dimensional convolution between the pseudo irradiance-domain image and the estimated kernel. To evaluate our proposed automatic scheme for soft-focus effect, we compare the results from our proposed new scheme and the physical optical filter in terms of the DCT-KLD (Kullback-Leibler divergence of discrete cosine transform) and the conventional PSNR (peak-signal-to-noise ratio). Experiments show that our proposed new scheme can achieve very small DCT-KLDs and very large PSNRs over the ground truth, namely the results from the physical optical filter.

Abstract:
3D keypoint detection is of great interest to researchers in computer vision and graphics because it is an integral part of realizing many tasks, such as object tracking, 3D reconstruction, and shape registration. However, it is challenging to detect 3D keypoints quickly and stably due to the ambiguity of the keypoints and the presence of noise, density changes, and geometric distortions in the 3D point cloud. This paper proposes a novel 3D keypoint detection method based on point cloud structural saliency (PCSS) to realize stable and efficient 3D keypoint detection. First, we propose an effective point cloud feature descriptor called local spatial geometric feature, which can effectively combine spatial and geometric information to improve feature distinguishability. Second, we define a point cloud structural saliency representation that effectively characterizes the structured information in the point cloud. Finally, we generate 3D keypoints based on point cloud structural saliency using a non-maximum suppression method. We evaluate our method on five 3D keypoint benchmark datasets, and the experimental results demonstrate that it achieves state-of-the-art performance in 3D keypoint detection. Comparing it with previous keypoint detection methods further demonstrates the effectiveness and superiority of our method.

Abstract:
Semantic segmentation is an important branch of image processing and computer vision. With the popularity of deep learning, various convolutional neural networks have been proposed for pixel-level classification and segmentation tasks. In practical scenarios, however, imaging angles are often arbitrary, encompassing instances such as water body images from remote sensing and capillary and polyp images in the medical domain, where prior orientation information is typically unavailable to guide these networks to extract more effective features. In this case, learning features from objects with diverse orientation information poses a significant challenge, as the majority of CNN-based semantic segmentation networks lack rotation equivariance to resist the disturbance from orientation information. To address this challenge, this paper first constructs a universal convolution-group framework aimed at more fully utilizing orientation information and equipping the network with rotation equivariance. Subsequently, we mathematically design a padding-based rotation equivariant convolution mode (PreCM), which is not only applicable to multi-scale images and convolutional kernels but can also serve as a replacement component for various types of convolutions, such as dilated convolutions, transposed convolutions, and asymmetric convolution. To quantitatively assess the impact of image rotation in semantic segmentation tasks, we also propose a new evaluation metric, Rotation Difference (RD). The replacement experiments related to six existing semantic segmentation networks on three datasets (i.e., Satellite Images of Water Bodies, DRIVE, and Floodnet) show that, the average Intersection Over Union (IOU) of their PreCM-based versions respectively improve 6.91%, 10.63%, 4.53%, 5.93%, 7.48%, 8.33% compared to their original versions in terms of random angle rotation. And the average RD values are decreased by 3.58%, 4.56%, 3.47%, 3.66%, 3.47%, 3.43% respectively. The code can be download from https://github.com/XinyuXu414

Abstract:
Exploring complementary information between RGB and thermal/depth modalities is crucial for bi-modal salient object detection (BSOD). However, the distinct characteristics of different modalities often lead to large differences in information distributions. Existing models, which rely on convolutional operations or plug-and-play attention mechanisms, struggle to address this issue. To overcome this challenge, we rethink the relationship between information complementarity and long-range relevance, and propose a uniform broad-view Twins Transformer Network (TwinsTNet) for accurate BSOD. Specifically, to efficiently fuse bi-modal information, we first design the Cross-Modal Federated Attention (CMFA), which mines complementary cues across modalities through element-wise global dependency. Second, to ensure accurate modality fusion, we propose the Semantic Consistency Attention Loss, which supervises the co-attention feature in CMFA using the ground-truth-generated attention map. Additionally, existing BSOD models lack the exploration of inter-layer interactions, for which we propose the Cross-Scale Retracing Attention (CSRA), which retrieves query-relevant information from stacked features of all previous layers, enabling flexible cross-layer interactions. The cooperation between CMFA and CSRA mitigates inductive bias in both modality and layer dimensions, enhancing TwinsTNet’s representational capability. Extensive experiments demonstrate that TwinsTNet outperforms twenty-two existing state-of-the-art models on ten BSOD benchmark datasets. The code is available at: https://github.com/JoshuaLPF/TwinsTNet.

Abstract:
Graph neural networks (GNNs) encounter challenges in establishing deep structures and managing a large number of parameters effectively to learn node features comprehensively. Consequently, in vision tasks, GNNs often struggle to achieve high classification accuracy compared to convolutional neural networks. Nonetheless, GNNs retain crucial advantages and potential, particularly in lightweight network scale and efficient, reliable decision-making. Thus, improving GNN performance in vision tasks remains a significant research endeavor, with numerous important works exploring the application of GNN models in such contexts, where the graph representation of images poses a key challenge. Existing methods often fall short in adaptively generating blocks of different sizes and their corresponding edges to form graph representations according to graph semantics. To address this issue, we propose a novel method to convert images into graphical forms using granular-ball computing. Our approach does not rely on manual annotation or other learning methods, yet it can dynamically generate block nodes of varying sizes and corresponding edges. Compared to other state-of-the-art methods, our approach better captures semantic information within the graph. Despite having fewer parameters, our method significantly enhances accuracy. Overall, our work holds substantial implications for improving the performance of graph neural networks in vision tasks.

Abstract:
Monocular 3D object detection has garnered significant attention for its outstanding cost effectiveness compared with multi-sensor systems. However, previous work mainly acquires object 3D properties in a heuristic way, with less emphasis on the cues between objects. Inspired by the mechanisms of monocular vision, we propose MoVis, an innovative 3D object detection framework that skillfully combines object hierarchy and color sequence cues. Specifically, a decoupled Spatial Relationship Encoder (SRE) is designed to effectively feed back the high-level encoding results with object hierarchical relationships to low-level features. This method not only effectively reduces the computational overhead of multi-scale coding, but also significantly improves the detection accuracy of occluded objects by incorporating the hierarchical relationship between objects into multi-scale features. Moreover, to obtain more precise object depth information, an Object-level Depth Modulator (ODM) based on the concept of conditional random fields is designed, which employs color sequences. Ultimately, the results of the SRE and ODM are efficiently fused by our Spatial Context Processor (SCP) to accurately perceive the 3D attributes of the objects. Extensive experiments on the KITTI and Rope3D benchmarks show that MoVis achieves state-of-the-art performance. Our MoVis represents a progressive approach that emulates how human monocular vision utilizes monocular cues to perceive 3D scenes.

Abstract:
Deep learning approaches for Image Aesthetics Assessment (IAA) have shown promising results in recent years, but the internal mechanisms of these models remain unclear. Previous studies have demonstrated that image aesthetics can be predicted using semantic features, such as pre-trained object classification features. However, these semantic features are learned implicitly, and therefore, previous works have not elucidated what the semantic features are representing. In this work, we aim to create a more transparent deep learning framework for IAA by introducing explainable semantic features. To achieve this, we propose Tag-based Content Descriptors (TCDs), where each value in a TCD describes the relevance of an image to a human-readable tag that refers to a specific type of image content. This allows us to build IAA models from explicit descriptions of image contents. We first propose the explicit matching process to produce TCDs that adopt predefined tags to describe image contents. We show that a simple MLP-based IAA model with TCDs only based on predefined tags can achieve an SRCC of 0.767, which is comparable to most state-of-the-art methods. However, predefined tags may not be sufficient to describe all possible image contents that the model may encounter. Therefore, we further propose the implicit matching process to describe image contents that cannot be described by predefined tags. By integrating components obtained from the implicit matching process into TCDs, the IAA model further achieves an SRCC of 0.817, which significantly outperforms existing IAA methods. Both the explicit matching process and the implicit matching process are realized by the proposed TCD generator. To evaluate the performance of the proposed TCD generator in matching images with predefined tags, we also labeled 5101 images with photography-related tags to form a validation set. And experimental results show that the proposed TCD generator can meaningfully assign photography-related tags to images.

Abstract:
Image matching is a critical task in computer vision research, focusing on aligning two or more images with similar features. Feature detection and description constitute the core of image matching. Handcrafted detectors are capable of obtaining distinctive points but these points may not be repeatable on the image pairs especially those with dramatic appearance changes. On the contrary, the learned detectors can extract a large number of repeatable points but many of them tend to be ambiguous points with low distinctiveness. Moreover, in the scenarios of dramatic appearance change, commonly used contrast or triplet loss in the training of descriptors employ the hard negative mining strategy, which may obtain overly challenging negative samples by global sampling, resulting in sluggish convergence or even overfitting. Those learned descriptors may not guarantee that the corresponding points enjoy larger similarities than unmatched ones, leading to inaccurate matches. To address those issues, we propose a hierarchically learned detector and descriptor (HLDD) for robust image matching, which contains three modules: a handcrafted-learned detector, a hierarchically learned descriptor, and a coarse-to-fine matching strategy. The handcrafted-learned detector integrates the advantages of handcrafted and learned detectors. It extracts distinctive feature points from a learned repeatability map robust to image changes and eliminates the ambiguous ones according to a learned distinctiveness map. The descriptor is trained by a proposed hierarchical triplet loss, which employs a dual window strategy. It can obtain the hardest negative samples in local windows, which are comparatively easier over global sampling, ensuring the effective training of descriptors. The coarse-to-fine matching strategy performs global and local mutual nearest neighbor matching on the coarse and fine descriptor maps respectively to improve the matching accuracy progressively. By comparing with other matching methods, experimental results demonstrate the superiority of the proposed method in the task of image matching, homography estimation, visual localization, and relative pose estimation. Moreover, ablation studies illustrate the effectiveness of the three proposed modules.

Abstract:
Conventional unsupervised domain adaptation (UDA) requires access to source data and/or source model parameters, prohibiting its practical application in terms of privacy, security, and intellectual property. Recent black-box UDA (BDA) reduces such constraints by defining a pseudo label from a single encapsulated source application programming interface (API) prediction, which allows for self-training of the target model. Nonetheless, existing methods have limited consideration for multi-source settings, in which multiple source domain APIs are available to generate pseudo labels. In this work, we introduce a novel training framework for multi-source BDA (MSBDA), dubbed Label Space-Induced Pseudo Label Refinement (LPR). Specifically, LPR incorporates a Pseudo label Refinery Network (PRN) that learns the relationship among source domains conditioned by the target domain only utilizing source API’s prediction. The target model is adapted by our dual phases PRN. First, a warm-up phase targets to avoid failure due to noisy samples and provide an initial pseudo-label, which is followed by a label refinement phase with domain relationship exploration. We provide theoretical support for the mechanism of the LPR. Experimental results on four benchmark datasets demonstrate that MSBDA using LPR achieves competitive performance compared to state-of-the-art approaches with different DA settings.

Abstract:
Semi-Supervised Few-Shot Learning (SSFSL) aims to address the data scarcity in few-shot learning by leveraging both a few labeled support data and abundant unlabeled data. In SSFSL, a classifier trained on scarce support data is often biased and thus assigns inaccurate pseudo-labels to the unlabeled data, which will mislead downstream learning tasks. To combat this issue, we introduce a novel method called Certainty-Aware Recursive Confidence Training (CARCT). CARCT hinges on the insight that selecting pseudo-labeled data based on confidence levels can yield more informative support data, which is crucial for retraining an unbiased classifier to achieve accurate pseudo-labeling—a process we term pseudo-labeling calibration. We observe that accurate pseudo-labels typically exhibit smaller certainty entropy, indicating high-confidence pseudo-labeling compared to those of inaccurate pseudo-labels. Accordingly, CARCT constructs a joint double-Gaussian model to fit the certainty entropies collected across numerous SSFSL tasks. Thereby, A semi-supervised Prior Confidence Distribution (ssPCD) is learned to aid in distinguishing between high-confidence and low-confidence pseudo-labels. During an SSFSL task, ssPCD guides the selection of both high-confidence and low-confidence pseudo-labeled data to retrain the classifier that then assigns more accurate pseudo-labels to the low-confidence pseudo-labeled data. Such recursive confidence training continues until the low-confidence ones are exhausted, terminating the pseudo-labeling calibration. The unlabeled data all receive accurate pseudo-labels to expand the few support data to generalize the downstream learning task, which in return meta-refines the classifier, named self-training, to boost the pseudo-labeling in subsequent tasks. Extensive experiments on basic and extended SSFSL setups showcase the superiority of CARCT versus state-of-the-art methods, and comprehensive ablation studies and visualizations justify our insight. The source code is available at https://github.com/Klein-JING/CARCT

Abstract:
Image restoration involves recovering a clean image from its degraded counterpart. In recent years, we have witnessed a paradigm shift from convolutional neural networks to Transformers, which have quadratic complexity with respect to the input size. Instead of designing more complex modules based on recent techniques, this paper presents an efficient and effective mechanism for image restoration by exploring the potential of ubiquitous pooling techniques. We leverage different pooling operators as tools for implicit dual-domain representation learning. Specifically, the average and max pooling can be used as extractors for implicit low- and high-frequency signals, respectively. Then, we utilize lightweight learnable parameters to modulate the resulting frequency components. Furthermore, the intermediate high-frequency features can serve as attention maps to highlight the spatial edge information. Our pooling module is built by incorporating the aforementioned dual-domain modulation across multiple scales and various shapes. We demonstrate the effectiveness of our module in single-degradation, composite-degradation, and all-in-one image restoration tasks. Extensive experimental results show that the resulting network achieves state-of-the-art performance on 15 datasets for five single-degradation and two composite-degradation image restoration tasks by deploying our module. Moreover, our method can be extended to all-in-one scenarios and performs favorably against state-of-the-art all-in-one algorithms under two settings. The code is available at https://github.com/c-yn/PoolNet

Abstract:
Few-shot learning (FSL) has recently been extensively utilized to overcome the scarcity of training data in domain-specific visual recognition. In real-world scenarios, environmental factors such as complex backgrounds, varying lighting conditions, long-distance shooting, and moving targets often cause test images to exhibit numerous incomplete targets or noise disruptions. However, current research on evaluation datasets and methodologies has largely ignored the concept of “environmental robustness”, which refers to maintaining consistent performance in complex and diverse physical environments. This neglect has led to a notable decline in the performance of FSL models during practical testing compared to their training performance. To bridge this gap, we introduce a new real-world multi-domain few-shot learning (RD-FSL) benchmark, which includes four domains and six evaluation datasets. The test images in this benchmark feature various challenging elements, such as camouflaged objects, small targets, and blurriness. Our evaluation experiments reveal that existing methods struggle to utilize training images effectively to generate accurate feature representations for challenging test images. To address this problem, we propose a novel conditional representation learning network (CRLNet) that integrates the interactions between training and testing images as conditional information in their respective representation processes. The main goal is to reduce intra-class variance or enhance inter-class variance at the feature representation level. Finally, comparative experiments reveal that CRLNet surpasses the current state-of-the-art methods, achieving performance improvements ranging from 6.83% to 16.98% across diverse settings and backbones. The source code and dataset are available at https://github.com/guoqianyu-alberta/Conditional-Representation-Learning

Abstract:
Transformer-based trackers have achieved promising success and become the dominant tracking paradigm because of their accuracy and efficiency. Despite the substantial progress, most of the existing approaches handle object tracking as a deterministic coordinate regression problem, while the target localization uncertainty has been largely overlooked, which hampers trackers’ ability to maintain reliable target state prediction in challenging scenarios. To address this issue, we propose UncTrack, a novel uncertainty-aware transformer-based tracker that predicts the target localization uncertainty and incorporates this uncertainty information for accurate target state inference. Specifically, UncTrack uses a transformer encoder to perform feature interactions between the template and search images. The output features are passed into an uncertainty-aware localization decoder (ULD) to coarsely predict the corner-based localization and the corresponding localization uncertainty. Then, the localization uncertainty is sent into a prototype memory network (PMN) to excavate valuable historical information to identify whether the target state prediction is reliable. To enhance the template representation, the samples with high confidence are fed back into the prototype memory bank for memory updating, which makes the tracker more robust to challenging appearance variations. Extensive experiments demonstrate that our method outperforms the state-of-the-art methods. Our code is available at https://github.com/ManOfStory/UncTrack

Abstract:
High Dynamic Range (HDR) environment lighting is essential for augmented reality and visual editing applications, enabling realistic object relighting and seamless scene composition. However, the acquisition of accurate HDR environment maps remains resource-intensive, often requiring specialized devices such as light probes or 360° capture systems, and necessitating stitching during postprocessing. Existing deep learning-based methods attempt to estimate global illumination from partial-view images but often struggle with complex lighting conditions, particularly in indoor environments with diverse lighting variations. To address this challenge, we propose a novel method for estimating indoor HDR environment maps from single standard images, leveraging Anisotropic Spherical Gaussians (ASG) to model intricate lighting distributions as priors. Unlike traditional Spherical Gaussian (SG) representations, ASG can better capture anisotropic lighting properties, including complex shape, rotation, and spatial extent. Our approach introduces a transformer-based network with a two-stage training scheme to predict ASG parameters effectively. To leverage these predicted lighting priors for environment map generation, we introduce a novel generative projector that synthesizes environment maps with high-frequency textures. To train the generative projector, we propose a parameter-efficient adaptation method that transfers knowledge from SG-based guidance to ASG, enabling the model to preserve the generalizability of SG (e.g., spatial distribution and dominance of light sources) while enhancing its capacity to capture fine-grained anisotropic lighting characteristics. Experimental results demonstrate that our method yields environment maps with more precise lighting conditions and environment textures, facilitating the realistic rendering of lighting effects. The implementation code for ASG extraction can be found at https://github.com/junhong-jennifer-zhao/ASG-lighting

Abstract:
Continual Semantic Segmentation (CSS) primarily aims to continually learn new semantic segmentation categories while avoiding catastrophic forgetting. In semantic segmentation tasks, images can comprise both familiar old categories and novel unseen categories and they are treated as background in the incremental stage. Therefore, it is necessary to utilize the old model to generate pseudo-labels. However, the quality of these pseudo-labels significantly influences the model’s forgetting of the old categories. Erroneous pseudo-labels can introduce harmful gradients, thus exacerbating model forgetting. In addition, the issue of class imbalance poses a significant challenge within the realm of CSS. Although traditional methods frequently diminish the emphasis placed on new classes to address this imbalance, we discover that the imbalance extends beyond the distinction between old and new classes. In this paper, we specifically address two previously overlooked problems in CSS: the impact of erroneous pseudo-labels on model forgetting and the confusion induced by class imbalance. We propose an Uncertainty and Class Balance Re-weighting approach (UCB) that assigns higher weights to pixels with pseudo-labels exhibiting lower uncertainty and to categories with smaller proportions during the training process. Our proposed approach enhances the impact of essential pixels during the continual learning process, thereby reducing model forgetting and dynamically balancing category weights based on the dataset. Our method is simple yet effective and can be applied to any method that uses pseudo-labels. Extensive experiments on the Pascal-VOC and ADE20K datasets demonstrate the efficacy of our approach in improving model performance across three state-of-the-art methods. The code will be available at https://github.com/JACK-Chen-2019/UCB

Abstract:
Existing underwater salient object detection (USOD) methods design fusion strategies to integrate multimodal information, but lack exploration of modal characteristics. To address this, we separately leverage the RGB and depth branches to learn disentangled representations, formulating the heterogeneous experts and hierarchical perception network (HEHP). Specifically, to reduce modal discrepancies, we propose the hierarchical prototype guided interaction (HPI), which achieves fine-grained alignment guided by the semantic prototypes, and then refines with complementary modalities. We further design the mixture of frequency experts (MoFE), where experts focus on modeling high- and low-frequency respectively, collaborating to explicitly obtain hierarchical representations. To efficiently integrate diverse spatial and frequency information, we formulate the four-way fusion experts (FFE), which dynamically selects optimal experts for fusion while being sensitive to scale and orientation. Since depth maps with poor quality inevitably introduce noises, we design the uncertainty injection (UI) to explore high uncertainty regions by establishing pixel-level probability distributions. We further formulate the holistic prototype contrastive (HPC) loss based on semantics and patches to learn compact and general representations across modalities and images. Finally, we employ varying supervision based on branch distinctions to implicitly construct difference modeling. Extensive experiments on two USOD datasets and four relevant underwater scene benchmarks validate the effect of the proposed method, surpassing state-of-the-art binary detection models. Impressive results on seven natural scene benchmarks further demonstrate the scalability.

Abstract:
Hyperspectral images (HSIs) offer great potential for computational pathology. But, limited by the lack of adequate annotated data and the high spectral redundancy of HSIs, traditional supervised learning techniques are usually bottlenecked. To exploit the structural properties of HSIs and learn representations with good transferability, we propose Separated Self-Supervised Spectral Regression (S4R). Concretely, we find one spectral band can be represented by a linear combination of the remaining bands. Regressing the distribution of the linear coefficients learns the inherent properties of HSIs and pathological information about the tissue. Besides, reconstructing the missing band, especially the tissue boundaries makes the model learn pathology details that are critical to downstream tasks. Coupling these two pretext tasks makes the self-supervised model understand spectral structures of HSIs w.r.t. pathological semantics and spatial micro details. Furthermore, we design two brand-new architectures to avoid the interference of extraneous signal based on S4R: S4R-CLS and S4R-SEG for HSI classification and segmentation, respectively. Two downstream tasks are incorporated into a unified framework, which first encodes different bands from HSIs via a depthwise separable encoder, and then selectively aggregates band features to generate final predictions. In S4R-SEG, we propose to pick the best matching bands with the guidance of a classification paradigm. Extensive experiments show S4R performs much better than competitors on both tasks. Theoretical analysis and clinical discussion also indicate the great potential for further medical applications. The code and pre-trained checkpoints are available at https://github.com/DeepMed-Lab-ECNU/S4R

Abstract:
Recent advancements in virtual reality (VR) and augmented reality (AR) have popularised the emerging panoramic content for the immersive visual experience. The difficulty in acquisition and display of 360° format further highlights the necessity of unconditional panoramic image generation. Existing methods essentially generate planar images mapped from panoramic images, and fail to address the deformation and closed-loop characteristics when inverted back to the panoramic images. Thus leading to the generation of pseudo-panoramic content. This paper aims to directly generate spherical content, in a patch-by-patch style; besides computation friendly, this promises the anywhere continuity on the panoramic image and proper accommodation of panoramic deformation. More specifically, we first propose a novel spherical patch convolution (SPConv) that operates on the local spherical patch, which naturally addresses the deformation of panoramic content. We then propose our spherical patch generative adversarial net (SP-GAN) that consists of spherical local embedding (SLE) and spherical content synthesiser (SCS) modules, which seamlessly incorporate our SPConv so as to generate continuous panoramic patches. To the best of our knowledge, the proposed SP-GAN is the first successful attempt to accommodate the spherical distortion for closed-loop panoramic image generation in a patch-by-patch manner. The experimental results, with human-rated evaluations, have verified the consistently superior performances for unconditional panoramic image generation, from the perspectives of generation quality, computational memory, and generalisation to various resolutions. Codes are publicly available at https://github.com/chronos123/SP-GAN

Abstract:
Salient Object Detection (SOD) aims to identify the most attention-grabbing regions in an image and focuses on distinguishing salient objects from their backgrounds. Current SOD methods primarily use a discriminative approach, which works well for clear images but struggles in complex scenes with similar colors and textures between objects and backgrounds. To address these limitations, we introduce the diffusion-based salient object detection model (DiffSOD), which leverages a noise-to-image denoising process within a diffusion framework, enhancing saliency detection in both RGB and RGB-D images. Unlike conventional fusion-based SOD methods that directly merge RGB and depth information, we treat RGB and depth as distinct conditions, i.e., the appearance condition and the structure condition, respectively. These conditions serve as controls within the diffusion UNet architecture, guiding the denoising process. To facilitate this guidance, we employ two specialized control adapters: the appearance control adapter and the structure control adapter. Moreover, conventional denoising UNet models may struggle when handling low-quality depth maps, potentially introducing detrimental cues into the denoising process. To mitigate the impact of low-quality depth maps, we introduce a quality-aware filter. This filter selectively processes only high-quality depth data, ensuring that the denoising process is based on reliable information. Comparative evaluations on benchmark datasets have shown that DiffSOD substantially surpasses existing RGB and RGB-D saliency detection methods, improving average performance by 1.5% and 1.2% respectively, thus setting a new benchmark for diffusion-based dense prediction models in visual saliency detection.

Abstract:
3D room layout estimation aims to reconstruct the holistic 3D structure from an indoor RGB image. For most of the deep learning-based methods, layout inference is guided by a kind of learned 2D mid-level representation such as pixel-wise surface labels. However, learning such high-resolution 2D representation might suffer from information redundancy and memory consumption, and will increase the runtime of estimation and deployment cost for practical applications. In this paper, we attempt to learn a compact high-level representation with only 29 real numbers for estimating the 3D layout using general regression networks. The learned compact high-level representation contains three components: instance-wise plane parameters, camera intrinsic parameters, and plane location indicators. With the learned representation, the inverse depth map of each plane can be calculated to reconstruct the 3D layout. We further design a set of order-agnostic loss functions to restrict the produced inverse depth maps, with which the model can be trained with either weak 2D layout labels or full 3D layout supervision. Moreover, by jointly learning the plane parameters and locations, the model is benefited from 3D reasoning. Experimental results show that our method is much faster than the existing layout estimation methods and obtains competitive performance on benchmark datasets, showing its potential for real-time applications.

Abstract:
Removing unwanted reflections from images is a fundamental yet challenging problem in low-level computer vision. Recent deep learning-based Single Image Reflection Removal (SIRR) methods have made significant progress. However, separating reflections from transmission content remains difficult, particularly in complex scenes where the two exhibit high visual similarity. Upon careful analysis, we find that reflections predominantly reside in the high-frequency components of an image. These reflections tend to distort fine details in the high-frequency range, while the low-frequency information remains relatively less affected. This observation motivates us to explore a frequency-aware approach for SIRR by leveraging the Discrete Wavelet Transform (DWT). The wavelet decomposition enables us to distinguish and isolate reflective artifacts in the frequency domain while preserving the transmission information. Building on this insight, we propose a novel Wavelet-guided Deep Unfolding Network (WDUNet) that leverages the strengths of wavelet decomposition and deep unfolding techniques to improve interpretability and generalization in SIRR. Specifically, we formulate an optimization-based reflection removal model using DWT and convolutional dictionaries. The proposed model is optimized via a proximal gradient algorithm and then unfolded into a neural network architecture, where all parameters are learned end-to-end during training. By combining wavelet domain analysis with deep unfolding, WDUNet enhances both the interpretability and generalization of SIRR methods. Additionally, we design and integrate the Low-frequency Parameter Estimation Module (LPEM) and High-frequency Parameter Estimation Module (HPEM) modules into WDUNet, allowing the network to automatically learn and optimize the models’ hyperparameters. Extensive experiments conducted on four benchmark datasets demonstrate that WDUNet consistently outperforms existing state-of-the-art methods in both objective evaluation metrics and subjective visual quality.

Abstract:
Gait recognition, which aims at identifying individuals by their walking patterns, has achieved great success based on silhouette. The binary silhouette sequence encodes the walking pattern within the sparse boundary representation. Therefore, most pixels in the silhouette are under-sensitive to the walking pattern since the sparse boundary lacks dense spatial-temporal information, which is suitable to be represented with dense texture. To enhance the sensitivity to the walking pattern while maintaining the robustness of recognition, we present a Complementary Learning with neural Architecture SearcH (CLASH) framework, consisting of walking pattern sensitive gait descriptor named dense spatial-temporal field (DSTF) and neural architecture search based complementary learning (NCL). Specifically, DSTF transforms the representation from the sparse binary boundary into the dense distance-based texture, which is sensitive to the walking pattern at the pixel level. Further, NCL presents a task-specific search space for complementary learning, which mutually complements the sensitivity of DSTF and the robustness of the silhouette to represent the walking pattern effectively. Extensive experiments demonstrate the effectiveness of the proposed methods under both in-the-lab and in-the-wild scenarios. On CASIA-B, we achieve rank-1 accuracy of 98.8%, 96.5%, and 89.3% under three conditions. On OU-MVLP, we achieve rank-1 accuracy of 91.9%. Under the latest in-the-wild datasets, we outperform the latest silhouette-based methods by 16.3% and 19.7% on Gait3D and GREW, respectively.

Abstract:
Unsupervised domain adaptation is mainly focused on the tasks of transferring knowledge from a fully-labeled source domain to an unlabeled target domain. However, in some scenarios, the labeled data are expensive to collect, which cause an insufficient label issue in the source domain. To tackle this issue, some works have focused on few-shot unsupervised domain adaptation (FUDA), which transfers predictive models to an unlabeled target domain through a source domain that only contains a few labeled samples. Yet the relationship between labeled and unlabeled source domains are not well exploited in generating pseudo-labels. Additionally, the few-shot setting further prevents the transfer tasks as an excessive domain gap is introduced between the source and target domains. To address these issues, we newly proposed an adaptive dispersal and collaborative clustering (ADCC) method for FUDA. Specifically, for the shortage of the labeled source data, a collaborative clustering algorithm is constructed that expands the labeled source data to obtain more distribution information. Furthermore, to alleviate the negative impact of domain-irrelevant information, we construct an adaptive dispersal strategy that introduces an intermediate domain and pushes both the source and target domains to this intermediate domain. Extensive experiments on the Office31, Office-Home, miniDomainNet, and VisDA-2017 datasets showcase the superior performance of ADCC compared to the state-of-the-art FUDA methods.

Affiliations: School of Artificial Intelligence, OPtics and ElectroNics (iOPEN), Northwestern Polytechnical University, Xi’an, Shaanxi, China; Institute of Artificial Intelligence (TeleAI), China Telecom, Shanghai, China; State Key Laboratory of Transient Optics and Photonics, Xi’an Institute of Optics and Precision Mechanics, CAS, Xi’an, Shaanxi, China; Marine Optical Technology Laboratory, Xi’an Institute of Optics and Precision Mechanics, CAS, Xi’an, Shaanxi, China; Key Laboratory of Spectral Imaging Technology, Xi’an Institute of Optics and Precision Mechanics, CAS, Xi’an, Shaanxi, China

Abstract:
In this paper, we introduce StreakNet-Arch, a real-time, end-to-end binary-classification framework based on our self-developed Underwater Carrier LiDAR-Radar (UCLR) that embeds Self-Attention and our novel Double Branch Cross Attention (DBC-Attention) to enhance scatter suppression. Under controlled water tank validation conditions, StreakNet-Arch with Self-Attention or DBC-Attention outperforms traditional bandpass filtering and achieves higher F_1 scores than learning-based MP networks and CNNs at comparable model size and complexity. Real-time benchmarks on an NVIDIA RTX 3060 show a constant Average Imaging Time (54 to 84 ms) regardless of frame count, versus a linear increase (58 to 1,257 ms) for conventional methods. To facilitate further research, we contribute a publicly available streak-tube camera image dataset contains 2,695,168 real-world underwater 3D point cloud data. More importantly, we validate our UCLR system in a South China Sea trial, reaching an error of 46mm for 3D target at 1,000 m depth and 20 m range. Source code and data are available at https://github.com/BestAnHongjun/StreakNet

Abstract:
Deep learning-based palmprint recognition methods take performance to the next level. However, most current methods rely on samples with clean labels. Noisy labels are difficult to avoid in practical applications and may affect the reliability of models, which poses a big challenge. In this paper, we propose a novel Multi-stage Noisy Label Selection and Correction (MNLSC) framework to address this issue. Three stages are proposed to improve the robustness of palmprint recognition. Clean simple samples are firstly selected based on self-supervised learning. A Fourier-based module is constructed to select clean hard samples. A pototype-based module is further introduced for selecting noisy labels from the remaining samples and correcting them. Finally, the model is trained by using clean and corrected labels to improve the performance. Experiments are conducted on several constrained and unconstrained palmprint databases. The results demonstrate the superiority of our method over other methods in dealing with different noise rates. Compared with the baseline method, the accuracy can be improved by up to 33.45% when there are 60% noisy labels.

Abstract:
The class-agnostic counting (CAC) task has recently been proposed to solve the problem of counting all objects of an arbitrary class with several exemplars given in the input image. To address this challenging task, existing leading methods all resort to density map regression, which renders them impractical for downstream tasks that require object locations and restricts their ability to well explore the scale information of exemplars for supervision. Meanwhile, they generally model the interaction between the input image and the exemplars in an exemplar-by-exemplar way, which is inefficient and may not fully synthesize information from all exemplars. To address these limitations, we propose a novel localization-based CAC approach, termed Scale-modulated Query and Localization Network (SQLNet). It fully explores the scales of exemplars in both the query and localization stages and achieves effective counting by accurately locating each object and predicting its approximate size. Specifically, during the query stage, rich discriminative representations of the target class are acquired by the Hierarchical Exemplars Collaborative Enhancement (HECE) module from the few exemplars through multi-scale exemplar cooperation with equifrequent size prompt embedding. These representations are then fed into the Exemplars-Unified Query Correlation (EUQC) module to interact with the query features in a unified manner and produce the correlated query tensor. In the localization stage, the Scale-aware Multi-head Localization (SAML) module utilizes the query tensor to predict the confidence, location, and size of each potential object. Moreover, a scale-aware localization loss is introduced, which exploits flexible location associations and exemplar scales for supervision to optimize the model performance. Extensive experiments demonstrate that SQLNet outperforms state-of-the-art methods on popular CAC benchmarks, achieving excellent performance not only in counting accuracy but also in localization and bounding box generation.

Abstract:
Visual emotion analysis holds significant research value in both computer vision and psychology. However, existing methods for visual emotion analysis suffer from limited generalizability due to the ambiguity of emotion perception and the diversity of data scenarios. To tackle this issue, we introduce UniEmoX, a cross-modal semantic-guided large-scale pretraining framework. Inspired by psychological research emphasizing the inseparability of the emotional exploration process from the interaction between individuals and their environment, UniEmoX integrates scene-centric and person-centric low-level image spatial structural information, aiming to derive more nuanced and discriminative emotional representations. By exploiting the similarity between paired and unpaired image-text samples, UniEmoX distills rich semantic knowledge from the CLIP model to enhance emotional embedding representations more effectively. To the best of our knowledge, this is the first large-scale pretraining framework that integrates psychological theories with contemporary contrastive learning and masked image modeling techniques for emotion analysis across diverse scenarios. Additionally, we develop a visual emotional dataset titled Emo8. Emo8 samples cover a range of domains, including cartoon, natural, realistic, science fiction and advertising cover styles, covering nearly all common emotional scenes. Comprehensive experiments conducted on seven benchmark datasets across two downstream tasks validate the effectiveness of UniEmoX. The source code is available at https://github.com/chincharles/u-emo

Abstract:
Implicit degradation modeling-based blind super-resolution (SR) has attracted more increasing attention in the community due to its excellent generalization to complex degradation scenarios and wide application range. How to extract more discriminative degradation representations and fully adapt them to specific image features is the key to this task. In this paper, we propose a new Content-decoupled Contrastive Learning-based blind image super-resolution (CdCL) framework following the typical blind SR pipeline. This framework introduces negative-free contrastive learning technique for the first time to model the implicit degradation representation, in which a new cyclic shift sampling strategy is designed to ensure decoupling between content features and degradation features from the data perspective, thereby improving the purity and discriminability of the learned implicit degradation space. In addition, we propose a detail-aware implicit degradation adapting module that can better adapt degradation representations to specific LR features by enhancing the basic adaptation unit’s perception of image details, significantly reducing the overall SR model complexity. Extensive experiments on synthetic and real data show that our method achieves highly competitive quantitative and qualitative results in various degradation settings while obviously reducing parameters and computational costs, validating the feasibility of designing practical and lightweight blind SR tools. Codes and models will be available at https://github.com/Fieldhunter/CdCL.

Abstract:
Data augmentation (DA) is widely employed to improve the generalization performance of deep models. However, most existing DA methods employ augmentation operations with fixed or random magnitudes throughout the training process. While this fosters data diversity, it can also inevitably introduce uncontrolled variability in augmented data, which could potentially cause misalignment with the evolving training status of the target models. Both theoretical and empirical findings suggest that this misalignment increases the risks of both underfitting and overfitting. To address these limitations, we propose AdaAugment, an innovative and tuning-free adaptive augmentation method that leverages reinforcement learning to dynamically and adaptively adjust augmentation magnitudes for individual training samples based on real-time feedback from the target network. Specifically, AdaAugment features a dual-model architecture consisting of a policy network and a target network, which are jointly optimized to adapt augmentation magnitudes in accordance with the model’s training progress effectively. The policy network optimizes the variability within the augmented data, while the target network utilizes the adaptively augmented samples for training. These two networks are jointly optimized and mutually reinforce each other. Extensive experiments across benchmark datasets and deep architectures demonstrate that AdaAugment consistently outperforms other state-of-the-art DA methods in effectiveness while maintaining remarkable efficiency. Code is available at https://github.com/Jackbrocp/AdaAugment.

Abstract:
Learned lossless compression methods for volumetric biomedical images have achieved significant performance improvements compared with the traditional ones. However, they often perform poorly when applied to unseen domains due to domain gap issues. To address this problem, we propose a multi-source domain generalization method to handle two main sources of domain gap issues: modality and structure differences. To address modality differences, we develop an adaptive modality transfer (AMT) module, which predicts a set of modality-specific parameters from the original image and embeds them into the bit stream. These parameters control the weights of a mixture of experts to create a dynamic convolution, which is then used for entropy coding to facilitate modality transfer. To address structure differences, we design an adaptive structure transfer (AST) module, which decomposes the high dynamic range biomedical images into least significant bits (LSB) and most significant bits (MSB) in the wavelet domain. The MSB information, which is unique to the test image, is then used to predict an additional set of dynamic convolutions to enable structure transfer. Experimental results show that our approach reduces performance degradation caused by the domain gap to within 3% across various volumetric biomedical modalities. This paves the way for the practical end-to-end biomedical image compression.

Abstract:
Near-Infrared (NIR) hyperspectral imaging opens up numerous possibilities for wide applications. Despite Compressive Spectral Imaging (CSI) being a promising technique, which enables the acquisition of three-dimensional (3D) spatio-spectral information from dynamic scenes, applying it to the NIR spectrum remains challenging. The bottleneck lies in the high cost and limited resolution of InGaAs Focal Plane Arrays (FPAs), which further degrade the high-frequency information of the compressed measurements. Here we demonstrate a novel Effective Prior Image-guided Spectral imager, termed EpiSpec, towards high-resolution spectral imaging in the NIR. Our key observation is that the tail response of low-cost silicon-based sensors tends to capture similar image, offering high spatial resolution guidance for retrieving details. Hence, the degraded measurement of hyperspectral scene, guided by the prior image, is capable of obtaining high-quality reconstructions. Since the prior image integrates only a partial spectrum of the target scene, introducing content-aware chromatic errors, we propose the Prior Image Guided Deep Unfolding Framework (PIUF) for high-fidelity spectral reconstruction. This framework implicitly models the underlying non-linear relationship between the degraded measurements and the Non-Panchromatic (NPA) prior image. We also introduce a new NIR Spectral Images Dataset (NISID), which features a broad selection of real-world NIR spectral interesting scenes. Based on the dataset in hand, we evaluate the sparse structure of such spectra, which can serve as a guide for efficient CSI sensing matrices design. Extensive evaluations on representative CSI systems demonstrate the effectiveness of the proposed EpiSpec framework. Subsequently, lab prototypes are built for real-world imaging validation, further supporting the viability of high-resolution spectral imaging in the NIR.

Abstract:
With the rapid proliferation of digital image content and advancements in image editing technologies, the protection of digital image authorship has become an increasingly important issue. Traditional methods for authorship protection include registering authorship through certification organization, utilizing image metadata such as Exchangeable Image File Format (EXIF) data, and employing watermarking techniques to prove ownership. In recent years, blockchain-based technologies have also been introduced to enhance authorship protection further. However, these approaches face challenges in balancing four key attributes: strong legal validity, high security, low cost, and high usability. Authorship registration is often cumbersome, EXIF metadata can be easily extracted and tampered with, watermarking techniques are vulnerable to various forms of attack, and blockchain technology is complex to implement and requires long-term maintenance. In response to these challenges, this paper introduces a new framework Hard EXIF, designed to balance these multiple attributes while delivering improved performance. The proposed method integrates metadata with physically unclonable functions (PUFs) for the first time, creating unique device fingerprints and embedding them into images using watermarking techniques. By leveraging the security and simplicity of hash functions and PUFs, this method enhances EXIF security while minimizing costs. Experimental results demonstrate that the Hard EXIF framework achieves an average peak signal-to-noise ratio (PSNR) of 42.89 dB, with a similarity of 99.46% between the original and watermarked images, and the extraction error rate is only 0.0017. These results show that the Hard EXIF framework balances legal validity, security, cost, and usability, promising authorship protection with great potential for wider application.

Abstract:
Glass largely blurs the boundary between the real world and the reflection. The special transmittance and reflectance quality have confused the semantic tasks related to machine vision. Therefore, how to clear the boundary built by glass, and avoid over-capturing features as false positive information in deep structure, matters for constraining the segmentation of reflection surface and penetrating glass. We propose the Fourier Boundary Features Network with Wider Catchers (FBWC), which might represent the first attempt to utilize sufficiently wide horizontal shallow branches without vertical deepening for guiding the fine granularity segmentation boundary through primary glass semantic information. Specifically, we design the Wider Coarse-Catchers (WCC) for anchoring large area segmentation and reducing excessive extraction from a structural perspective. We embed fine-grained features by Cross Transpose Attention (CTA), which is introduced to avoid the incomplete area within the boundary caused by reflection noise. For excavating glass features and balancing high-low layers context, a learnable Fourier Convolution Controller (FCC) is proposed to regulate information integration robustly. The proposed method is validated on three different public glass segmentation datasets. Experimental results reveal that the proposed method yields better segmentation performance compared with the state-of-the-art (SOTA) methods in glass image segmentation.

Abstract:
Multi-modal object Re-ID aims to leverage the complementary information provided by multiple modalities to overcome challenging conditions and achieve high-quality object matching. However, existing multi-modal methods typically rely on various modality interaction modules for information fusion, which can reduce the efficiency of real-time monitoring systems. Additionally, practical challenges such as low-quality multi-modal data or missing modalities further complicate the application of object Re-ID. To address these issues, we propose the Complementary Data Enhancement and Modal-Aware Soft Alignment Network (DESANet), which is designed to be independent of interactive networks and adaptable to scenarios with missing modalities. This approach ensures a simple-yet-effective, and efficient multi-modal object Re-ID. DESANet consists of three key components: Firstly, the Dual-Color Space Data Enhancement (DCDE) module, which enhances multi-modal data by performing patch rotation in the RGB space and improving image quality in the HSV space. Secondly, the Salient Feature ReConstruction (SFRC) module, which addresses the issue of missing modalities by reconstructing features from one modality using the other two. Thirdly, the Modal-Aware Soft Alignment (MASA) module, which integrates multi-source data to avoid the blind fusion of features and prevents the propagation of noise from reconstructed modalities. Our approach achieves state-of-the-art performances on both person and vehicle datasets. Source code is available at https://github.com/DWJ11/DESANet

Abstract:
Generalized zero-shot learning (GZSL) aims at training a model that can generalize to unseen class data by only using auxiliary information. One of the main challenges in GZSL is a biased model prediction toward seen classes caused by overfitting on only available seen class data during training. To overcome this issue, we propose a two-stream autoencoder-based gating model for GZSL. Our gating model predicts whether the query data is from seen classes or unseen classes, and utilizes separate seen and unseen experts to predict the class independently from each other. This framework avoids comparing the biased prediction scores for seen classes with the prediction scores for unseen classes. In particular, we measure the distance between visual and attribute representations in the latent space and the cross-reconstruction space of the autoencoder. These distances are utilized as complementary features to characterize unseen classes at different levels of data abstraction. Also, the two-stream autoencoder works as a unified framework for the gating model and the unseen expert, which makes the proposed method computationally efficient. We validate our proposed method in four benchmark image recognition datasets. In comparison with other state-of-the-art methods, we achieve the best harmonic mean accuracy in SUN and AWA2, and the second best in CUB and AWA1. Furthermore, our base model requires at least 20% less number of model parameters than state-of-the-art methods relying on generative models.

Abstract:
Nighttime handheld photography is often simultaneously affected by low light and blur degradations due to object motion and camera shake. Previous methods typically design specific modules to restore the degradations in the spatial domain independently. However, the interdependence of low light and blur degradations in the spatial domain makes it difficult for these approaches to effectively decouple the degradations, limiting the performance of the designed modules. In this paper, we observe that in the Fourier domain, low light and blur degradations can be represented independently in the amplitude and phase of the image. Through an in-depth analysis of the underlying physical degradation process, we discover that low light degradation exhibits distinct characteristics across different frequency bands in amplitude, while blur degradation is characterized by phase correlation. Leveraging these insights, we mathematically derive a frequency attention mechanism and a filtering mechanism for learning decoupled representations of these degradations, proposing a Fourier-based Decoupling Network for joint low-light image enhancement and deblurring. Experimental results demonstrate that our method achieves the state-of-the-art performance on both synthetic and real-world datasets and exhibits significantly sharper edges. Code is available at https://github.com/Jabruson/FDN-TIP2025

Abstract:
Video summarization aims to generate a compact summary of the original video by selecting and combining the most representative parts. Most existing approaches only focus on recognizing key video segments to generate the summary, which lacks holistic considerations. The transitions between selected video segments are usually abrupt and inconsistent, making the summary confusing. Indeed, the coherence of video summaries is crucial to improve the quality and user viewing experience. However, the coherence between video segments is hard to measure and optimize from a pure vision perspective. To this end, we propose a Language-guided Segment Coherence-Aware Network (LS-CAN), which integrates entire coherence considerations into the key segment recognition. The main idea of LS-CAN is to explore the coherence of corresponding text modality to facilitate the entire coherence of the video summary, which leverages the natural property in the language that contextual coherence is easy to measure. In terms of text coherence measures, specifically, we propose the multi-graph correlated neural network module (MGCNN), which constructs a graph for each sentence based on three key components, i.e., subject, attribute, and action words. For each sentence pair, the node features are then discriminatively learned by incorporating neighbors of its own graph and information of its dual graph, reducing the error of synonyms or reference relationships in measuring the correlation between sentences, as well as the error caused by considering each component separately. In doing so, MGCNN utilizes subject agreement, attribute coherence, and action succession to measure text coherence. Besides, with the help of large language models, we augment the original text coherence annotations, improving the ability of MGCNN to judge coherence. Extensive experiments on three challenging datasets demonstrate the superiority of our approach and each proposed module, especially improving the latest records by +3.8%, +14.2% and +12% w.r.t. F1 scores, \tau and \rho metrics on the BLiSS dataset.

Abstract:
In recent years, deep learning has shown immense promise in advancing medical hyperspectral imaging diagnostics at the microscopic level. Despite this progress, most existing research models remain constrained to single-task or single-scene applications, lacking robust collaborative interpretation of microscopic hyperspectral features and spatial information, thereby failing to fully explore the clinical value of hyperspectral data. In this paper, we propose a microscopic hyperspectral universal feature perception framework (UFPF), which extracts high-quality spatial-spectral features of hyperspectral data, providing a robust feature foundation for downstream tasks. Specifically, this innovative framework captures different sequential spatial nearest-neighbor relationships through a hierarchical corner-to-center mamba structure. It incorporates the concept of “progressive focus towards the center”, starting by emphasizing edge information and gradually refining attention from the edges towards the center. This approach effectively integrates richer spatial-spectral information, boosting the model’s feature extraction capability. On this basis, a dual-path spatial-spectral joint perception module is developed to achieve the complementarity of spatial and spectral information and fully explore the potential patterns in the data. In addition, a Mamba-attention Mix-alignment is designed to enhance the optimized alignment of deep semantic features. The experimental results on multiple datasets have shown that this framework significantly improves classification and segmentation performance, supporting the clinical application of medical hyperspectral data. The code is available at: https://github.com/Qugeryolo/UFPF

Abstract:
Model customization mitigates the issues of inadequate performance, resource wastage, and privacy risks associated with using general-purpose models in specialized domains and well-defined tasks. However, achieving customization at a low annotation cost still poses a challenge. Existing domain adaptation research has addressed cases where all customized classes are present in the labeled database, yet scenarios involving customer-specific classes are still unresolved. Therefore, this paper proposes a novel Class-Customized Domain Adaptation (CCDA) method, addressing the latter scenario with just one additional annotation for each customer-specific class. CCDA adopts the classic adaptation training framework and comprises two innovative techniques. Firstly, to ensure the shared class knowledge from the database and the private class knowledge from additional annotations are transferred and propagated to the correct regions within the target domain, we design the partial-feature alignment strategy, based on the mechanical properties of feature alignment. Second, we propose soft-balanced sampling to tackle the long-tail distribution problem in labeled data, preventing the model from overfitting to the labeled samples of customer-specific classes. The effectiveness of CCDA has been validated across 48 tasks simulated on domain adaptation benchmarks and two real-world customization scenarios, consistently showing excellent performance. Additionally, extensive analytical experiments illustrate the contributions of two innovative techniques. The code is available at https://github.com/CHEN-kx/ClassCustomizedDA

Abstract:
Cross-resolution person re-identification (CR-ReID) aims to match low-resolution (LR) and high-resolution (HR) images of the same individual. To reduce the cost of manual annotation, existing unsupervised CR-ReID methods typically rely on cross-resolution fusion to obtain pseudo-labels and resolution-invariant features. However, the fusion process requires two encoders and a fusion module, which significantly increases computational complexity and reduces efficiency. To address this issue, we propose a robust labeling and invariance modeling (RLIM) framework, which utilizes a single encoder to tackle the unsupervised CR-ReID problem. To obtain pseudo-labels robust to resolution gaps, we develop cross-resolution robust labeling (CRL), which utilizes two clustering criteria to encourage cross-resolution positive pairs to cluster together and exploit the reliable relationships between images. We also introduce random texture augmentation (TexA) to enhance the model’s robustness to noisy textures related to artifacts and backgrounds by randomly adjusting texture strength. During the optimization process, we introduce the resolution-cluster consistency loss, which promotes resolution-invariant feature learning by aligning inter-resolution distances with intra-cluster distances. Experimental results on multiple datasets demonstrate that RLIM not only surpasses existing unsupervised methods, but also achieves performance close to some supervised CR-ReID methods. Code is available at https://github.com/zqpang/RLIM

Abstract:
CNNs have demonstrated superior performance in medical image segmentation. To overcome the limitation of only using local receptive field, previous work has attempted to integrate Transformers into convolutional network components such as encoders, decoders, or skip connections. However, these methods can only establish long-distance dependencies for some specific patterns and usually neglect the loss of fine-grained details during downsampling in multi-scale feature extraction. To address the issues, we present a novel hybrid Transformer network called FocalTransNet. Specifically, we construct a focal-enhanced (FE) Transformer module by introducing dense cross-connections into a CNN-Transformer dual-path structure and deploy the FE Transformer throughout the entire encoder. Different from existing hybrid networks that employ embedding or stacking strategies, the proposed model allows for a comprehensive extraction and deep fusion of both local and global features at different scales. Besides, we propose a symmetric patch merging (SPM) module for downsampling, which can retain the fine-grained details by establishing a specific information compensation mechanism. We evaluated the proposed method on four different medical image segmentation benchmarks. The proposed method outperforms previous state-of-the-art convolutional networks, Transformers, and hybrid networks. The code for FocalTransNet is publicly available at https://github.com/nemanjajoe/FocalTransNet

Abstract:
Phase unwrapping is a critical step in fringe projection profilometry, essential for achieving accurate and efficient three-dimension (3D) imaging. Temporal phase unwrapping is the most widely utilized to improve robustness and the reconstruction quality. Unfortunately, due to abrupt phase discontinuities at boundaries, misalignment between the wrapped phases, and unreliable shadow regions, fringe order errors may occur. To address these challenges, this study presents a generalized multi-feature-guided progressive order correction algorithm (GMP-OCA) for high-quality 3D imaging. The algorithm integrates global coarse detection, incremental line-wise optimization, and regional precision scanning to progressively correct fringe orders. Static and dynamic experimental results demonstrate that GMP-OCA effectively eliminates the systematic errors inherent in various phase unwrapping methods, producing high-quality 3D imaging results.

Affiliations: Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, Xiamen, China; State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University, Beijing, China; State Key Laboratory of Media Convergence and Communication, Communication University of China, Beijing, China; Wangxuan Institute of Computer Technology, State Key Laboratory of Multimedia Information Processing, Peking University, Beijing, China

Abstract:
Knowledge is an abstraction of factual principles of the physical world. Large foundation models encapsulate extensive multimodal knowledge into the parameters and thus invoke machine intelligence on various tasks. How to invoke the knowledge in these models to facilitate image compression lacks in-depth exploration. In this work, we aim to harness multimodal knowledge into ultra-low bitrate compression and propose Multimodal Knowledge-aware Image Compression (MKIC). Our key insight is that under the context of ultra-low bitrate compression, where the encoded representation is too sparse to represent enough information of the input signal, knowledge from the physical world is required to be incorporated into the compression. Thus, more shared patterns can be stored in the model together with sparse unique features also embedded into the bitstream. In light of two kinds of knowledge, namely natural visual knowledge and human language knowledge, we propose a novel Alternating Rate-Distortion Optimization to enhance the accuracy and compactness of global semantic text representation extraction, extract the local feature map that captures visual details, and integrate these multimodal representations into a large generative foundation model to achieve high-quality reconstruction. The proposed method relights the path of learned image coding, leveraging decoupled knowledge from large foundation models. Extensive experiments show that our proposed method achieves superior comprehensive performance compared to various methods and shows great potential for ultra-low bitrate image compression.

Abstract:
Real-world datasets often exhibit class imbalance across multiple categories, manifesting as long-tailed distributions and few-shot scenarios. This is especially challenging in Class-Imbalanced Multi-Label Image Classification (CI-MLIC) tasks, where data imbalance and multi-object recognition present significant obstacles. To address these challenges, we propose a novel method termed Dual-View Alignment Learning with Hierarchical Prompt (HP-DVAL), which leverages multi-modal knowledge from vision-language pretrained (VLP) models to mitigate the class-imbalance problem in multi-label settings. Specifically, HP-DVAL employs dual-view alignment learning to transfer the powerful feature representation capabilities from VLP models by extracting complementary features for accurate image-text alignment. To better adapt VLP models for CI-MLIC tasks, we introduce a hierarchical prompt-tuning strategy that utilizes global and local prompts to learn task-specific and context-related prior knowledge. Additionally, we design a semantic consistency loss during prompt tuning to prevent learned prompts from deviating from general knowledge embedded in VLP models. The effectiveness of our approach is validated on two CI-MLIC benchmarks: MS-COCO and VOC2007. Extensive experimental results demonstrate the superiority of our method over SOTA approaches, achieving mAP improvements of 10.0% and 5.2% on the long-tailed multi-label image classification task, and 6.8% and 2.9% on the multi-label few-shot image classification task.

Abstract:
Recently, cross-domain few-shot facial expression recognition (CF-FER), which identifies novel compound expressions with a few images in the target domain by using the model trained only on basic expressions in the source domain, has attracted increasing attention. Generally, existing CF-FER methods leverage the multi-dataset to increase the diversity of the source domain and alleviate the discrepancy between the source and target domains. However, these methods learn feature embeddings in the Euclidean space without considering imbalanced expression categories and imbalanced sample difficulty in the multi-dataset. Such a way makes the model difficult to capture hierarchical relationships of facial expressions, resulting in inferior transferable representations. To address these issues, we propose a hyperbolic self-paced multi-expert network (HSM-Net), which contains multiple mixture-of-experts (MoE) layers located in the hyperbolic space, for CF-FER. Specifically, HSM-Net collaboratively trains multiple experts in a self-distillation manner, where each expert focuses on learning a subset of expression categories from the multi-dataset. Based on this, we introduce a hyperbolic self-paced learning (HSL) strategy that exploits sample difficulty to adaptively train the model from easy-to-hard samples, greatly reducing the influence of imbalanced expression categories and imbalanced sample difficulty. Our HSM-Net can effectively model rich hierarchical relationships of facial expressions and obtain a highly transferable feature space. Extensive experiments on both in-the-lab and in-the-wild compound expression datasets demonstrate the superiority of our proposed method over several state-of-the-art methods. Code will be released at https://github.com/cxtjl/HSM-Net

Abstract:
Zero-shot learning (ZSL) aims to recognize unseen classes by transferring semantic knowledge from seen categories. However, existing methods often struggle with the persistent semantic gap caused by limited semantic descriptors and rigid visual feature modeling. In particular, modeling pre-defined class-level attribute descriptions as ground truth hinders effective semantic-to-visual alignment to some extent. To mitigate these issues, we propose the Bilateral-guided Prototype Refinement Network (BPRN), a novel ZSL framework designed to refine dual prototypes across meta-domains of varying scales. Specifically, we first disentangle the relationships among class-level semantics and use them to generate corresponding pseudo-visual prototypes. Then, by leveraging distribution information across dual prototypes in different meta-domains, BPRN achieves bidirectional calibration between visual-to-semantic and semantic-to-visual modalities. Finally, a synthesized class-level representation derived from the refined dual prototypes is employed for inference, instead of relying on a single prototype. Extensive experiments conducted on five widely-used ZSL benchmark datasets demonstrate that BPRN consistently achieves competitive or even superior performance. Specifically, in the GZSL scenario, BPRN shows improvements of 2.1%, 7.3%, 6.1%, and 4.8% on AWA1, AWA2, SUN, and aPY, respectively, compared to existing embedding-based ZSL methods. Ablation studies and visualization analyses further validate the effectiveness of the proposed components.

Abstract:
Orthogonal Moment-based Robust Reversible Watermarking (OM-RRW) is crucial for intellectual property protection, providing the dual benefits of robustness and reversibility. However, OM-RRW embeds watermarks into visually sensitive global low-frequency features, which easily leads to ring-like distortions that expose watermark locations, making them vulnerable to removal through image inpainting. To address this issue, this paper makes the first attempt to introduce an innovative strategy to eliminate these visible distortions, thereby overcoming OM-RRW’s inherent limitations. The strategy innovates on two fronts: first, it customizes varying embedding step sizes based on the stability differences of moment values to minimize distortion; second, it designs a texture-aware adaptive basis function fine-tuning strategy. This strategy adjusts the representation capability of the basis functions in different regions based on the human eye’s sensitivity to various texture areas, helping to avoid visible ring-like distortions. The performance of the proposed method is evaluated using Polar Harmonic Transform (PHT) moments, comprising three moments that exhibit remarkable performance in existing OM-RRW methods. Extensive experiments show that the proposed method can embed 128-bit watermarks with no visible distortions while minimizing the loss of robustness. In addition, this paper finds that OM-RRW demonstrates satisfactory robustness against VAE watermark removal attacks.

Abstract:
Infrared-visible image fusion methods aim at generating fused images with good visual quality and also facilitate the performance of high-level tasks. Indeed, existing semantic-driven methods have considered semantic information injection for downstream applications. However, none of them investigates the potential for reciprocal promotion between pixel-wise image fusion and cross-modal feature fusion perception tasks from a macroscopic task-level perspective. To address this limitation, we propose a unified network for image fusion and semantic segmentation. MAFS is a parallel structure, containing a fusion sub-network and a segmentation sub-network. On the one hand, we devise a heterogeneous feature fusion strategy to enhance semantic-aware capabilities for image fusion. On the other hand, by cascading the fusion sub-network and a segmentation backbone, segmentation-related knowledge is transferred to promote feature-level fusion-based segmentation. Within the framework, we design a novel multi-stage Transformer decoder to aggregate fine-grained multi-scale fused features efficiently. Additionally, a dynamic factor based on the max-min fairness allocation principle is introduced to generate adaptive weights of two tasks and guarantee smooth training in a multi-task manner. Extensive experiments demonstrate that our approach achieves competitive results compared with state-of-the-art methods. The code is available at https://github.com/Abraham-Einstein/MAFS/

Abstract:
Incremental multilingual text recognition (IMLTR) aims to advance continual learning by retaining knowledge from previously learned languages while adapting to new ones. Existing methods typically perform under a constrained assumption that each text instance originates from a specific single-language domain. However, this assumption is inaccurate in multilingual scenarios, as it overlooks the inherent cross-lingual knowledge, i.e., the incremental sharing problem. To address this issue, we propose a novel self-supervised cross-lingual knowledge discovery framework, CrossKnow, tailored for IMLTR tasks. Specifically, an innovative shared knowledge discovery strategy is developed to identify potential shared knowledge by leveraging prediction consistency across multiple recognizers, thus eliminating the reliance on language labels of all characters. Building upon this shared knowledge, we further design a multi-granularity, multi-task language domain discriminator to capture dependency relationships among incremental languages, which could adequately guide the hierarchical sequence decoding. By mining shared knowledge, CrossKnow can not only mitigate the forgetting of old knowledge but also efficiently achieve cross-lingual knowledge transfer, thereby promoting the continual learning of incremental multilingual text recognition models. Experiments on two widely used datasets, MLT17 and MLT19, demonstrate the superiority of CrossKnow. Compared to methods that leverage additional language supervision of characters, CrossKnow achieves competitive performance while eliminating storage overhead and improving computation efficiency.

Abstract:
Palmprint recognition offers a promising solution for convenient and private authentication. However, the scarcity of large-scale palmprint datasets constrains its development and application. Recent approaches have sought to mitigate this issue by synthesizing palmprints based on Bézier curves. Due to the lack of paired data between curves and palmprints, it is difficult to generate curve-driven palmprints with precise identity. To address this challenge, we propose a novel Pixel and Feature Identity Guidance (PFIG) framework to synthesize realistic palmprints, whose IDs are strictly governed by the Bézier curves. In order to establish ID mapping, an ID Injection (IDI) module is constructed to synthesize pseudo-paired data. Two cross-domain ID consistency losses at pixel and feature levels are further proposed to strictly preserve the semantic information of the input ID curves. Experimental results demonstrate that our ID-guided approach can synthesize more realistic palmprints with controllable identities. Based on only 80,000 synthesized palmprints for pre-training, the recognition accuracy can be improved by more than 18% in terms of TAR@1e-6. When trained exclusively on synthetic data, our method achieves superior performance to existing synthetic approaches. The source code is available at https://github.com/YuchenZou/PFIG-Palm

Abstract:
Image-to-image translation has achieved great success, but still faces the significant challenge of limited paired data, particularly in translating Synthetic Aperture Radar (SAR) images to optical images. Furthermore, most existing semi-supervised methods place limited emphasis on leveraging the data distribution. To address those challenges, we propose a Semi-Supervised SAR-to-Optical Image Translation (S3OIL) method that achieves high-quality image generation using minimal paired data and extensive unpaired data while strategically exploiting the data distribution. To this end, we first introduce a Cross-Set Alignment Matching (CAM) mechanism to create local correspondences between the generated results of paired and unpaired data, ensuring cross-set consistency. In addition, for unpaired data, we apply weak and strong perturbations and establish intra-set Multi-Scale Matching (MSM) constraints. For paired data, intra-modal semantic consistency (ISC) is presented to ensure alignment with the ground truth. Finally, we propose local and global cross-modal semantic consistency (CSC) to boost structural identity during translation. We conduct extensive experiments on SAR-to-optical datasets and another sketch-to-anime task, demonstrating that S3OIL delivers competitive performance compared to state-of-the-art unsupervised, supervised, and semi-supervised methods, both quantitatively and qualitatively. Ablation studies further reveal that S3OIL can ensure the preservation of both semantic content and structural integrity of the generated images. Our code is available at: https://github.com/XduShi/SOIL

Abstract:
Window-based Transformers have demonstrated outstanding performance in super-resolution due to their adaptive modeling capabilities through local self-attention (SA). However, they exhibit higher computational complexity and inference latency than convolutional neural networks. In this paper, we first identify that the adaptability of the Transformers is derived from their adaptive spatial aggregation and advanced structural design, while their high latency results from the computational costs and memory layout transformations. To address these limitations and simulate the aggregation approach, we propose an efficient convolution-based Focal Separable Attention (FSA) mechanism that enables long-range dynamic modeling with linear computational complexity. Additionally, we introduce a dual-branch structure integrated with an ultra-lightweight Information Exchange Module (IEM) to enhance information aggregation within the token mixing process. Finally, we modify the existing spatial-gate-based feedforward neural networks by incorporating a self-gate mechanism to preserve high-dimensional channel information, thereby enabling the modeling of more complex relationships. This modification is referred to as the Dual-Gated Feed-Forward Network (DGFN). With these advancements, we construct a convolution-based Transformer framework named the Linear Adaptive Mixer Network (LAMNet). Extensive experiments demonstrate that LAMNet performs better than existing Transformer-based methods while maintaining the computational efficiency of convolutional neural networks, which can achieve a speedup 3× of the inference time. The code will be publicly available at: https://github.com/zononhzy/LAMNet

Abstract:
Multimodal Large Language Models (MLLMs) have demonstrated significant potential across various multimodal tasks, including retrieval, summarization, and reasoning. However, it remains a substantial challenge for MLLMs to understand and precisely retrieve specific moments from a video, which require fine-grained spatial and temporal understanding of a video. To overcome this, we propose the Caption Assisted MLLM from Coarse to finE (CALCE), a novel two-stage framework designed for enhanced moment retrieval. Our pipeline begins with a first stage where captions extracted from the audio are utilized to assist the MLLM to provide a robust foundation for precise moment retrieval. To efficiently manage memory consumption from this additional data, a clustering algorithm is applied to the sparsely sampled video frames, categorizing them into key frames and non-key frames. The second stage focuses on recalling missed moments and achieving more fine-grained moment boundaries by adopting a higher sampling rate. In this process, predictions from the first stage cast votes for their correlated densely sampled frames, thereby filtering out less relevant frames. By repeating the process of the first stage with these selected frames, CALCE progressively retrieves video moments from coarse to precise. Experiments on QVHighlights and Charades-STA demonstrate the effectiveness of CALCE, which outperforms existing state-of-the-art methods. The code is available at https://github.com/tjhd1475/CALCE

Abstract:
Perspective- n -point is a fundamental problem in multi-view geometry, yet two critical challenges persist: 1) The issues of high outlier rate and near degenerate cases exert a substantial impact on the robustness of existing P n P methods. In the worst-case where both issues are in presence, existing methods tend to either produce erroneous results or become computationally prohibitive. 2) Conventionally, the hypothetical pose with the maximum inlier-set is assumed to be correct. However, it remains unclear whether this assumption holds when the outlier rate approaches ultra-high levels, and along this line what is the maximum amount of outliers that can be robustly handled. To address these challenges, this paper proposes a novel Hough voting based 2-point RANSAC solution. To our knowledge, it is the first P n P solution capable of accurately and efficiently handling high outlier rates in near-degenerate cases. Extensive empirical evaluations have been conducted using the proposed approach, with a particular focus on a systematic examination under ultra-high outlier rates. The results show that, on random synthetic data, our approach works robustly even when dealing with up to 99% outliers. Meanwhile on real-world datasets, the maximum inlier-set assumption oftentimes fails when the outlier rate exceeds 97%, as the incorrect hypothetical poses may yield more inliers than the ground-truths. Our dataset and source code are to be made available at https://github.com/xuchi7/RPnP_plusplus

Affiliations: Sanhang Science and Technology Building, Shenzhen Research Institute of Northwestern Polytechnical University, Nanshan, Shenzhen, China; School of Mathematics and Statistics and the Key Laboratory of Intelligent Networks and Network Security, Ministry of Education, Xi’an Jiaotong University, Xi’an, Shaanxi, China; School of Electronics and Information, Northwestern Polytechnical University, Xi’an, China; School of Electronic and Information Engineering and the Key Laboratory of Intelligent Networks and Network Security, Ministry of Education, Xi’an Jiaotong University, Xi’an, China

Abstract:
Remote sensing image restoration, which aims to reconstruct corrupted or missing regions, heavily relies on low-rank models. A recent trend in this field is to jointly model low-rank and local smoothness priors using a single regularization term, in order to better recover fine textures. However, due to the entanglement of low- and high-frequency components in an image, existing methods often struggle to simultaneously capture both coarse-grained structures and fine-grained textures, while also suffering from high computational complexity. To address these issues, this paper proposes a novel regularization, the Haar Nuclear Norm (HNN), for efficient and effective remote sensing image restoration. HNN transforms images into wavelet coefficients that separate low-frequency (coarse-grained) and high-frequency (fine-grained) components, and enforces low-rankness via nuclear norms on the mode-3 unfolding matrices of these wavelet coefficients. Experimental evaluations conducted on hyperspectral image inpainting, multi-temporal image cloud removal, and hyperspectral image denoising have revealed the HNN’s potential. Typically, HNN achieves a performance improvement of 1-4 dB and a speedup of 10-28x compared to some state-of-the-art methods (e.g., tensor correlated total variation, and fully-connected tensor network) for inpainting tasks. The code is available at https://github.com/isyuchang/HNN.

Abstract:
In real-world scenarios, peculiar remote sensing categories are difficult to collect on account of high cost and technical requirements. Moreover, there exists domain distribution gap among different datasets. Existing methods leverage inter-class and intra-class relations to enhance feature representation. Since remote images are shot from top to bottom, there is little difference between classes. Thus, such distance constraint only forms decision boundary between different classes. This paper proposes a triplet relation-aware metric for cross-domain few-shot remote sensing object classification, where the triplet relation-aware metric adjusts the distances among three kinds of inter-instance relations (i.e., same instance, same class and different class relations) to obtain a precise and effective feature representation. Especially, the distance of the same instance is regarded as a distance coordinate origin to guide distance metric learning. In this way, we constitute richer feature relations to promote representation learning in the source domain. Concretely, this procedure is optimized by the supervision of the designed relation-aware soft label based on the distance coordinate origin. Then, we align the triplet relation-aware metric between source domain and pseudo domain generated by the proposed episode style adversarial attack, thereby obtaining a domain-invariant feature representation. Extensive experiments on five widely-used remote sensing datasets demonstrate the superior performance of the proposed method compared with the state of the arts. Code is available at: https://github.com/jackhdpbl/TRAM

Abstract:
In surveillance environments, detecting anomalies requires understanding the contextual dynamics of the environment, human behaviors, and movements within a scene. Effective anomaly detection must address both the where and what of events, but existing approaches such as unimodal action-based methods or LLM-integrated multimodal frameworks have limitations. These methods either rely on implicit scene information, making it difficult to localize where anomalies occur, or fail to adapt to surveillance specific challenges such as view changes, subtle actions, low light conditions, and crowded scenes. As a result, these challenges hinder accurate detection of what occurs. To overcome these limitations, our system takes advantage of features from a lightweight scene classification model to discern where an event occurs, acquiring explicit location-based context. To identify what events occur, it focuses on atomic actions, which remain underexplored in this field and are better suited to interpreting intricate abnormal behaviors than conventional abstract action features. To achieve robust anomaly detection, the proposed Temporal-Semantic Relationship Network (TSRN) models spatio-temporal relationships among multimodal features and employs a Segment-selective Focal Margin loss (SFML) to effectively address class imbalance, outperforming conventional MIL-based methods. Experimental results on public datasets demonstrate that the proposed system effectively reduces false alarms while maintaining robustness and practicality for real-world surveillance applications.

Affiliations: College of Communication Engineering, Jilin University, Changchun, China; Department of Electrical Engineering, University of Science and Technology Bannu, Bannu, Pakistan; School of Electrical and Electronic Engineering, Nanyang Technological University, Jurong West, Singapore; School of Engineering, Westlake University, Hangzhou, Zhejiang, China; Department of Computer and Information Science, State Key Laboratory of Internet of Things for Smart City, University of Macau, Macau, China; School of Information and Communication Engineering, University of Electronic Science and Technology of China, Chengdu, China

Abstract:
Recent advancements have suggested that neural radiance fields (NeRFs) show great potential in 3D style transfer. However, most existing NeRF-based style transfer methods still face considerable challenges in generating stylized images that simultaneously preserve clear scene textures and maintain strong cross-view consistency. To address these limitations, in this paper, we propose a novel transformer-guided approach for 3D scene style transfer. Specifically, we first design a transformer-based style transfer network to capture long-range dependencies and generate 2D stylized images with initial consistency, which serve as supervision for the 3D stylized generation. To enable fine-grained control over style, we propose a latent style vector as a conditional feature and design a style network that projects this style information into the 3D space. We further develop a merge network that integrates style features with scene geometry to render 3D stylized images that are both visually coherent and stylistically consistent. In addition, we propose a texture consistency loss to preserve scene structure and enhance texture fidelity across views. Extensive quantitative and qualitative experimental results demonstrate that our proposed approach outperforms many state-of-the-art methods in terms of visual perception, image quality and multi-view consistency. Our code and more results are available at: https://github.com/PaiDii/TGTC-Style.git

Abstract:
With the rapid development of Generative Adversarial Networks (GANs), facial sketch generation and age transformation have advanced considerably. These technologies show great potential in digital media, entertainment, and forensic applications, particularly in helping law enforcement reconstruct the appearance of long-term fugitives. However, current methodologies exhibit notable limitations: existing approaches typically specialize in either facial sketch generation or age progression independently, lacking an effective integration for cross-domain synthesis. Moreover, preserving identity information while ensuring high-quality image generation remains a challenge. This paper proposes Multi-Scale Feature Extraction Networks (MSFE), an image-to-image translation framework that enables continuous age transformation while maintaining the stylistic characteristics of sketch domains. The core of the MSFS network uses a Dual Conditional Normalization Attention (DCNA) architecture to extract sketch features and encode facial images into the latent space of a pre-trained StyleGAN based on the desired age change. Experimental results on public datasets demonstrate that our approach outperforms existing methods, achieving superior facial photo-sketch synthesis with enhanced realism, identity preservation, and age accuracy.

Abstract:
Traffic Salient Object Detection (TSOD) aims to segment the objects critical to driving safety by combining semantic (e.g., collision risks) and visual saliency. Unlike SOD in natural scene images (NSI-SOD), which prioritizes visually distinctive regions, TSOD emphasizes the objects that demand immediate driver attention due to their semantic impact, even with low visual contrast. This dual criterion, i.e., bridging perception and contextual risk, re-defines saliency for autonomous and assisted driving systems. To address the lack of task-specific benchmarks, we collect the first large-scale TSOD dataset with pixel-wise saliency annotations, named TSOD10K. TSOD10K covers the diverse object categories in various real-world traffic scenes under various challenging weather/illumination variations (e.g., fog, snowstorms, low-contrast, and low-light). Methodologically, we propose a Mamba-based TSOD model, termed Tramba. Considering the challenge of distinguishing inconspicuous visual information from complex traffic backgrounds, Tramba introduces a novel Dual-Frequency Visual State Space module equipped with shifted window partitioning and dilated scanning to enhance the perception of fine details and global structure by hierarchically decomposing high/low-frequency components. To emphasize critical regions in traffic scenes, we propose a traffic-oriented Helix 2D-Selective-Scan (Helix-SS2D) mechanism that injects driving attention priors while effectively capturing global multi-direction spatial dependencies. We establish a comprehensive benchmark by evaluating Tramba and 25 existing NSI-SOD models on TSOD10K, demonstrating Tramba’s superiority. Our research establishes the first foundation for safety-aware saliency analysis in intelligent transportation systems. The dataset and code will be made publicly available at https://github.com/mj129/Tramba.

Abstract:
In the realm of skeleton-based human action recognition, the traditional methods which rely on coarse body keypoints fall short of capturing subtle human actions. In this work, we propose Expressive Keypoints that incorporates hand and foot details to form a fine-grained skeletal representation, to improve the discriminative ability for existing models in discerning intricate human actions. However, the increased computational cost from processing nearly three times more joints becomes a new challenge. To address this, we present the Progressive Skeleton Evolution strategy, which significantly improves efficiency while preserving the benefits of fine-grained keypoints. The core idea involves utilizing learnable mapping matrices, semantically initialized to progressively downsample keypoints and prioritize prominent joints by allocating importance weights. Additionally, a plug-and-play Instance Pooling module is exploited to extend our approach to multi-person scenarios without surging computation cost. Extensive experimental results over seven datasets demonstrate the superiority of our method compared to the state-of-the-arts for skeleton-based human action recognition. Code has been made available at https://github.com/YijieYang23/PSE-GCN

Abstract:
Semi-supervised object detection (SSOD) aims to solve the data annotation challenge in object detection and can achieve remarkable progress in natural scenes; however, it remains unexplored in horizontal bounding box (HBB)-based remote sensing imagery where annotation tasks pose greater challenges. In remote sensing scenarios, objects exhibit arbitrary orientations, small scales, and dense distributions, leading to pseudoboxes with fuzzy boundaries and class imbalance issues. Therefore, we propose UNCertainty quantification (UNC) for SSOD in remote sensing images. UNC uses uncertainty to guide the network from both regression and classification perspectives: Semantic alignment SAM calibration (SASC) uses pseudoboxes as box prompts for the input of the segment anything model (SAM), achieving more precise boundaries. Subsequently, boundaries with lower regression uncertainty are selected as the final pseudoboxes, ensuring better alignment between the pseudoboxes and the ground truth. Dynamic uncertainty weighting (DUW) calculates class uncertainty and determines its correlation with the availability of instances per class. High uncertainty implies limited availability of instances, necessitating greater emphasis on instances of that class. Furthermore, we set a percentage uncertainty threshold to avoid overemphasis caused by individual classes. Extensive experiments conducted on the DIOR and DOTA HBB-based datasets demonstrate the effectiveness of our method in leveraging unlabeled image information. Specifically, compared with the supervised baseline method, the UNC method improves mAP by 12.4% and 8.6% when 5% and 10% of labeled data on DIOR, respectively.

Abstract:
Prompt tuning achieves superior performance across a wide range of tasks, including multi-label zero-shot classification. Existing approaches employ multiple prompts to acquire comprehensive knowledge from categories, demonstrating state-of-the-art performance and significant computational efficiency. However, two main challenges still exist in these methods that impede the full potential of generalization. First, the class imbalance is not carefully addressed. Despite some efforts to adopt re-weighted loss functions to alleviate the positive-negative imbalance, such strategies tend to exacerbate the class imbalance by over-suppression of labels with fewer samples and overfitting to dominant classes. Second, the multi-prompt methods neglect the interactions between prompts during parameter optimization, underestimating the potential of prompts and leading to suboptimal performance. To address these issues, we present a novel framework named Dynamic Regulation in Prompt Tuning (DAR-Prompt). DAR-Prompt introduces three dynamic components: semantic regulator and debiased regulator to address the class imbalance, along with contrastive gradient regularization to enhance feature separation through prompt interactions during the backward pass. Specifically, the semantic regulator generates class-adaptive thresholds to compensate for tail classes and mitigate over-suppression, while the debiased regulator focuses on learning biased classes by rectifying overconfident predictions. Moreover, we apply dynamic regularization to the gradient update directions of prompts to promote orthogonality, thereby enhancing feature distinctiveness. Extensive experiments on several benchmarks show that our method can achieve state-of-the-art performance, well demonstrating its effectiveness and superiority. Code is available at https://github.com/Evelyn1ywliang/DAR-Prompt.

Abstract:
Synthetic aperture radar (SAR) image simulation has attracted much attention due to its great potential to supplement the scarce training data for deep learning algorithms. Consequently, evaluating the quality of the simulated SAR image is crucial for practical applications. The current literature primarily uses image quality assessment (IQA) techniques for evaluation that rely on human observers’ perceptions. However, because of the unique imaging mechanism of SAR, these techniques may produce evaluation results that are not entirely valid. The distribution inconsistency between real and simulated data is the main obstacle that influences the utility of simulated SAR images. To this end, we propose a novel trustworthy utility evaluation framework with a counterfactual explanation for simulated SAR images for the first time, denoted as X-Fake. It unifies a probabilistic evaluator and a causal explainer to achieve a trustworthy utility assessment. We construct the evaluator using a probabilistic Bayesian deep model to learn the posterior distribution, conditioned on real data. Quantitatively, the predicted uncertainty of simulated data can reflect the distribution discrepancy. We build the causal explainer with an introspective variational auto-encoder (IntroVAE) to generate high-resolution counterfactuals. The latent code of IntroVAE is finally optimized with evaluation indicators and prior information to generate the counterfactual explanation, thus revealing the inauthentic details of simulated data explicitly. The proposed framework is validated on four simulated SAR image datasets obtained from electromagnetic models and generative artificial intelligence approaches. The results demonstrate the proposed X-Fake framework outperforms other IQA methods in terms of utility. Furthermore, the results illustrate that the generated counterfactual explanations are trustworthy, and can further improve the data utility in applications.

Abstract:
Deep neural networks pre-trained on ImageNet have demonstrated remarkable transferability for developing effective full-reference image quality assessment (FR-IQA) models. However, existing approaches typically demand pixel-level alignment between reference and distorted images—a requirement that poses significant challenges in practical scenarios involving natural photography and texture similarity evaluation. To address this limitation, we propose a novel FR-IQA model leveraging deep statistical similarity derived from pre-trained features without relying on spatial co-location of these features or requiring fine-tuning with mean opinion scores. Specifically, we employ distance correlation, a potent yet relatively underexplored statistical measure, to quantify similarity between reference and distorted images within a deep feature space. The distance correlation is computed via the ratio of the distance covariance to the product of their respective distance standard deviations, for which we derive a closed-form solution using the inner product of deep double-centered distance matrices. Extensive experimental evaluations across diverse IQA benchmarks demonstrate the superiority and robustness of the proposed model. Furthermore, we demonstrate the utility of our model for optimizing texture synthesis and neural style transfer tasks, achieving state-of-the-art performance in both quantitative measures and qualitative assessments. The implementation is publicly available at https://github.com/h4nwei/DeepDC

Abstract:
Visual navigation is fundamental for embodied agents operating in expansive workspaces. The cognitive abilities of these agents form the essential basis for creating intelligent behavioral patterns. Memory and reasoning are vital components among these abilities. The former enhances decision-making by preserving a wide array of episodic spatio-temporal perception cues, while the latter allows proactive and advanced probabilistic inference of task distributions based on long-term experiences. Despite individual studies on these two cognitive modalities, their integration for enhanced decision-making presents a considerable challenge due to their substantial differences in representation and behavioral characteristics. In this paper, we introduce Semantic-based Multi-modal Cognitive Graph (SMCG) for intelligent visual navigation. This framework is distinguished by its unified semantic-level representation of both memory and reasoning capabilities. Specifically, SMCG, rather than directly memorizing perceptual features as per previous methods, records observed object sequences. Simultaneously, reasoning is based on a semantic relation graph that represents correlations among objects. We additionally develop a hierarchical cognition extraction (HCE) pipeline and employ it to decode cognitive cues within SMCG and situation-aware subgraphs, thereby enhancing intelligent navigation behavior. Experimental results in image-goal navigation show pronounced performance improvements, credited to the effective induction and rational application of heterogeneous cognitive modalities.

Abstract:
Nowadays, data-driven learning based deep neural network (DNN) is the most dominant SOTA image dehazing framework. Here, learning to perfectly simulate the underlying mapping rules (from hazy to clear) told by massive paired training data is its core driving force. However, under genuine scenarios, it is extremely hard to guarantee the 100% qualification of all collected ground truth (GT) haze-free data. That’s because natural weather is hardly controlled, and many weathers are actually in a chaotic status existing between foggy and fog-free. Thus, unlike most supervised learning issues, the image dehazing society is born with the torture of part of faulty ground truth no-haze samples. Therefore, totally trusting training data and solely pursuing more fitting powerful data-driven model may not be a wise solution. To cope with this thorny challenge, in this paper, instead of faithfully pursuing for fitting capacity promotion, we on the contrary choose to intentionally cut down the fitting flexibility to achieve higher-level robustness. That is the LPATR-Net, a novel dehazing framework specially armed with fitting power suppression mechanism to resist intrinsic annoying faulty GT. This solution does not involve any extra manually labeling. Specifically, the LPATR-Net architecture is created completely around elaborately designed fitting-restrained learnable piecewise affine transformation regression. Since such low-order linear regression structure genetically can only fit for majority of data, the interference of minority of unqualified GT samples is expected to be effectively suppressed. Through further coupled with a highly customized multi-concerns high-accuracy dehazing fitting companion component, All-Mattering, proposed LPATR-Net elegantly achieves the seamless integration of traditional majority determining fixed-form regression and modern all freedom data-driven deep learning. Extensive experiments have been conducted on five commonly utilized public datasets to verify its effectiveness. In addition, the wide-range transplantability of the proposed core regression structure has also been experimentally confirmed. Source code is available at https://github.com/FeiChen829/LPATR-Net

Abstract:
The existing attention-based label-free weakly supervised group activity recognition methods can automatically learn tokens related to the actors. And they have difficulties generating sufficiently diverse token embeddings. To address these issues, we automatically obtain the grayscale motion mask of all the moving objects based on the motion direction not the motion amplitude. A Motion-Guided Mask Generator module (MGMG) is proposed to estimate the attention region mask under the supervision of the grayscale motion mask. MGMG involves four parts. A correlation layer measures the relative displacement between two adjacent feature maps. A cosine attention mechanism is designed to reduce the module’s sensitivity to feature amplitude changes. A mask generator is built to generate the attention region mask. And a specifically designed activation function is used to refine the attention region mask and to enhance its focus on actor motion regions. We also customize a normalized relative error loss function for MGMG module. This loss can address the value range mismatch problem for the estimated attention mask as well as the grayscale motion mask. Furthermore, a Motion Attention-Guided Relational Reasoning (MAGRR) framework is presented for the weakly supervised condition. It uses the MGMG module to estimate the attention region automatically, and a Spatial-temporal Aggregation Stack (SAS) module to activate the attention regions of the features at the spatial level, then transform them into multiple tokens, which are further captured by the attention mechanism for their temporal dependencies and interrelationships. MAGRR is experimented on the Collective Activity dataset and the Collective Activity Extension dataset, achieving state-of-the-art performance and competitive performance on the Volleyball and the NBA datasets.

Abstract:
With the rapid advancement of Vision Language Models (VLMs), VLM-based Image Quality Assessment (IQA) seeks to describe image quality linguistically to align with human expression and capture the multifaceted nature of IQA tasks. However, current methods are still far from practical usage. First, prior works focus narrowly on specific sub-tasks or settings, which do not align with diverse real-world applications. Second, their performance is sub-optimal due to limitations in dataset coverage, scale, and quality. To overcome these challenges, we introduce the Enhanced Depicted image Quality Assessment model (EDQA). Our method includes a multi-functional IQA task paradigm that encompasses both assessment and comparison tasks, brief and detailed responses, full-reference and non-reference scenarios. We introduce a ground-truth-informed dataset construction approach to enhance data quality, and scale up the dataset to 495K under the brief-detail joint framework. Consequently, we construct a comprehensive, large-scale, and high-quality dataset, named EDQA-495K. We also retain image resolution during training to better handle resolution-related quality issues, and estimate a confidence score that is helpful to filter out low-quality responses. Experimental results demonstrate that EDQA significantly outperforms traditional score-based methods, prior VLM-based IQA models, and proprietary GPT-4V in distortion identification, instant rating, and reasoning tasks. Our advantages are further confirmed by real-world applications including assessing the web-downloaded images and ranking model-processed images. Codes, datasets, and model weights have been released in https://depictqa.github.io/

Abstract:
Video restoration from low-resolution and low-frame-rate blurry sources remains challenging due to insufficient data priors. In this paper, we propose BVSR-EvD, leveraging event cameras and diffusion models to boost blurry video space-time super-resolution. Specifically, we identify three distinct data priors from event-video dual modalities: motion prior from events, content prior from videos, and physical prior from their integration, contributing to temporal stability, content preservation, and detail enhancement respectively. To effectively utilize these data priors, BVSR-EvD creates the Trident Diffusion Model (Trident-DM), which decomposes each denoising step into trident decoupling and adaptive self-composition stages. The former employs single-modal and dual-modal meta-networks to extract the three unique data priors, while the latter dynamically integrates them through learned prior-aware weight maps. BVSR-EvD achieves up to × 8 spatial super-resolution and × 64 temporal super-resolution from blurry videos, surpassing existing methods on public video datasets.

Abstract:
Video-based visible-infrared person re-identification (VVI-ReID) aims to match target pedestrians between visible and infrared videos, which is significantly applied in 24-hour surveillance systems. The key of VVI-ReID is to learn modality invariant and spatio-temporal invariant sequence-level representation to solve the challenges such as modality differences, spatio-temporal misalignment, and domain shift noise. However, existing methods predominantly emphasize on reducing modality discrepancy while relatively neglect temporal misalignment and domain shift noise reduction. To this end, this paper proposes a VVI-ReID framework called Feature Alignment Network (FA-Net) from the perspective of feature alignment, aiming to mitigate temporal misalignment. FA-Net comprises two main alignment modules: Spatial-Temporal Alignment Module (STAM) and Modality Distribution Constraint (MDC). STAM integrates global and local features to ensure individuals’ spatial representation alignment. Additionally, STAM also establishes temporal relationships by exploring inter-frame features to address cross-frame person feature matching. Furthermore, we introduce the Modality Distribution Constraint (MDC), which utilizes a symmetric distribution loss to align the distributions of features from different modalities. Besides, the SAM Guidance Augmentation (SAM-GA) strategy is designed to transform the image space of RGB and IR frames to provide more informative and less noisy frame information. Extensive experimental results demonstrate the effectiveness of the proposed method, surpassing existing state-of-the-art methods. Our code will be available at: https://github.com/code/FANet

Abstract:
Artifact remains a long-standing challenge in High Dynamic Range (HDR) reconstruction. Existing methods focus on model designs for artifact mitigation but ignore explicit detection and suppression strategies. Because artifact lacks clear boundaries, distinct shapes, and semantic consistency, and there is no existing dedicated dataset for HDR artifact, progress in direct artifact detection and recovery is impeded. To bridge the gap, we propose a unified HDR reconstruction framework that integrates artifact detection and model optimization. Firstly, we build the first HDR artifact dataset (HADataset), comprising 1,213 diverse multi-exposure Low Dynamic Range (LDR) image sets and 1,765 HDR image pairs with per-pixel artifact annotations. Secondly, we develop an effective HDR artifact detector (HADetector), a robust artifact detection model capable of accurately localizing HDR reconstruction artifact. HADetector plays two pivotal roles: (1) enhancing existing HDR reconstruction models through fine-tuning, and (2) serving as a non-reference image quality assessment (NR-IQA) metric, the Artifact Score (AS), which aligns closely with human visual perception for reliable quality evaluation. Extensive experiments validate the effectiveness and generalizability of our framework, including the HADataset, HADetector, fine-tuning paradigm, and AS metric. The code and datasets are available at: https://github.com/xinyueliii/hdr-artifact-detect-optimize

Affiliations: Hangzhou Institute of Technology, Xidian University, Hangzhou, China; School of Computer Science and Engineering, National Engineering Research Center of Digital Life, Sun Yat-sen University, Guangzhou, China; School of Computer Science and Technology, Xidian University, Xi’an, China; School of Computing and Mathematical Sciences, University of Leicester, Leicester, U.K.; Department of Mechanical Engineering, Faculty of Science and Engineering, Swansea University, Swansea, U.K.; School of Artificial Intelligence, Xidian University, Xi’an, China; Rapid-Rich Object Search (ROSE) Laboratory and the NTU-PKU Joint Research Institute, Nanyang Technological University, Singapore, Singapore

Abstract:
Current deep learning-based methods for remote sensing image dehazing have developed rapidly, yet they still commonly struggle to simultaneously preserve fine texture details and restore accurate colors. The fundamental reason lies in the insufficient modeling of high-frequency information that captures structural details, as well as the lack of effective constraints for color restoration. To address the insufficient modeling of global high-frequency information, we first develop an omni-directional high-frequency feature in painting mechanism that leverages the wavelet transform to extract multi-directional high-frequency components. While maintaining the advantage of linear complexity, it models global long-range texture dependencies through cross-frequency perception. Then, to further strengthen local high-frequency representation, we design a high-frequency prompt attention module that dynamically injects wavelet-domain optimized high-frequency features as cross-level guidance signals, significantly enhancing the model’s capability in edge sharpness restoration and texture detail reconstruction. Further, to alleviate the problem of inaccurate color restoration, we propose a color contrast loss function based on the HSV color space, which explicitly models the statistical distribution differences of brightness and saturation in hazy regions, guiding the model to generate dehazed images with consistent colors and natural visual appearance. Finally, extensive experiments on multiple benchmark datasets demonstrate that the proposed method outperforms existing approaches in both texture detail restoration and color consistency. Further results and code are available at: https://github.com/fyxnl/C4RSD

Abstract:
Multimodal semantic segmentation has significantly advanced the field of semantic segmentation by integrating data from multiple sources. However, this task often encounters missing modality scenarios due to challenges such as sensor failures or data transmission errors, which can result in substantial performance degradation. Existing approaches to addressing missing modalities predominantly involve training separate models tailored to specific missing scenarios, typically requiring considerable computational resources. In this paper, we propose a Hierarchical Adaptation framework to Restore Missing Modalities for Multimodal segmentation (HARM3), which enables frozen pretrained multimodal models to be directly applied to missing-modality semantic segmentation tasks with minimal parameter updates. Central to HARM3 is a text-instructed missing modality prompt module, which learns multimodal semantic knowledge by utilizing available modalities and textual instructions to generate prompts for the missing modalities. By incorporating a small set of trainable parameters, this module effectively facilitates knowledge transfer between high-resource domains and low-resource domains where missing modalities are more prevalent. Besides, to further enhance the model’s robustness and adaptability, we introduce adaptive perturbation training and an affine modality adapter. Extensive experimental results demonstrate the effectiveness and robustness of HARM3 across a variety of missing modality scenarios.

Abstract:
In the domain of 3D Human Pose Estimation, which finds widespread daily applications, the requirement for convenient acquisition equipment continues to grow. To satisfy this demand, we focus on a short-baseline binocular setup that offers both portability and a geometric measurement capability that significantly reduces depth ambiguity. However, as the binocular baseline shortens, two serious challenges emerge: first, the robustness of 3D reconstruction against 2D errors deteriorates; second, occlusion reoccurs frequently due to the limited visual differences between two views. To address the first challenge, we propose the Stereo Co-Keypoints Estimation module to improve the view consistency of 2D keypoints and enhance the 3D robustness. In this module, the disparity is utilized to represent the correspondence of binocular 2D points, and the Stereo Volume Feature (SVF) is introduced to contain binocular features across different disparities. Through the regression of SVF, two-view 2D keypoints are simultaneously estimated in a collaborative way which restricts their view consistency. Furthermore, to deal with occlusions, a Pre-trained Pose Transformer module is introduced. Through this module, 3D poses are refined by perceiving pose coherence, a representation of joint correlations. This perception is injected by the Pose Transformer network and learned through a pre-training task that recovers iterative masked joints. Comprehensive experiments on H36M and MHAD datasets validate the effectiveness of our approach in the short-baseline binocular 3D Human Pose Estimation and occlusion handling.

Abstract:
Controllable 3D-aware scene synthesis seeks to disentangle the various latent codes in the implicit space enabling the generation network to create highly realistic images with 3D consistency. Recent approaches often integrate Neural Radiance Fields with the upsampling method of StyleGAN2, employing Convolutions with style modulation to transform spatial coordinates into frequency domain representations. Our analysis indicates that this approach can give rise to a bubble phenomenon in StyleNeRF. We argue that the style modulation introduces extraneous information into the implicit space, disrupting 3D implicit modeling and degrading image quality. We introduce HomuGAN, incorporating two key improvements. First, we disentangle the style modulation applied to implicit modeling from that utilized for super-resolution, thus alleviating the bubble phenomenon. Second, we introduce Cylindrical Spatial-Constrained Sampling and Parabolic Sampling. The latter sampling method, as an alternative method to the former, specifically contributes to the performance of foreground modeling of vehicles. We evaluate HomuGAN on publicly available datasets, comparing its performance to existing methods. Empirical results demonstrate that our model achieves the best performance, exhibiting relatively outstanding disentanglement capability. Moreover, HomuGAN addresses the training instability problem observed in StyleNeRF and reduces the bubble phenomenon.

Abstract:
With the evolution of storage and communication protocols, ultra-low bitrate image compression has become a highly demanding topic. However, all existing compression algorithms must sacrifice either consistency with the ground truth or perceptual quality at ultra-low bitrate. During recent years, the rapid development of the Large Multimodal Model (LMM) has made it possible to balance these two goals. To solve this problem, this paper proposes a method called Multimodal Image Semantic Compression (MISC), which consists of an LMM encoder for extracting the semantic information of the image, a map encoder to locate the region corresponding to the semantic, an image encoder generates an extremely compressed bitstream, and a decoder reconstructs the image based on the above information. Experimental results show that our proposed MISC is suitable for compressing both traditional Natural Sense Images (NSIs) and emerging AI-Generated Images (AIGIs) content. It can achieve optimal consistency and perception results while saving 50% bitrate, which has strong potential applications in the next generation of storage and communication. The code will be released on https://github.com/lcysyzxdxc/MISC.

Abstract:
Safe reinforcement learning aims to ensure the optimal performance while minimizing potential risks. In real-world applications, especially in scenarios that rely on visual inputs, a key challenge lies in the extraction of essential features for safe decision-making while maintaining the sample efficiency. To address this issue, we propose the constrained visual representation learning with bisimulation metrics for safe reinforcement learning (CVRL-BM). CVRL-BM constructs a sequential conditional variational inference model to compress high-dimensional visual observations into low-dimensional state representations. Additionally, safety bisimulation metrics are introduced to quantify the behavioral similarity between states, and our objective is to make the distance between any two latent state representations as close as possible to the safety bisimulation metric between their corresponding states. By integrating these two components, CVRL-BM is able to learn compact and information-rich visual state representations while satisfying predefined safety constraints. Experiments on Safety Gym show that CVRL-BM outperforms existing vision-based safe reinforcement learning methods in safety and efficacy. Particularly, CVRL-BM surpasses the state-of-the-art Safe SLAC method by achieving a 19.748% higher reward return, a 41.772% lower cost return, and a 5.027% decrease in cost regret. These results highlight the effectiveness of our proposed CVRL-BM.

Abstract:
Recent advances in deep learning algorithms have shown impressive progress in image copy-move forgery detection (CMFD). However, these algorithms lack generalizability in practical scenarios where the copied regions are not present in the training images, or the cloned regions are part of the background. Additionally, these algorithms utilize convolution operations to distinguish source and target regions, leading to unsatisfactory results when the target regions blend well with the background. To address these limitations, this study proposes a novel end-to-end CMFD framework that integrates the strengths of conventional and deep learning methods. Specifically, the study develops a deep cross-scale PatchMatch (PM) method that is customized for CMFD to locate copy-move regions. Unlike existing deep models, our approach utilizes features extracted from high-resolution scales to seek explicit and reliable point-to-point matching between source and target regions. Furthermore, we propose a novel pairwise rank learning framework to separate source and target regions. By leveraging the strong prior of point-to-point matches, the framework can identify subtle differences and effectively discriminate between source and target regions, even when the target regions blend well with the background. Our framework is fully differentiable and can be trained end-to-end. Comprehensive experimental results highlight the remarkable generalizability of our scheme across various copy-move scenarios, significantly outperforming existing methods.

Abstract:
Visible-depth-thermal (VDT) salient object detection (SOD) aims to highlight the most visually attractive object by utilizing the triple-modal cues. However, existing models don’t give sufficient exploration of the multi-modal correlations and differentiation, which leads to unsatisfactory detection performance. In this paper, we propose an interaction, fusion, and enhancement network (IFENet) to conduct the VDT SOD task, which contains three key steps including the multi-modal interaction, the multi-modal fusion, and the spatial enhancement. Specifically, embarking on the Transformer backbone, our IFENet can acquire multi-scale multi-modal features. Firstly, the inter-modal and intra-modal graph-based interaction (IIGI) module is deployed to explore inter-modal channel correlation and intra-modal long-term spatial dependency. Secondly, the gated attention-based fusion (GAF) module is employed to purify and aggregate the triple-modal features, where multi-modal features are filtered along spatial, channel, and modality dimensions, respectively. Lastly, the frequency split-based enhancement (FSE) module separates the fused feature into high-frequency and low-frequency components to enhance spatial information (i.e., boundary details and object location) of the salient object. Extensive experiments are performed on VDT-2048 dataset, and the results show that our saliency model consistently outperforms 13 state-of-the-art models. Our code and results are available at https://github.com/Lx-Bao/IFENet.

Abstract:
As a challenging computer vision task, Scene Graph Generation (SGG) finds the latent semantic relationships among objects from a given image, which may be limited by the datasets and real-world scenarios. In this paper, we consider a novel incremental learning task called Relationship-Incremental Scene Graph Generation (RISGG) that learns the semantic relationships among objects in an incremental way. Compared with classic Class-Incremental Learning (CIL) problem, RISGG suffers from its special issues: 1) Old class shift – the relationship-labeled object pair may have different labels during different learning sessions; 2) Background shift – the relationship-unlabeled object pair may not be a real unlabeled one. In this work, we address the above issues from the following aspects. First, we present a Divide-and-Conquer (DaC) pipeline to deal with the old class shift via decoupling the recognition of relationship classes and recognizing relationships individually. In this way, label confusion and interaction among different relationships are eliminated during training. Second, we propose a Feature Adapter (FA) to bridge the feature space gap between the current session and the previous one and use our extra supervision to mine old relationship information in the current session. Our proposed network combined DaC and FA, abbreviated DaCFA-Net, for RISGG. Experimental results on the benchmark dataset demonstrate the significant performance gain of DaCFA-Net in RISGG. It gains about 20% improvement against the SGG baselines on the popular VG dataset.

Abstract:
Few-shot learning (FSL) has been rapidly developed in the hyperspectral image (HSI) classification, potentially eliminating time-consuming and costly labeled data acquisition requirements. Effective feature embedding is empirically significant in FSL methods, which is still challenging for the HSI with rich spectral-spatial information. In addition, compared with inductive FSL, transductive models typically perform better as they explicitly leverage the statistics in the query set. To this end, we devise a transductive FSL framework with enhanced spectral-spatial embedding (TEFSL) to fully exploit the limited prior information available. First, to improve the informative features and suppress the redundant ones contained in the HSI, we devise an attentive feature embedding network (AFEN) comprising a channel calibration module (CCM). Next, a meta-feature interaction module (MFIM) is designed to optimize the support and query features by learning adaptive co-attention using convolutional filters. During inference, we propose an iterative graph-based prototype refinement scheme (iGPRS) to achieve test-time adaptation, making the class centers more representative in a transductive learning manner. Extensive experimental results on four standard benchmarks demonstrate the superiority of our model with various handfuls (i.e., from 1 to 5) labeled samples. The code will be available online at https://github.com/B-Xi/TIP_2025_TEFSL.

Abstract:
Although self-supervised learning approaches have demonstrated tremendous potential in multi-frame depth estimation scenarios, existing methods struggle to perform well in cases involving dynamic targets and static ego-camera conditions. To address this issue, we propose a self-supervised monocular depth estimation method featuring dual-path encoders and learnable offset interpolation (LOI). First, we construct a dual-path encoding scheme that utilizes residual and transformer blocks to extract both single- and multi-frame features from the input frames. We design a contrastive learning strategy to effectively decouple single- and multi-frame features, enabling weighted fusion guided by a confidence map. Next, we explore two distinct decoding heads for simultaneously generating low-resolution predictions and offset fields. We then design an LOI module to directly upsample a low-resolution depth map to a full-resolution map. This one-step decoding framework enables accurate and efficient depth prediction. Finally, we evaluate our proposed method on the KITTI and Cityscapes benchmarks, conducting a comprehensive comparison with state-of-the-art approaches. The experimental results demonstrate that our DualDepth method achieves competitive performance in terms of both estimation accuracy and efficiency.

Affiliations: Information Science and Technology College, Dalian Maritime University, Dalian, Liaoning, China; School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu, Sichuan, China; Key Laboratory of Social Computing and Cognitive Intelligence, Dalian University of Technology, Dalian, Liaoning, China; Department of Electrical and Computer Engineering, National University of Singapore, Queenstown, Singapore; Shenzhen Research Institute of Big Data, The Chinese University of Hong Kong (Shenzhen), Shenzhen, Guangdong, China

Abstract:
The brain-inspired Spiking Neural Networks (SNNs) work in an event-driven manner and have an implicit recurrence in neuronal membrane potential to memorize information over time, which are inherently suitable to handle temporal event-based streams. Despite their temporal nature and recent approaches advancements, these methods have predominantly been assessed on event-based classification tasks. In this paper, we explore the utility of SNNs for event-based tracking tasks. Specifically, we propose a brain-inspired adaptive Leaky Integrate-and-Fire neuron (BA-LIF) that can adaptively adjust the membrane time constant according to the inputs, thereby accelerating the leakage of meaningless noise features and reducing the decay of valuable information. SNNs composed of our proposed BA-LIF neurons can achieve high performance without a careful and time-consuming trial-by-error initialization on the membrane time constant. The adaptive capability of our network is further improved by introducing an extra temporal feature aggregator (TFA) that assigns attention weights over the temporal dimension. Extensive experiments on various event-based tracking datasets validate the effectiveness of our proposed method. We further validate the generalization capability of our method by applying it to other event-classification tasks.

Abstract:
Conventional spectral image demosaicing algorithms rely on pixels’ spatial or spectral correlations for reconstruction. Due to the missing data in the multispectral filter array (MSFA), the estimation of spatial or spectral correlations is inaccurate, leading to poor reconstruction results, and these algorithms are time-consuming. Deep learning-based spectral image demosaicing methods directly learn the nonlinear mapping relationship between 2D spectral mosaic images and 3D multispectral images. However, these learning-based methods focused only on learning the mapping relationship in the spatial domain, but neglected valuable image information in the frequency domain, resulting in limited reconstruction quality. To address the above issues, this paper proposes a novel lightweight spectral image demosaicing method based on joint spatial and frequency domain information learning. First, a novel parameter-free spectral image initialization strategy based on the Fourier transform is proposed, which leads to better initialized spectral images and eases the difficulty of subsequent spectral image reconstruction. Furthermore, an efficient spatial-frequency transformer network is proposed, which jointly learns the spatial correlations and the frequency domain characteristics. Compared to existing learning-based spectral image demosaicing methods, the proposed method significantly reduces the number of model parameters and computational complexity. Extensive experiments on simulated and real-world data show that the proposed method notably outperforms existing spectral image demosaicing methods.

Abstract:
This paper aims to restore original background images in watermarked videos, overcoming challenges posed by traditional approaches that fail to handle the temporal dynamics and diverse watermark characteristics effectively. Our method introduces a unique framework that first “decouples” the extraction of prior knowledge—such as common-sense knowledge and residual background details—from the temporal modeling process, allowing for independent handling of background restoration and temporal consistency. Subsequently, it “couples” these extracted features by integrating them into the temporal modeling backbone of a video inpainting (VI) framework. This integration is facilitated by a specialized module, which includes an intrinsic background image prediction sub-module and a dual-branch frame embedding module, designed to reduce watermark interference and enhance the application of prior knowledge. Moreover, a frame-adaptive feature selection module dynamically adjusts the extraction of prior features based on the corruption level of each frame, ensuring their effective incorporation into the temporal processing. Extensive experiments on YouTube-VOS and DAVIS datasets validate our method’s efficiency in watermark removal and background restoration, showing significant improvement over state-of-the-art techniques in visible image watermark removal, video restoration, and video inpainting.

Abstract:
Domain adaptation aims to leverage abundant label information from a source domain to an unlabeled target domain with two different distributions. Existing methods usually rely on a classifier to generate high-quality pseudo-labels for the target domain, facilitating the learning of discriminative features. Label propagation (LP), as an effective classifier, propagates labels from the source domain to the target domain by designing a smooth function over a similarity graph, which represents structural relationships among data points in feature space. However, LP has not been thoroughly explored in deep neural network-based domain adaptation approaches. Additionally, the probability labels generated by LP are low-confident and LP is sensitive to class imbalance problem. To address these problems, we propose a novel approach for domain adaptation named deep label propagation with nuclear norm maximization (DLP-NNM). Specifically, we employ the constraint of nuclear norm maximization to enhance both label confidence and class diversity in LP and propose an efficient algorithm to solve the corresponding optimization problem. Subsequently, we utilize the proposed LP to guide the classifier layer in a deep discriminative adaptation network using the cross-entropy loss. As such, the network could produce more reliable predictions for the target domain, thereby facilitating more effective discriminative feature learning. Extensive experimental results on three cross-domain benchmark datasets demonstrate that the proposed DLP-NNM surpasses existing state-of-the-art domain adaptation approaches.

Abstract:
The fast growing application of omnidirectional images calls for effective approaches for omnidirectional image quality assessment (OIQA). Existing OIQA methods have been developed and tested on homogeneously distorted omnidirectional images, but it is hard to transfer their success directly to the heterogeneously distorted omnidirectional images. In this paper, we conduct the largest study so far on OIQA, where we establish a large-scale database called OIQ-10K containing 10,000 omnidirectional images with both homogeneous and heterogeneous distortions. A comprehensive psychophysical study is elaborated to collect human opinions for each omnidirectional image, together with the spatial distributions (within local regions or globally) of distortions, and the head and eye movements of the subjects. Furthermore, we propose a novel multitask-derived adaptive feature-tailoring OIQA model named IQCaption360, which is capable of generating a quality caption for an omnidirectional image in a manner of textual template. Extensive experiments demonstrate the effectiveness of IQCaption360, which outperforms state-of-the-art methods by a significant margin on the proposed OIQ-10K database. The OIQ-10K database and the related source codes are available at https://github.com/WenJuing/IQCaption360.

Abstract:
Collaborative learning has gained significant traction for training deep learning models without sharing the original data of participants, particularly when dealing with sensitive data such as facial images. However, current gradient inversion attacks are employed to progressively reconstruct private data from gradients, and they have shown successful in extracting private training data. Nonetheless, our observations reveal that these methods exhibit suboptimal performance in face reconstruction and result in the loss of numerous facial details. In this paper, we propose DFLeak, an effective approach to boost face leakage from gradients using residual optimization and thwart the privacy of facial applications in collaborative learning. In particular, we first introduce a superior initialization method to stabilize the inversion process. Second, we propose to integrate prior-free face restoration (PFFR) results into the gradient inversion optimization process in a residual manner, which enriches facial details. We further design a pixel update schedule to mitigate the adverse effects of image regularization terms and preserve fine facial details. Comprehensive experimentation demonstrates the effectiveness of our approach in achieving more realistic and higher-quality facial image reconstructions, surpassing the performance of state-of-the-art gradient inversion attacks.

Abstract:
Estimating the 6D pose of an object from a single RGB image is a critical task that becomes additionally challenging when dealing with symmetric objects. Recent approaches typically establish one-to-one correspondences between image pixels and 3D object surface vertices. However, the utilization of one-to-one correspondences introduces ambiguity for symmetric objects. To address this, we propose SymCode, a symmetry-aware surface encoding that encodes the object surface vertices based on one-to-many correspondences, eliminating the problem of one-to-one correspondence ambiguity. We also introduce SymNet, a fast end-to-end network that directly regresses the 6D pose parameters without solving a PnP problem. We demonstrate faster runtime and comparable accuracy achieved by our method on the T-LESS and IC-BIN benchmarks of mostly symmetric objects. The code is available at https://github.com/lyltc1/SymNet.

Abstract:
Single-image 3D shape reconstruction has attracted significant attention with the advance of generative models. Recent studies have utilized diffusion models to achieve unprecedented shape reconstruction quality. However, these methods, in each sampling step, perform denoising in a single forward pass, leading to cumulative errors that severely impact the geometric consistency of the generated shapes with the input targets and face difficulties in reconstructing rich details of complex 3D shapes. Moreover, the performance of current works suffers significant degradation due to limited information when only a single image is used as input during testing, further affecting the quality of 3D shape generation. In this paper, we present a recurrent diffusion framework, aiming to improve generation performance during single image-to-shape inference. Diverging from denoising in a single forward pass, we recursively refine the noise prediction in a self-rectified manner with the explicit guidance of the input target, thereby markedly suppressing cumulative errors and improving detail modeling. To enhance the geometric perception ability of the network during single-image inference, we further introduce a multi-view training scheme equipped with a view-robust conditional generation mechanism, which effectively promotes generation quality even when only a single image is available during inference. The effectiveness of our method is demonstrated through extensive evaluations on two public 3D shape datasets, where it surpasses state-of-the-art methods both qualitatively and quantitatively.

Abstract:
Composite images (CIs) have experienced unprecedented growth, especially with the prosperity of a large number of generative AI technologies. They are usually created by combining multiple visual elements from different sources to form a single cohesive composition, which have an increasing impact on a variety of vision applications. However, transmission of CIs can degrade their visual quality, especially undergoing lossy compression to reduce bandwidth and storage. To facilitate the development of objective measurements for CIs and investigate the influence of compression distortions on their perception, we establish a compression-oriented image quality assessment (CIQA) database for CIs (called ciCIQA) with 30 typical encoding distortions. Compressed with six representative codecs, we have carried out a large-scale subjective experiment that delivered 3,000 encoded CIs with labeled quality scores, making ciCIQA one of the earliest CI databases with the most compression types. ciCIQA enables us to explore the encoding effects on visual quality from the first five just noticeable difference (JND) points, offering insights for perceptual CI compression and related tasks. Moreover, we have proposed a new multi-masked no-reference CIQA method(called mmCIQA), including a multi-masked quality representation module, a self-supervised quality alignment module, and a multi-masked attentive fusion module. Experimental results demonstrate the outstanding performance of our mmCIQA in assessing the quality of CIs, outperforming 17 competitive approaches. The proposed method and database as well as the collected objective metrics are made publicly available on https://charwill.github.io/mmciqa.html.

Affiliations: Department of Electrical and Electronic Engineering, The University of Hong Kong, Kowloon Tong, Hong Kong; Chinese Academy of Sciences, MAIS, Institute of Automation, Beijing, China; Department of Computer Science, Northwestern University, Evanston, IL, USA; Key Laboratory of Computing Power Network and Information Security, Ministry of Education, Shandong Computer Science Center, Qilu University of Technology (Shandong Academy of Sciences), Jinan, China; School of Artificial Intelligence, Beijing University of Posts and Telecommunications, Beijing, China

Abstract:
Local feature detection and description play an important role in many computer vision tasks, which are designed to detect and describe keypoints in any scene and any downstream task. Data-driven local feature learning methods need to rely on pixel-level correspondence for training. However, a vast number of existing approaches ignored the semantic information on which humans rely to describe image pixels. In addition, it is not feasible to enhance generic scene keypoints detection and description simply by using traditional common semantic segmentation models because they can only recognize a limited number of coarse-grained object classes. In this paper, we propose SAMFeat to introduce SAM (segment anything model), a foundation model trained on 11 million images, as a teacher to guide local feature learning. SAMFeat learns additional semantic information brought by SAM and thus is inspired by higher performance even with limited training samples. To do so, first, we construct an auxiliary task of Attention-weighted Semantic Relation Distillation (ASRD), which adaptively distillates feature relations with category-agnostic semantic information learned by the SAM encoder into a local feature learning network, to improve local feature description using semantic discrimination. Second, we develop a technique called Weakly Supervised Contrastive Learning Based on Semantic Grouping (WSC), which utilizes semantic groupings derived from SAM as weakly supervised signals, to optimize the metric space of local descriptors. Third, we design an Edge Attention Guidance (EAG) to further improve the accuracy of local feature detection and description by prompting the network to pay more attention to the edge region guided by SAM. SAMFeat’s performance on various tasks, such as image matching on HPatches, and long-term visual localization on Aachen Day-Night showcases its superiority over previous local features. The release code is available at https://github.com/vignywang/SAMFeat.

Affiliations: Academy for Engineering and Technology, Fudan University, Shanghai, China; College of Electrical and Information Engineering, Hunan University, Changsha, Hunan, China; Division of Natural and Applied Sciences, Duke Kunshan University, Suzhou, Jiangsu, China; DataLab: Data Science and Informatics, University of California at Davis, Davis, CA, USA; Department of Electrical and Computer Engineering, The University of British Columbia, Vancouver, BC, Canada; China Mobile Chengdu Institute of Research and Development, Chengdu, Sichuan, China

Abstract:
Video Anomaly Detection (VAD) remains a fundamental yet formidable task in the video understanding community, with promising applications in areas such as information forensics and public safety protection. Due to the rarity and diversity of anomalies, existing methods only use easily collected regular events to model the inherent normality of normal spatial-temporal patterns in an unsupervised manner. Although such methods have made significant progress benefiting from the development of deep learning, they attempt to model the statistical dependency between observable videos and semantic labels, which is a crude description of normality and lacks a systematic exploration of its underlying causal relationships. Previous studies have shown that existing unsupervised VAD models are incapable of label-independent data offsets (e.g., scene changes) in real-world scenarios and may fail to respond to light anomalies due to the overgeneralization of deep neural networks. Inspired by causality learning, we argue that there exist causal factors that can adequately generalize the prototypical patterns of regular events and present significant deviations when anomalous instances occur. In this regard, we propose Causal Representation Consistency Learning (CRCL) to implicitly mine potential scene-robust causal variable in unsupervised video normality learning. Specifically, building on the structural causal models, we propose scene-debiasing learning and causality-inspired normality learning to strip away entangled scene bias in deep representations and learn causal video normality, respectively. Extensive experiments on benchmarks validate the superiority of our method over conventional deep representation learning. Moreover, ablation studies and extension validation show that the CRCL can cope with label-independent biases in multi-scene settings and maintain stable performance with only limited training data available.

Abstract:
In this paper, we propose a new approach to train deep learning models using game theory concepts including Generative Adversarial Networks (GANs) and Adversarial Training (AT) where we deploy a double-oracle framework using best response oracles. GAN is essentially a two-player zero-sum game between the generator and the discriminator. The same concept can be applied to AT with attacker and classifier as players. Training these models is challenging as a pure Nash equilibrium may not exist and even finding the mixed Nash equilibrium is difficult as training algorithms for both GAN and AT have a large-scale strategy space. Extending our preliminary model DO-GAN, we propose the methods to apply the double oracle framework concept to Adversarial Neural Architecture Search (NAS for GAN) and Adversarial Training (NAS for AT) algorithms. We first generalize the players’ strategies as the trained models of generator and discriminator from the best response oracles. We then compute the meta-strategies using a linear program. For scalability of the framework where multiple network models of best responses are stored in the memory, we prune the weakly-dominated players’ strategies to keep the oracles from becoming intractable. Finally, we conduct experiments on MNIST, CIFAR-10 and TinyImageNet for DONAS-GAN. We also evaluate the robustness under FGSM and PGD attacks on CIFAR-10, SVHN and TinyImageNet for DONAS-AT. We show that all our variants have significant improvements in both subjective qualitative evaluation and quantitative metrics, compared with their respective base architectures.

Abstract:
Webly-supervised fine-grained visual classification (WSL-FGVC) aims to learn similar sub-classes from cheap web images, which suffers from two major issues: label noises in web images and subtle differences among fine-grained classes. However, existing methods for WSL-FGVC only focus on suppressing noise at image-level, but neglect to mine cues at pixel-level to distinguish the subtle differences among fine-grained classes. In this paper, we propose a bag-level top-down attention framework, which could tackle label noises and mine subtle cues simultaneously and integrally. Specifically, our method first extracts high-level semantic information from a bag of images belonging to the same class, and then uses the bag-level information to mine discriminative regions in various scales of each image. Besides, we propose to derive attention weights from attention maps to weight the bag-level fusion for a robust supervision. We also propose an attention loss on self-bag attention and cross-bag attention to facilitate the learning of valid attention. Extensive experiments on four WSL-FGVC datasets, i.e., Web-Aircraft, Web-Bird, Web-Car, and WebiNat-5089, demonstrate the effectiveness of our method against the state-of-the-art methods.

Affiliations: Key Laboratory of Optoelectronic Science and Technology for Medicine of Ministry of Education, Fujian Provincial Key Laboratory of Photonics Technology,Fujian Provincial Engineering Technology Research Center of Photoelectric Sensing Application, College of Photonic and Electronic Engineering, Fujian Normal University, Fuzhou, China; Department of Pathology, The First Affiliated Hospital of Fujian Medical University, Fuzhou, China; Department of Pathology, The Fuzhou First Hospital, Fuzhou, China; Department of Anatomical and Cellular Pathology, State Key Laboratory of Translational Oncology, Prince of Wales Hospital, The Chinese University of Hong Kong, Hong Kong, China

Abstract:
Tumor-stroma ratio (TSR), which is the area ratio between two components within tumor beds, namely tumor cells and tumor stroma, has been suggested as a promising prognostic feature in breast cancers. However, due to imperfect datasets, and the similarity between tumor stroma and non-tumor stroma, previous algorithms struggle to delineate tumor beds, especially those of histomorphologies with a fibrotic focus. To overcome these limitations, we propose a novel ray-aided quadruple affiliation network (RQA-Net) for calculating TSRs in breast cancers. RQA-Net uses quadruple branches to segment tumor cells and tumor beds simultaneously, where a crisscross task subtraction module (CTS-Module) is designed to locate tumor stroma, grounded on its affiliation relationships with tumor beds. Moreover, we propose an affiliation loss (Aff-Loss) to force identified tumor beds to incorporate tumor cells to enhance their affiliation relationships. Furthermore, we propose a ray-based hypothesis testing (RH-Testing) to obtain line segments from ray equations in tumor beds that can decorate identified tumor beds by overlapping. In summary, RQA-Net precisely predicts tumor cells and tumor beds, and thus supports the calculation of TSRs. We also create a cancerous dataset (CrD-Set) containing 100 slides with an average resolution of 50,000× 50,000 pixels from real breast cancer cases, which is the first dataset with pixel-wise tumor bed annotations. Experimental results on existing datasets and CrD-Set demonstrate that compared with previous methods, RQA-Net better calculates breast cancer TSRs by precisely identifying tumor cells and tumor beds. The created CrD-Set and codes in this work will be available online at https://github.com/Kunpingyang1992/Breast-Cancer-TSR-Calculation

Abstract:
Source-free cross-modal knowledge transfer is a crucial yet challenging task, which aims to transfer knowledge from one source modality (e.g., RGB) to the target modality (e.g., depth or infrared) with no access to the task-relevant (TR) source data due to memory and privacy concerns. A recent attempt leverages the paired task-irrelevant (TI) data and directly matches the features from them to eliminate the modality gap. However, it ignores a pivotal clue that the paired TI data could be utilized to effectively estimate the source data distribution and better facilitate knowledge transfer to the target modality. To this end, we propose a novel yet concise framework to unlock the potential of paired TI data for enhancing source-free cross-modal knowledge transfer. Our work is buttressed by two key technical components. Firstly, to better estimate the source data distribution, we introduce a Task-irrelevant data-Guided Modality Bridging (TGMB) module. It translates the target modality data into the source-like images based on paired TI data and the guidance of the available source model to alleviate two key gaps: 1) inter-modality gap between the paired TI data; 2) intra-modality gap between TI and TR target data. We then propose a Task-irrelevant data-Guided Knowledge Transfer (TGKT) module that transfers knowledge from the source model to the target model by leveraging the paired TI data. Notably, due to the unavailability of labels for the TR target data and its less reliable prediction from the source model, our TGKT model incorporates a self-supervised pseudo-labeling approach to enable the target model to learn from its predictions. Extensive experiments show that our method achieves state-of-the-art performance on three datasets (RGB-to-depth and RGB-to-infrared).

Abstract:
Camouflage poses notable challenges in distinguishing a static target, as it usually blends seamlessly with the background. However, any movement by the target can disrupt this disguise, making it detectable. Existing video camouflaged object detection (VCOD) approaches take noisy motion estimation as input or model motion implicitly, restricting detection performance in complex dynamic scenes. In this paper, we propose a novel Explicit Motion handling and Interactive Prompting framework for VCOD, dubbed EMIP, which handles motion cues explicitly using a frozen pre-trained optical flow fundamental model. EMIP is characterized by a two-stream architecture for simultaneously conducting camouflaged segmentation and optical flow estimation. Interactions across the dual streams are realized in an interactive prompting way that is inspired by emerging visual prompt learning. Two learnable modules, i.e. the camouflaged feeder and motion collector, are designed to incorporate segmentation-to-motion and motion-to-segmentation prompts, respectively, and enhance outputs of the both streams. The prompt fed to the motion stream is learned by supervising optical flow in a self-supervised manner. Furthermore, we show that long-term historical information can also be incorporated as a prompt into EMIP and achieve more robust results with temporal consistency. By leveraging promoting techniques based on EMIP, the proposed long-term model EMIP ^\dagger incurs lower training cost with only 8.5M trainable parameters (less than 8% of the total model parameters). Experimental results demonstrate that both EMIP and EMIP ^\dagger set new state-of-the-art records on popular VCOD benchmarks. Additionally, comparative evaluations against other video segmentation models on a wider range of video segmentation tasks demonstrate the robustness and superior generalization capabilities of EMIP. Our code is made publicly available at https://github.com/zhangxin06/EMIP

Affiliations: Key Laboratory of Computing Power Network and Information Security, Ministry of Education, Shandong Computer Science Center (National Supercomputer Center in Jinan), Qilu University of Technology (Shandong Academy of Sciences), Jinan, China; College of Traditional Chinese Medicine, Shandong University of Traditional Chinese Medicine, Jinan, China; Shandong Mental Health Center, Shandong University, Jinan, China; School of Computer Science, Guangdong University of Technology, Guangzhou, China; Department of Endocrinology, The Fifth Clinical College, Guangzhou University of Chinese Medicine, Guangzhou, China

Abstract:
Facial Expression Recognition (FER) is a critical method for evaluating the emotional states of patients with mental disorders, playing a significant role in treatment monitoring. However, due to privacy constraints, facial expression data from patients with mental disorders is severely limited. Additionally, the more complex inter-class and intra-class similarities compared to healthy individuals make accurate recognition of facial expressions challenging. Therefore, we propose a Voluntary Facial Expression Mimicry (VFEM) experiment, which collected facial expression data from schizophrenia, depression, and anxiety. This experiment establishes the first dataset designed for facial expression recognition tasks exclusively composed of patients with mental disorders. Simultaneously, based on VFEM, we propose a Vision Transformer FER model tailored for Complex mental disorder patients (CmdVIT). CmdVIT integrates crucial facial expression features through both explicit and implicit mechanisms, including explicit visual center positional encoding and implicit sparse attention center loss function. These two key components enhance positional information and minimize the facial feature space distance between conventional attention and critical attention, effectively suppressing inter-class and intra-class similarities. In various FER tasks for different mental disorders in VFEM, CmdVIT achieves more competitive performance compared to contemporary benchmark models. Our works are available at https://github.com/yjy-97/CmdVIT.

Abstract:
Point clouds are unordered sets of coordinates in 3D with no functional relation imposed on them. The Rigid Transformation Universal Manifold Embedding (RTUME) is a mapping of volumetric or surface measurements on a 3D object to matrices, such that when two observations on the same object are related by a rigid transformation, this relation is preserved between their corresponding RTUME matrices, thus providing linear and robust solution to the registration and detection problems. To make the RTUME framework of 3D object detection and registration applicable for processing point cloud observations, there is a need to define a function that assigns each point in the cloud with a value (feature vector), invariant to the action of the transformation group. Since existing feature extraction functions do not achieve the desired level of invariance to rigid transformations, to the variability of sampling patterns, and to model mismatches, we present a novel approach for designing dense feature extraction functions, compatible with the requirements of the RTUME framework. One possible implementation of the approach is to adapt existing feature extracting functions, whether learned or analytic, designed for the estimation of point correspondences, to the RTUME framework. The novel feature-extracting function design employs integration over SO(3) to marginalize the pose dependency of extracted features, followed by projecting features between point clouds using nearest neighbor projection to overcome other sources of model mismatch. In addition, the non-linear functions that define the RTUME mapping are optimized using an MLP model, trained to minimize the RTUME registration errors. The overall RTUME registration performance is evaluated using standard registration benchmarks, and is shown to outperform existing SOTA methods.

Abstract:
Taking photographs through windows is an inevitable scenario in the real world, but glass windows are not ideally clean in most cases. Although there exists various raindrop removal methods, the occlusion of dirt, as another dirty window case, has not been well valued. The vital reasons include i) the limitation of the optical imaging model proposed in previous methods, and ii) the shortage of a practical dataset for sufficient types of dirty glass windows. To fill this research gap, in this paper, we first propose a general optical imaging model that fits widely used dirty window cases. Following this, training and testing synthetic datasets are generated, and real-world dirty window data are collected to evaluate the effectiveness of our imaging model and synthetic data. For the methodology part, we propose an optics-guided Transformer network to solve this special image restoration problem, i.e., the dirt removal for images taken through a dirty window. Experimental results demonstrate that our imaging model is effective and robust. Our proposed network leads to higher performance than existing methods on both synthetic and real-world dirty window images. Code and data are available at https://github.com/Zongliang-Wu/ReDNet

Abstract:
Point cloud rigid registration is a fundamental problem in 3D computer vision. In the multiview case, we aim to find a set of 6D poses to align a set of objects. Methods based on pairwise registration rely on a subsequent synchronization algorithm, which makes them poorly scalable with the number of views. Generative approaches overcome this limitation, but are based on Gaussian Mixture Models and use an Expectation-Maximization algorithm. Hence, they are not well suited to handle large transformations. Moreover, most existing methods cannot handle high levels of degradations. In this paper, we introduce POLAR (POint cloud LAtent Registration), a multiview registration method able to efficiently deal with a large number of views, while being robust to a high level of degradations and large initial angles. To achieve this, we transpose the registration problem into the latent space of a pretrained autoencoder, design a loss taking degradations into account, and develop an efficient multistart optimization strategy. Our proposed method significantly outperforms state-of-the-art approaches on synthetic and real data. POLAR is available at github.com/pypolar/polar or as a standalone package which can be installed with pip install polaregistration.

Affiliations: Institute of Robotics, School of Biomedical Engineering, and the Cooperative Medianet Innovation Center, Shanghai Jiao Tong University, Shanghai, China; First Affiliated Hospital of Nanjing Medical University, Nanjing, Jiangsu, China; Shanghai Artificial Intelligence Laboratory, Shanghai, China; Affiliated Cancer Hospital of Zhengzhou University, Zhengzhou, Henan, China; Cooperative Medianet Innovation Center and Shanghai Artificial Intelligence Laboratory, Shanghai Jiao Tong University, Shanghai, China

Abstract:
Transarterial Chemoembolization (TACE) is a widely applied alternative treatment for patients with hepatocellular carcinoma who are not eligible for liver resection or transplantation. However, the clinical outcomes after TACE are highly heterogeneous. There remains an urgent need for effective and efficient strategies to accurately assess tumor response and predict long-term outcomes using longitudinal and multi-center datasets. To address this challenge, we here introduce RECISTSurv, a novel response-driven Transformer model that integrates multi-task learning with a response-driven co-attention mechanism to simultaneously perform liver and tumor segmentation, predict tumor response to TACE, and estimate overall survival based on longitudinal Computed Tomography (CT) imaging. The proposed Response-driven Co-attention layer models the interactions between pre-TACE and post-TACE features guided by the treatment response embedding. This design enables the model to capture complex relationships between imaging features, treatment response, and survival outcomes, thereby enhancing both prediction accuracy and interpretability. In a multi-center validation study, RECISTSurv-predicted prognosis has demonstrated superior precision than state-of-the-art methods with C-indexes ranging from 0.595 to 0.780. Furthermore, when integrated with multi-modal data, RECISTSurv has emerged as an independent prognostic factor in all three validation cohorts, with hazard ratio (HR) ranging from 1.693 to 20.7 ( \text P = 0.001-0.042 ). Our results highlight the potential of RECISTSurv as a powerful tool for personalized treatment planning and outcome prediction in hepatocellular carcinoma patients undergoing TACE. The experimental code is made publicly available at https://github.com/rushier/RECISTSurv

Abstract:
Building an effective object detector usually depends on large well-annotated training samples. While annotating such dataset is extremely laborious and costly, where box-level supervision which contains both accurate classification category and localization coordinate is required. Compared to above box-level supervised annotation, those weakly supervised learning manners (e.g,, category, point and scribble) need relatively less laborious annotation cost, and provide a feasible way to mitigate the reliance on the dataset. Because of the lack of sufficient supervised information, current weakly supervised methods cannot achieve satisfactory detection performance. Recently, Segment Anything Model (SAM) has appeared as a task-agnostic foundation model and shown promising performance improvement in many related works due to its powerful generalization and data processing abilities. The properties of the SAM inspire us to adopt such basic benchmark to weakly supervised object detection field to compensate the deficiencies in supervised information. However, directly deploying SAM on weakly supervised object detection task meets with two issues. Firstly, SAM needs meticulously-designed prompts, and such expert-level prompts restrict their applicability and practicality. Besides, SAM is a category unawareness model, and it cannot assign the category labels to the generated predictions. To solve above issues, we propose WS-SAM, which generalizes Segment Anything Model (SAM) to weakly supervised object detection with category label. Specifically, we design an adaptive prompt generator to take full advantages of the spatial and semantic information from the prompt. It employs in a self-prompting manner by taking the output of SAM from the previous iteration as the prompt input to guide the next iteration, where the prompts can be adaptively generated based on the classification activation map. We also develop a segmentation mask refinement module and formulate the label assignment process as a shortest path optimization problem by considering the similarity between each location and prompts. Furthermore, a bidirectional adapter is also implemented to resolve the domain discrepancy by incorporating domain-specific information. We evaluate the effectiveness of our method on several detection datasets (e.g., PASCAL VOC and MS COCO), and the experiment results show that our proposed method can achieve clear improvement over state-of-the-art methods, while performing favorably against state-of-the-arts.

Abstract:
Inconsistent responses of X-ray detector elements lead to stripe artifacts within the sinogram data, which subsequently manifest as ring artifacts in the reconstructed computed tomography (CT) images, severely degrading image quality. This paper presents a novel method for correcting stripe artifacts in the sinogram data by separating the sinogram into an Ideal Sinogram (IS) and Stripe Artifacts (SA), with both components parameterized through Implicit Neural Representations (INR). The proposed method leverages INR to correct defective pixel response values using implicit continuous functions while simultaneously learning stripe features in the angular direction of the sinogram data. These two components, IS and SA, are combined within an optimization constraint framework, achieving unsupervised iterative correction of stripe artifacts in the projection domain. Experimental results demonstrate that the proposed method significantly outperforms current state-of-the-art techniques in effectively removing ring artifacts while maintaining the clarity and fidelity of CT images, thereby enhancing the overall diagnostic quality of CT imaging.

Affiliations: Shanghai Research Institute for Intelligent Autonomous Systems, Tongji University, Shanghai, China; College of Electronics and Information Engineering, Tongji University, Shanghai, China; College of Electronics and Information Engineering, Shanghai Institute of Intelligent Science and Technology, Shanghai Research Institute for Intelligent Autonomous Systems, State Key Laboratory of Intelligent Autonomous Systems, and Frontiers Science Center for Intelligent Autonomous Systems, Tongji University, Shanghai, China

Abstract:
There has been a recent surge of interest in learning to perceive depth from monocular videos in an unsupervised fashion. A key challenge in this field is achieving robust and accurate depth estimation in regions with weak textures or where dynamic objects are present. This study makes three major contributions by delving deeply into dense correspondence priors to provide existing frameworks with explicit geometric constraints. The first novel contribution is a contextual-geometric depth consistency loss, which employs depth maps triangulated from dense correspondences based on estimated ego-motion to guide the learning of depth perception from contextual information, since explicitly triangulated depth maps capture accurate relative distances among pixels. The second novel contribution arises from the observation that there exists an explicit, deducible relationship between optical flow divergence and depth gradient. A differential property correlation loss is therefore designed to refine depth estimation with a specific emphasis on local variations. The third novel contribution is a bidirectional stream co-adjustment strategy that enhances the interaction between rigid and optical flows, encouraging the former towards more accurate correspondence and making the latter more adaptable across various scenarios under the static scene hypotheses. DCPI-Depth, a framework that incorporates all these innovative components and couples two bidirectional and collaborative streams, achieves state-of-the-art performance and generalizability across multiple public datasets, outperforming all existing prior arts. Specifically, it demonstrates accurate depth estimation in texture-less and dynamic regions, and shows more reasonable smoothness. Our source code is publicly available at https://mias.group/DCPI-Depth.

Abstract:
The advent of Neuromorphic spike cameras has garnered significant attention for their ability to capture continuous motion with unparalleled temporal resolution. However, this imaging attribute necessitates considerable resources for binary spike data storage and transmission. In light of compression and spike-driven intelligent applications, we present the notion of Spike Coding for Intelligence (SCI), wherein spike sequences are compressed and optimized for both bit-rate and task performance. Drawing inspiration from the mammalian vision system, we propose a dual-pathway architecture for separate processing of spatial semantics and motion information, which is then merged to produce features for compression. A refinement scheme is also introduced to ensure consistency between decoded features and motion vectors. We further propose a temporal regression approach that integrates various motion dynamics, capitalizing on the advancements in warping and deformation simultaneously. Comprehensive experiments demonstrate our scheme achieves state-of-the-art (SOTA) performance for spike compression and analysis. We achieve an average 17.25% BD-rate reduction compared to SOTA codecs and a 4.3% accuracy improvement over SpiReco for spike-based classification, with 88.26% complexity reduction and 42.41% inference time saving on the encoding side.

Abstract:
Despite substantial efforts dedicated to the design of heuristic models for omnidirectional (i.e., 360°) image quality assessment (OIQA), a conspicuous gap remains due to the lack of consideration for the diversity of viewing behaviors that leads to the varying perceptual quality of 360° images. Two critical aspects underline this oversight: the neglect of viewing conditions that significantly sway user gaze patterns and the overreliance on a single viewport sequence from the 360° image for quality inference. To address these issues, we introduce a unique generative scanpath representation (GSR) for effective quality inference of 360° images, which aggregates varied perceptual experiences of multi-hypothesis users under a predefined viewing condition. More specifically, given a viewing condition characterized by the starting point of viewing and exploration time, a set of scanpaths consisting of dynamic visual fixations can be produced using an apt scanpath generator. Following this vein, we use the scanpaths to convert the 360° image into the unique GSR, which provides a global overview of gazed-focused contents derived from scanpaths. As such, the quality inference of the 360° image is swiftly transformed to that of GSR. We then propose an efficient OIQA computational framework by learning the quality maps of GSR. Comprehensive experimental results validate that the predictions of the proposed framework are highly consistent with human perception in the spatiotemporal domain, especially in the challenging context of locally distorted 360° images under varied viewing conditions. The code will be released at https://github.com/xiangjieSui/GSR

Abstract:
Recent progress has shown great potential of visual prompt tuning (VPT) when adapting pre-trained vision transformers to various downstream tasks. However, most existing solutions independently optimize prompts at each layer, thereby neglecting the usage of task-relevant information encoded in prompt tokens across layers. Additionally, existing prompt structures are prone to interference from task-irrelevant noise in input images, which can adversely affect the sharing of task-relevant information. In this paper, we propose a novel VPT approach, SVPT. It innovatively incorporates a cross-layer dynamic connection (CDC) for input prompt tokens from adjacent layers, enabling effective sharing of task-relevant information. Furthermore, we design a dynamic aggregation (DA) module that facilitates selective sharing of information between layers. The combination of CDC and DA enhances the flexibility of the attention process within the VPT framework. Building upon these foundations, SVPT introduces an attentive enhancement (AE) mechanism that automatically identifies salient image tokens and refines them with prompt tokens in an additive manner. Extensive experiments on 24 image classification and semantic segmentation benchmarks clearly demonstrate the advantages of the proposed SVPT, compared to the state-of-the-art counterparts.

Abstract:
Cross-domain image segmentation plays a crucial role in the field of remote sensing. Current approaches often rely on a mean-teacher model that is integrated from student models to guide the training of the student model itself. However, the feature space of the mean-teacher model exhibits significant domain discrepancy and considerable class overlap, which results in suboptimal performance. Motivated by the idea of learning from stronger teachers, we introduce a robust domain adaptation method called LFMDA. This novel approach is the first to explicitly enhance cross-domain semantic segmentation performance by leveraging vision foundation models (VFMs) within remote sensing applications. Specifically, we propose a prototypical contrastive knowledge distillation loss (PCD) that enables the student model to produce domain-invariant yet category-discriminative features by distilling knowledge from a domain-generalized VFM teacher. Additionally, we introduce a local region homogenization strategy (LRH) to generate high-quality and high-quantity pseudo-labels by incorporating a Segment Anything Model (SAM). Extensive empirical evaluations demonstrate that our method outperforms existing approaches, setting a new state-of-the-art (SOTA) method in domain-adaptive remote sensing image segmentation. The code is available at https://github.com/StuLiu/LFMDA

Abstract:
Joint multi-modal image registration and fusion (JMIRF) typically follows a register-first, fuse-later paradigm. It has a registration module to align parallax images and a fusion module to fuse registered images. Existing research typically focuses on the mutual enhancement between the two modules, but this is essentially a straightforward combination rather than an efficient, unified network. Moreover, executing the two modules separately may cause inefficiency, as the total runtime is merely the sum of both steps without investigating potential shared structures. In this paper, we propose an Adaptive Unified Network (AU-Net) following a novel end-to-end paradigm called Feature-Level Joint Training (FLJT). Firstly, AU-Net learns registration and fusion within a unified network through shared structure and hierarchical semantic interaction. A multi-level dynamic fusion module is designed to adaptively fuse input features from different scales and modalities. Secondly, the image-to-image translation based on Denoising Diffusion Probabilistic Models (DDPMs) is introduced to train AU-Net using simple and reliable single-modal metrics. Unlike previous unidirectional translation, we explore bidirectional translation to provide additional implicit branch supervision. Furthermore, a cache-like scheme is proposed to elegantly circumvent the additional computational overhead caused by the iterative denoising of DDPMs. Finally, our method was validated on two publicly available datasets, demonstrating advantages over state-of-the-art methods in terms of qualitative evaluation, quantitative evaluation, and computational complexity analysis. The code will be publically available at https://github.com/luming1314/AU-Net

Affiliations: Beijing Key Laboratory of Multimedia and Intelligent Software Technology, Beijing Institute of Artificial Intelligence, School of Information Science and Technology, Beijing University of Technology, Beijing, China; Beijing Institute of Artificial Intelligence, School of Computer Science, Beijing University of Technology, Beijing, China; MoE Key Laboratory of Brain-Inspired Intelligent Perception and Cognition, University of Science and Technology of China, Hefei, China; Department of Computer Science and Engineering, Santa Clara University, Santa Clara, CA, USA

Abstract:
Emerging virtual reality (VR) applications bring significant challenges to spherical image compression. Spherical images are first converted into planar images using projections like the equirectangular projection (ERP) to facilitate compression. Methods based on deep neural networks (DNNs) have achieved optimal rate-distortion (R-D) performance in planar image compression. However, the non-uniform sampling of ERP makes the R-D optimization process inefficient when using DNN-based planar compression methods. To address this problem, we propose spherical DNNs for learning based spherical image compression using uniform sampling and ordered rooted tree based index of the Spherical Measure-Based Spherical Image Representation (SMSIR). Specifically, we first define basic spherical operations under the ordered rooted tree based index, including spherical convolution and window transformer, to exploit both local and non-local correlations on the sphere, respectively. We then construct a spherical convolution and a self-attention integrated transformer module named SMixFormer, which simultaneously considers both the enlargement of the receptive fields of local windows and the capture of local and non-local correlations. Furthermore, we introduce a spherical transformer context model with an ordering following the ordered rooted tree based index to enhance the accuracy of the entropy model. To optimize our model, we collect a high-resolution and high-quality spherical image dataset from the Internet. Experimental results demonstrate that our approach outperforms traditional image compression standards, including JPEG, JPEG2000, and BPG. Compared to the learning-based hyperprior planar image compression model, our method achieves a bitrate reduction of over 16%.

Abstract:
Template generation is a critical step in groupwise image registration, which involves aligning a group of subjects into a common space. While existing methods can generate high-quality template images, they often incur substantial time costs or are limited by fixed group scales. In this paper, we present InstantGroup, an efficient groupwise template generation framework based on variational autoencoder (VAE) models that leverage latent representations’ arithmetic properties, enabling scalability to groups of any size. InstantGroup features a Dual VAE backbone with shared-weight twin networks to handle pairs of inputs and incorporates a Displacement Inversion Module (DIM) to maintain template unbiasedness and a Subject-Template Alignment Module (STAM) to improve template quality and registration accuracy. Experiments on 3D brain MRI scans from the OASIS and ADNI datasets reveal that InstantGroup dramatically reduces runtime, generating templates within seconds for various group sizes while maintaining superior performance compared to state-of-the-art baselines on quantitative metrics, including unbiasedness and registration accuracy.

Abstract:
In this paper, we introduce cognitive contour, a novel image attribute that encapsulates the global shape perceived from sparsely distributed, identical or similar objects—such as drone swarms or flocks of geese—collectively termed sparse-structured objects. Unlike traditional contour analysis that delineates the boundaries of individual objects, cognitive contours reflect a gestalt-inspired perception of the overall structure formed by the ensemble, capturing higher-level visual organization. Detecting cognitive contours is challenging due to the sparsity and multiplicity of constituent elements. To tackle this, we propose a scale-space method that integrates alpha shapes into a scale-space framework. An alpha-shape scale space is constructed for the sparse-structured object, and the optimal scale is adaptively selected to extract cognitively meaningful contours with appropriate structural detail. Extensive experiments validate the effectiveness and robustness of the proposed method, enhancing visual inference and offering flexibility across diverse image-based applications. Code and data are available at: https://github.com/CookiC/Sparse

Abstract:
Although there have been notable advancements in video compression technologies in recent years, banding artifacts remain a serious issue affecting the quality of compressed videos, particularly on smooth regions of high-definition videos. Noticeable banding artifacts can severely impact the perceptual quality of videos viewed on a high-end HDTV or high-resolution screen. Hence, there is a pressing need for a systematic investigation of the banding video quality assessment problem for advanced video codecs. Given that the existing publicly available datasets for studying banding artifacts are limited to still picture data only, which cannot account for temporal banding dynamics, we have created a first-of-a-kind open video dataset, dubbed LIVE-YT-Banding, which consists of 160 videos generated by four different compression parameters using the AV1 video codec. A total of 7,200 subjective opinions are collected from a cohort of 45 human subjects. To demonstrate the value of this new resources, we tested and compared a variety of models that detect banding occurrences, and measure their impact on perceived quality. Among these, we introduce an effective and efficient new no-reference (NR) video quality evaluator which we call CBAND. CBAND leverages the properties of the learned statistics of natural images expressed in the embeddings of deep neural networks. Our experimental results show that the perceptual banding prediction performance of CBAND significantly exceeds that of previous state-of-the-art models, and is also orders of magnitude faster. Moreover, CBAND can be employed as a differentiable loss function to optimize video debanding models. The LIVE-YT-Banding database, code, and pre-trained model are all publically available at https://github.com/uniqzheng/CBAND.

Affiliations: School of Computer Science, School of Artificial Intelligence, OPtics and ElectroNics (iOPEN), and the Key Laboratory of Intelligent Interaction and Applications (Ministry of Industry and Information Technology), Northwestern Polytechnical University, Xi’an, China; Institute of Data and Intelligence, Beijing, China; School of Artificial Intelligence, OPtics and ElectroNics (iOPEN), Northwestern Polytechnical University, Xi’an, China; Institute of Artificial Intelligence (TeleAI) of China Telecom, Shanghai, China

Abstract:
As we all known, sparse subspace learning can provide good input for spectral clustering, thereby producing high-quality cluster partitioning. However, it employs complete samples as the dictionary for representation learning, resulting in non-negligible computational costs. Therefore, replacing the complete samples with representative ones (anchors) as the dictionary has become a more popular choice, giving rise to a series of related works. Unfortunately, although these works are linear with respect to the number of samples, they are often quadratic or even cubic with respect to the number of anchors. In this paper, we derive a simpler problem to replace the original scalable subspace clustering, whose properties are utilized. This new problem is linear with respect to both the number of samples and anchors, further enhancing scalability and providing more efficient operations. Furthermore, thanks to the new problem formulation, we can adopt a separate fusion strategy for multi-view extensions. This strategy can better measure the inter-view difference and avoid alternate optimization, so as to achieve more robust and efficient multi-view clustering. Finally, comprehensive experiments demonstrate that our methods not only significantly reduce time overhead but also exhibit superior performance.

Abstract:
Depth completion is a pivotal challenge in computer vision, aiming at reconstructing the dense depth map from a sparse one, typically with a paired RGB image. Existing learning-based models rely on carefully prepared but limited data, leading to significant performance degradation in out-of-distribution (OOD) scenarios. Recent foundation models have demonstrated exceptional robustness in monocular depth estimation through large-scale training, and using such models to enhance the robustness of depth completion models is a promising solution. In this work, we propose a novel depth completion framework that leverages depth foundation models to attain remarkable robustness without large-scale training. Specifically, we leverage a depth foundation model to extract environmental cues, including structural and semantic context, from RGB images to guide the propagation of sparse depth information into missing regions. We further design a dual-space propagation approach, without any learnable parameters, to effectively propagate sparse depth in both 3D and 2D spaces to maintain geometric structure and local consistency. To refine the intricate structure, we introduce a learnable correction module to progressively adjust the depth prediction towards the real depth. We train our model on the NYUv2 and KITTI datasets as in-distribution datasets and extensively evaluate the framework on 16 other datasets. Our framework performs remarkably well in the OOD scenarios and outperforms existing state-of-the-art depth completion methods. Our models are released in https://github.com/shenglunch/PSD.

Abstract:
The design of effective multimodal feature fusion strategies is the key task for multimodal learning, which often requires huge computational costs with extensive expertise. In this paper, we seek to enhance multimodal learning via hierarchical fusion architecture search with inconsistency mitigation. Different from previous works, our Hierarchical Fusion Multimodal Neural Architecture Search (HF-MNAS) considers the inconsistency in modalities and labels, and fine-grained exploitation in multi-level fusion architectures. Specifically, it disentangles the hierarchical fusion problem into two-level (macro- and micro-level) search spaces. In the macro-level search space, the high-level and low-level features are extracted and then connected in a fine-grained way, where the inconsistency mitigation module is designed to minimize discrepancies between modalities and labels in cell outputs. In the micro-level search space, we find that different intermediate nodes in the cells exhibit different importance degrees. Then, we propose an importance-based node selection mechanism to form the optimal cells for feature fusion. We evaluate HF-MNAS on a series of multimodal classification tasks. Empirical evidence shows that HF-MNAS achieves competitive trade-off performance across accuracy, search time, and inference speed. In particular, HF-MNAS consumes minimal computational cost compared with state-of-the-art MNASs. Furthermore, we theoretically and experimentally verify that the modality-label inconsistency deteriorates the overall fusion performance of models such as accuracy and F1 score, and demonstrate that the proposed inconsistency mitigation module could effectively mitigate this phenomenon.

Abstract:
With the continuous expansion of intelligent surveillance networks, lifelong person re-identification (LReID) has received widespread attention, pursuing the need of self-evolution across different domains. However, existing LReID studies accumulate knowledge with the assumption that people would not change their clothes. In this paper, we propose a more practical task, namely lifelong person re-identification with hybrid clothing states (LReID-Hybrid), which takes a series of cloth-changing and same-cloth domains into account during lifelong learning. To tackle the challenges of knowledge granularity mismatch and knowledge presentation mismatch in LReID-Hybrid, we take advantage of the consistency and generalization capabilities of the text space, and propose a novel framework, dubbed Teata, to effectively align, transfer, and accumulate knowledge in an “image-text-image” closed loop. Concretely, to achieve effective knowledge transfer, we design a Structured Semantic Prompt (SSP) learning to decompose the text prompt into several structured pairs to distill knowledge from the image space with a unified granularity of text description. Then, we introduce a Knowledge Adaptation and Projection (KAP) strategy, which tunes text knowledge via a slow-paced learner to adapt to different tasks without catastrophic forgetting. Extensive experiments demonstrate the superiority of our proposed Teata for LReID-Hybrid as well as on conventional LReID benchmarks over advanced methods.

Abstract:
Due to the limited output categories, semi-supervised salient object detection faces challenges in adapting conventional semi-supervised strategies. To address this limitation, we propose a multi-branch architecture that extracts complementary features from labeled data. Specifically, we introduce TripleNet, a three-branch network architecture designed for contour, content, and holistic saliency prediction. The supervision signals for the contour and content branches are derived by decomposing the limited ground truths. After training on the labeled data, the model produces pseudo-labels for unlabeled images, including contour, content, and salient objects. By leveraging the complementarity between the contour and content branches, we construct coupled pseudo-saliency labels by integrating the pseudo-contour and pseudo-content labels, which differ from the model-inferred pseudo-saliency labels. We further develop an enhanced pseudo-labeling mechanism that generates enhanced pseudo-saliency labels by combining reliable regions from both pseudo-saliency labels. Moreover, we incorporate a partial binary cross-entropy loss function to guide the learning of the saliency branch to focus on effective regions within the enhanced pseudo-saliency labels, which are identified through our adaptive thresholding approach. Extensive experiments demonstrate that the proposed method achieves state-of-the-art performance using only 329 labeled training images.

Abstract:
3D imaging based on phase-shifting structured light is widely used in industrial measurement due to its non-contact nature. However, it typically requires a large number of additional images (multi-frequency heterodyne (M-FH) method) or introduces intensity features that compromise accuracy (space domain modulation phase-shifting (SDM-PS) method) for phase unwrapping, and it remains sensitive to motion. To overcome these issues, this article proposes a nonlinear phase coding-based stereo phase unwrapping (NPC-SPU) method that requires no additional patterns while maintaining measurement accuracy. In the encoding stage, a novel nonlinear distortion feature is introduced, while the signal-to-noise ratio of the phase codeword is preserved. In the decoding stage, a local phase unwrapping method that does not require additional auxiliary information is first proposed, closely associating the distortion information in the local wrapped phase. Then, a pre-calibrated stereo constraint system is used to filter potential matching phases, significantly reducing phase ambiguity and computational costs. Finally, to avoid the time-consuming and complex intensity kernel matching used in traditional methods, we propose a local phase correlation matching (LPCM) technique that enables lightweight and robust phase unwrapping. Experimental results demonstrate that this algorithm significantly enhances 3D reconstruction performance in scenarios with large depth, large disparity, complex colored structures, and dynamic scenes. Specifically, in dynamic environments (20mm/s), the proposed method achieves a lower measurement error rate (0.7829% vs. 6.4962%) with only 3 patterns, compared to the traditional three-frequency heterodyne (T-FH) method (using 9 patterns). Additionally, its measurement accuracy outperforms the advanced SDM-PS method, which also uses 3 patterns (0.1102 mm vs. 0.3232 mm).

Abstract:
Infrared and visible image alignment is essential and critical to the fusion and multi-modal perception applications. It addresses discrepancies in position and scale caused by spectral properties and environmental variations, ensuring precise pixel correspondence and spatial consistency. Existing manual calibration requires regular maintenance and exhibits poor portability, challenging the adaptability of multi-modal application in dynamic environments. In this paper, we propose a harmonized representation based infrared and visible image alignment, achieving both high accuracy and scene adaptability. Specifically, with regard to the disparity between multi-modal images, we develop an invertible translation process to establish a harmonized representation domain that effectively encapsulates the feature intensity and distribution of both infrared and visible modalities. Building on this, we design a hierarchical framework to correct deformations inferred from the harmonized domain in a coarse-to-fine manner. Our framework leverages advanced perception capabilities alongside residual estimation to enable accurate regression of sparse offsets, while an alternate correlation search mechanism ensures precise correspondence matching. Furthermore, we propose the first ground truth available misaligned infrared and visible image benchmark for evaluation. Extensive experiments validate the effectiveness of the proposed method against the state-of-the-arts, advancing the subsequent applications further. Code and dataset are available at https://github.com/Jzy2017/HR4IR

Abstract:
Dynamic functional brain network (DFBN) can flexibly describe the time-varying topological connectivity patterns of the brain, and show great potential in brain disease diagnosis. However, most of the existing DFBN analysis methods focus on capturing the dynamic interaction at the brain region level, ignoring the spatio-temporal topological evolution across time windows. Moreover, they are difficult to suppress interfering connections in DFBNs, which leads to a diminished capacity for discerning the intrinsic structures that are intimately linked to brain disorders. To address these issues, we propose a topological evolution graph learning model to capture disease-related spatio-temporal topological features in DFBNs. Specifically, we first take the hubness of adjacent DFBN as the source domain and the target domain in turn, and then use Wasserstein distance (WD) and Gromov-Wasserstein distance (GWD) to capture the brain’s evolution law at the node and edge levels, respectively. Furthermore, we introduce the principle of relevant information to guide the topology evolution graph to learn the structures that are most relevant to brain diseases yet least redundant information between adjacent DFBNs. On this basis, we develop a high-order spatio-temporal model with multi-hop graph convolution to collaboratively extract long-range spatial and temporal dependencies from the topological evolution graph. Extensive experiments show that the proposed method outperforms the current state-of-the-art methods, and can effectively reveal the information evolution mechanism between brain regions across windows.

Abstract:
Modern end-to-end image signal processors (ISPs) can learn complex mappings from RAW/XYZ data to sRGB (and vice versa), opening new possibilities in image processing. However, the growing diversity of camera models, particularly in mobile devices, renders the development of individual ISPs unsustainable due to their limited versatility and adaptability across varied camera systems. In this paper, we introduce Uni-ISP, a novel pipeline that unifies ISP learning for diverse mobile cameras, delivering a highly accurate and adaptable processor. The core of Uni-ISP is leveraging device-aware embeddings through learning forward/inverse ISPs and its special training scheme. By doing so, Uni-ISP not only improves the performance of forward and inverse ISPs but also unlocks new applications previously inaccessible to conventional learned ISPs. To support this work, we construct a real-world 4K dataset, FiveCam, comprising more than 2,400 pairs of sRGB-RAW images captured synchronously by five smartphone cameras. Extensive experiments validate Uni-ISP’s accuracy in learning forward and inverse ISPs (with improvements of +2.4dB/1.5dB PSNR), versatility in enabling new applications, and adaptability to new camera models.

Abstract:
The present work sought to instil metrology in existing hyperspectral texture feature extraction methods. Specifically, we propose distance-based expressions of graylevel cooccurrence matrix (GLCM), local binary pattern (LBP), and Gabor filtering directly computable for hyperspectral images without any pre- or post-processing. At the core of our proposition is Radical of Extended Mean Information for Discrimination (REID), a novel spectral distance with information-theoretic roots. Respecting the physics of spectrum as continuous function of wavelengths, REID is mathematically decomposable into spectral direction and spectral magnitude distances. The resulted feature calculations are fullband (utilizing all wavelengths), yet lightweight and fully interpretable. A similarity measure based on information theory is also justified. Their efficiency is demonstrated in the context of texture classification, content-based image retrieval, and cancer detection in which they consistently outperform existing computations based on dimensionally reduced space using PCA, ICA, and NMF. The propositions could be potentially integrated into machine/deep learning systems towards explainable AI (XAI).

Abstract:
Point cloud registration, which estimates a rigid transformation matrix between two point clouds, is a fundamental process in numerous applications. While existing detector-free techniques present exceptional performance, they overlook the extraction of hybrid local features that capture correlations between points and their neighbours, thereby limiting the quality of point cloud recognition. Moreover, these approaches typically treat point clouds as sequential data and employ the transformer to integrate global context from all points, which inevitably introduces interference from irrelevant regions, hence affecting the registration accuracy. In this work, we propose a novel detector-free approach AGHL to address these challenges. For the first issue, AGHL introduces a hybrid local feature perception module that designs two parallel branches to concurrently extract low-level and high-level local features, which effectively encode the correlations between each point and its neighborhood points in both Euclidean space and high-dimensional feature space. For the second issue, AGHL develops an anchor-guided cross attention that adheres to the local geometric consistency to constrain the network’s attention on reliable anchors, thereby effectively suppressing interference from irrelevant regions. Benefiting from these techniques, AGHL achieves impressive point cloud registration accuracy across all synthetic, indoor, and outdoor datasets. Furthermore, we build an experimental platform and conduct a real-world robot localization experiment, with results showing the strong generalization ability of AGHL.

Abstract:
Recently, masked image modeling (MIM), which learns visual representations by reconstructing the masked patches of an image, has become a popular self-supervised paradigm. However, the pre-training of MIM always takes massive time due to the large-scale data and large-size backbones. We mainly attribute it to the random patch masking in previous MIM works, which fails to leverage the crucial semantic information for effective visual representation learning. To tackle this issue, we propose the Frequency & Attention-driven Masking and Throwing Strategy (FAMT), which can detect semantic patches and reduce the number of training patches to boost model performance and training efficiency simultaneously. Specifically, FAMT utilizes the self-attention mechanism to extract semantic information from the image for masking during training in an unsupervised manner. However, attention alone could sometimes focus on inappropriate areas regarding the semantic information. Thus, we are motivated to incorporate the information from the frequency domain into the self-attention mechanism to derive the sampling weights for masking, which captures semantic patches for visual representation learning. Furthermore, we introduce a patch throwing strategy based on the derived sampling weights to reduce the training cost. FAMT can be seamlessly integrated as a plug-and-play module and surpasses previous works, e.g. reducing the training phase time by nearly 50% and improving the linear probing accuracy of MAE by 1.8 % ~ 6.3 % across various datasets, including CIFAR-10/100, Tiny ImageNet, and ImageNet-1K. FAMT also demonstrates superior performance in downstream detection and segmentation tasks.

Abstract:
The widespread application of 3D human pose estimation (HPE) is limited by resource-constrained edge devices like Jetson Nano, requiring more efficient models. A key approach to enhancing efficiency involves designing networks based on the structural characteristics of input data. However, effectively utilizing the structural priors in human skeletal inputs remains challenging. To address this, we leverage both explicit and implicit spatio-temporal priors of the human body through innovative model design and a pre-training proxy task. First, we propose a Nano Human Topology Network (NanoHTNet), a tiny 3D HPE network with stacked Hierarchical Mixers to capture explicit features. Specifically, the spatial Hierarchical Mixer efficiently learns the human physical topology across multiple semantic levels, while the temporal Hierarchical Mixer with discrete cosine transform and low-pass filtering captures local instantaneous movements and global action coherence. Moreover, Efficient Temporal-Spatial Tokenization (ETST) is introduced to enhance spatio-temporal interaction and reduce computational complexity significantly. Second, PoseCLR is proposed as a general pre-training method based on contrastive learning for 3D HPE, aimed at extracting implicit representations of human topology. By aligning 2D poses from diverse viewpoints in the proxy task, PoseCLR aids 3D HPE encoders like NanoHTNet in more effectively capturing the high-dimensional features of the human body, leading to further performance improvements. Extensive experiments verify that NanoHTNet with PoseCLR outperforms other state-of-the-art methods in efficiency, making it ideal for deployment on edge devices like the Jetson Nano. Code and models are available at https://github.com/vefalun/NanoHTNet

Affiliations: Department of BioMedical Research (DBMR), University of Bern, Bern, Switzerland; Department of Neurosurgery, Inselspital, Bern, Switzerland; Institute of Tissue Medicine and Pathology and the Graduate School for Cellular and Biomedical Sciences, University of Bern, Bern, Switzerland; LPICM, CNRS, École Polytechnique, Palaiseau, Paris, France; Institute of Pathology, Lausanne University Hospital, University of Lausanne, Lausanne, Switzerland; Institute of Tissue Medicine and Pathology, University of Bern, Bern, Switzerland; Support Center for Advanced Neuroimaging (SCAN), University Institute of Diagnostic and Interventional Neuroradiology, University of Bern, Bern, Switzerland

Abstract:
Mueller matrix polarimetry captures essential information about polarized light interactions with a sample, presenting unique challenges for data augmentation in deep learning due to its distinct structure. While augmentations are an effective and affordable way to enhance dataset diversity and reduce overfitting, standard transformations like rotations and flips do not preserve the polarization properties in Mueller matrix images. To this end, we introduce a versatile simulation framework that applies physically consistent rotations and flips to Mueller matrices, tailored to maintain polarization fidelity. Our experimental results across multiple datasets reveal that conventional augmentations can lead to falsified results when applied to polarimetric data, underscoring the necessity of our physics-based approach. In our experiments, we first compare our polarization-specific augmentations against real-world captures to validate their physical consistency. We then apply these augmentations in a semantic segmentation task, achieving substantial improvements in model generalization and performance. This study underscores the necessity of physics-informed data augmentation for polarimetric imaging in deep learning (DL), paving the way for broader adoption and more robust applications across diverse research in the field. In particular, our framework unlocks the potential of DL models for polarimetric datasets with limited sample sizes. Our code implementation is available at github.com/hahnec/polar_augment

Affiliations: State Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing, Wuhan University, Wuhan, China; School of Computer Science, the National Engineering Research Center for Multimedia Software, the Institute of Artificial Intelligence, and Hubei Key Laboratory of Multimedia and Network Communication Engineering, Wuhan University, Wuhan, China; Graduate School of Frontier Sciences, The University of Tokyo, Tokyo, Japan; School of Computer Science and Technology, Anhui University, Hefei, China; Department of Computer Science, Stanford University, Stanford, CA, USA

Abstract:
Weakly-Supervised Change Detection (WSCD) aims to distinguish specific object changes (e.g., objects appearing or disappearing) from background variations (e.g., environmental changes due to light, weather, or seasonal shifts) in paired satellite images, relying only on paired image (i.e., image-level) classification labels. This technique significantly reduces the need for dense annotations required in fully-supervised change detection. However, as image-level supervision only indicates whether objects have changed in a scene, WSCD methods often misclassify background variations as object changes, especially in complex remote-sensing scenarios. In this work, we propose an Adversarial Class Prompting (AdvCP) method to address this co-occurring noise problem, including two phases: a) Adversarial Prompt Mining: After each training iteration, we introduce adversarial prompting perturbations, using incorrect one-hot image-level labels to activate erroneous feature mappings. This process reveals co-occurring adversarial samples under weak supervision, namely background variation features that are likely to be misclassified as object changes. b) Adversarial Sample Rectification: We integrate these adversarially prompt-activated pixel samples into training by constructing an online global prototype. This prototype is built from an exponentially weighted moving average of the current batch and all historical training data. Serving as an unbiased anchor, the global prototype guides the rectification of adversarial pixel samples. Our AdvCP can be seamlessly integrated into current WSCD methods without adding additional inference cost. Experiments on ConvNet, Transformer, and Segment Anything Model (SAM)-based baselines demonstrate significant performance enhancements, achieving up to 7.37%, 7.46%, and 6.56% IoU improvements on the WHU-CD, LEVIR-CD, and DSIFN-CD datasets. Furthermore, we demonstrate the generalizability of AdvCP to other multi-class weakly-supervised dense prediction scenarios. Code is available at https://github.com/zhenghuizhao/AdvCP

Affiliations: Guangdong Key Laboratory for Biomedical Measurements and Ultrasound Imaging, National-Regional Key Technology Engineering Laboratory for Medical Ultrasound, School of Biomedical Engineering, Shenzhen University Medical School, Shenzhen University, Shenzhen, China; Yantai Yuhuangding Hospital, Qingdao University, Yantai, Shandong, China; School of Science and Engineering, University of Dundee, Dundee, U.K.; Linyi People’s Hospital Affiliated to Shandong Second Medical University, Linyi, China; BenQ Medical Center, Affiliated BenQ Hospital of Nanjing Medical University, Nanjing, Jiangsu, China; Department of Otorhinolaryngology, Head and Neck Surgery, Yantai Yuhuangding Hospital, Qingdao University, Yantai, China

Abstract:
While accurate and automatic Laryngeal Neoplasm Segmentation (LNS) can benefit the diagnosis and prevention of laryngeal cancers, existing LNS-related works are very limited due to the lack of public datasets. This paper conducts systematic research to take the research field a step further. Firstly, we create a multicenter LNS dataset, named as MLN-Seg. Collecting from four hospitals, it has 2,273 laryngeal images with a diversity in resolutions and modalities, where each image is pixel-wise annotated by experienced physicians. Secondly, considering the scarcity of LNS methods and similarity between LNS and Colorectal Polyp Segmentation (CPS) tasks, we collect 15 CPS methods and validate their performance on MLN-Seg. It shows that despite the similarity between the two tasks, existing CPS methods underperform on LNS, especially those with blurry boundaries and camouflaged characteristics. Lastly, considering the LNS challenges, we propose an effective segmentation method, termed Scale-Sensitive Network (S2Net). S2Net scales the feature at each layer of the network up and down and integrates all the scaled features to coarsely localize neoplasm regions. In addition, a Localization Calibration (LC) module is used to refine uncertain areas. By connecting the LC modules from top to down, S2Net can finally accurately segment the laryngeal neoplasms. Extensive tests on MLN-Seg shows that S2Net has better learning ability and generalizability than competing methods. In addition, evaluation on five public datasets shows that S2Net achieves comparable performance in the CPS task.

Abstract:
Recently, while significant progress has been made in remote sensing image change captioning, existing methods fail to filter out areas unrelated to actual changes, making models susceptible to irrelevant features. In this article, we propose a novel multimodal model for remote sensing image change captioning, guided by Key Change Features and Instruction-tuned (KCFI). This model aims to fully leverage the intrinsic knowledge of large language models through visual instructions and enhance the effectiveness and accuracy of change features using pixel-level change detection tasks. Specifically, KCFI includes a ViTs encoder for extracting bi-temporal remote sensing image features, a key feature perceiver for identifying critical change areas, a pixel-level change detection decoder to constrain key change features, and an instruction-tuned decoder based on a large language model. Moreover, to ensure that change captioning and change detection tasks are jointly optimized, we employ a dynamic weight-averaging strategy to balance the losses between the two tasks. We also explore various feature combinations for visual fine-tuning instructions and demonstrate that using only key change features to guide the large language model is the optimal choice. To validate the effectiveness of our approach, we compare it against several state-of-the-art change captioning methods on the LEVIR-CC dataset, achieving the best performance. Our code will be available at https://github.com/yangcong356/KCFI.git

Abstract:
In recent years, multi-view data often suffer from incompleteness owing to environmental factors, equipment failures. Thus Incomplete Multi-View Clustering (IMVC) has become an important research focus, which aims to alleviate the adverse impacts of missing views and leverage inter-view complementary information to enhance clustering performance. However, existing IMVC methodologies suffer from three critical limitations: 1) Inadequate integration of cross-view learning and cross-instance learning; 2) Lack of explicit modeling for dynamic interactions between view-specific information and cross-view shared semantics; 3) Inability to dynamically capture high-order topological correlations under view-missing conditions, leading to semantic misalignment among samples. To address these challenges, we propose an IMVC framework CrossNet-VGA based on variational collaboration and graph attention fusion. Specifically, We formulate a novel multi-view evidence lower bound to explicitly separate view-specific latent variables and cross-view shared latent variables, and achieve inter-view semantic fusion by integrating variational distributions shared across views. Contrastive learning is employed to maximize mutual information and promote feature distribution uniformity, thereby achieving consistent representation learning. We employ dynamic k -nearest neighbor graph construction and multi-head graph attention mechanisms to capture the inter-sample deep topological correlations, achieving robust structural alignment. Comprehensive experiments conducted on 6 public datasets demonstrate that CrossNet-VGA significantly outperforms the competing methods both on accuracy and robustness. The anonymous code of this work is available on GitHub at https://github.com/ggg2111/2025-TIP-CrossNet-VGA

Abstract:
Weather image translation aims to convert sunny images into diverse weather scenes, addressing the challenge of the costly collections of multi-weather samples. Existing weather translation methods based on generative adversarial networks (GANs) suffer from limited generalization, often producing images lacking authenticity and diversity. In contrast, the emerging diffusion-based has surpassed GANs-across various visual tasks. This work pioneers diffusion models for weather translation with a novel Instruction-driven Multi-Weather Translation (InstructWT), built on the large image editing model, InstructPix2Pix and its zero-shot generalization capacities. We develop a user-friendly instruction set via prompt engineering and introduce a weather intensity factor for precise weather effect control well enhancing translation authenticity and diversity. A weather correlation-based blended editing preserves the original scene layout while physically based rendering of rain and snow incorporated further improve realism. Experiments on a public dataset Cityscapes demonstrate that InstructWT outperforms existing methods in authenticity and fidelity achieving Contrastive Language-Image Pre-Training (CLIP) image embedding cosine similarity of 0.8302 and directional CLIP similarity of 0.1598. Furthermore, several semantic segmentation algorithms fine-tuned using InsturctWT-augmented multi-weather datasets show significant performance gains under all complex weather conditions.

Abstract:
Video-text retrieval aims to precisely search for videos most relevant to text queries within a video corpus. However, existing methods are largely limited to single-text (single-event) queries and are not effective at handling multi-text (multi-event) queries. Furthermore, these methods typically focus solely on retrieval and do not attempt to locate multiple events within the retrieved videos. To address these limitations, our paper proposes a novel method named Disentangling Inter- and Intra-Video Relations, which jointly addresses multi-event video-text retrieval and grounding. This method leverages both inter-video and intra-video event relationships to enhance retrieval and grounding performance. At the retrieval level, we devise a Relational Event-Centric Video-Text Retrieval module based on the principle that comprehensive textual information leads to precise correspondence between text and video. It incorporates event relationship features at different hierarchical levels and exploits the hierarchical structure of video relationships to achieve multi-level contrastive learning between events and videos. This approach enhances the richness, accuracy, and comprehensiveness of event descriptions, improving alignment precision between text and video and enabling effective differentiation among videos. For event grounding, we propose Event Contrast-Driven Video Grounding, which accounts for positional differences among events on the 2D temporal score map and achieves precise grounding of multiple events through divergence learning for their locations. Our solution not only provides efficient text-to-video retrieval but also accurately grounds events within the retrieved videos, addressing the shortcomings of existing methods. Extensive experimental results on the ActivityNet Captions and Charades-STA benchmark datasets demonstrate the superior performance of our method, validating its effectiveness. The innovation of this research lies in introducing a new joint framework for video-text retrieval and multi-event grounding while offering new ideas for further research and applications in related fields. The code is available at https://github.com/X7J92/MVT-RG

Abstract:
Recent research in continual learning has primarily focused on unimodal tasks, with limited attention to multimodal tasks such as Composed Image Retrieval (CIR). In this paper, we establish a novel Continual CIR setting named C2IR to simulate the ever-change retrieval demands in the real world. Using the C2IR setting, we identify two significant challenges: intra-task correspondence uncertainty, which hinders the model’s ability to manage noisy query-target pair correspondences; and inter-task drift uncertainty, which impedes the model’s consistent understanding of relationships, exacerbating catastrophic forgetting across continual tasks. To address these challenges, we propose a Dual Uncertainty-aware Correspondence Adapting and Retaining (U2CAR) framework for C2IR, which leverages uncertainty learning to acquire and consolidate composed correspondence. To ensure reliable composed correspondence inference in each task, we introduce an Uncertainty-based Correspondence Reasoning (UCR) module that estimates and refines the uncertainty in query-target correspondence. Besides, to mitigate catastrophic forgetting of previous tasks, we design an Uncertainty-guided Re-parameterization (URep) paradigm that consolidates valuable composed correspondence knowledge based on the uncertainty variance across various tasks. Extensive experimental results illustrate that our U2CAR significantly outperforms existing methods, demonstrating the robust adaptability and anti-forgetting capabilities of the proposed approach.

Abstract:
Open-set recognition (OSR) in hyperspectral imagery (HSI) focuses on accurately classifying known classes while effectively rejecting unknown negative samples. Most existing reconstruction-based approaches are susceptible to noise interference in the input images, and known classes can easily lead to inter-class confusion during the reconstruction process. Moreover, effectively utilizing the abundant spectral-spatial information in HSI within an open-set context presents significant challenges. To address these issues, we propose HyperCASR, an innovative framework for HSI OSR that integrates a grouped spectral-spatial retentive transformer (GSSRT) and a class-aware semantic reconstruction (CASR) module. This method begins by designing the GSSRT to extract features from HSI, enhancing the extraction capability of spatial-spectral information by introducing a grouped pixel embedding (GPE) module and a novel spatial retentive attention (SRA) mechanism. Subsequently, an independent autoencoder (AE) is assigned to each known class to reconstruct semantic features, which helps to mitigate noise interference and inter-class confusion. Additionally, by minimizing reconstruction errors to estimate class affiliation, the framework effectively identifies unknown classes. Experimental results across three benchmark datasets indicate that the HyperCASR framework significantly enhances classification performance for both known and unknown classes when compared to existing state-of-the-art methods. The code is available at https://github.com/B-Xi/TIP_2025_HyperCASR

Abstract:
Spike cameras have shown great potential in capturing ultra-high-speed motion scenes by mimicking the retinal fovea’s function, especially addressing the challenges of full-time imaging and high dynamic range in an energy-efficient fashion. Leveraging spike emission mechanisms, these cameras achieve extraordinary temporal resolutions in terms of thousands of frames per second, far surpassing traditional imaging devices. However, the resulting data, characterized by its large scale and sufficient temporal imaging nature, poses significant challenges for storage and transmission. In this paper, we propose an advanced lossless compression model for spike data via constructing a novel spike data representation scheme. We first introduce an efficient short-term aggregation method for spike sequences, paired with an intensity remapping technique to mitigate the effects of noise inherent in the spike sampling approach. In addition, we design and propose the Categorical Logit-based Entropy Model (CLEM) by quantitatively and precisely measuring the required code length of the underlying representation to generate an implicit representation that models the unique statistical distribution of spike data. We leverage these findings to introduce a novel learned lossless spike compression model that significantly reduces the data rate while preserving full data fidelity. Extensive experimental results on PKU-Spike-Recon and more real-world spike datasets demonstrate that our approach achieves state-of-the-art (SOTA) performance, with competitive computational complexity. The proposed method illuminates a new path towards lossless compression without encoding the prediction residual for spike data coding.

Abstract:
In recent years, virtual immunohistochemical (IHC) staining, which converts hematoxylin and eosin (H&E) images into IHC images, has emerged as a promising technology in digital histopathology. Most existing methods rely on paired H&E and IHC patches extracted from adjacent tissue sections for supervised training. However, tissue misalignment and tissue loss between adjacent sections lead to inconsistent training pairs, limiting the models’ ability to produce accurate staining results. To address this issue, we propose ConCLR, a two-stage virtual IHC staining framework based on context-aware contrastive learning, designed to handle inconsistently paired patches. Our method is built on the assumption that for a given mini-patch in the H&E patch, there may exist a corresponding mini-patch in the reference IHC patch exhibiting a similar Pos/Neg pathological pattern. If such a mini-patch exists, it is typically located spatially close to the H&E mini-patch due to the local consistency of tissue structure. In the first stage, we leverage this assumption to design a similarity-guided mini-patch sampling (SGMS) module. For each mini-patch anchor in the staining results, SGMS searches within the real IHC patch to find the most similar mini-patch to serve as the positive sample for contrastive learning, enabling effective supervision despite mild tissue misalignment. In the second stage, we design a context-aware adaptive refinement module, which addresses significant inconsistencies between training pairs caused by potential tissue loss, by expanding the search range of positive samples to include neighboring patches. Extensive experiments on two network backbones across four virtual IHC staining tasks demonstrate the effectiveness of our ConCLR. Evaluations include qualitative and quantitative assessments of staining results, as well as downstream diagnostic performance. In addition to experiments on existing public datasets, we collected a PanCK-NSCLC dataset by acquiring H&E and pan-cytokeratin staining images from the same lung tissue sections via destaining and restaining. This dataset offers significantly improved tissue alignment compared to those derived from adjacent sections, with the aim of facilitating further progress in virtual IHC staining.

Abstract:
In this paper, we present a novel non-convex tensor completion model specifically tailored for multidimensional data. Our approach introduces a three-directional non-convex tensor rank surrogate regularized by the Minimax Concave Penalty (MCP) function. Crucially, the method processes data by simultaneously exploiting low-rank structures across its three modal directions, with the MCP function effectively mitigating the over-penalization of large singular values—a common drawback in convex nuclear norm minimization. To address the inherent challenges of this non-convex optimization, we develop an innovative approximate convex model that accurately captures the original formulation’s essence. We then develop a robust convex Alternating Direction Method of Multipliers (ADMM)-based algorithm, supported by a rigorous convergence guarantee, ensuring both theoretical soundness and practical reliability. Extensive experiments on a variety of real-world datasets demonstrate the superior performance and robustness of the proposed method compared to state-of-the-art approaches.

Abstract:
In recent years, multi-view unsupervised feature selection has gained significant interest for its ability to efficiently handle multi-view datasets while offering better interpretability. Existing multi-view unsupervised feature selection methods construct graphs based on the relationship between samples. In fact, in feature selection, it is more important to focus on the relationships between features. However, constructing a complete graph to capture the relationship between features would incur a space and time complexity of O(d^2) or even higher. Therefore, we introduce an anchor-based strategy and build a feature bipartite graph to reduce complexity. In addition, since existing methods cannot directly extract feature importance from a feature bipartite graph, we design an effective and low-complexity method to directly obtain feature scores from a feature bipartite graph. Compared with the feature importance extraction method based on the complete graph, our proposed method reduces the time complexity from O(d^3) to O(d) . To the best of our knowledge, our proposed method is the first multi-view unsupervised feature selection algorithm that achieves O(nd) space and time complexity without data segmentation. Specifically, this method adaptively learns feature-level anchor graph structures through self-expressive multi-view subspace learning, which can effectively capture the structural information between features and anchors. Meanwhile, the proposed method projects low-dimensional anchors to common dimensions and aligns them with consensus anchors to capture the consistency and complementary information between different views. The superiority of the proposed algorithm is demonstrated by comparing it with seven state-of-the-art algorithms on five public image and two biological information multi-view datasets. The code of the proposed method is publicly available at https://github.com/getupLiu/AFRC

Abstract:
Hyperspectral image (HSI) clustering groups pixels into clusters without labeled data, which is an important yet challenging task. For large-scale HSIs, most methods rely on superpixel segmentation and perform superpixel-level clustering based on graph neural networks (GNNs). However, existing GNNs cannot fully exploit the spectral information of the input HSI, and the inaccurate superpixel topological graph may lead to the confusion of different class semantics during information aggregation. To address these challenges, we first propose a structural-spectral graph convolutional operator (SSGCO) tailored for graph-structured HSI superpixels to improve their representation quality through the co-extraction of spatial and spectral features. Second, we propose an evidence-guided adaptive edge learning (EGAEL) module that adaptively predicts and refines edge weights in the superpixel topological graph. We integrate the proposed method into a contrastive learning framework to achieve clustering, where representation learning and clustering are simultaneously conducted. Experiments demonstrate that the proposed method improves clustering accuracy by 2.61%, 6.06%, 4.96% and 3.15% over the best compared methods on four HSI datasets. Our code is available at https://github.com/jhqi/SSGCO-EGAEL

Affiliations: Department of Mathematics, The Chinese University of Hong Kong, Hong Kong, China; College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing, China; School of Computing and Artificial Intelligence, Jiangxi University of Finance and Economics, Nanchang, China; China Telecom, Institute of Artificial Intelligence (TeleAI), Shanghai, China; School of Cyber Science and Engineering, Sichuan University, Chengdu, China; Department of Data Science, City University of Hong Kong, Hong Kong, China

Abstract:
Generative artificial intelligence has shown great success in visual content synthesis such that humans struggle to distinguish between real and synthesized images. Forensic research seeks to reveal artifacts in such generated images, ensuring information security or improving generation capability. In this regard, the robustness and interpretability are important for the trustworthy purpose of forensic tasks. However, typical forensic models and their underlying data representations rely on empirical learning algorithms, which cannot effectively handle the high robustness and interpretability requirements beyond experience. As an effective solution, we extend the classical geometric invariants to the forensic research of large-scale generated images. Invariants are handcrafted representations with robust and interpretable geometric principles. However, their discriminability is far from the large scale of today’s forensic tasks. We boost the discriminability by extending the classical invariants to the hierarchical architecture of convolutional neural networks. The resulting overcompleteness allows for an automatic selection of task-discriminative features, while retaining the previous advantages of robustness and interpretability. From generative adversarial networks to diffusion models, the forensic with our boosted invariants demonstrates state-of-the-art discriminability against large-scale content diversity. It also exhibits high efficiency on training examples, intrinsic invariance to geometric variations, and better interpretability of the forensic process.

Abstract:
Customization generation techniques have significantly advanced the synthesis of specific concepts across varied contexts. Multi-concept customization emerges as the challenging task within this domain. Existing approaches often rely on training a fusion matrix of multiple Low-Rank Adaptations (LoRAs) to merge various concepts into a single image. However, we identify this straightforward method faces two major challenges: 1) concept confusion, where the model struggles to preserve distinct individual characteristics, and 2) concept vanishing, where the model fails to generate the intended subjects. To address these issues, we introduce LoRA-Composer, a training-free framework designed for seamlessly integrating multiple LoRAs, thereby enhancing the harmony among different concepts within generated images. LoRA-Composer addresses concept vanishing through concept injection constraints, enhancing visibility via an expanded cross-attention mechanism. To combat concept confusion, concept isolation constraints are introduced, refining the self-attention computation. Furthermore, we propose two inference techniques to accelerate inference speed without performance degradation and enhance the accuracy of the generated region, respectively. Extensive experiments demonstrate that LoRA-Composer significantly outperforms standard baselines, especially in scenarios without image-based conditions such as canny edge or pose estimation.

Abstract:
Recently, AI-generated images (AIGIs), synthesized based on initial textual prompts, have attracted widespread attention. However, due to limitations in current generation techniques, these images often exhibit degraded perceptual quality and semantic misalignment with the guiding prompts. Therefore, evaluating both perceptual quality and text-to-image alignment is essential for optimizing the performance of generative models. Existing methods design textual prompts solely based on the initial prompt for both perceptual and alignment quality tasks, and compute only coarse-grained similarity between the designed prompt and the generated image. However, such task-agnostic prompts overlook the distinctions between the perceptual and alignment quality tasks, and coarse-level similarity fails to capture semantic details, leading to suboptimal evaluation performance. To address these challenges, we propose a novel AIGI quality assessment framework, termed TPMS, which incorporates task-specific prompt and multi-granularity similarity computation. The task-specific prompt constructs dedicated prompts for perceptual and alignment quality respectively, allowing the model to capture distinct quality cues tailored to each evaluation task. Multi-granularity similarity measures the coarse-level similarity between the generated image and task-specific prompts to capture global quality characteristics, and the fine-level similarity between the generated image and the initial prompt to enhance semantic detail awareness. By integrating these two complementary similarities, TPMS enables precise and robust quality prediction. Extensive experiments on four widely-used AIGI quality benchmarks validate the effectiveness and superiority of the proposed framework.

Abstract:
Infrared small target detection (IRSTD) is of great practical significance in many real-world applications, such as maritime rescue and early warning systems, benefiting from the unique and excellent infrared imaging ability in adverse weather and low-light conditions. Nevertheless, segmenting small targets from the background remains a challenge. When the subsampling frequency during image processing does not satisfy the Nyquist criterion, the aliasing effect occurs, which makes it extremely difficult to identify small targets. To address this challenge, we propose a novel Wavelet Mamba with Reversible Structure Network (WMRNet) for infrared small target detection in this paper. Specifically, WMRNet consists of a Discrete Wavelet Mamba (DW-Mamba) module and a Third-order Difference Equation guided Reversible (TDE-Rev) structure. DW-Mamba employs the Discrete Wavelet Transform to decompose images into multiple subbands, integrating this information into the state equations of a state space model. This method minimizes frequency interference while preserving a global perspective, thereby effectively reducing background aliasing. The TDE-Rev aims to suppress edge aliasing effects by refining the target edges, which first processes features with an explicit neural structure derived from the second-order difference equations and then promotes feature interactions through a reversible structure. Extensive experiments on the public IRSTD-1k and SIRST datasets demonstrate that the proposed WMRNet outperforms the state-of-the-art methods.

Abstract:
In dynamic and evolving application scenarios, the ability of visual language models to continuously learn from new data while preserving historical knowledge is critically important. Existing continual learning methods for large visual language models (LVLMs) often restrict the number of tasks they can handle, causing performance to decline as tasks continue to increase. In this paper, we propose a novel continual learning framework that adapts to the growing number of tasks, enabling visual language models to handle a dynamic range of open-set tasks while overcoming the catastrophic forgetting problem of learning new tasks at the expense of forgetting old ones. Our method builds on a pre-trained CLIP model and incorporates a dynamic mixture-of-experts (MoE) layer, enabling flexible adaptation to a wide range of open-set tasks. We design an elastic expert weight management strategy to effectively mitigate the catastrophic forgetting problem. Furthermore, we optimize the LoRA experts with adaptive ranks to achieve a balanced trade-off between model complexity and representational capacity. Extensive experiments across diverse settings demonstrate that our proposed method significantly reduces the number of tunable parameters while consistently surpassing state-of-the-art methods in new task learning capability and maintaining performance on historical tasks.

Abstract:
The substantial successes achieved by diffusion probabilistic models have prompted the study of their employment in resource-limited scenarios. Pruning methods have been proven effective in compressing discriminative models relying on the correlation between training losses and model performances. However, diffusion models employ an iterative process for generating high-quality images, leading to a breakdown of such connections. To address this challenge, we propose a simple yet effective method, named NiCI-Pruning (Noise in Clean Image Pruning), for the compression of diffusion models. NiCI-Pruning capitalizes the noise predicted by the model based on clean image inputs, favoring it as a feature for establishing reconstruction losses. Accordingly, Taylor expansion is employed for the proposed reconstruction loss to evaluate the parameter importance effectively. Moreover, we propose an interval sampling strategy that incorporates a timestep-weighted schema, alleviating the risk of misleading information obtained at later timesteps. We provide comprehensive experimental results to affirm the superiority of our proposed approach. Notably, our method achieves a remarkable average reduction of 30.4% in FID score increase across five different datasets compared to the state-of-the-art diffusion pruning method at equivalent pruning rates. Our code and models have been made available at https://github.com/NUST-Machine-Intelligence-Laboratory/NiCI-Pruning

Abstract:
Vision-based 3D object detection, a cost-effective alternative to LiDAR-based solutions, plays a crucial role in modern autonomous driving systems. Meanwhile, deep models have been proven susceptible to adversarial examples, and attacking detection models can lead to serious driving consequences. Most previous adversarial attacks targeted 2D detectors by placing the patch in a specific region within the object’s bounding box in the image, allowing it to evade detection. However, attacking 3D detector is more difficult because the adversary may be observed from different viewpoints and distances, and there is a lack of effective methods to differentiably render the 3D space poster onto the image. In this paper, we propose a novel attack setting where a carefully crafted adversarial poster (looks like meaningless graffiti) is learned and pasted on the road surface, inducing the vision-based 3D detectors to perceive a non-existent object. We show that even a single 2D poster is sufficient to deceive the 3D detector with the desired attack effect, and the poster is universal, which is effective across various scenes, viewpoints, and distances. To generate the poster, an image-3D applying algorithm is devised to establish the pixel-wise mapping relationship between the image area and the 3D space poster so that the poster can be optimized through standard backpropagation. Moreover, a ground-truth masked optimization strategy is presented to effectively learn the poster without interference from scene objects. Extensive results including real-world experiments validate the effectiveness of our adversarial attack. The transferability and defense strategy are also investigated to comprehensively understand the proposed attack.

Abstract:
Automated social behaviour analysis of mice has become an increasingly popular research area in behavioural neuroscience. Recently, pose information (i.e., locations of keypoints or skeleton) has been used to interpret social behaviours of mice. Nevertheless, effective encoding and decoding of social interaction information underlying the keypoints of mice has been rarely investigated in the existing methods. In particular, it is challenging to model complex social interactions between mice due to highly deformable body shapes and ambiguous movement patterns. To deal with the interaction modelling problem, we here propose a Cross-Skeleton Interaction Graph Aggregation Network (CS-IGANet) to learn abundant dynamics of freely interacting mice, where a Cross-Skeleton Node-level Interaction module (CS-NLI) is used to model multi-level interactions (i.e., intra-, inter- and cross-skeleton interactions). Furthermore, we design a novel Interaction-Aware Transformer (IAT) to dynamically learn the graph-level representation of social behaviours and update the node-level representation, guided by our proposed interaction-aware self-attention mechanism. Finally, to enhance the representation ability of our model, an auxiliary self-supervised learning task is proposed for measuring the similarity between cross-skeleton nodes. Experimental results on the standard CRMI13-Skeleton and our PDMB-Skeleton datasets show that our proposed model outperforms several other state-of-the-art approaches.

Abstract:
Great efforts have been made to investigate AI’s ability in abstract reasoning, along with the proposal of various versions of RAVEN’s progressive matrices (RPM) as benchmarks. Previous studies suggest that, even after extensive training, neural networks may still struggle to make decisive decisions regarding RPM problems without sophisticated designs or additional semantic information in the form of meta-data. Through comprehensive experiments, we demonstrate that neural networks endowed with appropriate inductive biases, either intentionally designed or fortuitously matched, can efficiently solve RPM problems without the need for extra meta-data augmentation. Our work also reveals the importance of employing a multi-viewpoint with multi-evaluation approach as a key learning strategy for successful reasoning. Nevertheless, we acknowledge the unique role of metadata by demonstrating that a pre-training model supervised by meta-data leads to an RPM solver with improved performance. Codes are available in: https://github.com/QinglaiWeiCASIA/RavenSolver.

Abstract:
Vision Transformer (ViT), known for capturing non-local features, is an effective tool for hyperspectral image classification (HSIC). However, ViT’s multi-head self-attention (MHSA) mechanism often struggles to balance local details and long-range relationships for complex high-dimensional data, leading to a loss in spectral-spatial information representation. To address this issue, we propose a deformable convolution-enhanced hierarchical Transformer with spectral-spatial cluster attention (SClusterFormer) for HSIC. The model incorporates a unique cluster attention mechanism that utilizes spectral angle similarity and Euclidean distance metrics to enhance the representation of fine-grained homogenous local details and improve discrimination of non-local structures in 3D HSI and 2D morphological data, respectively. Additionally, a dual-branch multiscale deformable convolution framework augmented with frequency-based spectral attention is designed to capture both the discrepancy patterns in high-frequency and overall trend of the spectral profile in low-frequency. Finally, we utilize a cross-feature pixel-level fusion module for collaborative cross-learning and fusion of the results from the dual-branch framework. Comprehensive experiments conducted on multiple HSIC datasets validate the superiority of our proposed SClusterFormer model, which outperforms existing methods. The source code of SClusterFormer is available at https://github.com/Fang666666/HSIC_SClusterFormer.

Abstract:
Few-Shot Object Detection (FSOD) aims to detect the objects of novel classes using only a few manually annotated samples. With the few novel class samples, learning the inter-class relationships among foreground and constructing the corresponding class hierarchy in FSOD is a challenging task. The poor construction of the class hierarchy will result in the inter-class confusion problem, which has been identified as a primary cause of inferior performance in novel classes by recent FSOD methods. In this work, we further find that the intra-super-class confusion, where samples are misclassified as classes within their associated super-classes, is the main challenge in solving the confusion problem. To solve this issue, this work generates class-confusion-aware samples with a pre-defined tree-structure graph, for helping models to construct a precise class hierarchy. In precise, for generating class-confusion-aware samples, we add the noise into available samples and update the noise to maximize confidence scores on associated confusion categories of samples. Then, a confusion-aware curriculum learning strategy is proposed to make generated samples gradually participate in the training, which benefits the model convergence while learning the generated samples. Experimental results show that our method can be used as a plug-in in recent FSOD methods and consistently improve the model performance.

Abstract:
In recent years, there has been a notable surge in the adoption of weakly-supervised learning for medical image segmentation, utilizing scribble annotation as a means to potentially reduce annotation costs. However, the inherent characteristics of scribble labeling, marked by incompleteness, subjectivity, and a lack of standardization, introduce inconsistencies into the annotations. These inconsistencies become significant challenges for the network’s learning process, ultimately affecting the performance of segmentation. To address this challenge, we propose creating a reference set to guide pixel-level feature matching, constructed from class-specific tokens and pixel-level features extracted from variously images. Serving as a repository showcasing diverse pixel styles and classes, the reference set becomes the cornerstone for a pixel-level feature matching strategy. This strategy enables the effective comparison of unlabeled pixels, offering guidance, particularly in learning scenarios characterized by inconsistent and incomplete scribbles. The proposed strategy incorporates smoothing and regression techniques to align pixel-level features across different images. By leveraging the diversity of pixel sources, our matching approach enhances the network’s ability to learn consistent patterns from the reference set. This, in turn, mitigates the impact of inconsistent and incomplete labeling, resulting in improved segmentation outcomes. Extensive experiments conducted on three publicly available datasets demonstrate the superiority of our approach over state-of-the-art methods in terms of segmentation accuracy and stability. The code will be made publicly available at https://github.com/jingkunchen/scribble-medical-segmentation.

Abstract:
Existing unfolding-based compressive imaging approaches always suffer from certain issues, including inefficient feature extraction and information loss during iterative reconstruction phases, which become particularly evident at low sampling ratios, i.e., significant detail degradation and distortion in reconstructed images. To mitigate these challenges, we propose USB-Net, a deep unfolding method inspired by the renowned Split Bregman algorithm and multi-phase feature integration strategy, for compressive imaging reconstruction. Specifically, we use a customized Depthwise Attention Block as a fundamental block for feature extraction, but also to address the sparse induction-related splitting operator within Split Bregman method. Based on this, we introduce three Auxiliary Iteration Modules: \mathrm X^(k) , \mathrm D^(k) , and \mathrm B^(k) to reinforce the effectiveness of Split Bregman’s decomposition strategy for problem breakdown and Bregman iterations. Moreover, we introduce two categories of Iterative Fusion Modules to seamlessly harmonize and integrate insights across iterative reconstruction phases, enhancing the utilization of crucial features, such as edge information and textures. In general, USB-Net can fully harness the advantages of traditional Split Bregman approach, manipulating multi-phase iterative insights to enhance feature extraction, optimize data fidelity, and achieve high-quality image reconstruction. Extensive experiments show that USB-Net significantly outperforms current state-of-the-art methods on image compressive sensing, CS-magnetic resonance imaging, and snapshot compressive imaging tasks, demonstrating superior generalizability. Our code is available at USB-Net.

Abstract:
With the success of the DEtection TRansformer (DETR), numerous researchers have explored its effectiveness in addressing unsupervised domain adaptation tasks. Existing methods leverage carefully designed feature alignment techniques to align the backbone or encoder, yielding promising results. However, effectively aligning instance-level features within the unique decoder structure of the detector has largely been neglected. Related techniques primarily align instance-level features in a class-agnostic manner, overlooking distinctions between features from different categories, which results in only limited improvements. Furthermore, the scope of current alignment modules in the decoder is often restricted to a limited batch of images, failing to capture the dataset-level cues, thereby severely constraining the detector’s generalization ability to the target domain. To this end, we introduce a strong DETR-based detector named Domain Adaptive detection TRansformer (DATR) for unsupervised domain adaptation of object detection. First, we propose the Class-wise Prototypes Alignment (CPA) module, which effectively aligns cross-domain features in a class-aware manner by bridging the gap between the object detection task and the domain adaptation task. Then, the designed Dataset-level Alignment Scheme (DAS) explicitly guides the detector to achieve global representation and enhance inter-class distinguishability of instance-level features across the entire dataset, which spans both domains, by leveraging contrastive learning. Moreover, DATR incorporates a mean-teacher-based self-training framework, utilizing pseudo-labels generated by the teacher model to further mitigate domain bias. Extensive experimental results demonstrate superior performance and generalization capabilities of our proposed DATR in multiple domain adaptation scenarios. Code is released at https://github.com/h751410234/DATR.

Abstract:
Ridge detection is a classical tool to extract curvilinear features in image processing. As such, it has great promise in applications to material science problems; specifically, for trend filtering relatively stable atom-shaped objects in image sequences, such as bright-field Transmission Electron Microscopy (TEM) videos. Standard analysis of TEM videos is limited to frame-by-frame object recognition. We instead harness temporal correlation across frames through simultaneous analysis of long image sequences, specified as a spatio-temporal image tensor. We define new ridge detection algorithms to non-parametrically estimate explicit trajectories of atomic-level object locations as a continuous function of time. Our approach is specially tailored to handle temporal analysis of objects that seemingly stochastically disappear and subsequently reappear throughout a sequence. We demonstrate that the proposed method is highly effective in simulation scenarios, and delivers notable performance improvements in TEM experiments compared to other material science benchmarks.

Abstract:
As a continual learning paradigm where non-stationary data arrive in the form of streams and training occurs whenever a small batch of samples is accumulated, general continual learning (GCL) suffers from both inter-task bias and intra-task bias. Existing GCL methods can hardly simultaneously handle two issues since it requires models to avoid from lying into the spurious correlation trap of GCL. From a causal perspective, we formalize a structural causality model of GCL and conclude that spurious correlation exists not only between confounders and input, but also within multiple causal variables. Inspired by frequency transformation techniques which harbor intricate patterns of image comprehension, we propose a plug-and-play module: the Dual-Domain Division Multiplex (D3M) unit, which intervenes confounders and multiple causal factors over frequency and spatial domains with a two-stage pseudo causal intervention strategy. Typically, D3M consists of a frequency division multiplexer (FDM) module and a spatial division multiplexer (SDM) module, each of which prioritizes target-relevant causal features by dividing and multiplexing features over frequency domain and spatial domain, respectively. As a lightweight and model-agonistic unit, D3M can be seamlessly integrated into most current GCL methods. Extensive experiments on four popular datasets demonstrate that D3M significantly enhances accuracy and diminishes catastrophic forgetting compared to current methods. The code is available at https://github.com/wangsfan/D3M.

Abstract:
Unsupervised non-rigid point cloud shape correspondence underpins a multitude of 3D vision tasks, yet itself is non-trivial given the exponential complexity stemming from inter-point degree-of-freedom, i.e., pose transformations. Based on the assumption of local rigidity, one solution for reducing complexity is to decompose the overall shape into independent local regions using Local Reference Frames (LRFs) that are equivariant to SE(3) transformations. However, the focus solely on local structure neglects global geometric contexts, resulting in less distinctive LRFs that lack crucial semantic information necessary for effective matching. Furthermore, such complexity introduces out-of-distribution geometric contexts during inference, thus complicating generalization. To this end, we introduce 1) EquiShape, a novel structure tailored to learn pair-wise LRFs with global structural cues for both spatial and semantic consistency, and 2) LRF-Refine, an optimization strategy generally applicable to LRF-based methods, aimed at addressing the generalization challenges. Specifically, for EquiShape, we employ cross-talk within separate equivariant graph neural networks (Cross-GVP) to build long-range dependencies to compensate for the lack of semantic information in local structure modeling, deducing pair-wise independent SE(3)-equivariant LRF vectors for each point. For LRF-Refine, the optimization adjusts LRFs within specific contexts and knowledge, enhancing the geometric and semantic generalizability of point features. Our overall framework surpasses the state-of-the-art methods by a large margin on three benchmarks. Codes are available at https://github.com/2019EPWL/EquiShape.

Abstract:
In adverse environments, the detector often fails to detect degraded objects because they are almost invisible and their features are weakened by the environment. Common approaches involve image enhancement to support detection, but they inevitably introduce human-invisible noise that negatively impacts the detector. In this work, we propose a physics-guided approach for object detection in adverse environments, which gives a straightforward solution that injects the physical priors into the detector, enabling it to detect poorly visible objects. The physical priors, derived from the imaging mechanism and image property, include environment prior and frequency prior. The environment prior is generated from the physical model, e.g., the atmospheric model, which reflects the density of environmental noise. The frequency prior is explored based on an observation that the amplitude spectrum could highlight object regions from the background. The proposed two priors are complementary in principle. Furthermore, we present a physics-guided loss that incorporates a novel weight item, which is estimated by applying the membership function on physical priors and could capture the extent of degradation. By backpropagating the physics-guided loss, physics knowledge is injected into the detector to aid in locating degraded objects. We conduct experiments in synthetic foggy environment, real foggy environment, and real underwater scenario. The results demonstrate that our method is effective and achieves state-of-the-art performance. The code is available at https://github.com/PangJian123/See-Degraded-Objects.

Abstract:
Hyperspectral imaging is endowed with outstanding discriminability between different land types by its comprehensive sensing of the spectrum, thus favored applying to anomaly detection. However, blurring effect, as a critical cause for quality deterioration of hyperspectral imaging, has been omitted by previous hyperspectral anomaly detection models. On one hand, given that anomalies are sparsely distributed in nature, such blurring effect entangling neighboring pixels severely weighs those detection models down. On the other hand, abnormal objects jeopardize the low-dimensional structure of the image, thus deblurring those images with anomalies is more challenging than normal ones. Hence, it is of much significance to investigate anomaly detection using blurred hyperspectral images. To this end, this paper proposes a generalized non-convex surrogated tensor framework that is able to perform anomaly detection robustly to blurring effects on hyperspectral images. The proposed framework is featured to be a unified paradigm which guarantees convergence for a broad class of non-convex surrogates. Through treating the spatial and spectral low-rankness adaptively via Block Term Decomposition, the unevenness in the multi-linear low-rankness of hyperspectral image is comprehensively considered, which together with the non-convex surrogates results in a tighter modeling of the low-dimensional prior of hyperspectral images. Extensive experiments demonstrate the superiority of the proposed method compared with the state-of-the-art methods on both hyperspectral image deblurring and anomaly detection.

Abstract:
A single-modal infrared or visible image offers limited representation in scenes with lighting degradation or extreme weather. We propose a multi-modal fusion framework, named SDSFusion, for all-day and all-weather infrared and visible image fusion. SDSFusion exploits the commonality in image processing to achieve enhancement, fusion, and semantic task interaction in a unified framework guided by semantic awareness and multi-scale features and losses. To address the disparity between infrared and visible images in degraded scenes, we differentiate modal features in a unified fusion model. Unlike existing joint fusion methods, we propose an adversarial generative network that refines the reconstruction of low-light images by embedding fused features. It provides feature-level brightness supplementation and image reconstruction to refine brightness and contrast. Extensive experiments in degraded scenes confirm that our approach is superior to state-of-the-art approaches in visual quality and performance, demonstrating the effectiveness of interaction improvement. The code will be posted at: https://github.com/Liling-yang/SDSFusion.

Abstract:
Understanding and predicting viewer attention in omnidirectional videos (ODVs) is crucial for enhancing user engagement in virtual and augmented reality applications. Although both audio and visual modalities are essential for saliency prediction in ODVs, the joint exploitation of these two modalities has been limited, primarily due to the absence of large-scale audio-visual saliency databases and comprehensive analyses. This paper comprehensively investigates audio-visual attention in ODVs from both subjective and objective perspectives. Specifically, we first introduce a new audio-visual saliency database for omnidirectional videos, termed AVS-ODV database, containing 162 ODVs and corresponding eye movement data collected from 60 subjects under three audio modes including mute, mono, and ambisonics. Based on the constructed AVS-ODV database, we perform an in-depth analysis of how audio influences visual attention in ODVs. To advance the research on audio-visual saliency prediction for ODVs, we further establish a new benchmark based on the AVS-ODV database by testing numerous state-of-the-art saliency models, including visual-only models and audio-visual models. In addition, given the limitations of current models, we propose an innovative omnidirectional audio-visual saliency prediction network (OmniAVS), which is built based on the U-Net architecture, and hierarchically fuses audio and visual features from the multimodal aligned embedding space. Extensive experimental results demonstrate that the proposed OmniAVS model outperforms other state-of-the-art models on both ODV AVS prediction and traditional AVS prediction tasks. The AVS-ODV database and the OmniAVS model are available at: https://github.com/IntMeGroup/AVS-ODV.

Abstract:
Hyperspectral video (HSV) offers valuable spatial, spectral, and temporal information simultaneously, making it highly suitable for handling challenges such as background clutter and visual similarity in object tracking. However, existing methods primarily focus on band regrouping and rely on RGB trackers for feature extraction, resulting in limited exploration of spectral information and difficulties in achieving complementary representations of object features. In this paper, a spatial-spectral fusion network with spectral angle awareness (SSF-Net) is proposed for hyperspectral (HS) object tracking. Firstly, to address the issue of insufficient spectral feature extraction in existing networks, a spatial-spectral feature backbone ( S^2 FB) is designed. With the spatial and spectral extraction branch, a joint representation of texture and spectrum is obtained. Secondly, a spectral attention fusion module (SAFM) is presented to capture the intra- and inter-modality correlation to obtain the fused features from the HS and RGB modalities. It can incorporate the visual information into the HS context to form a robust representation. Thirdly, to ensure a more accurate response to the object position, a spectral angle awareness module (SAAM) is designed to investigate the region-level spectral similarity between the template and search images during the prediction stage. Furthermore, a novel spectral angle awareness loss (SAAL) is developed to offer guidance for the SAAM based on similar regions. Finally, to obtain the robust tracking results, a weighted prediction method is considered to combine the HS and RGB predicted motions of objects to leverage the strengths of each modality. Extensive experiments on the HOTC-2020, HOTC-2024, and BihoT datasets demonstrate the effectiveness of the proposed SSF-Net compared with state-of-the-art trackers. The source code will be available at https://github.com/hzwyhc/hsvt

Abstract:
Semi-supervised medical image segmentation (SSMIS) leverages unlabeled data to reduce reliance on manually annotated images. However, current SOTA approaches predominantly focus on foreground-oriented modeling (i.e., segmenting only the foreground region) and have largely overlooked the potential benefits of explicitly modeling the background region. Our study theoretically and empirically demonstrates that highly certain predictions in background modeling enhance the confidence of corresponding foreground modeling. Building on this insight, we propose the Cross-view Bidirectional Modeling (CVBM) framework, which introduces a novel perspective by incorporating background modeling to improve foreground modeling performance. Within CVBM, background modeling serves as an auxiliary perspective, providing complementary supervisory signals to enhance the confidence of the foreground model. Additionally, CVBM introduces an innovative bidirectional consistency mechanism, which ensures mutual alignment between foreground predictions and background-guided predictions. Extensive experiments demonstrate that our approach achieves SOTA performance on the LA, Pancreas, ACDC, and HRF datasets. Notably, on the Pancreas dataset, CVBM outperforms fully supervised methods (i.e., DSC: 84.57% vs. 83.89%) while utilizing only 20% of the labeled data. Our code is publicly available at https://github.com/caoluyang0830/CVBM.git

Abstract:
Open-set recognition (OSR) aims to accurately classify known categories while effectively rejecting unknown negative samples. Existing methods for OSR in hyperspectral images (HSI) can be generally divided into two categories: reconstruction-based and distance-based methods. Reconstruction-based approaches focus on analyzing reconstruction errors during inference, whereas distance-based methods determine the rejection of unknown samples by measuring their distance to each prototype. However, these techniques often require a substantial amount of training data, which can be both time-consuming and expensive to gather, and they require manual threshold setting, which can be difficult for different tasks. Furthermore, effectively utilizing spectral-spatial information in HSI remains a significant challenge, particularly in open-set scenarios. To tackle these challenges, we introduce a few-shot OSR framework for HSI named HyperTaFOR, which incorporates a novel spatial-spectral selective transformer (S3Former). This framework employs a meta-learning strategy to implement a negative prototype generation module (NPGM) that generates task-adaptive rejection scores, allowing flexible categorization of samples into various known classes and anomalies for each task. Additionally, the S3Former is designed to extract spectral-spatial features, optimizing the use of central pixel information while reducing the impact of irrelevant spatial data. Comprehensive experiments conducted on three benchmark hyperspectral datasets show that our proposed method delivers competitive classification and detection performance in open-set environments when compared to state-of-the-art methods. The code is available online at https://github.com/B-Xi/TIP_2025_HyperTaFOR.

Abstract:
The preservation and the enhancement of complementary features between modalities are crucial for multi-modal image fusion and downstream vision tasks. However, existing methods are limited to local receptive fields (CNNs) or lack comprehensive utilization of spatial information from both modalities during interaction (transformers), which results in the inability to effectively retain useful information from both modalities in a comparative manner. Consequently, the fused images may exhibit a bias towards one modality, failing to adaptively preserve salient targets from all sources. Thus, a novel fusion framework (S4Fusion) based on the Saliency-aware Selective State Space is proposed. S4Fusion introduces the Cross-Modal Spatial Awareness Module (CMSA), which is designed to simultaneously capture global spatial information from all input modalities and promote effective cross-modal interaction. This enables a more comprehensive representation of complementary features. Furthermore, to guide the model in adaptively preserving salient objects, we propose a novel perception-enhanced loss function. This loss aims to enhance the retention of salient features by minimizing ambiguity or uncertainty, as measured at a pre-trained model’s decision layer, within the fused images. The code is available at https://github.com/zipper112/S4Fusion

Abstract:
Compressive spectral imaging has garnered significant attention for its ability to effectively enhance the captured spatial and spectral information. Predominant methods, based on compressive sensing, typically formulate the imaging task as a constrained optimization problem and rely on hand-crafted priors to model the sparsity of spectral images. However, these approaches often suffer from suboptimal performance due to the inherent difficulty of identifying an appropriate transform space where spectral images exhibit sparsity. To overcome this limitation, we propose a novel convolutional sparse coding-inspired untrained network prior for fast and adaptive identification of the sparse transform domain and compressible signal. Specifically, a Lightweight Convolutional Thresholding sparse Coding (LCTC) network is designed as the sparse transform domain, with its inputs interpreted as sparse coefficients. Crucially, both the transform domain and its coefficients are solved in a self-supervised learning manner. Furthermore, we demonstrate that LCTC prior can be seamlessly incorporated into the iterative optimization algorithm as a Plug-and-Play (PnP) regularization. Both the LCTC and PnP-LCTC exhibit superior performance compared to previous methods. Experiments under various scenarios validate the effectiveness and efficiency of our approach.

Abstract:
Infrared small target detection has been extensively studied due to its wide range of applications. Most studies treat infrared small target detection as an independent task, either as a detection-based or a segmentation-based, failing to fully leverage the supervisory information from different annotation forms. To address this issue, we propose a multi-task mutual learning network (MTMLNet) specifically designed for infrared small targets, aiming to enhance both detection and segmentation performance by effectively utilizing various forms of supervisory information. Specifically, we design a multi-stage feature aggregation (MFA) module capable of capturing features with varying gradients and receptive fields simultaneously. Additionally, a hybrid pooling down-sampling (HPDown) module is proposed to mitigate information loss during the down-sampling process of infrared small targets. Finally, the hierarchical feature fusion (HFF) module is designed to adaptively select and fuse features from different semantic layers, learning the optimal way to fuse features across semantic layers. The results on IRSTD-1k and SIRST-V2 datasets show that our proposed MTMLNet achieves state-of-the-art (SOTA) performance in both detection-based and segmentation-based methods. The codes are available at https://github.com/YangBo0411/MTMLNet

Abstract:
Images captured in challenging environments–such as nighttime, smoke, rainy weather, and underwater–often suffer from significant degradation, resulting in a substantial loss of visual quality. The effective restoration of these degraded images is critical for the subsequent vision tasks. While many existing approaches have successfully incorporated specific priors for individual tasks, these tailored solutions limit their applicability to other degradations. In this work, we propose a universal network architecture, dubbed “ReviveDiff”, which can address various degradations and restore images to their original quality by enhancing and restoring their details. Our approach is inspired by the observation that, unlike degradation caused by movement or electronic issues, quality degradation under adverse conditions primarily stems from natural media (such as fog, water, and low luminance), which generally preserves the original structures of objects. To restore the quality of such images, we leveraged the latest advancements in diffusion models and developed ReviveDiff to restore image quality from both macro and micro levels across some key factors determining image quality, such as sharpness, distortion, noise level, dynamic range, and color accuracy. We rigorously evaluated ReviveDiff on seven benchmark datasets covering five types of degrading conditions: Rainy, Underwater, Low-light, Smoke, and Nighttime Hazy. Our experimental results demonstrate that ReviveDiff outperforms the state-of-the-art methods both quantitatively and visually.

Abstract:
Image-based age estimation aims to predict a person’s age from facial images. It is used in a variety of real-world applications. Although end-to-end deep models have achieved impressive results for age estimation on benchmark datasets, their performance in-the-wild still leaves much room for improvement due to the challenges caused by large variations in head pose, facial expressions, and occlusions. To address this issue, we propose a simple yet effective method to explicitly incorporate facial semantics into age estimation, so that the model would learn to correctly focus on the most informative facial components from unaligned facial images regardless of head pose and non-rigid deformation. To this end, we design a face parsing-based network to learn semantic information at different scales and a novel face parsing attention module to leverage these semantic features for age estimation. To evaluate our method on in-the-wild data, we also introduce a new challenging large-scale benchmark called IMDB-Clean. This dataset is created by semi-automatically cleaning the noisy IMDB-WIKI dataset using a constrained clustering method. Through comprehensive experiment on IMDB-Clean and other benchmark datasets, under both intra-dataset and cross-dataset evaluation protocols, we show that our method consistently outperforms all existing age estimation methods and achieves a new state-of-the-art performance. To the best of our knowledge, our work presents the first attempt of leveraging face parsing attention to achieve semantic-aware age estimation, which may be inspiring to other high level facial analysis tasks.

Abstract:
For effective image segmentation, it is crucial to employ constraints informed by prior knowledge about the characteristics of the areas to be segmented to yield favorable segmentation outcomes. However, the existing methods have primarily focused on priors of specific properties or shapes, lacking consideration of the general global shape similarity from a Contour Flow perspective. Furthermore, naturally integrating this contour flow prior image segmentation model into the activation functions of deep convolutional networks through mathematical methods is currently unexplored. In this paper, we establish a concept of global shape similarity based on the premise that two shapes exhibit comparable contours. Furthermore, we mathematically derive a contour flow constraint that ensures the preservation of global shape similarity. We propose two implementations to integrate the constraint with deep neural networks. Firstly, the constraint is converted to a shape loss, which can be seamlessly incorporated into the training phase for any learning-based segmentation framework. Secondly, we add the constraint into a variational segmentation model and derive its iterative schemes for solution. The scheme is then unrolled to get the architecture of the proposed CFSSnet. Validation experiments on diverse datasets are conducted on classic benchmark deep network segmentation models. The results indicate a great improvement in segmentation accuracy and shape similarity for the proposed shape loss, showcasing the general adaptability of the proposed loss term regardless of specific network architectures. CFSSnet shows robustness in segmenting noise-contaminated images, and inherent capability to preserve global shape similarity.

Affiliations: Department of Artificial Intelligence, Key Laboratory of Brain-Machine Intelligence Technology, Ministry of Education, Nanjing University of Aeronautics and Astronautics, Nanjing, China; Tri-Institutional Center for Translational Research in Neuroimaging and Data Science (TReNDS), Georgia State University, Georgia Institute of Technology, Emory University, Atlanta, GA, USA; Department of Computer Science and Electrical Engineering, University of Maryland, Baltimore County, MD, USA; Department of Radiology and Biomedical Imaging, Yale School of Medicine, New Haven, CT, USA

Abstract:
Multimodal fusion provides multiple benefits over single modality analysis by leveraging both shared and complementary information from different modalities. Notably, supervised fusion enjoys extensive interest for capturing multimodal co-varying patterns associated with clinical measures. A key challenge of brain data analysis is how to handle confounds, which, if unaddressed, can lead to an unrealistic description of the relationship between the brain and clinical measures. Current approaches often rely on linear regression to remove covariate effects prior to fusion, which may lead to information loss, rather than pursue the more global strategy of optimizing both fusion and covariates removal simultaneously. Thus, we propose “CR-mCCAR” to jointly optimize for confounds within a guided fusion model, capturing co-varying multimodal patterns associated with a specific clinical domain while also discounting covariate effects. Simulations show that CR-mCCAR separate the reference and covariate factors accurately. Functional and structural neuroimaging data fusion reveals co-varying patterns in attention deficit/hyperactivity disorder (ADHD, striato-thalamo-cortical and salience areas) and in autism spectrum disorder (ASD, salience and fronto-temporal areas) that link with core symptoms but uncorrelate with age and motion. These results replicate in an independent cohort. Downstream classification accuracy between ADHD/ASD and controls is markedly higher for CR-mCCAR compared to fusion and regression separately. CR-mCCAR can be extended to include multiple targets and multiple covariates. Overall, results demonstrate CR-mCCAR can jointly optimize for target components that correlate with the reference(s) while removing nuisance covariates. This approach can improve the meaningful detection of reliable phenotype-linked multimodal biomarkers for brain disorders.

Abstract:
With the prevalence of emerging computer vision applications, the demand for capturing dynamic scenes with high-speed motion has increased. A kind of neuromorphic sensor called spike camera shows great potential in this aspect since it generates a stream of binary spikes to describe the dynamic light intensity with a very high temporal resolution. Color spike camera (CSC) was recently invented to capture the color information of dynamic scenes via a color filter array (CFA) on the sensor. This paper proposes a long short-term temporal aggregation strategy of spike signals. First, we utilize short-term temporal correlation to adaptively extract temporal features of each time point. Then we align the features and aggregate them to exploit long-term temporal correlation, suppressing undesired motion blur. To implement the strategy, we design a CSC reconstruction network. Based on adaptive short-term temporal aggregation, we propose a spike representation module to extract temporal features of each color channel, leveraging multiple temporal scales. Considering the long-term temporal correlation, we develop an alignment module to align the temporal features. In particular, we perform motion alignment of red and blue channels with the guidance of the higher-sampling-rate green channel, leveraging motion consistency among color channels. Besides, we propose a module to aggregate the aligned temporal features for the restored color image, which exploits color channel correlation. We have also developed a CSC simulator for data generation. Experimental results demonstrate that our method can restore color images with fine texture details, achieving state-of-the-art CSC reconstruction performance.

Abstract:
In fringe projection profilometry systems, accurately reconstructing 3D objects with varying surface reflectivity requires high dynamic range (HDR) imaging. However, the limited dynamic range of single-exposure cameras poses challenges for capturing HDR fringe patterns efficiently. This paper introduces a deep learning-based HDR structured light 3D reconstruction pipeline, comprising an HDR Fringe Generation Module and a Phase Calculation Module. The HDR Fringe Generation Module employs an end-to-end network with attention guidance and feature distillation to reconstruct HDR fringe images from short- and long-exposure low dynamic range (LDR) inputs. The Phase Calculation Module processes the phase information from HDR fringes to enable 3D reconstruction. On a metallic HDR dataset, the method achieved a phase error of 0.105, comparable to the 4-exposure 6-step Phase Shifting Profilometry (PSP) method (0.069), with only 8.3% of the projection time. Experimental results demonstrate the robustness of our approach under diverse object geometries, exposure levels, and challenging global illumination environments. In quantitative measurements, our method achieved accuracies of sub-50 \mu m on ceramic spheres, flat plates and metal step object. Ablation experiments confirmed that feature distillation and attention module effectively enhance the HDR Fringe Generation Module, producing high-quality HDR fringe patterns critical for reconstructing objects with HDR surface reflectivity. Furthermore, we constructed an HDR imaging metal dataset comprising 1,700 samples of machined metal parts with diverse shapes, sizes, and materials, making it a benchmark in the field of HDR structured light measurement. Our method offers a general HDR imaging-based structured light 3D reconstruction approach, integrating the two modules into an efficient, end-to-end solution for objects with HDR reflective surfaces.

Abstract:
Animation super-resolution (SR) aims to generate high-resolution (HR) animation frames from degraded low-resolution (LR) inputs, constituting an important task in real-world SR. Existing animation SR methods typically follow a photorealistic real-world SR computational paradigm. However, digital animation frames commonly suffer from compression and transmission-related degradation, distinct from degradations in camera-captured real-world images. In this paper, we introduce a novel real-world animation super-resolution benchmark designed explicitly for animation frames, named ADASR, featuring both 2D and modern 3D animation content to facilitate industry applications. Additionally, we propose a Color-Aware Animation Super-Resolution (CAASR) method. CAASR, for the first time, incorporates a color degradation simulation mechanism tailored for animations, addressing color banding, blocking, and color shift. Furthermore, we develop a multi-scale multi-frequency alignment mechanism to robustly extract degradation-invariant features. Extensive experiments conducted on both the existing AVC dataset and our newly constructed ADASR dataset demonstrate that our proposed CAASR achieves state-of-the-art performance in restoring HR frames for both 2D and 3D animations. Code and data are available at https://github.com/huangyang-666/CAASR.

Abstract:
Data synthesis methods have shown promising results in general deepfake detection tasks. This is attributed to the inherent blending process in deepfake creation, which leaves behind distinct synthetic artifacts. However, the existence of content-irrelevant artifacts has not been explicitly explored in the deepfake synthesis. Unveiling content-irrelevant synthetic artifacts helps uncover general deepfake features and enhances the generalization capability of detection models. To capture the content-irrelevant synthetic artifacts, we propose a learning framework incorporating a synthesis process for diverse contents and specially designed learning strategies that encourage using content-irrelevant forgery information across deepfake images. From the data perspective, we disentangle the blending operation from face data and propose a universal synthetic module that generates images from various classes with common synthetic artifacts. From the learning perspective, a domain-adaptive learning head is introduced to filter out forgery-irrelevant features and optimize the decision on deepfake face detection. To efficiently learn the content-irrelevant artifacts for detection with a large sampling space, we propose a batch-wise sample selection strategy that actively mines the hard samples based on their effect on the adaptive decision boundary. Extensive cross-dataset experiments show that our method achieves state-of-the-art performance in general deepfake detection.

Affiliations: Guangdong-Hong Kong-Macao Joint Laboratory for Intelligent Micro-Nano Optoelectronic Technology, School of Physics and Optoelectronic Engineering, Foshan University, Foshan, China; School of Engineering, Hong Kong University of Science and Technology, Hong Kong, China; School of Information Engineering and Automation, Kunming University of Science and Technology, Kunming, China; School of Mechanical Electronic and Information Engineering, China University of Mining and Technology, Beijing, China

Abstract:
Infrared and visible image fusion has emerged as a prominent research area in computer vision. However, little attention has been paid to the fusion task in complex scenes, leading to sub-optimal results under interference. To fill this gap, we propose a unified framework for infrared and visible images fusion in complex scenes, termed UMCFuse. Specifically, we classify the pixels of visible images from the degree of scattering of light transmission, allowing us to separate fine details from overall intensity. Maintaining a balance between interference removal and detail preservation is essential for the generalization capacity of the proposed method. Therefore, we propose an adaptive denoising strategy for the fusion of detail layers. Meanwhile, we fuse the energy features from different modalities by analyzing them from multiple directions. Extensive fusion experiments on real and synthetic complex scenes datasets cover adverse weather conditions, noise, blur, overexposure, fire, as well as downstream tasks including semantic segmentation, object detection, salient object detection, and depth estimation, consistently indicate the superiority of the proposed method compared with the recent representative methods. Our code is available at https://github.com/ixilai/UMCFuse

Abstract:
Creating a comprehensively representative image while maintaining the merits of various modalities is a key focus of current Multi-Modality Image Fusion research. Existing unified methods often struggle to handle varying types of degradation while extracting modality-shared and modality-specific information from source images, leading to limitations in their generative or representation capabilities under different conditions. To address the challenge, we propose MVFusion, a novel self-supervised masked variational autoencoder framework that simultaneously enhances generative training and representation learning. It is designed to cope with varying image quality and dataset composition with a unified framework while ensuring effective fusion of modality information. Specifically, MVFusion employs a self-supervised masked autoencoder to reduce the impact of redundancy and degradation in the source images, and thus learns the latent distribution of degraded input images in the generative training stage. In addition, we incorporate variational feature learning to further preserve the distinctive modality features in the representation learning stage. Extensive experiments demonstrate that our model achieves promising results in several classical fusion tasks, including infrared-visible, multi-focus, multi-exposure, and medical image fusion. The code is available at https://github.com/shiboneng/MVFusion

Abstract:
For the interpretability of deep neural networks (DNNs) in visual-related tasks, existing explanation methods commonly generate a saliency map based on the linear relation between output results and input features. However, when the explanation conflicts with a human visual examination, these methods do not provide further evidence to analyze the saliency explanation. Most may fail to provide feature attribution with identifiable semantics or produce misleading explanations due to their insufficient robustness. In this paper, we first propose four key characteristics (richness, adaptivity, exclusiveness, and fairness) to evaluate the existing linear relation-based explanation method, and then construct an interpretable linear model to satisfy them. We formalize the characteristics and develop a novel explanation method based on this. We extract and reconstruct key exclusive semantic features from the feature map using the Nonnegative Matrix Factorization (NMF) algorithm, utilize the information entropy model to determine the number of features adaptively and their richness, and then linearly combine each feature with fairly assigned weights using an approximate Shapley algorithm to generate the saliency map. Compared with the state-of-the-art methods, our explanations of different datasets and DNNs are more convincing and robust in terms of Average drop (AD), Average increase (AI), Deletions (Del), and Insertions (Ins). Our supplementary experiments provide sufficient evidence that the four characteristics guarantee the feasibility of feature attribution analysis and enhance the quality of the resulting explanations.

Abstract:
Recently, some works integrate SSL techniques into deep clustering frameworks to enhance image clustering performance. However, they all need pretraining, clustering learning, or a trained clustering model as prerequisites, limiting the flexible and out-of-box application of SSL learners in the image clustering task. This work introduces ASD, an adaptor that enables the cold-start of SSL learners for deep image clustering without any prerequisites. Specifically, we first randomly sample pseudo-labeled data from all unlabeled data, and set an instance-level classifier to learn them with semantically aligned instance-level labels. With the ability of instance-level classification, we track the class transitions of predictions on unlabeled data to extract high-level similarities of instance-level classes, which can be utilized to assign cluster-level labels to pseudo-labeled data. Finally, we use the pseudo-labeled data with assigned cluster-level labels to trigger a general SSL learner trained on the unlabeled data for image clustering. We show the superior performance of ASD across various benchmarks against the latest deep image clustering approaches and very slight accuracy gaps compared to SSL methods using ground-truth, e.g., only 1.33% on CIFAR-10. Moreover, ASD can also further boost the performance of existing SSL-embedded deep image clustering methods.

Abstract:
Text-to-image diffusion models are capable of producing high-quality images from textual descriptions; however, they present notable security concerns. These include the potential for generating Not-Safe-For-Work (NSFW) content, replicating artists’ styles without authorization, or creating deepfakes. Recent advancements have proposed concept erasure techniques to eliminate sensitive concepts from these models, aiming to mitigate the generation of undesirable content. Nevertheless, the robustness of these techniques against a wide range of adversarial inputs has not been comprehensively investigated. To address this challenge, a novel two-stage optimization attack framework based on adversarial perturbations, referred to as Concept Embedding Adversary (CEA), was proposed in the present study. By leveraging the cross-modal alignment priors of the CLIP model, CEA iteratively adjusts adversarial embedding vectors to approximate the semantic expression of specific target concepts. This process enables the construction of deceptive adversarial prompts that exploit diffusion models, compelling them to regenerate previously erased concepts. The performance of concept erasure methods was evaluated, specifically when dealing with diversified adversarial prompts targeting erased concepts, such as NSFW content, artistic styles, and objects. Extensive experimental results demonstrate that existing concept erasure methods are unable to completely eliminate target concepts. In contrast, the proposed CEA framework exploits residual vulnerabilities within the generative latent space through a two-stage optimization process. By achieving precise cross-modal alignment, CEA attains significantly higher ASR in regenerating erased concepts.

Abstract:
Videos account for a significant portion of internet traffic, and the presence of noise, whether from compression algorithms, low light, sensor imperfections, deteriorates the video quality. Ambient noise can also significantly diminish the visual quality. Traditional CNN-based video denoising methods rely on convolutional filters with fixed sizes and receptive fields, excelling at capturing local patterns and short-range dependencies. However, CNNs often struggle to handle long-term dependencies or relationships that extend over larger spatial and temporal scales. These are vital for accurately removing noise while preserving essential video details, textures, and structures. To address this limitation, we propose a novel approach, using UNet architecture, which combines the strengths of convolutional neural networks (CNNs) and graph neural networks (GNNs) for local and global information and dependency preservation. In this approach, CNN is followed by transformer attention for sparse graph formation for CNN. The spatiotemporal patches act as nodes, and the similarity between them represent edges. By integrating CNNs for local feature extraction followed by transformer attention and GNN for video denoising first time, for long-term spatio-temporal relationships, improves the ability to accurately model noise, preserve fine details and subsequently denoise videos more accurately. The strong ablation studies prove the effectiveness of the different modules, patch sizes on four different noise types. The proposed method outperformed most of the SOTA video denoising algorithms in terms of both PSNR and SSIM, at moderate computational cost, apart from the Video Restoration Transformer(VRT).

Abstract:
Skeleton-based human action recognition aims to classify human skeletal sequences, which are spatiotemporal representations of actions, into predefined categories. To reduce the reliance on costly annotations of skeletal sequences while maintaining competitive recognition accuracy, the task of 3D Action Recognition with Limited Training Samples, also known as semi-supervised 3D Action Recognition, has been proposed. In addition, active learning, which aims to proactively select the most informative unlabeled samples for annotation, has been explored in semi-supervised 3D Action Recognition for training sample selection. Specifically, researchers adopt an encoder–decoder framework to embed skeleton sequences into a latent space, where clustering information, combined with a margin-based selection strategy using a multi-head mechanism, is utilized to identify the most informative sequences in the unlabeled set for annotation. However, the most representative skeleton sequences may not necessarily be the most informative for the action recognizer, as the model may have already acquired similar knowledge from previously seen skeleton samples. To solve it, we reformulate Semi-supervised 3D action recognition via active learning from a novel perspective by casting it as a Markov Decision Process (MDP). Built upon the MDP framework and its training paradigm, we train an informative sample selection model to intelligently guide the selection of skeleton sequences for annotation. To enhance the representational capacity of the factors in the state-action pairs within our method, we project them from Euclidean space to hyperbolic space. Furthermore, we introduce a meta tuning strategy to accelerate the deployment of our method in real-world scenarios. Extensive experiments on three 3D action recognition benchmarks demonstrate the effectiveness of our method.

Abstract:
Adverse weather conditions such as snow, fog, and rain pose significant challenges to LiDAR-based perception models by introducing noise and corrupting point cloud measurements. To address this issue, we propose TripleMixer, a robust and efficient point cloud denoising network that integrates spatial, frequency, and channel-wise processing through three specialized mixer modules. TripleMixer effectively suppresses high-frequency noise while preserving essential geometric structures and can be seamlessly deployed as a plug-and-play module within existing LiDAR perception pipelines. To support the development and evaluation of denoising methods, we construct two large-scale simulated datasets, Weather-KITTI and Weather-NuScenes, covering diverse weather scenarios with dense point-wise semantic and noise annotations. Based on these datasets, we establish four benchmarks: Denoising, Semantic Segmentation (SS), Place Recognition (PR), and Object Detection (OD). These benchmarks enable systematic evaluation of denoising generalization, transferability, and downstream impact under both simulated and real-world adverse weather conditions. Extensive experiments demonstrate that TripleMixer achieves state-of-the-art denoising performance and yields substantial improvements across all downstream tasks without requiring retraining. Our results highlight the potential of denoising as a task-agnostic preprocessing strategy to enhance LiDAR robustness in real-world autonomous driving applications.

Abstract:
Hyperspectral images (HSI) change detection (CD) has become a powerful tool to analyze the sublte surface changes. However, the application of HSI CD is constrained by the limited availability of homogeneous HSIs. HSI-RGB multimodal CD address these limitations by collaboratively utilizing multi-source data. Although multimodal CD methods have achieved encouraging results, their performance often relies on the assumption that the training and test samples have similar distributions. Recently, some domain adaptive CD methods have been introduced. However, the additional modality differences in cross-domain multimodal CD pose challenges to existing domain adaptation techniques. To address these challenges, we propose a cycle-based frequency disentanglement diffusion model with self-training for cross-domain HSI-RGB multimodal CD, which explores a frequency-domain diffusion-driven self-training mechanism to enhance consistency in change representations across different modalities and domains. Specifically, a cyclic frequency domain disentanglement-based modality-domain alignment diffusion network is proposed to achieve modality and domain alignment within a unified diffusion framework. Subsequently, a curriculum-learning based self-training dual-domain CD network is designed to process the aligned images, which leverages pseudo-label reliability to ensure stable transfer of prior knowledge while exploits complementary features across modalities for collaborative CD. Extensive experiments demonstrate that our method significantly outperforms state-of-the-art approaches in cross-domain multimodal CD tasks.

Abstract:
Existing multimodal point cloud quality assessment (PCQA) methods usually integrate 3D and 2D information to simulate human visual perception of distortions. However, due to the lack of consideration of spatial correspondence, they have difficulty to learn consistent distortion representations from different modalities in the same region of the PC. In addition, they also ignore the heterogeneity of modalities and rely on complex fusion mechanisms (e.g., attention) to integrate multimodal features. Both lead to limited performance and increased computational complexity. To address these limitations, we propose a novel double alignment multimodal learning network (DA-Net), which introduces two key alignment strategies. Specifically, the first is spatial pre-alignment strategy, which generates informative 2D patch for each 3D patch via an adaptive patch projection module (APPM), ensuring accurate spatial correspondence of different modalities prior to feature extraction. The second is a uniform feature alignment strategy, which includes feature disentanglement module (FDM) and feature mapping module (FMM) to relieve heterogeneity of modalities and guide the optimization of 2D and 3D encoder. Finally, multimodal features are simply integrated and regressed to obtain the quality score. Experimental results demonstrate that the DA-Net exhibits outstanding performance and generalization ability. It also achieves lower computational complexity compared with other multimodal PCQA methods. The source codes of DA-Net will be available at https://github.com/Rphone/DA-Net

Abstract:
Threat Image Projection (TIP) is a convenient and effective means to expand X-ray baggage images, which is essential for training both security personnel and computer-aided screening systems. Existing methods are primarily divided into two categories: X-ray imaging principle-based methods and GAN-based generative methods. The former cast prohibited items acquisition and projection as two individual steps and rarely consider the style consistency between the source prohibited items and target X-ray images from different datasets, making them less flexible and reliable for practical applications. Although GAN-based methods can directly generate visually consistent prohibited items on target images, they suffer from unstable training and lack of interpretability, which significantly impact the quality of the generated items. To overcome these limitations, we present a conceptually simple, flexible and unsupervised end-to-end TIP framework, termed as Meta-TIP, which superimposes the prohibited item distilled from the source image onto the target image in a style-adaptive manner. Specifically, Meta-TIP mainly applies three innovations: 1) reconstruct a pure prohibited item from a cluttered source image with a novel foreground-background contrastive loss; 2) a material-aware style-adaptive projection module learns two modulation parameters pertinently based on the style of similar material objects in the target image to control the appearance of prohibited items; 3) a novel logarithmic form loss is well-designed based on the principle of TIP to optimize synthetic results in an unsupervised manner. We comprehensively verify the authenticity and training effect of the synthetic X-ray images on four public datasets, i.e., SIXray, OPIXray, PIXray, and PIDray dataset, and the results confirm that our framework can flexibly generate very realistic synthetic images without any limitations.

Abstract:
The generative priors of pre-trained latent diffusion models (DMs) have demonstrated great potential to enhance the visual quality of image super-resolution (SR) results. However, the noise sampling process in DMs introduces randomness in the SR outputs, and the generated contents can differ a lot with different noise samples. The multi-step diffusion process can be accelerated by distilling methods, but the generative capacity is difficult to control. To address these issues, we analyze the respective advantages of DMs and generative adversarial networks (GANs) and propose to partition the generative SR process into two stages, where the DM is employed for reconstructing image structures and the GAN is employed for improving fine-grained details. Specifically, we propose a non-uniform timestep sampling strategy in the first stage. A single timestep sampling is first applied to extract the coarse information from the input image, then a few reverse steps are used to reconstruct the main structures. In the second stage, we finetune the decoder of the pre-trained variational auto-encoder by adversarial GAN training for deterministic detail enhancement. Once trained, our proposed method, namely content consistent super-resolution (CCSR), allows flexible use of different diffusion steps in the inference stage without re-training. Extensive experiments show that with 2 or even 1 diffusion step, CCSR can significantly improve the content consistency of SR outputs while keeping high perceptual quality. Codes and models can be found at https://github.com/csslc/CCSR.

Abstract:
Vessel re-identification (ReID) serves as a foundational task for intelligent maritime transportation systems. To enhance maritime surveillance capabilities, this study investigates video-based vessel ReID, a critical yet underexplored task in intelligent transportation systems. The lack of relevant datasets has limited the progress of Video-based vessel ReID research work. We established ViV-ReID, the first publicly available large-scale video-based vessel ReID dataset, comprising 480 vessel identities captured from 20 cross-port camera views (7,165 tracklets and 1.14 million frames), establishing a benchmark for advancing vessel ReID from image to video processing. Videos offer significantly richer information than single-frame images. The dynamic nature of video often leads to fragmented spatio-temporal features causing disrupted contextual understanding, and to address this problem, we further propose a Bidirectional Structural-Aware Spatial-Temporal Graph Network (Bi-SSTN) that explicitly aligns spatio-temporal features using vessel structural priors. Extensive experiments on the ViV-ReID dataset demonstrate that image-based ReID methods often show suboptimal performance when applied to video data. Meanwhile, it is crucial to validate the effectiveness of spatio-temporal information and establish performance benchmarks for different methods. The Bidirectional Structural-Aware Spatial-Temporal Graph Network (Bi-SSTN) significantly outperforms state-of-the-art methods on ViV-ReID, confirming its efficacy in modeling vessel-specific spatio-temporal patterns. Project web page: https://vsislab.github.io/ViV_ReID/

Abstract:
This paper handles the problem of human attention prediction in natural daily life from the third-person view. Due to the significance of this topic in various applications, researchers in the computer vision community have proposed many excellent models in the past few decades, and many models have begun to focus on natural daily life scenarios in recent years. However, existing mainstream models usually ignore a basic fact that human attention is a typical interdisciplinary concept. Specifically, the mainstream definition is direction-level or pixel-level, while many interdisciplinary studies argue the object-level definition. Additionally, the mainstream model structure converges to the dual-pathway architecture or its variants, while the majority of interdisciplinary studies claim attention is involved in the human-environment interaction procedure. Grounded on solid theories and studies in interdisciplinary fields including computer vision, cognition, neuroscience, psychology, and philosophy, this paper proposes a fine-grained Human-Environment-Object Interaction (HEOI) model, which for the first time integrates multi-granularity human cues to predict human attention. Our model is explainable and lightweight, and validated to be effective by a wide range of comparison, ablation, and visualization experiments on two public datasets.

Abstract:
Most state-of-the-art object detection methods suffer from poor generalization due to the domain shift between the training and testing datasets. To resolve this challenge, unsupervised cross domain object detection is proposed to learn an object detector for an unlabeled target domain by transferring knowledge from an annotated source domain. Promising results have been achieved via Mean Teacher, however, pseudo labeling which is the bottleneck of mutual learning remains to be further explored. In this study, we find that confidence misalignment of the predictions, including category-level overconfidence, instance-level task confidence inconsistency, and image-level confidence misfocusing, leading to the injection of noisy pseudo labels in the training process, will bring suboptimal performance. Considering the above issue, we present a novel general framework termed Multi-Granularity Confidence Alignment Mean Teacher (MGCAMT) for unsupervised cross domain object detection, which alleviates confidence misalignment across category-, instance-, and image-levels simultaneously to refine pseudo labeling for better teacher-student learning. Specifically, to align confidence with accuracy at category level, we propose Classification Confidence Alignment (CCA) to model category uncertainty based on Evidential Deep Learning (EDL) and filter out the category incorrect labels via an uncertainty-aware selection strategy. Furthermore, we design Task Confidence Alignment (TCA) to mitigate the instance-level misalignment between classification and localization by enabling each classification feature to adaptively identify the optimal feature for regression. Finally, we develop imagery Focusing Confidence Alignment (FCA) adopting another way of pseudo label learning, i.e., we use the original outputs from the Mean Teacher network for supervised learning without label assignment to achieve a balanced perception of the image’s spatial layout. When these three procedures are integrated into a single framework, they mutually benefit to improve the final performance from a cooperative learning perspective. Extensive experiments across multiple scenarios demonstrate that our method outperforms large foundational models, and surpasses other state-of-the-art approaches by a large margin.

Abstract:
Efficient and highly accurate lightweight gaze estimation method has been receiving increasing research attention due to the emergence of mobile interactive platforms such as mobile device and AR/VR. State-of-the-art deep learning based gaze estimation models suffer from either heavy computational architecture which is infeasible for mobile deployment or limited generalization capability which cannot deal with large diversity in eye texture or distinguish subtle/frequent pupil movement. To mitigate the above challenges, we propose a novel lightweight network structure featuring a deformable approximate large kernel which can effectively extend the receptive field to handle complicated eye movement and highly varying eye/gaze region appearance with very tight computational budget. In the meantime, we embed the training of the gaze estimator into a control information extraction module, which serves as a gaze-parameter input that modularizes a large generative model (Stable Diffusion V1.5) to output gaze-specific eye images. In this way, the great generalization capability of large generative model could be implicitly distilled/pursued into our lightweight gaze model. Extensive comparisons with various state-of-the-art gaze estimation methods demonstrate the superiority of our proposed model and training scheme in terms of both accuracy and model complexity.

Abstract:
The scope of point cloud (PC) applications is expanding. We propose a no-reference bitstream-layer quality assessment model that eliminates the need for full decoding of the PC, providing quality evaluation scores during the V-PCC decoding process. Specifically, we illustrate the relationship between content diversity (CD) and perceptual coding distortion in lossless geometric coding. Subsequently, we model attribute distortion by predicting CD using transform energy (TE) and texture quantization parameter (TQP). By combining the geometric distortion model with geometry quantization parameters (GQP) and the attribute distortion model, we derive comprehensive quality prediction results. Our experimental results on four PC databases (WPC2.0, M-PCCD, VSENSE VVDB and VSENSE VVDB2) show that the proposed energy-adaptive bitstream-layer model (EABL) delivers competitive quality prediction performance in comparison with existing full-reference, reduced-reference and no-reference PC quality assessment models that require full decoding, and meanwhile exhibits large speed advantage. The source code will be made publicly available for repeatability research at https://github.com/arthas-sws/EABL_model.

Abstract:
The fusion of hyperspectral image (HSI) and multispectral image (MSI) is an effective mean to improve the inherent defect of low spatial resolution of HSI. However, existing fusion methods usually rigidly upgrade the spatial resolution of HSI to that of matching MSI under the ideal assumption that multi-source images are accurately registered. In real scenes where multi-source images are difficult to be perfectly registered and the spatial resolution requirements are dynamically different, these fusion algorithms is difficult to be effectively deployed. To this end, we construct the spatial-spectral consistent arbitrary scale observation model (S2cAsOM) to model the dependence between the unregistered HSI and MSI and the ideal arbitrary resolution HSI. Furthermore, an optimization algorithm is designed to solve S2cAsOM, and a deep interpretable arbitrary resolution fusion network (IR&ArF) is proposed to simulate the optimization process, which achieves the model-data dual-driven arbitrary resolution fusion of unregistered HSI and MSI. IR&ArF breaks the dependence of traditional fusion methods on the accuracy of image registration in a robust way, and can flexibly cope with the dynamic requirements of diverse applications for the spatial resolution of HSI, which improves the application ability of HSI fusion in real scenes. Extensive systematic experiments demonstrate the superiority and generalization of the proposed method. Source code of the proposed method is available on https://github.com/Jiahuiqu/IR-ArF.

Abstract:
In 3D microscopic imaging, the extremely shallow depth of field presents a challenge for accurate 3D reconstruction in cases of significant defocus. Traditional calibration methods rely on the spatial extraction of feature points to establish spatial 3D information as the optimization objective. However, these methods suffer from reduced extraction accuracy under defocus conditions, which causes degradation of calibration performance. To extend calibration volume without compromising accuracy in defocused scenarios, we propose a per-pixel calibration based on multi-view 3D reconstruction errors. It utilizes 3D reconstruction errors among different binocular setups as an optimization objective. We first analyze multi-view 3D reconstruction error distributions under the poor-accuracy optical model by employing a multi-view microscopic 3D measurement system using telecentric lenses. Subsequently, the 3D proportion model is proposed for implementing our error-based per-pixel calibration, derived as a spatial linear expression directly correlated with the 3D reconstruction error distribution. The experimental results confirm the robust convergence of our method with multiple binocular setups. Near the focus volume, the multi-view 3D reconstruction error remains approximately 8~\mu m (less than 0.5 camera pixel pitch), with absolute accuracy maintained within 0.5% of the measurement range. Beyond tenfold depth of field, the multi-view 3D reconstruction error increases to around 30~\mu m (still less than 2 camera pixel pitches), while absolute accuracy remains within 1% of the measurement range. These high-precision measurement results validate the feasibility and accuracy of our proposed calibration.

Abstract:
Ultrasound Localization Microscopy (ULM) is a non-invasive technique that allows for the imaging of micro-vessels in vivo, at depth and with a resolution on the order of ten microns. ULM is based on the sub-resolution localization of individual microbubbles injected in the bloodstream. Mapping the whole angioarchitecture requires the accumulation of microbubbles trajectories from thousands of frames, typically acquired over a few minutes. ULM acquisition times can be reduced by increasing the microbubble concentration, but requires more advanced algorithms to detect them individually. Several deep learning approaches have been proposed for this task, but they remain limited to 2D imaging, in part due to the associated large memory requirements. Herein, we propose the use of sparse tensor neural networks to enable deep learning-based 3D ULM by improving memory scalability with increased dimensionality. We study several approaches to efficiently convert ultrasound data into a sparse format and study the impact of the associated loss of information. When applied in 2D, the sparse formulation reduces the memory requirements by a factor 2 at the cost of a small reduction of performance when compared against dense networks. In 3D, the proposed approach reduces memory requirements by two order of magnitude while largely outperforming conventional ULM in high concentration settings. We show that Sparse Tensor Neural Networks in 3D ULM allow for the same benefits as dense deep learning based method in 2D ULM i.e. the use of higher concentration in silico and reduced acquisition time.

Abstract:
NeRF-Det has achieved impressive performance in indoor multi-view 3D detection by innovatively utilizing NeRF to enhance representation learning. Despite its notable performance, we uncover three decisive shortcomings in its current design, including semantic ambiguity, inappropriate sampling, and insufficient utilization of depth supervision. To combat the aforementioned problems, we present three corresponding solutions: 1) Semantic Enhancement. We project the freely available 3D segmentation annotations onto the 2D plane and leverage the corresponding 2D semantic maps as the supervision signal, significantly enhancing the semantic awareness of multi-view detectors. 2) Perspective-Aware Sampling. Instead of employing the uniform sampling strategy, we put forward the perspective-aware sampling policy that samples densely near the camera while sparsely in the distance, more effectively collecting the valuable geometric clues. 3) Ordinal Residual Depth Supervision. As opposed to directly regressing the depth values that are difficult to optimize, we divide the depth range of each scene into a fixed number of ordinal bins and reformulate the depth prediction as the combination of the classification of depth bins as well as the regression of the residual depth values, thereby benefiting the depth learning process. The resulting algorithm, NeRF-Det++, has exhibited appealing performance in the ScanNetV2 and ARKITScenes datasets. Notably, in ScanNetV2, NeRF-Det++ outperforms the competitive NeRF-Det by +1.9% in mAP \text@0.25 and +3.5% in mAP \text@0.50 . The code will be publicly available at https://github.com/mrsempress/NeRF-Detplusplus

Abstract:
Automatic nuclei segmentation and classification (NSC) is a fundamental prerequisite in digital pathology analysis as it enables the quantification of biomarkers and histopathological features for precision medicine. Nuclei appear to be small, however, global spatial distribution and brightness contrast, or color correlation between the nucleus and background, have been recognized as key rationales for accurate nuclei segmentation in actual clinical practice. Although recent great breakthroughs in medical image segmentation have been achieved by Transformer-based methods, the adaptability of segmenting and classifying nuclei from histopathological images is rarely investigated. Also, the severe overlap of nuclei and the large intra-class variability are common in clinical wild data. Prevailing methods based on polygonal representations or distance maps are limited by empirically designed post-processing strategies, resulting in ineffective segmentation of large irregular nuclei instances. To address these challenges, we propose a keypoint-guided tri-decoder Transformer (PointFormer) for NSC simultaneously. Specifically, the overall NSC task is decoupled to a multi-task learning problem, where a tri-decoder structure is employed for decoding nuclei instance, edges, and types, respectively. The nuclei detection and classification (NDC) subtask is reformulated as a semantic keypoint estimation problem. Meanwhile, introduces a novel attention-guiding strategy to capture strong inter-branch correlations and mitigate inconsistencies between multi-decoder predictions. Finally, a multi-local perception module is designed as the base building block of PointFormer to achieve local and global trade-offs and reduce model complexity. Comprehensive quantitative and qualitative experimental results on three datasets of different volumes have demonstrated the superiority of the proposed method over prevalent methods, especially for the PanNuke dataset with an achievement of 70.6% on bPQ.

Abstract:
Recent advancements in diffusion models (DMs) have showcased superior capabilities in generating images and text. This paper first introduces DMs for image change captioning (ICC) and proposes a novel Context-aware Contrastive Diffusion model with Mediator-bridging Cross-modal Transformer (MCT-CCDiff) to accurately predict visual difference descriptions conditioned on two similar images. Technically, MCT-CCDiff develops a Text Embedding Contrastive Loss (TECL) that leverages both positive and negative samples to more effectively distinguish text embeddings, thus generating more discriminative text representations for ICC. To accurately predict visual difference descriptions, MCT-CCDiff introduces a Mediator-bridging Cross-modal Transformer (MCTrans) designed to efficiently explore the cross-modal correlations between visual differences and corresponding text by using a lightweight mediator, mitigating interference from visual redundancy and reducing interaction overhead. Additionally, it incorporates context-augmented denoising to further understand the contextual relationships within caption words implemented by a revised diffusion loss, which provides a tighter optimization bound, leading to enhanced optimization effects for high-quality text generation. Extensive experiments conducted on four benchmark datasets clearly demonstrate that our MCT-CCDiff significantly outperforms state-of-the-art methods in the field of ICC, marking a notable advancement in the generation of precise visual difference descriptions.

Affiliations: National-Regional Key Technology Engineering Laboratory for Medical Ultrasound, Guangdong Key Laboratory for Biomedical Measurements and Ultrasound Imaging, School of Biomedical Engineering, Shenzhen University Medical School, Shenzhen, Guangdong, China; Department of Computer and Information Science, University of Macau, Macau, SAR, China; Industrial Artificial Intelligence Centre, Shenzhen Institute for Advanced Study, University of Electronic Science and Technology of China, Shenzhen, China

Abstract:
Class activation mapping (CAM) methods have garnered considerable research attention because they can be used to interpret the decision-making of deep convolutional neural network (CNN) models and provide initial masks for weakly supervised semantic segmentation (WSSS) tasks. However, the class activation maps generated by most CAM methods usually have two limitations: 1) a lack of the ability to cover the whole object when using low-level features; and 2) introducing background noise. To mitigate these issues, an innovative Context-level weights-based CAM (Context-CAM) method is proposed, which guarantees: 1) the non-discriminative regions that have similar appearances and are located close to the discriminative regions can also be highlighted by the newly designed Region-Enhanced Mapping (REM) module using context-level weights; and 2) the background noises are gradually eliminated via a newly proposed Semantic-guided Reverse Sequence Fusion (SRSF) strategy that can sequentially denoise and fuse the region-enhanced maps from the last layer to the first layer. Extensive experimental results show that our Context-CAM can generate higher-quality class activation maps than classic and state-of-the-art (SOTA) CAM methods in terms of the Energy-Based Pointing Game (EBPG) score, and the improvements are up to 35.49% when compared to the second-best method. Moreover, for WSSS tasks, our Context-CAM can directly replace the CAM method used in existing WSSS methods without any architectural modification to further improve the segmentation performance. Our code is available at https://github.com/cwb0611/Context-CAM.

Abstract:
Long-term Action Quality Assessment (AQA) aims to evaluate the quantitative performance of actions in long videos. However, existing methods face challenges due to domain shifts between the pre-trained large-scale action recognition backbones and the specific AQA task, thereby hindering their performance. This arises since fine-tuning resource-intensive backbones on small AQA datasets is impractical. We address this by identifying two levels of domain shift: task-level, regarding differences in task objectives, and feature-level, regarding differences in important features. For feature-level shifts, which are more detrimental, we propose Progressive Hierarchical Instruction (PHI) with two strategies. First, Gap Minimization Flow (GMF) leverages flow matching to progressively learn a fast flow path that reduces the domain gap between initial and desired features across shallow to deep layers. Additionally, a temporally-enhanced attention module captures long-range dependencies essential for AQA. Second, List-wise Contrastive Regularization (LCR) facilitates coarse-to-fine alignment by comprehensively comparing batch pairs to learn fine-grained cues while mitigating domain shift. Integrating these modules, PHI offers an effective solution. Experiments demonstrate that PHI achieves state-of-the-art performance on three representative long-term AQA datasets, proving its superiority in addressing the domain shift for long-term AQA.

Abstract:
The surface quality of steel materials is significantly influenced by processing conditions, which may result in roughness, flatness deviations, and various surface defects. However, the diversity of defect types and the limited size of labeled datasets pose challenges for accurate and efficient defect identification. To address these challenges, this paper proposes a multiobjective evolutionary multiscale Transformer incorporating fractal features for surface quality analytics of steel materials. Specifically, a multiscale Transformer is constructed, consisting of the convolutional tokenization architecture embedded with the multiscale attention module (MAM) and stacked Transformer encoders, enabling the model to effectively capture both morphological patterns and local defect details. In addition, a novel fractal dimension feature fusion module (FDFFM) is introduced to describe the irregularity of defect textures, enhancing feature representation. To achieve a balance between recognition accuracy and model complexity, a multiobjective evolutionary algorithm (MOEA) is employed, with the final model selected based on a knee point selection strategy to support decision-making. Experimental results validate the superior performance and efficiency of MOEA-FM-Trans compared to state-of-the-art methods.

Abstract:
It is challenging for active contour models (ACMs) to segment weak-edge and noisy images efficiently and accurately. To solve this problem, a novel ACM is proposed in this work. The proposed ACM achieves high-precision segmentation for weak-edge and noisy images using a non-local feature fitting energy function and a scalable normalization method. The non-local feature fitting energy function is constructed based on the distances calculated by Jeffreys divergence between non-local weighted fitting images and the image processed by the non-local means (NLM) algorithm. The non-local weighted fitting images include the fitting foreground and background with image edge features. The images processed by the NLM algorithm is used to reduce the influence of noise. The data-driven term, obtained by minimizing the non-local feature fitting energy function, is computed before the level set iteration, which improves the computation speed. In addition, a scalable normalization method is proposed to normalize the data-driven term. The ability to distinguish the targets from the background for different types of images is enhanced by adjusting a scaling factor, improving the robustness and accuracy of the proposed model. Experimental results demonstrate the advantages of the proposed model.

Abstract:
Camouflaged Object Detection (COD) aims to segment objects resembling their environment. To address the challenges of extensive annotations and complex optimizations in supervised learning, recent prompt-based segmentation methods excavate insightful prompts from Large Vision-Language Models (LVLMs) and refine them using various foundation models. These are subsequently fed into the Segment Anything Model (SAM) for segmentation. However, due to the hallucinations of LVLMs and insufficient image-prompt interactions during the refinement stage, these prompts often struggle to capture well-established class differentiation and localization of camouflaged objects, resulting in performance degradation. To provide SAM with more informative prompts, we present UpGen, a pipeline that prompts SAM with generative prompts without requiring training, marking a novel integration of generative models with LVLMs. Specifically, we propose the Multi-Student-Single-Teacher (MSST) knowledge integration framework to alleviate hallucinations of LVLMs. This framework integrates insights from multiple sources to enhance the classification of camouflaged objects. To enhance interactions during the prompt refinement stage, we are the first to leverage generative models on real camouflage images to produce SAM-style prompts without fine-tuning. By capitalizing on the unique learning mechanism and structure of generative models, we effectively enable image-prompt interactions and generate highly informative prompts for SAM. Our extensive experiments demonstrate that UpGen outperforms weakly-supervised models and its SAM-based counterparts. We also integrate our framework into existing weakly-supervised methods to generate pseudo-labels, resulting in consistent performance gains. Moreover, with minor adjustments, UpGen shows promising results in open-vocabulary COD, referring COD, salient object detection, marine animal segmentation, and transparent object segmentation.

Abstract:
Scene text reading plays a crucial role in scene understanding. As its precondition task, scene text detection has garnered increasing interest from researchers. Segmentation-based text detection methods have gained prominence due to their adaptable pixel-level predictions. Many existing methods predict the shrink mask and utilize the Vatti clipping algorithm to reconstruct text contours. However, the shrink mask only focuses on the global geometry feature and shrinks the same distance everywhere, which neglects local contour information and disrupts the instance shape feature. In addition, the post-processing based on the Vatti clipping algorithm heavily relies on the predictions and is relatively complex, causing suboptimal performance in both detection accuracy and efficiency. To address the above problems, we propose an efficient and effective method named Magnetic Text Detector (MTD), inspired by magnetism. It is constructed by a text representation method flexible mask (FM) and a magnetic pull module (MPM). Unlike the shrink mask and concentric mask, the former concerns the local contours and shrinks unfixed distances on different positions, which avoids the truncation issue while preserving distinctiveness from the text regions. The latter generates magnetic fields and pulls pole points of FM to the text contour by magnetism. This allows accurate reconstruction of text contours, even when predictions deviate from the actual text severely, while saving 50% of the post-processing time approximately. Several ablation studies verify the effectiveness of the proposed FM and MPM. Extensive experiments show that our MTD achieves state-of-the-art (SOTA) methods on multiple datasets from different scenes. The code is available at https://github.com/fengmulin/MTD

Abstract:
Spike camera is a neuromorphic sensor that can capture high-speed dynamic scenes by firing a continuous stream of binary spikes with extremely high temporal resolution, essentially forming a dense sampling in the temporal dimension. Due to the relative motion between camera and scene, each pixel is actually sampling at a large number of different spatial positions on the object in a short period. Converting this dense sampling from temporal dimension to spatial domain, high resolution images can be reconstructed from the spike stream. However, spike fluctuations and large motion in high-speed scenes pose great challenges for this task, especially for intensity information extraction and temporal alignment. In this paper, we propose a spike camera super resolution network to address these issues. Considering the local temporal correlation of spike stream and correlation consistency within a local region, we introduce a representation module that performs region-adaptive temporal filtering on spikes to mitigate fluctuations and extract stable intensity information from binary data. Additionally, we develop a module for multi-frame feature alignment, leveraging the long-term temporal information of spike stream. To handle large motions, we propagate the motion information from neighboring moment to current feature alignment module, which provides a prior that helps to narrow the search range for current motion offset, improving the accuracy of temporal alignment. Experimental results demonstrate that the proposed network achieves state-of-the-art performance on synthetic and real-captured spike data.

Abstract:
Robust analysis of 3D point cloud data is essential for high-precision applications such as autonomous driving and industrial automation, where models need to consistently perform under complex and unpredictable real-world conditions. Current strategies, including data augmentation techniques and robust network designs, often struggle to effectively capture dynamic disturbances or accommodate the spatial variations of point clouds, thus lacking the required flexibility across diverse environments. To overcome these limitations, we propose a novel methodology to improve the robustness of 3D point cloud processing systems. Our approach simulates generalized corrupted input samples during training, using Radial Basis Functions (RBF) to model smooth deformations based on the control points. These deformations are applied selectively to different regions of the point cloud, adapting to spatial heterogeneity based on local density and geometric complexity. While generating these samples, we employ a combined adversarial loss that simultaneously induces model errors and maximizes the difference in internal feature distributions between the original and perturbed data. Additionally, we introduce a sub-network for distortion-guided feature augmentation to enhance important features while suppressing unreliable ones. This sub-network estimates distortion levels by compressing features and identifying discrepancies, then adjusts feature extraction process accordingly. Experimental results demonstrate that our method outperforms existing approaches on both Computer-Aided Design (CAD) models and real-world LiDAR datasets, enhancing model resilience and accuracy in handling diverse 3D scenarios.

Abstract:
Accurate segmentation of blood vessels is essential for various clinical assessments and postoperative analyses. However, the inherent challenges of vascular imaging—such as sparsity, fine granularity, low contrast, data distribution variability, and the critical need for preserving topological integrity—make generalized vessel segmentation particularly complex. While specialized segmentation methods have been developed for specific anatomical regions, their over-reliance on tailored models hinders broader applicability and generalization. General-purpose segmentation models introduced in medical imaging often fail to address critical vascular characteristics, including the connectivity of segmentation results. In this study, we propose OVS-Net, an optimized vessel segmentation framework designed to generalize across diverse vessel structures and imaging modalities. It introduces a dual-branch architecture design for improving small vessel segmentation and a morphology-aware correction module to preserve vascular topology and connectivity. We compiled a comprehensive multi-modality dataset from 17 datasets to train and benchmark the proposed OVS-Net against 6 SAM-based methods and 17 expert models under various conditions. The results demonstrate that our approach achieves superior segmentation accuracy, generalization, and a 34.6% improvement in connectivity, underscoring its potential for clinical applications. The code and dataset information are available at https://github.com/Hk416mod2/OVS-Net.

Abstract:
Universal adverse weather removal (UAWR) seeks to address various weather degradations within a unified framework. Recent methods are inspired by prompt learning using pre-trained vision-language models (e.g., CLIP), leveraging degradation-aware prompts to facilitate weather-free image restoration, yielding significant improvements. In this work, we propose CyclicPrompt, an innovative cyclic prompt approach designed to enhance the effectiveness, adaptability, and generalizability of UAWR. CyclicPrompt comprises two key components: 1) a composite context prompt that integrates weather-related information and context-aware representations into the network to guide restoration. This prompt differs from previous methods by marrying learnable input-conditional vectors with weather-specific knowledge, thereby improving adaptability across various degradations and 2) the erase-and-paste mechanism, after the initial guided restoration, substitutes weather-specific knowledge with constrained restoration priors, inducing high-quality weather-free concepts into the composite prompt to further fine-tune the restoration process. Therefore, we can form a cyclic “Prompt-Restore-Prompt” pipeline that adeptly harnesses weather-specific knowledge, textual contexts, and reliable textures. Extensive experiments on synthetic and real-world datasets validate the superior performance of CyclicPrompt. The code is available at: https://github.com/RongxinL/CyclicPrompt

Abstract:
Recently, many versatile Multi-modal Large Language Models (MLLMs) have emerged continuously. However, their capacity to query information depicted in visual charts and engage in reasoning based on the queried contents remains under-explored. In this paper, to comprehensively and rigorously benchmark the ability of the off-the-shelf MLLMs in the chart domain, we construct ChartX, a multi-modal evaluation set covering 18 chart types, 7 chart tasks, 22 disciplinary topics, and high-quality chart data. Besides, we develop ChartVLM to offer a new perspective on handling multi-modal tasks that strongly depend on interpretable patterns, such as reasoning tasks in the field of charts or geometric images. We evaluate the chart-related ability of mainstream MLLMs and our ChartVLM on the proposed ChartX evaluation set. Extensive experiments demonstrate that ChartVLM surpasses both versatile and chart-related large models, including GPT-4V. We believe that our study can pave the way for further exploration in creating a more comprehensive chart evaluation set and developing more interpretable multi-modal models. Both ChartX and ChartVLM are available at: https://github.com/Alpha-Innovator/ChartVLM

Abstract:
At present, deep face recognition models working on millions of images are confronted with the challenge that such large-scale datasets are often corrupted with noises and mislabeled identities yet most deep models are primarily designed for clean datasets. In this paper, we propose a robust deep face recognition model by exploiting the advantage of integrating the strength of margin-based learning models with the strength of mining-based approaches to effectively mitigate the impact of noises during training. By monitoring the recognition performances at a batch level to provide optimization-oriented feedback, we introduce a noise-adaptive mining strategy to dynamically adjust the emphasis balance between hard and noise samples, enabling direct training on noisy datasets without the requirement of pre-training. With a novel anti-noise loss function, learning is empowered for direct and robust training on noisy datasets yet its effectiveness over clean datasets is still preserved, sustaining effective mining of both clean and noisy samples whilst weakening its learning intensiveness over noisy samples. Extensive experiments reveal that: (i) our proposed achieves competitive performances in comparison with representative existing SoTA models when trained with clean datasets; (ii) when trained with both real-world and synthesized noisy datasets, our proposed significantly outperforms the existing models, especially when the synthesized datasets are corrupted with both close-set and open-set noises; (iii) while the existing deep models suffer from an average performance drop of around 20% over noise-corrupted large scale datasets, our proposed still delivers accuracy rates of more than 95%. Our source codes are publicly available on GitHub.

Abstract:
Current object detectors often suffer performance degradation when applied to cross-domain scenarios, particularly under challenging visual conditions such as nighttime scenes. This is primarily due to the I3 problems: Inadequate sampling of instance-level features, Indistinguishable feature representation across domains and Inaccurate generation for identical category participation. To address these challenges, we propose a domain-adaptive detection framework that enables robust generalization across different visual domains without introducing any additional inference overhead. The framework comprises three key components. Specifically, the centerness–category consistency sampler alleviates inadequate sampling by selecting representative instance-level features, while the paired centerness consistency loss enforces alignment between classification and localization. Second, VLM-based orthogonality enhancement leverages frozen vision–language encoders with an orthogonal projection loss to improve cross-domain feature distinguishability. Third, hallucination feature generator synthesizes robust instance-level features for missing categories, ensuring balanced category participation across domains. Extensive experiments on multiple datasets covering various domain adaptation and generalization settings demonstrate that our method consistently outperforms state-of-the-art detectors, achieving up to 5.5 mAP improvement, with particularly strong gains in nighttime adaptation.

Abstract:
Static meshes with texture maps have attracted considerable attention in both industrial manufacturing and academic research, leading to an urgent requirement for effective and robust objective quality evaluation. However, current model-based static mesh quality metrics (i.e., metrics that directly use the raw data of the static mesh to extract features and predict the quality) have obvious limitations: most of them only consider geometry information, while color information is ignored, and they have strict constraints for the meshes’ geometrical topology. Other metrics, such as image-based and point-based metrics, are easily influenced by the prepossessing algorithms, e.g., projection and sampling, hampering their ability to perform at their best. In this paper, we propose Geodesic Patch Similarity (GeodesicPSIM), a novel model-based metric to accurately predict human perception quality for static meshes. After selecting a group keypoints, 1-hop geodesic patches are constructed based on both the reference and distorted meshes cleaned by an effective mesh cleaning algorithm. A two-step patch cropping algorithm and a patch texture mapping module refine the size of 1-hop geodesic patches and build the relationship between the mesh geometry and color information, resulting in the generation of 1-hop textured geodesic patches. Three types of features are extracted to quantify the distortion: patch color smoothness, patch discrete mean curvature, and patch pixel color average and variance. To the best of our knowledge, GeodesicPSIM is the first model-based metric especially designed for static meshes with texture maps. GeodesicPSIM provides state-of-the-art performance in comparison with image-based, point-based, and video-based metrics on a newly created and challenging database. We also prove the robustness of GeodesicPSIM by introducing different settings of hyperparameters. Ablation studies also exhibit the effectiveness of three proposed features and the patch cropping algorithm. The code is available at https://multimedia.tencent.com/resources/GeodesicPSIM.

Affiliations: State Key Laboratory of Integrated Services Networks, School of Electronic Engineering, Xidian University, Xi’an, Shaanxi, China; Chongqing Key Laboratory of Image Cognition, Chongqing University of Posts and Telecommunications, Chongqing, China; Key Laboratory of Collaborative Intelligent Systems, Ministry of Education, Xidian University, Xi’an, China; School of Artificial Intelligence, Xidian University, Xi’an, China; Key Laboratory of Collaborative Intelligence Systems, Ministry of Education, Xidian University, Xi’an, China; School of Electronic Engineering, Xidian University, Xi’an, China

Abstract:
Existing free-energy guided No-Reference Image Quality Assessment (NR-IQA) methods continue to face challenges in effectively restoring complexly distorted images. The features guiding the main network for quality assessment lack interpretability, and efficiently leveraging high-level feature information remains a significant challenge. As a novel class of state-of-the-art (SOTA) generative model, the diffusion model exhibits the capability to model intricate relationships, enhancing image restoration effectiveness. Moreover, the intermediate variables in the denoising iteration process exhibit clearer and more interpretable meanings for high-level visual information guidance. In view of these, we pioneer the exploration of the diffusion model into the domain of NR-IQA. We design a novel diffusion model for enhancing images with various types of distortions, resulting in higher quality and more interpretable high-level visual information. Our experiments demonstrate that the diffusion model establishes a clear mapping relationship between image reconstruction and image quality scores, which the network learns to guide quality assessment. Finally, to fully leverage high-level visual information, we design two complementary visual branches to collaboratively perform quality evaluation. Extensive experiments are conducted on seven public NR-IQA datasets, and the results demonstrate that the proposed model outperforms SOTA methods for NR-IQA. The codes will be available at https://github.com/handsomewzy/DiffV2IQA.

Affiliations: School of Electrical and Electronic Engineering, Nanyang Technological University, Jurong West, Singapore; College of Metrology and Measurement Engineering, China Jiliang University, Hangzhou, China; Institute for Infocomm Research, Agency for Science, Technology and Research (A*STAR), Fusionopolis, Singapore; Department of Electrical and Computer Engineering, Johns Hopkins University, Baltimore, MD, USA; Department of Electrical Engineering, National Tsing Hua University, Hsinchu, Taiwan; Department of Electrical and Computer Engineering, Iowa State University, Ames, IA, USA

Abstract:
Acoustic resolution photoacoustic microscopy (AR-PAM) is a novel medical imaging modality, which can be used for both structural and functional imaging in deep bio-tissue. However, the imaging resolution is degraded and structural details are lost since its dependency on acoustic focusing, which significantly constrains its scope of applications in medical and clinical scenarios. To address the above issue, model-based approaches incorporating traditional analytical prior terms have been employed, making it challenging to capture finer details of anatomical bio-structures. In this paper, we proposed an innovative prior named group sparsity prior for simultaneous reconstruction, which utilizes the non-local structural similarity between patches extracted from internal AR-PAM images. The local image details and resolution are improved while artifacts are also introduced. To mitigate the artifacts introduced by patch-based reconstruction methods, we further integrate an external image dataset as an extra information provider and consolidate the group sparsity prior with a deep denoiser prior. In this way, complementary information can be exploited to improve reconstruction results. Extensive experiments are conducted to enhance the simulated and in vivo AR-PAM imaging results. Specifically, in the simulated images, the mean peak signal-to-noise ratio (PSNR) and structural similarity index measure (SSIM) values have increased from 16.36 dB and 0.46 to 27.62 dB and 0.92, respectively. The in vivo reconstructed results also demonstrate the proposed method achieves superior local and global perceptual qualities, the metrics of signal-to-noise ratio (SNR) and contrast-to-noise ratio (CNR) have significantly increased from 10.59 and 8.61 to 30.83 and 27.54, respectively. Additionally, reconstruction fidelity is validated with the optical resolution photoacoustic microscopy (OR-PAM) data as reference image.

Abstract:
The feature fusion of optical and Synthetic Aperture Radar (SAR) images is widely used for semantic segmentation of multimodal remote sensing images. It leverages information from two different sensors to enhance the analytical capabilities of land cover. However, the imaging characteristics of optical and SAR data are vastly different, and noise interference makes the fusion of multimodal data information challenging. Furthermore, in practical remote sensing applications, there are typically only a limited number of labeled samples available, with most pixels needing to be labeled. Semi-supervised learning has the potential to improve model performance in scenarios with limited labeled data. However, in remote sensing applications, the quality of pseudo-labels is frequently compromised, particularly in challenging regions such as blurred edges and areas with class confusion. This degradation in label quality can have a detrimental effect on the model’s overall performance. In this paper, we introduce the Difference-complementary Learning and Label Reassignment (DLLR) network for multimodal semi-supervised semantic segmentation of remote sensing images. Our proposed DLLR framework leverages asymmetric masking to create information discrepancies between the optical and SAR modalities, and employs a difference-guided complementary learning strategy to enable mutual learning. Subsequently, we introduce a multi-level label reassignment strategy, treating the label assignment problem as an optimal transport optimization task to allocate pixels to classes with higher precision for unlabeled pixels, thereby enhancing the quality of pseudo-label annotations. Finally, we introduce a multimodal consistency cross pseudo-supervision strategy to improve pseudo-label utilization. We evaluate our method on two multimodal remote sensing datasets, namely, the WHU-OPT-SAR and EErDS-OPT-SAR datasets. Experimental results demonstrate that our proposed DLLR model outperforms other relevant deep networks in terms of accuracy in multimodal semantic segmentation.

Abstract:
Video data and algorithms have been driving advances in multi-object tracking (MOT). While existing MOT datasets focus on occlusion and appearance similarity, complex motion patterns are widespread yet overlooked. To address this issue, we introduce a new dataset called BEE24 to highlight complex motions. Identity association algorithms have long been the focus of MOT research. Existing trackers can be categorized into two association paradigms: single-feature paradigm (based on either motion or appearance feature) and serial paradigm (one feature serves as secondary while the other is primary). However, these paradigms are incapable of fully utilizing different features. In this paper, we propose a parallel paradigm and present the Two rOund Parallel matchIng meChanism (TOPIC) to implement it. The TOPIC leverages both motion and appearance features and can adaptively select the preferable one as the assignment metric based on motion level. Moreover, we provide an Attention-based Appearance Reconstruction Module (AARM) to reconstruct appearance feature embeddings, thus enhancing the representation of appearance features. Comprehensive experiments show that our approach achieves state-of-the-art performance on four public datasets and BEE24. Moreover, BEE24 challenges existing trackers to track multiple similar-appearing small objects with complex motions over long periods, which is critical in real-world applications such as beekeeping and drone swarm surveillance. Notably, our proposed parallel paradigm surpasses the performance of existing association paradigms by a large margin, e.g., reducing false negatives by 6% to 81% compared to the single-feature association paradigm. The introduced dataset and association paradigm in this work offer a fresh perspective for advancing the MOT field. The source code and dataset are available at https://github.com/holmescao/TOPICTrack.

Affiliations: School of Computing and Artificial Intelligence, Jiangxi University of Finance and Economics, Nanchang, China; Guangxi Key Laboratory of Image and Graphic Intelligent Processing, Guilin University of Electronic Technology, Guilin, China; Criminal Investigation School, Southwest University of Political Science and Law, Chongqing, China; School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, Guangdong, China; College of Information Science and Technology, Jinan University, Guangzhou, China

Abstract:
In this paper, we explore a new road for format-compatible 3D object encryption by proposing a novel mechanism of leveraging 2D image encryption methods. It alleviates the difficulty of designing 3D object encryption schemes coming from the intrinsic intricacy of the data structure, and implements the flexible and diverse 3D object encryption designs. First, turning complexity into simplicity, the vertex values, real numbers with continuous values, are converted into integers ranging from 0 to 255. The simplification result for a 3D object is a 2D numerical matrix. Second, six prototypes for three encryption patterns (permutation, diffusion, and permutation-diffusion) are designed as exemplifications to encrypt the 2D matrix. Third, the integer-valued elements in the encrypted numeric matrix are converted into real numbers complying with the syntax of the 3D object. In addition, some experiments are conducted to verify the effectiveness of the proposed mechanism.

Abstract:
Unsupervised domain adaptation (UDA) aims to adapt models learned from a well-annotated source domain to a target domain, where only unlabeled samples are available. To this end, adversarial training is widely used in conventional UDA methods to reduce the discrepancy between source and target domains. Recently, prompt tuning has emerged as an efficient way to adapt large pre-trained vision-language models like CLIP to a variety of downstream tasks. In this paper, we present a novel method named Adversarial DuAl Prompt Tuning (ADAPT) for UDA, which employs text prompts and visual prompts to guide CLIP simultaneously. Rather than simply performing a joint optimization of text prompts and visual prompts, we integrate text prompt tuning and visual prompt tuning into a collaborative framework where they engage in an adversarial game: text prompt tuning focuses on distinguishing between source and target images, whereas visual prompt tuning seeks to align source and target domains. Unlike most existing adversarial training-based UDA approaches, ADAPT does not require explicit domain discriminators for domain alignment. Instead, the objective is effectively achieved at both global and category levels through modeling the joint probability distribution of images on domains and categories. Extensive experiments on four benchmark datasets demonstrate the effectiveness of our ADAPT method for UDA. We have released our code at https://github.com/Liuziyi1999/ADAPT.

Abstract:
Dense pixel-wise labeling of large-scale remote sensing images (RSI) is very time-consuming, while sparse labels (i.e., points, scribbles, or blocks) can be an efficient way to reduce labeling costs. Most existing sparse label-based methods adopt only one type of label for image segmentation, which cannot reflect the complex land covers in the RSI for training the model, thus leading to inferior segmentation performance. We observe that land covers with different shapes and complexity can be optimally represented by different sparse labels. Inspired by this observation, we propose a novel sparse labeling framework, termed Hybrid Sparse Labeling (HSLabeling), for large-scale RSI segmentation. Our HSLabeling can adaptively select the optimal hybrid sparse labels for different land covers, according to labeling cost and segmentation contribution of different sparse labels. Specifically, we first propose a label segmentation contribution information estimation module that estimates the information of different sparse labels according to the diversity and shape of land covers. After that, we propose an Optimal Hybrid Labeling Strategy (OHLS) to assign optimal types of labels for different land covers. In the OHLS, label assignment is formulated as an optimization problem that trades off label segmentation contribution information and labeling cost. We employ the greedy algorithm to efficiently solve the optimization problem and adaptively assign labels for varied land covers. Extensive experiments on three large-scale RSI datasets have demonstrated that our HSLabeling achieves almost fully supervised performance with extremely low labeling costs. In addition, compared with the single type sparse label, HSLabeling can also utilize much lower labeling costs to obtain the same performance. The source code is available at https://github.com/linjiaxing99/HSLabeling.

Abstract:
Denoising Diffusion Probabilistic Model (DDPM) has demonstrated exceptional performance in low-light enhancement task. However, the dependency on paired training datas has left the generality of DDPM in low-light enhancement largely untapped. Therefore, this paper proposes a mutually reinforcing learning framework of decoupled degradation and diffusion enhancement, named MRLIE, which leverages style guidance from unpaired low-light images to generate pseudo-image pairs that are consistent with the target domain, thereby optimizing the latter diffusion enhancement network in a supervised manner. During the degradation process, the diffusion loss of fixed enhancement network serves as a evaluation metric for structure consistency and is combined with adversarial style loss to form the optimization objective for degradation network. Such loss design ensures that scene structure information is retained during the degradation process. During the enhancement process, the degradation network with frozen parameters continuously generates pseudo-paired low-/normal-light image pairs as training datas, thus the diffusion enhancement network could be progressively optimized. On the whole, the two processes are interdependent and could achieve cooperative improvement in terms of degradation realism and enhancement quality through iterative optimization. Additionally, we propose the Retinex-based decoupled degradation strategy for simulating the complex degradation in real low-light imaging, which ensures the color correction and noise suppression capabilities of latter diffusion enhancement network. Extensive experiments show that MRLIE can achieve promising results and better generality across various datasets.

Affiliations: Center for High Performance Computing and Shenzhen Key Laboratory of Intelligent Bioinformatics, Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China; Department of Computer Science and Engineering, Research Institute of Trustworthy Autonomous Systems, Southern University of Science and Technology, Shenzhen, China; Guangdong-Hong Kong-Macao Greater Bay Area Artificial Intelligence Application Technology Research Institute, Shenzhen Polytechnic University, Shenzhen, China; Department of Computer Science, EPSRC Centre for Interventional and Surgical Sciences (WEISS), University College London, London, U.K.; Cooperative Innovation Center of Internet Healthcare, Zhengzhou University, Zhengzhou, China

Abstract:
Salient and small lesions (e.g., microaneurysms on fundus) both play significant roles in real-world disease diagnosis under medical image examinations. Although deep neural networks (DNNs) have achieved promising medical image classification performance, they often have limitations in capturing both salient and small lesion information, restricting performance improvement in imbalanced medical image classification. Recently, with the advent of DNN-based style transfer in medical image generation, the roles of clinical styles have attracted great interest, as they are crucial indicators of lesions. Motivated by this observation, we propose a novel Adaptive Dual-Axis Style-based Recalibration (ADSR) module, leveraging the potential of clinical styles to guide DNNs in effectively learning salient and small lesion information from a dual-axis perspective. ADSR first emphasizes salient lesion information via global style-based adaptation, then captures small lesion information with pixel-wise style-based fusion. We construct an ADSR-Net for imbalanced medical image classification by stacking multiple ADSR modules. Additionally, DNNs typically adopt cross-entropy loss for parameter optimization, which ignores the impacts of class-wise predicted probability distributions. To address this, we introduce a new Class-wise Statistics Loss (CWS) combined with CE to further boost imbalanced medical image classification results. Extensive experiments on five imbalanced medical image datasets demonstrate not only the superiority of ADSR-Net and CWS over state-of-the-art (SOTA) methods but also their improved confidence calibration results. For example, ADSR-Net with the proposed loss significantly outperforms CABNet50 by 21.39% and 27.82% in F1 and B-ACC while reducing 3.31% and 4.57% in ECE and BS on ISIC2018.

Abstract:
In the fast-growing field of Remote Sensing (RS) image analysis, the gap between massive unlabeled datasets and the ability to fully utilize these datasets for advanced RS analytics presents a significant challenge. To fill the gap, our work introduces an innovative auto-labeling framework named ALPS (Automatic Labeling for Pre-training in Segmentation), which leverages the Segment Anything Model (SAM) to predict precise pseudo-labels for RS images without necessitating prior annotations or additional prompts. The proposed pipeline significantly reduces the labor and resource demands traditionally associated with annotating RS datasets. By constructing two comprehensive pseudo-labeled RS datasets via ALPS for pre-training purposes, our approach enhances the performance of downstream tasks across various benchmarks, including iSAID and ISPRS Potsdam. Experiments demonstrate the effectiveness of our framework, showing its ability to generalize well across multiple tasks even under the scarcity of extensively annotated datasets, offering a scalable solution to automatic segmentation and annotation challenges in the field. In addition, the proposed pipeline is flexible and can be applied to medical image segmentation, remarkably increasing the performance. Note that ALPS utilizes pre-trained SAM to semi-automatically annotate RS images without additional manual annotations. Although every component in the pipeline has been well explored, integrating clustering algorithms with SAM and novel pseudo-label alignment significantly enhances RS segmentation, as an off-the-shelf tool for pre-training data preparation. Our source code is available at: https://github.com/StriveZs/ALPS.

Abstract:
Remote photoplethysmography (rPPG) is a promising technology for capturing physiological signals from facial videos, with potential applications in medical health, affective computing, and biometric recognition. The demand for rPPG tasks has evolved from achieving high performance in intra-dataset testing to excelling in cross-dataset testing (i.e., domain generalization). However, most existing methods have overlooked the incorporation of prior knowledge specific to rPPG, leading to limited generalization capabilities. In this paper, we propose a novel framework that effectively integrates both explicit and implicit prior knowledge into the rPPG task. Specifically, we conduct a systematic analysis of noise sources (e.g., variations in cameras, lighting conditions, skin types, and motion) across different domains and embed this prior knowledge into the network design. Furthermore, we employ a two-branch network to disentangle physiological feature distributions from noise through implicit label correlation. Extensive experiments demonstrate that the proposed method not only surpasses state-of-the-art approaches in RGB cross-dataset evaluation but also exhibits strong generalization from RGB datasets to NIR datasets. The code is publicly available at https://github.com/keke-nice/Greip

Abstract:
Stationary functional brain networks (sFBNs) and dynamic functional brain networks (dFBNs) derived from resting-state functional MRI characterize the complex interactions of the human brain from different aspects and could offer complementary information for brain disease analysis. Most current studies focus on sFBN or dFBN analysis, thus limiting the performance of brain network analysis. A few works have explored integrating sFBN and dFBN to identify brain diseases, and achieved better performance than conventional methods. However, these studies still ignore some valuable discriminative information, such as the distribution information of subjects between and within categories. This paper presents a Double Collaborative Learning Network (DCLNet), which takes advantage of both collaborative encoder and collaborative contrastive learning, to learn complementary information of sFBN and dFBN and distribution information of subjects between inter- and intra-categories for brain disease classification. Specifically, we first construct sFBN and dFBN using traditional correlation-based methods with rs-fMRI data, respectively. Then, we build a collaborative encoder to extract brain network features at different levels (i.e., connectivity-based, brain-region-based, and brain-network-based features), and design a prune-graft transformer module to embed the complementary information of the features at each level between two kinds of FBNs. We also develop a collaborative contrastive learning module to capture the distribution information of subjects between and within different categories, thereby learning the more discriminative features of brain networks. We evaluate the DCLNet on two real brain disease datasets with rs-fMRI data, with experimental results demonstrating the superiority of the proposed method.

Abstract:
Object tracking is considered as a template matching task. Traditional and deep learning-based methods have achieved high performance in satellite video object tracking (SVOT). However, existing methods still suffer from insufficiently discriminative features, complex approaches to handling occlusion, and excessive hyperparameters. In response to these issues, we propose a simple, yet effective Siamese network, termed SiamTITP. A temporal information (TI) submodule is developed, which integrates temporal cues by dynamically updating the template to enhance discriminative features. Furthermore, we propose a structurally simple trajectory prediction (TP) submodule, which solely utilizes polynomial function for fitting historical results to assist the network in addressing occlusion. In an effort to reduce hyperparameters, we forgo feature fusion steps and weighted results, while we propose an adaptive occlusion judgment metrics based on the target size. To validate the efficacy of our approach, we conducted extensive experiments on three large satellite video datasets, namely the SatSOT, SV248S and OOTB datasets. Code and train models are publicly available at https://github.com/jiawei-zhou/SiamTITP

Abstract:
Gigapixel whole-slide image (WSI) prediction and region-of-interest localization present considerable challenges due to the diverse range of features both across different slides and within individual slides. Most current methods rely on weakly supervised learning using homogeneous graphs to establish context-aware relevance within slides, often neglecting the rich diversity of heterogeneous information inherent in pathology images. Inspired by the negative sampling strategy of the Determinantal Point Process (DPP) and the hierarchical structure of pathology slides, we introduce the Negative Sample Boosted Hierarchical Heterogeneous Graph Attention Network (NSB-H2GAN). This model addresses the over-smoothing issue typically encountered in classical Graph Convolutional Networks (GCNs) when applied to pathology slides. By incorporating “negative samples” at multiple scales and utilizing hierarchical, heterogeneous feature discrimination, NSB-H2GAN more effectively captures the unique features of each patch, leading to an improved representation of gigapixel WSIs. We evaluated the performance of NSB-H2GAN on three publicly available datasets: CAMELYON16, TCGA-NSCLC and TCGA-COAD. The results show that NSB-H2GAN significantly outperforms existing state-of-the-art methods in both qualitative and quantitative evaluations. Moreover, NSB-H2GAN generates more detailed and interpretable heatmaps, allowing for precise localization of tiny lesions as small as 200\mu m× 200\mu m that are often missed by the human eye. The robust performance of NSB-H2GAN offers a new paradigm for computer-aided pathology diagnosis and holds great potential for advancing the clinical applications of computational pathology.

Abstract:
The ability to quantify widefield tissue optical properties (OPs, i.e., absorption and scattering) has major implications on the characterization of various physiological and disease processes. However, conventional image processing methods for tissue optical properties are either limited to qualitative analysis, or have tradeoffs in speed and accuracy. The key to quantification of optical properties is the extraction of amplitude maps from reflectance images under sinusoidal illumination of different spatial frequencies. Conventional three-phase demodulation (TPD) method has been demonstrated for the mapping of OPs, but it requires as many as 14 measurement images for accurate OP extraction, which leads to limited throughput and hinders practical translation. Although single-phase demodulation (SPD) method has been proposed to map OPs with a single measurement image, it is typically subject to image artifacts and decreased measurement accuracy. To tackle those challenges, here we develop a deep ensemble model (DEM) that can map tissue optical properties with high accuracy in a single snapshot, increasing the measurement speed by 14× compared to conventional TPD method. The proposed method was validated with measurements on an array of optical phantoms, ex vivo tissues, and in vivo tissues. The errors for OP extraction were 0.83~\pm ~5.0 % for absorption and 0.40~\pm ~1.9 % for reduced scattering, dramatically lower than that of the state-of-the-art SPD method ( 2.5~\pm ~15 % for absorption and - 1.2~\pm ~11 % for reduced scattering). It was further demonstrated that while trained with data from a single wavelength, the DEM can be directly applied to other wavelengths and effectively obtain optical property and chromophore concentration images of biological tissues. Together, these results highlight the potential of DEM to enable new capabilities for quantitative monitoring of tissue physiological and disease processes.

Abstract:
Coded Aperture Snapshot Spectral Imaging (CASSI) multiplexes 3D Hyperspectral Images (HSIs) into a 2D sensor to capture dynamic spectral scenes, which, however, sacrifices the spatial information. Dual-Camera Compressive Hyperspectral Imaging (DCCHI) enhances CASSI by incorporating a Panchromatic (PAN) camera to compensate for the loss of spatial information in CASSI. However, the dual-camera structure of DCCHI disrupts the diagonal property of the product of the sensing matrix and its transpose, making it difficult to efficiently and accurately solve the data subproblem in closed-form and thereby hindering the application of model-based methods and Deep Unfolding Networks (DUNs) that rely on such a closed-form solution. To address this issue, we propose an Alternating Direction DUN, named ADRNN, which decouples the imaging model of DCCHI into a CASSI subproblem and a PAN subproblem. The ADRNN alternately solves data terms analytically and a joint prior term in these subproblems. Additionally, we propose a Cross Spectral Transformer (XST) to exploit the joint prior. The XST utilizes cross spectral attention to exploit the correlation between the compressed HSI and the PAN image, and incorporates Grouped-Query Attention (GQA) to alleviate the burden of parameters and computational cost brought by impartially treating the compressed HSI and the PAN image. Furthermore, we built a real DCCHI system and captured large-scale indoor and outdoor scenes for future academic research. Extensive experiments on both simulation and real datasets demonstrate that the proposed method achieves state-of-the-art (SOTA) performance. The code and datasets have been open-sourced at: https://github.com/ShawnDong98/ADRNN-XST

Abstract:
Hyperspectral imaging technology is considered a new paradigm for high-precision pathological image segmentation due to its ability to obtain spatial and spectral information of the detected object simultaneously. However, due to the time-consuming and laborious manual annotation, precise annotation of medical hyperspectral images is difficult to obtain. Therefore, there is an urgent need for a semi-supervised learning framework that can fully utilize unlabeled data for medical hyperspectral image segmentation. In this work, we propose an adversarial consistency constraint learning cross indication network (ACCL-CINet), which achieves accurate pathological image segmentation through adversarial consistency constraint learning training strategies. The ACCL-CINet comprises a contextual and structural encoder to form the spatial-spectral feature encoding part. The contextual and structural indications are aggregated into features through a cross indication attention module and finally decoded by a pixel decoder to generate prediction results. For the semi-supervised training strategy, a pixel perceptual consistency module encourages the two models to generate consistent and low-entropy predictions. Secondly, a pixel maximum neighborhood probability adversarial constraint strategy is designed, which produces high-quality pseudo labels for cross supervision training. The proposed ACCL-CINet has been rigorously evaluated on both public and private datasets, with experimental results demonstrating that it outperforms state-of-the-art semi-supervised methods. The code is available at: https://github.com/Qugeryolo/ACCL-CINet

Abstract:
Single-shot 3D surface imaging techniques with high accuracy and high resolution are very important in both academia and industry. In this paper, we propose a sparse-to-dense structured light (SL) line-pattern based active stereo vision (ASV) approach to reconstruct the 3D shapes robustly with high-resolution. We propose a sparse-to-dense stereo matching (SDSM) method to solve the challenging problem of line clustering and line matching. We design the structured light line pattern with four colors and the distances between lines of different color range from sparse to dense. Accordingly, the sparse color lines could be clustered and matched at first while the dense color lines could be matched subsequently with the constraint of the clustered and matched sparse color lines. After all the color lines are matched, a spline-function based parallax model (SFPM) is computed based on the points on the matched color lines. Then, the depths of the points in the regions between the color lines are computed by the parallax model. Experimental results show that the proposed SDSM-SFPM ASV approach is more robust than existing methods especially in reconstructing the complex 3D shapes.

Abstract:
Online Class-Incremental Learning (OCIL) aims to solve the problem of incrementally learning new classes from a non-i.i.d. and single-pass data stream. Compared to the offline setting, OCIL is much closer to a live learning experience requiring higher model update frequency at less computational budget. Due to its one-epoch training constraint, the model is likely to learn non-essential features and encounter the under-fitting issue, which severely affects the model’s stability. In this paper, we investigate how to use hard samples to improve data variability, eventually enhancing feature learning and addressing the under-fitting problem. Specifically, by introducing a scoring function assessing the sample value, we build an OCIL formulation that simultaneously generates high-value samples and optimizes the OCIL model, improving generalization ability within the constraint of single-epoch training. Empirically, we found that strong data augmentation is a simple but effective way to generate a higher proportion of high-score samples. To make the most of these augmented samples, we design an OCIL model based on mutual learning with two networks of identical structures. Moreover, a collaborative learning mechanism is developed by aligning the features and class probabilities from the two networks to promote their interaction. Extensive experiments on three widely used datasets for OCIL have demonstrated the effectiveness of our method, obtaining superior performance to state-of-the-art methods. The code is available at https://github.com/susususushi/SDA-MCL

Abstract:
The scarcity of data in various scenarios, such as medical, industry and autonomous driving, leads to model overfitting and dataset imbalance, thus hindering effective detection and segmentation performance. Existing studies employ the generative models to synthesize more training samples to mitigate data scarcity. However, these synthetic samples are repetitive or simplistic and fail to provide “crucial information” that targets the downstream model’s weaknesses. Additionally, these methods typically require separate training for different objects, leading to computational inefficiencies. To address these issues, we propose Crucial-Diff, a domain-agnostic framework designed to synthesize crucial samples. Our method integrates two key modules. The Scene Agnostic Feature Extractor (SAFE) utilizes a unified feature extractor to capture target information. The Weakness Aware Sample Miner (WASM) generates hard-to-detect samples using feedback from the detection results of downstream model, which is then fused with the output of SAFE module. Together, our Crucial-Diff framework generates diverse, high-quality training data, achieving a pixel-level AP of 83.63% and an F1-MAX of 78.12% on MVTec. On polyp dataset, Crucial-Diff reaches an mIoU of 81.64% and an mDice of 87.69%. Code is publicly available at https://github.com/JJessicaYao/Crucial-diff

Affiliations: College of Electronic and Information Engineering, Tongji University, Shanghai, China; Department of Informatics, Aristotle University of Thessaloniki, Thessaloniki, Greece; College of Electronic and Information Engineering, Shanghai Institute of Intelligent Science and Technology, Shanghai Research Institute for Intelligent Autonomous Systems, Shanghai Key Laboratory of Intelligent Autonomous Systems, the State Key Laboratory of Autonomous Intelligent Unmanned Systems, and the Frontiers Science Center for Intelligent Autonomous Systems (Ministry of Education), Tongji University, Shanghai, China

Abstract:
Despite the impressive performance achieved by data-fusion networks with duplex encoders for visual semantic segmentation, they become ineffective when spatial geometric data are not available. Implicitly infusing the spatial geometric prior knowledge acquired by a data-fusion teacher network into a single-modal student network is a practical, albeit less explored research avenue. This article delves into this topic and resorts to knowledge distillation approaches to address this problem. We introduce the Learning to Infuse “X” (LIX) framework, with novel contributions in both logit distillation and feature distillation aspects. We present a mathematical proof that underscores the limitation of using a single, fixed weight in decoupled knowledge distillation and introduce a logit-wise dynamic weight controller as a solution to this issue. Furthermore, we develop an adaptively-recalibrated feature distillation algorithm, including two novel techniques: feature recalibration via kernel regression and feature consistency quantification via centered kernel alignment. Extensive experiments conducted with intermediate-fusion and late-fusion networks across various public datasets provide both quantitative and qualitative evaluations, demonstrating the superior performance of our LIX framework when compared to other state-of-the-art approaches. Source code is available at https://mias.group/LIX.

Abstract:
Most remote sensing datasets are annotated with horizontal bounding boxes (HBBs), which conflicts with mainstream oriented object detection methods that require oriented bounding boxes (OBBs). Horizontal box supervised oriented object detection has emerged as a promising solution, but existing methods suffer from two key limitations. First, they apply image-level geometric transformations for consistency learning, which binds object orientation to the global image and limits the model’s ability to learn instance-specific orientation features. Second, they rely on data augmentation for orientation awareness while still using conventional horizontal convolutional neural networks (CNNs) for regression, failing to extract orientation-sensitive features effectively. To address these issues, we propose the Instance-Level Orientation Information Enhanced Detector (ILOEDet), which integrates the Instance-Aware Rotated Convolution Module (IARCM) and an Instance-Level Flip Consistency (IFC) mechanism to improve orientation sensitivity. Specifically, IARCM leverages classification and center-ness scores to select high-quality instances and their predicted angles, guiding a rotated convolution operation to embed instance-level orientation information into the feature maps. Meanwhile, IFC introduces a self-supervised branch that flips individual object instances to decouple their orientation from the image background, enforcing instance-level consistency constraints for more robust orientation learning. Experiments on the DOTA, HRSC2016, and DIOR-R datasets demonstrate the effectiveness of our approach.

Abstract:
Underwater data is inherently scarce and exhibits complex distributions, making it challenging to train high-performance models from scratch. In contrast, in-air models are structurally mature, resource-rich, and offer strong potential for transfer. However, significant discrepancies in visual characteristics and feature distributions between underwater and in-air environments often lead to severe performance degradation when applying in-air models directly. To address this issue, we propose IA2U, a lightweight plugin designed for efficient underwater adaptation without modifying the original model architecture. IA2U can be flexibly integrated into arbitrary in-air networks, offering high generalizability and low deployment costs. Specifically, IA2U incorporates three types of prior knowledge—water type, degradation pattern, and sample semantics—which are embedded into intermediate layers through feature injection and channel-wise modulation to guide the network’s response to underwater-specific features. Furthermore, a multi-scale feature alignment module is introduced to dynamically balance information across different resolution paths, enhancing consistency and contextual representation. Extensive experiments demonstrate that IA2U significantly improves both image enhancement and object detection performance. Specifically, on the UIEB dataset, IA2U boosts Shallow-UWNet by 5.2 dB in PSNR and reduces LPIPS by 52%; on the RUOD dataset, it increases AP by 1.8% when applied to the PAA detector. IA2U provides an effective and scalable solution for building robust underwater perception systems with minimal adaptation costs. Our code is available at https://github.com/zhoujingchun03/IA2U

Abstract:
Few-shot unsupervised domain adaptation (FS-UDA) leverages a limited amount of labeled data from a source domain to enable accurate classification in an unlabeled target domain. Despite recent advancements, current approaches of FS-UDA continue to confront a major challenge: models often demonstrate instability when adapted to new FS-UDA tasks and necessitate considerable time investment. To address these challenges, we put forward a novel framework called Enduring and Efficient Meta-Prompt Learning (E2MPL) for FS-UDA. Within this framework, we utilize the pre-trained CLIP model as the backbone of feature learning. Firstly, we design domain-shared prompts, consisting of virtual tokens, which primarily capture meta-knowledge from a wide range of meta-tasks to mitigate the domain gaps. Secondly, we develop a task prompt learning network that adaptively learns task-specific prompts with the goal of achieving fast and stable task generalization. Thirdly, we formulate the meta-prompt learning process as a bilevel optimization problem, consisting of (outer) meta-prompt learner and (inner) task-specific classifier and domain adapter. Also, the inner objective of each meta-task has the closed-form solution, which enables efficient prompt learning and adaptation to new tasks in a single step. Extensive experimental studies demonstrate the promising performance of our framework in a domain adaptation benchmark dataset DomainNet. Compared with state-of-the-art methods, our approach has improved the average accuracy by at least 15 percentage points and reduces the average time by 64.67% in the 5-way 1-shot task; in the 5-way 5-shot task, it achieves at least a 9-percentage-point improvement in average accuracy and reduces the average time by 63.18%. Moreover, our method exhibits more enduring and stable performance than the other methods, i.e., reducing the average IQR value by over 40.80% and 25.35% in the 5-way 1-shot and 5-shot task, respectively.

Abstract:
Open World Object Detection (OWOD) aims to adapt object detection to an open-world environment, so as to detect unknown objects and learn knowledge incrementally. Existing OWOD methods typically leverage training sets with a relatively small number of known objects. Due to the absence of generic object knowledge, they fail to comprehensively perceive objects beyond the scope of training sets. Recent advancements in large vision models (LVMs), trained on extensive large-scale data, offer a promising opportunity to harness rich generic knowledge for the fundamental advancement of OWOD. Motivated by Segment Anything Model (SAM), a prominent LVM lauded for its exceptional ability to segment generic objects, we first demonstrate the possibility to employ SAM for OWOD and establish the very first SAM-Guided OWOD baseline solution. Subsequently, we identify and address two fundamental challenges in SAM-Guided OWOD and propose a pioneering SAM-Guided Robust Open-world Detector (SGROD) method, which can significantly improve the recall of unknown objects without losing the precision on known objects. Specifically, the two challenges in SAM-Guided OWOD include: 1) Noisy labels caused by the class-agnostic nature of SAM; 2) Precision degradation on known objects when more unknown objects are recalled. For the first problem, we propose a dynamic label assignment (DLA) method that adaptively selects confident labels from SAM during training, evidently reducing the noise impact. For the second problem, we introduce cross-layer learning (CLL) and SAM-based negative sampling (SNS), which enable SGROD to avoid precision loss by learning robust decision boundaries of objectness and classification. Experiments on public datasets show that SGROD not only improves the recall of unknown objects by a large margin (~20%), but also preserves highly-competitive precision on known objects. The program codes are available at https://github.com/harrylin-hyl/SGROD.

Abstract:
The cross-channel deblurring problem in color image processing is difficult to solve due to the complex coupling and structural blurring of color pixels. Until now, there are few efficient algorithms that can reduce color artifacts in deblurring process. To solve this challenging problem, we present a novel cross-space total variation (CSTV) regularization model for color image deblurring by introducing a quaternion blur operator and a cross-color space regularization functional. The existence and uniqueness of the solution are proved and a new L-curve method is proposed to find a balance of regularization terms on different color spaces. The Euler-Lagrange equation is derived to show that CSTV has taken into account the coupling of all color channels and the local smoothing within each color channel. A quaternion operator splitting method is firstly proposed to enhance the ability of color artifacts reduction of the CSTV regularization model. This strategy also applies to the well-known color deblurring models. Numerical experiments on color image databases illustrate the efficiency and effectiveness of the new model and algorithms. The color images restored by them successfully maintain the color and spatial information and are of higher quality in terms of PSNR, SSIM, MSE and CIEde2000 than the restorations of the-state-of-the-art methods.

Abstract:
Segmenting objects from cluttered backgrounds in single-channel images, such as marine radar echoes, medical images, and remote sensing images, poses significant challenges due to limited texture, color information, and diverse target types. This paper proposes a novel solution: the Onet, an O-shaped assembly of twin U-Net deep neural networks, designed for unsupervised binary semantic segmentation. The Onet, trained with an intensity-complementary image pair and without the need for annotated labels, maximizes the Jensen-Shannon divergence (JSD) between the densely localized features and the class probability maps. By leveraging the symmetry of U-Net, Onet subtly strengthens the dependence between dense local features, global features, and class probability maps during the training process. The design of the complementary input pair aligns with the theoretical requirement that optimizing JSD needs the class probability of negative samples to accurately estimate the marginal distribution. Compared to the current leading unsupervised segmentation methods, the Onet demonstrates superior performance in target segmentation in marine radar frames and cloud segmentation in remote sensing images. Notably, we found that Onet’s foreground prediction significantly enhances the signal-to-noise ratio (SNR) of targets amidst marine radar clutter. Onet’s source code is publicly accessible at https://github.com/joeyee/Onet.

Abstract:
Perceptual edge grouping is a technique for organizing the cluttered edge pixels into meaningful structures and further serves high-level vision tasks, which has long been a basic and critical task in computer vision. Existing methods usually have a poor performance when coping with the junctions caused by occlusion and noise in natural images. In this paper, we present GPGrouper, a perceptual edge grouping model based on gestalt theory and the primary visual cortex (V1). Different from the existing methods, GPGrouper leverages the edge representation and grouping matrix (ERGM), a functional structure inspired by V1 mechanisms, to represent edges in a way that can effectively reduce grouping errors caused by occlusion between objects. ERGM is trained with natural image contours and further provides a priori guidance for the construction of the edge connection graph (ECG) that is useful to minimize the impact of noise on grouping. In the experiment, we compared GPGrouper and the state-of-the-art (SOTA) method of perceptual grouping on the visual psychology pathfinder challenge. The results demonstrate that GPGrouper outperforms the SOTA method in grouping performance. Furthermore, in the grouping experiments involving line segments with varying lengths detected by the Line Segment Detector (LSD), as well as those involving superpixel segmentation results with significant levels of interfering noise using the SLIC algorithm, GPGrouper was superior to the existing methods in terms of grouping effect and robustness. Moreover, the results of applying the grouping results to the vision tasks objectness demonstrate that GPGrouper can contribute significantly to high-level visual tasks.

Abstract:
Recently, Magnetic Particle Imaging, an emerging functional imaging modality, has exhibited outstanding spatial-temporal resolution and sensitivity. The general reconstruction pipeline of Magnetic Particle Imaging involves calibrating a System Matrix and then solving an ill-posed inverse problem combined with the measured particle signals. However, the introduction of noise during the System Matrix calibration procedure is inevitable, which degrades the detailed information in the reconstructed images. Therefore, frequency selection methods based on signal-to-noise ratio are commonly adopted. However, these methods lead to a decrease in the available high-frequency components, which damages the spatial resolution. To address this problem, we propose an unsupervised memory-guided denoising framework with unpaired noisy-clean System Matrix components, called U-N2C. Specifically, we design a Pattern Memory Block to memorize System Matrix patterns, directed by a position-aware frequency index embedding. Meanwhile, we devise a Noise Memory Block to implicitly approximate noise distributions. With the guidance of our dual memory blocks, we can disentangle the noise and content of the System Matrix in the latent space. Furthermore, benefiting from the ability to model complex noise, our method can generate pseudo but high-quality noisy-clean pairs and further enhance our denoising capability. Experiments on both synthetic and real noise demonstrate that our U-N2C achieves cutting-edge performance compared to other methods. Moreover, we conduct extensive qualitative and quantitative ablation studies to verify the effectiveness of our method. Our code has been available at U-N2C.

Abstract:
To achieve saliency prediction in omnidirectional images (ODIs), the majority of prior works typically adopt the convolutional neural networks (CNNs)-based saliency models to extract semantic features to predict prominent regions in ODIs. Albeit achieving substantially performance gains, these works all employed purely visual computing paradigms and ignore to explore the nature of human visual attention mechanisms. In other words, existing saliency prediction works for ODIs are insufficient to capture the biological characteristics of the visual attention mechanism in the human brain. To establish a more explicit link between saliency prediction performance and brain-like visual attention mechanism, we simulate the mechanism of human retrospective memory in neuropsychology and propose IMRE model, a novel iterative memory-retrospective emergence model can predict and infer the salient features by recalling previously learned information. In IMRE model, we introduce four key modules to simulate the visual attention mechanism for predicting human fixations in the human brain. Firstly, the visual stimulus response module is designed to effectively extract semantic features and capture the intricate relationship between these features, acting as the human visual cortex. Secondly, the retrospective integration module serves to distill valuable information from a fuzzy memory ensemble, resembling the role of the basal ganglia in the neural system. Thirdly, the memory bank module explicitly records and stores subconscious response information and learned knowledge, acting like the hippocampus in neural system. Lastly, the prospective inference module accurately infers saliency maps from the refined useful information, resembling the role of the prefrontal cortex. During prediction, we utilize the introduced memory bank to retrieve and recall previously learned information, which simulates the process of memory emergence from haziness to clarity. Such a process aligns with the retrospective memory mechanism of the human brain. To validate the superiority of the proposed model in ODIs saliency prediction tasks, we conduct extensive experiments on two benchmark datasets. Experiments show impressive performances that IMRE model outperforms other state-of-the-art methods across all benchmark datasets. Importantly, experiments also highlight the IMRE model’s ability to trace back to specific instances during prediction, thereby reducing model inference costs and enhancing interpretability.

Abstract:
Text-based person retrieval is defined as the challenging task of searching for people’s images based on given textual queries in natural language. Conventional methods primarily use deep neural networks to understand the relationship between visual and textual data, creating a shared feature space for cross-modal matching. The absence of awareness regarding variations in feature granularity between the two modalities, coupled with the diverse poses and viewing angles of images corresponding to the same individual, may lead to overlooking significant differences within each modality and across modalities, despite notable enhancements. Furthermore, the inconsistency in caption queries in large public datasets presents an additional obstacle to cross-modality mapping learning. Therefore, we introduce 3RTPR, a novel text-based person retrieval method that integrates a representation fusing mechanism and an adaptive loss refinement algorithm into a dual-encoder branch architecture. Moreover, we propose training two independent models simultaneously, which reciprocally support each other to enhance learning effectiveness. Consequently, our approach encompasses three significant contributions: (i) proposing a fused representation method to generate more discriminative representations for images and captions; (ii) introducing a novel algorithm to adjust loss and prioritize samples that contain valuable information; and (iii) proposing reciprocal learning involving a pair of independent models, which allows us to enhance general retrieval performance. In order to validate our method’s effectiveness, we also demonstrate superior performance over state-of-the-art methods by performing rigorous experiments on three well-known benchmarks: CUHK-PEDES, ICFG-PEDES, and RSTPReid.

Abstract:
Point cloud compression significantly reduces data volume but sacrifices reconstruction quality, highlighting the need for advanced quality enhancement techniques. Most existing approaches focus primarily on point-to-point fidelity, often neglecting the importance of perceptual quality as interpreted by the human visual system. To address this issue, we propose a generative adversarial network for point cloud quality enhancement (PCE-GAN), grounded in optimal transport theory, with the goal of simultaneously optimizing both data fidelity and perceptual quality. The generator consists of a local feature extraction (LFE) unit, a global spatial correlation (GSC) unit and a feature squeeze unit. The LFE unit uses dynamic graph construction and a graph attention mechanism to efficiently extract local features, placing greater emphasis on points with severe distortion. The GSC unit uses the geometry information of neighboring patches to construct an extended local neighborhood and introduces a transformer-style structure to capture long-range global correlations. The discriminator computes the deviation between the probability distributions of the enhanced point cloud and the original point cloud, guiding the generator to achieve high quality reconstruction. Experimental results show that the proposed method achieves state-of-the-art performance. Specifically, when applying PCE-GAN to the latest geometry-based point cloud compression (G-PCC) test model, it achieves an average BD-rate of -19.2% compared with the PredLift coding configuration and -18.3% compared with the RAHT coding configuration. Subjective comparisons show a significant improvement in texture clarity and color transitions, revealing finer details and more natural color gradients.

Abstract:
Conventional computer vision pipelines typically treat low-level enhancement and high-level semantic tasks as isolated processes, focusing on optimizing enhancement for perceptual quality rather than computational utility, neglecting semantic task requirements. To bridge this gap, this paper proposes an integrated joint optimization architecture that aligns the objectives of enhancement tasks with the practical needs of semantic tasks. Specifically, the architecture ensures that medical image segmentation (the semantic task) benefits directly from super-resolution pre-processing (the enhancement task). This integrated architecture fundamentally differs from conventional sequential frameworks by enabling joint training of super-resolution and segmentation networks. Guided by its own content reconstruction loss and semantic loss transferred from segmentation, the super-resolution network prioritizes semantically significant regions for segmentation-driven reconstruction. Comprehensive comparative and ablation studies demonstrate that the network, trained jointly, markedly enhances segmentation performance in low-resolution images, even outperforming those directly from referenced high-resolution images. The code is available at https://github.com/kldys/JOANet

Abstract:
Camouflaged object detection (COD) and salient object detection (SOD) are two distinct yet closely-related computer vision tasks widely studied during the past decades. Though sharing the same purpose of segmenting an image into binary foreground and background regions, their distinction lies in the fact that COD focuses on concealed objects hidden in the image, while SOD concentrates on the most prominent objects in the image. Building universal segmentation models is currently a hot topic in the community. Previous works achieved good performance on certain task by stacking various hand-designed modules and multi-scale features. However, these careful task-specific designs also make them lose their potential as general-purpose architectures. Therefore, we hope to build general architectures that can be applied to both tasks. In this work, we propose a simple yet effective network (SENet) based on vision Transformer (ViT), by employing a simple design of an asymmetric ViT-based encoder-decoder structure, we yield competitive results on both tasks, exhibiting greater versatility than meticulously crafted ones. To enhance the performance of universal architectures on both tasks, we propose some general methods targeting some common difficulties of the two tasks. First, we use image reconstruction as an auxiliary task during training to increase the difficulty of training, forcing the network to have a better perception of the image as a whole to help with segmentation tasks. In addition, we propose a local information capture module (LICM) to make up for the limitations of the patch-level attention mechanism in pixel-level COD and SOD tasks and a dynamic weighted loss (DW loss) to solve the problem that small target samples are more difficult to locate and segment in both tasks. Finally, we also conduct a preliminary exploration of joint training, trying to use one model to complete two tasks simultaneously. Extensive experiments on multiple benchmark datasets demonstrate the effectiveness of our method. The code is available at https://github.com/linuxsino/SENet.

Abstract:
Vision-based Bird’s Eye View (BEV) representation is an emerging perception formulation for autonomous driving. The core challenge is to construct BEV space with multi-camera features, which is a one-to-many ill-posed problem. Diving into all previous BEV representation generation methods, we found that most of them fall into two types: modeling depths in image views or modeling heights in the BEV space, mostly in an implicit way. In this work, we propose to explicitly model heights in the BEV space, which needs no extra data like LiDAR and can fit arbitrary camera rigs and types compared to modeling depths. Theoretically, we give proof of the equivalence between height-based methods and depth-based methods. Considering the equivalence and some advantages of modeling heights, we propose HeightFormer, which models heights and uncertainties in a self-recursive way. Without any extra data, the proposed HeightFormer could estimate heights in BEV accurately. Benchmark results show that the performance of HeightFormer achieves SOTA compared with those camera-only methods.

Abstract:
This paper proposes a new effective and efficient plug-and-play backbone for video-based person re-identification (ReID). Conventional video-based ReID methods typically use CNN or transformer backbones to extract deep features for every position in every sampled video frame. Here, we argue that this exhaustive feature extraction could be unnecessary, since we find that different frames in a ReID video often exhibit small differences and contain many similar regions due to the relatively slight movements of human beings. Inspired by this, a more selective, efficient paradigm is explored in this paper. Specifically, we introduce a patch selection mechanism to reduce computational cost by choosing only the crucial and non-repetitive patches for feature extraction. Additionally, we present a novel network structure that generates and utilizes pseudo frame global context to address the issue of incomplete views resulting from sparse inputs. By incorporating these new designs, our backbone can achieve both high performance and low computational cost. Extensive experiments on multiple datasets show that our approach reduces the computational cost by 74% compared to ViT-B and 28% compared to ResNet50, while the accuracy is on par with ViT-B and outperforms ResNet50 significantly.

Abstract:
Knowledge graphs (KGs) represent known entities and their relationships using triplets, but this method cannot represent relationships between facts, limiting their expressiveness. Recently, the Bi-level Knowledge Graph (Bi-level KG) has addressed this issue by modeling facts as nodes and establishing relationships between these facts, introducing two new tasks: triplet prediction and conditional link prediction. Existing methods enhance triplets through data augmentation method and represent facts using entity representations. However, these methods do not address the isolated nodes at the structure level, nor do they effectively capture the information of facts at the feature level. To address these two issues, we design a data augmentation method that identifies islanded node by detecting anomalous structures and features in the graph. Subsequently, we perform similar subgraph matching for each isolated node to construct potential facts. To enrich the features of facts, we design a weighted combination initialization method for facts and introduce a new relation \widetilde R , to connect facts with related entities. This approach allows for the co-training of fact and entity representations during the training process. Extensive experiments validate the effectiveness of our data augmentation and co-training methods. Our model achieves optimal performance in triplet prediction and conditional link prediction tasks.

Abstract:
High-resolution satellite imagery with dense temporal series is crucial for long-term surface change monitoring. Spatiotemporal fusion seeks to reconstruct remote sensing image sequences with both high spatial and temporal resolutions by leveraging prior information from multiple satellite platforms. However, significant radiometric discrepancies and large spatial resolution variations between images acquired from different satellite sensors, coupled with the limited availability of prior data, present major challenges to accurately reconstructing missing data using existing methods. To address these challenges, this paper introduces GCM-PDA, a novel generative compensation model with progressive difference attenuation for spatiotemporal fusion of remote sensing images. The proposed model integrates multi-scale image decomposition within a progressive fusion framework, enabling the efficient extraction and integration of information across scales. Additionally, GCM-PDA employs domain adaptation techniques to mitigate radiometric inconsistencies between heterogeneous images. Notably, this study pioneers the use of style transformation in spatiotemporal fusion to achieve spatial-spectral compensation, effectively overcoming the constraints of limited prior image information. Experimental results demonstrate that GCM-PDA not only achieves competitive fusion performance but also exhibits strong robustness across diverse conditions.

Abstract:
Developing a unified model for surface anomaly detection remains challenging due to significant variations across product categories. Recent feature editing methods, as a branch of image reconstruction, mitigate the over-generalization of auto-encoders that leads to accurate anomaly reconstruction. However, these methods are only suited for texture-category products and have significant limitations in being generalized to other categories. In this article, we propose a multi-category anomaly editing network with a dual-branch training approach: one branch processes defect-free images (normal branch), while the other handles synthetic anomaly images (anomaly branch). Specifically, the paired samples are first fed into the multi-category anomaly feature editing based auto-encoder (MCAFE-AE) to perform image reconstruction and inpainting. In the normal branch, we propose a dual-entropy constrained deep embedded clustering module (DEC-DECM) to promote a more compact and orderly distribution of normal latent features, while avoiding trivial clustering solutions. Based on the clustering results, we further design a patch-based adaptive thresholding (PAT) strategy to adaptively calculate the threshold representing the central boundary of the cluster center for each local patch, thereby enabling the model to detect anomalies. Then, in the anomaly branch, we propose a multi-category anomaly feature editing module (MCAFEM) to identify anomalies in synthetic images and apply a category-oriented feature editing strategy to transform detected anomaly features into normal ones, thereby suppressing the reconstruction of anomalies. After completing the image reconstruction and inpainting, the input images from both branches and their respective output images are concatenated and fed into the correlation exploration and voxel-level attention based prediction network (CEVA-Net) for anomaly segmentation. The network is integrated with our proposed correlation-dependency exploration and voxel-level attention refinement module (CDE-VARM) and generates precise anomaly maps under the guidance of the bidirectional-path feature fusion (BPFF) and deep supervised learning (DSL). Extensive experiments on three datasets show that our method achieves state-of-the-art performance.

Abstract:
RGBT tracking draws increasing attention because of its robustness in multi-modal warranting (MMW) scenarios, such as nighttime and adverse weather conditions, where relying on a single sensing modality fails to ensure stable tracking results. However, existing benchmarks predominantly contain videos collected in common scenarios where both RGB and thermal infrared (TIR) information are of sufficient quality. This weakens the representativeness of existing benchmarks in severe imaging conditions, leading to tracking failures in MMW scenarios. To bridge this gap, we present a new benchmark considering the modality validity, MV-RGBT, captured specifically from MMW scenarios where either RGB (extreme illumination) or TIR (thermal truncation) modality is invalid. Hence, it is further divided into two subsets according to the valid modality, offering a new compositional perspective for evaluation and providing valuable insights for future designs. Moreover, MV-RGBT is the most diverse benchmark of its kind, featuring 36 different object categories captured across 19 distinct scenes. Furthermore, considering severe imaging conditions in MMW scenarios, a new problem is posed in RGBT tracking, named ‘when to fuse’, to stimulate the development of fusion strategies for such scenarios. To facilitate its discussion, we propose a new solution with a mixture of experts, named MoETrack, where each expert generates independent tracking results along with a confidence score. Extensive results demonstrate the significant potential of MV-RGBT in advancing RGBT tracking and elicit the conclusion that fusion is not always beneficial, especially in MMW scenarios. Besides, MoETrack achieves state-of-the-art results on several benchmarks, including MV-RGBT, GTOT, and LasHeR. Source codes and benchmarks are available at https://github.com/Zhangyong-Tang/MVRGBT

Abstract:
In this paper, we introduce MaeFuse, a novel autoencoder model designed for Infrared and Visible Image Fusion (IVIF). The existing approaches for image fusion often rely on training combined with downstream tasks to obtain high-level visual information, which is effective in emphasizing target objects and delivering impressive results in visual quality and task-specific applications. Instead of being driven by downstream tasks, our model called MaeFuse utilizes a pretrained encoder from Masked Autoencoders (MAE), which facilities the omni features extraction for low-level reconstruction and high-level vision tasks, to obtain perception friendly features with a low cost. In order to eliminate the domain gap of different modal features and the block effect caused by the MAE encoder, we further develop a guided training strategy. This strategy is meticulously crafted to ensure that the fusion layer seamlessly adjusts to the feature space of the encoder, gradually enhancing the fusion performance. The proposed method can facilitate the comprehensive integration of feature vectors from both infrared and visible modalities, thus preserving the rich details inherent in each modal. MaeFuse not only introduces a novel perspective in the realm of fusion techniques but also stands out with impressive performance across various public datasets. The code is available at https://github.com/Henry-Lee-real/MaeFuse.

Affiliations: College of Electronics and Information Engineering, Tongji University, Shanghai, China; Shanghai Research Institute for Intelligent Autonomous Systems, Tongji University, Shanghai, China; Department of Informatics, University of Thessaloniki, Thessaloniki, Greece; College of Electronics and Information Engineering, Shanghai Institute of Intelligent Science and Technology, Shanghai Research Institute for Intelligent Autonomous Systems, the State Key Laboratory of Intelligent Autonomous Systems, and the Frontiers Science Center for Intelligent Autonomous Systems, Tongji University, Shanghai, China

Abstract:
Stereo matching has emerged as a cost-effective solution for road surface 3D reconstruction, garnering significant attention towards improving both computational efficiency and accuracy. This article introduces decisive disparity diffusion (D3Stereo), marking the first exploration of dense deep feature matching that adapts pre-trained deep convolutional neural networks (DCNNs) to previously unseen road scenarios. A pyramid of cost volumes is initially created using various levels of learned representations. Subsequently, a novel recursive bilateral filtering algorithm is employed to aggregate these costs. A key innovation of D3Stereo lies in its alternating decisive disparity diffusion strategy, wherein intra-scale diffusion is employed to complete sparse disparity images, while inter-scale inheritance provides valuable prior information for higher resolutions. Extensive experiments conducted on our created UDTIRI-Stereo and Stereo-Road datasets underscore the effectiveness of D3Stereo strategy in adapting pre-trained DCNNs and its superior performance compared to all other explicit programming-based algorithms designed specifically for road surface 3D reconstruction. Additional experiments conducted on the Middlebury dataset with backbone DCNNs pre-trained on the ImageNet database further validate the versatility of D3Stereo strategy in tackling general stereo matching problems. Our source code and supplementary material are publicly available at https://mias.group/D3-Stereo.

Abstract:
Multispectral imaging aims at recording images in different spectral bands. This is extremely beneficial in diverse discrimination applications, for example in agriculture, recycling or healthcare. One approach for snapshot multispectral imaging, which is capable of recording multispectral videos, is by using camera arrays, where each camera records a different spectral band. Since the cameras are at different spatial positions, a registration procedure is necessary to map every camera to the same view. In this paper, we present a multispectral snapshot image registration with three novel components. First, a cross spectral disparity estimation network is introduced, which is trained on a popular stereo database using pseudo spectral data augmentation. Subsequently, this disparity estimation is used to accurately detect occlusions by warping the disparity map in a layer-wise manner. Finally, these detected occlusions are reconstructed by a learned deep guided neural network, which leverages the structure from other spectral components. It is shown that each element of this registration process as well as the final result is superior to the current state of the art. In terms of PSNR, our registration achieves an improvement of over 3 dB. At the same time, the runtime is decreased by a factor of over 3 on a CPU. Additionally, the registration is executable on a GPU, where the runtime can be decreased by a factor of 113. The source code and the data is available at https://github.com/FAU-LMS/MSIR.

Abstract:
Fine-grained object detection (FGOD) fundamentally comprises two primary tasks: object detection and fine-grained classification. In natural scenes, most FGOD methods benefit from higher instance resolution and fewer environmental variation, attributing more commonly associated with the latter task. In this paper, we propose a unified paradigm named Detector with Classifier2 (DC2), which provides a holistic paradigm by explicitly considering the end-to-end integration of object detection and fine-grained classification tasks, rather than prioritizing one aspect. Initially, our detection sub-network is restricted to only determining whether the proposal is a coarse-category and does not delve into the specific sub-categories. Moreover, in order to reduce redundant pixel-level calculation, we propose an instance-level feature enhancement (IFE) module to model the semantic similarities among proposals, which poses great potential for locating more instances in remote sensing images (RSIs). After obtaining the coarse detection predictions, we further construct a classification sub-network, which is built on top of the former branch to determine the specific sub-categories of the aforementioned predictions. Importantly, the detection network is performed on the complete image, while the classification network conducts secondary modeling for the detected regions. These operations can be denoted as the global contextual information and local intrinsic cues extractions for each instance. Therefore, we propose a multi-stream feature aggregation (MSFA) module to integrate global-stream semantic information and local-stream discriminative cues. Our whole DC2 network follows an end-to-end learning fashion, which effectively excavates the internal correlation between detection and fine-grained classification networks. We evaluate the performance of our DC2 network on two benchmarks SAT-MTB and HRSC2016 datasets. Importantly, our method achieves the new state-of-the-art results compared with recent works (approximately 7% mAP gains on SAT-MTB) and improves baseline by a significant margin (43.2% v.s.~36.7 %) without any complicated post-processing strategies. Source codes of the proposed methods are available at https://github.com/zhengshangdong/DC2

Abstract:
Deep reinforcement learning-based object detection approaches center around a pivotal concept: hierarchically scaling image segments that harbor more intricate details. Compared with the traditional object detection approaches, this approach significantly curbs the quantity of region proposals. This reduction holds paramount significance in curtailing the computational overhead. However, common deep reinforcement learning-based approaches suffer from a significant defect in terms of precision. This issue arises from inadequacies in representing image states appropriately and the unstable learning ability exhibited by the agent. To address these issues, we present the LHAR-RLD. First, we design the Low-dimensional RepVGG(LDR) feature extractor to reduce memory consumption and to reduce the difficulty of fitting downstream networks. Second, we propose the Hybrid DQN(HDQN) to enhance the agent’s ability to determine the state-action of images in complex environments. Then, the Adaptive Dynamic Reward Function(ADR) is crafted to dynamically adjust the reward based on shifts within the agent’s exploration environment. Finally, the ROI Align-based bounding box regression network (RABRNet) is proposed, which aims at further regressing the localization results of reinforcement learning to improve the detection precision. Our method accomplishes 74.4% mAP on the VOC2007, 76.2% mAP on the COCO2017, 75.2% Precision on the SF dataset, with 1.43G FLOPs. The precision outperforms the advanced deep reinforcement learning approaches and the computational cost is far lower than theirs and mainstream object detection methods. This method facilitates highly accurate object localization with minimal computational demands, which means it has notable applications on resource-constrained devices.

Abstract:
Near-eye light field displays offer natural 3D visual experiences for AR/VR users by projecting light rays onto retina as if the light rays were emanated from a real object. Such displays normally take four-dimensional light field data as input. Given that sizeable existing 3D contents are in the form of stereo images, we propose a practical approach that generates light field data from such contents at minimal computational cost while maintaining a reasonable image quality. The perceptual quality of light field is ensured by making the baseline of light field subviews consistent with that of the micro-projectors of the light field display and by compensating for the optical artifact of the light field display through digital rectification. The effectiveness and efficiency of the proposed approach is verified through both quantitative and qualitative experiments. The results demonstrate that our light field converter works for real-world light field displays.