TIP2024

Abstract:
Recent years have witnessed the incredible performance boost of data-driven deep visual object trackers. Despite the success, these trackers require millions of sequential manual labels on videos for supervised training, implying the heavy burden of human annotating. This raises a crucial question: how to train a powerful tracker from abundant videos using limited manual annotations? In this paper, we challenge the conventional belief that frame-by-frame labeling is indispensable, and show that providing a small number of annotated bounding boxes in each video is sufficient for training a strong tracker. To facilitate that, we design a novel SParsely-supervised Object Tracking (SPOT) framework. It regards the sparsely annotated boxes as anchors and progressively explores in the temporal span to discover unlabeled target snapshots. Under the teacher-student paradigm, SPOT leverages the unique transitive consistency inherent in the tracking task as supervision, extracting knowledge from both anchor snapshots and unlabeled target snapshots. We also utilize several effective training strategies, i.e., IoU filtering, asymmetric augmentation, and temporal calibration to further improve the training robustness of SPOT. The experimental results demonstrate that, given less than 5 labels for each video, trackers trained via SPOT perform on par with their fully-supervised counterparts. Moreover, our SPOT exhibits two desirable properties: 1) SPOT enables us to fully exploit large-scale video datasets by efficiently allocating sparse labels to more videos even under a limited labeling budget; 2) when equipped with a target discovery module, SPOT can even learn from purely unlabeled videos for performance gain. We hope this work could inspire the community to rethink the current annotation principles and make a step towards practical label-efficient deep tracking.

Abstract:
We tackle the problem of establishing dense correspondences between a pair of images in an efficient way. Most existing dense matching methods use 4D convolutions to filter incorrect matches, but 4D convolutions are highly inefficient due to their quadratic complexity. Besides, these methods learn features with fixed convolutions which cannot make learnt features robust to different challenge scenarios. To deal with these issues, we propose an Efficient Dynamic Correspondence Network (EDCNet) by jointly equipping pre-separate convolution (Psconv) and dynamic convolution (Dyconv) to establish dense correspondences in a coarse-to-fine manner. The proposed EDCNet enjoys several merits. First, two well-designed modules including a neighbourhood aggregation (NA) module and a dynamic feature learning (DFL) module are combined elegantly in the coarse-to-fine architecture, which is efficient and effective to establish both reliable and accurate correspondences. Second, the proposed NA module maintains linear complexity, showing its high efficiency. And our proposed DFL module has better flexibility to learn features robust to different challenges. Extensive experimental results show that our algorithm performs favorably against state-of-the-art methods on three challenging datasets including HPatches, Aachen Day-Night and InLoc.

Abstract:
Clustering is a fundamental and important step in many image processing tasks, such as face recognition and image segmentation. The performance of clustering can be largely enhanced if relevant weak supervision information is appropriately exploited. To achieve this goal, in this paper, we propose the Compound Weakly Supervised Clustering (CSWC) method. Concretely, CSWC incorporates two types of widely available and easily accessed weak supervision information from the label and feature aspects, respectively. To be specific, at the label level, the pairwise constraints are utilized as a kind of typical weak label supervision information. At the feature level, the partial instances collected from multiple perspectives have internal consistency and they are regarded as weak structure supervision information. To achieve a more confident clustering partition, we learn a unified graph with its similarity matrix to incorporate the above two types of weak supervision. On one hand, this similarity matrix is constructed by self-expression across the partial instances collected from multiple perspectives. On the other hand, the pairwise constraints, i.e., must-links and cannot-links, are considered by formulating a regularizer on the similarity matrix. Finally, the clustering results can be directly obtained according to the learned graph, without performing additional clustering techniques. Besides evaluating CSWC on 7 benchmark datasets, we also apply it to the application of face clustering in video data since it has vast application potentiality. Experimental results demonstrate the effectiveness of our algorithm in both incorporating compound weak supervision and identifying faces in real applications.

Abstract:
Structure-from-Motion (SfM) aims to recover 3D scene structures and camera poses based on the correspondences between input images, and thus the ambiguity caused by duplicate structures (i.e., different structures with strong visual resemblance) always results in incorrect camera poses and 3D structures. To deal with the ambiguity, most existing studies resort to additional constraint information or implicit inference by analyzing two-view geometries or feature points. In this paper, we propose to exploit high-level information in the scene, i.e., the spatial contextual information of local regions, to guide the reconstruction. Specifically, a novel structure is proposed, namely, track-community, in which each community consists of a group of tracks and represents a local segment in the scene. A community detection algorithm is performed on the track-graph to partition the scene into segments. Then, the potential ambiguous segments are detected by analyzing the neighborhood of tracks and corrected by checking the pose consistency. Finally, we perform partial reconstruction on each segment and align them with a novel bidirectional consistency cost function which considers both 3D-3D correspondences and pairwise relative camera poses. Experimental results demonstrate that our approach can robustly alleviate reconstruction failure resulting from visually indistinguishable structures and accurately merge the partial reconstructions.

Abstract:
Action quality assessment (AQA) is to assess how well an action is performed. Previous works perform modelling by only the use of visual information, ignoring audio information. We argue that although AQA is highly dependent on visual information, the audio is useful complementary information for improving the score regression accuracy, especially for sports with background music, such as figure skating and rhythmic gymnastics. To leverage multimodal information for AQA, i.e., RGB, optical flow and audio information, we propose a Progressive Adaptive Multimodal Fusion Network (PAMFN) that separately models modality-specific information and mixed-modality information. Our model consists of with three modality-specific branches that independently explore modality-specific information and a mixed-modality branch that progressively aggregates the modality-specific information from the modality-specific branches. To build the bridge between modality-specific branches and the mixed-modality branch, three novel modules are proposed. First, a Modality-specific Feature Decoder module is designed to selectively transfer modality-specific information to the mixed-modality branch. Second, when exploring the interaction between modality-specific information, we argue that using an invariant multimodal fusion policy may lead to suboptimal results, so as to take the potential diversity in different parts of an action into consideration. Therefore, an Adaptive Fusion Module is proposed to learn adaptive multimodal fusion policies in different parts of an action. This module consists of several FusionNets for exploring different multimodal fusion strategies and a PolicyNet for deciding which FusionNets are enabled. Third, a module called Cross-modal Feature Decoder is designed to transfer cross-modal features generated by Adaptive Fusion Module to the mixed-modality branch. Our extensive experiments validate the efficacy of the proposed method, and our method achieves state-of-the-art performance on two public datasets. Code is available at https://github.com/qinghuannn/PAMFN.

Abstract:
Action tube detection is a challenging task as it requires not only to locate action instances in each frame, but also link them in time. Existing action tube detection methods often employ multi-stage pipelines with complex designs and time-consuming linking procedure. In this paper, we present a simple end-to-end action tube detection method, termed as Sparse Tube Detector (STDet). Unlike those dense action detectors, our core idea is to use a set of learnable tube queries and directly decode them into action tubes (i.e., a set of tracked boxes with action label) from video content. This sparse detection paradigm shares several advantages. First, the large number of hand-crafted anchor candidates in dense action detectors is greatly reduced to a small number of learnable tubes, which results in a more efficient detection framework. Second, our learnable tube queries directly attend the whole video content, which endows our method with the capacity of capturing long-range information for action detection. Finally, our action detector is an end-to-end tube detection without requiring the linking procedure, which directly and explicitly predicts the action boundary instead of depending on the linking strategy. Extensive experiments shows that our STDet outperforms the previous state-of-the-art methods on two challenging untrimmed video action detection datasets of UCF101-24 and MultiSports. We hope our method will be an simple end-to-end tube detection baseline and can inspire new ideas in this direction.

Abstract:
This paper investigates a novel unpaired video dehazing framework, which can be a good candidate in practice by relieving pressure from collecting paired data. In such a paradigm, two key issues including 1) temporal consistency uninvolved in single image dehazing, and 2) better dehazing ability need to be considered for satisfied performance. To handle the mentioned problems, we alternatively resort to introducing depth information to construct additional regularization and supervision. Specifically, we attempt to synthesize realistic motions with depth information to improve the effectiveness and applicability of traditional temporal losses, and thus better regularizing the spatiotemporal consistency. Moreover, the depth information is also considered in terms of adversarial learning. For haze removal, the depth information guides the local discriminator to focus on regions where haze residuals are more likely to exist. The dehazing performance is consequently improved by more pertinent guidance from our depth-aware local discriminator. Extensive experiments are conducted to validate our effectiveness and superiority over other competitors. To the best of our knowledge, this study is the initial foray into the task of unpaired video dehazing. Our code is available at https://github.com/YaN9-Y/DUVD.

Abstract:
Detecting ellipses poses a challenging low-level task indispensable to many image analysis applications. Existing ellipse detection methods commonly encounter two fundamental issues. First, inferior detection accuracy could be incurred on a small ellipse than that on a large one; this introduces the scale issue. Second, inferior detection accuracy could be yielded along the minor axis than along the major one of the same ellipse; this leads to the anisotropy issue. To address these issues simultaneously, a novel anisotropic scale-invariant (ASI) ellipse detection methodology is proposed. Our basic idea is to perform ellipse detection in a transformed image space referred to as the ellipse normalization (EN) space, in which the desired ellipse from the original image is ‘normalized’ to the unit circle. With the establishment of the EN-space, an analytical ellipse fitting scheme and a set of distance measures are developed. Theoretical justifications are then derived to prove that both our ellipse fitting scheme and distance measures are invariant to anisotropic scaling, and thus each ellipse can be detected with the same accuracy regardless of its size and ellipticity. By incorporating these components into two recent state-of-the-art algorithms, two ASI ellipse detectors are finally developed and exploited to verify the effectiveness of our proposed methodology.

Abstract:
Texture similarity plays important roles in texture analysis and material recognition. However, perceptually-consistent fine-grained texture similarity prediction is still challenging. The discrepancy between the texture similarity data obtained using algorithms and human visual perception has been demonstrated. This dilemma is normally attributed to the texture representation and similarity metric utilised by the algorithms, which are inconsistent with human perception. To address this challenge, we introduce a Perception-Aware Texture Similarity Prediction Network (PATSP-Net). This network comprises a Bilinear Lateral Attention Transformer network (BiLAViT) and a novel loss function, namely, RSLoss. The BiLAViT contains a Siamese Feature Extraction Subnetwork (SFEN) and a Metric Learning Subnetwork (MLN), designed on top of the mechanisms of human perception. On the other hand, the RSLoss measures both the ranking and the scaling differences. To our knowledge, either the BiLAViT or the RSLoss has not been explored for texture similarity tasks. The PATSP-Net performs better than, or at least comparably to, its counterparts on three data sets for different fine-grained texture similarity prediction tasks. We believe that this promising result should be due to the joint utilization of the BiLAViT and RSLoss, which is able to learn the perception-aware texture representation and similarity metric.

Abstract:
Transformers show a great impact on visual tracking thanks to their powerful representation learning capabilities. As the capacity of the model grows, the speed of the tracker tends to decrease gradually. Our work focuses on dealing with massively redundant information in tracking sequences with the Saliency Region Tracker (SRTrack). SRTrack is a heuristic two-stage tracker consisting of a lightweight tracking stage and a saliency stage. The former can handle simple tracking sequences while the latter is designed to perform delicate tracking on challenging frames with more discriminative features. However, the two-stage design leads to feature extrapolation, creating inconsistencies between training and inference features. In order to mitigate this problem, we develop an attention scaling factor that guarantees model robustness while yielding a slight performance gain. Our SRTrack achieves a state-of-the-art 0.699 AUC running at 61 FPS on LaSOT. Several experiments on large benchmarks demonstrate the high efficiency and accuracy of SRTrack.

Abstract:
This paper focuses on the facial micro-expression (FME) generation task, which has potential application in enlarging digital FME datasets, thereby alleviating the lack of training data with labels in existing micro-expression datasets. Despite obvious progress in the image animation task, FME generation remains challenging because existing image animation methods can hardly encode subtle and short-term facial motion information. To this end, we present a facial-prior-guided FME generation framework that takes advantage of facial priors for facial motion generation. Specifically, we first estimate the geometric locations of action units (AUs) with detected facial landmarks. We further calculate an adaptive weighted prior (AWP) map, which alleviates the estimation error of AUs while efficiently capturing subtle facial motion patterns. To achieve smooth and realistic synthesis results, we use our proposed facial prior module to guide motion representation and generation modules in mainstream image animation frameworks. Extensive experiments on three benchmark datasets consistently show that our proposed facial prior module can be adopted in image animation frameworks and significantly improve their performance on micro-expression generation. Moreover, we use the generation technique to enlarge existing datasets, thereby improving the performance of general action recognition backbones on the FME recognition task. Our code is available at https://github.com/sysu19351158/FPB-FOMM.

Abstract:
Contrastive learning has revolutionized the field of computer vision, learning rich representations from unlabeled data, which generalize well to diverse vision tasks. Consequently, it has become increasingly important to explain these approaches and understand their inner workings mechanisms. Given that contrastive models are trained with interdependent and interacting inputs and aim to learn invariance through data augmentation, the existing methods for explaining single-image systems (e.g., image classification models) are inadequate as they fail to account for these factors and typically assume independent inputs. Additionally, there is a lack of evaluation metrics designed to assess pairs of explanations, and no analytical studies have been conducted to investigate the effectiveness of different techniques used to explaining contrastive learning. In this work, we design visual explanation methods that contribute towards understanding similarity learning tasks from pairs of images. We further adapt existing metrics, used to evaluate visual explanations of image classification systems, to suit pairs of explanations and evaluate our proposed methods with these metrics. Finally, we present a thorough analysis of visual explainability methods for contrastive learning, establish their correlation with downstream tasks and demonstrate the potential of our approaches to investigate their merits and drawbacks.

Abstract:
This paper presents Slime, a novel non-deep image matching framework which models the scene as rough local overlapping planes. This intermediate representation sits in-between the local affine approximation of the keypoint patches and the global matching based on both spatial and similarity constraints, providing a progressive pruning of the correspondences, as planes are easier to handle with respect to general scenes. Slime decomposes the images into overlapping regions at different scales and computes loose planar homographies. Planes are mutually extended by compatible matches and the images are split into fixed tiles, with only the best homographies retained for each pair of tiles. Stable matches are identified according to the consensus of the admissible stereo configurations provided by pairwise homographies. Within tiles, the rough planes are then merged according to their overlap in terms of matches and further consistent correspondences are extracted. The whole process only involves homography constraints. As a result, both the coverage and the stability of correct matches over the scene are amplified, together with the ability to spot matches in challenging scenes, allowing traditional hybrid matching pipelines to make up lost ground against recent end-to-end deep matching methods. In addition, the paper gives a thorough comparative analysis of recent state-of-the-art in image matching represented by end-to-end deep networks and hybrid pipelines. The evaluation considers both planar and non-planar scenes, taking into account critical and challenging scenarios including abrupt temporal image changes and strong variations in relative image rotations. According to this analysis, although the impressive progress done in this field, there is still a wide room for improvements to be investigated in future research.

Abstract:
Taking advantages of the quaternion representation of the color image, this paper proposes a quaternion perceptual seamline detection model to generate the seamline in the quaternion domain. It considers seamline detection as a quaternion-domain color image labeling problem and minimizes the local-area quaternion perceptual difference cost to obtain the optimal seamline. To assess seamline quality effectively, we develop a quaternion perceptual seamline quality measure. Based on the proposed quaternion perceptual seamline detection model and quality measure, we further propose a general framework for automatic quaternion-domain color image stitching (AQCIS). To the best of our knowledge, this is the first attempt to perform color image stitching completely in the quaternion domain. Meanwhile, AQCIS introduces the joint optimization strategy of local alignment and seamline in an iterative fashion. Extensive experiments on challenging datasets demonstrate that our AQCIS achieves superior performance for color image stitching in comparison with state-of-the-art methods.

Abstract:
Bimodal objects, such as the checkerboard pattern used in camera calibration, markers for object tracking, and text on road signs, to name a few, are prevalent in our daily lives and serve as a visual form to embed information that can be easily recognized by vision systems. While binarization from intensity images is crucial for extracting the embedded information in the bimodal objects, few previous works consider the task of binarization of blurry images due to the relative motion between the vision sensor and the environment. The blurry images can result in a loss in the binarization quality and thus degrade the downstream applications where the vision system is in motion. Recently, neuromorphic cameras offer new capabilities for alleviating motion blur, but it is non-trivial to first deblur and then binarize the images in a real-time manner. In this work, we propose an event-based binary reconstruction method that leverages the prior knowledge of the bimodal target’s properties to perform inference independently in both event space and image space and merge the results from both domains to generate a sharp binary image. We also develop an efficient integration method to propagate this binary image to high frame rate binary video. Finally, we develop a novel method to naturally fuse events and images for unsupervised threshold identification. The proposed method is evaluated in publicly available and our collected data sequence, and shows the proposed method can outperform the SOTA methods to generate high frame rate binary video in real-time on CPU-only devices.

Abstract:
Referring Image Segmentation (RIS) is a fundamental vision-language task that outputs object masks based on text descriptions. Many works have achieved considerable progress for RIS, including different fusion method designs. In this work, we explore an essential question, “What if the text description is wrong or misleading?” For example, the described objects are not in the image. We term such a sentence as a negative sentence. However, existing solutions for RIS cannot handle such a setting. To this end, we propose a new formulation of RIS, named Robust Referring Image Segmentation (R-RIS). It considers the negative sentence inputs besides the regular positive text inputs. To facilitate this new task, we create three R-RIS datasets by augmenting existing RIS datasets with negative sentences and propose new metrics to evaluate both types of inputs in a unified manner. Furthermore, we propose a new transformer-based model, called RefSegformer, with a token-based vision and language fusion module. Our design can be easily extended to our R-RIS setting by adding extra blank tokens. Our proposed RefSegformer achieves state-of-the-art results on both RIS and R-RIS datasets, establishing a solid baseline for both settings. Our project page is at https://github.com/jianzongwu/robust-ref-seg.

Abstract:
Depth information opens up new opportunities for video object segmentation (VOS) to be more accurate and robust in complex scenes. However, the RGBD VOS task is largely unexplored due to the expensive collection of RGBD data and time-consuming annotation of segmentation. In this work, we first introduce a new benchmark for RGBD VOS, named DepthVOS, which contains 350 videos (over 55k frames in total) annotated with masks and bounding boxes. We futher propose a novel, strong baseline model - Fused Color-Depth Network (FusedCDNet), which can be trained solely under the supervision of bounding boxes, while being used to generate masks with a bounding box guideline only in the first frame. Thereby, the model possesses three major advantages: a weakly-supervised training strategy to overcome the high-cost annotation, a cross-modal fusion module to handle complex scenes, and weakly-supervised inference to promote ease of use. Extensive experiments demonstrate that our proposed method performs on par with top fully-supervised algorithms. We will open-source our project on https://github.com/yjybuaa/depthvos/ to facilitate the development of RGBD VOS.

Abstract:
Video restoration aims to restore high-quality frames from low-quality frames. Different from single image restoration, video restoration generally requires to utilize temporal information from multiple adjacent but usually misaligned video frames. Existing deep methods generally tackle with this by exploiting a sliding window strategy or a recurrent architecture, which are restricted by frame-by-frame restoration. In this paper, we propose a Video Restoration Transformer (VRT) with parallel frame prediction ability. More specifically, VRT is composed of multiple scales, each of which consists of two kinds of modules: temporal reciprocal self attention (TRSA) and parallel warping. TRSA divides the video into small clips, on which reciprocal attention is applied for joint motion estimation, feature alignment and feature fusion, while self attention is used for feature extraction. To enable cross-clip interactions, the video sequence is shifted for every other layer. Besides, parallel warping is used to further fuse information from neighboring frames by parallel feature warping. Experimental results on five tasks, including video super-resolution, video deblurring, video denoising, video frame interpolation and space-time video super-resolution, demonstrate that VRT outperforms the state-of-the-art methods by large margins (up to 2.16dB) on fourteen benchmark datasets. The codes are available at https://github.com/JingyunLiang/VRT.

Abstract:
Cross-modal retrieval (e.g., query a given image to obtain a semantically similar sentence, and vice versa) is an important but challenging task, as the heterogeneous gap and inconsistent distributions exist between different modalities. The dominant approaches struggle to bridge the heterogeneity by capturing the common representations among heterogeneous data in a constructed subspace which can reflect the semantic closeness. However, insufficient consideration is taken into the fact that learned latent representations are actually heavily entangled with those semantic-unrelated features, which obviously further compounds the challenges of cross-modal retrieval. To alleviate the difficulty, this work makes an assumption that the data are jointly characterized by two independent features: semantic-shared and semantic-unrelated representations. The former presents characteristics of consistent semantics shared by different modalities, while the latter reflects the characteristics with respect to the modality yet unrelated to semantics, such as background, illumination, and other low-level information. Therefore, this paper aims to disentangle the shared semantics from the entangled features, andthus the purer semantic representation can promote the closeness of paired data. Specifically, this paper designs a novel Semantics Disentangling approach for Cross-Modal Retrieval (termed as SDCMR) to explicitly decouple the two different features based on variational auto-encoder. Next, the reconstruction is performed by exchanging shared semantics to ensure the learning of semantic consistency. Moreover, a dual adversarial mechanism is designed to disentangle the two independent features via a pushing-and-pulling strategy. Comprehensive experiments on four widely used datasets demonstrate the effectiveness and superiority of the proposed SDCMR method by achieving a new bar on performance when compared against 15 state-of-the-art methods.

Abstract:
Gaze estimation is an important fundamental task in computer vision and medical research. Existing works have explored various effective paradigms and modules for precisely predicting eye gazes. However, the uncertainty for gaze estimation, e.g., input uncertainty and annotation uncertainty, have been neglected in previous research. Existing models use a deterministic function to estimate the gaze, which cannot reflect the actual situation in gaze estimation. To address this issue, we propose a probabilistic framework for gaze estimation by modeling the input uncertainty and annotation uncertainty. We first utilize probabilistic embeddings to model the input uncertainty, representing the input image as a Gaussian distribution in the embedding space. Based on the input uncertainty modeling, we give an instance-wise uncertainty estimation to measure the confidence of prediction results, which is critical in practical applications. Then, we propose a new label distribution learning method, probabilistic annotations, to model the annotation uncertainty, representing the raw hard labels as Gaussian distributions. In addition, we develop an Embedding Distribution Smoothing (EDS) module and a hard example mining method to improve the consistency between embedding distribution and label distribution. We conduct extensive experiments, demonstrating that the proposed approach achieves significant improvements over baseline and state-of-the-art methods on two widely used benchmark datasets, GazeCapture and MPIIFaceGaze, as well as our collected dataset using mobile devices.

Abstract:
Transformer-based instance-level recognition has attracted increasing research attention recently due to the superior performance. However, although attempts have been made to encode masks as embeddings into Transformer-based frameworks, how to combine mask embeddings and spatial information for a transformer-based approach is still not fully explored. In this paper, we revisit the design of mask-embedding-based pipelines and propose an Instance Segmentation TRansformer (ISTR) with Mask Meta-Embeddings (MME), leveraging the strengths of transformer models in encoding embedding information and incorporating spatial information from mask embeddings. ISTR incorporates a recurrent refining head that consists of a Dynamic Box Predictor (DBP), a Mask Information Generator (MIG), and a Mask Meta-Decoder (MMD). To improve the quality of mask embeddings, MME interprets the mask encoding-decoding processes as a mutual information maximization problem, which unifies the objective functions of different decoding schemes such as Principal Component Analysis (PCA) and Discrete Cosine Transform (DCT) with a meta-formulation. Under the meta-formulation, a learnable Spatial Mask Tuner (SMT) is further proposed, which fuses the spatial and embedding information produced from MIG and can significantly boost the segmentation performance. The resulting varieties, i.e., ISTR-PCA, ISTR-DCT, and ISTR-SMT, demonstrate the effectiveness and efficiency of incorporating mask embeddings with the query-based instance segmentation pipelines. On the COCO dataset, ISTR surpasses all predominant mask-embedding-based models by a large margin, and achieves competitive performance compared to concurrent state-of-the-art models. On the Cityscapes dataset, ISTR also outperforms several strong baselines. Our code has been made available at: https://github.com/hujiecpp/ISTR.

Abstract:
Depth estimation is a fundamental task in many vision applications. With the popularity of omnidirectional cameras, it becomes a new trend to tackle this problem in the spherical space. In this paper, we propose a learning-based method for predicting dense depth values of a scene from a monocular omnidirectional image. An omnidirectional image has a full field-of-view, providing much more complete descriptions of the scene than perspective images. However, fully-convolutional networks that most current solutions rely on fail to capture rich global contexts from the panorama. To address this issue and also the distortion of equirectangular projection in the panorama, we propose Cubemap Vision Transformers (CViT), a new transformer-based architecture that can model long-range dependencies and extract distortion-free global features from the panorama. We show that cubemap vision transformers have a global receptive field at every stage and can provide globally coherent predictions for spherical signals. As a general architecture, it removes any restriction that has been imposed on the panorama in many other monocular panoramic depth estimation methods. To preserve important local features, we further design a convolution-based branch in our pipeline (dubbed GLPanoDepth) and fuse global features from cubemap vision transformers at multiple scales. This global-to-local strategy allows us to fully exploit useful global and local features in the panorama, achieving state-of-the-art performance in panoramic depth estimation.

Abstract:
The ability to detect and track the dynamic objects in different scenes is fundamental to real-world applications, e.g., autonomous driving and robot navigation. However, traditional Multi-Object Tracking (MOT) is limited to track objects belonging to the pre-defined closed-set categories. Recently, Generic MOT (GMOT) is proposed to track interested objects beyond pre-defined categories and it can be divided into Open-Vocabulary MOT (OVMOT) and Template-Image-based MOT (TIMOT). Taking the consideration that the expensive well pre-trained (vision-)language model and fine-grained category annotations are required to train OVMOT models, in this paper, we focus on TIMOT and propose a simple but effective method, Siamese-DETR. Only the commonly used detection datasets (e.g., COCO) are required for training. Different from existing TIMOT methods, which train a Single Object Tracking (SOT) based detector to detect interested objects and then apply a data association based MOT tracker to get the trajectories, we leverage the inherent object queries in DETR variants. Specifically: 1) The multi-scale object queries are designed based on the given template image, which are effective for detecting different scales of objects with the same category as the template image; 2) A dynamic matching training strategy is introduced to train Siamese-DETR on commonly used detection datasets, which takes full advantage of provided annotations; 3) The online tracking pipeline is simplified through a tracking-by-query manner by incorporating the tracked boxes in the previous frame as additional query boxes. The complex data association is replaced with the much simpler Non-Maximum Suppression (NMS). Extensive experimental results show that Siamese-DETR surpasses existing MOT methods on GMOT-40 dataset by a large margin.

Abstract:
Consensus representation learning is one of the most popular approaches in the field of multi-view clustering. However, most of the existing methods cannot learn discriminative representations with a clustering-friendly structure since these methods ignore the separation among clusters and the compactness within each cluster. To tackle this issue, we propose a new deep multi-view clustering network with a dual contrastive mechanism to learn clustering-friendly representations. Specifically, our method employs dual contrasting losses: a dynamic cluster diffusion loss to maximize the distance between different clusters and a reliable neighbor-guided positive alignment loss to enhance compactness within each cluster. Our approach includes several key components: view-specific encoders to extract high-level features from each view, and an adaptive feature fusion strategy to obtain consensus representations across multiple views. The dynamic cluster diffusion module ensures inter-cluster separation by maximizing distances between different clusters in the consensus feature space. Simultaneously, the reliable neighbor-guided positive alignment module improves within-cluster compactness through a pseudo-label and nearest neighbor structure-driven contrastive loss. Experimental results on several datasets show that our method can acquire clustering-friendly representations with both good properties of inter-cluster separation and within-cluster compactness, and outperforms the existing state-of-the-art approaches in clustering performance. Our source code is available at https://github.com/tweety1028/DCMVC.

Abstract:
The proliferation of Few-Shot Class Incremental Learning (FSCIL) methodologies has highlighted the critical challenge of maintaining robust anti-amnesia capabilities in FSCIL learners. In this paper, we present a novel conceptualization of anti-amnesia in terms of mathematical generalization, leveraging the Neural Tangent Kernel (NTK) perspective. Our method focuses on two key aspects: ensuring optimal NTK convergence and minimizing NTK-related generalization loss, which serve as the theoretical foundation for cross-task generalization. To achieve global NTK convergence, we introduce a principled meta-learning mechanism that guides optimization within an expanded network architecture. Concurrently, to reduce the NTK-related generalization loss, we systematically optimize its constituent factors. Specifically, we initiate self-supervised pre-training on the base session to enhance NTK-related generalization potential. These self-supervised weights are then carefully refined through curricular alignment, followed by the application of dual NTK regularization tailored specifically for both convolutional and linear layers. Through the combined effects of these measures, our network acquires robust NTK properties, ensuring optimal convergence and stability of the NTK matrix and minimizing the NTK-related generalization loss, significantly enhancing its theoretical generalization. On popular FSCIL benchmark datasets, our NTK-FSCIL surpasses contemporary state-of-the-art approaches, elevating end-session accuracy by 2.9% to 9.3%.

Abstract:
Diffusion models have demonstrated their great ability to generate high-quality images for various tasks. With such a strong performance, diffusion models can potentially pose a severe threat to both humans and deep learning models, e.g., DNNs and MLLMs. However, their abilities as adversaries have not been well explored. Among different adversarial scenarios, the no-box adversarial attack is the most practical one, as it assumes that the attacker has no access to the training dataset or the target model. Existing works still require some data from the training dataset, which may not be feasible in real-world scenarios. In this paper, we investigate the adversarial capabilities of diffusion models by conducting no-box attacks solely using data generated by diffusion models. Specifically, our attack method generates a synthetic dataset using diffusion models to train a substitute model. We then employ a classification diffusion model to fine-tune the substitute model, considering model uncertainty and incorporating noise augmentation. Finally, we sample adversarial examples from the diffusion models using the average approximation over the diffusion substitute model with multiple inferences. Extensive experiments on the ImageNet dataset demonstrate that the proposed attack method achieves state-of-the-art performance in both no-box attack and black-box attack scenarios.

Abstract:
Nowadays, data in the real world often comes from multiple sources, but most existing multi-view K -Means perform poorly on linearly non-separable data and require initializing the cluster centers and calculating the mean, which causes the results to be unstable and sensitive to outliers. This paper proposes an efficient multi-view K -Means to solve the above-mentioned issues. Specifically, our model avoids the initialization and computation of clusters centroid of data. Additionally, our model use the Butterworth filters function to transform the adjacency matrix into a distance matrix, which makes the model is capable of handling linearly inseparable data and insensitive to outliers. To exploit the consistency and complementarity across multiple views, our model constructs a third tensor composed of discrete index matrices of different views and minimizes the tensor’s rank by tensor Schatten p -norm. Experiments on two artificial datasets verify the superiority of our model on linearly inseparable data, and experiments on several benchmark datasets illustrate the performance.

Abstract:
Recent developments in the field of non-local attention (NLA) have led to a renewed interest in self-similarity-based single image super-resolution (SISR). Researchers usually use the NLA to explore non-local self-similarity (NSS) in SISR and achieve satisfactory reconstruction results. However, a surprising phenomenon that the reconstruction performance of the standard NLA is similar to that of the NLA with randomly selected regions prompted us to revisit NLA. In this paper, we first analyzed the attention map of the standard NLA from different perspectives and discovered that the resulting probability distribution always has full support for every local feature, which implies a statistical waste of assigning values to irrelevant non-local features, especially for SISR which needs to model long-range dependence with a large number of redundant non-local features. Based on these findings, we introduced a concise yet effective soft thresholding operation to obtain high-similarity-pass attention (HSPA), which is beneficial for generating a more compact and interpretable distribution. Furthermore, we derived some key properties of the soft thresholding operation that enable training our HSPA in an end-to-end manner. The HSPA can be integrated into existing deep SISR models as an efficient general building block. In addition, to demonstrate the effectiveness of the HSPA, we constructed a deep high-similarity-pass attention network (HSPAN) by integrating a few HSPAs in a simple backbone. Extensive experimental results demonstrate that HSPAN outperforms state-of-the-art approaches on both quantitative and qualitative evaluations. Our code and a pre-trained model were uploaded to GitHub (https://github.com/laoyangui/HSPAN) for validation.

Abstract:
We present a plug-and-play Image Signal Processor (ISP) for image enhancement to better produce diverse image styles than the previous works. Our proposed method, ContRollable Image Signal Processor (CRISP), explicitly controls the parameters of the ISP that determine output image styles. ISP parameters for high-quality (HQ) image styles are encoded into low-dimensional latent codes, allowing fast and easy style adjustments. We empirically show that CRISP covers a wide range of image styles with high efficiency. On the MIT-Adobe FiveK dataset, CRISP can very closely estimate the reference styles produced by human experts and achieves better MOS with diverse image styles. Compared with the state-of-the-art method, our ISP comprises only 19 parameters, allowing CRISP to have 2× smaller parameters and 100× reduced FLOPs for an image output. CRISP outperforms previous works in PSNR and FLOPs with several scenarios for style adjustments.

Abstract:
In this paper, we present a Structure-aware Cross-Modal Transformer (SCMT) to fully capture the 3D structures hidden in sparse depths for depth completion. Most existing methods learn to predict dense depths by taking depths as an additional channel of RGB images or learning 2D affinities to perform depth propagation. However, they fail to exploit 3D structures implied in the depth channel, thereby losing the informative 3D knowledge that provides important priors to distinguish the foreground and background features. Moreover, since these methods rely on the color textures of 2D images, it is challenging for them to handle poor-texture regions without the guidance of explicit 3D cues. To address this, we disentangle the hierarchical 3D scene-level structure from the RGB-D input and construct a pathway to make sharp depth boundaries and object shape outlines accessible to 2D features. Specifically, we extract 2D and 3D features from depth inputs and the back-projected point clouds respectively by building a two-stream network. To leverage 3D structures, we construct several cross-modal transformers to adaptively propagate multi-scale 3D structural features to the 2D stream, energizing 2D features with priors of object shapes and local geometries. Experimental results show that our SCMT achieves state-of-the-art performance on three popular outdoor (KITTI) and indoor (VOID and NYU) benchmarks.

Abstract:
This paper presents a novel fine-grained task for traffic accident analysis. Accident detection in surveillance or dashcam videos is a common task in the field of traffic accident analysis by using videos. However, common accident detection does not analyze the specific particulars of the accident, only identifies the accident’s existence or occurrence time in a video. In this paper, we define the novel fine-grained accident detection task which contains fine-grained accident classification, temporal-spatial occurrence region localization, and accident severity estimation. A transformer-based framework combining the RGB and optical flow information of videos is proposed for fine-grained accident detection. Additionally, we introduce a challenging Fine-grained Accident Detection (FAD) database that covers multiple tasks in surveillance videos which places more emphasis on the overall perspective. Experimental results demonstrate that our model could effectively extract the video features for multiple tasks, indicating that current traffic accident analysis has limitations in dealing with the FAD task and that further research is indeed needed.

Abstract:
Domain adaptation leverages labeled data from a source domain to learn an accurate classifier for an unlabeled target domain. Since the data collected in practical applications usually contain noise, the weakly-supervised domain adaptation algorithm has attracted widespread attention from researchers that tolerates the source domain with label noises or/and features noises. Several weakly-supervised domain adaptation methods have been proposed to mitigate the difficulty of obtaining the high-quality source domains that are highly related to the target domain. However, these methods assume to obtain the accurate noise rate in advance to reduce the negative transfer caused by noises in source domain, which limits the application of these methods in the real world where the noise rate is unknown. Meanwhile, since source data usually comes from multiple domains, the naive application of single-source domain adaptation algorithms may lead to sub-optimal results. We hence propose a universal and scalable weakly-supervised domain adaptation method called PDCAS to ease restraints of such assumptions and make it more general. Specifically, PDCAS includes two stages: progressive distillation and domain alignment. In progressive distillation stage, we iteratively distill out potentially clean samples whose annotated labels are highly consistent with the prediction of model and correct labels for noisy source samples. This process is non-supervision by exploiting intrinsic similarity to measure and extract initial corrected samples. In domain alignment stage, we consider Class-Aligned Sampling which balances the samples for both source and target domains along with the global feature distributions to alleviate the shift of label distributions. Finally, we apply PDCAS in multi-source noisy scenario and propose a novel multi-source weakly-supervised domain adaptation method called MSPDCAS, which shows the scalability of our framework. Extensive experiments on Office-31 and Office-Home datasets demonstrate the effectiveness and robustness of our method compared to state-of-the-art methods.

Abstract:
We present ReGO (Reference-Guided Outpainting), a new method for the task of sketch-guided image outpainting. Despite the significant progress made in producing semantically coherent content, existing outpainting methods often fail to deliver visually appealing results due to blurry textures and generative artifacts. To address these issues, ReGO leverages neighboring reference images to synthesize texture-rich results by transferring pixels from them. Specifically, an Adaptive Content Selection (ACS) module is incorporated into ReGO to facilitate pixel transfer for texture compensating of the target image. Additionally, a style ranking loss is introduced to maintain consistency in terms of style while preventing the generated part from being influenced by the reference images. ReGO is a model-agnostic learning paradigm for outpainting tasks. In our experiments, we integrate ReGO with three state-of-the-art outpainting models to evaluate its effectiveness. The results obtained on three scenery benchmarks, i.e. NS6K, NS8K and SUN Attribute, demonstrate the superior performance of ReGO compared to prior art in terms of texture richness and authenticity. Our code is available at https://github.com/wangyxxjtu/ReGO-Pytorch.

Abstract:
Open-set object detection (OSOD) aims to detect the known categories and reject unknown objects in a dynamic world, which has achieved significant attention. However, previous approaches only consider this problem in data-abundant conditions, while neglecting the few-shot scenes. In this paper, we seek a solution for the generalized few-shot open-set object detection (G-FOOD), which aims to avoid detecting unknown classes as known classes with a high confidence score while maintaining the performance of few-shot detection. The main challenge for this task is that few training samples induce the model to overfit on the known classes, resulting in a poor open-set performance. We propose a new G-FOOD algorithm to tackle this issue, named Few-shOt Open-set Detector (FOOD), which contains a novel class weight sparsification classifier (CWSC) and a novel unknown decoupling learner (UDL). To prevent over-fitting, CWSC randomly sparses parts of the normalized weights for the logit prediction of all classes, and then decreases the co-adaptability between the class and its neighbors. Alongside, UDL decouples training the unknown class and enables the model to form a compact unknown decision boundary. Thus, the unknown objects can be identified with a confidence probability without any threshold, prototype, or generation. We compare our method with several state-of-the-art OSOD methods in few-shot scenes and observe that our method improves the F-score of unknown classes by 4.80%-9.08% across all shots in VOC-COCO dataset settings. Source code is available on-line at https://github.com/binyisu/food.

Affiliations: School of Computer Science and Engineering (Southeast University), the National Center of Technology Innovation for EDA, and the Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications, Ministry of Education (Southeast University), Nanjing, China; School of Cyber Science and Engineering, Southeast University, Nanjing, China; State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China; School of Computer Science and Engineering, and the Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications, Ministry of Education, Southeast University, Nanjing, China

Abstract:
Deep learning has made significant advancements in supervised learning. However, models trained in this setting often face challenges due to domain shift between training and test sets, resulting in a significant drop in performance during testing. To address this issue, several domain generalization methods have been developed to learn robust and domain-invariant features from multiple training domains that can generalize well to unseen test domains. Data augmentation plays a crucial role in achieving this goal by enhancing the diversity of the training data. In this paper, inspired by the observation that normalizing an image with different statistics generated by different batches with various domains can perturb its feature, we propose a simple yet effective method called NormAUG (Normalization-guided Augmentation). Our method includes two paths: the main path and the auxiliary (augmented) path. During training, the auxiliary path includes multiple sub-paths, each corresponding to batch normalization for a single domain or a random combination of multiple domains. This introduces diverse information at the feature level and improves the generalization of the main path. Moreover, our NormAUG method effectively reduces the existing upper boundary for generalization based on theoretical perspectives. During the test stage, we leverage an ensemble strategy to combine the predictions from the auxiliary path of our model, further boosting performance. Extensive experiments are conducted on multiple benchmark datasets to validate the effectiveness of our proposed method.

Abstract:
The problem of sketch semantic segmentation is far from being solved. Despite existing methods exhibiting near-saturating performances on simple sketches with high recognisability, they suffer serious setbacks when the target sketches are products of an imaginative process with high degree of creativity. We hypothesise that human creativity, being highly individualistic, induces a significant shift in distribution of sketches, leading to poor model generalisation. Such hypothesis, backed by empirical evidences, opens the door for a solution that explicitly disentangles creativity while learning sketch representations. We materialise this by crafting a learnable creativity estimator that assigns a scalar score of creativity to each sketch. It follows that we introduce CreativeSeg, a learning-to-learn framework that leverages the estimator in order to learn creativity-agnostic representation, and eventually the downstream semantic segmentation task. We empirically verify the superiority of CreativeSeg on the recent “Creative Birds” and “Creative Creatures” creative sketch datasets. Through a human study, we further strengthen the case that the learned creativity score does indeed have a positive correlation with the subjective creativity of human. Codes are available at https://github.com/PRIS-CV/Sketch-CS.

Abstract:
In this paper, we propose an anycost network quantization method for efficient image super-resolution with variable resource budgets. Conventional quantization approaches acquire discrete network parameters for deployment with fixed complexity constraints, while image super-resolution networks are usually applied on mobile devices with frequently modified resource budgets due to the change of battery levels or computing chips. Hence, exhaustively optimizing quantized networks with each complexity constraint results in unacceptable training costs. On the contrary, we construct a hyper-network whose parameters can efficiently adapt to different resource budgets with negligible finetuning cost, so that the image super-resolution networks can be feasibly deployed in diversified devices with variable resource budgets. Specifically, we dynamically search the optimal bitwidth for each patch in convolution according to feature maps and complexity constraints, which aims to achieve the best efficiency-accuracy trade-off in image super-resolution given the resource budget. To acquire the hyper-network that can be efficiently adapted to different bitwidth settings, we actively sample the patch-wise bitwidth during training and adaptively ensemble gradients from hyper-network in different precision for faster convergence and higher generalization ability. Compared with existing quantization methods, experimental results demonstrate that our method significantly reduces the cost of adapting models in new resource budgets with comparable efficiency-accuracy trade-offs.

Abstract:
Deep cross-modal hashing retrieval has recently made significant progress. However, existing methods generally learn hash functions with pairwise or triplet supervisions, which involves learning the relevant information by splicing partial similarity between data pairs; notably, this approach only captures the data similarity locally and incompletely, resulting in sub-optimal retrieval performance. In this paper, we propose a novel Multi-Relational Deep Hashing (MRDH) approach, which can fully bridge the modality gap by comprehensively modeling the similarity relationship between data in different modalities. In more detail, to investigate the inter-modal relationships, we constrain the consistency of cross-modal pairwise similarities to maintain the semantic similarity across modalities. Moreover, to further capture complete similarity information, we design a new similarity metric, which we term cross-modal global similarity, by encouraging hash codes of similar data pairs from different modalities to approach a common center and hash codes for dissimilar pairs to converge to different centers. Adopting this approach enables our model to generate more discriminative hash codes. Extensive experiments on three benchmark datasets demonstrate the superiority of our method on cross-modal hashing retrieval.

Abstract:
Weakly supervised object detection (WSOD) aims to train detectors using only image-category labels. Current methods typically first generate dense class-agnostic proposals and then select objects based on the classification scores of these proposals. These methods mainly focus on selecting the proposal having high Intersection-over-Union with the true object location, while ignoring the problem of misclassification, which occurs when some proposals exhibit semantic similarities with objects from other categories due to viewing perspective and background interference. We observe that the positive class that is misclassified typically has the following two characteristics: 1) It is usually misclassified as one or a few specific negative classes, and the scores of these negative classes are high; 2) Compared to other negative classes, the score of the positive class is relatively high. Based on these two characteristics, we propose misclassification correction (MCC) and misclassification tolerance (MCT) respectively. In MCC, we establish a misclassification memory bank to record and summarize the class-pairs with high frequencies of potential misclassifications in the early stage of training, that is, cases where the score of a negative class is significantly higher than that of the positive class. In the later stage of training, when such cases occur and correspond to the summarized class-pairs, we select the top-scoring negative class proposal as the positive training example. In MCT, we decrease the loss weights of misclassified classes in the later stage of training to avoid them dominating training and causing misclassification of objects from other classes that are semantically similar to them during inference. Extensive experiments on the PASCAL VOC and MS COCO demonstrate our method can alleviate the problem of misclassification and achieve the state-of-the-art results.

Abstract:
Crowd counting models in highly congested areas confront two main challenges: weak localization ability and difficulty in differentiating between foreground and background, leading to inaccurate estimations. The reason is that objects in highly congested areas are normally small and high-level features extracted by convolutional neural networks are less discriminative to represent small objects. To address these problems, we propose a learning discriminative features framework for crowd counting, which is composed of a masked feature prediction module (MPM) and a supervised pixel-level contrastive learning module (CLM). The MPM randomly masks feature vectors in the feature map and then reconstructs them, allowing the model to learn about what is present in the masked regions and improving the model’s ability to localize objects in high-density regions. The CLM pulls targets close to each other and pushes them far away from background in the feature space, enabling the model to discriminate foreground objects from background. Additionally, the proposed modules can be beneficial in various computer vision tasks, such as crowd counting and object detection, where dense scenes or cluttered environments pose challenges to accurate localization. The proposed two modules are plug-and-play, incorporating the proposed modules into existing models can potentially boost their performance in these scenarios.

Abstract:
Unsupervised Domain Adaptation (UDA) is quite challenging due to the large distribution discrepancy between the source domain and the target domain. Inspired by diffusion models which have strong capability to gradually convert data distributions across a large gap, we consider to explore the diffusion technique to handle the challenging UDA task. However, using diffusion models to convert data distribution across different domains is a non-trivial problem as the standard diffusion models generally perform conversion from the Gaussian distribution instead of from a specific domain distribution. Besides, during the conversion, the semantics of the source-domain data needs to be preserved to classify correctly in the target domain. To tackle these problems, we propose a novel Domain-Adaptive Diffusion (DAD) module accompanied by a Mutual Learning Strategy (MLS), which can gradually convert data distribution from the source domain to the target domain while enabling the classification model to learn along the domain transition process. Consequently, our method successfully eases the challenge of UDA by decomposing the large domain gap into small ones and gradually enhancing the capacity of classification model to finally adapt to the target domain. Our method outperforms the current state-of-the-arts by a large margin on three widely used UDA datasets.

Abstract:
Privacy is a crucial concern in collaborative machine vision where a part of a Deep Neural Network (DNN) model runs on the edge, and the rest is executed on the cloud. In such applications, the machine vision model does not need the exact visual content to perform its task. Taking advantage of this potential, private information could be removed from the data insofar as it does not significantly impair the accuracy of the machine vision system. In this paper, we present an autoencoder-style network integrated within an object detection pipeline, which generates a latent representation of the input image that preserves task-relevant information while removing private information. Our approach employs an adversarial training strategy that not only removes private information from the bottleneck of the autoencoder but also promotes improved compression efficiency for feature channels coded by conventional codecs like VVC-Intra. We assess the proposed system using a realistic evaluation framework for privacy, directly measuring face and license plate recognition accuracy. Experimental results show that our proposed method is able to reduce the bitrate significantly at the same object detection accuracy compared to coding the input images directly, while keeping the face and license plate recognition accuracy on the images recovered from the bottleneck features low, implying strong privacy protection. Our code is available at https://github.com/bardia-az/ppa-code.

Abstract:
Defect detection from images is a crucial and challenging topic of industry scenarios due to the scarcity and unpredictability of anomalous samples. However, existing defect detection methods exhibit low detection performance when it comes to small-size defects. In this work, we propose a Cross-Attention Regression Flow (CARF) framework to model a compact distribution of normal visual patterns for separating outliers. To retain rich scale information of defects, we build an interactive cross-attention pattern flow module to jointly transform and align distributions of multi-layer features, which is beneficial for detecting small-size defects that may be annihilated in high-level features. To handle the complexity of multi-layer feature distributions, we introduce a layer-conditional autoregression module to improve the fitting capacity of data likelihoods on multi-layer features. By transforming the multi-layer feature distributions into a latent space, we can better characterize normal visual patterns. Extensive experiments on four public datasets and our collected industrial dataset demonstrate that the proposed CARF outperforms state-of-the-art methods, particularly in detecting small-size defects.

Abstract:
Most few-shot learning methods employ either adaptive approaches or parameter amortization techniques. However, their reliance on pre-trained models presents a significant vulnerability. When an attacker’s trigger activates a hidden backdoor, it may result in the misclassification of images, profoundly affecting the model’s performance. In our research, we explore adaptive defenses against backdoor attacks for few-shot learning. We introduce a specialized stochastic process tailored to task characteristics that safeguards the classification model against attack-induced incorrect feature extraction. This process functions during forward propagation and is thus termed an “in-process defense.” Our method employs an adaptive strategy, effectively generating task-level representations, enabling rapid adaptation to pre-trained models, and proving effective in few-shot classification scenarios for countering backdoor attacks. We apply latent stochastic processes to approximate task distributions and derive task-level representations from the support set. This task-level representation guides feature extraction, leading to backdoor trigger mismatching and forming the foundation of our parameter defense strategy. Benchmark tests on Meta-Dataset reveal that our approach not only withstands backdoor attacks but also shows an improved adaptation in addressing few-shot classification tasks.

Abstract:
Toward desirable saliency prediction, the types and numbers of inputs for a salient object detection (SOD) algorithm may dynamically change in many real-life applications. However, existing SOD algorithms are mainly designed or trained for one particular type of inputs, failing to be generalized to other types of inputs. Consequentially, more types of SOD algorithms need to be prepared in advance for handling different types of inputs, raising huge hardware and research costs. Differently, in this paper, we propose a new type of SOD task, termed Arbitrary Modality SOD (AM SOD). The most prominent characteristics of AM SOD are that the modality types and modality numbers will be arbitrary or dynamically changed. The former means that the inputs to the AM SOD algorithm may be arbitrary modalities such as RGB, depths, or even any combination of them. While, the latter indicates that the inputs may have arbitrary modality numbers as the input type is changed, e.g. single-modality RGB image, dual-modality RGB-Depth (RGB-D) images or triple-modality RGB-Depth-Thermal (RGB-D-T) images. Accordingly, a preliminary solution to the above challenges, i.e. a modality switch network (MSN), is proposed in this paper. In particular, a modality switch feature extractor (MSFE) is first designed to extract discriminative features from each modality effectively by introducing some modality indicators, which will generate some weights for modality switching. Subsequently, a dynamic fusion module (DFM) is proposed to adaptively fuse features from a variable number of modalities based on a novel Transformer structure. Finally, a new dataset, named AM-XD, is constructed to facilitate research on AM SOD. Extensive experiments demonstrate that our AM SOD method can effectively cope with changes in the type and number of input modalities for robust salient object detection. Our code and AM-XD dataset will be released on https://github.com/nexiakele/AMSODFirst.

Abstract:
Low-rank tensor completion (LRTC) has shown promise in processing incomplete visual data, yet it often overlooks the inherent local smooth structures in images and videos. Recent advances in LRTC, integrating total variation regularization to capitalize on the local smoothness, have yielded notable improvements. Nonetheless, these methods are limited to exploiting local smoothness within the original data space, neglecting the latent factor space of tensors. More seriously, there is a lack of theoretical backing for the role of local smoothness in enhancing recovery performance. In response, this paper introduces an innovative tensor completion model that concurrently leverages the global low-rank structure of the original tensor and the local smooth structure of its factor tensors. Our objective is to learn a low-rank tensor that decomposes into two factor tensors, each exhibiting sufficient local smoothness. We propose an efficient alternating direction method of multipliers to optimize our model. Further, we establish generalization error bounds for smooth factor-based tensor completion methods across various decomposition frameworks. These bounds are significantly tighter than existing baselines. We conduct extensive inpainting experiments on color images, multispectral images, and videos, which demonstrate the efficacy and superiority of our method. Additionally, our approach shows a low sensitivity to hyper-parameter settings, enhancing its convenience and reliability for practical applications.

Abstract:
Given the widespread adoption of depth-sensing acquisition devices, RGB-D videos and related data/media have gained considerable traction in various aspects of daily life. Consequently, conducting salient object detection (SOD) in RGB-D videos presents a highly promising and evolving avenue. Despite the potential of this area, SOD in RGB-D videos remains somewhat under-explored, with RGB-D SOD and video SOD (VSOD) traditionally studied in isolation. To explore this emerging field, this paper makes two primary contributions: the dataset and the model. On one front, we construct the RDVS dataset, a new RGB-D VSOD dataset with realistic depth and characterized by its diversity of scenes and rigorous frame-by-frame annotations. We validate the dataset through comprehensive attribute and object-oriented analyses, and provide training and testing splits. Moreover, we introduce DCTNet+, a three-stream network tailored for RGB-D VSOD, with an emphasis on RGB modality and treats depth and optical flow as auxiliary modalities. In pursuit of effective feature enhancement, refinement, and fusion for precise final prediction, we propose two modules: the multi-modal attention module (MAM) and the refinement fusion module (RFM). To enhance interaction and fusion within RFM, we design a universal interaction module (UIM) and then integrate holistic multi-modal attentive paths (HMAPs) for refining multi-modal low-level features before reaching RFMs. Comprehensive experiments, conducted on pseudo RGB-D video datasets alongside our proposed RDVS, highlight the superiority of DCTNet+ over 19 VSOD models and 14 RGB-D SOD models. Additionally, insightful ablation experiments were performed on both pseudo and realistic RGB-D video datasets to demonstrate the advantages of individual modules as well as the necessity of introducing realistic depth into VSOD. Our code together with RDVS dataset will be available at https://github.com/kerenfu/RDVS/.

Abstract:
Motion deblurring is a highly ill-posed problem due to the significant loss of motion information in the blurring process. Complementary informative features from auxiliary sensors such as event cameras can be explored for guiding motion deblurring. The event camera can capture rich motion information asynchronously with microsecond accuracy. In this paper, a novel frame-event fusion framework is proposed for event-driven motion deblurring (FEF-Deblur), which can sufficiently explore long-range cross-modal information interactions. Firstly, different modalities are usually complementary and also redundant. Cross-modal fusion is modeled as complementary-unique features separation-and-aggregation, avoiding the modality redundancy. Unique features and complementary features are first inferred with parallel intra-modal self-attention and inter-modal cross-attention respectively. After that, a correlation-based constraint is designed to act between unique and complementary features to facilitate their differentiation, which assists in cross-modal redundancy suppression. Additionally, spatio-temporal dependencies among neighboring inputs are crucial for motion deblurring. A recurrent cross attention is introduced to preserve inter-input attention information, in which the current spatial features and aggregated temporal features are attending to each other by establishing the long-range interaction between them. Extensive experiments on both synthetic and real-world motion deblurring datasets demonstrate our method outperforms state-of-the-art event-based and image/video-based methods.

Abstract:
Vision-language retrieval aims to search for similar instances in one modality based on queries from another modality. The primary objective is to learn cross-modal matching representations in a latent common space. Actually, the assumption underlying cross-modal matching is modal balance, where each modality contains sufficient information to represent the others. However, noise interference and modality insufficiency often lead to modal imbalance, making it a common phenomenon in practice. The impact of imbalance on retrieval performance remains an open question. In this paper, we first demonstrate that ultimate cross-modal matching is generally sub-optimal for cross-modal retrieval when imbalanced modalities exist. The structure of instances in the common space is inherently influenced when facing imbalanced modalities, posing a challenge to cross-modal similarity measurement. To address this issue, we emphasize the importance of meaningful structure-preserved matching. Accordingly, we propose a simple yet effective method to rebalance cross-modal matching by learning structure-preserved matching representations. Specifically, we design a novel multi-granularity cross-modal matching that incorporates structure-aware distillation alongside the cross-modal matching loss. While the cross-modal matching loss constraints instance-level matching, the structure-aware distillation further regularizes the geometric consistency between learned matching representations and intra-modal representations through the developed relational matching. Extensive experiments on different datasets affirm the superior cross-modal retrieval performance of our approach, simultaneously enhancing single-modal retrieval capabilities compared to the baseline models.

Abstract:
Although many advanced works have achieved significant progress for face recognition with deep learning and large-scale face datasets, low-quality face recognition remains a challenging problem in real-word applications, especially for unconstrained surveillance scenes. We propose a texture-guided (TG) transfer learning approach under the knowledge distillation scheme to improve low-quality face recognition performance. Unlike existing methods in which distillation loss is built on forward propagation; e.g., the output logits and intermediate features, in this study, the backward propagation gradient texture is used. More specifically, the gradient texture of low-quality images is forced to be aligned to that of its high-quality counterpart to reduce the feature discrepancy between the high- and low-quality images. Moreover, attention is introduced to derive a soft-attention (SA) version of transfer learning, termed as SA-TG, to focus on informative regions. Experiments on the benchmark low-quality face DB’s TinyFace and QMUL-SurFace confirmed the superiority of the proposed method, especially more than 6.6% Rank1 accuracy improvement is achieved on TinyFace.

Abstract:
This paper presents a novel method for supervised multi-view representation learning, which projects multiple views into a latent common space while preserving the discrimination and intrinsic structure of each view. Specifically, an apriori discriminant similarity graph is first constructed based on labels and pairwise relationships of multi-view inputs. Then, view-specific networks progressively map inputs to common representations whose affinity approximates the constructed graph. To achieve graph consistency, discrimination, and cross-view invariance, the similarity graph is enforced to meet the following constraints: 1) pairwise relationship should be consistent between the input space and common space for each view; 2) within-class similarity is larger than any between-class similarity for each view; 3) the inter-view samples from the same (or different) classes are mutually similar (or dissimilar). Consequently, the intrinsic structure and discrimination are preserved in the latent common space using an apriori approximation schema. Moreover, we present a sampling strategy to approach a sub-graph sampled from the whole similarity structure instead of approximating the graph of the whole dataset explicitly, thus benefiting lower space complexity and the capability of handling large-scale multi-view datasets. Extensive experiments show the promising performance of our method on five datasets by comparing it with 18 state-of-the-art methods.

Abstract:
Image retouching, aiming to regenerate the visually pleasing renditions of given images, is a subjective task where the users are with different aesthetic sensations. Most existing methods adopt a deterministic model to learn the retouching style from a specific expert, making it less flexible to meet diverse subjective preferences. Besides, the intrinsic diversity of an expert due to the targeted processing of different images is also deficiently described. To circumvent such issues, we propose to learn diverse image retouching with normalizing flow-based architectures. Unlike current flow-based methods which directly generate the output image, we argue that learning in a one-dimensional style space could 1) disentangle the retouching styles from the image content, 2) lead to a stable style presentation form, and 3) avoid the spatial disharmony effects. For obtaining meaningful image tone style representations, a joint-training pipeline is delicately designed, which is composed of a style encoder, a conditional RetouchNet, and the image tone style normalizing flow (TSFlow) module. In particular, the style encoder predicts the target style representation of an input image, which serves as the conditional information in the RetouchNet for retouching, while the TSFlow maps the style representation vector into a Gaussian distribution in the forward pass. After training, the TSFlow can generate diverse image tone style vectors by sampling from the Gaussian distribution. Extensive experiments on MIT-Adobe FiveK and PPR10K datasets show that our proposed method performs favorably against state-of-the-art methods and is effective in generating diverse results to satisfy different human aesthetic preferences. Source codeterministic and pre-trained models are publicly available at https://github.com/SSRHeart/TSFlow.

Abstract:
Unmanned Aerial Vehicles (UAVs) rely on satellite systems for stable positioning. However, due to limited satellite coverage or communication disruptions, UAVs may lose signals for positioning. In such situations, vision-based techniques can serve as an alternative, ensuring the self-positioning capability of UAVs. However, most of the existing datasets are developed for the geo-localization task of the objects captured by UAVs, rather than UAV self-positioning. Furthermore, the existing UAV datasets apply discrete sampling to synthetic data, such as Google Maps, neglecting the crucial aspects of dense sampling and the uncertainties commonly experienced in practical scenarios. To address these issues, this paper presents a new dataset, DenseUAV, that is the first publicly available dataset tailored for the UAV self-positioning task. DenseUAV adopts dense sampling on UAV images obtained in low-altitude urban areas. In total, over 27K UAV- and satellite-view images of 14 university campuses are collected and annotated. In terms of methodology, we first verify the superiority of Transformers over CNNs for the proposed task. Then we incorporate metric learning into representation learning to enhance the model’s discriminative capacity and to reduce the modality discrepancy. Besides, to facilitate joint learning from both the satellite and UAV views, we introduce a mutually supervised learning approach. Last, we enhance the Recall@K metric and introduce a new measurement, SDM@K, to evaluate both the retrieval and localization performance for the proposed task. As a result, the proposed baseline method achieves a remarkable Recall@1 score of 83.01% and an SDM@1 score of 86.50% on DenseUAV. The dataset and code have been made publicly available on https://github.com/Dmmm1997/DenseUAV.

Abstract:
Most researchers focus on designing accurate crowd counting models with heavy parameters and computations but ignore the resource burden during the model deployment. A real-world scenario demands an efficient counting model with low-latency and high-performance. Knowledge distillation provides an elegant way to transfer knowledge from a complicated teacher model to a compact student model while maintaining accuracy. However, the student model receives the wrong guidance with the supervision of the teacher model due to the inaccurate information understood by the teacher in some cases. In this paper, we propose a dual-knowledge distillation (DKD) framework, which aims to reduce the side effects of the teacher model and transfer hierarchical knowledge to obtain a more efficient counting model. First, the student model is initialized with global information transferred by the teacher model via adaptive perspectives. Then, the self-knowledge distillation forces the student model to learn the knowledge by itself, based on intermediate feature maps and target map. Specifically, the optimal transport distance is utilized to measure the difference of feature maps between the teacher and the student to perform the distribution alignment of the counting area. Extensive experiments are conducted on four challenging datasets, demonstrating the superiority of DKD. When there are only approximately 6% of the parameters and computations from the original models, the student model achieves a faster and more accurate counting performance as the teacher model even surpasses it.

Abstract:
Class-Incremental Unsupervised Domain Adaptation (CI-UDA) requires the model can continually learn several steps containing unlabeled target domain samples, while the source-labeled dataset is available all the time. The key to tackling CI-UDA problem is to transfer domain-invariant knowledge from the source domain to the target domain, and preserve the knowledge of the previous steps in the continual adaptation process. However, existing methods introduce much biased source knowledge for the current step, causing negative transfer and unsatisfying performance. To tackle these problems, we propose a novel CI-UDA method named Pseudo-Label Distillation Continual Adaptation (PLDCA). We design Pseudo-Label Distillation module to leverage the discriminative information of the target domain to filter the biased knowledge at the class- and instance-level. In addition, Contrastive Alignment is proposed to reduce domain discrepancy by aligning the class-level feature representation of the confident target samples and the source domain, and exploit the robust feature representation of the unconfident target samples at the instance-level. Extensive experiments demonstrate the effectiveness and superiority of PLDCA. Code is available at code.

Abstract:
A simple yet effective semi-supervised method is proposed in this paper based on consistency regularization for crowd counting, and a hybrid perturbation strategy is used to generate strong, diverse perturbations, and enhance unlabeled images information mining. The conventional CNN-based counting methods are sensitive to texture perturbation and imperceptible noises raised by adversarial attack, therefore, the hybrid strategy is proposed to combine a spatial texture transformation and an adversarial perturbation module to perturb the unlabeled data in the semantic and non-semantic spaces, respectively. Moreover, a cross-distribution normalization technique is introduced to address the model optimization failure caused by BN layer in the strong perturbation, and to stabilize the optimization of the learning model. Extensive experiments have been conducted on the datasets of ShanghaiTech, UCF-QNRF, NWPU-Crowd, and JHU-Crowd++. The results demonstrate that the proposed semi-supervised counting method performs better over the state-of-the-art methods, and it shows better robustness to various perturbations.

Abstract:
To manipulate large-scale data, anchor-based multi-view clustering methods have grown in popularity owing to their linear complexity in terms of the number of samples. However, these existing approaches pay less attention to two aspects. 1) They target at learning a shared affinity matrix by using the local information from every single view, yet ignoring the global information from all views, which may weaken the ability to capture complementary information. 2) They do not consider the removal of feature redundancy, which may affect the ability to depict the real sample relationships. To this end, we propose a novel fast multi-view clustering method via pick-and-place transform learning named PPTL, which could capture insightful global features to characterize the sample relationships quickly. Specifically, PPTL first concatenates all the views along the feature direction to produce a global matrix. Considering the redundancy of the global matrix, we design a pick-and-place transform with \ell _2,p -norm regularization to abandon the poor features and consequently construct a compact global representation matrix. Thus, by conducting anchor-based subspace clustering on the compact global representation matrix, PPTL can learn a consensus skinny affinity matrix with a discriminative clustering structure. Numerous experiments performed on small-scale to large-scale datasets demonstrate that our method is not only faster but also achieves superior clustering performance over state-of-the-art methods across a majority of the datasets.

Abstract:
The mainstream of image and sentence matching studies currently focuses on fine-grained alignment of image regions and sentence words. However, these methods miss a crucial fact: the correspondence between images and sentences does not simply come from alignments between individual regions and words but from alignments between the phrases they form respectively. In this work, we propose a novel Decoupled Cross-modal Phrase-Attention network (DCPA) for image-sentence matching by modeling the relationships between textual phrases and visual phrases. Furthermore, we design a novel decoupled manner for training and inferencing, which is able to release the trade-off for bi-directional retrieval, where image-to-sentence matching is executed in textual semantic space and sentence-to-image matching is executed in visual semantic space. Extensive experimental results on Flickr30K and MS-COCO demonstrate that the proposed method outperforms state-of-the-art methods by a large margin, and can compete with some methods introducing external knowledge.

Abstract:
Prompt learning stands out as one of the most efficient approaches for adapting powerful vision-language foundational models like CLIP to downstream datasets by tuning learnable prompt vectors with very few samples. However, despite its success in achieving remarkable performance on in-domain data, prompt learning still faces the significant challenge of effectively generalizing to novel classes and domains. Some existing methods address this concern by dynamically generating distinct prompts for different domains. Yet, they overlook the inherent potential of prompts to generalize across unseen domains. To address these limitations, our study introduces an innovative prompt learning paradigm, called MetaPrompt, aiming to directly learn domain invariant prompt in few-shot scenarios. To facilitate learning prompts for image and text inputs independently, we present a dual-modality prompt tuning network comprising two pairs of coupled encoders. Our study centers on an alternate episodic training algorithm to enrich the generalization capacity of the learned prompts. In contrast to traditional episodic training algorithms, our approach incorporates both in-domain updates and domain-split updates in a batch-wise manner. For in-domain updates, we introduce a novel asymmetric contrastive learning paradigm, where representations from the pre-trained encoder assume supervision to regularize prompts from the prompted encoder. To enhance performance on out-of-domain distribution, we propose a domain-split optimization on visual prompts for cross-domain tasks or textual prompts for cross-class tasks during domain-split updates. Extensive experiments across 11 datasets for base-to-new generalization and 4 datasets for domain generalization exhibit favorable performance. Compared with the state-of-the-art method, MetaPrompt achieves an absolute gain of 1.02% on the overall harmonic mean in base-to-new generalization and consistently demonstrates superiority over all benchmarks in domain generalization.

Abstract:
Gait recognition aims at identifying the pedestrians at a long distance by their biometric gait patterns. It is inherently challenging due to the various covariates and the properties of silhouettes (textureless and colorless), which result in two kinds of pair-wise hard samples: the same pedestrian could have distinct silhouettes (intra-class diversity) and different pedestrians could have similar silhouettes (inter-class similarity). In this work, we propose to solve the hard sample issue with a Memory-augmented Progressive Learning network (GaitMPL), including Dynamic Reweighting Progressive Learning module (DRPL) and Global Structure-Aligned Memory bank (GSAM). Specifically, DRPL reduces the learning difficulty of hard samples by easy-to-hard progressive learning. GSAM further augments DRPL with a structure-aligned memory mechanism, which maintains and models the feature distribution of each ID. Experiments on two commonly used datasets, CASIA-B and OU-MVLP, demonstrate the effectiveness of GaitMPL. On CASIA-B, we achieve the state-of-the-art performance, i.e., 88.0% on the most challenging condition (Clothing) and 93.3% on the average condition, which outperforms the other methods by at least 3.8% and 1.4%, respectively. Code will be available at https://github.com/WhiteDOU/GaitMPL https://github.com/WhiteDOU/GaitMPL

Abstract:
Face aging tasks aim to simulate changes in the appearance of faces over time. However, due to the lack of data on different ages under the same identity, existing models are commonly trained using mapping between age groups. This makes it difficult for most existing aging methods to accurately capture the correspondence between individual identities and aging features, leading to generating faces that do not match the real aging appearance. In this paper, we re-annotate the CACD2000 dataset and propose a consensus-agent deep reinforcement learning method to solve the aforementioned problem. Specifically, we define two agents, the aging process agent and the aging personalization agent, and model the task of matching aging features as a Markov decision process. The aging process agent simulates the aging process of an individual, while the aging personalization agent calculates the difference between the aging appearance of an individual and the average aging appearance. The two agents iteratively adjust the matching degree between the target aging feature and the current identity through a form of synergistic cooperation. Extensive experimental results on four face aging datasets show that our model achieves convincing performance compared to the current state-of-the-art methods.

Abstract:
Event-based cameras are becoming increasingly popular for their ability to capture high-speed motion with low latency and high dynamic range. However, generating videos from events remains challenging due to the highly sparse and varying nature of event data. To address this, in this study, we propose HyperE2VID, a dynamic neural network architecture for event-based video reconstruction. Our approach uses hypernetworks to generate per-pixel adaptive filters guided by a context fusion module that combines information from event voxel grids and previously reconstructed intensity images. We also employ a curriculum learning strategy to train the network more robustly. Our comprehensive experimental evaluations across various benchmark datasets reveal that HyperE2VID not only surpasses current state-of-the-art methods in terms of reconstruction quality but also achieves this with fewer parameters, reduced computational requirements, and accelerated inference times.

Abstract:
Mirror detection is a challenging task since mirrors do not possess a consistent visual appearance. Even the Segment Anything Model (SAM), which boasts superior zero-shot performance, cannot accurately detect the position of mirrors. Existing methods determine the position of the mirror under hypothetical conditions, such as the correspondence between objects inside and outside the mirror, and the semantic association between the mirror and surrounding objects. However, these assumptions do not apply to all scenarios. For instance, there may be no corresponding real objects to the reflected objects in the scene, or it may be challenging to extract meaningful semantic associations in complex scenes. On the other hand, humans can easily recognize mirrors through the specular texture caused by materials. To mine mirror features in more general scenes, we propose a Cross-Space-Frequency Window Transformer (CSFwinformer) to extract spatial and frequency features for texture analysis. Specifically, we design a Spatial-Frequency Window Alignment module (SFWA) to calculate spatial-frequency feature affinities and learn the difference between mirror and non-mirror textures. We then propose a Dilated Window Attention (DWA) to extract global features to complement the limitation of window alignment. Besides, we propose a Cross-Modality Context Contrast module (CMCC) to fuse cross-modality features and global features, which enables information flow between different windows to take full advantage of cross-modality information. Extensive experiments show that our method performs favorably against state-of-the-art methods on three mirror detection benchmarks and significantly improved SAM performance on mirror detection. The code is available at https://github.com/wangsen99/CSFwinformer.

Abstract:
Spline functions have received widespread attention in the fields of image sampling and reconstruction. To enhance the performance of splines in reconstruction and reduce the computational burden of solving large linear equations, we propose a family of generalized cardinal polishing splines (GCP-splines) and provide a system of linear equations to obtain the expressions of GCP-splines. First, we propose a cardinal polishing spline basis function with high-precision. Then, we propose a class of GCP-splines and give a general theory of GCP-splines. To calculate the expressions of GCP-splines, we adopt a system of linear equations to obtain the time shifts operator and the convolutional coefficients based on the search spacing and number of terms. Finally, we propose continuous and discrete interpolation models based on GCP-splines, and demonstrate several valuable properties, such as order of approximation and the Riesz basis. To evaluate the performance of GCP-splines, we conduct several experiments on test images from different modalities. The experimental results demonstrate that the GCP-splines for image interpolation and image denoising have better performance and outperform other methods.

Abstract:
Zero-shot learning (ZSL) recognizes unseen images by sharing semantic knowledge transferred from seen images, encouraging the investigation of associations between semantic and visual information. Prior works have been devoted to the alignment of global visual features with semantic information, i.e., attribute vectors, or further mining the local part regions related to each attribute and then simply concatenating them for category decisions. Although effective, these works ignore intrinsic interactions between local parts and the whole object, which enables a more discriminative and representative knowledge transfer for ZSL. In this paper, we propose a Part-Object Progressive Refinement Network (POPRNet), where discriminative and transferable semantics are progressively refined by the cooperation between parts and the whole object. Specifically, POPRNet incorporates discriminative part semantics and object-centric semantics guided by semantic intensity to improve cross-domain transferability. To achieve part-object learning, a semantic-augment transformer (SaT) is proposed to model the part-object relation at the part-level via an encoder and at the object-level via a decoder, generating a comprehensive semantic representation to boost discriminability and transferability. By introducing the prototype updating module embedded with the prototype selection layers, the discriminative ability of the updated category prototype is enhanced to further improve the recognition performance of ZSL. Extensive experiments are conducted to demonstrate the superiority and competitiveness of our proposed POPRNet method on three public benchmark datasets. The code is available at https://github.com/ManLiuCoder/POPRNet.

Abstract:
3D shape segmentation is a fundamental and crucial task in the field of image processing and 3D shape analysis. To segment 3D shapes using data-driven methods, a fully labeled dataset is usually required. However, obtaining such a dataset can be a daunting task, as manual face-level labeling is both time-consuming and labor-intensive. In this paper, we present a semi-supervised framework for 3D shape segmentation that uses a small, fully labeled set of 3D shapes, as well as a weakly labeled set of 3D shapes with sparse scribble labels. Our framework first employs an auxiliary network to generate initial fully labeled segmentation labels for the sparsely labeled dataset, which helps in training the primary network. During training, the self-refine module uses increasingly accurate predictions of the primary network to improve the labels generated by the auxiliary network. Our proposed method achieves better segmentation performance than previous semi-supervised methods, as demonstrated by extensive benchmark tests, while also performing comparably to supervised methods.

Abstract:
Existing approaches towards anomaly detection (AD) often rely on a substantial amount of anomaly-free data to train representation and density models. However, large anomaly-free datasets may not always be available before the inference stage; in which case an anomaly detection model must be trained with only a handful of normal samples, a.k.a. few-shot anomaly detection (FSAD). In this paper, we propose a novel methodology to address the challenge of FSAD which incorporates two important techniques. Firstly, we employ a model pre-trained on a large source dataset to initialize model weights. Secondly, to ameliorate the covariate shift between source and target domains, we adopt contrastive training to fine-tune on the few-shot target domain data. To learn suitable representations for the downstream AD task, we additionally incorporate cross-instance positive pairs to encourage a tight cluster of the normal samples, and negative pairs for better separation between normal and synthesized negative samples. We evaluate few-shot anomaly detection on 3 controlled AD tasks and 4 real-world AD tasks to demonstrate the effectiveness of the proposed method.

Abstract:
Low-rank tensor representation with the tensor nuclear norm has been rising in popularity in multi-view subspace clustering (MVSC), in which the tensor nuclear norm is commonly implemented using discrete Fourier transform (DFT). Unfortunately, existing DFT-oriented MVSC methods may provide unsatisfactory results since (1) DFT exploits complex arithmetic in the Fourier domain, usually resulting in high tubal tensor rank, and (2) local structural information is rarely considered. To solve these problems, in this paper, we propose a novel double discrete cosine transform (DCT)-oriented multi-view subspace clustering (D2CTMSC) method, in which the first DCT aims to derive the tensor nuclear norm without complex arithmetic while the second DCT aims to explore the local structure of the self-representation tensor, such that the essential low-rankness and sparsity embedding in multi-view features can be thoroughly exploited. Moreover, we design an effective alternating iteration strategy to solve the proposed model. Experimental results on four types of multi-view datasets (News stories, Face images, Scene images, and Generic objects) demonstrate the superiority of the D2CTMSC method compared with DFT-based methods and other state-of-the-art clustering methods.

Abstract:
Recent long-tailed classification methods generally adopt the two-stage pipeline and focus on learning the classifier to tackle the imbalanced data in the second stage via re-sampling or re-weighting, but the classifier is easily prone to overconfidence in head classes. Data augmentation is a natural way to tackle this issue. Existing augmentation methods either perform low-level transformations or apply the same semantic transformation for all instances. However, meaningful augmentations for different instances should be different. In this paper, we propose feature-level augmentation (FLA) and pixel-level augmentation (PLA) learning methods for long-tailed image classification. In the first stage, the feature space is learned from the original imbalanced data. In the second stage, we model the semantic within-class transformation range for each instance by a specific Gaussian distribution and design a semantic transformation generator (STG) to predict the distribution from the instance itself. We train STG by constructing ground-truth distributions for instances of head classes in the feature space. In the third stage, for FLA, we generate instance-specific transformations by STG to obtain feature augmentations of tail classes for fine-tuning the classifier. For PLA, we use STG to guide pixel-level augmentations for fine-tuning the backbone. The proposed augmentation strategy can be combined with different existing long-tail classification methods. Extensive experiments on five imbalanced datasets show the effectiveness of our method.

Abstract:
Despite remarkable successes in unimodal learning tasks, backdoor attacks against cross-modal learning are still underexplored due to the limited generalization and inferior stealthiness when involving multiple modalities. Notably, since works in this area mainly inherit ideas from unimodal visual attacks, they struggle with dealing with diverse cross-modal attack circumstances and manipulating imperceptible trigger samples, which hinders their practicability in real-world applications. In this paper, we introduce a novel bilateral backdoor to fill in the missing pieces of the puzzle in the cross-modal backdoor and propose a generalized invisible backdoor framework against cross-modal learning (BadCM). Specifically, a cross-modal mining scheme is developed to capture the modality-invariant components as target poisoning areas, where well-designed trigger patterns injected into these regions can be efficiently recognized by the victim models. This strategy is adapted to different image-text cross-modal models, making our framework available to various attack scenarios. Furthermore, for generating poisoned samples of high stealthiness, we conceive modality-specific generators for visual and linguistic modalities that facilitate hiding explicit trigger patterns in modality-invariant regions. To the best of our knowledge, BadCM is the first invisible backdoor method deliberately designed for diverse cross-modal attacks within one unified framework. Comprehensive experimental evaluations on two typical applications, i.e., cross-modal retrieval and VQA, demonstrate the effectiveness and generalization of our method under multiple kinds of attack scenarios. Moreover, we show that BadCM can robustly evade existing backdoor defenses. Our code is available at https://github.com/xandery-geek/BadCM.

Abstract:
Unsupervised object discovery (UOD) refers to the task of discriminating the whole region of objects from the background within a scene without relying on labeled datasets, which benefits the task of bounding-box-level localization and pixel-level segmentation. This task is promising due to its ability to discover objects in a generic manner. We roughly categorize existing techniques into two main directions, namely the generative solutions based on image resynthesis, and the clustering methods based on self-supervised models. We have observed that the former heavily relies on the quality of image reconstruction, while the latter shows limitations in effectively modeling semantic correlations. To directly target at object discovery, we focus on the latter approach and propose a novel solution by incorporating weakly-supervised contrastive learning (WCL) to enhance semantic information exploration. We design a semantic-guided self-supervised learning model to extract high-level semantic features from images, which is achieved by fine-tuning the feature encoder of a self-supervised model, namely DINO, via WCL. Subsequently, we introduce Principal Component Analysis (PCA) to localize object regions. The principal projection direction, corresponding to the maximal eigenvalue, serves as an indicator of the object region(s). Extensive experiments on benchmark unsupervised object discovery datasets demonstrate the effectiveness of our proposed solution. The source code and experimental results are publicly available via our project page at https://github.com/npucvr/WSCUOD.git.

Abstract:
Modern detectors commonly employ classification scores to reflect the localization quality of detection results. However, there exists an inconsistency between them, misguiding the selection of high-quality predictions and providing unreliable results for downstream applications. In this paper, we find that the root of this confidence inconsistency lies in the inaccurate IoU estimation and the spatial misalignment of the learned features between the classification and localization tasks. Therefore, a Confidence-Consistent Detector (CCDet) which includes the Distribution-based IoU Prediction (DIP) and Consistency-aware label assignment (CLA), is proposed. DIP provides more stable and accurate IoU estimation by learning the probability distribution over the IoU range and employing the expectation as the predicted IoU. CLA adopts both the prediction performance and consistency degree of samples as assignment metrics to select positives, which guides the classification and localization tasks to promote similar feature distribution. Comprehensive experiments demonstrate that CCDet can effectively mitigate the confidence inconsistency between classification and localization, and achieve stable improvement across different baselines. On the test-dev set of MS COCO, CCDet acquires a single-model single-scale AP of 50.1%, surpassing most of the existing object detectors.

Abstract:
Video Visual Relation Detection (VidVRD) aims to detect visual relationship triplets in videos using spatial bounding boxes and temporal boundaries. Existing VidVRD methods can be broadly categorized into bottom-up and top-down paradigms, depending on their approach to classifying relations. Bottom-up methods follow a clip-based approach where they classify relations of short clip tubelet pairs and then merge them into long video relations. On the other hand, top-down methods directly classify long video tubelet pairs. While recent video-based methods utilizing video tubelets have shown promising results, we argue that the effective modeling of spatial and temporal context plays a more significant role than the choice between clip tubelets and video tubelets. This motivates us to revisit the clip-based paradigm and explore the key success factors in VidVRD. In this paper, we propose a Hierarchical Context Model (HCM) that enriches the object-based spatial context and relation-based temporal context based on clips. We demonstrate that using clip tubelets can achieve superior performance compared to most video-based methods. Additionally, using clip tubelets offers more flexibility in model designs and helps alleviate the limitations associated with video tubelets, such as the challenging long-term object tracking problem and the loss of temporal information in long-term tubelet feature compression. Extensive experiments conducted on two challenging VidVRD benchmarks validate that our HCM achieves a new state-of-the-art performance, highlighting the effectiveness of incorporating advanced spatial and temporal context modeling within the clip-based paradigm.

Abstract:
Existing image restoration models are typically designed for specific tasks and struggle to generalize to out-of-sample degradations not encountered during training. While zero-shot methods can address this limitation by fine-tuning model parameters on testing samples, their effectiveness relies on predefined natural priors and physical models of specific degradations. Nevertheless, determining out-of-sample degradations faced in real-world scenarios is always impractical. As a result, it is more desirable to train restoration models with inherent generalization ability. To this end, this work introduces the Out-of-Sample Restoration (OSR) task, which aims to develop restoration models capable of handling out-of-sample degradations. An intuitive solution involves pre-translating out-of-sample degradations to known degradations of restoration models. However, directly translating them in the image space could lead to complex image translation issues. To address this issue, we propose a model reprogramming framework, which translates out-of-sample degradations by quantum mechanic and wave functions. Specifically, input images are decoupled as wave functions of amplitude and phase terms. The translation of out-of-sample degradation is performed by adapting the phase term. Meanwhile, the image content is maintained and enhanced in the amplitude term. By taking these two terms as inputs, restoration models are able to handle out-of-sample degradations without fine-tuning. Through extensive experiments across multiple evaluation cases, we demonstrate the effectiveness and flexibility of our proposed framework. Our codes are available at https://github.com/ddghjikle/Out-of-sample-restoration.

Abstract:
Single image super-resolution (SISR) aims to reconstruct a high-resolution image from its low-resolution observation. Recent deep learning-based SISR models show high performance at the expense of increased computational costs, limiting their use in resource-constrained environments. As a promising solution for computationally efficient network design, network quantization has been extensively studied. However, existing quantization methods developed for SISR have yet to effectively exploit image self-similarity, which is a new direction for exploration in this study. We introduce a novel method called reference-based quantization for image super-resolution (RefQSR) that applies high-bit quantization to several representative patches and uses them as references for low-bit quantization of the rest of the patches in an image. To this end, we design dedicated patch clustering and reference-based quantization modules and integrate them into existing SISR network quantization methods. The experimental results demonstrate the effectiveness of RefQSR on various SISR networks and quantization methods.

Abstract:
Multi-view clustering (MVC) has attracted broad attention due to its capacity to exploit consistent and complementary information across views. This paper focuses on a challenging issue in MVC called the incomplete continual data problem (ICDP). Specifically, most existing algorithms assume that views are available in advance and overlook the scenarios where data observations of views are accumulated over time. Due to privacy considerations or memory limitations, previous views cannot be stored in these situations. Some works have proposed ways to handle this problem, but all of them fail to address incomplete views. Such an incomplete continual data problem (ICDP) in MVC is difficult to solve since incomplete information with continual data increases the difficulty of extracting consistent and complementary knowledge among views. We propose Fast Continual Multi-View Clustering with Incomplete Views (FCMVC-IV) to address this issue. Specifically, the method maintains a scalable consensus coefficient matrix and updates its knowledge with the incoming incomplete view rather than storing and recomputing all the data matrices. Considering that the given views are incomplete, the newly collected view might contain samples that have yet to appear; two indicator matrices and a rotation matrix are developed to match matrices with different dimensions. In addition, we design a three-step iterative algorithm to solve the resultant problem with linear complexity and proven convergence. Comprehensive experiments conducted on various datasets demonstrate the superiority of FCMVC-IV over the competing approaches. The code is publicly available at https://github.com/wanxinhang/FCMVC-IV.

Abstract:
Neural Architecture Search (NAS) has emerged as a promising tool in the field of AutoML for designing more accurate and efficient architectures. The majority of NAS works employ a weight-sharing technique to reduce the search cost by sharing the weights of a supernet, which is a composite of all architectures produced from the search space. Nonetheless, this method has a significant drawback in that negative interference may arise when candidate architectures share the same weights. This issue becomes even more severe in multi-task searches, where a supernet is shared across tasks. To address this problem, we propose a task-aware nested search for multiple tasks that generates task-specific search spaces and architectures using a search-in-search approach consisting of space-search and architecture-search phases. In the space-search phase, we discover an optimal subspace in a task-aware manner by utilizing the proposed search space generator based on the global search space. On top of each subspace, we search for a promising architecture in the architecture-search phase. This method can mitigate search interference by adaptively sharing weights of the supernet by the generated subspace. The experimental results on various vision benchmarks (CityScapes, NYUv2, and Tiny-Taskonomy) show that the proposed method achieves outstanding performance over existing methods in terms of task accuracy, model parameters, and latency.

Abstract:
Fine-grained image recognition is a longstanding computer vision challenge that focuses on differentiating objects belonging to multiple subordinate categories within the same meta-category. Since images belonging to the same meta-category usually share similar visual appearances, mining discriminative visual cues is the key to distinguishing fine-grained categories. Although commonly used image-level data augmentation techniques have achieved great success in generic image classification problems, they are rarely applied in fine-grained scenarios, because their random editing-region behavior is prone to destroy the discriminative visual cues residing in the subtle regions. In this paper, we propose diversifying the training data at the feature-level to alleviate the discriminative region loss problem. Specifically, we produce diversified augmented samples by translating image features along semantically meaningful directions. The semantic directions are estimated with a covariance prediction network, which predicts a sample-wise covariance matrix to adapt to the large intra-class variation inherent in fine-grained images. Furthermore, the covariance prediction network is jointly optimized with the classification network in a meta-learning manner to alleviate the degenerate solution problem. Experiments on four competitive fine-grained recognition benchmarks (CUB-200-2011, Stanford Cars, FGVC Aircrafts, NABirds) demonstrate that our method significantly improves the generalization performance on several popular classification networks (e.g., ResNets, DenseNets, EfficientNets, RegNets and ViT). Combined with a recently proposed method, our semantic data augmentation approach achieves state-of-the-art performance on the CUB-200-2011 dataset. Source code is available at https://github.com/LeapLabTHU/LearnableISDA.

Abstract:
Accurately detecting the lanes plays a significant role in various autonomous and assistant driving scenarios. It is a highly structured task as lanes in the 3D world are continuous and parallel to each other. While most existing methods focus on how to inject structural priors into the representation of each lane, we propose a StructLane method to further leverage the structural relations among lanes for more accurate and robust lane detection. To achieve this, we explicitly encode the structural relations using a set of relational templates in a learned structural space. We then employ the attention mechanism to enable interactions between templates and image features to incorporate structural relational priors. Our StructLane can be applied to existing lane detection methods as a plug-and-play module to improve their performance. Extensive experiments on the widely used CULane, TuSimple, and LLAMAS datasets demonstrate that StructLane consistently improves the performance of state-of-the-art models across all datasets and backbones. Visualization results also demonstrate the robustness of our StructLane compared with existing methods due to the leverage of structural relations. Codes will be released at https://github.com/lqzhao/StructLane.

Abstract:
Incomplete multiview clustering (IMVC) aims to reveal the underlying structure of incomplete multiview data by partitioning data samples into clusters. Several graph-based methods exhibit a strong ability to explore high-order information among multiple views using low-rank tensor learning. However, spectral embedding fusion of multiple views is ignored in low-rank tensor learning. In addition, addressing missing instances or features is still an intractable problem for most existing IMVC methods. In this paper, we present a unified spectral embedding tensor learning (USETL) framework that integrates the spectral embedding fusion of multiple similarity graphs and spectral embedding tensor learning for IMVC. To remove redundant information from the original incomplete multiview data, spectral embedding fusion is performed by introducing spectral rotations at two different data levels, i.e., the spectral embedding feature level and the clustering indicator level. The aim of introducing spectral embedding tensor learning is to capture consistent and complementary information by seeking high-order correlations among multiple views. The strategy of removing missing instances is adopted to construct multiple similarity graphs for incomplete multiple views. Consequently, this strategy provides an intuitive and feasible way to construct multiple similarity graphs. Extensive experimental results on multiview datasets demonstrate the effectiveness of the two spectral embedding fusion methods within the USETL framework.

Abstract:
Real-world data often follows a long-tailed distribution, where a few head classes occupy most of the data and a large number of tail classes only contain very limited samples. In practice, deep models often show poor generalization performance on tail classes due to the imbalanced distribution. To tackle this, data augmentation has become an effective way by synthesizing new samples for tail classes. Among them, one popular way is to use CutMix that explicitly mixups the images of tail classes and the others, while constructing the labels according to the ratio of areas cropped from two images. However, the area-based labels entirely ignore the inherent semantic information of the augmented samples, often leading to misleading training signals. To address this issue, we propose a Contrastive CutMix (ConCutMix) that constructs augmented samples with semantically consistent labels to boost the performance of long-tailed recognition. Specifically, we compute the similarities between samples in the semantic space learned by contrastive learning, and use them to rectify the area-based labels. Experiments show that our ConCutMix significantly improves the accuracy on tail classes as well as the overall performance. For example, based on ResNeXt-50, we improve the overall accuracy on ImageNet-LT by 3.0% thanks to the significant improvement of 3.3% on tail classes. We highlight that the improvement also generalizes well to other benchmarks and models. Our code and pretrained models are available at https://github.com/PanHaulin/ConCutMix.

Affiliations: Key Laboratory of Intelligent Perception and Systems for High-Dimensional Information of Ministry of Education and Jiangsu Key Laboratory of Image and Video Understanding for Social Security, School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, China; School of Computer Science and Technology, Southwest University of Science and Technology, Mianyang, China; Institute for Infocomm Research (I²R) and the Centre for Frontier AI Research (CFAR), Agency for Science, Technology and Research (A*STAR), Fusionopolis, Singapore; Centre for Frontier A Research (CFAR), Institute of High Performance Computing (IHPC), Agency for Science Technology and Research (A*STAR), Fusionopolis, Singapore

Abstract:
Data-free knowledge distillation aims to learn a small student network from a large pre-trained teacher network without the aid of original training data. Recent works propose to gather alternative data from the Internet for training student network. In a more realistic scenario, the data on the Internet contains two types of label noise, namely: 1) closed-set label noise, where some examples belong to the known categories but are mislabeled; and 2) open-set label noise, where the true labels of some mislabeled examples are outside the known categories. However, the latter is largely ignored by existing works, leading to limited student network performance. Therefore, this paper proposes a novel data-free knowledge distillation paradigm by utilizing a webly-collected dataset under universal label noise, which means both closed-set and open-set label noise should be tackled. Specifically, we first split the collected noisy dataset into clean set, closed noisy set, and open noisy set based on the prediction uncertainty of various data types. For the closed-set noisy examples, their labels are refined by teacher network. Meanwhile, a noise-robust hybrid contrastive learning is performed on the clean set and refined closed noisy set to encourage student network to learn the categorical and instance knowledge inherited by teacher network. For the open-set noisy examples unexplored by previous work, we regard them as unlabeled and conduct self-supervised learning on them to enrich the supervision signal for student network. Intensive experimental results on image classification tasks demonstrate that our approach can achieve superior performance to state-of-the-art data-free knowledge distillation methods.

Abstract:
Few-Shot Class-Incremental Learning (FSCIL) aims at incrementally learning new knowledge from limited training examples without forgetting previous knowledge. However, we observe that existing methods face a challenge known as supervision collapse, where the model disproportionately emphasizes class-specific features of base classes at the detriment of novel class representations, leading to restricted cognitive capabilities. To alleviate this issue, we propose a new framework, Model aTtention Expansion for Few-Shot Class-Incremental Learning (MTE-FSCIL), aimed at expanding the model attention fields to improve transferability without compromising the discriminative capability for base classes. Specifically, the framework adopts a dual-stage training strategy, comprising pre-training and meta-training stages. In the pre-training stage, we present a new regularization technique, named the Reserver (RS) loss, to expand the global perception and reduce over-reliance on class-specific features by amplifying feature map activations. During the meta-training stage, we introduce the Repeller (RP) loss, a novel pair-based loss that promotes variation in representations and improves the model’s recognition of sample uniqueness by scattering intra-class samples within the embedding space. Furthermore, we propose a Transformational Adaptation (TA) strategy to enable continuous incorporation of new knowledge from downstream tasks, thus facilitating cross-task knowledge transfer. Extensive experimental results on mini-ImageNet, CIFAR100, and CUB200 datasets demonstrate that our proposed framework consistently outperforms the state-of-the-art methods.

Abstract:
High perceptual quality and low distortion degree are two important goals in image restoration tasks such as super-resolution (SR). Most of the existing SR methods aim to achieve these goals by minimizing the corresponding yet conflicting losses, such as the \ell _1 loss and the adversarial loss. Unfortunately, the commonly used gradient-based optimizers, such as Adam, are hard to balance these objectives due to the opposite gradient decent directions of the contradictory losses. In this paper, we formulate the perception-distortion trade-off in SR as a multi-objective optimization problem and develop a new optimizer by integrating the gradient-free evolutionary algorithm (EA) with gradient-based Adam, where EA and Adam focus on the divergence and convergence of the optimization directions respectively. As a result, a population of optimal models with different perception-distortion preferences is obtained. We then design a fusion network to merge these models into a single stronger one for an effective perception-distortion trade-off. Experiments demonstrate that with the same backbone network, the perception-distortion balanced SR model trained by our method can achieve better perceptual quality than its competitors while attaining better reconstruction fidelity. Codes and models can be found at https://github.com/csslc/EA-Adam.

Abstract:
Phase retrieval (PR) is fundamentally important in scientific imaging and is crucial for nanoscale techniques like coherent diffractive imaging (CDI). Low radiation dose imaging is essential for applications involving radiation-sensitive samples. However, most PR methods struggle in low-dose scenarios due to high shot noise. Recent advancements in optical data acquisition setups, such as in-situ CDI, have shown promise for low-dose imaging, but they rely on a time series of measurements, making them unsuitable for single-image applications. Similarly, data-driven phase retrieval techniques are not easily adaptable to data-scarce situations. Zero-shot deep learning methods based on pre-trained and implicit generative priors have been effective in various imaging tasks but have shown limited success in PR. In this work, we propose low-dose deep image prior (LoDIP), which combines in-situ CDI with the power of implicit generative priors to address single-image low-dose phase retrieval. Quantitative evaluations demonstrate LoDIP’s superior performance in this task and its applicability to real experimental scenarios.

Abstract:
In this paper, we propose an effective plug-and-play module called structural relation network (SRN) to model structural dependencies in 3D point clouds for feature representation. Existing network architectures such as PointNet++ and RS-CNN capture local structures individually and ignore the inner interactions between different sub-clouds. Motivated by the fact that structural relation modeling plays critical roles for humans to understand 3D objects, our SRN exploits local information by modeling structural relations in 3D spaces. For a given sub-cloud of point sets, SRN firstly extracts its geometrical and locational relations with the other sub-clouds and maps them into the embedding space, then aggregates both relational features with the other sub-clouds. As the variation of semantics embedded in different sub-clouds is ignored by SRN, we further extend SRN to enable dynamic message passing between different sub-clouds. We propose a graph-based structural relation network (GSRN) where sub-clouds and their pairwise relations are modeled as nodes and edges respectively, so that the node features are updated by the messages along the edges. Since the node features might not be well preserved when acquiring the global representation, we propose a Combined Entropy Readout (CER) function to adaptively aggregate them into the holistic representation, so that GSRN simultaneously models the local-local and local-global region-wise interaction. The proposed SRN and GSRN modules are simple, interpretable, and do not require any additional supervision signals, which can be easily equipped with the existing networks. Experimental results on the benchmark datasets (ScanObjectNN, ModelNet40, ShapeNet Part, S3DIS, ScanNet and SUN-RGBD) indicate promising boosts on the tasks of 3D point cloud classification, segmentation and object detection.

Abstract:
Image restoration (IR) via deep learning has been vigorously studied in recent years. However, due to the ill-posed nature of the problem, it is challenging to recover the high-quality image details from a single distorted input especially when images are corrupted by multiple distortions. In this paper, we propose a multi-stage IR approach for progressive restoration of multi-degraded images via transferring similar edges/textures from the reference image. Our method, called a Reference-based Image Restoration Transformer (Ref-IRT), operates via three main stages. In the first stage, a cascaded U-Transformer network is employed to perform the preliminary recovery of the image. The proposed network consists of two U-Transformer architectures connected by feature fusion of the encoders and decoders, and the residual image is estimated by each U-Transformer in an easy-to-hard and coarse-to-fine fashion to gradually recover the high-quality image. The second and third stages perform texture transfer from a reference image to the preliminarily-recovered target image to further enhance the restoration performance. To this end, a quality-degradation-restoration method is proposed for more accurate content/texture matching between the reference and target images, and a texture transfer/reconstruction network is employed to map the transferred features to the high-quality image. Experimental results tested on three benchmark datasets demonstrate the effectiveness of our model as compared with other state-of-the-art multi-degraded IR methods. Our code and dataset are available at https://vinelab.jp/refmdir/.

Abstract:
The potential vulnerability of deep neural networks and the complexity of pedestrian images, greatly limits the application of person re-identification techniques in the field of smart security. Current attack methods often focus on generating carefully crafted adversarial samples or only disrupting the metric distances between targets and similar pedestrians. However, both aspects are crucial for evaluating the security of methods adapted for person re-identification tasks. For this reason, we propose an image-level adaptive adversarial ranking method that comprehensively considers two aspects to adapt to changes in pedestrians in the real world and effectively evaluate the robustness of models in adversarial environments. To generate more refined adversarial samples, our image representation enhancement module leverages channel-wise information entropy, assigning varying weights to different channels to produce images with richer information content, along with a generative adversarial network to create adversarial samples. Subsequently, for adaptive perturbation of ranking, the adaptive weight confusion ranking loss is presented to calculate the weights of distances between positive or negative samples and query samples. It endeavors to push positive samples away from query samples and bring negative samples closer, thereby interfering with the ranking of system. Notably, this method requires no additional hyperparameter tuning or extra data training, making it an adaptive attack strategy. Experimental results on large-scale datasets such as Market1501, CUHK03, and DukeMTMC demonstrate the effectiveness of our method in attacking ReID systems.

Abstract:
Symmetric Positive Definite (SPD) matrices have received wide attention in machine learning due to their intrinsic capacity to encode underlying structural correlation in data. Many successful Riemannian metrics have been proposed to reflect the non-Euclidean geometry of SPD manifolds. However, most existing metric tensors are fixed, which might lead to sub-optimal performance for SPD matrix learning, especially for deep SPD neural networks. To remedy this limitation, we leverage the commonly encountered pullback techniques and propose Adaptive Log-Euclidean Metrics (ALEMs), which extend the widely used Log-Euclidean Metric (LEM). Compared with the previous Riemannian metrics, our metrics contain learnable parameters, which can better adapt to the complex dynamics of Riemannian neural networks with minor extra computations. We also present a complete theoretical analysis to support our ALEMs, including algebraic and Riemannian properties. The experimental and theoretical results demonstrate the merit of the proposed metrics in improving the performance of SPD neural networks. The efficacy of our metrics is further showcased on a set of recently developed Riemannian building blocks, including Riemannian batch normalization, Riemannian Residual blocks, and Riemannian classifiers.

Abstract:
In halftone-driven imaging pipelines focus is often placed on halftone pattern design as the main contributor to overall output quality. However, for sequential or cumulative imaging technologies, such as multi-pass printing, an important element is also pattern partitioning – how the overall halftone pattern is divided among the different partial imaging events such as printing passes. Partitioning is usually designed agnostically of the halftone pattern, making it impossible to optimize for the joint effect of halftone and partitioning. However, even a good halftone pattern coupled with a good partitioning scheme does not guarantee well partitioned halftones and can impact image quality attributes. In this paper a novel approach called PARAOMASKING is presented that benefits from the pattern-determinism of PARAWACS halftoning and proposes a partitioning scheme for multi-pass printing such that optimality is also obtained for partitioned halftones. Results – both digital and printed – show how it can lead to significant improvements in partial pattern quality and overall pattern quality. Consequently, output attributes such as grain, coalescence and pattern robustness are improved. The focus here is on blue-noise pattern preservation but the approach can also be extended to other objectives, e.g., maximizing per-pass clustering.

Abstract:
The true label plays an important role in semi-supervised medical image segmentation (SSMIS) because it can provide the most accurate supervision information when the label is limited. The popular SSMIS method trains labeled and unlabeled data separately, and the unlabeled data cannot be directly supervised by the true label. This limits the contribution of labels to model training. Is there an interactive mechanism that can break the separation between two types of data training to maximize the utilization of true labels? Inspired by this, we propose a novel consistency learning framework based on the non-parametric distance metric of boundary-aware prototypes to alleviate this problem. This method combines CNN-based linear classification and nearest neighbor-based non-parametric classification into one framework, encouraging the two segmentation paradigms to have similar predictions for the same input. More importantly, the prototype can be clustered from both labeled and unlabeled data features so that it can be seen as a bridge for interactive training between labeled and unlabeled data. When the prototype-based prediction is supervised by the true label, the supervisory signal can simultaneously affect the feature extraction process of both data. In addition, boundary-aware prototypes can explicitly model the differences in boundaries and centers of adjacent categories, so pixel-prototype contrastive learning is introduced to further improve the discriminability of features and make them more suitable for non-parametric distance measurement. Experiments show that although our method uses a modified lightweight UNet as the backbone, it outperforms the comparison method using a 3D VNet with more parameters.

Abstract:
Due to the advancement of deep learning, the performance of salient object detection (SOD) has been significantly improved. However, deep learning-based techniques require a sizable amount of pixel-wise annotations. To relieve the burden of data annotation, a variety of deep weakly-supervised and unsupervised SOD methods have been proposed, yet the performance gap between them and fully supervised methods remains significant. In this paper, we propose a novel, cost-efficient salient object detection framework, which can adapt models from synthetic data to real-world data with the help of a limited number of actively selected annotations. Specifically, we first construct a synthetic SOD dataset by copying and pasting foreground objects into pure background images. With the masks of foreground objects taken as the ground-truth saliency maps, this dataset can be used for training the SOD model initially. However, due to the large domain gap between synthetic images and real-world images, the performance of the initially trained model on the real-world images is deficient. To transfer the model from the synthetic dataset to the real-world datasets, we further design an uncertainty-aware active domain adaptive algorithm to generate labels for the real-world target images. The prediction variances against data augmentations are utilized to calculate the superpixel-level uncertainty values. For those superpixels with relatively low uncertainty, we directly generate pseudo labels according to the network predictions. Meanwhile, we select a few superpixels with high uncertainty scores and assign labels to them manually. This labeling strategy is capable of generating high-quality labels without incurring too much annotation cost. Experimental results on six benchmark SOD datasets demonstrate that our method outperforms the existing state-of-the-art weakly-supervised and unsupervised SOD methods and is even comparable to the fully supervised ones. Code will be released at: https://github.com/czh-3/UADA.

Abstract:
Knowledge distillation aims to achieve model compression by transferring knowledge from complex teacher models to lightweight student models. To reduce reliance on pre-trained teacher models, self-distillation methods utilize knowledge from the model itself as additional supervision. However, their performance is limited by the same or similar network architecture between the teacher and student. In order to increase architecture variety, we propose a new self-distillation framework called restructured self-distillation (RSD), which involves restructuring both the teacher and student networks. The self-distilled model is expanded into a multi-branch topology to create a more powerful teacher. During training, diverse student sub-networks are generated by randomly discarding the teacher’s branches. Additionally, the teacher and student models are linked by a randomly inserted feature mixture block, introducing additional knowledge distillation in the mixed feature space. To avoid extra inference costs, the branches of the teacher model are then converted back to its original structure equivalently. Comprehensive experiments have demonstrated the effectiveness of our proposed framework for most architectures on CIFAR-10/100 and ImageNet datasets. Code is available at https://github.com/YujieZheng99/RSD.

Abstract:
This article presents an algorithmic framework for real-time condition monitoring and state forecasting using multivariate data demonstrated on thermal imagery data of a ship’s engine. The proposed method aims to improve the accuracy, efficiency, and robustness of condition monitoring and state predictions by identifying the most informative sampling locations of high-dimensional datasets and extracting the underlying dynamics of the system. The method is based on a combination of Proper Orthogonal Decomposition (POD), Optimal Sampling Location (OSL), and Dynamic Mode Decomposition (DMD), allowing the identification of key features in the system’s behavior and predicting future states. Based on thermal imagery data, it is shown how thermal areas of interest can be classified via POD. By extracting the POD modes of the data, dimensions can be drastically reduced and via OSL, optimal sampling locations are found. In addition, nonlinear kernel-based Support Vector Regression (SVR) is used to build models between the optimal locations, enabling the imputation of erroneous data to improve the overall robustness. To build predictive data-driven models, DMD is applied on the subspace obtained by OSL, which leads to an intensive lower demand of computational resources, making the proposed method real-time applicable. Furthermore, an unsupervised approach for anomaly detection is proposed using OSL. The anomaly detection framework is coupled with the state prediction framework, which extends the capabilities to real-time anomaly predictions. In summary, this study proposes a robust predictive condition monitoring framework for real-time risk assessment.

Abstract:
Recent progress in single-image super-resolution (SISR) has achieved remarkable performance, yet the computational costs of these methods remain a challenge for deployment on resource-constrained devices. In particular, transformer-based methods, which leverage self-attention mechanisms, have led to significant breakthroughs but also introduce substantial computational costs. To tackle this issue, we introduce the Convolutional Transformer layer (ConvFormer) and propose a ConvFormer-based Super-Resolution network (CFSR), offering an effective and efficient solution for lightweight image super-resolution. The proposed method inherits the advantages of both convolution-based and transformer-based approaches. Specifically, CFSR utilizes large kernel convolutions as a feature mixer to replace the self-attention module, efficiently modeling long-range dependencies and extensive receptive fields with minimal computational overhead. Furthermore, we propose an edge-preserving feed-forward network (EFN) designed to achieve local feature aggregation while effectively preserving high-frequency information. Extensive experiments demonstrate that CFSR strikes an optimal balance between computational cost and performance compared to existing lightweight SR methods. When benchmarked against state-of-the-art methods such as ShuffleMixer, the proposed CFSR achieves a gain of 0.39 dB on the Urban100 dataset for the x2 super-resolution task while requiring 26% and 31% fewer parameters and FLOPs, respectively. The code and pre-trained models are available at https://github.com/Aitical/CFSR.

Abstract:
In the practical application of image generation, dealing with long-tailed data distributions is a common challenge for diffusion-based generative models. To tackle this issue, we investigate the head-class accumulation effect in diffusion models’ latent space, particularly focusing on its correlation to the noise sampling strategy. Our experimental analysis indicates that employing a consistent sampling distribution for the noise prior across all classes leads to a significant bias towards head classes in the noise sampling distribution, which results in poor quality and diversity of the generated images. Motivated by this observation, we propose a novel sampling strategy named Bias-aware Prior Adjusting (BPA) to debias diffusion models in the class-imbalanced scenario. With BPA, each class is automatically assigned an adaptive noise sampling distribution prior during training, effectively mitigating the influence of class imbalance on the generation process. Extensive experiments on several benchmarks demonstrate that images generated using our proposed BPA showcase elevated diversity and superior quality.

Abstract:
In an effort to improve the efficiency and scalability of single-image super-resolution (SISR) applications, we introduce AnySR, to rebuild existing arbitrary-scale SR methods into any-scale, any-resource implementation. As a contrast to off-the-shelf methods that solve SR tasks across various scales with the same computing costs, our AnySR innovates in: 1) building arbitrary-scale tasks as any-resource implementation, reducing resource requirements for smaller scales without additional parameters; 2) enhancing any-scale performance in a feature-interweaving fashion, inserting scale pairs into features at regular intervals and ensuring correct feature/scale processing. The efficacy of our AnySR is fully demonstrated by rebuilding most existing arbitrary-scale SISR methods and validating on five popular SISR test datasets. The results show that our AnySR implements SISR tasks in a computing-more-efficient fashion, and performs on par with existing arbitrary-scale SISR methods. For the first time, we realize SISR tasks as not only any-scale in literature, but also as any-resource. Our code is available at https://github.com/CrispyFeSo4/AnySR.

Abstract:
Due to different installation angles, heights, and positions of the camera installation in real-world scenes, it is difficult for crowd counting models to work in unseen surveillance scenes. In this paper, we are interested in accurate crowd counting based on the data collected by any surveillance camera, that is to count the crowd from any scene given only one annotated image from that scene. To this end, we firstly pose crowd counting as a one-shot learning task. Through the metric-learning, we propose a simple yet effective method that firstly estimates crowd characteristics and then transfers them to guide the model to count the crowd. Specifically, to fully capture these crowd characteristics of the target scene, we devise the Multi-Prototype Learner to learn the prototypes of foreground and density from the limited support image using the Expectation-Maximization algorithm. To learn the adaptation capability for any unseen scene, estimated multi prototypes are proposed to guide the crowd counting of query images in a local-to-global way. CNN is utilized to activate the local features. And transformer is introduced to correlate global features. Extensive experiments on three surveillance datasets suggest that our method outperforms the SOTA methods in the few-shot crowd counting.

Abstract:
Numerous methods for unsupervised domain adaptation (UDA) have been proposed in semantic segmentation, achieving remarkable improvements. These methods are categorized into an adversarial learning-based approach that utilizes an additional discriminator and image translation model, and a self-supervised approach that uses a teacher model to generate pseudo labels. Among them, the self-supervised UDA approaches based on a self-training show excellent adaptability in semantic segmentation. However, erroneous estimates of the pseudo ground truths (PGTs) used in the self-training may often lead to inaccurate updates in the teacher model. Although several attempts have been made to address this issue, the teacher model updated through exponential moving average (EMA) still has a risk of propagating inaccuracies from the PGTs. Inspired by the fact that UDA shares similar principles with knowledge distillation (KD), we revisit the self-training based UDA approach from the perspective of KD and propose a novel UDA approach that employs two different teacher models. Specifically, we utilize both an EMA-updated teacher model to generate PGTs and a frozen teacher model pretrained with source data to transfer knowledge on a feature space. Since the frozen teacher model has no constraint on the model architecture unlike the EMA updated teacher model, we can effectively leverage a better representation power from the larger frozen teacher. Extensive experiments on various backbones (DeepLab-V2 and DAFormer) and scenarios (GTA 5~\rightarrow Cityscapes and SYNTHIA \rightarrow Cityscapes) show that the proposed method improves segmentation performance in the target domain with its scalability. In particular, our method achieves comparable or better performance than state-of-the-arts even with a lightweight backbone.

Abstract:
Restoring images captured under adverse weather conditions is a fundamental task for many computer vision applications. However, most existing weather restoration approaches are only capable of handling a specific type of degradation, which is often insufficient in real-world scenarios, such as rainy-snowy or rainy-hazy weather. Towards being able to address these situations, we propose a multi-weather Transformer, or MWFormer for short, which is a holistic vision Transformer that aims to solve multiple weather-induced degradations using a single, unified architecture. MWFormer uses hyper-networks and feature-wise linear modulation blocks to restore images degraded by various weather types using the same set of learned parameters. We first employ contrastive learning to train an auxiliary network that extracts content-independent, distortion-aware feature embeddings that efficiently represent predicted weather types, of which more than one may occur. Guided by these weather-informed predictions, the image restoration Transformer adaptively modulates its parameters to conduct both local and global feature processing, in response to multiple possible weather. Moreover, MWFormer allows for a novel way of tuning, during application, to either a single type of weather restoration or to hybrid weather restoration without any retraining, offering greater controllability than existing methods. Our experimental results on multi-weather restoration benchmarks show that MWFormer achieves significant performance improvements compared to existing state-of-the-art methods, without requiring much computational cost. Moreover, we demonstrate that our methodology of using hyper-networks can be integrated into various network architectures to further boost their performance. The code is available at: https://github.com/taco-group/MWFormer.

Abstract:
Remote Photoplethysmography (rPPG) has been attracting increasing attention due to its potential in a wide range of application scenarios such as physical training, clinical monitoring, and face anti-spoofing. On top of conventional solutions, deep-learning approach starts to dominate in rPPG estimation and achieves top-level performance. However, most of them try to integrate preprocessing steps such as the ROI selection into an end-to-end network, which may diverge the attention and also limit the generalization in other scenarios with different input skin regions. In this work, we focus on learning the intrinsic rPPG feature and design a lightweight but effective rPPG estimation network based on spatiotemporal convolution. To further improve the robustness, on top of the basic design we propose the Noise-Disentangled DeeprPPG (ND-DeeprPPG) by disentangling the environmental noise from the raw rPPG feature with an adversarial canonical correlation analysis learning strategy. Background regions are employed as a reference to guide the noise disentangling in a self-supervised manner. Extensive experiments show that our ND-DeeprPPG not only outperforms the state-of-the-arts on heart rate estimation but also exhibits promising robustness in cross-skin-region, cross-dataset scenarios and other rPPG-based tasks.

Abstract:
Text-based Person Search (TBPS) aims to retrieve images of target pedestrian indicated by textual descriptions. It is essential for TBPS to extract fine-grained local features and align them crossing modality. Existing methods utilize external tools or heavy cross-modal interaction to achieve explicit alignment of cross-modal fine-grained features, which is inefficient and time-consuming. In this work, we propose a Vision-Guided Semantic-Group Network (VGSG) for text-based person search to extract well-aligned fine-grained visual and textual features. In the proposed VGSG, we develop a Semantic-Group Textual Learning (SGTL) module and a Vision-guided Knowledge Transfer (VGKT) module to extract textual local features under the guidance of visual local clues. In SGTL, in order to obtain the local textual representation, we group textual features from the channel dimension based on the semantic cues of language expression, which encourages similar semantic patterns to be grouped implicitly without external tools. In VGKT, a vision-guided attention is employed to extract visual-related textual features, which are inherently aligned with visual cues and termed vision-guided textual features. Furthermore, we design a relational knowledge transfer, including a vision-language similarity transfer and a class probability transfer, to adaptively propagate information of the vision-guided textual features to semantic-group textual features. With the help of relational knowledge transfer, VGKT is capable of aligning semantic-group textual features with corresponding visual features without external tools and complex pairwise interaction. Experimental results on two challenging benchmarks demonstrate its superiority over state-of-the-art methods.

Abstract:
Recent years have witnessed the superiority of deep learning-based algorithms in the field of HSI classification. However, a prerequisite for the favorable performance of these methods is a large number of refined pixel-level annotations. Due to atmospheric changes, sensor differences, and complex land cover distribution, pixel-level labeling of high-dimensional hyperspectral image (HSI) is extremely difficult, time-consuming, and laborious. To overcome the above hurdle, an Image-To-pixEl Representation (ITER) approach is proposed in this paper. To the best of our knowledge, this is the first time that image-level annotation is introduced to predict pixel-level classification maps for HSI. The proposed model is along the lines of subject modeling to boundary refinement, corresponding to pseudo-label generation and pixel-level prediction. Concretely, in the pseudo-label generation part, the spectral/spatial activation, spectral-spatial alignment loss, and geographic element enhancement are sequentially designed to locate discriminate regions of each category, optimize multi-domain class activation map (CAM) collaborative training, and refine labels, respectively. For the pixel-level prediction portion, a high frequency-aware self-attention in a high-enhanced transformer is put forward to achieve detailed feature representation. With the two-stage pipeline, ITER explores weakly supervised HSI classification with image-level tags, bridging the gap between image-level annotation and dense prediction. Extensive experiments in three benchmark datasets with state-of-the-art (SOTA) works show the performance of the proposed approach.

Abstract:
Image outpainting gains increasing attention since it can generate the complete scene from a partial view, providing a valuable solution to construct 360° panoramic images. As image outpainting suffers from the intrinsic issue of unidirectional completion flow, previous methods convert the original problem into inpainting, which allows a bidirectional flow. However, we find that inpainting has its own limitations and is inferior to outpainting in certain situations. The question of how they may be combined for the best of both has as yet remained under-explored. In this paper, we provide a deep analysis of the differences between inpainting and outpainting, which essentially depends on how the source pixels contribute to the unknown regions under different spatial arrangements. Motivated by this analysis, we present a Cylin-Painting framework that involves meaningful collaborations between inpainting and outpainting and efficiently fuses the different arrangements, with a view to leveraging their complementary benefits on a seamless cylinder. Nevertheless, straightforwardly applying the cylinder-style convolution often generates visually unpleasing results as it discards important positional information. To address this issue, we further present a learnable positional embedding strategy to incorporate the missing component of positional encoding into the cylinder convolution, which significantly improves the panoramic results. It is noted that while developed for image outpainting, the proposed algorithm can be effectively extended to other panoramic vision tasks, such as object detection, depth estimation, and image super-resolution. Code will be made available at https://github.com/KangLiao929/Cylin-Painting.

Abstract:
Self-supervised depth estimation methods can achieve competitive performance using only unlabeled monocular videos, but they suffer from the uncertainty of jointly learning depth and pose without any ground truths of both tasks. Supervised framework provides robust and superior performance but is limited by the scope of the labeled data. In this paper, we introduce SENSE, a novel learning paradigm for self-supervised monocular depth estimation that progressively evolves the prediction result using supervised learning, but without requiring labeled data. The key contribution of our approach stems from the novel use of the pseudo labels – the noisy depth estimation from the self-supervised methods. We surprisingly find that a fully supervised depth estimation network trained using the pseudo labels can produce even better results than its “ground truth”. To push the envelope further, we then evolve the self-supervised backbone by replacing its depth estimation branch with that fully supervised network. Based on this idea, we devise a comprehensive training pipeline that alternatively enhances the two key branches (depth and pose estimation) of the self-supervised backbone network. Our proposed approach can effectively ease the difficulty of multi-task training in self-supervised depth estimation. Experimental results have shown that our proposed approach achieves state-of-the-art results on the KITTI dataset.

Abstract:
Early action prediction (EAP) aims to recognize human actions from a part of action execution in ongoing videos, which is an important task for many practical applications. Most prior works treat partial or full videos as a whole, ignoring rich action knowledge hidden in videos, i.e., semantic consistencies among different partial videos. In contrast, we partition original partial or full videos to form a new series of partial videos and mine the Action-Semantic Consistent Knowledge (ASCK) among these new partial videos evolving in arbitrary progress levels. Moreover, a novel Rich Action-semantic Consistent Knowledge network (RACK) under the teacher-student framework is proposed for EAP. Firstly, we use a two-stream pre-trained model to extract features of videos. Secondly, we treat the RGB or flow features of the partial videos as nodes and their action semantic consistencies as edges. Next, we build a bi-directional semantic graph for the teacher network and a single-directional semantic graph for the student network to model rich ASCK among partial videos. The MSE and MMD losses are incorporated as our distillation loss to enrich the ASCK of partial videos from the teacher to the student network. Finally, we obtain the final prediction by summering the logits of different subnetworks and applying a softmax layer. Extensive experiments and ablative studies have been conducted, demonstrating the effectiveness of modeling rich ASCK for EAP. With the proposed RACK, we have achieved state-of-the-art performance on three benchmarks. The code is available at https://github.com/lily2lab/RACK.git.

Abstract:
Video scene graph generation (VidSGG) aims to identify objects in visual scenes and infer their relationships for a given video. It requires not only a comprehensive understanding of each object scattered on the whole scene but also a deep dive into their temporal motions and interactions. Inherently, object pairs and their relationships enjoy spatial co-occurrence correlations within each image and temporal consistency/transition correlations across different images, which can serve as prior knowledge to facilitate VidSGG model learning and inference. In this work, we propose a spatial-temporal knowledge-embedded transformer (STKET) that incorporates the prior spatial-temporal knowledge into the multi-head cross-attention mechanism to learn more representative relationship representations. Specifically, we first learn spatial co-occurrence and temporal transition correlations in a statistical manner. Then, we design spatial and temporal knowledge-embedded layers that introduce the multi-head cross-attention mechanism to fully explore the interaction between visual representation and the knowledge to generate spatial- and temporal-embedded representations, respectively. Finally, we aggregate these representations for each subject-object pair to predict the final semantic labels and their relationships. Extensive experiments show that STKET outperforms current competing algorithms by a large margin, e.g., improving the mR@50 by 8.1%, 4.7%, and 2.1% on different settings over current algorithms.

Abstract:
Efficient point cloud compression is essential for applications like virtual and mixed reality, autonomous driving, and cultural heritage. This paper proposes a deep learning-based inter-frame encoding scheme for dynamic point cloud geometry compression. We propose a lossy geometry compression scheme that predicts the latent representation of the current frame using the previous frame by employing a novel feature space inter-prediction network. The proposed network utilizes sparse convolutions with hierarchical multiscale 3D feature learning to encode the current frame using the previous frame. The proposed method introduces a novel predictor network for motion compensation in the feature domain to map the latent representation of the previous frame to the coordinates of the current frame to predict the current frame’s feature embedding. The framework transmits the residual of the predicted features and the actual features by compressing them using a learned probabilistic factorized entropy model. At the receiver, the decoder hierarchically reconstructs the current frame by progressively rescaling the feature embedding. The proposed framework is compared to the state-of-the-art Video-based Point Cloud Compression (V-PCC) and Geometry-based Point Cloud Compression (G-PCC) schemes standardized by the Moving Picture Experts Group (MPEG). The proposed method achieves more than 88% BD-Rate (Bjøntegaard Delta Rate) reduction against G-PCCv20 Octree, more than 56% BD-Rate savings against G-PCCv20 Trisoup, more than 62% BD-Rate reduction against V-PCC intra-frame encoding mode, and more than 52% BD-Rate savings against V-PCC P-frame-based inter-frame encoding mode using HEVC. These significant performance gains are cross-checked and verified in the MPEG working group.

Abstract:
This research investigates the potential of in vivo learning to enhance visual representation learning for image-based person re-identification (re-ID). Compared to traditional self-supervised learning (which require external data), the introduced in vivo learning utilizes supervisory labels generated from pedestrian images to improve re-ID accuracy without relying on external data sources. Three carefully designed in vivo learning tasks, leveraging statistical regularities within images, are proposed without the need for laborious manual annotations. These tasks enable feature extractors to learn more comprehensive and discriminative person representations by jointly modeling various aspects of human biological structure information, contributing to enhanced re-ID performance. Notably, the method seamlessly integrates with existing re-ID frameworks, requiring minimal modifications and no additional data beyond the existing training set. Extensive experiments on diverse datasets, including Market1501, CUHK03-NP, Celeb-reID, Celeb-reid-light, PRCC, and LTCC, demonstrate substantial enhancements in rank-1 precision compared to state-of-the-art methods.

Abstract:
Recent learning-based methods demonstrate their strong ability to estimate depth for multi-view stereo reconstruction. However, most of these methods directly extract features via regular or deformable convolutions, and few works consider the alignment of the receptive fields between views while constructing the cost volume. Through analyzing the constraint and inference of previous MVS networks, we find that there are still some shortcomings that hinder the performance. To deal with the above issues, we propose an Epipolar-Guided Multi-View Stereo Network with Interval-Aware Label (EI-MVSNet), which includes an epipolar-guided volume construction module and an interval-aware depth estimation module in a unified architecture for MVS. The proposed EI-MVSNet enjoys several merits. First, in the epipolar-guided volume construction module, we construct cost volume with features from aligned receptive fields between different pairs of reference and source images via epipolar-guided convolutions, which take rotation and scale changes into account. Second, in the interval-aware depth estimation module, we attempt to supervise the cost volume directly and make depth estimation independent of extraneous values by perceiving the upper and lower boundaries, which can achieve fine-grained predictions and enhance the reasoning ability of the network. Extensive experimental results on two standard benchmarks demonstrate that our EI-MVSNet performs favorably against state-of-the-art MVS methods. Specifically, our EI-MVSNet ranks 1_st on both intermediate and advanced subsets of the Tanks and Temples benchmark, which verifies the high precision and strong robustness of our model.

Abstract:
Geodesic models are known as an efficient tool for solving various image segmentation problems. Most of existing approaches only exploit local pointwise image features to track geodesic paths for delineating the objective boundaries. However, such a segmentation strategy cannot take into account the connectivity of the image edge features, increasing the risk of shortcut problem, especially in the case of complicated scenario. In this work, we introduce a new image segmentation model based on the minimal geodesic framework in conjunction with an adaptive cut-based circular optimal path computation scheme and a graph-based boundary proposals grouping scheme. Specifically, the adaptive cut can disconnect the image domain such that the target contours are imposed to pass through this cut only once. The boundary proposals are comprised of precomputed image edge segments, providing the connectivity information for our segmentation model. These boundary proposals are then incorporated into the proposed image segmentation model, such that the target segmentation contours are made up of a set of selected boundary proposals and the corresponding geodesic paths linking them. Experimental results show that the proposed model indeed outperforms state-of-the-art minimal paths-based image segmentation approaches.

Abstract:
Limited-angle tomographic reconstruction is one of the typical ill-posed inverse problems, leading to edge divergence with degraded image quality. Recently, deep learning has been introduced into image reconstruction and achieved great results. However, existing deep reconstruction methods have not fully explored data consistency, resulting in poor performance. In addition, deep reconstruction methods are still mathematically inexplicable and unstable. In this work, we propose an iterative residual optimization network (IRON) for limited-angle tomographic reconstruction. First, a new optimization objective function is established to overcome false negative and positive artifacts induced by limited-angle measurements. We integrate neural network priors as a regularizer to explore deep features within residual data. Furthermore, the block-coordinate descent is employed to achieve a novel iterative framework. Second, a convolution assisted transformer is carefully elaborated to capture both local and long-range pixel interactions simultaneously. Regarding the visual transformer, the multi-head attention is further redesigned to reduce computational costs and protect reconstructed image features. Third, based on the relative error convergence property of the convolution assisted transformer, a mathematical convergence analysis is also provided for our IRON. Both numerically simulated and clinically collected real cardiac datasets are employed to validate the effectiveness and advantages of the proposed IRON. The results show that IRON outperforms other state-of-the-art methods.

Abstract:
Due to many unmarked data, there has been tremendous interest in developing unsupervised feature selection methods, among which graph-guided feature selection is one of the most representative techniques. However, the existing feature selection methods have the following limitations: (1) All of them only remove redundant features shared by all classes and neglect the class-specific properties; thus, the selected features cannot well characterize the discriminative structure of the data. (2) The existing methods only consider the relationship between the data and the corresponding neighbor points by Euclidean distance while neglecting the differences with other samples. Thus, existing methods cannot encode discriminative information well. (3) They adaptively learn the graph in the original or embedding space. Thus, the learned graph cannot characterize the data’s cluster structure. To solve these limitations, we present a novel unsupervised discriminative feature selection via contrastive graph learning, which integrates feature selection and graph learning into a uniform framework. Specifically, our model adaptively learns the affinity matrix, which helps characterize the data’s intrinsic and cluster structures in the original space and the contrastive learning. We minimize \ell _1,2 -norm regularization on the projection matrix to preserve class-specific features and remove redundant features shared by all classes. Thus, the selected features encode discriminative information well and characterize the discriminative structure of the data. Generous experiments indicate that our proposed model has state-of-the-art performance.

Abstract:
The majority of existing works explore Unsupervised Domain Adaptation (UDA) with an ideal assumption that samples in both domains are available and complete. In real-world applications, however, this assumption does not always hold. For instance, data-privacy is becoming a growing concern, the source domain samples may be not publicly available for training, leading to a typical Source-Free Domain Adaptation (SFDA) problem. Traditional UDA methods would fail to handle SFDA since there are two challenges in the way: the data incompleteness issue and the domain gaps issue. In this paper, we propose a visually SFDA method named Adversarial Style Matching (ASM) to address both issues. Specifically, we first train a style generator to generate source-style samples given the target images to solve the data incompleteness issue. We use the auxiliary information stored in the pre-trained source model to ensure that the generated samples are statistically aligned with the source samples, and use the pseudo labels to keep semantic consistency. Then, we feed the target domain samples and the corresponding source-style samples into a feature generator network to reduce the domain gaps with a self-supervised loss. An adversarial scheme is employed to further expand the distributional coverage of the generated source-style samples. The experimental results verify that our method can achieve comparative performance even compared with the traditional UDA methods with source samples for training.

Abstract:
Video inpainting gains an increasing amount of attention ascribed to its wide applications in intelligent video editing. However, despite tremendous progress made in RGB video inpainting, the existing RGB-D video inpainting models are still incompetent to inpaint real-world RGB-D videos, as they simply fuse color and depth via explicit feature concatenation, neglecting the natural modality gap. Moreover, current RGB-D video inpainting datasets are synthesized with homogeneous and delusive RGB-D data, which is far from real-world application and cannot provide comprehensive evaluation. To alleviate these problems and achieve real-world RGB-D video inpainting, on one hand, we propose a Mutually-guided Color and Depth Inpainting Network (MCD-Net), where color and depth are reciprocally leveraged to inpaint each other implicitly, mitigating the modality gap and fully exploiting cross-modal association for inpainting. On the other hand, we build a Video Inpainting with Depth (VID) dataset to supply diverse and authentic RGB-D video data with various object annotation masks to enable comprehensive evaluation for RGB-D video inpainting under real-world scenes. Experimental results on the DynaFill benchmark and our collected VID dataset demonstrate our MCD-Net not only yields the state-of-the-art quantitative performance but successfully achieves high-quality RGB-D video inpainting under real-world scenes. All resources are available at https://github.com/JCATCV/MCD-Net.

Abstract:
Effectively summarizing and re-expressing video content by natural languages in a more human-like fashion is one of the key topics in the field of multimedia content understanding. Despite good progress made in recent years, existing efforts usually overlooked the emotions in user-generated videos, thus making the generated sentence a bit boring and soulless. To fill the research gap, this paper presents a novel emotional video captioning framework in which we design a Vision-based Emotion Interpretation Network to effectively capture the emotions conveyed in videos and describe the visual content in both factual and emotional languages. Specifically, we first model the emotion distribution over an open psychological vocabulary to predict the emotional state of videos. Then, guided by the discovered emotional state, we incorporate visual context, textual context, and visual-textual relevance into an aggregated multimodal contextual vector to enhance video captioning. Furthermore, we optimize the network in a new emotion-fact coordinated way that involves two losses— Emotional Indication Loss and Factual Contrastive Loss, which penalize the error of emotion prediction and visual-textual factual relevance, respectively. In other words, we innovatively introduce emotional representation learning into an end-to-end video captioning network. Extensive experiments on public benchmark datasets, EmVidCap and EmVidCap-S, demonstrate that our method can significantly outperform the state-of-the-art methods by a large margin. Quantitative ablation studies and qualitative analyses clearly show that our method is able to effectively capture the emotions in videos and thus generate emotional language sentences to interpret the video content.

Abstract:
The image-level label has prevailed in weakly supervised semantic segmentation tasks due to its easy availability. Since image-level labels can only indicate the existence or absence of specific categories of objects, visualization-based techniques have been widely adopted to provide object location clues. Considering class activation maps (CAMs) can only locate the most discriminative part of objects, recent approaches usually adopt an expansion strategy to enlarge the activation area for more integral object localization. However, without proper constraints, the expanded activation will easily intrude into the background region. In this paper, we propose spatial structure constraints (SSC) for weakly supervised semantic segmentation to alleviate the unwanted object over-activation of attention expansion. Specifically, we propose a CAM-driven reconstruction module to directly reconstruct the input image from deep CAM features, which constrains the diffusion of last-layer object attention by preserving the coarse spatial structure of the image content. Moreover, we propose an activation self-modulation module to refine CAMs with finer spatial structure details by enhancing regional consistency. Without external saliency models to provide background clues, our approach achieves 72.7% and 47.0% mIoU on the PASCAL VOC 2012 and COCO datasets, respectively, demonstrating the superiority of our proposed approach. The source codes and models have been made available at https://github.com/NUST-Machine-Intelligence-Laboratory/SSC.

Abstract:
Hashing and quantization have greatly succeeded by benefiting from deep learning for large-scale image retrieval. Recently, deep product quantization methods have attracted wide attention. However, representation capability of codewords needs to be further improved. Moreover, since the number of codewords in the codebook depends on experience, representation capability of codewords is usually imbalanced, which leads to redundancy or insufficiency of codewords and reduces retrieval performance. Therefore, in this paper, we propose a novel deep product quantization method, named Entropy Optimized deep Weighted Product Quantization (EOWPQ), which not only encodes samples into the weighted codewords in a new flexible manner but also balances the codeword assignment, improving while balancing representation capability of codewords. Specifically, we encode samples using the linear weighted sum of codewords instead of a single codeword as traditionally. Meanwhile, we establish the linear relationship between the weighted codewords and semantic labels, which effectively maintains semantic information of codewords. Moreover, in order to balance the codeword assignment, that is, avoiding some codewords representing most samples or some codewords representing very few samples, we maximize the entropy of the coding probability distribution and obtain the optimal coding probability distribution of samples by utilizing optimal transport theory, which achieves the optimal assignment of codewords and balances representation capability of codewords. The experimental results on three benchmark datasets show that EOWPQ can achieve better retrieval performance and also show the improvement of representation capability of codewords and the balance of codeword assignment.

Abstract:
Compared with other objects, smoke semantic segmentation (SSS) is more difficult and challenging due to some special characteristics of smoke, such as non-rigid, translucency, variable mode and so on. To achieve accurate positioning of smoke in real complex scenes and promote the development of intelligent fire detection, we propose a Smoke-Aware Global-Interactive Non-local Network (SAGINN) for SSS, which harness the power of both convolution and transformer to capture local and global information simultaneously. Non-local is a powerful means for modeling long-range context dependencies, however, friendliness to single-scale low-resolution features limits its potential to produce high-quality representations. Therefore, we propose a Global-Interactive Non-local (GINL) module, leveraging global interaction between multi-scale key information to improve the robustness of feature representations. To solve the interference of smoke-like objects, a Pyramid High-level Semantic Aggregation (PHSA) module is designed, where the learned high-level category semantics from classification aids model by providing additional guidance to correct the wrong information in segmentation representations at the image level and alleviate the inter-class similarity problem. Besides, we further propose a novel loss function, termed Smoke-aware loss (SAL), by assigning different weights to different objects contingent on their importance. We evaluate our SAGINN on extensive synthetic and real data to verify its generalization ability. Experimental results show that SAGINN achieves 83% average mIoU on the three testing datasets (83.33%, 82.72% and 82.94%) of SYN70K with an accuracy improvement of about 0.5%, 0.002 mMse and 0.805~F_\beta on SMOKE5K, which can obtain more accurate location and finer boundaries of smoke, achieving satisfactory results on smoke-like objects.

Abstract:
Few-shot action recognition aims to recognize new unseen categories with only a few labeled samples of each class. However, it still suffers from the limitation of inadequate data, which easily leads to the overfitting and low-generalization problems. Therefore, we propose a cross-modal contrastive learning network (CCLN), consisting of an adversarial branch and a contrastive branch, to perform effective few-shot action recognition. In the adversarial branch, we elaborately design a prototypical generative adversarial network (PGAN) to obtain synthesized samples for increasing training samples, which can mitigate the data scarcity problem and thereby alleviate the overfitting problem. When the training samples are limited, the obtained visual features are usually suboptimal for video understanding as they lack discriminative information. To address this issue, in the contrastive branch, we propose a cross-modal contrastive learning module (CCLM) to obtain discriminative feature representations of samples with the help of semantic information, which can enable the network to enhance the feature learning ability at the class-level. Moreover, since videos contain crucial sequences and ordering information, thus we introduce a spatial-temporal enhancement module (SEM) to model the spatial context within video frames and the temporal context across video frames. The experimental results show that the proposed CCLN outperforms the state-of-the-art few-shot action recognition methods on four challenging benchmarks, including Kinetics, UCF101, HMDB51 and SSv2.

Abstract:
Food computing brings various perspectives to computer vision like vision-based food analysis for nutrition and health. As a fundamental task in food computing, food detection needs Zero-Shot Detection (ZSD) on novel unseen food objects to support real-world scenarios, such as intelligent kitchens and smart restaurants. Therefore, we first benchmark the task of Zero-Shot Food Detection (ZSFD) by introducing FOWA dataset with rich attribute annotations. Unlike ZSD, fine-grained problems in ZSFD like inter-class similarity make synthesized features inseparable. The complexity of food semantic attributes further makes it more difficult for current ZSD methods to distinguish various food categories. To address these problems, we propose a novel framework ZSFDet to tackle fine-grained problems by exploiting the interaction between complex attributes. Specifically, we model the correlation between food categories and attributes in ZSFDet by multi-source graphs to provide prior knowledge for distinguishing fine-grained features. Within ZSFDet, Knowledge-Enhanced Feature Synthesizer (KEFS) learns knowledge representation from multiple sources (e.g., ingredients correlation from knowledge graph) via the multi-source graph fusion. Conditioned on the fusion of semantic knowledge representation, the region feature diffusion model in KEFS can generate fine-grained features for training the effective zero-shot detector. Extensive evaluations demonstrate the superior performance of our method ZSFDet on FOWA and the widely-used food dataset UECFOOD-256, with significant improvements by 1.8% and 3.7% ZSD mAP compared with the strong baseline RRFS. Further experiments on PASCAL VOC and MS COCO prove that enhancement of the semantic knowledge can also improve the performance on general ZSD. Code and dataset are available at https://github.com/LanceZPF/KEFS.

Abstract:
The domain generalization approach seeks to develop a universal model that performs well on unknown target domains with the aid of diverse source domains. Data augmentation has proven to be an effective method to enhance domain generalization in computer vision. Recently, semantic-level based data augmentation has yielded remarkable results. However, these methods focus on sampling semantic directions on feature space from intra-class and intra-domain, limiting the diversity of the source domain. To address this issue, we propose a novel approach called Inter-Class and Inter-Domain Semantic Augmentation (CDSA) for domain generalization. We first introduce a sampling-based method called CrossSmooth to obtain semantic directions from inter-class. Then, CrossVariance obtains the styles of different domains by sampling semantic directions. Our experiments on four well-known domain generalization benchmark datasets (Digits-DG, PACS, Office-Home, and DomainNet) demonstrate the effectiveness of our approach. We also validate our approach on commonly-used semantic segmentation datasets, namely GTAV, SYNTHIA, Cityscapes, Mapillary, and BDDS which also show significant improvements.

Abstract:
Few-shot semantic segmentation aims to segment novel-class objects in a query image with only a few annotated examples in support images. Although progress has been made recently by combining prototype-based metric learning, existing methods still face two main challenges. First, various intra-class objects between the support and query images or semantically similar inter-class objects can seriously harm the segmentation performance due to their poor feature representations. Second, the latent novel classes are treated as the background in most methods, leading to a learning bias, whereby these novel classes are difficult to correctly segment as foreground. To solve these problems, we propose a dual-branch learning method. The class-specific branch encourages representations of objects to be more distinguishable by increasing the inter-class distance while decreasing the intra-class distance. In parallel, the class-agnostic branch focuses on minimizing the foreground class feature distribution and maximizing the features between the foreground and background, thus increasing the generalizability to novel classes in the test stage. Furthermore, to obtain more representative features, pixel-level and prototype-level semantic learning are both involved in the two branches. The method is evaluated on PASCAL- 5^i~1 -shot, PASCAL- 5^i~5 -shot, COCO- 20^i~1 -shot, and COCO- 20^i~5 -shot, and extensive experiments show that our approach is effective for few-shot semantic segmentation despite its simplicity.

Abstract:
Non-uniqueness and instability are characteristic features of image reconstruction methods. As a result, it is necessary to develop regularization methods that can be used to compute reliable approximate solutions. A regularization method provides a family of stable reconstructions that converge to a specific solution of the noise-free problem as the noise level tends to zero. The standard regularization technique is defined by a variational image reconstruction that minimizes a data discrepancy augmented by a regularizer. The actual numerical implementation makes use of iterative methods, often involving proximal mappings of the regularizer. In recent years, Plug-and-Play (PnP) image reconstruction has been developed as a new powerful generalization of variational methods based on replacing proximal mappings by more general image denoisers. While PnP iterations yield excellent results, neither stability nor convergence in the sense of regularization have been studied so far. In this work, we extend the idea of PnP by considering families of PnP iterations, each accompanied by its own denoiser. As our main theoretical result, we show that such PnP reconstructions lead to stable and convergent regularization methods. This shows for the first time that PnP is as mathematically justified for robust image reconstruction as variational methods.

Abstract:
Traditional CNN-based pipelines for panoptic segmentation decompose the task into two subtasks, i.e., instance segmentation and semantic segmentation. In this way, they extract information with multiple branches, perform two subtasks separately and finally fuse the results. However, excessive feature extraction and complicated processes make them time-consuming. We propose IDNet to decompose panoptic segmentation at information level. IDNet only extracts two kinds of information and directly completes panoptic segmentation task, saving the efforts to extract extra information and to fuse subtasks. By decomposing panoptic segmentation into category information and location information and recomposing them with a serial pipeline, the process for panoptic segmentation is simplified greatly and unified with regard to stuff and things. We also adopt two correction losses specially designed for our serial pipeline, guaranteeing the overall predicting performance. As a result, IDNet strikes a better balance between effectiveness and efficiency, achieving the fastest inference speed of 24.2 FPS at a resolution of 800×1333 on a Tesla V100 GPU and a PQ of 43.8, which is comparable in one-stage CNN-based methods. The code will be released at https://github.com/AronLin/IDNet.

Affiliations: National Key Laboratory of Human-Machine Hybrid Augmented Intelligence, the National Engineering Research Center for Visual Information and Applications, Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University, Xi’an, Shaanxi, China; Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China; Alibaba Group, Hangzhou, China; College of Computer Science and Technology, Zhejiang Gongshang University, Hangzhou, China; Zhejiang Key Laboratory of E-Commerce, Zhoushan, China; College of Computer Science and Technology, Zhejiang University, Hangzhou, China; School of Computer Science and Information Engineering, Hefei University of Technology, Hefei, China

Abstract:
Current research on cross-modal retrieval is mostly English-oriented, as the availability of a large number of English-oriented human-labeled vision-language corpora. In order to break the limit of non-English labeled data, cross-lingual cross-modal retrieval (CCR) has attracted increasing attention. Most CCR methods construct pseudo-parallel vision-language corpora via Machine Translation (MT) to achieve cross-lingual transfer. However, the translated sentences from MT are generally imperfect in describing the corresponding visual contents. Improperly assuming the pseudo-parallel data are correctly correlated will make the networks overfit to the noisy correspondence. Therefore, we propose Dual-view Curricular Optimal Transport (DCOT) to learn with noisy correspondence in CCR. In particular, we quantify the confidence of the sample pair correlation with optimal transport theory from both the cross-lingual and cross-modal views, and design dual-view curriculum learning to dynamically model the transportation costs according to the learning stage of the two views. Extensive experiments are conducted on two multilingual image-text datasets and one video-text dataset, and the results demonstrate the effectiveness and robustness of the proposed method. Besides, our proposed method also shows a good expansibility to cross-lingual image-text baselines and a decent generalization on out-of-domain data.

Abstract:
As an important and challenging problem in computer vision, scene graph generation (SGG) aims to find out the underlying semantic relationships among objects from a given image for scene understanding. Usually, prevalent SGG approaches adopt a learning pipeline with the assumption that there exists only a single relationship for a particular object pair. Considering the common phenomenon that a pair of objects can be attached by multiple relationships, we propose a multi-label scene graph generation pipeline with multi-grained features (MLMG-SGG), which formulates the relationship detection as a multi-label classification problem during training while generating multigraphs at inference time. In order to better model the fine-grained relationships, the proposed pipeline encodes the feature representation of SGG on different spatial scales by a specially designed Multi-Grained Module (MGM), resulting in the multi-grained (i.e., object-level and region-level) features of objects. Experimental results over the benchmark dataset demonstrate the significant performance gain of the proposed pipeline used as a plug-in for the state-of-the-art methods.

Abstract:
Group activity recognition aims to identify a consistent group activity from different actions performed by respective individuals. Most existing methods focus on learning the interaction between each two individuals (i.e., second-order interaction). In this work, we argue that the second-order interactive relation is insufficient to address this task. We propose a third-order active factor graph network, which models the third-order interaction in each pair of three active individuals. At first, to alleviate the noisy individual actions, we select active individuals by measuring each individual’s influence. The individuals with the top-k largest influence weights are selected as active individuals. Then, for each three-individuals pair, we build a new factor node and contact the factor node with these individual nodes. In other words, we extend the base second-order interactive graph to a new third-order interactive graph, which is defined as factor graph. Next, we design a two-branch factor graph network, in which one branch is to consider all individuals (denoted as full factor graph) and the other one takes the active individuals into consideration (denoted as active factor graph). We leverage both the active and full factor graphs comprehensively for group activity recognition. Besides, to enforce group consistency, a consistency-aware reasoning module is designed with two penalty terms, which describe the inconsistency between individual actions and group activity respectively. Extensive experiments demonstrate that our method achieves state-of-the-art performance on four benchmark datasets, i.e., Volleyball, Collective Activity, Collective Activity Extended, and SoccerNet-v3 datasets. Visualization results further validate the interpretability of our method.

Abstract:
Attributed to the development of deep networks and abundant data, automatic face recognition (FR) has quickly reached human-level capacity in the past few years. However, the FR problem is not perfectly solved in case of large poses and uncontrolled occlusions. In this paper, we propose a novel bypass enhanced representation learning (BERL) method to improve face recognition under unconstrained scenarios. The proposed method integrates self-supervised learning and supervised learning together by attaching two auxiliary bypasses, a 3D reconstruction bypass and a blind inpainting bypass, to assist robust feature learning for face recognition. Among them, the 3D reconstruction bypass enforces the face recognition network to encode pose independent 3D facial information, which enhances the robustness to various poses. The blind inpainting bypass enforces the face recognition network to capture more facial context information for face inpainting, which enhances the robustness to occlusions. The whole framework is trained in end-to-end manner with two self-supervised tasks above and the classic supervised face identification task. During inference, the two auxiliary bypasses can be detached from the face recognition network, avoiding any additional computational overhead. Extensive experimental results on various face recognition benchmarks show that, without any cost of extra annotations and computations, our method outperforms state-of-the-art methods. Moreover, the learnt representations can also well generalize to other face-related downstream tasks such as the facial attribute recognition with limited labeled data.

Abstract:
This paper presents a deep learning-based spectral demosaicing technique trained in an unsupervised manner. Many existing deep learning-based techniques relying on supervised learning with synthetic images, often underperform on real-world images, especially as the number of spectral bands increases. This paper presents a comprehensive unsupervised spectral demosaicing (USD) framework based on the characteristics of spectral mosaic images. This framework encompasses a training method, model structure, transformation strategy, and a well-fitted model selection strategy. To enable the network to dynamically model spectral correlation while maintaining a compact parameter space, we reduce the complexity and parameters of the spectral attention module. This is achieved by dividing the spectral attention tensor into spectral attention matrices in the spatial dimension and spectral attention vector in the channel dimension. This paper also presents Mosaic 25 , a real 25-band hyperspectral mosaic image dataset featuring various objects, illuminations, and materials for benchmarking purposes. Extensive experiments on both synthetic and real-world datasets demonstrate that the proposed method outperforms conventional unsupervised methods in terms of spatial distortion suppression, spectral fidelity, robustness, and computational cost. Our code and dataset are publicly available at https://github.com/polwork/Unsupervised-Spectral-Demosaicing.

Abstract:
Visual attention advances object detection by attending neural networks to object representations. While existing methods incorporate empirical modules to empower network attention, we rethink attentive object detection from the network learning perspective in this work. We propose a NEural Attention Learning approach (NEAL) which consists of two parts. During the back-propagation of each training iteration, we first calculate the partial derivatives (a.k.a. the accumulated gradients) of the classification output with respect to the input features. We refine these partial derivatives to obtain attention response maps whose elements reflect the contributions to the final network predictions. Then, we formulate the attention response maps as extra objective functions, which are combined together with the original detection loss to train detectors in an end-to-end manner. In this way, we succeed in learning an attentive CNN model without introducing additional network structures. We apply NEAL to the two-stage object detection frameworks, which are usually composed of a CNN feature backbone, a region proposal network (RPN), and a classifier. We show that the proposed NEAL not only helps the RPN attend to objects but also enables the classifier to pay more attention to the premier positive samples. To this end, the localization (proposal generation) and classification mutually benefit from each other in our proposed method. Extensive experiments on large-scale benchmark datasets, including MS COCO 2017 and Pascal VOC 2012, demonstrate that the proposed NEAL algorithm advances the two-stage object detector over state-of-the-art approaches.

Affiliations: Information Materials and Intelligent Sensing Laboratory of Anhui Province, the Key Laboratory of Intelligent Computing and Signal Processing of Ministry of Education, and Anhui Provincial Key Laboratory of Multimodal Cognitive Computation, School of Computer Science and Technology, Anhui University, Hefei, China; Information Materials and Intelligent Sensing Laboratory of Anhui Province and Anhui Provincial Key Laboratory of Multimodal Cognitive Computation, School of Artificial Intelligence, Anhui University, Hefei, China; School of Internet, Anhui University, Hefei, China; Anhui Province Key Laboratory of Electric Fire and Safety Protection, State Grid Anhui Electric Power Research Institute, Hefei, China

Abstract:
RGB and thermal source data suffer from both shared and specific challenges, and how to explore and exploit them plays a critical role in representing the target appearance in RGBT tracking. In this paper, we propose a novel approach, which performs target appearance representation disentanglement and interaction via both modality-shared and modality-specific challenge attributes, for robust RGBT tracking. In particular, we disentangle the target appearance representations via five challenge-based branches with different structures according to their properties, including three parameter-shared branches to model modality-shared challenges and two parameter-independent branches to model modality-specific challenges. Considering the complementary advantages between modality-specific cues, we propose a guidance interaction module to transfer discriminative features from one modality to another one to enhance the discriminative ability of weak modality. Moreover, we design an aggregation interaction module to combine all challenge-based target representations, which could form more discriminative target representations and fit the challenge-agnostic tracking process. These challenge-based branches are able to model the target appearance under certain challenges so that the target representations can be learned by a few parameters even in the situation of insufficient training data. In addition, to relieve labor costs and avoid label ambiguity, we design a generation strategy to generate training data with different challenge attributes. Comprehensive experiments demonstrate the superiority of the proposed tracker against the state-of-the-art methods on four benchmark datasets.

Abstract:
Generalized Zero-Shot Learning (GZSL) aims at recognizing images from both seen and unseen classes by constructing correspondences between visual images and semantic embedding. However, existing methods suffer from a strong bias problem, where unseen images in the target domain tend to be recognized as seen classes in the source domain. To address this issue, we propose a Prototype-augmented Self-supervised Generative Network by integrating self-supervised learning and prototype learning into a feature generating model for GZSL. The proposed model enjoys several advantages. First, we propose a Self-supervised Learning Module to exploit inter-domain relationships, where we introduce anchors as a bridge between seen and unseen categories. In the shared space, we pull the distribution of the target domain away from the source domain and obtain domain-aware features. To our best knowledge, this is the first work to introduce self-supervised learning into GZSL as learning guidance. Second, a Prototype Enhancing Module is proposed to utilize class prototypes to model reliable target domain distribution in finer granularity. In this module, a Prototype Alignment mechanism and a Prototype Dispersion mechanism are combined to guide the generation of better target class features with intra-class compactness and inter-class separability. Extensive experimental results on five standard benchmarks demonstrate that our model performs favorably against state-of-the-art GZSL methods.

Abstract:
Recently, class incremental semantic segmentation (CISS) towards the practical open-world setting has attracted increasing research interest, which is mainly challenged by the well-known issue of catastrophic forgetting. Particularly, knowledge distillation (KD) techniques have been widely studied to alleviate catastrophic forgetting. Despite the promising performance, existing KD-based methods generally use the same distillation schemes for different intermediate layers to transfer old knowledge, while employing manually tuned and fixed trade-off weights to control the effect of KD. These KD-based methods take no consideration of feature characteristics from different intermediate layers, limiting the effectiveness of KD for CISS. In this paper, we propose a layer-specific knowledge distillation (LSKD) method to assign appropriate knowledge schemes and weights for various intermediate layers by considering feature characteristics, aiming to further explore the potential of KD in improving the performance of CISS. Specifically, we present a mask-guided distillation (MD) to alleviate the background shift on semantic features, which performs distillation by masking the features affected by the background. Furthermore, a mask-guided context distillation (MCD) is presented to explore global context information lying in high-level semantic features. Based on them, our LSKD assigns different distillation schemes according to feature characteristics. To adjust the effect of layer-specific distillation adaptively, LSKD introduces a regularized gradient equilibrium method to learn dynamic trade-off weights. Additionally, our LSKD makes an attempt to simultaneously learn distillation schemes and trade-off weights of different layers by developing a bi-level optimization method. Extensive experiments on widely used Pascal VOC 12 and ADE20K show our LSKD clearly outperforms its counterparts while achieving state-of-the-art results.

Abstract:
Self-supervised Object Segmentation (SOS) aims to segment objects without any annotations. Under conditions of multi-camera inputs, the structural, textural and geometrical consistency among each view can be leveraged to achieve fine-grained object segmentation. To make better use of the above information, we propose Surface representation based Self-supervised Object Segmentation (Surface-SOS), a new framework to segment objects for each view by 3D surface representation from multi-view images of a scene. To model high-quality geometry surfaces for complex scenes, we design a novel scene representation scheme, which decomposes the scene into two complementary neural representation modules respectively with a Signed Distance Function (SDF). Moreover, Surface-SOS is able to refine single-view segmentation with multi-view unlabeled images, by introducing coarse segmentation masks as additional input. To the best of our knowledge, Surface-SOS is the first self-supervised approach that leverages neural surface representation to break the dependence on large amounts of annotated data and strong constraints. These constraints typically involve observing target objects against a static background or relying on temporal supervision in videos. Extensive experiments on standard benchmarks including LLFF, CO3D, BlendedMVS, TUM and several real-world scenes show that Surface-SOS always yields finer object masks than its NeRF-based counterparts and surpasses supervised single-view baselines remarkably. Code is available at: https://github.com/zhengxyun/Surface-SOS.

Abstract:
3D point cloud registration is a crucial task in a variety of fields, including remote sensing mapping, computer vision, virtual reality, and autonomous driving. However, this task is still challenging due to the challenges of noise, non-uniformity, partial overlap, and repeated local features in large scene point clouds. In this paper, we propose an efficient single correspondence voting method for large scene point cloud registration. Specifically, we first propose an efficient hypothetical transformation prediction method called SCVC, which determines the 5 degrees of freedom of the transformation through one correspondence, and then uses Hough voting to determine the last degree of freedom. This algorithm can significantly improve the accuracy of registration in both indoor and outdoor scenes. On the other hand, we propose a more robust transformation verification function called VDIR, which can obtain the optimal registration result of two raw point clouds. Finally, we conduct a series of experiments that demonstrate that our method achieves state-of-the-art performance on four real-world datasets: 3DMatch, 3DLoMatch, KITTI, and WHU-TLS. Our code is available at https://github.com/xingxuejun1989/SCVC.

Abstract:
Scene Graph Generation (SGG) aims to detect all objects and identify their pairwise relationships in the scene. Recently, tremendous progress has been made in exploring better context relationship representations. Previous work mainly focuses on contextual information aggregation and uses de-biasing strategies on samples to eliminate the preference for head predicates. However, there remain challenges caused by indeterminate feature training. Overlooking the label confusion problem in feature training easily results in a messy feature distribution among the confused categories, thereby affecting the prediction of predicates. To alleviate the aforementioned problem, in this paper, we focus on enhancing predicate representation learning. Firstly, we propose a novel Adaptive Message Passing (AMP) network to dynamically conduct information propagation among neighbors. AMP provides discriminating representations for neighbor nodes under the view of de-noising and adaptive aggregation. Furthermore, we construct a feature-assisted training paradigm alongside the predicate classification branch, guiding predicate feature learning to the corresponding feature space. Moreover, to alleviate biased prediction caused by the long-tailed class distribution and the interference of confused labels, we design a Bi-level Curriculum learning scheme (BiC). The BiC separately considers the training from the feature learning and de-biasing levels, preserving discriminating representations of different predicates while resisting biased predictions. Results on multiple SGG datasets show that our proposed method AMP-BiC has superior comprehensive performance, demonstrating its effectiveness.

Abstract:
Human emotions contain both basic and compound facial expressions. In many practical scenarios, it is difficult to access all the compound expression categories at one time. In this paper, we investigate comprehensive facial expression recognition (FER) in the class-incremental learning paradigm, where we define well-studied and easily-accessible basic expressions as initial classes and learn new compound expressions incrementally. To alleviate the stability-plasticity dilemma in our incremental task, we propose a novel Relationship-Guided Knowledge Transfer (RGKT) method for class-incremental FER. Specifically, we develop a multi-region feature learning (MFL) module to extract fine-grained features for capturing subtle differences in expressions. Based on the MFL module, we further design a basic expression-oriented knowledge transfer (BET) module and a compound expression-oriented knowledge transfer (CET) module, by effectively exploiting the relationship across expressions. The BET module initializes the new compound expression classifiers based on expression relevance between basic and compound expressions, improving the plasticity of our model to learn new classes. The CET module transfers expression-generic knowledge learned from new compound expressions to enrich the feature set of old expressions, facilitating the stability of our model against forgetting old classes. Extensive experiments on three facial expression databases show that our method achieves superior performance in comparison with several state-of-the-art methods.

Abstract:
Online video streaming has fundamental limitations on the transmission bandwidth and computational capacity and super-resolution is a promising potential solution. However, applying existing video super-resolution methods to online streaming is non-trivial. Existing video codecs and streaming protocols (e.g., WebRTC) dynamically change the video quality both spatially and temporally, which leads to diverse and dynamic degradations. Furthermore, online streaming has a strict requirement for latency that most existing methods are less applicable. As a result, this paper focuses on the rarely exploited problem setting of online streaming video super resolution. To facilitate the research on this problem, a new benchmark dataset named LDV-WebRTC is constructed based on a real-world online streaming system. Leveraging the new benchmark dataset, we propose a novel method specifically for online video streaming, which contains a convolution and Look-Up Table (LUT) hybrid model to achieve better performance-latency trade-off. To tackle the changing degradations, we propose a mixture-of-expert-LUT module, where a set of LUT specialized in different degradations are built and adaptively combined to handle different degradations. Experiments show our method achieves 720P video SR around 100 FPS, while significantly outperforms existing LUT-based methods and offers competitive performance compared to efficient CNN-based methods. Code is available at https://github.com/quzefan/ConvLUT.

Affiliations: Center for Research on Intelligent Perception and Computing (CRIPAC), State Key Laboratory of Multimodal Artificial Intelligence Systems (MAIS), Institute of Automation, Chinese Academy of Sciences, Beijing, China; School of Electrical and Data Engineering, University of Technology Sydney, Ultimo, NSW, Australia; College of Electrical Engineering and Automation, Shandong University of Science and Technology, Qingdao, China; School of Information and Electronics, Beijing Institute of Technology, Beijing, Haidian, China

Abstract:
Recent studies have seen significant advancements in the field of long-term person re-identification (LT-reID) through the use of clothing-irrelevant or insensitive features. This work takes the field a step further by addressing a previously unexplored issue, the Clothing Status Distribution Shift (CSDS). CSDS refers to the differing ratios of samples with clothing changes to those without clothing changes between the training and test sets, leading to a decline in LT-reID performance. We establish a connection between the performance of LT-reID and CSDS, and argue that addressing CSDS can improve LT-reID performance. To that end, we propose a novel framework called Meta Clothing Status Calibration (MCSC), which uses meta-learning to optimize the LT-reID model. Specifically, MCSC simulates CSDS between meta-train and meta-test with meta-optimization objectives, optimizing the LT-reID model and making it robust to CSDS. This framework is designed to prevent overfitting and improve the generalization ability of the LT-reID model in the presence of CSDS. Comprehensive evaluations on seven datasets demonstrate that the proposed MCSC framework effectively handles CSDS and improves current state-of-the-art LT-reID methods on several LT-reID benchmarks.

Abstract:
Existing human parsing frameworks commonly employ joint learning of semantic edge detection and human parsing to facilitate the localization around boundary regions. Nevertheless, the parsing prediction within the interior of the part contour may still exhibit inconsistencies due to the inherent ambiguity of fine-grained semantics. In contrast, binary edge detection does not suffer from such fine-grained semantic ambiguity, leading to a typical failure case where misclassification occurs inner the part contour while the semantic edge is accurately detected. To address these challenges, we develop a novel diffusion scheme that incorporates guidance from the detected semantic edge to mitigate this problem by propagating corrected classified semantics into the misclassified regions. Building upon this diffusion scheme, we present an Edge Guided Diffusion Network (EGDNet) for human parsing, which can progressively refine the parsing predictions to enhance the accuracy and coherence of human parsing results. Moreover, we design a horizontal-vertical aggregation to exploit inherent correlations among body parts along both the horizontal and vertical axes, which aims at enhancing the initial parsing results. Extensive experimental evaluations on various challenging datasets demonstrate the effectiveness of the proposed EGDNet. Remarkably, our EGDNet shows impressive performances on six benchmark datasets, including four human body parsing datasets (LIP, CIHP, ATR, and PASCAL-Person-Part), and two human face parsing datasets (CelebAMask-HQ and LaPa).

Abstract:
Existing deep learning methods for fine-grained visual recognition often rely on large-scale, well-annotated training data. Obtaining fine-grained annotations in the wild typically requires concentration and expertise, such as fine category annotation for species recognition, instance annotation for person re-identification (re-id) and dense annotation for segmentation, which inevitably leads to label noise. This paper aims to tackle label noise in deep model training for fine-grained visual recognition. We propose a Neighbor-Attention Label Correction (NALC) model to correct labels during the training stage. NALC samples a training batch and a validation batch from the training set. It hence leverages a meta-learning framework to correct labels in the training batch based on the validation batch. To enhance the optimization efficiency, we introduce a novel nested optimization algorithm for the meta-learning framework. The proposed training procedure consistently improves label accuracy in the training batch, consequently enhancing the learned image representation. Experimental results demonstrate that our method significantly increases label accuracy from 70% to over 98% and outperforms recent approaches by up to 13.4% in mean Average Precision (mAP) on various fine-grained image retrieval (FGIR) tasks, including instance retrieval on CUB200 and person re-id on Market1501. We also demonstrate the efficacy of NALC on noisy semantic segmentation datasets generated from Cityscapes, where it achieves a significant 7.8% improvement in mIOU score. NALC also exhibits robustness to different types of noise, including simulated noise such as Asymmetric, Pair-Flip, and Pattern noise, as well as practical noisy labels generated by tracklets and clustering.

Abstract:
Visual intention understanding is a challenging task that explores the hidden intention behind the images of publishers in social media. Visual intention represents implicit semantics, whose ambiguous definition inevitably leads to label shifting and label blemish. The former indicates that the same image delivers intention discrepancies under different data augmentations, while the latter represents that the label of intention data is susceptible to errors or omissions during the annotation process. This paper proposes a novel method, called Label-aware Calibration and Relation-preserving (LabCR) to alleviate the above two problems from both intra-sample and inter-sample views. First, we disentangle the multiple intentions into a single intention for explicit distribution calibration in terms of the overall and the individual. Calibrating the class probability distributions in augmented instance pairs provides consistent inferred intention to address label shifting. Second, we utilize the intention similarity to establish correlations among samples, which offers additional supervision signals to form correlation alignments in instance pairs. This strategy alleviates the effect of label blemish. Extensive experiments have validated the superiority of the proposed method LabCR in visual intention understanding and pedestrian attribute recognition. Code is available at https://github.com/ShiQingHongYa/LabCR.

Abstract:
Visual question answering with natural language explanation (VQA-NLE) is a challenging task that requires models to not only generate accurate answers but also to provide explanations that justify the relevant decision-making processes. This task is accomplished by generating natural language sentences based on the given question-image pair. However, existing methods often struggle to ensure consistency between the answers and explanations due to their disregard of the crucial interactions between these factors. Moreover, existing methods overlook the potential benefits of incorporating additional knowledge, which hinders their ability to effectively bridge the semantic gap between questions and images, leading to less accurate explanations. In this paper, we present a novel approach denoted the knowledge-based iterative consensus VQA-NLE (KICNLE) model to address these limitations. To maintain consistency, our model incorporates an iterative consensus generator that adopts a multi-iteration generative method, enabling multiple iterations of the answer and explanation in each generation. In each iteration, the current answer is utilized to generate an explanation, which in turn guides the generation of a new answer. Additionally, a knowledge retrieval module is introduced to provide potentially valid candidate knowledge, guide the generation process, effectively bridge the gap between questions and images, and enable the production of high-quality answer-explanation pairs. Extensive experiments conducted on three different datasets demonstrate the superiority of our proposed KICNLE model over competing state-of-the-art approaches. Our code is available at https://github.com/Gary-code/KICNLE.

Abstract:
Anomaly detection is an important task for medical image analysis, which can alleviate the reliance of supervised methods on large labelled datasets. Most existing methods use a pixel-wise self-reconstruction framework for anomaly detection. However, there are two challenges of these studies: 1) they tend to overfit learning an identity mapping between the input and output, which leads to failure in detecting abnormal samples; 2) the reconstruction considers the pixel-wise differences which may lead to an undesirable result. To mitigate the above problems, we propose a novel heterogeneous Auto-Encoder (Hetero-AE) for medical anomaly detection. Our model utilizes a convolutional neural network (CNN) as the encoder and a hybrid CNN-Transformer network as the decoder. The heterogeneous structure enables the model to learn the intrinsic information of normal data and enlarge the difference on abnormal samples. To fully exploit the effectiveness of Transformer in the hybrid network, a multi-scale sparse Transformer block is proposed to trade off modelling long-range feature dependencies and high computational costs. Moreover, the multi-stage feature comparison is introduced to reduce the noise of pixel-wise comparison. Extensive experiments on four public datasets (i.e., retinal OCT, chest X-ray, brain MRI, and COVID-19) verify the effectiveness of our method on different imaging modalities for anomaly detection. Additionally, our method can accurately detect tumors in brain MRI and lesions in retinal OCT with interpretable heatmaps to locate lesion areas, assisting clinicians in diagnosing abnormalities efficiently.

Abstract:
In this work, we utilize the high-fidelity generation abilities of diffusion models to solve blind JPEG restoration at high compression levels. We propose an elegant modification of the forward stochastic differential equation of diffusion models to adapt them to this restoration task and name our method DriftRec. Comparing DriftRec against an L_2 regression baseline with the same network architecture and state-of-the-art techniques for JPEG restoration, we show that our approach can escape the tendency of other methods to generate blurry images, and recovers the distribution of clean images significantly more faithfully. For this, only a dataset of clean/corrupted image pairs and no knowledge about the corruption operation is required, enabling wider applicability to other restoration tasks. In contrast to other conditional and unconditional diffusion models, we utilize the idea that the distributions of clean and corrupted images are much closer to each other than each is to the usual Gaussian prior of the reverse process in diffusion models. Our approach therefore requires only low levels of added noise and needs comparatively few sampling steps even without further optimizations. We show that DriftRec naturally generalizes to realistic and difficult scenarios such as unaligned double JPEG compression and blind restoration of JPEGs found online, without having encountered such examples during training.

Affiliations: Fast School of Computing, National University of Computer and Emerging Sciences, Karachi, Pakistan; Department of Computing and Mathematics, Manchester Metropolitan University, Manchester, U.K.; Institute for Data Science in Oncology, The University of Texas MD Anderson Cancer Center, Houston, TX, USA; Department of Imaging Physics, The University of Texas MD Anderson Cancer Center, Houston, TX, USA; Faculty of Computing and Information Technology, Sohar University, Sohar, Oman; Institute of Computational Perception, Johannes Kepler University Linz, Linz, Austria; School of Professional Education and Executive Development, The Hong Kong Polytechnic University, Hong Kong, Hong Kong; Center for Research in Computer Vision (CRCV), University of Central Florida, Orlando, FL, USA

Abstract:
Recently attention-based networks have been successful for image restoration tasks. However, existing methods are either computationally expensive or have limited receptive fields, adding constraints to the model. They are also less resilient in spatial and contextual aspects and lack pixel-to-pixel correspondence, which may degrade feature representations. In this paper, we propose a novel and computationally efficient architecture Single Stage Adaptive Multi-Attention Network (SSAMAN) for image restoration tasks, particularly for image denoising and image deblurring. SSAMAN efficiently addresses computational challenges and expands receptive fields, enhancing robustness in spatial and contextual feature representation. Its Adaptive Multi-Attention Module (AMAM), which consists of Adaptive Pixel Attention Branch (APAB) and an Adaptive Channel Attention Branch (ACAB), uniquely integrates channel and pixel-wise dimensions, significantly improving sensitivity to edges, shapes, and textures. We perform extensive experiments and ablation studies to validate the performance of SSAMAN. Our model shows state-of-the-art results on various benchmarks, for example, on image denoising tasks, SSAMAN achieves a notable 40.08 dB PSNR on SIDD dataset, outperforming Restormer by 0.06 dB PSNR, with 41.02% less computational cost, and achieves a 40.05 dB PSNR on the DND dataset. For image deblurring, SSAMAN achieves 33.53 dB PSNR on GoPro dataset. Code and models are available at Github.

Abstract:
In modern neuroscience, observing the dynamics of large populations of neurons is a critical step of understanding how networks of neurons process information. Light-field microscopy (LFM) has emerged as a type of scanless, high-speed, three-dimensional (3D) imaging tool, particularly attractive for this purpose. Imaging neuronal activity using LFM calls for the development of novel computational approaches that fully exploit domain knowledge embedded in physics and optics models, as well as enabling high interpretability and transparency. To this end, we propose a model-based explainable deep learning approach for LFM. Different from purely data-driven methods, the proposed approach integrates wave-optics theory, sparse representation and non-linear optimization with the artificial neural network. In particular, the architecture of the proposed neural network is designed following precise signal and optimization models. Moreover, the network’s parameters are learned from a training dataset using a novel training strategy that integrates layer-wise training with tailored knowledge distillation. Such design allows the network to take advantage of domain knowledge and learned new features. It combines the benefit of both model-based and learning-based methods, thereby contributing to superior interpretability, transparency and performance. By evaluating on both structural and functional LFM data obtained from scattering mammalian brain tissues, we demonstrate the capabilities of the proposed approach to achieve fast, robust 3D localization of neuron sources and accurate neural activity identification.

Abstract:
This paper tackles spectral reflectance recovery (SRR) from RGB images. Since capturing ground-truth spectral reflectance and camera spectral sensitivity are challenging and costly, most existing approaches are trained on synthetic images and utilize the same parameters for all unseen testing images, which are suboptimal especially when the trained models are tested on real images because they never exploit the internal information of the testing images. To address this issue, we adopt a self-supervised meta-auxiliary learning (MAXL) strategy that fine-tunes the well-trained network parameters with each testing image to combine external with internal information. To the best of our knowledge, this is the first work that successfully adapts the MAXL strategy to this problem. Instead of relying on naive end-to-end training, we also propose a novel architecture that integrates the physical relationship between the spectral reflectance and the corresponding RGB images into the network based on our mathematical analysis. Besides, since the spectral reflectance of a scene is independent to its illumination while the corresponding RGB images are not, we recover the spectral reflectance of a scene from its RGB images captured under multiple illuminations to further reduce the unknown. Qualitative and quantitative evaluations demonstrate the effectiveness of our proposed network and of the MAXL. Our code and data are available at https://github.com/Dong-Huo/SRR-MAXL.

Abstract:
Existing RGB-Thermal trackers usually treat intra-modal feature extraction and inter-modal feature fusion as two separate processes, therefore the mutual promotion of extraction and fusion is neglected. Then, the complementary advantages of RGB-T fusion are not fully exploited, and the independent feature extraction is not adaptive to modal quality fluctuation during tracking. To address the limitations, we design a joint-modality query fusion network, in which the intra-modal feature extraction and the inter-modal fusion are coupled together and promote each other via joint-modality queries. The queries are initialized based on the multimodal features of the current frame, making the subsequent fusion adaptive to modal quality fluctuation during tracking. Then the joint-modality query fusion (JQF) utilizes the queries to interact with RGB-T features, allowing the intra-modal enhancement and the inter-modal interactions to be unified for mutual promotion. In this way, JQF can distinguish and enhance the complementary modality features, while filtering out redundant information. For real-time tracking, we propose regional cross-attention for cross-modal interactions to reduce computational cost. Our end-to-end tracker sets a new state-of-the-art performance on multiple RGBT tracking benchmarks including LasHeR, VTUAV, RGBT234 and GTOT, while running at a real-time speed.

Abstract:
Person re-identification (ReID) typically encounters varying degrees of occlusion in real-world scenarios. While previous methods have addressed this using handcrafted partitions or external cues, they often compromise semantic information or increase network complexity. In this paper, we propose a new method from a novel perspective, termed as OAT. Specifically, we first use a Transformer backbone with multiple class tokens for diverse pedestrian feature learning. Given that the self-attention mechanism in the Transformer solely focuses on low-level feature correlations, neglecting higher-order relations among different body parts or regions. Thus, we propose the Second-Order Attention (SOA) module to capture more comprehensive features. To address computational efficiency, we further derive approximation formulations for implementing second-order attention. Observing that the importance of semantics associated with different class tokens varies due to the uncertainty of the location and size of occlusion, we propose the Entropy Guided Fusion (EGF) module for multiple class tokens. By conducting uncertainty analysis on each class token, higher weights are assigned to those with lower information entropy, while lower weights are assigned to class tokens with higher entropy. The dynamic weight adjustment can mitigate the impact of occlusion-induced uncertainty on feature learning, thereby facilitating the acquisition of discriminative class token representations. Extensive experiments have been conducted on occluded and holistic person re-identification datasets, which demonstrate the effectiveness of our proposed method.

Abstract:
Recently, action recognition has attracted considerable attention in the field of computer vision. In dynamic circumstances and complicated backgrounds, there are some problems, such as object occlusion, insufficient light, and weak correlation of human body joints, resulting in skeleton-based human action recognition accuracy being very low. To address this issue, we propose a Multi-View Time-Series Hypergraph Neural Network (MV-TSHGNN) method. The framework is composed of two main parts: the construction of a multi-view time-series hypergraph structure and the learning process of multi-view time-series hypergraph convolutions. Specifically, given the multi-view video sequence frames, we first extract the joint features of actions from different views. Then, limb components and adjacent joints spatial hypergraphs based on the joints of different views at the same time are constructed respectively, temporal hypergraphs are constructed joints of the same view at continuous times, which are established high-order semantic relationships and cooperatively generate complementary action features. After that, we design a multi-view time-series hypergraph neural network to efficiently learn the features of spatial and temporal hypergraphs, and effectively improve the accuracy of skeleton-based action recognition. To evaluate the effectiveness and efficiency of MV-TSHGNN, we conduct experiments on NTU RGB+D, NTU RGB+D 120 and imitating traffic police gestures datasets. The experimental results indicate that our proposed method model achieves the new state-of-the-art performance.

Abstract:
Traditional block-based spatially scalable video coding has been studied for over twenty years. While significant advancements have been made, the scope for further improvement in compression performance is limited. Inspired by the success of learned video coding, we propose an end-to-end learned spatially scalable video coding scheme, LSSVC, which provides a new solution for scalable video coding. In LSSVC, we propose to use the motion, texture, and latent information of the base layer (BL) as interlayer information for compressing the enhancement layer (EL). To reduce interlayer redundancy, we design three modules to leverage the upsampled interlayer information. Firstly, we design a contextual motion vector (MV) encoder-decoder, which utilizes the upsampled BL motion information to help compress high-resolution MV. Secondly, we design a hybrid temporal-layer context mining module to learn more accurate contexts from the EL temporal features and the upsampled BL texture information. Thirdly, we use the upsampled BL latent information as an interlayer prior for the entropy model to estimate more accurate probability distribution parameters for the high-resolution latents. Experimental results show that our scheme surpasses H.265/SHVC reference software by a large margin. Our code is available at https://github.com/EsakaK/LSSVC.

Abstract:
Since hand-print recognition, i.e., palmprint, finger-knuckle-print (FKP), and hand-vein, have significant superiority in user convenience and hygiene, it has attracted greater enthusiasm from researchers. Seeking to handle the long-standing interference factors, i.e., noise, rotation, shadow, in hand-print images, multi-view hand-print representation has been proposed to enhance the feature expression by exploiting multiple characteristics from diverse views. However, the existing methods usually ignore the high-order correlations between different views or fuse very limited types of features. To tackle these issues, in this paper, we present a novel tensorized multi-view low-rank approximation based robust hand-print recognition method (TMLA_RHR), which can dexterously manipulate the multi-view hand-print features to produce a high-compact feature representation. To achieve this goal, we formulate TMLA_RHR by two key components, i.e., aligned structure regression loss and tensorized low-rank approximation, in a joint learning model. Specifically, we treat the low-rank representation matrices of different views as a tensor, which is regularized with a low-rank constraint. It models the across information between different views and reduces the redundancy of the learned sub-space representations. Experimental results on eight real-world hand-print databases prove the superiority of the proposed method in comparison with other state-of-the-art related works.

Abstract:
With the benefit of deep learning techniques, recent researches have made significant progress in image compression artifacts reduction. Despite their improved performances, prevailing methods only focus on learning a mapping from the compressed image to the original one but ignore the intrinsic attributes of the given compressed images, which greatly harms the performance of downstream parsing tasks. Different from these methods, we propose to decouple the intrinsic attributes into two complementary features for artifacts reduction, i.e., the compression-insensitive features to regularize the high-level semantic representations during training and the compression-sensitive features to be aware of the compression degree. To achieve this, we first employ adversarial training to regularize the compressed and original encoded features for retaining high-level semantics, and we then develop the compression quality-aware feature encoder for compression-sensitive features. Based on these dual complementary features, we propose a Dual Awareness Guidance Network (DAGN) to utilize these awareness features as transformation guidance during the decoding phase. In our proposed DAGN, we develop a cross-feature fusion module to maintain the consistency of compression-insensitive features by fusing compression-insensitive features into the artifacts reduction baseline. Our method achieves an average 2.06 dB PSNR gains on BSD500, outperforming state-of-the-art methods, and only requires 29.7 ms to process one image on BSD500. Besides, the experimental results on LIVE1 and LIU4K also demonstrate the efficiency, effectiveness, and superiority of the proposed method in terms of quantitative metrics, visual quality, and downstream machine vision tasks.

Abstract:
Hashing has received significant interest in large-scale data retrieval due to its outstanding computational efficiency. Of late, numerous deep hashing approaches have emerged, which have obtained impressive performance. However, these approaches can contain ethical risks during image retrieval. To address this, we are the first to study the problem of group fairness within learning to hash and introduce a novel method termed Fairness-aware Hashing with Mixture of Experts (FATE). Specifically, FATE leverages the mixture-of-experts framework as the hashing network, where each expert contributes knowledge from an individual viewpoint, followed by aggregation using the gating mechanism. This strongly enhances the model capability, facilitating the generation of both discriminative and unbiased binary descriptors. We also incorporate fairness-aware contrastive learning, combining sensitive labels with feature similarities to ensure unbiased hash code learning. Furthermore, an adversarial learning objective condition on both deep features and hash codes is employed to further eliminate group biases. Extensive experiments on several benchmark datasets validate the superiority of the proposed FATE compared with various state-of-the-art approaches.

Abstract:
Unconstrained palmprint images have shown great potential for recognition applications due to their lower restrictions regarding hand poses and backgrounds during contactless image acquisition. However, they face two challenges: 1) unclear palm contours and finger-valley points of unconstrained palmprint images make it difficult to locate landmarks to crop the palmprint region of interest (ROI); and 2) large intra-class diversities of unconstrained palmprint images hinder the learning of intra-class-invariant palmprint features. In this paper, we propose to directly extract the complete palmprint region as the ROI (CROI) using the detection-style CenterNet without requiring the detection of any landmarks, and large intra-class diversities may occur. To address this, we further propose a palmprint feature alignment and learning hybrid network (PalmALNet) for unconstrained palmprint recognition. Specifically, we first exploit and align the multi-scale shallow representation of unconstrained palmprint images via deformable convolution and alignment-aware supervision, such that the pixel gaps of the intra-class palmprint CROIs can be minimized in shallow feature space. Then, we develop multiple triple-attention learning modules by integrating spatial, channel, and self-attention operations into convolution to adaptively learn and highlight the latent identity-invariant palmprint information, enhancing the overall discriminative power of the palmprint features. Extensive experimental results on four challenging palmprint databases demonstrate the promising effectiveness of both the proposed PalmALNet and CROI for unconstrained palmprint recognition.

Abstract:
Reshaping, a point operation that alters the characteristics of signals, has been shown capable of improving the compression ratio in video coding practices. Out-of-loop reshaping that directly modifies the input video signal was first adopted as the supplemental enhancement information (SEI) for the HEVC/H.265 without the need to alter the core design of the video codec. VVC/H.266 further improves the coding efficiency by adopting in-loop reshaping that modifies the residual signal being processed in the hybrid coding loop. In this paper, we theoretically analyze the rate-distortion performance of the in-loop reshaping and use experiments to verify the theoretical result. We prove that the in-loop reshaping can improve coding efficiency when the entropy coder adopted in the coding pipeline is suboptimal, which is in line with the practical scenarios that video codecs operate in. We derive the PSNR gain in a closed form and show that the theoretically predicted gain is consistent with that measured from experiments using standard testing video sequences.

Abstract:
Night-time scene parsing aims to extract pixel-level semantic information in night images, aiding downstream tasks in understanding scene object distribution. Due to limited labeled night image datasets, unsupervised domain adaptation (UDA) has become the predominant method for studying night scenes. UDA typically relies on paired day-night image pairs to guide adaptation, but this approach hampers dataset construction and restricts generalization across night scenes in different datasets. Moreover, UDA, focusing on network architecture and training strategies, faces difficulties in handling classes with few domain similarities. In this paper, we leverage Prompt Images Guidance (PIG) to enhance UDA with supplementary night knowledge. We propose a Night-Focused Network (NFNet) to learn night-specific features from both target domain images and prompt images. To generate high-quality pseudo-labels, we propose Pseudo-label Fusion via Domain Similarity Guidance (FDSG). Classes with fewer domain similarities are predicted by NFNet, which excels in parsing night features, while classes with more domain similarities are predicted by UDA, which has rich labeled semantics. Additionally, we propose two data augmentation strategies: the Prompt Mixture Strategy (PMS) and the Alternate Mask Strategy (AMS), aimed at mitigating the overfitting of the NFNet to a few prompt images. We conduct extensive experiments on four night-time datasets: NightCity, NightCity+, Dark Zurich, and ACDC. The results indicate that utilizing PIG can enhance the parsing accuracy of UDA. The code is available at https://github.com/qiurui4shu/PIG.

Abstract:
Monocular depth estimation (MDE) is a fundamental task in computer vision and has drawn increasing attention. Recently, some methods reformulate it as a classification-regression task to boost the model performance, where continuous depth is estimated via a linear combination of predicted probability distributions and discrete bins. In this paper, we present a novel framework called BinsFormer, tailored for the classification-regression-based depth estimation. It mainly focuses on two crucial components in the specific task: 1) proper generation of adaptive bins; and 2) sufficient interaction between probability distribution and bins predictions. To specify, we employ a Transformer decoder to generate bins, novelly viewing it as a direct set-to-set prediction problem. We further integrate a multi-scale decoder structure to achieve a comprehensive understanding of spatial geometry information and estimate depth maps in a coarse-to-fine manner. Moreover, an extra scene understanding query is proposed to improve the estimation accuracy, which turns out that models can implicitly learn useful information from the auxiliary environment classification task. Extensive experiments on the KITTI, NYU, and SUN RGB-D datasets demonstrate that BinsFormer surpasses state-of-the-art MDE methods with prominent margins. Code and pretrained models are made publicly available at https://github.com/zhyever/ Monocular-Depth-Estimation-Toolbox/tree/main/configs/ binsformer.

Abstract:
Temporal action localization (TAL) has drawn much attention in recent years, however, the performance of previous methods is still far from satisfactory due to the lack of annotated untrimmed video data. To deal with this issue, we propose to improve the utilization of current data through feature augmentation. Given an input video, we first extract video features with pre-trained video encoders, and then randomly mask various semantic contents of video features to consider different views of video features. To avoid damaging important action-related semantic information, we further develop a learnable feature augmentation framework to generate better views of videos. In particular, a Mask-based Feature Augmentation Module (MFAM) is proposed. The MFAM has three advantages: 1) it captures the temporal and semantic relationships of original video features, 2) it generates masked features with indispensable action-related information, and 3) it randomly recycles some masked information to ensure diversity. Finally, we input the masked features and the original features into shared action detectors respectively, and perform action classification and localization jointly for model learning. The proposed framework can improve the robustness and generalization of action detectors by learning more and better views of videos. In the testing stage, the MFAM can be removed, which does not bring extra computational costs. Extensive experiments are conducted on four TAL benchmark datasets. Our proposed framework significantly improves different TAL models and achieves the state-of-the-art performances.

Abstract:
In this study, we propose a modeling-based compression approach for dense/lenslet light field images captured by Plenoptic 2.0 with square microlenses. This method employs the 5-D Epanechnikov Kernel (5-D EK) and its associated theories. Owing to the limitations of modeling larger image block using the Epanechnikov Mixture Regression (EMR), a 5-D Epanechnikov Mixture-of-Experts using Gaussian Initialization (5-D EMoE-GI) is proposed. This approach outperforms 5-D Gaussian Mixture Regression (5-D GMR). The modeling aspect of our coding framework utilizes the entire EI and the 5D Adaptive Model Selection (5-D AMLS) algorithm. The experimental results demonstrate that the decoded rendered images produced by our method are perceptually superior, outperforming High Efficiency Video Coding (HEVC) and JPEG 2000 at a bit depth below 0.06bpp.

Abstract:
Few-shot learning (FSL) aims at recognizing a novel object under limited training samples. A robust feature extractor (backbone) can significantly improve the recognition performance of the FSL model. However, training an effective backbone is a challenging issue since 1) designing and validating structures of backbones are time-consuming and expensive processes, and 2) a backbone trained on the known (base) categories is more inclined to focus on the textures of the objects it learns, which is hard to describe the novel samples. To solve these problems, we propose a feature mixture operation on the pre-trained (fixed) features: 1) We replace a part of the values of the feature map from a novel category with the content of other feature maps to increase the generalizability and diversity of training samples, which avoids retraining a complex backbone with high computational costs. 2) We use the similarities between the features to constrain the mixture operation, which helps the classifier focus on the representations of the novel object where these representations are hidden in the features from the pre-trained backbone with biased training. Experimental studies on five benchmark datasets in both inductive and transductive settings demonstrate the effectiveness of our feature mixture (FM). Specifically, compared with the baseline on the Mini-ImageNet dataset, it achieves 3.8% and 4.2% accuracy improvements for 1 and 5 training samples, respectively. Additionally, the proposed mixture operation can be used to improve other existing FSL methods based on backbone training.

Abstract:
Image stitching is a critical task in panorama perception that involves combining images captured from different viewing positions to reconstruct a wider field-of-view (FOV) image. Existing visible image stitching methods suffer from performance drops under severe conditions since environmental factors can easily impair visible images. In contrast, infrared images possess greater penetrating ability and are less affected by environmental factors. Therefore, we propose an infrared and visible image-based multispectral image stitching method to achieve all-weather, broad FOV scene perception. Specifically, based on two pairs of infrared and visible images, we employ the salient structural information from the infrared images and the textual details from the visible images to infer the correspondences within different modality-specific features. For this purpose, a multiscale progressive mechanism coupled with quadrature correlation is exploited to improve regression in different modalities. Exploiting the complementary properties, accurate and credible homography can be obtained by integrating the deformation parameters of the two modalities to compensate for the missing modality-specific information. A global-aware guided reconstruction module is established to generate an informative and broad scene, wherein the attentive features of different viewpoints are introduced to fuse the source images with a more seamless and comprehensive appearance. We construct a high-quality infrared and visible stitching dataset for evaluation, including real-world and synthetic sets. The qualitative and quantitative results demonstrate that the proposed method outperforms the intuitive cascaded fusion-stitching procedure, achieving more robust and credible panorama generation. Code and dataset are available at https://github.com/Jzy2017/MSGA.

Abstract:
In RGB-T tracking, there exist rich spatial relationships between the target and backgrounds within multi-modal data as well as sound consistencies of spatial relationships among successive frames, which are crucial for boosting the tracking performance. However, most existing RGB-T trackers overlook such multi-modal spatial relationships and temporal consistencies within RGB-T videos, hindering them from robust tracking and practical applications in complex scenarios. In this paper, we propose a novel Multi-modal Spatial-Temporal Context (MMSTC) network for RGB-T tracking, which employs a Transformer architecture for the construction of reliable multi-modal spatial context information and the effective propagation of temporal context information. Specifically, a Multi-modal Transformer Encoder (MMTE) is designed to achieve the encoding of reliable multi-modal spatial contexts as well as the fusion of multi-modal features. Furthermore, a Quality-aware Transformer Decoder (QATD) is proposed to effectively propagate the tracking cues from historical frames to the current frame, which facilitates the object searching process. Moreover, the proposed MMSTC network can be easily extended to various tracking frameworks. New state-of-the-art results on five prevalent RGB-T tracking benchmarks demonstrate the superiorities of our proposed trackers over existing ones.

Abstract:
Learning invariant representations via contrastive learning has seen state-of-the-art performance in domain generalization (DG). Despite such success, in this paper, we find that its core learning strategy – feature alignment – could heavily hinder model generalization. Drawing insights in neuron interpretability, we characterize this problem from a neuron activation view. Specifically, by treating feature elements as neuron activation states, we show that conventional alignment methods tend to deteriorate the diversity of learned invariant features, as they indiscriminately minimize all neuron activation differences. This instead ignores rich relations among neurons – many of them often identify the same visual concepts despite differing activation patterns. With this finding, we present a simple yet effective approach, Concept Contrast (CoCo), which relaxes element-wise feature alignments by contrasting high-level concepts encoded in neurons. Our CoCo performs in a plug-and-play fashion, thus it can be integrated into any contrastive method in DG. We evaluate CoCo over four canonical contrastive methods, showing that CoCo promotes the diversity of feature representations and consistently improves model generalization capability. By decoupling this success through neuron coverage analysis, we further find that CoCo potentially invokes more meaningful neurons during training, thereby improving model learning.

Abstract:
Fine-grained visual classification aims to classify similar sub-categories with the challenges of large variations within the same sub-category and high visual similarities between different sub-categories. Recently, methods that extract semantic parts of the discriminative regions have attracted increasing attention. However, most existing methods extract the part features via rectangular bounding boxes by object detection module or attention mechanism, which makes it difficult to capture the rich shape information of objects. In this paper, we propose a novel Multi-Granularity Part Sampling Attention (MPSA) network for fine-grained visual classification. First, a novel multi-granularity part retrospect block is designed to extract the part information of different scales and enhance the high-level feature representation with discriminative part features of different granularities. Then, to extract part features of various shapes at each granularity, we propose part sampling attention, which can sample the implicit semantic parts on the feature maps comprehensively. The proposed part sampling attention not only considers the importance of sampled parts but also adopts the part dropout to reduce the overfitting issue. In addition, we propose a novel multi-granularity fusion method to highlight the foreground features and suppress the background noises with the assistance of the gradient class activation map. Experimental results demonstrate that the proposed MPSA achieves state-of-the-art performance on four commonly used fine-grained visual classification benchmarks. The source code is publicly available at https://github.com/mobulan/MPSA.

Abstract:
High-quality panoramic images with a Field of View (FoV) of 360° are essential for contemporary panoramic computer vision tasks. However, conventional imaging systems come with sophisticated lens designs and heavy optical components. This disqualifies their usage in many mobile and wearable applications where thin and portable, minimalist imaging systems are desired. In this paper, we propose a Panoramic Computational Imaging Engine (PCIE) to achieve minimalist and high-quality panoramic imaging. With less than three spherical lenses, a Minimalist Panoramic Imaging Prototype (MPIP) is constructed based on the design of the Panoramic Annular Lens (PAL), but with low-quality imaging results due to aberrations and small image plane size. We propose two pipelines, i.e. Aberration Correction (AC) and Super-Resolution and Aberration Correction (SR&AC), to solve the image quality problems of MPIP, with imaging sensors of small and large pixel size, respectively. To leverage the prior information of the optical system, we propose a Point Spread Function (PSF) representation method to produce a PSF map as an additional modality. A PSF-aware Aberration-image Recovery Transformer (PART) is designed as a universal network for the two pipelines, in which the self-attention calculation and feature extraction are guided by the PSF map. We train PART on synthetic image pairs from simulation and put forward the PALHQ dataset to fill the gap of real-world high-quality PAL images for low-level vision. A comprehensive variety of experiments on synthetic and real-world benchmarks demonstrates the impressive imaging results of PCIE and the effectiveness of the PSF representation. We further deliver heuristic experimental findings for minimalist and high-quality panoramic imaging, in terms of the choices of prototype and pipeline, network architecture, training strategies, and dataset construction. Our dataset and code will be available at https://github.com/zju-jiangqi/PCIE-PART.

Abstract:
The spiking neural networks (SNNs) that efficiently encode temporal sequences have shown great potential in extracting audio-visual joint feature representations. However, coupling SNNs (binary spike sequences) with transformers (float-point sequences) to jointly explore the temporal-semantic information still facing challenges. In this paper, we introduce a novel Spiking Tucker Fusion Transformer (STFT) for audio-visual zero-shot learning (ZSL). The STFT leverage the temporal and semantic information from different time steps to generate robust representations. The time-step factor (TSF) is introduced to dynamically synthesis the subsequent inference information. To guide the formation of input membrane potentials and reduce the spike noise, we propose a global-local pooling (GLP) which combines the max and average pooling operations. Furthermore, the thresholds of the spiking neurons are dynamically adjusted based on semantic and temporal cues. Integrating the temporal and semantic information extracted by SNNs and Transformers are difficult due to the increased number of parameters in a straightforward bilinear model. To address this, we introduce a temporal-semantic Tucker fusion module, which achieves multi-scale fusion of SNN and Transformer outputs while maintaining full second-order interactions. Our experimental results demonstrate the effectiveness of the proposed approach in achieving state-of-the-art performance in three benchmark datasets. The harmonic mean (HM) improvement of VGGSound, UCF101 and ActivityNet are around 15.4%, 3.9%, and 14.9%, respectively.

Abstract:
Unsupervised domain adaptation medical image segmentation is aimed to segment unlabeled target domain images with labeled source domain images. However, different medical imaging modalities lead to large domain shift between their images, in which well-trained models from one imaging modality often fail to segment images from anothor imaging modality. In this paper, to mitigate domain shift between source domain and target domain, a style consistency unsupervised domain adaptation image segmentation method is proposed. First, a local phase-enhanced style fusion method is designed to mitigate domain shift and produce locally enhanced organs of interest. Second, a phase consistency discriminator is constructed to distinguish the phase consistency of domain-invariant features between source domain and target domain, so as to enhance the disentanglement of the domain-invariant and style encoders and removal of domain-specific features from the domain-invariant encoder. Third, a style consistency estimation method is proposed to obtain inconsistency maps from intermediate synthesized target domain images with different styles to measure the difficult regions, mitigate domain shift between synthesized target domain images and real target domain images, and improve the integrity of interested organs. Fourth, style consistency entropy is defined for target domain images to further improve the integrity of the interested organ by the concentration on the inconsistent regions. Comprehensive experiments have been performed with an in-house dataset and a publicly available dataset. The experimental results have demonstrated the superiority of our framework over state-of-the-art methods.

Abstract:
Class incremental learning (CIL) aims to incrementally update a trained model with the new classes of samples (plasticity) while retaining previously learned ability (stability). To address the most challenging issue in this goal, i.e., catastrophic forgetting, the mainstream paradigm is memory-replay CIL, which consolidates old knowledge by replaying a small number of old classes of samples saved in the memory. Despite effectiveness, the inherent destruction-reconstruction dynamics in memory-replay CIL are an intrinsic limitation: if the old knowledge is severely destructed, it will be quite hard to reconstruct the lossless counterpart. Our theoretical analysis shows that the destruction of old knowledge can be effectively alleviated by balancing the contribution of samples from the current phase and those saved in the memory. Motivated by this theoretical finding, we propose a novel Balanced Destruction-Reconstruction module (BDR) for memory-replay CIL, which can achieve better knowledge reconstruction by reducing the degree of maximal destruction of old knowledge. Specifically, to achieve a better balance between old knowledge and new classes, the proposed BDR module takes into account two factors: the variance in training status across different classes and the quantity imbalance of samples from the current phase and memory. By dynamically manipulating the gradient during training based on these factors, BDR can effectively alleviate knowledge destruction and improve knowledge reconstruction. Extensive experiments on a range of CIL benchmarks have shown that as a lightweight plug-and-play module, BDR can significantly improve the performance of existing state-of-the-art methods with good generalization. Our code is publicly available here.

Abstract:
Online video super-resolution (online-VSR) highly relies on an effective alignment module to aggregate temporal information, while the strict latency requirement makes accurate and efficient alignment very challenging. Though much progress has been achieved, most of the existing online-VSR methods estimate the motion fields of each frame separately to perform alignment, which is computationally redundant and ignores the fact that the motion fields of adjacent frames are correlated. In this work, we propose an efficient Temporal Motion Propagation (TMP) method, which leverages the continuity of motion field to achieve fast pixel-level alignment among consecutive frames. Specifically, we first propagate the offsets from previous frames to the current frame, and then refine them in the neighborhood, significantly reducing the matching space and speeding up the offset estimation process. Furthermore, to enhance the robustness of alignment, we perform spatial-wise weighting on the warped features, where the positions with more precise offsets are assigned higher importance. Experiments on benchmark datasets demonstrate that the proposed TMP method achieves leading online-VSR accuracy as well as inference speed. The source code of TMP can be found at https://github.com/xtudbxk/TMP.

Abstract:
Unsupervised person re-identification (Re-ID) is challenging due to the lack of ground truth labels. Most existing methods employ iterative clustering to generate pseudo labels for unlabeled training data to guide the learning process. However, how to select samples that are both associated with high-confidence pseudo labels and hard (discriminative) enough remains a critical problem. To address this issue, a disentangled sample guidance learning (DSGL) method is proposed for unsupervised Re-ID. The method consists of disentangled sample mining (DSM) and discriminative feature learning (DFL). DSM disentangles (unlabeled) person images into identity-relevant and identity-irrelevant factors, which are used to construct disentangled positive/negative groups that contain discriminative enough information. DFL incorporates the mined disentangled sample groups into model training by a surrogate disentangled learning loss and a disentangled second-order similarity regularization, to help the model better distinguish the characteristics of different persons. By using the DSGL training strategy, the mAP on Market-1501 and MSMT17 increases by 6.6% and 10.1% when applying the ResNet50 framework, and by 0.6% and 6.9% with the vision transformer (VIT) framework, respectively, validating the effectiveness of the DSGL method. Moreover, DSGL surpasses previous state-of-the-art methods by achieving higher Top-1 accuracy and mAP on the Market-1501, MSMT17, PersonX, and VeRi-776 datasets. The source code for this paper is available at https://github.com/jihaoxuanye/DiseSGL.

Abstract:
Multi-human parsing is an image segmentation task necessitating both instance-level and fine-grained category-level information. However, prior research has typically processed these two types of information through distinct branch types and output formats, leading to inefficient and redundant frameworks. This paper introduces UniParser, which integrates instance-level and category-level representations in three key aspects: 1) we propose a unified correlation representation learning approach, allowing our network to learn instance and category features within the cosine space; 2) we unify the form of outputs of each modules as pixel-level results while supervising instance and category features using a homogeneous label accompanied by an auxiliary loss; and 3) we design a joint optimization procedure to fuse instance and category representations. By unifying instance-level and category-level output, UniParser circumvents manually designed post-processing techniques and surpasses state-of-the-art methods, achieving 49.3% AP on MHPv2.0 and 60.4% AP on CIHP. We have released our source code, pretrained models, and demos to facilitate future studies on https://github.com/cjm-sfw/Uniparser.

Abstract:
In recent years, many single hyperspectral image super-resolution methods have emerged to enhance the spatial resolution of hyperspectral images without hardware modification. However, existing methods typically face two significant challenges. First, they struggle to handle the high-dimensional nature of hyperspectral data, which often results in high computational complexity and inefficient information utilization. Second, they have not fully leveraged the abundant spectral information in hyperspectral images. To address these challenges, we propose a novel hyperspectral super-resolution network named SNLSR, which transfers the super-resolution problem into the abundance domain. Our SNLSR leverages a spatial preserve decomposition network to estimate the abundance representations of the input hyperspectral image. Notably, the network acknowledges and utilizes the commonly overlooked spatial correlations of hyperspectral images, leading to better reconstruction performance. Then, the estimated low-resolution abundance is super-resolved through a spatial spectral attention network, where the informative features from both spatial and spectral domains are fully exploited. Considering that the hyperspectral image is highly spectrally correlated, we customize a spectral-wise non-local attention module to mine similar pixels along spectral dimension for high-frequency detail recovery. Extensive experiments demonstrate the superiority of our method over other state-of-the-art methods both visually and metrically. Our code is publicly available at https://github.com/HuQ1an/SNLSR.

Abstract:
Color quantization reduces the number of colors used in an image while preserving its content, which is essential in pixel art and knitting art creation. Traditional methods primarily focus on visual fidelity and treat it as a clustering problem in the RGB space. While effective in large (5-6 bits) color spaces, these approaches cannot guarantee semantics in small (1-2 bits) color spaces. On the other hand, deep color quantization methods use network viewers such as AlexNet and ResNet for supervision, effectively preserving semantics in small color spaces. However, in large color spaces, they lag behind traditional methods in terms of visual fidelity. In this work, we propose ColorCNN+, a novel approach that combines the strengths of both. It uses network viewer signals for supervision in small color spaces and learns to cluster the colors in large color spaces. Noteworthily, it is non-trivial for neural networks to do clustering, where existing deep clustering methods often need K-means to cluster the features. In this work, through a newly introduced cluster imitation loss, ColorCNN+ learns to directly output the cluster assignment without any additional steps. Furthermore, ColorCNN+ supports multiple color space sizes and network viewers, offering scalability and easy deployment. Experimental results demonstrate competitive performance of ColorCNN+ across various settings. Code is available at link.

Abstract:
Multi-view clustering usually attempts to improve the final performance by integrating graph structure information from different views and methods based on anchor are presented to reduce the computation cost for datasets with large scales. Despite significant progress, these methods pay few attentions to ensuring that the cluster structure correspondence between anchor graph and partition is built on multi-view datasets. Besides, they ignore to discover the anchor graph depicting the shared cluster assignment across views under the orthogonal constraint on actual bases in factorization. In this paper, we propose a novel Dual consensus Anchor Learning for Fast multi-view clustering (DALF) method, where the cluster structure correspondence between anchor graph and partition is guaranteed on multi-view datasets with large scales. It jointly learns anchors, constructs anchor graph and performs partition under a unified framework with the rank constraint imposed on the built Laplacian graph and the orthogonal constraint on the centroid representation. DALF simultaneously focuses on the cluster structure in the anchor graph and partition. The final cluster structure is simultaneously shown in the anchor graph and partition. We introduce the orthogonal constraint on the centroid representation in anchor graph factorization and the cluster assignment is directly constructed, where the cluster structure is shown in the partition. We present an iterative algorithm for solving the formulated problem. Extensive experiments demonstrate the effectiveness and efficiency of DALF on different multi-view datasets compared with other methods.

Abstract:
Event cameras respond to temporal dynamics, helping to resolve ambiguities in spatio-temporal changes for optical flow estimation. However, the unique spatio-temporal event distribution challenges the feature extraction, and the direct construction of motion representation through the orthogonal view is less than ideal due to the entanglement of appearance and motion. This paper proposes to transform the orthogonal view into a motion-dependent one for enhancing event-based motion representation and presents a Motion View-based Network (MV-Net) for practical optical flow estimation. Specifically, this motion-dependent view transformation is achieved through the Event View Transformation Module, which captures the relationship between the steepest temporal changes and motion direction, incorporating these temporal cues into the view transformation process for feature gathering. This module includes two phases: extracting the temporal evolution clues by central difference operation in the extraction phase and capturing the motion pattern by evolution-guided deformable convolution in the perception phase. Besides, the MV-Net constructs an eccentric downsampling process to avoid response weakening from the sparsity of events in the downsampling stage. The whole network is trained end-to-end in a self-supervised manner, and the evaluations conducted on four challenging datasets reveal the superior performance of the proposed model compared to state-of-the-art (SOTA) methods.

Abstract:
Vision transformer has demonstrated great potential in abundant vision tasks. However, it also inevitably suffers from poor generalization capability when the distribution shift occurs in testing (i.e., out-of-distribution data). To mitigate this issue, we propose a novel method, Semantic-aware Message Broadcasting (SAMB), which enables more informative and flexible feature alignment for unsupervised domain adaptation (UDA). Particularly, we study the attention module in the vision transformer and notice that the alignment space using one global class token lacks enough flexibility, where it interacts information with all image tokens in the same manner but ignores the rich semantics of different regions. In this paper, we aim to improve the richness of the alignment features by enabling semantic-aware adaptive message broadcasting. Particularly, we introduce a group of learned group tokens as nodes to aggregate the global information from all image tokens, but encourage different group tokens to adaptively focus on the message broadcasting to different semantic regions. In this way, our message broadcasting encourages the group tokens to learn more informative and diverse information for effective domain alignment. Moreover, we systematically study the effects of adversarial-based feature alignment (ADA) and pseudo-label based self-training (PST) on UDA. We find that one simple two-stage training strategy with the cooperation of ADA and PST can further improve the adaptation capability of the vision transformer. Extensive experiments on DomainNet, OfficeHome, and VisDA-2017 demonstrate the effectiveness of our methods for UDA.

Abstract:
Existing methods have demonstrated effective performance on a single degradation type. In practical applications, however, the degradation is often unknown, and the mismatch between the model and the degradation will result in a severe performance drop. In this paper, we propose an all-in-one image restoration network that tackles multiple degradations. Due to the heterogeneous nature of different types of degradations, it is difficult to process multiple degradations in a single network. To this end, we propose to learn a neural degradation representation (NDR) that captures the underlying characteristics of various degradations. The learned NDR adaptively decomposes different types of degradations, similar to a neural dictionary that represents basic degradation components. Subsequently, we develop a degradation query module and a degradation injection module to effectively approximate and utilize the specific degradation based on NDR, enabling the all-in-one restoration ability for multiple degradations. Moreover, we propose a bidirectional optimization strategy to effectively drive NDR to learn the degradation representation by optimizing the degradation and restoration processes alternately. Comprehensive experiments on representative types of degradations (including noise, haze, rain, and downsampling) demonstrate the effectiveness and generalizability of our method. Code is available at https://github.com/mdyao/NDR-Restore.

Abstract:
Few-shot object detection (FSOD) identifies objects from extremely few annotated samples. Most existing FSOD methods, recently, apply the two-stage learning paradigm, which transfers the knowledge learned from abundant base classes to assist the few-shot detectors by learning the global features. However, such existing FSOD approaches seldom consider the localization of objects from local to global. Limited by the scarce training data in FSOD, the training samples of novel classes typically capture part of objects, resulting in such FSOD methods being unable to detect the completely unseen object during testing. To tackle this problem, we propose an Extensible Co-Existing Attention (ECEA) module to enable the model to infer the global object according to the local parts. Specifically, we first devise an extensible attention mechanism that starts with a local region and extends attention to co-existing regions that are similar and adjacent to the given local region. We then implement the extensible attention mechanism in different feature scales to progressively discover the full object in various receptive fields. In the training process, the model learns the extensible ability on the base stage with abundant samples and transfers it to the novel stage of continuous extensible learning, which can assist the few-shot model to quickly adapt in extending local regions to co-existing regions. Extensive experiments on the PASCAL VOC and COCO datasets show that our ECEA module can assist the few-shot detector to completely predict the object despite some regions failing to appear in the training samples and achieve the new state-of-the-art compared with existing FSOD methods. Code is released at https://github.com/zhimengXin/ECEA.

Abstract:
Domain generalization (DG) aims to enhance the model robustness against domain shifts without accessing target domains. A prevalent category of methods for DG is data augmentation, which focuses on generating virtual samples to simulate domain shifts. However, existing augmentation techniques in DG are mainly tailored for convolutional neural networks (CNNs), with limited exploration in token-based architectures, i.e., vision transformer (ViT) and multi-layer perceptrons (MLP) models. In this paper, we study the impact of prior CNN-based augmentation methods on token-based models, revealing their performance is suboptimal due to the lack of incentivizing the model to learn holistic shape information. To tackle the issue, we propose the Semantic-aware Edge-guided Token Augmentation (SETA) method. SETA transforms token features by perturbing local edge cues while preserving global shape features, thereby enhancing the model learning of shape information. To further enhance the generalization ability of the model, we introduce two stylized variants of our method combined with two state-of-the-art (SOTA) style augmentation methods in DG. We provide a theoretical insight into our method, demonstrating its effectiveness in reducing the generalization risk bound. Comprehensive experiments on five benchmarks prove that our method achieves SOTA performances across various ViT and MLP architectures. Our code is available at https://github.com/lingeringlight/SETA.

Abstract:
Blood flow observation is of high interest in cardiovascular disease diagnosis and assessment. For this purpose, 2D Phase-Contrast MRI is widely used in the clinical routine. 4D flow MRI sequences, which dynamically image the anatomic shape and velocity vectors within a region of interest, are promising but rarely used due to their low resolution and signal-to-noise ratio (SNR). Computational fluid dynamics (CFD) simulation is considered as a reference solution for resolution enhancement. However, its precision relies on image segmentation and a clinical expertise for the definition of the vessel borders. The main contribution of this paper is a Segmentation-Free Super-Resolution (SFSR) algorithm. Based on inverse problem methodology, SFSR relies on minimizing a compound criterion involving: a data fidelity term, a fluid mechanics term, and a spatial velocity smoothing term. The proposed algorithm is evaluated with respect to state-of-the-art solutions, in terms of quantification error and computation time, on a synthetic 3D dataset with several noise levels, resulting in a 59% RMSE improvement and factor 2 super-resolution with a noise standard deviation of 5% of the Venc. Finally, its performance is demonstrated, with a scale factor of 2 and 3, on a pulsed flow phantom dataset with more complex patterns. The application on in-vivo were achievable within the 10 min. computation time.

Abstract:
Matching is an important prerequisite for point clouds registration, which is to establish a reliable correspondence between two point clouds. This paper aims to improve recent theoretical and algorithmic results on discrete optimal transport (DOT), since it lacks robustness for the point clouds matching problems with large-scale affine or even nonlinear transformation. We first consider the importance of the used prior probability for accurate matching and give some theoretical analysis. Then, to solve the point clouds matching problems with complex deformation and noise, we propose an improved DOT model, which introduces an orthogonal matrix and a diagonal matrix into the classical DOT model. To enhance its capability of dealing with cases with outliers, we further bring forward a relaxed and regularized DOT model. Meantime, we propose two algorithms to solve the brought forward two models. Finally, extensive experiments on some real datasets are designed in the presence of reflection, large-scale rotation, stretch, noise, and outliers. Some state-of-the-art methods, including CPD, APM, RANSAC, TPS-ICP, TPS-RPM, RPMNet, and classical DOT methods, are to be discussed and compared. For different levels of degradation, the numerical results demonstrate that the proposed methods perform more favorably and robustly than the other methods.

Abstract:
Most existing few-shot learning (FSL) methods require a large amount of labeled data in meta-training, which is a major limit. To reduce the requirement of labels, a semi-supervised meta-training (SSMT) setting has been proposed for FSL, which includes only a few labeled samples and numbers of unlabeled samples in base classes. However, existing methods under this setting require class-aware sample selection from the unlabeled set, which violates the assumption of unlabeled set. In this paper, we propose a practical semi-supervised meta-training setting with truly unlabeled data to facilitate the applications of FSL in realistic scenarios. To better utilize both the labeled and truly unlabeled data, we propose a simple and effective meta-training framework, called pseudo-labeling based meta-learning (PLML). Firstly, we train a classifier via common semi-supervised learning (SSL) and use it to obtain the pseudo-labels of unlabeled data. Then we build few-shot tasks from labeled and pseudo-labeled data and design a novel finetuning method with feature smoothing and noise suppression to better learn the FSL model from noise labels. Surprisingly, through extensive experiments across two FSL datasets, we find that this simple meta-training framework effectively prevents the performance degradation of various FSL models under limited labeled data, and also significantly outperforms the representative SSMT models. Besides, benefiting from meta-training, our method also improves several representative SSL algorithms as well. We provide the training code and usage examples at https://github.com/ouyangtianran/PLML.

Abstract:
Existing methods of multiple human parsing (MHP) apply deep models to learn instance-level representations for segmenting each person into non-overlapped body parts. However, learned representations often contain many spurious correlations that degrade model generalization, leading learned models to be vulnerable to visually contextual variations in images (e.g., unseen image styles/external interventions). To tackle this, we present a causal property integrated parsing model termed CPI-Parser, which is driven by fundamental causal principles involving two causal properties for human parsing (i.e., the causal diversity and the causal invariance). Specifically, we assume that an image is constructed by a mix of causal factors (the characteristics of body parts) and non-causal factors (external contexts), where only the former ones decide the essence of human parsing. Since causal/non-causal factors are unobservable, the proposed CPI-Parser is required to separate key factors that satisfy the causal properties from an image. In this way, the parser is able to rely on causal factors w.r.t relevant evidence rather than non-causal factors w.r.t spurious correlations, thus alleviating model degradation and yielding improved parsing ability. Notably, the CPI-Parser is designed in a flexible way and can be integrated into any existing MHP frameworks. Extensive experiments conducted on three widely used benchmarks demonstrate the effectiveness and generalizability of our method. Code and models are released (https://github.com/HAG-uestc/CPI-Parser) for research purpose.

Abstract:
Pre-trained vision-language models (VLM) such as CLIP, have demonstrated impressive zero-shot performance on various vision tasks. Trained on millions or even billions of image-text pairs, the text encoder has memorized a substantial amount of appearance knowledge. Such knowledge in VLM is usually leveraged by learning specific task-oriented prompts, which may limit its performance in unseen tasks. This paper proposes a new knowledge injection framework to pursue a generalizable adaption of VLM to downstream vision tasks. Instead of learning task-specific prompts, we extract task-agnostic knowledge features, and insert them into features of input images or texts. The fused features hence gain better discriminative capability and robustness to intra-category variances. Those knowledge features are generated by inputting learnable prompt sentences into text encoder of VLM, and extracting its multi-layer features. A new knowledge injection module (KIM) is proposed to refine text features or visual features using knowledge features. This knowledge injection framework enables both modalities to benefit from the rich knowledge memorized in the text encoder. Experiments show that our method outperforms recently proposed methods under few-shot learning, base-to-new classes generalization, cross-dataset transfer, and domain generalization settings. For instance, it outperforms CoOp by 4.5% under the few-shot learning setting, and CoCoOp by 4.4% under the base-to-new classes generalization setting. Our code will be released.

Abstract:
Deep learning-based video denoising methods have achieved great performance improvements in recent years. However, the expensive computational cost arising from sophisticated network design has severely limited their applications in real-world scenarios. To address this practical weakness, we propose a multiscale spatio-temporal memory network for fast video denoising, named MSTMN, aiming at striking an improved trade-off between cost and performance. To develop an efficient and effective algorithm for video denoising, we exploit a multiscale representation based on the Gaussian-Laplacian pyramid decomposition so that the reference frame can be restored in a coarse-to-fine manner. Guided by a model-based optimization approach, we design an effective variance estimation module, an alignment error estimation module and an adaptive fusion module for each scale of the pyramid representation. For the fusion module, we employ a reconstruction recurrence strategy to incorporate local temporal information. Moreover, we propose a memory enhancement module to exploit the global spatio-temporal information. Meanwhile, the similarity computation of the spatio-temporal memory network enables the proposed network to adaptively search the valuable information at the patch level, which avoids computationally expensive motion estimation and compensation operations. Experimental results on real-world raw video datasets have demonstrated that the proposed lightweight network outperforms current state-of-the-art fast video denoising algorithms such as FastDVDnet, EMVD, and ReMoNet with fewer computational costs.

Affiliations: School of Information and Communication Engineering, Dalian University of Technology, Dalian, China; School of Electronic and Information, Wuhan University, Wuhan, China; Department of Precision Instrument, Tsinghua University, Beijing, China; Advanced Computing and Storage Laboratory, Huawei Technologies Company Ltd., Beijing, China; School of Computer Science and Technology and the School of Innovation and Entrepreneurship, Dalian University of Technology, Dalian, China; School of Information and Communication Engineering, the School of Future Technology, and the School of Artificial Intelligence, Dalian, China; School of Future Technology and the School of Artificial Intelligence, Dalian University of Technology, Dalian, China

Abstract:
The goal of blurry image deblurring and unfolding task is to recover a single sharp frame or a sequence from a blurry one. Recently, its performance is greatly improved with introduction of a bio-inspired visual sensor, event camera. Most existing event-assisted deblurring methods focus on the design of powerful network architectures and effective training strategy, while ignoring the role of blur modeling in removing various blur in dynamic scenes. In this work, we propose to implicitly model blur in an image by computing blurriness representation with an event-assisted blurriness encoder. The learning of blurriness representation is formulated as a ranking problem based on specially synthesized pairs. Blurriness-aware image unfolding is achieved by integrating blur relevant information contained in the representation into a base unfolding network. The integration is mainly realized by the proposed blurriness-guided modulation and multi-scale aggregation modules. Experiments on GOPRO and HQF datasets show favorable performance of the proposed method against state-of-the-art approaches. More results on real-world data validate its effectiveness in recovering a sequence of latent sharp frames from a blurry image.

Abstract:
Hyperspectral image super-resolution has attained widespread prominence to enhance the spatial resolution of hyperspectral images. However, convolution-based methods have encountered challenges in harnessing the global spatial-spectral information. The prevailing transformer-based methods have not adequately captured the long-range dependencies in both spectral and spatial dimensions. To alleviate this issue, we propose a novel cross-scope spatial-spectral Transformer (CST) to efficiently investigate long-range spatial and spectral similarities for single hyperspectral image super-resolution. Specifically, we devise cross-attention mechanisms in spatial and spectral dimensions to comprehensively model the long-range spatial-spectral characteristics. By integrating global information into the rectangle-window self-attention, we first design a cross-scope spatial self-attention to facilitate long-range spatial interactions. Then, by leveraging appropriately characteristic spatial-spectral features, we construct a cross-scope spectral self-attention to effectively capture the intrinsic correlations among global spectral bands. Finally, we elaborate a concise feed-forward neural network to enhance the feature representation capacity in the Transformer structure. Extensive experiments over three hyperspectral datasets demonstrate that the proposed CST is superior to other state-of-the-art methods both quantitatively and visually. The code is available at https://github.com/Tomchenshi/CST.git.

Abstract:
Multimodal image matching is essential in image stitching, image fusion, change detection, and land cover mapping. However, the severe nonlinear radiometric distortion (NRD) and geometric distortions in multimodal images severely limit the accuracy of multimodal image matching, posing significant challenges to existing methods. Additionally, detector-based methods are prone to feature point offset issues in regions with substantial modal differences, which also hinder the subsequent fine registration and fusion of images. To address these challenges, we propose a guided refinement for detector-free multimodal image matching (GRiD) method, which weakens feature point offset issues by establishing pixel-level correspondences and utilizes reference points to guide and correct matches affected by NRD and geometric distortions. Specifically, we first introduce a detector-free framework to alleviate the feature point offset problem by directly finding corresponding pixels between images. Subsequently, to tackle NRD and geometric distortion in multimodal images, we design a guided correction module that establishes robust reference points (RPs) to guide the search for corresponding pixels in regions with significant modality differences. Moreover, to enhance RPs reliability, we incorporate a phase congruency module during the RPs confirmation stage to concentrate RPs around image edge structures. Finally, we perform finer localization on highly correlated corresponding pixels to obtain the optimized matches. We conduct extensive experiments on four multimodal image datasets to validate the effectiveness of the proposed approach. Experimental results demonstrate that our method can achieve sufficient and robust matches across various modality images and effectively suppress the feature point offset problem.

Abstract:
Despite efforts to construct super-resolution (SR) training datasets with a wide range of degradation scenarios, existing supervised methods based on these datasets still struggle to consistently offer promising results due to the diversity of real-world degradation scenarios and the inherent complexity of model learning. Our work explores a new route: integrating the sample-adaptive property learned through image intrinsic self-similarity and the universal knowledge acquired from large-scale data. We achieve this by uniting internal learning and external learning by an unrolled optimization process. With the merits of both, the tuned fully-supervised SR models can be augmented to broadly handle the real-world degradation in a plug-and-play style. Furthermore, to promote the efficiency of combining internal/external learning, we apply an attention-based weight-updating method to guide the mining of self-similarity, and various data augmentations are adopted while applying the exponential moving average strategy. We conduct extensive experiments on real-world degraded images and our approach outperforms other methods in both qualitative and quantitative comparisons. Our project is available at: https://github.com/ZahraFan/AdaSSR/.

Abstract:
Learned 360° image compression methods using equirectangular projection (ERP) often confront a non-uniform sampling issue, inherent to sphere-to-rectangle projection. While uniformly or nearly uniformly sampling representations, along with their corresponding convolution operations, have been proposed to mitigate this issue, these methods often concentrate solely on uniform sampling rates, thus neglecting the content of the image. In this paper, we urge that different contents within 360° images have varying significance and advocate for the adoption of a content-adaptive parametric representation in 360° image compression, which takes into account both the content and sampling rate. We first introduce the parametric pseudocylindrical representation and corresponding convolution operation, upon which we build a learned 360° image codec. Then, we model the hyperparameter of the representation as the output of a network, derived from the image’s content and its spherical coordinates. We treat the optimization of hyperparameters for different 360° images as distinct compression tasks and propose a meta-learning algorithm to jointly optimize the codec and the metaknowledge, i.e., the hyperparameter estimation network. A significant challenge is the lack of a direct derivative from the compression loss to the hyperparameter network. To address this, we present a novel method to relax the rate-distortion loss as a function of the hyperparameters, enabling gradient-based optimization of the metaknowledge. Experimental results on omnidirectional images demonstrate that our method achieves state-of-the-art performance and superior visual quality.

Abstract:
Highly undersampled schemes in magnetic resonance fingerprinting (MRF) typically lead to aliasing artifacts in reconstructed images, thereby reducing quantitative imaging accuracy. Existing studies mainly focus on improving the reconstruction quality by incorporating temporal or spatial data priors. However, these methods seldom exploit the underlying MRF data structure driven by imaging physics and usually suffer from high computational complexity due to the high-dimensional nature of MRF data. In addition, data priors constructed in a pixel-wise manner struggle to incorporate non-local and non-linear correlations. To address these issues, we introduce a novel MRF reconstruction framework based on the graph embedding framework, exploiting non-linear and non-local redundancies in MRF data. Our work remodels MRF data and parameter maps as graph nodes, redefining the MRF reconstruction problem as a structure-preserved graph embedding problem. Furthermore, we propose a novel scheme for accurately estimating the underlying graph structure, demonstrating that the parameter nodes inherently form a low-dimensional representation of the high-dimensional MRF data nodes. The reconstruction framework is then built by preserving the intrinsic graph structure between MRF data nodes and parameter nodes and extended to exploiting the globality of graph structure. Our approach integrates the MRF data recovery and parameter map estimation into a single optimization problem, facilitating reconstructions geared toward quantitative accuracy. Moreover, by introducing graph representation, our methods substantially reduce the computational complexity, with the computational cost showing a minimal increase as the data acquisition length grows. Experiments show that the proposed method can reconstruct high-quality MRF data and multiple parameter maps within reduced computational time.

Abstract:
We present a novel deep hypergraph modeling architecture (called DHM-Net) for feature matching in this paper. Our network focuses on learning reliable correspondences between two sets of initial feature points by establishing a dynamic hypergraph structure that models group-wise relationships and assigns weights to each node. Compared to existing feature matching methods that only consider pair-wise relationships via a simple graph, our dynamic hypergraph is capable of modeling nonlinear higher-order group-wise relationships among correspondences in an interaction capturing and attention representation learning fashion. Specifically, we propose a novel Deep Hypergraph Modeling block, which initializes an overall hypergraph by utilizing neighbor information, and then adopts node-to-hyperedge and hyperedge-to-node strategies to propagate interaction information among correspondences while assigning weights based on hypergraph attention. In addition, we propose a Differentiation Correspondence-Aware Attention mechanism to optimize the hypergraph for promoting representation learning. The proposed mechanism is able to effectively locate the exact position of the object of importance via the correspondence aware encoding and simple feature gating mechanism to distinguish candidates of inliers. In short, we learn such a dynamic hypergraph format that embeds deep group-wise interactions to explicitly infer categories of correspondences. To demonstrate the effectiveness of DHM-Net, we perform extensive experiments on both real-world outdoor and indoor datasets. Particularly, experimental results show that DHM-Net surpasses the state-of-the-art method by a sizable margin. Our approach obtains an 11.65% improvement under error threshold of 5° for relative pose estimation task on YFCC100M dataset. Code will be released at https://github.com/CSX777/DHM-Net.

Abstract:
Inertial measurement units (IMU) in the capturing device can record the motion information of the device, with gyroscopes measuring angular velocity and accelerometers measuring acceleration. However, conventional deblurring methods seldom incorporate IMU data, and existing approaches that utilize IMU information often face challenges in fully leveraging this valuable data, resulting in noise issues from the sensors. To address these issues, in this paper, we propose a multi-stage deblurring network named INformer, which combines inertial information with the Transformer architecture. Specifically, we design an IMU-image Attention Fusion (IAF) block to merge motion information derived from inertial measurements with blurry image features at the attention level. Furthermore, we introduce an Inertial-Guided Deformable Attention (IGDA) block for utilizing the motion information features as guidance to adaptively adjust the receptive field, which can further refine the corresponding blur kernel for pixels. Extensive experiments on comprehensive benchmarks demonstrate that our proposed method performs favorably against state-of-the-art deblurring approaches.

Abstract:
Lens flare is a common phenomenon when strong light rays arrive at the camera sensor and a clean scene is consequently mixed up with various opaque and semi-transparent artifacts. Existing deep learning methods are always constrained with limited real image pairs for training. Though recent synthesis-based approaches are found effective, synthesized pairs still deviate from the real ones as the mixing mechanism of flare artifacts and scenes in the wild always depends on a line of undetermined factors, such as lens structure, scratches, etc. In this paper, we present a new perspective from the blind nature of the flare removal task in a knowledge-driven manner. Specifically, we present a simple yet effective flare-level estimator to predict the corruption level of a flare-corrupted image. The estimated flare-level can be interpreted as additive information of the gap between corrupted images and their flare-free correspondences to facilitate a network at both training and testing stages adaptively. Besides, we utilize a flare-level modulator to better integrate the estimations into networks. We also devise a flare-aware block for more accurate flare recognition and reconstruction. Additionally, we collect a new real-world flare dataset for benchmarking, namely WiderFlare. Extensive experiments on three benchmark datasets demonstrate that our method outperforms state-of-the-art methods quantitatively and qualitatively.

Abstract:
We explore multi-modal contextual knowledge learned through multi-modal masked language modeling to provide explicit localization guidance for novel classes in open-vocabulary object detection (OVD). Intuitively, a well-modeled and correctly predicted masked concept word should effectively capture the textual contexts, visual contexts, and the cross-modal correspondence between texts and regions, thereby automatically activating high attention on corresponding regions. In light of this, we propose a multi-modal contextual knowledge distillation framework, MMC-Det, to explicitly supervise a student detector with the context-aware attention of the masked concept words in a teacher fusion transformer. The teacher fusion transformer is trained with our newly proposed diverse multi-modal masked language modeling (D-MLM) strategy, which significantly enhances the fine-grained region-level visual context modeling in the fusion transformer. The proposed distillation process provides additional contextual guidance to the concept-region matching of the detector, thereby further improving the OVD performance. Extensive experiments performed upon various detection datasets show the effectiveness of our multi-modal context learning strategy.

Abstract:
In this paper, we propose a point-based cross-attention transformer named CrossPoints with parametric Global Porous Sampling (GPS) strategy. The attention module is crucial to capture the correlations between different tokens for transformers. Most existing point-based transformers design multi-scale self-attention operations with down-sampled point clouds by the widely-used Farthest Point Sampling (FPS) strategy. However, FPS only generates sub-clouds with holistic structures, which fails to fully exploit the flexibility of points to generate diversified tokens for the attention module. To address this, we design a cross-attention module with parametric GPS and Complementary GPS (C-GPS) strategies to generate series of diversified tokens through controllable parameters. We show that FPS is a degenerated case of GPS, and the network learns more abundant relational information of the structure and geometry when we perform consecutive cross-attention over the tokens generated by GPS as well as C-GPS sampled points. More specifically, we set evenly-sampled points as queries and design our cross-attention layers with GPS and C-GPS sampled points as keys and values. In order to further improve the diversity of tokens, we design a deformable operation over points to adaptively adjust the points according to the input. Extensive experimental results on both shape classification and indoor scene segmentation tasks indicate promising boosts over the recent point cloud transformers. We also conduct ablation studies to show the effectiveness of our proposed cross-attention module with GPS strategy.

Abstract:
Ensuring the reliability of open-world intelligent systems heavily relies on effective out-of-distribution (OOD) detection. Despite notable successes in existing OOD detection methods, their performance in scenarios with limited training samples is still suboptimal. Therefore, we first construct a comprehensive few-shot OOD detection benchmark in this paper. Remarkably, our investigation reveals that Parameter-Efficient Fine-Tuning (PEFT) techniques, such as visual prompt tuning and visual adapter tuning, outperform traditional methods like fully fine-tuning and linear probing tuning in few-shot OOD detection. Considering that some valuable information from the pre-trained model, which is conducive to OOD detection, may be lost during the fine-tuning process, we reutilize features from the pre-trained models to mitigate this issue. Specifically, we first propose a training-free approach, termed uncertainty score ensemble (USE). This method integrates feature-matching scores to enhance existing OOD detection methods, significantly narrowing the gap between traditional fine-tuning and PEFT techniques. However, due to its training-free property, this method is unable to improve in-distribution accuracy. To this end, we further propose a method called Domain-Specific and General Knowledge Fusion (DSGF) to improve few-shot OOD detection performance and ID accuracy under different fine-tuning paradigms. Experiment results demonstrate that DSGF enhances few-shot OOD detection across different fine-tuning strategies, shot settings, and OOD detection methods. We believe our work can provide the research community with a novel path to leveraging large-scale visual pre-trained models for addressing FS-OOD detection. The code will be released.

Abstract:
Remote sensing super-resolution (SR) technique, which aims to generate high-resolution image with rich spatial details from its low-resolution counterpart, play a vital role in many applications. Recently, more and more studies attempt to explore the application of Transformer in remote sensing field. However, they suffer from the high computational burden and memory consumption for remote sensing super-resolution. In this paper, we propose an efficient Swin Transformer (ESTNet) via channel attention for SR of remote sensing images, which is composed of three components. First, a three-layer convolutional operation is utilized to extract shallow features of the input low-resolution image. Then, a residual group-wise attention module is proposed to extract the deep features, which contains an efficient channel attention block (ECAB) and a group-wise attention block (GAB). Finally, the extracted deep features are reconstructed to generate high-resolution remote sensing images. Extensive experimental results proclaim that the proposed ESTNet can obtain better super-resolution results with low computational burden. Compared to the recently proposed Transformer-based remote sensing super-resolution method, the number of parameters is reduced by 82.68% while the computational cost is reduced by 87.84%. The code of the proposed ESTNet will be available at https://github.com/PuhongDuan/ESTNet for reproducibility.

Abstract:
Explainability is a pivotal factor in determining whether a deep learning model can be authorized in critical applications. To enhance the explainability of models of end-to-end object DEtection with TRansformer (DETR), we introduce a disentanglement method that constrains the feature learning process, following a divide-and-conquer decoupling paradigm, similar to how people understand complex real-world problems. We first demonstrate the entangled property of the features between the extractor and detector and find that the regression function is a key factor contributing to the deterioration of disentangled feature activation. These highly entangled features always activate the local characteristics, making it difficult to cover the semantic information of an object, which also reduces the interpretability of single-backbone object detection models. Thus, an Explainability Enhanced object detection Transformer with feature Disentanglement (DETD) model is proposed, in which the Tensor Singular Value Decomposition (T-SVD) is used to produce feature bases and the Batch averaged Feature Spectral Penalization (BFSP) loss is introduced to constrain the disentanglement of the feature and balance the semantic activation. The proposed method is applied across three prominent backbones, two DETR variants, and a CNN based model. By combining two optimization techniques, extensive experiments on two datasets consistently demonstrate that the DETD model outperforms the counterpart in terms of object detection performance and feature disentanglement. The Grad-CAM visualizations demonstrate the enhancement of feature learning explainability in the disentanglement view.

Abstract:
The open-set text recognition task is a generalized form of the (close-set) text recognition task, where the model is further challenged to spot and incrementally recognize novel characters not covered by the training data. Novel characters also indicate that the language model of the training set is biased from the “real-world”. In this work, we alleviate the confounding effect of such biases by learning from individual character representations isolated from their context. Specifically, we propose a Character-First Open-Set Text Recognition framework that cotrains the feature extractor with two context-free learning tasks. First, a Context Isolation Learning task is proposed to wipe the context for each character from the input image, utilizing a character mask learned in a weak supervision manner. Second, the framework adopts an Individual Character Learning task, which is a single-character classification task with synthetic samples. After training on English and simplified Chinese data, our framework can adapt to recognize unseen characters in Japanese, Korean, Greek, and other scripts without retraining, and can reliably spot unseen characters in Japanese with an F1-score over 64%. The framework also shows 91.5% line accuracy on IIIT5k and a speed of over 69 FPS single-batched, making it a feasible universal lightweight OCR solution that works well for both open-set and close-set use cases.

Abstract:
Unsupervised Domain Adaptation (UDA) has shown promise in Scene Text Recognition (STR) by facilitating knowledge transfer from labeled synthetic text (source) to more challenging unlabeled real scene text (target). However, existing UDA-based STR methods fully rely on the pseudo-labels of target samples, which ignores the impact of domain gaps (inter-domain noise) and various natural environments (intra-domain noise), resulting in poor pseudo-label quality. In this paper, we propose a novel noisy-aware unsupervised domain adaptation framework tailored for STR, which aims to enhance model robustness against both inter- and intra-domain noise, thereby providing more precise pseudo-labels for target samples. Concretely, we propose a reweighting target pseudo-labels by estimating the entropy of refined probability distributions, which mitigates the impact of domain gaps on pseudo-labels. Additionally, a decoupled triple-P-N consistency matching module is proposed, which leverages data augmentation to increase data diversity, enhancing model robustness in diverse natural environments. Within this module, we design a low-confidence-based character negative learning, which is decoupled from high-confidence-based positive learning, thus improving sample utilization under scarce target samples. Furthermore, we extend our framework to the more challenging Source-Free UDA (SFUDA) setting, where only a pre-trained source model is available for adaptation, with no access to source data. Experimental results on benchmark datasets demonstrate the effectiveness of our framework. Under the SFUDA setting, our method exhibits faster convergence and superior performance with less training data than previous UDA-based STR methods. Our method surpasses representative STR methods, establishing new state-of-the-art results across multiple datasets.

Abstract:
Cross-component prediction is an important intra-prediction tool in the modern video coders. Existing prediction methods to exploit cross-component correlation include cross-component linear model and its extension of multi-model linear model. These models are designed for camera captured content. For screen content coding, where videos exhibit different signal characteristics, a cross-component prediction model tailored to their characteristics is desirable. As a pioneering work, we propose a discrete-mapping based cross-component prediction model for screen content coding. Our model relies on the core observation that, screen content videos typically comprise of regions with a few distinct colors and luma value (almost always) uniquely conveys chroma value. Based on this, the proposed method learns a discrete-mapping function from available reconstructed luma-chroma pairs and uses this function to derive chroma prediction from the co-located luma samples. To achieve higher accuracy, a multi-filter approach is employed to derive co-located luma values. The proposed method achieves 2.61%, 3.51% and 3.92% Y, U and V bit-rate savings respectively over Enhanced Compression Model (ECM) 4.0, with negligible complexity, for text and graphics media under all-intra configuration.

Abstract:
Convolutional neural networks (CNNs) and self-attention (SA) have demonstrated remarkable success in low-level vision tasks, such as image super-resolution, deraining, and dehazing. The former excels in acquiring local connections with translation equivariance, while the latter is better at capturing long-range dependencies. However, both CNNs and Transformers suffer from individual limitations, such as limited receptive field and weak diversity representation of CNNs during low efficiency and weak local relation learning of SA. To this end, we propose a multi-scale fusion and decomposition network (MFDNet) for rain perturbation removal, which unifies the merits of these two architectures while maintaining both effectiveness and efficiency. To achieve the decomposition and association of rain and rain-free features, we introduce an asymmetrical scheme designed as a dual-path mutual representation network that enables iterative refinement. Additionally, we incorporate high-efficiency convolutions throughout the network and use resolution rescaling to balance computational complexity with performance. Comprehensive evaluations show that the proposed approach outperforms most of the latest SOTA deraining methods and is versatile and robust in various image restoration tasks, including underwater image enhancement, image dehazing, and low-light image enhancement. The source codes and pretrained models are available at https://github.com/qwangg/MFDNet.

Abstract:
Recognizing actions performed on unseen objects, known as Compositional Action Recognition (CAR), has attracted increasing attention in recent years. The main challenge is to overcome the distribution shift of “action-objects” pairs between the training and testing sets. Previous works for CAR usually introduce extra information (e.g. bounding box) to enhance the dynamic cues of video features. However, these approaches do not essentially eliminate the inherent inductive bias in the video, which can be regarded as the stumbling block for model generalization. Because the video features are usually extracted from the visually cluttered areas in which many objects cannot be removed or masked explicitly. To this end, this work attempts to implicitly accomplish semantic-level decoupling of “object-action” in the high-level feature space. Specifically, we propose a novel Semantic-Decoupling Transformer framework, dubbed as DeFormer, which contains two insightful sub-modules: Objects-Motion Decoupler (OMD) and Semantic-Decoupling Constrainer (SDC). In OMD, we initialize several learnable tokens incorporating annotation priors to learn an instance-level representation and then decouple it into the appearance feature and motion feature in high-level visual space. In SDC, we use textual information in the high-level language space to construct a dual-contrastive association to constrain the decoupled appearance feature and motion feature obtained in OMD. Extensive experiments verify the generalization ability of DeFormer. Specifically, compared to the baseline method, DeFormer achieves absolute improvements of 3%, 3.3%, and 5.4% under three different settings on STH-ELSE, while corresponding improvements on EPIC-KITCHENS-55 are 4.7%, 9.2%, and 4.4%. Besides, DeFormer gains state-of-the-art results either on ground-truth or detected annotations.

Abstract:
Existing salient object detection methods are capable of predicting binary maps that highlight visually salient regions. However, these methods are limited in their ability to differentiate the relative importance of multiple objects and the relationships among them, which can lead to errors and reduced accuracy in downstream tasks that depend on the relative importance of multiple objects. To conquer, this paper proposes a new paradigm for saliency ranking, which aims to completely focus on ranking salient objects by their “importance order”. While previous works have shown promising performance, they still face ill-posed problems. First, the saliency ranking ground truth (GT) orders generation methods are unreasonable since determining the correct ranking order is not well-defined, resulting in false alarms. Second, training a ranking model remains challenging because most saliency ranking methods follow the multi-task paradigm, leading to conflicts and trade-offs among different tasks. Third, existing regression-based saliency ranking methods are complex for saliency ranking models due to their reliance on instance mask-based saliency ranking orders. These methods require a significant amount of data to perform accurately and can be challenging to implement effectively. To solve these problems, this paper conducts an in-depth analysis of the causes and proposes a whole-flow processing paradigm of saliency ranking task from the perspective of “GT data generation”, “network structure design” and “training protocol”. The proposed approach outperforms existing state-of-the-art methods on the widely-used SALICON set, as demonstrated by extensive experiments with fair and reasonable comparisons. The saliency ranking task is still in its infancy, and our proposed unified framework can serve as a fundamental strategy to guide future work. The code and data will be available at https://github.com/MengkeSong/Saliency-Ranking-Paradigm.

Abstract:
The sparse signals provided by external sources have been leveraged as guidance for improving dense disparity estimation. However, previous methods assume depth measurements to be randomly sampled, which restricts performance improvements due to under-sampling in challenging regions and over-sampling in well-estimated areas. In this work, we introduce an Active Disparity Sampling problem that selects suitable sampling patterns to enhance the utility of depth measurements given arbitrary sampling budgets. We achieve this goal by learning an Adjoint Network for a deep stereo model to measure its pixel-wise disparity quality. Specifically, we design a hard-soft prior supervision mechanism to provide hierarchical supervision for learning the quality map. A Bayesian optimized disparity sampling policy is further proposed to sample depth measurements with the guidance of the disparity quality. Extensive experiments on standard datasets with various stereo models demonstrate that our method is suited and effective in different stereo architectures and outperforms existing fixed and adaptive sampling methods under different sampling rates. Remarkably, the proposed method makes substantial improvements when generalized to heterogeneous unseen domains.

Abstract:
In this work, we propose a new Dual Min-Max Games (DMMG) based self-supervised skeleton action recognition method by augmenting unlabeled data in a contrastive learning framework. Our DMMG consists of a viewpoint variation min-max game and an edge perturbation min-max game. These two min-max games adopt an adversarial paradigm to perform data augmentation on the skeleton sequences and graph-structured body joints, respectively. Our viewpoint variation min-max game focuses on constructing various hard contrastive pairs by generating skeleton sequences from various viewpoints. These hard contrastive pairs help our model learn representative action features, thus facilitating model transfer to downstream tasks. Moreover, our edge perturbation min-max game specializes in building diverse hard contrastive samples through perturbing connectivity strength among graph-based body joints. The connectivity-strength varying contrastive pairs enable the model to capture minimal sufficient information of different actions, such as representative gestures for an action while preventing the model from overfitting. By fully exploiting the proposed DMMG, we can generate sufficient challenging contrastive pairs and thus achieve discriminative action feature representations from unlabeled skeleton data in a self-supervised manner. Extensive experiments demonstrate that our method achieves superior results under various evaluation protocols on widely-used NTU-RGB+D, NTU120-RGB+D and PKU-MMD datasets.

Abstract:
Single object tracking in point clouds has been attracting more and more attention owing to the presence of LiDAR sensors in 3D vision. However, existing methods based on deep neural networks mainly focus on training different models for different categories, which makes them unable to perform well in real-world applications when encountering classes unseen during the training phase. In this work, we investigate a more challenging task in LiDAR point clouds, namely class-agnostic tracking, where a general model is supposed to be learned to handle targets of both observed and unseen categories. In particular, we first investigate the class-agnostic performance of state-of-the-art trackers by exposing the unseen categories to them during testing. It is found that as the distribution shifts from observed to unseen classes, how to constrain the fused features between the template and the search region to maintain generalization is a key factor in class-agnostic tracking. Therefore, we propose a feature decorrelation method to address this problem, which eliminates the spurious correlations of the fused features through a set of learned weights, and further makes the search region consistent among foreground points and distinctive between foreground and background points. Experiments on KITTI and NuScenes demonstrate that the proposed method can achieve considerable improvements by benchmarking against the advanced trackers P2B and BAT, especially when tracking unseen objects.

Abstract:
Previous methods based on 3DCNN, convLSTM, or optical flow have achieved great success in video salient object detection (VSOD). However, these methods still suffer from high computational costs or poor quality of the generated saliency maps. To address this, we design a space-time memory (STM)-based network that employs a standard encoder–decoder architecture. During the encoding stage, we extract high-level temporal features from the current frame and its adjacent frames, which is more efficient and practical than methods reliant on optical flow. During the decoding stage, we introduce an effective fusion strategy for both spatial and temporal branches. The semantic information of the high-level features is used to improve the object details in the low-level features. Subsequently, spatiotemporal features are methodically derived step by step to reconstruct the saliency maps. Moreover, inspired by the boundary supervision prevalent in image salient object detection (ISOD), we design a motion-aware loss that predicts object boundary motion, and simultaneously perform multitask learning for VSOD and object motion prediction. This can further enhance the model’s capability to accurately extract spatiotemporal features while maintaining object integrity. Extensive experiments on several datasets demonstrate the effectiveness of our method and can achieve state-of-the-art metrics on some datasets. Our proposed model does not require optical flow or additional preprocessing, and can reach an impressive inference speed of nearly 100 FPS.

Abstract:
Deep neural networks (DNNs) are shown to be vulnerable to universal adversarial perturbations (UAP), a single quasi-imperceptible perturbation that deceives the DNNs on most input images. The current UAP methods can be divided into data-dependent and data-independent methods. The former exhibits weak transferability in black-box models due to overly relying on model-specific features. The latter shows inferior attack performance in white-box models as it fails to exploit the model’s response information to benign images. To address the above issues, this paper proposes a novel universal adversarial attack to generate UAP with strong transferability by disrupting the model-agnostic features (e.g., edges or simple texture), which are invariant to the models. Specifically, we first devise an objective function to weaken the significant channel-wise features and strengthen the less significant channel-wise features, which are partitioned by the designed strategy. Furthermore, the proposed objective function eliminates the dependency on labeled samples, allowing us to utilize out-of-distribution (OOD) data to train UAP. To enhance the attack performance with limited training samples, we exploit the average gradient of the mini-batch input to update the UAP iteratively, which encourages the UAP to capture the local information inside the mini-batch input. In addition, we introduce the momentum term to accumulate the gradient information at each iterative step for the purpose of perceiving the global information over the training set. Finally, extensive experimental results demonstrate that the proposed methods outperform the existing UAP approaches. Additionally, we exhaustively investigate the transferability of the UAP across models, datasets, and tasks.

Abstract:
Unsupervised Domain adaptation (UDA) aims to transfer knowledge from the labeled source domain to the unlabeled target domain. Most existing domain adaptation methods are based on convolutional neural networks (CNNs) to learn cross-domain invariant features. Inspired by the success of transformer architectures and their superiority to CNNs, we propose to combine the transformer with UDA to improve their generalization properties. In this paper, we present a novel model named Trans ferable V ector Q uantization A lignment for Unsupervised Domain Adaptation (TransVQA), which integrates the Transferable transformer-based feature extractor (Trans), vector quantization domain alignment (VQA), and mutual information weighted maximization confusion matrix (MIMC) of intra-class discrimination into a unified domain adaptation framework. First, TransVQA uses the transformer to extract more accurate features in different domains for classification. Second, TransVQA, based on the vector quantization alignment module, uses a two-step alignment method to align the extracted cross-domain features and solve the domain shift problem. The two-step alignment includes global alignment via vector quantization and intra-class local alignment via pseudo-labels. Third, for intra-class feature discrimination problem caused by the fuzzy alignment of different domains, we use the MIMC module to constrain the target domain output and increase the accuracy of pseudo-labels. The experiments on several datasets of domain adaptation show that TransVQA can achieve excellent performance and outperform existing state-of-the-art methods.

Abstract:
Composed query image retrieval task aims to retrieve the target image in the database by a query that composes two different modalities: a reference image and a sentence declaring that some details of the reference image need to be modified and replaced by new elements. Tackling this task needs to learn a multimodal embedding space, which can make semantically similar targets and queries close but dissimilar targets and queries as far away as possible. Most of the existing methods start from the perspective of model structure and design some clever interactive modules to promote the better fusion and embedding of different modalities. However, their learning objectives use conventional query-level examples as negatives while neglecting the composed query’s multimodal characteristics, leading to the inadequate utilization of the training data and suboptimal construction of metric space. To this end, in this paper, we propose to improve the learning objective by constructing and mining hard negative examples from the perspective of multimodal fusion. Specifically, we compose the reference image and its logically unpaired sentences rather than paired ones to create component-level negative examples to better use data and enhance the optimization of metric space. In addition, we further propose a new sentence augmentation method to generate more indistinguishable multimodal negative examples from the element level and help the model learn a better metric space. Massive comparison experiments on four real-world datasets confirm the effectiveness of the proposed method.

Abstract:
While the wisdom of training an image dehazing model on synthetic hazy data can alleviate the difficulty of collecting real-world hazy/clean image pairs, it brings the well-known domain shift problem. From a different yet new perspective, this paper explores contrastive learning with an adversarial training effort to leverage unpaired real-world hazy and clean images, thus alleviating the domain shift problem and enhancing the network’s generalization ability in real-world scenarios. We propose an effective unsupervised contrastive learning paradigm for image dehazing, dubbed UCL-Dehaze. Unpaired real-world clean and hazy images are easily captured, and will serve as the important positive and negative samples respectively when training our UCL-Dehaze network. To train the network more effectively, we formulate a new self-contrastive perceptual loss function, which encourages the restored images to approach the positive samples and keep away from the negative samples in the embedding space. Besides the overall network architecture of UCL-Dehaze, adversarial training is utilized to align the distributions between the positive samples and the dehazed images. Compared with recent image dehazing works, UCL-Dehaze does not require paired data during training and utilizes unpaired positive/negative data to better enhance the dehazing performance. We conduct comprehensive experiments to evaluate our UCL-Dehaze and demonstrate its superiority over the state-of-the-arts, even only 1,800 unpaired real-world images are used to train our network. Source code is publicly available at https://github.com/yz-wang/UCL-Dehaze.

Abstract:
As an important and challenging problem in vision-language tasks, referring expression comprehension (REC) generally requires a large amount of multi-grained information of visual and linguistic modalities to realize accurate reasoning. In addition, due to the diversity of visual scenes and the variation of linguistic expressions, some hard examples have much more abundant multi-grained information than others. How to aggregate multi-grained information from different modalities and extract abundant knowledge from hard examples is crucial in the REC task. To address aforementioned challenges, in this paper, we propose a Self-paced Multi-grained Cross-modal Interaction Modeling framework, which improves the language-to-vision localization ability through innovations in network structure and learning mechanism. Concretely, we design a transformer-based multi-grained cross-modal attention, which effectively utilizes the inherent multi-grained information in visual and linguistic encoders. Furthermore, considering the large variance of samples, we propose a self-paced sample informativeness learning to adaptively enhance the network learning for samples containing abundant multi-grained information. The proposed framework significantly outperforms state-of-the-art methods on widely used datasets, such as RefCOCO, RefCOCO+, RefCOCOg, and ReferItGame datasets, demonstrating the effectiveness of our method.

Abstract:
The key to multi-object tracking is its stability and the retention of identity information. A common problem with most detection-based approaches is trusting and using all the detector outputs for the association. However, some settings of detectors can affect stable long-range tracking. Based on the principle of reducing the association noise in the detection processing step, we propose a new framework, the Box application Pattern Mining Tracker (BPMTrack), to address this issue. Specifically, we worked on three main aspects: output threshold, association strategy, and motion model. Due to the problem of inconsistency between classification scores and localization accuracy, we propose the Box Quality Estimation Network (BQENet) to predict the localization quality scores of all detections in the current frame, reserving high-quality boxes for the tracker. In addition, based on observations of intensive scenarios, we propose a simple and effective data association method, the Non-Maximum Suppression Integration (NMSI) matching strategy. It recovers the Non-Maximum Suppression (NMS) detection, inputs them into BQENet, and then performs hierarchical matching with reasonable control of box priority to alleviate the problem of absent objects caused by occlusion. Finally, we propose an improved Measurement Correct and Noise Scale (MCNS) Kalman algorithm to improve the prediction accuracy of object positions and, thus, the association quality. We performed an extensive ablation evaluation of the proposed framework to prove its effectiveness. Moreover, the three tracking benchmarks show our method’s accuracy and long-distance performance.

Abstract:
In this paper, we focus on the weakly supervised video object detection problem, where each training video is only tagged with object labels, without any bounding box annotations of objects. To effectively train object detectors from such weakly-annotated videos, we propose a Progressive Frame-Proposal Mining (PFPM) framework by exploiting discriminative proposals in a coarse-to-fine manner. First, we design a flexible Multi-Level Selection (MLS) scheme, with explicit guidance of video tags. By selecting object-relevant frames and mining important proposals from these frames, the proposed MLS can effectively reduce frame redundancy as well as improve proposal effectiveness to boost weakly-supervised detectors. Moreover, we develop a novel Holistic-View Refinement (HVR) scheme, which can globally evaluate importance of proposals among frames, and thus correctly refine pseudo ground truth boxes for training video detectors in a self-supervised manner. Finally, we evaluate the proposed PFPM on a large-scale benchmark for video object detection, on ImageNet VID, under the setting of weak annotations. The experimental results demonstrate that our PFPM significantly outperforms the state-of-the-art weakly-supervised detectors.

Abstract:
Video grounding, the process of identifying a specific moment in an untrimmed video based on a natural language query, has become a popular topic in video understanding. However, fully supervised learning approaches for video grounding that require large amounts of annotated data can be expensive and time-consuming. Recently, zero-shot video grounding (ZS-VG) methods that leverage pre-trained object detectors and language models to generate pseudo-supervision for training video grounding models have been developed. However, these approaches have limitations in recognizing diverse categories and capturing specific dynamics and interactions in the video context. To tackle these challenges, we introduce a novel two-stage ZS-VG framework called Lookup-and-Verification (LoVe), which treats the pseudo-query generation procedure as a video-to-concept retrieval problem. Our approach allows for the extraction of diverse concepts from an open-concept pool and employs a verification process to ensure the relevance of the retrieved concepts to the objects or events of interest in the video proposals. Comprehensive experimental results on the Charades-STA, ActivityNet-Captions, and DiDeMo datasets demonstrate the effectiveness of the LoVe framework.

Abstract:
The Geometry-based Point Cloud Compression (G-PCC) has been developed by the Moving Picture Experts Group to compress point clouds efficiently. Nevertheless, in its lossy mode, the reconstructed point cloud by G-PCC often suffers from noticeable distortions due to naïve geometry quantization (i.e., grid downsampling). This paper proposes a hierarchical prior-based super resolution method for point cloud geometry compression. The content-dependent hierarchical prior is constructed at the encoder side, which enables coarse-to-fine super resolution of the point cloud geometry at the decoder side. A more accurate prior generally yields improved reconstruction performance, albeit at the cost of increased bits required to encode this piece of side information. Our experiments on the MPEG Cat1A dataset demonstrate substantial Bjøntegaard-delta bitrate savings, surpassing the performance of the octree-based and trisoup-based G-PCC v14. We provide our implementations for reproducible research at https://github.com/lidq92/mpeg-pcc-tmc13.

Abstract:
Person search by language refers to searching for the interested pedestrian images given natural language sentences, which requires capturing fine-grained differences to accurately distinguish different pedestrians, while still far from being well addressed by most of the current solutions. In this paper, we propose the Comprehensive Attribute Prediction Learning (CAPL) method, which explicitly carries out attribute prediction learning, for improving the modeling capabilities of fine-grained semantic attributes and obtaining more discriminative visual and textual representations. First, we construct the semantic ATTribute Vocabulary (ATT-Vocab) based on sentence analysis. Second, the complementary context-wise and attribute-wise attribute predictions are simultaneously conducted to better model the high-frequency in-vocab attributes in our In-vocab Attribute Prediction (IAP) module. Third, to additionally consider the out-of-vocab semantics, we present the Attribute Completeness Learning (ACL) module for better capturing the low-frequency attributes outside the ATT-Vocab, obtaining more comprehensive representations. Combining the IAP and ACL modules together, our CAPL method has obtained the currently state-of-the-art retrieval performance on two widely-used benchmarks, i.e., CUHK-PEDES and ICFG-PEDES datasets. Extensive experiments and analyses have been carried out to validate the effectiveness and generalization capacities of our CAPL method.

Affiliations: Shaanxi Provincial Key Laboratory of Big Data Knowledge Engineering, Xi’an Jiaotong University, Xi’an, Shaanxi, China; Shaanxi Province Key Laboratory of Satellite and Terrestrial Network Technology Research and Development, Xi’an Jiaotong University, Xi’an, Shaanxi, China; School of Data Science and Artificial Intelligence, Chang'an University, Xi’an, Shanxi, China; School of Computer Science and Technology, Xi’an Jiaotong University, Xi’an, Shaanxi, China; Lenovo Research, Beijing, China

Abstract:
As a knowledge carrier, the diagram is widely distributed in many aspects of human life, such as textbooks, architectural drawings, and documents. Different from natural images, representations of visual elements in the diagram are sparser, and similar visual representations can reflect dissimilar semantics. Thus, current methods fail to capture the visual elements with precise semantics. To address this issue, regarding the aligned visual and textual elements as pairs is the way to assign the precise semantics of textual elements to visual elements. We build the first diagram dataset named align diagram element (ADE), which includes annotations for alignment relations between visual and textual elements. And we propose a visual-textual alignment model (VTAM) including graph construction and optimal aligning phases. In the graph construction phase, the relational graphs are constructed between different elements with four relational operators. The relational operators are designed to measure the relations between different elements, according to distance, connection line, inclusion, and feature similarity. In the optimal aligning phase, the representation at each visual-textual pair is improved as a weighted sum of the representations on all relational graphs. Experimental results show that our VTAM achieves a significant improvement of 10.9% on mean test folds of the ADE dataset than the current best competitor. In order to explore the role of alignment relations in diagram parsing, we introduce VTAM to diagram-related tasks, such as diagram question answering (DQA). And we achieve 2.8% to 5.9% and 4.6% to 5.1% improvements on AI2D and Foodwebs after adding VTAM. Our dataset and code are released at: https://github.com/ADE-dataset/ADE-dataset.

Abstract:
Notwithstanding the prominent performance shown in various applications, point cloud recognition models have often suffered from natural corruptions and adversarial perturbations. In this paper, we delve into boosting the general robustness of point cloud recognition, proposing Point-Cloud Contrastive Adversarial Training (PointCAT). The main intuition of PointCAT is encouraging the target recognition model to narrow the decision gap between clean point clouds and corrupted point clouds by devising feature-level constraints rather than logit-level constraints. Specifically, we leverage a supervised contrastive loss to facilitate the alignment and the uniformity of hypersphere representations, and design a pair of centralizing losses with dynamic prototype guidance to prevent features from deviating outside their belonging category clusters. To generate more challenging corrupted point clouds, we adversarially train a noise generator concurrently with the recognition model from the scratch. This differs from previous adversarial training methods that utilized gradient-based attacks as the inner loop. Comprehensive experiments show that the proposed PointCAT outperforms the baseline methods, significantly enhancing the robustness of diverse point cloud recognition models under various corruptions, including isotropic point noises, the LiDAR simulated noises, random point dropping, and adversarial perturbations. Our code is available at: https://github.com/shikiw/PointCAT.

Abstract:
Neuromorphic imaging reacts to per-pixel brightness changes of a dynamic scene with high temporal precision and responds with asynchronous streaming events as a result. It also often supports a simultaneous output of an intensity image. Nevertheless, the raw events typically involve a large amount of noise due to the high sensitivity of the sensor, while capturing fast-moving objects at low frame rates results in blurry images. These deficiencies significantly degrade human observation and machine processing. Fortunately, the two information sources are inherently complementary — events with microsecond-level temporal resolution, which are triggered by the edges of objects recorded in a latent sharp image, can supply rich motion details missing from the blurry one. In this work, we bring the two types of data together and introduce a simple yet effective unifying algorithm to jointly reconstruct blur-free images and noise-robust events in an iterative coarse-to-fine fashion. Specifically, an event-regularized prior offers precise high-frequency structures and dynamic features for blind deblurring, while image gradients serve as a kind of faithful supervision in regulating neuromorphic noise removal. Comprehensively evaluated on real and synthetic samples, such a synergy delivers superior reconstruction quality for both images with severe motion blur and raw event streams with a storm of noise, and also exhibits greater robustness to challenging realistic scenarios such as varying levels of illumination, contrast and motion magnitude. Meanwhile, it can be driven by much fewer events and holds a competitive edge at computational time overhead, rendering itself preferable as available computing resources are limited. Our solution gives impetus to the improvement of both sensing data and paves the way for highly accurate neuromorphic reasoning and analysis.

Abstract:
To promote the application of object detectors in real scenes, out-of-distribution object detection (OOD-OD) is proposed to distinguish whether detected objects belong to the ones that are unseen during training or not. One of the key challenges is that detectors lack unknown data for supervision, and as a result, can produce overconfident detection results on OOD data. Thus, this task requires to synthesize OOD data for training, which achieves the goal of enhancing the ability of localizing and discriminating OOD objects. In this paper, we propose a novel method, i.e., PCA-Driven dynamic prototype enhancement, to explore exploiting Principal Component Analysis (PCA) to extract simulative OOD data for training and obtain dynamic prototypes that are related to the current input and are helpful for boosting the discrimination ability. Concretely, the last few principal components of the backbone features are utilized to calculate an OOD map that involves plentiful information that deviates from the correlation distribution of the input. The OOD map is further used to extract simulative OOD data for training, which alleviates the impact of lacking unknown data. Besides, for in-distribution (ID) data, the category-level semantic information of objects between the backbone features and the high-level features should be kept consistent. To this end, we utilize the residual principal components to extract dynamic prototypes that reflect the semantic information of the current backbone features. Next, we define a contrastive loss to leverage these prototypes to enlarge the semantic gap between the simulative OOD data and the features from the residual principal components, which improves the ability of discriminating OOD objects. In the experiments, we separately verify our method on OOD-OD and incremental object detection. The significant performance gains demonstrate the superiorities of our method.

Abstract:
Graph convolutional networks (GCN) have recently been studied to exploit the graph topology of the human body for skeleton-based action recognition. However, most of these methods unfortunately aggregate messages via an inflexible pattern for various action samples, lacking the awareness of intra-class variety and the suitableness for skeleton sequences, which often contain redundant or even detrimental connections. In this paper, we propose a novel Deformable Graph Convolutional Network (DeGCN) to adaptively capture the most informative joints. The proposed DeGCN learns the deformable sampling locations on both spatial and temporal graphs, enabling the model to perceive discriminative receptive fields. Notably, considering human action is inherently continuous, the corresponding temporal features are defined in a continuous latent space. Furthermore, we design an innovative multi-branch framework, which not only strikes a better trade-off between accuracy and model size, but also elevates the effect of ensemble between the joint and bone modalities remarkably. Extensive experiments show that our proposed method achieves state-of-the-art performances on three widely used datasets, NTU RGB+D, NTU RGB+D 120, and NW-UCLA.

Abstract:
Current semi-supervised video object segmentation (VOS) methods often employ the entire features of one frame to predict object masks and update memory. This introduces significant redundant computations. To reduce redundancy, we introduce a Region Aware Video Object Segmentation (RAVOS) approach, which predicts regions of interest (ROIs) for efficient object segmentation and memory storage. RAVOS includes a fast object motion tracker to predict object ROIs in the next frame. For efficient segmentation, object features are extracted based on the ROIs, and an object decoder is designed for object-level segmentation. For efficient memory storage, we propose motion path memory to filter out redundant context by memorizing the features within the motion path of objects. In addition to RAVOS, we also propose a large-scale occluded VOS dataset, dubbed OVOS, to benchmark the performance of VOS models under occlusions. Evaluation on DAVIS and YouTube-VOS benchmarks and our new OVOS dataset show that our method achieves state-of-the-art performance with significantly faster inference time, e.g., 86.1~\mathcal J \& \mathcal F at 42 FPS on DAVIS and 84.4~\mathcal J \& \mathcal F at 23 FPS on YouTube-VOS. Project page: ravos.netlify.app.

Abstract:
Previous monocular 3D detection works focus on the single frame input in both training and inference. In real-world applications, temporal and motion information naturally exists in monocular video. It is valuable for 3D detection but under-explored in monocular works. In this paper, we propose a straightforward and effective method for temporal feature fusion, which exhibits low computation cost and excellent transferability, making it conveniently applicable to various monocular models. Specifically, with the help of optical flow, we transform the backbone features produced by prior frames and fuse them into the current frame. We introduce the scene feature propagating mechanism, which accumulates history scene features without extra time-consuming. In this process, occluded areas are removed via forward-backward scene consistency. Our method naturally introduces valuable temporal features, facilitating 3D reasoning in monocular 3D detection. Furthermore, accumulated history scene features via scene propagating mitigate heavy computation overheads for video processing. Experiments are conducted on variant baselines, which demonstrate that the proposed method is model-agonistic and can bring significant improvement to multiple types of single-frame methods.

Abstract:
Temporal phase unwrapping based on single auxiliary binary coded pattern has been proven to be effective for high-speed 3D measurement. However, in traditional spatial binary coding, it often leads to an imbalance between the number of periodic divisions and codewords. To meet this challenge, a large codewords orthogonal spatial binary coding method is proposed in this paper. By expanding spatial multiplexing from 1D to 2D orthogonal direction, it goes beyond the traditional 8 codewords to 27 codewords at three-level periodic division. In addition, a novel full-period connected domain segmentation technique based on local localization is proposed to avoid the time-consuming global iterative erosion and complex anomaly detection in traditional methods. For the decoding process, a purely spatial codewords recognition and a spatial-temporal hybrid codewords recognition methods are established to better suppress the percentage offset caused by static defocusing and dynamic motion, respectively. Obviating the need for intricate symbol recognition, the decoding process in our proposed method encompasses a straightforward analysis of statistical distribution. Building upon the development of special spatial binary coding, we have achieved a well-balance between low periodic division and large codewords for the first time. The experimental results verify the feasibility and validity of our proposed whole image processing method in both static and dynamic measurements.

Abstract:
Existing cross-domain classification and detection methods usually apply a consistency constraint between the target sample and its self-augmentation for unsupervised learning without considering the essential source knowledge. In this paper, we propose a Source-guided Target Feature Reconstruction (STFR) module for cross-domain visual tasks, which applies source visual words to reconstruct the target features. Since the reconstructed target features contain the source knowledge, they can be treated as a bridge to connect the source and target domains. Therefore, using them for consistency learning can enhance the target representation and reduce the domain bias. Technically, source visual words are selected and updated according to the source feature distribution, and applied to reconstruct the given target feature via a weighted combination strategy. After that, consistency constraints are built between the reconstructed and original target features for domain alignment. Furthermore, STFR is connected with the optimal transportation algorithm theoretically, which explains the rationality of the proposed module. Extensive experiments on nine benchmarks and two cross-domain visual tasks prove the effectiveness of the proposed STFR module, e.g., 1) cross-domain image classification: obtaining average accuracy of 91.0%, 73.9%, and 87.4% on Office-31, Office-Home, and VisDA-2017, respectively; 2) cross-domain object detection: obtaining mAP of 44.50% on Cityscapes \rightarrow Foggy Cityscapes, AP on car of 78.10% on Cityscapes \rightarrow KITTI, MR ^-2 of 8.63%, 12.27%, 22.10%, and 40.58% on COCOPersons \rightarrow Caltech, CityPersons \rightarrow Caltech, COCOPersons \rightarrow CityPersons, and Caltech \rightarrow CityPersons, respectively.

Abstract:
Within the tensor singular value decomposition (T-SVD) framework, existing robust low-rank tensor completion approaches have made great achievements in various areas of science and engineering. Nevertheless, these methods involve the T-SVD based low-rank approximation, which suffers from high computational costs when dealing with large-scale tensor data. Moreover, most of them are only applicable to third-order tensors. Against these issues, in this article, two efficient low-rank tensor approximation approaches fusing random projection techniques are first devised under the order-d ( d\geq 3 ) T-SVD framework. Theoretical results on error bounds for the proposed randomized algorithms are provided. On this basis, we then further investigate the robust high-order tensor completion problem, in which a double nonconvex model along with its corresponding fast optimization algorithms with convergence guarantees are developed. Experimental results on large-scale synthetic and real tensor data illustrate that the proposed method outperforms other state-of-the-art approaches in terms of both computational efficiency and estimated precision.

Abstract:
Few-shot image generation aims to generate images of high quality and great diversity with limited data. However, it is difficult for modern GANs to avoid overfitting when trained on only a few images. The discriminator can easily remember all the training samples and guide the generator to replicate them, leading to severe diversity degradation. Several methods have been proposed to relieve overfitting by adapting GANs pre-trained on large source domains to target domains using limited real samples. This work presents masked discrimination to realize few-shot GAN adaptation, which is the first feature-level augmentation method for generative tasks. Random masks are applied to features extracted by the discriminator from input images. We aim to encourage the discriminator to judge various images that share partially common features with training samples as realistic. Correspondingly, the generator is guided to generate diverse images instead of replicating training samples. In addition, we employ a cross-domain consistency loss for the discriminator to keep relative distances between generated samples in its feature space. It strengthens global image discrimination and guides adapted GANs to preserve more information learned from source domains for higher image quality, resulting in better cross-domain correspondence. The effectiveness of our approach is demonstrated both qualitatively and quantitatively with higher quality and greater diversity on a series of few-shot image generation tasks than prior methods.

Abstract:
Detecting infrared small targets under cluttered background is mainly challenged by dim textures, low contrast and varying shapes. This paper proposes an approach to facilitate infrared small target detection by learning contrast-enhanced shape-biased representations. The approach cascades a contrast-shape encoder and a shape-reconstructable decoder to learn discriminative representations that can effectively identify target objects. The contrast-shape encoder applies a stem of central difference convolutions and a few large-kernel convolutions to extract shape-preserving features from input infrared images. This specific design in convolutions can effectively overcome the challenges of low contrast and varying shapes in a unified way. Meanwhile, the shape-reconstructable decoder accepts the edge map of input infrared image and is learned by simultaneously optimizing two shape-related consistencies: the internal one decodes the encoder representations by upsampling reconstruction and constraints segmentation consistency, whilst the external one cascades three gated ResNet blocks to hierarchically fuse edge maps and decoder representations and constrains contour consistency. This decoding way can bypass the challenge of dim texture and varying shapes. In our approach, the encoder and decoder are learned in an end-to-end manner, and the resulting shape-biased encoder representations are suitable for identifying infrared small targets. Extensive experimental evaluations are conducted on public benchmarks and the results demonstrate the effectiveness of our approach.

Abstract:
In this paper, we propose a graph-represented image distribution similarity (GRIDS) index for full-reference (FR) image quality assessment (IQA), which can measure the perceptual distance between distorted and reference images by assessing the disparities between their distribution patterns under a graph-based representation. First, we transform the input image into a graph-based representation, which is proven to be a versatile and effective choice for capturing visual perception features. This is achieved through the automatic generation of a vision graph from the given image content, leading to holistic perceptual associations for irregular image regions. Second, to reflect the perceived image distribution, we decompose the undirected graph into cliques and then calculate the product of the potential functions for the cliques to obtain the joint probability distribution of the undirected graph. Finally, we compare the distances between the graph feature distributions of the distorted and reference images at different stages; thus, we combine the distortion distribution measurements derived from different graph model depths to determine the perceived quality of the distorted images. The empirical results obtained from an extensive array of experiments underscore the competitive nature of our proposed method, which achieves performance on par with that of the state-of-the-art methods, demonstrating its exceptional predictive accuracy and ability to maintain consistent and monotonic behaviour in image quality prediction tasks. The source code is publicly available at the following website https://github.com/Land5cape/GRIDS.

Affiliations: College of Computer Science and Technology, Qingdao University, Qingdao, China; School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, China; School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu, China; College of Computer Science and Technology, China University of Petroleum (East China), Qingdao, China; Institute of Biomedical Informatics, Department of Computer Science, University of Kentucky, Lexington, KY, USA

Abstract:
Multi-view subspace clustering (MVSC) has drawn significant attention in recent study. In this paper, we propose a novel approach to MVSC. First, the new method is capable of preserving high-order neighbor information of the data, which provides essential and complicated underlying relationships of the data that is not straightforwardly preserved by the first-order neighbors. Second, we design log-based nonconvex approximations to both tensor rank and tensor sparsity, which are effective and more accurate than the convex approximations. For the associated shrinkage problems, we provide elegant theoretical results for the closed-form solutions, for which the convergence is guaranteed by theoretical analysis. Moreover, the new approximations have some interesting properties of shrinkage effects, which are guaranteed by elegant theoretical results. Extensive experimental results confirm the effectiveness of the proposed method.

Abstract:
Depth images and thermal images contain the spatial geometry information and surface temperature information, which can act as complementary information for the RGB modality. However, the quality of the depth and thermal images is often unreliable in some challenging scenarios, which will result in the performance degradation of the two-modal based salient object detection (SOD). Meanwhile, some researchers pay attention to the triple-modal SOD task, namely the visible-depth-thermal (VDT) SOD, where they attempt to explore the complementarity of the RGB image, the depth image, and the thermal image. However, existing triple-modal SOD methods fail to perceive the quality of depth maps and thermal images, which leads to performance degradation when dealing with scenes with low-quality depth and thermal images. Therefore, in this paper, we propose a quality-aware selective fusion network (QSF-Net) to conduct VDT salient object detection, which contains three subnets including the initial feature extraction subnet, the quality-aware region selection subnet, and the region-guided selective fusion subnet. Firstly, except for extracting features, the initial feature extraction subnet can generate a preliminary prediction map from each modality via a shrinkage pyramid architecture, which is equipped with the multi-scale fusion (MSF) module. Then, we design the weakly-supervised quality-aware region selection subnet to generate the quality-aware maps. Concretely, we first find the high-quality and low-quality regions by using the preliminary predictions, which further constitute the pseudo label that can be used to train this subnet. Finally, the region-guided selective fusion subnet purifies the initial features under the guidance of the quality-aware maps, and then fuses the triple-modal features and refines the edge details of prediction maps through the intra-modality and inter-modality attention (IIA) module and the edge refinement (ER) module, respectively. Extensive experiments are performed on VDT-2048 dataset, and the results show that our saliency model consistently outperforms 13 state-of-the-art methods with a large margin. Our code and results are available at https://github.com/Lx-Bao/QSFNet.

Abstract:
With human action anticipation becoming an essential tool for many practical applications, there has been an increasing trend in developing more accurate anticipation models in recent years. Most of the existing methods target standard action anticipation datasets, in which they could produce promising results by learning action-level contextual patterns. However, the over-simplified scenarios of standard datasets often do not hold in reality, which hinders them from being applied to real-world applications. To address this, we propose a scene-graph-based novel model SEAD that learns the action anticipation at the high semantic level rather than focusing on the action level. The proposed model is composed of two main modules, 1) the scene prediction module, which predicts future scene graphs using a grammar dictionary, and 2) the action anticipation module, which is responsible for predicting future actions with an LSTM network by taking as input the observed and predicted scene graphs. We evaluate our model on two real-world video datasets (Charades and Home Action Genome) as well as a standard action anticipation dataset (CAD-120) to verify its efficacy. The experimental results show that SEAD is able to outperform existing methods by large margins on the two real-world datasets and can also yield stable predictions on the standard dataset at the same time. In particular, our proposed model surpasses the state-of-the-art methods with mean average precision improvements consistently higher than 65% on the Charades dataset and an average improvement of 40.6% on the Home Action Genome dataset.

Abstract:
Advances in multisource remote sensing have allowed for the development of more comprehensive observation. The adoption of deep convolutional neural networks (CNN) naturally includes spatial-spectral information, which has achieved promising performance in multisource data classification. However, challenges are still found with the extraction of spatial distribution and spectrum relationships, which eventually limit the classification performance. To solve the issue, a spatial-spectral perception network (S2PNet) is proposed to extract the advantages of different data sources and the cross information between data sources in a targeted manner. Specifically, the spatial perception network is developed to build the spatial distribution relationship from high-resolution images, while the spectral perception network extracts the spectrum relationship from spectral images. For perceiving cross information, a memory unit is utilized to store the features from different data sources in succession. In addition, the distance loss and reconstruction loss are introduced to keep the feature integrity, and the cross-entropy loss ensures that features can distinguish different classes. The comprehensive experiments are conducted on several datasets to validate the superiority of the proposed algorithm. The proposed S2PNet outperforms the considered classifiers with an average improvement of +0.77%, +5.62%, +1.58%, and +1.79% for overall accuracy values.

Abstract:
JPEG compression adopts the quantization of Discrete Cosine Transform (DCT) coefficients for effective bit-rate reduction, whilst the quantization could lead to a significant loss of important image details. Recovering compressed JPEG images in the frequency domain has recently garnered increasing interest, complementing the multitude of restoration techniques established in the pixel domain. However, existing DCT domain methods typically suffer from limited effectiveness in handling a wide range of compression quality factors or fall short in recovering sparse quantized coefficients and the components across different colorspaces. To address these challenges, we propose a DCT domain spatial-frequential Transformer, namely DCTransformer, for JPEG quantized coefficient recovery. Specifically, a dual-branch architecture is designed to capture both spatial and frequential correlations within the collocated DCT coefficients. Moreover, we incorporate the operation of quantization matrix embedding, which effectively allows our single model to handle a wide range of quality factors, and a luminance-chrominance alignment head that produces a unified feature map to align different-sized luminance and chrominance components. Our proposed DCTransformer outperforms the current state-of-the-art JPEG artifact removal techniques, as demonstrated by our extensive experiments.

Abstract:
In this paper, novel robust principal component analysis (RPCA) methods are proposed to exploit the local structure of datasets. The proposed methods are derived by minimizing the \alpha -divergence between the sample distribution and the Gaussian density model. The \alpha - divergence is used in different frameworks to represent variants of RPCA approaches including orthogonal, non-orthogonal, and sparse methods. We show that the classical PCA is a special case of our proposed methods where the \alpha - divergence is reduced to the Kullback-Leibler (KL) divergence. It is shown in simulations that the proposed approaches recover the underlying principal components (PCs) by down-weighting the importance of structured and unstructured outliers. Furthermore, using simulated data, it is shown that the proposed methods can be applied to fMRI signal recovery and Foreground-Background (FB) separation in video analysis. Results on real world problems of FB separation as well as image reconstruction are also provided.

Abstract:
Few-shot learning (FSL) poses a significant challenge in classifying unseen classes with limited samples, primarily stemming from the scarcity of data. Although numerous generative approaches have been investigated for FSL, their generation process often results in entangled outputs, exacerbating the distribution shift inherent in FSL. Consequently, this considerably hampers the overall quality of the generated samples. Addressing this concern, we present a pioneering framework called DisGenIB, which leverages an Information Bottleneck (IB) approach for Disentangled Generation. Our framework ensures both discrimination and diversity in the generated samples, simultaneously. Specifically, we introduce a groundbreaking Information Theoretic objective that unifies disentangled representation learning and sample generation within a novel framework. In contrast to previous IB-based methods that struggle to leverage priors, our proposed DisGenIB effectively incorporates priors as invariant domain knowledge of sub-features, thereby enhancing disentanglement. This innovative approach enables us to exploit priors to their full potential and facilitates the overall disentanglement process. Moreover, we establish the theoretical foundation that reveals certain prior generative and disentanglement methods as special instances of our DisGenIB, underscoring the versatility of our proposed framework. To solidify our claims, we conduct comprehensive experiments on demanding FSL benchmarks, affirming the remarkable efficacy and superiority of DisGenIB. Furthermore, the validity of our theoretical analyses is substantiated by the experimental results. Our code is available at https://github.com/eric-hang/DisGenIB.

Abstract:
Domain adaptive object detection (DAOD) aims to infer a robust detector on the target domain with the labelled source datasets. Recent studies utilize a feature extractor shared on the source and target domains to capture the domain-invariant features and the task-relevant information with both feature-alignment constraint and source annotations. However, the feature extractor shared across domains discards partial task-relevant information of the target domain due to the domain gap and lack of target annotations, leading to compromised discrimination capabilities within target domain. To this end, we propose a novel REmainder Adaptive CompensaTion network (REACT) to adaptively compensate the extracted features with the remainder features for generating task-relevant features. The key insight is that the remainder features contain the discarded task-relevant information, so they can be adapted to compensate for the inadequate target features. Especially, REACT introduces an additional remainder branch to regain the remainder features, and then adaptively utilizes them to compensate for the discarded task-relevant information, improving discrimination on the target domain. Extensive experiments over multiple cross-domain adaptation tasks with three baselines demonstrate that our approach gains significant improvements and achieves superior performance compared with highly-optimized state-of-the-art methods.

Abstract:
Recent object re-identification (Re-ID) methods gain high efficiency via lightweight student models trained by knowledge distillation (KD). However, the huge architectural difference between lightweight students and heavy teachers causes students to have difficulties in receiving and understanding teachers’ knowledge, thus losing certain accuracy. To this end, we propose a refiner-expander-refiner (RER) structure to enlarge a student’s representational capacity and prune the student’s complexity. The expander is a multi-branch convolutional layer to expand the student’s representational capacity to understand a teacher’s knowledge comprehensively, which does not require any feature-dimensional adapter to avoid knowledge distortions. The two refiners are 1× 1 convolutional layers to prune the input and output channels of the expander. In addition, in order to alleviate the competition accuracy-related and pruning-related gradients, we design a common consensus gradient resetting (CCGR) method, which discards unimportant channels according to the intersection of each sample’s unimportant channel judgment. Finally, the trained RER can be simplified into a slim convolutional layer via re-parameterization to speed up inference. As a result, we propose an expanding and refining hybrid compressing (ERHC) method. Extensive experiments show that our ERHC has superior inference speed and accuracy, e.g., on the VeRi-776 dataset, given the ResNet101 as a teacher, ERHC saves 75.33% model parameters (MP) and 74.29% floating-point of operations (FLOPs) without sacrificing accuracy.

Abstract:
Light fields capture 3D scene information by recording light rays emitted from a scene at various orientations. They offer a more immersive perception, compared with classic 2D images, but at the cost of huge data volumes. In this paper, we design a compact neural network representation for the light field compression task. In the same vein as the deep image prior, the neural network takes randomly initialized noise as input and is trained in a supervised manner in order to best reconstruct the target light field Sub-Aperture Images (SAIs). The network is composed of two types of complementary kernels: descriptive kernels (descriptors) that store scene description information learned during training, and modulatory kernels (modulators) that control the rendering of different SAIs from the queried perspectives. To further enhance compactness of the network meanwhile retain high quality of the decoded light field, we propose modulator allocation and apply kernel tensor decomposition techniques, followed by non-uniform quantization and lossless entropy coding. Extensive experiments demonstrate that our method outperforms other state-of-the-art (SOTA) methods by a significant margin in the light field compression task. Moreover, after adapting descriptors, the modulators learned from one light field can be transferred to new light fields for rendering dense views, showing the potential of the solution for view synthesis.

Abstract:
2D-3D joint learning is essential and effective for fundamental 3D vision tasks, such as 3D semantic segmentation, due to the complementary information these two visual modalities contain. Most current 3D scene semantic segmentation methods process 2D images “as they are”, i.e., only real captured 2D images are used. However, such captured 2D images may be redundant, with abundant occlusion and/or limited field of view (FoV), leading to poor performance for the current methods involving 2D inputs. In this paper, we propose a general learning framework for joint 2D-3D scene understanding by selecting informative virtual 2D views of the underlying 3D scene. We then feed both the 3D geometry and the generated virtual 2D views into any joint 2D-3D-input or pure 3D-input based deep neural models for improving 3D scene understanding. Specifically, we generate virtual 2D views based on an information score map learned from the current 3D scene semantic segmentation results. To achieve this, we formalize the learning of the information score map as a deep reinforcement learning process, which rewards good predictions using a deep neural network. To obtain a compact set of virtual 2D views that jointly cover informative surfaces of the 3D scene as much as possible, we further propose an efficient greedy virtual view coverage strategy in the normal-sensitive 6D space, including 3-dimensional point coordinates and 3-dimensional normal. We have validated our proposed framework for various joint 2D-3D-input or pure 3D-input based deep neural models on two real-world 3D scene datasets, i.e., ScanNet v2 and S3DIS, and the results demonstrate that our method obtains a consistent gain over baseline models and achieves new top accuracy for joint 2D and 3D scene semantic segmentation. Code is available at https://github.com/smy-THU/VirtualViewSelection.

Abstract:
In semi-supervised learning (SSL), many approaches follow the effective self-training paradigm with consistency regularization, utilizing threshold heuristics to alleviate label noise. However, such threshold heuristics lead to the underutilization of crucial discriminative information from the excluded data. In this paper, we present OTAMatch, a novel SSL framework that reformulates pseudo-labeling as an optimal transport (OT) assignment problem and simultaneously exploits data with high confidence to mitigate the confirmation bias. Firstly, OTAMatch models the pseudo-label allocation task as a convex minimization problem, facilitating end-to-end optimization with all pseudo-labels and employing the Sinkhorn-Knopp algorithm for efficient approximation. Meanwhile, we incorporate epsilon-greedy posterior regularization and curriculum bias correction strategies to constrain the distribution of OT assignments, improving the robustness with noisy pseudo-labels. Secondly, we propose PseudoNCE, which explicitly exploits pseudo-label consistency with threshold heuristics to maximize mutual information within self-training, significantly boosting the balance of convergence speed and performance. Consequently, our proposed approach achieves competitive performance on various SSL benchmarks. Specifically, OTAMatch substantially outperforms the previous state-of-the-art SSL algorithms in realistic and challenging scenarios, exemplified by a notable 9.45% error rate reduction over SoftMatch on ImageNet with 100K-label split, underlining its robustness and effectiveness.

Affiliations: School of Software Engineering, Huazhong University of Science and Technology, Wuhan, China; Computer Vision Department, Mohamed Bin Zayed University of Artificial Intelligence, Abu Dhabi, United Arab Emirates; Department of Electrical and Computer Engineering, National University of Singapore, Cluny Road, Singapore; Nankai International Advanced Research Institute (Shenzhen Futian), Nankai University, Shenzheng, China; Department of Computer Science and Engineering, The Chinese University of Hong Kong, Hong Kong, SAR, China

Abstract:
In this study, we propose a novel approach for RGB-D salient instance segmentation using a dual-branch cross-modal feature calibration architecture called CalibNet. Our method simultaneously calibrates depth and RGB features in the kernel and mask branches to generate instance-aware kernels and mask features. CalibNet consists of three simple modules, a dynamic interactive kernel (DIK) and a weight-sharing fusion (WSF), which work together to generate effective instance-aware kernels and integrate cross-modal features. To improve the quality of depth features, we incorporate a depth similarity assessment (DSA) module prior to DIK and WSF. In addition, we further contribute a new DSIS dataset, which contains 1,940 images with elaborate instance-level annotations. Extensive experiments on three challenging benchmarks show that CalibNet yields a promising result, i.e., 58.0% AP with 320× 480 input size on the COME15K-E test set, which significantly surpasses the alternative frameworks. Our code and dataset will be publicly available at: https://github.com/PJLallen/CalibNet.

Abstract:
Prior methodologies have disregarded the diversities among distinct degradation types during image reconstruction, employing a uniform network model to handle multiple deteriorations. Nevertheless, we discover that prevalent degradation modalities, including sampling, blurring, and noise, can be roughly categorized into two classes. We classify the first class as spatial-agnostic dominant degradations, less affected by regional changes in image space, such as downsampling and noise degradation. The second class degradation type is intimately associated with the spatial position of the image, such as blurring, and we identify them as spatial-specific dominant degradations. We introduce a dynamic filter network integrating global and local branches to address these two degradation types. This network can greatly alleviate the practical degradation problem. Specifically, the global dynamic filtering layer can perceive the spatial-agnostic dominant degradation in different images by applying weights generated by the attention mechanism to multiple parallel standard convolution kernels, enhancing the network’s representation ability. Meanwhile, the local dynamic filtering layer converts feature maps of the image into a spatially specific dynamic filtering operator, which performs spatially specific convolution operations on the image features to handle spatial-specific dominant degradations. By effectively integrating both global and local dynamic filtering operators, our proposed method outperforms state-of-the-art blind super-resolution algorithms in both synthetic and real image datasets.

Abstract:
A common approach to solve inverse imaging problems relies on finding a maximum a posteriori (MAP) estimate of the original unknown image, by solving a minimization problem. In this context, iterative proximal algorithms are widely used, enabling to handle non-smooth functions and linear operators. Recently, these algorithms have been paired with deep learning strategies, to further improve the estimate quality. In particular, proximal neural networks (PNNs) have been introduced, obtained by unrolling a proximal algorithm as for finding a MAP estimate, but over a fixed number of iterations, with learned linear operators and parameters. As PNNs are based on optimization theory, they are very flexible, and can be adapted to any image restoration task, as soon as a proximal algorithm can solve it. They further have much lighter architectures than traditional networks. In this article we propose a unified framework to build PNNs for the Gaussian denoising task, based on both the dual-FB and the primal-dual Chambolle-Pock algorithms. We further show that accelerated inertial versions of these algorithms enable skip connections in the associated NN layers. We propose different learning strategies for our PNN framework, and investigate their robustness (Lipschitz property) and denoising efficiency. Finally, we assess the robustness of our PNNs when plugged in a forward-backward algorithm for an image deblurring problem.

Abstract:
Multi-View 3D object detection (MV3D) has made tremendous progress by leveraging multiple perspective features through surrounding cameras. Despite demonstrating promising prospects in various applications, accurately detecting objects through camera view in the 3D space is extremely difficult due to the ill-posed issue in monocular depth estimation. Recently, Graph-DETR3D presents a novel graph-based 3D-2D query paradigm in aggregating multi-view images for 3D object detection and achieves competitive performance. Although it enriches the query representations with 2D image features through a learnable 3D graph, it still suffers from limited depth and velocity estimation abilities due to the adoption of a single-frame input setting. To solve this problem, we introduce a unified spatial-temporal graph modeling framework to fully leverage the multi-view imagery cues under the multi-frame inputs setting. Thanks to the flexibility and sparsity of the dynamic graph architecture, we lift the original 3D graph into the 4D space with an effective attention mechanism to automatically perceive imagery information at both spatial and temporal levels. Moreover, considering the main latency bottleneck lies in the image backbone, we propose a novel dense-sparse distillation framework for multi-view 3D object detection, to reduce the computational budget while sacrificing no detection accuracy, making it more suitable for real-world deployment. To this end, we propose Graph-DETR4D, a faster and stronger multi-view 3D object detection framework, built on top of Graph-DETR3D. Extensive experiments on nuScenes and Waymo benchmarks demonstrate the effectiveness and efficiency of Graph-DETR4D. Notably, our best model achieves 62.0% NDS on nuScenes test leaderboard. Code is available at https://github.com/zehuichen123/Graph-DETR4D.

Abstract:
Despite the large-scale adoption of Artificial Intelligence (AI) models in healthcare, there is an urgent need for trustworthy tools to rigorously backtrack the model decisions so that they behave reliably. Counterfactual explanations take a counter-intuitive approach to allow users to explore “what if” scenarios gradually becoming popular in the trustworthy field. However, most previous work on model’s counterfactual explanation cannot generate in-distribution attribution credibly, produces adversarial examples, or fails to give a confidence interval for the explanation. Hence, in this paper, we propose a novel approach that generates counterfactuals in locally smooth directed semantic embedding space, and at the same time gives an uncertainty estimate in the counterfactual generation process. Specifically, we identify low-dimensional directed semantic embedding space based on Principal Component Analysis (PCA) applied in differential generative model. Then, we propose latent space smoothing regularization to rectify counterfactual search within in-distribution, such that visually-imperceptible changes are more robust to adversarial perturbations. Moreover, we put forth an uncertainty estimation framework for evaluating counterfactual uncertainty. Extensive experiments on several challenging realistic Chest X-ray and CelebA datasets show that our approach performs consistently well and better than the existing several state-of-the-art baseline approaches.

Abstract:
Pedestrian trajectory prediction is a critical component of autonomous driving in urban environments, allowing vehicles to anticipate pedestrian movements and facilitate safer interactions. While egocentric-view-based algorithms can reduce the sensing and computation burdens of 3D scene reconstruction, accurately predicting pedestrian trajectories and interpreting their intentions from this perspective requires a better understanding of the coupled vehicle (camera) and pedestrian motions, which has not been adequately addressed by existing models. In this paper, we present a novel egocentric pedestrian trajectory prediction approach that uses a two-tower structure and multi-modal inputs. One tower, the vehicle module, receives only the initial pedestrian position and ego-vehicle actions and speed, while the other, the pedestrian module, receives additional prior pedestrian trajectory and visual features. Our proposed action-aware loss function allows the two-tower model to decompose pedestrian trajectory predictions into two parts, caused by ego-vehicle movement and pedestrian movement, respectively, even when only trained on combined ego-view motions. This decomposition increases model flexibility and provides a better estimation of pedestrian actions and intentions, enhancing overall performance. Experiments on three publicly available benchmark datasets show that our proposed model outperforms all existing algorithms in ego-view pedestrian trajectory prediction accuracy.

Abstract:
Balancing the trade-off between accuracy and speed for obtaining higher performance without sacrificing the inference time is a challenging topic for object detection task. Knowledge distillation, which serves as a kind of model compression techniques, provides a potential and feasible way to handle above efficiency and effectiveness issue through transferring the dark knowledge from the sophisticated teacher detector to the simple student one. Despite demonstrating promising solutions to make harmonies between accuracy and speed, current knowledge distillation for object detection methods still suffer from two limitations. Firstly, most of the methods are inherited or refereed from the frameworks in image classification task, and deploy an implicit manner by imitating or constraining the features from the intermediate layers or the output predictions between the teacher and student models. While little consideration has been raised to the intrinsic relevance of the classification and localization predictions in object detection task. Besides, these methods fail to investigate the relationship between detection and distillation tasks in knowledge distillation pipeline, and they train the whole network by simply integrating losses from these two different tasks through hand-crafted designation parameters. For addressing the aforementioned issues, we propose a novel Relation Knowledge Distillation by Auxiliary Learning for Object Detection (ReAL) method in this paper. Specifically, we first design a prediction relation distillation module which makes the student model directly mimic the output predictions from the teacher one, and conduct self and mutual relation distillation losses to excavate the relation information between teacher and student models. Moreover, for better devolving into the relationship between different tasks in distillation pipeline, we introduce the auxiliary learning into knowledge distillation for object detection and develop a dynamic weight adaptation strategy. Through regarding detection task as primary task and treating distillation task as auxiliary task in auxiliary learning framework, we dynamically adjust and regularize the corresponding weights of the losses for these tasks during the training process. Experiments on MS COCO dataset are conducted using various detector combinations of teacher and student models and the results show that our proposed ReAL can achieve obvious improvement on different distillation model configurations, while performing favorably against state-of-the-arts.

Abstract:
Modern visual recognition models often display overconfidence due to their reliance on complex deep neural networks and one-hot target supervision, resulting in unreliable confidence scores that necessitate calibration. While current confidence calibration techniques primarily address single-label scenarios, there is a lack of focus on more practical and generalizable multi-label contexts. This paper introduces the Multi-Label Confidence Calibration (MLCC) task, aiming to provide well-calibrated confidence scores in multi-label scenarios. Unlike single-label images, multi-label images contain multiple objects, leading to semantic confusion and further unreliability in confidence scores. Existing single-label calibration methods, based on label smoothing, fail to account for category correlations, which are crucial for addressing semantic confusion, thereby yielding sub-optimal performance. To overcome these limitations, we propose the Dynamic Correlation Learning and Regularization (DCLR) algorithm, which leverages multi-grained semantic correlations to better model semantic confusion for adaptive regularization. DCLR learns dynamic instance-level and prototype-level similarities specific to each category, using these to measure semantic correlations across different categories. With this understanding, we construct adaptive label vectors that assign higher values to categories with strong correlations, thereby facilitating more effective regularization. We establish an evaluation benchmark, re-implementing several advanced confidence calibration algorithms and applying them to leading multi-label recognition (MLR) models for fair comparison. Through extensive experiments, we demonstrate the superior performance of DCLR over existing methods in providing reliable confidence scores in multi-label scenarios.

Abstract:
Camouflaged objects often blend in with their surroundings, making the perception of a camouflaged object a more complex procedure. However, most neural-network-based methods that simulate the visual information processing pathway of creatures only roughly define the general process, which deficiently reproduces the process of identifying camouflaged objects. How to make modeled neural networks perceive camouflaged objects as effectively as creatures is a significant topic that deserves further consideration. After meticulous analysis of biological visual information processing, we propose an end-to-end prudent and comprehensive neural network, termed IdeNet, to model the critical information processing. Specifically, IdeNet divides the entire perception process into five stages: information collection, information augmentation, information filtering, information localization, and information correction and object identification. In addition, we design tailored visual information processing mechanisms for each stage, including the information augmentation module (IAM), the information filtering module (IFM), the information localization module (ILM), and the information correction module (ICM), to model the critical visual information processing and establish the inextricable association of biological behavior and visual information processing. The extensive experiments show that IdeNet outperforms state-of-the-art methods in all benchmarks, demonstrating the effectiveness of the five-stage partitioning of visual information processing pathway and the tailored visual information processing mechanisms for camouflaged object detection. Our code is publicly available at: https://github.com/whyandbecause/IdeNet.

Abstract:
Weakly supervised video anomaly detection aims to locate abnormal activities in untrimmed videos without the need for frame-level supervision. Prior work has utilized graph convolution networks or self-attention mechanisms alongside multiple instance learning (MIL)-based classification loss to model temporal relations and learn discriminative features. However, these approaches are limited in two aspects: 1) Multi-branch parallel architectures, while capturing multi-scale temporal dependencies, inevitably lead to increased parameter and computational costs. 2) The binarized MIL constraint only ensures the interclass separability while neglecting the fine-grained discriminability within anomalous classes. To this end, we introduce a novel WS-VAD framework that focuses on efficient temporal modeling and anomaly innerclass discriminability. We first construct a Temporal Context Aggregation (TCA) module that simultaneously captures local-global dependencies by reusing an attention matrix along with adaptive context fusion. In addition, we propose a Prompt-Enhanced Learning (PEL) module that incorporates semantic priors using knowledge-based prompts to boost the discrimination of visual features while ensuring separability across anomaly subclasses. The proposed components have been validated through extensive experiments, which demonstrate superior performance on three challenging datasets, UCF-Crime, XD-Violence and ShanghaiTech, with fewer parameters and reduced computational effort. Notably, our method can significantly improve the detection accuracy for certain anomaly subclasses and reduced the false alarm rate. Our code is available at: https://github.com/yujiangpu20/PEL4VAD.

Abstract:
Self-training is a simple yet effective method for semi-supervised learning, during which pseudo-label selection plays an important role for handling confirmation bias. Despite its popularity, applying self-training to landmark detection faces three problems: 1) The selected confident pseudo-labels often contain data bias, which may hurt model performance; 2) It is not easy to decide a proper threshold for sample selection as the localization task can be sensitive to noisy pseudo-labels; 3) coordinate regression does not output confidence, making selection-based self-training infeasible. To address the above issues, we propose Self-Training for Landmark Detection (STLD), a method that does not require explicit pseudo-label selection. Instead, STLD constructs a task curriculum to deal with confirmation bias, which progressively transitions from more confident to less confident tasks over the rounds of self-training. Pseudo pretraining and shrink regression are two essential components for such a curriculum, where the former is the first task of the curriculum for providing a better model initialization and the latter is further added in the later rounds to directly leverage the pseudo-labels in a coarse-to-fine manner. Experiments on three facial and one medical landmark detection benchmark show that STLD outperforms the existing methods consistently in both semi- and omni-supervised settings. The code is available at https://github.com/jhb86253817/STLD.

Abstract:
Image geo-localization aims to locate a query image from source platform (e.g., drones, street vehicle) by matching it with Geo-tagged reference images from the target platforms (e.g., different satellites). Achieving cross-modal or cross-view real-time (>30fps) image localization with the guaranteed accuracy in a unified framework remains a challenge due to the huge differences in modalities and views between the two platforms. In order to solve this problem, a novel fine-grained overlap estimation based image geo-localization method is proposed in this paper, the core of which is to estimate the salient and subtle overlapping regions in image pairs to ensure correct matching. Specifically, the high-level semantic features of input images are extracted by a deep convolutional neural network. Then, a novel overlap scanning module (OSM) is presented to mine the long-range spatial and channel dependencies of semantic features in various subspaces, thereby identifying fine-grained overlapping regions. Finally, we adopt the triplet ranking loss to guide the proposed network optimization so that the matching regions are as close as possible and the most mismatched regions are as far away as possible. To demonstrate the effectiveness of our FOENet, comprehensive experiments are conducted on three cross-view benchmarks and one cross-modal benchmark. Our FOENet yields better performance in various metrics and the recall accuracy at top 1 (R@1) is significantly improved, with a maximum improvement of 70.6%. In addition, the proposed model runs fast on a single RTX 6000, reaching real-time inference speed on all datasets, with the fastest being 82.3 FPS.

Abstract:
The rise of the metaverse and the increasing volume of heterogeneous 2D and 3D data have created a growing demand for cross-modal retrieval, enabling users to query semantically relevant data across different modalities. Existing methods heavily rely on class labels to bridge semantic correlations; however, collecting large-scale, well-labeled data is expensive and often impractical, making unsupervised learning more attractive and feasible. Nonetheless, unsupervised cross-modal learning faces challenges in bridging semantic correlations due to the lack of label information, leading to unreliable discrimination. In this paper, we reveal and study a novel problem: unsupervised cross-modal learning with noisy pseudo-labels. To address this issue, we propose a 2D-3D unsupervised multimodal learning framework that leverages multimodal data. Our framework consists of three key components: 1) Self-matching Supervision Mechanism (SSM) warms up the model to encapsulate discrimination into the representations in a self-supervised learning manner. 2) Robust Discriminative Learning (RDL) further mines the discrimination from the learned imperfect predictions after warming up. To tackle the noise in the predicted pseudo labels, RDL leverages a novel Robust Concentrating Learning Loss (RCLL) to alleviate the influence of the uncertain samples, thus embracing robustness against noisy pseudo labels. 3) Modality-invariance Learning Mechanism (MLM) minimizes the cross-modal discrepancy to enforce SSM and RDL to produce common representations. We conduct comprehensive experiments on four 2D-3D multimodal datasets, comparing our method against 14 state-of-the-art approaches, thereby demonstrating its effectiveness and superiority.

Abstract:
Existing multi-view classification algorithms usually assume that all examples have observations on all views, and the data in different views are clean. However, in real-world applications, we are often provided with data that have missing representations or contain noise on some views (i.e., missing or noise views). This may lead to significant performance degeneration, and thus many algorithms are proposed to address the incomplete view or noisy view issues. However, most of existing algorithms deal with the two issues separately, and hence may fail when both missing and noisy views exist. They are also usually not flexible in that the view or feature significance cannot be adaptively identified. Besides, the view missing patterns may vary in the training and test phases, and such difference is often ignored. To remedy these drawbacks, we propose a novel multi-view classification framework that is simultaneously robust to both incomplete and noisy views. This is achieved by integrating early fusion and late fusion in a single framework. Specifically, in our early fusion module, we propose a view-aware transformer to mask the missing views and adaptively explore the relationships between views and target tasks to deal with missing views. Considering that view missing patterns may change from the training to the test phase, we also design single-view classification and category-consistency constraints to reduce the dependence of our model on view-missing patterns. In our late fusion module, we quantify the view uncertainty in an ensemble way to estimate the noise level of that view. Then the uncertainty and prediction logits of different views are integrated to make our model robust to noisy views. The framework is trained in an end-to-end manner. Experimental results on diverse datasets demonstrate the robustness and effectiveness of our model for both incomplete and noisy views. Codes are available at https://github.com/li-yapeng/UVaT.

Abstract:
Autonomous underwater vehicles (AUVs) equipped with the intelligent underwater object detection technique is of great significance for underwater navigation. Advanced underwater object detection frameworks adopt skip connections to enhance the feature representation which further boosts the detection precision. However, we reveal two limitations of standard skip connections: 1) standard skip connections do not consider the feature heterogeneity, resulting in a sub-optimal feature fusion strategy; 2) feature redundancy exists in the skip connected features that not all the channels in the fused feature maps are equally important, the network learning should focus on the informative channels rather than the redundant ones. In this paper, we propose a novel channel-weighted skip connection network (CWSCNet) to learn multiple hyper fusion features for improving multi-scale underwater object detection. In CWSCNet, a novel feature fusion module, named channel-weighted skip connection (CWSC), is proposed to adaptively adjust the importance of different channels during feature fusion. The CWSC module removes feature heterogeneity that strengthens the compatibility of different feature maps, it also works as an effective feature selection strategy that enables CWSCNet to focus on learning channels with more object-related information. Extensive experiments on three underwater object detection datasets RUOD, URPC2017 and URPC2018 show that the proposed CWSCNet achieves comparable or state-of-the-art performances in underwater object detection.

Abstract:
Semi-supervised semantic segmentation focuses on the exploration of a small amount of labeled data and a large amount of unlabeled data, which is more in line with the demands of real-world image understanding applications. However, it is still hindered by the inability to fully and effectively leverage unlabeled images. In this paper, we reveal that cross-window consistency (CWC) is helpful in comprehensively extracting auxiliary supervision from unlabeled data. Additionally, we propose a novel CWC-driven progressive learning framework to optimize the deep network by mining weak-to-strong constraints from massive unlabeled data. More specifically, this paper presents a biased cross-window consistency (BCC) loss with an importance factor, which helps the deep network explicitly constrain confidence maps from overlapping regions in different windows to maintain semantic consistency with larger contexts. In addition, we propose a dynamic pseudo-label memory bank (DPM) to provide high-consistency and high-reliability pseudo-labels to further optimize the network. Extensive experiments on three representative datasets of urban views, medical scenarios, and satellite scenes with consistent performance gain demonstrate the superiority of our framework. Our code is released at https://jack-bo1220.github.io/project/CWC.html.

Abstract:
IR cameras are widely used for temperature measurements in various applications, including agriculture, medicine, and security. Low-cost IR cameras have the immense potential to replace expensive radiometric cameras in these applications; however, low-cost microbolometer-based IR cameras are prone to spatially variant nonuniformity and to drift in temperature measurements, which limit their usability in practical scenarios. To address these limitations, we propose a novel approach for simultaneous temperature estimation and nonuniformity correction (NUC) from multiple frames captured by low-cost microbolometer-based IR cameras. We leverage the camera’s physical image-acquisition model and incorporate it into a deep-learning architecture termed kernel prediction network (KPN), which enables us to combine multiple frames despite imperfect registration between them. We also propose a novel offset block that incorporates the ambient temperature into the model and enables us to estimate the offset of the camera, which is a key factor in temperature estimation. Our findings demonstrate that the number of frames has a significant impact on the accuracy of the temperature estimation and NUC. Moreover, introduction of the offset block results in significantly improved performance compared to vanilla KPN. The method was tested on real data collected by a low-cost IR camera mounted on an unmanned aerial vehicle, showing only a small average error of 0.27-0.54^\circ C relative to costly scientific-grade radiometric cameras. Real data collected horizontally resulted in similar errors of 0.48-0.68^\circ C . Our method provides an accurate and efficient solution for simultaneous temperature estimation and NUC, which has important implications for a wide range of practical applications.

Abstract:
Despite the great progress of unsupervised domain adaptation (UDA) with the deep neural networks, current UDA models are opaque and cannot provide promising explanations, limiting their applications in the scenarios that require safe and controllable model decisions. At present, a surge of work focuses on designing deep interpretable methods with adequate data annotations and only a few methods consider the distributional shift problem. Most existing interpretable UDA methods are post-hoc ones, which cannot facilitate the model learning process for performance enhancement. In this paper, we propose an inherently interpretable method, named Transferable Conceptual Prototype Learning (TCPL), which could simultaneously interpret and improve the processes of knowledge transfer and decision-making in UDA. To achieve this goal, we design a hierarchically prototypical module that transfers categorical basic concepts from the source domain to the target domain and learns domain-shared prototypes for explaining the underlying reasoning process. With the learned transferable prototypes, a self-predictive consistent pseudo-label strategy that fuses confidence, predictions, and prototype information, is designed for selecting suitable target samples for pseudo annotations and gradually narrowing down the domain gap. Comprehensive experiments show that the proposed method can not only provide effective and intuitive explanations but also outperform previous state-of-the-arts. Code is available at https://drive.google.com/file/d/1b1EHFghiF1ExD-Cn1HYg75VutfkXWp60/view?usp=sharing.

Abstract:
Semantic scene completion (SSC) aims to predict the semantic occupancy of each voxel in the entire 3D scene from limited observations, which is an emerging and critical task for autonomous driving. Recently, many studies have turned to camera-based SSC solutions due to the richer visual cues and cost-effectiveness of cameras. However, existing methods usually rely on sophisticated and heavy 3D models to process the lifted 3D features directly, which are not discriminative enough for clear segmentation boundaries. In this paper, we adopt the dense-sparse-dense design and propose a one-stage camera-based SSC framework, termed SGN, to propagate semantics from the semantic-aware seed voxels to the whole scene based on spatial geometry cues. Firstly, to exploit depth-aware context and dynamically select sparse seed voxels, we redesign the sparse voxel proposal network to process points generated by depth prediction directly with the coarse-to-fine paradigm. Furthermore, by designing hybrid guidance (sparse semantic and geometry guidance) and effective voxel aggregation for spatial geometry cues, we enhance the feature separation between different categories and expedite the convergence of semantic propagation. Finally, we devise the multi-scale semantic propagation module for flexible receptive fields while reducing the computation resources. Extensive experimental results on the SemanticKITTI and SSCBench-KITTI-360 datasets demonstrate the superiority of our SGN over existing state-of-the-art methods. And even our lightweight version SGN-L achieves notable scores of 14.80% mIoU and 45.45% IoU on SeamnticKITTI validation with only 12.5 M parameters and 7.16 G training memory. Code is available at https://github.com/Jieqianyu/SGN.

Abstract:
Fusion of a panchromatic (PAN) image and corresponding multispectral (MS) image is also known as pansharpening, which aims to combine abundant spatial details of PAN and spectral information of MS images. Due to the absence of high-resolution MS images, available deep-learning-based methods usually follow the paradigm of training at reduced resolution and testing at both reduced and full resolution. When taking original MS and PAN images as inputs, they always obtain sub-optimal results due to the scale variation. In this paper, we propose to explore the self-supervised representation for pansharpening by designing a cross-predictive diffusion model, named CrossDiff. It has two-stage training. In the first stage, we introduce a cross-predictive pretext task to pre-train the UNet structure based on conditional Denoising Diffusion Probabilistic Model (DDPM). While in the second stage, the encoders of the UNets are frozen to directly extract spatial and spectral features from PAN and MS images, and only the fusion head is trained to adapt for pansharpening task. Extensive experiments show the effectiveness and superiority of the proposed model compared with state-of-the-art supervised and unsupervised methods. Besides, the cross-sensor experiments also verify the generalization ability of proposed self-supervised representation learners for other satellite datasets. Code is available at https://github.com/codgodtao/CrossDiff.

Abstract:
In this paper, we address a complex but practical scenario in Active Learning (AL) known as open-set AL, where the unlabeled data consists of both in-distribution (ID) and out-of-distribution (OOD) samples. Standard AL methods will fail in this scenario as OOD samples are highly likely to be regarded as uncertain samples, leading to their selection and wasting of the budget. Existing methods focus on selecting the highly likely ID samples, which tend to be easy and less informative. To this end, we introduce two criteria, namely contrastive confidence and historical divergence, which measure the possibility of being ID and the hardness of a sample, respectively. By balancing the two proposed criteria, highly informative ID samples can be selected as much as possible. Furthermore, unlike previous methods that require additional neural networks to detect the OOD samples, we propose a contrastive clustering framework that endows the classifier with the ability to identify the OOD samples and further enhances the network’s representation learning. The experimental results demonstrate that the proposed method achieves state-of-the-art performance on several benchmark datasets.

Abstract:
Instance shadow detection, crucial for applications such as photo editing and light direction estimation, has undergone significant advancements in predicting shadow instances, object instances, and their associations. The extension of this task to videos presents challenges in annotating diverse video data and addressing complexities arising from occlusion and temporary disappearances within associations. In response to these challenges, we introduce ViShadow, a semi-supervised video instance shadow detection framework that leverages both labeled image data and unlabeled video data for training. ViShadow features a two-stage training pipeline: the first stage, utilizing labeled image data, identifies shadow and object instances through contrastive learning for cross-frame pairing. The second stage employs unlabeled videos, incorporating an associated cycle consistency loss to enhance tracking ability. A retrieval mechanism is introduced to manage temporary disappearances, ensuring tracking continuity. The SOBA-VID dataset, comprising unlabeled training videos and labeled testing videos, along with the SOAP-VID metric, is introduced for the quantitative evaluation of VISD solutions. The effectiveness of ViShadow is further demonstrated through various video-level applications such as video inpainting, instance cloning, shadow editing, and text-instructed shadow-object manipulation.

Abstract:
This study tackles image matching in difficult scenarios, such as scenes with significant variations or limited texture, with a strong emphasis on computational efficiency. Previous studies have attempted to address this challenge by encoding global scene contexts using Transformers. However, these approaches have high computational costs and may not capture sufficient high-level contextual information, such as spatial structures or semantic shapes. To overcome these limitations, we propose a novel image-matching method that leverages a topic-modeling strategy to capture high-level contexts in images. Our method represents each image as a multinomial distribution over topics, where each topic represents semantic structures. By incorporating these topics, we can effectively capture comprehensive context information and obtain discriminative and high-quality features. Notably, our coarse-level matching network enhances efficiency by employing attention layers only to fixed-sized topics and small-sized features. Finally, we design a dynamic feature refinement network for precise results at a finer matching stage. Through extensive experiments, we have demonstrated the superiority of our method in challenging scenarios. Specifically, our method ranks in the top 9% in the Image Matching Challenge 2023 without using ensemble techniques. Additionally, we achieve an approximately 50% reduction in computational costs compared to other Transformer-based methods. Code is available at https://github.com/TruongKhang/TopicFM.

Abstract:
Semi-supervised Action Quality Assessment (AQA) using limited labeled and massive unlabeled samples to achieve high-quality assessment is an attractive but challenging task. The main challenge relies on how to exploit solid and consistent representations of action sequences for building a bridge between labeled and unlabeled samples in the semi-supervised AQA. To address the issue, we propose a Self-supervised sub-Action Parsing Network (SAP-Net) that employs a teacher-student network structure to learn consistent semantic representations between labeled and unlabeled samples for semi-supervised AQA. We perform actor-centric region detection and generate high-quality pseudo-labels in the teacher branch and assists the student branch in learning discriminative action features. We further design a self-supervised sub-action parsing solution to locate and parse fine-grained sub-action sequences. Then, we present the group contrastive learning with pseudo-labels to capture consistent motion-oriented action features in the two branches. We evaluate our proposed SAP-Net on four public datasets: the MTL-AQA, FineDiving, Rhythmic Gymnastics, and FineFS datasets. The experiment results show that our approach outperforms state-of-the-art semi-supervised methods by a significant margin.

Abstract:
The sphere-to-plane projection of 360-degree video introduces substantial stretched redundant data, which is discarded when reprojected to the 3D sphere for display. Consequently, encoding and transmitting such redundant data is unnecessary. Highly redundant blocks can be referred to as all-zero blocks (AZBs). Detecting these AZBs in advance can reduce computational and transmission resource consumption. However, this cannot be achieved by existing AZB detection techniques due to the unawareness of the stretching redundancy. In this paper, we first derive a latitude-adaptive redundancy detection (LARD) approach to adaptively detect coefficients carrying redundancy in transformed blocks by modeling the dependency between valid frequency range and the stretching degree based on spectrum analysis. Utilizing LARD, a latitude-redundancy-aware AZB detection scheme tailored for fast 360-degree video coding (LRAS) is proposed to accelerate the encoding process. LRAS consists of three sequential stages: latitude-adaptive AZB (L-AZB) detection, latitude-adaptive genuine-AZB (LG-AZB) detection and latitude-adaptive pseudo-AZB (LP-AZB) detection. Specifically, L-AZB refers to the AZB introduced by projection. LARD is used to detect L-AZB directly. LG-AZB refers to the AZB after hard-decision quantization and zeroing redundant coefficients. A novel latitude-adaptive sum of absolute difference estimation model is built to derive the threshold for LG-AZB detection. LP-AZB refers to the AZB in terms of rate-distortion optimization considering redundancy. A latitude-adaptive rate-distortion model is established for LP-AZB detection. Experimental results show that LRAS can achieve an average total encoding time reduction of 25.85% and 20.38% under low-delay and random access configurations compared to the original HEVC encoder, with only 0.16% and 0.13% BDBR increases and 0.01dB BDPSNR loss, respectively. The transform and quantization time savings are 60.13% and 59.94% on average.

Abstract:
Despite recent advances, developing general-purpose universal denoising and artifact-removal networks remains largely an open problem: Given fixed network weights, one inherently trades-off specialization at one task (e.g., removing Poisson noise) for performance at another (e.g., removing speckle noise). In addition, training such a network is challenging due to the curse of dimensionality: As one increases the dimensions of the specification-space (i.e., the number of parameters needed to describe the noise distribution) the number of unique specifications one needs to train for grows exponentially. Uniformly sampling this space will result in a network that does well at very challenging problem specifications but poorly at easy problem specifications, where even large errors will have a small effect on the overall mean squared error. In this work we propose training denoising networks using an adaptive-sampling/active-learning strategy. Our work improves upon a recently proposed universal denoiser training strategy by extending these results to higher dimensions and by incorporating a polynomial approximation of the true specification-loss landscape. This approximation allows us to reduce training times by almost two orders of magnitude. We test our method on simulated joint Poisson-Gaussian-Speckle noise and demonstrate that with our proposed training strategy, a single blind, generalist denoiser network can achieve peak signal-to-noise ratios within a uniform bound of specialized denoiser networks across a large range of operating conditions. We also capture a small dataset of images with varying amounts of joint Poisson-Gaussian-Speckle noise and demonstrate that a universal denoiser trained using our adaptive-sampling strategy outperforms uniformly trained baselines.

Abstract:
We present a simple and effective approach to explore both local spatial-temporal contexts and non-local temporal information for video deblurring. First, we develop an effective spatial-temporal contextual transformer to explore local spatial-temporal contexts from videos. As the features extracted by the spatial-temporal contextual transformer does not model the non-local temporal information of video well, we then develop a feature propagation method to aggregate useful features from the long-range frames so that both local spatial-temporal contexts and non-local temporal information can be better utilized for video deblurring. Finally, we formulate the spatial-temporal contextual transformer with the feature propagation into a unified deep convolutional neural network (CNN) and train it in an end-to-end manner. We show that using the spatial-temporal contextual transformer with the feature propagation is able to generate useful features and makes the deep CNN model more compact and effective for video deblurring. Extensive experimental results show that the proposed method performs favorably against state-of-the-art ones on the benchmark datasets in terms of accuracy and model parameters.

Abstract:
Partial point cloud registration aims to align partial scans into a shared coordinate system. While learning-based partial point cloud registration methods have achieved remarkable progress, they often fail to take full advantage of the relative positional relationships both within (intra-) and between (inter-) point clouds. This oversight hampers their ability to accurately identify overlapping regions and search for reliable correspondences. To address these limitations, a diverse inlier consistency (DIC) method has been proposed that adaptively embeds the positional information of a reliable correspondence in the intra- and inter-point cloud. Firstly, a diverse inlier consistency-driven region perception (DICdRP) module is devised, which encodes the positional information of the selected correspondence within the intra-point cloud. This module enhances the sensitivity of all points to overlapping regions by recognizing the position of the selected correspondence. Secondly, a diverse inlier consistency-aware correspondence search (DICaCS) module is developed, which leverages relative positions in the inter-point cloud. This module studies an inter-point cloud DIC weight to supervise correspondence compatibility, allowing for precise identification of correspondences and effective outlier filtration. Thirdly, diverse information is integrated throughout our framework to achieve a more holistic and detailed registration process. Extensive experiments on object-level and scene-level datasets demonstrate the superior performance of the proposed algorithm. The code is available at https://github.com/yxzhang15/DIC.

Abstract:
Event-stream representation is the first step for many computer vision tasks using event cameras. It converts the asynchronous event-streams into a formatted structure so that conventional machine learning models can be applied easily. However, most of the state-of-the-art event-stream representations are manually designed and the quality of these representations cannot be guaranteed due to the noisy nature of event-streams. In this paper, we introduce a data-driven approach aiming at enhancing the quality of event-stream representations. Our approach commences with the introduction of a new event-stream representation based on spatial-temporal statistics, denoted as EvRep. Subsequently, we theoretically derive the intrinsic relationship between asynchronous event-streams and synchronous video frames. Building upon this theoretical relationship, we train a representation generator, RepGen, in a self-supervised learning manner accepting EvRep as input. Finally, the event-streams are converted to high-quality representations, termed as EvRepSL, by going through the learned RepGen (without the need of fine-tuning or retraining). Our methodology is rigorously validated through extensive evaluations on a variety of mainstream event-based classification and optical flow datasets (captured with various types of event cameras). The experimental results highlight not only our approach’s superior performance over existing event-stream representations but also its versatility, being agnostic to different event cameras and tasks.

Abstract:
Remote sensing anomaly detector can find the objects deviating from the background as potential targets for Earth monitoring. Given the diversity in earth anomaly types, designing a transferring model with cross-modality detection ability should be cost-effective and flexible to new earth observation sources and anomaly types. However, the current anomaly detectors aim to learn the certain background distribution, the trained model cannot be transferred to unseen images. Inspired by the fact that the deviation metric for score ranking is consistent and independent from the image distribution, this study exploits the learning target conversion from the varying background distribution to the consistent deviation metric. We theoretically prove that the large-margin condition in labeled samples ensures the transferring ability of learned deviation metric. To satisfy this condition, two large margin losses for pixel-level and feature-level deviation ranking are proposed respectively. Since the real anomalies are difficult to acquire, anomaly simulation strategies are designed to compute the model loss. With the large-margin learning for deviation metric, the trained model achieves cross-modality detection ability in five modalities—hyperspectral, visible light, synthetic aperture radar (SAR), infrared and low-light—in zero-shot manner.

Abstract:
Large-scale pre-trained vision-language models (e.g., CLIP) have shown incredible generalization performance in downstream tasks such as video-text retrieval (VTR). Traditional approaches have leveraged CLIP’s robust multi-modal alignment ability for VTR by directly fine-tuning vision and text encoders with clean video-text data. Yet, these techniques rely on carefully annotated video-text pairs, which are expensive and require significant manual effort. In this context, we introduce a new approach, Pseudo-Supervised Selective Contrastive Learning (PS-SCL). PS-SCL minimizes the dependency on manually-labeled text annotations by generating pseudo-supervisions from unlabeled video data for training. We first exploit CLIP’s visual recognition capabilities to generate pseudo-texts automatically. These pseudo-texts contain diverse visual concepts from the video and serve as weak textual guidance. Moreover, we introduce Selective Contrastive Learning (SeLeCT), which prioritizes and selects highly correlated video-text pairs from pseudo-supervised video-text pairs. By doing so, SeLeCT enables more effective multi-modal learning under weak pairing supervision. Experimental results demonstrate that our method outperforms CLIP zero-shot performance by a large margin on multiple video-text retrieval benchmarks, e.g., 8.2% R@1 for video-to-text on MSRVTT, 12.2% R@1 for video-to-text on DiDeMo, and 10.9% R@1 for video-to-text on ActivityNet, respectively.

Abstract:
Millimeter-wave (mmWave) radar pointcloud offers attractive potential for 3D sensing, thanks to its robustness in challenging conditions such as smoke and low illumination. However, existing methods failed to simultaneously address the three main challenges in mmWave radar pointcloud reconstruction: specular information lost, low angular resolution, and severe interference. In this paper, we propose DREAM-PCD, a novel framework specifically designed for real-time 3D environment sensing that combines signal processing and deep learning methods into three well-designed components to tackle all three challenges: Non-Coherent Accumulation for dense points, Synthetic Aperture Accumulation for improved angular resolution, and Real-Denoise Multiframe network for interference removal. By leveraging causal multiple viewpoints accumulation and the “real-denoise” mechanism, DREAM-PCD significantly enhances the generalization performance and real-time capability. We also introduce RadarEyes, the largest mmWave indoor dataset with over 1,000,000 frames, featuring a unique design incorporating two orthogonal single-chip radars, Lidar, and camera, enriching dataset diversity and applications. Experimental results demonstrate that DREAM-PCD surpasses existing methods in reconstruction quality, and exhibits superior generalization and real-time capabilities, enabling high-quality real-time reconstruction of radar pointcloud under various parameters and scenarios. We believe that DREAM-PCD, along with the RadarEyes dataset, will significantly advance mmWave radar perception in future real-world applications.

Abstract:
Graph Convolutional Networks (GCN) which typically follows a neural message passing framework to model dependencies among skeletal joints has achieved high success in skeleton-based human motion prediction task. Nevertheless, how to construct a graph from a skeleton sequence and how to perform message passing on the graph are still open problems, which severely affect the performance of GCN. To solve both problems, this paper presents a Dynamic Dense Graph Convolutional Network (DD-GCN), which constructs a dense graph and implements an integrated dynamic message passing. More specifically, we construct a dense graph with 4D adjacency modeling as a comprehensive representation of motion sequence at different levels of abstraction. Based on the dense graph, we propose a dynamic message passing framework that learns dynamically from data to generate distinctive messages reflecting sample-specific relevance among nodes in the graph. Extensive experiments on benchmark Human 3.6M and CMU Mocap datasets verify the effectiveness of our DD-GCN which obviously outperforms state-of-the-art GCN-based methods, especially when using long-term and our proposed extremely long-term protocol.

Abstract:
Single image de-raining is an emerging paradigm for many outdoor computer vision applications since rain streaks can significantly degrade the visibility and render the function compromised. The introduction of deep learning (DL) has brought about substantial advancement on de-raining methods. However, most existing DL-based methods use single homogeneous network architecture to generate de-rained images in a general image restoration manner, ignoring the discrepancy between rain location detection and rain intensity estimation. We find that this discrepancy would cause feature interference and representation ability degradation problems which significantly affect de-raining performance. In this paper, we propose a novel heterogeneous de-raining architecture aiming to decouple rain location detection and rain intensity estimation (DLINet). For these two subtasks, we provide dedicated network structures according to their differential properties to meet their respective performance requirements. To coordinate the decoupled subnetworks, we develop a high-order collaborative network learning the dynamic inter-layer interactions between rain location and intensity. To effectively supervise the decoupled subnetworks during training, we propose a novel training strategy that imposes task-oriented supervision using the label learned via joint training. Extensive experiments on synthetic datasets and real-world rainy scenes demonstrate that the proposed method has great advantages over existing state-of-the-art methods.

Abstract:
With growing demand for point cloud coding, Video-based Point Cloud Compression (V-PCC) is released for dynamic point clouds, relying on mature 2D video coding techniques. However, the huge computational complexity of 2D video codec is inherited by V-PCC, thereby resulting in a notably time-consuming encoding process for the projection videos. To accelerate the compression, this paper proposes a low complexity coding unit decision algorithm for V-PCC intra coding. First, the 2D sequences (occupancy, geometry, and attribute sequences) are projected from same 3D point could frames in V-PCC. By exploring the strong correlations among them, the cross-projection information is creatively proposed for improving the predication performance of CU partition. Second, considering the disparate coding losses generated by incorrect partitioning decisions of different CUs, we develop a rate-distortion-oriented learning approach aimed at increasing the decision accuracy of the CUs, severely affecting coding performance. Third, to accommodate the particular coding architecture of V-PCC intra configuration, we further devise an overall framework, including targeted feature extraction and partitioning decision for intra and inter coding of geometry and attribute sequences. The final experimental results strongly demonstrate the effectiveness of our proposed algorithm. The time consumption of total projection sequence compression can be reduced by 57.80%, while the coding losses on Geom.BD-TotalRate (D1 and D2) and Attr.BD-TotalRate Luma component are only 0.08%, 0.33%, and 0.16%, respectively, which can be negligible. To the best of our knowledge, the proposed algorithm achieves state-of-the-art performance for accelerating the projection sequence compression in V-PCC All-Intra configuration.

Abstract:
The effective use of long-range information can yield improved network performance, which is very important for image restoration. Although local window-based models have linear complexity and can be feasibly applied to process high-resolution images, a single-scale window has a limited receptive field and is less efficient for encoding long-range context information. To address this issue, this paper presents a single-stage multiscale spatial rearrangement multilayer perceptron (MSSR-MLP) architecture that can obtain information at different scales within a local window. Specifically, we propose a simple and efficient spatial rearrangement module (SRM) that moves information outside the local window to the inside of the local window so that long-range dependencies can be modeled using only a window-based fully connected (FC) layer. The SRM can extend the local receptive field of a window-based FC layer without introducing additional parameters and FLOPs. Utilizing several spatial rearrangement modules with different step sizes, we design an efficient multiscale spatial rearrangement MLP architecture for image restoration. This design aggregates multiscale information to achieve improved restoration quality while maintaining a low computational cost. Extensive experiments conducted on several image restoration tasks demonstrate the efficiency and effectiveness of our method. For example, it requires only ~4.3% of the FLOPs needed by SwinIR for Gaussian gray image denoising, ~13.9% of the FLOPs needed by \mathrm C^2 PNet for single-image dehazing and ~18.9% of the FLOPs needed by MAXIM for single-image motion deblurring but achieves better performance on each of these restoration tasks.

Abstract:
Effectively evaluating the perceptual quality of dehazed images remains an under-explored research issue. In this paper, we propose a no-reference complex-valued convolutional neural network (CV-CNN) model to conduct automatic dehazed image quality evaluation. Specifically, a novel CV-CNN is employed that exploits the advantages of complex-valued representations, achieving better generalization capability on perceptual feature learning than real-valued ones. To learn more discriminative features to analyze the perceptual quality of dehazed images, we design a dual-stream CV-CNN architecture. The dual-stream model comprises a distortion-sensitive stream that operates on the dehazed RGB image, and a haze-aware stream on a novel dark channel difference image. The distortion-sensitive stream accounts for perceptual distortion artifacts, while the haze-aware stream addresses the possible presence of residual haze. Experimental results on three publicly available dehazed image quality assessment (DQA) databases demonstrate the effectiveness and generalization of our proposed CV-CNN DQA model as compared to state-of-the-art no-reference image quality assessment algorithms.

Abstract:
As a fundamental and challenging task in bridging language and vision domains, Image-Text Retrieval (ITR) aims at searching for the target instances that are semantically relevant to the given query from the other modality, and its key challenge is to measure the semantic similarity across different modalities. Although significant progress has been achieved, existing approaches typically suffer from two major limitations: (1) It hurts the accuracy of the representation by directly exploiting the bottom-up attention based region-level features where each region is equally treated. (2) It limits the scale of negative sample pairs by employing the mini-batch based end-to-end training mechanism. To address these limitations, we propose a Unified Semantic Enhancement Momentum Contrastive Learning (USER) method for ITR. Specifically, we delicately design two simple but effective Global representation based Semantic Enhancement (GSE) modules. One learns the global representation via the self-attention algorithm, noted as Self-Guided Enhancement (SGE) module. The other module benefits from the pre-trained CLIP module, which provides a novel scheme to exploit and transfer the knowledge from an off-the-shelf model, noted as CLIP-Guided Enhancement (CGE) module. Moreover, we incorporate the training mechanism of MoCo into ITR, in which two dynamic queues are employed to enrich and enlarge the scale of negative sample pairs. Meanwhile, a Unified Training Objective (UTO) is developed to learn from mini-batch based and dynamic queue based samples. Extensive experiments on the benchmark MSCOCO and Flickr30K datasets demonstrate the superiority of both retrieval accuracy and inference efficiency. For instance, compared with the existing best method NAAF, the metric R@1 of our USER on the MSCOCO 5K Testing set is improved by 5% and 2.4% on caption retrieval and image retrieval without any external knowledge or pre-trained model while enjoying over 60 times faster inference speed. Our source code will be released at https://github.com/zhangy0822/USER.

Abstract:
Cross-modality recognition has many important applications in science, law enforcement and entertainment. Popular methods to bridge the modality gap include reducing the distributional differences of representations of different modalities, learning indistinguishable representations or explicit modality transfer. The first two approaches suffer from the loss of discriminant information while removing the modality-specific variations. The third one heavily relies on the successful modality transfer, could face catastrophic performance drop when explicit modality transfers are not possible or difficult. To tackle this problem, we proposed a compact encoder-decoder neural module (cmUNet) to learn modality-agnostic representations while retaining identity-related information. This is achieved through cross-modality transformation and in-modality reconstruction, enhanced by an adversarial/perceptual loss which encourages indistinguishability of representations in the original sample space. For cross-modality matching, we propose MarrNet where cmUNet is connected to a standard feature extraction network which takes as inputs the modality-agnostic representations and outputs similarity scores for matching. We validated our method on five challenging tasks, namely Raman-infrared spectrum matching, cross-modality person re-identification and heterogeneous (photo-sketch, visible-near infrared and visible-thermal) face recognition, where MarrNet showed superior performance compared to state-of-the-art methods. Furthermore, it is observed that a cross-modality matching method could be biased to extract discriminant information from partial or even wrong regions, due to incompetence of dealing with modality gaps, which subsequently leads to poor generalization. We show that robustness to occlusions can be an indicator of whether a method can well bridge the modality gap. This, to our knowledge, has been largely neglected in the previous works. Our experiments demonstrated that MarrNet exhibited excellent robustness against disguises and occlusions, and outperformed existing methods with a large margin (>10%). The proposed cmUNet is a meta-approach and can be used as a building block for various applications.

Abstract:
Oriented object detection is a challenging and relatively new problem. Most existing approaches are based on deep learning and explore Oriented Bounding Boxes (OBBs) to represent the objects. They are typically based on adaptations of traditional detectors that work with Horizontal Bounding Boxes (HBBs), which have been exploring IoU-like loss functions to regress the HBBs. However, extending this idea for OBBs is challenging due to complex formulations or requirement for customized backpropagation implementations. Furthermore, using OBBs presents limitations for irregular or roughly circular objects, since the definition of the ideal OBB is an ambiguous and ill-posed problem. In this work, we jointly tackle the problem of training, representing, and evaluating oriented detectors. We explore Gaussian distributions–called Gaussian Bounding Boxes (GBBs)–as fuzzy representations for oriented objects and propose using a similarity metric between two GBBs based on the Hellinger distance. We show that this metric leads to a differentiable closed-form expression that can be directly used as a localization loss term to train OBB object detectors. We also show that GBBs present a natural representation as elliptical regions (called EBBs), which inherently mitigate ambiguity representation for circular objects. Finally, we empirically show that the proposed similarity metric computed between two GBBs strongly correlates with the IoU between the corresponding EBBs, motivating the name Probabilistic Intersection-over-Union (ProbIoU). Our experiments show that results using ProbIoU as a regression loss are competitive with state-of-the-art alternatives without requiring additional hyperparameters or customized implementations, and that ProbIoU is a promising alternative to evaluate oriented object detectors. Our code is available at https://github.com/ProbIOU/.

Abstract:
In histopathology, the tissue slides are usually stained by common H&E stain or special stains (MAS, PAS, and PASM, etc.) to clearly show specific tissue structures. The rapid development of deep learning provides a good solution to generate virtual staining images to significantly reduce the time and labor costs associated with histochemical staining. However, most existing methods need to train a special model for every two stains, which consumes a lot of computing resources with the increasing of staining types. To address this problem, we propose an unsupervised multi-domain stain transfer method, GramGAN, which realizes the progressive transfer through cascaded Style-Guided blocks. For each Style-Guided block, we design a style encoding dictionary to characterize and store all the staining style information. In addition, we propose a Rényi entropy-based regularization term to improve the discrimination ability of different styles. The experimental results show that our method can realize accurate transferring among multiple staining styles with better performance. Furthermore, we build and publish a special stained image dataset suitable for glomeruli segmentation (including H&E staining), where the accuracy of glomeruli detection and segmentation can be significantly improved after transferring H&E-stained images to PAS-stained and PASM-stained ones by our method. The code is publicly available at: https://github.com/xianchaoguan/GramGAN.

Abstract:
Learned video compression methods have gained various interests in the video coding community. Most existing algorithms focus on exploring short-range temporal information and developing strong motion compensation. Still, the ignorance of long-range temporal information utilization constrains the potential of compression. In this paper, we are dedicated to exploiting both long- and short-range temporal information to enhance video compression performance. Specifically, for long-range temporal information exploration, we propose a temporal prior that can be continuously supplemented and updated during compression within the group of pictures (GOP). With the updating scheme, the temporal prior can provide richer mutual information between the overall prior and the current frame for the entropy model, thus facilitating Gaussian parameter prediction. As for the short-range temporal information, we propose a progressive guided motion compensation to achieve robust and accurate compensation. In particular, we design a hierarchical structure to build multi-scale compensation, and by employing optical flow guidance, we generate pixel offsets as motion information at each scale. Additionally, the compensation results at each scale will guide the next scale’s compensation, forming a flow-to-kernel and scale-by-scale stable guiding strategy. Extensive experimental results demonstrate that our method can obtain advanced rate-distortion performance compared to the state-of-the-art learned video compression approaches and the latest standard reference software in terms of PSNR and MS-SSIM. The codes are publicly available on: https://github.com/Huairui/LSTVC.

Abstract:
Feature selection (FS) has recently attracted considerable attention in many fields. Highly-overlapping classes and skewed distributions of data within classes have been found in various classification tasks. Most existing FS methods are all instance-based, which ignores the significant differences in characteristics between the particular outliers and the main body of the class, causing confusion for classifiers. In this paper, we propose a novel supervised FS method, Intrusive Outliers-based Feature Selection (IOFS), to find out what kind of outliers lead to misclassification and exploit the characteristics of such outliers. In order to accurately identify the intrusive outliers (IOs), we provide a density-mean center algorithm to obtain the appropriate representative of a class. A special distance threshold is given to obtain the candidate for IOs. Combining with several metrics, mathematical formulations are provided to evaluate the overlapping degree of the intrusive class pairs. Features with high overlapping degrees are assigned to low rankings in IOFS method. An extension of IOFS based on a small number of extreme IOs, called E-IOFS, is also proposed. Three theoretical proofs are provided for the essential theoretical basis of IOFS. Experiments comparing against various state-of-the-art methods on eleven benchmark datasets show that IOFS is rational and effective, especially on the datasets with higher overlapping classes. And E-IOFS almost always outperforms IOFS.

Abstract:
Cross-domain (CD) hyperspectral image classification (HSIC) has been significantly boosted by methods employing Few-Shot Learning (FSL) based on CNNs or GCNs. Nevertheless, the majority of current approaches disregard the prior information of spectral coordinates with limited interpretability, leading to inadequate robustness and knowledge transfer. In this paper, we propose an asymmetric encoder-decoder architecture, Spectral Coordinate Transformer (SCFormer), for the CDFSL HSIC task. Several dense Spectral Coordinate blocks (SC blocks) are embedded in the backbone of the encoder to establish feature representation with better generalization, which integrates spectral coordinates via Rotary Position Embedding (RoPE) to minimize spectral position disturbance caused by the convolution operation. Due to a large amount of hyperspectral image data and the high demand for model generalization ability in cross-domain scenarios, we design two mask patterns (Random Mask and Sequential Mask) built on unexploited spectral coordinates within the SC blocks, which are unified with the asymmetric structure to learn high-capacity models efficiently and effectively with satisfactory generalization. Besides, from the perspective of the loss function, we devise an intra-domain loss function founded on the Orthogonal Complement Space Projection (OCSP) theory to facilitate the aggregation of samples in the metric space, which promotes intra-domain consistency and increases interpretability. Finally, the strengthened class expression capacity of the intra-domain loss function contributes to the inter-domain loss function constructed by Wasserstein Distance (WD) for realizing domain alignment. Experimental results on four benchmark data sets demonstrate the superiority of the SCFormer.

Affiliations: Key Laboratory of Computing Power Network and Information Security, Ministry of Education, Shandong Computer Science Center (National Supercomputer Center in Jinan), Qilu University of Technology (Shandong Academy of Sciences), Jinan, China; State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, China; School of Artificial Intelligence, Beijing University of Posts and Telecommunications, Beijing, China; Department of Geomatics Engineering, University of Calgary, Calgary, AB, Canada

Abstract:
Weakly supervised object localization (WSOL) is a challenging and promising task that aims to localize objects solely based on the supervision of image category labels. In the absence of annotated bounding boxes, WSOL methods must employ the intrinsic properties of the image classification task pipeline to generate object localizations. In this work, we propose a WSOL method for exploring the Intrinsic Discrimination and Consistency in the image classification task pipeline, and call it as IDC. First, we develop a Triplet Metrics Based Foreground Modeling (TMFM) framework to directly predict object foreground regions using intrinsic discrimination. Unlike Class Activation Map (CAM) based methods that also rely on intrinsic discrimination, our TMFM framework alleviates the problem of only focusing on the most discriminative parts by optimizing foreground and background regions synergistically. Second, we design a Dual Geometric Transformation Consistency Constraints (DGTC2) training strategy to introduce additional supervision and regularization constraints for WSOL by leveraging intrinsic geometric transformation consistency. The proposed pixel-wise and object-wise consistency constraint losses cost-effectively provide spontaneous supervision for WSOL. Extensive experiments show that our IDC method achieves significant and consistent performance gains compared to existing state-of-the-art WSOL approaches. Code is available at: https://github.com/vignywang/IDC.

Abstract:
Video question answering (VideoQA) is challenging since it requires the model to extract and combine multi-level visual concepts from local objects to global actions from complex events for compositional reasoning. Existing works represent the video with fixed-duration clip features that make the model struggle in capturing the crucial concepts in multiple granularities. To overcome this shortcoming, we propose to represent the video with an Event Graph in a hierarchical structure whose nodes correspond to visual concepts of different levels (object, relation, scene and action) and edges indicate their spatial-temporal relationships. We further propose a H ierarchical S patial- T emporal T ransformer (HSTT) which takes nodes from the graph as visual input to realize compositional reasoning guided by the event graph. To fully exploit the spatial-temporal context delivered from the graph structure, on the one hand, we encode the nodes in the order of their semantic hierarchy (depth) and occurrence time (breadth) with our improved graph search algorithm; On the other hand, we introduce edge-guided attention to combine the spatial-temporal context among nodes according to their edge connections. HSTT then performs QA by cross-modal interactions guaranteed by the hierarchical correspondence between the multi-level event graph and the cross-level question. Experiments on the recent challenging AGQA and STAR datasets show that the proposed method clearly outperforms the existing VideoQA models by a large margin, including those pre-trained with large-scale external data. Our code is available at https://github.com/ByZ0e/HSTT.

Abstract:
Many deep learning based methods have been proposed for brain tumor segmentation. Most studies focus on deep network internal structure to improve the segmentation accuracy, while valuable external information, such as normal brain appearance, is often ignored. Inspired by the fact that radiologists often screen lesion regions with normal appearance as reference in mind, in this paper, we propose a novel deep framework for brain tumor segmentation, where normal brain images are adopted as reference to compare with tumor brain images in a learned feature space. In this way, features at tumor regions, i.e., tumor-related features, can be highlighted and enhanced for accurate tumor segmentation. It is known that routine tumor brain images are multimodal, while normal brain images are often monomodal. This causes the feature comparison a big issue, i.e., multimodal vs. monomodal. To this end, we present a new feature alignment module (FAM) to make the feature distribution of monomodal normal brain images consistent/inconsistent with multimodal tumor brain images at normal/tumor regions, making the feature comparison effective. Both public (BraTS2022) and in-house tumor brain image datasets are used to evaluate our framework. Experimental results demonstrate that for both datasets, our framework can effectively improve the segmentation accuracy and outperforms the state-of-the-art segmentation methods. Codes are available at https://github.com/hb-liu/Normal-Brain-Boost-Tumor-Segmentation.

Abstract:
We present a systematic approach for training and testing structural texture similarity metrics (STSIMs) so that they can be used to exploit texture redundancy for structurally lossless image compression. The training and testing is based on a set of image distortions that reflect the characteristics of the perturbations present in natural texture images. We conduct empirical studies to determine the perceived similarity scale across all pairs of original and distorted textures. We then introduce a data-driven approach for training the Mahalanobis formulation of STSIM based on the resulting annotated texture pairs. Experimental results demonstrate that training results in significant improvements in metric performance. We also show that the performance of the trained STSIM metrics is competitive with state of the art metrics based on convolutional neural networks, at substantially lower computational cost.

Abstract:
Domain generalization (DG) intends to train a model on multiple source domains to ensure that it can generalize well to an arbitrary unseen target domain. The acquisition of domain-invariant representations is pivotal for DG as they possess the ability to capture the inherent semantic information of the data, mitigate the influence of domain shift, and enhance the generalization capability of the model. Adopting multiple perspectives, such as the sample and the feature, proves to be effective. The sample perspective facilitates data augmentation through data manipulation techniques, whereas the feature perspective enables the extraction of meaningful generalization features. In this paper, we focus on improving the generalization ability of the model by compelling it to acquire domain-invariant representations from both the sample and feature perspectives by disentangling spurious correlations and enhancing potential correlations. 1) From the sample perspective, we develop a frequency restriction module, guiding the model to focus on the relevant correlations between object features and labels, thereby disentangling spurious correlations. 2) From the feature perspective, the simple Tail Interaction module implicitly enhances potential correlations among all samples from all source domains, facilitating the acquisition of domain-invariant representations across multiple domains for the model. The experimental results show that Convolutional Neural Networks (CNNs) or Multi-Layer Perceptrons (MLPs) with a strong baseline embedded with these two modules can achieve superior results, e.g., an average accuracy of 92.30% on Digits-DG. Source code is available at https://github.com/RubyHoho/DGeneralization.

Abstract:
Previous multi-modal transformers for RGB-D salient object detection (SOD) generally directly connect all patches from two modalities to model cross-modal correlation and perform multi-modal combination without differentiation, which can lead to confusing and inefficient fusion. Instead, we disentangle the cross-modal complementarity from two views to reduce cross-modal fusion ambiguity: 1) Context disentanglement. We argue that modeling long-range dependencies across modalities as done before is uninformative due to the severe modality gap. Differently, we propose to disentangle the cross-modal complementary contexts to intra-modal self-attention to explore global complementary understanding, and spatial-aligned inter-modal attention to capture local cross-modal correlations, respectively. 2) Representation disentanglement. Unlike previous undifferentiated combination of cross-modal representations, we find that cross-modal cues complement each other by enhancing common discriminative regions and mutually supplement modal-specific highlights. On top of this, we divide the tokens into consistent and private ones in the channel dimension to disentangle the multi-modal integration path and explicitly boost two complementary ways. By progressively propagate this strategy across layers, the proposed Disentangled Feature Pyramid module (DFP) enables informative cross-modal cross-level integration and better fusion adaptivity. Comprehensive experiments on a large variety of public datasets verify the efficacy of our context and representation disentanglement and the consistent improvement over state-of-the-art models. Additionally, our cross-modal attention hierarchy can be plug-and-play for different backbone architectures (both transformer and CNN) and downstream tasks, and experiments on a CNN-based model and RGB-D semantic segmentation verify this generalization ability.

Abstract:
In real-world datasets, visually related images often form clusters, and these clusters can be further grouped into larger categories with more general semantics. These inherent hierarchical structures can help capture the underlying distribution of data, making it easier to learn robust hash codes that lead to better retrieval performance. However, existing methods fail to make use of this hierarchical information, which in turn prevents the accurate preservation of relationships between data points in the learned hash codes, resulting in suboptimal performance. In this paper, our focus is on applying visual hierarchical information to self-supervised hash learning and addressing three key challenges, including the construction, embedding, and exploitation of visual hierarchies. We propose a new self-supervised hashing method named Hierarchical Hyperbolic Contrastive Hashing (HHCH), making breakthroughs in three aspects. First, we propose to embed continuous hash codes into hyperbolic space for accurate semantic expression since embedding hierarchies in the hyperbolic space generates less distortion than in the hyper-sphere or Euclidean space. Second, we update the K-Means algorithm to make it run in the hyperbolic space. The proposed hierarchical hyperbolic K-Means algorithm can achieve the adaptive construction of hierarchical semantic structures. Last but not least, to exploit the hierarchical semantic structures in hyperbolic space, we propose the hierarchical contrastive learning algorithm, including hierarchical instance-wise and hierarchical prototype-wise contrastive learning. Extensive experiments on four benchmark datasets demonstrate that the proposed method outperforms state-of-the-art self-supervised hashing methods. Our codes are released at https://github.com/HUST-IDSM-AI/HHCH.git.

Abstract:
In this paper, we present a simple yet effective continual learning method for blind image quality assessment (BIQA) with improved quality prediction accuracy, plasticity-stability trade-off, and task-order/-length robustness. The key step in our approach is to freeze all convolution filters of a pre-trained deep neural network (DNN) for an explicit promise of stability, and learn task-specific normalization parameters for plasticity. We assign each new IQA dataset (i.e., task) a prediction head, and load the corresponding normalization parameters to produce a quality score. The final quality estimate is computed by a weighted summation of predictions from all heads with a lightweight K -means gating mechanism. Extensive experiments on six IQA datasets demonstrate the advantages of the proposed method in comparison to previous training techniques for BIQA.

Abstract:
Identifying highlight moments of raw video materials is crucial for improving the efficiency of editing videos that are pervasive on internet platforms. However, the extensive work of manually labeling footage has created obstacles to applying supervised methods to videos of unseen categories. The absence of an audio modality that contains valuable cues for highlight detection in many videos also makes it difficult to use multimodal strategies. In this paper, we propose a novel model with cross-modal perception for unsupervised highlight detection. The proposed model learns representations with visual-audio level semantics from image-audio pair data via a self-reconstruction task. To achieve unsupervised highlight detection, we investigate the latent representations of the network and propose the representation activation sequence learning (RASL) module with k-point contrastive learning to learn significant representation activations. To connect the visual modality with the audio modality, we use the symmetric contrastive learning (SCL) module to learn the paired visual and audio representations. Furthermore, an auxiliary task of masked feature vector sequence (FVS) reconstruction is simultaneously conducted during pretraining for representation enhancement. During inference, the cross-modal pretrained model can generate representations with paired visual-audio semantics given only the visual modality. The RASL module is used to output the highlight scores. The experimental results show that the proposed framework achieves superior performance compared to other state-of-the-art approaches.

Abstract:
Multi-scale detection based on Feature Pyramid Networks (FPN) has been a popular approach in object detection to improve accuracy. However, using multi-layer features in the decoder of FPN methods entails performing many convolution operations on high-resolution feature maps, which consumes significant computational resources. In this paper, we propose a novel perspective for FPN in which we directly use fused single-layer features for regression and classification. Our proposed model, You Only Look One Hourglass (YOLOH), fuses multiple feature maps into one feature map in the encoder. We then use dense connections and dilated residual blocks to expand the receptive field of the fused feature map. This output not only contains information from all the feature maps, but also has a multi-scale receptive field for detection. The experimental results on the COCO dataset demonstrate that YOLOH achieves higher accuracy and better run-time performance than established detector baselines, for instance, it achieves an average precision (AP) of 50.2 on a standard 3× training schedule and achieves 40.3 AP at a speed of 32 FPS on the ResNet-50 model. We anticipate that YOLOH can serve as a reference for researchers to design real-time detection in future studies. Our code is available at https://github.com/wsb853529465/YOLOH-main.

Abstract:
Accurate 6-DoF pose estimation of surgical instruments during minimally invasive surgeries can substantially improve treatment strategies and eventual surgical outcome. Existing deep learning methods have achieved accurate results, but they require custom approaches for each object and laborious setup and training environments often stretching to extensive simulations, whilst lacking real-time computation. We propose a general-purpose approach of data acquisition for 6-DoF pose estimation tasks in X-ray systems, a novel and general purpose YOLOv5-6D pose architecture for accurate and fast object pose estimation and a complete method for surgical screw pose estimation under acquisition geometry consideration from a monocular cone-beam X-ray image. The proposed YOLOv5-6D pose model achieves competitive results on public benchmarks whilst being considerably faster at 42 FPS on GPU. In addition, the method generalizes across varying X-ray acquisition geometry and semantic image complexity to enable accurate pose estimation over different domains. Finally, the proposed approach is tested for bone-screw pose estimation for computer-aided guidance during spine surgeries. The model achieves a 92.41% by the 0.1\cdot d ADD-S metric, demonstrating a promising approach for enhancing surgical precision and patient outcomes. The code for YOLOv5-6D is publicly available at https://github.com/cviviers/YOLOv5-6D-Pose.

Abstract:
Convolutional neural networks (CNNs) have achieved significant improvement for the task of facial expression recognition. However, current training still suffers from the inconsistent learning intensities among different layers, i.e., the feature representations in the shallow layers are not sufficiently learned compared with those in deep layers. To this end, this work proposes a contrastive learning framework to align the feature semantics of shallow and deep layers, followed by an attention module for representing the multi-scale features in the weight-adaptive manner. The proposed algorithm has three main merits. First, the learning intensity, defined as the magnitude of the backpropagation gradient, of the features on the shallow layer is enhanced by cross-layer contrastive learning. Second, the latent semantics in the shallow-layer and deep-layer features are explored and aligned in the contrastive learning, and thus the fine-grained characteristics of expressions can be taken into account for the feature representation learning. Third, by integrating the multi-scale features from multiple layers with an attention module, our algorithm achieved the state-of-the-art performances, i.e. 92.21%, 89.50%, 62.82%, on three in-the-wild expression databases, i.e. RAF-DB, FERPlus, SFEW, and the second best performance, i.e. 65.29% on AffectNet dataset. Our codes will be made publicly available.

Abstract:
The success of existing cross-modal retrieval (CMR) methods heavily rely on the assumption that the annotated cross-modal correspondence is faultless. In practice, however, the correspondence of some pairs would be inevitably contaminated during data collection or annotation, thus leading to the so-called Noisy Correspondence (NC) problem. To alleviate the influence of NC, we propose a novel method termed Consistency REfining And Mining (CREAM) by revealing and exploiting the difference between correspondence and consistency. Specifically, the correspondence and the consistency only be coincident for true positive and true negative pairs, while being distinct for false positive and false negative pairs. Based on the observation, CREAM employs a collaborative learning paradigm to detect and rectify the correspondence of positives, and a negative mining approach to explore and utilize the consistency. Thanks to the consistency refining and mining strategy of CREAM, the overfitting on the false positives could be prevented and the consistency rooted in the false negatives could be exploited, thus leading to a robust CMR method. Extensive experiments verify the effectiveness of our method on three image-text benchmarks including Flickr30K, MS-COCO, and Conceptual Captions. Furthermore, we adopt our method into the graph matching task and the results demonstrate the robustness of our method against fine-grained NC problem. The code is available on https://github.com/XLearning-SCU/2024-TIP-CREAM.

Abstract:
Accurate segmentation of lesions is crucial for diagnosis and treatment of early esophageal cancer (EEC). However, neither traditional nor deep learning-based methods up to today can meet the clinical requirements, with the mean Dice score - the most important metric in medical image analysis - hardly exceeding 0.75. In this paper, we present a novel deep learning approach for segmenting EEC lesions. Our method stands out for its uniqueness, as it relies solely on a single input image from a patient, forming the so-called “You-Only-Have-One” (YOHO) framework. On one hand, this “one-image-one-network” learning ensures complete patient privacy as it does not use any images from other patients as the training data. On the other hand, it avoids nearly all generalization-related problems since each trained network is applied only to the same input image itself. In particular, we can push the training to “over-fitting” as much as possible to increase the segmentation accuracy. Our technical details include an interaction with clinical doctors to utilize their expertise, a geometry-based data augmentation over a single lesion image to generate the training dataset (the biggest novelty), and an edge-enhanced UNet. We have evaluated YOHO over an EEC dataset collected by ourselves and achieved a mean Dice score of 0.888, which is much higher as compared to the existing deep-learning methods, thus representing a significant advance toward clinical applications. The code and dataset are available at: https://github.com/lhaippp/YOHO.

Abstract:
Billions of people share images from their daily lives on social media every day. However, their biometric information (e.g., fingerprints) could be easily stolen from these images. The threat of fingerprint leakage from social media has created a strong desire to anonymize shared images while maintaining image quality, since fingerprints act as a lifelong individual biometric password. To guard the fingerprint leakage, adversarial attack that involves adding imperceptible perturbations to fingerprint images have emerged as a feasible solution. However, existing works of this kind are either weak in black-box transferability or cause the images to have an unnatural appearance. Motivated by the visual perception hierarchy (i.e., high-level perception exploits model-shared semantics that transfer well across models while low-level perception extracts primitive stimuli that result in high visual sensitivity when a suspicious stimulus is provided), we propose FingerSafe, a hierarchical perceptual protective noise injection framework to address the above mentioned problems. For black-box transferability, we inject protective noises into the fingerprint orientation field to perturb the model-shared high-level semantics (i.e., fingerprint ridges). Considering visual naturalness, we suppress the low-level local contrast stimulus by regularizing the response of the Lateral Geniculate Nucleus. Our proposed FingerSafe is the first to provide feasible fingerprint protection in both digital (up to 94.12%) and realistic scenarios (Twitter and Facebook, up to 68.75%). Our code can be found at https://github.com/nlsde-safety-team/FingerSafe.

Abstract:
The removal of outliers is crucial for establishing correspondence between two images. However, when the proportion of outliers reaches nearly 90%, the task becomes highly challenging. Existing methods face limitations in effectively utilizing geometric transformation consistency (GTC) information and incorporating geometric semantic neighboring information. To address these challenges, we propose a Multi-Stage Geometric Semantic Attention (MSGSA) network. The MSGSA network consists of three key modules: the multi-branch (MB) module, the GTC module, and the geometric semantic attention (GSA) module. The MB module, structured with a multi-branch design, facilitates diverse and robust spatial transformations. The GTC module captures transformation consistency information from the preceding stage. The GSA module categorizes input based on the prior stage’s output, enabling efficient extraction of geometric semantic information through a graph-based representation and inter-category information interaction using Transformer. Extensive experiments on the YFCC100M and SUN3D datasets demonstrate that MSGSA outperforms current state-of-the-art methods in outlier removal and camera pose estimation, particularly in scenarios with a high prevalence of outliers. Source code is available at https://github.com/shuyuanlin.

Abstract:
In recent years, fusing high spatial resolution multispectral images (HR-MSIs) and low spatial resolution hyperspectral images (LR-HSIs) has become a widely used approach for hyperspectral image super-resolution (HSI-SR). Various unsupervised HSI-SR methods based on deep image prior (DIP) have gained wide popularity thanks to no pre-training requirement. However, DIP-based methods often demonstrate mediocre performance in extracting latent information from the data. To resolve this performance deficiency, we propose a coupled spatial and spectral deep image priors (CS2DIPs) method for the fusion of an HR-MSI and an LR-HSI into an HR-HSI. Specifically, we integrate the nonnegative matrix-vector tensor factorization (NMVTF) into the DIP framework to jointly learn the abundance tensor and spectral feature matrix. The two coupled DIPs are designed to capture essential spatial and spectral features in parallel from the observed HR-MSI and LR-HSI, respectively, which are then used to guide the generation of the abundance tensor and spectral signature matrix for the fusion of the HSI-SR by mode-3 tensor product, meanwhile taking some inherent physical constraints into account. Free from any training data, the proposed CS2DIPs can effectively capture rich spatial and spectral information. As a result, it exhibits much superior performance and convergence speed over most existing DIP-based methods. Extensive experiments are provided to demonstrate its state-of-the-art overall performance including comparison with benchmark peer methods.

Abstract:
Video-based referring expression comprehension is a challenging task that requires locating the referred object in each video frame of a given video. While many existing approaches treat this task as an object-tracking problem, their performance is heavily reliant on the quality of the tracking templates. Furthermore, when there is not enough annotation data to assist in template selection, the tracking may fail. Other approaches are based on object detection, but they often use only one adjacent frame of the key frame for feature learning, which limits their ability to establish the relationship between different frames. In addition, improving the fusion of features from multiple frames and referring expressions to effectively locate the referents remains an open problem. To address these issues, we propose a novel approach called the Multi-Stage Image-Language Cross-Generative Fusion Network (MILCGF-Net), which is based on one-stage object detection. Our approach includes a Frame Dense Feature Aggregation module for dense feature learning of adjacent time sequences. Additionally, we propose an Image-Language Cross-Generative Fusion module as the main body of multi-stage learning to generate cross-modal features by calculating the similarity between video and expression, and then refining and fusing the generated features. To further enhance the cross-modal feature generation capability of our model, we introduce a consistency loss that constrains the image-language similarity and language-image similarity matrices during feature generation. We evaluate our proposed approach on three public datasets and demonstrate its effectiveness through comprehensive experimental results.

Abstract:
Image-text matching remains a challenging task due to heterogeneous semantic diversity across modalities and insufficient distance separability within triplets. Different from previous approaches focusing on enhancing multi-modal representations or exploiting cross-modal correspondence for more accurate retrieval, in this paper we aim to leverage the knowledge transfer between peer branches in a boosting manner to seek a more powerful matching model. Specifically, we propose a brand-new Deep Boosting Learning (DBL) algorithm, where an anchor branch is first trained to provide insights into the data properties, with a target branch gaining more advanced knowledge to develop optimal features and distance metrics. Concretely, an anchor branch initially learns the absolute or relative distance between positive and negative pairs, providing a foundational understanding of the particular network and data distribution. Building upon this knowledge, a target branch is concurrently tasked with more adaptive margin constraints to further enlarge the relative distance between matched and unmatched samples. Extensive experiments validate that our DBL can achieve impressive and consistent improvements based on various recent state-of-the-art models in the image-text matching field, and outperform related popular cooperative strategies, e.g., Conventional Distillation, Mutual Learning, and Contrastive Learning. Beyond the above, we confirm that DBL can be seamlessly integrated into their training scenarios and achieve superior performance under the same computational costs, demonstrating the flexibility and broad applicability of our proposed method.

Abstract:
Recently studies have shown the potential of weakly supervised multi-object tracking and segmentation, but the drawbacks of coarse pseudo mask label and limited utilization of temporal information remain to be unresolved. To address these issues, we present a framework that directly uses box label to supervise the segmentation network without resorting to pseudo mask label. In addition, we propose to fully exploit the temporal information from two perspectives. Firstly, we integrate optical flow-based pairwise consistency to ensure mask consistency across frames, thereby improving mask quality for segmentation. Secondly, we propose a temporally adjacent pair-based sampling strategy to adapt instance embedding learning for data association in tracking. We combine these techniques into an end-to-end deep model, named BoxMOTS, which requires only box annotation without mask supervision. Extensive experiments demonstrate that our model surpasses current state-of-the-art by a large margin, and produces promising results on KITTI MOTS and BDD100K MOTS. The source code is available at https://github.com/Spritea/BoxMOTS.

Abstract:
Existing multi-view graph learning methods often rely on consistent information for similar nodes within and across views, however they may lack adaptability when facing diversity challenges from noise, varied views, and complex data distributions. These challenges can be mainly categorized into: 1) View-specific diversity within intra-view from noise and incomplete information; 2) Cross-view diversity within inter-view caused by various latent semantics; 3) Cross-group diversity within inter-group due to data distribution differences. To this end, we propose a universal multi-view consensus graph learning framework that considers both original and generative graphs to balance consistency and diversity. Specifically, the proposed framework can be divided into the following four modules: i) Multi-channel graph module to extract principal node information, ensuring view-specific and cross-view consistency while mitigating view-specific and cross-view diversity within original graphs; ii) Generative module to produce cleaner and more realistic graphs, enriching graph structure while maintaining view-specific consistency and suppressing view-specific diversity; iii) Contrastive module to collaborate on generative semantics to facilitate cross-view consistency and reducing cross-view diversity within generative graphs; iv) Consensus graph module to consolidate learning a consensual graph, pursuing cross-group consistency and cross-group diversity. Extensive experimental results on real-world datasets demonstrate its effectiveness and superiority.

Affiliations: School of Computer Science and Engineering and the Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications, Southeast University, Nanjing, China; School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, China; Department of Mathematical and Physical Sciences, University of Toronto, Toronto, ON, Canada; Qingdao AInnovation Technology Group Company Ltd., Qingdao, China; Wangxuan Institute of Computer Technology, Peking University, Beijing, China

Abstract:
Our work focuses on tackling the problem of fine-grained recognition with incomplete multi-modal data, which is overlooked by previous work in the literature. It is desirable to not only capture fine-grained patterns of objects but also alleviate the challenges of missing modalities for such a practical problem. In this paper, we propose to leverage a meta-learning strategy to learn model abilities of both fast modal adaptation and more importantly missing modality completion across a variety of incomplete multi-modality learning tasks. Based on that, we develop a meta-completion method, termed as MECOM, to perform multimodal fusion and explicit missing modality completion by our proposals of cross-modal attention and decoupling reconstruction. To further improve fine-grained recognition accuracy, an additional partial stream (as a counterpart of the main stream of MECOM, i.e., holistic) and the part-level features (corresponding to fine-grained objects’ parts) selection are designed, which are tailored for fine-grained nature to capture discriminative but subtle part-level patterns. Comprehensive experiments from quantitative and qualitative aspects, as well as various ablation studies, on two fine-grained multimodal datasets and one generic multimodal dataset show our superiority over competing methods. Our code is open-source and available at https://github.com/SEU-VIPGroup/MECOM.

Abstract:
An adversarial attack is typically implemented by solving a constrained optimization problem. In top-k adversarial attacks implementation for multi-label learning, the attack failure degree (AFD) and attack cost (AC) of a possible attack are major concerns. According to our experimental and theoretical analysis, existing methods are negatively impacted by the coarse measures for AFD/AC and the indiscriminate treatment for all constraints, particularly when there is no ideal solution. Hence, this study first develops a refined measure based on the Jaccard index appropriate for AFD and AC, distinguishing the failure degrees/costs of two possible attacks better than the existing indicator function-based scheme. Furthermore, we formulate novel optimization problems with the least constraint violation via new measures for AFD and AC, and theoretically demonstrate the effectiveness of weighting slack variables for constraints. Finally, a self-paced weighting strategy is proposed to assign different priorities to various constraints during optimization, resulting in larger attack gains compared to previous indiscriminate schemes. Meanwhile, our method avoids fluctuations during optimization, especially in the presence of highly conflicting constraints. Extensive experiments on four benchmark datasets validate the effectiveness of our method across different evaluation metrics.

Abstract:
Image deblurring for camera shake is a highly regarded problem in the field of computer vision. A promising solution is the patch-wise non-uniform image deblurring algorithms, where a linear transformation model is typically established between different blur kernels to re-estimate poorly estimated blur kernels. However, the linear model struggles to effectively describe the nonlinear transformation relationships between blur kernels. A key observation is that the inertial measurement unit (IMU) provides motion data of the camera, which is helpful in describing the landmarks of the blur kernel. This paper presents a new IMU-assisted method for the re-estimation of poorly estimated blur kernels. This method establishes a nonlinear transformation relationship model between blur kernels of different patches using IMU motion data. Subsequently, an optimization problem is applied to re-estimate poorly estimated blur kernels by incorporating this relationship model with neighboring well-estimated kernels. Experimental results demonstrate that this blur kernel re-estimation method outperforms existing methods.

Abstract:
Transparent materials are widely used in industrial applications, such as construction, transportation, and optics. However, the complex optical properties of these materials make it difficult to achieve precise surface form measurements, especially for bulk surface form inspection in industrial environments. Traditional structured light-based measurement methods often struggle with suboptimal signal-to-noise ratios, making them ineffective. Currently, there is a lack of efficient techniques for real-time inspection of such components. This paper proposes a single-frame measurement technique based on deflectometry for large-size transparent surfaces. It utilizes the reflective characteristics of the measured surface, making it independent of the surface’s diffuse reflection properties. This fundamentally solves the issues associated with signal-to-noise ratios. By discretizing the phase map, it separates the multiple surface reflection characteristics of transparent devices, enabling transparent device measurement. To meet the requirements of industrial dynamic measurement, this technique only needs a simple and low-cost system structure, which contains just two cameras for image capture. It does not require phase shifting to complete the measurement, making it independent of the screen and having the potential for larger surface measurement. The proposed method was used to measure a 400mm aperture automobile glass, and the results showed that it is able to achieve a measurement accuracy on the order of 10~\mu m. The method proposed in this paper overcomes the influence of surface reflection on transparent objects and significantly improves the efficiency and accuracy of large-sized transparent surface measurements by using a single-frame image measurement. Moreover, this method shows promise for broader applications, including measurements of lenses and HUD (Heads-Up Display) components, showcasing significant potential for industrial applications.

Abstract:
Inferring 3D human motion is fundamental in many applications, including understanding human activity and analyzing one’s intention. While many fruitful efforts have been made to human motion prediction, most approaches focus on pose-driven prediction and inferring human motion in isolation from the contextual environment, thus leaving the body location movement in the scene behind. However, real-world human movements are goal-directed and highly influenced by the spatial layout of their surrounding scenes. In this paper, instead of planning future human motion in a “dark” room, we propose a Multi-Condition Latent Diffusion network (MCLD) that reformulates the human motion prediction task as a multi-condition joint inference problem based on the given historical 3D body motion and the current 3D scene contexts. Specifically, instead of directly modeling joint distribution over the raw motion sequences, MCLD performs a conditional diffusion process within the latent embedding space, characterizing the cross-modal mapping from the past body movement and current scene context condition embeddings to the future human motion embedding. Extensive experiments on large-scale human motion prediction datasets demonstrate that our MCLD achieves significant improvements over the state-of-the-art methods on both realistic and diverse predictions.

Abstract:
Domain adaptation has shown appealing performance by leveraging knowledge from a source domain with rich annotations. However, for a specific target task, it is cumbersome to collect related and high-quality source domains. In real-world scenarios, large-scale datasets corrupted with noisy labels are easy to collect, stimulating a great demand for automatic recognition in a generalized setting, i.e., weakly-supervised partial domain adaptation (WS-PDA), which transfers a classifier from a large source domain with noises in labels to a small unlabeled target domain. As such, the key issues of WS-PDA are: 1) how to sufficiently discover the knowledge from the noisy labeled source domain and the unlabeled target domain, and 2) how to successfully adapt the knowledge across domains. In this paper, we propose a simple yet effective domain adaptation approach, termed as self-paced transfer classifier learning (SP-TCL), to address the above issues, which could be regarded as a well-performing baseline for several generalized domain adaptation tasks. The proposed model is established upon the self-paced learning scheme, seeking a preferable classifier for the target domain. Specifically, SP-TCL learns to discover faithful knowledge via a carefully designed prudent loss function and simultaneously adapts the learned knowledge to the target domain by iteratively excluding source examples from training under the self-paced fashion. Extensive evaluations on several benchmark datasets demonstrate that SP-TCL significantly outperforms state-of-the-art approaches on several generalized domain adaptation tasks. Code is available at https://github.com/mc-lan/SP-TCL.

Abstract:
Video question answering (VideoQA) requires the ability of comprehensively understanding visual contents in videos. Existing VideoQA models mainly focus on scenarios involving a single event with simple object interactions and leave event-centric scenarios involving multiple events with dynamically complex object interactions largely unexplored. These conventional VideoQA models are usually based on features extracted from the global visual signals, making it difficult to capture the object-level and event-level semantics. Although there exists a recent work utilizing a static spatio-temporal graph to explicitly model object interactions in videos, it ignores the dynamic impact of questions for graph construction and fails to exploit the implicit event-level semantic clues in questions. To overcome these limitations, we propose a Self-supervised Dynamic Graph Reasoning (SDGraphR) model for video question answering (VideoQA). Our SDGraphR model learns a question-guided spatio-temporal graph that dynamically encodes intra-frame spatial correlations and inter-frame correspondences between objects in the videos. Furthermore, the proposed SDGraphR model discovers event-level cues from questions to conduct self-supervised learning with an auxiliary event recognition task, which in turn helps to improve its VideoQA performances without using any extra annotations. We carry out extensive experiments to validate the substantial improvements of our proposed SDGraphR model over existing baselines.

Abstract:
Rotation averaging, which aims to calculate the absolute rotations of a set of cameras from a redundant set of their relative rotations, is an important and challenging topic arising in the study of structure from motion. A central problem in rotation averaging is how to alleviate the influence of noise and outliers. Addressing this problem, we investigate rotation averaging under the Cayley framework in this paper, inspired by the extra-constraint-free nature of the Cayley rotation representation. Firstly, for the relative rotation of an arbitrary pair of cameras regardless of whether it is corrupted by noise/outliers or not, a general Cayley rotation constraint equation is derived for reflecting the relationship between this relative rotation and the absolute rotations of the two cameras, according to the Cayley rotation representation. Then based on such a set of Cayley rotation constraint equations, a Cayley-based approach for Rotation Averaging is proposed, called CRA, where an adaptive regularizer is designed for further alleviating the influence of outliers. Finally, a unified iterative algorithm for minimizing some commonly-used loss functions is proposed under this approach. Experimental results on 16 real-world datasets and multiple synthetic datasets demonstrate that the proposed CRA approach achieves a better accuracy in comparison to several typical rotation averaging approaches in most cases.

Abstract:
Recently, there have been efforts to improve the performance in sign language recognition by designing self-supervised learning methods. However, these methods capture limited information from sign pose data in a frame-wise learning manner, leading to sub-optimal solutions. To this end, we propose a simple yet effective self-supervised contrastive learning framework to excavate rich context via spatial-temporal consistency from two distinct perspectives and learn instance discriminative representation for sign language recognition. On one hand, since the semantics of sign language are expressed by the cooperation of fine-grained hands and coarse-grained trunks, we utilize both granularity information and encode them into latent spaces. The consistency between hand and trunk features is constrained to encourage learning consistent representation of instance samples. On the other hand, inspired by the complementary property of motion and joint modalities, we first introduce first-order motion information into sign language modeling. Additionally, we further bridge the interaction between the embedding spaces of both modalities, facilitating bidirectional knowledge transfer to enhance sign language representation. Our method is evaluated with extensive experiments on four public benchmarks, and achieves new state-of-the-art performance with a notable margin. The source code is publicly available at https://github.com/sakura/Code.

Affiliations: Complex Laboratory of New Finance and Economics, School of Computing and Artificial Intelligence, Southwestern University of Finance and Economics, Chengdu, China; Laboratory of Intelligent Collaborative Computing, University of Electronic Science and Technology of China, Chengdu, China; School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu, China; College of Computer Science and the State Key Laboratory of Brain-Machine Intelligence, Zhejiang University, Hangzhou, China

Abstract:
Recent advances in bio-inspired vision with event cameras and associated spiking neural networks (SNNs) have provided promising solutions for low-power consumption neuromorphic tasks. However, as the research of event cameras is still in its infancy, the amount of labeled event stream data is much less than that of the RGB database. The traditional method of converting static images into event streams by simulation to increase the sample size cannot simulate the characteristics of event cameras such as high temporal resolution. To take advantage of both the rich knowledge in labeled RGB images and the features of the event camera, we propose a transfer learning method from the RGB to the event domain in this paper. Specifically, we first introduce a transfer learning framework named R2ETL (RGB to Event Transfer Learning), including a novel encoding alignment module and a feature alignment module. Then, we introduce the temporal centered kernel alignment (TCKA) loss function to improve the efficiency of transfer learning. It aligns the distribution of temporal neuron states by adding a temporal learning constraint. Finally, we theoretically analyze the amount of data required by the deep neuromorphic model to prove the necessity of our method. Numerous experiments demonstrate that our proposed framework outperforms the state-of-the-art SNN and artificial neural network (ANN) models trained on event streams, including N-MNIST, CIFAR10-DVS and N-Caltech101. This indicates that the R2ETL framework is able to leverage the knowledge of labeled RGB images to help the training of SNN on event streams.

Abstract:
Graph Convolutional Networks (GCNs) are widely used for skeleton-based action recognition and achieved remarkable performance. Due to the locality of graph convolution, GCNs can only utilize short-range node dependencies but fail to model long-range node relationships. In addition, existing graph convolution based methods normally use a uniform skeleton topology for all frames, which limits the ability of feature learning. To address these issues, we present the Graph Convolution Network with Self-Attention (SelfGCN), which consists of a mixing features across self-attention and graph convolution (MFSG) module and a temporal-specific spatial self-attention (TSSA) module. The MFSG module models local and global relationships between joints by executing graph convolution and self-attention branches in parallel. Its bi-directional interactive learning strategy utilizes complementary clues in the channel dimensions and the spatial dimensions across both of these branches. The TSSA module uses self-attention to learn the spatial relationships between joints of each frame in a skeleton sequence. It also models the unique spatial features of the single frames. We conduct extensive experiments on three popular benchmark datasets, NTU RGB+D, NTU RGB+D120, and Northwestern-UCLA. The results of the experiment demonstrate that our method achieves or exceeds the record accuracies on all three benchmarks. Our project website is available at https://github.com/SunPengP/SelfGCN.

Abstract:
Multi-view clustering aims to learn discriminative representations from multi-view data. Although existing methods show impressive performance by leveraging contrastive learning to tackle the representation gap between every two views, they share the common limitation of not performing semantic alignment from a global perspective, resulting in the undermining of semantic patterns in multi-view data. This paper presents CSOT, namely Common Semantics via Optimal Transport, to boost contrastive multi-view clustering via semantic learning in a common space that integrates all views. Through optimal transport, the samples in multiple views are mapped to the joint clusters which represent the multi-view semantic patterns in the common space. With the semantic assignment derived from the optimal transport plan, we design a semantic learning module where the soft assignment vector works as a global supervision to enforce the model to learn consistent semantics among all views. Moreover, we propose a semantic-aware re-weighting strategy to treat samples differently according to their semantic significance, which improves the effectiveness of cross-view contrastive representation learning. Extensive experimental results demonstrate that CSOT achieves the state-of-the-art clustering performance.

Abstract:
Light field (LF) images enable numerous applications due to their ability to capture information for multiple views. Semantic segmentation is an essential task for LF scene understanding. However, existing supervised methods heavily rely on a large number of pixel-wise annotations. To relieve this problem, we propose a semi-supervised LF semantic segmentation method that requires only a small subset of labeled data and harnesses the LF disparity information. First, we design an unsupervised disparity estimation network, which can determine the disparity map for every view. With the estimated disparity maps, we generate pseudo-labels along with their weight maps for the peripheral views when only the labels of central views are available. We then merge the predictions from multiple views to obtain more reliable pseudo-labels for unlabeled data, and introduce a disparity-semantics consistency loss to enforce structure similarity. Moreover, we develop a comprehensive contrastive learning scheme that includes a pixel-level strategy to enhance feature representations and an object-level strategy to improve segmentation for individual objects. Our method demonstrates state-of-the-art performance on the benchmark LF semantic segmentation dataset under a variety of training settings and achieves comparable performance to supervised methods when trained under 1/2 protocol.

Abstract:
In the past decade, deep neural networks have revolutionized image denoising in achieving significant accuracy improvements by learning on datasets composed of noisy/clean image pairs. However, this strategy is extremely dependent on training data quality, which is a well-established weakness. To alleviate the requirement to learn image priors externally, single-image (a.k.a., self-supervised or zero-shot) methods perform denoising solely based on the analysis of the input noisy image without external dictionary or training dataset. This work investigates the effectiveness of linear combinations of patches for denoising under this constraint. Although conceptually very simple, we show that linear combinations of patches are enough to achieve state-of-the-art performance. The proposed parametric approach relies on quadratic risk approximation via multiple pilot images to guide the estimation of the combination weights. Experiments on images corrupted artificially with Gaussian noise as well as on real-world noisy images demonstrate that our method is on par with the very best single-image denoisers, outperforming the recent neural network-based techniques, while being much faster and fully interpretable.

Abstract:
Anchor graph has been recently proposed to accelerate multi-view graph clustering and widely applied in various large-scale applications. Different from capturing full instance relationships, these methods choose small portion anchors among each view, construct single-view anchor graphs and combine them into the unified graph. Despite its efficiency, we observe that: (i) Existing mechanism adopts a separable two-step procedure-anchor graph construction and individual graph fusion, which may degrade the clustering performance. (ii)These methods determine the number of selected anchors to be equal among all the views, which may destruct the data distribution diversity. A more flexible multi-view anchor graph fusion framework with diverse magnitudes is desired to enhance the representation ability. (iii) During the latter fusion process, current anchor graph fusion framework follows simple linearly-combined style while the intrinsic clustering structures are ignored. To address these issues, we propose a novel scalable and flexible anchor graph fusion framework for multi-view graph clustering method in this paper. Specially, the anchor graph construction and graph alignment are jointly optimized in our unified framework to boost clustering quality. Moreover, we present a novel structural alignment regularization to adaptively fuse multiple anchor graphs with different magnitudes. In addition, our proposed method inherits the linear complexity of existing anchor strategies respecting to the sample number, which is time-economical for large-scale data. Experiments conducted on various benchmark datasets demonstrate the superiority and effectiveness of the newly proposed anchor graph fusion framework against the existing state-of-the-arts over the clustering performance promotion and time expenditure. Our code is publicly available at https://github.com/wangsiwei2010/SMVAGC-SF.

Abstract:
Recently, transformer-based backbones show superior performance over the convolutional counterparts in computer vision. Due to quadratic complexity with respect to the token number in global attention, local attention is always adopted in low-level image processing with linear complexity. However, the limited receptive field is harmful to the performance. In this paper, motivated by Octave convolution, we propose a transformer-based single image super-resolution (SISR) model, which explicitly embeds dynamic frequency decomposition into the standard local transformer. All the frequency components are continuously updated and re-assigned via intra-scale attention and inter-scale interaction, respectively. Specifically, the attention in low resolution is enough for low-frequency features, which not only increases the receptive field, but also decreases the complexity. Compared with the standard local transformer, the proposed FDRTran layer simultaneously decreases FLOPs and parameters. By contrast, Octave convolution only decreases FLOPs of the standard convolution, but keeps the parameter number unchanged. In addition, the restart mechanism is proposed for every a few frequency updates, which first fuses the low and high frequency, then decomposes the features again. In this way, the features can be decomposed in multiple viewpoints by learnable parameters, which avoids the risk of early saturation for frequency representation. Furthermore, based on the FDRTran layer with restart mechanism, the proposed FDRNet is the first transformer backbone for SISR which discusses the Octave design. Sufficient experiments show our model reaches state-of-the-art performance on 6 synthetic and real datasets. The code and the models are available at https://github.com/catnip1029/FDRNet.

Abstract:
Multi-Contrast Magnetic Resonance Imaging (MCMRI) utilizes the short-time reference image to facilitate the reconstruction of the long-time target one, providing a new solution for fast MRI. Although various methods have been proposed, they still have certain limitations. 1) existing methods featuring the preset under-sampling patterns give rise to redundancy between multi-contrast images and limit their model performance; 2) most methods focus on the information in the image domain, prior knowledge in the k-space domain has not been fully explored; and 3) most networks are manually designed and lack certain physical interpretability. To address these issues, we propose a joint optimization of the under-sampling pattern and a deep-unfolding dual-domain network for accelerating MCMRI. Firstly, to reduce the redundant information and sample more contrast-specific information, we propose a new framework to learn the optimal under-sampling pattern for MCMRI. Secondly, a dual-domain model is established to reconstruct the target image in both the image domain and the k-space frequency domain. The model in the image domain introduces a spatial transformation to explicitly model the inconsistent and unaligned structures of MCMRI. The model in the k-space learns prior knowledge from the frequency domain, enabling the model to capture more global information from the input images. Thirdly, we employ the proximal gradient algorithm to optimize the proposed model and then unfold the iterative results into a deep-unfolding network, called MC-DuDoN. We evaluate the proposed MC-DuDoN on MCMRI super-resolution and reconstruction tasks. Experimental results give credence to the superiority of the current model. In particular, since our approach explicitly models the inconsistent structures, it shows robustness on spatially misaligned MCMRI. In the reconstruction task, compared with conventional masks, the learned mask restores more realistic images, even under an ultra-high acceleration ratio ( × 30 ). Code is available at https://github.com/lpcccc-cv/MC-DuDoNet.

Abstract:
Blind video quality assessment (VQA) has become an increasingly demanding problem in automatically assessing the quality of ever-growing in-the-wild videos. Although efforts have been made to measure temporal distortions, the core to distinguish between VQA and image quality assessment (IQA), the lack of modeling of how the human visual system (HVS) relates to the temporal quality of videos hinders the precise mapping of predicted temporal scores to the human perception. Inspired by the recent discovery of the temporal straightness law of natural videos in the HVS, this paper intends to model the complex temporal distortions of in-the-wild videos in a simple and uniform representation by describing the geometric properties of videos in the visual perceptual domain. A novel videolet, with perceptual representation embedding of a few consecutive frames, is designed as the basic quality measurement unit to quantify temporal distortions by measuring the angular and linear displacements from the straightness law. By combining the predicted score on each videolet, a perceptually temporal quality evaluator (PTQE) is formed to measure the temporal quality of the entire video. Experimental results demonstrate that the perceptual representation in the HVS is an efficient way of predicting subjective temporal quality. Moreover, when combined with spatial quality metrics, PTQE achieves top performance over popular in-the-wild video datasets. More importantly, PTQE requires no additional information beyond the video being assessed, making it applicable to any dataset without parameter tuning. Additionally, the generalizability of PTQE is evaluated on video frame interpolation tasks, demonstrating its potential to benefit temporal-related enhancement tasks.

Abstract:
Recently, one-stream trackers have achieved parallel feature extraction and relation modeling through the exploitation of Transformer-based architectures. This design greatly improves the performance of trackers. However, as one-stream trackers often overlook crucial tracking cues beyond the template, they prone to give unsatisfactory results against complex tracking scenarios. To tackle these challenges, we propose a multi-cue single-stream tracker, dubbed MCTrack here, which seamlessly integrates template information, historical trajectory, historical frame, and the search region for synchronized feature extraction and relation modeling. To achieve this, we employ two types of encoders to convert the template, historical frames, search region, and historical trajectory into tokens, which are then collectively fed into a Transformer architecture. To distill temporal and spatial cues, we introduce a novel adaptive update mechanism, which incorporates a thresholding component and a local multi-peak component to filter out less accurate and overly disturbed tracking cues. Empirically, MCTrack achieves leading performance on mainstream benchmark datasets, surpassing the most advanced SeqTrack by 2.0% in terms of the AO metric on GOT-10k. The code is available at https://github.com/wsumel/MCTrack.

Abstract:
Adaptive video streaming relies on the construction of efficient bitrate ladders to deliver the best possible visual quality to viewers under bandwidth constraints. The traditional method of content dependent bitrate ladder selection requires a video shot to be pre-encoded with multiple encoding parameters to find the optimal operating points given by the convex hull of the resulting rate-quality curves. However, this pre-encoding step is equivalent to an exhaustive search process over the space of possible encoding parameters, which causes significant overhead in terms of both computation and time expenditure. To reduce this overhead, we propose a deep learning based method of content aware convex hull prediction. We employ a recurrent convolutional network (RCN) to implicitly analyze the spatiotemporal complexity of video shots in order to predict their convex hulls. A two-step transfer learning scheme is adopted to train our proposed RCN-Hull model, which ensures sufficient content diversity to analyze scene complexity, while also making it possible to capture the scene statistics of pristine source videos. Our experimental results reveal that our proposed model yields better approximations of the optimal convex hulls, and offers competitive time savings as compared to existing approaches. On average, the pre-encoding time was reduced by 53.8% by our method, while the average Bjøntegaard delta bitrate (BD-rate) of the predicted convex hulls against ground truth was 0.26%, and the mean absolute deviation of the BD-rate distribution was 0.57%.

Abstract:
Recently, there exists an increased research interest in embodied artificial intelligence (EAI), which involves an agent learning to perform a specific task when dynamically interacting with the surrounding 3D environment. There into, a new challenge is that many unseen objects may appear due to the increased number of object categories in 3D scenes. It makes developing models with strong zero-shot generalization ability to new objects necessary. Existing work tries to achieve this goal by providing embodied agents with massive high-quality human annotations closely related to the task to be learned, while it is too costly in practice. Inspired by recent advances in pre-trained models in 2D visual tasks, we attempt to boost zero-shot generalization for embodied reasoning with vision-language pre-training that can encode common sense as general prior knowledge. To further improve its performance on a specific task, we rectify the pre-trained representation through masked scene graph modeling (MSGM) in a self-supervised manner, where the task-specific knowledge is learned from iterative message passing. Our method can improve a variety of representative embodied reasoning tasks by a large margin (e.g., over 5.0% w.r.t. answer accuracy on MP3D-EQA dataset that consists of many real-world scenes with a large number of new objects during testing), and achieve the new state-of-the-art performance.

Abstract:
High Dynamic Range (HDR) videos are able to represent wider ranges of contrasts and colors than Standard Dynamic Range (SDR) videos, giving more vivid experiences. Due to this, HDR videos are expected to grow into the dominant video modality of the future. However, HDR videos are incompatible with existing SDR displays, which form the majority of affordable consumer displays on the market. Because of this, HDR videos must be processed by tone-mapping them to reduced bit-depths to service a broad swath of SDR-limited video consumers. Here, we analyze the impact of tone-mapping operators on the visual quality of streaming HDR videos. To this end, we built the first large-scale subjectively annotated open-source database of compressed tone-mapped HDR videos, containing 15,000 tone-mapped sequences derived from 40 unique HDR source contents. The videos in the database were labeled with more than 750,000 subjective quality annotations, collected from more than 1,600 unique human observers. We demonstrate the usefulness of the new subjective database by benchmarking objective models of visual quality on it. We envision that the new LIVE Tone-Mapped HDR (LIVE-TMHDR) database will enable significant progress on HDR video tone mapping and quality assessment in the future. To this end, we make the database freely available to the community at https://live.ece.utexas.edu/research/LIVE_TMHDR/index.html.

Abstract:
Joint classification of hyperspectral images with hybrid modality can significantly enhance interpretation potentials, particularly when elevation information from the LiDAR sensor is integrated for outstanding performance. Recently, the transformer architecture was introduced to the HSI and LiDAR classification task, which has been verified as highly efficient. However, the existing naive transformer architectures suffer from two main drawbacks: 1) Inadequacy extraction for local spatial information and multi-scale information from HSI simultaneously. 2) The matrix calculation in the transformer consumes vast amounts of computing power. In this paper, we propose a novel Stochastic Window Transformer (SWFormer) framework to resolve these issues. First, the effective spatial and spectral feature projection networks are built independently based on hybrid-modal heterogeneous data composition using parallel feature extraction, which is conducive to excavating the perceptual features more representative along different dimensions. Furthermore, to construct local-global nonlinear feature maps more flexibly, we implement multi-scale strip convolution coupled with a transformer strategy. Moreover, in an innovative random window transformer structure, features are randomly masked to achieve sparse window pruning, alleviating the problem of information density redundancy, and reducing the parameters required for intensive attention. Finally, we designed a plug-and-play feature aggregation module that adapts domain offset between modal features adaptively to minimize semantic gaps between them and enhance the representational ability of the fusion feature. Three fiducial datasets demonstrate the effectiveness of the SWFormer in determining classification results.

Abstract:
Deep CNNs have achieved impressive improvements for night-time self-supervised depth estimation form a monocular image. However, the performance degrades considerably compared to day-time depth estimation due to significant domain gaps, low visibility, and varying illuminations between day and night images. To address these challenges, we propose a novel night-time self-supervised monocular depth estimation framework with structure regularization, i.e., SRNSD, which incorporates three aspects of constraints for better performance, including feature and depth domain adaptation, image perspective constraint, and cropped multi-scale consistency loss. Specifically, we utilize adaptations of both feature and depth output spaces for better night-time feature extraction and depth map prediction, along with high- and low-frequency decoupling operations for better depth structure and texture recovery. Meanwhile, we employ an image perspective constraint to enhance the smoothness and obtain better depth maps in areas where the luminosity jumps change. Furthermore, we introduce a simple yet effective cropped multi-scale consistency loss that utilizes consistency among different scales of depth outputs for further optimization, refining the detailed textures and structures of predicted depth. Experimental results on different benchmarks with depth ranges of 40m and 60m, including Oxford RobotCar dataset, nuScenes dataset and CARLA-EPE dataset, demonstrate the superiority of our approach over state-of-the-art night-time self-supervised depth estimation approaches across multiple metrics, proving our effectiveness.

Abstract:
Typical Convolutional Neural Networks (ConvNets) depend heavily on large amounts of image data and resort to an iterative optimization algorithm (e.g., SGD or Adam) to learn network parameters, making training very time- and resource-intensive. In this paper, we propose a new training paradigm and formulate the parameter learning of ConvNets into a prediction task: considering that there exist correlations between image datasets and their corresponding optimal network parameters of a given ConvNet, we explore if we can learn a hyper-mapping between them to capture the relations, such that we can directly predict the parameters of the network for an image dataset never seen during the training phase. To do this, we put forward a new hypernetwork-based model, called PudNet, which intends to learn a mapping between datasets and their corresponding network parameters, then predicts parameters for unseen data with only a single forward propagation. Moreover, our model benefits from a series of adaptive hyper-recurrent units sharing weights to capture the dependencies of parameters among different network layers. Extensive experiments demonstrate that our proposed method achieves good efficacy for unseen image datasets in two kinds of settings: Intra-dataset prediction and Inter-dataset prediction. Our PudNet can also well scale up to large-scale datasets, e.g., ImageNet-1K. It takes 8,967 GPU seconds to train ResNet-18 on the ImageNet-1K using GC from scratch and obtain a top-5 accuracy of 44.65%. However, our PudNet costs only 3.89 GPU seconds to predict the network parameters of ResNet-18 achieving comparable performance (44.92%), more than 2,300 times faster than the traditional training paradigm.

Abstract:
Full-reference point cloud quality assessment (FR-PCQA) aims to infer the quality of distorted point clouds with available references. Most of the existing FR-PCQA metrics ignore the fact that the human visual system (HVS) dynamically tackles visual information according to different distortion levels (i.e., distortion detection for high-quality samples and appearance perception for low-quality samples) and measure point cloud quality using unified features. To bridge the gap, in this paper, we propose a perception-guided hybrid metric (PHM) that adaptively leverages two visual strategies with respect to distortion degree to predict point cloud quality: to measure visible difference in high-quality samples, PHM takes into account the masking effect and employs texture complexity as an effective compensatory factor for absolute difference; on the other hand, PHM leverages spectral graph theory to evaluate appearance degradation in low-quality samples. Variations in geometric signals on graphs and changes in the spectral graph wavelet coefficients are utilized to characterize geometry and texture appearance degradation, respectively. Finally, the results obtained from the two components are combined in a non-linear method to produce an overall quality score of the tested point cloud. The results of the experiment on five independent databases show that PHM achieves state-of-the-art (SOTA) performance and offers significant performance improvement in multiple distortion environments. The code is publicly available at https://github.com/zhangyujie-1998/PHM.

Abstract:
With the increasing availability of multi-view data, multi-view representation learning has emerged as a prominent research area. However, collecting strictly view-aligned data is usually expensive, and learning from both aligned and unaligned data can be more practicable. Therefore, Partially View-aligned Representation Learning (PVRL) has recently attracted increasing attention. After aligning multi-view representations based on their semantic similarity, the aligned representations can be utilized to facilitate downstream tasks, such as clustering. However, existing methods may be constrained by the following limitations: 1) They learn semantic relations across views using the known correspondences, which is incomplete and the existence of false negative pairs (FNP) can significantly impact the learning effectiveness; 2) Existing strategies for alleviating the impact of FNP are too intuitive and lack a theoretical explanation of their applicable conditions; 3) They attempt to find FNP based on distance in the common space and fail to explore semantic relations between multi-view data. In this paper, we propose a Nonparametric Clustering-guided Cross-view Contrastive Learning (NC3L) for PVRL, in order to address the above issues. Firstly, we propose to estimate the similarity matrix between multi-view data in the marginal cross-view contrastive loss to approximate the similarity matrix of supervised contrastive learning (CL). Secondly, we establish the theoretical foundation for our proposed method by analyzing the error bounds of the loss function and its derivatives between our method and supervised CL. Thirdly, we propose a Deep Variational Nonparametric Clustering (DeepVNC) by designing a deep reparameterized variational inference for Dirichlet process Gaussian mixture models to construct cluster-level similarity between multi-view data and discover FNP. Additionally, we propose a reparameterization trick to improve the robustness and the performance of our proposed CL method. Extensive experiments on four widely used benchmark datasets show the superiority of our proposed method compared with state-of-the-art methods.

Abstract:
Cross-modal metric learning is a prominent research topic that bridges the semantic heterogeneity between vision and language. Existing methods frequently utilize simple cosine or complex distance metrics to transform the pairwise features into a similarity score, which suffers from an inadequate or inefficient capability for distance measurements. Consequently, we propose a Generalized Structural Sparse Function to dynamically capture thorough and powerful relationships across modalities for pair-wise similarity learning while remaining concise but efficient. Specifically, the distance metric delicately encapsulates two formats of diagonal and block-diagonal terms, automatically distinguishing and highlighting the cross-channel relevancy and dependency inside a structured and organized topology. Hence, it thereby empowers itself to adapt to the optimal matching patterns between the paired features and reaches a sweet spot between model complexity and capability. Extensive experiments on cross-modal and two extra uni-modal retrieval tasks (image-text retrieval, person re-identification, fine-grained image retrieval) have validated its superiority and flexibility over various popular retrieval frameworks. More importantly, we further discover that it can be seamlessly incorporated into multiple application scenarios, and demonstrates promising prospects from Attention Mechanism to Knowledge Distillation in a plug-and-play manner.

Abstract:
Low-light image enhancement aims to improve the visual quality of images captured under poor illumination. However, enhancing low-light images often introduces image artifacts, color bias, and low SNR. In this work, we propose AnlightenDiff, an anchoring diffusion model for low light image enhancement. Diffusion models can enhance the low light image to well-exposed image by iterative refinement, but require anchoring to ensure that enhanced results remain faithful to the input. We propose a Dynamical Regulated Diffusion Anchoring mechanism and Sampler to anchor the enhancement process. We also propose a Diffusion Feature Perceptual Loss tailored for diffusion based model to utilize different loss functions in image domain. AnlightenDiff demonstrates the effect of diffusion models for low-light enhancement and achieving high perceptual quality results. Our techniques show a promising future direction for applying diffusion models to image enhancement.

Abstract:
It is quite challenging to visually identify skin lesions with irregular shapes, blurred boundaries and large scale variances. Convolutional Neural Network (CNN) extracts more local features with abundant spatial information, while Transformer has the powerful ability to capture more global information but with insufficient spatial details. To overcome the difficulties in discriminating small or blurred skin lesions, we propose a Bi-directionally Fused Boundary Aware Network (BiFBA-Net). To utilize complementary features produced by CNNs and Transformers, we design a dual-encoding structure. Different from existing dual-encoders, our method designs a Bi-directional Attention Gate (Bi-AG) with two inputs and two outputs for crosswise feature fusion. Our Bi-AG accepts two kinds of features from CNN and Transformer encoders, and two attention gates are designed to generate two attention outputs that are sent back to the two encoders. Thus, we implement adequate exchanging of multi-scale information between CNN and Transformer encoders in a bi-directional and attention way. To perfectly restore feature maps, we propose a progressive decoding structure with boundary aware, containing three decoders with six supervised losses. The first decoder is a CNN network for producing more spatial details. The second one is a Partial Decoder (PD) for aggregating high-level features with more semantics. The last one is a Boundary Aware Decoder (BAD) proposed to progressively improve boundary accuracy. Our BAD uses residual structure and Reverse Attention (RA) at different scales to deeply mine structural and spatial details for refining lesion boundaries. Extensive experiments on public datasets show that our BiFBA-Net achieves higher segmentation accuracy, and has much better ability of boundary perceptions than compared methods. It also alleviates both over-segmentation of small lesions and under-segmentation of large ones.

Abstract:
For flash guided non-flash image denoising, the main challenge is to explore the consistency prior between the two modalities. Most existing methods attempt to model the flash/non-flash consistency in pixel level, which may easily lead to blurred edges. Different from these methods, we have an important finding in this paper, which reveals that the modality gap between flash and non-flash images conforms to the Laplacian distribution in gradient domain. Based on this finding, we establish a Laplacian gradient consistency (LGC) model for flash guided non-flash image denoising. This model is demonstrated to have faster convergence speed and denoising accuracy than the traditional pixel consistency model. Through solving the LGC model, we further design a deep network namely LGCNet. Different from existing image denoising networks, each component of the LGCNet strictly matches the solution of LGC model, giving the network good interpretability. The performance of the proposed LGCNet is evaluated on three different flash/non-flash image datasets, which demonstrates its superior denoising performance over many state-of-the-art methods both quantitatively and qualitatively. The intermediate features are also visualized to verify the effectiveness of the Laplacian gradient consistency prior. The source codes are available at https://github.com/JingyiXu404/LGCNet.

Abstract:
Integration of diverse visual prompts like clicks, scribbles, and boxes in interactive image segmentation significantly facilitates users’ interaction as well as improves interaction efficiency. However, existing studies primarily encode the position or pixel regions of prompts without considering the contextual areas around them, resulting in insufficient prompt feedback, which is not conducive to performance acceleration. To tackle this problem, this paper proposes a simple yet effective Probabilistic Visual Prompt Unified Transformer (PVPUFormer) for interactive image segmentation, which allows users to flexibly input diverse visual prompts with the probabilistic prompt encoding and feature post-processing to excavate sufficient and robust prompt features for performance boosting. Specifically, we first propose a Probabilistic Prompt-unified Encoder (PPuE) to generate a unified one-dimensional vector by exploring both prompt and non-prompt contextual information, offering richer feedback cues to accelerate performance improvement. On this basis, we further present a Prompt-to-Pixel Contrastive (P2C) loss to accurately align both prompt and pixel features, bridging the representation gap between them to offer consistent feature representations for mask prediction. Moreover, our approach designs a Dual-cross Merging Attention (DMA) module to implement bidirectional feature interaction between image and prompt features, generating notable features for performance improvement. A comprehensive variety of experiments on several challenging datasets demonstrates that the proposed components achieve consistent improvements, yielding state-of-the-art interactive segmentation performance. Our code is available at https://github.com/XuZhang1211/PVPUFormer.

Abstract:
Hyperspectral images (HSIs), with hundreds of narrow spectral bands, are increasingly used for ground object classification in remote sensing. However, many HSI classification models operate pixel-by-pixel, limiting the utilization of spatial information and resulting in increased inference time for the whole image. This paper proposes SegHSI, an effective and efficient end-to-end HSI segmentation model, alongside a novel training strategy. SegHSI adopts a head-free structure with cluster attention modules and spatial-aware feedforward networks (SA-FFN) for multiscale spatial encoding. Cluster attention encodes pixels through constructed clusters within the HSI, while SA-FFN integrates depth-wise convolution to enhance spatial context. Our training strategy utilizes a student-teacher model framework that combines labeled pixel class information with consistency learning on unlabeled pixels. Experiments on three public HSI datasets demonstrate that SegHSI not only surpasses other state-of-the-art models in segmentation accuracy but also achieves inference time at the scale of seconds, even reaching sub-second speeds for full-image classification. Code is available at https://github.com/huanliu233/SegHSI.

Abstract:
Multi-object tracking (MOT) aims to estimate the bounding boxes and ID labels of objects in videos. The challenging issue in this task is to alleviate competitive learning between the detection and tracking subtasks, for which, two-stage Tracking-By-Detection (TBD) optimizes the two subtasks individually, and the single-stage Joint Detection and Tracking (JDT) adjusts the complex network architectures finely in an end-to-end pipeline. In this paper, we propose a new MOT method, i.e., Proposal Propagation via Diffusion Models, called Pro2Diff, which integrates a diffusion model into the proposal propagation in multi-object tracking, focusing on the model training process rather than complex network design. Specifically, using a generative approach, Pro2Diff generates a considerable number of noisy proposals for the tracking image sequence in the forward process, and subsequently, Pro2Diff learns the discrepancies between these noisy proposals and the actual bounding boxes of the tracked objects, gradually optimizing these noisy proposals to obtain the initial sequence of real tracked objects. By introducing the denoising diffusion process into multi-object tracking, we have made three further important findings: 1) Generative methods can effectively handle multi-object tracking tasks; 2) Without the need to modify the model structure, we propose self-conditional proposal propagation to enhance model performance effectively during inference; 3) By adjusting the numbers of proposals and iterations appropriately for different tracking sequences, the optimal performance of the model can be achieved. Extensive experimental results on MOT17 and DanceTrack datasets demonstrate that Pro2Diff outperforms current end-to-end multi-object tracking methods. We achieve 61.9 HOTA on DanceTrack and 57.6 HOTA on MOT17, reaching the competitive result of the JDT approach.

Abstract:
Currently, the general domain of video reconstruction (VR) is fragmented into different shutters spanning global shutter and rolling shutter cameras. Despite rapid progress in the state-of-the-art, existing methods overwhelmingly follow shutter-specific paradigms and cannot conceptually generalize to other shutter types, hindering the uniformity of VR models. In this paper, we propose UniVR, a versatile framework to handle various shutters through unified modeling and shared parameters. Specifically, UniVR encodes diverse shutter types into a unified space via a tractable shutter adapter, which is parameter-free and thus can be seamlessly delivered to current well-established VR architectures for cross-shutter transfer. To demonstrate its effectiveness, we conceptualize UniVR as three shutter-generic VR methods, namely Uni-SoftSplat, Uni-SuperSloMo, and Uni-RIFE. Extensive experimental results demonstrate that the pre-trained model without any fine-tuning can achieve reasonable performance even on novel shutters. After fine-tuning, new state-of-the-art performances are established that go beyond shutter-specific methods and enjoy strong generalization. The code is available at https://github.com/GitCVfb/UniVR.

Abstract:
As compared to standard dynamic range (SDR) videos, high dynamic range (HDR) content is able to represent and display much wider and more accurate ranges of brightness and color, leading to more engaging and enjoyable visual experiences. HDR also implies increases in data volume, further challenging existing limits on bandwidth consumption and on the quality of delivered content. Perceptual quality models are used to monitor and control the compression of streamed SDR content. A similar strategy should be useful for HDR content, yet there has been limited work on building HDR video quality assessment (VQA) algorithms. One reason for this is a scarcity of high-quality HDR VQA databases representative of contemporary HDR standards. Towards filling this gap, we created the first publicly available HDR VQA database dedicated to HDR10 videos, called the Laboratory for Image and Video Engineering (LIVE) HDR Database. It comprises 310 videos from 31 distinct source sequences processed by ten different compression and resolution combinations, simulating bitrate ladders used by the streaming industry. We used this data to conduct a subjective quality study, gathering more than 20,000 human quality judgments under two different illumination conditions. To demonstrate the usefulness of this new psychometric data resource, we also designed a new framework for creating HDR quality sensitive features, using a nonlinear transform to emphasize distortions occurring in spatial portions of videos that are enhanced by HDR, e.g., having darker blacks and brighter whites. We apply this new method, which we call HDRMAX, to modify the widely-deployed Video Multimethod Assessment Fusion (VMAF) model. We show that VMAF+HDRMAX provides significantly elevated performance on both HDR and SDR videos, exceeding prior state-of-the-art model performance. The database is now accessible at: https://live.ece.utexas.edu/research/LIVEHDR/LIVEHDR_index.html. The model will be made available at a later date at: https://live.ece.utexas.edu//research/Quality/index_algorithms.htm.

Abstract:
Interactive image segmentation (IIS) has been widely used in various fields, such as medicine, industry, etc. However, some core issues, such as pixel imbalance, remain unresolved so far. Different from existing methods based on pre-processing or post-processing, we analyze the cause of pixel imbalance in depth from the two perspectives of pixel number and pixel difficulty. Based on this, a novel and unified Click-pixel Cognition Fusion network with Balanced Cut (CCF-BC) is proposed in this paper. On the one hand, the Click-pixel Cognition Fusion (CCF) module, inspired by the human cognition mechanism, is designed to increase the number of click-related pixels (namely, positive pixels) being correctly segmented, where the click and visual information are fully fused by using a progressive three-tier interaction strategy. On the other hand, a general loss, Balanced Normalized Focal Loss (BNFL), is proposed. Its core is to use a group of control coefficients related to sample gradients and forces the network to pay more attention to positive and hard-to-segment pixels during training. As a result, BNFL always tends to obtain a balanced cut of positive and negative samples in the decision space. Theoretical analysis shows that the commonly used Focal and BCE losses can be regarded as special cases of BNFL. Experiment results of five well-recognized datasets have shown the superiority of the proposed CCF-BC method compared to other state-of-the-art methods. The source code is publicly available at https://github.com/lab206/CCF-BC.

Abstract:
Recently, with the assumption that samples can be reconstructed by themselves, subspace clustering (SC) methods have achieved great success. Generally, SC methods contain some parameters to be tuned, and different affinity matrices can obtain with different parameter values. In this paper, for the first time, we study a method for fusing these different affinity matrices to promote clustering performance and provide the corresponding solution from a multi-view clustering (MVC) perspective. That is, we argue that the different affinity matrices are consistent and complementary, which is similar to the fundamental assumption of MVC methods. Based on this observation, in this paper, we use least squares regression (LSR), which is a typical SC method, as an example since it can be efficiently optimized and has shown good clustering performance and we propose a novel robust least squares regression method from an MVC perspective (RLSR/MVCP). Specifically, we first utilize LSR with different parameter values to obtain different affinity matrices. Then, to fully explore the information contained in these different affinity matrices and to remove noise, we further fuse these affinity matrices into a tensor, which is constrained by the tensor low-rank constraint, i.e., the tensor nuclear norm (TNN). The two steps are combined into a framework that is solved by the augmented Lagrange multiplier (ALM) method. The experimental results on several datasets indicate that RLSR/MVCP has very encouraging clustering performance and is superior to state-of-the-art SC methods.

Abstract:
Small moving objects at far distance always occupy only one or a few pixels in image and exhibit extremely limited visual features, which bring great challenges to motion detection. Highly evolved visual systems endow flying insects with remarkable ability to pursue tiny mates and prey, providing a good template to develop image processing method for small target motion detection. The insects’ excellent sensitivity to small moving objects is believed to come from a class of specific neurons called small target motion detectors (STMDs). However, existing STMD-based methods often experience performance degradation when coping with complex natural scenes. In this paper, we propose a bio-inspired visual system with spatio-temporal feedback mechanism (called Spatio-Temporal Feedback STMD) to suppress false positive background movement while enhancing system responses to small targets. Specifically, the proposed visual system is composed of two complementary subnetworks and a feedback loop. The first subnetwork is designed to extract spatial and temporal movement patterns of cluttered background by neuronal ensemble coding. The second subnetwork is developed to capture small target motion information where its output and signal from the first subnetwork are integrated together via the feedback loop to filter out background false positives in a recurrent manner. Experimental results demonstrate that the proposed spatio-temporal feedback visual system is more competitive than existing methods in discriminating small moving targets from complex natural environments.

Abstract:
Transformer-based method has demonstrated promising performance in image super-resolution tasks, due to its long-range and global aggregation capability. However, the existing Transformer brings two critical challenges for applying it in large-area earth observation scenes: (1) redundant token representation due to most irrelevant tokens; (2) single-scale representation which ignores scale correlation modeling of similar ground observation targets. To this end, this paper proposes to adaptively eliminate the interference of irreverent tokens for a more compact self-attention calculation. Specifically, we devise a Residual Token Selective Group (RTSG) to grasp the most crucial token by dynamically selecting the top- k keys in terms of score ranking for each query. For better feature aggregation, a Multi-scale Feed-forward Layer (MFL) is developed to generate an enriched representation of multi-scale feature mixtures during feed-forward process. Moreover, we also proposed a Global Context Attention (GCA) to fully explore the most informative components, thus introducing more inductive bias to the RTSG for an accurate reconstruction. In particular, multiple cascaded RTSGs form our final Top- k Token Selective Transformer (TTST) to achieve progressive representation. Extensive experiments on simulated and real-world remote sensing datasets demonstrate our TTST could perform favorably against state-of-the-art CNN-based and Transformer-based methods, both qualitatively and quantitatively. In brief, TTST outperforms the state-of-the-art approach (HAT-L) in terms of PSNR by 0.14 dB on average, but only accounts for 47.26% and 46.97% of its computational cost and parameters. The code and pre-trained TTST will be available at https://github.com/XY-boy/TTST for validation.

Abstract:
Scene text spotting is a challenging task, especially for inverse-like scene text, which has complex layouts, e.g., mirrored, symmetrical, or retro-flexed. In this paper, we propose a unified end-to-end trainable inverse-like antagonistic text spotting framework dubbed IATS, which can effectively spot inverse-like scene texts without sacrificing general ones. Specifically, we propose an innovative reading-order estimation module (REM) that extracts reading-order information from the initial text boundary generated by an initial boundary module (IBM). To optimize and train REM, we propose a joint reading-order estimation loss ( \mathcal L_RE ) consisting of a classification loss, an orthogonality loss, and a distribution loss. With the help of IBM, we can divide the initial text boundary into two symmetric control points and iteratively refine the new text boundary using a lightweight boundary refinement module (BRM) for adapting to various shapes and scales. To alleviate the incompatibility between text detection and recognition, we propose a dynamic sampling module (DSM) with a thin-plate spline that can dynamically sample appropriate features for recognition in the detected text region. Without extra supervision, the DSM can proactively learn to sample appropriate features for text recognition through the gradient returned by the recognition module. Extensive experiments on both challenging scene text and inverse-like scene text datasets demonstrate that our method achieves superior performance both on irregular and inverse-like text spotting.

Abstract:
In the versatile video coding (VVC) standard, cross-component prediction (CCP) is introduced to utilize the correlation between luma and chroma components. To further exploit the cross-component redundancy, a bundle of novel CCP methods with more sophisticated models are studied in enhanced compression model (ECM), which is a platform targeting at the next generation video coding standard, developed by JVET. This paper presents two methods, known as local-boosting CCP (LB-CCP) and non-local CCP (NL-CCP) to improve CCP with local and non-local information. With LB-CCP, prediction samples of CCP can be filtered with neighbouring samples. Besides, neighbouring template costs are calculated to determine the range of training samples, as well as the cross-component prediction method used in the chroma fusion mode. With NL-CCP, a CCP model can be derived with samples non-adjacent to the current block. As an equivalent but simpler implementation, the CCP model can be inherited from a CCP-coded neighbouring block as a spatial CCP candidate. Moreover, the CCP model of a previous CCP-coded block can be stored in a history-based table, which can be fetched and used by the current block as a history-based CCP candidate. A NL-CCP candidate list is built with the two kinds of candidates. The encoder can select the optimal candidate and send an index to the decoder. Experimental results show that LB-CCP together with NL-CCP can provide an average Bjontegaard delta rate (BD-rate) reduction of 0.27%, 2.31%, 2.44% on Y, Cb, Cr components, respectively, with a negligible change in the running time, compared with ECM-8.0 in all intra configurations. Both LB-CCP and NL-CCP have been adopted into ECM.

Abstract:
Single image dehazing is a challenging ill-posed problem which estimates latent haze-free images from observed hazy images. Some existing deep learning based methods are devoted to improving the model performance via increasing the depth or width of convolution. The learning ability of Convolutional Neural Network (CNN) structure is still under-explored. In this paper, a Detail-Enhanced Attention Block (DEAB) consisting of Detail-Enhanced Convolution (DEConv) and Content-Guided Attention (CGA) is proposed to boost the feature learning for improving the dehazing performance. Specifically, the DEConv contains difference convolutions which can integrate prior information to complement the vanilla one and enhance the representation capacity. Then by using the re-parameterization technique, DEConv is equivalently converted into a vanilla convolution to reduce parameters and computational cost. By assigning the unique Spatial Importance Map (SIM) to every channel, CGA can attend more useful information encoded in features. In addition, a CGA-based mixup fusion scheme is presented to effectively fuse the features and aid the gradient flow. By combining above mentioned components, we propose our Detail-Enhanced Attention Network (DEA-Net) for recovering high-quality haze-free images. Extensive experimental results demonstrate the effectiveness of our DEA-Net, outperforming the state-of-the-art (SOTA) methods by boosting the PSNR index over 41 dB with only 3.653 M parameters. (The source code of our DEA-Net is available at https://github.com/cecret3350/DEA-Net.)

Abstract:
Deep learning-based hyperspectral image (HSI) classification methods have recently shown excellent performance, however, there are two shortcomings that need to be addressed. One is that deep network training requires a large number of labeled images, and the other is that deep network needs to learn a large number of parameters. They are also general problems of deep networks, especially in applications that require professional techniques to acquire and label images, such as HSI and medical images. In this paper, we propose a deep network architecture (SAFDNet) based on the stochastic adaptive Fourier decomposition (SAFD) theory. SAFD has powerful unsupervised feature extraction capabilities, so the entire deep network only requires a small number of annotated images to train the classifier. In addition, we use fewer convolution kernels in the entire deep network, which greatly reduces the number of deep network parameters. SAFD is a newly developed signal processing tool with solid mathematical foundation, which is used to construct the unsupervised deep feature extraction mechanism of SAFDNet. Experimental results on three popular HSI classification datasets show that our proposed SAFDNet outperforms other compared state-of-the-art deep learning methods in HSI classification.

Abstract:
Image restoration under adverse weather conditions (e.g., rain, snow, and haze) is a fundamental computer vision problem that has important implications for various downstream applications. Distinct from early methods that are specially designed for specific types of weather, recent works tend to simultaneously remove various adverse weather effects based on either spatial feature representation learning or semantic information embedding. Inspired by various successful applications incorporating large-scale pre-trained models (e.g., CLIP), in this paper, we explore their potential benefits for leveraging large-scale pre-trained models in this task based on both spatial feature representation learning and semantic information embedding aspects: 1) spatial feature representation learning, we design a Spatially Adaptive Residual (SAR) encoder to adaptively extract degraded areas. To facilitate training of this model, we propose a Soft Residual Distillation (CLIP-SRD) strategy to transfer spatial knowledge from CLIP between clean and adverse weather images; 2) semantic information embedding, we propose a CLIP Weather Prior (CWP) embedding module to enable the network to adaptively respond to different weather conditions. This module integrates the sample-specific weather priors extracted by the CLIP image encoder with the distribution-specific information (as learned by a set of parameters) and embeds these elements using a cross-attention mechanism. Extensive experiments demonstrate that our proposed method can achieve state-of-the-art performance under various and severe adverse weather conditions. The code will be made available.

Abstract:
Deep learning has excelled in single-image super-resolution (SISR) applications, yet the lack of interpretability in most deep learning-based SR networks hinders their applicability, especially in fields like medical imaging that require transparent computation. To address these problems, we present an interpretable frequency division SR network that operates in the image frequency domain. It comprises a frequency division module and a step-wise reconstruction method, which divides the image into different frequencies and performs reconstruction accordingly. We develop a frequency division loss function to ensure that each reconstruction module (ReM) operates solely at one image frequency. These methods establish an interpretable framework for SR networks, visualizing the image reconstruction process and reducing the black box nature of SR networks. Additionally, we revisited the subpixel layer upsampling process by deriving its inverse process and designing a displacement generation module. This interpretable upsampling process incorporates subpixel information and is similar to pre-upsampling frameworks. Furthermore, we develop a new ReM based on interpretable Hessian attention to enhance network performance. Extensive experiments demonstrate that our network, without the frequency division loss, outperforms state-of-the-art methods qualitatively and quantitatively. The inclusion of the frequency division loss enhances the network’s interpretability and robustness, and only slightly decreases the PSNR and SSIM metrics by an average of 0.48 dB and 0.0049, respectively.

Abstract:
In this paper, we propose a probabilistic regression diffusion model for head pose estimation, dubbed HeadDiff, which typically addresses the rotation uncertainty, especially when faces are captured in wild conditions. Unlike conventional image-to-pose methods which cannot explicitly establish the rotational manifold of head poses, our HeadDiff aims to ensure the pose rotation via the diffusion process and in parallel, refine the mapping process iteratively. Specifically, we initially formulate the head pose estimation problem as a reverse diffusion process, defining a paradigm for progressive denoising on the manifold, which explores the uncertainty by decomposing the large gap into intermediate steps. Moreover, our HeadDiff is equipped with an isotropic Gaussian distribution by encoding the incoherence information in our rotation representation. Finally, we learn the facial relationship of nearest neighbors with a cycle-consistent constraint for robust pose estimation versus diverse shape variations. Experimental results on multiple datasets demonstrate that our proposed method outperforms existing state-of-the-art techniques without auxiliary data.

Abstract:
Self-supervised contrastive learning has proven to be successful for skeleton-based action recognition. For contrastive learning, data transformations are found to fundamentally affect the learned representation quality. However, traditional invariant contrastive learning is detrimental to the performance on the downstream task if the transformation carries important information for the task. In this sense, it limits the application of many data transformations in the current contrastive learning pipeline. To address these issues, we propose to utilize equivariant contrastive learning, which extends invariant contrastive learning and preserves important information. By integrating equivariant and invariant contrastive learning into a hybrid approach, the model can better leverage the motion patterns exposed by data transformations and obtain a more discriminative representation space. Specifically, a self-distillation loss is first proposed for transformed data of different intensities to fully utilize invariant transformations, especially strong invariant transformations. For equivariant transformations, we explore the potential of skeleton mixing and temporal shuffling for equivariant contrastive learning. Meanwhile, we analyze the impacts of different data transformations on the feature space in terms of two novel metrics proposed in this paper, namely, consistency and diversity. In particular, we demonstrate that equivariant learning boosts performance by alleviating the dimensional collapse problem. Experimental results on several benchmarks indicate that our method outperforms existing state-of-the-art methods.

Abstract:
Anatomical and functional image fusion is an important technique in a variety of medical and biological applications. Recently, deep learning (DL)-based methods have become a mainstream direction in the field of multi-modal image fusion. However, existing DL-based fusion approaches have difficulty in effectively capturing local features and global contextual information simultaneously. In addition, the scale diversity of features, which is a crucial issue in image fusion, often lacks adequate attention in most existing works. In this paper, to address the above problems, we propose a MixFormer-based multi-scale network, termed as MM-Net, for anatomical and functional image fusion. In our method, an improved MixFormer-based backbone is introduced to sufficiently extract both local features and global contextual information at multiple scales from the source images. The features from different source images are fused at multiple scales based on a multi-source spatial attention-based cross-modality feature fusion (CMFF) module. The scale diversity of the fused features is further enriched by a series of multi-scale feature interaction (MSFI) modules and feature aggregation upsample (FAU) modules. Moreover, a loss function consisting of both spatial domain and frequency domain components is devised to train the proposed fusion model. Experimental results demonstrate that our method outperforms several state-of-the-art fusion methods on both qualitative and quantitative comparisons, and the proposed fusion model exhibits good generalization capability. The source code of our fusion method will be available at https://github.com/yuliu316316.

Abstract:
Deep unrolling-based snapshot compressive imaging (SCI) methods, which employ iterative formulas to construct interpretable iterative frameworks and embedded learnable modules, have achieved remarkable success in reconstructing 3-dimensional (3D) hyperspectral images (HSIs) from 2D measurement induced by coded aperture snapshot spectral imaging (CASSI). However, the existing deep unrolling-based methods are limited by the residuals associated with Taylor approximations and the poor representation ability of single hand-craft priors. To address these issues, we propose a novel HSI construction method named residual completion unrolling with mixed priors (RCUMP). RCUMP exploits a residual completion branch to solve the residual problem and incorporates mixed priors composed of a novel deep sparse prior and mask prior to enhance the representation ability. Our proposed CNN-based model can significantly reduce memory cost, which is an obvious improvement over previous CNN methods, and achieves better performance compared with the state-of-the-art transformer and RNN methods. In this work, our method is compared with the 9 most recent baselines on 10 scenes. The results show that our method consistently outperforms all the other methods while decreasing memory consumption by up to 80%.

Abstract:
Estimating the head pose of a person is a crucial problem for numerous applications that is yet mainly addressed as a subtask of frontal pose prediction. We present a novel method for unconstrained end-to-end head pose estimation to tackle the challenging task of full range of orientation head pose prediction. We address the issue of ambiguous rotation labels by introducing the rotation matrix formalism for our ground truth data and propose a continuous 6D rotation matrix representation for efficient and robust direct regression. This allows to efficiently learn full rotation appearance and to overcome the limitations of the current state-of-the-art. Together with new accumulated training data that provides full head pose rotation data and a geodesic loss approach for stable learning, we design an advanced model that is able to predict an extended range of head orientations. An extensive evaluation on public datasets demonstrates that our method significantly outperforms other state-of-the-art methods in an efficient and robust manner, while its advanced prediction range allows the expansion of the application area. We open-source our training and testing code along with our trained models: https://github.com/thohemp/6DRepNet360.

Abstract:
Due to the sparse single-frame annotations, current Single-Frame Temporal Action Localization (SF-TAL) methods generally employ threshold-based pseudo-label generation strategies. However, these approaches suffer from inefficient data utilization, as only parts of unlabeled frames with confidence scores surpassing a predefined threshold are selected for training. Moreover, the variability of single-frame annotations and unreliable model predictions introduce pseudo-label noise. To address these challenges, we propose two strategies by using the relationship of the video segments with their neighbors’: 1) temporal neighbor-guided soft pseudo-label generation (TNPG); and 2) semantic neighbor-guided pseudo-label refinement (SNPR). TNPG utilizes a local-global self-attention mechanism in a transformer encoder to capture temporal neighbor information while focusing on the whole video. Then the generated self-attention map is multiplied by the network predictions to propagate information between labeled and unlabeled frames, and produce soft pseudo-label for all segments. Despite this, label noise persists due to unreliable model predictions. To mitigate this, SNPR refines pseudo-labels based on the assumption that predictions should resemble their semantic nearest neighbors’. Specifically, we search for semantic nearest neighbors of each video segment by cosine similarity in the feature space. Then the refined soft pseudo-labels can be obtained by a weight combination of the original pseudo-label and the semantic nearest neighbors’. Finally, the model can be trained with the refined pseudo-labels, and the performance has been greatly improved. Comprehensive experimental results on different benchmarks show that we achieve state-of-the-art performances on THUMOS14, ActivityNet1.2, and ActivityNet1.3 datasets.

Abstract:
The goal of moving object segmentation is separating moving objects from stationary backgrounds in videos. One major challenge in this problem is how to develop a universal model for videos from various natural scenes since previous methods are often effective only in specific scenes. In this paper, we propose a method called Learning Temporal Distribution and Spatial Correlation (LTS) that has the potential to be a general solution for universal moving object segmentation. In the proposed approach, the distribution from temporal pixels is first learned by our Defect Iterative Distribution Learning (DIDL) network for a scene-independent segmentation. Notably, the DIDL network incorporates the use of an improved product distribution layer that we have newly derived. Then, the Stochastic Bayesian Refinement (SBR) Network, which learns the spatial correlation, is proposed to improve the binary mask generated by the DIDL network. Benefiting from the scene independence of the temporal distribution and the accuracy improvement resulting from the spatial correlation, the proposed approach performs well for almost all videos from diverse and complex natural scenes with fixed parameters. Comprehensive experiments on standard datasets including LASIESTA, CDNet2014, BMC, SBMI2015 and 128 real world videos demonstrate the superiority of proposed approach compared to state-of-the-art methods with or without the use of deep learning networks. To the best of our knowledge, this work has high potential to be a general solution for moving object segmentation in real world environments. The code and real-world videos can be found on GitHub https://github.com/guanfangdong/LTS-UniverisalMOS.

Abstract:
Color transfer aims to change the color information of the target image according to the reference one. Many studies propose color transfer methods by analysis of color distribution and semantic relevance, which do not take the perceptual characteristics for visual quality into consideration. In this study, we propose a novel color transfer method based on the saliency information with brightness optimization. First, a saliency detection module is designed to separate the foreground regions from the background regions for images. Then a dual-branch module is introduced to implement color transfer for images. Finally, a brightness optimization operation is designed during the fusion of foreground and background regions for color transfer. Experimental results show that the proposed method can implement the color transfer for images while keeping the color consistency well. Compared with other existing studies, the proposed method can obtain significant performance improvement. The source code and pre-trained models are available at https://github.com/PlanktonQAQ/SCTNet.

Abstract:
Long-term Video Question Answering (VideoQA) is a challenging vision-and-language bridging task focusing on semantic understanding of untrimmed long-term videos and diverse free-form questions, simultaneously emphasizing comprehensive cross-modal reasoning to yield precise answers. The canonical approaches often rely on off-the-shelf feature extractors to detour the expensive computation overhead, but often result in domain-independent modality-unrelated representations. Furthermore, the inherent gradient blocking between unimodal comprehension and cross-modal interaction hinders reliable answer generation. In contrast, recent emerging successful video-language pre-training models enable cost-effective end-to-end modeling but fall short in domain-specific ratiocination and exhibit disparities in task formulation. Toward this end, we present an entirely end-to-end solution for long-term VideoQA: Multi-granularity Contrastive cross-modal collaborative Generation (MCG) model. To derive discriminative representations possessing high visual concepts, we introduce Joint Unimodal Modeling (JUM) on a clip-bone architecture and leverage Multi-granularity Contrastive Learning (MCL) to harness the intrinsically or explicitly exhibited semantic correspondences. To alleviate the task formulation discrepancy problem, we propose a Cross-modal Collaborative Generation (CCG) module to reformulate VideoQA as a generative task instead of the conventional classification scheme, empowering the model with the capability for cross-modal high-semantic fusion and generation so as to rationalize and answer. Extensive experiments conducted on six publicly available VideoQA datasets underscore the superiority of our proposed method.

Abstract:
The statistical regularities of natural images, referred to as natural scene statistics, play an important role in no-reference image quality assessment. However, it has been widely acknowledged that screen content images (SCIs), which are typically computer generated, do not hold such statistics. Here we make the first attempt to learn the statistics of SCIs, based upon which the quality of SCIs can be effectively determined. The underlying mechanism of the proposed approach is based upon the mild assumption that the SCIs, which are not physically acquired, still obey certain statistics that could be understood in a learning fashion. We empirically show that the statistics deviation could be effectively leveraged in quality assessment, and the proposed method is superior when evaluated in different settings. Extensive experimental results demonstrate the Deep Feature Statistics based SCI Quality Assessment (DFSS-IQA) model delivers promising performance compared with existing NR-IQA models and shows a high generalization capability in the cross-dataset settings. The implementation of our method is publicly available at https://github.com/Baoliang93/DFSS-IQA.

Abstract:
We live in a 3D world where people interact with each other in the environment. Learning 3D posed humans therefore requires us to perceive and interpret these interactions. This paper proposes LEAPSE, a novel method that learns salient instance affordances for estimating a posed body from a single RGB image in a non-parametric manner. Existing methods mostly ignore the environment and estimate the human body independently from the surroundings. We capture the influences of non-contact and contact instances on a posed body as an adequate representation of the “environment affordances”. The proposed method learns the global relationships between 3D joints, body mesh vertices, and salient instances as environment affordances on the human body. LEAPSE achieved state-of-the-art results on the 3DPW dataset with many affordance instances, and also demonstrated excellent performance on Human3.6M dataset. We further demonstrate the benefit of our method by showing that the performance of existing weak models can be significantly improved when combined with our environment affordance module.

Abstract:
Continual zero-shot learning (CZSL) aims to develop a model that accumulates historical knowledge to recognize unseen tasks, while eliminating catastrophic forgetting for seen tasks when learning new tasks. However, existing CZSL methods, while mitigating catastrophic forgetting for old tasks, often lead to negative transfer problem for new tasks by over-focusing on accumulating old knowledge and neglecting the plasticity of the model for learning new tasks. To tackle these problems, we propose PAMK, a prototype augmented multi-teacher knowledge transfer network that strikes a trade-off between recognition stability for old tasks and generalization plasticity for new tasks. PAMK consists of a prototype augmented contrastive generation (PACG) module and a multi-teacher knowledge transfer (MKT) module. To reduce the cumulative semantic decay of the class representation embedding and mitigate catastrophic forgetting, we propose a continual prototype augmentation strategy based on relevance scores in PACG. Furthermore, by introducing the prototype augmented semantic-visual contrastive loss, PACG promotes intra-class compactness for all classes across all tasks. MKT effectively accumulates semantic knowledge learned from old tasks to recognize new tasks via the proposed multi-teacher knowledge transfer, eliminating the negative transfer problem. Extensive experiments on various CZSL settings demonstrate the superior performance of PAMK compared to state-of-the-art methods. In particular, in the practical task-free CZSL setting, PAMK achieves impressive gains of 3.28%, 3.09% and 3.71% in mean harmonic accuracy on the CUB, AWA1, and AWA2 datasets, respectively.

Abstract:
Continuous sign language recognition (CSLR) is to recognize the glosses in a sign language video. Enhancing the generalization ability of CSLR’s visual feature extractor is a worthy area of investigation. In this paper, we model glosses as priors that help to learn more generalizable visual features. Specifically, the signer-invariant gloss feature is extracted by a pre-trained gloss BERT model. Then we design a gloss prior guidance network (GPGN). It contains a novel parallel densely-connected temporal feature extraction (PDC-TFE) module for multi-resolution visual feature extraction. The PDC-TFE captures the complex temporal patterns of the glosses. The pre-trained gloss feature guides the visual feature learning through a cross-modality matching loss. We propose to formulate the cross-modality feature matching into a regularized optimal transport problem, it can be efficiently solved by a variant of the Sinkhorn algorithm. The GPGN parameters are learned by optimizing a weighted sum of the cross-modality matching loss and CTC loss. The experiment results on German and Chinese sign language benchmarks demonstrate that the proposed GPGN achieves competitive performance. The ablation study verifies the effectiveness of several critical components of the GPGN. Furthermore, the proposed pre-trained gloss BERT model and cross-modality matching can be seamlessly integrated into other RGB-cue-based CSLR methods as plug-and-play formulations to enhance the generalization ability of the visual feature extractor.

Abstract:
This paper introduces a stochastic plug-and-play (PnP) sampling algorithm that leverages variable splitting to efficiently sample from a posterior distribution. The algorithm based on split Gibbs sampling (SGS) draws inspiration from the half quadratic splitting method (HQS) and the alternating direction method of multipliers (ADMM). It divides the challenging task of posterior sampling into two simpler sampling problems. The first problem depends on the likelihood function, while the second is interpreted as a Bayesian denoising problem that can be readily carried out by a deep generative model. Specifically, for an illustrative purpose, the proposed method is implemented in this paper using state-of-the-art diffusion-based generative models. Akin to its deterministic PnP-based counterparts, the proposed method exhibits the great advantage of not requiring an explicit choice of the prior distribution, which is rather encoded into a pre-trained generative model. However, unlike optimization methods (e.g., PnP-ADMM and PnP-HQS) which generally provide only point estimates, the proposed approach allows conventional Bayesian estimators to be accompanied by confidence intervals at a reasonable additional computational cost. Experiments on commonly studied image processing problems illustrate the efficiency of the proposed sampling strategy. Its performance is compared to recent state-of-the-art optimization and sampling methods.

Abstract:
Recent restoration methods for handling real old photos have achieved significant improvements using generative networks. However, the restoration quality under the usual generative architectures is greatly affected by the encoded properties of latent space, which reflect pivotal semantic information in the recovery process. Therefore, how to find the suitable latent space and identify its semantic factors is an important issue in this challenging task. To this end, we propose a novel generative network with hyperbolic embeddings to restore old photos that suffer from multiple degradations. Specifically, we transform high-dimensional Euclidean features into a compact latent space via the hyperbolic operations. In order to enhance the hierarchical representative capability, we perform the channel mixing and group convolutions for the intermediate hyperbolic features. By using attention-based aggregation mechanism in a hyperbolic space, we can further obtain the resulting latent vectors, which are more effective in encoding the important semantic factors and improving the restoration quality. Besides, we design a diversity loss to guide each latent vector to disentangle different semantics. Extensive experiments have shown that our method is able to generate visually pleasing photos and outperforms state-of-the-art restoration methods.

Abstract:
Recent advancements in deep learning techniques have pushed forward the frontiers of real photograph denoising. However, due to the inherent pooling operations in the spatial domain, current CNN-based denoisers are biased towards focusing on low-frequency representations, while discarding the high-frequency components. This will induce a problem for suboptimal visual quality as the image denoising tasks target completely eliminating the complex noises and recovering all fine-scale and salient information. In this work, we tackle this challenge from the frequency perspective and present a new solution pipeline, coined as frequency attention denoising network (FADNet). Our key idea is to build a learning-based frequency attention framework, where the feature correlations on a broader frequency spectrum can be fully characterized, thus enhancing the representational power of the network across multiple frequency channels. Based on this, we design a cascade of adaptive instance residual modules (AIRMs). In each AIRM, we first transform the spatial-domain features into the frequency space. Then, a learning-based frequency attention framework is devised to explore the feature inter-dependencies converted in the frequency domain. Besides this, we introduce an adaptive layer by leveraging the guidance of the estimated noise map and intermediate features to meet the challenges of model generalization in the noise discrepancy. The effectiveness of our method is demonstrated on several real camera benchmark datasets, with superior denoising performance, generalization capability, and efficiency versus the state-of-the-art.

Abstract:
Novel view synthesis aims at rendering any posed images from sparse observations of the scene. Recently, neural radiance fields (NeRF) have demonstrated their effectiveness in synthesizing novel views of a bounded scene. However, most existing methods cannot be directly extended to 360° unbounded scenes where the camera orientations and scene depths are unconstrained with large variations. In this paper, we present a spherical radiance field (SRF) for efficient novel view synthesis in 360° unbounded scenes. Specifically, we represent a 3D scene as multiple concentric spheres with different radii. In particular, each sphere encodes its corresponding layered scene into implicit representations and is parameterized with an equirectangular projection image. A shallow multi-layer perceptron (MLP) is then used to infer the density and color from these sphere representations for volume rendering. Moreover, an occupancy grid is introduced to cache the density field and guide the ray sampling, which accelerates the training and rendering procedures by reducing the number of samples along the ray. Experiments show that our method can well fit 360° unbounded scenes and produces state-of-the-art results on three benchmark datasets with less than 30 minutes of training time on a 3090 GPU, surpassing Mip-NeRF 360 with a 400× speedup. In addition, our method achieves competitive performance in terms of both accuracy and efficiency on a bounded dataset. Project page: https://minglin-chen.github.io/SphericalRF

Abstract:
Recent research on few-shot fine-grained image classification (FSFG) has predominantly focused on extracting discriminative features. The limited attention paid to the role of loss functions has resulted in weaker preservation of similarity relationships between query and support instances, thereby potentially limiting the performance of FSFG. In this regard, we analyze the limitations of widely adopted cross-entropy loss and introduce a novel Angular ISotonic (AIS) loss. The AIS loss introduces an angular margin to constrain the prototypes to maintain a certain distance from a pre-set threshold. It guides the model to converge more stably, learn clearer boundaries among highly similar classes, and achieve higher accuracy faster with limited instances. Moreover, to better accommodate the feature requirements of the AIS loss and fully exploit its potential in FSFG, we propose a Multi-Layer Integration (MLI) network that captures object features from multiple perspectives to provide more comprehensive and informative representations of the input images. Extensive experiments demonstrate the effectiveness of our proposed method on four standard fine-grained benchmarks. Codes are available at: https://github.com/Legenddddd/AIS-MLI.

Abstract:
Semi-supervised learning (SSL), which aims to learn with limited labeled data and massive amounts of unlabeled data, offers a promising approach to exploit the massive amounts of satellite Earth observation images. The fundamental concept underlying most state-of-the-art SSL methods involves generating pseudo-labels for unlabeled data based on image-level predictions. However, complex remote sensing (RS) scene images frequently encounter challenges, such as interference from multiple background objects and significant intra-class differences, resulting in unreliable pseudo-labels. In this paper, we propose the SemiRS-COC, a novel semi-supervised classification method for complex RS scenes. Inspired by the idea that neighboring objects in feature space should share consistent semantic labels, SemiRS-COC utilizes the similarity between foreground objects in RS images to generate reliable object-level pseudo-labels, effectively addressing the issues of multiple background objects and significant intra-class differences in complex RS images. Specifically, we first design a Local Self-Learning Object Perception (LSLOP) mechanism, which transforms multiple background objects interference of RS images into usable annotation information, enhancing the model’s object perception capability. Furthermore, we present a Cross-Object Consistency Pseudo-Labeling (COCPL) strategy, which generates reliable object-level pseudo-labels by comparing the similarity of foreground objects across different RS images, effectively handling significant intra-class differences. Extensive experiments demonstrate that our proposed method achieves excellent performance compared to state-of-the-art methods on three widely-adopted RS datasets.

Abstract:
Visible infrared person re-identification (VI-ReID) exposes considerable challenges because of the modality gaps between the person images captured by daytime visible cameras and nighttime infrared cameras. Several fully-supervised VI-ReID methods have improved the performance with extensive labeled heterogeneous images. However, the identity of the person is difficult to obtain in real-world situations, especially at night. Limited known identities and large modality discrepancies impede the effectiveness of the model to a great extent. In this paper, we propose a novel Semi-Supervised Learning framework with Heterogeneous Distribution Consistency (HDC-SSL) for VI-ReID. Specifically, through investigating the confidence distribution of heterogeneous images, we introduce a Gaussian Mixture Model-based Pseudo Labeling (GMM-PL) method, which adaptively adjusts different thresholds for each modality to label the identity. Moreover, to facilitate the representation learning of unutilized data whose prediction is lower than the threshold, Modality Consistency Regularization (MCR) is proposed to ensure the prediction consistency of the cross-modality pedestrian images and handle the modality variance. Extensive experiments with different label settings on two VI-ReID datasets demonstrate the effectiveness of our method. Particularly, HDC-SSL achieves competitive performance with state-of-the-art fully-supervised VI-ReID methods on RegDB dataset with only 1 visible label and 1 infrared label per class.

Abstract:
Multi-focus image fusion can fuse the clear parts of two or more source images captured at the same scene with different focal lengths into an all-in-focus image. On the one hand, previous supervised learning-based multi-focus image fusion methods relying on synthetic datasets have a clear distribution shift with real scenarios. On the other hand, unsupervised learning-based multi-focus image fusion methods can well adapt to the observed images but lack the general knowledge of defocus blur that can be learned from paired data. To avoid the problems of existing methods, this paper presents a novel multi-focus image fusion model by considering both the general knowledge brought by the supervised pretrained backbone and the extrinsic priors optimized on specific testing sample to improve the performance of image fusion. To be specific, the Incremental Network Prior Adaptation (INPA) framework is proposed to incrementally integrate features extracted from the pretrained strong baselines into a tiny prior network (6.9% parameters of the backbone network) to boost the performance for test samples. We evaluate our method on both synthetic and real-world public datasets (Lytro, MFI-WHU, and Real-MFF) and show that our method outperforms existing supervised learning-based methods and unsupervised learning based methods.

Abstract:
Event cameras triggered a paradigm shift in the computer vision community delineated by their asynchronous nature, low latency, and high dynamic range. Calibration of event cameras is always essential to account for the sensor intrinsic parameters and for 3D perception. However, conventional image-based calibration techniques are not applicable due to the asynchronous, binary output of the sensor. The current standard for calibrating event cameras relies on either blinking patterns or event-based image reconstruction algorithms. These approaches are difficult to deploy in factory settings and are affected by noise and artifacts degrading the calibration performance. To bridge these limitations, we present E-Calib, a novel, fast, robust, and accurate calibration toolbox for event cameras utilizing the asymmetric circle grid, for its robustness to out-of-focus scenes. E-Calib introduces an efficient reweighted least squares (eRWLS) method for feature extraction of the calibration pattern circles with sub-pixel accuracy and robustness to noise. In addition, a modified hierarchical clustering algorithm is devised to detect the calibration grid apart from the background clutter. The proposed method is tested in a variety of rigorous experiments for different event camera models, on circle grids with different geometric properties, on varying calibration trajectories and speeds, and under challenging illumination conditions. The results show that our approach outperforms the state-of-the-art in detection success rate, reprojection error, and pose estimation accuracy.

Abstract:
This paper presents a novel and interpretable end-to-end learning framework, called the deep compensation unfolding network (DCUNet), for restoring light field (LF) images captured under low-light conditions. DCUNet is designed with a multi-stage architecture that mimics the optimization process of solving an inverse imaging problem in a data-driven fashion. The framework uses the intermediate enhanced result to estimate the illumination map, which is then employed in the unfolding process to produce a new enhanced result. Additionally, DCUNet includes a content-associated deep compensation module at each optimization stage to suppress noise and illumination map estimation errors. To properly mine and leverage the unique characteristics of LF images, this paper proposes a pseudo-explicit feature interaction module that comprehensively exploits redundant information in LF images. The experimental results on both simulated and real datasets demonstrate the superiority of our DCUNet over state-of-the-art methods, both qualitatively and quantitatively. Moreover, DCUNet preserves the essential geometric structure of enhanced LF images much better. The code is publicly available at https://github.com/lyuxianqiang/LFLL-DCU.

Abstract:
Both Convolutional Neural Networks (CNNs) and Transformers have shown great success in semantic segmentation tasks. Efforts have been made to integrate CNNs with Transformer models to capture both local and global context interactions. However, there is still room for enhancement, particularly when considering constraints on computational resources. In this paper, we introduce HAFormer, a model that combines the hierarchical features extraction ability of CNNs with the global dependency modeling capability of Transformers to tackle lightweight semantic segmentation challenges. Specifically, we design a Hierarchy-Aware Pixel-Excitation (HAPE) module for adaptive multi-scale local feature extraction. During the global perception modeling, we devise an Efficient Transformer (ET) module streamlining the quadratic calculations associated with traditional Transformers. Moreover, a correlation-weighted Fusion (cwF) module selectively merges diverse feature representations, significantly enhancing predictive accuracy. HAFormer achieves high performance with minimal computational overhead and compact model size, achieving 74.2% mIoU on Cityscapes and 71.1% mIoU on CamVid test datasets, with frame rates of 105FPS and 118FPS on a single 2080Ti GPU. The source codes are available at https://github.com/XU-GITHUB-curry/HAFormer.

Abstract:
Conventional image set methods typically learn from small to medium-sized image set datasets. However, when applied to large-scale image set applications such as classification and retrieval, they face two primary challenges: 1) effectively modeling complex image sets; and 2) efficiently performing tasks. To address the above issues, we propose a novel Multiple Riemannian Kernel Hashing (MRKH) method that leverages the powerful capabilities of Riemannian manifold and Hashing on effective and efficient image set representation. MRKH considers multiple heterogeneous Riemannian manifolds to represent each image set. It introduces a multiple kernel learning framework designed to effectively combine statistics from multiple manifolds, and constructs kernels by selecting a small set of anchor points, enabling efficient scalability for large-scale applications. In addition, MRKH further exploits inter- and intra-modal semantic structure to enhance discrimination. Instead of employing continuous feature to represent each image set, MRKH suggests learning hash code for each image set, thereby achieving efficient computation and storage. We present an iterative algorithm with theoretical convergence guarantee to optimize MRKH, and the computational complexity is linear with the size of dataset. Extensive experiments on five image set benchmark datasets including three large-scale ones demonstrate the proposed method outperforms state-of-the-arts in accuracy and efficiency particularly in large-scale image set classification and retrieval.

Abstract:
To reconstruct a 3D human surface from a single image, it is crucial to simultaneously consider human pose, shape, and clothing details. Recent approaches have combined parametric body models (such as SMPL), which capture body pose and shape priors, with neural implicit functions that flexibly learn clothing details. However, this combined representation introduces additional computation, e.g. signed distance calculation in 3D body feature extraction, leading to redundancy in the implicit query-and-infer process and failing to preserve the underlying body shape prior. To address these issues, we propose a novel IUVD-Feedback representation, consisting of an IUVD occupancy function and a feedback query algorithm. This representation replaces the time-consuming signed distance calculation with a simple linear transformation in the IUVD space, leveraging the SMPL UV maps. Additionally, it reduces redundant query points through a feedback mechanism, leading to more reasonable 3D body features and more effective query points, thereby preserving the parametric body prior. Moreover, the IUVD-Feedback representation can be embedded into any existing implicit human reconstruction pipeline without requiring modifications to the trained neural networks. Experiments on the THuman2.0 dataset demonstrate that the proposed IUVD-Feedback representation improves the robustness of results and achieves three times faster acceleration in the query-and-infer process. Furthermore, this representation holds potential for generative applications by leveraging its inherent semantic information from the parametric body model.

Abstract:
Recently, the transformer has achieved notable success in remote sensing (RS) change detection (CD). Its outstanding long-distance modeling ability can effectively recognize the change of interest (CoI). However, in order to obtain the precise pixel-level change regions, many methods directly integrate the stacked transformer blocks into the UNet-style structure, which causes the high computation costs. Besides, the existing methods generally consider bitemporal or differential features separately, which makes the utilization of ground semantic information still insufficient. In this paper, we propose the multiscale dual-space interactive perception network (MDIPNet) to fill these two gaps. On the one hand, we simplify the stacked multi-head transformer blocks into the single-layer single-head attention module and further introduce the lightweight parallel fusion module (LPFM) to perform the efficient information integration. On the other hand, based on the simplified attention mechanism, we propose the cross-space perception module (CSPM) to connect the bitemporal and differential feature spaces, which can help our model suppress the pseudo changes and mine the more abundant semantic consistency of CoI. Extensive experiment results on three challenging datasets and one urban expansion scene indicate that compared with the mainstream CD methods, our MDIPNet obtains the state-of-the-art (SOTA) performance while further controlling the computation costs.

Abstract:
Blind image super-resolution (SR) aims to recover a high-resolution (HR) image from its low-resolution (LR) counterpart under the assumption of unknown degradations. Many existing blind SR methods rely on supervising ground-truth kernels referred to as explicit degradation estimators. However, it is very challenging to obtain the ground-truths for different degradations kernels. Moreover, most of these methods rely on heavy backbone networks, which demand extensive computational resources. Implicit degradation estimators do not require the availability of ground truth kernels, but they see a significant performance gap with the explicit degradation estimators due to such missing information. We present a novel approach that significantly narrows such a gap by means of a lightweight architecture that implicitly learns the degradation kernel with the help of a novel loss component. The kernel is exploited by a learnable Wiener filter that performs efficient deconvolution in the Fourier domain by deriving a closed-form solution. Inspired by prompt-based learning, we also propose a novel degradation-conditioned prompt layer that exploits the estimated kernel to drive the focus on the discriminative contextual information that guides the reconstruction process in recovering the latent HR image. Extensive experiments under different degradation settings demonstrate that our model, named PL-IDENet, yields PSNR and SSIM improvements of more than 0.4dB and 1.3%, and 1.4dB and 4.8% to the best implicit and explicit blind-SR method, respectively. These results are achieved while maintaining a substantially lower number of parameters/FLOPs (i.e., 25% and 68% fewer parameters than best implicit and explicit methods, respectively).

Abstract:
In scenarios where identifying face information in the visible spectrum (VIS) is challenging due to poor lighting conditions, the use of near-infrared (NIR) and thermal (TH) cameras can provide viable alternatives. However, the unique data distribution of images captured by these cameras compared to VIS images presents challenges in matching face identities. To address these challenges, we propose a novel image transformation framework. The framework includes feature extraction from the input image, followed by a transformation network that generates target domain images with perceptual fidelity. Additionally, a reconstruction network preserves original information by reconstructing the original domain image from the extracted features. By considering the correlation between features from both domains, our framework utilizes paired data obtained from the same individual. We apply this framework to two well-established image-to-image transformation models, pix2pix and CycleGAN, known as CRC-pix2pix and CRC-CycleGAN respectively. The versatility of our approach allows extension to other models based on pix2pix or CycleGAN architectures. Our models generate high-quality images while preserving the identity information of the original face. Performance evaluation on TFW and BUAA NIR-VIS datasets demonstrates the superiority of our models in terms of generated image face matching and evaluation metrics such as SSIM, MSE, PSNR, and LPIPS. Moreover, we introduce the CQUPT-VIS-TH dataset, which enriches the paired dataset with thermal-visual face data capturing various angles and expressions.

Abstract:
This paper presents a convolutional neural network (CNN)-based enhancement to inter prediction in Versatile Video Coding (VVC). Our approach aims at improving the prediction signal of inter blocks with a residual CNN that incorporates spatial and temporal reference samples. It is motivated by the theoretical consideration that neural network-based methods have a higher degree of signal adaptivity than conventional signal processing methods and that spatially neighboring reference samples have the potential to improve the prediction signal by adapting it to the reconstructed signal in its immediate vicinity. We show that adding a polyphase decomposition stage to the CNN results in a significantly better trade-off between computational complexity and coding performance. Incorporating spatial reference samples in the inter prediction process is challenging: The fact that the input of the CNN for one block may depend on the output of the CNN for preceding blocks prohibits parallel processing. We solve this by introducing a novel signal plane that contains specifically constrained reference samples, enabling parallel decoding while maintaining a high compression efficiency. Overall, experimental results show average bit rate savings of 4.07% and 3.47% for the random access (RA) and low-delay B (LB) configurations of the JVET common test conditions, respectively.

Abstract:
Pose estimation and tracking of objects is a fundamental application in 3D vision. Event cameras possess remarkable attributes such as high dynamic range, low latency, and resilience against motion blur, which enables them to address challenging high dynamic range scenes or high-speed motion. These features make event cameras an ideal complement over standard cameras for object pose estimation. In this work, we propose a line-based robust pose estimation and tracking method for planar or non-planar objects using an event camera. Firstly, we extract object lines directly from events, then provide an initial pose using a globally-optimal Branch-and-Bound approach, where 2D-3D line correspondences are not known in advance. Subsequently, we utilize event-line matching to establish correspondences between 2D events and 3D models. Furthermore, object poses are refined and continuously tracked by minimizing event-line distances. Events are assigned different weights based on these distances, employing robust estimation algorithms. To evaluate the precision of the proposed methods in object pose estimation and tracking, we have devised and established an event-based moving object dataset. Compared against state-of-the-art methods, the robustness and accuracy of our methods have been validated both on synthetic experiments and the proposed dataset. The source code is available at https://github.com/Zibin6/LOPET.

Abstract:
This work presents a new completion method that specifically designed for low-overlapping partial point cloud registration. Based on the assumption that the candidate partial point clouds to be registered belong to the same target, the proposed mutual prior based completion (MPC) method uses these candidate partial point clouds as completion reference for each other to extend their overlapping regions. Without relying on shape prior knowledge, MPC can work for different types of point clouds, such as object, room scene, and street view. The main challenge of this mutual reference approach is that partial clouds without spatial alignment cannot provide a reliable completion reference. Based on the mutual information maximization, a progressive completion structure is developed to achieve pose, feature representation and completion alignment between input point clouds. Experiments on public datasets show encouraging results. Especially for the low-overlapping cases, compared with the state-of-the-art (SOTA) models, the size of overlapping regions can be increased by about 15.0%, and the rotation and translation error can be reduced by 30.8% and 57.7% respectively.

Abstract:
The analysis and prediction of visual attention have long been crucial tasks in the fields of computer vision and image processing. In practical applications, images are generally accompanied by various text descriptions, however, few studies have explored the influence of text descriptions on visual attention, let alone developed visual saliency prediction models considering text guidance. In this paper, we conduct a comprehensive study on text-guided image saliency (TIS) from both subjective and objective perspectives. Specifically, we construct a TIS database named SJTU-TIS, which includes 1200 text-image pairs and the corresponding collected eye-tracking data. Based on the established SJTU-TIS database, we analyze the influence of various text descriptions on visual attention. Then, to facilitate the development of saliency prediction models considering text influence, we construct a benchmark for the established SJTU-TIS database using state-of-the-art saliency models. Finally, considering the effect of text descriptions on visual attention, while most existing saliency models ignore this impact, we further propose a text-guided saliency (TGSal) prediction model, which extracts and integrates both image features and text features to predict the image saliency under various text-description conditions. Our proposed model significantly outperforms the state-of-the-art saliency models on both the SJTU-TIS database and the pure image saliency databases in terms of various evaluation metrics. The SJTU-TIS database and the code of the proposed TGSal model will be released at: https://github.com/IntMeGroup/TGSal.

Abstract:
Roadside camera-driven 3D object detection is a crucial task in intelligent transportation systems, which extends the perception range beyond the limitations of vision-centric vehicles and enhances road safety. While previous studies have limitations in using only depth or height information, we find both depth and height matter and they are in fact complementary. The depth feature encompasses precise geometric cues, whereas the height feature is primarily focused on distinguishing between various categories of height intervals, essentially providing semantic context. This insight motivates the development of Complementary-BEV (CoBEV), a novel end-to-end monocular 3D object detection framework that integrates depth and height to construct robust BEV representations. In essence, CoBEV estimates each pixel’s depth and height distribution and lifts the camera features into 3D space for lateral fusion using the newly proposed two-stage complementary feature selection (CFS) module. A BEV feature distillation framework is also seamlessly integrated to further enhance the detection accuracy from the prior knowledge of the fusion-modal CoBEV teacher. We conduct extensive experiments on the public 3D detection benchmarks of roadside camera-based DAIR-V2X-I and Rope3D, as well as the private Supremind-Road dataset, demonstrating that CoBEV not only achieves the accuracy of the new state-of-the-art, but also significantly advances the robustness of previous methods in challenging long-distance scenarios and noisy camera disturbance, and enhances generalization by a large margin in heterologous Settings with drastic changes in scene and camera parameters. For the first time, the vehicle AP score of a camera model reaches 80% on DAIR-V2X-I in terms of easy mode. The source code will be made publicly available at CoBEV.

Abstract:
ConvNet-based object detection networks have achieved outstanding performance on clean images. However, many works have shown that these detectors perform poorly on corrupted images caused by noises, blurs, poor weather conditions and so on. With the development of security-sensitive applications, the detector’s practicability has raised increasing concerns. Existing approaches improve detector robustness via extra operations (image restoration or training on extra labeled data) or by applying adversarial training at the expense of performance degradation on clean images. In this paper, we present Selective Adversarial Learning with Constraints (SALC) as a universal detector training approach to simultaneously improve the detector’s precision and robustness. We first propose a unified formulation of adversarial samples for multitask adversarial learning, which significantly diversifies the obtained adversarial samples when integrated into the adversarial training of the detector. Next, we examine our findings on model bias against adversarial attacks of different strengths and differences in Batch Normalization (BN) statistics among clean images and different adversarial samples. On this basis, we propose a batch local comparison strategy with two BN branches to balance the detector’s accuracy and robustness. Furthermore, to avoid performance degradation caused by overwhelming subtask losses, we leverage task-aware ratio thresholds to control the influence of learning in each subtask. The proposed approach can be applied to various detectors without any extra labeled data, inference time costs, or model parameters. Extensive experiments show that our SALC achieves state-of-the-art results on both clean benchmarks (Pascal VOC and MS-COCO) and corruption benchmarks (Pascal VOC-C and MS-COCO-C).

Abstract:
Recent progress in blind face restoration has resulted in producing high-quality restored results for static images. However, efforts to extend these advancements to video scenarios have been minimal, partly because of the absence of benchmarks that allow for a comprehensive and fair comparison. In this work, we first present a fair evaluation benchmark, in which we first introduce a Real-world Low-Quality Face Video benchmark (RFV-LQ), evaluate several leading image-based face restoration algorithms, and conduct a thorough systematical analysis of the benefits and challenges associated with extending blind face image restoration algorithms to degraded face videos. Our analysis identifies several key issues, primarily categorized into two aspects: significant jitters in facial components and noise-shape flickering between frames. To address these issues, we propose a Temporal Consistency Network (TCN) cooperated with alignment smoothing to reduce jitters and flickers in restored videos. TCN is a flexible component that can be seamlessly plugged into the most advanced face image restoration algorithms, ensuring the quality of image-based restoration is maintained as closely as possible. Extensive experiments have been conducted to evaluate the effectiveness and efficiency of our proposed TCN and alignment smoothing operation.

Abstract:
Image imaging in the real world is based on physical imaging mechanisms. Existing super-resolution methods mainly focus on designing complex network structures to extract and fuse image features more effectively, but ignore the guiding role of physical imaging mechanisms for model design, and cannot mine features from a physical perspective. Inspired by the mechanism of physical imaging, we propose a novel network architecture called Virtual-Sensor Construction network (VSCNet) to simulate the sensor array inside the camera. Specifically, VSCNet first generates different splitting directions to distribute photons to construct virtual sensors, and then performs a multi-stage adaptive fine-tuning operation to fine-tune the number of photons on the virtual sensors to increase the photosensitive area and eliminate photon cross-talk, and finally converts the obtained photon distributions into RGB images. These operations can naturally be regarded as the virtual expansion of the camera’s sensor array in the feature space, which makes our VSCNet bridge the physical space and feature space, and uses their complementarity to mine more effective features to improve performance. Extensive experiments on various datasets show that the proposed VSCNet achieves state-of-the-art performance with fewer parameters. Moreover, we perform experiments to validate the connection between the proposed VSCNet and the physical imaging mechanism. The implementation code is available at https://github.com/GZ-T/VSCNet.

Abstract:
Scene text editing aims to replace the source text with the target text while preserving the original background. Its practical applications span various domains, such as data generation and privacy protection, highlighting its increasing importance in recent years. In this study, we propose a novel Scene Text Editing network with Explicitly-decoupled text transfer and Minimized background reconstruction, called STEEM. Unlike existing methods that usually fuse text style, text content, and background, our approach focuses on decoupling text style and content from the background and utilizes the minimized background reconstruction to reduce the impact of text replacement on the background. Specifically, the text-background separation module predicts the text mask of the scene text image, separating the source text from the background. Subsequently, the style-guided text transfer decoding module transfers the geometric and stylistic attributes of the source text to the content text, resulting in the target text. Next, the background and target text are combined to determine the minimal reconstruction area. Finally, the context-focused background reconstruction module is applied to the reconstruction area, producing the editing result. Furthermore, to ensure stable joint optimization of the four modules, a task-adaptive training optimization strategy has been devised. Experimental evaluations conducted on two popular datasets demonstrate the effectiveness of our approach. STEEM outperforms state-of-the-art methods, as evidenced by a reduction in the FID index from 29.48 to 24.67 and an increase in text recognition accuracy from 76.8% to 78.8%.

Abstract:
Accurate and automatic segmentation of medical images plays an essential role in clinical diagnosis and analysis. It has been established that integrating contextual relationships substantially enhances the representational ability of neural networks. Conventionally, Long Short-Term Memory (LSTM) and Self-Attention (SA) mechanisms have been recognized for their proficiency in capturing global dependencies within data. However, these mechanisms have typically been viewed as distinct modules without a direct linkage. This paper presents the integration of LSTM design with SA sparse coding as a key innovation. It uses linear combinations of LSTM states for SA’s query, key, and value (QKV) matrices to leverage LSTM’s capability for state compression and historical data retention. This approach aims to rectify the shortcomings of conventional sparse coding methods that overlook temporal information, thereby enhancing SA’s ability to do sparse coding and capture global dependencies. Building upon this premise, we introduce two innovative modules that weave the SA matrix into the LSTM state design in distinct manners, enabling LSTM to more adeptly model global dependencies and meld seamlessly with SA without accruing extra computational demands. Both modules are separately embedded into the U-shaped convolutional neural network architecture for handling both 2D and 3D medical images. Experimental evaluations on downstream medical image segmentation tasks reveal that our proposed modules not only excel on four extensively utilized datasets across various baselines but also enhance prediction accuracy, even on baselines that have already incorporated contextual modules. Code is available at https://github.com/yeshunlong/SALSTM.

Abstract:
Robust segmentation performance under dense fog is crucial for autonomous driving, but collecting labeled real foggy scene datasets is burdensome in the real world. To this end, existing methods have adapted models trained on labeled clear weather images to the unlabeled real foggy domain. However, these approaches require intermediate domain datasets (e.g. synthetic fog) and involve multi-stage training, making them cumbersome and less practical for real-world applications. In addition, the issue of overconfident pseudo-labels by a confidence score remains less explored in self-training for foggy scene adaptation. To resolve these issues, we propose a new framework, named DAEN, which Directly Adapts without additional datasets or multi-stage training and leverages an ENergy score in self-training. Notably, we integrate a High-order Style Matching (HSM) module into the network to match high-order statistics between clear weather features and real foggy features. HSM enables the network to implicitly learn complex fog distributions without relying on intermediate domain datasets or multi-stage training. Furthermore, we introduce Energy Score-based Pseudo-Labeling (ESPL) to mitigate the overconfidence issue of the confidence score in self-training. ESPL generates more reliable pseudo-labels through a pixel-wise energy score, thereby alleviating bias and preventing the model from assigning pseudo-labels exclusively to head classes. Extensive experiments demonstrate that DAEN achieves state-of-the-art performance on three real foggy scene datasets and exhibits a generalization ability to other adverse weather conditions. Code is available at https://github.com/jdg900/daen

Abstract:
Cost aggregation plays a critical role in existing stereo matching methods. In this paper, we revisit cost aggregation in stereo matching from disparity classification and propose a generic yet efficient Disparity Context Aggregation (DCA) module to improve the performance of CNN-based methods. Our approach is based on an insight that a coarse disparity class prior is beneficial to disparity regression. To obtain such a prior, we first classify pixels in an image into several disparity classes and treat pixels within the same class as homogeneous regions. We then generate homogeneous region representations and incorporate these representations into the cost volume to suppress irrelevant information while enhancing the matching ability for cost aggregation. With the help of homogeneous region representations, efficient and informative cost aggregation can be achieved with only a shallow 3D CNN. Our DCA module is fully-differentiable and well-compatible with different network architectures, which can be seamlessly plugged into existing networks to improve performance with small additional overheads. It is demonstrated that our DCA module can effectively exploit disparity class priors to improve the performance of cost aggregation. Based on our DCA, we design a highly accurate network named DCANet, which achieves state-of-the-art performance on several benchmarks.

Abstract:
Despite the progress made in domain adaptation, solving Unsupervised Domain Adaptation (UDA) problems with a general method under complex conditions caused by label shifts between domains remains a challenging task. In this work, we comprehensively investigate four distinct UDA settings including closed set domain adaptation, partial domain adaptation, open set domain adaptation, and universal domain adaptation, where shared common classes between source and target domains coexist alongside domain-specific private classes. The prominent challenges inherent in diverse UDA settings center around the discrimination of common/private classes and the precise measurement of domain discrepancy. To surmount these challenges effectively, we propose a novel yet effective method called Learning Instance Weighting for Unsupervised Domain Adaptation (LIWUDA), which caters to various UDA settings. Specifically, the proposed LIWUDA method constructs a weight network to assign weights to each instance based on its probability of belonging to common classes, and designs Weighted Optimal Transport (WOT) for domain alignment by leveraging instance weights. Additionally, the proposed LIWUDA method devises a Separate and Align (SA) loss to separate instances with low similarities and align instances with high similarities. To guide the learning of the weight network, Intra-domain Optimal Transport (IOT) is proposed to enforce the weights of instances in common classes to follow a uniform distribution. Through the integration of those three components, the proposed LIWUDA method demonstrates its capability to address all four UDA settings in a unified manner. Experimental evaluations conducted on four benchmark datasets substantiate the effectiveness of the proposed LIWUDA method. The code is available at https://github.com/JinjingZhu/LIWUDA.

Abstract:
Existing light field salient object detection (LFSOD) models predominantly rely on convolutional neural networks or local attention to process light field data, consequently encountering difficulties in modeling intra-slice and cross-slice long-range dependencies within focal stacks. In this paper, we ponder the feasibility of relying solely on the pure Transformer architecture to address this dilemma and propose a novel quasi-pure Transformer-based framework for LFSOD, termed TLFNet. TLFNet incorporates innovative Transformer-based fusion modules (PGFormer) along with an edge enhancement module. The PGFormer employs a perpendicular self-attention (PSA) mechanism to capture long-range dependencies along both cross-slice and intra-slice axes within the focal stack, and integrates multi-modal features using a guided feature fusion (GFF) module. To address the issue of blurry edges arising from the Transformer-based encoder-decoder architecture, the edge enhancement module combines detailed texture and body information and employs focal loss to improve the edge precision of salient objects. TLFNet is a nearly pure Transformer-based approach (with approximately 99.01% of its parameters belonging to the Transformer), while the edge enhancement module significantly boosts accuracy with only around 0.99% of parameters. Comprehensive benchmarks demonstrate that TLFNet outperforms 14 light field models and achieves new state-of-the-art performance. Last but not least, we show in this paper a new application scheme of TLFNet, by cooperating with the deep autofocus technique proposed by Herrmann et al. (2020), leading to light field salient object autofocus (LFSOA). LFSOA aims to identify and output the focal slice with a salient object in focus while keeping other irrelevant background blurred (out-of-focus), yielding an autonomous bokeh effect in photography. The code for the model and application will be publicly available at https://github.com/jiangyao-scu/TLFNet.

Abstract:
The misalignment between RGB and thermal images significantly impairs RGB-Thermal semantic segmentation accuracy. Current non-end-to-end methods treat RGB-Thermal registration independently of semantic segmentation, resulting in fusion errors, redundant computations, and poor real-time performance. Semantic segmentation accuracy directly correlates with registration precision: better registration yields more accurate segmentation. Moreover, regions with identical semantic labels, indicating the same object, tend to share similar registration offsets. Based on these correlations, we propose an end-to-end multimodal registration and segmentation method using flexible deformation fields. Our method utilizes a shared encoder for registration and semantic segmentation to reduce redundancy. Unlike traditional non-end-to-end approaches, it directly registers high-level perceptual features, thereby optimizing computational efficiency and real-time performance. Additionally, we employ a flexible deformation field to register RGB-Thermal data, addressing limitations of traditional affine transformations in handling non-coplanar and non-rigid registrations. However, the increased flexibility of deformation fields compared to affine transformations, and the sacrificing of geometric feature preservation, pose training challenges. To overcome this, we introduce a semantic alignment loss function to train the alignment module. This function calculates the semantic segmentation loss between the predictions from registered thermal features and RGB semantic labels. It shortens the gradient backpropagation path, aligning the objectives of registration and segmentation. We validate our end-to-end approach through extensive experiments, achieving significant performance enhancements. On the IR SEG dataset, our end-to-end method achieves state-of-the-art results with a mean Intersection over Union (mIoU) of 61.1% and a mean accuracy (mAcc) of 76.0%.

Abstract:
Human action recognition is an essential topic in computer vision and image processing. Graph convolutional networks (GCNs) have attracted significant attention and achieved noteworthy performance in skeleton-based human action recognition tasks. However, most of the previous graph-based works are designed to refine skeleton topology without considering the types of different joints and edges and the occurrence order of the frames. Such a limitation makes them insufficient to represent intrinsic semantic information. Differently, we proposed a dynamic semantic-based spatial-temporal graph convolution network (DS-STGCN) to address the challenge. DS-STGCN has two dynamic semantic modules for spatial and temporal contexts respectively. Specifically, the joints and edge types were encoded in the spatial module implicitly, and the occurrence order of frames was encoded in the temporal module implicitly. Extensive experiments on four datasets including NTU-RGB+D 60(120), Kinetics-400, and FineGYM show that our proposed two semantic modules can bring consistent recognition performance improvement with various backbones. Meanwhile, the proposed DS-STGCN notably surpassed state-of-the-art methods on these datasets. Notably, in the more challenging dataset, such as Kinetics-400, our model significantly outperformed other state-of-the-art GCN-based methods by a large margin. The code has been released at https://github.com/davelailai/DS-STGCN.

Abstract:
Unsupervised person re-identification (Re-ID) aims to learn semantic representations for person retrieval without using identity labels. Most existing methods generate fine-grained patch features to reduce noise in global feature clustering. However, these methods often compromise the discriminative semantic structure and overlook the semantic consistency between the patch and global features. To address these problems, we propose a Person Intrinsic Semantic Learning (PISL) framework with diffusion model for unsupervised person Re-ID. First, we design the Spatial Diffusion Model (SDM), which performs a denoising diffusion process from noisy spatial transformer parameters to semantic parameters, enabling the sampling of patches with intrinsic semantic structure. Second, we propose the Semantic Controlled Diffusion (SCD) loss to guide the denoising direction of the diffusion model, facilitating the generation of semantic patches. Third, we propose the Patch Semantic Consistency (PSC) loss to capture semantic consistency between the patch and global features, refining the pseudo-labels of global features. Comprehensive experiments on three challenging datasets show that our method surpasses current unsupervised Re-ID methods. The source code will be publicly available at https://github.com/taoxuefong/Diffusion-reid

Abstract:
High Dynamic Range (HDR) lighting plays a pivotal role in modern augmented and mixed-reality (AR/MR) applications, facilitating immersive experiences through realistic object insertion and dynamic relighting. However, the acquisition of precise HDR environment maps remains cost-prohibitive and impractical when using standard devices. To bridge this gap, this paper introduces SALENet, a novel deep network for estimating global lighting conditions from a single image, to effectively mitigate the need for resource-intensive acquisition methods. In contrast to earlier studies, we focus on exploring the inherent structural relationships within the lighting distribution. We design a hierarchical transformer-based neural network architecture with a proposed cross-attention mechanism between different resolution lighting source representations, optimizing the spatial distribution of lighting sources simultaneously for enhanced consistency. To further improve accuracy, a structure-based contrastive learning method is proposed to select positive-negative pairs based on lighting distribution similarity. By harnessing the synergy of hierarchical transformers and structure-based contrastive learning, our framework yields a significant enhancement in lighting prediction accuracy, enabling high-fidelity augmented and mixed reality to achieve cost-effectively immersive and realistic lighting effects.

Abstract:
We present two deep unfolding neural networks for the simultaneous tasks of background subtraction and foreground detection in video. Unlike conventional neural networks based on deep feature extraction, we incorporate domain-knowledge models by considering a masked variation of the robust principal component analysis problem (RPCA). With this approach, we separate video clips into low-rank and sparse components, respectively corresponding to the backgrounds and foreground masks indicating the presence of moving objects. Our models, coined ROMAN-S and ROMAN-R, map the iterations of two alternating direction of multipliers methods (ADMM) to trainable convolutional layers, and the proximal operators are mapped to non-linear activation functions with trainable thresholds. This approach leads to lightweight networks with enhanced interpretability that can be trained on limited data. In ROMAN-S, the correlation in time of successive binary masks is controlled with side-information based on \ell _1 - \ell _1 minimization. ROMAN-R enhances the foreground detection by learning a dictionary of atoms to represent the moving foreground in a high-dimensional feature space and by using reweighted- \ell _1 - \ell _1 minimization. Experiments are conducted on both synthetic and real video datasets, for which we also include an analysis of the generalization to unseen clips. Comparisons are made with existing deep unfolding RPCA neural networks, which do not use a mask formulation for the foreground, and with a 3D U-Net baseline. Results show that our proposed models outperform other deep unfolding networks, as well as the untrained optimization algorithms. ROMAN-R, in particular, is competitive with the U-Net baseline for foreground detection, with the additional advantage of providing video backgrounds and requiring substantially fewer training parameters and smaller training sets.

Abstract:
The optimization of prediction and update operators plays a prominent role in lifting-based image coding schemes. In this paper, we focus on learning the prediction and update models involved in a recent Fully Connected Neural Network (FCNN)-based lifting structure. While a straightforward approach consists in separately learning the different FCNN models by optimizing appropriate loss functions, jointly learning those models is a more challenging problem. To address this problem, we first consider a statistical model-based entropy loss function that yields a good approximation to the coding rate. Then, we develop a multi-scale optimization technique to learn all the FCNN models simultaneously. For this purpose, two loss functions defined across the different resolution levels of the proposed representation are investigated. While the first function combines standard prediction and update loss functions, the second one aims to obtain a good approximation to the rate-distortion criterion. Experimental results carried out on two standard image datasets, show the benefits of the proposed approaches in the context of lossy and lossless compression.

Abstract:
Cutmix-based data augmentation, which uses a cut-and-paste strategy, has shown remarkable generalization capabilities in deep learning. However, existing methods primarily consider global semantics with image-level constraints, which excessively reduces attention to the discriminative local context of the class and leads to a performance improvement bottleneck. Moreover, existing methods for generating augmented samples usually involve cutting and pasting rectangular or square regions, resulting in a loss of object part information. To mitigate the problem of inconsistency between the augmented image and the generated mixed label, existing methods usually require double forward propagation or rely on an external pre-trained network for object centering, which is inefficient. To overcome the above limitations, we propose LGCOAMix, an efficient context-aware and object-part-aware superpixel-based grid blending method for data augmentation. To the best of our knowledge, this is the first time that a label mixing strategy using a superpixel attention approach has been proposed for cutmix-based data augmentation. It is the first instance of learning local features from discriminative superpixel-wise regions and cross-image superpixel contrasts. Extensive experiments on various benchmark datasets show that LGCOAMix outperforms state-of-the-art cutmix-based data augmentation methods on classification tasks, and weakly supervised object location on CUB200-2011. We have demonstrated the effectiveness of LGCOAMix not only for CNN networks, but also for Transformer networks. Source codes are available at https://github.com/DanielaPlusPlus/LGCOAMix.

Abstract:
Accurate classification of nuclei communities is an important step towards timely treating the cancer spread. Graph theory provides an elegant way to represent and analyze nuclei communities within the histopathological landscape in order to perform tissue phenotyping and tumor profiling tasks. Many researchers have worked on recognizing nuclei regions within the histology images in order to grade cancerous progression. However, due to the high structural similarities between nuclei communities, defining a model that can accurately differentiate between nuclei pathological patterns still needs to be solved. To surmount this challenge, we present a novel approach, dubbed neural graph refinement, that enhances the capabilities of existing models to perform nuclei recognition tasks by employing graph representational learning and broadcasting processes. Based on the physical interaction of the nuclei, we first construct a fully connected graph in which nodes represent nuclei and adjacent nodes are connected to each other via an undirected edge. For each edge and node pair, appearance and geometric features are computed and are then utilized for generating the neural graph embeddings. These embeddings are used for diffusing contextual information to the neighboring nodes, all along a path traversing the whole graph to infer global information over an entire nuclei network and predict pathologically meaningful communities. Through rigorous evaluation of the proposed scheme across four public datasets, we showcase that learning such communities through neural graph refinement produces better results that outperform state-of-the-art methods.

Affiliations: School of Transportation and Civil Engineering, Nantong University, Nantong, China; School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China; Faculty of Information Technology, the Engineering Research Center of Intelligent Perception and Autonomous Control of Ministry of Education, the Beijing Laboratory of Smart Environmental Protection, the Beijing Key Laboratory of Computational Intelligence and Intelligent System, Beijing Artificial Intelligence Institute, Beijing University of Technology, Beijing, China; School of Computer Science and Informatics, Cardiff University, Cardiff, U.K.; Institute of Image Communication and Information Processing, Shanghai Jiao Tong University, Shanghai, China

Abstract:
Depth image-based rendering (DIBR) techniques play an essential role in free-viewpoint videos (FVVs), which generate the virtual views from a reference 2D texture video and its associated depth information. However, the background regions occluded by the foreground in the reference view will be exposed in the synthesized view, resulting in obvious irregular holes in the synthesized view. To this end, this paper proposes a novel coarse and fine-grained fusion hierarchical network (CFFHNet) for hole filling, which fills the irregular holes produced by view synthesis using the spatial contextual correlations between the visible and hole regions. CFFHNet adopts recurrent calculation to learn the spatial contextual correlation, while the hierarchical structure and attention mechanism are introduced to guide the fine-grained fusion of cross-scale contextual features. To promote texture generation while maintaining fidelity, we equip CFFHNet with a two-stage framework involving an inference sub-network to generate the coarse synthetic result and a refinement sub-network for refinement. Meanwhile, to make the learned hole-filling model better adaptable and robust to the “foreground penetration” distortion, we trained CFFHNet by generating a batch of training samples by adding irregular holes to the foreground and background connection regions of high-quality images. Extensive experiments show the superiority of our CFFHNet over the current state-of-the-art DIBR methods. The source code will be available at https://github.com/wgc-vsfm/view-synthesis-CFFHNet.

Abstract:
The Visual Multimethod Assessment Fusion (VMAF) algorithm has recently emerged as a state-of-the-art approach to video quality prediction, that now pervades the streaming and social media industry. However, since VMAF requires the evaluation of a heterogeneous set of quality models, it is computationally expensive. Given other advances in hardware-accelerated encoding, quality assessment is emerging as a significant bottleneck in video compression pipelines. Towards alleviating this burden, we propose a novel Fusion of Unified Quality Evaluators (FUNQUE) framework, by enabling computation sharing and by using a transform that is sensitive to visual perception to boost accuracy. Further, we expand the FUNQUE framework to define a collection of improved low-complexity fused-feature models that advance the state-of-the-art of video quality performance with respect to both accuracy, by 4.2% to 5.3%, and computational efficiency, by factors of 3.8 to 11 times!

Abstract:
How to model the effect of reflection is crucial for single image reflection removal (SIRR) task. Modern SIRR methods usually simplify the reflection formulation with the assumption of linear combination of a transmission layer and a reflection layer. However, the large variations in image content and the real-world picture-taking conditions often result in far more complex reflection. In this paper, we introduce a new screen-blur combination based on two important factors, namely the intensity and the blurriness of reflection, to better characterize the reflection formulation in SIRR. Specifically, we present Screen-blur Reflection Networks (SRNet), which executes the screen-blur formulation in its network design and adapts to the complex reflection on real scenes. Technically, SRNet consists of three components: a blended image generator, a reflection estimator and a reflection removal module. The image generator exploits the screen-blur combination to synthesize the training blended images. The reflection estimator learns the reflection layer and a blur degree that measures the level of blurriness for reflection. The reflection removal module further uses the blended image, blur degree and reflection layer to filter out the transmission layer in a cascaded manner. Superior results on three different SIRR methods are reported when generating the training data on the principle of the screen-blur combination. Moreover, extensive experiments on six datasets quantitatively and qualitatively demonstrate the efficacy of SRNet over the state-of-the-art methods.

Abstract:
Pneumonia, a respiratory disease often caused by bacterial infection in the distal lung, requires rapid and accurate identification, especially in settings such as critical care. Initiating or de-escalating antimicrobials should ideally be guided by the quantification of pathogenic bacteria for effective treatment. Optical endomicroscopy is an emerging technology with the potential to expedite bacterial detection in the distal lung by enabling in vivo and in situ optical tissue characterisation. With advancements in detector technology, optical endomicroscopy can utilize fluorescence lifetime imaging (FLIM) to help detect events that were previously challenging or impossible to identify using fluorescence intensity imaging. In this paper, we propose an iterative Bayesian approach for bacterial detection in FLIM. We model the FLIM image as a linear combination of background intensity, Gaussian noise, and additive outliers (labelled bacteria). While previous bacteria detection methods model anomalous pixels as bacteria, here the FLIM outliers are modelled as circularly symmetric Gaussian-shaped objects, based on their discrete shape observed through visual analysis and the physical nature of the imaging modality. A Hierarchical Bayesian model is used to solve the bacterial detection problem where prior distributions are assigned to unknown parameters. A Metropolis-Hastings within Gibbs sampler draws samples from the posterior distribution. The proposed method’s detection performance is initially measured using synthetic images, and shows significant improvement over existing approaches. Further analysis is conducted on real optical endomicroscopy FLIM images annotated by trained personnel. The experiments show the proposed approach outperforms existing methods by a margin of +16.85% ( F_1 ) for detection accuracy.

Abstract:
Remarkable success of the existing Near-InfraRed and VISible (NIR-VIS) approaches owes to sufficient labeled training data. However, collecting and tagging data from different domains is a time-consuming and expensive task. In this paper, we tackle the NIR-VIS face recognition problem in a semi-supervised manner, termed as semi-supervised NIR-VIS Heterogeneous Face Recognition (NIR-VIS-sHFR). To cope with this problem, we propose a novel pseudo Label association and Prototype-based invariant Learning (LPL), consisting of three key components, i.e., Cross-domain pseudo Label Association (CLA), Intra-domain Compact Representation learning (ICR), and Prototype-based Inter-domain Invariant learning (PII). Firstly, the CLA iteratively builds inter-domain association graphs for pseudo-label association, subsequently facilitating cross-domain model development based on the generated pseudo-labels. Furthermore, the ICR is proposed to achieve the separation of in-domain features from different clusters and the aggregation of features from the same cluster, by performing cluster adaptation learning with prototype-based initialization. Finally, with the cross-domain pseudo-label training data produced by CLA, the PII explores potential domain-invariant and identity-related features, which employs cross-domain prototypes with identity-associated momentum updating to effectively guide inter-domain instances learning. The semi-supervised LPL method achieves comparable performance to recent supervised learning methods on multiple challenging NIR-VIS datasets, which demonstrates that the LPL is capable of learning robust cross-domain representations even without identity label information.

Abstract:
When we recognize images with the help of Artificial Neural Networks (ANNs), we often wonder how they make decisions. A widely accepted solution is to point out local features as decisive evidence. A question then arises: Can local features in the latent space of an ANN explain the model output to some extent? In this work, we propose a modularized framework named MemeNet that can construct a reliable surrogate from a Convolutional Neural Network (CNN) without changing its perception. Inspired by the idea of time series classification, this framework recognizes images in two steps. First, local representations named memes are extracted from the activation map of a CNN model. Then an image is transformed into a series of understandable features. Experimental results show that MemeNet can achieve accuracy comparable to most models’ through a set of reliable features and a simple classifier. Thus, it is a promising interface to use the internal dynamics of CNN, which represents a novel approach to constructing reliable models.

Abstract:
Weakly supervised point cloud semantic segmentation methods that require 1% or fewer labels with the aim of realizing almost the same performance as fully supervised approaches have recently attracted extensive research attention. A typical solution in this framework is to use self-training or pseudo-labeling to mine the supervision from the point cloud itself while ignoring the critical information from images. In fact, cameras widely exist in LiDAR scenarios, and this complementary information seems to be highly important for 3D applications. In this paper, we propose a novel cross-modality weakly supervised method for 3D segmentation that incorporates complementary information from unlabeled images. We design a dual-branch network equipped with an active labeling strategy to maximize the power of tiny parts of labels and to directly realize 2D-to-3D knowledge transfer. Afterward, we establish a cross-modal self-training framework, which iterates between parameter updating and pseudolabel estimation. In the training phase, we propose cross-modal association learning to mine complementary supervision from images by reinforcing the cycle consistency between 3D points and 2D superpixels. In the pseudolabel estimation phase, a pseudolabel self-rectification mechanism is derived to filter noisy labels, thus providing more accurate labels for the networks to be fully trained. The extensive experimental results demonstrate that our method even outperforms the state-of-the-art fully supervised competitors with less than 1% actively selected annotations.

Abstract:
To significantly enhance the performance of point cloud semantic segmentation, this manuscript presents a novel method for constructing large-scale networks and offers an effective lightweighting technique. First, a latent point feature processing (LPFP) module is utilized to interconnect base networks such as PointNet++ and Point Transformer. This intermediate module serves both as a feature information transfer and a ground truth supervision function. Furthermore, in order to alleviate the increase in computational costs brought by constructing large-scale networks and better adapt to the demand for terminal deployment, a novel point cloud lightweighting method for semantic segmentation network (PCLN) is proposed to compress the network by transferring multidimensional feature information of large-scale networks. Specifically, at different stages of the large-scale network, the structure and attention information of the point features are selectively transferred to guide the compressed network to train in the direction of the large-scale network. This paper also solves the problem of representing global structure information of large-scale point clouds through feature sampling and aggregation. Extensive experiments on public datasets and real-world data demonstrate that the proposed method can significantly improve the performance of different base networks and outperform the state-of-the-art.

Abstract:
Existing Cross-Domain Few-Shot Learning (CDFSL) methods require access to source domain data to train a model in the pre-training phase. However, due to increasing concerns about data privacy and the desire to reduce data transmission and training costs, it is necessary to develop a CDFSL solution without accessing source data. For this reason, this paper explores a Source-Free CDFSL (SF-CDFSL) problem, in which CDFSL is addressed through the use of existing pretrained models instead of training a model with source data, avoiding accessing source data. However, due to the lack of source data, we face two key challenges: effectively tackling CDFSL with limited labeled target samples, and the impossibility of addressing domain disparities by aligning source and target domain distributions. This paper proposes an Enhanced Information Maximization with Distance-Aware Contrastive Learning (IM-DCL) method to address these challenges. Firstly, we introduce the transductive mechanism for learning the query set. Secondly, information maximization (IM) is explored to map target samples into both individual certainty and global diversity predictions, helping the source model better fit the target data distribution. However, IM fails to learn the decision boundary of the target task. This motivates us to introduce a novel approach called Distance-Aware Contrastive Learning (DCL), in which we consider the entire feature set as both positive and negative sets, akin to Schrödinger’s concept of a dual state. Instead of a rigid separation between positive and negative sets, we employ a weighted distance calculation among features to establish a soft classification of the positive and negative sets for the entire feature set. We explore three types of negative weights to enhance the performance of CDFSL. Furthermore, we address issues related to IM by incorporating contrastive constraints between object features and their corresponding positive and negative sets. Evaluations of the 4 datasets in the BSCD-FSL benchmark indicate that the proposed IM-DCL, without accessing the source domain, demonstrates superiority over existing methods, especially in the distant domain task. Additionally, the ablation study and performance analysis confirmed the ability of IM-DCL to handle SF-CDFSL. The code will be made public at https://github.com/xuhuali-mxj/IM-DCL.

Abstract:
Video anomaly detection (VAD) has been paid increasing attention due to its potential applications, its current dominant tasks focus on online detecting anomalies, which can be roughly interpreted as the binary or multiple event classification. However, such a setup that builds relationships between complicated anomalous events and single labels, e.g., “vandalism”, is superficial, since single labels are deficient to characterize anomalous events. In reality, users tend to search a specific video rather than a series of approximate videos. Therefore, retrieving anomalous events using detailed descriptions is practical and positive but few researches focus on this. In this context, we propose a novel task called Video Anomaly Retrieval (VAR), which aims to pragmatically retrieve relevant anomalous videos by cross-modalities, e.g., language descriptions and synchronous audios. Unlike the current video retrieval where videos are assumed to be temporally well-trimmed with short duration, VAR is devised to retrieve long untrimmed videos which may be partially relevant to the given query. To achieve this, we present two large-scale VAR benchmarks and design a model called Anomaly-Led Alignment Network (ALAN) for VAR. In ALAN, we propose an anomaly-led sampling to focus on key segments in long untrimmed videos. Then, we introduce an efficient pretext task to enhance semantic associations between video-text fine-grained representations. Besides, we leverage two complementary alignments to further match cross-modal contents. Experimental results on two benchmarks reveal the challenges of VAR task and also demonstrate the advantages of our tailored method. Captions are publicly released at https://github.com/Roc-Ng/VAR.

Abstract:
Image Quality Assessment (IQA) is a fundamental task in computer vision that has witnessed remarkable progress with deep neural networks. Inspired by the characteristics of the human visual system, existing methods typically use a combination of global and local representations (i.e., multi-scale features) to achieve superior performance. However, most of them adopt simple linear fusion of multi-scale features, and neglect their possibly complex relationship and interaction. In contrast, humans typically first form a global impression to locate important regions and then focus on local details in those regions. We therefore propose a top-down approach that uses high-level semantics to guide the IQA network to focus on semantically important local distortion regions, named as TOPIQ. Our approach to IQA involves the design of a heuristic coarse-to-fine network (CFANet) that leverages multi-scale features and progressively propagates multi-level semantic information to low-level representations in a top-down manner. A key component of our approach is the proposed cross-scale attention mechanism, which calculates attention maps for lower level features guided by higher level features. This mechanism emphasizes active semantic regions for low-level distortions, thereby improving performance. TOPIQ can be used for both Full-Reference (FR) and No-Reference (NR) IQA. We use ResNet50 as its backbone and demonstrate that TOPIQ achieves better or competitive performance on most public FR and NR benchmarks compared with state-of-the-art methods based on vision transformers, while being much more efficient (with only ～ 13% FLOPS of the current best FR method). Codes are released at https://github.com/chaofengc/IQA-PyTorch.

Abstract:
Residual coding has gained prevalence in lossless compression, where a lossy layer is initially employed and the reconstruction errors (i.e., residues) are then losslessly compressed. The underlying principle of the residual coding revolves around the exploration of priors based on context modeling. Herein, we propose a residual coding framework for 3D medical images, involving the off-the-shelf video codec as the lossy layer and a Bilateral Context Modeling based Network (BCM-Net) as the residual layer. The BCM-Net is proposed to achieve efficient lossless compression of residues through exploring intra-slice and inter-slice bilateral contexts. In particular, a symmetry-based intra-slice context extraction (SICE) module is proposed to mine bilateral intra-slice correlations rooted in the inherent anatomical symmetry of 3D medical images. Moreover, a bi-directional inter-slice context extraction (BICE) module is designed to explore bilateral inter-slice correlations from bi-directional references, thereby yielding representative inter-slice context. Experiments on popular 3D medical image datasets demonstrate that the proposed method can outperform existing state-of-the-art methods owing to efficient redundancy reduction. Our code will be available on GitHub for future research.

Abstract:
High quality image reconstruction from undersampled k -space data is key to accelerating MR scanning. Current deep learning methods are limited by the small receptive fields in reconstruction networks, which restrict the exploitation of long-range information, and impede the mitigation of full-image artifacts, particularly in 3D reconstruction tasks. Additionally, the substantial computational demands of 3D reconstruction considerably hinder advancements in related fields. To tackle these challenges, we propose the following: 1) A novel convolution operator named Faster Fourier Convolution (FasterFC), aims at providing an adaptable broad receptive field for spatial domain reconstruction networks with fast computational speed. 2) A split-slice strategy that substantially reduces the computational load of 3D reconstruction, enabling high-resolution, multi-coil, 3D MR image reconstruction while fully utilizing inter-layer and intra-layer information. 3) A single-to-group algorithm that efficiently utilizes scan-specific and data-driven priors to enhance k -space interpolation effects. 4) A multi-stage, multi-coil, 3D fast MRI method, called the faster Fourier convolution based single-to-group network (FAS-Net), comprising a single-to-group k -space interpolation algorithm and a FasterFC-based image domain reconstruction module, significantly minimizes the computational demands of 3D reconstruction through split-slice strategy. Experimental evaluations conducted on the NYU fastMRI and Stanford MRI Data datasets reveal that the FasterFC significantly enhances the quality of both 2D and 3D reconstruction results. Moreover, FAS-Net, characterized as a method that can achieve high-resolution (320, 320, 256), multi-coil, (8 coils), 3D fast MRI, exhibits superior reconstruction performance compared to other state-of-the-art 2D and 3D methods.

Abstract:
Light sources are usually described by luminous intensity, color temperature and color rendering index, while the harshness of the light source is often overlooked. It is also referred to as the softness or light quality and can be determined by analyzing the shadows it produces. Shadow detection is often addressed in image analysis research, but the methods usually provide solutions for removing shadows. In this research, a method is proposed for isolating shadows from test images, applying thresholding for segmentation of the shadow into umbral and penumbral regions, calculating the shadow metric, and finally calculating the final numerical evaluation of the quality of the light source, i.e., the degree of its softness or harshness. The method proved to be efficient in analyzing halogen, LED and xenon light sources in conjunction with light-shaping attachments that soften the light. The method is not influenced by the intensity of the light source and focuses exclusively on the quality of the light. The method provides a single value to describe this light property and has the potential to become a standardized method for describing the quality of light sources and light-shaping attachments.

Abstract:
Domain Generalization (DG) aims to learn a generalizable model on the unseen target domain by only training on the multiple observed source domains. Although a variety of DG methods have focused on extracting domain-invariant features, the domain-specific class-relevant features have attracted attention and been argued to benefit generalization to the unseen target domain. To take into account the class-relevant domain-specific information, in this paper we propose an Information theory iNspired diSentanglement and pURification modEl (INSURE) to explicitly disentangle the latent features to obtain sufficient and compact (necessary) class-relevant feature for generalization to the unseen domain. Specifically, we first propose an information theory inspired loss function to ensure the disentangled class-relevant features contain sufficient class label information and the other disentangled auxiliary feature has sufficient domain information. We further propose a paired purification loss function to let the auxiliary feature discard all the class-relevant information and thus the class-relevant feature will contain sufficient and compact (necessary) class-relevant information. Moreover, instead of using multiple encoders, we propose to use a learnable binary mask as our disentangler to make the disentanglement more efficient and make the disentangled features complementary to each other. We conduct extensive experiments on five widely used DG benchmark datasets including PACS, VLCS, OfficeHome, TerraIncognita, and DomainNet. The proposed INSURE achieves state-of-the-art performance. We also empirically show that domain-specific class-relevant features are beneficial for domain generalization. The code is available at https://github.com/yuxi120407/INSURE.

Abstract:
The fusion of magnetic resonance imaging and positron emission tomography can combine biological anatomical information and physiological metabolic information, which is of great significance for the clinical diagnosis and localization of lesions. In this paper, we propose a novel adaptive linear fusion method for multi-dimensional features of brain magnetic resonance and positron emission tomography images based on a convolutional neural network, termed as MdAFuse. First, in the feature extraction stage, three-dimensional feature extraction modules are constructed to extract coarse, fine, and multi-scale information features from the source image. Second, at the fusion stage, the affine mapping function of multi-dimensional features is established to maintain a constant geometric relationship between the features, which can effectively utilize structural information from a feature map to achieve a better reconstruction effect. Furthermore, our MdAFuse comprises a key feature visualization enhancement algorithm designed to observe the dynamic growth of brain lesions, which can facilitate the early diagnosis and treatment of brain tumors. Extensive experimental results demonstrate that our method is superior to existing fusion methods in terms of visual perception and nine kinds of objective image fusion metrics. Specifically, in the results of MR-PET fusion, the SSIM (Structural Similarity) and VIF (Visual Information Fidelity) metrics show improvements of 5.61% and 13.76%, respectively, compared to the current state-of-the-art algorithm. Our project is publicly available at: https://github.com/22385wjy/MdAFuse.

Abstract:
Part-level 3D shape representations are crucial to shape reasoning and understanding. Two key sub-tasks are: 1) shape abstraction, creating primitive-based object parts; and 2) shape segmentation, finding partition-based object parts. However, for 3D object point clouds, most advanced methods produce parts relying on task-specific priors, such as similarity metrics and primitive geometries, resulting in misleading parts that deviate from semantics. To address prior limitations, we establish a foundation for joint shape abstraction and shape segmentation as formal linear transformations within a shared latent space, encapsulating essential dual-purpose membership information linking points and object parts for mutual reinforcement. We demonstrate that the transformations are underpinned by a derivation based on k-means, Non-negative Matrix Factorization (NMF), and the attention mechanism. As a result, we introduce Latent Membership Pursuit (LMP) for joint optimization of shape abstraction and segmentation. LMP utilizes a shared latent representation of object part membership to autonomously identify common object parts in both tasks without any supervision and priors. Furthermore, we adapt deformable superquadrics (DSQs) for primitives to capture variable part-level geometric and semantic information. Experiments on benchmark datasets validate that our approach enables mutual learning of shape abstraction and segmentation, and promotes consistent interpretations of 3D object shapes across instances and even categories in a fully unsupervised manner.

Abstract:
For capturing dynamic scenes with ultra-fast motion, neuromorphic cameras with extremely high temporal resolution have demonstrated their great capability and potential. Different from the event cameras that only record relative changes in light intensity, spike camera fires a stream of spikes according to a full-time accumulation of photons so that it can recover the texture details for both static areas and dynamic areas. Recently, color spike camera has been invented to record color information of dynamic scenes using a color filter array (CFA). However, demosaicing for color spike cameras is an open and challenging problem. In this paper, we develop a demosaicing network, called CSpkNet, to reconstruct dynamic color visual signals from the spike stream captured by the color spike camera. Firstly, we develop a light inference module to convert binary spike streams to intensity estimates. In particular, a feature-based channel attention module is proposed to reduce the noises caused by quantization errors. Secondly, considering both the Bayer configuration and object motion, we propose a motion-guided filtering module to estimate the missing pixels of each color channel, without undesired motion blur. Finally, we design a refinement module to improve the intensity and details, utilizing the color correlation. Experimental results demonstrate that CSpkNet can reconstruct color images from the Bayer-pattern spike stream with promising visual quality.

Abstract:
Medical image segmentation and registration are two fundamental and highly related tasks. However, current works focus on the mutual promotion between the two at the loss function level, ignoring the feature information generated by the encoder-decoder network during the task-specific feature mapping process and the potential inter-task feature relationship. This paper proposes a unified multi-task joint learning framework based on bi-fusion of structure and deformation at multi-scale, called BFM-Net, which simultaneously achieves the segmentation results and deformation field in a single-step estimation. BFM-Net consists of a segmentation subnetwork (SegNet), a registration subnetwork (RegNet), and the multi-task connection module (MTC). The MTC module is used to transfer the latent feature representation between segmentation and registration at multi-scale and link different tasks at the network architecture level, including the spatial attention fusion module (SAF), the multi-scale spatial attention fusion module (MSAF) and the velocity field fusion module (VFF). Extensive experiments on MR, CT and ultrasound images demonstrate the effectiveness of our approach. The MTC module can increase the Dice scores of segmentation and registration by 3.2%, 1.6%, 2.2%, and 6.2%, 4.5%, 3.0%, respectively. Compared with six state-of-the-art algorithms for segmentation and registration, BFM-Net can achieve superior performance in various modal images, fully demonstrating its effectiveness and generalization.

Abstract:
Freezing of gait (FoG) is a common disabling symptom of Parkinson’s disease (PD). It is clinically characterized by sudden and transient walking interruptions for specific human body parts, and it presents the localization in time and space. Due to the difficulty in extracting global fine-grained features from lengthy videos, developing an automated five-point FoG scoring system is quite challenging. Therefore, we propose a novel video-based automated five-classification FoG assessment method with a causality-enhanced multiple-instance-learning graph convolutional network (GCN). This method involves developing a temporal segmentation GCN to segment each video into three motion stages for stage-level feature modeling, followed by a multiple-instance-learning framework to divide each stage into short clips for instance-level feature extraction. Subsequently, an uncertainty-driven multiple-instance-learning GCN is developed to capture spatial and temporal fine-grained features through GCN scheme and uncertainty learning, respectively, for acquiring global representations. Finally, a causality-enhanced graph generation strategy is proposed to exploit causal inference for mining and enhancing human structures causally related to clinical assessment, thereby extracting spatial causal features. Extensive experimental results demonstrate the excellent performance of the proposed method on five-classification FoG assessment with an accuracy of 62.72% and an acceptable accuracy of 91.32%, which is confirmed by independent testing. Additionally, it enables temporal and spatial localization of FoG events to a certain extent, facilitating reasonable clinical interpretations. In conclusion, our method provides a valuable tool for automated FoG assessment in PD, and the proposed causality-related component exhibits promising potential for extension to other general and medical fine-grained action recognition tasks.

Abstract:
In this paper, we consider decomposing an image into its cartoon and texture components. Traditional methods, which mainly rely on the gradient amplitude of images to distinguish between these components, often show limitations in decomposing small-scale, high-contrast texture patterns and large-scale, low-contrast structural components. Specifically, these methods tend to decompose the former to the cartoon image and the latter to the texture image, neglecting the scale features inherent in both components. To overcome these challenges, we introduce a new variational model which incorporates an L_0 -based total variation norm for the cartoon component and an L_2 norm for the scale space representation of the texture component. We show that the texture component has a small L_2 norm in the scale space representation. We apply a quadratic penalty function to handle the non-separable L_0 norm minimization problem. Numerical experiments are given to illustrate the efficiency and effectiveness of our approach.

Abstract:
Brain region-of-interest (ROI) segmentation with magnetic resonance (MR) images is a basic prerequisite step for brain analysis. The main problem with using deep learning for brain ROI segmentation is the lack of sufficient annotated data. To address this issue, in this paper, we propose a simple multi-atlas supervised contrastive learning framework (MAS-CL) for brain ROI segmentation with MR images in an end-to-end manner. Specifically, our MAS-CL framework mainly consists of two steps, including 1) a multi-atlas supervised contrastive learning method to learn the latent representation using a limited amount of voxel-level labeling brain MR images, and 2) brain ROI segmentation based on the pre-trained backbone using our MSA-CL method. Specifically, different from traditional contrastive learning, in our proposed method, we use multi-atlas supervised information to pre-train the backbone for learning the latent representation of input MR image, i.e., the correlation of each sample pair is defined by using the label maps of input MR image and atlas images. Then, we extend the pre-trained backbone to segment brain ROI with MR images. We perform our proposed MAS-CL framework with five segmentation methods on LONI-LPBA40, IXI, OASIS, ADNI, and CC359 datasets for brain ROI segmentation with MR images. Various experimental results suggested that our proposed MAS-CL framework can significantly improve the segmentation performance on these five datasets.

Abstract:
Multimodal remote sensing image recognition is a popular research topic in the field of remote sensing. This recognition task is mostly solved by supervised learning methods that heavily rely on manually labeled data. When the labels are absent, the recognition is challenging for the large data size, complex land-cover distribution and large modality spectrum variation. In this paper, a novel unsupervised method, named fast projected fuzzy clustering with anchor guidance (FPFC), is proposed for multimodal remote sensing imagery. Specifically, according to the spatial distribution of land covers, meaningful superpixels are obtained for denoising and generating high-quality anchor. The denoised data and anchors are projected into the optimal subspace to jointly learn the shared anchor graph as well as the shared anchor membership matrix from different modalities in an adaptively weighted manner to accelerate the clustering process. Finally, the shared anchor graph and shared anchor membership matrix are combined to derive clustering labels for all pixels. An effective alternating optimization algorithm is designed to solve the proposed formulation. This is the first attempt to propose a soft clustering method for large-scale multimodal remote sensing data. Experiments show that the proposed FPFC achieves 81.34%, 55.43% and 93.34% clustering accuracies on the three datasets and outperforms the state-of-the-art methods. The source code is released at https://github.com/ZhangYongshan/FPFC.

Abstract:
Recent attention has been devoted to the pursuit of learning semantic segmentation models exclusively from image tags, a paradigm known as image-level Weakly Supervised Semantic Segmentation (WSSS). Existing attempts adopt the Class Activation Maps (CAMs) as priors to mine object regions yet observe the imbalanced activation issue, where only the most discriminative object parts are located. In this paper, we argue that the distribution discrepancy between the discriminative and the non-discriminative parts of objects prevents the model from producing complete and precise pseudo masks as ground truths. For this purpose, we propose a Pixel-Level Domain Adaptation (PLDA) method to encourage the model in learning pixel-wise domain-invariant features. Specifically, a multi-head domain classifier trained adversarially with the feature extraction is introduced to promote the emergence of pixel features that are invariant with respect to the shift between the source (i.e., the discriminative object parts) and the target (i.e., the non-discriminative object parts) domains. In addition, we come up with a Confident Pseudo-Supervision strategy to guarantee the discriminative ability of each pixel for the segmentation task, which serves as a complement to the intra-image domain adversarial training. Our method is conceptually simple, intuitive and can be easily integrated into existing WSSS methods. Taking several strong baseline models as instances, we experimentally demonstrate the effectiveness of our approach under a wide range of settings.

Abstract:
As a promising topic in palmprint recognition, cross-dataset palmprint recognition is attracting more and more research interests. In this paper, a more difficult yet realistic scenario is studied, i.e., Single-Source Cross-Dataset Palmprint Recognition with Unseen Target dataset (S2CDPR-UT). It is aimed to generalize a palmprint feature extractor trained only on a single source dataset to multiple unseen target datasets collected by different devices or environments. To combat this challenge, we propose a novel method to improve the generalization of feature extractor for S2CDPR-UT, named Generating stylIzed FeaTures (GIFT). Firstly, the raw features are decoupled into high- and low- frequency components. Then, a feature stylization module is constructed to perturb the mean and variance of low-frequency components to generate more stylized features, which can provided more valuable knowledge. Furthermore, two diversity enhancement and consistency preservation supervisions are introduced at feature level to help to learn the model. The former is aimed to enhance the diversity of stylized features to expand the feature space. Meanwhile, the later is aimed to maintain the semantic consistency to ensure accurate palmprint recognition. Extensive experiments carried out on CASIA Multi-Spectral, XJTU-UP, and MPD palmprint databases show that our GIFT method can achieve significant improvement of performance over other methods. The codes will be released at https://github.com/HuikaiShao/GIFT.

Affiliations: School of Electronic Information and Artificial Intelligence, Shaanxi University of Science and Technology, Weiyang Campus, Xi’an, China; Shaanxi Key Laboratory on Speech and Image Information Processing, National Engineering Laboratory for Integrated Aero-Space-Ground-Ocean Big Data Application Technology, School of Computer Science, NPU, Xi’an, China; School of Electrical and Control Engineering, Shaanxi University of Science and Technology, Xi’an, China; School of Engineering Science, University of Chinese Academy of Sciences, Beijing, China; Peng Cheng Laboratory, Shenzhen, Guangdong, China

Abstract:
Facial action units (AUs) focus on a comprehensive set of atomic facial muscle movements for human expression understanding. Based on supervised learning, discriminative AU representation can be achieved from local patches where the AUs are located. Unfortunately, accurate AU localization and characterization are challenged by the tremendous manual annotations, which limits the performance of AU recognition in realistic scenarios. In this study, we propose an end-to-end self-supervised AU representation learning model (SsupAU) to learn AU representations from unlabeled facial videos. Specifically, the input face is decomposed into six components using auto-encoders: five photo-geometric meaningful components, together with 2D flow field AUs. By constructing the canonical neutral face, posed neutral face, and posed expressional face gradually, these components can be disentangled without supervision, therefore the AU representations can be learned. To construct the canonical neutral face without manually labeled ground truth of emotion state or AU intensity, two priori knowledge based assumptions are proposed: 1) identity consistency, which explores the identical albedos and depths of different frames in a face video, and helps to learn the camera color mode as an extra cue for canonical neutral face recovery. 2) average face, which enables the model to discover a ‘neutral facial expression’ of the canonical neutral face and decouple the AUs in representation learning. To the best of our knowledge, this is the first attempt to design self-supervised AU representation learning method based on the definition of AUs. Substantial experiments on benchmark datasets have demonstrated the superior performance of the proposed work in comparison to other state-of-the-art approaches, as well as an outstanding capability of decomposing input face into meaningful factors for its reconstruction. The code is made available at https://github.com/Sunner4nwpu/SsupAU.

Abstract:
Fine-grained visual categorization (FGVC) aims to distinguish visual objects from multiple subcategories of the coarse-grained category. Subtle inter-class differences among various subcategories make the FGVC task more challenging. Existing methods primarily focus on learning salient visual patterns while ignoring how to capture the object’s internal structure, causing difficulty in obtaining complete discriminative regions within the object to limit FGVC performance. To address the above issue, we propose a Structure Information Mining and Object-aware Feature Enhancement (SIM-OFE) method for fine-grained visual categorization, which explores the visual object’s internal structure composition and appearance traits. Concretely, we first propose a simple yet effective hybrid perception attention module for locating visual objects based on global-scope and local-scope significance analyses. Then, a structure information mining module is proposed to model the distribution and context relation of critical regions within the object, highlighting the whole object and discriminative regions for distinguishing subtle differences. Finally, an object-aware feature enhancement module is proposed to combine global-scope and local-scope discriminative features in an attentive coupling way for powerful visual representations in fine-grained recognition. Extensive experiments on three FGVC benchmark datasets demonstrate that our proposed SIM-OFE method can achieve state-of-the-art performance.

Abstract:
Object detection methods have achieved remarkable performances when the training and testing data satisfy the assumption of i.i.d. However, the training and testing data may be collected from different domains, and the gap between the domains can significantly degrade the detectors. Test Time Adaptive Object Detection (TTA-OD) is a novel online approach that aims to adapt detectors quickly and make predictions during the testing procedure. TTA-OD is more realistic than the existing unsupervised domain adaptation and source-free unsupervised domain adaptation approaches. For example, self-driving cars need to improve their perception of new environments in the TTA-OD paradigm during driving. To address this, we propose a multi-level feature alignment (MLFA) method for TTA-OD, which is able to adapt the model online based on the steaming target domain data. For a more straightforward adaptation, we select informative foreground and background features from image feature maps and capture their distributions using probabilistic models. Our approach includes: i) global-level feature alignment to align all informative feature distributions, thereby encouraging detectors to extract domain-invariant features, and ii) cluster-level feature alignment to match feature distributions for each category cluster across different domains. Through the multi-level alignment, we can prompt detectors to extract domain-invariant features, as well as align the category-specific components of image features from distinct domains. We conduct extensive experiments to verify the effectiveness of our proposed method. Our code is accessible at https://github.com/yaboliudotug/MLFA.

Abstract:
3D reconstruction is a fundamental task in robotics and AI, providing a prerequisite for many related applications. Fringe projection profilometry is an efficient and non-contact method for generating 3D point clouds out of 2D images. However, during the actual measurement, it is inevitable to experiment with translucent objects, such as skin, marble, and fruit. Indirect illumination from these objects has substantially compromised the precision of 3D reconstruction via the contamination of 2D images. This paper presents a fast and accurate approach to correct for indirect illumination. The essential idea is to design a highly suitable network architecture founded on a precise error model that facilitates accurate error rectification. Initially, our method transforms the error generated by indirect illumination into a sine series. Based on this error model, the multilayer perceptron is more effective in error correction than traditional methods and convolutional neural networks. Our network was trained solely on simulated data but was tested on authentic images. Three sets of experiments, including two sets of comparison experiments, indicate that the designed network can efficiently rectify the error induced by indirect illumination.

Abstract:
Video anomaly detection (VAD) aims at localizing the snippets containing anomalous events in long unconstrained videos. The weakly supervised (WS) setting, where solely video-level labels are available during training, has attracted considerable attention, owing to its satisfactory trade-off between the detection performance and annotation cost. However, due to lack of snippet-level dense labels, the existing WS-VAD methods still get easily stuck on the detection errors, caused by false alarms and incomplete localization. To address this dilemma, in this paper, we propose to inject text clues of anomaly-event categories for improving WS-VAD, via a dedicated dual-branch framework. For suppressing the response of confusing normal contexts, we first present a text-guided anomaly discovering (TAG) branch based on a hierarchical matching scheme, which utilizes the label-text queries to search the discriminative anomalous snippets in a global-to-local fashion. To facilitate the completeness of anomaly-instance localization, an anomaly-conditioned text completion (ATC) branch is further designed to perform an auxiliary generative task, which intrinsically forces the model to gather sufficient event semantics from all the relevant anomalous snippets for completely reconstructing the masked description sentence. Furthermore, to encourage the cross-branch knowledge sharing, a mutual learning strategy is introduced by imposing a consistency constraint on the anomaly scores of these two branches. Extensive experimental results on two public benchmarks validate that the proposed method achieves superior performance over the competing methods.

Abstract:
Camouflaged object detection (COD) aims to identify the objects that seamlessly blend into the surrounding backgrounds. Due to the intrinsic similarity between the camouflaged objects and the background region, it is extremely challenging to precisely distinguish the camouflaged objects by existing approaches. In this paper, we propose a hierarchical graph interaction network termed HGINet for camouflaged object detection, which is capable of discovering imperceptible objects via effective graph interaction among the hierarchical tokenized features. Specifically, we first design a region-aware token focusing attention (RTFA) with dynamic token clustering to excavate the potentially distinguishable tokens in the local region. Afterwards, a hierarchical graph interaction transformer (HGIT) is proposed to construct bi-directional aligned communication between hierarchical features in the latent interaction space for visual semantics enhancement. Furthermore, we propose a decoder network with confidence aggregated feature fusion (CAFF) modules, which progressively fuses the hierarchical interacted features to refine the local detail in ambiguous regions. Extensive experiments conducted on the prevalent datasets, i.e. COD10K, CAMO, NC4K and CHAMELEON demonstrate the superior performance of HGINet compared to existing state-of-the-art methods. Our code is available at https://github.com/Garyson1204/HGINet.

Abstract:
The existence of noisy labels in real-world data negatively impacts the performance of deep learning models. Although much research effort has been devoted to improving the robustness towards noisy labels in classification tasks, the problem of noisy labels in deep metric learning (DML) remains under-explored. Existing noisy label learning methods designed for DML mainly discard suspicious noisy samples, resulting in a waste of the training data. To address this issue, we propose a noise-robust DML framework with SubGroup-based Positive-pair Selection (SGPS), which constructs reliable positive pairs for noisy samples to enhance the sample utilization. Specifically, SGPS first effectively identifies clean and noisy samples by a probability-based clean sample selectionstrategy. To further utilize the remaining noisy samples, we discover their potential similar samples based on the subgroup information given by a subgroup generation module and then aggregate them into informative positive prototypes for each noisy sample via a positive prototype generation module. Afterward, a new contrastive loss is tailored for the noisy samples with their selected positive pairs. SGPS can be easily integrated into the training process of existing pair-wise DML tasks, like image retrieval and face recognition. Extensive experiments on multiple synthetic and real-world large-scale label noise datasets demonstrate the effectiveness of our proposed method. Without any bells and whistles, our SGPS framework outperforms the state-of-the-art noisy label DML methods.

Abstract:
Recently, unsupervised domain adaptation (UDA) for 3D object detectors has increasingly garnered attention as a method to eliminate the prohibitive costs associated with generating extensive 3D annotations, which are crucial for effective model training. Self-training (ST) has emerged as a simple and effective technique for UDA. The major issue involved in ST-UDA for 3D object detection is refining the imprecise predictions caused by domain shift and generating accurate pseudo labels as supervisory signals. This study presents a novel ST-UDA framework to generate high-quality pseudo labels by associating predictions of 3D point cloud sequences during ego-motion according to spatial and temporal consistency, named motion-associated self-training for 3D object detection (MA-ST3D). MA-ST3D maintains a global-local pathway (GLP) architecture to generate high-quality pseudo-labels by leveraging both intra-frame and inter-frame consistencies along the spatial dimension of the LiDAR’s ego-motion. It also equips two memory modules for both global and local pathways, called global memory and local memory, to suppress the temporal fluctuation of pseudo-labels during self-training iterations. In addition, a motion-aware loss is introduced to impose discriminated regulations on pseudo labels with different motion statuses, which mitigates the harmful spread of false positive pseudo labels. Finally, our method is evaluated on three representative domain adaptation tasks on authoritative 3D benchmark datasets (i.e. Waymo, Kitti, and nuScenes). MA-ST3D achieved SOTA performance on all evaluated UDA settings and even surpassed the weakly supervised DA methods on the Kitti and NuScenes object detection benchmark.

Abstract:
Medical images often exhibit intricate structures, inhomogeneous intensity, significant noise and blurred edges, presenting challenges for medical image segmentation. Several segmentation algorithms grounded in mathematics, computer science, and medical domains have been proposed to address this matter; nevertheless, there is still considerable scope for improvement. This paper proposes a novel adaptive anatomical structure-based two-layer level set framework (AS2LS) for segmenting organs with concentric structures, such as the left ventricle and the fundus. By adaptive fitting region and edge intensity information, the AS2LS achieves high accuracy in segmenting complex medical images characterized by inhomogeneous intensity, blurred boundaries and interference from surrounding organs. Moreover, we introduce a novel two-layer level set representation based on anatomical structures, coupled with a two-stage level set evolution algorithm. Experimental results demonstrate the superior accuracy of AS2LS in comparison to representative level set methods and deep learning methods.

Abstract:
In pose estimation for objects with rotational symmetry, ambiguous poses may arise, and the symmetry axes of objects are crucial for eliminating such ambiguities. Currently, in pose estimation, reliance on manual settings of symmetry axes decreases the accuracy of pose estimation. To address this issue, this method proposes determining the orders of symmetry axes and angles between axes based on a given rotational symmetry type or polyhedron, reducing the need for manual settings of symmetry axes. Subsequently, two key axes with the highest orders are defined and localized, then three orthogonal axes are generated based on key axes, while each symmetry axis can be computed utilizing orthogonal axes. Compared to localizing symmetry axes one by one, the key-axis-based symmetry axis localization is more efficient. To support geometric and texture symmetry, the method utilizes the ADI metric for key axis localization in geometrically symmetric objects and proposes a novel metric, ADI-C, for objects with texture symmetry. Experimental results on the LM-O and HB datasets demonstrate a 9.80% reduction in symmetry axis localization error and a 1.64% improvement in pose estimation accuracy. Additionally, the method introduces a new dataset, DSRSTO, to illustrate its performance across seven types of geometrically and texturally symmetric objects. The GitHub link for the open-source tool based on this method is https://github.com/WangYuLin-SEU/KASAL.

Abstract:
Pre-trained vision-language models (VLMs) are the de-facto foundation models for various downstream tasks. However, scene text recognition methods still prefer backbones pre-trained on a single modality, namely, the visual modality, despite the potential of VLMs to serve as powerful scene text readers. For example, CLIP can robustly identify regular (horizontal) and irregular (rotated, curved, blurred, or occluded) text in images. With such merits, we transform CLIP into a scene text reader and introduce CLIP4STR, a simple yet effective STR method built upon image and text encoders of CLIP. It has two encoder-decoder branches: a visual branch and a cross-modal branch. The visual branch provides an initial prediction based on the visual feature, and the cross-modal branch refines this prediction by addressing the discrepancy between the visual feature and text semantics. To fully leverage the capabilities of both branches, we design a dual predict-and-refine decoding scheme for inference. We scale CLIP4STR in terms of the model size, pre-training data, and training data, achieving state-of-the-art performance on 13 STR benchmarks. Additionally, a comprehensive empirical study is provided to enhance the understanding of the adaptation of CLIP to STR. Our method establishes a simple yet strong baseline for future STR research with VLMs.

Abstract:
Convolutional Neural Networks (CNNs) have achieved remarkable progress in arbitrary artistic style transfer. However, the model size of existing state-of-the-art (SOTA) style transfer algorithms is immense, leading to enormous computational costs and memory demand. It makes real-time and high resolution hard for GPUs with limited memory and limits the application on mobile devices. This paper proposes a novel arbitrary artistic style transfer algorithm, KBStyle, whose model size is only 200 KB. Firstly, we design a style transfer network where the style encoder, content encoder, and corresponding decoder are custom designed to guarantee low computational cost and high shape retention. Besides, the weighted style loss function is presented to improve the performance of style migration. Then, we propose a novel knowledge distillation method (Symmetric Knowledge Distillation, SKD) for encoder-decoder-based style transfer models, which redefines the knowledge and symmetrically compresses the encoder and decoder. With the SKD, the proposed style transfer network is further compressed by 14 times to achieve the KBStyle. Experimental results demonstrate that the proposed SKD method achieves comparable results with other SOTA knowledge distillation algorithms for style transfer. Besides, the proposed KBStyle achieves high-quality stylized images. And the inference time of the KBStyle on an Nvidia TITAN RTX GPU is only 20 ms when the resolutions of the content image and style image are both 2k-resolution ( 2048× 1080 ). Moreover, the 200 KB model size of KBStyle is much smaller than the SOTA models and facilitates style transfer on mobile devices.

Abstract:
The accelerated proliferation of visual content and the rapid development of machine vision technologies bring significant challenges in delivering visual data on a gigantic scale, which shall be effectively represented to satisfy both human and machine requirements. In this work, we investigate how hierarchical representations derived from the advanced generative prior facilitate constructing an efficient scalable coding paradigm for human-machine collaborative vision. Our key insight is that by exploiting the StyleGAN prior, we can learn three-layered representations encoding hierarchical semantics, which are elaborately designed into the basic, middle, and enhanced layers, supporting machine intelligence and human visual perception in a progressive fashion. With the aim of achieving efficient compression, we propose the layer-wise scalable entropy transformer to reduce the redundancy between layers. Based on the multi-task scalable rate-distortion objective, the proposed scheme is jointly optimized to achieve optimal machine analysis performance, human perception experience, and compression ratio. We validate the proposed paradigm’s feasibility in face image compression. Extensive qualitative and quantitative experimental results demonstrate the superiority of the proposed paradigm over the latest compression standard Versatile Video Coding (VVC) in terms of both machine analysis as well as human perception at extremely low bitrates (< 0.01 bpp), offering new insights for human-machine collaborative compression.

Affiliations: School of Engineering and Technology, University of New South Wales, Canberra, ACT, Australia; School of Information and Communication Technology, Griffith University, Nathan, QLD, Australia; Key Laboratory of Intelligent Perception and Image Understanding of the Ministry of Education of China, International Research Center of Intelligent Perception and Computation, School of Artificial Intelligence, Xidian University, Xi’an, China; School of Systems and Computing, University of New South Wales, Canberra, ACT, Australia

Abstract:
This paper proposes a novel uncertainty-adjusted label transition (UALT) method for weakly supervised solar panel mapping (WS-SPM) in aerial Images. In weakly supervised learning (WSL), the noisy nature of pseudo labels (PLs) often leads to poor model performance. To address this problem, we formulate the task as a label-noise learning problem and build a statistically consistent mapping model by estimating the instance-dependent transition matrix (IDTM). We propose to estimate the IDTM with a parameterized label transition network describing the relationship between the latent clean labels and noisy PLs. A trace regularizer is employed to impose constraints on the form of IDTM for its stability. To further reduce the estimation difficulty of IDTM, we incorporate uncertainty estimation to first improve the accuracy of noisy dataset distillation and then mitigate the negative impacts of falsely distilled examples with an uncertainty-adjusted re-weighting strategy. Extensive experiments and ablation studies on two challenging aerial data sets support the validity of the proposed UALT.

Abstract:
Coded aperture snapshot spectral imaging (CASSI) is an important technique for capturing three-dimensional (3D) hyperspectral images (HSIs), and involves an inverse problem of reconstructing the 3D HSI from its corresponding coded 2D measurements. Existing model-based and learning-based methods either could not explore the implicit feature of different HSIs or require a large amount of paired data for training, resulting in low reconstruction accuracy or poor generalization performance as well as interpretability. To remedy these deficiencies, this paper proposes a novel HSI reconstruction method, which exploits the global spectral correlation from the HSI itself through a formulation of model-driven low-rank subspace representation and learns the deep prior by a data-driven self-supervised deep learning scheme. Specifically, we firstly develop a model-driven low-rank subspace representation to decompose the HSI as the product of an orthogonal basis and a spatial representation coefficient, then propose a data-driven deep guided spatial-attention network (called DGSAN) to adaptively reconstruct the implicit spatial feature of HSI by learning the deep coefficient prior (DCP), and finally embed these implicit priors into an iterative optimization framework through a self-supervised training way without requiring any training data. Thus, the proposed method shall enhance the reconstruction accuracy, generalization ability, and interpretability. Extensive experiments on several datasets and imaging systems validate the superiority of our method. The source code and data of this article will be made publicly available at https://github.com/ChenYong1993/LRSDN.

Abstract:
In this paper, we provide an in-depth assessment on the Bjøntegaard Delta. We construct a large data set of video compression performance comparisons using a diverse set of metrics including PSNR, VMAF, bitrate, and processing energies. These metrics are evaluated for visual data types such as classic perspective video, 360° video, point clouds, and screen content. As compression technology, we consider multiple hybrid video codecs as well as state-of-the-art neural network based compression methods. Using additional supporting points in-between standard points defined by parameters such as the quantization parameter, we assess the interpolation error of the Bjøntegaard-Delta (BD) calculus and its impact on the final BD value. From the analysis, we find that the BD calculus is most accurate in the standard application of rate-distortion comparisons with mean errors below 0.5 percentage points. For other applications and special cases, e.g., VMAF quality, energy considerations, or inter-codec comparisons, the errors are higher (up to 5 percentage points), but can be halved by using a higher number of supporting points. We finally come up with recommendations on how to use the BD calculus such that the validity of the resulting BD-values is maximized. Main recommendations are as follows: First, relative curve differences should be plotted and analyzed. Second, the logarithmic domain should be used for saturating metrics such as SSIM and VMAF. Third, BD values below a certain threshold indicated by the subset error should not be used to draw recommendations. Fourth, using two supporting points is sufficient to obtain rough performance estimates.

Abstract:
Text field labelling plays a key role in Key Information Extraction (KIE) from structured document images. However, existing methods ignore the field drift and outlier problems, which limit their performance and make them less robust. This paper casts the text field labelling problem into a partial graph matching problem and proposes an end-to-end trainable framework called Deep Partial Graph Matching (dPGM) for the one-shot KIE task. It represents each document as a graph and estimates the correspondence between text fields from different documents by maximizing the graph similarity of different documents. Our framework obtains a strict one-to-one correspondence by adopting a combinatorial solver module with an extra one-to-(at most)-one mapping constraint to do the exact graph matching, which leads to the robustness of the field drift problem and the outlier problem. Finally, a large one-shot KIE dataset named DKIE is collected and annotated to promote research of the KIE task. This dataset will be released to the research and industry communities. Extensive experiments on both the public and our new DKIE datasets show that our method can achieve state-of-the-art performance and is more robust than existing methods.

Abstract:
Video anomaly detection aims to find the events in a video that do not conform to the expected behavior. The prevalent methods mainly detect anomalies by snippet reconstruction or future frame prediction error. However, the error is highly dependent on the local context of the current snippet and lacks the understanding of normality. To address this issue, we propose to detect anomalous events not only by the local context, but also according to the consistency between the testing event and the knowledge about normality from the training data. Concretely, we propose a novel two-stream framework based on context recovery and knowledge retrieval, where the two streams can complement each other. For the context recovery stream, we propose a spatiotemporal U-Net which can fully utilize the motion information to predict the future frame. Furthermore, we propose a maximum local error mechanism to alleviate the problem of large recovery errors caused by complex foreground objects. For the knowledge retrieval stream, we propose an improved learnable locality-sensitive hashing, which optimizes hash functions via a Siamese network and a mutual difference loss. The knowledge about normality is encoded and stored in hash tables, and the distance between the testing event and the knowledge representation is used to reveal the probability of anomaly. Finally, we fuse the anomaly scores from the two streams to detect anomalies. Extensive experiments demonstrate the effectiveness and complementarity of the two streams, whereby the proposed two-stream framework achieves state-of-the-art performance on ShanghaiTech, Avenue and Corridor datasets among the methods without object detection. Even if compared with the methods using object detection, our method reaches competitive or better performance on the ShanghaiTech, Avenue, and Ped2 datasets.

Abstract:
Deep compressive sensing (CS) has become a prevalent technique for image acquisition and reconstruction. However, existing deep learning (DL)-based CS methods often encounter challenges such as block artifacts and information loss during iterative reconstruction, particularly at low sampling rates, resulting in a reduction of reconstructed details. To address these issues, we propose NesTD-Net, an unfolding-based architecture inspired by the NESTA algorithm, designed for image CS. NesTD-Net integrates DL modules into NESTA iterations, forming a deep network that continuously iterates to minimize the \ell _1 -norm CS problem, ensuring high-quality image CS. Utilizing a learned sampling matrix for measurements and an initialization module for initial estimate, NesTD-Net then introduces Iteration Sub-Modules derived from the NESTA algorithm (i.e., \mathbf Y_k , \mathbf Z_k , and \mathbf X_k ) during reconstruction stages to iteratively solve the \ell _1 -norm CS reconstruction. Additionally, NesTD-Net incorporates a Dual-Path Deblocking Structure (DPDS) to facilitate feature information flow and mitigate block artifacts, enhancing image detail reconstruction. Furthermore, DPDS exhibits remarkable versatility and demonstrates seamless integration with other unfolding-based methods, offering the potential to enhance their performance in image reconstruction. Experimental results demonstrate that our proposed NesTD-Net achieves better performance compared to other state-of-the-art methods in terms of image quality metrics such as SSIM and PSNR, as well as visual perception on several public benchmark datasets.

Abstract:
Recently, attempts to learn the underlying 3D structures of a scene from monocular videos in a fully self-supervised fashion have drawn much attention. One of the most challenging aspects of this task is to handle independently moving objects as they break the rigid-scene assumption. In this paper, we show for the first time that pixel positional information can be exploited to learn SVDE (Single View Depth Estimation) from videos. The proposed moving object (MO) masks, which are induced by the depth variance to shifted positional information (SPI) and are referred to as ‘SPIMO’ masks, are highly robust and consistently remove independently moving objects from the scenes, allowing for robust and consistent learning of SVDE from videos. Additionally, we introduce a new adaptive quantization scheme that assigns the best per-pixel quantization curve for depth discretization, improving the fine granularity and accuracy of the final aggregated depth maps. Finally, we employ existing boosting techniques in a new way that self-supervises moving object depths further. With these features, our pipeline is robust against moving objects and generalizes well to high-resolution images, even when trained with small patches, yielding state-of-the-art (SOTA) results with four- to eight-fold fewer parameters than the previous SOTA techniques that learn from videos. We present extensive experiments on KITTI and CityScapes that show the effectiveness of our method.

Abstract:
Recently, visual food analysis has received more and more attention in the computer vision community due to its wide application scenarios, e.g., diet nutrition management, smart restaurant, and personalized diet recommendation. Considering that food images are unstructured images with complex and unfixed visual patterns, mining food-related semantic-aware regions is crucial. Furthermore, the ingredients contained in food images are semantically related to each other due to the cooking habits and have significant semantic relationships with food categories under the hierarchical food classification ontology. Therefore, modeling the long-range semantic relationships between ingredients and the categories-ingredients semantic interactions is beneficial for ingredient recognition and food analysis. Taking these factors into consideration, we propose a multi-task learning framework for food category and ingredient recognition. This framework mainly consists of a food-orient Transformer named Convolution-Enhanced Bi-Branch Adaptive Transformer (CBiAFormer) and a multi-task category-ingredient recognition network called Structural Learning and Cross-Task Interaction (SLCI). In order to capture the complex and unfixed fine-grained patterns of food images, we propose a query-aware data-adaptive attention mechanism called Bi-Branch Adaptive Attention (BiA-Attention) in CBiAFormer, which consists of a local fine-grained branch and a global coarse-grained branch to mine local and global semantic-aware regions for different input images through an adaptive candidate key/value sets assignment for each query. Additionally, a convolutional patch embedding module is proposed to extract the fine-grained features which are neglected by Transformers. To fully utilize the ingredient information, we propose SLCI, which consists of cross-layer attention to model the semantic relationships between ingredients and two cross-task interaction modules to mine the semantic interactions between categories and ingredients. Extensive experiments show that our method achieves competitive performance on three mainstream food datasets (ETH Food-101, Vireo Food-172, and ISIA Food-200). Visualization analyses of CBiAFormer and SLCI on two tasks prove the effectiveness of our method. Codes will be released upon publication. Code and models are available at https://github.com/Liuyuxinict/CBiAFormer.

Abstract:
Change detection (CD) is a fundamental and important task for monitoring the land surface dynamics in the earth observation field. Existing deep learning-based CD methods typically extract bi-temporal image features using a weight-sharing Siamese encoder network and identify change regions using a decoder network. These CD methods, however, still perform far from satisfactorily as we observe that 1) deep encoder layers focus on irrelevant background regions; and 2) the models’ confidence in the change regions is inconsistent at different decoder stages. The first problem is because deep encoder layers cannot effectively learn from imbalanced change categories using the sole output supervision, while the second problem is attributed to the lack of explicit semantic consistency preservation. To address these issues, we design a novel similarity-aware attention flow network (SAAN). SAAN incorporates a similarity-guided attention flow module with deeply supervised similarity optimization to achieve effective change detection. Specifically, we counter the first issue by explicitly guiding deep encoder layers to discover semantic relations from bi-temporal input images using deeply supervised similarity optimization. The extracted features are optimized to be semantically similar in the unchanged regions and dissimilar in the changing regions. The second drawback can be alleviated by the proposed similarity-guided attention flow module, which incorporates similarity-guided attention modules and attention flow mechanisms to guide the model to focus on discriminative channels and regions. We evaluated the effectiveness and generalization ability of the proposed method by conducting experiments on a wide range of CD tasks. The experimental results demonstrate that our method achieves excellent performance on several CD tasks, with discriminative features and semantic consistency preserved.

Abstract:
With recent deep learning based approaches showing promising results in removing noise from images, the best denoising performance has been reported in a supervised learning setup that requires a large set of paired noisy images and ground truth data for training. The strong data requirement can be mitigated by unsupervised learning techniques, however, accurate modelling of images or noise variances is still crucial for high-quality solutions. The learning problem is ill-posed for unknown noise distributions. This paper investigates the tasks of image denoising and noise variance estimation in a single, joint learning framework. To address the ill-posedness of the problem, we present deep variation prior (DVP), which states that the variation of a properly learnt denoiser with respect to the change of noise satisfies some smoothness properties, as a key criterion for good denoisers. Building upon DVP and under the assumption that the noise is zero mean and pixel-wise independent conditioned on the image, an unsupervised deep learning framework, that simultaneously learns a denoiser and estimates noise variances, is developed. Our method does not require any clean training images or an external step of noise estimation, and instead, approximates the minimum mean squared error denoisers using only a set of noisy images. With the two underlying tasks being considered in a single framework, we allow them to be optimised for each other. The experimental results show a denoising quality comparable to that of supervised learning and accurate noise variance estimates.

Abstract:
In this paper, we propose a novel framework for multi-person pose estimation and tracking on challenging scenarios. In view of occlusions and motion blurs which hinder the performance of pose tracking, we proposed to model humans as graphs and perform pose estimation and tracking by concentrating on the visible parts of human bodies which are informative about complete skeletons under incomplete observations. Specifically, the proposed framework involves three parts: (i) A Sparse Key-point Flow Estimating Module (SKFEM) and a Hierarchical Graph Distance Minimizing Module (HGMM) for estimating pixel-level and human-level motion, respectively; (ii) Pixel-level appearance consistency and human-level structural consistency are combined in measuring the visibility scores of body joints. The scores guide the pose estimator to predict complete skeletons by observing high-visibility parts, under the assumption that visible and invisible parts are inherently correlated in human part graphs. The pose estimator is iteratively fine-tuned to achieve this capability; (iii) Multiple historical frames are combined to benefit tracking which is implemented using HGMM. The proposed approach not only achieves state-of-the-art performance on PoseTrack datasets but also contributes to significant improvements in other tasks such as human-related anomaly detection.

Abstract:
We conducted a large-scale study of human perceptual quality judgments of High Dynamic Range (HDR) and Standard Dynamic Range (SDR) videos subjected to scaling and compression levels and viewed on three different display devices. While conventional expectations are that HDR quality is better than SDR quality, we have found subject preference of HDR versus SDR depends heavily on the display device, as well as on resolution scaling and bitrate. To study this question, we collected more than 23,000 quality ratings from 67 volunteers who watched 356 videos on OLED, QLED, and LCD televisions, and among many other findings, observed that HDR videos were often rated as lower quality than SDR videos at lower bitrates, particularly when viewed on LCD and QLED displays. Since it is of interest to be able to measure the quality of videos under these scenarios, e.g. to inform decisions regarding scaling, compression, and SDR vs HDR, we tested several well-known full-reference and no-reference video quality models on the new database. Towards advancing progress on this problem, we also developed a novel no-reference model called HDRPatchMAX, that uses a contrast-based analysis of classical and bit-depth features to predict quality more accurately than existing metrics.

Abstract:
We endeavor on a rarely explored task named thermal infrared video denoising. Perception in the thermal infrared significantly enhances the capabilities of machine vision. Nonetheless, noise in imaging systems is one of the factors that hampers the large-scale application of equipment. Existing thermal infrared denoising methods, primarily focusing on the image level, inadequately utilize time-domain information and insufficiently conduct investigation of system-level mixed noise, presenting the inferior ability in the video-recorded era; while video denoising methods, commonly applied to RGB cameras, exhibit uncertain effectiveness owing to substantial dissimilarities in the noise models and modalities between RGB and thermal infrared images. In sight of this, we initially revisit the imaging mechanism, while concurrently introducing a physics-inspired noise generator based on the sources and characteristics of system noise. Subsequently, a thermal infrared video denoising dataset consisting of 518 real-world videos is constructed. Lastly, we propose a denoising model called multi-domain infrared video denoising network, capable of concentrating features from the time, space, and frequency domains to restore high-fidelity videos. Extensive experiments demonstrate that the proposed method achieves state-of-the-art denoising quality and can be successfully applied to commercial cameras and downstream vision tasks, providing a new avenue for clear videography in the thermal infrared world. The dataset and code will be available.

Abstract:
With the increasing availability of cameras in vehicles, obtaining license plate (LP) information via on-board cameras has become feasible in traffic scenarios. LPs play a pivotal role in vehicle identification, making automatic LP detection (ALPD) a crucial area within traffic analysis. Recent advancements in deep learning have spurred a surge of studies in ALPD. However, the computational limitations of on-board devices hinder the performance of real-time ALPD systems for moving vehicles. Therefore, we propose a real-time frame-by-frame LP detector focusing on real-time accurate LP detection. Specifically, video frames are categorized into keyframes and non-keyframes. Keyframes are processed by a deeper network (high-level stream), while non-keyframes are handled by a lightweight network (low-level stream), significantly enhancing efficiency. To achieve accurate detection, we design a knowledge distillation strategy to boost the performance of low-level stream and a feature propagation method to introduce the temporal clues in video LP detection. Our contributions are: (1) A real-time frame-by-frame LP detector for video LP detection is proposed, achieving a competitive performance with popular one-stage LP detectors. (2) A simple feature-based knowledge distillation strategy is introduced to improve the low-level stream performance. (3) A spatial-temporal attention feature propagation method is designed to refine the features from non-keyframes guided by the memory features from keyframes, leveraging the inherent temporal correlation in videos. The ablation studies show the effectiveness of knowledge distillation strategy and feature propagation method.

Abstract:
This study aims to develop advanced and training-free full-reference image quality assessment (FR-IQA) models based on deep neural networks. Specifically, we investigate measures that allow us to perceptually compare deep network features and reveal their underlying factors. We find that distribution measures enjoy advanced perceptual awareness and test the Wasserstein distance (WSD), Jensen-Shannon divergence (JSD), and symmetric Kullback-Leibler divergence (SKLD) measures when comparing deep features acquired from various pretrained deep networks, including the Visual Geometry Group (VGG) network, SqueezeNet, MobileNet, and EfficientNet. The proposed FR-IQA models exhibit superior alignment with subjective human evaluations across diverse image quality assessment (IQA) datasets without training, demonstrating the advanced perceptual relevance of distribution measures when comparing deep network features. Additionally, we explore the applicability of deep distribution measures in image super-resolution enhancement tasks, highlighting their potential for guiding perceptual enhancements. The code is available on website. (https://github.com/Buka-Xing/Deep-network-based-distribution-measures-for-full-reference-image-quality-assessment).

Affiliations: State Key Laboratory of Integrated Services Networks, School of Electronic Engineering, Xidian University, Xi’an, Shaanxi, China; State Key Laboratory of Integrated Services Networks, School of Telecommunications Engineering, Xidian University, Xi’an, Shaanxi, China; State Key Laboratory of Integrated Services Networks, School of Cyber Engineering, Xidian University, Xi’an, China; Chongqing Key Laboratory of Image Cognition, Chongqing University of Posts and Telecommunications, Chongqing, China; School of Electronic Engineering, Xidian University, Xi’an, China

Abstract:
The emergence of face forgery has raised global concerns on social security, thereby facilitating the research on automatic forgery detection. Although current forgery detectors have demonstrated promising performance in determining authenticity, their susceptibility to adversarial perturbations remains insufficiently addressed. Given the nuanced discrepancies between real and fake instances are essential in forgery detection, previous defensive paradigms based on input processing and adversarial training tend to disrupt these discrepancies. For the detectors, the learning difficulty is thus increased, and the natural accuracy is dramatically decreased. To achieve adversarial defense without changing the instances as well as the detectors, a novel defensive paradigm called Inspector is designed specifically for face forgery detectors. Specifically, Inspector defends against adversarial attacks in a coarse-to-fine manner. In the coarse defense stage, adversarial instances with evident perturbations are directly identified and filtered out. Subsequently, in the fine defense stage, the threats from adversarial instances with imperceptible perturbations are further detected and eliminated. Experimental results across different types of face forgery datasets and detectors demonstrate that our method achieves state-of-the-art performances against various types of adversarial perturbations while better preserving natural accuracy. Code is available on https://github.com/xarryon/Inspector.

Abstract:
Division of focal plane color polarization camera becomes the mainstream in polarimetric imaging for it directly captures color polarization mosaic image by one snapshot, so image demosaicking is an essential task. Current color polarization demosaicking (CPDM) methods are prone to unsatisfied results since it’s difficult to recover missed 15 or 14 pixels out of 16 pixels in color polarization mosaic images. To address this problem, a non-locally regularized convolutional sparse regularization model, which is advantaged in denoising and edge maintaining, is proposed to recall more information for CPDM task, and the CPDM task is transformed into an energy function to be solved by ADMM optimization. Finally, the optimal model generates informative and clear results. The experimental results, including reconstructed synthetic and real-world scenes, demonstrate that our proposed method outperforms the current state-of-the-art methods in terms of quantitative measurements and visual quality. The source code is available at https://github.com/roydon-luo/NLCSR-CPDM.

Abstract:
We study the visual quality judgments of human subjects on digital human avatars (sometimes referred to as “holograms” in the parlance of virtual reality [VR] and augmented reality [AR] systems) that have been subjected to distortions. We also study the ability of video quality models to predict human judgments. As streaming human avatar videos in VR or AR become increasingly common, the need for more advanced human avatar video compression protocols will be required to address the tradeoffs between faithfully transmitting high-quality visual representations while adjusting to changeable bandwidth scenarios. During transmission over the internet, the perceived quality of compressed human avatar videos can be severely impaired by visual artifacts. To optimize trade-offs between perceptual quality and data volume in practical workflows, video quality assessment (VQA) models are essential tools. However, there are very few VQA algorithms developed specifically to analyze human body avatar videos, due, at least in part, to the dearth of appropriate and comprehensive datasets of adequate size. Towards filling this gap, we introduce the LIVE-Meta Rendered Human Avatar VQA Database, which contains 720 human avatar videos processed using 20 different combinations of encoding parameters, labeled by corresponding human perceptual quality judgments that were collected in six degrees of freedom VR headsets. To demonstrate the usefulness of this new and unique video resource, we use it to study and compare the performances of a variety of state-of-the-art Full Reference and No Reference video quality prediction models, including a new model called HoloQA. As a service to the research community, we publicly releases the metadata of the new database at https://live.ece.utexas.edu/research/LIVE-Meta-rendered-human-avatar/index.html.

Abstract:
High dynamic range (HDR) video offers a more realistic visual experience than standard dynamic range (SDR) video, while introducing new challenges to both compression and transmission. Rate control is an effective technology to overcome these challenges, and ensure optimal HDR video delivery. However, the rate control algorithm in the latest video coding standard, versatile video coding (VVC), is tailored to SDR videos, and does not produce well coding results when encoding HDR videos. To address this problem, a data-driven \lambda -domain rate control algorithm is proposed for VVC HDR intra frames in this paper. First, the coding characteristics of HDR intra coding are analyzed, and a piecewise R- \lambda model is proposed to accurately determine the correlation between the rate (R) and the Lagrange parameter \lambda for HDR intra frames. Then, to optimize bit allocation at the coding tree unit (CTU)-level, a wavelet-based residual neural network (WRNN) is developed to accurately predict the parameters of the piecewise R- \lambda model for each CTU. Third, a large-scale HDR dataset is established for training WRNN, which facilitates the applications of deep learning in HDR intra coding. Extensive experimental results show that our proposed HDR intra frame rate control algorithm achieves superior coding results than the state-of-the-art algorithms. The source code of this work will be released at https://github.com/TJU-Videocoding/WRNN.git.

Abstract:
The increased interest in consumer-grade high dynamic range (HDR) images and videos in recent years has caused a proliferation of HDR deghosting algorithms. Despite numerous proposals, a fast, memory-efficient, and robust algorithm has been difficult to achieve. This paper addresses this problem by leveraging the power of attention and U-Net-based neural representations and using a conservative deghosting strategy. Given two bracketed exposures of a scene, we produce an HDR image that maximally resembles the high exposure where it is well-exposed and fuses aligned information from both exposures otherwise. We evaluate the performance of our algorithm under several different challenging scenarios, using both visual and quantitative results, and show that it matches the state-of-the-art algorithms despite using only two exposures and having significantly lower computational complexity. Furthermore, the parameters of our algorithm greatly simplify deploying its different versions for devices with a variety of computational constraints, including mobile devices.

Abstract:
Lesion localization and tracking are critical for accurate, automated medical imaging analysis. Contrast-enhanced ultrasound (CEUS) significantly enriches traditional B-mode ultrasound with contrast agents to provide high-resolution, real-time images of blood flow in tissues and organs. However, many trackers, designed primarily for natural RGB or B-mode ultrasound images, underutilize the extensive data from dual-screen enhanced images and fail to account for respiratory motion, thus facing challenges in achieving accurate target tracking. To address the existing challenges, we propose an adaptive-weighted dual mapping (ADMNet), an online tracking framework tailored for CEUS. Firstly, we introduced a novel Multimodal Atrous Attention Fusion (MAAF) module, innovatively designed to adapt the weightage between B-mode and enhanced images in dual-screen CEUS, reflecting the clinician’s dynamic focus shifts between screens. Secondly, we proposed a Respiratory Motion Compensation (RMC) module to correct motion trajectory interferences due to respiratory motion, effectively leveraging temporal information. We utilized two newly established CEUS datasets, totaling 35,082 frames, to benchmark the ADMNet against various advanced B-mode ultrasound trackers. Our extensive experiments revealed that ADMNet achieves new state-of-the-art performance in CEUS tracking. Ablation studies and visualizations further underline the effectiveness of MAAF and RMC modules, demonstrating the promising potential of ADMNet in clinical CEUS tracing, thus providing novel research avenues in this field.

Abstract:
To enhance the viewer experience of standard dynamic range (SDR) video content on high dynamic range (HDR) displays, inverse tone mapping (ITM) is employed. Objective visual quality assessment (VQA) models are needed for effective evaluation of ITM algorithms. However, there is a lack of specialized VQA models for assessing the visual quality of inversely tone-mapped HDR videos (ITM-HDR-Videos). This paper addresses both an algorithmic and a dataset gap by introducing a novel SDR referenced HDR (SD-R-HD) VQA model tailored for ITM-HDR-Videos, along with the first public dataset specifically constructed for this purpose. The innovations of the SD-R-HD VQA model include 1) utilizing available SDR video as a reference signal, 2) extracting features that characterize standard ITM operations such as global mapping and local compensation, and 3) directly modeling interframe inconsistencies introduced by ITM operations. The newly created ITM-HDR-VQA dataset comprises 200 ITM-HDR-Videos annotated with mean opinion scores, gathered over 320 man-hours of psychovisual experiments. Experimental results demonstrate that the SD-R-HD VQA model significantly outperforms existing state-of-the-art VQA models.

Abstract:
3D perception tasks, such as 3D object detection and Bird’s-Eye-View (BEV) segmentation using multi-camera images, have drawn significant attention recently. Despite the fact that accurately estimating both semantic and 3D scene layouts are crucial for this task, existing techniques often neglect the synergistic effects of semantic and depth cues, leading to the occurrence of classification and position estimation errors. Additionally, the input-independent nature of initial queries also limits the learning capacity of Transformer-based models. To tackle these challenges, we propose an input-aware Transformer framework that leverages Semantics and Depth as priors (named SDTR). Our approach involves the use of an S-D Encoder that explicitly models semantic and depth priors, thereby disentangling the learning process of object categorization and position estimation. Moreover, we introduce a Prior-guided Query Builder that incorporates the semantic prior into the initial queries of the Transformer, resulting in more effective input-aware queries. Extensive experiments on the nuScenes and Lyft benchmarks demonstrate the state-of-the-art performance of our method in both 3D object detection and BEV segmentation tasks.

Abstract:
Hyperspectral images (HSIs) are composed of hundreds of contiguous waveband images, offering a wealth of spatial and spectral information. However, the practical use of HSIs is often hindered by the presence of complicated noise caused by various factors such as non-uniform sensor response and dark current. Traditional methods for denoising HSIs rely on constrained optimization approaches, where selecting appropriate prior knowledge is critical for achieving satisfactory results. Nevertheless, these traditional algorithms are limited by hand-crafted priors, leaving room for improvement in their denoising performance. Recently, the supervised deep learning technique has emerged as a promising approach for HSI denoising. However, their requirement for paired training data and poor generalization ability on untrained noise distributions pose challenges in practical applications. In this paper, we design a novel algorithm by the synergism of optimization-based methods and deep learning techniques. Specifically, we introduce a plug-and-play Deep Low-rank Decomposition (DLD) model into the optimization framework. Furthermore, we propose an effective mechanism to incorporate traditional prior knowledge into the DLD model. Finally, we provide a detailed analysis of the optimization process and convergence of the proposed method. Empirical evaluations on various tasks, including hyperspectral image denoising and spectral compressive imaging, demonstrate the superiority of our approach over state-of-the-art methods.

Abstract:
In this paper, we present a novel high dynamic range (HDR)-like image generator that utilizes mutual-guided learning between multi-exposure registration and fusion, leading to promising dynamic multi-exposure image fusion. The method consists of three main components: the registration network, the fusion network, and the dual attention network which seamlessly integrates registration and fusion processes. Initially, within the registration network, the estimation of deformation fields among multi-exposure image sequences is conducted following an exposure-invariant feature extraction phase. This leads to enhanced accuracy by mitigating discrepancies across domains. Subsequently, the fusion network utilizes a progressive frequency fusion module in two distinct stages, addressing color correction and detail preservation within low and high-frequency domains, respectively. To facilitate the mutual enhancement of the registration and fusion networks, we undertake a mutual-guided learning strategy encompassing their physical connection and constraint paradigm. Firstly, a dual attention network bridges the registration and fusion networks, addressing ghosting, which is beyond the scope of registration and facilitates information exchange between input images. Secondly, a meticulously designed generative adversarial network-like iterative training schema guides the overall network framework, thereby yielding high-quality HDR-like images through mutual enhancement. Comprehensive experiments on publicly available datasets validate the superiority of our method over existing state-of-the-art approaches.

Abstract:
Multi-view 3D visual perception including 3D object detection and Birds’-eye-view (BEV) map segmentation is essential for autonomous driving. However, there has been little discussion about 3D context attention between dynamic objects and static elements with multi-view camera inputs, due to the challenging nature of recovering the 3D spatial information from images and performing effective 3D context interaction. 3D context information is expected to provide more cues to enhance 3D visual perception for autonomous driving. We thus propose a new transformer-based framework named CI3D in an attempt to implicitly model 3D context interaction between dynamic objects and static map elements. To achieve this, we use dynamic object queries and static map queries to gather information from multi-view image features, which are represented sparsely in 3D space. Moreover, a dynamic 3D position encoder is utilized to precisely generate queries’ positional embeddings. With accurate positional embeddings, the queries effectively aggregate 3D context information via a multi-head attention mechanism to model 3D context interaction. We further reveal that sparse supervision signals from the limited number of queries result in the issue of rough and vague image features. To overcome this challenge, we introduce a panoptic segmentation head as an auxiliary task and a 3D-to-2D deformable cross-attention module, greatly enhancing the robustness of spatial feature learning and sampling. Our approach has been extensively evaluated on two large-scale datasets, nuScenes and Waymo, and significantly outperforms the baseline method on both benchmarks.

Abstract:
Many studies have attempted to classify small drones in response to threats posed by the technical progress of small drones. Recently, small drones have been classified utilizing convolutional neural networks (CNNs) with micro-Doppler signature (MDS) images generated from frequency-modulated continuous-wave (FMCW) radars. This study proposes a comprehensive method for classifying small drones in real-time using high-quality MDS images and an ultra-lightweight CNN. The proposed comprehensive method comprises an MDS image generation technique, which can improve the quality of MDS images generated via FMCW radars, and the ultra-lightweight CNN with high accuracy performance despite its remarkable lightness. Experimental results show that the proposed MDS image generation technique increases the accuracy of CNNs by enhancing the MDS image quality. This is further verified using the results of uncertainty quantification. The proposed ultra-lightweight CNN significantly decreases the computational cost while achieving high accuracy. Finally, we demonstrate that the proposed comprehensive method successfully classifies small drones from far distances with high efficiency and accuracy: the maximum and average accuracies for classification are 100% and 99.21%, respectively, and the numbers of parameters, nodes, and floating-point operations of the proposed ultra-lightweight CNN are approximately 4.88 K, 21.51 K, and 31.52 M, respectively.

Abstract:
Fringe projection profilometry is a widely used technique for 3D measurement due to its high accuracy and speed. However, the accuracy significantly decreases when measuring complex texture objects, especially in the junction of different colors. This paper analyzes the causes of errors resulting from complex textures and proposes a height compensation method to revise the error by employing a dual-projector structure. Moreover, the dual-projector is capable of acquiring a pair of errors with opposite signs, which can be utilized to calculate the accurate 3D information after determining the ratio of this pair of errors. Experiments provide significant improvement in measuring complex texture objects, demonstrating the proposed method’s ability.

Abstract:
The unsupervised domain adaptation (UDA) based cross-scene remote sensing image classification has recently become an appealing research topic, since it is a valid solution to unsupervised scene classification by exploiting well-labeled data from another scene. Despite its good performance in reducing domain shifts, UDA in multisource data scenarios is hindered by several critical challenges. The first one is the heterogeneity inherent in multisource data complicates domain alignment. The second challenge is the incomplete representation of feature distribution caused by the neglect of the contribution from global information. The third challenge is the inaccuracies in alignment due to errors in establishing target domain conditional distributions. Since UDA does not guarantee the complete consistency of the distribution of the two domains, networks using simple classifiers are still affected by domain shifts, resulting in poor performance. In this paper, we propose a graph embedding interclass relation-aware adaptive network (GeIraA-Net) for unsupervised classification of multi-source remote sensing data, which facilitates knowledge transfer at the class level for two domains by leveraging aligned features to perceive inter-class relation. More specifically, a graph-based progressive hierarchical feature extraction network is constructed, capable of capturing both local and global features of multisource data, thereby consolidating comprehensive domain information within a unified feature space. To deal with the imprecise alignment of data distribution, a joint de-scrambling alignment strategy is designed to utilize the features obtained by a three-step pseudo-label generation module for more delicate domain calibration. Moreover, an adaptive inter-class topology based classifier is constructed to further improve the classification accuracy by making the classifier domain adaptive at the category level. The experimental results show that GeIraA-Net has significant advantages over the current state-of-the-art cross-scene classification methods.

Abstract:
Current video object segmentation approaches primarily rely on frame-wise appearance information to perform matching. Despite significant progress, reliable matching becomes challenging due to rapid changes of the object’s appearance over time. Moreover, previous matching mechanisms suffer from redundant computation and noise interference as the number of accumulated frames increases. In this paper, we introduce a multi-frame spatio-temporal context memory (STCM) network to exploit discriminative spatio-temporal cues in multiple adjacent frames by utilizing a multi-frame context interaction module (MCI) for memory construction. Based on the proposed MCI module, a sparse group memory reader is developed to enable efficient sparse matching during memory reading. Our proposed method is generic and achieves state-of-the-art performance with real-time speed on benchmark datasets such as DAVIS and YouTube-VOS. In addition, our model exhibits robustness to sparse videos with low frame rates.

Abstract:
Accurate segmentation of brain tumors across multiple MRI sequences is essential for diagnosis, treatment planning, and clinical decision-making. In this paper, I propose a cutting-edge framework, named multi-modal graph convolution network (M2GCNet), to explore the relationships across different MR modalities, and address the challenge of brain tumor segmentation. The core of M2GCNet is the multi-modal graph convolution module (M2GCM), a pivotal component that represents MR modalities as graphs, with nodes corresponding to image pixels and edges capturing latent relationships between pixels. This graph-based representation enables the effective utilization of both local and global contextual information. Notably, M2GCM comprises two important modules: the spatial-wise graph convolution module (SGCM), adept at capturing extensive spatial dependencies among distinct regions within an image, and the channel-wise graph convolution module (CGCM), dedicated to modelling intricate contextual dependencies among different channels within the image. Additionally, acknowledging the intrinsic correlation present among different MR modalities, a multi-modal correlation loss function is introduced. This novel loss function aims to capture specific nonlinear relationships between correlated modality pairs, enhancing the model’s ability to achieve accurate segmentation results. The experimental evaluation on two brain tumor datasets demonstrates the superiority of the proposed M2GCNet over other state-of-the-art segmentation methods. Furthermore, the proposed method paves the way for improved tumor diagnosis, multi-modal information fusion, and a deeper understanding of brain tumor pathology.

Abstract:
Subspace-based models have been extensively employed in unsupervised segmentation and completion of human motion sequence (HMS). However, existing approaches often neglect the incorporation of temporal priors embedded in HMS, resulting in suboptimal results. This paper presents a subspace variety model for HMS, along with an innovative Temporal Learning of Subspace Variety Model (TL-SVM) method for enhanced segmentation and completion in HMS. The key idea is to segment incomplete HMS into motion clusters and extracting the subspace features of each motion through the temporal learning of the subspace variety model. Subsequently, the HMS is completed based on the extracted subspace features. Thus, the main challenge is to learn the subspace variety model with temporal priors when confronted with missing entries. To tackle this, the paper develops a spatio-temporal assignment consistency (STAC) constraint for the subspace variety model, leveraging temporal priors embedded in HMS. In addition, a subspace clustering approach under the STAC constraint is proposed to learn the subspace variety model by extracting subspace features from HMS and segmenting HMS into motion clusters alternatively. The proposed subspace clustering model can also handle missing entries with theoretical guarantees. Furthermore, the missing entries of HMS are completed by minimizing the distance between each human motion frame and its corresponding subspace. Extensive experimental results, along with comparisons to state-of-the-art methods on four benchmark datasets, underscore the advantages of the proposed method.

Abstract:
Deep convolutional neural networks (CNNs) have been widely used for fundus image classification and have achieved very impressive performance. However, the explainability of CNNs is poor because of their black-box nature, which limits their application in clinical practice. In this paper, we propose a novel method to search for discriminative regions to increase the confidence of CNNs in the classification of features in specific category, thereby helping users understand which regions in an image are important for a CNN to make a particular prediction. In the proposed method, a set of superpixels is selected in an evolutionary process, such that discriminative regions can be found automatically. Many experiments are conducted to verify the effectiveness of the proposed method. The average drop and average increase obtained with the proposed method are 0 and 77.8%, respectively, in fundus image classification, indicating that the proposed method is very effective in identifying discriminative regions. Additionally, several interesting findings are reported: 1) Some superpixels, which contain the evidence used by humans to make a certain decision in practice, can be identified as discriminative regions via the proposed method; 2) The superpixels identified as discriminative regions are distributed in different locations in an image rather than focusing on regions with a specific instance; and 3) The number of discriminative superpixels obtained via the proposed method is relatively small. In other words, a CNN model can employ a small portion of the pixels in an image to increase the confidence for a specific category.

Abstract:
This paper provides a comprehensive study on features and performance of different ways to incorporate neural networks into lifting-based wavelet-like transforms, within the context of fully scalable and accessible image compression. Specifically, we explore different arrangements of lifting steps, as well as various network architectures for learned lifting operators. Moreover, we examine the impact of the number of learned lifting steps, the number of channels, the number of layers and the support of kernels in each learned lifting operator. To facilitate the study, we investigate two generic training methodologies that are simultaneously appropriate to a wide variety of lifting structures considered. Experimental results ultimately suggest that retaining fixed lifting steps from the base wavelet transform is highly beneficial. Moreover, we demonstrate that employing more learned lifting steps and more layers in each learned lifting operator do not contribute strongly to the compression performance. However, benefits can be obtained by utilizing more channels in each learned lifting operator. Ultimately, the learned wavelet-like transform proposed in this paper achieves over 25% bit-rate savings compared to JPEG 2000 with compact spatial support.

Abstract:
The recently proposed high-order tensor algebraic framework generalizes the tensor singular value decomposition (t-SVD) induced by the invertible linear transform from order-3 to order-d ( d\gt 3 ). However, the derived order-d t-SVD rank essentially ignores the implicit global discrepancy in the quantity distribution of non-zero transformed high-order singular values across the higher modes of tensors. This oversight leads to suboptimal restoration in processing real-world multi-dimensional visual datasets. To address this challenge, in this study, we look in-depth at the intrinsic properties of practical visual data tensors, and put our efforts into faithfully measuring their high-order low-rank nature. Technically, we first present a novel order-d tensor rank definition. This rank function effectively captures the aforementioned discrepancy property observed in real visual data tensors and is thus called the discrepant t-SVD rank. Subsequently, we introduce a nonconvex regularizer to facilitate the construction of the corresponding discrepant t-SVD rank minimization regime. The results show that the investigated low-rank approximation has the closed-form solution and avoids dilemmas caused by the previous convex optimization approach. Based on this new regime, we meticulously develop two models for typical restoration tasks: high-order tensor completion and high-order tensor robust principal component analysis. Numerical examples on order-4 hyperspectral videos, order-4 color videos, and order-5 light field images substantiate that our methods outperform state-of-the-art tensor-represented competitors. Finally, taking a fundamental order-3 hyperspectral tensor restoration task as an example, we further demonstrate the effectiveness of our new rank minimization regime for more practical applications. The source codes of the proposed methods are available at https://github.com/CX-He/DTSVD.git.

Abstract:
Despite the prevailing transition from single-task to multi-task approaches in video anomaly detection, we observe that many adopt sub-optimal frameworks for individual proxy tasks. Motivated by this, we contend that optimizing single-task frameworks can advance both single- and multi-task approaches. Accordingly, we leverage middle-frame prediction as the primary proxy task, and introduce an effective hybrid framework designed to generate accurate predictions for normal frames and flawed predictions for abnormal frames. This hybrid framework is built upon a bi-directional structure that seamlessly integrates both vision transformers and ConvLSTMs. Specifically, we utilize this bi-directional structure to fully analyze the temporal dimension by predicting frames in both forward and backward directions, significantly boosting the detection stability. Given the transformer’s capacity to model long-range contextual dependencies, we develop a convolutional temporal transformer that efficiently associates feature maps from all context frames to generate attention-based predictions for target frames. Furthermore, we devise a layer-interactive ConvLSTM bridge that facilitates the smooth flow of low-level features across layers and time-steps, thereby strengthening predictions with fine details. Anomalies are eventually identified by scrutinizing the discrepancies between target frames and their corresponding predictions. Several experiments conducted on public benchmarks affirm the efficacy of our hybrid framework, whether used as a standalone single-task approach or integrated as a branch in a multi-task approach. These experiments also underscore the advantages of merging vision transformers and ConvLSTMs for video anomaly detection. The implementation of our hybrid framework is available at https://github.com/SHENGUODONG19951126/ConvTTrans-ConvLSTM.

Abstract:
In large-scale long-term dynamic environments, high-frequency dynamic objects inevitably lead to significant changes in the appearance of the scene at the same location at different times, which is catastrophic for place recognition (PR). Therefore, how to eliminate the influence of dynamic objects to achieve robust PR has universal practical value for mobile robots and autonomous vehicles. To this end, we suggest a novel semantically consistent LiDAR PR method based on chained cascade network, called SC_LPR, which mainly consists of a LiDAR semantic image inpainting network (LSI-Net) and a semantic pyramid Transformer-based PR network (SPT-Net). Specifically, LSI-Net is a coarse-to-fine generative adversarial network (GAN) with a gated convolutional autoencoder as the backbone. To effectively address the challenges posed by variable-scale dynamic object masks, we integrate the updated Transformer block with mask attention and gated trident block into LSI-Net. Sequentially, in order to generate a discriminative global descriptor representing the point cloud, we design an encoder with pyramid Transformer block to efficiently encode long-range dependencies and global contexts between different categories in the inpainted semantic image, followed by an augmented NetVALD, a generalized VLAD (Vector of Locally Aggregated Descriptors) layer that adaptively aggregates salient local features. Last but not least, we first attempt to create a LiDAR semantic inpainting dataset, called LSI-Dataset, to effectively validate the proposed method. Experimental comparisons show that our method not only improves semantic inpainting performance by about 6%, but also improves PR performance in dynamic environments by about 8% compared to the representative optimal baseline. LSI-Dataset will be publicly available at https://github.KD.LPR.com/.

Abstract:
In Alzheimer’s disease (AD) diagnosis, joint feature selection for predicting disease labels (classification) and estimating cognitive scores (regression) with neuroimaging data has received increasing attention. In this paper, we propose a model named Shared Manifold regularized Joint Feature Selection (SMJFS) that performs classification and regression in a unified framework for AD diagnosis. For classification, unlike the existing works that build least squares regression models which are insufficient in the ability of extracting discriminative information for classification, we design an objective function that integrates linear discriminant analysis and subspace sparsity regularization for acquiring an informative feature subset. Furthermore, the local data relationships are learned according to the samples’ transformed distances to exploit the local data structure adaptively. For regression, in contrast to previous works that overlook the correlations among cognitive scores, we learn a latent score space to capture the correlations and employ the latent space to design a regression model with \ell _2,1 -norm regularization, facilitating the feature selection in regression task. Moreover, the missing cognitive scores can be recovered in the latent space for increasing the number of available training samples. Meanwhile, to capture the correlations between the two tasks and describe the local relationships between samples, we construct an adaptive shared graph to guide the subspace learning in classification and the latent cognitive score learning in regression simultaneously. An efficient iterative optimization algorithm is proposed to solve the optimization problem. Extensive experiments on three datasets validate the discriminability of the features selected by SMJFS.

Abstract:
Combining dual-energy computed tomography (DECT) with positron emission tomography (PET) offers many potential clinical applications but typically requires expensive hardware upgrades or increases radiation doses on PET/CT scanners due to an extra X-ray CT scan. The recent PET-enabled DECT method allows DECT imaging on PET/CT without requiring a second X-ray CT scan. It combines the already existing X-ray CT image with a 511 keV \gamma -ray CT (gCT) image reconstructed from time-of-flight PET emission data. A kernelized framework has been developed for reconstructing gCT image but this method has not fully exploited the potential of prior knowledge. Use of deep neural networks may explore the power of deep learning in this application. However, common approaches require a large database for training, which is impractical for a new imaging method like PET-enabled DECT. Here, we propose a single-subject method by using neural-network representation as a deep coefficient prior to improving gCT image reconstruction without population-based pre-training. The resulting optimization problem becomes the tomographic estimation of nonlinear neural-network parameters from gCT projection data. This complicated problem can be efficiently solved by utilizing the optimization transfer strategy with quadratic surrogates. Each iteration of the proposed neural optimization transfer algorithm includes: PET activity image update; gCT image update; and least-square neural-network learning in the gCT image domain. This algorithm is guaranteed to monotonically increase the data likelihood. Results from computer simulation, real phantom data and real patient data have demonstrated that the proposed method can significantly improve gCT image quality and consequent multi-material decomposition as compared to other methods.

Abstract:
In this work, by re-examining the “matching” nature of Anomaly Detection (AD), we propose a novel AD framework that simultaneously enjoys new records of AD accuracy and dramatically high running speed. In this framework, the anomaly detection problem is solved via a cascade patch retrieval procedure that retrieves the nearest neighbors for each test image patch in a coarse-to-fine fashion. Given a test sample, the top-K most similar training images are first selected based on a robust histogram matching process. Secondly, the nearest neighbor of each test patch is retrieved over the similar geometrical locations on those “most similar images”, by using a carefully trained local metric. Finally, the anomaly score of each test image patch is calculated based on the distance to its “nearest neighbor” and the “non-background” probability. The proposed method is termed “Cascade Patch Retrieval” (CPR) in this work. Different from the previous patch-matching-based AD algorithms, CPR selects proper “targets” (reference images and patches) before “shooting” (patch-matching). On the well-acknowledged MVTec AD, BTAD and MVTec-3D AD datasets, the proposed algorithm consistently outperforms all the comparing SOTA methods by remarkable margins, measured by various AD metrics. Furthermore, CPR is extremely efficient. It runs at the speed of 113 FPS with the standard setting while its simplified version only requires less than 1 ms to process an image at the cost of a trivial accuracy drop. The code of CPR is available at https://github.com/flyinghu123/CPR.

Abstract:
We propose a non-cascaded and crosstalk-free multi-image encryption method based on optical scanning holography and 2D orthogonal compressive sensing. This approach enables the simultaneous recording and encryption of multiple plaintext images without mechanical scanning, while allows for independent retrieval of each image with exceptional quality and no crosstalk. Two features would bring about more substantial security and privacy. The one is that, by employing a sequence of pre-designed structural patterns as encryption keys at the pupil, multiple samplings can be achieved and ultimately the holographic cyphertext can be obtained. These patterns are generated using a measurement matrix processed with the generalized orthogonal one. As a result, one can accomplish the differentiation of images prior to the recording and thus neither need to pretreat the pending images nor to suppress the out-of-focus noise in the decrypted image. The other one is that, the non-cascaded architecture ensures that different plaintexts do not share sub-keys. Meanwhile, compared to 1D orthogonal compressive sensing, the 2D counterpart makes the proposed method to synchronously deal with multiple images of more complexity, while acquire significantly high-quality decrypted images and far greater encryption capacity. Further, the regularities of conversion between 1D and 2D orthogonal compressive sensing are identified, which may be instructive when to manufacture a practical multi-image cryptosystem or a single-pixel imaging equipment. A more general method or concept named synthesis pupil encoding is advanced. It may provide an effective way to combine multiple encryption methods together into a non-cascaded one. Our method possesses nonlinearity and it is also promising in multi-image asymmetric or public key cryptosystem as well as multi-user multiplexing.

Abstract:
In the field of Multi-Object Tracking (MOT), the incorporation of appearance cues into tracking-by-detection heuristic trackers using re-identification (ReID) features has posed limitations on its advancement. The existing ReID paradigm involves the extraction of coarse-grained object-level feature vectors from cropped objects at a fixed input size using a ReID model, and similarity computation through a simple normalized inner product. However, MOT requires fine-grained features from different object regions and more accurate similarity measurements to identify individuals, especially in the presence of occlusion. To address these limitations, we propose a novel feature paradigm. In this paradigm, we extract the feature map from the entire frame image to preserve object sizes and represent objects using a set of fine-grained features from different object regions. These features are sampled from adaptive patches within the object bounding box on the feature map to effectively capture local appearance cues. We introduce Mutual Ratio Similarity (MRS) to accurately measure the similarity of the most discriminative region between two objects based on the sampled patches, which proves effective in handling occlusion. Moreover, we propose absolute Intersection over Union (AIoU) to consider object sizes in feature cost computation. We integrate our paradigm with advanced motion techniques to develop a heuristic Motion-Feature joint multi-object tracker, MoFe. Within it, we reformulate the track state transition of tracklets to better model their life cycle, and firstly introduce a runtime recorder after MoFe to refine trajectories. Extensive experiments on five benchmarks, i.e., GMOT-40, BDD100k, DanceTrack, MOT17, and MOT20, demonstrate that MoFe achieves state-of-the-art performance in robustness and generalizability without any fine-tuning, and even surpasses the performance of fine-tuned ReID features.

Abstract:
Medical image segmentation is a critical task in clinical applications. Recently, the Segment Anything Model (SAM) has demonstrated potential for natural image segmentation. However, the requirement for expert labour to provide prompts, and the domain gap between natural and medical images pose significant obstacles in adapting SAM to medical images. To overcome these challenges, this paper introduces a novel prompt generation method named EviPrompt. The proposed method requires only a single reference image-annotation pair, making it a training-free solution that significantly reduces the need for extensive labelling and computational resources. First, prompts are automatically generated based on the similarity between features of the reference and target images, and evidential learning is introduced to improve reliability. Then, to mitigate the impact of the domain gap, committee voting and inference-guided in-context learning are employed, generating prompts primarily based on human prior knowledge and reducing reliance on extracted semantic information. EviPrompt represents an efficient and robust approach to medical image segmentation. We evaluate it across a broad range of tasks and modalities, confirming its efficacy. The source code is available at https://github.com/SPIresearch/EviPrompt.

Abstract:
How to compress face video is a crucial problem for a series of online applications, such as video chat/conference, live broadcasting and remote education. Compared to other natural videos, these face-centric videos owning abundant structural information can be compactly represented and high-quality reconstructed via deep generative models, such that the promising compression performance can be achieved. However, the existing generative face video compression schemes are faced with the inconsistency between the 3D facial motion in the physical world and the face content evolution in the 2D view. To solve this drawback, we propose a 3D-Keypoint-and-2D-Motion based generative method for Face Video Compression, namely FVC-3K2M, which can well ensure perceptual compensation and visual consistency between motion description and face reconstruction. In particular, the temporal evolution of face video can be characterized into separate 3D keypoints from the global and local perspectives, entailing great coding flexibility and accurate motion representation. Moreover, a cascade motion conversion mechanism is further proposed to internally convert 3D keypoints to 2D dense motion, enforcing the face video reconstruction to be perceptually realistic. Finally, an adaptive reference frame selection scheme is developed to enhance the adaptation of various temporal movements. Experimental results show that the proposed scheme can realize reliable video communication in the extremely limited bandwidth, e.g., 2 kbps. Compared to the state-of-the-art video coding standards and the latest face video compression methods, extensive comparisons demonstrate that our proposed scheme achieves superior compression performance in terms of multiple quality evaluations.

Abstract:
Deep learning-based hyperspectral image (HSI) change detection (CD) approaches have a strong ability to leverage spectral-spatial-temporal information through automatic feature extraction, and currently dominate in the research field. However, their efficiency and universality are limited by the dependency on labeled data. Although the newly applied untrained networks can avoid the need for labeled data, their feature volatility from the simple difference space easily leads to inaccurate CD results. Inspired by the interesting finding that salient changes appear as bright “stripes” in a new feature space, we propose a novel unsupervised CD method that represents and models changes in stripes for HSIs (named as StripeCD), which integrates optimization modeling into an untrained network. The StripeCD method constructs a new feature space that represents change features in stripes and models them in a novel optimization manner. It consists of three main parts: 1) dual-branch untrained convolutional network, which is utilized to extract deep difference features from bitemporal HSIs and combined with a two-stage channel selection strategy to emphasize the important channels that contribute to CD. 2) multiscale forward-backward segmentation framework, which is proposed for salient change representation. It transforms deep difference features into a new feature space by exploiting the structure information of ground objects and associates salient changes with the stripe-shaped change component. 3) stripe-shaped change extraction model, which characterizes the global sparsity and local discontinuity of salient changes. It explores the intrinsic properties of deep difference features and constructs model-based constraints to better identify changed regions in a controllable manner. The proposed StripeCD method outperformed the state-of-the-art unsupervised CD approaches on three widely used datasets. In addition, the proposed StripeCD method indicates the potential for further investigation of untrained networks in facilitating reliable CD.

Abstract:
Band-to-Band Registration (BBR) is a pre-requisite image processing operation essential for specific remote sensing multispectral sensors. BBR aims to align spectral wavelength channels at sub-pixel level accuracy over each other. The paper presents a novel BBR technique utilizing Co-occurrence Scale Space (CSS) for feature point detection and Spatial Confined RANSAC (SC-RANSAC) for removing outlier matched control points. Additionally, the Segmented Affine Transformation (SAT) model reduces distortion and ensures consistent BBR. The methodology developed is evaluated with Nano-MX multispectral images onboard the Indian Nano Satellite (INS-2B) covering diverse landscapes. BBR performance using the proposed method is also verified visually at a 4X zoom level on satellite scenes dominated by cloud pixels. The band misregistration effect on the Normalized Difference Vegetation Index (NDVI) from INS-2B is analyzed and cross-validated with the closest acquisition Landsat-9 OLI NDVI map before and after BBR correction. The experimental evaluation shows that the proposed BBR approach outperforms the state-of-the-art image registration techniques.

Abstract:
Satellite video multi-label scene classification predicts semantic labels of multiple ground contents to describe a given satellite observation video, which plays an important role in applications like ocean observation, smart cities, et al. However, the lack of a high-quality and large-scale dataset prevents further improvement of the task. And existing methods on general videos have the difficulty to represent the local details of ground contents when directly applied to the satellite videos. In this paper, our contributions include (1) we develop the first publicly available and large-scale satellite video multi-label scene classification dataset. It consists of 18 classes of static and dynamic ground contents, 3549 videos, and 141960 frames. (2) we propose a baseline method with the novel Spatial and Temporal Feature Cooperative Encoding (STFCE). It exploits the relations between local spatial and temporal features, and models long-term motion information hidden in inter-frame variations. In this way, it can enhance features of local details and obtain the powerful video-scene-level feature representation, which raises the classification performance effectively. Experimental results show that our proposed STFCE outperforms 13 state-of-the-art methods with a global average precision (GAP) of 0.8106 and the careful fusion and joint learning of the spatial, temporal, and motion features are beneficial to achieve a more robust and accurate model. Moreover, benchmarking results show that the proposed dataset is very challenging and we hope it could promote further development of the satellite video multi-label scene classification task.

Abstract:
Deep learning-based image compression has made great progresses recently. However, some leading schemes use serial context-adaptive entropy model to improve the rate-distortion (R-D) performance, which is very slow. In addition, the complexities of the encoding and decoding networks are quite high and not suitable for many practical applications. In this paper, we propose four techniques to balance the trade-off between the complexity and performance. We first introduce the deformable residual module to remove more redundancies in the input image, thereby enhancing compression performance. Second, we design an improved checkerboard context model with two separate distribution parameter estimation networks and different probability models, which enables parallel decoding without sacrificing the performance compared to the sequential context-adaptive model. Third, we develop a three-pass knowledge distillation scheme to retrain the decoder and entropy coding, and reduce the complexity of the core decoder network, which transfers both the final and intermediate results of the teacher network to the student network to improve its performance. Fourth, we introduce L_1 regularization to make the numerical values of the latent representation more sparse, and we only encode non-zero channels in the encoding and decoding process to reduce the bit rate. This also reduces the encoding and decoding time. Experiments show that compared to the state-of-the-art learned image coding scheme, our method can be about 20 times faster in encoding and 70-90 times faster in decoding, and our R-D performance is also 2.3% higher. Our method achieves better rate-distortion performance than classical image codecs including H.266/VVC-intra (4:4:4) and some recent learned methods, as measured by both PSNR and MS-SSIM metrics on the Kodak and Tecnick-40 datasets.

Abstract:
Different from traditional source coding techniques, distributed source coding (DSC) techniques rely on independent encoding at the encoding end but joint decoding at the decoding end to compress image sources which exhibit correlation. Channel code-based DSC techniques compress correlated image sources by fully utilizing such correlation. However, in addition to this correlation information, the current symbol of most real image sources is correlated to the preceding or subsequent symbols. Such correlation information should also play an important role for improving the compression performance of channel code-based DSC schemes. To this end, we present an efficient message passing mechanism for LDPC Code-based DSC schemes (ELCDSC) to compress correlated image sources. By utilizing this message passing mechanism, we enable LDPC code-based DSC techniques to not only make full use of the inter-source correlation to assist compression, but also integrate the intra-source correlation in each message passing iteration to improve the compression performance. It is the first that enables LDPC code-based DSC techniques to achieve the utilization of both intra- and inter-source correlations in the message passing mechanism. Experimental results reveal that ELCDSC significantly enhances the compression ratio of correlated image sources, surpassing other DSC schemes.